All of lore.kernel.org
 help / color / mirror / Atom feed
* Next steps with pv_ops for Xen
@ 2007-11-21 22:05 Stephen C. Tweedie
  2007-11-21 23:12 ` Jeremy Fitzhardinge
  2007-12-03 12:54 ` Gerd Hoffmann
  0 siblings, 2 replies; 57+ messages in thread
From: Stephen C. Tweedie @ 2007-11-21 22:05 UTC (permalink / raw)
  To: xen-devel, virtualization
  Cc: Jeremy Fitzhardinge, Eduardo Habkost, Juan Quintela,
	Stephen Tweedie, Jan Beulich, Glauber de Oliveira Costa,
	Chris Wright

Hi all,

I've been looking at the next steps to try to get Xen running fully on
top of pv_ops.  To that end, I've (just) started looking at one of the
next major jobs --- i686 dom0 on pv_ops.

There are still a number of things needing done to reach parity with
xen-unstable:

  x86_64 xen on pv_ops
  Paravirt framebuffer/keyboard
  CPU hotplug
  Balloon
  kexec
  driver domains

but it looks like these can largely proceed in parallel if desired.

My short-term goal with this is simply to come up with a first-pass
merge of the linux-2.6.18-xen.hg dom0 support into the current
kernel.org tree's pv_ops support.  No major refactoring in the first
pass, but absolutely no *-xen.c code copying.

I'm just starting this, but at least with the version magic check (see

	http://lists.xensource.com/archives/html/xen-devel/2007-11/msg00601.html

) out of the way, an SMP dom0 running pv_ops gets all the way through
start_kernel() and into rest_init() before dying with an unsupported cr0
write.  (I'm using direct console hypercalls for printk for now, full
xencons is not working yet.)

Current goal is to get as far as I can with the normal domU boot process
in a dom0 environment (getting console set up correctly, etc), before
starting to piece in the additional extra bits needed for dom0 startup
(mostly, but by no means exclusively, setup-xen.c).

I'm happy to put up a git tree for this once it gets anywhere.  We'd
need to decide which tree to track for that purpose --- Linus's, or
perhaps the tglx or mingo x86 merge tree might make more sense.

--Stephen

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Next steps with pv_ops for Xen
  2007-11-21 22:05 Next steps with pv_ops for Xen Stephen C. Tweedie
@ 2007-11-21 23:12 ` Jeremy Fitzhardinge
  2007-11-26 14:02   ` Juan Quintela
  2007-12-03 12:54 ` Gerd Hoffmann
  1 sibling, 1 reply; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-21 23:12 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

Stephen C. Tweedie wrote:
> I've been looking at the next steps to try to get Xen running fully on
> top of pv_ops.  To that end, I've (just) started looking at one of the
> next major jobs --- i686 dom0 on pv_ops.
>   

Great!

> There are still a number of things needing done to reach parity with
> xen-unstable:
>
>   x86_64 xen on pv_ops
>   

I think once pvops has been unified, Xen support should be fairly
straightforward.  I wrote most of the existing code with 64-bit in mind,
so I'm hoping I got it right...

>   Paravirt framebuffer/keyboard
>   CPU hotplug
>   Balloon
>   

I've done some preliminary work on balloon and hotplug.  I think balloon
should make more use of memory hotplug, but a straight port would be a
good first step.

>   kexec
>   driver domains
>
> but it looks like these can largely proceed in parallel if desired.
>
> My short-term goal with this is simply to come up with a first-pass
> merge of the linux-2.6.18-xen.hg dom0 support into the current
> kernel.org tree's pv_ops support.  No major refactoring in the first
> pass, but absolutely no *-xen.c code copying.
>   

Yes.  #ifdefs are the way to go here.

> I'm just starting this, but at least with the version magic check (see
>
> 	http://lists.xensource.com/archives/html/xen-devel/2007-11/msg00601.html
>   

I was just about to post a fix for this.

> ) out of the way, an SMP dom0 running pv_ops gets all the way through
> start_kernel() and into rest_init() before dying with an unsupported cr0
> write.  (I'm using direct console hypercalls for printk for now, full
> xencons is not working yet.)
>   

I have some early dom0 patches already, though they're a few months old
now.  Not much there, but I did do an early console implementation.

> I'm happy to put up a git tree for this once it gets anywhere.  We'd
> need to decide which tree to track for that purpose --- Linus's, or
> perhaps the tglx or mingo x86 merge tree might make more sense.
>   

Yes, I think the x86 tree is where we need to be, since there's a lot of
activity there.

I'll attach my dom0 patches for whatever use you can make of them.  The
definitely won't apply to anything, not least because of the arch merge
(though it looks like they did get converted by script), but also
because they're based on some defunct experimental booting-from-bzImage
patches.  But perhaps there's some useful stuff in there.

I've also attached my xen-balloon and hotplug patches as-is.  They don't
work completely, but they should be closer to applying.

    J

[-- Attachment #2: xen-dom0-boot.patch --]
[-- Type: text/x-patch, Size: 6090 bytes --]

---
 arch/x86/boot/compressed/notes-xen.c |   16 ---------
 arch/x86/xen/Makefile                |    2 -
 arch/x86/xen/early.c                 |    5 +-
 arch/x86/xen/enlighten.c             |    4 +-
 arch/x86/xen/legacy_boot.c           |   60 ++++++++++++++++++++++++++++++++++
 arch/x86/xen/notes.c                 |   19 ++++++++++
 arch/x86/xen/xen-ops.h               |    3 +
 7 files changed, 89 insertions(+), 20 deletions(-)

===================================================================
--- a/arch/x86/boot/compressed/notes-xen.c
+++ b/arch/x86/boot/compressed/notes-xen.c
@@ -1,17 +1,3 @@
 #ifdef CONFIG_XEN
-#include <linux/elfnote.h>
-#include <xen/interface/elfnote.h>
-
-ELFNOTE("Xen", XEN_ELFNOTE_GUEST_OS,       "linux");
-ELFNOTE("Xen", XEN_ELFNOTE_GUEST_VERSION,  "2.6");
-ELFNOTE("Xen", XEN_ELFNOTE_XEN_VERSION,    "xen-3.0");
-ELFNOTE("Xen", XEN_ELFNOTE_FEATURES,
-	"!writable_page_tables|pae_pgdir_above_4gb");
-ELFNOTE("Xen", XEN_ELFNOTE_LOADER,         "generic");
-
-#ifdef CONFIG_X86_PAE
-	ELFNOTE("Xen", XEN_ELFNOTE_PAE_MODE,       "yes");
-#else
-	ELFNOTE("Xen", XEN_ELFNOTE_PAE_MODE,       "no");
+#include "../../xen/notes.c"
 #endif
-#endif
===================================================================
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -1,4 +1,4 @@ obj-y		:= early.o enlighten.o setup.o fe
 obj-y		:= early.o enlighten.o setup.o features.o multicalls.o mmu.o \
-			events.o time.o manage.o xen-asm.o
+			events.o time.o manage.o xen-asm.o notes.o legacy_boot.o
 
 obj-$(CONFIG_SMP)	+= smp.o
===================================================================
--- a/arch/x86/xen/early.c
+++ b/arch/x86/xen/early.c
@@ -50,7 +50,7 @@ static __init unsigned long early_m2p(un
 	return ret;
 }
 
-static __init void setup_hypercall_page(struct start_info *info)
+__init void xen_setup_hypercall_page(struct start_info *info)
 {
 	unsigned long *mfn_list = (unsigned long *)info->mfn_list;
 	unsigned eax, ebx, ecx, edx;
@@ -183,7 +183,7 @@ void __init xen_entry(void)
 	BUG_ON(memcmp(info->magic, PA(&"xen-3.0"), 7) != 0);
 
 	/* establish a hypercall page */
-	setup_hypercall_page(info);
+	xen_setup_hypercall_page(info);
 
 	/* work out how far we need to remap */
 	limit = __pa(_end);
@@ -203,6 +203,7 @@ void __init xen_entry(void)
 	/* repoint things to their new virtual addresses */
 	info->pt_base = (unsigned long)__va(info->pt_base);
 	info->mfn_list = (unsigned long)__va(info->mfn_list);
+	boot_params.hdr.hardware_subarch_data = (unsigned long)__va(info);
 
 	init_pg_tables_end = limit;
 
===================================================================
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1106,8 +1106,8 @@ void __init xen_start_kernel(void)
 {
 	pgd_t *pgd;
 
-	xen_start_info = (struct start_info *)
-		__va(boot_params.hdr.hardware_subarch_data);
+	xen_start_info = (struct start_info *)(unsigned long)
+		boot_params.hdr.hardware_subarch_data;
 
 	/* Get mfn list */
 	phys_to_machine_mapping = (unsigned long *)xen_start_info->mfn_list;
===================================================================
--- /dev/null
+++ b/arch/x86/xen/legacy_boot.c
@@ -0,0 +1,60 @@
+/*
+ * Notes and setup needed for legacy booting.  This is used either
+ * when loading a domU with vmlinux directly, or for booting
+ * dom0. Normally we'd expect to be booted via the normal boot
+ * protocol.
+ */
+#include <linux/sched.h>
+#include <linux/elfnote.h>
+#include <linux/linkage.h>
+#include <linux/init.h>
+
+#include <asm/setup.h>
+#include <asm/page.h>
+#include <asm/bootparam.h>
+
+#include <xen/interface/xen.h>
+#include <xen/interface/elfnote.h>
+
+#include "xen-ops.h"
+
+extern void xen_legacy_entry(void *);
+
+/* Extra notes needed to set the xen-specific
+   entrypoint and virtual offset */
+ELFNOTE("Xen", XEN_ELFNOTE_ENTRY,		&xen_legacy_entry);
+ELFNOTE("Xen", XEN_ELFNOTE_VIRT_BASE,		PAGE_OFFSET);
+
+static __init __used fastcall void xen_legacy_setup(struct start_info *info)
+{
+	memset(&boot_params, 0, sizeof(boot_params));
+
+	boot_params.hdr.type_of_loader = 0x90;	/* xen */
+
+	boot_params.hdr.hardware_subarch = 2;	/* xen */
+	boot_params.hdr.hardware_subarch_data = (unsigned long)info;
+
+	boot_params.hdr.ramdisk_image = info->mod_start;
+	boot_params.hdr.ramdisk_size = info->mod_len;
+
+	boot_params.hdr.cmd_line_ptr = (unsigned long)info->cmd_line;
+	boot_params.hdr.cmdline_size = sizeof(info->cmd_line);
+
+	xen_setup_hypercall_page(info);
+
+	/* jump to xen_start_kernel with appropriate stack */
+	asm volatile("mov %0,%%esp;"
+		     "push $0;"
+		     "jmp xen_start_kernel"
+		     :
+		     : "i" (&init_thread_union.stack[THREAD_SIZE/sizeof(long)])
+		     : "memory");
+}
+
+
+asm(".section \".init.text\",\"ax\",@progbits	\n"
+    ".globl xen_legacy_entry			\n"
+    "xen_legacy_entry:				\n"
+    "	mov %esi, %eax				\n"
+    "	jmp xen_legacy_setup			\n"
+    ".previous");
===================================================================
--- /dev/null
+++ b/arch/x86/xen/notes.c
@@ -0,0 +1,19 @@
+/*
+ * Common ELF notes needed for all Xen kernel images
+ */
+#include <linux/elfnote.h>
+#include <xen/interface/elfnote.h>
+
+ELFNOTE("Xen", XEN_ELFNOTE_GUEST_OS,		"linux");
+ELFNOTE("Xen", XEN_ELFNOTE_GUEST_VERSION,	"2.6");
+ELFNOTE("Xen", XEN_ELFNOTE_XEN_VERSION,		"xen-3.0");
+ELFNOTE("Xen", XEN_ELFNOTE_FEATURES,
+	"!writable_page_tables|pae_pgdir_above_4gb");
+ELFNOTE("Xen", XEN_ELFNOTE_LOADER,		"generic");
+
+#ifdef CONFIG_X86_PAE
+	ELFNOTE("Xen", XEN_ELFNOTE_PAE_MODE,	"yes");
+#else
+	ELFNOTE("Xen", XEN_ELFNOTE_PAE_MODE,	"no");
+#endif
+
===================================================================
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -2,10 +2,13 @@
 #define XEN_OPS_H
 
 #include <linux/init.h>
+#include <linux/percpu.h>
 
 /* These are code, but not functions.  Defined in entry.S */
 extern const char xen_hypervisor_callback[];
 extern const char xen_failsafe_callback[];
+
+void xen_setup_hypercall_page(struct start_info *info);
 
 void xen_copy_trap_info(struct trap_info *traps);
 

[-- Attachment #3: xen-dom0-xenbus.patch --]
[-- Type: text/x-patch, Size: 1584 bytes --]

---
 drivers/xen/xenbus/xenbus_probe.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -786,6 +786,7 @@ static int __init xenbus_probe_init(void
 static int __init xenbus_probe_init(void)
 {
 	int err = 0;
+	unsigned long page = 0;
 
 	DPRINTK("");
 
@@ -806,7 +807,31 @@ static int __init xenbus_probe_init(void
 	 * Domain0 doesn't have a store_evtchn or store_mfn yet.
 	 */
 	if (is_initial_xendomain()) {
-		/* dom0 not yet supported */
+		struct evtchn_alloc_unbound alloc_unbound;
+
+		/* Allocate page. */
+		page = get_zeroed_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+
+		xen_store_mfn = xen_start_info->store_mfn =
+			pfn_to_mfn(virt_to_phys((void *)page) >>
+				   PAGE_SHIFT);
+
+		/* Next allocate a local port which xenstored can bind to */
+		alloc_unbound.dom        = DOMID_SELF;
+		alloc_unbound.remote_dom = 0;
+
+		err = HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound,
+						  &alloc_unbound);
+		if (err == -ENOSYS)
+			goto out_unreg_front;
+
+		BUG_ON(err);
+		xen_store_evtchn = xen_start_info->store_evtchn =
+			alloc_unbound.port;
+
+		xen_store_interface = mfn_to_virt(xen_store_mfn);
 	} else {
 		xenstored_ready = 1;
 		xen_store_evtchn = xen_start_info->store_evtchn;
@@ -834,6 +859,9 @@ static int __init xenbus_probe_init(void
 	bus_unregister(&xenbus_frontend.bus);
 
   out_error:
+	if (page != 0)
+		free_page(page);
+
 	return err;
 }
 

[-- Attachment #4: xen-dom0-ide.patch --]
[-- Type: text/x-patch, Size: 2507 bytes --]

---
 arch/x86/mm/ioremap_32.c |    3 ---
 arch/x86/xen/enlighten.c |   20 ++++++++++++++++++++
 arch/x86/xen/setup.c     |    3 ++-
 include/asm-x86/io_32.h  |    4 ++++
 4 files changed, 26 insertions(+), 4 deletions(-)

===================================================================
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -18,9 +18,6 @@
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
 
-#define ISA_START_ADDRESS	0xa0000
-#define ISA_END_ADDRESS		0x100000
-
 /*
  * Generic mapping function (not visible outside):
  */
===================================================================
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -45,6 +45,7 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 #include <asm/reboot.h>
+#include <asm/io.h>
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -826,6 +827,19 @@ static __init void xen_pagetable_setup_d
 		if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF))
 			BUG();
 	}
+
+	/*
+	 * If we're dom0, then 1:1 map the ISA machine addresses into
+	 * the kernel's address space.
+	 */
+	if (is_initial_xendomain()) {
+		unsigned i;
+
+		for(i = ISA_START_ADDRESS; i < ISA_END_ADDRESS; i += PAGE_SIZE)
+			set_pte_mfn(PAGE_OFFSET + i, PFN_DOWN(i), PAGE_KERNEL);
+
+		reserve_bootmem(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS);
+	}
 }
 
 /* This is called once we have the cpu_possible_map */
@@ -1144,6 +1158,12 @@ void __init xen_start_kernel(void)
 	if (xen_feature(XENFEAT_supervisor_mode_kernel))
 		paravirt_ops.kernel_rpl = 0;
 
+	if (is_initial_xendomain()) {
+		struct physdev_set_iopl set_iopl;
+		set_iopl.iopl = 1;
+		HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
+	}
+
 	/* set the limit of our address space */
 	xen_reserve_top();
 
===================================================================
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -92,5 +92,6 @@ void __init xen_arch_setup(void)
 	xen_fill_possible_map();
 #endif
 
-	paravirt_disable_iospace();
+	if (!is_initial_xendomain())
+		paravirt_disable_iospace();
 }
===================================================================
--- a/include/asm-x86/io_32.h
+++ b/include/asm-x86/io_32.h
@@ -135,6 +135,10 @@ extern void __iomem *fix_ioremap(unsigne
 #define dmi_ioremap bt_ioremap
 #define dmi_iounmap bt_iounmap
 #define dmi_alloc alloc_bootmem
+
+
+#define ISA_START_ADDRESS	0xa0000
+#define ISA_END_ADDRESS		0x100000
 
 /*
  * ISA I/O bus memory addresses are 1:1 with the physical address.

[-- Attachment #5: xen-dom0-console.patch --]
[-- Type: text/x-patch, Size: 3840 bytes --]

---
 arch/x86/xen/events.c  |    2 -
 drivers/char/hvc_xen.c |   61 +++++++++++++++++++++++++++++++++++++++++-------
 include/xen/events.h   |    2 +
 3 files changed, 56 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/x86/xen/events.c
+++ b/arch/x86/xen/events.c
@@ -308,7 +308,7 @@ static int bind_ipi_to_irq(unsigned int 
 }
 
 
-static int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
+int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
 {
 	struct evtchn_bind_virq bind_virq;
 	int evtchn, irq;
===================================================================
--- a/drivers/char/hvc_xen.c
+++ b/drivers/char/hvc_xen.c
@@ -50,7 +50,7 @@ static inline void notify_daemon(void)
 	notify_remote_via_evtchn(xen_start_info->console.domU.evtchn);
 }
 
-static int write_console(uint32_t vtermno, const char *data, int len)
+static int domU_write_console(uint32_t vtermno, const char *data, int len)
 {
 	struct xencons_interface *intf = xencons_interface();
 	XENCONS_RING_IDX cons, prod;
@@ -71,7 +71,28 @@ static int write_console(uint32_t vtermn
 	return sent;
 }
 
-static int read_console(uint32_t vtermno, char *buf, int len)
+static int dom0_write_console(uint32_t vtermno, const char *data, int len)
+{
+	int ret;
+
+	ret = HYPERVISOR_console_io(CONSOLEIO_write, len, (char *)data);
+
+	return ret < 0 ? 0 : len;
+}
+
+static int write_console(uint32_t vtermno, const char *data, int len)
+{
+	int ret;
+
+	if (is_initial_xendomain())
+		ret = dom0_write_console(vtermno, data, len);
+	else
+		ret = domU_write_console(vtermno, data, len);
+
+	return ret;
+}
+
+static int domU_read_console(uint32_t vtermno, char *buf, int len)
 {
 	struct xencons_interface *intf = xencons_interface();
 	XENCONS_RING_IDX cons, prod;
@@ -92,22 +113,40 @@ static int read_console(uint32_t vtermno
 	return recv;
 }
 
-static struct hv_ops hvc_ops = {
-	.get_chars = read_console,
-	.put_chars = write_console,
+static int dom0_read_console(uint32_t vtermno, char *buf, int len)
+{
+	return HYPERVISOR_console_io(CONSOLEIO_read, len, buf);
+}
+
+static struct hv_ops domU_hvc_ops = {
+	.get_chars = domU_read_console,
+	.put_chars = domU_write_console,
+};
+
+static struct hv_ops dom0_hvc_ops = {
+	.get_chars = dom0_read_console,
+	.put_chars = dom0_write_console,
 };
 
 static int __init xen_init(void)
 {
 	struct hvc_struct *hp;
+	struct hv_ops *ops;
 
 	if (!is_running_on_xen())
 		return 0;
 
-	xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
+	if (is_initial_xendomain()) {
+		ops = &dom0_hvc_ops;
+		xencons_irq = bind_virq_to_irq(VIRQ_CONSOLE, smp_processor_id());
+	} else {
+		ops = &domU_hvc_ops;
+		xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
+	}
+
 	if (xencons_irq < 0)
 		xencons_irq = 0 /* NO_IRQ */;
-	hp = hvc_alloc(HVC_COOKIE, xencons_irq, &hvc_ops, 256);
+	hp = hvc_alloc(HVC_COOKIE, xencons_irq, ops, 256);
 	if (IS_ERR(hp))
 		return PTR_ERR(hp);
 
@@ -123,10 +162,16 @@ static void __exit xen_fini(void)
 
 static int xen_cons_init(void)
 {
+	struct hv_ops *ops;
+
 	if (!is_running_on_xen())
 		return 0;
 
-	hvc_instantiate(HVC_COOKIE, 0, &hvc_ops);
+	ops = &domU_hvc_ops;
+	if (is_initial_xendomain())
+		ops = &dom0_hvc_ops;
+
+	hvc_instantiate(HVC_COOKIE, 0, ops);
 	return 0;
 }
 
===================================================================
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -18,6 +18,8 @@ int bind_evtchn_to_irqhandler(unsigned i
 			      irq_handler_t handler,
 			      unsigned long irqflags, const char *devname,
 			      void *dev_id);
+int bind_virq_to_irq(unsigned int virq, unsigned int cpu);
+
 int bind_virq_to_irqhandler(unsigned int virq, unsigned int cpu,
 			    irq_handler_t handler,
 			    unsigned long irqflags, const char *devname,

[-- Attachment #6: xen-dom0-set_fixmap.patch --]
[-- Type: text/x-patch, Size: 7189 bytes --]

---
 arch/x86/kernel/paravirt_32.c |    2 ++
 arch/x86/mm/pgtable_32.c      |   16 ++++++++++------
 arch/x86/xen/enlighten.c      |   41 +++++++++++++++++++++++++++++++++++++----
 arch/x86/xen/mmu.c            |   30 +-----------------------------
 include/asm-x86/fixmap_32.h   |   13 +++++++++++--
 include/asm-x86/paravirt.h    |   13 +++++++++++++
 include/asm-x86/pgtable_32.h  |    3 +++
 7 files changed, 77 insertions(+), 41 deletions(-)

===================================================================
--- a/arch/x86/kernel/paravirt_32.c
+++ b/arch/x86/kernel/paravirt_32.c
@@ -377,6 +377,8 @@ struct paravirt_ops paravirt_ops = {
 	.dup_mmap = paravirt_nop,
 	.exit_mmap = paravirt_nop,
 	.activate_mm = paravirt_nop,
+
+	.set_fixmap = native_set_fixmap,
 };
 
 EXPORT_SYMBOL(paravirt_ops);
===================================================================
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -73,7 +73,7 @@ void show_mem(void)
  * Associate a virtual page frame with a given physical page frame 
  * and protection flags for that frame.
  */ 
-static void set_pte_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags)
+void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -96,9 +96,8 @@ static void set_pte_pfn(unsigned long va
 		return;
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (pgprot_val(flags))
-		/* <pfn,flags> stored as-is, to permit clearing entries */
-		set_pte(pte, pfn_pte(pfn, flags));
+	if (pte_val(pteval))
+		set_pte_at(&init_mm, vaddr, pte, pteval);
 	else
 		pte_clear(&init_mm, vaddr, pte);
 
@@ -148,7 +147,7 @@ unsigned long __FIXADDR_TOP = 0xfffff000
 unsigned long __FIXADDR_TOP = 0xfffff000;
 EXPORT_SYMBOL(__FIXADDR_TOP);
 
-void __set_fixmap (enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
+void __native_set_fixmap(enum fixed_addresses idx, pte_t pte)
 {
 	unsigned long address = __fix_to_virt(idx);
 
@@ -156,8 +155,13 @@ void __set_fixmap (enum fixed_addresses 
 		BUG();
 		return;
 	}
-	set_pte_pfn(address, phys >> PAGE_SHIFT, flags);
+	set_pte_vaddr(address, pte);
 	fixmaps++;
+}
+
+void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
+{
+	__native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags));
 }
 
 /**
===================================================================
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -131,10 +131,12 @@ static void xen_cpuid(unsigned int *eax,
 	 * Mask out inconvenient features, to try and disable as many
 	 * unsupported kernel subsystems as possible.
 	 */
-	if (*eax == 1)
-		maskedx = ~((1 << X86_FEATURE_APIC) |  /* disable APIC */
-			    (1 << X86_FEATURE_ACPI) |  /* disable ACPI */
-			    (1 << X86_FEATURE_ACC));   /* thermal monitoring */
+	if (*eax == 1) {
+		maskedx = ~(1 << X86_FEATURE_APIC);  /* disable local APIC */
+		if (!is_initial_xendomain())
+			maskedx &= ~((1 << X86_FEATURE_ACPI) |  /* disable ACPI */
+				     (1 << X86_FEATURE_ACC));   /* thermal monitoring */
+	}
 
 	asm(XEN_EMULATE_PREFIX "cpuid"
 		: "=a" (*eax),
@@ -916,6 +918,35 @@ static unsigned xen_patch(u8 type, u16 c
 	return ret;
 }
 
+static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot)
+{
+	pte_t pte;
+
+	phys >>= PAGE_SHIFT;
+
+	switch (idx) {
+#ifdef CONFIG_X86_F00F_BUG
+	case FIX_F00F_IDT:
+#endif
+	case FIX_WP_TEST:
+	case FIX_VDSO:
+#ifdef CONFIG_X86_LOCAL_APIC
+	case FIX_APIC_BASE:	/* maps dummy local APIC */
+#endif
+		pte = pfn_pte(phys, prot);
+		break;
+
+	default:
+		pte = mfn_pte(phys, prot);
+		break;
+	}
+
+	printk("xen_set_fixmap: idx=%d phys=%lx prot=%lx\n",
+	       idx, phys, (unsigned long)pgprot_val(prot));
+
+	__native_set_fixmap(idx, pte);
+}
+
 static const struct paravirt_ops xen_paravirt_ops __initdata = {
 	.paravirt_enabled = 1,
 	.shared_kernel_pmd = 0,
@@ -1046,6 +1077,8 @@ static const struct paravirt_ops xen_par
 	.exit_mmap = xen_exit_mmap,
 
 	.set_lazy_mode = xen_set_lazy_mode,
+
+	.set_fixmap = xen_set_fixmap,
 };
 
 #ifdef CONFIG_SMP
===================================================================
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -117,35 +117,7 @@ void xen_set_pmd(pmd_t *ptr, pmd_t val)
  */
 void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
 {
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-
-	pgd = swapper_pg_dir + pgd_index(vaddr);
-	if (pgd_none(*pgd)) {
-		BUG();
-		return;
-	}
-	pud = pud_offset(pgd, vaddr);
-	if (pud_none(*pud)) {
-		BUG();
-		return;
-	}
-	pmd = pmd_offset(pud, vaddr);
-	if (pmd_none(*pmd)) {
-		BUG();
-		return;
-	}
-	pte = pte_offset_kernel(pmd, vaddr);
-	/* <mfn,flags> stored as-is, to permit clearing entries */
-	xen_set_pte(pte, mfn_pte(mfn, flags));
-
-	/*
-	 * It's enough to flush this one mapping.
-	 * (PGE mappings get flushed as well)
-	 */
-	__flush_tlb_one(vaddr);
+	set_pte_vaddr(vaddr, mfn_pte(mfn, flags));
 }
 
 void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
===================================================================
--- a/include/asm-x86/fixmap_32.h
+++ b/include/asm-x86/fixmap_32.h
@@ -98,8 +98,17 @@ enum fixed_addresses {
 	__end_of_fixed_addresses
 };
 
-extern void __set_fixmap (enum fixed_addresses idx,
-					unsigned long phys, pgprot_t flags);
+void __native_set_fixmap(enum fixed_addresses idx, pte_t pte);
+void native_set_fixmap(enum fixed_addresses idx,
+		       unsigned long phys, pgprot_t flags);
+
+#ifndef CONFIG_PARAVIRT
+static inline void __set_fixmap(enum fixed_addresses idx,
+				unsigned long phys, pgprot_t flags)
+{
+	native_set_fixmap(idx, phys, flags);
+}
+#endif
 extern void reserve_top_address(unsigned long reserve);
 
 #define set_fixmap(idx, phys) \
===================================================================
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -222,6 +222,13 @@ struct paravirt_ops
 	/* These two are jmp to, not actually called. */
 	void (*irq_enable_sysexit)(void);
 	void (*iret)(void);
+
+	/* dom0 ops */
+
+	/* Sometimes the physical address is a pfn, and sometimes its
+	   an mfn.  We can tell which is which from the index. */
+	void (*set_fixmap)(unsigned /* enum fixed_addresses */ idx,
+			   unsigned long phys, pgprot_t flags);
 };
 
 extern struct paravirt_ops paravirt_ops;
@@ -931,6 +938,12 @@ static inline void arch_flush_lazy_mmu_m
 	PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_FLUSH);
 }
 
+static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
+				unsigned long phys, pgprot_t flags)
+{
+	paravirt_ops.set_fixmap(idx, phys, flags);
+}
+
 void _paravirt_nop(void);
 #define paravirt_nop	((void *)_paravirt_nop)
 
===================================================================
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -522,6 +522,9 @@ void native_pagetable_setup_start(pgd_t 
 void native_pagetable_setup_start(pgd_t *base);
 void native_pagetable_setup_done(pgd_t *base);
 
+/* Install a pte for a particular vaddr in kernel space. */
+void set_pte_vaddr(unsigned long vaddr, pte_t pte);
+
 #ifndef CONFIG_PARAVIRT
 static inline void paravirt_pagetable_setup_start(pgd_t *base)
 {

[-- Attachment #7: xen-signature-check.patch --]
[-- Type: text/x-patch, Size: 687 bytes --]

Subject: xen: relax signature check

Some versions of Xen 3.x set their magic number to "xen-3.[12]", so
relax the test to match them.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 arch/x86/xen/enlighten.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

===================================================================
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1131,7 +1131,7 @@ asmlinkage void __init xen_start_kernel(
 	if (!xen_start_info)
 		return;
 
-	BUG_ON(memcmp(xen_start_info->magic, "xen-3.0", 7) != 0);
+	BUG_ON(memcmp(xen_start_info->magic, "xen-3", 5) != 0);
 
 	/* Install Xen paravirt ops */
 	pv_info = xen_info;

[-- Attachment #8: xen-balloon.patch --]
[-- Type: text/x-patch, Size: 25580 bytes --]

---
 drivers/Kconfig                |    2 
 drivers/xen/Kconfig            |   19 +
 drivers/xen/Makefile           |    2 
 drivers/xen/balloon.c          |  712 ++++++++++++++++++++++++++++++++++++++++
 include/xen/balloon.h          |   61 +++
 include/xen/interface/memory.h |   12 
 6 files changed, 800 insertions(+), 8 deletions(-)

===================================================================
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -95,4 +95,6 @@ source "drivers/uio/Kconfig"
 source "drivers/uio/Kconfig"
 
 source "drivers/virtio/Kconfig"
+
+source "drivers/xen/Kconfig"
 endmenu
===================================================================
--- /dev/null
+++ b/drivers/xen/Kconfig
@@ -0,0 +1,19 @@
+config XEN_BALLOON
+	bool "Xen memory balloon driver"
+	depends on XEN
+	default y
+	help
+	  The balloon driver allows the Xen domain to request more memory from
+	  the system to expand the domain's memory allocation, or alternatively
+	  return unneeded memory to the system.
+
+config XEN_SCRUB_PAGES
+	bool "Scrub pages before returning them to system"
+	depends on XEN_BALLOON
+	default y
+	help
+	  Scrub pages before returning them to the system for reuse by
+	  other domains.  This makes sure that any confidential data
+	  is not accidentally visible to other domains.  Is it more
+	  secure, but slightly less efficient.
+	  If in doubt, say yes.
===================================================================
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,2 +1,4 @@ obj-y	+= grant-table.o
 obj-y	+= grant-table.o
 obj-y	+= xenbus/
+
+obj-$(CONFIG_XEN_BALLOON) += balloon.o
===================================================================
--- /dev/null
+++ b/drivers/xen/balloon.c
@@ -0,0 +1,712 @@
+/******************************************************************************
+ * balloon.c
+ *
+ * Xen balloon driver - enables returning/claiming memory to/from Xen.
+ *
+ * Copyright (c) 2003, B Dragovic
+ * Copyright (c) 2003-2004, M Williamson, K Fraser
+ * Copyright (c) 2005 Dan M. Smith, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/mutex.h>
+#include <linux/highmem.h>
+#include <linux/list.h>
+#include <linux/sysdev.h>
+
+#include <asm/xen/hypervisor.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/uaccess.h>
+#include <asm/tlb.h>
+
+#include <xen/interface/memory.h>
+#include <xen/balloon.h>
+#include <xen/xenbus.h>
+#include <xen/features.h>
+#include <xen/page.h>
+
+#define PAGES2KB(_p) ((_p)<<(PAGE_SHIFT-10))
+
+#define BALLOON_CLASS_NAME "memory"
+
+struct balloon_stats {
+	/* We aim for 'current allocation' == 'target allocation'. */
+	unsigned long current_pages;
+	unsigned long target_pages;
+	/* We may hit the hard limit in Xen. If we do then we remember it. */
+	unsigned long hard_limit;
+	/*
+	 * Drivers may alter the memory reservation independently, but they
+	 * must inform the balloon driver so we avoid hitting the hard limit.
+	 */
+	unsigned long driver_pages;
+	/* Number of pages in high- and low-memory balloons. */
+	unsigned long balloon_low;
+	unsigned long balloon_high;
+};
+
+static DEFINE_MUTEX(balloon_mutex);
+
+static struct sys_device balloon_sysdev;
+
+static int register_balloon(struct sys_device *sysdev);
+
+/*
+ * Protects atomic reservation decrease/increase against concurrent increases.
+ * Also protects non-atomic updates of current_pages and driver_pages, and
+ * balloon lists.
+ */
+static DEFINE_SPINLOCK(balloon_lock);
+
+static struct balloon_stats balloon_stats;
+
+/* We increase/decrease in batches which fit in a page */
+static unsigned long frame_list[PAGE_SIZE / sizeof(unsigned long)];
+
+/* VM /proc information for memory */
+extern unsigned long totalram_pages;
+
+#ifdef CONFIG_HIGHMEM
+extern unsigned long totalhigh_pages;
+#define inc_totalhigh_pages() (totalhigh_pages++)
+#define dec_totalhigh_pages() (totalhigh_pages--)
+#else
+#define inc_totalhigh_pages() do {} while(0)
+#define dec_totalhigh_pages() do {} while(0)
+#endif
+
+/* List of ballooned pages, threaded through the mem_map array. */
+static LIST_HEAD(ballooned_pages);
+
+/* Main work function, always executed in process context. */
+static void balloon_process(struct work_struct *work);
+static DECLARE_WORK(balloon_worker, balloon_process);
+static struct timer_list balloon_timer;
+
+/* When ballooning out (allocating memory to return to Xen) we don't really
+   want the kernel to try too hard since that can trigger the oom killer. */
+#define GFP_BALLOON \
+	(GFP_HIGHUSER | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC)
+
+static void scrub_page(struct page *page)
+{
+#ifdef CONFIG_XEN_SCRUB_PAGES
+	if (PageHighMem(page)) {
+		void *v = kmap(page);
+		clear_page(v);
+		kunmap(v);
+	} else {
+		void *v = page_address(page);
+		clear_page(v);
+	}
+#endif
+}
+
+/* balloon_append: add the given page to the balloon. */
+static void balloon_append(struct page *page)
+{
+	/* Lowmem is re-populated first, so highmem pages go at list tail. */
+	if (PageHighMem(page)) {
+		list_add_tail(&page->lru, &ballooned_pages);
+		balloon_stats.balloon_high++;
+		dec_totalhigh_pages();
+	} else {
+		list_add(&page->lru, &ballooned_pages);
+		balloon_stats.balloon_low++;
+	}
+}
+
+/* balloon_retrieve: rescue a page from the balloon, if it is not empty. */
+static struct page *balloon_retrieve(void)
+{
+	struct page *page;
+
+	if (list_empty(&ballooned_pages))
+		return NULL;
+
+	page = list_entry(ballooned_pages.next, struct page, lru);
+	list_del(&page->lru);
+
+	if (PageHighMem(page)) {
+		balloon_stats.balloon_high--;
+		inc_totalhigh_pages();
+	}
+	else
+		balloon_stats.balloon_low--;
+
+	return page;
+}
+
+static struct page *balloon_first_page(void)
+{
+	if (list_empty(&ballooned_pages))
+		return NULL;
+	return list_entry(ballooned_pages.next, struct page, lru);
+}
+
+static struct page *balloon_next_page(struct page *page)
+{
+	struct list_head *next = page->lru.next;
+	if (next == &ballooned_pages)
+		return NULL;
+	return list_entry(next, struct page, lru);
+}
+
+static void balloon_alarm(unsigned long unused)
+{
+	schedule_work(&balloon_worker);
+}
+
+static unsigned long current_target(void)
+{
+	unsigned long target = min(balloon_stats.target_pages, balloon_stats.hard_limit);
+
+	target = min(target,
+		     balloon_stats.current_pages +
+		     balloon_stats.balloon_low +
+		     balloon_stats.balloon_high);
+
+	return target;
+}
+
+static int increase_reservation(unsigned long nr_pages)
+{
+	unsigned long  pfn, i, flags;
+	struct page   *page;
+	long           rc;
+	struct xen_memory_reservation reservation = {
+		.address_bits = 0,
+		.extent_order = 0,
+		.domid        = DOMID_SELF
+	};
+
+	if (nr_pages > ARRAY_SIZE(frame_list))
+		nr_pages = ARRAY_SIZE(frame_list);
+
+	spin_lock_irqsave(&balloon_lock, flags);
+
+	page = balloon_first_page();
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(page == NULL);
+		frame_list[i] = page_to_pfn(page);;
+		page = balloon_next_page(page);
+	}
+
+	reservation.extent_start = (unsigned long)frame_list;
+	reservation.nr_extents   = nr_pages;
+	rc = HYPERVISOR_memory_op(
+		XENMEM_populate_physmap, &reservation);
+	if (rc < nr_pages) {
+		if (rc > 0) {
+			int ret;
+
+			/* We hit the Xen hard limit: reprobe. */
+			reservation.nr_extents = rc;
+			ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation,
+					&reservation);
+			BUG_ON(ret != rc);
+		}
+		if (rc >= 0)
+			balloon_stats.hard_limit = (balloon_stats.current_pages + rc -
+						    balloon_stats.driver_pages);
+		goto out;
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		page = balloon_retrieve();
+		BUG_ON(page == NULL);
+
+		pfn = page_to_pfn(page);
+		BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) &&
+		       phys_to_machine_mapping_valid(pfn));
+
+		set_phys_to_machine(pfn, frame_list[i]);
+
+		/* Link back into the page tables if not highmem. */
+		if (pfn < max_low_pfn) {
+			int ret;
+			ret = HYPERVISOR_update_va_mapping(
+				(unsigned long)__va(pfn << PAGE_SHIFT),
+				mfn_pte(frame_list[i], PAGE_KERNEL),
+				0);
+			BUG_ON(ret);
+		}
+
+		/* Relinquish the page back to the allocator. */
+		ClearPageReserved(page);
+		init_page_count(page);
+		__free_page(page);
+	}
+
+	balloon_stats.current_pages += nr_pages;
+	totalram_pages = balloon_stats.current_pages;
+
+ out:
+	spin_unlock_irqrestore(&balloon_lock, flags);
+
+	return 0;
+}
+
+static int decrease_reservation(unsigned long nr_pages)
+{
+	unsigned long  pfn, i, flags;
+	struct page   *page;
+	int            need_sleep = 0;
+	int ret;
+	struct xen_memory_reservation reservation = {
+		.address_bits = 0,
+		.extent_order = 0,
+		.domid        = DOMID_SELF
+	};
+
+	if (nr_pages > ARRAY_SIZE(frame_list))
+		nr_pages = ARRAY_SIZE(frame_list);
+
+	for (i = 0; i < nr_pages; i++) {
+		if ((page = alloc_page(GFP_BALLOON)) == NULL) {
+			nr_pages = i;
+			need_sleep = 1;
+			break;
+		}
+
+		pfn = page_to_pfn(page);
+		frame_list[i] = pfn_to_mfn(pfn);
+
+		scrub_page(page);
+	}
+
+	/* Ensure that ballooned highmem pages don't have kmaps. */
+	kmap_flush_unused();
+	flush_tlb_all();
+
+	spin_lock_irqsave(&balloon_lock, flags);
+
+	/* No more mappings: invalidate P2M and add to balloon. */
+	for (i = 0; i < nr_pages; i++) {
+		pfn = mfn_to_pfn(frame_list[i]);
+		set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
+		balloon_append(pfn_to_page(pfn));
+	}
+
+	reservation.extent_start = (unsigned long)frame_list;
+	reservation.nr_extents   = nr_pages;
+	ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation, &reservation);
+	BUG_ON(ret != nr_pages);
+
+	balloon_stats.current_pages -= nr_pages;
+	totalram_pages = balloon_stats.current_pages;
+
+	spin_unlock_irqrestore(&balloon_lock, flags);
+
+	return need_sleep;
+}
+
+/*
+ * We avoid multiple worker processes conflicting via the balloon mutex.
+ * We may of course race updates of the target counts (which are protected
+ * by the balloon lock), or with changes to the Xen hard limit, but we will
+ * recover from these in time.
+ */
+static void balloon_process(struct work_struct *work)
+{
+	int need_sleep = 0;
+	long credit;
+
+	mutex_lock(&balloon_mutex);
+
+	do {
+		credit = current_target() - balloon_stats.current_pages;
+		if (credit > 0)
+			need_sleep = (increase_reservation(credit) != 0);
+		if (credit < 0)
+			need_sleep = (decrease_reservation(-credit) != 0);
+
+#ifndef CONFIG_PREEMPT
+		if (need_resched())
+			schedule();
+#endif
+	} while ((credit != 0) && !need_sleep);
+
+	/* Schedule more work if there is some still to be done. */
+	if (current_target() != balloon_stats.current_pages)
+		mod_timer(&balloon_timer, jiffies + HZ);
+
+	mutex_unlock(&balloon_mutex);
+}
+
+/* Resets the Xen limit, sets new target, and kicks off processing. */
+void balloon_set_new_target(unsigned long target)
+{
+	/* No need for lock. Not read-modify-write updates. */
+	balloon_stats.hard_limit   = ~0UL;
+	balloon_stats.target_pages = target;
+	schedule_work(&balloon_worker);
+}
+
+static struct xenbus_watch target_watch =
+{
+	.node = "memory/target"
+};
+
+/* React to a change in the target key */
+static void watch_target(struct xenbus_watch *watch,
+			 const char **vec, unsigned int len)
+{
+	unsigned long long new_target;
+	int err;
+
+	err = xenbus_scanf(XBT_NIL, "memory", "target", "%llu", &new_target);
+	if (err != 1) {
+		/* This is ok (for domain0 at least) - so just return */
+		return;
+	}
+
+	/* The given memory/target value is in KiB, so it needs converting to
+	 * pages. PAGE_SHIFT converts bytes to pages, hence PAGE_SHIFT - 10.
+	 */
+	balloon_set_new_target(new_target >> (PAGE_SHIFT - 10));
+}
+
+static int balloon_init_watcher(struct notifier_block *notifier,
+				unsigned long event,
+				void *data)
+{
+	int err;
+
+	err = register_xenbus_watch(&target_watch);
+	if (err)
+		printk(KERN_ERR "Failed to set balloon watcher\n");
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block xenstore_notifier;
+
+static int __init balloon_init(void)
+{
+	unsigned long pfn;
+	struct page *page;
+
+	if (!is_running_on_xen())
+		return -ENODEV;
+
+	pr_info("xen_balloon: Initialising balloon driver.\n");
+
+	balloon_stats.current_pages = min(xen_start_info->nr_pages, max_pfn);
+	totalram_pages   = balloon_stats.current_pages;
+	balloon_stats.target_pages  = balloon_stats.current_pages;
+	balloon_stats.balloon_low   = 0;
+	balloon_stats.balloon_high  = 0;
+	balloon_stats.driver_pages  = 0UL;
+	balloon_stats.hard_limit    = ~0UL;
+
+	init_timer(&balloon_timer);
+	balloon_timer.data = 0;
+	balloon_timer.function = balloon_alarm;
+
+	register_balloon(&balloon_sysdev);
+
+	/* Initialise the balloon with excess memory space. */
+	for (pfn = xen_start_info->nr_pages; pfn < max_pfn; pfn++) {
+		page = pfn_to_page(pfn);
+		if (!PageReserved(page))
+			balloon_append(page);
+	}
+
+	target_watch.callback = watch_target;
+	xenstore_notifier.notifier_call = balloon_init_watcher;
+
+	register_xenstore_notifier(&xenstore_notifier);
+
+	return 0;
+}
+
+subsys_initcall(balloon_init);
+
+static void balloon_exit(void)
+{
+    /* XXX - release balloon here */
+    return;
+}
+
+module_exit(balloon_exit);
+
+static void balloon_update_driver_allowance(long delta)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&balloon_lock, flags);
+	balloon_stats.driver_pages += delta;
+	spin_unlock_irqrestore(&balloon_lock, flags);
+}
+
+static int dealloc_pte_fn(
+	pte_t *pte, struct page *pmd_page, unsigned long addr, void *data)
+{
+	unsigned long mfn = pte_mfn(*pte);
+	int ret;
+	struct xen_memory_reservation reservation = {
+		.nr_extents   = 1,
+		.extent_order = 0,
+		.domid        = DOMID_SELF
+	};
+	reservation.extent_start = (unsigned long)&mfn;
+	set_pte_at(&init_mm, addr, pte, __pte_ma(0ull));
+	set_phys_to_machine(__pa(addr) >> PAGE_SHIFT, INVALID_P2M_ENTRY);
+	ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation, &reservation);
+	BUG_ON(ret != 1);
+	return 0;
+}
+
+static struct page **alloc_empty_pages_and_pagevec(int nr_pages)
+{
+	unsigned long vaddr, flags;
+	struct page *page, **pagevec;
+	int i, ret;
+
+	pagevec = kmalloc(sizeof(page) * nr_pages, GFP_KERNEL);
+	if (pagevec == NULL)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = pagevec[i] = alloc_page(GFP_KERNEL);
+		if (page == NULL)
+			goto err;
+
+		vaddr = (unsigned long)page_address(page);
+
+		scrub_page(page);
+
+		spin_lock_irqsave(&balloon_lock, flags);
+
+		if (xen_feature(XENFEAT_auto_translated_physmap)) {
+			unsigned long gmfn = page_to_pfn(page);
+			struct xen_memory_reservation reservation = {
+				.nr_extents   = 1,
+				.extent_order = 0,
+				.domid        = DOMID_SELF
+			};
+			reservation.extent_start = (unsigned long)&gmfn;
+			ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation,
+						   &reservation);
+			if (ret == 1)
+				ret = 0; /* success */
+		} else {
+			ret = apply_to_page_range(&init_mm, vaddr, PAGE_SIZE,
+						  dealloc_pte_fn, NULL);
+		}
+
+		if (ret != 0) {
+			spin_unlock_irqrestore(&balloon_lock, flags);
+			__free_page(page);
+			goto err;
+		}
+
+		totalram_pages = --balloon_stats.current_pages;
+
+		spin_unlock_irqrestore(&balloon_lock, flags);
+	}
+
+ out:
+	schedule_work(&balloon_worker);
+	flush_tlb_all();
+	return pagevec;
+
+ err:
+	spin_lock_irqsave(&balloon_lock, flags);
+	while (--i >= 0)
+		balloon_append(pagevec[i]);
+	spin_unlock_irqrestore(&balloon_lock, flags);
+	kfree(pagevec);
+	pagevec = NULL;
+	goto out;
+}
+
+static void free_empty_pages_and_pagevec(struct page **pagevec, int nr_pages)
+{
+	unsigned long flags;
+	int i;
+
+	if (pagevec == NULL)
+		return;
+
+	spin_lock_irqsave(&balloon_lock, flags);
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(page_count(pagevec[i]) != 1);
+		balloon_append(pagevec[i]);
+	}
+	spin_unlock_irqrestore(&balloon_lock, flags);
+
+	kfree(pagevec);
+
+	schedule_work(&balloon_worker);
+}
+
+static void balloon_release_driver_page(struct page *page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&balloon_lock, flags);
+	balloon_append(page);
+	balloon_stats.driver_pages--;
+	spin_unlock_irqrestore(&balloon_lock, flags);
+
+	schedule_work(&balloon_worker);
+}
+
+
+#define BALLOON_SHOW(name, format, args...)			\
+	static ssize_t show_##name(struct sys_device *dev,	\
+				   char *buf)			\
+	{							\
+		return sprintf(buf, format, ##args);		\
+	}							\
+	static SYSDEV_ATTR(name, S_IRUGO, show_##name, NULL)
+
+BALLOON_SHOW(current_kb, "%lu\n", PAGES2KB(balloon_stats.current_pages));
+BALLOON_SHOW(low_kb, "%lu\n", PAGES2KB(balloon_stats.balloon_low));
+BALLOON_SHOW(high_kb, "%lu\n", PAGES2KB(balloon_stats.balloon_high));
+BALLOON_SHOW(hard_limit_kb,
+	     (balloon_stats.hard_limit!=~0UL) ? "%lu\n" : "???\n",
+	     (balloon_stats.hard_limit!=~0UL) ? PAGES2KB(balloon_stats.hard_limit) : 0);
+BALLOON_SHOW(driver_kb, "%lu\n", PAGES2KB(balloon_stats.driver_pages));
+
+static ssize_t show_target_kb(struct sys_device *dev, char *buf)
+{
+	return sprintf(buf, "%lu\n", PAGES2KB(balloon_stats.target_pages));
+}
+
+static ssize_t store_target_kb(struct sys_device *dev,
+			       const char *buf,
+			       size_t count)
+{
+	char memstring[64], *endchar;
+	unsigned long long target_bytes;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (count <= 1)
+		return -EBADMSG; /* runt */
+	if (count > sizeof(memstring))
+		return -EFBIG;   /* too long */
+	strcpy(memstring, buf);
+
+	target_bytes = memparse(memstring, &endchar);
+	balloon_set_new_target(target_bytes >> PAGE_SHIFT);
+
+	return count;
+}
+
+static SYSDEV_ATTR(target_kb, S_IRUGO | S_IWUSR,
+		   show_target_kb, store_target_kb);
+
+static struct sysdev_attribute *balloon_attrs[] = {
+	&attr_target_kb,
+};
+
+static struct attribute *balloon_info_attrs[] = {
+	&attr_current_kb.attr,
+	&attr_low_kb.attr,
+	&attr_high_kb.attr,
+	&attr_hard_limit_kb.attr,
+	&attr_driver_kb.attr,
+	NULL
+};
+
+static struct attribute_group balloon_info_group = {
+	.name = "info",
+	.attrs = balloon_info_attrs,
+};
+
+static struct sysdev_class balloon_sysdev_class = {
+	set_kset_name(BALLOON_CLASS_NAME),
+};
+
+static int register_balloon(struct sys_device *sysdev)
+{
+	int i, error;
+
+	error = sysdev_class_register(&balloon_sysdev_class);
+	if (error)
+		return error;
+
+	sysdev->id = 0;
+	sysdev->cls = &balloon_sysdev_class;
+
+	error = sysdev_register(sysdev);
+	if (error) {
+		sysdev_class_unregister(&balloon_sysdev_class);
+		return error;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(balloon_attrs); i++) {
+		error = sysdev_create_file(sysdev, balloon_attrs[i]);
+		if (error)
+			goto fail;
+	}
+
+	error = sysfs_create_group(&sysdev->kobj, &balloon_info_group);
+	if (error)
+		goto fail;
+
+	return 0;
+
+ fail:
+	while (--i >= 0)
+		sysdev_remove_file(sysdev, balloon_attrs[i]);
+	sysdev_unregister(sysdev);
+	sysdev_class_unregister(&balloon_sysdev_class);
+	return error;
+}
+
+static void unregister_balloon(struct sys_device *sysdev)
+{
+	int i;
+
+	sysfs_remove_group(&sysdev->kobj, &balloon_info_group);
+	for (i = 0; i < ARRAY_SIZE(balloon_attrs); i++)
+		sysdev_remove_file(sysdev, balloon_attrs[i]);
+	sysdev_unregister(sysdev);
+	sysdev_class_unregister(&balloon_sysdev_class);
+}
+
+static void balloon_sysfs_exit(void)
+{
+	unregister_balloon(&balloon_sysdev);
+}
+
+MODULE_LICENSE("GPL");
===================================================================
--- /dev/null
+++ b/include/xen/balloon.h
@@ -0,0 +1,61 @@
+/******************************************************************************
+ * balloon.h
+ *
+ * Xen balloon driver - enables returning/claiming memory to/from Xen.
+ *
+ * Copyright (c) 2003, B Dragovic
+ * Copyright (c) 2003-2004, M Williamson, K Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __XEN_BALLOON_H__
+#define __XEN_BALLOON_H__
+
+#include <linux/spinlock.h>
+
+#if 0
+/*
+ * Inform the balloon driver that it should allow some slop for device-driver
+ * memory activities.
+ */
+void balloon_update_driver_allowance(long delta);
+
+/* Allocate/free a set of empty pages in low memory (i.e., no RAM mapped). */
+struct page **alloc_empty_pages_and_pagevec(int nr_pages);
+void free_empty_pages_and_pagevec(struct page **pagevec, int nr_pages);
+
+void balloon_release_driver_page(struct page *page);
+
+/*
+ * Prevent the balloon driver from changing the memory reservation during
+ * a driver critical region.
+ */
+extern spinlock_t balloon_lock;
+#define balloon_lock(__flags)   spin_lock_irqsave(&balloon_lock, __flags)
+#define balloon_unlock(__flags) spin_unlock_irqrestore(&balloon_lock, __flags)
+#endif
+
+#endif /* __XEN_BALLOON_H__ */
===================================================================
--- a/include/xen/interface/memory.h
+++ b/include/xen/interface/memory.h
@@ -29,7 +29,7 @@ struct xen_memory_reservation {
      *   OUT: GMFN bases of extents that were allocated
      *   (NB. This command also updates the mach_to_phys translation table)
      */
-    GUEST_HANDLE(ulong) extent_start;
+    ulong extent_start;
 
     /* Number of extents, and size/alignment of each (2^extent_order pages). */
     unsigned long  nr_extents;
@@ -50,7 +50,6 @@ struct xen_memory_reservation {
     domid_t        domid;
 
 };
-DEFINE_GUEST_HANDLE_STRUCT(xen_memory_reservation);
 
 /*
  * Returns the maximum machine frame number of mapped RAM in this system.
@@ -86,7 +85,7 @@ struct xen_machphys_mfn_list {
      * any large discontiguities in the machine address space, 2MB gaps in
      * the machphys table will be represented by an MFN base of zero.
      */
-    GUEST_HANDLE(ulong) extent_start;
+    ulong extent_start;
 
     /*
      * Number of extents written to the above array. This will be smaller
@@ -94,7 +93,6 @@ struct xen_machphys_mfn_list {
      */
     unsigned int nr_extents;
 };
-DEFINE_GUEST_HANDLE_STRUCT(xen_machphys_mfn_list);
 
 /*
  * Sets the GPFN at which a particular page appears in the specified guest's
@@ -117,7 +115,6 @@ struct xen_add_to_physmap {
     /* GPFN where the source mapping page should appear. */
     unsigned long gpfn;
 };
-DEFINE_GUEST_HANDLE_STRUCT(xen_add_to_physmap);
 
 /*
  * Translates a list of domain-specific GPFNs into MFNs. Returns a -ve error
@@ -132,14 +129,13 @@ struct xen_translate_gpfn_list {
     unsigned long nr_gpfns;
 
     /* List of GPFNs to translate. */
-    GUEST_HANDLE(ulong) gpfn_list;
+    ulong gpfn_list;
 
     /*
      * Output list to contain MFN translations. May be the same as the input
      * list (in which case each input GPFN is overwritten with the output MFN).
      */
-    GUEST_HANDLE(ulong) mfn_list;
+    ulong mfn_list;
 };
-DEFINE_GUEST_HANDLE_STRUCT(xen_translate_gpfn_list);
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */

[-- Attachment #9: xen-cpu-hotplug.patch --]
[-- Type: text/x-patch, Size: 7208 bytes --]

---
 arch/x86/kernel/smp_32.c     |    2 +
 arch/x86/kernel/smpboot_32.c |    6 ++--
 arch/x86/xen/enlighten.c     |   15 ++++++++++-
 arch/x86/xen/smp.c           |   54 ++++++++++++++++++++++++++++--------------
 arch/x86/xen/xen-ops.h       |    1 
 include/asm-x86/smp_32.h     |   18 ++++++++++++--
 6 files changed, 72 insertions(+), 24 deletions(-)

===================================================================
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -704,4 +704,6 @@ struct smp_ops smp_ops = {
 	.smp_send_stop = native_smp_send_stop,
 	.smp_send_reschedule = native_smp_send_reschedule,
 	.smp_call_function_mask = native_smp_call_function_mask,
+
+	.cpu_disable = native_cpu_disable,
 };
===================================================================
--- a/arch/x86/kernel/smpboot_32.c
+++ b/arch/x86/kernel/smpboot_32.c
@@ -1166,7 +1166,7 @@ void remove_siblinginfo(int cpu)
 	cpu_clear(cpu, cpu_sibling_setup_map);
 }
 
-int __cpu_disable(void)
+int native_cpu_disable(void)
 {
 	cpumask_t map = cpu_online_map;
 	int cpu = smp_processor_id();
@@ -1216,12 +1216,12 @@ void __cpu_die(unsigned int cpu)
  	printk(KERN_ERR "CPU %u didn't die...\n", cpu);
 }
 #else /* ... !CONFIG_HOTPLUG_CPU */
-int __cpu_disable(void)
+int native_cpu_disable(void)
 {
 	return -ENOSYS;
 }
 
-void __cpu_die(unsigned int cpu)
+void native_cpu_die(unsigned int cpu)
 {
 	/* We said "no" in __cpu_disable */
 	BUG();
===================================================================
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -254,10 +254,21 @@ static void xen_safe_halt(void)
 		BUG();
 }
 
+static void xen_shutdown_cpu(void)
+{
+	int cpu = smp_processor_id();
+
+	/* make sure we're not pinning something down */
+	load_cr3(swapper_pg_dir);
+	/* GDT too? */
+
+	HYPERVISOR_vcpu_op(VCPUOP_down, cpu, NULL);
+}
+
 static void xen_halt(void)
 {
 	if (irqs_disabled())
-		HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
+		xen_shutdown_cpu();
 	else
 		xen_safe_halt();
 }
@@ -1069,6 +1080,8 @@ static const struct smp_ops xen_smp_ops 
 	.smp_send_stop = xen_smp_send_stop,
 	.smp_send_reschedule = xen_smp_send_reschedule,
 	.smp_call_function_mask = xen_smp_call_function_mask,
+
+	.cpu_disable = xen_cpu_disable,
 };
 #endif	/* CONFIG_SMP */
 
===================================================================
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -189,8 +189,14 @@ void __init xen_smp_prepare_cpus(unsigne
 			panic("failed fork for CPU %d", cpu);
 
 		cpu_set(cpu, cpu_present_map);
-	}
-
+
+		smp_store_cpu_info(cpu);
+		init_gdt(cpu);
+		irq_ctx_init(cpu);
+		xen_setup_timer(cpu);
+		xen_smp_intr_init(cpu);
+	}
+
 	//init_xenbus_allowed_cpumask();
 }
 
@@ -198,7 +204,7 @@ cpu_initialize_context(unsigned int cpu,
 cpu_initialize_context(unsigned int cpu, struct task_struct *idle)
 {
 	struct vcpu_guest_context *ctxt;
-	struct gdt_page *gdt = &per_cpu(gdt_page, cpu);
+	struct desc_struct *gdt = get_cpu_gdt_table(cpu);
 
 	if (cpu_test_and_set(cpu, cpu_initialized_map))
 		return 0;
@@ -222,11 +228,11 @@ cpu_initialize_context(unsigned int cpu,
 
 	ctxt->ldt_ents = 0;
 
-	BUG_ON((unsigned long)gdt->gdt & ~PAGE_MASK);
-	make_lowmem_page_readonly(gdt->gdt);
-
-	ctxt->gdt_frames[0] = virt_to_mfn(gdt->gdt);
-	ctxt->gdt_ents      = ARRAY_SIZE(gdt->gdt);
+	BUG_ON((unsigned long)gdt & ~PAGE_MASK);
+	make_lowmem_page_readonly(gdt);
+
+	ctxt->gdt_frames[0] = virt_to_mfn(gdt);
+	ctxt->gdt_ents      = GDT_ENTRIES;
 
 	ctxt->user_regs.cs = __KERNEL_CS;
 	ctxt->user_regs.esp = idle->thread.esp0 - sizeof(struct pt_regs);
@@ -260,26 +266,20 @@ int __cpuinit xen_cpu_up(unsigned int cp
 		return rc;
 #endif
 
-	init_gdt(cpu);
 	per_cpu(current_task, cpu) = idle;
-	irq_ctx_init(cpu);
-	xen_setup_timer(cpu);
 
 	/* make sure interrupts start blocked */
 	per_cpu(xen_vcpu, cpu)->evtchn_upcall_mask = 1;
 
 	rc = cpu_initialize_context(cpu, idle);
 	if (rc)
-		return rc;
+		goto out;
 
 	if (num_online_cpus() == 1)
 		alternatives_smp_switch(1);
 
-	rc = xen_smp_intr_init(cpu);
-	if (rc)
-		return rc;
-
-	smp_store_cpu_info(cpu);
+	get_cpu();		/* set_cpu_sibling_map wants no preempt */
+
 	set_cpu_sibling_map(cpu);
 	/* This must be done before setting cpu_online_map */
 	wmb();
@@ -289,7 +289,10 @@ int __cpuinit xen_cpu_up(unsigned int cp
 	rc = HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL);
 	BUG_ON(rc);
 
-	return 0;
+	put_cpu();
+
+  out:
+	return rc;
 }
 
 void xen_smp_cpus_done(unsigned int max_cpus)
@@ -408,3 +411,18 @@ int xen_smp_call_function_mask(cpumask_t
 
 	return 0;
 }
+
+int xen_cpu_disable(void)
+{
+	cpumask_t map = cpu_online_map;
+	int cpu = smp_processor_id();
+
+	remove_siblinginfo(cpu);
+
+	cpu_clear(cpu, map);
+	fixup_irqs(map);
+	/* It's now safe to remove this processor from the online map */
+	cpu_clear(cpu, cpu_online_map);
+
+	return 0;
+}
===================================================================
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -39,6 +39,7 @@ void xen_smp_prepare_cpus(unsigned int m
 void xen_smp_prepare_cpus(unsigned int max_cpus);
 int xen_cpu_up(unsigned int cpu);
 void xen_smp_cpus_done(unsigned int max_cpus);
+int xen_cpu_disable(void);
 
 void xen_smp_send_stop(void);
 void xen_smp_send_reschedule(int cpu);
===================================================================
--- a/include/asm-x86/smp_32.h
+++ b/include/asm-x86/smp_32.h
@@ -63,6 +63,9 @@ struct smp_ops
 	int (*smp_call_function_mask)(cpumask_t mask,
 				      void (*func)(void *info), void *info,
 				      int wait);
+
+	int (*cpu_disable)(void);
+	void (*cpu_die)(unsigned int cpu);
 };
 
 extern struct smp_ops smp_ops;
@@ -71,14 +74,17 @@ static inline void smp_prepare_boot_cpu(
 {
 	smp_ops.smp_prepare_boot_cpu();
 }
+
 static inline void smp_prepare_cpus(unsigned int max_cpus)
 {
 	smp_ops.smp_prepare_cpus(max_cpus);
 }
+
 static inline int __cpu_up(unsigned int cpu)
 {
 	return smp_ops.cpu_up(cpu);
 }
+
 static inline void smp_cpus_done(unsigned int max_cpus)
 {
 	smp_ops.smp_cpus_done(max_cpus);
@@ -88,10 +94,12 @@ static inline void smp_send_stop(void)
 {
 	smp_ops.smp_send_stop();
 }
+
 static inline void smp_send_reschedule(int cpu)
 {
 	smp_ops.smp_send_reschedule(cpu);
 }
+
 static inline int smp_call_function_mask(cpumask_t mask,
 					 void (*func) (void *info), void *info,
 					 int wait)
@@ -99,10 +107,18 @@ static inline int smp_call_function_mask
 	return smp_ops.smp_call_function_mask(mask, func, info, wait);
 }
 
+static inline int __cpu_disable(void)
+{
+	return smp_ops.cpu_disable();
+}
+
+
 void native_smp_prepare_boot_cpu(void);
 void native_smp_prepare_cpus(unsigned int max_cpus);
 int native_cpu_up(unsigned int cpunum);
 void native_smp_cpus_done(unsigned int max_cpus);
+extern int native_cpu_disable(void);
+extern void __cpu_die(unsigned int cpu);
 
 #ifndef CONFIG_PARAVIRT
 #define startup_ipi_hook(phys_apicid, start_eip, start_esp) 		\
@@ -128,8 +144,6 @@ static inline int num_booting_cpus(void)
 }
 
 extern int safe_smp_processor_id(void);
-extern int __cpu_disable(void);
-extern void __cpu_die(unsigned int cpu);
 extern unsigned int num_processors;
 
 void __cpuinit smp_store_cpu_info(int id);

[-- Attachment #10: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-11-21 23:12 ` Jeremy Fitzhardinge
@ 2007-11-26 14:02   ` Juan Quintela
  2007-11-26 18:52     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 57+ messages in thread
From: Juan Quintela @ 2007-11-26 14:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Hi,

your console works great, but rest of patches are assuming:

arch/x86/boot/compressed/notes-xen.c
arch/x86/xen/early.c

at least.  It looks as if there is missing another patche, could you
take a look, please?
Otherwise, I will take a look at what is missing.

It breaks with:

Intel machine check architecture supported.
(XEN) traps.c:1734:d0 Domain attempted WRMSR 00000404 from 00000000:00000001 to
ffffffff:ffffffff.
Intel machine check reporting enabled on CPU#0.
general protection fault: 0000 [#1] SMP
Modules linked in:

Pid: 1, comm: swapper Not tainted (2.6.24-rc3-q2 #10)
EIP: 0061:[<c0101790>] EFLAGS: 00010082 CPU: 0
EIP is at native_write_cr0+0x0/0x4
EAX: c005003b EBX: c03902a0 ECX: ed03f288 EDX: 00000005
ESI: c1c10c80 EDI: ed054200 EBP: 00000001 ESP: ed027eb8
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
Process swapper (pid: 1, ti=ed027000 task=ed03ebb0 task.ti=ed027000)
Stack: c01125e9 00000000 c03902a0 c1c10c80 ed054200 c01128c6 c03900a0 00000008
       c010e0aa c037b48d 00000000 ed00efa0 ed027f24 0000000a c035215c c01e20a7
       c1c10c80 80000008 000006f4 00020800 c0143563 ed03ebb0 017fe000 c03902a0
Call Trace:
 [<c01125e9>] prepare_set+0x20/0x86
 [<c01128c6>] generic_set_all+0x28/0x34a
 [<c010e0aa>] identify_cpu+0x525/0x52d
 [<c01e20a7>] kvasprintf+0x3f/0x48
 [<c0143563>] trace_hardirqs_off+0x28/0xa1
 [<c0111ac6>] mtrr_ap_init+0x33/0x5d
 [<c0117547>] smp_store_cpu_info+0x32/0xb9
 [<c0104e78>] xen_cpu_up+0x22c/0x3b4
 [<c0148fdf>] _cpu_up+0xab/0x120
 [<c014913e>] cpu_up+0x4e/0x61
 [<c03d33f8>] kernel_init+0x9e/0x2c6
 [<c0107017>] restore_nocheck+0x12/0x15
 [<c03d335a>] kernel_init+0x0/0x2c6
 [<c03d335a>] kernel_init+0x0/0x2c6
 [<c0107c7f>] kernel_thread_helper+0x7/0x10
 =======================
Code: 53 89 cb 83 ec 08 89 14 24 89 da 8b 04 24 89 4c 24 04 89 f9 0f 30 31 c0 5a
 59 5b 5e 5f c3 0f 31 c3 0f 33 c3 0f 06 c3 0f 20 c0 c3 <0f> 22 c0 c3 0f 20 e0 c3
 31 c0 0f 20 e0 c3 0f 09 c3 0f 01 00 c3
EIP: [<c0101790>] native_write_cr0+0x0/0x4 SS:ESP e021:ed027eb8
Kernel panic - not syncing: Attempted to kill init!


Later, Juan.


On Nov 22, 2007 12:12 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Stephen C. Tweedie wrote:
> > I've been looking at the next steps to try to get Xen running fully on
> > top of pv_ops.  To that end, I've (just) started looking at one of the
> > next major jobs --- i686 dom0 on pv_ops.
> >
>
> Great!
>
> > There are still a number of things needing done to reach parity with
> > xen-unstable:
> >
> >   x86_64 xen on pv_ops
> >
>
> I think once pvops has been unified, Xen support should be fairly
> straightforward.  I wrote most of the existing code with 64-bit in mind,
> so I'm hoping I got it right...
>
> >   Paravirt framebuffer/keyboard
> >   CPU hotplug
> >   Balloon
> >
>
> I've done some preliminary work on balloon and hotplug.  I think balloon
> should make more use of memory hotplug, but a straight port would be a
> good first step.
>
> >   kexec
> >   driver domains
> >
> > but it looks like these can largely proceed in parallel if desired.
> >
> > My short-term goal with this is simply to come up with a first-pass
> > merge of the linux-2.6.18-xen.hg dom0 support into the current
> > kernel.org tree's pv_ops support.  No major refactoring in the first
> > pass, but absolutely no *-xen.c code copying.
> >
>
> Yes.  #ifdefs are the way to go here.
>
> > I'm just starting this, but at least with the version magic check (see
> >
> >       http://lists.xensource.com/archives/html/xen-devel/2007-11/msg00601.html
> >
>
> I was just about to post a fix for this.
>
> > ) out of the way, an SMP dom0 running pv_ops gets all the way through
> > start_kernel() and into rest_init() before dying with an unsupported cr0
> > write.  (I'm using direct console hypercalls for printk for now, full
> > xencons is not working yet.)
> >
>
> I have some early dom0 patches already, though they're a few months old
> now.  Not much there, but I did do an early console implementation.
>
> > I'm happy to put up a git tree for this once it gets anywhere.  We'd
> > need to decide which tree to track for that purpose --- Linus's, or
> > perhaps the tglx or mingo x86 merge tree might make more sense.
> >
>
> Yes, I think the x86 tree is where we need to be, since there's a lot of
> activity there.
>
> I'll attach my dom0 patches for whatever use you can make of them.  The
> definitely won't apply to anything, not least because of the arch merge
> (though it looks like they did get converted by script), but also
> because they're based on some defunct experimental booting-from-bzImage
> patches.  But perhaps there's some useful stuff in there.
>
> I've also attached my xen-balloon and hotplug patches as-is.  They don't
> work completely, but they should be closer to applying.
>
>     J
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-11-26 14:02   ` Juan Quintela
@ 2007-11-26 18:52     ` Jeremy Fitzhardinge
  2007-11-27  8:30       ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-26 18:52 UTC (permalink / raw)
  To: Juan Quintela
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Juan Quintela wrote:
> Hi,
>
> your console works great, but rest of patches are assuming:
>
> arch/x86/boot/compressed/notes-xen.c
> arch/x86/xen/early.c
>   

Yes, those are leftovers from a somewhat unsuccessful attempt at getting
ELF-in-bzImage booting working.  I need to go back and make bzImage
booting work properly.

I posted those patches as a source of possibly useful code
snippets/summary of things I've looked at so far, rather than something
that can be directly used.

> at least.  It looks as if there is missing another patche, could you
> take a look, please?
> Otherwise, I will take a look at what is missing.
>
> It breaks with:
>
> Intel machine check architecture supported.
> (XEN) traps.c:1734:d0 Domain attempted WRMSR 00000404 from 00000000:00000001 to
> ffffffff:ffffffff.
> Intel machine check reporting enabled on CPU#0.
> general protection fault: 0000 [#1] SMP
> Modules linked in:
>   

Hm.  Looks like Xen is getting upset about dom0 trying to disable
caching.  No, wait: 0xffffffff:ffffffff?  That's strange; I wonder if
its just misreporting the value, because the code doesn't look like its
trying to write that.

Either way, the fix is to implement xen_write_cr0, and mask off any bits
that Xen won't want us to set/clear (or if it doesn't allow dom0 to
change cr0, just ignore all updates).

    J

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-11-26 18:52     ` Jeremy Fitzhardinge
@ 2007-11-27  8:30       ` Jan Beulich
  2007-11-27 17:00         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2007-11-27  8:30 UTC (permalink / raw)
  To: Juan Quintela, Jeremy Fitzhardinge
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Glauber de Oliveira Costa, Chris Wright, virtualization

>> It breaks with:
>>
>> Intel machine check architecture supported.
>> (XEN) traps.c:1734:d0 Domain attempted WRMSR 00000404 from 00000000:00000001 to
>> ffffffff:ffffffff.
>> Intel machine check reporting enabled on CPU#0.
>> general protection fault: 0000 [#1] SMP
>> Modules linked in:
>>   
>
>Hm.  Looks like Xen is getting upset about dom0 trying to disable
>caching.  No, wait: 0xffffffff:ffffffff?  That's strange; I wonder if
>its just misreporting the value, because the code doesn't look like its
>trying to write that.
>
>Either way, the fix is to implement xen_write_cr0, and mask off any bits
>that Xen won't want us to set/clear (or if it doesn't allow dom0 to
>change cr0, just ignore all updates).

Why do you think that's a CR0 write? The messages clearly indicate an
MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init()
and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
check CPUID bits before trying to touch any registers... (And similarly
amd_mcheck_init() is checking only the MCE bit, not the MCA one.)

But then I just noticed that Xen itself doesn't clear the MCE/MCA bits either
in emulate_forced_invalid_op(), apparently under the assumption that PV
guests wouldn't try to make use of this feature.

A simple workaround would be to force mce_disabled to 1 in early Xen
initialization.

Jan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-11-27  8:30       ` Jan Beulich
@ 2007-11-27 17:00         ` Jeremy Fitzhardinge
  2007-11-27 17:14           ` Jan Beulich
  2007-11-27 17:15           ` Stephen C. Tweedie
  0 siblings, 2 replies; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-11-27 17:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Juan Quintela

Jan Beulich wrote:
>>> It breaks with:
>>>
>>> Intel machine check architecture supported.
>>> (XEN) traps.c:1734:d0 Domain attempted WRMSR 00000404 from 00000000:00000001 to
>>> ffffffff:ffffffff.
>>> Intel machine check reporting enabled on CPU#0.
>>> general protection fault: 0000 [#1] SMP
>>> Modules linked in:
>>>   
>>>       
>> Hm.  Looks like Xen is getting upset about dom0 trying to disable
>> caching.  No, wait: 0xffffffff:ffffffff?  That's strange; I wonder if
>> its just misreporting the value, because the code doesn't look like its
>> trying to write that.
>>
>> Either way, the fix is to implement xen_write_cr0, and mask off any bits
>> that Xen won't want us to set/clear (or if it doesn't allow dom0 to
>> change cr0, just ignore all updates).
>>     
>
> Why do you think that's a CR0 write? 

Well, the oops says "EIP is at native_write_cr0+0x0/0x4", and the caller
is prepare_set(), which does:

	/*  Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
	cr0 = read_cr0() | X86_CR0_CD;
	write_cr0(cr0);
	wbinvd();

This is in preparation to setting up the MTRRs, which needs to be all
skipped anyway.

> The messages clearly indicate an
> MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init()
> and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
> check CPUID bits before trying to touch any registers... (And similarly
> amd_mcheck_init() is checking only the MCE bit, not the MCA one.)
>   

The oops and backtrace doesn't suggest it's an MSR write.  Does a crX
write take the same path through the emulator as an MSR write?

> A simple workaround would be to force mce_disabled to 1 in early Xen
> initialization.
>   

That's probably necessary too.

    J

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-11-27 17:00         ` Jeremy Fitzhardinge
@ 2007-11-27 17:14           ` Jan Beulich
  2007-11-27 17:15           ` Stephen C. Tweedie
  1 sibling, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2007-11-27 17:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Juan Quintela

>The oops and backtrace doesn't suggest it's an MSR write.  Does a crX

Oh, right, the MSR write is being ignored, not failed.

>write take the same path through the emulator as an MSR write?

No, the two operations take different paths.

Jan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-11-27 17:00         ` Jeremy Fitzhardinge
  2007-11-27 17:14           ` Jan Beulich
@ 2007-11-27 17:15           ` Stephen C. Tweedie
  1 sibling, 0 replies; 57+ messages in thread
From: Stephen C. Tweedie @ 2007-11-27 17:15 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Juan Quintela

Hi,

On Tue, 2007-11-27 at 09:00 -0800, Jeremy Fitzhardinge wrote:

> > Why do you think that's a CR0 write? 
> 
> Well, the oops says "EIP is at native_write_cr0+0x0/0x4", and the caller
> is prepare_set(), which does:
> 
> 	/*  Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
> 	cr0 = read_cr0() | X86_CR0_CD;
> 	write_cr0(cr0);
> 	wbinvd();
> 
> This is in preparation to setting up the MTRRs, 

Right: cpu 0 gets past the early mtrr init (on the boot CPU, all the
kernel does is to probe the existing mtrr config), but it dies on cpu 1
trying to copy the mtrr config across.  (As a consequence, we don't hit
this problem on UP configs.)

> which needs to be all skipped anyway.

We _could_ just skip it, but we still want some mtrr support for dom0.
Fortunately the kernel's mtrr interfaces are nicely modular already, so
I'm currently starting to plug the 2.6.18 mtrr/main-xen.c into pv_ops as
a modular mtrr provider.

> > The messages clearly indicate an
> > MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init()
> > and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
> > check CPUID bits before trying to touch any registers... (And similarly
> > amd_mcheck_init() is checking only the MCE bit, not the MCA one.)

> The oops and backtrace doesn't suggest it's an MSR write.  Does a crX
> write take the same path through the emulator as an MSR write?

We get a slew of MCE-related MSR write warnings from the HV on both boot
and auxiliary processor bring-up, but the kernel doesn't crash at those
points.  (Which is not necessarily a good thing, as it implies the
kernel thinks it has registered its MCE handling, but the MSR writes
have not actually been honoured.)  So it's not a showstopper right now,
but is something we'll still want to deal with at some stage.

> > A simple workaround would be to force mce_disabled to 1 in early Xen
> > initialization.

> That's probably necessary too.

It doesn't seem to be necessary, given that the kernel does get past
this point; it's probably desirable, though, at least until such time as
we can actually do the MCE support correctly.

--Stephen

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Next steps with pv_ops for Xen
  2007-11-21 22:05 Next steps with pv_ops for Xen Stephen C. Tweedie
  2007-11-21 23:12 ` Jeremy Fitzhardinge
@ 2007-12-03 12:54 ` Gerd Hoffmann
  2007-12-03 13:19   ` Derek Murray
  1 sibling, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-03 12:54 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization

Stephen C. Tweedie wrote:
> Hi all,
> 
>   driver domains

Looked at the gntdev (grant table mappings for user space) driver,
noticed that one is not self-contained.  It needs a hook for page unmapping:

  http://xenbits.xensource.com/xen-3.1-testing.hg?rev/7180d2e61f92
  plus an s/ptep_get_and_clear_full/zap_pte/ fixup a few changesets
  later.

Upstreaming that one could become *uhm* intresting.  Nevertheless the
gntdev functionality is quite useful for writing pure userspace
backend drivers ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 12:54 ` Gerd Hoffmann
@ 2007-12-03 13:19   ` Derek Murray
  2007-12-03 14:16     ` Gerd Hoffmann
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-03 13:19 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

I take the blame for that one. I added the hook because, if a process 
were to die whilst holding one or more grants, there were no hooks that 
would make it possible to carry out the grant-unmap. All existing hooks 
on either the device or the VMA were called *after* the PTEs were cleared.

It gets better, though. The same hook is used in the version of blktap 
in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for 
xen-3.1-testing):
 
http://xenbits.xensource.com/linux-2.6.18-xen.hg?file/fd879c0688bf/drivers/xen/blktap/blktap.c

Reverting back to the old (hookless) behaviour would be a retrograde 
step IMHO.

Cheers,

Derek Murray.

Gerd Hoffmann wrote:
> Stephen C. Tweedie wrote:
>> Hi all,
>>
>>   driver domains
> 
> Looked at the gntdev (grant table mappings for user space) driver,
> noticed that one is not self-contained.  It needs a hook for page unmapping:
> 
>   http://xenbits.xensource.com/xen-3.1-testing.hg?rev/7180d2e61f92
>   plus an s/ptep_get_and_clear_full/zap_pte/ fixup a few changesets
>   later.
> 
> Upstreaming that one could become *uhm* intresting.  Nevertheless the
> gntdev functionality is quite useful for writing pure userspace
> backend drivers ...
> 
> cheers,
>   Gerd
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 13:19   ` Derek Murray
@ 2007-12-03 14:16     ` Gerd Hoffmann
  2007-12-03 14:51       ` Derek Murray
  0 siblings, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-03 14:16 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Derek Murray wrote:
> I take the blame for that one. I added the hook because, if a process
> were to die whilst holding one or more grants, there were no hooks that
> would make it possible to carry out the grant-unmap. All existing hooks
> on either the device or the VMA were called *after* the PTEs were cleared.

Hmm.  What exactly is the issue here?

This is about *userspace* mappings, right?  As far as I can see from a
quick scan there of the code is an additional kernel space mapping for
the grants and the userspace mapping is optional.  I don't see any
problems with userspace mapping going away without *instant*
notification.  Cleaning up a bit later, called from the
file_ops->release callback maybe, should work ok.

The problem I see with the additional vm_ops callback is that I suspect
you'll have to come up with some *very* good arguments to get it
accepted by the VM (as in "virtual memory") folks and merged mainline.

> It gets better, though. The same hook is used in the version of blktap
> in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for
> xen-3.1-testing):

Oh, I'm thinking more in the direction of killing blktap altogether in
favor of a pure userspace implementation on top of gntdev.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 14:16     ` Gerd Hoffmann
@ 2007-12-03 14:51       ` Derek Murray
  2007-12-03 17:18         ` Mark Williamson
  2007-12-03 20:38         ` Gerd Hoffmann
  0 siblings, 2 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-03 14:51 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Gerd Hoffmann wrote:
> Derek Murray wrote:
>> I take the blame for that one. I added the hook because, if a process
>> were to die whilst holding one or more grants, there were no hooks that
>> would make it possible to carry out the grant-unmap. All existing hooks
>> on either the device or the VMA were called *after* the PTEs were cleared.
> 
> Hmm.  What exactly is the issue here?
> 
> This is about *userspace* mappings, right?  As far as I can see from a
> quick scan there of the code is an additional kernel space mapping for
> the grants and the userspace mapping is optional.  I don't see any
> problems with userspace mapping going away without *instant*
> notification.  Cleaning up a bit later, called from the
> file_ops->release callback maybe, should work ok.

If we let Linux zap the page tables before we unmap the grant reference, 
then it is not possible to unmap the grant reference. The 
unmap_grant_ref hypercall ultimately calls destroy_grant_pte_mapping in 
xen/arch/x86/mm.c, which ensures that the PTE does in fact point to the 
granted frame. Note also the comment further up in that file (in 
put_page_from_l1e):

     /*
      * Check if this is a mapping that was established via a grant 
reference.
      * If it was then we should not be here: we require that such 
mappings are
      * explicitly destroyed via the grant-table interface.
      *
      * The upshot of this is that the guest can end up with active 
grants that
      * it cannot destroy (because it no longer has a PTE to present to the
      * grant-table interface). This can lead to subtle hard-to-catch bugs,
      * hence a special grant PTE flag can be enabled to catch the bug 
early.
      *
      * (Note that the undestroyable active grants are not a security 
hole in
      * Xen. All active grants can safely be cleaned up when the domain 
dies.)
      */

Effectively, there is a debug option that sets a bit in PTEs that map 
granted pages, and this can be used to force a domain_crash in the event 
that a VM tries to zap the entries normally. The normal behaviour is to 
silently accept the zap operation, and leak granted pages until the 
grantee domain is killed.

> The problem I see with the additional vm_ops callback is that I suspect
> you'll have to come up with some *very* good arguments to get it
> accepted by the VM (as in "virtual memory") folks and merged mainline.

On this point I completely agree with you! If anyone has any less 
radical suggestions, then I'd be delighted to refactor the gntdev code 
to use them. However, I'm not currently aware of any alternative that 
maintains robustness to process crashes.

>> It gets better, though. The same hook is used in the version of blktap
>> in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for
>> xen-3.1-testing):
> 
> Oh, I'm thinking more in the direction of killing blktap altogether in
> favor of a pure userspace implementation on top of gntdev.

I think this would represent good progress, though I wonder if there 
would be a performance penalty due to performing the mapping and 
unmapping in user-space (multiple syscalls per mapping versus a single 
hypercall).

Cheers,

Derek Murray.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 14:51       ` Derek Murray
@ 2007-12-03 17:18         ` Mark Williamson
  2007-12-03 18:36           ` D.G. Murray
  2007-12-03 20:38         ` Gerd Hoffmann
  1 sibling, 1 reply; 57+ messages in thread
From: Mark Williamson @ 2007-12-03 17:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Derek Murray, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

> >> It gets better, though. The same hook is used in the version of blktap
> >> in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for
> >> xen-3.1-testing):
> >
> > Oh, I'm thinking more in the direction of killing blktap altogether in
> > favor of a pure userspace implementation on top of gntdev.
>
> I think this would represent good progress, though I wonder if there
> would be a performance penalty due to performing the mapping and
> unmapping in user-space (multiple syscalls per mapping versus a single
> hypercall).

Maybe a change to the gntdev userspace API to allow batching of mapping 
requests?

I'm not aware of a batched mmap interface, which would seem to be the ideal 
solution; but it should be possible to batch this stuff somehow.  Although it 
seems like some kind of really weird ioctl might be needed :-S to do it 
*without* such a batched interface...

blktap in userspace, if any performance problems can be addressed, would seem 
to be a far nicer way of doing things.  And it's less code to merge 
upstream ;-)

Cheers,
Mark

-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Re: Next steps with pv_ops for Xen
  2007-12-03 17:18         ` Mark Williamson
@ 2007-12-03 18:36           ` D.G. Murray
  2007-12-03 19:08             ` Mark Williamson
                               ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: D.G. Murray @ 2007-12-03 18:36 UTC (permalink / raw)
  To: 'Mark Williamson', xen-devel
  Cc: 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, 'Gerd Hoffmann'

Hi Mark, 

> Maybe a change to the gntdev userspace API to allow batching 
> of mapping requests?

Something along the lines of the following?

/**
 * Memory maps one or more grant references from one or more domains to a
 * contiguous local address range. Mappings should be unmapped with
 * xc_gnttab_munmap. Returns NULL on failure.
 *
 * @parm xcg_handle a handle on an open grant table interface
 * @parm count the number of grant references to be mapped
 * @parm domids an array of @count domain IDs by which the corresponding
@refs
 * were granted
 * @parm refs an array of @count grant references to be mapped
 * @parm prot same flag as in mmap()
 */
void *xc_gnttab_map_grant_refs(int xcg_handle,
                               uint32_t count,
                               uint32_t *domids,
                               uint32_t *refs,
                               int prot); 

http://xenbits.xensource.com/xen-unstable.hg?file/3057f813da14/tools/libxc/x
enctrl.h

> blktap in userspace, if any performance problems can be 
> addressed, would seem to be a far nicer way of doing things.  
> And it's less code to merge upstream ;-)

Agreed.

Cheers,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 18:36           ` D.G. Murray
@ 2007-12-03 19:08             ` Mark Williamson
  2007-12-04  9:35               ` tgh
  2007-12-06 15:21             ` Gerd Hoffmann
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Mark Williamson @ 2007-12-03 19:08 UTC (permalink / raw)
  To: dgm36
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, 'Gerd Hoffmann'

> Hi Mark,
>
> > Maybe a change to the gntdev userspace API to allow batching
> > of mapping requests?
>
> Something along the lines of the following?

Just like that :-D

When you said "multiple syscalls per mapping" I assumed you meant that we'd 
lose the batching you get by doing a mulicall.  If it's just a couple of 
syscalls (plus, presumably a couple of hypercalls) per batch of mappings, my 
gut says it's probably not going to hurt block performance.  My guts have 
been wrong in (many!) ways before of course...

I guess the overhead *could* be reduced even more by just having a magic ioctl 
that did all the mmap-ing stuff in one operation, but that'd probably be 
really gross if it wasn't necessary!  And I doubt it'd make upstream very 
happy...

We'll also be eliminating the overheads involved in having a blktap ring for 
talking to userspace and having to move requests between that ring and the 
real block ring, so there's some definite wins in overheads as well.

Cheers,
Mark

-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 14:51       ` Derek Murray
  2007-12-03 17:18         ` Mark Williamson
@ 2007-12-03 20:38         ` Gerd Hoffmann
  2007-12-04  9:40           ` Derek Murray
  1 sibling, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-03 20:38 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]

Derek Murray wrote:
> If we let Linux zap the page tables before we unmap the grant reference,
> then it is not possible to unmap the grant reference. The
> unmap_grant_ref hypercall ultimately calls destroy_grant_pte_mapping in
> xen/arch/x86/mm.c, which ensures that the PTE does in fact point to the
> granted frame.

Hmm, I see.  You have to do that for every mapping, not just the last
(kernel) one to get release the grant.  And just dropping that check is
probably out of question because the guest could fool xen's reference
counting then?

> On this point I completely agree with you! If anyone has any less
> radical suggestions, then I'd be delighted to refactor the gntdev code
> to use them. However, I'm not currently aware of any alternative that
> maintains robustness to process crashes.

Oh, for me it isn't robust at all, it crashes on the first munmap
syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
xensource 2.6.18 yet.

Ideas what is wrong?
Who uses the gntdev device right now?

> I think this would represent good progress, though I wonder if there
> would be a performance penalty due to performing the mapping and
> unmapping in user-space (multiple syscalls per mapping versus a single
> hypercall).

I'd expect the hard disk (and how I/O is scheduled) being the
bottleneck, not the syscall overhead.  Nevertheless I plan to benchmark
it once I have it up and running.

cheers,
  Gerd

[-- Attachment #2: oops --]
[-- Type: text/plain, Size: 25856 bytes --]

Linux version 2.6.21-2952.fc8xen (kojibuilder@hammer2.fedora.redhat.com) (gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #1 SMP Mon Nov 19 07:06:55 EST 2007
BIOS-provided physical RAM map:
sanitize start
sanitize bail 0
copy_e820_map() start: 0000000000000000 size: 000000007491e000 end: 000000007491e000 type: 1
 Xen: 0000000000000000 - 000000007491e000 (usable)
1137MB HIGHMEM available.
727MB LOWMEM available.
NX (Execute Disable) protection: active
Entering add_active_range(0, 0, 477470) 0 entries of 256 used
Zone PFN ranges:
  DMA             0 ->   186366
  Normal     186366 ->   186366
  HighMem    186366 ->   477470
early_node_map[1] active PFN ranges
    0:        0 ->   477470
On node 0 totalpages: 477470
  DMA zone: 1455 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 184911 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  HighMem zone: 2274 pages used for memmap
  HighMem zone: 288830 pages, LIFO batch:31
found SMP MP-table at 000ff780
DMI present.
ACPI: RSDP 000F9990, 0014 (r0 ACPIAM)
ACPI: RSDT 7D6B0000, 0044 (r1 A M I  OEMRSDT   5000708 MSFT       97)
ACPI: FACP 7D6B0200, 0084 (r2 A M I  OEMFACP   5000708 MSFT       97)
ACPI Warning (tbfadt-0360): Ignoring BIOS FADT r2 C-state control [20070126]
ACPI: DSDT 7D6B0490, 6643 (r1 SDBLI9 SDBLI944       44 INTL 20051117)
ACPI: FACS 7D6BE000, 0040
ACPI: APIC 7D6B0390, 006C (r1 A M I  OEMAPIC   5000708 MSFT       97)
ACPI: MCFG 7D6B0450, 003C (r1 A M I  OEMMCFG   5000708 MSFT       97)
ACPI: OEMB 7D6BE040, 0079 (r1 A M I  AMI_OEM   5000708 MSFT       97)
ACPI: ASF! 7D6B6AE0, 0099 (r32 LEGEND I865PASF        1 INTL 20051117)
ACPI: GSCI 7D6BE0C0, 2024 (r1 A M I  GMCHSCI   5000708 MSFT       97)
ACPI: iEIT 7D6C00F0, 00B0 (r1 A M I  EITTABLE  5000708 MSFT       97)
ACPI: SSDT 7D6C0BC0, 0877 (r1 DpgPmm    CpuPm       12 INTL 20051117)
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Detected 2992.804 MHz processor.
Built 1 zonelists.  Total pages: 473741
Kernel command line: ro root=/dev/xeni/fedora32 console=tty1 xencons=xvc0 console=xvc0 panic=30
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c136c000 soft=c134c000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Xen reported: 2992.594 MHz processor.
Console: colour VGA+ 80x50
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Software IO TLB enabled: 
 Aperture:     2 megabytes
 Kernel range: c033c000 - c053c000
 Address size: 24 bits
vmalloc area: ee000000-f4ffe000, maxmem 2d7fe000
Memory: 1866468k/1909880k available (2071k kernel code, 34068k reserved, 1080k data, 188k init, 1164416k highmem)
virtual kernel memory layout:
    fixmap  : 0xf5315000 - 0xf57fe000   (5028 kB)
    pkmap   : 0xf5000000 - 0xf5200000   (2048 kB)
    vmalloc : 0xee000000 - 0xf4ffe000   ( 111 MB)
    lowmem  : 0xc0000000 - 0xed7fe000   ( 727 MB)
      .init : 0xc1319000 - 0xc1348000   ( 188 kB)
      .data : 0xc1205e5e - 0xc1313fd4   (1080 kB)
      .text : 0xc1000000 - 0xc1205e5e   (2071 kB)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 5991.67 BogoMIPS (lpj=11983344)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: bfebd3f1 20100000 00000000 00000000 0000e3fd 00000000 00000001
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU: After all inits, caps: bfebd3f1 20100000 00000000 00003940 0000e3fd 00000000 00000001
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
ACPI: Core revision 20070126
CPU 1 irqstacks, hard=c136d000 soft=c134d000
ENABLING IO-APIC IRQs
SMP alternatives: switching to SMP code
Brought up 2 CPUs
Initializing CPU#1
sizeof(vma)=88 bytes
sizeof(page)=32 bytes
sizeof(inode)=336 bytes
sizeof(dentry)=132 bytes
sizeof(ext3inode)=488 bytes
sizeof(buffer_head)=56 bytes
sizeof(skbuff)=176 bytes
sizeof(task_struct)=1376 bytes
migration_cost=19
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
Setting up standard PCI resources
Allocating PCI resources starting at 80000000 (gap: 7d6b0000:82950000)
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
Error attaching device data
Error attaching device data
Error attaching device data
Error attaching device data
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
Boot video device is 0000:00:02.0
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P4._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P6._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 12 14 *15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 7 10 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 10 12 *14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 *10 12 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 *5 6 7 10 12 14 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 6 7 10 12 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 10 12 14 *15)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 23 devices
xen_mem: Initialising balloon driver.
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
pnp: 00:01: iomem range 0xfed14000-0xfed19fff has been reserved
pnp: 00:0a: ioport range 0xa20-0xa3f has been reserved
pnp: 00:0a: ioport range 0xa00-0xa0f has been reserved
pnp: 00:0a: ioport range 0xa10-0xa1f has been reserved
pnp: 00:0a: ioport range 0xa40-0xa5f has been reserved
pnp: 00:0b: iomem range 0xfed1c000-0xfed1ffff has been reserved
pnp: 00:0b: iomem range 0xfed20000-0xfed8ffff has been reserved
pnp: 00:0e: iomem range 0xffc00000-0xffefffff has been reserved
pnp: 00:0f: iomem range 0xfec00000-0xfec00fff has been reserved
pnp: 00:0f: iomem range 0xfee00000-0xfee00fff has been reserved
pnp: 00:10: iomem range 0xe0000000-0xefffffff has been reserved
pnp: 00:11: iomem range 0xffa77000-0xffa77fff has been reserved
pnp: 00:12: iomem range 0x0-0x0 could not be reserved
pnp: 00:13: iomem range 0x0-0x0 could not be reserved
pnp: 00:14: iomem range 0x0-0x0 could not be reserved
pnp: 00:15: iomem range 0x0-0x0 could not be reserved
pnp: 00:16: iomem range 0x0-0x9ffff could not be reserved
pnp: 00:16: iomem range 0xc0000-0xcffff could not be reserved
pnp: 00:16: iomem range 0xe0000-0xfffff could not be reserved
pnp: 00:16: iomem range 0x100000-0x7d6fffff could not be reserved
PCI: Ignore bogus resource 6 [0:0] of 0000:00:02.0
PCI: Bridge: 0000:00:1c.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:02:00.0
  IO window: disabled.
  MEM window: ff600000-ff6fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1c.2
  IO window: disabled.
  MEM window: ff600000-ff6fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1e.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 17 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:1c.0 to 64
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 18 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1c.2 to 64
PCI: Setting latency timer of device 0000:02:00.0 to 64
PCI: Setting latency timer of device 0000:00:1e.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1572864 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
checking if image is initramfs... it is
Freeing initrd memory: 8554k freed
audit: initializing netlink socket (disabled)
audit(1196713193.812:1): initialized
highmem bounce pool size: 64 pages
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
ksign: Installing public key data
Loading keyring
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
PCI: Setting latency timer of device 0000:00:1c.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.0:pcie00]
Allocate Port Service[0000:00:1c.0:pcie02]
PCI: Setting latency timer of device 0000:00:1c.2 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.2:pcie00]
Allocate Port Service[0000:00:1c.2:pcie02]
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI Warning (tbutils-0158): Incorrect checksum in table [OEMB] -  5E, should be 4F [20070126]
ACPI: SSDT 7D6C01A0, 02F4 (r1 DpgPmm  P001Ist       11 INTL 20051117)
Monitor-Mwait will be used to enter C-1 state
ACPI: CPU0 (power states: C1[C1] C2[C2])
ACPI: SSDT 7D6C06B0, 02F4 (r1 DpgPmm  P002Ist       12 INTL 20051117)
ACPI: CPU1 (power states: C1[C1] C2[C2])
ACPI Exception (processor_core-0783): AE_NOT_FOUND, Processor Device is not present [20070126]
ACPI Exception (processor_core-0783): AE_NOT_FOUND, Processor Device is not present [20070126]
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.102 (c) Dave Jones
RAMDISK driver initialized: 16 RAM disks of 16384K size 4096 blocksize
input: Macintosh mouse button emulation as /class/input/input0
Xen virtual console successfully installed as xvc0
Event-channel device installed.
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: No PS/2 controller found. Probing ports directly.
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
TCP bic registered
Initializing XFRM netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI No-Shortcut mode
drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
Freeing unused kernel memory: 188k freed
Write protecting the kernel read-only data: 795k
ACPI: PCI Interrupt 0000:00:1a.7[D] -> GSI 19 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1a.7 to 64
ehci_hcd 0000:00:1a.7: EHCI Host Controller
ehci_hcd 0000:00:1a.7: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:1a.7: debug port 1
PCI: cache line size of 32 is not supported by device 0000:00:1a.7
ehci_hcd 0000:00:1a.7: irq 18, io mem 0xffa7b400
ehci_hcd 0000:00:1a.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 6 ports detected
ACPI: PCI Interrupt 0000:00:1d.7[A] -> GSI 23 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1d.7 to 64
ehci_hcd 0000:00:1d.7: EHCI Host Controller
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 2
ehci_hcd 0000:00:1d.7: debug port 1
PCI: cache line size of 32 is not supported by device 0000:00:1d.7
ehci_hcd 0000:00:1d.7: irq 19, io mem 0xffa7b000
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 6 ports detected
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1a.0[A] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:1a.0 to 64
uhci_hcd 0000:00:1a.0: UHCI Host Controller
uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1a.0: irq 20, io base 0x0000d880
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1a.1[B] -> GSI 21 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:1a.1 to 64
uhci_hcd 0000:00:1a.1: UHCI Host Controller
uhci_hcd 0000:00:1a.1: new USB bus registered, assigned bus number 4
uhci_hcd 0000:00:1a.1: irq 21, io base 0x0000d800
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1a.2[C] -> GSI 18 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1a.2 to 64
uhci_hcd 0000:00:1a.2: UHCI Host Controller
uhci_hcd 0000:00:1a.2: new USB bus registered, assigned bus number 5
uhci_hcd 0000:00:1a.2: irq 17, io base 0x0000d480
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 23 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 6
uhci_hcd 0000:00:1d.0: irq 19, io base 0x0000d400
usb usb6: configuration #1 chosen from 1 choice
hub 6-0:1.0: USB hub found
hub 6-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 7
uhci_hcd 0000:00:1d.1: irq 18, io base 0x0000d080
usb usb7: configuration #1 chosen from 1 choice
hub 7-0:1.0: USB hub found
hub 7-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.2[D] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 8
uhci_hcd 0000:00:1d.2: irq 20, io base 0x0000d000
usb usb8: configuration #1 chosen from 1 choice
hub 8-0:1.0: USB hub found
hub 8-0:1.0: 2 ports detected
SCSI subsystem initialized
libata version 2.20 loaded.
ahci 0000:00:1f.2: version 2.1
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 18
usb 6-1: new low speed USB device using uhci_hcd and address 2
usb 6-1: configuration #1 chosen from 1 choice
input: SOLIDTEK USB Composite Keyboard as /class/input/input1
input: USB HID v1.10 Keyboard [SOLIDTEK USB Composite Keyboard] on usb-0000:00:1d.0-1
input: SOLIDTEK USB Composite Keyboard as /class/input/input2
input,hiddev96: USB HID v1.10 Device [SOLIDTEK USB Composite Keyboard] on usb-0000:00:1d.0-1
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports ? Gbps 0x3f impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq stag led clo pmp pio slum part 
ata1: SATA max UDMA/133 cmd 0xee074900 ctl 0x00000000 bmdma 0x00000000 irq 18
ata2: SATA max UDMA/133 cmd 0xee074980 ctl 0x00000000 bmdma 0x00000000 irq 18
ata3: SATA max UDMA/133 cmd 0xee074a00 ctl 0x00000000 bmdma 0x00000000 irq 18
ata4: SATA max UDMA/133 cmd 0xee074a80 ctl 0x00000000 bmdma 0x00000000 irq 18
ata5: SATA max UDMA/133 cmd 0xee074b00 ctl 0x00000000 bmdma 0x00000000 irq 18
ata6: SATA max UDMA/133 cmd 0xee074b80 ctl 0x00000000 bmdma 0x00000000 irq 18
scsi0 : ahci
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ata_hpa_resize 1: sectors = 321672960, hpa_sectors = 321672960
ata1.00: ATA-7: Hitachi HDS721616PLA380, P22OAB3A, max UDMA/133
ata1.00: 321672960 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: ata_hpa_resize 1: sectors = 321672960, hpa_sectors = 321672960
ata1.00: configured for UDMA/133
scsi1 : ahci
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATAPI, max UDMA/66
ata2.00: configured for UDMA/66
scsi2 : ahci
ata3: SATA link down (SStatus 0 SControl 300)
scsi3 : ahci
ata4: SATA link down (SStatus 0 SControl 300)
scsi4 : ahci
ata5: SATA link down (SStatus 0 SControl 300)
scsi5 : ahci
ata6: SATA link down (SStatus 0 SControl 300)
scsi 0:0:0:0: Direct-Access     ATA      Hitachi HDS72161 P22O PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 321672960 512-byte hardware sectors (164697 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 321672960 512-byte hardware sectors (164697 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
scsi 1:0:0:0: CD-ROM            PLEXTOR  DVDR   PX-755A   1.04 PQ: 0 ANSI: 5
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
ata1: D2H reg with I during NCQ, this message won't be printed again
kjournald starting.  Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
security:  5 users, 11 roles, 2391 types, 114 bools, 1 sens, 1024 cats
security:  67 classes, 215624 rules
SELinux:  Completing initialization.
SELinux:  Setting up existing superblocks.
SELinux: initialized (dev dm-0, type ext3), uses xattr
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts
SELinux: initialized (dev selinuxfs, type selinuxfs), uses genfs_contexts
SELinux: initialized (dev mqueue, type mqueue), uses transition SIDs
SELinux: initialized (dev devpts, type devpts), uses transition SIDs
SELinux: initialized (dev eventpollfs, type eventpollfs), uses task SIDs
SELinux: initialized (dev inotifyfs, type inotifyfs), uses genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev futexfs, type futexfs), uses genfs_contexts
SELinux: initialized (dev pipefs, type pipefs), uses task SIDs
SELinux: initialized (dev sockfs, type sockfs), uses task SIDs
SELinux: initialized (dev cpuset, type cpuset), uses genfs_contexts
SELinux: initialized (dev proc, type proc), uses genfs_contexts
SELinux: initialized (dev bdev, type bdev), uses genfs_contexts
SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1196713206.804:2): policy loaded auid=4294967295
sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 1:0:0:0: Attached scsi CD-ROM sr0
sd 0:0:0:0: Attached scsi generic sg0 type 0
sr 1:0:0:0: Attached scsi generic sg1 type 5
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
input: PC Speaker as /class/input/input3
8250_pnp: Unknown symbol serial8250_unregister_port
8250_pnp: Unknown symbol serial8250_register_port
8250_pnp: Unknown symbol serial8250_unregister_port
8250_pnp: Unknown symbol serial8250_register_port
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
FDC 0 is a National Semiconductor PC87306
Intel(R) PRO/1000 Network Driver - version 7.5.5.1-NAPI
Copyright (c) 1999-2007 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:19.0 to 64
e1000: 0000:00:19.0: e1000_probe: PHY reset is blocked due to SOL/IDER session.
parport: PnPBIOS parport detected.
parport0: PC-style at 0x378 (0x778), irq 7 [PCSPP,TRISTATE,EPP]
e1000: 0000:00:19.0: e1000_check_copper_options: Link active due to SoL/IDER Session. Speed/Duplex/AutoNeg parameter ignored.
e1000: 0000:00:19.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:13:20:f5:f8:50
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
ACPI: PCI Interrupt 0000:00:1f.3[C] -> GSI 18 (level, low) -> IRQ 17
ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 22 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:1b.0 to 64
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
ACPI: PCI Interrupt 0000:00:03.3[B] -> GSI 17 (level, low) -> IRQ 16
device-mapper: multipath: version 1.0.5 loaded
loop: loaded (max 8 devices)
EXT3 FS on dm-0, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-2, type ext3), uses xattr
SELinux: initialized (dev sda1, type ext2), uses xattr
SELinux: initialized (dev sda2, type ext2), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
Adding 4194296k swap on /dev/mapper/xeni-swap.  Priority:-1 extents:1 across:4194296k
SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts
IA-32 Microcode Update Driver: v1.14a-xen <tigran@veritas.com>
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
Mobile IPv6
ip6_tables: (C) 2000-2006 Netfilter Core Team
ip_tables: (C) 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
nf_conntrack version 0.5.0 (8192 buckets, 65536 max)
ADDRCONF(NETDEV_UP): eth0: link is not ready
e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
e1000: eth0: e1000_watchdog_task: 10/100 speed: disabling TSO
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
audit(1196713221.215:3): audit_pid=1901 old=0 by auid=4294967295 subj=system_u:system_r:auditd_t:s0
SELinux: initialized (dev rpc_pipefs, type rpc_pipefs), uses genfs_contexts
eth0: no IPv6 routers present
SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
Bridge firewalling registered
ADDRCONF(NETDEV_UP): peth0: link is not ready
e1000: peth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
e1000: peth0: e1000_watchdog_task: 10/100 speed: disabling TSO
ADDRCONF(NETDEV_CHANGE): peth0: link becomes ready
device peth0 entered promiscuous mode
eth0: port 1(peth0) entering learning state
eth0: topology change detected, propagating
eth0: port 1(peth0) entering forwarding state
virbr0: no IPv6 routers present
peth0: no IPv6 routers present
eth0: no IPv6 routers present
xen-vbd: registered block device major 202
blkfront: xvda: barriers enabled
 xvda:<0>Eeek! page_mapcount(page) went negative! (-1)
  page pfn = 29384
  page->flags = 835
  page->count = 3
  page->mapping = eb093d98
  vma->vm_ops = gntdev_vmops+0x0/0x34
  vma->vm_ops->nopage = 0x0
  vma->vm_file->f_op->mmap = gntdev_mmap+0x0/0x467
------------[ cut here ]------------
kernel BUG at mm/rmap.c:574!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /devices/xen/vbd-51712/devtype
Modules linked in: xenblk(U) ipt_MASQUERADE(U) iptable_nat(U) nf_nat(U) bridge(U) autofs4(U) sunrpc(U) nf_conntrack_netbios_ns(U) nf_conntrack_ipv4(U) xt_state(U) nf_conntrack(U) nfnetlink(U) ipt_REJECT(U) iptable_filter(U) ip_tables(U) xt_tcpudp(U) ip6t_REJECT(U) ip6table_filter(U) ip6_tables(U) x_tables(U) ipv6(U) ext2(U) loop(U) dm_multipath(U) 8250_pci(U) snd_hda_intel(U) snd_hda_codec(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) snd_timer(U) parport_pc(U) snd(U) i2c_i801(U) ata_generic(U) e1000(U) parport(U) floppy(U) serio_raw(U) pcspkr(U) soundcore(U) i2c_core(U) 8250(U) snd_page_alloc(U) sg(U) sr_mod(U) serial_core(U) cdrom(U) ata_piix(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) ahci(U) liba
 ta(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) mbcache(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
CPU:    1
EIP:    0061:[<c1065c32>]    Not tainted VLI
EFLAGS: 00210282   (2.6.21-2952.fc8xen #1)
EIP is at page_remove_rmap+0xce/0xed
eax: 00000036   ebx: c2371080   ecx: 00000001   edx: 00000000
esi: e92aed84   edi: 00000020   ebp: 53584067   esp: e9245ea4
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0069
Process blkbackd (pid: 3154, ti=e9245000 task=c1d95150 task.ti=e9245000)
Stack: c1291346 eb093d98 c1156c6b 00000000 c105d7fa 00000000 00000000 ffffffef 
       b7f03000 e92aed84 e9245f68 53584067 00000000 003ff000 b7f03000 00000000 
       00000000 b7f04000 e95a1010 00000000 eb128580 e92df818 c2371080 c1c64900 
Call Trace:
 [<c1156c6b>] gntdev_clear_pte+0x0/0x289
 [<c105d7fa>] unmap_vmas+0x62e/0x8bd
 [<c101d6f4>] __wake_up+0x32/0x43
 [<c106269f>] unmap_region+0x93/0xf7
 [<c1063073>] do_munmap+0x15a/0x1ac
 [<c10630f5>] sys_munmap+0x30/0x3e
 [<c1005688>] syscall_call+0x7/0xb
 =======================
Code: c0 74 0d 8b 50 08 b8 76 13 29 c1 e8 35 af fd ff 8b 46 4c 85 c0 74 14 8b 40 10 85 c0 74 0d 8b 50 2c b8 95 13 29 c1 e8 1a af fd ff <0f> 0b eb fe 8b 53 10 89 d8 59 5b 5b 83 e2 01 5e f7 da 83 c2 04 
EIP: [<c1065c32>] page_remove_rmap+0xce/0xed SS:ESP 0069:e9245ea4

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 19:08             ` Mark Williamson
@ 2007-12-04  9:35               ` tgh
  2007-12-05  3:42                 ` Mark Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: tgh @ 2007-12-04  9:35 UTC (permalink / raw)
  To: Mark Williamson
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, dgm36, 'Gerd Hoffmann'

hi
I am not quite clear about the purpose of pv-ops , what do we want to 
deal with by developping "pv-ops"? is it used for HVM or for PV or KVM 
or something ? I have seen it for a few months in the list ,and 
"pv-ops"is an active project ,but i am not clear about what is the aim 
of "pv-ops" ,could you give me an explanation about it

Thanks in advance




Mark Williamson 写道:
>> Hi Mark,
>>
>>     
>>> Maybe a change to the gntdev userspace API to allow batching
>>> of mapping requests?
>>>       
>> Something along the lines of the following?
>>     
>
> Just like that :-D
>
> When you said "multiple syscalls per mapping" I assumed you meant that we'd 
> lose the batching you get by doing a mulicall.  If it's just a couple of 
> syscalls (plus, presumably a couple of hypercalls) per batch of mappings, my 
> gut says it's probably not going to hurt block performance.  My guts have 
> been wrong in (many!) ways before of course...
>
> I guess the overhead *could* be reduced even more by just having a magic ioctl 
> that did all the mmap-ing stuff in one operation, but that'd probably be 
> really gross if it wasn't necessary!  And I doubt it'd make upstream very 
> happy...
>
> We'll also be eliminating the overheads involved in having a blktap ring for 
> talking to userspace and having to move requests between that ring and the 
> real block ring, so there's some definite wins in overheads as well.
>
> Cheers,
> Mark
>
>   

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 20:38         ` Gerd Hoffmann
@ 2007-12-04  9:40           ` Derek Murray
  2007-12-04 12:01             ` Gerd Hoffmann
  2007-12-04 20:59             ` Ian Main
  0 siblings, 2 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-04  9:40 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Gerd Hoffmann wrote:
>> On this point I completely agree with you! If anyone has any less
>> radical suggestions, then I'd be delighted to refactor the gntdev code
>> to use them. However, I'm not currently aware of any alternative that
>> maintains robustness to process crashes.
> 
> Oh, for me it isn't robust at all, it crashes on the first munmap
> syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
> xensource 2.6.18 yet.

My gut feeling is that something changed in mm between 2.6.18 and 
2.6.21, but that seems like a cop out so...

> Ideas what is wrong?

Since the bug appears to be in page_remove_rmap, that would tend to 
imply that there is never a corresponding page_add_*_rmap 
(page_add_file_rmap?). My knowledge of the Linux mm code is a bit shaky 
here: should gntdev be doing this? Should we be using install_page (or a 
modified version thereof) to set the PTE?

Also, does a simple program that opens gntdev, maps a grant, 
accesses/writes to the page, and unmaps it (all using the xc_gnttab_* 
functions) work?

> Who uses the gntdev device right now?

Good question! I'm aware of it being used in a few research projects, 
and it seems to work for them (though I think it is mostly used with the 
linux-2.6.18-xen kernel). Anyone else?

>> I think this would represent good progress, though I wonder if there
>> would be a performance penalty due to performing the mapping and
>> unmapping in user-space (multiple syscalls per mapping versus a single
>> hypercall).
> 
> I'd expect the hard disk (and how I/O is scheduled) being the
> bottleneck, not the syscall overhead.  Nevertheless I plan to benchmark
> it once I have it up and running.

Great to hear that you're working on this! Let me know if there's any 
other help I can provide with gntdev.

Cheers,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04  9:40           ` Derek Murray
@ 2007-12-04 12:01             ` Gerd Hoffmann
  2007-12-04 12:39               ` Stephen C. Tweedie
  2007-12-04 20:59             ` Ian Main
  1 sibling, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-04 12:01 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Derek Murray wrote:
> Gerd Hoffmann wrote:
>> Oh, for me it isn't robust at all, it crashes on the first munmap
>> syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
>> xensource 2.6.18 yet.
> 
> My gut feeling is that something changed in mm between 2.6.18 and
> 2.6.21, but that seems like a cop out so...

Could be.  Cross checking failed thouth, 2.6.18 doesn't boot the machine
in question (intel devel box with ich9).  Doesn't finds the disk.
Probably the ahci driver is too old.

>> Ideas what is wrong?
> 
> Since the bug appears to be in page_remove_rmap, that would tend to
> imply that there is never a corresponding page_add_*_rmap
> (page_add_file_rmap?). My knowledge of the Linux mm code is a bit shaky
> here: should gntdev be doing this? Should we be using install_page (or a
> modified version thereof) to set the PTE?

Don't know, I'm just trying to use it.  I did some mm handling for
device drivers back in my video4linux days, but for that it wasn't
needed to be involved into setting/clearing pte entries.  I just had a
->nopage handler allocate the pages the way I needed it for the
userspace mappings of video dma buffers.

> Also, does a simple program that opens gntdev, maps a grant,
> accesses/writes to the page, and unmaps it (all using the xc_gnttab_*
> functions) work?

Didn't try yet.  The application in question (blkbackd) does this:

  * map blk shared ring
  * see the first request come in (kernel trying to read the
    partition table).
  * map the grants of the request.
  * perform I/O.
  * Try to unmap the grants of the request.  On the first unmap call
    the kernel oopses.

This all without even starting a guest, I'm just using "xm block-attach"
 to create a blkfront device in Dom0.

>> Who uses the gntdev device right now?
> 
> Good question! I'm aware of it being used in a few research projects,
> and it seems to work for them (though I think it is mostly used with the
> linux-2.6.18-xen kernel). Anyone else?

So it effectively got no real-world testing yet ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 12:01             ` Gerd Hoffmann
@ 2007-12-04 12:39               ` Stephen C. Tweedie
  2007-12-04 19:58                 ` Gerd Hoffmann
                                   ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: Stephen C. Tweedie @ 2007-12-04 12:39 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Stephen Tweedie, Jan Beulich, Glauber de Oliveira Costa,
	Chris Wright, virtualization

Hi,

On Tue, 2007-12-04 at 13:01 +0100, Gerd Hoffmann wrote:

> >> Who uses the gntdev device right now?
> > 
> > Good question! I'm aware of it being used in a few research projects,
> > and it seems to work for them (though I think it is mostly used with the
> > linux-2.6.18-xen kernel). Anyone else?
> 
> So it effectively got no real-world testing yet ...

So... the interface (a) cannot be used on the Linux VM without at least
one invasive VM modification, due to the requirement of ptes being
explicitly unmapped via hypercall; and (b) isn't used significantly in
real life yet.

I can't help wondering if this is a hint that now is the time to find a
better API, which doesn't have the requirement (a) that seems to be
causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
to have the same problems with their VM layers if they try to implement
this API?  Upstream Linux pv_ops certainly will, and it would be good if
we could avoid tying unprivileged guests to ABIs which cannot hope to be
merged into pv_ops.

(Just what is the cost of not having this functionality in blktap,
anyway?)

--Stephen

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 12:39               ` Stephen C. Tweedie
@ 2007-12-04 19:58                 ` Gerd Hoffmann
  2007-12-05 11:48                   ` [Xen-devel] " Derek Murray
                                     ` (2 more replies)
  2007-12-04 21:08                 ` Ian Main
                                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-04 19:58 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1883 bytes --]

Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, 2007-12-04 at 13:01 +0100, Gerd Hoffmann wrote:
> 
>>>> Who uses the gntdev device right now?
>>> Good question! I'm aware of it being used in a few research projects,
>>> and it seems to work for them (though I think it is mostly used with the
>>> linux-2.6.18-xen kernel). Anyone else?
>> So it effectively got no real-world testing yet ...
> 
> So... the interface (a) cannot be used on the Linux VM without at least
> one invasive VM modification, due to the requirement of ptes being
> explicitly unmapped via hypercall; and (b) isn't used significantly in
> real life yet.

(c) seems not to work for anything non-trivial.  I've compiled and
tested a xensource 2.6.18 kernel (3.1 testing mercurial tree head,
should be 3.1.2-release), it fails in a simliar way.  See attachment.

Want reproduce?  Here we go:

  * grab xenner 0.8 from http://dl.bytesex.org/releases/xenner/
  * grab a xenified dom0 kernel without blktap driver (either not
    compiled or module not loaded).
  * start xend
  * start blkbackd from xenner package (you probably want the -d switch
    for debug output, twice for more).
  * run "xm block-attach 0 tap:aio:/path/to/some/file xvda r"
  * watch it blow up ;)

> I can't help wondering if this is a hint that now is the time to find a
> better API, which doesn't have the requirement (a) that seems to be
> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
> to have the same problems with their VM layers if they try to implement
> this API?  Upstream Linux pv_ops certainly will, and it would be good if
> we could avoid tying unprivileged guests to ABIs which cannot hope to be
> merged into pv_ops.

And I fear the problems I've trapped into up to now is only the tip of
the iceberg.  What happens if an application with active grant table
mappings calls fork() ?

cheers,
  Gerd

[-- Attachment #2: oops --]
[-- Type: text/plain, Size: 15610 bytes --]

Linux version 2.6.18-xen (kraxel@zweiblum.travel.kraxel.org) (gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #1 SMP Tue Dec 4 18:17:24 CET 2007
BIOS-provided physical RAM map:
 Xen: 0000000000000000 - 000000000adc3000 (usable)
0MB HIGHMEM available.
173MB LOWMEM available.
On node 0 totalpages: 44483
  DMA zone: 44483 pages, LIFO batch:7
DMI 2.3 present.
ACPI: RSDP (v000 OID_00                                ) @ 0x000f0010
ACPI: RSDT (v001 OID_00 RSDT_000 0x30303030 & 0x00010000) @ 0x0bfffbd0
ACPI: FADT (v001 OID_00 FACP_000 0x30303030 & 0x00010000) @ 0x0bfffb20
ACPI: BOOT (v001 OID_00 BOOT_000 0x30303030 & 0x00010000) @ 0x0bfffba0
ACPI: DSDT (v001 INT440 SYSFexxx 0x00001001 MSFT 0x0100000b) @ 0x00000000
ACPI: Vendor "INT440" System "SYSFexxx" Revision 0x1001 has a known ACPI BIOS problem.
ACPI: Reason: Does not use _REG to protect EC OpRegions. This is a non-recoverable error
ACPI: Disabling ACPI support
Allocating PCI resources starting at 10000000 (gap: 0c000000:f3fc0000)
Detected 600.047 MHz processor.
Built 1 zonelists.  Total pages: 44483
Kernel command line: ro root=/dev/zen/rhel5 apm=off vga=0x317 panic=30
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
Xen reported: 600.034 MHz processor.
Console: colour VGA+ 80x50
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Software IO TLB enabled: 
 Aperture:     2 megabytes
 Kernel range: c0aad000 - c0cad000
 Address size: 24 bits
vmalloc area: cb800000-f51fe000, maxmem 2d7fe000
Memory: 155572k/177932k available (1972k kernel code, 14020k reserved, 693k data, 192k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 1502.07 BogoMIPS (lpj=7510358)
Security Framework v1.0.0 initialized
Capability LSM initialized
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0387d1f1 00000000 00000000 00000000 00000000 00000000 00000000
CPU: After vendor identify, caps: 0387d1f1 00000000 00000000 00000000 00000000 00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU serial number disabled.
CPU: After all inits, caps: 0383d1f1 00000000 00000000 00000040 00000000 00000000 00000000
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
Freeing SMP alternatives: 12k freed
Brought up 1 CPUs
migration_cost=0
checking if image is initramfs... it is
Freeing initrd memory: 6538k freed
NET: Registered protocol family 16
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI: disabled
xen_mem: Initialising balloon driver.
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI quirk: region 1000-103f claimed by PIIX4 ACPI
PCI quirk: region 1400-140f claimed by PIIX4 SMB
PIIX4 devres C PIO at 0398-0399
Boot video device is 0000:00:09.0
PCI: Using IRQ router PIIX/ICH [8086/7198] at 0000:00:07.0
PCI: Cannot allocate resource region 0 of device 0000:00:0b.0
PCI: Bus 1, cardbus bridge: 0000:00:08.0
  IO window: 00001c00-00001cff
  IO window: 00002000-000020ff
  PREFETCH window: 10000000-11ffffff
  MEM window: 12000000-13ffffff
PCI: setting IRQ 10 as level-triggered
PCI: Found IRQ 10 for device 0000:00:08.0
NET: Registered protocol family 2
IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
TCP established hash table entries: 8192 (order: 4, 65536 bytes)
TCP bind hash table entries: 4096 (order: 3, 32768 bytes)
TCP: Hash tables configured (established 8192 bind 4096)
TCP reno registered
Simple Boot Flag at 0x37 set to 0x1
IA-32 Microcode Update Driver: v1.14a-xen <tigran@veritas.com>
audit: initializing netlink socket (disabled)
audit(1196794944.970:1): initialized
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
floppy0: no floppy controllers found
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
loop: loaded (max 8 devices)
Xen virtual console successfully installed as ttyS0
Event-channel device installed.
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
PNP: No PS/2 controller found. Probing ports directly.
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI No-Shortcut mode
Freeing unused kernel memory: 192k freed
piix: no version for "struct_module" found: kernel tainted.
PIIX4: IDE controller at PCI slot 0000:00:07.1
PIIX4: chipset revision 0
PIIX4: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x1100-0x1107, BIOS settings: hda:DMA, hdb:pio
Probing IDE interface ide0...
input: AT Translated Set 2 keyboard as /class/input/input0
hda: HTS548040M9AT00, ATA DISK drive
input: PS/2 Mouse as /class/input/input1
input: AlpsPS/2 ALPS GlidePoint as /class/input/input2
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
usbcore: registered new driver usbfs
usbcore: registered new driver hub
USB Universal Host Controller Interface driver v3.0

PCI: IRQ 11 for device 0000:00:07.2 doesn't match PIRQ mask - try pci=usepirqmask
<7>PCI: setting IRQ 11 as level-triggered
PCI: Found IRQ 11 for device 0000:00:07.2
PCI: Sharing IRQ 11 with 0000:00:0a.0
uhci_hcd 0000:00:07.2: UHCI Host Controller
uhci_hcd 0000:00:07.2: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:07.2: irq 11, io base 0x00001200
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
hda: max request size: 512KiB
usb 1-2: new full speed USB device using uhci_hcd and address 2
hda: 78140160 sectors (40007 MB) w/7877KiB Cache, CHS=16383/255/63, UDMA(33)
hda: cache flushes supported
 hda:<6>usb 1-2: configuration #1 chosen from 1 choice
hub 1-2:1.0: USB hub found
hub 1-2:1.0: 3 ports detected
 hda1 hda2 hda3 < hda5 > hda4
device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: dm-devel@redhat.com
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
Real Time Clock Driver v1.12ac
input: PC Speaker as /class/input/input3
piix4_smbus 0000:00:07.3: Found 0000:00:07.3 device
Yenta: CardBus bridge found at 0000:00:08.0 [1071:7722]
Yenta: Enabling burst memory read transactions
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:00:08.0, mfunc 0x017c1602, devctl 0x64
Intel 810 + AC97 Audio, version 1.01, 18:04:40 Dec  4 2007
Yenta: ISA IRQ mask 0x02d8, PCI irq 10
Socket status: 30000010
PCI: Setting latency timer of device 0000:00:00.1 to 64
i810: Intel 440MX found at IO 0x1500 and 0x1600, MEM 0x0000 and 0x0000, IRQ 5
ieee1394: Initialized config rom entry `ip1394'
i810_audio: Audio Controller supports 2 channels.
i810_audio: Defaulting to base 2 channel mode.
i810_audio: Resetting connection 0
ac97_codec: AC97 Audio codec, id: CRY52 (Cirrus Logic CS4299 rev D)
i810_audio: AC'97 codec 0 supports AMAP, total channels = 2
i810_audio: setting clocking to 38348
PCI: Found IRQ 10 for device 0000:00:0b.0
ohci1394: fw-host0: SelfID received outside of bus reset sequence
ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[10]  MMIO=[14021000-140217ff]  Max Packet=[1024]  IR/IT contexts=[4/4]
pccard: PCMCIA card inserted into slot 0
8139cp: 10/100 PCI Ethernet driver v1.2 (Mar 22, 2004)
8139cp 0000:00:0a.0: This (id 10ec:8139 rev 10) is not an 8139C+ compatible chip
8139cp 0000:00:0a.0: Try the "8139too" driver instead.
8139too Fast Ethernet driver 0.9.27
PCI: Found IRQ 11 for device 0000:00:0a.0
PCI: Sharing IRQ 11 with 0000:00:07.2
eth0: RealTek RTL8139 at 0xcb980000, 00:40:d0:12:f3:b4, IRQ 11
eth0:  Identified 8139 chip type 'RTL-8139B'
PCI: Setting latency timer of device 0000:00:00.2 to 64
evbug.c: Connected device: "AT Translated Set 2 keyboard", isa0060/serio0/input0
evbug.c: Connected device: "PS/2 Mouse", isa0060/serio1/input1
evbug.c: Connected device: "AlpsPS/2 ALPS GlidePoint", isa0060/serio1/input0
evbug.c: Connected device: "PC Speaker", isa0061/input0
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
ts: Compaq touchscreen protocol output
8250_pci: Unknown symbol serial8250_unregister_port
8250_pci: Unknown symbol serial8250_resume_port
8250_pci: Unknown symbol serial8250_register_port
8250_pci: Unknown symbol serial8250_suspend_port
ieee1394: Host added: ID:BUS[0-00:1023]  GUID[0040d00100000b49]
cs: memory probe 0x0c0000-0x0fffff: excluding 0xc0000-0xcffff 0xe0000-0xfffff
cs: memory probe 0x60000000-0x60ffffff: clean.
cs: memory probe 0xa0000000-0xa0ffffff: clean.
pcmcia: registering new device pcmcia0.0
orinoco 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
orinoco_cs 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
pcmcia: request for exclusive IRQ could not be fulfilled.
pcmcia: the driver needs updating to supported shared IRQ lines.
eth1: Hardware identity 8008:0000:0001:0000
eth1: Station identity  001f:0004:0001:0003
eth1: Firmware determined as Intersil 1.3.4
eth1: Ad-hoc demo mode supported
eth1: IEEE standard IBSS ad-hoc mode supported
eth1: WEP supported, 104-bit key
eth1: MAC address 00:30:AB:0F:69:F6
eth1: Station name "Prism  I"
eth1: ready
eth1: orinoco_cs at 0.0, irq 10, io 0x0100-0x013f
Non-volatile memory driver v1.2
lp: driver loaded but no devices found
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
device-mapper: multipath: version 1.0.4 loaded
EXT3 FS on dm-1, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 1048568k swap on /dev/zen/swap.  Priority:-1 extents:1 across:1048568k
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
ip6_tables: (C) 2000-2006 Netfilter Core Team
ip_tables: (C) 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (1390 buckets, 11120 max) - 228 bytes per conntrack
process `sysctl' is using deprecated sysctl (syscall) net.ipv6.neigh.lo.retrans_time; Use net.ipv6.neigh.lo.retrans_time_ms instead.
eth0: link down
ADDRCONF(NETDEV_UP): eth0: link is not ready
ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: New link status: Connected (0001)
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
audit(1196794998.576:2): audit_pid=3073 old=0 by auid=4294967295
tun: Universal TUN/TAP device driver, 1.6
tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Bluetooth: Core ver 2.10
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: L2CAP ver 2.8
Bluetooth: L2CAP socket layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM ver 1.8
eth1: no IPv6 routers present
Bluetooth: HIDP (Human Interface Emulation) ver 1.1
Bridge firewalling registered
openvpn0: no IPv6 routers present
virbr0: no IPv6 routers present
xen-vbd: registered block device major 202
blkfront: xvda: barriers enabled
 xvda:<0>------------[ cut here ]------------
kernel BUG at /home/kraxel/xen/xen31/linux-2.6.18-xen/mm/rmap.c:522!
invalid opcode: 0000 [#1]
SMP 
Modules linked in: xenblk ipt_MASQUERADE iptable_nat ip_nat bridge hidp rfcomm l2cap bluetooth tun sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc dm_multipath parport_pc lp parport nvram orinoco_cs orinoco hermes joydev pcmcia firmware_class tsdev evbug evdev serial_core snd_intel8x0m snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq 8139too serio_raw 8139cp mii snd_intel8x0 snd_ac97_codec snd_ac97_bus ohci1394 ieee1394 snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i810_audio snd_timer ac97_codec snd snd_page_alloc soundcore yenta_socket rsrc_nonstatic pcmcia_core i2c_piix4 i2c_core pcspkr rtc dm_snapshot dm_zero dm_mirror dm_mod ide_disk ext3 jbd ehci_hcd ohci_hcd uhci_hcd usbcore piix
CPU:    0
EIP:    0061:[<c0169688>]    Tainted: GF     VLI
EFLAGS: 00010286   (2.6.18-xen #1) 
EIP is at page_remove_rmap+0x28/0x40
eax: ffffffff   ebx: c1080780   ecx: c1080780   edx: 00000000
esi: c4e65a14   edi: 00000020   ebp: c536ab80   esp: c407bea8
ds: 007b   es: 007b   ss: 0069
Process blkbackd (pid: 3973, ti=c407a000 task=c05eda30 task.ti=c407a000)
Stack: c0160b51 c536ab80 00000000 c05eda30 00000000 00000000 00000002 c01ea764 
       b7f70000 c4e65a14 c407bf68 07a3c067 00000000 003ff000 b7f70000 00000000 
       00000000 b7f71000 c4ccd010 c99e9740 c1080780 c1161c00 00000000 ffffffff 
Call Trace:
 [<c0160b51>] unmap_vmas+0x4a1/0x910
 [<c01ea764>] copy_from_user+0x34/0x80
 [<c016594b>] unmap_region+0x9b/0x120
 [<c016645c>] do_munmap+0x14c/0x1e0
 [<c0166522>] sys_munmap+0x32/0x50
 [<c010568f>] syscall_call+0x7/0xb
Code: 00 00 00 89 c1 90 83 40 08 ff 0f 98 c0 84 c0 75 02 f3 c3 8b 41 08 83 c0 01 78 10 8b 51 10 89 c8 83 f2 01 83 e2 01 e9 e8 42 ff ff <0f> 0b 0a 02 48 84 30 c0 eb e6 8d b4 26 00 00 00 00 8d bc 27 00 
EIP: [<c0169688>] page_remove_rmap+0x28/0x40 SS:ESP 0069:c407bea8
 <7>evbug.c: Event. Dev: isa0060/serio0/input0, Type: 4, Code: 4, Value: 42
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 1, Code: 42, Value: 1
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 0, Code: 0, Value: 0
XENBUS: Waiting for devices to initialise: 295s...<7>evbug.c: Event. Dev: isa0060/serio0/input0, Type: 4, Code: 4, Value: 201
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 1, Code: 104, Value: 1
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 0, Code: 0, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 4, Code: 4, Value: 201
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 1, Code: 104, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 0, Code: 0, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 4, Code: 4, Value: 201
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 1, Code: 104, Value: 1
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 0, Code: 0, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 4, Code: 4, Value: 201
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 1, Code: 104, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 0, Code: 0, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 4, Code: 4, Value: 42
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 1, Code: 42, Value: 0
evbug.c: Event. Dev: isa0060/serio0/input0, Type: 0, Code: 0, Value: 0
290s...285s...

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04  9:40           ` Derek Murray
  2007-12-04 12:01             ` Gerd Hoffmann
@ 2007-12-04 20:59             ` Ian Main
  2007-12-05 11:54               ` Derek Murray
  1 sibling, 1 reply; 57+ messages in thread
From: Ian Main @ 2007-12-04 20:59 UTC (permalink / raw)
  To: xen-devel

On Tue, 04 Dec 2007 09:40:49 +0000
Derek Murray <Derek.Murray@cl.cam.ac.uk> wrote:

> Gerd Hoffmann wrote:
> >> On this point I completely agree with you! If anyone has any less
> >> radical suggestions, then I'd be delighted to refactor the gntdev code
> >> to use them. However, I'm not currently aware of any alternative that
> >> maintains robustness to process crashes.
> > 
> > Oh, for me it isn't robust at all, it crashes on the first munmap
> > syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
> > xensource 2.6.18 yet.
> 
> My gut feeling is that something changed in mm between 2.6.18 and 
> 2.6.21, but that seems like a cop out so...
> 
> > Ideas what is wrong?
> 
> Since the bug appears to be in page_remove_rmap, that would tend to 
> imply that there is never a corresponding page_add_*_rmap 
> (page_add_file_rmap?). My knowledge of the Linux mm code is a bit shaky 
> here: should gntdev be doing this? Should we be using install_page (or a 
> modified version thereof) to set the PTE?
> 
> Also, does a simple program that opens gntdev, maps a grant, 
> accesses/writes to the page, and unmaps it (all using the xc_gnttab_* 
> functions) work?

I am part of a team working on a project with Intel that is using it a
fair bit in a number of places.

We actually have no such simple test right now that I'm aware of,
but we are certainly using it in larger applications and it does work.
The only problem we're seeing is that killed processes using it cause a
BUG to fire.  I haven't explored it more than that yet, and I can't say
for sure that gntdev is causing that either as it's a complex program
(although I'm not aware of anything else in there that might cause it).

> > Who uses the gntdev device right now?
> 
> Good question! I'm aware of it being used in a few research projects, 
> and it seems to work for them (though I think it is mostly used with the 
> linux-2.6.18-xen kernel). Anyone else?

We are using it with 2.6.18 xen kernel.

    Ian

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 12:39               ` Stephen C. Tweedie
  2007-12-04 19:58                 ` Gerd Hoffmann
@ 2007-12-04 21:08                 ` Ian Main
  2007-12-05 10:03                 ` Gerd Hoffmann
  2007-12-05 10:11                 ` Derek Murray
  3 siblings, 0 replies; 57+ messages in thread
From: Ian Main @ 2007-12-04 21:08 UTC (permalink / raw)
  To: xen-devel

On Tue, 04 Dec 2007 12:39:59 +0000
"Stephen C. Tweedie" <sct@redhat.com> wrote:


> So... the interface (a) cannot be used on the Linux VM without at least
> one invasive VM modification, due to the requirement of ptes being
> explicitly unmapped via hypercall; and (b) isn't used significantly in
> real life yet.
> 
> I can't help wondering if this is a hint that now is the time to find a
> better API, which doesn't have the requirement (a) that seems to be
> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
> to have the same problems with their VM layers if they try to implement
> this API?  Upstream Linux pv_ops certainly will, and it would be good if
> we could avoid tying unprivileged guests to ABIs which cannot hope to be
> merged into pv_ops.

I posted up and said we were using the current interface, but if there are fundamental issues with the API then I'd be in favor of changing it, even
if there is some work involved on our side.

	Ian

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04  9:35               ` tgh
@ 2007-12-05  3:42                 ` Mark Williamson
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Williamson @ 2007-12-05  3:42 UTC (permalink / raw)
  To: tgh
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, dgm36, 'Gerd Hoffmann'

> I am not quite clear about the purpose of pv-ops , what do we want to
> deal with by developping "pv-ops"? is it used for HVM or for PV or KVM
> or something ? I have seen it for a few months in the list ,and
> "pv-ops"is an active project ,but i am not clear about what is the aim
> of "pv-ops" ,could you give me an explanation about it

PV-ops is an API within Linux which is used to support paravirtualisation.

paravirt-ops makes it possible to compile a Linux kernel which can boot on 
bare hardware, or on Xen, or using VMI (VMware's paravirtualised interface), 
lguest, or any other VMM that is supported.  The resulting kernel can then 
boot on any of those and make proper use of paravirtualisation.

For instance, with 2.6.23 from kernel.org you should be able to compile a 
kernel that will boot both on bare hardware and in a Xen domU in PV mode.  
Various tricks are used to ensure that it will run with good performance on 
both.

pv-ops mostly deals with the paravirtualisation of the CPU.  IO devices such 
as block and network are handled using Xen-aware drivers rather similar to 
those in the XenSource Linux kernels, they are not part of pv-ops.

Cheers,
Mark


> Thanks in advance
>
> Mark Williamson 写道:
> >> Hi Mark,
> >>
> >>> Maybe a change to the gntdev userspace API to allow batching
> >>> of mapping requests?
> >>
> >> Something along the lines of the following?
> >
> > Just like that :-D
> >
> > When you said "multiple syscalls per mapping" I assumed you meant that
> > we'd lose the batching you get by doing a mulicall.  If it's just a
> > couple of syscalls (plus, presumably a couple of hypercalls) per batch of
> > mappings, my gut says it's probably not going to hurt block performance. 
> > My guts have been wrong in (many!) ways before of course...
> >
> > I guess the overhead *could* be reduced even more by just having a magic
> > ioctl that did all the mmap-ing stuff in one operation, but that'd
> > probably be really gross if it wasn't necessary!  And I doubt it'd make
> > upstream very happy...
> >
> > We'll also be eliminating the overheads involved in having a blktap ring
> > for talking to userspace and having to move requests between that ring
> > and the real block ring, so there's some definite wins in overheads as
> > well.
> >
> > Cheers,
> > Mark



-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 12:39               ` Stephen C. Tweedie
  2007-12-04 19:58                 ` Gerd Hoffmann
  2007-12-04 21:08                 ` Ian Main
@ 2007-12-05 10:03                 ` Gerd Hoffmann
  2007-12-05 12:51                   ` Gerd Hoffmann
  2007-12-05 10:11                 ` Derek Murray
  3 siblings, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-05 10:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Stephen C. Tweedie wrote:
> I can't help wondering if this is a hint that now is the time to find a
> better API, which doesn't have the requirement (a) that seems to be
> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
> to have the same problems with their VM layers if they try to implement
> this API?

Well, it isn't that easy unfortunaly.  We have to separate two things here:

  (a) the grant table hypercall API (linux kernel <-> xen).
  (b) the grant table device (userspace interface).

The hypercall API *is* heavily used, block and network drivers are using
it for example.  It works quite well as long as the drivers are living
in kernel space, thus the grants are also mapped in kernel space only.
It isn't very hard to control map and unmap then.

The problems start when the gntdev comes into play which wants allow
userspace applications map grant references.  At this point the whole VM
subsystem becomes involved.  And the requirement of the hypercall API to
 do any pte manipulation using grant table hypercalls becomes a real
burden.  The linux VM design simply doesn't allow that.

Consequently the current gntdev implementation tries to get the job done
by bypassing the VM (and hooking into it).  It establishes mappings by
doing the page table manipulations itself in the fops->mmap function.
It tears down mappings using the hook discussed earlier.

gntdev doesn't even try to handle forking.  I wouldn't be surprised if
that is a great way to kill Domain-0.  The xen hypervisor will most
likely not be amused to find a pte refering to a granted (but foreign)
page which wasn't established using the grant table interface.  Pinning
the pgd of the child process will most likely fail and make the kernel
BUG().

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 12:39               ` Stephen C. Tweedie
                                   ` (2 preceding siblings ...)
  2007-12-05 10:03                 ` Gerd Hoffmann
@ 2007-12-05 10:11                 ` Derek Murray
  3 siblings, 0 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-05 10:11 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Gerd Hoffmann

Stephen C. Tweedie wrote:
> So... the interface (a) cannot be used on the Linux VM without at least
> one invasive VM modification, due to the requirement of ptes being
> explicitly unmapped via hypercall;

Also there is the use of VM_FOREIGN 
(http://xenbits.xensource.com/linux-2.6.18-xen.hg?file/b2768401db94/mm/memory.c 
lines 1040--1059), which has been used quite happily in blktap since 
2005 
(http://lists.xensource.com/archives/html/xen-changelog/2005-07/msg00053.html). 
While it may not be a priority to get gntdev into pv-ops Linux, I should 
imagine that blktap would be fairly critical.

> I can't help wondering if this is a hint that now is the time to find a
> better API, which doesn't have the requirement (a) that seems to be
> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
> to have the same problems with their VM layers if they try to implement
> this API?  Upstream Linux pv_ops certainly will, and it would be good if
> we could avoid tying unprivileged guests to ABIs which cannot hope to be
> merged into pv_ops.

I'm open to suggestions... but I think it always reduces to needing a 
hook that is called on process exit before the PTEs are zapped.

> (Just what is the cost of not having this functionality in blktap,
> anyway?)

If tapdisk dies whilst holding a granted page, the page can never be 
ungranted, so we leak that page.

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Xen-devel] Re: Next steps with pv_ops for Xen
  2007-12-04 19:58                 ` Gerd Hoffmann
@ 2007-12-05 11:48                   ` Derek Murray
  2007-12-05 11:48                   ` Derek Murray
  2007-12-05 13:19                   ` Derek Murray
  2 siblings, 0 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-05 11:48 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization

Hi Gerd,

Gerd Hoffmann wrote:
Want reproduce?  Here we go:
> 
>   * grab xenner 0.8 from http://dl.bytesex.org/releases/xenner/
>   * grab a xenified dom0 kernel without blktap driver (either not
>     compiled or module not loaded).
>   * start xend
>   * start blkbackd from xenner package (you probably want the -d switch
>     for debug output, twice for more).
>   * run "xm block-attach 0 tap:aio:/path/to/some/file xvda r"
>   * watch it blow up ;)

Thanks for the repro details. I'll have a go at this later. One thing we 
haven't tested AFAIK is mapping grants in the same domain: could you 
check to see if the bug is the same if you attach a block device to a 
domain other than Dom0? Also, could you send any Xen console output, if 
it contains errors or warnings?

>> I can't help wondering if this is a hint that now is the time to find a
>> better API, which doesn't have the requirement (a) that seems to be
>> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
>> to have the same problems with their VM layers if they try to implement
>> this API?  Upstream Linux pv_ops certainly will, and it would be good if
>> we could avoid tying unprivileged guests to ABIs which cannot hope to be
>> merged into pv_ops.
> 
> And I fear the problems I've trapped into up to now is only the tip of
> the iceberg.  What happens if an application with active grant table
> mappings calls fork() ?

Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls 
copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls 
set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.

The hypercall will not succeed and will return an error code indicating 
the reason for this. Therefore the PTE will not be set. There appears to 
be no way to propagate this error through the Linux VM code, because 
there is no concept of a PTE update failing. I could add return codes to 
all those functions, but I don't fancy their chances upstream....

A possibility for solving that might be to carry out the mappings upon a 
page fault: I believe this would be compatible with copy_page_range.

(In fact, it's possible that a forked process would attempt to 
demand-page in the granted page, bypassing the copy_page_range code. 
Since there is no nopage handler for a gntdev VMA, that would lead to an 
anonymous page being mapped into memory instead.)

So, as far as I can tell, there would be no kernel BUG() or 
domain_crash() in the event of a fork(). It looks like implementing 
nopage in gntdev would enable grants to be remapped after a fork() and 
the correct behaviour to happen.

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 19:58                 ` Gerd Hoffmann
  2007-12-05 11:48                   ` [Xen-devel] " Derek Murray
@ 2007-12-05 11:48                   ` Derek Murray
  2007-12-05 14:12                     ` Gerd Hoffmann
  2007-12-05 18:12                     ` Jeremy Fitzhardinge
  2007-12-05 13:19                   ` Derek Murray
  2 siblings, 2 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-05 11:48 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

Hi Gerd,

Gerd Hoffmann wrote:
Want reproduce?  Here we go:
> 
>   * grab xenner 0.8 from http://dl.bytesex.org/releases/xenner/
>   * grab a xenified dom0 kernel without blktap driver (either not
>     compiled or module not loaded).
>   * start xend
>   * start blkbackd from xenner package (you probably want the -d switch
>     for debug output, twice for more).
>   * run "xm block-attach 0 tap:aio:/path/to/some/file xvda r"
>   * watch it blow up ;)

Thanks for the repro details. I'll have a go at this later. One thing we 
haven't tested AFAIK is mapping grants in the same domain: could you 
check to see if the bug is the same if you attach a block device to a 
domain other than Dom0? Also, could you send any Xen console output, if 
it contains errors or warnings?

>> I can't help wondering if this is a hint that now is the time to find a
>> better API, which doesn't have the requirement (a) that seems to be
>> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
>> to have the same problems with their VM layers if they try to implement
>> this API?  Upstream Linux pv_ops certainly will, and it would be good if
>> we could avoid tying unprivileged guests to ABIs which cannot hope to be
>> merged into pv_ops.
> 
> And I fear the problems I've trapped into up to now is only the tip of
> the iceberg.  What happens if an application with active grant table
> mappings calls fork() ?

Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls 
copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls 
set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.

The hypercall will not succeed and will return an error code indicating 
the reason for this. Therefore the PTE will not be set. There appears to 
be no way to propagate this error through the Linux VM code, because 
there is no concept of a PTE update failing. I could add return codes to 
all those functions, but I don't fancy their chances upstream....

A possibility for solving that might be to carry out the mappings upon a 
page fault: I believe this would be compatible with copy_page_range.

(In fact, it's possible that a forked process would attempt to 
demand-page in the granted page, bypassing the copy_page_range code. 
Since there is no nopage handler for a gntdev VMA, that would lead to an 
anonymous page being mapped into memory instead.)

So, as far as I can tell, there would be no kernel BUG() or 
domain_crash() in the event of a fork(). It looks like implementing 
nopage in gntdev would enable grants to be remapped after a fork() and 
the correct behaviour to happen.

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 20:59             ` Ian Main
@ 2007-12-05 11:54               ` Derek Murray
  0 siblings, 0 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-05 11:54 UTC (permalink / raw)
  To: Ian Main; +Cc: xen-devel

Hi Ian,

Ian Main wrote:
> We actually have no such simple test right now that I'm aware of,
> but we are certainly using it in larger applications and it does work.
> The only problem we're seeing is that killed processes using it cause a
> BUG to fire.  I haven't explored it more than that yet, and I can't say
> for sure that gntdev is causing that either as it's a complex program
> (although I'm not aware of anything else in there that might cause it).

Does killing your process using gntdev *always* cause a BUG()? If you 
send me the kernel log and Xen console log, then I can have a look at 
what the problem might be.

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 10:03                 ` Gerd Hoffmann
@ 2007-12-05 12:51                   ` Gerd Hoffmann
  0 siblings, 0 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-05 12:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

  Hi,

> gntdev doesn't even try to handle forking.  I wouldn't be surprised if
> that is a great way to kill Domain-0.  The xen hypervisor will most
> likely not be amused to find a pte refering to a granted (but foreign)
> page which wasn't established using the grant table interface.  Pinning
> the pgd of the child process will most likely fail and make the kernel
> BUG().

Ok, isn't that bad thanks to the VM_DONTCOPY.  The child just doesn't
get the grant mapping.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-04 19:58                 ` Gerd Hoffmann
  2007-12-05 11:48                   ` [Xen-devel] " Derek Murray
  2007-12-05 11:48                   ` Derek Murray
@ 2007-12-05 13:19                   ` Derek Murray
  2 siblings, 0 replies; 57+ messages in thread
From: Derek Murray @ 2007-12-05 13:19 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 488 bytes --]

Gerd,

Can you try the attached patch against linux-2.6.18-xen.hg?

I think the problem was that the gntdev VMA is not marked as being 
VM_PFNMAP, therefore it tries to get a struct page_struct for each 
granted page when it is unmapped (and maybe sometimes succeeds 
(incorrectly), which could be why I haven't seen the bug). With this 
flag, vm_normal_page will return NULL in zap_pte_range, and so the code 
that decrements that reference count will not be executed.

Regards,

Derek.

[-- Attachment #2: gntdev_vm_pfnmap.patch --]
[-- Type: text/x-patch, Size: 1073 bytes --]

# HG changeset patch
# User dgm36@ise.cl.cam.ac.uk
# Date 1196860382 0
# Node ID af26b3dd23822190acbec1872a47259e1fed88b8
# Parent  b2768401db943e66af9d64bd610ffa225f560c0b
Set gntdev VMA to be VM_PFNMAP.

diff -r b2768401db94 -r af26b3dd2382 drivers/xen/gntdev/gntdev.c
--- a/drivers/xen/gntdev/gntdev.c	Mon Dec 03 08:50:12 2007 +0000
+++ b/drivers/xen/gntdev/gntdev.c	Wed Dec 05 13:13:02 2007 +0000
@@ -501,6 +501,17 @@ static int gntdev_mmap (struct file *fli
     
 	/* The VM area contains pages from another VM. */
 	vma->vm_flags |= VM_FOREIGN;
+
+	/* The VM area contains pages that are not backed by page_structs in
+	 * this domain's memory map.
+	 *
+	 * TODO/FIXME?: We should probably use the VM_FOREIGN workaround as
+	 *              used by get_user_pages() to provide access to the
+	 *              page_structs for each page, but I'm not sure if that's
+	 *              necessary.
+	 */
+	vma->vm_flags |= VM_PFNMAP;
+
 	vma->vm_private_data = kzalloc(size * sizeof(struct page_struct *), 
 				       GFP_KERNEL);
 	if (vma->vm_private_data == NULL) {

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 11:48                   ` Derek Murray
@ 2007-12-05 14:12                     ` Gerd Hoffmann
  2007-12-05 14:22                       ` Keir Fraser
  2007-12-05 18:12                     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-05 14:12 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

  Hi,

> Thanks for the repro details. I'll have a go at this later. One thing we
> haven't tested AFAIK is mapping grants in the same domain: could you
> check to see if the bug is the same if you attach a block device to a
> domain other than Dom0? Also, could you send any Xen console output, if
> it contains errors or warnings?

Attaching to another domain works better.  blkbackd needs some fixes as
well though ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 14:12                     ` Gerd Hoffmann
@ 2007-12-05 14:22                       ` Keir Fraser
  2007-12-05 14:30                         ` Derek Murray
  0 siblings, 1 reply; 57+ messages in thread
From: Keir Fraser @ 2007-12-05 14:22 UTC (permalink / raw)
  To: Gerd Hoffmann, Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization

On 5/12/07 14:12, "Gerd Hoffmann" <kraxel@redhat.com> wrote:

>> Thanks for the repro details. I'll have a go at this later. One thing we
>> haven't tested AFAIK is mapping grants in the same domain: could you
>> check to see if the bug is the same if you attach a block device to a
>> domain other than Dom0? Also, could you send any Xen console output, if
>> it contains errors or warnings?
> 
> Attaching to another domain works better.  blkbackd needs some fixes as
> well though ...

Is this patch to go into linux-2.6.18-xen.hg then?

It needs a signed-off-by line if so.

 -- Keir

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 14:22                       ` Keir Fraser
@ 2007-12-05 14:30                         ` Derek Murray
  2007-12-05 16:58                           ` Keir Fraser
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-05 14:30 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

[-- Attachment #1: Type: text/plain, Size: 253 bytes --]

Keir Fraser wrote:
> Is this patch to go into linux-2.6.18-xen.hg then?

Yes, even if it doesn't fix the exact bug we're seeing here, I think it 
should go in. I've attached a version with my signed-off-by and a better 
commit comment.

Cheers,

Derek.

[-- Attachment #2: gntdev_vm_pfnmap.patch --]
[-- Type: text/x-patch, Size: 1328 bytes --]

# HG changeset patch
# User dgm36@ise.cl.cam.ac.uk
# Date 1196860382 0
# Node ID af26b3dd23822190acbec1872a47259e1fed88b8
# Parent  b2768401db943e66af9d64bd610ffa225f560c0b
Add VM_PFNMAP flag to gntdev-mmaped VM areas. This prevents an attempt in
zap_pte_range to decrement the reverse-mapping count of the non-existant
(but occasionally spuriously present) page_struct associated with the
granted PFN.

Signed-off-by: Derek Murray <Derek.Murray@cl.cam.ac.uk>

diff -r b2768401db94 -r af26b3dd2382 drivers/xen/gntdev/gntdev.c
--- a/drivers/xen/gntdev/gntdev.c	Mon Dec 03 08:50:12 2007 +0000
+++ b/drivers/xen/gntdev/gntdev.c	Wed Dec 05 13:13:02 2007 +0000
@@ -501,6 +501,17 @@ static int gntdev_mmap (struct file *fli
     
 	/* The VM area contains pages from another VM. */
 	vma->vm_flags |= VM_FOREIGN;
+
+	/* The VM area contains pages that are not backed by page_structs in
+	 * this domain's memory map.
+	 *
+	 * TODO/FIXME?: We should probably use the VM_FOREIGN workaround as
+	 *              used by get_user_pages() to provide access to the
+	 *              page_structs for each page, but I'm not sure if that's
+	 *              necessary.
+	 */
+	vma->vm_flags |= VM_PFNMAP;
+
 	vma->vm_private_data = kzalloc(size * sizeof(struct page_struct *), 
 				       GFP_KERNEL);
 	if (vma->vm_private_data == NULL) {

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 14:30                         ` Derek Murray
@ 2007-12-05 16:58                           ` Keir Fraser
  2007-12-05 17:17                             ` Derek Murray
  0 siblings, 1 reply; 57+ messages in thread
From: Keir Fraser @ 2007-12-05 16:58 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

On 5/12/07 14:30, "Derek Murray" <Derek.Murray@cl.cam.ac.uk> wrote:

> Keir Fraser wrote:
>> Is this patch to go into linux-2.6.18-xen.hg then?
> 
> Yes, even if it doesn't fix the exact bug we're seeing here, I think it
> should go in. I've attached a version with my signed-off-by and a better
> commit comment.

Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
vm_normal_page() return NULL? But actually I would expect pte_pfn() to
return max_mapnr because the mapped page is not a local page. And that
should cause vm_normal_page() to return NULL always, regardless of whether
you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
the test that causes the crash?

 -- Keir

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 16:58                           ` Keir Fraser
@ 2007-12-05 17:17                             ` Derek Murray
  2007-12-05 17:22                               ` Keir Fraser
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-05 17:17 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

Keir Fraser wrote:
> 
> Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
> vm_normal_page() return NULL? But actually I would expect pte_pfn() to
> return max_mapnr because the mapped page is not a local page. And that
> should cause vm_normal_page() to return NULL always, regardless of whether
> you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
> the test that causes the crash?

That's right (gntdev is being used to map (but not grant) a local page). 
The test case creates a virtual block device in Dom0, and attempts to 
map its ring buffer in a user-space daemon in Dom0. Therefore pte_pfn 
succeeds.

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 17:17                             ` Derek Murray
@ 2007-12-05 17:22                               ` Keir Fraser
  2007-12-05 17:48                                 ` Derek Murray
  0 siblings, 1 reply; 57+ messages in thread
From: Keir Fraser @ 2007-12-05 17:22 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

On 5/12/07 17:17, "Derek Murray" <Derek.Murray@cl.cam.ac.uk> wrote:

>> Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
>> vm_normal_page() return NULL? But actually I would expect pte_pfn() to
>> return max_mapnr because the mapped page is not a local page. And that
>> should cause vm_normal_page() to return NULL always, regardless of whether
>> you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
>> the test that causes the crash?
> 
> That's right (gntdev is being used to map (but not grant) a local page).
> The test case creates a virtual block device in Dom0, and attempts to
> map its ring buffer in a user-space daemon in Dom0. Therefore pte_pfn
> succeeds.

Need to bite the bullet and fix this properly by setting a software flag in
ptes that are not subject to reference counting.

Unfortunately that also needs a hypervisor interface change, to allow
setting of those pte flags. Easily done though, and we should definitely get
that piece in for 3.2.0.

Setting VM_PFNMAP is bogus. We used to do that for privcmd mappings too, but
we stopped because IIRC it had other unwanted side effects.

 -- Keir

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 17:22                               ` Keir Fraser
@ 2007-12-05 17:48                                 ` Derek Murray
  2007-12-05 17:59                                   ` Keir Fraser
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-05 17:48 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

Keir Fraser wrote:
> Need to bite the bullet and fix this properly by setting a software flag in
> ptes that are not subject to reference counting.

Could we get away with testing the VM_FOREIGN flag in vm_normal_page()? 
Although I get the impression that this wouldn't be easily justified if 
trying to merge with upstream Linux....

> Unfortunately that also needs a hypervisor interface change, to allow
> setting of those pte flags. Easily done though, and we should definitely get
> that piece in for 3.2.0.

Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for 
debugging? Indeed, if we did this, could be obviate the need for the 
PTE-zapping hook, by instead catching the case where this flag is set, 
and unmapping the grant implicitly?

Otherwise, what would the semantics of this new flag be?

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 17:48                                 ` Derek Murray
@ 2007-12-05 17:59                                   ` Keir Fraser
  2007-12-05 18:15                                     ` Derek Murray
  2007-12-05 20:06                                     ` Gerd Hoffmann
  0 siblings, 2 replies; 57+ messages in thread
From: Keir Fraser @ 2007-12-05 17:59 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

On 5/12/07 17:48, "Derek Murray" <Derek.Murray@cl.cam.ac.uk> wrote:

> Keir Fraser wrote:
>> Need to bite the bullet and fix this properly by setting a software flag in
>> ptes that are not subject to reference counting.
> 
> Could we get away with testing the VM_FOREIGN flag in vm_normal_page()?
> Although I get the impression that this wouldn't be easily justified if
> trying to merge with upstream Linux....

Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
there any other responsibilities that you acquire if you make use of
VM_FOREIGN (in particular, how would this affect get_user_pages)?

> Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for
> debugging? Indeed, if we did this, could be obviate the need for the
> PTE-zapping hook, by instead catching the case where this flag is set,
> and unmapping the grant implicitly?

Well, in the general case you don't have enough info to know which grant to
release (a single page can be granted multiple times).

> Otherwise, what would the semantics of this new flag be?

It would cause pte_pfn() to return max_mapnr. It would be set for any
foreign page mapping, and replace mfn_to_local_pfn() in pte_pfn().

 -- Keir

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 11:48                   ` Derek Murray
  2007-12-05 14:12                     ` Gerd Hoffmann
@ 2007-12-05 18:12                     ` Jeremy Fitzhardinge
  2007-12-05 18:29                       ` Derek Murray
  1 sibling, 1 reply; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-12-05 18:12 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Gerd Hoffmann

Derek Murray wrote:
> Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls
> copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls
> set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.
>
> The hypercall will not succeed and will return an error code
> indicating the reason for this. Therefore the PTE will not be set.
> There appears to be no way to propagate this error through the Linux
> VM code, because there is no concept of a PTE update failing. I could
> add return codes to all those functions, but I don't fancy their
> chances upstream....

Could we use one of the software-defined bits in the PTE to indicate
that this is a foreign/granted PTE, and have set_pte_at behave
differently if you pass it a pte with this bit set?

    J

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 17:59                                   ` Keir Fraser
@ 2007-12-05 18:15                                     ` Derek Murray
  2007-12-12  8:27                                       ` Isaku Yamahata
  2007-12-05 20:06                                     ` Gerd Hoffmann
  1 sibling, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-05 18:15 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

[-- Attachment #1: Type: text/plain, Size: 526 bytes --]

Keir Fraser wrote:
> Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
> there any other responsibilities that you acquire if you make use of
> VM_FOREIGN (in particular, how would this affect get_user_pages)?

VM_FOREIGN is already set for the gntdev VMA (mostly because it's 
directly based on the blktap code). That means that it has the array of 
page_structs in its vm_private_data, which can be used to fulfill a 
get_user_pages call. I've attached a patch based on this fix.

Regards,

Derek.

[-- Attachment #2: gntdev_vm_foreign.patch --]
[-- Type: text/x-patch, Size: 714 bytes --]

# HG changeset patch
# User dgm36@ise.cl.cam.ac.uk
# Date 1196878124 0
# Node ID df7d0555ec3847bd5915063d8ee79123d6ebc67a
# Parent  ba918cb2cf7520604dee724dd80dad5ce4bee8a1
Changed vm_normal_page to return NULL when presented with a VMA marked
as being VM_FOREIGN.

Signed-off-by: Derek Murray <Derek.Murray@cl.cam.ac.uk>

diff -r ba918cb2cf75 -r df7d0555ec38 mm/memory.c
--- a/mm/memory.c	Tue Dec 04 11:54:22 2007 +0000
+++ b/mm/memory.c	Wed Dec 05 18:08:44 2007 +0000
@@ -395,6 +395,9 @@ struct page *vm_normal_page(struct vm_ar
 		if (!is_cow_mapping(vma->vm_flags))
 			return NULL;
 	}
+
+	if (unlikely(vma->vm_flags & VM_FOREIGN))
+		return NULL;
 
 	/*
 	 * Add some anal sanity checks for now. Eventually,

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 18:12                     ` Jeremy Fitzhardinge
@ 2007-12-05 18:29                       ` Derek Murray
  2007-12-05 20:15                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-05 18:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Gerd Hoffmann

Jeremy Fitzhardinge wrote:
> Could we use one of the software-defined bits in the PTE to indicate
> that this is a foreign/granted PTE, and have set_pte_at behave
> differently if you pass it a pte with this bit set?

Actually, as Gerd pointed out in his answer to his own question, the use 
of VM_DONTCOPY cuts out this entire code path, so we don't need to worry 
about it.

Mind you, it looks like we're going to go ahead and use one of the PTE 
bits to signify foreign PTEs anyway, per Keir's suggestion. Either way, 
it's going to involve making Xen-specific changes to the mm code... have 
you any ideas how we can either (i) get rid of the zap_pte hook in the 
vm_operations_struct, or (ii) make a really compelling case to the 
kernel maintainers that it really should get in?

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 17:59                                   ` Keir Fraser
  2007-12-05 18:15                                     ` Derek Murray
@ 2007-12-05 20:06                                     ` Gerd Hoffmann
  1 sibling, 0 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-05 20:06 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Stephen C. Tweedie, Jan Beulich, Glauber de Oliveira Costa,
	Chris Wright, virtualization

>> Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for
>> debugging? Indeed, if we did this, could be obviate the need for the
>> PTE-zapping hook, by instead catching the case where this flag is set,
>> and unmapping the grant implicitly?
> 
> Well, in the general case you don't have enough info to know which grant to
> release (a single page can be granted multiple times).

You'll also get the mm and the addr which should make it sufficiently
unique, so this looks like a doable approach to me.

ptep_get_and_clear_full() in include/asm-x86/pgtable_32.h needs to be
changed take care, but that should be possible to do and the change is
local to x86 paravirt_ops, which looks much better to me than touching
generic mm code.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 18:29                       ` Derek Murray
@ 2007-12-05 20:15                         ` Jeremy Fitzhardinge
  2007-12-05 20:35                           ` Geoffrey Lefebvre
  2007-12-05 20:44                           ` Keir Fraser
  0 siblings, 2 replies; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-12-05 20:15 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Gerd Hoffmann

Derek Murray wrote:
> Jeremy Fitzhardinge wrote:
>> Could we use one of the software-defined bits in the PTE to indicate
>> that this is a foreign/granted PTE, and have set_pte_at behave
>> differently if you pass it a pte with this bit set?
>
> Actually, as Gerd pointed out in his answer to his own question, the
> use of VM_DONTCOPY cuts out this entire code path, so we don't need to
> worry about it.
>
> Mind you, it looks like we're going to go ahead and use one of the PTE
> bits to signify foreign PTEs anyway, per Keir's suggestion. Either
> way, it's going to involve making Xen-specific changes to the mm code... 

Sneaking in a user for the otherwise completely unused PTE bits should
be fairly straightforward.

> have you any ideas how we can either (i) get rid of the zap_pte hook
> in the vm_operations_struct, or (ii) make a really compelling case to
> the kernel maintainers that it really should get in? 

Hm, I haven't spent much time looking at how grant tables and their
mappings work yet, so I can't say I really understand all this myself. 
Hence, questions:

Can we take a different approach from the zap_pte hook?  Given that
we're 1) planning on claiming a pte bit for grant mappings, and 2) need
to hook ptep_get_and_clear anyway to solve the mprotect performance
problems, couldn't we just special-case grant mapping pte_clears?

In 2.6.18-xen the only two implementations of zap_pte are
blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
grant-mapping bit set, could we determine which of these need calling
and do the appropriate thing?  Do we even need separate implementations
of the core pte-clearing functionality?  Could we just say something like:

	if (pte & _PAGE_XEN_FOREIGN)
		HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
	else
		xen_set_pte_at(...);


blktap_clear_pte and gntdev_clear_pte do other housekeeping, but do they
have to be done at the same instant as the grant mapping clear?  Could
they be done via some other hook?

(I see Gerd just proposed this, pretty much.)

    J

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 20:15                         ` Jeremy Fitzhardinge
@ 2007-12-05 20:35                           ` Geoffrey Lefebvre
  2007-12-06 10:15                             ` Gerd Hoffmann
  2007-12-05 20:44                           ` Keir Fraser
  1 sibling, 1 reply; 57+ messages in thread
From: Geoffrey Lefebvre @ 2007-12-05 20:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

> Can we take a different approach from the zap_pte hook?  Given that
> we're 1) planning on claiming a pte bit for grant mappings, and 2) need
> to hook ptep_get_and_clear anyway to solve the mprotect performance
> problems, couldn't we just special-case grant mapping pte_clears?
>
> In 2.6.18-xen the only two implementations of zap_pte are
> blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
> grant-mapping bit set, could we determine which of these need calling
> and do the appropriate thing?  Do we even need separate implementations
> of the core pte-clearing functionality?  Could we just say something like:
>
>         if (pte & _PAGE_XEN_FOREIGN)
>                 HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
>         else
>                 xen_set_pte_at(...);
>

Hi,

In order to unmap a grant, you need the grant handle obtained when the
grant is mapped. That handle needs to be stored somewhere for the
lifetime of the mapping. Where would the handle be stored (as Gerd
proposed) in order to be able to unmap from ptep_get_and_clear_full?

I haven't looked at the paravirt ops in details so I could be missing
something obvious here.

cheers,

geoffrey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 20:15                         ` Jeremy Fitzhardinge
  2007-12-05 20:35                           ` Geoffrey Lefebvre
@ 2007-12-05 20:44                           ` Keir Fraser
  2007-12-06 10:00                             ` Derek Murray
  1 sibling, 1 reply; 57+ messages in thread
From: Keir Fraser @ 2007-12-05 20:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Gerd Hoffmann

On 5/12/07 20:15, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

> In 2.6.18-xen the only two implementations of zap_pte are
> blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
> grant-mapping bit set, could we determine which of these need calling
> and do the appropriate thing?  Do we even need separate implementations
> of the core pte-clearing functionality?  Could we just say something like:
> 
> if (pte & _PAGE_XEN_FOREIGN)
> HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
> else
> xen_set_pte_at(...);

You'd need to track pte->grant_handle mappings somewhere, but it could
certainly be done this way, yes.

 -- Keir

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 20:44                           ` Keir Fraser
@ 2007-12-06 10:00                             ` Derek Murray
  2007-12-06 19:55                               ` [Xen-devel] " Jeremy Fitzhardinge
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-06 10:00 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, xen-devel, Eduardo Habkost, Juan Quintela,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann

Keir Fraser wrote:
> You'd need to track pte->grant_handle mappings somewhere, but it could
> certainly be done this way, yes.

At the moment, blktap and gntdev provide struct pages to get_user_pages 
by smuggling them in the vm_private_data field of the relevant 
vm_area_struct. Could we use this field to get the handles to 
ptep_get_and_clear_full as well?

Only downside that I can see is that we would need to find the vma for 
each PTE that needs to be cleared this way (since we don't get this 
passed to ptep_get_and_clear_full), but this is mitigated by (i) it only 
happening in the erroneous, unclean-shutdown case, and (ii) getting a 
hit in the mm->mmap_cache for consecutive runs of mapped grants.

Regards,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 20:35                           ` Geoffrey Lefebvre
@ 2007-12-06 10:15                             ` Gerd Hoffmann
  0 siblings, 0 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-06 10:15 UTC (permalink / raw)
  To: Geoffrey Lefebvre
  Cc: Derek Murray, Jeremy Fitzhardinge, xen-devel, Eduardo Habkost,
	Juan Quintela, Jan Beulich, Glauber de Oliveira Costa,
	Chris Wright, virtualization

Geoffrey Lefebvre wrote:
> In order to unmap a grant, you need the grant handle obtained when the
> grant is mapped. That handle needs to be stored somewhere for the
> lifetime of the mapping. Where would the handle be stored (as Gerd
> proposed) in order to be able to unmap from ptep_get_and_clear_full?

Sure. the kernel has to keep track of the grant mappings somewhere, so
it can lookup the grant handle from the available information.  Hashing
by machine address should work reasonable fast.  It's probably useful to
have an in-kernel API for that which then can be used by both gntdev and
the in-kernel backend drivers.  This API can also abstract out
arch-specific bits to make life easier for the ia64 guys ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 18:36           ` D.G. Murray
  2007-12-03 19:08             ` Mark Williamson
@ 2007-12-06 15:21             ` Gerd Hoffmann
  2007-12-06 15:32               ` Derek Murray
  2007-12-21 12:58             ` Gerd Hoffmann
  2007-12-21 12:58             ` [Xen-devel] " Gerd Hoffmann
  3 siblings, 1 reply; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-06 15:21 UTC (permalink / raw)
  To: dgm36
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, 'Mark Williamson'

D.G. Murray wrote:
> Hi Mark, 
> 
>> Maybe a change to the gntdev userspace API to allow batching 
>> of mapping requests?
> 
> Something along the lines of the following?
> 
> void *xc_gnttab_map_grant_refs(int xcg_handle,
>                                uint32_t count,
>                                uint32_t *domids,
>                                uint32_t *refs,
>                                int prot); 

Yes, except that it should actually work ;)

It doesn't for me (Fedora 8 again).  Grab xenner 0.9 (just uploaded),
edit blkbackd.c and flip the BATCH_MAPS from 0 to 1, compile, run, see
it not work.

With BATCH_MAPS being 0 blkbackd works nicely as blktap/tapdisk drop-in
replacement.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-06 15:21             ` Gerd Hoffmann
@ 2007-12-06 15:32               ` Derek Murray
  2007-12-06 15:55                 ` Gerd Hoffmann
  0 siblings, 1 reply; 57+ messages in thread
From: Derek Murray @ 2007-12-06 15:32 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, 'Mark Williamson'

Gerd Hoffmann wrote:
> Yes, except that it should actually work ;)
> 
> It doesn't for me (Fedora 8 again).  Grab xenner 0.9 (just uploaded),
> edit blkbackd.c and flip the BATCH_MAPS from 0 to 1, compile, run, see
> it not work.

Which version of the Xen tools are you using? There was a bug in the 
version released with Xen 3.1, which should have been cleaned up in the 
subsequent minor versions. Try grabbing the patch to libxc at:

http://xenbits.xensource.com/xen-3.1-testing.hg?raw-rev/135d5088909f

Otherwise, if this doesn't work/is some other issue, could you post the 
OOPS and relevant Xen console output?

Thanks,

Derek.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-06 15:32               ` Derek Murray
@ 2007-12-06 15:55                 ` Gerd Hoffmann
  0 siblings, 0 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-06 15:55 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, 'Mark Williamson'


> Which version of the Xen tools are you using? There was a bug in the
> version released with Xen 3.1, which should have been cleaned up in the
> subsequent minor versions. Try grabbing the patch to libxc at:
> 
> http://xenbits.xensource.com/xen-3.1-testing.hg?raw-rev/135d5088909f

Probably it is this one, according to rpm version is 3.1.0, so most
likely the fix isn't there.

> Otherwise, if this doesn't work/is some other issue, could you post the
> OOPS and relevant Xen console output?

There isn't any, the mapping just doesn't work (libxc returning NULL).

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Xen-devel] Re: Next steps with pv_ops for Xen
  2007-12-06 10:00                             ` Derek Murray
@ 2007-12-06 19:55                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-12-06 19:55 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Jan Beulich,
	Glauber de Oliveira Costa, Chris Wright, virtualization,
	Keir Fraser

Derek Murray wrote:
> Keir Fraser wrote:
>> You'd need to track pte->grant_handle mappings somewhere, but it could
>> certainly be done this way, yes.
>
> At the moment, blktap and gntdev provide struct pages to
> get_user_pages by smuggling them in the vm_private_data field of the
> relevant vm_area_struct. Could we use this field to get the handles to
> ptep_get_and_clear_full as well?

Yes.  Given the mm and a vaddr passed to ptep_get_and_clear, find_vma()
will return the vma_struct.  If we assert that anyone who sets the "I'm
foreign" bit in a pte has a standard format for the vm_private_data
field, then we can stash a callback pointer there and make the
appropriate callback.

> Only downside that I can see is that we would need to find the vma for
> each PTE that needs to be cleared this way (since we don't get this
> passed to ptep_get_and_clear_full), but this is mitigated by (i) it
> only happening in the erroneous, unclean-shutdown case, and (ii)
> getting a hit in the mm->mmap_cache for consecutive runs of mapped
> grants.

Yes.  find_vma is fairly hot, since its used on every fault, so it
should be reasonably fast.  And it doesn't sound like our case is
particularly performance critical.

    J

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-05 18:15                                     ` Derek Murray
@ 2007-12-12  8:27                                       ` Isaku Yamahata
  2007-12-12  8:39                                         ` Keir Fraser
  0 siblings, 1 reply; 57+ messages in thread
From: Isaku Yamahata @ 2007-12-12  8:27 UTC (permalink / raw)
  To: Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann, xen-ia64-devel

On Wed, Dec 05, 2007 at 06:15:49PM +0000, Derek Murray wrote:
> Keir Fraser wrote:
> >Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
> >there any other responsibilities that you acquire if you make use of
> >VM_FOREIGN (in particular, how would this affect get_user_pages)?
> 
> VM_FOREIGN is already set for the gntdev VMA (mostly because it's 
> directly based on the blktap code). That means that it has the array of 
> page_structs in its vm_private_data, which can be used to fulfill a 
> get_user_pages call. I've attached a patch based on this fix.
> 
> Regards,
> 
> Derek.

Hi Derek. Sorry for this late alert.

This patch breaks blktap and gntdev on ia64.
With auto translated physmap mode enabled, bktap/gntdev update
the pte entry with vm_insert_page(). Not direct updating it with
the hypercall.
So when zapping the pte entry, it is necessary to release page
reference counting, rmapping and etc. Thus vm_normal_page() have
to return the struct page when auto translated physmap mode is enabled.

How about passing the page struct** to the zap_pte call back
and set it to NULL if necessary?
(or
Can the condition be changed to check auto trasnalted physmap mode?
or
Should the clean up be done in zap_pte callback?)
-- 
yamahata

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-12  8:27                                       ` Isaku Yamahata
@ 2007-12-12  8:39                                         ` Keir Fraser
  2007-12-12  8:44                                           ` Isaku Yamahata
  0 siblings, 1 reply; 57+ messages in thread
From: Keir Fraser @ 2007-12-12  8:39 UTC (permalink / raw)
  To: Isaku Yamahata, Derek Murray
  Cc: xen-devel, Eduardo Habkost, Juan Quintela, Stephen C. Tweedie,
	Jan Beulich, Glauber de Oliveira Costa, Chris Wright,
	virtualization, Gerd Hoffmann, xen-ia64-devel

We already make the VM_FOREIGN check conditional on defined(CONFIG_XEN). We
could add defined(CONFIG_X86) as well? This would seem reasonable as a
temporary measure for the old 2.6.18 tree.

 -- Keir

On 12/12/07 08:27, "Isaku Yamahata" <yamahata@valinux.co.jp> wrote:

> This patch breaks blktap and gntdev on ia64.
> With auto translated physmap mode enabled, bktap/gntdev update
> the pte entry with vm_insert_page(). Not direct updating it with
> the hypercall.
> So when zapping the pte entry, it is necessary to release page
> reference counting, rmapping and etc. Thus vm_normal_page() have
> to return the struct page when auto translated physmap mode is enabled.
> 
> How about passing the page struct** to the zap_pte call back
> and set it to NULL if necessary?
> (or
> Can the condition be changed to check auto trasnalted physmap mode?
> or
> Should the clean up be done in zap_pte callback?)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-12  8:39                                         ` Keir Fraser
@ 2007-12-12  8:44                                           ` Isaku Yamahata
  0 siblings, 0 replies; 57+ messages in thread
From: Isaku Yamahata @ 2007-12-12  8:44 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Derek Murray, xen-devel, Eduardo Habkost, Juan Quintela,
	Stephen C. Tweedie, Jan Beulich, Glauber de Oliveira Costa,
	Chris Wright, virtualization, Gerd Hoffmann, xen-ia64-devel

On Wed, Dec 12, 2007 at 08:39:41AM +0000, Keir Fraser wrote:
> We already make the VM_FOREIGN check conditional on defined(CONFIG_XEN). We
> could add defined(CONFIG_X86) as well? This would seem reasonable as a
> temporary measure for the old 2.6.18 tree.

Yes, ok for IA64.
-- 
yamahata

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Xen-devel] Re: Next steps with pv_ops for Xen
  2007-12-03 18:36           ` D.G. Murray
                               ` (2 preceding siblings ...)
  2007-12-21 12:58             ` Gerd Hoffmann
@ 2007-12-21 12:58             ` Gerd Hoffmann
  3 siblings, 0 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-21 12:58 UTC (permalink / raw)
  To: dgm36
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Jan Beulich', 'Glauber de Oliveira Costa',
	'Chris Wright', virtualization, 'Mark Williamson'

D.G. Murray wrote:
> Hi Mark, 
> void *xc_gnttab_map_grant_refs(int xcg_handle,
>                                uint32_t count,
>                                uint32_t *domids,
>                                uint32_t *refs,
>                                int prot); 

Fedora 8 has 3.1.2 packages now, still doesn't work for me though.

Bored at xmas?  Want try fixing it?  Fetch xenner 0.15 from
http://dl.bytesex.org/releases/xenner/, build ("make blkbackd"), run it
as drop-in replacement for blktap.  You have to pass the "-b" switch to
make it try batching grant maps.  Code is in ioreq_map(), blkbackd.c.

Oh, and I think the limit should better be raised.  32 requests with up
to 11 sectors each sums up to 352 pages.  Which is way beyound the
current 128 grants limit, so it may fail under heavy I/O load.

cheers and happy xmas,

  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: Next steps with pv_ops for Xen
  2007-12-03 18:36           ` D.G. Murray
  2007-12-03 19:08             ` Mark Williamson
  2007-12-06 15:21             ` Gerd Hoffmann
@ 2007-12-21 12:58             ` Gerd Hoffmann
  2007-12-21 12:58             ` [Xen-devel] " Gerd Hoffmann
  3 siblings, 0 replies; 57+ messages in thread
From: Gerd Hoffmann @ 2007-12-21 12:58 UTC (permalink / raw)
  To: dgm36
  Cc: xen-devel, 'Eduardo Habkost', 'Juan Quintela',
	'Stephen C. Tweedie', 'Jan Beulich',
	'Glauber de Oliveira Costa', 'Chris Wright',
	virtualization, 'Mark Williamson'

D.G. Murray wrote:
> Hi Mark, 
> void *xc_gnttab_map_grant_refs(int xcg_handle,
>                                uint32_t count,
>                                uint32_t *domids,
>                                uint32_t *refs,
>                                int prot); 

Fedora 8 has 3.1.2 packages now, still doesn't work for me though.

Bored at xmas?  Want try fixing it?  Fetch xenner 0.15 from
http://dl.bytesex.org/releases/xenner/, build ("make blkbackd"), run it
as drop-in replacement for blktap.  You have to pass the "-b" switch to
make it try batching grant maps.  Code is in ioreq_map(), blkbackd.c.

Oh, and I think the limit should better be raised.  32 requests with up
to 11 sectors each sums up to 352 pages.  Which is way beyound the
current 128 grants limit, so it may fail under heavy I/O load.

cheers and happy xmas,

  Gerd

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2007-12-21 12:58 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-21 22:05 Next steps with pv_ops for Xen Stephen C. Tweedie
2007-11-21 23:12 ` Jeremy Fitzhardinge
2007-11-26 14:02   ` Juan Quintela
2007-11-26 18:52     ` Jeremy Fitzhardinge
2007-11-27  8:30       ` Jan Beulich
2007-11-27 17:00         ` Jeremy Fitzhardinge
2007-11-27 17:14           ` Jan Beulich
2007-11-27 17:15           ` Stephen C. Tweedie
2007-12-03 12:54 ` Gerd Hoffmann
2007-12-03 13:19   ` Derek Murray
2007-12-03 14:16     ` Gerd Hoffmann
2007-12-03 14:51       ` Derek Murray
2007-12-03 17:18         ` Mark Williamson
2007-12-03 18:36           ` D.G. Murray
2007-12-03 19:08             ` Mark Williamson
2007-12-04  9:35               ` tgh
2007-12-05  3:42                 ` Mark Williamson
2007-12-06 15:21             ` Gerd Hoffmann
2007-12-06 15:32               ` Derek Murray
2007-12-06 15:55                 ` Gerd Hoffmann
2007-12-21 12:58             ` Gerd Hoffmann
2007-12-21 12:58             ` [Xen-devel] " Gerd Hoffmann
2007-12-03 20:38         ` Gerd Hoffmann
2007-12-04  9:40           ` Derek Murray
2007-12-04 12:01             ` Gerd Hoffmann
2007-12-04 12:39               ` Stephen C. Tweedie
2007-12-04 19:58                 ` Gerd Hoffmann
2007-12-05 11:48                   ` [Xen-devel] " Derek Murray
2007-12-05 11:48                   ` Derek Murray
2007-12-05 14:12                     ` Gerd Hoffmann
2007-12-05 14:22                       ` Keir Fraser
2007-12-05 14:30                         ` Derek Murray
2007-12-05 16:58                           ` Keir Fraser
2007-12-05 17:17                             ` Derek Murray
2007-12-05 17:22                               ` Keir Fraser
2007-12-05 17:48                                 ` Derek Murray
2007-12-05 17:59                                   ` Keir Fraser
2007-12-05 18:15                                     ` Derek Murray
2007-12-12  8:27                                       ` Isaku Yamahata
2007-12-12  8:39                                         ` Keir Fraser
2007-12-12  8:44                                           ` Isaku Yamahata
2007-12-05 20:06                                     ` Gerd Hoffmann
2007-12-05 18:12                     ` Jeremy Fitzhardinge
2007-12-05 18:29                       ` Derek Murray
2007-12-05 20:15                         ` Jeremy Fitzhardinge
2007-12-05 20:35                           ` Geoffrey Lefebvre
2007-12-06 10:15                             ` Gerd Hoffmann
2007-12-05 20:44                           ` Keir Fraser
2007-12-06 10:00                             ` Derek Murray
2007-12-06 19:55                               ` [Xen-devel] " Jeremy Fitzhardinge
2007-12-05 13:19                   ` Derek Murray
2007-12-04 21:08                 ` Ian Main
2007-12-05 10:03                 ` Gerd Hoffmann
2007-12-05 12:51                   ` Gerd Hoffmann
2007-12-05 10:11                 ` Derek Murray
2007-12-04 20:59             ` Ian Main
2007-12-05 11:54               ` Derek Murray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.