[patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
@ 2007-03-01 23:24 Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                   ` (27 more replies)
  0 siblings, 28 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

Hi Andi,

This patch series implements the Linux Xen guest as a paravirt_ops
backend.  The features in implemented this patch series are:
 * domU only
 * UP only (most code is SMP-safe, but there's no way to create a new vcpu)
 * writable pagetables, with late pinning/early unpinning
   (no shadow pagetable support)
 * supports both PAE and non-PAE modes
 * xen hvc console (console=hvc0)
 * virtual block device (blockfront)
 * virtual network device (netfront)

The patch series is in three parts:
  1-2	Cleanup patches to various parts of the kernel
  3-15	Extensions to the core code and/or paravirt ops, needed to support
	Xen. Includes hooks into get/setting pte values and the new
	patching machinery.
 16-26	The Xen paravirt_ops implementation itself.

(Some of the earlier patches in the series have already been posted,
 but are included to make the series self-contained.)

I've tried to make each patch as self-explanatory as possible.  The
series is based on 2.6.21-rc2-git1.

Changes since the previous posting:
- Moved the arch_*_mmap stubs to asm-generic/mm_hooks.h, and included that in
  asm-*/mmu_context.h (James Bottomley)
- Use unsigned long consistently for addresses (Andi Kleen)
- Moved xenboot_console declaration to its own header. (Andi Kleen)
- Renamed FIX_PARAVIRT to FIX_PARAVIRT_BOOTMAP to emphasize that its intended
  for bootstrap only (Ingo Molnar)
- Cleaned up headers a fair bit.  Moved a lot of the native_*
  functions to their "home" headers, so that they can be used either for
  native paravirt boots, or when CONFIG_PARAVIRT is disabled.  This cuts
  down on duplication, and makes paravirt.h simpler.
- Various cleanups from Adrian Bunk

Thanks,
	J
-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 01/26] Xen-paravirt_ops: Fix typo in sync_constant_test_bit()s name.
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: fix-sync-bitops.patch --]
[-- Type: text/plain, Size: 664 bytes --]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 include/asm-i386/sync_bitops.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

===================================================================
--- a/include/asm-i386/sync_bitops.h
+++ b/include/asm-i386/sync_bitops.h
@@ -130,7 +130,7 @@ static inline int sync_test_and_change_b
 	return oldbit;
 }
 
-static __always_inline int sync_const_test_bit(int nr, const volatile unsigned long *addr)
+static __always_inline int sync_constant_test_bit(int nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr & 31)) &
 		(((const volatile unsigned int *)addr)[nr >> 5])) != 0;

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 01/26] Xen-paravirt_ops: Fix typo in sync_constant_test_bit()s name.
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Wright, virtualization, xen-devel, Andrew Morton, linux-kernel

[-- Attachment #1: fix-sync-bitops.patch --]
[-- Type: text/plain, Size: 681 bytes --]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 include/asm-i386/sync_bitops.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

===================================================================
--- a/include/asm-i386/sync_bitops.h
+++ b/include/asm-i386/sync_bitops.h
@@ -130,7 +130,7 @@ static inline int sync_test_and_change_b
 	return oldbit;
 }
 
-static __always_inline int sync_const_test_bit(int nr, const volatile unsigned long *addr)
+static __always_inline int sync_constant_test_bit(int nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr & 31)) &
 		(((const volatile unsigned int *)addr)[nr >> 5])) != 0;

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 02/26] Xen-paravirt_ops: ignore vgacon if hardware not present
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: vgacon-ignore-uninitialized.patch --]
[-- Type: text/plain, Size: 843 bytes --]

Avoid trying to set up vgacon if there's no vga hardware present.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
From: Gerd Hoffmann <kraxel@suse.de>

---
 drivers/video/console/vgacon.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/video/console/vgacon.c
+++ b/drivers/video/console/vgacon.c
@@ -372,7 +372,8 @@ static const char *vgacon_startup(void)
 	}
 
 	/* VGA16 modes are not handled by VGACON */
-	if ((ORIG_VIDEO_MODE == 0x0D) ||	/* 320x200/4 */
+	if ((ORIG_VIDEO_MODE == 0x00) ||	/* SCREEN_INFO not initialized */
+	    (ORIG_VIDEO_MODE == 0x0D) ||	/* 320x200/4 */
 	    (ORIG_VIDEO_MODE == 0x0E) ||	/* 640x200/4 */
 	    (ORIG_VIDEO_MODE == 0x10) ||	/* 640x350/4 */
 	    (ORIG_VIDEO_MODE == 0x12) ||	/* 640x480/4 */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 02/26] Xen-paravirt_ops: ignore vgacon if hardware not present
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Rusty Russell, linux-kernel,
	Chris Wright, virtualization, Andrew Morton

[-- Attachment #1: vgacon-ignore-uninitialized.patch --]
[-- Type: text/plain, Size: 842 bytes --]

Avoid trying to set up vgacon if there's no vga hardware present.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
From: Gerd Hoffmann <kraxel@suse.de>

---
 drivers/video/console/vgacon.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/video/console/vgacon.c
+++ b/drivers/video/console/vgacon.c
@@ -372,7 +372,8 @@ static const char *vgacon_startup(void)
 	}
 
 	/* VGA16 modes are not handled by VGACON */
-	if ((ORIG_VIDEO_MODE == 0x0D) ||	/* 320x200/4 */
+	if ((ORIG_VIDEO_MODE == 0x00) ||	/* SCREEN_INFO not initialized */
+	    (ORIG_VIDEO_MODE == 0x0D) ||	/* 320x200/4 */
 	    (ORIG_VIDEO_MODE == 0x0E) ||	/* 640x200/4 */
 	    (ORIG_VIDEO_MODE == 0x10) ||	/* 640x350/4 */
 	    (ORIG_VIDEO_MODE == 0x12) ||	/* 640x480/4 */

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:44     ` Ingo Molnar
  2007-03-01 23:24 ` [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries Jeremy Fitzhardinge
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: paravirt-nop.patch --]
[-- Type: text/plain, Size: 2940 bytes --]

Add a _paravirt_nop function for use as a stub for no-op operations,
and paravirt_nop #defined void * version to make using it easier
(since all its uses are as a void *).

This is useful to allow the patcher to automatically identify noop
operations so it can simply nop out the callsite.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 arch/i386/kernel/paravirt.c |   22 +++++++++++-----------
 include/asm-i386/paravirt.h |    3 +++
 2 files changed, 14 insertions(+), 11 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -34,7 +34,7 @@
 #include <asm/tlbflush.h>
 
 /* nop stub */
-static void native_nop(void)
+void _paravirt_nop(void)
 {
 }
 
@@ -489,7 +489,7 @@ struct paravirt_ops paravirt_ops = {
 
  	.patch = native_patch,
 	.banner = default_banner,
-	.arch_setup = native_nop,
+	.arch_setup = paravirt_nop,
 	.memory_setup = machine_specific_memory_setup,
 	.get_wallclock = native_get_wallclock,
 	.set_wallclock = native_set_wallclock,
@@ -544,23 +544,23 @@ struct paravirt_ops paravirt_ops = {
 	.setup_boot_clock = setup_boot_APIC_clock,
 	.setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
-	.set_lazy_mode = (void *)native_nop,
+	.set_lazy_mode = paravirt_nop,
 
 	.flush_tlb_user = native_flush_tlb,
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
 
-	.alloc_pt = (void *)native_nop,
-	.alloc_pd = (void *)native_nop,
-	.alloc_pd_clone = (void *)native_nop,
-	.release_pt = (void *)native_nop,
-	.release_pd = (void *)native_nop,
+	.alloc_pt = paravirt_nop,
+	.alloc_pd = paravirt_nop,
+	.alloc_pd_clone = paravirt_nop,
+	.release_pt = paravirt_nop,
+	.release_pd = paravirt_nop,
 
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
-	.pte_update = (void *)native_nop,
-	.pte_update_defer = (void *)native_nop,
+	.pte_update = paravirt_nop,
+	.pte_update_defer = paravirt_nop,
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
@@ -572,7 +572,7 @@ struct paravirt_ops paravirt_ops = {
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
 
-	.startup_ipi_hook = (void *)native_nop,
+	.startup_ipi_hook = paravirt_nop,
 };
 
 /*
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -422,6 +422,9 @@ static inline void pmd_clear(pmd_t *pmdp
 #define arch_enter_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
 #define arch_leave_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
 
+void _paravirt_nop(void);
+#define paravirt_nop	((void *)_paravirt_nop)
+
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch {
 	u8 *instr; 		/* original instructions */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (2 preceding siblings ...)
  2007-03-01 23:24 ` [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:38     ` Ingo Molnar
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                   ` (23 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ingo Molnar

[-- Attachment #1: paravirt-pte-accessors.patch --]
[-- Type: text/plain, Size: 18901 bytes --]

Add a set of accessors to pack, unpack and modify page table entries
(at all levels).  This allows a paravirt implementation to control the
contents of pgd/pmd/pte entries.  For example, Xen uses this to
convert the (pseudo-)physical address into a machine address when
populating a pagetable entry, and converting back to pphys address
when an entry is read.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Ingo Molnar <mingo@elte.hu>

---
 arch/i386/kernel/paravirt.c       |   84 +++++--------------------------------
 arch/i386/kernel/vmi.c            |    6 +-
 include/asm-i386/page.h           |   71 ++++++++++++++++++++++++++-----
 include/asm-i386/paravirt.h       |   52 +++++++++++++++++-----
 include/asm-i386/pgtable-2level.h |   28 +++++++++---
 include/asm-i386/pgtable-3level.h |   65 +++++++++++++++++-----------
 include/asm-i386/pgtable.h        |    2 
 7 files changed, 178 insertions(+), 130 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -398,78 +398,6 @@ static void native_flush_tlb_single(u32 
 {
 	__native_flush_tlb_single(addr);
 }
-
-#ifndef CONFIG_X86_PAE
-static void native_set_pte(pte_t *ptep, pte_t pteval)
-{
-	*ptep = pteval;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval)
-{
-	*ptep = pteval;
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-	*pmdp = pmdval;
-}
-
-#else /* CONFIG_X86_PAE */
-
-static void native_set_pte(pte_t *ptep, pte_t pte)
-{
-	ptep->pte_high = pte.pte_high;
-	smp_wmb();
-	ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pte)
-{
-	ptep->pte_high = pte.pte_high;
-	smp_wmb();
-	ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_present(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
-{
-	ptep->pte_low = 0;
-	smp_wmb();
-	ptep->pte_high = pte.pte_high;
-	smp_wmb();
-	ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-	set_64bit((unsigned long long *)ptep,pte_val(pteval));
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-	set_64bit((unsigned long long *)pmdp,pmd_val(pmdval));
-}
-
-static void native_set_pud(pud_t *pudp, pud_t pudval)
-{
-	*pudp = pudval;
-}
-
-static void native_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
-{
-	ptep->pte_low = 0;
-	smp_wmb();
-	ptep->pte_high = 0;
-}
-
-static void native_pmd_clear(pmd_t *pmd)
-{
-	u32 *tmp = (u32 *)pmd;
-	*tmp = 0;
-	smp_wmb();
-	*(tmp + 1) = 0;
-}
-#endif /* CONFIG_X86_PAE */
 
 /* These are in entry.S */
 extern void native_iret(void);
@@ -561,13 +489,25 @@ struct paravirt_ops paravirt_ops = {
 	.set_pmd = native_set_pmd,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+
+	.ptep_get_and_clear = native_ptep_get_and_clear,
+
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
 	.set_pud = native_set_pud,
 	.pte_clear = native_pte_clear,
 	.pmd_clear = native_pmd_clear,
+
+	.pmd_val = native_pmd_val,
+	.make_pmd = native_make_pmd,
 #endif
+
+	.pte_val = native_pte_val,
+	.pgd_val = native_pgd_val,
+
+	.make_pte = native_make_pte,
+	.make_pgd = native_make_pgd,
 
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -426,13 +426,13 @@ static void vmi_release_pd(u32 pfn)
         ((level) | (is_current_as(mm, user) ?                           \
                 (VMI_PAGE_DEFER | VMI_PAGE_CURRENT_AS | ((addr) & VMI_PAGE_VA_MASK)) : 0))
 
-static void vmi_update_pte(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
 	vmi_ops.update_pte(ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
 }
 
-static void vmi_update_pte_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
 	vmi_ops.update_pte(ptep, vmi_flags_addr_defer(mm, addr, VMI_PAGE_PT, 0));
@@ -445,7 +445,7 @@ static void vmi_set_pte(pte_t *ptep, pte
 	vmi_ops.set_pte(pte, ptep, VMI_PAGE_PT);
 }
 
-static void vmi_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pte)
+static void vmi_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
 {
 	vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
 	vmi_ops.set_pte(pte, ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
===================================================================
--- a/include/asm-i386/page.h
+++ b/include/asm-i386/page.h
@@ -12,7 +12,6 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__
 
-
 #ifdef CONFIG_X86_USE_3DNOW
 
 #include <asm/mmx.h>
@@ -42,26 +41,73 @@
  * These are used to make use of C type-checking..
  */
 extern int nx_enabled;
+
 #ifdef CONFIG_X86_PAE
 extern unsigned long long __supported_pte_mask;
 typedef struct { unsigned long pte_low, pte_high; } pte_t;
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
 typedef struct { unsigned long long pgprot; } pgprot_t;
-#define pmd_val(x)	((x).pmd)
-#define pte_val(x)	((x).pte_low | ((unsigned long long)(x).pte_high << 32))
-#define __pmd(x) ((pmd_t) { (x) } )
+
+static inline unsigned long long native_pgd_val(pgd_t pgd)
+{
+	return pgd.pgd;
+}
+static inline unsigned long long native_pmd_val(pmd_t pmd)
+{
+	return pmd.pmd;
+}
+static inline unsigned long long native_pte_val(pte_t pte)
+{
+	return pte.pte_low | ((unsigned long long)pte.pte_high << 32);
+}
+static inline pgd_t native_make_pgd(unsigned long long val)
+{
+	return (pgd_t) { val };
+}
+static inline pmd_t native_make_pmd(unsigned long long val)
+{
+	return (pmd_t) { val };
+}
+static inline pte_t native_make_pte(unsigned long long val)
+{
+	return (pte_t) { .pte_low = val, .pte_high = (val >> 32) } ;
+}
+
+#ifndef CONFIG_PARAVIRT
+#define pmd_val(x)	native_pmd_val(x)
+#define __pmd(x)	native_make_pmd(x)
+#endif	/* !CONFIG_PARAVIRT */
+
 #define HPAGE_SHIFT	21
 #include <asm-generic/pgtable-nopud.h>
-#else
+#else  /* !CONFIG_X86_PAE */
 typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pgd; } pgd_t;
 typedef struct { unsigned long pgprot; } pgprot_t;
 #define boot_pte_t pte_t /* or would you rather have a typedef */
-#define pte_val(x)	((x).pte_low)
+
+static inline unsigned long native_pgd_val(pgd_t pgd)
+{
+	return pgd.pgd;
+}
+static inline unsigned long native_pte_val(pte_t pte)
+{
+	return pte.pte_low;
+}
+static inline pgd_t native_make_pgd(unsigned long val)
+{
+	return (pgd_t) { val };
+}
+static inline pte_t native_make_pte(unsigned long val)
+{
+	return (pte_t) { .pte_low = val };
+}
+
 #define HPAGE_SHIFT	22
 #include <asm-generic/pgtable-nopmd.h>
-#endif
+#endif	/* CONFIG_X86_PAE */
+
 #define PTE_MASK	PAGE_MASK
 
 #ifdef CONFIG_HUGETLB_PAGE
@@ -71,12 +117,15 @@ typedef struct { unsigned long pgprot; }
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #endif
 
-#define pgd_val(x)	((x).pgd)
 #define pgprot_val(x)	((x).pgprot)
-
-#define __pte(x) ((pte_t) { (x) } )
-#define __pgd(x) ((pgd_t) { (x) } )
 #define __pgprot(x)	((pgprot_t) { (x) } )
+
+#ifndef CONFIG_PARAVIRT
+#define pgd_val(x)	native_pgd_val(x)
+#define __pgd(x)	native_make_pgd(x)
+#define pte_val(x)	native_pte_val(x)
+#define __pte(x)	native_make_pte(x)
+#endif	/* !CONFIG_PARAVIRT */
 
 #endif /* !__ASSEMBLY__ */
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -2,7 +2,6 @@
 #define __ASM_PARAVIRT_H
 /* Various instructions on x86 need to be replaced for
  * para-virtualization: those hooks are defined here. */
-#include <linux/linkage.h>
 #include <linux/stringify.h>
 #include <asm/page.h>
 
@@ -25,6 +24,8 @@
 #define CLBR_ANY 0x7
 
 #ifndef __ASSEMBLY__
+#include <linux/types.h>
+
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
@@ -53,11 +54,6 @@ struct paravirt_ops
 	unsigned long (*get_wallclock)(void);
 	int (*set_wallclock)(unsigned long);
 	void (*time_init)(void);
-
-	/* All the function pointers here are declared as "fastcall"
-	   so that we get a specific register-based calling
-	   convention.  This makes it easier to implement inline
-	   assembler replacements. */
 
 	void (*cpuid)(unsigned int *eax, unsigned int *ebx,
 		      unsigned int *ecx, unsigned int *edx);
@@ -136,16 +132,33 @@ struct paravirt_ops
 	void (*release_pd)(u32 pfn);
 
 	void (*set_pte)(pte_t *ptep, pte_t pteval);
-	void (*set_pte_at)(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval);
+	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
-	void (*pte_update)(struct mm_struct *mm, u32 addr, pte_t *ptep);
-	void (*pte_update_defer)(struct mm_struct *mm, u32 addr, pte_t *ptep);
+	void (*pte_update)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+	void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+
+ 	pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
 #ifdef CONFIG_X86_PAE
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
-	void (*set_pte_present)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte);
+ 	void (*set_pte_present)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte);
 	void (*set_pud)(pud_t *pudp, pud_t pudval);
-	void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+ 	void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 	void (*pmd_clear)(pmd_t *pmdp);
+
+	unsigned long long (*pte_val)(pte_t);
+	unsigned long long (*pmd_val)(pmd_t);
+	unsigned long long (*pgd_val)(pgd_t);
+
+	pte_t (*make_pte)(unsigned long long pte);
+	pmd_t (*make_pmd)(unsigned long long pmd);
+	pgd_t (*make_pgd)(unsigned long long pgd);
+#else
+	unsigned long (*pte_val)(pte_t);
+	unsigned long (*pgd_val)(pgd_t);
+
+	pte_t (*make_pte)(unsigned long pte);
+	pgd_t (*make_pgd)(unsigned long pgd);
 #endif
 
 	void (*set_lazy_mode)(int mode);
@@ -215,6 +228,8 @@ static inline void __cpuid(unsigned int 
 #define read_cr4() paravirt_ops.read_cr4()
 #define read_cr4_safe(x) paravirt_ops.read_cr4_safe()
 #define write_cr4(x) paravirt_ops.write_cr4(x)
+
+#define raw_ptep_get_and_clear(xp)	(paravirt_ops.ptep_get_and_clear(xp))
 
 static inline void raw_safe_halt(void)
 {
@@ -297,6 +312,17 @@ static inline void halt(void)
 	(paravirt_ops.write_idt_entry((dt), (entry), (low), (high)))
 #define set_iopl_mask(mask) (paravirt_ops.set_iopl_mask(mask))
 
+#define __pte(x)	paravirt_ops.make_pte(x)
+#define __pgd(x)	paravirt_ops.make_pgd(x)
+
+#define pte_val(x)	paravirt_ops.pte_val(x)
+#define pgd_val(x)	paravirt_ops.pgd_val(x)
+
+#ifdef CONFIG_X86_PAE
+#define __pmd(x)	paravirt_ops.make_pmd(x)
+#define pmd_val(x)	paravirt_ops.pmd_val(x)
+#endif
+
 /* The paravirtualized I/O functions */
 static inline void slow_down_io(void) {
 	paravirt_ops.io_delay();
@@ -337,6 +363,7 @@ static inline void setup_secondary_clock
 }
 #endif
 
+
 #ifdef CONFIG_SMP
 static inline void startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 				    unsigned long start_esp)
@@ -362,7 +389,8 @@ static inline void set_pte(pte_t *ptep, 
 	paravirt_ops.set_pte(ptep, pteval);
 }
 
-static inline void set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval)
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep, pte_t pteval)
 {
 	paravirt_ops.set_pte_at(mm, addr, ptep, pteval);
 }
===================================================================
--- a/include/asm-i386/pgtable-2level.h
+++ b/include/asm-i386/pgtable-2level.h
@@ -11,11 +11,24 @@
  * within a page table are directly modified.  Thus, the following
  * hook is made available.
  */
+static inline void native_set_pte(pte_t *ptep , pte_t pte)
+{
+	*ptep = pte;
+}
+static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep , pte_t pte)
+{
+	native_set_pte(ptep, pte);
+}
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	*pmdp = pmd;
+}
 #ifndef CONFIG_PARAVIRT
-#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
-#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
-#endif
+#define set_pte(pteptr, pteval)		native_set_pte(pteptr, pteval)
+#define set_pte_at(mm,addr,ptep,pteval) native_set_pte_at(mm, addr, ptep, pteval)
+#define set_pmd(pmdptr, pmdval)		native_set_pmd(pmdptr, pmdval)
+#endif	/* CONFIG_PARAVIRT */
 
 #define set_pte_atomic(pteptr, pteval) set_pte(pteptr,pteval)
 #define set_pte_present(mm,addr,ptep,pteval) set_pte_at(mm,addr,ptep,pteval)
@@ -23,11 +36,14 @@
 #define pte_clear(mm,addr,xp)	do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)
 #define pmd_clear(xp)	do { set_pmd(xp, __pmd(0)); } while (0)
 
-#define raw_ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
+static inline pte_t native_ptep_get_and_clear(pte_t *xp)
+{
+	return __pte(xchg(&xp->pte_low, 0));
+}
 
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
 #define pte_none(x)		(!(x).pte_low)
-#define pte_pfn(x)		((unsigned long)(((x).pte_low >> PAGE_SHIFT)))
+#define pte_pfn(x)		(pte_val(x) >> PAGE_SHIFT)
 #define pfn_pte(pfn, prot)	__pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
 #define pfn_pmd(pfn, prot)	__pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
===================================================================
--- a/include/asm-i386/pgtable-3level.h
+++ b/include/asm-i386/pgtable-3level.h
@@ -42,20 +42,23 @@ static inline int pte_exec_kernel(pte_t 
 	return pte_x(pte);
 }
 
-#ifndef CONFIG_PARAVIRT
 /* Rules for using set_pte: the pte being assigned *must* be
  * either not present or in a state where the hardware will
  * not attempt to update the pte.  In places where this is
  * not possible, use pte_get_and_clear to obtain the old pte
  * value and then use set_pte to update it.  -ben
  */
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void native_set_pte(pte_t *ptep, pte_t pte)
 {
 	ptep->pte_high = pte.pte_high;
 	smp_wmb();
 	ptep->pte_low = pte.pte_low;
 }
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
+static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep , pte_t pte)
+{
+	native_set_pte(ptep, pte);
+}
 
 /*
  * Since this is only called on user PTEs, and the page fault handler
@@ -63,7 +66,8 @@ static inline void set_pte(pte_t *ptep, 
  * we are justified in merely clearing the PTE present bit, followed
  * by a set.  The ordering here is important.
  */
-static inline void set_pte_present(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
+static inline void native_set_pte_present(struct mm_struct *mm, unsigned long addr,
+					  pte_t *ptep, pte_t pte)
 {
 	ptep->pte_low = 0;
 	smp_wmb();
@@ -72,33 +76,49 @@ static inline void set_pte_present(struc
 	ptep->pte_low = pte.pte_low;
 }
 
-#define set_pte_atomic(pteptr,pteval) \
-		set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
-#define set_pmd(pmdptr,pmdval) \
-		set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
-#define set_pud(pudptr,pudval) \
-		(*(pudptr) = (pudval))
+static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
+{
+	set_64bit((unsigned long long *)(ptep),native_pte_val(pte));
+}
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	set_64bit((unsigned long long *)(pmdp),native_pmd_val(pmd));
+}
+static inline void native_set_pud(pud_t *pudp, pud_t pud)
+{
+	*pudp = pud;
+}
 
 /*
  * For PTEs and PDEs, we must clear the P-bit first when clearing a page table
  * entry, so clear the bottom half first and enforce ordering with a compiler
  * barrier.
  */
-static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	ptep->pte_low = 0;
 	smp_wmb();
 	ptep->pte_high = 0;
 }
 
-static inline void pmd_clear(pmd_t *pmd)
+static inline void native_pmd_clear(pmd_t *pmd)
 {
 	u32 *tmp = (u32 *)pmd;
 	*tmp = 0;
 	smp_wmb();
 	*(tmp + 1) = 0;
 }
-#endif
+
+#ifndef CONFIG_PARAVIRT
+#define set_pte(ptep, pte)			native_set_pte(ptep, pte)
+#define set_pte_at(mm, addr, ptep, pte)		native_set_pte_at(mm, addr, ptep, pte)
+#define set_pte_present(mm, addr, ptep, pte)	native_set_pte_present(mm, addr, ptep, pte)
+#define set_pte_atomic(ptep, pte)		native_set_pte_atomic(ptep, pte)
+#define set_pmd(pmdp, pmd)			native_set_pmd(pmdp, pmd)
+#define set_pud(pudp, pud)			native_set_pud(pudp, pud)
+#define pte_clear(mm, addr, ptep)		native_pte_clear(mm, addr, ptep)
+#define pmd_clear(pmd)				native_pmd_clear(pmd)
+#endif	/* CONFIG_PARAVIRT */
 
 /*
  * Pentium-II erratum A13: in PAE mode we explicitly have to flush
@@ -119,7 +139,7 @@ static inline void pud_clear (pud_t * pu
 #define pmd_offset(pud, address) ((pmd_t *) pud_page(*(pud)) + \
 			pmd_index(address))
 
-static inline pte_t raw_ptep_get_and_clear(pte_t *ptep)
+static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 {
 	pte_t res;
 
@@ -146,28 +166,21 @@ static inline int pte_none(pte_t pte)
 
 static inline unsigned long pte_pfn(pte_t pte)
 {
-	return (pte.pte_low >> PAGE_SHIFT) |
-		(pte.pte_high << (32 - PAGE_SHIFT));
+	return pte_val(pte) >> PAGE_SHIFT;
 }
 
 extern unsigned long long __supported_pte_mask;
 
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
-	pte_t pte;
-
-	pte.pte_high = (page_nr >> (32 - PAGE_SHIFT)) | \
-					(pgprot_val(pgprot) >> 32);
-	pte.pte_high &= (__supported_pte_mask >> 32);
-	pte.pte_low = ((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)) & \
-							__supported_pte_mask;
-	return pte;
+	return __pte((((unsigned long long)page_nr << PAGE_SHIFT) |
+		      pgprot_val(pgprot)) & __supported_pte_mask);
 }
 
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pmd((((unsigned long long)page_nr << PAGE_SHIFT) | \
-			pgprot_val(pgprot)) & __supported_pte_mask);
+	return __pmd((((unsigned long long)page_nr << PAGE_SHIFT) |
+		      pgprot_val(pgprot)) & __supported_pte_mask);
 }
 
 /*
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -263,6 +263,8 @@ static inline pte_t pte_mkhuge(pte_t pte
  */
 #define pte_update(mm, addr, ptep)		do { } while (0)
 #define pte_update_defer(mm, addr, ptep)	do { } while (0)
+
+#define raw_ptep_get_and_clear(xp)     native_ptep_get_and_clear(xp)
 #endif
 
 /*

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell,
	William Lee Irwin III, Ingo Molnar

[-- Attachment #1: paravirt-memory-init.patch --]
[-- Type: text/plain, Size: 11672 bytes --]

This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.

In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory.  When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.

When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.

In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary.  Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.

In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable.  This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.

A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables.  This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Ingo Molnar <mingo@elte.hu>

---
 arch/i386/kernel/paravirt.c |    3 
 arch/i386/mm/init.c         |  159 +++++++++++++++++++++++++++++--------------
 include/asm-i386/paravirt.h |   17 ++++
 include/asm-i386/pgtable.h  |   16 ++++
 4 files changed, 143 insertions(+), 52 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -549,6 +549,9 @@ struct paravirt_ops paravirt_ops = {
 #endif
 	.set_lazy_mode = paravirt_nop,
 
+	.pagetable_setup_start = native_pagetable_setup_start,
+	.pagetable_setup_done = native_pagetable_setup_done,
+
 	.flush_tlb_user = native_flush_tlb,
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
===================================================================
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -42,6 +42,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
+#include <asm/paravirt.h>
 
 unsigned int __VMALLOC_RESERVE = 128 << 20;
 
@@ -62,6 +63,7 @@ static pmd_t * __init one_md_table_init(
 		
 #ifdef CONFIG_X86_PAE
 	pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
 	paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
 	set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
 	pud = pud_offset(pgd, 0);
@@ -83,12 +85,10 @@ static pte_t * __init one_page_table_ini
 {
 	if (pmd_none(*pmd)) {
 		pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
 		paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
 		set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
-		if (page_table != pte_offset_kernel(pmd, 0))
-			BUG();	
-
-		return page_table;
+		BUG_ON(page_table != pte_offset_kernel(pmd, 0));
 	}
 	
 	return pte_offset_kernel(pmd, 0);
@@ -119,7 +119,7 @@ static void __init page_table_range_init
 	pgd = pgd_base + pgd_idx;
 
 	for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
-		if (pgd_none(*pgd)) 
+		if (!(pgd_val(*pgd) & _PAGE_PRESENT))
 			one_md_table_init(pgd);
 		pud = pud_offset(pgd, vaddr);
 		pmd = pmd_offset(pud, vaddr);
@@ -158,7 +158,11 @@ static void __init kernel_physical_mappi
 	pfn = 0;
 
 	for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {
-		pmd = one_md_table_init(pgd);
+		if (!(pgd_val(*pgd) & _PAGE_PRESENT))
+			pmd = one_md_table_init(pgd);
+		else
+			pmd = pmd_offset(pud_offset(pgd, PAGE_OFFSET), PAGE_OFFSET);
+
 		if (pfn >= max_low_pfn)
 			continue;
 		for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {
@@ -167,20 +171,26 @@ static void __init kernel_physical_mappi
 			/* Map with big pages if possible, otherwise create normal page tables. */
 			if (cpu_has_pse) {
 				unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-				if (is_kernel_text(address) || is_kernel_text(address2))
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
-				else
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				if (!pmd_present(*pmd)) {
+					if (is_kernel_text(address) || is_kernel_text(address2))
+						set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
+					else
+						set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				}
 				pfn += PTRS_PER_PTE;
 			} else {
 				pte = one_page_table_init(pmd);
 
-				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
-						if (is_kernel_text(address))
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
-						else
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
+				for (pte_ofs = 0;
+				     pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn;
+				     pte++, pfn++, pte_ofs++, address += PAGE_SIZE) {
+					if (pte_present(*pte))
+						continue;
+
+					if (is_kernel_text(address))
+						set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
+					else
+						set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
 				}
 			}
 		}
@@ -337,44 +347,36 @@ extern void __init remap_numa_kva(void);
 #define remap_numa_kva() do {} while (0)
 #endif
 
-static void __init pagetable_init (void)
-{
-	unsigned long vaddr;
-	pgd_t *pgd_base = swapper_pg_dir;
-
+void __init native_pagetable_setup_start(pgd_t *base)
+{
 #ifdef CONFIG_X86_PAE
 	int i;
-	/* Init entries of the first-level page table to the zero page */
-	for (i = 0; i < PTRS_PER_PGD; i++)
-		set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
+
+	/*
+	 * Init entries of the first-level page table to the
+	 * zero page, if they haven't already been set up.
+	 *
+	 * In a normal native boot, we'll be running on a
+	 * pagetable rooted in swapper_pg_dir, but not in PAE
+	 * mode, so this will end up clobbering the mappings
+	 * for the lower 24Mbytes of the address space,
+	 * without affecting the kernel address space.
+	 */
+	for (i = 0; i < USER_PTRS_PER_PGD; i++)
+		set_pgd(&base[i],
+			__pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
+
+	/* Make sure kernel address space is empty so that a pagetable
+	   will be allocated for it. */
+	memset(&base[USER_PTRS_PER_PGD], 0,
+	       KERNEL_PGD_PTRS * sizeof(pgd_t));
 #else
 	paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
 #endif
-
-	/* Enable PSE if available */
-	if (cpu_has_pse) {
-		set_in_cr4(X86_CR4_PSE);
-	}
-
-	/* Enable PGE if available */
-	if (cpu_has_pge) {
-		set_in_cr4(X86_CR4_PGE);
-		__PAGE_KERNEL |= _PAGE_GLOBAL;
-		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
-	}
-
-	kernel_physical_mapping_init(pgd_base);
-	remap_numa_kva();
-
-	/*
-	 * Fixed mappings, only the page table structure has to be
-	 * created - mappings will be set by set_fixmap():
-	 */
-	vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
-	page_table_range_init(vaddr, 0, pgd_base);
-
-	permanent_kmaps_init(pgd_base);
-
+}
+
+void __init native_pagetable_setup_done(pgd_t *base)
+{
 #ifdef CONFIG_X86_PAE
 	/*
 	 * Add low memory identity-mappings - SMP needs it when
@@ -383,8 +385,63 @@ static void __init pagetable_init (void)
 	 * All user-space mappings are explicitly cleared after
 	 * SMP startup.
 	 */
-	set_pgd(&pgd_base[0], pgd_base[USER_PTRS_PER_PGD]);
-#endif
+	set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
+#endif
+}
+
+/*
+ * Build a proper pagetable for the kernel mappings.  Up until this
+ * point, we've been running on some set of pagetables constructed by
+ * the boot process.
+ *
+ * If we're booting on native hardware, this will be a pagetable
+ * constructed in arch/i386/kernel/head.S, and not running in PAE mode
+ * (even if we'll end up running in PAE).  The root of the pagetable
+ * will be swapper_pg_dir.
+ *
+ * If we're booting paravirtualized under a hypervisor, then there are
+ * more options: we may already be running PAE, and the pagetable may
+ * or may not be based in swapper_pg_dir.  In any case,
+ * paravirt_pagetable_setup_start() will set up swapper_pg_dir
+ * appropriately for the rest of the initialization to work.
+ *
+ * In general, pagetable_init() assumes that the pagetable may already
+ * be partially populated, and so it avoids stomping on any existing
+ * mappings.
+ */
+static void __init pagetable_init (void)
+{
+	unsigned long vaddr, end;
+	pgd_t *pgd_base = swapper_pg_dir;
+
+	paravirt_pagetable_setup_start(pgd_base);
+
+	/* Enable PSE if available */
+	if (cpu_has_pse) {
+		set_in_cr4(X86_CR4_PSE);
+	}
+
+	/* Enable PGE if available */
+	if (cpu_has_pge) {
+		set_in_cr4(X86_CR4_PGE);
+		__PAGE_KERNEL |= _PAGE_GLOBAL;
+		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
+	}
+
+	kernel_physical_mapping_init(pgd_base);
+	remap_numa_kva();
+
+	/*
+	 * Fixed mappings, only the page table structure has to be
+	 * created - mappings will be set by set_fixmap():
+	 */
+	vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
+	end = (FIXADDR_TOP + PMD_SIZE - 1) & PMD_MASK;
+	page_table_range_init(vaddr, end, pgd_base);
+
+	permanent_kmaps_init(pgd_base);
+
+	paravirt_pagetable_setup_done(pgd_base);
 }
 
 #if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_ACPI_SLEEP)
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -2,10 +2,11 @@
 #define __ASM_PARAVIRT_H
 /* Various instructions on x86 need to be replaced for
  * para-virtualization: those hooks are defined here. */
+
+#ifdef CONFIG_PARAVIRT
 #include <linux/stringify.h>
 #include <asm/page.h>
 
-#ifdef CONFIG_PARAVIRT
 /* These are the most performance critical ops, so we want to be able to patch
  * callers */
 #define PARAVIRT_IRQ_DISABLE 0
@@ -48,6 +49,9 @@ struct paravirt_ops
 	void (*arch_setup)(void);
 	char *(*memory_setup)(void);
 	void (*init_IRQ)(void);
+
+	void (*pagetable_setup_start)(pgd_t *pgd_base);
+	void (*pagetable_setup_done)(pgd_t *pgd_base);
 
 	void (*banner)(void);
 
@@ -363,6 +367,17 @@ static inline void setup_secondary_clock
 }
 #endif
 
+static inline void paravirt_pagetable_setup_start(pgd_t *base)
+{
+	if (paravirt_ops.pagetable_setup_start)
+		(*paravirt_ops.pagetable_setup_start)(base);
+}
+
+static inline void paravirt_pagetable_setup_done(pgd_t *base)
+{
+	if (paravirt_ops.pagetable_setup_done)
+		(*paravirt_ops.pagetable_setup_done)(base);
+}
 
 void native_set_pte(pte_t *ptep, pte_t pteval);
 void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval);
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -497,6 +497,22 @@ do {									\
  * tables contain all the necessary information.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
+
+void native_pagetable_setup_start(pgd_t *base);
+void native_pagetable_setup_done(pgd_t *base);
+
+#ifndef CONFIG_PARAVIRT
+static inline void paravirt_pagetable_setup_start(pgd_t *base)
+{
+	native_pagetable_setup_start(base);
+}
+
+static inline void paravirt_pagetable_setup_done(pgd_t *base)
+{
+	native_pagetable_setup_done(base);
+}
+#endif	/* !CONFIG_PARAVIRT */
+
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_FLATMEM

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Rusty Russell, linux-kernel,
	William Lee Irwin III, Chris Wright, virtualization,
	Andrew Morton, Ingo Molnar

[-- Attachment #1: paravirt-memory-init.patch --]
[-- Type: text/plain, Size: 11671 bytes --]

This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.

In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory.  When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.

When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.

In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary.  Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.

In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable.  This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.

A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables.  This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Ingo Molnar <mingo@elte.hu>

---
 arch/i386/kernel/paravirt.c |    3 
 arch/i386/mm/init.c         |  159 +++++++++++++++++++++++++++++--------------
 include/asm-i386/paravirt.h |   17 ++++
 include/asm-i386/pgtable.h  |   16 ++++
 4 files changed, 143 insertions(+), 52 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -549,6 +549,9 @@ struct paravirt_ops paravirt_ops = {
 #endif
 	.set_lazy_mode = paravirt_nop,
 
+	.pagetable_setup_start = native_pagetable_setup_start,
+	.pagetable_setup_done = native_pagetable_setup_done,
+
 	.flush_tlb_user = native_flush_tlb,
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
===================================================================
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -42,6 +42,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
+#include <asm/paravirt.h>
 
 unsigned int __VMALLOC_RESERVE = 128 << 20;
 
@@ -62,6 +63,7 @@ static pmd_t * __init one_md_table_init(
 		
 #ifdef CONFIG_X86_PAE
 	pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
 	paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
 	set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
 	pud = pud_offset(pgd, 0);
@@ -83,12 +85,10 @@ static pte_t * __init one_page_table_ini
 {
 	if (pmd_none(*pmd)) {
 		pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
 		paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
 		set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
-		if (page_table != pte_offset_kernel(pmd, 0))
-			BUG();	
-
-		return page_table;
+		BUG_ON(page_table != pte_offset_kernel(pmd, 0));
 	}
 	
 	return pte_offset_kernel(pmd, 0);
@@ -119,7 +119,7 @@ static void __init page_table_range_init
 	pgd = pgd_base + pgd_idx;
 
 	for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
-		if (pgd_none(*pgd)) 
+		if (!(pgd_val(*pgd) & _PAGE_PRESENT))
 			one_md_table_init(pgd);
 		pud = pud_offset(pgd, vaddr);
 		pmd = pmd_offset(pud, vaddr);
@@ -158,7 +158,11 @@ static void __init kernel_physical_mappi
 	pfn = 0;
 
 	for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {
-		pmd = one_md_table_init(pgd);
+		if (!(pgd_val(*pgd) & _PAGE_PRESENT))
+			pmd = one_md_table_init(pgd);
+		else
+			pmd = pmd_offset(pud_offset(pgd, PAGE_OFFSET), PAGE_OFFSET);
+
 		if (pfn >= max_low_pfn)
 			continue;
 		for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {
@@ -167,20 +171,26 @@ static void __init kernel_physical_mappi
 			/* Map with big pages if possible, otherwise create normal page tables. */
 			if (cpu_has_pse) {
 				unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-				if (is_kernel_text(address) || is_kernel_text(address2))
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
-				else
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				if (!pmd_present(*pmd)) {
+					if (is_kernel_text(address) || is_kernel_text(address2))
+						set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
+					else
+						set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				}
 				pfn += PTRS_PER_PTE;
 			} else {
 				pte = one_page_table_init(pmd);
 
-				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
-						if (is_kernel_text(address))
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
-						else
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
+				for (pte_ofs = 0;
+				     pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn;
+				     pte++, pfn++, pte_ofs++, address += PAGE_SIZE) {
+					if (pte_present(*pte))
+						continue;
+
+					if (is_kernel_text(address))
+						set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
+					else
+						set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
 				}
 			}
 		}
@@ -337,44 +347,36 @@ extern void __init remap_numa_kva(void);
 #define remap_numa_kva() do {} while (0)
 #endif
 
-static void __init pagetable_init (void)
-{
-	unsigned long vaddr;
-	pgd_t *pgd_base = swapper_pg_dir;
-
+void __init native_pagetable_setup_start(pgd_t *base)
+{
 #ifdef CONFIG_X86_PAE
 	int i;
-	/* Init entries of the first-level page table to the zero page */
-	for (i = 0; i < PTRS_PER_PGD; i++)
-		set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
+
+	/*
+	 * Init entries of the first-level page table to the
+	 * zero page, if they haven't already been set up.
+	 *
+	 * In a normal native boot, we'll be running on a
+	 * pagetable rooted in swapper_pg_dir, but not in PAE
+	 * mode, so this will end up clobbering the mappings
+	 * for the lower 24Mbytes of the address space,
+	 * without affecting the kernel address space.
+	 */
+	for (i = 0; i < USER_PTRS_PER_PGD; i++)
+		set_pgd(&base[i],
+			__pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
+
+	/* Make sure kernel address space is empty so that a pagetable
+	   will be allocated for it. */
+	memset(&base[USER_PTRS_PER_PGD], 0,
+	       KERNEL_PGD_PTRS * sizeof(pgd_t));
 #else
 	paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
 #endif
-
-	/* Enable PSE if available */
-	if (cpu_has_pse) {
-		set_in_cr4(X86_CR4_PSE);
-	}
-
-	/* Enable PGE if available */
-	if (cpu_has_pge) {
-		set_in_cr4(X86_CR4_PGE);
-		__PAGE_KERNEL |= _PAGE_GLOBAL;
-		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
-	}
-
-	kernel_physical_mapping_init(pgd_base);
-	remap_numa_kva();
-
-	/*
-	 * Fixed mappings, only the page table structure has to be
-	 * created - mappings will be set by set_fixmap():
-	 */
-	vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
-	page_table_range_init(vaddr, 0, pgd_base);
-
-	permanent_kmaps_init(pgd_base);
-
+}
+
+void __init native_pagetable_setup_done(pgd_t *base)
+{
 #ifdef CONFIG_X86_PAE
 	/*
 	 * Add low memory identity-mappings - SMP needs it when
@@ -383,8 +385,63 @@ static void __init pagetable_init (void)
 	 * All user-space mappings are explicitly cleared after
 	 * SMP startup.
 	 */
-	set_pgd(&pgd_base[0], pgd_base[USER_PTRS_PER_PGD]);
-#endif
+	set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
+#endif
+}
+
+/*
+ * Build a proper pagetable for the kernel mappings.  Up until this
+ * point, we've been running on some set of pagetables constructed by
+ * the boot process.
+ *
+ * If we're booting on native hardware, this will be a pagetable
+ * constructed in arch/i386/kernel/head.S, and not running in PAE mode
+ * (even if we'll end up running in PAE).  The root of the pagetable
+ * will be swapper_pg_dir.
+ *
+ * If we're booting paravirtualized under a hypervisor, then there are
+ * more options: we may already be running PAE, and the pagetable may
+ * or may not be based in swapper_pg_dir.  In any case,
+ * paravirt_pagetable_setup_start() will set up swapper_pg_dir
+ * appropriately for the rest of the initialization to work.
+ *
+ * In general, pagetable_init() assumes that the pagetable may already
+ * be partially populated, and so it avoids stomping on any existing
+ * mappings.
+ */
+static void __init pagetable_init (void)
+{
+	unsigned long vaddr, end;
+	pgd_t *pgd_base = swapper_pg_dir;
+
+	paravirt_pagetable_setup_start(pgd_base);
+
+	/* Enable PSE if available */
+	if (cpu_has_pse) {
+		set_in_cr4(X86_CR4_PSE);
+	}
+
+	/* Enable PGE if available */
+	if (cpu_has_pge) {
+		set_in_cr4(X86_CR4_PGE);
+		__PAGE_KERNEL |= _PAGE_GLOBAL;
+		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
+	}
+
+	kernel_physical_mapping_init(pgd_base);
+	remap_numa_kva();
+
+	/*
+	 * Fixed mappings, only the page table structure has to be
+	 * created - mappings will be set by set_fixmap():
+	 */
+	vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
+	end = (FIXADDR_TOP + PMD_SIZE - 1) & PMD_MASK;
+	page_table_range_init(vaddr, end, pgd_base);
+
+	permanent_kmaps_init(pgd_base);
+
+	paravirt_pagetable_setup_done(pgd_base);
 }
 
 #if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_ACPI_SLEEP)
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -2,10 +2,11 @@
 #define __ASM_PARAVIRT_H
 /* Various instructions on x86 need to be replaced for
  * para-virtualization: those hooks are defined here. */
+
+#ifdef CONFIG_PARAVIRT
 #include <linux/stringify.h>
 #include <asm/page.h>
 
-#ifdef CONFIG_PARAVIRT
 /* These are the most performance critical ops, so we want to be able to patch
  * callers */
 #define PARAVIRT_IRQ_DISABLE 0
@@ -48,6 +49,9 @@ struct paravirt_ops
 	void (*arch_setup)(void);
 	char *(*memory_setup)(void);
 	void (*init_IRQ)(void);
+
+	void (*pagetable_setup_start)(pgd_t *pgd_base);
+	void (*pagetable_setup_done)(pgd_t *pgd_base);
 
 	void (*banner)(void);
 
@@ -363,6 +367,17 @@ static inline void setup_secondary_clock
 }
 #endif
 
+static inline void paravirt_pagetable_setup_start(pgd_t *base)
+{
+	if (paravirt_ops.pagetable_setup_start)
+		(*paravirt_ops.pagetable_setup_start)(base);
+}
+
+static inline void paravirt_pagetable_setup_done(pgd_t *base)
+{
+	if (paravirt_ops.pagetable_setup_done)
+		(*paravirt_ops.pagetable_setup_done)(base);
+}
 
 void native_set_pte(pte_t *ptep, pte_t pteval);
 void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval);
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -497,6 +497,22 @@ do {									\
  * tables contain all the necessary information.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
+
+void native_pagetable_setup_start(pgd_t *base);
+void native_pagetable_setup_done(pgd_t *base);
+
+#ifndef CONFIG_PARAVIRT
+static inline void paravirt_pagetable_setup_start(pgd_t *base)
+{
+	native_pagetable_setup_start(base);
+}
+
+static inline void paravirt_pagetable_setup_done(pgd_t *base)
+{
+	native_pagetable_setup_done(base);
+}
+#endif	/* !CONFIG_PARAVIRT */
+
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_FLATMEM

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 06/26] Xen-paravirt_ops: paravirt_ops: allocate a fixmap slot
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: paravirt-fixmap.patch --]
[-- Type: text/plain, Size: 1035 bytes --]

Allocate a fixmap slot for use by a paravirt_ops implementation.  This
is intended for early-boot bootstrap mappings.  Once the zones and
allocator have been set up, it would be better to use get_vm_area() to
allocate some virtual space.

Xen uses this to map the hypervisor's shared info page, which doesn't
have a pseudo-physical page number, and therefore can't be mapped
ordinarily.  It is needed early because it contains the vcpu state,
including the interrupt mask.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 include/asm-i386/fixmap.h |    3 +++
 1 file changed, 3 insertions(+)

===================================================================
--- a/include/asm-i386/fixmap.h
+++ b/include/asm-i386/fixmap.h
@@ -86,6 +86,9 @@ enum fixed_addresses {
 #ifdef CONFIG_PCI_MMCONFIG
 	FIX_PCIE_MCFG,
 #endif
+#ifdef CONFIG_PARAVIRT
+	FIX_PARAVIRT_BOOTMAP,
+#endif
 	__end_of_permanent_fixed_addresses,
 	/* temporary boot-time mappings, used before ioremap() is functional */
 #define NR_FIX_BTMAPS	16

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 06/26] Xen-paravirt_ops: paravirt_ops: allocate a fixmap slot
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Wright, virtualization, xen-devel, Andrew Morton, linux-kernel

[-- Attachment #1: paravirt-fixmap.patch --]
[-- Type: text/plain, Size: 1063 bytes --]

Allocate a fixmap slot for use by a paravirt_ops implementation.  This
is intended for early-boot bootstrap mappings.  Once the zones and
allocator have been set up, it would be better to use get_vm_area() to
allocate some virtual space.

Xen uses this to map the hypervisor's shared info page, which doesn't
have a pseudo-physical page number, and therefore can't be mapped
ordinarily.  It is needed early because it contains the vcpu state,
including the interrupt mask.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 include/asm-i386/fixmap.h |    3 +++
 1 file changed, 3 insertions(+)

===================================================================
--- a/include/asm-i386/fixmap.h
+++ b/include/asm-i386/fixmap.h
@@ -86,6 +86,9 @@ enum fixed_addresses {
 #ifdef CONFIG_PCI_MMCONFIG
 	FIX_PCIE_MCFG,
 #endif
+#ifdef CONFIG_PARAVIRT
+	FIX_PARAVIRT_BOOTMAP,
+#endif
 	__end_of_permanent_fixed_addresses,
 	/* temporary boot-time mappings, used before ioremap() is functional */
 #define NR_FIX_BTMAPS	16

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 07/26] Xen-paravirt_ops: Allow paravirt backend to choose kernel PMD sharing
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (5 preceding siblings ...)
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:31     ` Ingo Molnar
  2007-03-01 23:24 ` [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
                   ` (20 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell,
	William Lee Irwin III

[-- Attachment #1: shared-kernel-pmd.patch --]
[-- Type: text/plain, Size: 11587 bytes --]

Normally when running in PAE mode, the 4th PMD maps the kernel address
space, which can be shared among all processes (since they all need
the same kernel mappings.

Xen, however, does not allow guests to have the kernel pmd shared
between page tables, so parameterize pgtable.c to allow both modes of
operation.

There are several side-effects of this.  One is that vmalloc will
update the kernel address space mappings, and those updates need to be
propagated into all processes if the kernel mapping are not
intrinsically shared.  In the non-PAE case, this is done by
maintaining a pgd_list of all processes; this list is used when all
process pagetables must be updated.  pgd_list is threaded via
otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.

Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
the PAE pgd to page aligned anyway, so this patch forces the pgd to be
page aligned+sized when the kernel pmd is unshared, to accomodate both
these requirements.

Also, since there may be several distinct kernel pmds (if the
user/kernel split is below 3G), there's no point in allocating them
from a slab cache; they're just allocated with get_free_page and
initialized appropriately.  (Of course the could be cached if there is
just a single kernel pmd - which is the default with a 3G user/kernel
split - but it doesn't seem worthwhile to add yet another case into
this code).

[ Many thanks to wli for review comments. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Zachary Amsden <zach@vmware.com>
---
 arch/i386/kernel/paravirt.c            |    1 
 arch/i386/mm/fault.c                   |    6 +-
 arch/i386/mm/init.c                    |   18 +++++-
 arch/i386/mm/pageattr.c                |    2 
 arch/i386/mm/pgtable.c                 |   84 ++++++++++++++++++++++++++------
 include/asm-i386/paravirt.h            |    1 
 include/asm-i386/pgtable-2level-defs.h |    2 
 include/asm-i386/pgtable-2level.h      |    2 
 include/asm-i386/pgtable-3level-defs.h |    6 ++
 include/asm-i386/pgtable-3level.h      |    2 
 include/asm-i386/pgtable.h             |    7 ++
 11 files changed, 105 insertions(+), 26 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -604,6 +604,7 @@ struct paravirt_ops paravirt_ops = {
 	.name = "bare hardware",
 	.paravirt_enabled = 0,
 	.kernel_rpl = 0,
+	.shared_kernel_pmd = 1,	/* Only used when CONFIG_X86_PAE is set */
 
  	.patch = native_patch,
 	.banner = default_banner,
===================================================================
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -588,8 +588,7 @@ do_sigbus:
 	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
 }
 
-#ifndef CONFIG_X86_PAE
-void vmalloc_sync_all(void)
+void _vmalloc_sync_all(void)
 {
 	/*
 	 * Note that races in the updates of insync and start aren't
@@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
 	static DECLARE_BITMAP(insync, PTRS_PER_PGD);
 	static unsigned long start = TASK_SIZE;
 	unsigned long address;
+
+	BUG_ON(SHARED_KERNEL_PMD);
 
 	BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
 	for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
@@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
 			start = address + PGDIR_SIZE;
 	}
 }
-#endif
===================================================================
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -715,6 +715,8 @@ struct kmem_cache *pmd_cache;
 
 void __init pgtable_cache_init(void)
 {
+	size_t pgd_size = PTRS_PER_PGD*sizeof(pgd_t);
+
 	if (PTRS_PER_PMD > 1) {
 		pmd_cache = kmem_cache_create("pmd",
 					PTRS_PER_PMD*sizeof(pmd_t),
@@ -724,13 +726,23 @@ void __init pgtable_cache_init(void)
 					NULL);
 		if (!pmd_cache)
 			panic("pgtable_cache_init(): cannot create pmd cache");
+
+		if (!SHARED_KERNEL_PMD) {
+			/* If we're in PAE mode and have a non-shared
+			   kernel pmd, then the pgd size must be a
+			   page size.  This is because the pgd_list
+			   links through the page structure, so there
+			   can only be one pgd per page for this to
+			   work. */
+			pgd_size = PAGE_SIZE;
+		}
 	}
 	pgd_cache = kmem_cache_create("pgd",
-				PTRS_PER_PGD*sizeof(pgd_t),
-				PTRS_PER_PGD*sizeof(pgd_t),
+				pgd_size,
+				pgd_size,
 				0,
 				pgd_ctor,
-				PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
+				(!SHARED_KERNEL_PMD) ? pgd_dtor : NULL);
 	if (!pgd_cache)
 		panic("pgtable_cache_init(): Cannot create pgd cache");
 }
===================================================================
--- a/arch/i386/mm/pageattr.c
+++ b/arch/i386/mm/pageattr.c
@@ -91,7 +91,7 @@ static void set_pmd_pte(pte_t *kpte, uns
 	unsigned long flags;
 
 	set_pte_atomic(kpte, pte); 	/* change init_mm */
-	if (PTRS_PER_PMD > 1)
+	if (SHARED_KERNEL_PMD)
 		return;
 
 	spin_lock_irqsave(&pgd_lock, flags);
===================================================================
--- a/arch/i386/mm/pgtable.c
+++ b/arch/i386/mm/pgtable.c
@@ -238,35 +238,56 @@ static inline void pgd_list_del(pgd_t *p
 		set_page_private(next, (unsigned long)pprev);
 }
 
+#if (PTRS_PER_PMD == 1)
+/* Non-PAE pgd constructor */
 void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
 {
 	unsigned long flags;
 
-	if (PTRS_PER_PMD == 1) {
-		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
-		spin_lock_irqsave(&pgd_lock, flags);
-	}
+	/* !PAE, no pagetable sharing */
+	memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
 
 	clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
 			swapper_pg_dir + USER_PTRS_PER_PGD,
 			KERNEL_PGD_PTRS);
 
-	if (PTRS_PER_PMD > 1)
-		return;
+	spin_lock_irqsave(&pgd_lock, flags);
 
 	/* must happen under lock */
 	paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
-			__pa(swapper_pg_dir) >> PAGE_SHIFT,
-			USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
+				__pa(swapper_pg_dir) >> PAGE_SHIFT,
+				USER_PTRS_PER_PGD,
+				KERNEL_PGD_PTRS);
 
 	pgd_list_add(pgd);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
-
-/* never called when PTRS_PER_PMD > 1 */
+#else  /* PTRS_PER_PMD > 1 */
+/* PAE pgd constructor */
+void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+{
+	/* PAE, kernel PMD may be shared */
+
+	if (SHARED_KERNEL_PMD) {
+		clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
+				swapper_pg_dir + USER_PTRS_PER_PGD,
+				KERNEL_PGD_PTRS);
+	} else {
+		unsigned long flags;
+
+		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
+		spin_lock_irqsave(&pgd_lock, flags);
+		pgd_list_add(pgd);
+		spin_unlock_irqrestore(&pgd_lock, flags);
+	}
+}
+#endif	/* PTRS_PER_PMD */
+
 void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
 {
 	unsigned long flags; /* can be called from interrupt context */
+
+	BUG_ON(SHARED_KERNEL_PMD);
 
 	paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
 	spin_lock_irqsave(&pgd_lock, flags);
@@ -274,6 +295,37 @@ void pgd_dtor(void *pgd, struct kmem_cac
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
+#define UNSHARED_PTRS_PER_PGD				\
+	(SHARED_KERNEL_PMD ? USER_PTRS_PER_PGD : PTRS_PER_PGD)
+
+/* If we allocate a pmd for part of the kernel address space, then
+   make sure its initialized with the appropriate kernel mappings.
+   Otherwise use a cached zeroed pmd.  */
+static pmd_t *pmd_cache_alloc(int idx)
+{
+	pmd_t *pmd;
+
+	if (idx >= USER_PTRS_PER_PGD) {
+		pmd = (pmd_t *)__get_free_page(GFP_KERNEL);
+
+		if (pmd)
+			memcpy(pmd,
+			       (void *)pgd_page_vaddr(swapper_pg_dir[idx]),
+			       sizeof(pmd_t) * PTRS_PER_PMD);
+	} else
+		pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+
+	return pmd;
+}
+
+static void pmd_cache_free(pmd_t *pmd, int idx)
+{
+	if (idx >= USER_PTRS_PER_PGD)
+		free_page((unsigned long)pmd);
+	else
+		kmem_cache_free(pmd_cache, pmd);
+}
+
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
@@ -282,10 +334,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PTRS_PER_PMD == 1 || !pgd)
 		return pgd;
 
-	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+ 	for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
+		pmd_t *pmd = pmd_cache_alloc(i);
+
 		if (!pmd)
 			goto out_oom;
+
 		paravirt_alloc_pd(__pa(pmd) >> PAGE_SHIFT);
 		set_pgd(&pgd[i], __pgd(1 + __pa(pmd)));
 	}
@@ -296,7 +350,7 @@ out_oom:
 		pgd_t pgdent = pgd[i];
 		void* pmd = (void *)__va(pgd_val(pgdent)-1);
 		paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-		kmem_cache_free(pmd_cache, pmd);
+		pmd_cache_free(pmd, i);
 	}
 	kmem_cache_free(pgd_cache, pgd);
 	return NULL;
@@ -308,11 +362,11 @@ void pgd_free(pgd_t *pgd)
 
 	/* in the PAE case user pgd entries are overwritten before usage */
 	if (PTRS_PER_PMD > 1)
-		for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
+		for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
 			pgd_t pgdent = pgd[i];
 			void* pmd = (void *)__va(pgd_val(pgdent)-1);
 			paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-			kmem_cache_free(pmd_cache, pmd);
+			pmd_cache_free(pmd, i);
 		}
 	/* in the non-PAE case, free_pgtables() clears user pgd entries */
 	kmem_cache_free(pgd_cache, pgd);
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -34,6 +34,7 @@ struct paravirt_ops
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
+	int shared_kernel_pmd;
  	int paravirt_enabled;
 	const char *name;
 
===================================================================
--- a/include/asm-i386/pgtable-2level-defs.h
+++ b/include/asm-i386/pgtable-2level-defs.h
@@ -1,5 +1,7 @@
 #ifndef _I386_PGTABLE_2LEVEL_DEFS_H
 #define _I386_PGTABLE_2LEVEL_DEFS_H
+
+#define SHARED_KERNEL_PMD	0
 
 /*
  * traditional i386 two-level paging structure:
===================================================================
--- a/include/asm-i386/pgtable-2level.h
+++ b/include/asm-i386/pgtable-2level.h
@@ -65,6 +65,4 @@ static inline int pte_exec_kernel(pte_t 
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
-void vmalloc_sync_all(void);
-
 #endif /* _I386_PGTABLE_2LEVEL_H */
===================================================================
--- a/include/asm-i386/pgtable-3level-defs.h
+++ b/include/asm-i386/pgtable-3level-defs.h
@@ -1,5 +1,11 @@
 #ifndef _I386_PGTABLE_3LEVEL_DEFS_H
 #define _I386_PGTABLE_3LEVEL_DEFS_H
+
+#ifdef CONFIG_PARAVIRT
+#define SHARED_KERNEL_PMD	(paravirt_ops.shared_kernel_pmd)
+#else
+#define SHARED_KERNEL_PMD	1
+#endif
 
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
===================================================================
--- a/include/asm-i386/pgtable-3level.h
+++ b/include/asm-i386/pgtable-3level.h
@@ -180,6 +180,4 @@ static inline pmd_t pfn_pmd(unsigned lon
 
 #define __pmd_free_tlb(tlb, x)		do { } while (0)
 
-#define vmalloc_sync_all() ((void)0)
-
 #endif /* _I386_PGTABLE_3LEVEL_H */
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -244,6 +244,13 @@ static inline pte_t pte_mkwrite(pte_t pt
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PSE; return pte; }
 
+extern void _vmalloc_sync_all(void);
+static inline void vmalloc_sync_all(void)
+{
+	if (!SHARED_KERNEL_PMD)
+		_vmalloc_sync_all();
+}
+
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
 #else

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (6 preceding siblings ...)
  2007-03-01 23:24 ` [patch 07/26] Xen-paravirt_ops: Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:30     ` Ingo Molnar
  2007-03-01 23:24 ` [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations Jeremy Fitzhardinge
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, linux-arch

[-- Attachment #1: mm-lifetime-hooks.patch --]
[-- Type: text/plain, Size: 3994 bytes --]

Add hooks to allow a paravirt implementation to track the lifetime of
an mm.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-arch@vger.kernel.org
---
 arch/i386/kernel/paravirt.c    |    4 ++++
 include/asm-i386/mmu_context.h |    8 ++++++--
 include/asm-i386/paravirt.h    |   24 ++++++++++++++++++++++++
 include/linux/sched.h          |    6 ++++++
 kernel/fork.c                  |    2 ++
 mm/mmap.c                      |    3 +++
 6 files changed, 45 insertions(+), 2 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -591,6 +591,10 @@ struct paravirt_ops paravirt_ops = {
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
 
+	.dup_mmap = paravirt_nop,
+	.exit_mmap = paravirt_nop,
+	.activate_mm = paravirt_nop,
+
 	.startup_ipi_hook = paravirt_nop,
 };
 
===================================================================
--- a/include/asm-i386/mmu_context.h
+++ b/include/asm-i386/mmu_context.h
@@ -5,6 +5,7 @@
 #include <asm/atomic.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
+#include <asm/paravirt.h>
 
 /*
  * Used for LDT copy/destruction.
@@ -65,7 +66,10 @@ static inline void switch_mm(struct mm_s
 #define deactivate_mm(tsk, mm)			\
 	asm("movl %0,%%gs": :"r" (0));
 
-#define activate_mm(prev, next) \
-	switch_mm((prev),(next),NULL)
+#define activate_mm(prev, next)				\
+	do {						\
+		arch_activate_mm(prev, next);		\
+		switch_mm((prev),(next),NULL);		\
+	} while(0);
 
 #endif
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -118,6 +118,12 @@ struct paravirt_ops
 	void (*io_delay)(void);
 	void (*const_udelay)(unsigned long loops);
 
+	void (*activate_mm)(struct mm_struct *prev,
+			    struct mm_struct *next);
+	void (*dup_mmap)(struct mm_struct *oldmm,
+			 struct mm_struct *mm);
+	void (*exit_mmap)(struct mm_struct *mm);
+
 #ifdef CONFIG_X86_LOCAL_APIC
 	void (*apic_write)(unsigned long reg, unsigned long v);
 	void (*apic_write_atomic)(unsigned long reg, unsigned long v);
@@ -398,6 +404,24 @@ static inline void startup_ipi_hook(int 
 }
 #endif
 
+#define __HAVE_ARCH_MM_LIFETIME
+static inline void arch_activate_mm(struct mm_struct *prev,
+				    struct mm_struct *next)
+{
+	paravirt_ops.activate_mm(prev, next);
+}
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+				 struct mm_struct *mm)
+{
+	paravirt_ops.dup_mmap(oldmm, mm);
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+	paravirt_ops.exit_mmap(mm);
+}
+
 #define __flush_tlb() paravirt_ops.flush_tlb_user()
 #define __flush_tlb_global() paravirt_ops.flush_tlb_kernel()
 #define __flush_tlb_single(addr) paravirt_ops.flush_tlb_single(addr)
===================================================================
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -374,6 +374,12 @@ struct mm_struct {
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
 };
+
+#ifndef __HAVE_ARCH_MM_LIFETIME
+#define arch_activate_mm(prev, next)	do {} while(0)
+#define arch_dup_mmap(oldmm, mm)	do {} while(0)
+#define arch_exit_mmap(mm)		do {} while(0)
+#endif
 
 struct sighand_struct {
 	atomic_t		count;
===================================================================
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -286,6 +286,8 @@ static inline int dup_mmap(struct mm_str
 		if (retval)
 			goto out;
 	}
+	/* a new mm has just been created */
+	arch_dup_mmap(oldmm, mm);
 	retval = 0;
 out:
 	up_write(&mm->mmap_sem);
===================================================================
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1976,6 +1976,9 @@ void exit_mmap(struct mm_struct *mm)
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
+
+	/* mm's last user has gone, and its about to be pulled down */
+	arch_exit_mmap(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (7 preceding siblings ...)
  2007-03-01 23:24 ` [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:27     ` Ingo Molnar
  2007-03-01 23:24 ` [patch 10/26] Xen-paravirt_ops: rename struct paravirt_patch to paravirt_patch_site for clarity Jeremy Fitzhardinge
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, linux-arch,
	James Bottomley

[-- Attachment #1: kill-HAVE_ARCH_MM_LIFETIME.patch --]
[-- Type: text/plain, Size: 12359 bytes --]

akpm:
> Can we lose __HAVE_ARCH_MM_LIFETIME?  Just define these (preferably in C,
> not in cpp) in the appropriate include/asm-foo/ files?

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>

---
 include/asm-alpha/mmu_context.h     |    1 +
 include/asm-arm/mmu_context.h       |    1 +
 include/asm-arm26/mmu_context.h     |    2 ++
 include/asm-avr32/mmu_context.h     |    1 +
 include/asm-cris/mmu_context.h      |    2 ++
 include/asm-frv/mmu_context.h       |    1 +
 include/asm-generic/mm_hooks.h      |   18 ++++++++++++++++++
 include/asm-h8300/mmu_context.h     |    1 +
 include/asm-i386/mmu_context.h      |   11 ++++++++++-
 include/asm-i386/paravirt.h         |    5 ++---
 include/asm-ia64/mmu_context.h      |    1 +
 include/asm-m32r/mmu_context.h      |    1 +
 include/asm-m68k/mmu_context.h      |    1 +
 include/asm-m68knommu/mmu_context.h |    1 +
 include/asm-mips/mmu_context.h      |    1 +
 include/asm-parisc/mmu_context.h    |    1 +
 include/asm-powerpc/mmu_context.h   |    1 +
 include/asm-ppc/mmu_context.h       |    1 +
 include/asm-s390/mmu_context.h      |    2 ++
 include/asm-sh/mmu_context.h        |    1 +
 include/asm-sh64/mmu_context.h      |    2 +-
 include/asm-sparc/mmu_context.h     |    2 ++
 include/asm-sparc64/mmu_context.h   |    1 +
 include/asm-um/mmu_context.h        |    2 ++
 include/asm-v850/mmu_context.h      |    2 ++
 include/asm-x86_64/mmu_context.h    |    1 +
 include/asm-xtensa/mmu_context.h    |    1 +
 include/linux/sched.h               |    6 ------
 mm/mmap.c                           |    1 +
 29 files changed, 61 insertions(+), 11 deletions(-)

===================================================================
--- a/include/asm-alpha/mmu_context.h
+++ b/include/asm-alpha/mmu_context.h
@@ -10,6 +10,7 @@
 #include <asm/system.h>
 #include <asm/machvec.h>
 #include <asm/compiler.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * Force a context reload. This is needed when we change the page
===================================================================
--- a/include/asm-arm/mmu_context.h
+++ b/include/asm-arm/mmu_context.h
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <asm/cacheflush.h>
 #include <asm/proc-fns.h>
+#include <asm-generic/mm_hooks.h>
 
 void __check_kvm_seq(struct mm_struct *mm);
 
===================================================================
--- a/include/asm-arm26/mmu_context.h
+++ b/include/asm-arm26/mmu_context.h
@@ -12,6 +12,8 @@
  */
 #ifndef __ASM_ARM_MMU_CONTEXT_H
 #define __ASM_ARM_MMU_CONTEXT_H
+
+#include <asm-generic/mm_hooks.h>
 
 #define init_new_context(tsk,mm)	0
 #define destroy_context(mm)		do { } while(0)
===================================================================
--- a/include/asm-avr32/mmu_context.h
+++ b/include/asm-avr32/mmu_context.h
@@ -15,6 +15,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/sysreg.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * The MMU "context" consists of two things:
===================================================================
--- a/include/asm-cris/mmu_context.h
+++ b/include/asm-cris/mmu_context.h
@@ -1,5 +1,7 @@
 #ifndef __CRIS_MMU_CONTEXT_H
 #define __CRIS_MMU_CONTEXT_H
+
+#include <asm-generic/mm_hooks.h>
 
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
 extern void get_mmu_context(struct mm_struct *mm);
===================================================================
--- a/include/asm-frv/mmu_context.h
+++ b/include/asm-frv/mmu_context.h
@@ -15,6 +15,7 @@
 #include <asm/setup.h>
 #include <asm/page.h>
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- /dev/null
+++ b/include/asm-generic/mm_hooks.h
@@ -0,0 +1,18 @@
+/*
+ * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap, to
+ * be included in asm-FOO/mmu_context.h for any arch FOO which doesn't
+ * need to hook these.
+ */
+#ifndef _ASM_GENERIC_MM_HOOKS_H
+#define _ASM_GENERIC_MM_HOOKS_H
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+				 struct mm_struct *mm)
+{
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+}
+
+#endif	/* _ASM_GENERIC_MM_HOOKS_H */
===================================================================
--- a/include/asm-h8300/mmu_context.h
+++ b/include/asm-h8300/mmu_context.h
@@ -4,6 +4,7 @@
 #include <asm/setup.h>
 #include <asm/page.h>
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-i386/mmu_context.h
+++ b/include/asm-i386/mmu_context.h
@@ -6,6 +6,15 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
+#ifndef CONFIG_PARAVIRT
+#include <asm-generic/mm_hooks.h>
+
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+					struct mm_struct *next)
+{
+}
+#endif	/* !CONFIG_PARAVIRT */
+
 
 /*
  * Used for LDT copy/destruction.
@@ -68,7 +77,7 @@ static inline void switch_mm(struct mm_s
 
 #define activate_mm(prev, next)				\
 	do {						\
-		arch_activate_mm(prev, next);		\
+		paravirt_activate_mm(prev, next);	\
 		switch_mm((prev),(next),NULL);		\
 	} while(0);
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -403,9 +403,8 @@ static inline void startup_ipi_hook(int 
 }
 #endif
 
-#define __HAVE_ARCH_MM_LIFETIME
-static inline void arch_activate_mm(struct mm_struct *prev,
-				    struct mm_struct *next)
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+					struct mm_struct *next)
 {
 	paravirt_ops.activate_mm(prev, next);
 }
===================================================================
--- a/include/asm-ia64/mmu_context.h
+++ b/include/asm-ia64/mmu_context.h
@@ -29,6 +29,7 @@
 #include <linux/spinlock.h>
 
 #include <asm/processor.h>
+#include <asm-generic/mm_hooks.h>
 
 struct ia64_ctx {
 	spinlock_t lock;
===================================================================
--- a/include/asm-m32r/mmu_context.h
+++ b/include/asm-m32r/mmu_context.h
@@ -15,6 +15,7 @@
 #include <asm/pgalloc.h>
 #include <asm/mmu.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * Cache of MMU context last used.
===================================================================
--- a/include/asm-m68k/mmu_context.h
+++ b/include/asm-m68k/mmu_context.h
@@ -1,6 +1,7 @@
 #ifndef __M68K_MMU_CONTEXT_H
 #define __M68K_MMU_CONTEXT_H
 
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-m68knommu/mmu_context.h
+++ b/include/asm-m68knommu/mmu_context.h
@@ -4,6 +4,7 @@
 #include <asm/setup.h>
 #include <asm/page.h>
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-mips/mmu_context.h
+++ b/include/asm-mips/mmu_context.h
@@ -20,6 +20,7 @@
 #include <asm/mipsmtregs.h>
 #include <asm/smtc.h>
 #endif /* SMTC */
+#include <asm-generic/mm_hooks.h>
 
 /*
  * For the fast tlb miss handlers, we keep a per cpu array of pointers
===================================================================
--- a/include/asm-parisc/mmu_context.h
+++ b/include/asm-parisc/mmu_context.h
@@ -5,6 +5,7 @@
 #include <asm/atomic.h>
 #include <asm/pgalloc.h>
 #include <asm/pgtable.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-powerpc/mmu_context.h
+++ b/include/asm-powerpc/mmu_context.h
@@ -10,6 +10,7 @@
 #include <linux/mm.h>	
 #include <asm/mmu.h>	
 #include <asm/cputable.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * Copyright (C) 2001 PPC 64 Team, IBM Corp
===================================================================
--- a/include/asm-ppc/mmu_context.h
+++ b/include/asm-ppc/mmu_context.h
@@ -6,6 +6,7 @@
 #include <asm/bitops.h>
 #include <asm/mmu.h>
 #include <asm/cputable.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * On 32-bit PowerPC 6xx/7xx/7xxx CPUs, we use a set of 16 VSIDs
===================================================================
--- a/include/asm-s390/mmu_context.h
+++ b/include/asm-s390/mmu_context.h
@@ -10,6 +10,8 @@
 #define __S390_MMU_CONTEXT_H
 
 #include <asm/pgalloc.h>
+#include <asm-generic/mm_hooks.h>
+
 /*
  * get a new mmu context.. S390 don't know about contexts.
  */
===================================================================
--- a/include/asm-sh/mmu_context.h
+++ b/include/asm-sh/mmu_context.h
@@ -12,6 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 #include <asm/io.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * The MMU "context" consists of two things:
===================================================================
--- a/include/asm-sh64/mmu_context.h
+++ b/include/asm-sh64/mmu_context.h
@@ -27,7 +27,7 @@ extern unsigned long mmu_context_cache;
 extern unsigned long mmu_context_cache;
 
 #include <asm/page.h>
-
+#include <asm-generic/mm_hooks.h>
 
 /* Current mm's pgd */
 extern pgd_t *mmu_pdtp_cache;
===================================================================
--- a/include/asm-sparc/mmu_context.h
+++ b/include/asm-sparc/mmu_context.h
@@ -4,6 +4,8 @@
 #include <asm/btfixup.h>
 
 #ifndef __ASSEMBLY__
+
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-sparc64/mmu_context.h
+++ b/include/asm-sparc64/mmu_context.h
@@ -9,6 +9,7 @@
 #include <linux/spinlock.h>
 #include <asm/system.h>
 #include <asm/spitfire.h>
+#include <asm-generic/mm_hooks.h>
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
===================================================================
--- a/include/asm-um/mmu_context.h
+++ b/include/asm-um/mmu_context.h
@@ -5,6 +5,8 @@
 
 #ifndef __UM_MMU_CONTEXT_H
 #define __UM_MMU_CONTEXT_H
+
+#include <asm-generic/mm_hooks.h>
 
 #include "linux/sched.h"
 #include "choose-mode.h"
===================================================================
--- a/include/asm-v850/mmu_context.h
+++ b/include/asm-v850/mmu_context.h
@@ -1,5 +1,7 @@
 #ifndef __V850_MMU_CONTEXT_H__
 #define __V850_MMU_CONTEXT_H__
+
+#include <asm-generic/mm_hooks.h>
 
 #define destroy_context(mm)		((void)0)
 #define init_new_context(tsk,mm)	0
===================================================================
--- a/include/asm-x86_64/mmu_context.h
+++ b/include/asm-x86_64/mmu_context.h
@@ -7,6 +7,7 @@
 #include <asm/pda.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mm_hooks.h>
 
 /*
  * possibly do the LDT unload here?
===================================================================
--- a/include/asm-xtensa/mmu_context.h
+++ b/include/asm-xtensa/mmu_context.h
@@ -18,6 +18,7 @@
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mm_hooks.h>
 
 #define XCHAL_MMU_ASID_BITS	8
 
===================================================================
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -374,12 +374,6 @@ struct mm_struct {
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
 };
-
-#ifndef __HAVE_ARCH_MM_LIFETIME
-#define arch_activate_mm(prev, next)	do {} while(0)
-#define arch_dup_mmap(oldmm, mm)	do {} while(0)
-#define arch_exit_mmap(mm)		do {} while(0)
-#endif
 
 struct sighand_struct {
 	atomic_t		count;
===================================================================
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
+#include <asm/mmu_context.h>
 
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 10/26] Xen-paravirt_ops: rename struct paravirt_patch to paravirt_patch_site for clarity
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (8 preceding siblings ...)
  2007-03-01 23:24 ` [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: paravirt-patch-rename-paravirt_patch.patch --]
[-- Type: text/plain, Size: 3280 bytes --]

Rename struct paravirt_patch to paravirt_patch_site, so that it
clearly refers to a callsite, and not the patch which may be applied
to that callsite.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/alternative.c |    8 +++-----
 arch/i386/kernel/vmi.c         |    4 ----
 include/asm-i386/alternative.h |    6 +++---
 include/asm-i386/paravirt.h    |    5 ++++-
 4 files changed, 10 insertions(+), 13 deletions(-)

===================================================================
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -350,9 +350,9 @@ void alternatives_smp_switch(int smp)
 #endif
 
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
-{
-	struct paravirt_patch *p;
+void apply_paravirt(struct paravirt_patch_site *start, struct paravirt_patch_site *end)
+{
+	struct paravirt_patch_site *p;
 
 	for (p = start; p < end; p++) {
 		unsigned int used;
@@ -379,8 +379,6 @@ void apply_paravirt(struct paravirt_patc
 	/* Sync to be conservative, in case we patched following instructions */
 	sync_core();
 }
-extern struct paravirt_patch __start_parainstructions[],
-	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
 void __init alternative_instructions(void)
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -70,10 +70,6 @@ struct {
 	void (*set_initial_ap_state)(int, int);
 	void (*halt)(void);
 } vmi_ops;
-
-/* XXX move this to alternative.h */
-extern struct paravirt_patch __start_parainstructions[],
-	__stop_parainstructions[];
 
 /*
  * VMI patching routines.
===================================================================
--- a/include/asm-i386/alternative.h
+++ b/include/asm-i386/alternative.h
@@ -118,12 +118,12 @@ static inline void alternatives_smp_swit
 #define LOCK_PREFIX ""
 #endif
 
-struct paravirt_patch;
+struct paravirt_patch_site;
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end);
+void apply_paravirt(struct paravirt_patch_site *start, struct paravirt_patch_site *end);
 #else
 static inline void
-apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
+apply_paravirt(struct paravirt_patch_site *start, struct paravirt_patch_site *end)
 {}
 #define __start_parainstructions NULL
 #define __stop_parainstructions NULL
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -501,12 +501,15 @@ void _paravirt_nop(void);
 #define paravirt_nop	((void *)_paravirt_nop)
 
 /* These all sit in the .parainstructions section to tell us what to patch. */
-struct paravirt_patch {
+struct paravirt_patch_site {
 	u8 *instr; 		/* original instructions */
 	u8 instrtype;		/* type of this instruction */
 	u8 len;			/* length of original instruction */
 	u16 clobbers;		/* what registers you may clobber */
 };
+
+extern struct paravirt_patch_site __start_parainstructions[],
+	__stop_parainstructions[];
 
 #define paravirt_alt(insn_string, typenum, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 11/26] Xen-paravirt_ops: Use patch site IDs computed from offset in paravirt_ops structure
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: paravirt-use-offset-site-ids.patch --]
[-- Type: text/plain, Size: 10746 bytes --]

Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure.  This avoids having to maintain a separate
enum for patch site types.

This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/asm-offsets.c |    5 +
 arch/i386/kernel/paravirt.c    |   14 ++---
 arch/i386/kernel/vmi.c         |   38 ++------------
 include/asm-i386/paravirt.h    |  107 +++++++++++++++++++---------------------
 4 files changed, 70 insertions(+), 94 deletions(-)

===================================================================
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -110,5 +110,10 @@ void foo(void)
 	OFFSET(PARAVIRT_irq_enable_sysexit, paravirt_ops, irq_enable_sysexit);
 	OFFSET(PARAVIRT_iret, paravirt_ops, iret);
 	OFFSET(PARAVIRT_read_cr0, paravirt_ops, read_cr0);
+
+	DEFINE(PARAVIRT_IRQ_DISABLE, PARAVIRT_PATCH(irq_disable));
+	DEFINE(PARAVIRT_IRQ_ENABLE, PARAVIRT_PATCH(irq_enable));
+	DEFINE(PARAVIRT_STI_SYSEXIT, PARAVIRT_PATCH(irq_enable_sysexit));
+	DEFINE(PARAVIRT_INTERRUPT_RETURN, PARAVIRT_PATCH(iret));
 #endif
 }
===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -57,7 +57,6 @@ DEF_NATIVE(sti, "sti");
 DEF_NATIVE(sti, "sti");
 DEF_NATIVE(popf, "push %eax; popf");
 DEF_NATIVE(pushf, "pushf; pop %eax");
-DEF_NATIVE(pushf_cli, "pushf; pop %eax; cli");
 DEF_NATIVE(iret, "iret");
 DEF_NATIVE(sti_sysexit, "sti; sysexit");
 
@@ -65,13 +64,12 @@ static const struct native_insns
 {
 	const char *start, *end;
 } native_insns[] = {
-	[PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
-	[PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
-	[PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
-	[PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
-	[PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
-	[PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
-	[PARAVIRT_STI_SYSEXIT] = { start_sti_sysexit, end_sti_sysexit },
+	[PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
+	[PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
+	[PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
+	[PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
+	[PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
+	[PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, end_sti_sysexit },
 };
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -78,11 +78,6 @@ struct {
 #define MNEM_JMP  0xe9
 #define MNEM_RET  0xc3
 
-static char irq_save_disable_callout[] = {
-	MNEM_CALL, 0, 0, 0, 0,
-	MNEM_CALL, 0, 0, 0, 0,
-	MNEM_RET
-};
 #define IRQ_PATCH_INT_MASK 0
 #define IRQ_PATCH_DISABLE  5
 
@@ -130,33 +125,17 @@ static unsigned vmi_patch(u8 type, u16 c
 static unsigned vmi_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
 	switch (type) {
-		case PARAVIRT_IRQ_DISABLE:
+		case PARAVIRT_PATCH(irq_disable):
 			return patch_internal(VMI_CALL_DisableInterrupts, len, insns);
-		case PARAVIRT_IRQ_ENABLE:
+		case PARAVIRT_PATCH(irq_enable):
 			return patch_internal(VMI_CALL_EnableInterrupts, len, insns);
-		case PARAVIRT_RESTORE_FLAGS:
+		case PARAVIRT_PATCH(restore_fl):
 			return patch_internal(VMI_CALL_SetInterruptMask, len, insns);
-		case PARAVIRT_SAVE_FLAGS:
+		case PARAVIRT_PATCH(save_fl):
 			return patch_internal(VMI_CALL_GetInterruptMask, len, insns);
-        	case PARAVIRT_SAVE_FLAGS_IRQ_DISABLE:
-			if (len >= 10) {
-				patch_internal(VMI_CALL_GetInterruptMask, len, insns);
-				patch_internal(VMI_CALL_DisableInterrupts, len-5, insns+5);
-				return 10;
-			} else {
-				/*
-				 * You bastards didn't leave enough room to
-				 * patch save_flags_irq_disable inline.  Patch
-				 * to a helper
-				 */
-				BUG_ON(len < 5);
-				*(char *)insns = MNEM_CALL;
-				patch_offset(insns, irq_save_disable_callout);
-				return 5;
-			}
-		case PARAVIRT_INTERRUPT_RETURN:
+		case PARAVIRT_PATCH(iret):
 			return patch_internal(VMI_CALL_IRET, len, insns);
-		case PARAVIRT_STI_SYSEXIT:
+		case PARAVIRT_PATCH(irq_enable_sysexit):
 			return patch_internal(VMI_CALL_SYSEXIT, len, insns);
 		default:
 			break;
@@ -733,11 +712,6 @@ static inline int __init activate_vmi(vo
 	para_fill(restore_fl, SetInterruptMask);
 	para_fill(irq_disable, DisableInterrupts);
 	para_fill(irq_enable, EnableInterrupts);
-	/* irq_save_disable !!! sheer pain */
-	patch_offset(&irq_save_disable_callout[IRQ_PATCH_INT_MASK],
-		     (char *)paravirt_ops.save_fl);
-	patch_offset(&irq_save_disable_callout[IRQ_PATCH_DISABLE],
-		     (char *)paravirt_ops.irq_disable);
 #ifndef CONFIG_NO_IDLE_HZ
 	para_fill(safe_halt, Halt);
 #else
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -4,18 +4,7 @@
  * para-virtualization: those hooks are defined here. */
 
 #ifdef CONFIG_PARAVIRT
-#include <linux/stringify.h>
 #include <asm/page.h>
-
-/* These are the most performance critical ops, so we want to be able to patch
- * callers */
-#define PARAVIRT_IRQ_DISABLE 0
-#define PARAVIRT_IRQ_ENABLE 1
-#define PARAVIRT_RESTORE_FLAGS 2
-#define PARAVIRT_SAVE_FLAGS 3
-#define PARAVIRT_SAVE_FLAGS_IRQ_DISABLE 4
-#define PARAVIRT_INTERRUPT_RETURN 5
-#define PARAVIRT_STI_SYSEXIT 6
 
 /* Bitmask of what can be clobbered: usually at least eax. */
 #define CLBR_NONE 0x0
@@ -26,6 +15,23 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+#define paravirt_type(type)		[paravirt_typenum] "i" (type)
+#define paravirt_clobber(clobber)	[paravirt_clobber] "i" (clobber)
+
+#define PARAVIRT_PATCH(x)		(offsetof(struct paravirt_ops, x) / sizeof(void *))
+
+#define _paravirt_alt(insn_string, type, clobber)	\
+	"771:\n\t" insn_string "\n" "772:\n"		\
+	".pushsection .parainstructions,\"a\"\n"	\
+	"  .long 771b\n"				\
+	"  .byte " type "\n"				\
+	"  .byte 772b-771b\n"				\
+	"  .short " clobber "\n"			\
+	".popsection\n"
+
+#define paravirt_alt(insn_string)				\
+	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
 
 struct thread_struct;
 struct Xgt_desc_struct;
@@ -512,24 +518,16 @@ extern struct paravirt_patch_site __star
 extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 
-#define paravirt_alt(insn_string, typenum, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	"  .long 771b\n"				\
-	"  .byte " __stringify(typenum) "\n"		\
-	"  .byte 772b-771b\n"				\
-	"  .short " __stringify(clobber) "\n"		\
-	".popsection"
-
 static inline unsigned long __raw_local_save_flags(void)
 {
 	unsigned long f;
 
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%1;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_SAVE_FLAGS, CLBR_NONE)
-			     : "=a"(f): "m"(paravirt_ops.save_fl)
+					   "popl %%edx; popl %%ecx")
+			     : "=a"(f): "m"(paravirt_ops.save_fl),
+			       paravirt_type(PARAVIRT_PATCH(save_fl)),
+			       paravirt_clobber(CLBR_NONE)
 			     : "memory", "cc");
 	return f;
 }
@@ -538,9 +536,10 @@ static inline void raw_local_irq_restore
 {
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%1;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_RESTORE_FLAGS, CLBR_EAX)
-			     : "=a"(f) : "m" (paravirt_ops.restore_fl), "0"(f)
+					   "popl %%edx; popl %%ecx")
+			     : "=a"(f) : "m" (paravirt_ops.restore_fl), "0"(f),
+			       paravirt_type(PARAVIRT_PATCH(restore_fl)),
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "cc");
 }
 
@@ -548,9 +547,10 @@ static inline void raw_local_irq_disable
 {
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%0;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_IRQ_DISABLE, CLBR_EAX)
-			     : : "m" (paravirt_ops.irq_disable)
+					   "popl %%edx; popl %%ecx")
+			     : : "m" (paravirt_ops.irq_disable),
+			       paravirt_type(PARAVIRT_PATCH(irq_disable)),
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "eax", "cc");
 }
 
@@ -558,9 +558,10 @@ static inline void raw_local_irq_enable(
 {
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%0;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_IRQ_ENABLE, CLBR_EAX)
-			     : : "m" (paravirt_ops.irq_enable)
+					   "popl %%edx; popl %%ecx")
+			     : : "m" (paravirt_ops.irq_enable),
+			       paravirt_type(PARAVIRT_PATCH(irq_enable)),
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "eax", "cc");
 }
 
@@ -568,33 +569,31 @@ static inline unsigned long __raw_local_
 {
 	unsigned long f;
 
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%1; pushl %%eax;"
-					   "call *%2; popl %%eax;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_SAVE_FLAGS_IRQ_DISABLE,
-					  CLBR_NONE)
-			     : "=a"(f)
-			     : "m" (paravirt_ops.save_fl),
-			       "m" (paravirt_ops.irq_disable)
-			     : "memory", "cc");
+	f = __raw_local_save_flags();
+	raw_local_irq_disable();
+
 	return f;
 }
 
-#define CLI_STRING paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-		     "call *paravirt_ops+%c[irq_disable];"		\
-		     "popl %%edx; popl %%ecx",				\
-		     PARAVIRT_IRQ_DISABLE, CLBR_EAX)
-
-#define STI_STRING paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-		     "call *paravirt_ops+%c[irq_enable];"		\
-		     "popl %%edx; popl %%ecx",				\
-		     PARAVIRT_IRQ_ENABLE, CLBR_EAX)
+#define CLI_STRING _paravirt_alt("pushl %%ecx; pushl %%edx;"		\
+				 "call *paravirt_ops+%c[irq_disable];"	\
+				 "popl %%edx; popl %%ecx",		\
+				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
+
+#define STI_STRING _paravirt_alt("pushl %%ecx; pushl %%edx;"		\
+				"call *paravirt_ops+%c[irq_enable];"	\
+				 "popl %%edx; popl %%ecx",		\
+				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
+
+
 #define CLI_STI_CLOBBERS , "%eax"
-#define CLI_STI_INPUT_ARGS \
+#define CLI_STI_INPUT_ARGS						\
 	,								\
-	[irq_disable] "i" (offsetof(struct paravirt_ops, irq_disable)),	\
-	[irq_enable] "i" (offsetof(struct paravirt_ops, irq_enable))
+		[irq_disable] "i" (offsetof(struct paravirt_ops, irq_disable)),	\
+		[irq_enable] "i" (offsetof(struct paravirt_ops, irq_enable)), \
+		[paravirt_cli_type] "i" (PARAVIRT_PATCH(irq_disable)),	\
+		[paravirt_sti_type] "i" (PARAVIRT_PATCH(irq_disable)),	\
+		paravirt_clobber(CLBR_NONE)
 
 #else  /* __ASSEMBLY__ */
 

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 11/26] Xen-paravirt_ops: Use patch site IDs computed from offset in paravirt_ops structure
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Rusty Russell, linux-kernel,
	Chris Wright, virtualization, Andrew Morton

[-- Attachment #1: paravirt-use-offset-site-ids.patch --]
[-- Type: text/plain, Size: 10745 bytes --]

Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure.  This avoids having to maintain a separate
enum for patch site types.

This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/asm-offsets.c |    5 +
 arch/i386/kernel/paravirt.c    |   14 ++---
 arch/i386/kernel/vmi.c         |   38 ++------------
 include/asm-i386/paravirt.h    |  107 +++++++++++++++++++---------------------
 4 files changed, 70 insertions(+), 94 deletions(-)

===================================================================
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -110,5 +110,10 @@ void foo(void)
 	OFFSET(PARAVIRT_irq_enable_sysexit, paravirt_ops, irq_enable_sysexit);
 	OFFSET(PARAVIRT_iret, paravirt_ops, iret);
 	OFFSET(PARAVIRT_read_cr0, paravirt_ops, read_cr0);
+
+	DEFINE(PARAVIRT_IRQ_DISABLE, PARAVIRT_PATCH(irq_disable));
+	DEFINE(PARAVIRT_IRQ_ENABLE, PARAVIRT_PATCH(irq_enable));
+	DEFINE(PARAVIRT_STI_SYSEXIT, PARAVIRT_PATCH(irq_enable_sysexit));
+	DEFINE(PARAVIRT_INTERRUPT_RETURN, PARAVIRT_PATCH(iret));
 #endif
 }
===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -57,7 +57,6 @@ DEF_NATIVE(sti, "sti");
 DEF_NATIVE(sti, "sti");
 DEF_NATIVE(popf, "push %eax; popf");
 DEF_NATIVE(pushf, "pushf; pop %eax");
-DEF_NATIVE(pushf_cli, "pushf; pop %eax; cli");
 DEF_NATIVE(iret, "iret");
 DEF_NATIVE(sti_sysexit, "sti; sysexit");
 
@@ -65,13 +64,12 @@ static const struct native_insns
 {
 	const char *start, *end;
 } native_insns[] = {
-	[PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
-	[PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
-	[PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
-	[PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
-	[PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
-	[PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
-	[PARAVIRT_STI_SYSEXIT] = { start_sti_sysexit, end_sti_sysexit },
+	[PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
+	[PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
+	[PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
+	[PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
+	[PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
+	[PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, end_sti_sysexit },
 };
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
===================================================================
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -78,11 +78,6 @@ struct {
 #define MNEM_JMP  0xe9
 #define MNEM_RET  0xc3
 
-static char irq_save_disable_callout[] = {
-	MNEM_CALL, 0, 0, 0, 0,
-	MNEM_CALL, 0, 0, 0, 0,
-	MNEM_RET
-};
 #define IRQ_PATCH_INT_MASK 0
 #define IRQ_PATCH_DISABLE  5
 
@@ -130,33 +125,17 @@ static unsigned vmi_patch(u8 type, u16 c
 static unsigned vmi_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
 	switch (type) {
-		case PARAVIRT_IRQ_DISABLE:
+		case PARAVIRT_PATCH(irq_disable):
 			return patch_internal(VMI_CALL_DisableInterrupts, len, insns);
-		case PARAVIRT_IRQ_ENABLE:
+		case PARAVIRT_PATCH(irq_enable):
 			return patch_internal(VMI_CALL_EnableInterrupts, len, insns);
-		case PARAVIRT_RESTORE_FLAGS:
+		case PARAVIRT_PATCH(restore_fl):
 			return patch_internal(VMI_CALL_SetInterruptMask, len, insns);
-		case PARAVIRT_SAVE_FLAGS:
+		case PARAVIRT_PATCH(save_fl):
 			return patch_internal(VMI_CALL_GetInterruptMask, len, insns);
-        	case PARAVIRT_SAVE_FLAGS_IRQ_DISABLE:
-			if (len >= 10) {
-				patch_internal(VMI_CALL_GetInterruptMask, len, insns);
-				patch_internal(VMI_CALL_DisableInterrupts, len-5, insns+5);
-				return 10;
-			} else {
-				/*
-				 * You bastards didn't leave enough room to
-				 * patch save_flags_irq_disable inline.  Patch
-				 * to a helper
-				 */
-				BUG_ON(len < 5);
-				*(char *)insns = MNEM_CALL;
-				patch_offset(insns, irq_save_disable_callout);
-				return 5;
-			}
-		case PARAVIRT_INTERRUPT_RETURN:
+		case PARAVIRT_PATCH(iret):
 			return patch_internal(VMI_CALL_IRET, len, insns);
-		case PARAVIRT_STI_SYSEXIT:
+		case PARAVIRT_PATCH(irq_enable_sysexit):
 			return patch_internal(VMI_CALL_SYSEXIT, len, insns);
 		default:
 			break;
@@ -733,11 +712,6 @@ static inline int __init activate_vmi(vo
 	para_fill(restore_fl, SetInterruptMask);
 	para_fill(irq_disable, DisableInterrupts);
 	para_fill(irq_enable, EnableInterrupts);
-	/* irq_save_disable !!! sheer pain */
-	patch_offset(&irq_save_disable_callout[IRQ_PATCH_INT_MASK],
-		     (char *)paravirt_ops.save_fl);
-	patch_offset(&irq_save_disable_callout[IRQ_PATCH_DISABLE],
-		     (char *)paravirt_ops.irq_disable);
 #ifndef CONFIG_NO_IDLE_HZ
 	para_fill(safe_halt, Halt);
 #else
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -4,18 +4,7 @@
  * para-virtualization: those hooks are defined here. */
 
 #ifdef CONFIG_PARAVIRT
-#include <linux/stringify.h>
 #include <asm/page.h>
-
-/* These are the most performance critical ops, so we want to be able to patch
- * callers */
-#define PARAVIRT_IRQ_DISABLE 0
-#define PARAVIRT_IRQ_ENABLE 1
-#define PARAVIRT_RESTORE_FLAGS 2
-#define PARAVIRT_SAVE_FLAGS 3
-#define PARAVIRT_SAVE_FLAGS_IRQ_DISABLE 4
-#define PARAVIRT_INTERRUPT_RETURN 5
-#define PARAVIRT_STI_SYSEXIT 6
 
 /* Bitmask of what can be clobbered: usually at least eax. */
 #define CLBR_NONE 0x0
@@ -26,6 +15,23 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+#define paravirt_type(type)		[paravirt_typenum] "i" (type)
+#define paravirt_clobber(clobber)	[paravirt_clobber] "i" (clobber)
+
+#define PARAVIRT_PATCH(x)		(offsetof(struct paravirt_ops, x) / sizeof(void *))
+
+#define _paravirt_alt(insn_string, type, clobber)	\
+	"771:\n\t" insn_string "\n" "772:\n"		\
+	".pushsection .parainstructions,\"a\"\n"	\
+	"  .long 771b\n"				\
+	"  .byte " type "\n"				\
+	"  .byte 772b-771b\n"				\
+	"  .short " clobber "\n"			\
+	".popsection\n"
+
+#define paravirt_alt(insn_string)				\
+	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
 
 struct thread_struct;
 struct Xgt_desc_struct;
@@ -512,24 +518,16 @@ extern struct paravirt_patch_site __star
 extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 
-#define paravirt_alt(insn_string, typenum, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	"  .long 771b\n"				\
-	"  .byte " __stringify(typenum) "\n"		\
-	"  .byte 772b-771b\n"				\
-	"  .short " __stringify(clobber) "\n"		\
-	".popsection"
-
 static inline unsigned long __raw_local_save_flags(void)
 {
 	unsigned long f;
 
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%1;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_SAVE_FLAGS, CLBR_NONE)
-			     : "=a"(f): "m"(paravirt_ops.save_fl)
+					   "popl %%edx; popl %%ecx")
+			     : "=a"(f): "m"(paravirt_ops.save_fl),
+			       paravirt_type(PARAVIRT_PATCH(save_fl)),
+			       paravirt_clobber(CLBR_NONE)
 			     : "memory", "cc");
 	return f;
 }
@@ -538,9 +536,10 @@ static inline void raw_local_irq_restore
 {
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%1;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_RESTORE_FLAGS, CLBR_EAX)
-			     : "=a"(f) : "m" (paravirt_ops.restore_fl), "0"(f)
+					   "popl %%edx; popl %%ecx")
+			     : "=a"(f) : "m" (paravirt_ops.restore_fl), "0"(f),
+			       paravirt_type(PARAVIRT_PATCH(restore_fl)),
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "cc");
 }
 
@@ -548,9 +547,10 @@ static inline void raw_local_irq_disable
 {
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%0;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_IRQ_DISABLE, CLBR_EAX)
-			     : : "m" (paravirt_ops.irq_disable)
+					   "popl %%edx; popl %%ecx")
+			     : : "m" (paravirt_ops.irq_disable),
+			       paravirt_type(PARAVIRT_PATCH(irq_disable)),
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "eax", "cc");
 }
 
@@ -558,9 +558,10 @@ static inline void raw_local_irq_enable(
 {
 	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
 					   "call *%0;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_IRQ_ENABLE, CLBR_EAX)
-			     : : "m" (paravirt_ops.irq_enable)
+					   "popl %%edx; popl %%ecx")
+			     : : "m" (paravirt_ops.irq_enable),
+			       paravirt_type(PARAVIRT_PATCH(irq_enable)),
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "eax", "cc");
 }
 
@@ -568,33 +569,31 @@ static inline unsigned long __raw_local_
 {
 	unsigned long f;
 
-	__asm__ __volatile__(paravirt_alt( "pushl %%ecx; pushl %%edx;"
-					   "call *%1; pushl %%eax;"
-					   "call *%2; popl %%eax;"
-					   "popl %%edx; popl %%ecx",
-					  PARAVIRT_SAVE_FLAGS_IRQ_DISABLE,
-					  CLBR_NONE)
-			     : "=a"(f)
-			     : "m" (paravirt_ops.save_fl),
-			       "m" (paravirt_ops.irq_disable)
-			     : "memory", "cc");
+	f = __raw_local_save_flags();
+	raw_local_irq_disable();
+
 	return f;
 }
 
-#define CLI_STRING paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-		     "call *paravirt_ops+%c[irq_disable];"		\
-		     "popl %%edx; popl %%ecx",				\
-		     PARAVIRT_IRQ_DISABLE, CLBR_EAX)
-
-#define STI_STRING paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-		     "call *paravirt_ops+%c[irq_enable];"		\
-		     "popl %%edx; popl %%ecx",				\
-		     PARAVIRT_IRQ_ENABLE, CLBR_EAX)
+#define CLI_STRING _paravirt_alt("pushl %%ecx; pushl %%edx;"		\
+				 "call *paravirt_ops+%c[irq_disable];"	\
+				 "popl %%edx; popl %%ecx",		\
+				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
+
+#define STI_STRING _paravirt_alt("pushl %%ecx; pushl %%edx;"		\
+				"call *paravirt_ops+%c[irq_enable];"	\
+				 "popl %%edx; popl %%ecx",		\
+				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
+
+
 #define CLI_STI_CLOBBERS , "%eax"
-#define CLI_STI_INPUT_ARGS \
+#define CLI_STI_INPUT_ARGS						\
 	,								\
-	[irq_disable] "i" (offsetof(struct paravirt_ops, irq_disable)),	\
-	[irq_enable] "i" (offsetof(struct paravirt_ops, irq_enable))
+		[irq_disable] "i" (offsetof(struct paravirt_ops, irq_disable)),	\
+		[irq_enable] "i" (offsetof(struct paravirt_ops, irq_enable)), \
+		[paravirt_cli_type] "i" (PARAVIRT_PATCH(irq_disable)),	\
+		[paravirt_sti_type] "i" (PARAVIRT_PATCH(irq_disable)),	\
+		paravirt_clobber(CLBR_NONE)
 
 #else  /* __ASSEMBLY__ */
 

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: paravirt-fix-clobbers.patch --]
[-- Type: text/plain, Size: 1257 bytes --]

Fix a few clobbers to include the return register.  The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 include/asm-i386/paravirt.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
 					   "popl %%edx; popl %%ecx")
 			     : "=a"(f): "m"(paravirt_ops.save_fl),
 			       paravirt_type(PARAVIRT_PATCH(save_fl)),
-			       paravirt_clobber(CLBR_NONE)
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "cc");
 	return f;
 }
@@ -610,7 +610,7 @@ static inline unsigned long __raw_local_
 				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
 
 #define STI_STRING _paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-				"call *paravirt_ops+%c[irq_enable];"	\
+				 "call *paravirt_ops+%c[irq_enable];"	\
 				 "popl %%edx; popl %%ecx",		\
 				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
 

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Rusty Russell, linux-kernel,
	Chris Wright, virtualization, Andrew Morton

[-- Attachment #1: paravirt-fix-clobbers.patch --]
[-- Type: text/plain, Size: 1256 bytes --]

Fix a few clobbers to include the return register.  The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>

---
 include/asm-i386/paravirt.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
 					   "popl %%edx; popl %%ecx")
 			     : "=a"(f): "m"(paravirt_ops.save_fl),
 			       paravirt_type(PARAVIRT_PATCH(save_fl)),
-			       paravirt_clobber(CLBR_NONE)
+			       paravirt_clobber(CLBR_EAX)
 			     : "memory", "cc");
 	return f;
 }
@@ -610,7 +610,7 @@ static inline unsigned long __raw_local_
 				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
 
 #define STI_STRING _paravirt_alt("pushl %%ecx; pushl %%edx;"		\
-				"call *paravirt_ops+%c[irq_enable];"	\
+				 "call *paravirt_ops+%c[irq_enable];"	\
 				 "popl %%edx; popl %%ecx",		\
 				 "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
 

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (11 preceding siblings ...)
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:24     ` Ingo Molnar
  2007-03-01 23:24 ` [patch 14/26] Xen-paravirt_ops: add common patching machinery Jeremy Fitzhardinge
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Anthony Liguori

[-- Attachment #1: paravirt-patchable-call-wrappers.patch --]
[-- Type: text/plain, Size: 24877 bytes --]

Wrap a set of interesting paravirt_ops calls in a wrapper which makes
the callsites available for patching.  Unfortunately this is pretty
ugly because there's no way to get gcc to generate a function call,
but also wrap just the callsite itself with the necessary labels.

This patch supports functions with 0-4 arguments, and either void or
returning a value.  64-bit arguments must be split into a pair of
32-bit arguments (lower word first).  Small structures are returned in
registers.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>

---
 include/asm-i386/paravirt.h |  577 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 442 insertions(+), 135 deletions(-)

===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -33,6 +33,222 @@
 #define paravirt_alt(insn_string)				\
 	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
 
+#define PVOP_CALL0(__rettype, __op)					\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : [op] "m" (paravirt_ops.__op),	\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;	\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=a" (__tmp), "=d" (__edx), "=c" (__ecx) \
+				     : [op] "m" (paravirt_ops.__op),	\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL0(__op)						\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt("call *%[op]")		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : [op] "m" (paravirt_ops.__op),		\
+			       paravirt_type(PARAVIRT_PATCH(__op)),	\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL1(__rettype, __op, arg1)				\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "a" ((u32)(arg1)),		\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;	\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=a" (__tmp), "=d" (__edx), "=c" (__ecx) \
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "0" ((u32)(arg1)),		\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL1(__op, arg1)						\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt("call *%[op]")		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : [op] "m" (paravirt_ops.__op),		\
+			       "0" ((u32)(arg1)),			\
+			       paravirt_type(PARAVIRT_PATCH(__op)),	\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL2(__rettype, __op, arg1, arg2)				\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "a" ((u32)(arg1)),		\
+				       "d" ((u32)(arg2)),		\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;	\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=a" (__tmp), "=d" (__edx), "=c" (__ecx) \
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "0" ((u32)(arg1)),		\
+				       "1" ((u32)(arg2)),		\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL2(__op, arg1, arg2)					\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt("call *%[op]")		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : [op] "m" (paravirt_ops.__op),		\
+			       "0" ((u32)(arg1)),			\
+			       "1" ((u32)(arg2)),			\
+			       paravirt_type(PARAVIRT_PATCH(__op)),	\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL3(__rettype, __op, arg1, arg2, arg3)			\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "a" ((u32)(arg1)),		\
+				       "d" ((u32)(arg2)),		\
+				       "1" ((u32)(arg3)),		\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;	\
+			asm volatile(paravirt_alt("call *%[op]")	\
+				     : "=a" (__tmp), "=d" (__edx), "=c" (__ecx) \
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "0" ((u32)(arg1)),		\
+				       "1" ((u32)(arg2)),		\
+				       "2" ((u32)(arg3)),		\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL3(__op, arg1, arg2, arg3)				\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile(paravirt_alt("call *%[op]")		\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : [op] "m" (paravirt_ops.__op),		\
+			       "0" ((u32)(arg1)),			\
+			       "1" ((u32)(arg2)),			\
+			       "2" ((u32)(arg3)),			\
+			       paravirt_type(PARAVIRT_PATCH(__op)),	\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
+#define PVOP_CALL4(__rettype, __op, arg1, arg2, arg3, arg4)		\
+	({								\
+		__rettype __ret;					\
+		if (sizeof(__rettype) > sizeof(unsigned long)) {	\
+			unsigned long long __tmp;			\
+			unsigned long __ecx;				\
+			asm volatile("push %[_arg4]; "			\
+				     paravirt_alt("call *%[op]")	\
+				     "lea 4(%%esp),%%esp"		\
+				     : "=A" (__tmp), "=c" (__ecx)	\
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "a" ((u32)(arg1)),		\
+				       "d" ((u32)(arg2)),		\
+				       "1" ((u32)(arg3)),		\
+				       [_arg4] "mr" ((u32)(arg4)),	\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc",);		\
+			__ret = (__rettype)__tmp;			\
+		} else {						\
+			unsigned long __tmp, __edx, __ecx;		\
+			asm volatile("push %[_arg4]; "			\
+				     paravirt_alt("call *%[op]")	\
+				     "lea 4(%%esp),%%esp"		\
+				     : "=a" (__tmp), "=d" (__edx), "=c" (__ecx) \
+				     : [op] "m" (paravirt_ops.__op),	\
+				       "0" ((u32)(arg1)),		\
+				       "1" ((u32)(arg2)),		\
+				       "2" ((u32)(arg3)),		\
+				       [_arg4]"mr" ((u32)(arg4)),	\
+				       paravirt_type(PARAVIRT_PATCH(__op)), \
+				       paravirt_clobber(CLBR_ANY)	\
+				     : "memory", "cc");			\
+			__ret = (__rettype)__tmp;			\
+		}							\
+		__ret;							\
+	})
+#define PVOP_VCALL4(__op, arg1, arg2, arg3, arg4)			\
+	({								\
+		unsigned long __eax, __edx, __ecx;			\
+		asm volatile("push %[_arg4]; "				\
+			     paravirt_alt("call *%[op]")		\
+			     "lea 4(%%esp),%%esp"			\
+			     : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+			     : [op] "m" (paravirt_ops.__op),		\
+			       "0" ((u32)(arg1)),			\
+			       "1" ((u32)(arg2)),			\
+			       "2" ((u32)(arg3)),			\
+			       [_arg4]"mr" ((u32)(arg4)),		\
+			       paravirt_type(PARAVIRT_PATCH(__op)),	\
+			       paravirt_clobber(CLBR_ANY)		\
+			     : "memory", "cc");				\
+	})
+
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
@@ -110,14 +326,10 @@ struct paravirt_ops
 	void (*set_ldt)(const void *desc, unsigned entries);
 	unsigned long (*store_tr)(void);
 	void (*load_tls)(struct thread_struct *t, unsigned int cpu);
-	void (*write_ldt_entry)(void *dt, int entrynum,
-					 u32 low, u32 high);
-	void (*write_gdt_entry)(void *dt, int entrynum,
-					 u32 low, u32 high);
-	void (*write_idt_entry)(void *dt, int entrynum,
-					 u32 low, u32 high);
-	void (*load_esp0)(struct tss_struct *tss,
-				   struct thread_struct *thread);
+	void (*write_ldt_entry)(void *dt, int entrynum, u32 low, u32 high);
+	void (*write_gdt_entry)(void *dt, int entrynum, u32 low, u32 high);
+	void (*write_idt_entry)(void *dt, int entrynum, u32 low, u32 high);
+	void (*load_esp0)(struct tss_struct *tss, struct thread_struct *thread);
 
 	void (*set_iopl_mask)(unsigned mask);
 
@@ -194,151 +406,165 @@ struct paravirt_ops
 
 extern struct paravirt_ops paravirt_ops;
 
-#define paravirt_enabled() (paravirt_ops.paravirt_enabled)
+static inline int paravirt_enabled(void)
+{
+	return paravirt_ops.paravirt_enabled;
+}
 
 static inline void load_esp0(struct tss_struct *tss,
 			     struct thread_struct *thread)
 {
-	paravirt_ops.load_esp0(tss, thread);
-}
-
-#define ARCH_SETUP			paravirt_ops.arch_setup();
+	PVOP_VCALL2(load_esp0, tss, thread);
+}
+
+#define ARCH_SETUP			PVOP_VCALL0(arch_setup);
 static inline unsigned long get_wallclock(void)
 {
-	return paravirt_ops.get_wallclock();
+	return PVOP_CALL0(unsigned long, get_wallclock);
 }
 
 static inline int set_wallclock(unsigned long nowtime)
 {
-	return paravirt_ops.set_wallclock(nowtime);
+	return PVOP_CALL1(int, set_wallclock, nowtime);
 }
 
 static inline void do_time_init(void)
 {
-	return paravirt_ops.time_init();
+	PVOP_VCALL0(time_init);
 }
 
 /* The paravirtualized CPUID instruction. */
 static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
 			   unsigned int *ecx, unsigned int *edx)
 {
-	paravirt_ops.cpuid(eax, ebx, ecx, edx);
+	PVOP_VCALL4(cpuid, eax, ebx, ecx, edx);
 }
 
 /*
  * These special macros can be used to get or set a debugging register
  */
-#define get_debugreg(var, reg) var = paravirt_ops.get_debugreg(reg)
-#define set_debugreg(val, reg) paravirt_ops.set_debugreg(reg, val)
-
-#define clts() paravirt_ops.clts()
-
-#define read_cr0() paravirt_ops.read_cr0()
-#define write_cr0(x) paravirt_ops.write_cr0(x)
-
-#define read_cr2() paravirt_ops.read_cr2()
-#define write_cr2(x) paravirt_ops.write_cr2(x)
-
-#define read_cr3() paravirt_ops.read_cr3()
-#define write_cr3(x) paravirt_ops.write_cr3(x)
-
-#define read_cr4() paravirt_ops.read_cr4()
-#define read_cr4_safe(x) paravirt_ops.read_cr4_safe()
-#define write_cr4(x) paravirt_ops.write_cr4(x)
-
-#define raw_ptep_get_and_clear(xp)	(paravirt_ops.ptep_get_and_clear(xp))
+#define get_debugreg(var, reg) var = PVOP_CALL1(unsigned long, get_debugreg, reg)
+#define set_debugreg(val, reg) PVOP_VCALL2(set_debugreg, reg, val)
+
+#define clts()		PVOP_VCALL0(clts)
+
+#define read_cr0()	PVOP_CALL0(unsigned long, read_cr0)
+#define write_cr0(x)	PVOP_VCALL1(write_cr0, x)
+
+#define read_cr2()	PVOP_CALL0(unsigned long, read_cr2)
+#define write_cr2(x)	PVOP_VCALL1(write_cr2, x)
+
+#define read_cr3()	PVOP_CALL0(unsigned long, read_cr3)
+#define write_cr3(x)	PVOP_VCALL1(write_cr3, x)
+
+#define read_cr4()	PVOP_CALL0(unsigned long, read_cr4)
+#define read_cr4_safe(x) PVOP_CALL0(unsigned long, read_cr4_safe)
+#define write_cr4(x)	PVOP_VCALL1(write_cr4, x)
 
 static inline void raw_safe_halt(void)
 {
-	paravirt_ops.safe_halt();
+	PVOP_VCALL0(safe_halt);
 }
 
 static inline void halt(void)
 {
-	paravirt_ops.safe_halt();
-}
-#define wbinvd() paravirt_ops.wbinvd()
+	PVOP_VCALL0(safe_halt);
+}
+#define wbinvd()	PVOP_VCALL0(wbinvd)
 
 #define get_kernel_rpl()  (paravirt_ops.kernel_rpl)
 
 #define rdmsr(msr,val1,val2) do {				\
 	int _err;						\
-	u64 _l = paravirt_ops.read_msr(msr,&_err);		\
+	u64 _l = PVOP_CALL2(u64, read_msr, msr, &_err);		\
 	val1 = (u32)_l;						\
 	val2 = _l >> 32;					\
 } while(0)
 
-#define wrmsr(msr,val1,val2) do {				\
-	u64 _l = ((u64)(val2) << 32) | (val1);			\
-	paravirt_ops.write_msr((msr), _l);			\
-} while(0)
+#define wrmsr(msr,val1,val2) PVOP_VCALL3(write_msr, msr, val1, val2)
 
 #define rdmsrl(msr,val) do {					\
 	int _err;						\
-	val = paravirt_ops.read_msr((msr),&_err);		\
+	val = PVOP_CALL2(u64, read_msr, msr, &_err);		\
 } while(0)
 
-#define wrmsrl(msr,val) (paravirt_ops.write_msr((msr),(val)))
-#define wrmsr_safe(msr,a,b) ({					\
-	u64 _l = ((u64)(b) << 32) | (a);			\
-	paravirt_ops.write_msr((msr),_l);			\
-})
+#define wrmsrl(msr,val)		PVOP_CALL3(int, write_msr, msr, val, 0)
+#define wrmsr_safe(msr,a,b)	PVOP_CALL3(int, write_msr, msr, a, b)
 
 /* rdmsr with exception handling */
 #define rdmsr_safe(msr,a,b) ({					\
 	int _err;						\
-	u64 _l = paravirt_ops.read_msr(msr,&_err);		\
+	u64 _l = PVOP_CALL2(u64, read_msr, msr, &_err);		\
 	(*a) = (u32)_l;						\
 	(*b) = _l >> 32;					\
 	_err; })
 
 #define rdtsc(low,high) do {					\
-	u64 _l = paravirt_ops.read_tsc();			\
+	u64 _l = PVOP_CALL0(u64, read_tsc);			\
 	low = (u32)_l;						\
 	high = _l >> 32;					\
 } while(0)
 
 #define rdtscl(low) do {					\
-	u64 _l = paravirt_ops.read_tsc();			\
+	u64 _l = PVOP_CALL0(u64, read_tsc);			\
 	low = (int)_l;						\
 } while(0)
 
-#define rdtscll(val) (val = paravirt_ops.read_tsc())
+#define rdtscll(val) (val = PVOP_CALL0(u64, read_tsc))
 
 #define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
 
 #define rdpmc(counter,low,high) do {				\
-	u64 _l = paravirt_ops.read_pmc();			\
+	u64 _l = PVOP_CALL0(u64, read_pmc);			\
 	low = (u32)_l;						\
 	high = _l >> 32;					\
 } while(0)
 
-#define load_TR_desc() (paravirt_ops.load_tr_desc())
-#define load_gdt(dtr) (paravirt_ops.load_gdt(dtr))
-#define load_idt(dtr) (paravirt_ops.load_idt(dtr))
-#define set_ldt(addr, entries) (paravirt_ops.set_ldt((addr), (entries)))
-#define store_gdt(dtr) (paravirt_ops.store_gdt(dtr))
-#define store_idt(dtr) (paravirt_ops.store_idt(dtr))
-#define store_tr(tr) ((tr) = paravirt_ops.store_tr())
-#define load_TLS(t,cpu) (paravirt_ops.load_tls((t),(cpu)))
-#define write_ldt_entry(dt, entry, low, high)				\
-	(paravirt_ops.write_ldt_entry((dt), (entry), (low), (high)))
-#define write_gdt_entry(dt, entry, low, high)				\
-	(paravirt_ops.write_gdt_entry((dt), (entry), (low), (high)))
-#define write_idt_entry(dt, entry, low, high)				\
-	(paravirt_ops.write_idt_entry((dt), (entry), (low), (high)))
-#define set_iopl_mask(mask) (paravirt_ops.set_iopl_mask(mask))
-
-#define __pte(x)	paravirt_ops.make_pte(x)
-#define __pgd(x)	paravirt_ops.make_pgd(x)
-
-#define pte_val(x)	paravirt_ops.pte_val(x)
-#define pgd_val(x)	paravirt_ops.pgd_val(x)
-
-#ifdef CONFIG_X86_PAE
-#define __pmd(x)	paravirt_ops.make_pmd(x)
-#define pmd_val(x)	paravirt_ops.pmd_val(x)
-#endif
+static inline void load_TR_desc(void)
+{
+	PVOP_VCALL0(load_tr_desc);
+}
+static inline void load_gdt(const struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(load_gdt, dtr);
+}
+static inline void load_idt(const struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(load_idt, dtr);
+}
+static inline void set_ldt(const void *addr, unsigned entries)
+{
+	PVOP_VCALL2(set_ldt, addr, entries);
+}
+static inline void store_gdt(struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(store_gdt, dtr);
+}
+static inline void store_idt(struct Xgt_desc_struct *dtr)
+{
+	PVOP_VCALL1(store_idt, dtr);
+}
+#define store_tr(tr)	((tr) = PVOP_CALL0(unsigned long, store_tr))
+static inline void load_TLS(struct thread_struct *t, unsigned cpu)
+{
+	PVOP_VCALL2(load_tls, t, cpu);
+}
+static inline void write_ldt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	PVOP_VCALL4(write_ldt_entry, dt, entry, low, high);
+}
+static inline void write_gdt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	PVOP_VCALL4(write_gdt_entry, dt, entry, low, high);
+}
+static inline void write_idt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	PVOP_VCALL4(write_idt_entry, dt, entry, low, high);
+}
+static inline void set_iopl_mask(unsigned mask)
+{
+	PVOP_VCALL1(set_iopl_mask, mask);
+}
 
 /* The paravirtualized I/O functions */
 static inline void slow_down_io(void) {
@@ -356,27 +582,27 @@ static inline void slow_down_io(void) {
  */
 static inline void apic_write(unsigned long reg, unsigned long v)
 {
-	paravirt_ops.apic_write(reg,v);
+	PVOP_VCALL2(apic_write, reg, v);
 }
 
 static inline void apic_write_atomic(unsigned long reg, unsigned long v)
 {
-	paravirt_ops.apic_write_atomic(reg,v);
+	PVOP_VCALL2(apic_write_atomic, reg, v);
 }
 
 static inline unsigned long apic_read(unsigned long reg)
 {
-	return paravirt_ops.apic_read(reg);
+	return PVOP_CALL1(unsigned long, apic_read, reg);
 }
 
 static inline void setup_boot_clock(void)
 {
-	paravirt_ops.setup_boot_clock();
+	PVOP_VCALL0(setup_boot_clock);
 }
 
 static inline void setup_secondary_clock(void)
 {
-	paravirt_ops.setup_secondary_clock();
+	PVOP_VCALL0(setup_secondary_clock);
 }
 #endif
 
@@ -396,91 +622,172 @@ static inline void startup_ipi_hook(int 
 static inline void startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 				    unsigned long start_esp)
 {
-	return paravirt_ops.startup_ipi_hook(phys_apicid, start_eip, start_esp);
+	PVOP_VCALL3(startup_ipi_hook, phys_apicid, start_eip, start_esp);
 }
 #endif
 
 static inline void paravirt_activate_mm(struct mm_struct *prev,
 					struct mm_struct *next)
 {
-	paravirt_ops.activate_mm(prev, next);
+	PVOP_VCALL2(activate_mm, prev, next);
 }
 
 static inline void arch_dup_mmap(struct mm_struct *oldmm,
 				 struct mm_struct *mm)
 {
-	paravirt_ops.dup_mmap(oldmm, mm);
+	PVOP_VCALL2(dup_mmap, oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
-	paravirt_ops.exit_mmap(mm);
-}
-
-#define __flush_tlb() paravirt_ops.flush_tlb_user()
-#define __flush_tlb_global() paravirt_ops.flush_tlb_kernel()
-#define __flush_tlb_single(addr) paravirt_ops.flush_tlb_single(addr)
-
-#define paravirt_alloc_pt(pfn) paravirt_ops.alloc_pt(pfn)
-#define paravirt_release_pt(pfn) paravirt_ops.release_pt(pfn)
-
-#define paravirt_alloc_pd(pfn) paravirt_ops.alloc_pd(pfn)
+	PVOP_VCALL1(exit_mmap, mm);
+}
+
+#define __flush_tlb()		PVOP_VCALL0(flush_tlb_user)
+#define __flush_tlb_global()	PVOP_VCALL0(flush_tlb_kernel)
+#define __flush_tlb_single(addr) PVOP_VCALL1(flush_tlb_single, addr)
+
+#define paravirt_alloc_pt(pfn)	PVOP_VCALL1(alloc_pt, pfn)
+#define paravirt_release_pt(pfn) PVOP_VCALL1(release_pt, pfn)
+
+#define paravirt_alloc_pd(pfn)	PVOP_VCALL1(alloc_pd, pfn)
 #define paravirt_alloc_pd_clone(pfn, clonepfn, start, count) \
-	paravirt_ops.alloc_pd_clone(pfn, clonepfn, start, count)
-#define paravirt_release_pd(pfn) paravirt_ops.release_pd(pfn)
+	PVOP_VCALL4(alloc_pd_clone, pfn, clonepfn, start, count)
+#define paravirt_release_pd(pfn) PVOP_VCALL1(release_pd, pfn)
+
+static inline void pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	PVOP_VCALL3(pte_update, mm, addr, ptep);
+}
+
+static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	PVOP_VCALL3(pte_update_defer, mm, addr, ptep);
+}
+
+#ifdef CONFIG_X86_PAE
+static inline pte_t __pte(unsigned long long val)
+{
+	unsigned long long ret = PVOP_CALL2(unsigned long long, make_pte, val, val >> 32);
+	return (pte_t) { ret, ret >> 32 };
+}
+
+static inline pmd_t __pmd(unsigned long long val)
+{
+	return (pmd_t) { PVOP_CALL2(unsigned long long, make_pmd, val, val >> 32) };
+}
+
+static inline pgd_t __pgd(unsigned long long val)
+{
+	return (pgd_t) { PVOP_CALL2(unsigned long long, make_pgd, val, val >> 32) };
+}
+
+static inline unsigned long long pte_val(pte_t x)
+{
+	return PVOP_CALL2(unsigned long long, pte_val, x.pte_low, x.pte_high);
+}
+
+static inline unsigned long long pmd_val(pmd_t x)
+{
+	return PVOP_CALL2(unsigned long long, pmd_val, x.pmd, x.pmd >> 32);
+}
+
+static inline unsigned long long pgd_val(pgd_t x)
+{
+	return PVOP_CALL2(unsigned long long, pgd_val, x.pgd, x.pgd >> 32);
+}
 
 static inline void set_pte(pte_t *ptep, pte_t pteval)
 {
-	paravirt_ops.set_pte(ptep, pteval);
+	PVOP_VCALL3(set_pte, ptep, pteval.pte_low, pteval.pte_high);
 }
 
 static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pteval)
 {
+	/* 5 arg words */
 	paravirt_ops.set_pte_at(mm, addr, ptep, pteval);
 }
 
+static inline void set_pte_atomic(pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL3(set_pte_atomic, ptep, pteval.pte_low, pteval.pte_high);
+}
+
+static inline void set_pte_present(struct mm_struct *mm, unsigned long addr,
+				   pte_t *ptep, pte_t pte)
+{
+	/* 5 arg words */
+	paravirt_ops.set_pte_present(mm, addr, ptep, pte);
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
 {
-	paravirt_ops.set_pmd(pmdp, pmdval);
-}
-
-static inline void pte_update(struct mm_struct *mm, u32 addr, pte_t *ptep)
-{
-	paravirt_ops.pte_update(mm, addr, ptep);
-}
-
-static inline void pte_update_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
-{
-	paravirt_ops.pte_update_defer(mm, addr, ptep);
-}
-
-#ifdef CONFIG_X86_PAE
-static inline void set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-	paravirt_ops.set_pte_atomic(ptep, pteval);
-}
-
-static inline void set_pte_present(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
-{
-	paravirt_ops.set_pte_present(mm, addr, ptep, pte);
+	PVOP_VCALL3(set_pmd, pmdp, pmdval.pmd, pmdval.pmd >> 32);
 }
 
 static inline void set_pud(pud_t *pudp, pud_t pudval)
 {
-	paravirt_ops.set_pud(pudp, pudval);
+	PVOP_VCALL3(set_pud, pudp, pudval.pgd.pgd, pudval.pgd.pgd >> 32);
 }
 
 static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
-	paravirt_ops.pte_clear(mm, addr, ptep);
+	PVOP_VCALL3(pte_clear, mm, addr, ptep);
 }
 
 static inline void pmd_clear(pmd_t *pmdp)
 {
-	paravirt_ops.pmd_clear(pmdp);
-}
-#endif
+	PVOP_VCALL1(pmd_clear, pmdp);
+}
+
+static inline pte_t raw_ptep_get_and_clear(pte_t *p)
+{
+	unsigned long long val = PVOP_CALL1(unsigned long long, ptep_get_and_clear, p);
+	return (pte_t) { val, val >> 32 };
+}
+#else  /* !CONFIG_X86_PAE */
+static inline pte_t __pte(unsigned long val)
+{
+	return (pte_t) { PVOP_CALL1(unsigned long, make_pte, val) };
+}
+
+static inline pgd_t __pgd(unsigned long val)
+{
+	return (pgd_t) { PVOP_CALL1(unsigned long, make_pgd, val) };
+}
+
+static inline unsigned long pte_val(pte_t x)
+{
+	return PVOP_CALL1(unsigned long, pte_val, x.pte_low);
+}
+
+static inline unsigned long pgd_val(pgd_t x)
+{
+	return PVOP_CALL1(unsigned long, pgd_val, x.pgd);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL2(set_pte, ptep, pteval.pte_low);
+}
+
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep, pte_t pteval)
+{
+	PVOP_VCALL4(set_pte_at, mm, addr, ptep, pteval.pte_low);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
+{
+	PVOP_VCALL2(set_pmd, pmdp, pmdval.pud.pgd.pgd);
+}
+
+static inline pte_t raw_ptep_get_and_clear(pte_t *p)
+{
+	return (pte_t) { PVOP_CALL1(unsigned long, ptep_get_and_clear, p) };
+}
+#endif	/* CONFIG_X86_PAE */
 
 /* Lazy mode for batching updates / context switch */
 #define PARAVIRT_LAZY_NONE 0
@@ -488,12 +795,12 @@ static inline void pmd_clear(pmd_t *pmdp
 #define PARAVIRT_LAZY_CPU  2
 
 #define  __HAVE_ARCH_ENTER_LAZY_CPU_MODE
-#define arch_enter_lazy_cpu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_CPU)
-#define arch_leave_lazy_cpu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
+#define arch_enter_lazy_cpu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_CPU)
+#define arch_leave_lazy_cpu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_NONE)
 
 #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
-#define arch_leave_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
+#define arch_enter_lazy_mmu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_MMU)
+#define arch_leave_lazy_mmu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_NONE)
 
 void _paravirt_nop(void);
 #define paravirt_nop	((void *)_paravirt_nop)

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 14/26] Xen-paravirt_ops: add common patching machinery
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (12 preceding siblings ...)
  2007-03-01 23:24 ` [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable Jeremy Fitzhardinge
@ 2007-03-01 23:24 ` Jeremy Fitzhardinge
  2007-03-16  9:20     ` Ingo Molnar
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                   ` (13 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Anthony Liguori

[-- Attachment #1: paravirt-patch-machinery.patch --]
[-- Type: text/plain, Size: 7567 bytes --]

Implement the actual patching machinery.  paravirt_patch_default()
contains the logic to automatically patch a callsite based on a few
simple rules:

 - if the paravirt_op function is paravirt_nop, then patch nops
 - if the paravirt_op function is a jmp target, then jmp to it
 - if the paravirt_op function is callable and doesn't clobber too much
    for the callsite, call it directly

paravirt_patch_default is suitable as a default implementation of
paravirt_ops.patch, will remove most of the expensive indirect calls
in favour of either a direct call or a pile of nops.

Backends may implement their own patcher, however.  There are several
helper functions to help with this:

paravirt_patch_nop	nop out a callsite
paravirt_patch_ignore	leave the callsite as-is
paravirt_patch_call	patch a call if the caller and callee
			have compatible clobbers
paravirt_patch_jmp	patch in a jmp
paravirt_patch_insns	patch some literal instructions over
			the callsite, if they fit

This patch also implements more direct patches for the native case, so
that when running on native hardware many common operations are
implemented inline.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
---
 arch/i386/kernel/alternative.c |    3 
 arch/i386/kernel/paravirt.c    |  163 ++++++++++++++++++++++++++++++++--------
 include/asm-i386/paravirt.h    |   11 ++
 3 files changed, 147 insertions(+), 30 deletions(-)

===================================================================
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -359,6 +359,9 @@ void apply_paravirt(struct paravirt_patc
 
 		used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
 					  p->len);
+
+		BUG_ON(used > p->len);
+
 #ifdef CONFIG_DEBUG_PARAVIRT
 		{
 		int i;
===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -53,40 +53,143 @@ char *memory_setup(void)
 #define DEF_NATIVE(name, code)					\
 	extern const char start_##name[], end_##name[];		\
 	asm("start_" #name ": " code "; end_" #name ":")
-DEF_NATIVE(cli, "cli");
-DEF_NATIVE(sti, "sti");
-DEF_NATIVE(popf, "push %eax; popf");
-DEF_NATIVE(pushf, "pushf; pop %eax");
+
+DEF_NATIVE(irq_disable, "cli");
+DEF_NATIVE(irq_enable, "sti");
+DEF_NATIVE(restore_fl, "push %eax; popf");
+DEF_NATIVE(save_fl, "pushf; pop %eax");
 DEF_NATIVE(iret, "iret");
-DEF_NATIVE(sti_sysexit, "sti; sysexit");
-
-static const struct native_insns
-{
-	const char *start, *end;
-} native_insns[] = {
-	[PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
-	[PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
-	[PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
-	[PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
-	[PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
-	[PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, end_sti_sysexit },
-};
+DEF_NATIVE(irq_enable_sysexit, "sti; sysexit");
+DEF_NATIVE(read_cr2, "mov %cr2, %eax");
+DEF_NATIVE(write_cr3, "mov %eax, %cr3");
+DEF_NATIVE(read_cr3, "mov %cr3, %eax");
+DEF_NATIVE(clts, "clts");
+DEF_NATIVE(read_tsc, "rdtsc");
+
+DEF_NATIVE(pushf_cli, "pushf; pop %eax; cli");
+DEF_NATIVE(ud2a, "ud2a");
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
-	unsigned int insn_len;
-
-	/* Don't touch it if we don't have a replacement */
-	if (type >= ARRAY_SIZE(native_insns) || !native_insns[type].start)
-		return len;
-
-	insn_len = native_insns[type].end - native_insns[type].start;
-
-	/* Similarly if we can't fit replacement. */
-	if (len < insn_len)
-		return len;
-
-	memcpy(insns, native_insns[type].start, insn_len);
+	const unsigned char *start, *end;
+	unsigned ret;
+
+	switch(type) {
+#define SITE(x)	case PARAVIRT_PATCH(x):	start = start_##x; end = end_##x; goto patch_site
+		SITE(irq_disable);
+		SITE(irq_enable);
+		SITE(restore_fl);
+		SITE(save_fl);
+		SITE(iret);
+		SITE(irq_enable_sysexit);
+		SITE(read_cr2);
+		SITE(read_cr3);
+		SITE(write_cr3);
+		SITE(clts);
+		SITE(read_tsc);
+#undef SITE
+
+	patch_site:
+		ret = paravirt_patch_insns(insns, len, start, end);
+		break;
+
+	case PARAVIRT_PATCH(make_pgd):
+	case PARAVIRT_PATCH(make_pte):
+	case PARAVIRT_PATCH(pgd_val):
+	case PARAVIRT_PATCH(pte_val):
+#ifdef CONFIG_X86_PAE
+	case PARAVIRT_PATCH(make_pmd):
+	case PARAVIRT_PATCH(pmd_val):
+#endif
+		/* These functions end up returning exactly what
+		   they're passed, in the same registers. */
+		ret = paravirt_patch_nop();
+		break;
+
+	default:
+		ret = paravirt_patch_default(type, clobbers, insns, len);
+		break;
+	}
+
+	return ret;
+}
+
+unsigned paravirt_patch_nop(void)
+{
+	return 0;
+}
+
+unsigned paravirt_patch_ignore(unsigned len)
+{
+	return len;
+}
+
+unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
+			     void *site, u16 site_clobbers,
+			     unsigned len)
+{
+	unsigned char *call = site;
+	unsigned long delta = (unsigned long)target - (unsigned long)(call+5);
+
+	if (tgt_clobbers & ~site_clobbers)
+		return len;	/* target would clobber too much for this site */
+	if (len < 5)
+		return len;	/* call too long for patch site */
+
+	*call++ = 0xe8;		/* call */
+	*(unsigned long *)call = delta;
+
+	return 5;
+}
+
+unsigned paravirt_patch_jmp(void *target, void *site, unsigned len)
+{
+	unsigned char *jmp = site;
+	unsigned long delta = (unsigned long)target - (unsigned long)(jmp+5);
+
+	if (len < 5)
+		return len;	/* call too long for patch site */
+
+	*jmp++ = 0xe9;		/* jmp */
+	*(unsigned long *)jmp = delta;
+
+	return 5;
+}
+
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len)
+{
+	void *opfunc = *((void **)&paravirt_ops + type);
+	unsigned ret;
+
+	if (opfunc == NULL)
+		/* If there's no function, patch it with a ud2a (BUG) */
+		ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
+	else if (opfunc == paravirt_nop)
+		/* If the operation is a nop, then nop the callsite */
+		ret = paravirt_patch_nop();
+	else if (type == PARAVIRT_PATCH(iret) ||
+		 type == PARAVIRT_PATCH(irq_enable_sysexit))
+		/* If operation requires a jmp, then jmp */
+		ret = paravirt_patch_jmp(opfunc, site, len);
+	else
+		/* Otherwise call the function; assume target could
+		   clobber any caller-save reg */
+		ret = paravirt_patch_call(opfunc, CLBR_ANY,
+					  site, clobbers, len);
+
+	return ret;
+}
+
+unsigned paravirt_patch_insns(void *site, unsigned len,
+			      const char *start, const char *end)
+{
+	unsigned insn_len = end - start;
+
+	if (insn_len > len || start == NULL)
+		insn_len = len;
+	else
+		memcpy(site, start, insn_len);
+
 	return insn_len;
 }
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -821,6 +821,17 @@ struct paravirt_patch_site {
 	u16 clobbers;		/* what registers you may clobber */
 };
 
+unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ignore(unsigned len);
+unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
+			     void *site, u16 site_clobbers,
+			     unsigned len);
+unsigned paravirt_patch_jmp(void *target, void *site, unsigned len);
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len);
+
+unsigned paravirt_patch_insns(void *site, unsigned len,
+			      const char *start, const char *end);
+
 extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach, Christoph Lameter

[-- Attachment #1: apply-to-page-range.patch --]
[-- Type: text/plain, Size: 4181 bytes --]

Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    5 ++
 mm/memory.c        |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

===================================================================
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+			void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+			       unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pte_t *pte;
+	int err;
+	struct page *pmd_page;
+	spinlock_t *ptl;
+
+	pte = (mm == &init_mm) ?
+		pte_alloc_kernel(pmd, addr) :
+		pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (!pte)
+		return -ENOMEM;
+
+	BUG_ON(pmd_huge(*pmd));
+
+	pmd_page = pmd_page(*pmd);
+
+	do {
+		err = fn(pte, pmd_page, addr, data);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	if (mm != &init_mm)
+		pte_unmap_unlock(pte-1, ptl);
+	return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err;
+
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -ENOMEM;
+	do {
+		next = pmd_addr_end(addr, end);
+		err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+	return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err;
+
+	pud = pud_alloc(mm, pgd, addr);
+	if (!pud)
+		return -ENOMEM;
+	do {
+		next = pud_addr_end(addr, end);
+		err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+	return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+			unsigned long size, pte_fn_t fn, void *data)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	unsigned long end = addr + size;
+	int err;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+	return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Ian Pratt, Rusty Russell,
	linux-kernel, Chris Wright, virtualization, Andrew Morton,
	Christoph Lameter, Christian Limpach

[-- Attachment #1: apply-to-page-range.patch --]
[-- Type: text/plain, Size: 4180 bytes --]

Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    5 ++
 mm/memory.c        |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

===================================================================
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+			void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+			       unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pte_t *pte;
+	int err;
+	struct page *pmd_page;
+	spinlock_t *ptl;
+
+	pte = (mm == &init_mm) ?
+		pte_alloc_kernel(pmd, addr) :
+		pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (!pte)
+		return -ENOMEM;
+
+	BUG_ON(pmd_huge(*pmd));
+
+	pmd_page = pmd_page(*pmd);
+
+	do {
+		err = fn(pte, pmd_page, addr, data);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	if (mm != &init_mm)
+		pte_unmap_unlock(pte-1, ptl);
+	return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err;
+
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -ENOMEM;
+	do {
+		next = pmd_addr_end(addr, end);
+		err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+	return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err;
+
+	pud = pud_alloc(mm, pgd, addr);
+	if (!pud)
+		return -ENOMEM;
+	do {
+		next = pud_addr_end(addr, end);
+		err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+	return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+			unsigned long size, pte_fn_t fn, void *data)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	unsigned long end = addr + size;
+	int err;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+	return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach, Jan Beulich

[-- Attachment #1: alloc-vm-area.patch --]
[-- Type: text/plain, Size: 2454 bytes --]

Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area
The alloc function ensures that page tables are constructed for the
region of kernel virtual address space and mapped into init_mm.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: "Andi Kleen" <ak@muc.de>

---
 arch/i386/mm/fault.c    |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vmalloc.h |    4 ++++
 2 files changed, 47 insertions(+)

===================================================================
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include <linux/uaccess.h>
+#include <linux/vmalloc.h>
 
 #include <asm/system.h>
 #include <asm/desc.h>
@@ -624,3 +625,45 @@ void _vmalloc_sync_all(void)
 			start = address + PGDIR_SIZE;
 	}
 }
+
+
+static int f(pte_t *pte, struct page *pmd_page, unsigned long addr, void *data)
+{
+	/* apply_to_page_range() does all the hard work. */
+	return 0;
+}
+
+struct vm_struct *alloc_vm_area(unsigned long size)
+{
+	struct vm_struct *area;
+
+	area = get_vm_area(size, VM_IOREMAP);
+	if (area == NULL)
+		return NULL;
+
+	/*
+	 * This ensures that page tables are constructed for this region
+	 * of kernel virtual address space and mapped into init_mm.
+	 */
+	if (apply_to_page_range(&init_mm, (unsigned long)area->addr,
+				area->size, f, NULL)) {
+		free_vm_area(area);
+		return NULL;
+	}
+
+	/* Make sure the pagetables are constructed in process kernel
+	   mappings */
+	vmalloc_sync_all();
+
+	return area;
+}
+EXPORT_SYMBOL_GPL(alloc_vm_area);
+
+void free_vm_area(struct vm_struct *area)
+{
+	struct vm_struct *ret;
+	ret = remove_vm_area(area->addr);
+	BUG_ON(ret != area);
+	kfree(area);
+}
+EXPORT_SYMBOL_GPL(free_vm_area);
===================================================================
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -68,6 +68,10 @@ extern int map_vm_area(struct vm_struct 
 			struct page ***pages);
 extern void unmap_vm_area(struct vm_struct *area);
 
+/* Allocate/destroy a 'vmalloc' VM area. */
+extern struct vm_struct *alloc_vm_area(unsigned long size);
+extern void free_vm_area(struct vm_struct *area);
+
 /*
  *	Internals.  Dont't use..
  */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas
@ 2007-03-01 23:24   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Wright, virtualization, xen-devel, Ian Pratt,
	Andrew Morton, linux-kernel, Jan Beulich

[-- Attachment #1: alloc-vm-area.patch --]
[-- Type: text/plain, Size: 2540 bytes --]

Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area
The alloc function ensures that page tables are constructed for the
region of kernel virtual address space and mapped into init_mm.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: "Andi Kleen" <ak@muc.de>

---
 arch/i386/mm/fault.c    |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vmalloc.h |    4 ++++
 2 files changed, 47 insertions(+)

===================================================================
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/module.h>
 #include <linux/kprobes.h>
 #include <linux/uaccess.h>
+#include <linux/vmalloc.h>
 
 #include <asm/system.h>
 #include <asm/desc.h>
@@ -624,3 +625,45 @@ void _vmalloc_sync_all(void)
 			start = address + PGDIR_SIZE;
 	}
 }
+
+
+static int f(pte_t *pte, struct page *pmd_page, unsigned long addr, void *data)
+{
+	/* apply_to_page_range() does all the hard work. */
+	return 0;
+}
+
+struct vm_struct *alloc_vm_area(unsigned long size)
+{
+	struct vm_struct *area;
+
+	area = get_vm_area(size, VM_IOREMAP);
+	if (area == NULL)
+		return NULL;
+
+	/*
+	 * This ensures that page tables are constructed for this region
+	 * of kernel virtual address space and mapped into init_mm.
+	 */
+	if (apply_to_page_range(&init_mm, (unsigned long)area->addr,
+				area->size, f, NULL)) {
+		free_vm_area(area);
+		return NULL;
+	}
+
+	/* Make sure the pagetables are constructed in process kernel
+	   mappings */
+	vmalloc_sync_all();
+
+	return area;
+}
+EXPORT_SYMBOL_GPL(alloc_vm_area);
+
+void free_vm_area(struct vm_struct *area)
+{
+	struct vm_struct *ret;
+	ret = remove_vm_area(area->addr);
+	BUG_ON(ret != area);
+	kfree(area);
+}
+EXPORT_SYMBOL_GPL(free_vm_area);
===================================================================
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -68,6 +68,10 @@ extern int map_vm_area(struct vm_struct 
 			struct page ***pages);
 extern void unmap_vm_area(struct vm_struct *area);
 
+/* Allocate/destroy a 'vmalloc' VM area. */
+extern struct vm_struct *alloc_vm_area(unsigned long size);
+extern void free_vm_area(struct vm_struct *area);
+
 /*
  *	Internals.  Dont't use..
  */

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (15 preceding siblings ...)
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-16  9:15     ` Ingo Molnar
  2007-03-01 23:25 ` [patch 18/26] Xen-paravirt_ops: Add XEN config options Jeremy Fitzhardinge
                   ` (10 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach

[-- Attachment #1: xen-vsyscall-note.patch --]
[-- Type: text/plain, Size: 2114 bytes --]

Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
used by the runtime linker to select a glibc version which then
disables negative-offset accesses to the thread-local segment via
%gs. These accesses require emulation in Xen (because segments are
truncated to protect the hypervisor address space) and avoiding them
provides a measurable performance boost.

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/vsyscall-note.S |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

===================================================================
--- a/arch/i386/kernel/vsyscall-note.S
+++ b/arch/i386/kernel/vsyscall-note.S
@@ -23,3 +24,31 @@ 3:	.balign 4;		/* pad out section */			 
 	ASM_ELF_NOTE_BEGIN(".note.kernel-version", "a", UTS_SYSNAME, 0)
 	.long LINUX_VERSION_CODE
 	ASM_ELF_NOTE_END
+
+#ifdef CONFIG_XEN
+/*
+ * Add a special note telling glibc's dynamic linker a fake hardware
+ * flavor that it will use to choose the search path for libraries in the
+ * same way it uses real hardware capabilities like "mmx".
+ * We supply "nosegneg" as the fake capability, to indicate that we
+ * do not like negative offsets in instructions using segment overrides,
+ * since we implement those inefficiently.  This makes it possible to
+ * install libraries optimized to avoid those access patterns in someplace
+ * like /lib/i686/tls/nosegneg.  Note that an /etc/ld.so.conf.d/file
+ * corresponding to the bits here is needed to make ldconfig work right.
+ * It should contain:
+ *	hwcap 0 nosegneg
+ * to match the mapping of bit to name that we give here.
+ */
+#define NOTE_KERNELCAP_BEGIN(ncaps, mask) \
+	ASM_ELF_NOTE_BEGIN(".note.kernelcap", "a", "GNU", 2) \
+	.long ncaps, mask
+#define NOTE_KERNELCAP(bit, name) \
+	.byte bit; .asciz name
+#define NOTE_KERNELCAP_END ASM_ELF_NOTE_END
+
+NOTE_KERNELCAP_BEGIN(1, 1)
+NOTE_KERNELCAP(1, "nosegneg")
+NOTE_KERNELCAP_END
+#endif
+

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 18/26] Xen-paravirt_ops: Add XEN config options
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (16 preceding siblings ...)
  2007-03-01 23:25 ` [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-16  9:14     ` Ingo Molnar
  2007-03-01 23:25 ` [patch 19/26] Xen-paravirt_ops: Add Xen interface header files Jeremy Fitzhardinge
                   ` (9 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach

[-- Attachment #1: xen-config.patch --]
[-- Type: text/plain, Size: 1455 bytes --]

The XEN config option enables the Xen paravirt_ops interface, which is
installed when the kernel finds itself running under Xen.

Xen is no longer a sub-architecture, so the X86_XEN subarch config
option has gone.

Xen is currently incompatible with:
- PREEMPT
- HZ: set to 100Hz for now, to cut down on VCPU context switch rate.
  This will be adapted to use tickless later.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>

---
 arch/i386/Kconfig     |    2 ++
 arch/i386/xen/Kconfig |   10 ++++++++++
 2 files changed, 12 insertions(+)

===================================================================
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -216,6 +216,8 @@ config PARAVIRT
 	  under a hypervisor, improving performance significantly.
 	  However, when run without a hypervisor the kernel is
 	  theoretically slower.  If in doubt, say N.
+
+source "arch/i386/xen/Kconfig"
 
 config VMI
 	bool "VMI Paravirt-ops support"
===================================================================
--- /dev/null
+++ b/arch/i386/xen/Kconfig
@@ -0,0 +1,10 @@
+#
+# This Kconfig describes xen options
+#
+
+config XEN
+	bool "Enable support for Xen hypervisor"
+	depends on PARAVIRT && HZ_100 && !PREEMPT && !NO_HZ
+	default y
+	help
+	  This is the Linux Xen port.

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 19/26] Xen-paravirt_ops: Add Xen interface header files
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (17 preceding siblings ...)
  2007-03-01 23:25 ` [patch 18/26] Xen-paravirt_ops: Add XEN config options Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-01 23:25 ` [patch 20/26] Xen-paravirt_ops: Core Xen implementation Jeremy Fitzhardinge
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Ian Pratt, Rusty Russell,
	linux-kernel, Chris Wright, virtualization, Andrew Morton,
	Christian Limpach

[-- Attachment #1: xen-interface.patch --]
[-- Type: text/plain, Size: 106304 bytes --]

Add Xen interface header files. These are taken fairly directly from
the Xen tree, but somewhat rearranged to suit the kernel's conventions.

Define macros and inline functions for doing hypercalls into the
hypervisor.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>


---
 include/asm-i386/xen/hypercall.h      |  413 ++++++++++++++++++++++++++++++
 include/asm-i386/xen/hypervisor.h     |   72 +++++
 include/asm-i386/xen/interface.h      |  187 +++++++++++++
 include/xen/interface/elfnote.h       |  133 +++++++++
 include/xen/interface/event_channel.h |  195 ++++++++++++++
 include/xen/interface/features.h      |   43 +++
 include/xen/interface/grant_table.h   |  301 ++++++++++++++++++++++
 include/xen/interface/io/blkif.h      |   96 +++++++
 include/xen/interface/io/console.h    |   23 +
 include/xen/interface/io/netif.h      |  152 +++++++++++
 include/xen/interface/io/ring.h       |  260 +++++++++++++++++++
 include/xen/interface/io/xenbus.h     |   42 +++
 include/xen/interface/io/xs_wire.h    |   87 ++++++
 include/xen/interface/memory.h        |  145 ++++++++++
 include/xen/interface/physdev.h       |   61 ++++
 include/xen/interface/sched.h         |   77 +++++
 include/xen/interface/vcpu.h          |  109 ++++++++
 include/xen/interface/version.h       |   60 ++++
 include/xen/interface/xen.h           |  445 +++++++++++++++++++++++++++++++++
 19 files changed, 2901 insertions(+)

===================================================================
--- /dev/null
+++ b/include/asm-i386/xen/hypercall.h
@@ -0,0 +1,413 @@
+/******************************************************************************
+ * hypercall.h
+ *
+ * Linux-specific hypervisor handling.
+ *
+ * Copyright (c) 2002-2004, K A Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __HYPERCALL_H__
+#define __HYPERCALL_H__
+
+#include <linux/errno.h>
+#include <linux/string.h>
+
+#include <xen/interface/xen.h>
+#include <xen/interface/sched.h>
+#include <xen/interface/physdev.h>
+
+extern struct { char _entry[32]; } hypercall_page[];
+
+#define _hypercall0(type, name)						\
+({									\
+	long __res;							\
+	asm volatile (							\
+		"call %[call]"						\
+		: "=a" (__res)						\
+		: [call] "m" (hypercall_page[__HYPERVISOR_##name])	\
+		: "memory" );						\
+	(type)__res;							\
+})
+
+#define _hypercall1(type, name, a1)					\
+({									\
+	long __res, __ign1;						\
+	asm volatile (							\
+		"call %[call]"						\
+		: "=a" (__res), "=b" (__ign1)				\
+		: "1" ((long)(a1)),					\
+		  [call] "m" (hypercall_page[__HYPERVISOR_##name])	\
+		: "memory" );						\
+	(type)__res;							\
+})
+
+#define _hypercall2(type, name, a1, a2)					\
+({									\
+	long __res, __ign1, __ign2;					\
+	asm volatile (							\
+		"call %[call]"						\
+		: "=a" (__res), "=b" (__ign1), "=c" (__ign2)		\
+		: "1" ((long)(a1)), "2" ((long)(a2)),			\
+		  [call] "m" (hypercall_page[__HYPERVISOR_##name])	\
+		: "memory" );						\
+	(type)__res;							\
+})
+
+#define _hypercall3(type, name, a1, a2, a3)				\
+({									\
+	long __res, __ign1, __ign2, __ign3;				\
+	asm volatile (							\
+		"call %[call]"						\
+		: "=a" (__res), "=b" (__ign1), "=c" (__ign2),		\
+		"=d" (__ign3)						\
+		: "1" ((long)(a1)), "2" ((long)(a2)),			\
+		  "3" ((long)(a3)),					\
+		  [call] "m" (hypercall_page[__HYPERVISOR_##name])	\
+		: "memory" );						\
+	(type)__res;							\
+})
+
+#define _hypercall4(type, name, a1, a2, a3, a4)				\
+({									\
+	long __res, __ign1, __ign2, __ign3, __ign4;			\
+	asm volatile (							\
+		"call %[call]"						\
+		: "=a" (__res), "=b" (__ign1), "=c" (__ign2),		\
+		"=d" (__ign3), "=S" (__ign4)				\
+		: "1" ((long)(a1)), "2" ((long)(a2)),			\
+		  "3" ((long)(a3)), "4" ((long)(a4)),			\
+		  [call] "m" (hypercall_page[__HYPERVISOR_##name])	\
+		: "memory" );						\
+	(type)__res;							\
+})
+
+#define _hypercall5(type, name, a1, a2, a3, a4, a5)			\
+({									\
+	long __res, __ign1, __ign2, __ign3, __ign4, __ign5;		\
+	asm volatile (							\
+		"call %[call]"						\
+		: "=a" (__res), "=b" (__ign1), "=c" (__ign2),		\
+		"=d" (__ign3), "=S" (__ign4), "=D" (__ign5)		\
+		: "1" ((long)(a1)), "2" ((long)(a2)),			\
+		  "3" ((long)(a3)), "4" ((long)(a4)),			\
+		  "5" ((long)(a5)),					\
+		  [call] "m" (hypercall_page[__HYPERVISOR_##name])	\
+		: "memory" );						\
+	(type)__res;							\
+})
+
+static inline int
+HYPERVISOR_set_trap_table(
+	struct trap_info *table)
+{
+	return _hypercall1(int, set_trap_table, table);
+}
+
+static inline int
+HYPERVISOR_mmu_update(
+	struct mmu_update *req, int count, int *success_count, domid_t domid)
+{
+	return _hypercall4(int, mmu_update, req, count, success_count, domid);
+}
+
+static inline int
+HYPERVISOR_mmuext_op(
+	struct mmuext_op *op, int count, int *success_count, domid_t domid)
+{
+	return _hypercall4(int, mmuext_op, op, count, success_count, domid);
+}
+
+static inline int
+HYPERVISOR_set_gdt(
+	unsigned long *frame_list, int entries)
+{
+	return _hypercall2(int, set_gdt, frame_list, entries);
+}
+
+static inline int
+HYPERVISOR_stack_switch(
+	unsigned long ss, unsigned long esp)
+{
+	return _hypercall2(int, stack_switch, ss, esp);
+}
+
+static inline int
+HYPERVISOR_set_callbacks(
+	unsigned long event_selector, unsigned long event_address,
+	unsigned long failsafe_selector, unsigned long failsafe_address)
+{
+	return _hypercall4(int, set_callbacks,
+			   event_selector, event_address,
+			   failsafe_selector, failsafe_address);
+}
+
+static inline int
+HYPERVISOR_fpu_taskswitch(
+	int set)
+{
+	return _hypercall1(int, fpu_taskswitch, set);
+}
+
+static inline int
+HYPERVISOR_sched_op(
+	int cmd, unsigned long arg)
+{
+	return _hypercall2(int, sched_op, cmd, arg);
+}
+
+static inline long
+HYPERVISOR_set_timer_op(
+	u64 timeout)
+{
+	unsigned long timeout_hi = (unsigned long)(timeout>>32);
+	unsigned long timeout_lo = (unsigned long)timeout;
+	return _hypercall2(long, set_timer_op, timeout_lo, timeout_hi);
+}
+
+static inline int
+HYPERVISOR_set_debugreg(
+	int reg, unsigned long value)
+{
+	return _hypercall2(int, set_debugreg, reg, value);
+}
+
+static inline unsigned long
+HYPERVISOR_get_debugreg(
+	int reg)
+{
+	return _hypercall1(unsigned long, get_debugreg, reg);
+}
+
+static inline int
+HYPERVISOR_update_descriptor(
+	u64 ma, u64 desc)
+{
+	return _hypercall4(int, update_descriptor, ma, ma>>32, desc, desc>>32);
+}
+
+static inline int
+HYPERVISOR_memory_op(
+	unsigned int cmd, void *arg)
+{
+	return _hypercall2(int, memory_op, cmd, arg);
+}
+
+static inline int
+HYPERVISOR_multicall(
+	void *call_list, int nr_calls)
+{
+	return _hypercall2(int, multicall, call_list, nr_calls);
+}
+
+static inline int
+HYPERVISOR_update_va_mapping(
+	unsigned long va, pte_t new_val, unsigned long flags)
+{
+	unsigned long pte_hi = 0;
+#ifdef CONFIG_X86_PAE
+	pte_hi = new_val.pte_high;
+#endif
+	return _hypercall4(int, update_va_mapping, va,
+			   new_val.pte_low, pte_hi, flags);
+}
+
+static inline int
+HYPERVISOR_event_channel_op(
+        int cmd, void *arg)
+{
+        int rc = _hypercall2(int, event_channel_op, cmd, arg);
+        if (unlikely(rc == -ENOSYS)) {
+                struct evtchn_op op;
+                op.cmd = cmd;
+                memcpy(&op.u, arg, sizeof(op.u));
+                rc = _hypercall1(int, event_channel_op_compat, &op);
+                memcpy(arg, &op.u, sizeof(op.u));
+        }
+        return rc;
+}
+
+static inline int
+HYPERVISOR_xen_version(
+	int cmd, void *arg)
+{
+	return _hypercall2(int, xen_version, cmd, arg);
+}
+
+static inline int
+HYPERVISOR_console_io(
+	int cmd, int count, char *str)
+{
+	return _hypercall3(int, console_io, cmd, count, str);
+}
+
+static inline int
+HYPERVISOR_physdev_op(int cmd, void *arg)
+{
+	int rc = _hypercall2(int, physdev_op, cmd, arg);
+	if (unlikely(rc == -ENOSYS)) {
+		struct physdev_op op;
+		op.cmd = cmd;
+		memcpy(&op.u, arg, sizeof(op.u));
+		rc = _hypercall1(int, physdev_op_compat, &op);
+		memcpy(arg, &op.u, sizeof(op.u));
+	}
+	return rc;
+}
+
+static inline int
+HYPERVISOR_grant_table_op(
+	unsigned int cmd, void *uop, unsigned int count)
+{
+	return _hypercall3(int, grant_table_op, cmd, uop, count);
+}
+
+static inline int
+HYPERVISOR_update_va_mapping_otherdomain(
+	unsigned long va, pte_t new_val, unsigned long flags, domid_t domid)
+{
+	unsigned long pte_hi = 0;
+#ifdef CONFIG_X86_PAE
+	pte_hi = new_val.pte_high;
+#endif
+	return _hypercall5(int, update_va_mapping_otherdomain, va,
+			   new_val.pte_low, pte_hi, flags, domid);
+}
+
+static inline int
+HYPERVISOR_vm_assist(
+	unsigned int cmd, unsigned int type)
+{
+	return _hypercall2(int, vm_assist, cmd, type);
+}
+
+static inline int
+HYPERVISOR_vcpu_op(
+	int cmd, int vcpuid, void *extra_args)
+{
+	return _hypercall3(int, vcpu_op, cmd, vcpuid, extra_args);
+}
+
+static inline int
+HYPERVISOR_suspend(
+	unsigned long srec)
+{
+	return _hypercall3(int, sched_op, SCHEDOP_shutdown,
+			   SHUTDOWN_suspend, srec);
+}
+
+static inline int
+HYPERVISOR_nmi_op(
+	unsigned long op,
+	unsigned long arg)
+{
+	return _hypercall2(int, nmi_op, op, arg);
+}
+
+static inline void
+MULTI_update_va_mapping(struct multicall_entry *mcl, unsigned long va,
+			pte_t new_val, unsigned long flags)
+{
+	mcl->op = __HYPERVISOR_update_va_mapping;
+	mcl->args[0] = va;
+#ifdef CONFIG_X86_PAE
+	mcl->args[1] = new_val.pte_low;
+	mcl->args[2] = new_val.pte_high;
+#else
+	mcl->args[1] = new_val.pte_low;
+	mcl->args[2] = 0;
+#endif
+	mcl->args[3] = flags;
+}
+
+static inline void
+MULTI_grant_table_op(struct multicall_entry *mcl, unsigned int cmd,
+		     void *uop, unsigned int count)
+{
+	mcl->op = __HYPERVISOR_grant_table_op;
+	mcl->args[0] = cmd;
+	mcl->args[1] = (unsigned long)uop;
+	mcl->args[2] = count;
+}
+
+static inline void
+MULTI_update_va_mapping_otherdomain(struct multicall_entry *mcl, unsigned long va,
+				    pte_t new_val, unsigned long flags, domid_t domid)
+{
+	mcl->op = __HYPERVISOR_update_va_mapping_otherdomain;
+	mcl->args[0] = va;
+#ifdef CONFIG_X86_PAE
+	mcl->args[1] = new_val.pte_low;
+	mcl->args[2] = new_val.pte_high;
+#else
+	mcl->args[1] = new_val.pte_low;
+	mcl->args[2] = 0;
+#endif
+	mcl->args[3] = flags;
+	mcl->args[4] = domid;
+}
+
+static inline void
+MULTI_update_descriptor(struct multicall_entry *mcl, u64 maddr,
+			struct desc_struct desc)
+{
+	mcl->op = __HYPERVISOR_update_descriptor;
+	mcl->args[0] = maddr;
+	mcl->args[1] = maddr >> 32;
+	mcl->args[2] = desc.a;
+	mcl->args[3] = desc.b;
+}
+
+static inline void
+MULTI_memory_op(struct multicall_entry *mcl, unsigned int cmd, void *arg)
+{
+	mcl->op = __HYPERVISOR_memory_op;
+	mcl->args[0] = cmd;
+	mcl->args[1] = (unsigned long)arg;
+}
+
+static inline void
+MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req,
+		 int count, int *success_count, domid_t domid)
+{
+	mcl->op = __HYPERVISOR_mmu_update;
+	mcl->args[0] = (unsigned long)req;
+	mcl->args[1] = count;
+	mcl->args[2] = (unsigned long)success_count;
+	mcl->args[3] = domid;
+}
+
+static inline void
+MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+		int *success_count, domid_t domid)
+{
+	mcl->op = __HYPERVISOR_mmuext_op;
+	mcl->args[0] = (unsigned long)op;
+	mcl->args[1] = count;
+	mcl->args[2] = (unsigned long)success_count;
+	mcl->args[3] = domid;
+}
+#endif /* __HYPERCALL_H__ */
===================================================================
--- /dev/null
+++ b/include/asm-i386/xen/hypervisor.h
@@ -0,0 +1,72 @@
+/******************************************************************************
+ * hypervisor.h
+ *
+ * Linux-specific hypervisor handling.
+ *
+ * Copyright (c) 2002-2004, K A Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __HYPERVISOR_H__
+#define __HYPERVISOR_H__
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/version.h>
+
+#include <xen/interface/xen.h>
+#include <xen/interface/version.h>
+
+#include <asm/ptrace.h>
+#include <asm/page.h>
+#if defined(__i386__)
+#  ifdef CONFIG_X86_PAE
+#   include <asm-generic/pgtable-nopud.h>
+#  else
+#   include <asm-generic/pgtable-nopmd.h>
+#  endif
+#endif
+#include <asm/xen/hypercall.h>
+
+/* arch/i386/kernel/setup.c */
+extern struct shared_info *HYPERVISOR_shared_info;
+extern struct start_info *xen_start_info;
+#define is_initial_xendomain() (xen_start_info->flags & SIF_INITDOMAIN)
+
+/* arch/i386/mach-xen/evtchn.c */
+/* Force a proper event-channel callback from Xen. */
+extern void force_evtchn_callback(void);
+
+/* Turn jiffies into Xen system time. */
+u64 jiffies_to_st(unsigned long jiffies);
+
+
+#define MULTI_UVMFLAGS_INDEX 3
+#define MULTI_UVMDOMID_INDEX 4
+
+#define is_running_on_xen()	(xen_start_info ? 1 : 0)
+
+#endif /* __HYPERVISOR_H__ */
===================================================================
--- /dev/null
+++ b/include/asm-i386/xen/interface.h
@@ -0,0 +1,187 @@
+/******************************************************************************
+ * arch-x86_32.h
+ *
+ * Guest OS interface to x86 32-bit Xen.
+ *
+ * Copyright (c) 2004, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_ARCH_X86_32_H__
+#define __XEN_PUBLIC_ARCH_X86_32_H__
+
+#ifdef __XEN__
+#define __DEFINE_GUEST_HANDLE(name, type) \
+    typedef struct { type *p; } __guest_handle_ ## name
+#else
+#define __DEFINE_GUEST_HANDLE(name, type) \
+    typedef type * __guest_handle_ ## name
+#endif
+
+#define DEFINE_GUEST_HANDLE_STRUCT(name) \
+	__DEFINE_GUEST_HANDLE(name, struct name)
+#define DEFINE_GUEST_HANDLE(name) __DEFINE_GUEST_HANDLE(name, name)
+#define GUEST_HANDLE(name)        __guest_handle_ ## name
+
+#ifndef __ASSEMBLY__
+/* Guest handles for primitive C types. */
+__DEFINE_GUEST_HANDLE(uchar, unsigned char);
+__DEFINE_GUEST_HANDLE(uint,  unsigned int);
+__DEFINE_GUEST_HANDLE(ulong, unsigned long);
+DEFINE_GUEST_HANDLE(char);
+DEFINE_GUEST_HANDLE(int);
+DEFINE_GUEST_HANDLE(long);
+DEFINE_GUEST_HANDLE(void);
+#endif
+
+/*
+ * SEGMENT DESCRIPTOR TABLES
+ */
+/*
+ * A number of GDT entries are reserved by Xen. These are not situated at the
+ * start of the GDT because some stupid OSes export hard-coded selector values
+ * in their ABI. These hard-coded values are always near the start of the GDT,
+ * so Xen places itself out of the way, at the far end of the GDT.
+ */
+#define FIRST_RESERVED_GDT_PAGE  14
+#define FIRST_RESERVED_GDT_BYTE  (FIRST_RESERVED_GDT_PAGE * 4096)
+#define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)
+
+/*
+ * These flat segments are in the Xen-private section of every GDT. Since these
+ * are also present in the initial GDT, many OSes will be able to avoid
+ * installing their own GDT.
+ */
+#define FLAT_RING1_CS 0xe019    /* GDT index 259 */
+#define FLAT_RING1_DS 0xe021    /* GDT index 260 */
+#define FLAT_RING1_SS 0xe021    /* GDT index 260 */
+#define FLAT_RING3_CS 0xe02b    /* GDT index 261 */
+#define FLAT_RING3_DS 0xe033    /* GDT index 262 */
+#define FLAT_RING3_SS 0xe033    /* GDT index 262 */
+
+#define FLAT_KERNEL_CS FLAT_RING1_CS
+#define FLAT_KERNEL_DS FLAT_RING1_DS
+#define FLAT_KERNEL_SS FLAT_RING1_SS
+#define FLAT_USER_CS    FLAT_RING3_CS
+#define FLAT_USER_DS    FLAT_RING3_DS
+#define FLAT_USER_SS    FLAT_RING3_SS
+
+/* And the trap vector is... */
+#define TRAP_INSTR "int $0x82"
+
+/*
+ * Virtual addresses beyond this are not modifiable by guest OSes. The
+ * machine->physical mapping table starts at this address, read-only.
+ */
+#ifdef CONFIG_X86_PAE
+#define __HYPERVISOR_VIRT_START 0xF5800000
+#else
+#define __HYPERVISOR_VIRT_START 0xFC000000
+#endif
+
+#ifndef HYPERVISOR_VIRT_START
+#define HYPERVISOR_VIRT_START mk_unsigned_long(__HYPERVISOR_VIRT_START)
+#endif
+
+#ifndef machine_to_phys_mapping
+#define machine_to_phys_mapping ((unsigned long *)HYPERVISOR_VIRT_START)
+#endif
+
+/* Maximum number of virtual CPUs in multi-processor guests. */
+#define MAX_VIRT_CPUS 32
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Send an array of these to HYPERVISOR_set_trap_table()
+ */
+#define TI_GET_DPL(_ti)      ((_ti)->flags & 3)
+#define TI_GET_IF(_ti)       ((_ti)->flags & 4)
+#define TI_SET_DPL(_ti,_dpl) ((_ti)->flags |= (_dpl))
+#define TI_SET_IF(_ti,_if)   ((_ti)->flags |= ((!!(_if))<<2))
+struct trap_info {
+    uint8_t       vector;  /* exception vector                              */
+    uint8_t       flags;   /* 0-3: privilege level; 4: clear event enable?  */
+    uint16_t      cs;      /* code selector                                 */
+    unsigned long address; /* code offset                                   */
+};
+DEFINE_GUEST_HANDLE_STRUCT(trap_info);
+
+struct cpu_user_regs {
+    uint32_t ebx;
+    uint32_t ecx;
+    uint32_t edx;
+    uint32_t esi;
+    uint32_t edi;
+    uint32_t ebp;
+    uint32_t eax;
+    uint16_t error_code;    /* private */
+    uint16_t entry_vector;  /* private */
+    uint32_t eip;
+    uint16_t cs;
+    uint8_t  saved_upcall_mask;
+    uint8_t  _pad0;
+    uint32_t eflags;        /* eflags.IF == !saved_upcall_mask */
+    uint32_t esp;
+    uint16_t ss, _pad1;
+    uint16_t es, _pad2;
+    uint16_t ds, _pad3;
+    uint16_t fs, _pad4;
+    uint16_t gs, _pad5;
+};
+DEFINE_GUEST_HANDLE_STRUCT(cpu_user_regs);
+
+typedef uint64_t tsc_timestamp_t; /* RDTSC timestamp */
+
+/*
+ * The following is all CPU context. Note that the fpu_ctxt block is filled
+ * in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used.
+ */
+struct vcpu_guest_context {
+    /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
+    struct { char x[512]; } fpu_ctxt;       /* User-level FPU registers     */
+#define VGCF_I387_VALID (1<<0)
+#define VGCF_HVM_GUEST  (1<<1)
+#define VGCF_IN_KERNEL  (1<<2)
+    unsigned long flags;                    /* VGCF_* flags                 */
+    struct cpu_user_regs user_regs;         /* User-level CPU registers     */
+    struct trap_info trap_ctxt[256];        /* Virtual IDT                  */
+    unsigned long ldt_base, ldt_ents;       /* LDT (linear address, # ents) */
+    unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
+    unsigned long kernel_ss, kernel_sp;     /* Virtual TSS (only SS1/SP1)   */
+    unsigned long ctrlreg[8];               /* CR0-CR7 (control registers)  */
+    unsigned long debugreg[8];              /* DB0-DB7 (debug registers)    */
+    unsigned long event_callback_cs;        /* CS:EIP of event callback     */
+    unsigned long event_callback_eip;
+    unsigned long failsafe_callback_cs;     /* CS:EIP of failsafe callback  */
+    unsigned long failsafe_callback_eip;
+    unsigned long vm_assist;                /* VMASST_TYPE_* bitmap */
+};
+DEFINE_GUEST_HANDLE_STRUCT(vcpu_guest_context);
+
+struct arch_shared_info {
+    unsigned long max_pfn;                  /* max pfn that appears in table */
+    /* Frame containing list of mfns containing list of mfns containing p2m. */
+    unsigned long pfn_to_mfn_frame_list_list;
+    unsigned long nmi_reason;
+};
+
+struct arch_vcpu_info {
+    unsigned long cr2;
+    unsigned long pad[5]; /* sizeof(struct vcpu_info) == 64 */
+};
+
+#endif /* !__ASSEMBLY__ */
+
+/*
+ * Prefix forces emulation of some non-trapping instructions.
+ * Currently only CPUID.
+ */
+#ifdef __ASSEMBLY__
+#define XEN_EMULATE_PREFIX .byte 0x0f,0x0b,0x78,0x65,0x6e ;
+#define XEN_CPUID          XEN_EMULATE_PREFIX cpuid
+#else
+#define XEN_EMULATE_PREFIX ".byte 0x0f,0x0b,0x78,0x65,0x6e ; "
+#define XEN_CPUID          XEN_EMULATE_PREFIX "cpuid"
+#endif
+
+#endif
===================================================================
--- /dev/null
+++ b/include/xen/interface/elfnote.h
@@ -0,0 +1,133 @@
+/******************************************************************************
+ * elfnote.h
+ *
+ * Definitions used for the Xen ELF notes.
+ *
+ * Copyright (c) 2006, Ian Campbell, XenSource Ltd.
+ */
+
+#ifndef __XEN_PUBLIC_ELFNOTE_H__
+#define __XEN_PUBLIC_ELFNOTE_H__
+
+/*
+ * The notes should live in a SHT_NOTE segment and have "Xen" in the
+ * name field.
+ *
+ * Numeric types are either 4 or 8 bytes depending on the content of
+ * the desc field.
+ *
+ * LEGACY indicated the fields in the legacy __xen_guest string which
+ * this a note type replaces.
+ */
+
+/*
+ * NAME=VALUE pair (string).
+ *
+ * LEGACY: FEATURES and PAE
+ */
+#define XEN_ELFNOTE_INFO           0
+
+/*
+ * The virtual address of the entry point (numeric).
+ *
+ * LEGACY: VIRT_ENTRY
+ */
+#define XEN_ELFNOTE_ENTRY          1
+
+/* The virtual address of the hypercall transfer page (numeric).
+ *
+ * LEGACY: HYPERCALL_PAGE. (n.b. legacy value is a physical page
+ * number not a virtual address)
+ */
+#define XEN_ELFNOTE_HYPERCALL_PAGE 2
+
+/* The virtual address where the kernel image should be mapped (numeric).
+ *
+ * Defaults to 0.
+ *
+ * LEGACY: VIRT_BASE
+ */
+#define XEN_ELFNOTE_VIRT_BASE      3
+
+/*
+ * The offset of the ELF paddr field from the acutal required
+ * psuedo-physical address (numeric).
+ *
+ * This is used to maintain backwards compatibility with older kernels
+ * which wrote __PAGE_OFFSET into that field. This field defaults to 0
+ * if not present.
+ *
+ * LEGACY: ELF_PADDR_OFFSET. (n.b. legacy default is VIRT_BASE)
+ */
+#define XEN_ELFNOTE_PADDR_OFFSET   4
+
+/*
+ * The version of Xen that we work with (string).
+ *
+ * LEGACY: XEN_VER
+ */
+#define XEN_ELFNOTE_XEN_VERSION    5
+
+/*
+ * The name of the guest operating system (string).
+ *
+ * LEGACY: GUEST_OS
+ */
+#define XEN_ELFNOTE_GUEST_OS       6
+
+/*
+ * The version of the guest operating system (string).
+ *
+ * LEGACY: GUEST_VER
+ */
+#define XEN_ELFNOTE_GUEST_VERSION  7
+
+/*
+ * The loader type (string).
+ *
+ * LEGACY: LOADER
+ */
+#define XEN_ELFNOTE_LOADER         8
+
+/*
+ * The kernel supports PAE (x86/32 only, string = "yes" or "no").
+ *
+ * LEGACY: PAE (n.b. The legacy interface included a provision to
+ * indicate 'extended-cr3' support allowing L3 page tables to be
+ * placed above 4G. It is assumed that any kernel new enough to use
+ * these ELF notes will include this and therefore "yes" here is
+ * equivalent to "yes[entended-cr3]" in the __xen_guest interface.
+ */
+#define XEN_ELFNOTE_PAE_MODE       9
+
+/*
+ * The features supported/required by this kernel (string).
+ *
+ * The string must consist of a list of feature names (as given in
+ * features.h, without the "XENFEAT_" prefix) separated by '|'
+ * characters. If a feature is required for the kernel to function
+ * then the feature name must be preceded by a '!' character.
+ *
+ * LEGACY: FEATURES
+ */
+#define XEN_ELFNOTE_FEATURES      10
+
+/*
+ * The kernel requires the symbol table to be loaded (string = "yes" or "no")
+ * LEGACY: BSD_SYMTAB (n.b. The legacy treated the presence or absence
+ * of this string as a boolean flag rather than requiring "yes" or
+ * "no".
+ */
+#define XEN_ELFNOTE_BSD_SYMTAB    11
+
+#endif /* __XEN_PUBLIC_ELFNOTE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/event_channel.h
@@ -0,0 +1,195 @@
+/******************************************************************************
+ * event_channel.h
+ *
+ * Event channels between domains.
+ *
+ * Copyright (c) 2003-2004, K A Fraser.
+ */
+
+#ifndef __XEN_PUBLIC_EVENT_CHANNEL_H__
+#define __XEN_PUBLIC_EVENT_CHANNEL_H__
+
+typedef uint32_t evtchn_port_t;
+DEFINE_GUEST_HANDLE(evtchn_port_t);
+
+/*
+ * EVTCHNOP_alloc_unbound: Allocate a port in domain <dom> and mark as
+ * accepting interdomain bindings from domain <remote_dom>. A fresh port
+ * is allocated in <dom> and returned as <port>.
+ * NOTES:
+ *  1. If the caller is unprivileged then <dom> must be DOMID_SELF.
+ *  2. <rdom> may be DOMID_SELF, allowing loopback connections.
+ */
+#define EVTCHNOP_alloc_unbound    6
+struct evtchn_alloc_unbound {
+    /* IN parameters */
+    domid_t dom, remote_dom;
+    /* OUT parameters */
+    evtchn_port_t port;
+};
+
+/*
+ * EVTCHNOP_bind_interdomain: Construct an interdomain event channel between
+ * the calling domain and <remote_dom>. <remote_dom,remote_port> must identify
+ * a port that is unbound and marked as accepting bindings from the calling
+ * domain. A fresh port is allocated in the calling domain and returned as
+ * <local_port>.
+ * NOTES:
+ *  2. <remote_dom> may be DOMID_SELF, allowing loopback connections.
+ */
+#define EVTCHNOP_bind_interdomain 0
+struct evtchn_bind_interdomain {
+    /* IN parameters. */
+    domid_t remote_dom;
+    evtchn_port_t remote_port;
+    /* OUT parameters. */
+    evtchn_port_t local_port;
+};
+
+/*
+ * EVTCHNOP_bind_virq: Bind a local event channel to VIRQ <irq> on specified
+ * vcpu.
+ * NOTES:
+ *  1. A virtual IRQ may be bound to at most one event channel per vcpu.
+ *  2. The allocated event channel is bound to the specified vcpu. The binding
+ *     may not be changed.
+ */
+#define EVTCHNOP_bind_virq        1
+struct evtchn_bind_virq {
+    /* IN parameters. */
+    uint32_t virq;
+    uint32_t vcpu;
+    /* OUT parameters. */
+    evtchn_port_t port;
+};
+
+/*
+ * EVTCHNOP_bind_pirq: Bind a local event channel to PIRQ <irq>.
+ * NOTES:
+ *  1. A physical IRQ may be bound to at most one event channel per domain.
+ *  2. Only a sufficiently-privileged domain may bind to a physical IRQ.
+ */
+#define EVTCHNOP_bind_pirq        2
+struct evtchn_bind_pirq {
+    /* IN parameters. */
+    uint32_t pirq;
+#define BIND_PIRQ__WILL_SHARE 1
+    uint32_t flags; /* BIND_PIRQ__* */
+    /* OUT parameters. */
+    evtchn_port_t port;
+};
+
+/*
+ * EVTCHNOP_bind_ipi: Bind a local event channel to receive events.
+ * NOTES:
+ *  1. The allocated event channel is bound to the specified vcpu. The binding
+ *     may not be changed.
+ */
+#define EVTCHNOP_bind_ipi         7
+struct evtchn_bind_ipi {
+    uint32_t vcpu;
+    /* OUT parameters. */
+    evtchn_port_t port;
+};
+
+/*
+ * EVTCHNOP_close: Close a local event channel <port>. If the channel is
+ * interdomain then the remote end is placed in the unbound state
+ * (EVTCHNSTAT_unbound), awaiting a new connection.
+ */
+#define EVTCHNOP_close            3
+struct evtchn_close {
+    /* IN parameters. */
+    evtchn_port_t port;
+};
+
+/*
+ * EVTCHNOP_send: Send an event to the remote end of the channel whose local
+ * endpoint is <port>.
+ */
+#define EVTCHNOP_send             4
+struct evtchn_send {
+    /* IN parameters. */
+    evtchn_port_t port;
+};
+
+/*
+ * EVTCHNOP_status: Get the current status of the communication channel which
+ * has an endpoint at <dom, port>.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may obtain the status of an event
+ *     channel for which <dom> is not DOMID_SELF.
+ */
+#define EVTCHNOP_status           5
+struct evtchn_status {
+    /* IN parameters */
+    domid_t  dom;
+    evtchn_port_t port;
+    /* OUT parameters */
+#define EVTCHNSTAT_closed       0  /* Channel is not in use.                 */
+#define EVTCHNSTAT_unbound      1  /* Channel is waiting interdom connection.*/
+#define EVTCHNSTAT_interdomain  2  /* Channel is connected to remote domain. */
+#define EVTCHNSTAT_pirq         3  /* Channel is bound to a phys IRQ line.   */
+#define EVTCHNSTAT_virq         4  /* Channel is bound to a virtual IRQ line */
+#define EVTCHNSTAT_ipi          5  /* Channel is bound to a virtual IPI line */
+    uint32_t status;
+    uint32_t vcpu;                 /* VCPU to which this channel is bound.   */
+    union {
+        struct {
+            domid_t dom;
+        } unbound; /* EVTCHNSTAT_unbound */
+        struct {
+            domid_t dom;
+            evtchn_port_t port;
+        } interdomain; /* EVTCHNSTAT_interdomain */
+        uint32_t pirq;      /* EVTCHNSTAT_pirq        */
+        uint32_t virq;      /* EVTCHNSTAT_virq        */
+    } u;
+};
+
+/*
+ * EVTCHNOP_bind_vcpu: Specify which vcpu a channel should notify when an
+ * event is pending.
+ * NOTES:
+ *  1. IPI- and VIRQ-bound channels always notify the vcpu that initialised
+ *     the binding. This binding cannot be changed.
+ *  2. All other channels notify vcpu0 by default. This default is set when
+ *     the channel is allocated (a port that is freed and subsequently reused
+ *     has its binding reset to vcpu0).
+ */
+#define EVTCHNOP_bind_vcpu        8
+struct evtchn_bind_vcpu {
+    /* IN parameters. */
+    evtchn_port_t port;
+    uint32_t vcpu;
+};
+
+/*
+ * EVTCHNOP_unmask: Unmask the specified local event-channel port and deliver
+ * a notification to the appropriate VCPU if an event is pending.
+ */
+#define EVTCHNOP_unmask           9
+struct evtchn_unmask {
+    /* IN parameters. */
+    evtchn_port_t port;
+};
+
+struct evtchn_op {
+    uint32_t cmd; /* EVTCHNOP_* */
+    union {
+        struct evtchn_alloc_unbound    alloc_unbound;
+        struct evtchn_bind_interdomain bind_interdomain;
+        struct evtchn_bind_virq        bind_virq;
+        struct evtchn_bind_pirq        bind_pirq;
+        struct evtchn_bind_ipi         bind_ipi;
+        struct evtchn_close            close;
+        struct evtchn_send             send;
+        struct evtchn_status           status;
+        struct evtchn_bind_vcpu        bind_vcpu;
+        struct evtchn_unmask           unmask;
+    } u;
+};
+DEFINE_GUEST_HANDLE_STRUCT(evtchn_op);
+
+#endif /* __XEN_PUBLIC_EVENT_CHANNEL_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/features.h
@@ -0,0 +1,43 @@
+/******************************************************************************
+ * features.h
+ *
+ * Feature flags, reported by XENVER_get_features.
+ *
+ * Copyright (c) 2006, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_FEATURES_H__
+#define __XEN_PUBLIC_FEATURES_H__
+
+/*
+ * If set, the guest does not need to write-protect its pagetables, and can
+ * update them via direct writes.
+ */
+#define XENFEAT_writable_page_tables       0
+
+/*
+ * If set, the guest does not need to write-protect its segment descriptor
+ * tables, and can update them via direct writes.
+ */
+#define XENFEAT_writable_descriptor_tables 1
+
+/*
+ * If set, translation between the guest's 'pseudo-physical' address space
+ * and the host's machine address space are handled by the hypervisor. In this
+ * mode the guest does not need to perform phys-to/from-machine translations
+ * when performing page table operations.
+ */
+#define XENFEAT_auto_translated_physmap    2
+
+/* If set, the guest is running in supervisor mode (e.g., x86 ring 0). */
+#define XENFEAT_supervisor_mode_kernel     3
+
+/*
+ * If set, the guest does not need to allocate x86 PAE page directories
+ * below 4GB. This flag is usually implied by auto_translated_physmap.
+ */
+#define XENFEAT_pae_pgdir_above_4gb        4
+
+#define XENFEAT_NR_SUBMAPS 1
+
+#endif /* __XEN_PUBLIC_FEATURES_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/grant_table.h
@@ -0,0 +1,301 @@
+/******************************************************************************
+ * grant_table.h
+ *
+ * Interface for granting foreign access to page frames, and receiving
+ * page-ownership transfers.
+ *
+ * Copyright (c) 2004, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_GRANT_TABLE_H__
+#define __XEN_PUBLIC_GRANT_TABLE_H__
+
+
+/***********************************
+ * GRANT TABLE REPRESENTATION
+ */
+
+/* Some rough guidelines on accessing and updating grant-table entries
+ * in a concurrency-safe manner. For more information, Linux contains a
+ * reference implementation for guest OSes (arch/i386/mach-xen/grant_table.c).
+ *
+ * NB. WMB is a no-op on current-generation x86 processors. However, a
+ *     compiler barrier will still be required.
+ *
+ * Introducing a valid entry into the grant table:
+ *  1. Write ent->domid.
+ *  2. Write ent->frame:
+ *      GTF_permit_access:   Frame to which access is permitted.
+ *      GTF_accept_transfer: Pseudo-phys frame slot being filled by new
+ *                           frame, or zero if none.
+ *  3. Write memory barrier (WMB).
+ *  4. Write ent->flags, inc. valid type.
+ *
+ * Invalidating an unused GTF_permit_access entry:
+ *  1. flags = ent->flags.
+ *  2. Observe that !(flags & (GTF_reading|GTF_writing)).
+ *  3. Check result of SMP-safe CMPXCHG(&ent->flags, flags, 0).
+ *  NB. No need for WMB as reuse of entry is control-dependent on success of
+ *      step 3, and all architectures guarantee ordering of ctrl-dep writes.
+ *
+ * Invalidating an in-use GTF_permit_access entry:
+ *  This cannot be done directly. Request assistance from the domain controller
+ *  which can set a timeout on the use of a grant entry and take necessary
+ *  action. (NB. This is not yet implemented!).
+ *
+ * Invalidating an unused GTF_accept_transfer entry:
+ *  1. flags = ent->flags.
+ *  2. Observe that !(flags & GTF_transfer_committed). [*]
+ *  3. Check result of SMP-safe CMPXCHG(&ent->flags, flags, 0).
+ *  NB. No need for WMB as reuse of entry is control-dependent on success of
+ *      step 3, and all architectures guarantee ordering of ctrl-dep writes.
+ *  [*] If GTF_transfer_committed is set then the grant entry is 'committed'.
+ *      The guest must /not/ modify the grant entry until the address of the
+ *      transferred frame is written. It is safe for the guest to spin waiting
+ *      for this to occur (detect by observing GTF_transfer_completed in
+ *      ent->flags).
+ *
+ * Invalidating a committed GTF_accept_transfer entry:
+ *  1. Wait for (ent->flags & GTF_transfer_completed).
+ *
+ * Changing a GTF_permit_access from writable to read-only:
+ *  Use SMP-safe CMPXCHG to set GTF_readonly, while checking !GTF_writing.
+ *
+ * Changing a GTF_permit_access from read-only to writable:
+ *  Use SMP-safe bit-setting instruction.
+ */
+
+/*
+ * A grant table comprises a packed array of grant entries in one or more
+ * page frames shared between Xen and a guest.
+ * [XEN]: This field is written by Xen and read by the sharing guest.
+ * [GST]: This field is written by the guest and read by Xen.
+ */
+struct grant_entry {
+    /* GTF_xxx: various type and flag information.  [XEN,GST] */
+    uint16_t flags;
+    /* The domain being granted foreign privileges. [GST] */
+    domid_t  domid;
+    /*
+     * GTF_permit_access: Frame that @domid is allowed to map and access. [GST]
+     * GTF_accept_transfer: Frame whose ownership transferred by @domid. [XEN]
+     */
+    uint32_t frame;
+};
+
+/*
+ * Type of grant entry.
+ *  GTF_invalid: This grant entry grants no privileges.
+ *  GTF_permit_access: Allow @domid to map/access @frame.
+ *  GTF_accept_transfer: Allow @domid to transfer ownership of one page frame
+ *                       to this guest. Xen writes the page number to @frame.
+ */
+#define GTF_invalid         (0U<<0)
+#define GTF_permit_access   (1U<<0)
+#define GTF_accept_transfer (2U<<0)
+#define GTF_type_mask       (3U<<0)
+
+/*
+ * Subflags for GTF_permit_access.
+ *  GTF_readonly: Restrict @domid to read-only mappings and accesses. [GST]
+ *  GTF_reading: Grant entry is currently mapped for reading by @domid. [XEN]
+ *  GTF_writing: Grant entry is currently mapped for writing by @domid. [XEN]
+ */
+#define _GTF_readonly       (2)
+#define GTF_readonly        (1U<<_GTF_readonly)
+#define _GTF_reading        (3)
+#define GTF_reading         (1U<<_GTF_reading)
+#define _GTF_writing        (4)
+#define GTF_writing         (1U<<_GTF_writing)
+
+/*
+ * Subflags for GTF_accept_transfer:
+ *  GTF_transfer_committed: Xen sets this flag to indicate that it is committed
+ *      to transferring ownership of a page frame. When a guest sees this flag
+ *      it must /not/ modify the grant entry until GTF_transfer_completed is
+ *      set by Xen.
+ *  GTF_transfer_completed: It is safe for the guest to spin-wait on this flag
+ *      after reading GTF_transfer_committed. Xen will always write the frame
+ *      address, followed by ORing this flag, in a timely manner.
+ */
+#define _GTF_transfer_committed (2)
+#define GTF_transfer_committed  (1U<<_GTF_transfer_committed)
+#define _GTF_transfer_completed (3)
+#define GTF_transfer_completed  (1U<<_GTF_transfer_completed)
+
+
+/***********************************
+ * GRANT TABLE QUERIES AND USES
+ */
+
+/*
+ * Reference to a grant entry in a specified domain's grant table.
+ */
+typedef uint32_t grant_ref_t;
+
+/*
+ * Handle to track a mapping created via a grant reference.
+ */
+typedef uint32_t grant_handle_t;
+
+/*
+ * GNTTABOP_map_grant_ref: Map the grant entry (<dom>,<ref>) for access
+ * by devices and/or host CPUs. If successful, <handle> is a tracking number
+ * that must be presented later to destroy the mapping(s). On error, <handle>
+ * is a negative status code.
+ * NOTES:
+ *  1. If GNTPIN_map_for_dev is specified then <dev_bus_addr> is the address
+ *     via which I/O devices may access the granted frame.
+ *  2. If GNTPIN_map_for_host is specified then a mapping will be added at
+ *     either a host virtual address in the current address space, or at
+ *     a PTE at the specified machine address.  The type of mapping to
+ *     perform is selected through the GNTMAP_contains_pte flag, and the
+ *     address is specified in <host_addr>.
+ *  3. Mappings should only be destroyed via GNTTABOP_unmap_grant_ref. If a
+ *     host mapping is destroyed by other means then it is *NOT* guaranteed
+ *     to be accounted to the correct grant reference!
+ */
+#define GNTTABOP_map_grant_ref        0
+struct gnttab_map_grant_ref {
+    /* IN parameters. */
+    uint64_t host_addr;
+    uint32_t flags;               /* GNTMAP_* */
+    grant_ref_t ref;
+    domid_t  dom;
+    /* OUT parameters. */
+    int16_t  status;              /* GNTST_* */
+    grant_handle_t handle;
+    uint64_t dev_bus_addr;
+};
+DEFINE_GUEST_HANDLE_STRUCT(gnttab_map_grant_ref);
+
+/*
+ * GNTTABOP_unmap_grant_ref: Destroy one or more grant-reference mappings
+ * tracked by <handle>. If <host_addr> or <dev_bus_addr> is zero, that
+ * field is ignored. If non-zero, they must refer to a device/host mapping
+ * that is tracked by <handle>
+ * NOTES:
+ *  1. The call may fail in an undefined manner if either mapping is not
+ *     tracked by <handle>.
+ *  3. After executing a batch of unmaps, it is guaranteed that no stale
+ *     mappings will remain in the device or host TLBs.
+ */
+#define GNTTABOP_unmap_grant_ref      1
+struct gnttab_unmap_grant_ref {
+    /* IN parameters. */
+    uint64_t host_addr;
+    uint64_t dev_bus_addr;
+    grant_handle_t handle;
+    /* OUT parameters. */
+    int16_t  status;              /* GNTST_* */
+};
+DEFINE_GUEST_HANDLE_STRUCT(gnttab_unmap_grant_ref);
+
+/*
+ * GNTTABOP_setup_table: Set up a grant table for <dom> comprising at least
+ * <nr_frames> pages. The frame addresses are written to the <frame_list>.
+ * Only <nr_frames> addresses are written, even if the table is larger.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may specify <dom> != DOMID_SELF.
+ *  3. Xen may not support more than a single grant-table page per domain.
+ */
+#define GNTTABOP_setup_table          2
+struct gnttab_setup_table {
+    /* IN parameters. */
+    domid_t  dom;
+    uint32_t nr_frames;
+    /* OUT parameters. */
+    int16_t  status;              /* GNTST_* */
+    GUEST_HANDLE(ulong) frame_list;
+};
+DEFINE_GUEST_HANDLE_STRUCT(gnttab_setup_table);
+
+/*
+ * GNTTABOP_dump_table: Dump the contents of the grant table to the
+ * xen console. Debugging use only.
+ */
+#define GNTTABOP_dump_table           3
+struct gnttab_dump_table {
+    /* IN parameters. */
+    domid_t dom;
+    /* OUT parameters. */
+    int16_t status;               /* GNTST_* */
+};
+DEFINE_GUEST_HANDLE_STRUCT(gnttab_dump_table);
+
+/*
+ * GNTTABOP_transfer_grant_ref: Transfer <frame> to a foreign domain. The
+ * foreign domain has previously registered its interest in the transfer via
+ * <domid, ref>.
+ *
+ * Note that, even if the transfer fails, the specified page no longer belongs
+ * to the calling domain *unless* the error is GNTST_bad_page.
+ */
+#define GNTTABOP_transfer                4
+struct gnttab_transfer {
+    /* IN parameters. */
+    unsigned long mfn;
+    domid_t       domid;
+    grant_ref_t   ref;
+    /* OUT parameters. */
+    int16_t       status;
+};
+DEFINE_GUEST_HANDLE_STRUCT(gnttab_transfer);
+
+/*
+ * Bitfield values for update_pin_status.flags.
+ */
+ /* Map the grant entry for access by I/O devices. */
+#define _GNTMAP_device_map      (0)
+#define GNTMAP_device_map       (1<<_GNTMAP_device_map)
+ /* Map the grant entry for access by host CPUs. */
+#define _GNTMAP_host_map        (1)
+#define GNTMAP_host_map         (1<<_GNTMAP_host_map)
+ /* Accesses to the granted frame will be restricted to read-only access. */
+#define _GNTMAP_readonly        (2)
+#define GNTMAP_readonly         (1<<_GNTMAP_readonly)
+ /*
+  * GNTMAP_host_map subflag:
+  *  0 => The host mapping is usable only by the guest OS.
+  *  1 => The host mapping is usable by guest OS + current application.
+  */
+#define _GNTMAP_application_map (3)
+#define GNTMAP_application_map  (1<<_GNTMAP_application_map)
+
+ /*
+  * GNTMAP_contains_pte subflag:
+  *  0 => This map request contains a host virtual address.
+  *  1 => This map request contains the machine addess of the PTE to update.
+  */
+#define _GNTMAP_contains_pte    (4)
+#define GNTMAP_contains_pte     (1<<_GNTMAP_contains_pte)
+
+/*
+ * Values for error status returns. All errors are -ve.
+ */
+#define GNTST_okay             (0)  /* Normal return.                        */
+#define GNTST_general_error    (-1) /* General undefined error.              */
+#define GNTST_bad_domain       (-2) /* Unrecognsed domain id.                */
+#define GNTST_bad_gntref       (-3) /* Unrecognised or inappropriate gntref. */
+#define GNTST_bad_handle       (-4) /* Unrecognised or inappropriate handle. */
+#define GNTST_bad_virt_addr    (-5) /* Inappropriate virtual address to map. */
+#define GNTST_bad_dev_addr     (-6) /* Inappropriate device address to unmap.*/
+#define GNTST_no_device_space  (-7) /* Out of space in I/O MMU.              */
+#define GNTST_permission_denied (-8) /* Not enough privilege for operation.  */
+#define GNTST_bad_page         (-9) /* Specified page was invalid for op.    */
+
+#define GNTTABOP_error_msgs {                   \
+    "okay",                                     \
+    "undefined error",                          \
+    "unrecognised domain id",                   \
+    "invalid grant reference",                  \
+    "invalid mapping handle",                   \
+    "invalid virtual address",                  \
+    "invalid device address",                   \
+    "no spare translation slot in the I/O MMU", \
+    "permission denied",                        \
+    "bad page"                                  \
+}
+
+#endif /* __XEN_PUBLIC_GRANT_TABLE_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/io/blkif.h
@@ -0,0 +1,96 @@
+/******************************************************************************
+ * blkif.h
+ *
+ * Unified block-device I/O interface for Xen guest OSes.
+ *
+ * Copyright (c) 2003-2004, Keir Fraser
+ */
+
+#ifndef __XEN_PUBLIC_IO_BLKIF_H__
+#define __XEN_PUBLIC_IO_BLKIF_H__
+
+#include "ring.h"
+#include "../grant_table.h"
+
+/*
+ * Front->back notifications: When enqueuing a new request, sending a
+ * notification can be made conditional on req_event (i.e., the generic
+ * hold-off mechanism provided by the ring macros). Backends must set
+ * req_event appropriately (e.g., using RING_FINAL_CHECK_FOR_REQUESTS()).
+ *
+ * Back->front notifications: When enqueuing a new response, sending a
+ * notification can be made conditional on rsp_event (i.e., the generic
+ * hold-off mechanism provided by the ring macros). Frontends must set
+ * rsp_event appropriately (e.g., using RING_FINAL_CHECK_FOR_RESPONSES()).
+ */
+
+#ifndef blkif_vdev_t
+#define blkif_vdev_t   uint16_t
+#endif
+#define blkif_sector_t uint64_t
+
+/*
+ * REQUEST CODES.
+ */
+#define BLKIF_OP_READ              0
+#define BLKIF_OP_WRITE             1
+/*
+ * Recognised only if "feature-barrier" is present in backend xenbus info.
+ * The "feature_barrier" node contains a boolean indicating whether barrier
+ * requests are likely to succeed or fail. Either way, a barrier request
+ * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by
+ * the underlying block-device hardware. The boolean simply indicates whether
+ * or not it is worthwhile for the frontend to attempt barrier requests.
+ * If a backend does not recognise BLKIF_OP_WRITE_BARRIER, it should *not*
+ * create the "feature-barrier" node!
+ */
+#define BLKIF_OP_WRITE_BARRIER     2
+
+/*
+ * Maximum scatter/gather segments per request.
+ * This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
+ * NB. This could be 12 if the ring indexes weren't stored in the same page.
+ */
+#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
+
+struct blkif_request {
+    uint8_t        operation;    /* BLKIF_OP_???                         */
+    uint8_t        nr_segments;  /* number of segments                   */
+    blkif_vdev_t   handle;       /* only for read/write requests         */
+    uint64_t       id;           /* private guest value, echoed in resp  */
+    blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
+    struct blkif_request_segment {
+        grant_ref_t gref;        /* reference to I/O buffer frame        */
+        /* @first_sect: first sector in frame to transfer (inclusive).   */
+        /* @last_sect: last sector in frame to transfer (inclusive).     */
+        uint8_t     first_sect, last_sect;
+    } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+};
+
+struct blkif_response {
+    uint64_t        id;              /* copied from request */
+    uint8_t         operation;       /* copied from request */
+    int16_t         status;          /* BLKIF_RSP_???       */
+};
+
+/*
+ * STATUS RETURN CODES.
+ */
+ /* Operation not supported (only happens on barrier writes). */
+#define BLKIF_RSP_EOPNOTSUPP  -2
+ /* Operation failed for some unspecified reason (-EIO). */
+#define BLKIF_RSP_ERROR       -1
+ /* Operation completed successfully. */
+#define BLKIF_RSP_OKAY         0
+
+/*
+ * Generate blkif ring structures and types.
+ */
+
+DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response);
+
+#define VDISK_CDROM        0x1
+#define VDISK_REMOVABLE    0x2
+#define VDISK_READONLY     0x4
+
+#endif /* __XEN_PUBLIC_IO_BLKIF_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/io/console.h
@@ -0,0 +1,23 @@
+/******************************************************************************
+ * console.h
+ *
+ * Console I/O interface for Xen guest OSes.
+ *
+ * Copyright (c) 2005, Keir Fraser
+ */
+
+#ifndef __XEN_PUBLIC_IO_CONSOLE_H__
+#define __XEN_PUBLIC_IO_CONSOLE_H__
+
+typedef uint32_t XENCONS_RING_IDX;
+
+#define MASK_XENCONS_IDX(idx, ring) ((idx) & (sizeof(ring)-1))
+
+struct xencons_interface {
+    char in[1024];
+    char out[2048];
+    XENCONS_RING_IDX in_cons, in_prod;
+    XENCONS_RING_IDX out_cons, out_prod;
+};
+
+#endif /* __XEN_PUBLIC_IO_CONSOLE_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/io/netif.h
@@ -0,0 +1,152 @@
+/******************************************************************************
+ * netif.h
+ *
+ * Unified network-device I/O interface for Xen guest OSes.
+ *
+ * Copyright (c) 2003-2004, Keir Fraser
+ */
+
+#ifndef __XEN_PUBLIC_IO_NETIF_H__
+#define __XEN_PUBLIC_IO_NETIF_H__
+
+#include "ring.h"
+#include "../grant_table.h"
+
+/*
+ * Notifications after enqueuing any type of message should be conditional on
+ * the appropriate req_event or rsp_event field in the shared ring.
+ * If the client sends notification for rx requests then it should specify
+ * feature 'feature-rx-notify' via xenbus. Otherwise the backend will assume
+ * that it cannot safely queue packets (as it may not be kicked to send them).
+ */
+
+/*
+ * This is the 'wire' format for packets:
+ *  Request 1: netif_tx_request -- NETTXF_* (any flags)
+ * [Request 2: netif_tx_extra]  (only if request 1 has NETTXF_extra_info)
+ * [Request 3: netif_tx_extra]  (only if request 2 has XEN_NETIF_EXTRA_MORE)
+ *  Request 4: netif_tx_request -- NETTXF_more_data
+ *  Request 5: netif_tx_request -- NETTXF_more_data
+ *  ...
+ *  Request N: netif_tx_request -- 0
+ */
+
+/* Protocol checksum field is blank in the packet (hardware offload)? */
+#define _NETTXF_csum_blank     (0)
+#define  NETTXF_csum_blank     (1U<<_NETTXF_csum_blank)
+
+/* Packet data has been validated against protocol checksum. */
+#define _NETTXF_data_validated (1)
+#define  NETTXF_data_validated (1U<<_NETTXF_data_validated)
+
+/* Packet continues in the next request descriptor. */
+#define _NETTXF_more_data      (2)
+#define  NETTXF_more_data      (1U<<_NETTXF_more_data)
+
+/* Packet to be followed by extra descriptor(s). */
+#define _NETTXF_extra_info     (3)
+#define  NETTXF_extra_info     (1U<<_NETTXF_extra_info)
+
+struct netif_tx_request {
+    grant_ref_t gref;      /* Reference to buffer page */
+    uint16_t offset;       /* Offset within buffer page */
+    uint16_t flags;        /* NETTXF_* */
+    uint16_t id;           /* Echoed in response message. */
+    uint16_t size;         /* Packet size in bytes.       */
+};
+
+/* Types of netif_extra_info descriptors. */
+#define XEN_NETIF_EXTRA_TYPE_NONE  (0)  /* Never used - invalid */
+#define XEN_NETIF_EXTRA_TYPE_GSO   (1)  /* u.gso */
+#define XEN_NETIF_EXTRA_TYPE_MAX   (2)
+
+/* netif_extra_info flags. */
+#define _XEN_NETIF_EXTRA_FLAG_MORE (0)
+#define XEN_NETIF_EXTRA_FLAG_MORE  (1U<<_XEN_NETIF_EXTRA_FLAG_MORE)
+
+/* GSO types - only TCPv4 currently supported. */
+#define XEN_NETIF_GSO_TYPE_TCPV4        (1)
+
+/*
+ * This structure needs to fit within both netif_tx_request and
+ * netif_rx_response for compatibility.
+ */
+struct netif_extra_info {
+    uint8_t type;  /* XEN_NETIF_EXTRA_TYPE_* */
+    uint8_t flags; /* XEN_NETIF_EXTRA_FLAG_* */
+
+    union {
+        struct {
+            /*
+             * Maximum payload size of each segment. For example, for TCP this
+             * is just the path MSS.
+             */
+            uint16_t size;
+
+            /*
+             * GSO type. This determines the protocol of the packet and any
+             * extra features required to segment the packet properly.
+             */
+            uint8_t type; /* XEN_NETIF_GSO_TYPE_* */
+
+            /* Future expansion. */
+            uint8_t pad;
+
+            /*
+             * GSO features. This specifies any extra GSO features required
+             * to process this packet, such as ECN support for TCPv4.
+             */
+            uint16_t features; /* XEN_NETIF_GSO_FEAT_* */
+        } gso;
+
+        uint16_t pad[3];
+    } u;
+};
+
+struct netif_tx_response {
+    uint16_t id;
+    int16_t  status;       /* NETIF_RSP_* */
+};
+
+struct netif_rx_request {
+    uint16_t    id;        /* Echoed in response message.        */
+    grant_ref_t gref;      /* Reference to incoming granted frame */
+};
+
+/* Packet data has been validated against protocol checksum. */
+#define _NETRXF_data_validated (0)
+#define  NETRXF_data_validated (1U<<_NETRXF_data_validated)
+
+/* Protocol checksum field is blank in the packet (hardware offload)? */
+#define _NETRXF_csum_blank     (1)
+#define  NETRXF_csum_blank     (1U<<_NETRXF_csum_blank)
+
+/* Packet continues in the next request descriptor. */
+#define _NETRXF_more_data      (2)
+#define  NETRXF_more_data      (1U<<_NETRXF_more_data)
+
+/* Packet to be followed by extra descriptor(s). */
+#define _NETRXF_extra_info     (3)
+#define  NETRXF_extra_info     (1U<<_NETRXF_extra_info)
+
+struct netif_rx_response {
+    uint16_t id;
+    uint16_t offset;       /* Offset in page of start of received packet  */
+    uint16_t flags;        /* NETRXF_* */
+    int16_t  status;       /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
+};
+
+/*
+ * Generate netif ring structures and types.
+ */
+
+DEFINE_RING_TYPES(netif_tx, struct netif_tx_request, struct netif_tx_response);
+DEFINE_RING_TYPES(netif_rx, struct netif_rx_request, struct netif_rx_response);
+
+#define NETIF_RSP_DROPPED         -2
+#define NETIF_RSP_ERROR           -1
+#define NETIF_RSP_OKAY             0
+/* No response: used for auxiliary requests (e.g., netif_tx_extra). */
+#define NETIF_RSP_NULL             1
+
+#endif
===================================================================
--- /dev/null
+++ b/include/xen/interface/io/ring.h
@@ -0,0 +1,260 @@
+/******************************************************************************
+ * ring.h
+ *
+ * Shared producer-consumer ring macros.
+ *
+ * Tim Deegan and Andrew Warfield November 2004.
+ */
+
+#ifndef __XEN_PUBLIC_IO_RING_H__
+#define __XEN_PUBLIC_IO_RING_H__
+
+typedef unsigned int RING_IDX;
+
+/* Round a 32-bit unsigned constant down to the nearest power of two. */
+#define __RD2(_x)  (((_x) & 0x00000002) ? 0x2                  : ((_x) & 0x1))
+#define __RD4(_x)  (((_x) & 0x0000000c) ? __RD2((_x)>>2)<<2    : __RD2(_x))
+#define __RD8(_x)  (((_x) & 0x000000f0) ? __RD4((_x)>>4)<<4    : __RD4(_x))
+#define __RD16(_x) (((_x) & 0x0000ff00) ? __RD8((_x)>>8)<<8    : __RD8(_x))
+#define __RD32(_x) (((_x) & 0xffff0000) ? __RD16((_x)>>16)<<16 : __RD16(_x))
+
+/*
+ * Calculate size of a shared ring, given the total available space for the
+ * ring and indexes (_sz), and the name tag of the request/response structure.
+ * A ring contains as many entries as will fit, rounded down to the nearest
+ * power of two (so we can mask with (size-1) to loop around).
+ */
+#define __RING_SIZE(_s, _sz) \
+    (__RD32(((_sz) - (long)&(_s)->ring + (long)(_s)) / sizeof((_s)->ring[0])))
+
+/*
+ * Macros to make the correct C datatypes for a new kind of ring.
+ *
+ * To make a new ring datatype, you need to have two message structures,
+ * let's say struct request, and struct response already defined.
+ *
+ * In a header where you want the ring datatype declared, you then do:
+ *
+ *     DEFINE_RING_TYPES(mytag, struct request, struct response);
+ *
+ * These expand out to give you a set of types, as you can see below.
+ * The most important of these are:
+ *
+ *     struct mytag_sring      - The shared ring.
+ *     struct mytag_front_ring - The 'front' half of the ring.
+ *     struct mytag_back_ring  - The 'back' half of the ring.
+ *
+ * To initialize a ring in your code you need to know the location and size
+ * of the shared memory area (PAGE_SIZE, for instance). To initialise
+ * the front half:
+ *
+ *     struct mytag_front_ring front_ring;
+ *     SHARED_RING_INIT((struct mytag_sring *)shared_page);
+ *     FRONT_RING_INIT(&front_ring, (struct mytag_sring *)shared_page,
+ *                     PAGE_SIZE);
+ *
+ * Initializing the back follows similarly (note that only the front
+ * initializes the shared ring):
+ *
+ *     struct mytag_back_ring back_ring;
+ *     BACK_RING_INIT(&back_ring, (struct mytag_sring *)shared_page,
+ *                    PAGE_SIZE);
+ */
+
+#define DEFINE_RING_TYPES(__name, __req_t, __rsp_t)                     \
+                                                                        \
+/* Shared ring entry */                                                 \
+union __name##_sring_entry {                                            \
+    __req_t req;                                                        \
+    __rsp_t rsp;                                                        \
+};                                                                      \
+                                                                        \
+/* Shared ring page */                                                  \
+struct __name##_sring {                                                 \
+    RING_IDX req_prod, req_event;                                       \
+    RING_IDX rsp_prod, rsp_event;                                       \
+    uint8_t  pad[48];                                                   \
+    union __name##_sring_entry ring[1]; /* variable-length */           \
+};                                                                      \
+                                                                        \
+/* "Front" end's private variables */                                   \
+struct __name##_front_ring {                                            \
+    RING_IDX req_prod_pvt;                                              \
+    RING_IDX rsp_cons;                                                  \
+    unsigned int nr_ents;                                               \
+    struct __name##_sring *sring;                                       \
+};                                                                      \
+                                                                        \
+/* "Back" end's private variables */                                    \
+struct __name##_back_ring {                                             \
+    RING_IDX rsp_prod_pvt;                                              \
+    RING_IDX req_cons;                                                  \
+    unsigned int nr_ents;                                               \
+    struct __name##_sring *sring;                                       \
+};
+
+/*
+ * Macros for manipulating rings.
+ *
+ * FRONT_RING_whatever works on the "front end" of a ring: here
+ * requests are pushed on to the ring and responses taken off it.
+ *
+ * BACK_RING_whatever works on the "back end" of a ring: here
+ * requests are taken off the ring and responses put on.
+ *
+ * N.B. these macros do NO INTERLOCKS OR FLOW CONTROL.
+ * This is OK in 1-for-1 request-response situations where the
+ * requestor (front end) never has more than RING_SIZE()-1
+ * outstanding requests.
+ */
+
+/* Initialising empty rings */
+#define SHARED_RING_INIT(_s) do {                                       \
+    (_s)->req_prod  = (_s)->rsp_prod  = 0;                              \
+    (_s)->req_event = (_s)->rsp_event = 1;                              \
+    memset((_s)->pad, 0, sizeof((_s)->pad));                            \
+} while(0)
+
+#define FRONT_RING_INIT(_r, _s, __size) do {                            \
+    (_r)->req_prod_pvt = 0;                                             \
+    (_r)->rsp_cons = 0;                                                 \
+    (_r)->nr_ents = __RING_SIZE(_s, __size);                            \
+    (_r)->sring = (_s);                                                 \
+} while (0)
+
+#define BACK_RING_INIT(_r, _s, __size) do {                             \
+    (_r)->rsp_prod_pvt = 0;                                             \
+    (_r)->req_cons = 0;                                                 \
+    (_r)->nr_ents = __RING_SIZE(_s, __size);                            \
+    (_r)->sring = (_s);                                                 \
+} while (0)
+
+/* Initialize to existing shared indexes -- for recovery */
+#define FRONT_RING_ATTACH(_r, _s, __size) do {                          \
+    (_r)->sring = (_s);                                                 \
+    (_r)->req_prod_pvt = (_s)->req_prod;                                \
+    (_r)->rsp_cons = (_s)->rsp_prod;                                    \
+    (_r)->nr_ents = __RING_SIZE(_s, __size);                            \
+} while (0)
+
+#define BACK_RING_ATTACH(_r, _s, __size) do {                           \
+    (_r)->sring = (_s);                                                 \
+    (_r)->rsp_prod_pvt = (_s)->rsp_prod;                                \
+    (_r)->req_cons = (_s)->req_prod;                                    \
+    (_r)->nr_ents = __RING_SIZE(_s, __size);                            \
+} while (0)
+
+/* How big is this ring? */
+#define RING_SIZE(_r)                                                   \
+    ((_r)->nr_ents)
+
+/* Number of free requests (for use on front side only). */
+#define RING_FREE_REQUESTS(_r)						\
+    (RING_SIZE(_r) - ((_r)->req_prod_pvt - (_r)->rsp_cons))
+
+/* Test if there is an empty slot available on the front ring.
+ * (This is only meaningful from the front. )
+ */
+#define RING_FULL(_r)                                                   \
+    (RING_FREE_REQUESTS(_r) == 0)
+
+/* Test if there are outstanding messages to be processed on a ring. */
+#define RING_HAS_UNCONSUMED_RESPONSES(_r)                               \
+    ((_r)->sring->rsp_prod - (_r)->rsp_cons)
+
+#define RING_HAS_UNCONSUMED_REQUESTS(_r)                                \
+    ({									\
+	unsigned int req = (_r)->sring->req_prod - (_r)->req_cons;	\
+	unsigned int rsp = RING_SIZE(_r) -				\
+			   ((_r)->req_cons - (_r)->rsp_prod_pvt);	\
+	req < rsp ? req : rsp;						\
+    })
+
+/* Direct access to individual ring elements, by index. */
+#define RING_GET_REQUEST(_r, _idx)                                      \
+    (&((_r)->sring->ring[((_idx) & (RING_SIZE(_r) - 1))].req))
+
+#define RING_GET_RESPONSE(_r, _idx)                                     \
+    (&((_r)->sring->ring[((_idx) & (RING_SIZE(_r) - 1))].rsp))
+
+/* Loop termination condition: Would the specified index overflow the ring? */
+#define RING_REQUEST_CONS_OVERFLOW(_r, _cons)                           \
+    (((_cons) - (_r)->rsp_prod_pvt) >= RING_SIZE(_r))
+
+#define RING_PUSH_REQUESTS(_r) do {                                     \
+    wmb(); /* back sees requests /before/ updated producer index */     \
+    (_r)->sring->req_prod = (_r)->req_prod_pvt;                         \
+} while (0)
+
+#define RING_PUSH_RESPONSES(_r) do {                                    \
+    wmb(); /* front sees responses /before/ updated producer index */   \
+    (_r)->sring->rsp_prod = (_r)->rsp_prod_pvt;                         \
+} while (0)
+
+/*
+ * Notification hold-off (req_event and rsp_event):
+ *
+ * When queueing requests or responses on a shared ring, it may not always be
+ * necessary to notify the remote end. For example, if requests are in flight
+ * in a backend, the front may be able to queue further requests without
+ * notifying the back (if the back checks for new requests when it queues
+ * responses).
+ *
+ * When enqueuing requests or responses:
+ *
+ *  Use RING_PUSH_{REQUESTS,RESPONSES}_AND_CHECK_NOTIFY(). The second argument
+ *  is a boolean return value. True indicates that the receiver requires an
+ *  asynchronous notification.
+ *
+ * After dequeuing requests or responses (before sleeping the connection):
+ *
+ *  Use RING_FINAL_CHECK_FOR_REQUESTS() or RING_FINAL_CHECK_FOR_RESPONSES().
+ *  The second argument is a boolean return value. True indicates that there
+ *  are pending messages on the ring (i.e., the connection should not be put
+ *  to sleep).
+ *
+ *  These macros will set the req_event/rsp_event field to trigger a
+ *  notification on the very next message that is enqueued. If you want to
+ *  create batches of work (i.e., only receive a notification after several
+ *  messages have been enqueued) then you will need to create a customised
+ *  version of the FINAL_CHECK macro in your own code, which sets the event
+ *  field appropriately.
+ */
+
+#define RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(_r, _notify) do {           \
+    RING_IDX __old = (_r)->sring->req_prod;                             \
+    RING_IDX __new = (_r)->req_prod_pvt;                                \
+    wmb(); /* back sees requests /before/ updated producer index */     \
+    (_r)->sring->req_prod = __new;                                      \
+    mb(); /* back sees new requests /before/ we check req_event */      \
+    (_notify) = ((RING_IDX)(__new - (_r)->sring->req_event) <           \
+                 (RING_IDX)(__new - __old));                            \
+} while (0)
+
+#define RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(_r, _notify) do {          \
+    RING_IDX __old = (_r)->sring->rsp_prod;                             \
+    RING_IDX __new = (_r)->rsp_prod_pvt;                                \
+    wmb(); /* front sees responses /before/ updated producer index */   \
+    (_r)->sring->rsp_prod = __new;                                      \
+    mb(); /* front sees new responses /before/ we check rsp_event */    \
+    (_notify) = ((RING_IDX)(__new - (_r)->sring->rsp_event) <           \
+                 (RING_IDX)(__new - __old));                            \
+} while (0)
+
+#define RING_FINAL_CHECK_FOR_REQUESTS(_r, _work_to_do) do {             \
+    (_work_to_do) = RING_HAS_UNCONSUMED_REQUESTS(_r);                   \
+    if (_work_to_do) break;                                             \
+    (_r)->sring->req_event = (_r)->req_cons + 1;                        \
+    mb();                                                               \
+    (_work_to_do) = RING_HAS_UNCONSUMED_REQUESTS(_r);                   \
+} while (0)
+
+#define RING_FINAL_CHECK_FOR_RESPONSES(_r, _work_to_do) do {            \
+    (_work_to_do) = RING_HAS_UNCONSUMED_RESPONSES(_r);                  \
+    if (_work_to_do) break;                                             \
+    (_r)->sring->rsp_event = (_r)->rsp_cons + 1;                        \
+    mb();                                                               \
+    (_work_to_do) = RING_HAS_UNCONSUMED_RESPONSES(_r);                  \
+} while (0)
+
+#endif /* __XEN_PUBLIC_IO_RING_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/io/xenbus.h
@@ -0,0 +1,42 @@
+/*****************************************************************************
+ * xenbus.h
+ *
+ * Xenbus protocol details.
+ *
+ * Copyright (C) 2005 XenSource Ltd.
+ */
+
+#ifndef _XEN_PUBLIC_IO_XENBUS_H
+#define _XEN_PUBLIC_IO_XENBUS_H
+
+/* The state of either end of the Xenbus, i.e. the current communication
+   status of initialisation across the bus.  States here imply nothing about
+   the state of the connection between the driver and the kernel's device
+   layers.  */
+enum xenbus_state
+{
+  XenbusStateUnknown      = 0,
+  XenbusStateInitialising = 1,
+  XenbusStateInitWait     = 2,  /* Finished early initialisation, but waiting
+                                   for information from the peer or hotplug
+				   scripts. */
+  XenbusStateInitialised  = 3,  /* Initialised and waiting for a connection
+				   from the peer. */
+  XenbusStateConnected    = 4,
+  XenbusStateClosing      = 5,  /* The device is being closed due to an error
+				   or an unplug event. */
+  XenbusStateClosed       = 6
+
+};
+
+#endif /* _XEN_PUBLIC_IO_XENBUS_H */
+
+/*
+ * Local variables:
+ *  c-file-style: "linux"
+ *  indent-tabs-mode: t
+ *  c-indent-level: 8
+ *  c-basic-offset: 8
+ *  tab-width: 8
+ * End:
+ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/io/xs_wire.h
@@ -0,0 +1,87 @@
+/*
+ * Details of the "wire" protocol between Xen Store Daemon and client
+ * library or guest kernel.
+ * Copyright (C) 2005 Rusty Russell IBM Corporation
+ */
+
+#ifndef _XS_WIRE_H
+#define _XS_WIRE_H
+
+enum xsd_sockmsg_type
+{
+    XS_DEBUG,
+    XS_DIRECTORY,
+    XS_READ,
+    XS_GET_PERMS,
+    XS_WATCH,
+    XS_UNWATCH,
+    XS_TRANSACTION_START,
+    XS_TRANSACTION_END,
+    XS_INTRODUCE,
+    XS_RELEASE,
+    XS_GET_DOMAIN_PATH,
+    XS_WRITE,
+    XS_MKDIR,
+    XS_RM,
+    XS_SET_PERMS,
+    XS_WATCH_EVENT,
+    XS_ERROR,
+    XS_IS_DOMAIN_INTRODUCED
+};
+
+#define XS_WRITE_NONE "NONE"
+#define XS_WRITE_CREATE "CREATE"
+#define XS_WRITE_CREATE_EXCL "CREATE|EXCL"
+
+/* We hand errors as strings, for portability. */
+struct xsd_errors
+{
+    int errnum;
+    const char *errstring;
+};
+#define XSD_ERROR(x) { x, #x }
+static struct xsd_errors xsd_errors[] __attribute__((unused)) = {
+    XSD_ERROR(EINVAL),
+    XSD_ERROR(EACCES),
+    XSD_ERROR(EEXIST),
+    XSD_ERROR(EISDIR),
+    XSD_ERROR(ENOENT),
+    XSD_ERROR(ENOMEM),
+    XSD_ERROR(ENOSPC),
+    XSD_ERROR(EIO),
+    XSD_ERROR(ENOTEMPTY),
+    XSD_ERROR(ENOSYS),
+    XSD_ERROR(EROFS),
+    XSD_ERROR(EBUSY),
+    XSD_ERROR(EAGAIN),
+    XSD_ERROR(EISCONN)
+};
+
+struct xsd_sockmsg
+{
+    uint32_t type;  /* XS_??? */
+    uint32_t req_id;/* Request identifier, echoed in daemon's response.  */
+    uint32_t tx_id; /* Transaction id (0 if not related to a transaction). */
+    uint32_t len;   /* Length of data following this. */
+
+    /* Generally followed by nul-terminated string(s). */
+};
+
+enum xs_watch_type
+{
+    XS_WATCH_PATH = 0,
+    XS_WATCH_TOKEN
+};
+
+/* Inter-domain shared memory communications. */
+#define XENSTORE_RING_SIZE 1024
+typedef uint32_t XENSTORE_RING_IDX;
+#define MASK_XENSTORE_IDX(idx) ((idx) & (XENSTORE_RING_SIZE-1))
+struct xenstore_domain_interface {
+    char req[XENSTORE_RING_SIZE]; /* Requests to xenstore daemon. */
+    char rsp[XENSTORE_RING_SIZE]; /* Replies and async watch events. */
+    XENSTORE_RING_IDX req_cons, req_prod;
+    XENSTORE_RING_IDX rsp_cons, rsp_prod;
+};
+
+#endif /* _XS_WIRE_H */
===================================================================
--- /dev/null
+++ b/include/xen/interface/memory.h
@@ -0,0 +1,145 @@
+/******************************************************************************
+ * memory.h
+ *
+ * Memory reservation and information.
+ *
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_MEMORY_H__
+#define __XEN_PUBLIC_MEMORY_H__
+
+/*
+ * Increase or decrease the specified domain's memory reservation. Returns a
+ * -ve errcode on failure, or the # extents successfully allocated or freed.
+ * arg == addr of struct xen_memory_reservation.
+ */
+#define XENMEM_increase_reservation 0
+#define XENMEM_decrease_reservation 1
+#define XENMEM_populate_physmap     6
+struct xen_memory_reservation {
+
+    /*
+     * XENMEM_increase_reservation:
+     *   OUT: MFN (*not* GMFN) bases of extents that were allocated
+     * XENMEM_decrease_reservation:
+     *   IN:  GMFN bases of extents to free
+     * XENMEM_populate_physmap:
+     *   IN:  GPFN bases of extents to populate with memory
+     *   OUT: GMFN bases of extents that were allocated
+     *   (NB. This command also updates the mach_to_phys translation table)
+     */
+    GUEST_HANDLE(ulong) extent_start;
+
+    /* Number of extents, and size/alignment of each (2^extent_order pages). */
+    unsigned long  nr_extents;
+    unsigned int   extent_order;
+
+    /*
+     * Maximum # bits addressable by the user of the allocated region (e.g.,
+     * I/O devices often have a 32-bit limitation even in 64-bit systems). If
+     * zero then the user has no addressing restriction.
+     * This field is not used by XENMEM_decrease_reservation.
+     */
+    unsigned int   address_bits;
+
+    /*
+     * Domain whose reservation is being changed.
+     * Unprivileged domains can specify only DOMID_SELF.
+     */
+    domid_t        domid;
+
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_memory_reservation);
+
+/*
+ * Returns the maximum machine frame number of mapped RAM in this system.
+ * This command always succeeds (it never returns an error code).
+ * arg == NULL.
+ */
+#define XENMEM_maximum_ram_page     2
+
+/*
+ * Returns the current or maximum memory reservation, in pages, of the
+ * specified domain (may be DOMID_SELF). Returns -ve errcode on failure.
+ * arg == addr of domid_t.
+ */
+#define XENMEM_current_reservation  3
+#define XENMEM_maximum_reservation  4
+
+/*
+ * Returns a list of MFN bases of 2MB extents comprising the machine_to_phys
+ * mapping table. Architectures which do not have a m2p table do not implement
+ * this command.
+ * arg == addr of xen_machphys_mfn_list_t.
+ */
+#define XENMEM_machphys_mfn_list    5
+struct xen_machphys_mfn_list {
+    /*
+     * Size of the 'extent_start' array. Fewer entries will be filled if the
+     * machphys table is smaller than max_extents * 2MB.
+     */
+    unsigned int max_extents;
+
+    /*
+     * Pointer to buffer to fill with list of extent starts. If there are
+     * any large discontiguities in the machine address space, 2MB gaps in
+     * the machphys table will be represented by an MFN base of zero.
+     */
+    GUEST_HANDLE(ulong) extent_start;
+
+    /*
+     * Number of extents written to the above array. This will be smaller
+     * than 'max_extents' if the machphys table is smaller than max_e * 2MB.
+     */
+    unsigned int nr_extents;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_machphys_mfn_list);
+
+/*
+ * Sets the GPFN at which a particular page appears in the specified guest's
+ * pseudophysical address space.
+ * arg == addr of xen_add_to_physmap_t.
+ */
+#define XENMEM_add_to_physmap      7
+struct xen_add_to_physmap {
+    /* Which domain to change the mapping for. */
+    domid_t domid;
+
+    /* Source mapping space. */
+#define XENMAPSPACE_shared_info 0 /* shared info page */
+#define XENMAPSPACE_grant_table 1 /* grant table page */
+    unsigned int space;
+
+    /* Index into source mapping space. */
+    unsigned long idx;
+
+    /* GPFN where the source mapping page should appear. */
+    unsigned long gpfn;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_add_to_physmap);
+
+/*
+ * Translates a list of domain-specific GPFNs into MFNs. Returns a -ve error
+ * code on failure. This call only works for auto-translated guests.
+ */
+#define XENMEM_translate_gpfn_list  8
+struct xen_translate_gpfn_list {
+    /* Which domain to translate for? */
+    domid_t domid;
+
+    /* Length of list. */
+    unsigned long nr_gpfns;
+
+    /* List of GPFNs to translate. */
+    GUEST_HANDLE(ulong) gpfn_list;
+
+    /*
+     * Output list to contain MFN translations. May be the same as the input
+     * list (in which case each input GPFN is overwritten with the output MFN).
+     */
+    GUEST_HANDLE(ulong) mfn_list;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_translate_gpfn_list);
+
+#endif /* __XEN_PUBLIC_MEMORY_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/physdev.h
@@ -0,0 +1,61 @@
+
+#ifndef __XEN_PUBLIC_PHYSDEV_H__
+#define __XEN_PUBLIC_PHYSDEV_H__
+
+/* Commands to HYPERVISOR_physdev_op() */
+#define PHYSDEVOP_IRQ_UNMASK_NOTIFY     4
+#define PHYSDEVOP_IRQ_STATUS_QUERY      5
+#define PHYSDEVOP_SET_IOPL              6
+#define PHYSDEVOP_SET_IOBITMAP          7
+#define PHYSDEVOP_APIC_READ             8
+#define PHYSDEVOP_APIC_WRITE            9
+#define PHYSDEVOP_ASSIGN_VECTOR         10
+
+struct physdevop_irq_status_query {
+    /* IN */
+    uint32_t irq;
+    /* OUT */
+/* Need to call PHYSDEVOP_IRQ_UNMASK_NOTIFY when the IRQ has been serviced? */
+#define PHYSDEVOP_IRQ_NEEDS_UNMASK_NOTIFY (1<<0)
+    uint32_t flags;
+};
+
+struct physdevop_set_iopl {
+    /* IN */
+    uint32_t iopl;
+};
+
+struct physdevop_set_iobitmap {
+    /* IN */
+    uint8_t *bitmap;
+    uint32_t nr_ports;
+};
+
+struct physdevop_apic {
+    /* IN */
+    unsigned long apic_physbase;
+    uint32_t reg;
+    /* IN or OUT */
+    uint32_t value;
+};
+
+struct physdevop_irq {
+    /* IN */
+    uint32_t irq;
+    /* OUT */
+    uint32_t vector;
+};
+
+struct physdev_op {
+    uint32_t cmd;
+    union {
+        struct physdevop_irq_status_query      irq_status_query;
+        struct physdevop_set_iopl              set_iopl;
+        struct physdevop_set_iobitmap          set_iobitmap;
+        struct physdevop_apic                  apic_op;
+        struct physdevop_irq                   irq_op;
+    } u;
+};
+DEFINE_GUEST_HANDLE_STRUCT(physdev_op);
+
+#endif /* __XEN_PUBLIC_PHYSDEV_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/sched.h
@@ -0,0 +1,77 @@
+/******************************************************************************
+ * sched.h
+ *
+ * Scheduler state interactions
+ *
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_SCHED_H__
+#define __XEN_PUBLIC_SCHED_H__
+
+#include "event_channel.h"
+
+/*
+ * The prototype for this hypercall is:
+ *  long sched_op_new(int cmd, void *arg)
+ * @cmd == SCHEDOP_??? (scheduler operation).
+ * @arg == Operation-specific extra argument(s), as described below.
+ *
+ * **NOTE**:
+ * Versions of Xen prior to 3.0.2 provide only the following legacy version
+ * of this hypercall, supporting only the commands yield, block and shutdown:
+ *  long sched_op(int cmd, unsigned long arg)
+ * @cmd == SCHEDOP_??? (scheduler operation).
+ * @arg == 0               (SCHEDOP_yield and SCHEDOP_block)
+ *      == SHUTDOWN_* code (SCHEDOP_shutdown)
+ */
+
+/*
+ * Voluntarily yield the CPU.
+ * @arg == NULL.
+ */
+#define SCHEDOP_yield       0
+
+/*
+ * Block execution of this VCPU until an event is received for processing.
+ * If called with event upcalls masked, this operation will atomically
+ * reenable event delivery and check for pending events before blocking the
+ * VCPU. This avoids a "wakeup waiting" race.
+ * @arg == NULL.
+ */
+#define SCHEDOP_block       1
+
+/*
+ * Halt execution of this domain (all VCPUs) and notify the system controller.
+ * @arg == pointer to sched_shutdown structure.
+ */
+#define SCHEDOP_shutdown    2
+struct sched_shutdown {
+    unsigned int reason; /* SHUTDOWN_* */
+};
+DEFINE_GUEST_HANDLE_STRUCT(sched_shutdown);
+
+/*
+ * Poll a set of event-channel ports. Return when one or more are pending. An
+ * optional timeout may be specified.
+ * @arg == pointer to sched_poll structure.
+ */
+#define SCHEDOP_poll        3
+struct sched_poll {
+    GUEST_HANDLE(evtchn_port_t) ports;
+    unsigned int nr_ports;
+    uint64_t timeout;
+};
+DEFINE_GUEST_HANDLE_STRUCT(sched_poll);
+
+/*
+ * Reason codes for SCHEDOP_shutdown. These may be interpreted by control
+ * software to determine the appropriate action. For the most part, Xen does
+ * not care about the shutdown code.
+ */
+#define SHUTDOWN_poweroff   0  /* Domain exited normally. Clean up and kill. */
+#define SHUTDOWN_reboot     1  /* Clean up, kill, and then restart.          */
+#define SHUTDOWN_suspend    2  /* Clean up, save suspend info, kill.         */
+#define SHUTDOWN_crash      3  /* Tell controller we've crashed.             */
+
+#endif /* __XEN_PUBLIC_SCHED_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/vcpu.h
@@ -0,0 +1,109 @@
+/******************************************************************************
+ * vcpu.h
+ *
+ * VCPU initialisation, query, and hotplug.
+ *
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_VCPU_H__
+#define __XEN_PUBLIC_VCPU_H__
+
+/*
+ * Prototype for this hypercall is:
+ *  int vcpu_op(int cmd, int vcpuid, void *extra_args)
+ * @cmd        == VCPUOP_??? (VCPU operation).
+ * @vcpuid     == VCPU to operate on.
+ * @extra_args == Operation-specific extra arguments (NULL if none).
+ */
+
+/*
+ * Initialise a VCPU. Each VCPU can be initialised only once. A
+ * newly-initialised VCPU will not run until it is brought up by VCPUOP_up.
+ *
+ * @extra_arg == pointer to vcpu_guest_context structure containing initial
+ *               state for the VCPU.
+ */
+#define VCPUOP_initialise           0
+
+/*
+ * Bring up a VCPU. This makes the VCPU runnable. This operation will fail
+ * if the VCPU has not been initialised (VCPUOP_initialise).
+ */
+#define VCPUOP_up                   1
+
+/*
+ * Bring down a VCPU (i.e., make it non-runnable).
+ * There are a few caveats that callers should observe:
+ *  1. This operation may return, and VCPU_is_up may return false, before the
+ *     VCPU stops running (i.e., the command is asynchronous). It is a good
+ *     idea to ensure that the VCPU has entered a non-critical loop before
+ *     bringing it down. Alternatively, this operation is guaranteed
+ *     synchronous if invoked by the VCPU itself.
+ *  2. After a VCPU is initialised, there is currently no way to drop all its
+ *     references to domain memory. Even a VCPU that is down still holds
+ *     memory references via its pagetable base pointer and GDT. It is good
+ *     practise to move a VCPU onto an 'idle' or default page table, LDT and
+ *     GDT before bringing it down.
+ */
+#define VCPUOP_down                 2
+
+/* Returns 1 if the given VCPU is up. */
+#define VCPUOP_is_up                3
+
+/*
+ * Return information about the state and running time of a VCPU.
+ * @extra_arg == pointer to vcpu_runstate_info structure.
+ */
+#define VCPUOP_get_runstate_info    4
+struct vcpu_runstate_info {
+    /* VCPU's current state (RUNSTATE_*). */
+    int      state;
+    /* When was current state entered (system time, ns)? */
+    uint64_t state_entry_time;
+    /*
+     * Time spent in each RUNSTATE_* (ns). The sum of these times is
+     * guaranteed not to drift from system time.
+     */
+    uint64_t time[4];
+};
+
+/* VCPU is currently running on a physical CPU. */
+#define RUNSTATE_running  0
+
+/* VCPU is runnable, but not currently scheduled on any physical CPU. */
+#define RUNSTATE_runnable 1
+
+/* VCPU is blocked (a.k.a. idle). It is therefore not runnable. */
+#define RUNSTATE_blocked  2
+
+/*
+ * VCPU is not runnable, but it is not blocked.
+ * This is a 'catch all' state for things like hotplug and pauses by the
+ * system administrator (or for critical sections in the hypervisor).
+ * RUNSTATE_blocked dominates this state (it is the preferred state).
+ */
+#define RUNSTATE_offline  3
+
+/*
+ * Register a shared memory area from which the guest may obtain its own
+ * runstate information without needing to execute a hypercall.
+ * Notes:
+ *  1. The registered address may be virtual or physical, depending on the
+ *     platform. The virtual address should be registered on x86 systems.
+ *  2. Only one shared area may be registered per VCPU. The shared area is
+ *     updated by the hypervisor each time the VCPU is scheduled. Thus
+ *     runstate.state will always be RUNSTATE_running and
+ *     runstate.state_entry_time will indicate the system time at which the
+ *     VCPU was last scheduled to run.
+ * @extra_arg == pointer to vcpu_register_runstate_memory_area structure.
+ */
+#define VCPUOP_register_runstate_memory_area 5
+struct vcpu_register_runstate_memory_area {
+    union {
+        struct vcpu_runstate_info *v;
+        uint64_t p;
+    } addr;
+};
+
+#endif /* __XEN_PUBLIC_VCPU_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/version.h
@@ -0,0 +1,60 @@
+/******************************************************************************
+ * version.h
+ *
+ * Xen version, type, and compile information.
+ *
+ * Copyright (c) 2005, Nguyen Anh Quynh <aquynh@gmail.com>
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_VERSION_H__
+#define __XEN_PUBLIC_VERSION_H__
+
+/* NB. All ops return zero on success, except XENVER_version. */
+
+/* arg == NULL; returns major:minor (16:16). */
+#define XENVER_version      0
+
+/* arg == xen_extraversion_t. */
+#define XENVER_extraversion 1
+struct xen_extraversion {
+    char extraversion[16];
+};
+#define XEN_EXTRAVERSION_LEN (sizeof(struct xen_extraversion))
+
+/* arg == xen_compile_info_t. */
+#define XENVER_compile_info 2
+struct xen_compile_info {
+    char compiler[64];
+    char compile_by[16];
+    char compile_domain[32];
+    char compile_date[32];
+};
+
+#define XENVER_capabilities 3
+struct xen_capabilities_info {
+    char info[1024];
+};
+#define XEN_CAPABILITIES_INFO_LEN (sizeof(struct xen_capabilities_info))
+
+#define XENVER_changeset 4
+struct xen_changeset_info {
+    char info[64];
+};
+#define XEN_CHANGESET_INFO_LEN (sizeof(struct xen_changeset_info))
+
+#define XENVER_platform_parameters 5
+struct xen_platform_parameters {
+    unsigned long virt_start;
+};
+
+#define XENVER_get_features 6
+struct xen_feature_info {
+    unsigned int submap_idx;    /* IN: which 32-bit submap to return */
+    uint32_t     submap;        /* OUT: 32-bit submap */
+};
+
+/* Declares the features reported by XENVER_get_features. */
+#include "features.h"
+
+#endif /* __XEN_PUBLIC_VERSION_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/interface/xen.h
@@ -0,0 +1,445 @@
+/******************************************************************************
+ * xen.h
+ *
+ * Guest OS interface to Xen.
+ *
+ * Copyright (c) 2004, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_XEN_H__
+#define __XEN_PUBLIC_XEN_H__
+
+#include <asm/xen/interface.h>
+
+/*
+ * XEN "SYSTEM CALLS" (a.k.a. HYPERCALLS).
+ */
+
+/*
+ * x86_32: EAX = vector; EBX, ECX, EDX, ESI, EDI = args 1, 2, 3, 4, 5.
+ *         EAX = return value
+ *         (argument registers may be clobbered on return)
+ * x86_64: RAX = vector; RDI, RSI, RDX, R10, R8, R9 = args 1, 2, 3, 4, 5, 6.
+ *         RAX = return value
+ *         (argument registers not clobbered on return; RCX, R11 are)
+ */
+#define __HYPERVISOR_set_trap_table        0
+#define __HYPERVISOR_mmu_update            1
+#define __HYPERVISOR_set_gdt               2
+#define __HYPERVISOR_stack_switch          3
+#define __HYPERVISOR_set_callbacks         4
+#define __HYPERVISOR_fpu_taskswitch        5
+#define __HYPERVISOR_sched_op              6
+#define __HYPERVISOR_dom0_op               7
+#define __HYPERVISOR_set_debugreg          8
+#define __HYPERVISOR_get_debugreg          9
+#define __HYPERVISOR_update_descriptor    10
+#define __HYPERVISOR_memory_op            12
+#define __HYPERVISOR_multicall            13
+#define __HYPERVISOR_update_va_mapping    14
+#define __HYPERVISOR_set_timer_op         15
+#define __HYPERVISOR_event_channel_op_compat 16
+#define __HYPERVISOR_xen_version          17
+#define __HYPERVISOR_console_io           18
+#define __HYPERVISOR_physdev_op_compat    19
+#define __HYPERVISOR_grant_table_op       20
+#define __HYPERVISOR_vm_assist            21
+#define __HYPERVISOR_update_va_mapping_otherdomain 22
+#define __HYPERVISOR_iret                 23 /* x86 only */
+#define __HYPERVISOR_vcpu_op              24
+#define __HYPERVISOR_set_segment_base     25 /* x86/64 only */
+#define __HYPERVISOR_mmuext_op            26
+#define __HYPERVISOR_acm_op               27
+#define __HYPERVISOR_nmi_op               28
+#define __HYPERVISOR_sched_op_new         29
+#define __HYPERVISOR_callback_op          30
+#define __HYPERVISOR_xenoprof_op          31
+#define __HYPERVISOR_event_channel_op     32
+#define __HYPERVISOR_physdev_op           33
+#define __HYPERVISOR_hvm_op               34
+
+/*
+ * VIRTUAL INTERRUPTS
+ *
+ * Virtual interrupts that a guest OS may receive from Xen.
+ */
+#define VIRQ_TIMER      0  /* Timebase update, and/or requested timeout.  */
+#define VIRQ_DEBUG      1  /* Request guest to dump debug info.           */
+#define VIRQ_CONSOLE    2  /* (DOM0) Bytes received on emergency console. */
+#define VIRQ_DOM_EXC    3  /* (DOM0) Exceptional event for some domain.   */
+#define VIRQ_DEBUGGER   6  /* (DOM0) A domain has paused for debugging.   */
+#define NR_VIRQS        8
+
+/*
+ * MMU-UPDATE REQUESTS
+ *
+ * HYPERVISOR_mmu_update() accepts a list of (ptr, val) pairs.
+ * A foreigndom (FD) can be specified (or DOMID_SELF for none).
+ * Where the FD has some effect, it is described below.
+ * ptr[1:0] specifies the appropriate MMU_* command.
+ *
+ * ptr[1:0] == MMU_NORMAL_PT_UPDATE:
+ * Updates an entry in a page table. If updating an L1 table, and the new
+ * table entry is valid/present, the mapped frame must belong to the FD, if
+ * an FD has been specified. If attempting to map an I/O page then the
+ * caller assumes the privilege of the FD.
+ * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
+ * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
+ * ptr[:2]  -- Machine address of the page-table entry to modify.
+ * val      -- Value to write.
+ *
+ * ptr[1:0] == MMU_MACHPHYS_UPDATE:
+ * Updates an entry in the machine->pseudo-physical mapping table.
+ * ptr[:2]  -- Machine address within the frame whose mapping to modify.
+ *             The frame must belong to the FD, if one is specified.
+ * val      -- Value to write into the mapping entry.
+ */
+#define MMU_NORMAL_PT_UPDATE     0 /* checked '*ptr = val'. ptr is MA.       */
+#define MMU_MACHPHYS_UPDATE      1 /* ptr = MA of frame to modify entry for  */
+
+/*
+ * MMU EXTENDED OPERATIONS
+ *
+ * HYPERVISOR_mmuext_op() accepts a list of mmuext_op structures.
+ * A foreigndom (FD) can be specified (or DOMID_SELF for none).
+ * Where the FD has some effect, it is described below.
+ *
+ * cmd: MMUEXT_(UN)PIN_*_TABLE
+ * mfn: Machine frame number to be (un)pinned as a p.t. page.
+ *      The frame must belong to the FD, if one is specified.
+ *
+ * cmd: MMUEXT_NEW_BASEPTR
+ * mfn: Machine frame number of new page-table base to install in MMU.
+ *
+ * cmd: MMUEXT_NEW_USER_BASEPTR [x86/64 only]
+ * mfn: Machine frame number of new page-table base to install in MMU
+ *      when in user space.
+ *
+ * cmd: MMUEXT_TLB_FLUSH_LOCAL
+ * No additional arguments. Flushes local TLB.
+ *
+ * cmd: MMUEXT_INVLPG_LOCAL
+ * linear_addr: Linear address to be flushed from the local TLB.
+ *
+ * cmd: MMUEXT_TLB_FLUSH_MULTI
+ * vcpumask: Pointer to bitmap of VCPUs to be flushed.
+ *
+ * cmd: MMUEXT_INVLPG_MULTI
+ * linear_addr: Linear address to be flushed.
+ * vcpumask: Pointer to bitmap of VCPUs to be flushed.
+ *
+ * cmd: MMUEXT_TLB_FLUSH_ALL
+ * No additional arguments. Flushes all VCPUs' TLBs.
+ *
+ * cmd: MMUEXT_INVLPG_ALL
+ * linear_addr: Linear address to be flushed from all VCPUs' TLBs.
+ *
+ * cmd: MMUEXT_FLUSH_CACHE
+ * No additional arguments. Writes back and flushes cache contents.
+ *
+ * cmd: MMUEXT_SET_LDT
+ * linear_addr: Linear address of LDT base (NB. must be page-aligned).
+ * nr_ents: Number of entries in LDT.
+ */
+#define MMUEXT_PIN_L1_TABLE      0
+#define MMUEXT_PIN_L2_TABLE      1
+#define MMUEXT_PIN_L3_TABLE      2
+#define MMUEXT_PIN_L4_TABLE      3
+#define MMUEXT_UNPIN_TABLE       4
+#define MMUEXT_NEW_BASEPTR       5
+#define MMUEXT_TLB_FLUSH_LOCAL   6
+#define MMUEXT_INVLPG_LOCAL      7
+#define MMUEXT_TLB_FLUSH_MULTI   8
+#define MMUEXT_INVLPG_MULTI      9
+#define MMUEXT_TLB_FLUSH_ALL    10
+#define MMUEXT_INVLPG_ALL       11
+#define MMUEXT_FLUSH_CACHE      12
+#define MMUEXT_SET_LDT          13
+#define MMUEXT_NEW_USER_BASEPTR 15
+
+#ifndef __ASSEMBLY__
+struct mmuext_op {
+    unsigned int cmd;
+    union {
+        /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR */
+        unsigned long mfn;
+        /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
+        unsigned long linear_addr;
+    } arg1;
+    union {
+        /* SET_LDT */
+        unsigned int nr_ents;
+        /* TLB_FLUSH_MULTI, INVLPG_MULTI */
+        void *vcpumask;
+    } arg2;
+};
+DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
+#endif
+
+/* These are passed as 'flags' to update_va_mapping. They can be ORed. */
+/* When specifying UVMF_MULTI, also OR in a pointer to a CPU bitmap.   */
+/* UVMF_LOCAL is merely UVMF_MULTI with a NULL bitmap pointer.         */
+#define UVMF_NONE               (0UL<<0) /* No flushing at all.   */
+#define UVMF_TLB_FLUSH          (1UL<<0) /* Flush entire TLB(s).  */
+#define UVMF_INVLPG             (2UL<<0) /* Flush only one entry. */
+#define UVMF_FLUSHTYPE_MASK     (3UL<<0)
+#define UVMF_MULTI              (0UL<<2) /* Flush subset of TLBs. */
+#define UVMF_LOCAL              (0UL<<2) /* Flush local TLB.      */
+#define UVMF_ALL                (1UL<<2) /* Flush all TLBs.       */
+
+/*
+ * Commands to HYPERVISOR_console_io().
+ */
+#define CONSOLEIO_write         0
+#define CONSOLEIO_read          1
+
+/*
+ * Commands to HYPERVISOR_vm_assist().
+ */
+#define VMASST_CMD_enable                0
+#define VMASST_CMD_disable               1
+#define VMASST_TYPE_4gb_segments         0
+#define VMASST_TYPE_4gb_segments_notify  1
+#define VMASST_TYPE_writable_pagetables  2
+#define VMASST_TYPE_pae_extended_cr3     3
+#define MAX_VMASST_TYPE 3
+
+#ifndef __ASSEMBLY__
+
+typedef uint16_t domid_t;
+
+/* Domain ids >= DOMID_FIRST_RESERVED cannot be used for ordinary domains. */
+#define DOMID_FIRST_RESERVED (0x7FF0U)
+
+/* DOMID_SELF is used in certain contexts to refer to oneself. */
+#define DOMID_SELF (0x7FF0U)
+
+/*
+ * DOMID_IO is used to restrict page-table updates to mapping I/O memory.
+ * Although no Foreign Domain need be specified to map I/O pages, DOMID_IO
+ * is useful to ensure that no mappings to the OS's own heap are accidentally
+ * installed. (e.g., in Linux this could cause havoc as reference counts
+ * aren't adjusted on the I/O-mapping code path).
+ * This only makes sense in MMUEXT_SET_FOREIGNDOM, but in that context can
+ * be specified by any calling domain.
+ */
+#define DOMID_IO   (0x7FF1U)
+
+/*
+ * DOMID_XEN is used to allow privileged domains to map restricted parts of
+ * Xen's heap space (e.g., the machine_to_phys table).
+ * This only makes sense in MMUEXT_SET_FOREIGNDOM, and is only permitted if
+ * the caller is privileged.
+ */
+#define DOMID_XEN  (0x7FF2U)
+
+/*
+ * Send an array of these to HYPERVISOR_mmu_update().
+ * NB. The fields are natural pointer/address size for this architecture.
+ */
+struct mmu_update {
+    uint64_t ptr;       /* Machine address of PTE. */
+    uint64_t val;       /* New contents of PTE.    */
+};
+DEFINE_GUEST_HANDLE_STRUCT(mmu_update);
+
+/*
+ * Send an array of these to HYPERVISOR_multicall().
+ * NB. The fields are natural register size for this architecture.
+ */
+struct multicall_entry {
+    unsigned long op, result;
+    unsigned long args[6];
+};
+DEFINE_GUEST_HANDLE_STRUCT(multicall_entry);
+
+/*
+ * Event channel endpoints per domain:
+ *  1024 if a long is 32 bits; 4096 if a long is 64 bits.
+ */
+#define NR_EVENT_CHANNELS (sizeof(unsigned long) * sizeof(unsigned long) * 64)
+
+struct vcpu_time_info {
+    /*
+     * Updates to the following values are preceded and followed by an
+     * increment of 'version'. The guest can therefore detect updates by
+     * looking for changes to 'version'. If the least-significant bit of
+     * the version number is set then an update is in progress and the guest
+     * must wait to read a consistent set of values.
+     * The correct way to interact with the version number is similar to
+     * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
+     */
+    uint32_t version;
+    uint32_t pad0;
+    uint64_t tsc_timestamp;   /* TSC at last update of time vals.  */
+    uint64_t system_time;     /* Time, in nanosecs, since boot.    */
+    /*
+     * Current system time:
+     *   system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
+     * CPU frequency (Hz):
+     *   ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
+     */
+    uint32_t tsc_to_system_mul;
+    int8_t   tsc_shift;
+    int8_t   pad1[3];
+}; /* 32 bytes */
+
+struct vcpu_info {
+    /*
+     * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
+     * a pending notification for a particular VCPU. It is then cleared
+     * by the guest OS /before/ checking for pending work, thus avoiding
+     * a set-and-check race. Note that the mask is only accessed by Xen
+     * on the CPU that is currently hosting the VCPU. This means that the
+     * pending and mask flags can be updated by the guest without special
+     * synchronisation (i.e., no need for the x86 LOCK prefix).
+     * This may seem suboptimal because if the pending flag is set by
+     * a different CPU then an IPI may be scheduled even when the mask
+     * is set. However, note:
+     *  1. The task of 'interrupt holdoff' is covered by the per-event-
+     *     channel mask bits. A 'noisy' event that is continually being
+     *     triggered can be masked at source at this very precise
+     *     granularity.
+     *  2. The main purpose of the per-VCPU mask is therefore to restrict
+     *     reentrant execution: whether for concurrency control, or to
+     *     prevent unbounded stack usage. Whatever the purpose, we expect
+     *     that the mask will be asserted only for short periods at a time,
+     *     and so the likelihood of a 'spurious' IPI is suitably small.
+     * The mask is read before making an event upcall to the guest: a
+     * non-zero mask therefore guarantees that the VCPU will not receive
+     * an upcall activation. The mask is cleared when the VCPU requests
+     * to block: this avoids wakeup-waiting races.
+     */
+    uint8_t evtchn_upcall_pending;
+    uint8_t evtchn_upcall_mask;
+    unsigned long evtchn_pending_sel;
+    struct arch_vcpu_info arch;
+    struct vcpu_time_info time;
+}; /* 64 bytes (x86) */
+
+/*
+ * Xen/kernel shared data -- pointer provided in start_info.
+ * NB. We expect that this struct is smaller than a page.
+ */
+struct shared_info {
+    struct vcpu_info vcpu_info[MAX_VIRT_CPUS];
+
+    /*
+     * A domain can create "event channels" on which it can send and receive
+     * asynchronous event notifications. There are three classes of event that
+     * are delivered by this mechanism:
+     *  1. Bi-directional inter- and intra-domain connections. Domains must
+     *     arrange out-of-band to set up a connection (usually by allocating
+     *     an unbound 'listener' port and avertising that via a storage service
+     *     such as xenstore).
+     *  2. Physical interrupts. A domain with suitable hardware-access
+     *     privileges can bind an event-channel port to a physical interrupt
+     *     source.
+     *  3. Virtual interrupts ('events'). A domain can bind an event-channel
+     *     port to a virtual interrupt source, such as the virtual-timer
+     *     device or the emergency console.
+     *
+     * Event channels are addressed by a "port index". Each channel is
+     * associated with two bits of information:
+     *  1. PENDING -- notifies the domain that there is a pending notification
+     *     to be processed. This bit is cleared by the guest.
+     *  2. MASK -- if this bit is clear then a 0->1 transition of PENDING
+     *     will cause an asynchronous upcall to be scheduled. This bit is only
+     *     updated by the guest. It is read-only within Xen. If a channel
+     *     becomes pending while the channel is masked then the 'edge' is lost
+     *     (i.e., when the channel is unmasked, the guest must manually handle
+     *     pending notifications as no upcall will be scheduled by Xen).
+     *
+     * To expedite scanning of pending notifications, any 0->1 pending
+     * transition on an unmasked channel causes a corresponding bit in a
+     * per-vcpu selector word to be set. Each bit in the selector covers a
+     * 'C long' in the PENDING bitfield array.
+     */
+    unsigned long evtchn_pending[sizeof(unsigned long) * 8];
+    unsigned long evtchn_mask[sizeof(unsigned long) * 8];
+
+    /*
+     * Wallclock time: updated only by control software. Guests should base
+     * their gettimeofday() syscall on this wallclock-base value.
+     */
+    uint32_t wc_version;      /* Version counter: see vcpu_time_info_t. */
+    uint32_t wc_sec;          /* Secs  00:00:00 UTC, Jan 1, 1970.  */
+    uint32_t wc_nsec;         /* Nsecs 00:00:00 UTC, Jan 1, 1970.  */
+
+    struct arch_shared_info arch;
+
+};
+
+/*
+ * Start-of-day memory layout for the initial domain (DOM0):
+ *  1. The domain is started within contiguous virtual-memory region.
+ *  2. The contiguous region begins and ends on an aligned 4MB boundary.
+ *  3. The region start corresponds to the load address of the OS image.
+ *     If the load address is not 4MB aligned then the address is rounded down.
+ *  4. This the order of bootstrap elements in the initial virtual region:
+ *      a. relocated kernel image
+ *      b. initial ram disk              [mod_start, mod_len]
+ *      c. list of allocated page frames [mfn_list, nr_pages]
+ *      d. start_info_t structure        [register ESI (x86)]
+ *      e. bootstrap page tables         [pt_base, CR3 (x86)]
+ *      f. bootstrap stack               [register ESP (x86)]
+ *  5. Bootstrap elements are packed together, but each is 4kB-aligned.
+ *  6. The initial ram disk may be omitted.
+ *  7. The list of page frames forms a contiguous 'pseudo-physical' memory
+ *     layout for the domain. In particular, the bootstrap virtual-memory
+ *     region is a 1:1 mapping to the first section of the pseudo-physical map.
+ *  8. All bootstrap elements are mapped read-writable for the guest OS. The
+ *     only exception is the bootstrap page table, which is mapped read-only.
+ *  9. There is guaranteed to be at least 512kB padding after the final
+ *     bootstrap element. If necessary, the bootstrap virtual region is
+ *     extended by an extra 4MB to ensure this.
+ */
+
+#define MAX_GUEST_CMDLINE 1024
+struct start_info {
+    /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME.    */
+    char magic[32];             /* "xen-<version>-<platform>".            */
+    unsigned long nr_pages;     /* Total pages allocated to this domain.  */
+    unsigned long shared_info;  /* MACHINE address of shared info struct. */
+    uint32_t flags;             /* SIF_xxx flags.                         */
+    unsigned long store_mfn;    /* MACHINE page number of shared page.    */
+    uint32_t store_evtchn;      /* Event channel for store communication. */
+    union {
+        struct {
+            unsigned long mfn;      /* MACHINE page number of console page.   */
+            uint32_t  evtchn;   /* Event channel for console page.        */
+        } domU;
+        struct {
+            uint32_t info_off;  /* Offset of console_info struct.         */
+            uint32_t info_size; /* Size of console_info struct from start.*/
+        } dom0;
+    } console;
+    /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME).     */
+    unsigned long pt_base;      /* VIRTUAL address of page directory.     */
+    unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames.       */
+    unsigned long mfn_list;     /* VIRTUAL address of page-frame list.    */
+    unsigned long mod_start;    /* VIRTUAL address of pre-loaded module.  */
+    unsigned long mod_len;      /* Size (bytes) of pre-loaded module.     */
+    int8_t cmd_line[MAX_GUEST_CMDLINE];
+};
+
+/* These flags are passed in the 'flags' field of start_info_t. */
+#define SIF_PRIVILEGED    (1<<0)  /* Is the domain privileged? */
+#define SIF_INITDOMAIN    (1<<1)  /* Is this the initial control domain? */
+
+typedef uint64_t cpumap_t;
+
+typedef uint8_t xen_domain_handle_t[16];
+
+/* Turn a plain number into a C unsigned long constant. */
+#define __mk_unsigned_long(x) x ## UL
+#define mk_unsigned_long(x) __mk_unsigned_long(x)
+
+#else /* __ASSEMBLY__ */
+
+/* In assembly code we cannot use C numeric constant suffixes. */
+#define mk_unsigned_long(x) x
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __XEN_PUBLIC_XEN_H__ */

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (18 preceding siblings ...)
  2007-03-01 23:25 ` [patch 19/26] Xen-paravirt_ops: Add Xen interface header files Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-16  9:14     ` Ingo Molnar
  2007-03-01 23:25 ` [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console Jeremy Fitzhardinge
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach, Adrian Bunk

[-- Attachment #1: xen-core.patch --]
[-- Type: text/plain, Size: 84895 bytes --]

Core Xen Implementation

This patch is a rollup of all the core pieces of the Xen
implementation, including booting, memory management, interrupts, time
and so on.

This adds a second entry point to head.S, which is jumped to when
booted by Xen.  This allows startup under Xen to be easily detected.

Because Xen starts the kernel in a fairly sane state, very little
setup is needed here; it just needs to jump into xen_start_kernel to
init the paravirt_ops structure, and then jump into start_kernel
proper.

This also makes a few small adjustments to the gdt tables to make them
properly suited to Xen.

This patch also contains an initial set of paravirt ops for Xen,
mostly ones which can just be implemented with the generic native_*
version.  At boot time, the global paravirt_ops structure is populated
with the Xen pointers in xen_start_kernel, which then jumps to the
standard start_kernel.

Also implements Xen pagetable handling, including the machinery to
implement direct pagetables.  This means that pte update operations
are intercepted so that pseudo-physical addresses can be converted to
machine addresses.  It also implements late pinning/early unpinning so
that pagetables are initialized while they're normal read-write memory
as much as possible, yet pinned to make cr3 reloads as efficient as
possible.

Xen implements interrupts in terms of event channels.  This patch maps
an event through an event channel to an irq, and then feeds it through
the normal interrupt path, via a Xen irq_chip implementation.

Context switches are optimised with the use of multicalls, so that the
multiple hypercalls needed to perform the switch can be implemented
with a single hypercall.  In order to simplify the source changes,
this is done via a generic multicall batching mechanism, which can be
used to batch arbitrary series of hypercalls into a multicall.

Time is implemented by using a clocksource which is driven from the
hypervisor's nanosecond timebase.  Xen implements time by
extrapolating from known timestamps using the tsc; the hypervisor is
responsible for making sure that the tsc is constant rate and
synchronized between vcpus.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>

---
 arch/i386/Makefile               |    3 
 arch/i386/kernel/cpu/common.c    |    3 
 arch/i386/kernel/entry.S         |   79 +++
 arch/i386/kernel/head.S          |   13 
 arch/i386/kernel/paravirt.c      |   48 +-
 arch/i386/kernel/vmlinux.lds.S   |    1 
 arch/i386/xen/Makefile           |    2 
 arch/i386/xen/enlighten.c        |  752 ++++++++++++++++++++++++++++++++++++++
 arch/i386/xen/events.c           |  481 ++++++++++++++++++++++++
 arch/i386/xen/features.c         |   29 +
 arch/i386/xen/mmu.c              |  411 ++++++++++++++++++++
 arch/i386/xen/mmu.h              |   49 ++
 arch/i386/xen/multicalls.c       |   62 +++
 arch/i386/xen/multicalls.h       |   27 +
 arch/i386/xen/setup.c            |   95 ++++
 arch/i386/xen/time.c             |  446 ++++++++++++++++++++++
 arch/i386/xen/xen-head.S         |   30 +
 arch/i386/xen/xen-ops.h          |   31 +
 include/asm-i386/irq.h           |    1 
 include/asm-i386/paravirt.h      |   39 +
 include/asm-i386/pda.h           |   18 
 include/asm-i386/xen/hypercall.h |   18 
 include/xen/events.h             |   28 +
 include/xen/features.h           |   26 +
 include/xen/page.h               |  175 ++++++++
 25 files changed, 2841 insertions(+), 26 deletions(-)

===================================================================
--- a/arch/i386/Makefile
+++ b/arch/i386/Makefile
@@ -93,6 +93,9 @@ mcore-$(CONFIG_X86_ES7000)	:= mach-defau
 mcore-$(CONFIG_X86_ES7000)	:= mach-default
 core-$(CONFIG_X86_ES7000)	:= arch/i386/mach-es7000/
 
+# Xen paravirtualization support
+core-$(CONFIG_XEN)		+= arch/i386/xen/
+
 # default subarch .h files
 mflags-y += -Iinclude/asm-i386/mach-default
 
===================================================================
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -19,6 +19,7 @@
 #include <mach_apic.h>
 #endif
 #include <asm/pda.h>
+#include <asm/paravirt.h>
 
 #include "cpu.h"
 
@@ -707,6 +708,8 @@ __cpuinit int init_gdt(int cpu, struct t
 	pda->cpu_number = cpu;
 	pda->pcurrent = idle;
 
+	paravirt_init_pda(pda, cpu);
+
 	return 1;
 }
 
===================================================================
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -1034,6 +1034,85 @@ ENTRY(kernel_thread_helper)
 	CFI_ENDPROC
 ENDPROC(kernel_thread_helper)
 
+#ifdef CONFIG_XEN
+/* Xen only supports sysenter/sysexit in ring0 guests,
+   and only if it the guest asks for it.  So for now,
+   this should never be used. */
+ENTRY(xen_sti_sysexit)
+	CFI_STARTPROC
+	ud2
+	CFI_ENDPROC
+ENDPROC(xen_sti_sysexit)
+
+ENTRY(xen_hypervisor_callback)
+	CFI_STARTPROC
+	pushl $0
+	CFI_ADJUST_CFA_OFFSET 4
+	SAVE_ALL
+	mov %esp, %eax
+	call xen_evtchn_do_upcall
+	jmp  ret_from_intr
+	CFI_ENDPROC
+ENDPROC(xen_hypervisor_callback)
+
+# Hypervisor uses this for application faults while it executes.
+# We get here for two reasons:
+#  1. Fault while reloading DS, ES, FS or GS
+#  2. Fault while executing IRET
+# Category 1 we fix up by reattempting the load, and zeroing the segment
+# register if the load fails.
+# Category 2 we fix up by jumping to do_iret_error. We cannot use the
+# normal Linux return path in this case because if we use the IRET hypercall
+# to pop the stack frame we end up in an infinite loop of failsafe callbacks.
+# We distinguish between categories by maintaining a status value in EAX.
+ENTRY(xen_failsafe_callback)
+	CFI_STARTPROC
+	pushl %eax
+	CFI_ADJUST_CFA_OFFSET 4
+	movl $1,%eax
+1:	mov 4(%esp),%ds
+2:	mov 8(%esp),%es
+3:	mov 12(%esp),%fs
+4:	mov 16(%esp),%gs
+	testl %eax,%eax
+	popl %eax
+	CFI_ADJUST_CFA_OFFSET -4
+	lea 16(%esp),%esp
+	CFI_ADJUST_CFA_OFFSET -16
+	jz 5f
+	addl $16,%esp
+	jmp iret_exc		# EAX != 0 => Category 2 (Bad IRET)
+5:	pushl $0		# EAX == 0 => Category 1 (Bad segment)
+	CFI_ADJUST_CFA_OFFSET 4
+	SAVE_ALL
+	jmp ret_from_exception
+	CFI_ENDPROC
+
+.section .fixup,"ax"
+6:	xorl %eax,%eax
+	movl %eax,4(%esp)
+	jmp 1b
+7:	xorl %eax,%eax
+	movl %eax,8(%esp)
+	jmp 2b
+8:	xorl %eax,%eax
+	movl %eax,12(%esp)
+	jmp 3b
+9:	xorl %eax,%eax
+	movl %eax,16(%esp)
+	jmp 4b
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,6b
+	.long 2b,7b
+	.long 3b,8b
+	.long 4b,9b
+.previous
+ENDPROC(xen_failsafe_callback)
+
+#endif	/* CONFIG_XEN */
+
 .section .rodata,"a"
 #include "syscall_table.S"
 
===================================================================
--- a/arch/i386/kernel/head.S
+++ b/arch/i386/kernel/head.S
@@ -535,6 +535,10 @@ unhandled_paravirt:
 	ud2
 #endif
 
+#ifdef CONFIG_XEN
+#include "../xen/xen-head.S"
+#endif
+
 /*
  * Real beginning of normal "text" segment
  */
@@ -544,8 +548,9 @@ ENTRY(_stext)
 /*
  * BSS section
  */
-.section ".bss.page_aligned","w"
+.section ".bss.page_aligned","wa"
 ENTRY(swapper_pg_dir)
+	.align PAGE_SIZE_asm
 	.fill 1024,4,0
 ENTRY(empty_zero_page)
 	.fill 4096,1,0
@@ -614,7 +619,8 @@ ENTRY(boot_gdt_table)
 /*
  * The Global Descriptor Table contains 28 quadwords, per-CPU.
  */
-	.align L1_CACHE_BYTES
+	.section ".data.page_aligned"
+	.align PAGE_SIZE_asm
 ENTRY(cpu_gdt_table)
 	.quad 0x0000000000000000	/* NULL descriptor */
 	.quad 0x0000000000000000	/* 0x0b reserved */
@@ -663,3 +669,6 @@ ENTRY(cpu_gdt_table)
 	.quad 0x0000000000000000	/* 0xf0 - unused */
 	.quad 0x0000000000000000	/* 0xf8 - GDT entry 31: double-fault TSS */
 
+	/* Be sure this is zeroed to avoid false validations in Xen */
+	.fill PAGE_SIZE_asm / 8 - GDT_ENTRIES,8,0
+	.previous
===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -247,55 +247,55 @@ void init_IRQ(void)
 	paravirt_ops.init_IRQ();
 }
 
-static void native_clts(void)
+void native_clts(void)
 {
 	asm volatile ("clts");
 }
 
-static unsigned long native_read_cr0(void)
+unsigned long native_read_cr0(void)
 {
 	unsigned long val;
 	asm volatile("movl %%cr0,%0\n\t" :"=r" (val));
 	return val;
 }
 
-static void native_write_cr0(unsigned long val)
+void native_write_cr0(unsigned long val)
 {
 	asm volatile("movl %0,%%cr0": :"r" (val));
 }
 
-static unsigned long native_read_cr2(void)
+unsigned long native_read_cr2(void)
 {
 	unsigned long val;
 	asm volatile("movl %%cr2,%0\n\t" :"=r" (val));
 	return val;
 }
 
-static void native_write_cr2(unsigned long val)
+void native_write_cr2(unsigned long val)
 {
 	asm volatile("movl %0,%%cr2": :"r" (val));
 }
 
-static unsigned long native_read_cr3(void)
+unsigned long native_read_cr3(void)
 {
 	unsigned long val;
 	asm volatile("movl %%cr3,%0\n\t" :"=r" (val));
 	return val;
 }
 
-static void native_write_cr3(unsigned long val)
+void native_write_cr3(unsigned long val)
 {
 	asm volatile("movl %0,%%cr3": :"r" (val));
 }
 
-static unsigned long native_read_cr4(void)
+unsigned long native_read_cr4(void)
 {
 	unsigned long val;
 	asm volatile("movl %%cr4,%0\n\t" :"=r" (val));
 	return val;
 }
 
-static unsigned long native_read_cr4_safe(void)
+unsigned long native_read_cr4_safe(void)
 {
 	unsigned long val;
 	/* This could fault if %cr4 does not exist */
@@ -308,7 +308,7 @@ static unsigned long native_read_cr4_saf
 	return val;
 }
 
-static void native_write_cr4(unsigned long val)
+void native_write_cr4(unsigned long val)
 {
 	asm volatile("movl %0,%%cr4": :"r" (val));
 }
@@ -347,12 +347,12 @@ static void native_halt(void)
 	asm volatile("hlt": : :"memory");
 }
 
-static void native_wbinvd(void)
+void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
 }
 
-static unsigned long long native_read_msr(unsigned int msr, int *err)
+unsigned long long native_read_msr(unsigned int msr, int *err)
 {
 	unsigned long long val;
 
@@ -371,7 +371,7 @@ static unsigned long long native_read_ms
 	return val;
 }
 
-static int native_write_msr(unsigned int msr, unsigned long long val)
+int native_write_msr(unsigned int msr, unsigned long long val)
 {
 	int err;
 	asm volatile("2: wrmsr ; xorl %0,%0\n"
@@ -389,14 +389,14 @@ static int native_write_msr(unsigned int
 	return err;
 }
 
-static unsigned long long native_read_tsc(void)
+unsigned long long native_read_tsc(void)
 {
 	unsigned long long val;
 	asm volatile("rdtsc" : "=A" (val));
 	return val;
 }
 
-static unsigned long long native_read_pmc(void)
+unsigned long long native_read_pmc(void)
 {
 	unsigned long long val;
 	asm volatile("rdpmc" : "=A" (val));
@@ -418,17 +418,17 @@ static void native_load_idt(const struct
 	asm volatile("lidt %0"::"m" (*dtr));
 }
 
-static void native_store_gdt(struct Xgt_desc_struct *dtr)
+void native_store_gdt(struct Xgt_desc_struct *dtr)
 {
 	asm ("sgdt %0":"=m" (*dtr));
 }
 
-static void native_store_idt(struct Xgt_desc_struct *dtr)
+void native_store_idt(struct Xgt_desc_struct *dtr)
 {
 	asm ("sidt %0":"=m" (*dtr));
 }
 
-static unsigned long native_store_tr(void)
+unsigned long native_store_tr(void)
 {
 	unsigned long tr;
 	asm ("str %0":"=r" (tr));
@@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
 
 static void native_load_tls(struct thread_struct *t, unsigned int cpu)
 {
-#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
-	C(0); C(1); C(2);
-#undef C
+	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
+	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
+	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
 }
 
 static inline void native_write_dt_entry(void *dt, int entry, u32 entry_low, u32 entry_high)
@@ -449,17 +449,17 @@ static inline void native_write_dt_entry
 	lp[1] = entry_high;
 }
 
-static void native_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
+void native_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
 {
 	native_write_dt_entry(dt, entrynum, low, high);
 }
 
-static void native_write_gdt_entry(void *dt, int entrynum, u32 low, u32 high)
+void native_write_gdt_entry(void *dt, int entrynum, u32 low, u32 high)
 {
 	native_write_dt_entry(dt, entrynum, low, high);
 }
 
-static void native_write_idt_entry(void *dt, int entrynum, u32 low, u32 high)
+void native_write_idt_entry(void *dt, int entrynum, u32 low, u32 high)
 {
 	native_write_dt_entry(dt, entrynum, low, high);
 }
===================================================================
--- a/arch/i386/kernel/vmlinux.lds.S
+++ b/arch/i386/kernel/vmlinux.lds.S
@@ -95,6 +95,7 @@ SECTIONS
 
   . = ALIGN(4096);
   .data.page_aligned : AT(ADDR(.data.page_aligned) - LOAD_OFFSET) {
+	*(.data.page_aligned)
 	*(.data.idt)
   }
 
===================================================================
--- /dev/null
+++ b/arch/i386/xen/Makefile
@@ -0,0 +1,2 @@
+obj-y		:= enlighten.o setup.o events.o time.o \
+			features.o mmu.o multicalls.o
===================================================================
--- /dev/null
+++ b/arch/i386/xen/enlighten.c
@@ -0,0 +1,752 @@
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/smp.h>
+#include <linux/preempt.h>
+#include <linux/percpu.h>
+#include <linux/delay.h>
+#include <linux/start_kernel.h>
+#include <linux/sched.h>
+#include <linux/bootmem.h>
+#include <linux/module.h>
+
+#include <xen/interface/xen.h>
+#include <xen/features.h>
+#include <xen/page.h>
+
+#include <asm/paravirt.h>
+#include <asm/page.h>
+#include <asm/xen/hypercall.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/fixmap.h>
+#include <asm/processor.h>
+#include <asm/setup.h>
+#include <asm/desc.h>
+#include <asm/pgtable.h>
+
+#include "xen-ops.h"
+#include "mmu.h"
+#include "multicalls.h"
+
+EXPORT_SYMBOL_GPL(hypercall_page);
+
+DEFINE_PER_CPU(unsigned, xen_lazy_mode);
+
+/* Code defined in entry.S (not a function) */
+extern const char xen_sti_sysexit[];
+
+struct start_info *xen_start_info;
+EXPORT_SYMBOL_GPL(xen_start_info);
+
+static void __init xen_banner(void)
+{
+	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
+	       paravirt_ops.name);
+	printk(KERN_INFO "Hypervisor signature: %s\n", xen_start_info->magic);
+}
+
+static void xen_init_pda(struct i386_pda *pda, int cpu)
+{
+	/* Don't re-init boot CPU; we do it once very early in boot,
+	   and then then cpu_init tries to do it again. If so, just
+	   reuse the stuff we already set up. */
+	if (cpu == 0 && pda != &boot_pda) {
+		BUG_ON(boot_pda.xen.vcpu == NULL);
+		pda->xen = boot_pda.xen;
+		return;
+	}
+
+	pda->xen.vcpu = &HYPERVISOR_shared_info->vcpu_info[cpu];
+	pda->xen.cr3 = 0;
+}
+
+static void xen_cpuid(unsigned int *eax, unsigned int *ebx,
+		      unsigned int *ecx, unsigned int *edx)
+{
+	unsigned maskedx = ~0;
+	if (*eax == 1)
+		maskedx = ~(1 << X86_FEATURE_APIC);
+
+	asm(XEN_EMULATE_PREFIX "cpuid"
+		: "=a" (*eax),
+		  "=b" (*ebx),
+		  "=c" (*ecx),
+		  "=d" (*edx)
+		: "0" (*eax), "2" (*ecx));
+	*edx &= maskedx;
+}
+
+static void xen_set_debugreg(int reg, unsigned long val)
+{
+	HYPERVISOR_set_debugreg(reg, val);
+}
+
+static unsigned long xen_get_debugreg(int reg)
+{
+	return HYPERVISOR_get_debugreg(reg);
+}
+
+static unsigned long xen_save_fl(void)
+{
+	struct vcpu_info *vcpu;
+	unsigned long flags;
+
+	preempt_disable();
+	vcpu = read_pda(xen.vcpu);
+	/* flag has opposite sense of mask */
+	flags = !vcpu->evtchn_upcall_mask;
+	preempt_enable();
+
+	/* convert to IF type flag
+	   -0 -> 0x00000000
+	   -1 -> 0xffffffff
+	*/
+	return (-flags) & X86_EFLAGS_IF;
+}
+
+static void xen_restore_fl(unsigned long flags)
+{
+	struct vcpu_info *vcpu;
+
+	preempt_disable();
+
+	/* convert from IF type flag */
+	flags = !(flags & X86_EFLAGS_IF);
+	vcpu = read_pda(xen.vcpu);
+	vcpu->evtchn_upcall_mask = flags;
+	if (flags == 0) {
+		barrier(); /* unmask then check (avoid races) */
+		if (unlikely(vcpu->evtchn_upcall_pending))
+			force_evtchn_callback();
+		preempt_enable();
+	} else
+		preempt_enable_no_resched();
+}
+
+static void xen_irq_disable(void)
+{
+	struct vcpu_info *vcpu;
+	preempt_disable();
+	vcpu = read_pda(xen.vcpu);
+	vcpu->evtchn_upcall_mask = 1;
+	preempt_enable_no_resched();
+}
+
+static void xen_irq_enable(void)
+{
+	struct vcpu_info *vcpu;
+
+	preempt_disable();
+	vcpu = read_pda(xen.vcpu);
+	vcpu->evtchn_upcall_mask = 0;
+	barrier(); /* unmask then check (avoid races) */
+	if (unlikely(vcpu->evtchn_upcall_pending))
+		force_evtchn_callback();
+	preempt_enable();
+}
+
+static void xen_safe_halt(void)
+{
+	stop_hz_timer();
+	/* Blocking includes an implicit local_irq_enable(). */
+	if (HYPERVISOR_sched_op(SCHEDOP_block, 0) != 0)
+		BUG();
+	start_hz_timer();
+}
+
+static void xen_halt(void)
+{
+#if 0
+	if (irqs_disabled())
+		HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
+#endif
+}
+
+static void xen_set_lazy_mode(int mode)
+{
+	unsigned *lazy = &get_cpu_var(xen_lazy_mode);
+
+	if (xen_mc_flush())
+		BUG();
+
+	*lazy = mode;
+
+	put_cpu_var(xen_lazy_mode);
+}
+
+static unsigned long xen_store_tr(void)
+{
+	return 0;
+}
+
+static void xen_set_ldt(const void *addr, unsigned entries)
+{
+	struct mmuext_op *op;
+	struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+	op = mcs.args;
+	op->cmd = MMUEXT_SET_LDT;
+	op->arg1.linear_addr = (unsigned long)addr;
+	if (addr)
+		/* ldt my be vmalloced, use arbitrary_virt_to_machine */
+		op->arg1.linear_addr = arbitrary_virt_to_machine((unsigned long)addr).maddr;
+	op->arg2.nr_ents = entries;
+
+	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+	xen_mc_issue();
+}
+
+static void xen_load_gdt(const struct Xgt_desc_struct *dtr)
+{
+	unsigned long va;
+	int f;
+	unsigned size = dtr->size + 1;
+	unsigned long frames[16];
+
+	BUG_ON(size > 16*PAGE_SIZE);
+
+	for (va = dtr->address, f = 0;
+	     va < dtr->address + size;
+	     va += PAGE_SIZE, f++) {
+		frames[f] = virt_to_mfn(va);
+		make_lowmem_page_readonly((void *)va);
+	}
+
+	/* This is used very early, so we can't rely on per-cpu data
+	   being set up, so no multicalls */
+	if (HYPERVISOR_set_gdt(frames, size/8))
+		BUG();
+}
+
+static void load_TLS_descriptor(struct thread_struct *t,
+				unsigned int cpu, unsigned int i)
+{
+	xmaddr_t maddr = virt_to_machine(&get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN+i]);
+	struct multicall_space mc = xen_mc_entry(0);
+
+	MULTI_update_descriptor(mc.mc, maddr.maddr, t->tls_array[i]);
+}
+
+static void xen_load_tls(struct thread_struct *t, unsigned int cpu)
+{
+	load_TLS_descriptor(t, cpu, 0);
+	load_TLS_descriptor(t, cpu, 1);
+	load_TLS_descriptor(t, cpu, 2);
+
+	xen_mc_issue();
+}
+
+static void xen_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
+{
+	unsigned long lp = (unsigned long)dt + entrynum * 8;
+	xmaddr_t mach_lp = virt_to_machine(lp);
+	u64 entry = (u64)high << 32 | low;
+
+	xen_mc_flush();
+	if (HYPERVISOR_update_descriptor(mach_lp.maddr, entry))
+		BUG();
+}
+
+static int cvt_gate_to_trap(int vector, u32 low, u32 high, struct trap_info *info)
+{
+	u8 type, dpl;
+
+	type = (high >> 8) & 0x1f;
+	dpl = (high >> 13) & 3;
+
+	if (type != 0xf && type != 0xe)
+		return 0;
+
+	info->vector = vector;
+	info->address = (high & 0xffff0000) | (low & 0x0000ffff);
+	info->cs = low >> 16;
+	info->flags = dpl;
+	/* interrupt gates clear IF */
+	if (type == 0xe)
+		info->flags |= 4;
+
+	return 1;
+}
+
+/* Locations of each CPU's IDT */
+static DEFINE_PER_CPU(struct Xgt_desc_struct, idt_desc);
+
+/* Set an IDT entry.  If the entry is part of the current IDT, then
+   also update Xen. */
+static void xen_write_idt_entry(void *dt, int entrynum, u32 low, u32 high)
+{
+
+	int cpu = smp_processor_id();
+	unsigned long p = (unsigned long)dt + entrynum * 8;
+	unsigned long start = per_cpu(idt_desc, cpu).address;
+	unsigned long end = start + per_cpu(idt_desc, cpu).size + 1;
+
+	xen_mc_flush();
+
+	native_write_idt_entry(dt, entrynum, low, high);
+
+	if (p >= start && (p + 8) <= end) {
+		struct trap_info info[2];
+
+		info[1].address = 0;
+
+		if (cvt_gate_to_trap(entrynum, low, high, &info[0]))
+			if (HYPERVISOR_set_trap_table(info))
+				BUG();
+	}
+}
+
+/* Load a new IDT into Xen.  In principle this can be per-CPU, so we
+   hold a spinlock to protect the static traps[] array (static because
+   it avoids allocation, and saves stack space). */
+static void xen_load_idt(const struct Xgt_desc_struct *desc)
+{
+	static DEFINE_SPINLOCK(lock);
+	static struct trap_info traps[257];
+
+	int cpu = smp_processor_id();
+	unsigned in, out, count;
+
+	per_cpu(idt_desc, cpu) = *desc;
+
+	count = desc->size / 8;
+	BUG_ON(count > 256);
+
+	spin_lock(&lock);
+	for(in = out = 0; in < count; in++) {
+		const u32 *entry = (u32 *)(desc->address + in * 8);
+
+		if (cvt_gate_to_trap(in, entry[0], entry[1], &traps[out]))
+			out++;
+	}
+	traps[out].address = 0;
+
+	xen_mc_flush();
+	if (HYPERVISOR_set_trap_table(traps))
+		BUG();
+
+	spin_unlock(&lock);
+}
+
+/* Write a GDT descriptor entry.  Ignore LDT descriptors, since
+   they're handled differently. */
+static void xen_write_gdt_entry(void *dt, int entry, u32 low, u32 high)
+{
+	switch ((high >> 8) & 0xff) {
+	case DESCTYPE_LDT:
+	case DESCTYPE_TSS:
+		/* ignore */
+		break;
+
+	default:
+		xen_mc_flush();
+		if (HYPERVISOR_update_descriptor(virt_to_machine(dt + entry*8).maddr,
+						 (u64)high << 32 | low))
+			BUG();
+	}
+}
+
+static void xen_load_esp0(struct tss_struct *tss,
+				   struct thread_struct *thread)
+{
+	if (xen_get_lazy_mode() != PARAVIRT_LAZY_CPU) {
+		if (HYPERVISOR_stack_switch(__KERNEL_DS, thread->esp0))
+			BUG();
+	} else {
+		struct multicall_space mcs = xen_mc_entry(0);
+		MULTI_stack_switch(mcs.mc, __KERNEL_DS, thread->esp0);
+	}
+}
+
+static void xen_set_iopl_mask(unsigned mask)
+{
+#if 0
+	struct physdev_set_iopl set_iopl;
+
+	/* Force the change at ring 0. */
+	set_iopl.iopl = (mask == 0) ? 1 : (mask >> 12) & 3;
+	HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
+#endif
+}
+
+static void xen_io_delay(void)
+{
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+static unsigned long xen_apic_read(unsigned long reg)
+{
+	return 0;
+}
+#endif
+
+static void xen_flush_tlb(void)
+{
+	struct mmuext_op *op;
+	struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+	op = mcs.args;
+	op->cmd = MMUEXT_TLB_FLUSH_LOCAL;
+	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+	xen_mc_issue();
+}
+
+static void xen_flush_tlb_global(void)
+{
+	struct mmuext_op *op;
+	struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+	op = mcs.args;
+	op->cmd = MMUEXT_TLB_FLUSH_ALL;
+	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+	xen_mc_issue();
+}
+
+static void xen_flush_tlb_single(u32 addr)
+{
+	struct mmuext_op *op;
+	struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+
+	op = mcs.args;
+	op->cmd = MMUEXT_INVLPG_LOCAL;
+	op->arg1.linear_addr = addr & PAGE_MASK;
+	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+	xen_mc_issue();
+}
+
+static unsigned long xen_read_cr2(void)
+{
+	return read_pda(xen.vcpu)->arch.cr2;
+}
+
+static void xen_write_cr4(unsigned long cr4)
+{
+	/* never allow TSC to be disabled */
+	native_write_cr4(cr4 & ~X86_CR4_TSD);
+}
+
+/*
+ * Page-directory addresses above 4GB do not fit into architectural %cr3.
+ * When accessing %cr3, or equivalent field in vcpu_guest_context, guests
+ * must use the following accessor macros to pack/unpack valid MFNs.
+ */
+#define xen_pfn_to_cr3(pfn) (((unsigned)(pfn) << 12) | ((unsigned)(pfn) >> 20))
+#define xen_cr3_to_pfn(cr3) (((unsigned)(cr3) >> 12) | ((unsigned)(cr3) << 20))
+
+static unsigned long xen_read_cr3(void)
+{
+	return read_pda(xen.cr3);
+}
+
+static void xen_write_cr3(unsigned long cr3)
+{
+	if (cr3 == read_pda(xen.cr3)) {
+		/* just a simple tlb flush */
+		xen_flush_tlb();
+		return;
+	}
+
+	write_pda(xen.cr3, cr3);
+
+
+	{
+		struct mmuext_op *op;
+		struct multicall_space mcs = xen_mc_entry(sizeof(*op));
+		unsigned long mfn = pfn_to_mfn(PFN_DOWN(cr3));
+
+		op = mcs.args;
+		op->cmd = MMUEXT_NEW_BASEPTR;
+		op->arg1.mfn = mfn;
+
+		MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+		xen_mc_issue();
+	}
+}
+
+static void xen_alloc_pt(u32 pfn)
+{
+	/* XXX pfn isn't necessarily a lowmem page */
+	make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
+}
+
+static void xen_alloc_pd(u32 pfn)
+{
+	make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
+}
+
+static void xen_release_pd(u32 pfn)
+{
+	make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
+}
+
+static void xen_release_pt(u32 pfn)
+{
+	make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
+}
+
+static void xen_alloc_pd_clone(u32 pfn, u32 clonepfn,
+					u32 start, u32 count)
+{
+	xen_alloc_pd(pfn);
+}
+
+static __init void xen_pagetable_setup_start(pgd_t *base)
+{
+	pgd_t *xen_pgd = (pgd_t *)xen_start_info->pt_base;
+
+	init_mm.pgd = base;
+
+	/* copy top-level of Xen-supplied pagetable into place.	 For
+	   !PAE we can use this as-is, but for PAE it is a stand-in
+	   while we copy the pmd pages. */
+	memcpy(base, xen_pgd, PTRS_PER_PGD * sizeof(pgd_t));
+
+	if (PTRS_PER_PMD > 1) {
+		int i;
+
+		/* For PAE, need to allocate new pmds, rather than
+		   share Xen's, since Xen doesn't like pmd's being
+		   shared between address spaces. */
+		for(i = 0; i < PTRS_PER_PGD; i++) {
+			if (pgd_val_ma(xen_pgd[i]) & _PAGE_PRESENT) {
+				pmd_t *pmd = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);
+
+				memcpy(pmd, (void *)pgd_page_vaddr(xen_pgd[i]),
+				       PAGE_SIZE);
+
+				xen_alloc_pd(PFN_DOWN(__pa(pmd)));
+
+				set_pgd(&base[i], __pgd(1 + __pa(pmd)));
+			} else
+				pgd_clear(&base[i]);
+		}
+	}
+
+	/* make sure the zero_page is mapped RO so we
+	   can use it in pagetables */
+	make_lowmem_page_readonly(empty_zero_page);
+	make_lowmem_page_readonly(base);
+
+	/* Switch to new pagetable.  This is done before
+	   pagetable_init has done anything so that the new pages
+	   added to the table can be prepared properly for Xen.	 */
+	xen_write_cr3(__pa(base));
+}
+
+static __init void xen_pagetable_setup_done(pgd_t *base)
+{
+	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+		/* Create a mapping for the shared info page.
+		   Should be set_fixmap(), but shared_info is a machine
+		   address with no corresponding pseudo-phys address. */
+		set_pte_mfn(fix_to_virt(FIX_PARAVIRT_BOOTMAP),
+			    PFN_DOWN(xen_start_info->shared_info),
+			    PAGE_KERNEL);
+
+		HYPERVISOR_shared_info =
+			(struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+	} else
+		HYPERVISOR_shared_info =
+			(struct shared_info *)__va(xen_start_info->shared_info);
+
+	xen_pgd_pin(base);
+
+	write_pda(xen.vcpu, &HYPERVISOR_shared_info->vcpu_info[smp_processor_id()]);
+}
+
+static const struct paravirt_ops xen_paravirt_ops __initdata = {
+	.paravirt_enabled = 1,
+	.shared_kernel_pmd = 0,
+
+	.name = "Xen",
+	.banner = xen_banner,
+
+	.patch = paravirt_patch_default,
+
+	.memory_setup = xen_memory_setup,
+	.arch_setup = xen_arch_setup,
+	.init_IRQ = xen_init_IRQ,
+	.time_init = xen_time_init,
+	.init_pda = xen_init_pda,
+
+	.cpuid = xen_cpuid,
+
+	.set_debugreg = xen_set_debugreg,
+	.get_debugreg = xen_get_debugreg,
+
+	.clts = native_clts,
+
+	.read_cr0 = native_read_cr0,
+	.write_cr0 = native_write_cr0,
+
+	.read_cr2 = xen_read_cr2,
+	.write_cr2 = native_write_cr2,
+
+	.read_cr3 = xen_read_cr3,
+	.write_cr3 = xen_write_cr3,
+
+	.read_cr4 = native_read_cr4,
+	.read_cr4_safe = native_read_cr4_safe,
+	.write_cr4 = xen_write_cr4,
+
+	.save_fl = xen_save_fl,
+	.restore_fl = xen_restore_fl,
+	.irq_disable = xen_irq_disable,
+	.irq_enable = xen_irq_enable,
+	.safe_halt = xen_safe_halt,
+	.halt = xen_halt,
+	.wbinvd = native_wbinvd,
+
+	.read_msr = native_read_msr,
+	.write_msr = native_write_msr,
+	.read_tsc = native_read_tsc,
+	.read_pmc = native_read_pmc,
+
+	.iret = (void *)&hypercall_page[__HYPERVISOR_iret],
+	.irq_enable_sysexit = (void *)xen_sti_sysexit,
+
+	.load_tr_desc = paravirt_nop,
+	.set_ldt = xen_set_ldt,
+	.load_gdt = xen_load_gdt,
+	.load_idt = xen_load_idt,
+	.load_tls = xen_load_tls,
+
+	.store_gdt = native_store_gdt,
+	.store_idt = native_store_idt,
+	.store_tr = xen_store_tr,
+
+	.write_ldt_entry = xen_write_ldt_entry,
+	.write_gdt_entry = xen_write_gdt_entry,
+	.write_idt_entry = xen_write_idt_entry,
+	.load_esp0 = xen_load_esp0,
+
+	.set_iopl_mask = xen_set_iopl_mask,
+	.io_delay = xen_io_delay,
+	.const_udelay = __const_udelay,
+	.set_wallclock = xen_set_wallclock,
+	.get_wallclock = xen_get_wallclock,
+
+#ifdef CONFIG_X86_LOCAL_APIC
+	.apic_write = paravirt_nop,
+	.apic_write_atomic = paravirt_nop,
+	.apic_read = xen_apic_read,
+	.setup_boot_clock = paravirt_nop,
+	.setup_secondary_clock = paravirt_nop,
+#endif
+
+	.flush_tlb_user = xen_flush_tlb,
+	.flush_tlb_kernel = xen_flush_tlb_global,
+	.flush_tlb_single = xen_flush_tlb_single,
+
+	.pte_update = paravirt_nop,
+	.pte_update_defer = paravirt_nop,
+
+	.pagetable_setup_start = xen_pagetable_setup_start,
+	.pagetable_setup_done = xen_pagetable_setup_done,
+	.activate_mm = xen_activate_mm,
+	.dup_mmap = xen_dup_mmap,
+	.exit_mmap = xen_exit_mmap,
+
+	.set_pte = xen_set_pte,
+	.set_pte_at = xen_set_pte_at,
+	.set_pmd = xen_set_pmd,
+
+	.alloc_pt = xen_alloc_pt,
+	.alloc_pd = xen_alloc_pd,
+	.alloc_pd_clone = xen_alloc_pd_clone,
+	.release_pd = xen_release_pd,
+	.release_pt = xen_release_pt,
+
+	.pte_val = xen_pte_val,
+	.pgd_val = xen_pgd_val,
+
+	.make_pte = xen_make_pte,
+	.make_pgd = xen_make_pgd,
+
+	.ptep_get_and_clear = xen_ptep_get_and_clear,
+
+#ifdef CONFIG_X86_PAE
+	.set_pte_atomic = xen_set_pte_atomic,
+	.set_pte_present = xen_set_pte_at,
+	.set_pud = xen_set_pud,
+	.pte_clear = xen_pte_clear,
+	.pmd_clear = xen_pmd_clear,
+
+	.make_pmd = xen_make_pmd,
+	.pmd_val = xen_pmd_val,
+#endif	/* PAE */
+
+	.set_lazy_mode = xen_set_lazy_mode,
+	.startup_ipi_hook = paravirt_nop,
+};
+
+/* First C function to be called on Xen boot */
+static asmlinkage void __init xen_start_kernel(void)
+{
+	u32 low, high;
+	pgd_t *pgd;
+
+	if (!xen_start_info)
+		return;
+
+	BUG_ON(memcmp(xen_start_info->magic, "xen-3.0", 7) != 0);
+
+	/* Install Xen paravirt ops */
+	paravirt_ops = xen_paravirt_ops;
+
+	xen_setup_features();
+
+	/* Get mfn list */
+	if (!xen_feature(XENFEAT_auto_translated_physmap))
+		phys_to_machine_mapping = (unsigned long *)xen_start_info->mfn_list;
+
+	pgd = (pgd_t *)xen_start_info->pt_base;
+
+	init_pg_tables_end = __pa(pgd) + xen_start_info->nr_pt_frames*PAGE_SIZE;
+
+	/* set up the boot-time gdt and segments */
+	init_mm.pgd = pgd; /* use the Xen pagetables to start */
+
+	xen_load_gdt(&early_gdt_descr);
+
+	/* set up PDA descriptor */
+	pack_descriptor(&low, &high, (unsigned)&boot_pda, sizeof(boot_pda)-1,
+			0x80 | DESCTYPE_S | 0x02, 0);
+
+	/* Use hypercall directly, because xen_write_gdt_entry can't
+	   be used until batched multicalls work. */
+	if (HYPERVISOR_update_descriptor(virt_to_machine(cpu_gdt_table +
+							 GDT_ENTRY_PDA).maddr,
+					 (u64)high << 32 | low))
+		BUG();
+
+	/* set up %fs and init Xen parts of the PDA */
+	asm volatile("mov %0, %%fs" : : "r" (__KERNEL_PDA) : "memory");
+	xen_init_pda(&boot_pda, 0);
+	boot_pda.xen.cr3 = __pa(pgd);
+
+	paravirt_ops.kernel_rpl = xen_feature(XENFEAT_supervisor_mode_kernel) ? 0 : 1;
+
+	/* set the limit of our address space */
+	reserve_top_address(-HYPERVISOR_VIRT_START + 2 * PAGE_SIZE);
+
+	/* set up basic CPUID stuff */
+	cpu_detect(&new_cpu_data);
+	new_cpu_data.hard_math = 1;
+	identify_cpu(&new_cpu_data);
+
+	/* Poke various useful things into boot_params */
+	LOADER_TYPE = (9 << 4) | 0;
+	INITRD_START = xen_start_info->mod_start ? __pa(xen_start_info->mod_start) : 0;
+	INITRD_SIZE = xen_start_info->mod_len;
+
+	/* Start the world */
+	start_kernel();
+}
+
+paravirt_probe(xen_start_kernel);
===================================================================
--- /dev/null
+++ b/arch/i386/xen/events.c
@@ -0,0 +1,481 @@
+#include <linux/linkage.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <asm/ptrace.h>
+#include <asm/irq.h>
+#include <asm/sync_bitops.h>
+#include <asm/xen/hypercall.h>
+
+#include <xen/events.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/event_channel.h>
+
+#include "xen-ops.h"
+
+/*
+ * This lock protects updates to the following mapping and reference-count
+ * arrays. The lock does not need to be acquired to read the mapping tables.
+ */
+static DEFINE_SPINLOCK(irq_mapping_update_lock);
+
+/* IRQ <-> VIRQ mapping. */
+static DEFINE_PER_CPU(int, virq_to_irq[NR_VIRQS]) = {[0 ... NR_VIRQS-1] = -1};
+
+/* Packed IRQ information: binding type, sub-type index, and event channel. */
+struct packed_irq
+{
+	unsigned short evtchn;
+	unsigned char index;
+	unsigned char type;
+};
+
+static struct packed_irq irq_info[NR_IRQS];
+
+/* Binding types. */
+enum { IRQT_UNBOUND, IRQT_PIRQ, IRQT_VIRQ, IRQT_IPI, IRQT_EVTCHN };
+
+/* Convenient shorthand for packed representation of an unbound IRQ. */
+#define IRQ_UNBOUND	mk_irq_info(IRQT_UNBOUND, 0, 0)
+
+static int evtchn_to_irq[NR_EVENT_CHANNELS] = {
+	[0 ... NR_EVENT_CHANNELS-1] = -1
+};
+static unsigned long cpu_evtchn_mask[NR_CPUS][NR_EVENT_CHANNELS/BITS_PER_LONG];
+static u8 cpu_evtchn[NR_EVENT_CHANNELS];
+
+/* Reference counts for bindings to IRQs. */
+static int irq_bindcount[NR_IRQS];
+
+/* Xen will never allocate port zero for any purpose. */
+#define VALID_EVTCHN(chn)	((chn) != 0)
+
+/*
+ * Force a proper event-channel callback from Xen after clearing the
+ * callback mask. We do this in a very simple manner, by making a call
+ * down into Xen. The pending flag will be checked by Xen on return.
+ */
+void force_evtchn_callback(void)
+{
+	(void)HYPERVISOR_xen_version(0, NULL);
+}
+EXPORT_SYMBOL_GPL(force_evtchn_callback);
+
+static struct irq_chip xen_dynamic_chip;
+
+/* Constructor for packed IRQ information. */
+static inline struct packed_irq mk_irq_info(u32 type, u32 index, u32 evtchn)
+{
+	return (struct packed_irq) { evtchn, index, type };
+}
+
+/*
+ * Accessors for packed IRQ information.
+ */
+static inline unsigned int evtchn_from_irq(int irq)
+{
+	return irq_info[irq].evtchn;
+}
+
+static inline unsigned int index_from_irq(int irq)
+{
+	return irq_info[irq].index;
+}
+
+static inline unsigned int type_from_irq(int irq)
+{
+	return irq_info[irq].type;
+}
+
+static inline unsigned long active_evtchns(unsigned int cpu,
+					   struct shared_info *sh,
+					   unsigned int idx)
+{
+	return (sh->evtchn_pending[idx] &
+		cpu_evtchn_mask[cpu][idx] &
+		~sh->evtchn_mask[idx]);
+}
+
+static void bind_evtchn_to_cpu(unsigned int chn, unsigned int cpu)
+{
+	int irq = evtchn_to_irq[chn];
+
+	BUG_ON(irq == -1);
+#ifdef CONFIG_SMP
+	irq_desc[irq].affinity = cpumask_of_cpu(cpu);
+#endif
+
+	__clear_bit(chn, cpu_evtchn_mask[cpu_evtchn[chn]]);
+	__set_bit(chn, cpu_evtchn_mask[cpu]);
+
+	cpu_evtchn[chn] = cpu;
+}
+
+static void init_evtchn_cpu_bindings(void)
+{
+#ifdef CONFIG_SMP
+	int i;
+	/* By default all event channels notify CPU#0. */
+	for (i = 0; i < NR_IRQS; i++)
+		irq_desc[i].affinity = cpumask_of_cpu(0);
+#endif
+
+	memset(cpu_evtchn, 0, sizeof(cpu_evtchn));
+	memset(cpu_evtchn_mask[0], ~0, sizeof(cpu_evtchn_mask[0]));
+}
+
+static inline unsigned int cpu_from_evtchn(unsigned int evtchn)
+{
+	return cpu_evtchn[evtchn];
+}
+
+static inline void clear_evtchn(int port)
+{
+	struct shared_info *s = HYPERVISOR_shared_info;
+	sync_clear_bit(port, &s->evtchn_pending[0]);
+}
+
+static inline void set_evtchn(int port)
+{
+	struct shared_info *s = HYPERVISOR_shared_info;
+	sync_set_bit(port, &s->evtchn_pending[0]);
+}
+
+
+/**
+ * notify_remote_via_irq - send event to remote end of event channel via irq
+ * @irq: irq of event channel to send event to
+ *
+ * Unlike notify_remote_via_evtchn(), this is safe to use across
+ * save/restore. Notifications on a broken connection are silently
+ * dropped.
+ */
+void notify_remote_via_irq(int irq)
+{
+	int evtchn = evtchn_from_irq(irq);
+
+	if (VALID_EVTCHN(evtchn))
+		notify_remote_via_evtchn(evtchn);
+}
+EXPORT_SYMBOL_GPL(notify_remote_via_irq);
+
+static void mask_evtchn(int port)
+{
+	struct shared_info *s = HYPERVISOR_shared_info;
+	sync_set_bit(port, &s->evtchn_mask[0]);
+}
+
+static void unmask_evtchn(int port)
+{
+	struct shared_info *s = HYPERVISOR_shared_info;
+	unsigned int cpu = smp_processor_id();
+	struct vcpu_info *vcpu_info = read_pda(xen.vcpu);
+
+	BUG_ON(!irqs_disabled());
+
+	/* Slow path (hypercall) if this is a non-local port. */
+	if (unlikely(cpu != cpu_from_evtchn(port))) {
+		struct evtchn_unmask unmask = { .port = port };
+		(void)HYPERVISOR_event_channel_op(EVTCHNOP_unmask, &unmask);
+		return;
+	}
+
+	sync_clear_bit(port, &s->evtchn_mask[0]);
+
+	/*
+	 * The following is basically the equivalent of 'hw_resend_irq'. Just
+	 * like a real IO-APIC we 'lose the interrupt edge' if the channel is
+	 * masked.
+	 */
+	if (sync_test_bit(port, &s->evtchn_pending[0]) &&
+	    !sync_test_and_set_bit(port / BITS_PER_LONG,
+				   &vcpu_info->evtchn_pending_sel))
+		vcpu_info->evtchn_upcall_pending = 1;
+}
+
+static int find_unbound_irq(void)
+{
+	int irq;
+
+	/* Only allocate from dynirq range */
+	for (irq = 0; irq < NR_IRQS; irq++)
+		if (irq_bindcount[irq] == 0)
+			break;
+
+	if (irq == NR_IRQS)
+		panic("No available IRQ to bind to: increase NR_IRQS!\n");
+
+	return irq;
+}
+
+static int bind_evtchn_to_irq(unsigned int evtchn)
+{
+	int irq;
+
+	spin_lock(&irq_mapping_update_lock);
+
+	irq = evtchn_to_irq[evtchn];
+
+	if (irq == -1) {
+		irq = find_unbound_irq();
+
+		dynamic_irq_init(irq);
+		set_irq_chip_and_handler(irq, &xen_dynamic_chip, handle_level_irq);
+
+		evtchn_to_irq[evtchn] = irq;
+		irq_info[irq] = mk_irq_info(IRQT_EVTCHN, 0, evtchn);
+	}
+
+	irq_bindcount[irq]++;
+
+	spin_unlock(&irq_mapping_update_lock);
+
+	return irq;
+}
+
+static int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
+{
+	struct evtchn_bind_virq bind_virq;
+	int evtchn, irq;
+
+	spin_lock(&irq_mapping_update_lock);
+
+	irq = per_cpu(virq_to_irq, cpu)[virq];
+
+	if (irq == -1) {
+		bind_virq.virq = virq;
+		bind_virq.vcpu = cpu;
+		if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_virq,
+						&bind_virq) != 0)
+			BUG();
+		evtchn = bind_virq.port;
+
+		irq = find_unbound_irq();
+
+		dynamic_irq_init(irq);
+		set_irq_chip_and_handler(irq, &xen_dynamic_chip, handle_level_irq);
+
+		evtchn_to_irq[evtchn] = irq;
+		irq_info[irq] = mk_irq_info(IRQT_VIRQ, virq, evtchn);
+
+		per_cpu(virq_to_irq, cpu)[virq] = irq;
+
+		bind_evtchn_to_cpu(evtchn, cpu);
+	}
+
+	irq_bindcount[irq]++;
+
+	spin_unlock(&irq_mapping_update_lock);
+
+	return irq;
+}
+
+static void unbind_from_irq(unsigned int irq)
+{
+	struct evtchn_close close;
+	int evtchn = evtchn_from_irq(irq);
+
+	spin_lock(&irq_mapping_update_lock);
+
+	if (VALID_EVTCHN(evtchn) && (--irq_bindcount[irq] == 0)) {
+		close.port = evtchn;
+		if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) != 0)
+			BUG();
+
+		switch (type_from_irq(irq)) {
+		case IRQT_VIRQ:
+			per_cpu(virq_to_irq, cpu_from_evtchn(evtchn))
+				[index_from_irq(irq)] = -1;
+			break;
+		default:
+			break;
+		}
+
+		/* Closed ports are implicitly re-bound to VCPU0. */
+		bind_evtchn_to_cpu(evtchn, 0);
+
+		evtchn_to_irq[evtchn] = -1;
+		irq_info[irq] = IRQ_UNBOUND;
+
+		dynamic_irq_init(irq);
+	}
+
+	spin_unlock(&irq_mapping_update_lock);
+}
+
+int bind_evtchn_to_irqhandler(unsigned int evtchn,
+			      irqreturn_t (*handler)(int, void *),
+			      unsigned long irqflags, const char *devname, void *dev_id)
+{
+	unsigned int irq;
+	int retval;
+
+	irq = bind_evtchn_to_irq(evtchn);
+	retval = request_irq(irq, handler, irqflags, devname, dev_id);
+	if (retval != 0) {
+		unbind_from_irq(irq);
+		return retval;
+	}
+
+	return irq;
+}
+EXPORT_SYMBOL_GPL(bind_evtchn_to_irqhandler);
+
+int bind_virq_to_irqhandler(unsigned int virq, unsigned int cpu,
+			    irqreturn_t (*handler)(int, void *),
+			    unsigned long irqflags, const char *devname, void *dev_id)
+{
+	unsigned int irq;
+	int retval;
+
+	irq = bind_virq_to_irq(virq, cpu);
+	retval = request_irq(irq, handler, irqflags, devname, dev_id);
+	if (retval != 0) {
+		unbind_from_irq(irq);
+		return retval;
+	}
+
+	return irq;
+}
+EXPORT_SYMBOL_GPL(bind_virq_to_irqhandler);
+
+void unbind_from_irqhandler(unsigned int irq, void *dev_id)
+{
+	free_irq(irq, dev_id);
+	unbind_from_irq(irq);
+}
+EXPORT_SYMBOL_GPL(unbind_from_irqhandler);
+
+/*
+  Search the CPUs pending events bitmasks.  For each one found, map
+  the event number to an irq, and feed it into do_IRQ() for
+  handling.
+
+  Xen uses a two-level bitmap to speed searching.  The first level is
+  a bitset of words which contain pending event bits.  The second
+  level is a bitset of pending events themselves.
+*/
+fastcall void xen_evtchn_do_upcall(struct pt_regs *regs)
+{
+	int cpu = smp_processor_id();
+	struct shared_info *s = HYPERVISOR_shared_info;
+	struct vcpu_info *vcpu_info = read_pda(xen.vcpu);
+	unsigned long pending_words;
+
+	vcpu_info->evtchn_upcall_pending = 0;
+
+	/* NB. No need for a barrier here -- XCHG is a barrier on x86. */
+	pending_words = xchg(&vcpu_info->evtchn_pending_sel, 0);
+	while (pending_words != 0) {
+		unsigned long pending_bits;
+		int word_idx = __ffs(pending_words);
+		pending_words &= ~(1UL << word_idx);
+
+		while ((pending_bits = active_evtchns(cpu, s, word_idx)) != 0) {
+			int bit_idx = __ffs(pending_bits);
+			int port = (word_idx * BITS_PER_LONG) + bit_idx;
+			int irq = evtchn_to_irq[port];
+
+			if (irq != -1) {
+				regs->orig_eax = ~irq;
+				do_IRQ(regs);
+			}
+		}
+	}
+}
+
+/* Rebind an evtchn so that it gets delivered to a specific cpu */
+static void rebind_irq_to_cpu(unsigned irq, unsigned tcpu)
+{
+	struct evtchn_bind_vcpu bind_vcpu;
+	int evtchn = evtchn_from_irq(irq);
+
+	if (!VALID_EVTCHN(evtchn))
+		return;
+
+	/* Send future instances of this interrupt to other vcpu. */
+	bind_vcpu.port = evtchn;
+	bind_vcpu.vcpu = tcpu;
+
+	/*
+	 * If this fails, it usually just indicates that we're dealing with a
+	 * virq or IPI channel, which don't actually need to be rebound. Ignore
+	 * it, but don't do the xenlinux-level rebind in that case.
+	 */
+	if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, &bind_vcpu) >= 0)
+		bind_evtchn_to_cpu(evtchn, tcpu);
+}
+
+
+static void set_affinity_irq(unsigned irq, cpumask_t dest)
+{
+	unsigned tcpu = first_cpu(dest);
+	rebind_irq_to_cpu(irq, tcpu);
+}
+
+static void enable_dynirq(unsigned int irq)
+{
+	int evtchn = evtchn_from_irq(irq);
+
+	if (VALID_EVTCHN(evtchn))
+		unmask_evtchn(evtchn);
+}
+
+static void disable_dynirq(unsigned int irq)
+{
+	int evtchn = evtchn_from_irq(irq);
+
+	if (VALID_EVTCHN(evtchn))
+		mask_evtchn(evtchn);
+}
+
+static void ack_dynirq(unsigned int irq)
+{
+	int evtchn = evtchn_from_irq(irq);
+
+	move_native_irq(irq);
+
+	if (VALID_EVTCHN(evtchn))
+		clear_evtchn(evtchn);
+}
+
+static int retrigger_dynirq(unsigned int irq)
+{
+	int evtchn = evtchn_from_irq(irq);
+	int ret = 0;
+
+	if (VALID_EVTCHN(evtchn)) {
+		set_evtchn(evtchn);
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static struct irq_chip xen_dynamic_chip __read_mostly = {
+	.name		= "xen-virq",
+	.mask		= disable_dynirq,
+	.unmask		= enable_dynirq,
+	.ack		= ack_dynirq,
+	.set_affinity	= set_affinity_irq,
+	.retrigger	= retrigger_dynirq,
+};
+
+void __init xen_init_IRQ(void)
+{
+	int i;
+
+	init_evtchn_cpu_bindings();
+
+	/* No event channels are 'live' right now. */
+	for (i = 0; i < NR_EVENT_CHANNELS; i++)
+		mask_evtchn(i);
+
+	/* Dynamic IRQ space is currently unbound. Zero the refcnts. */
+	for (i = 0; i < NR_IRQS; i++)
+		irq_bindcount[i] = 0;
+
+	irq_ctx_init(smp_processor_id());
+}
===================================================================
--- /dev/null
+++ b/arch/i386/xen/features.c
@@ -0,0 +1,29 @@
+/******************************************************************************
+ * features.c
+ *
+ * Xen feature flags.
+ *
+ * Copyright (c) 2006, Ian Campbell, XenSource Inc.
+ */
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/module.h>
+#include <asm/xen/hypervisor.h>
+#include <xen/features.h>
+
+u8 xen_features[XENFEAT_NR_SUBMAPS * 32] __read_mostly;
+EXPORT_SYMBOL_GPL(xen_features);
+
+void xen_setup_features(void)
+{
+	struct xen_feature_info fi;
+	int i, j;
+
+	for (i = 0; i < XENFEAT_NR_SUBMAPS; i++) {
+		fi.submap_idx = i;
+		if (HYPERVISOR_xen_version(XENVER_get_features, &fi) < 0)
+			break;
+		for (j=0; j<32; j++)
+			xen_features[i*32+j] = !!(fi.submap & 1<<j);
+	}
+}
===================================================================
--- /dev/null
+++ b/arch/i386/xen/mmu.c
@@ -0,0 +1,411 @@
+//#include <linux/bug.h>
+#include <asm/bug.h>
+
+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+#include <asm/mmu_context.h>
+
+#include <asm/xen/hypercall.h>
+#include <asm/paravirt.h>
+
+#include <xen/page.h>
+#include <xen/interface/xen.h>
+
+#include "mmu.h"
+
+xmaddr_t arbitrary_virt_to_machine(unsigned long address)
+{
+	pte_t *pte = lookup_address(address);
+	unsigned offset = address & PAGE_MASK;
+
+	BUG_ON(pte == NULL);
+
+	return XMADDR((pte_mfn(*pte) << PAGE_SHIFT) + offset);
+}
+
+void make_lowmem_page_readonly(void *vaddr)
+{
+	pte_t *pte, ptev;
+	unsigned long address = (unsigned long)vaddr;
+
+	pte = lookup_address(address);
+	BUG_ON(pte == NULL);
+
+	ptev = pte_wrprotect(*pte);
+
+	if (xen_feature(XENFEAT_writable_page_tables))
+		*pte = ptev;
+	else
+		if(HYPERVISOR_update_va_mapping(address, ptev, 0))
+			BUG();
+}
+
+void make_lowmem_page_readwrite(void *vaddr)
+{
+	pte_t *pte, ptev;
+	unsigned long address = (unsigned long)vaddr;
+
+	pte = lookup_address(address);
+	BUG_ON(pte == NULL);
+
+	ptev = pte_mkwrite(*pte);
+
+	if (xen_feature(XENFEAT_writable_page_tables))
+		*pte = ptev;
+	else
+		if(HYPERVISOR_update_va_mapping(address, ptev, 0))
+			BUG();
+}
+
+
+void xen_set_pte(pte_t *ptep, pte_t pte)
+{
+#if 1
+	struct mmu_update u;
+
+	u.ptr = virt_to_machine(ptep).maddr;
+	u.val = pte_val_ma(pte);
+	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
+		BUG();
+#else
+	ptep->pte_high = pte.pte_high;
+	smp_wmb();
+	ptep->pte_low = pte.pte_low;
+#endif
+}
+
+void xen_set_pmd(pmd_t *ptr, pmd_t val)
+{
+	struct mmu_update u;
+
+	u.ptr = virt_to_machine(ptr).maddr;
+	u.val = pmd_val_ma(val);
+	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
+		BUG();
+}
+
+#ifdef CONFIG_X86_PAE
+void xen_set_pud(pud_t *ptr, pud_t val)
+{
+	struct mmu_update u;
+
+	u.ptr = virt_to_machine(ptr).maddr;
+	u.val = pud_val_ma(val);
+	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
+		BUG();
+}
+#endif
+
+/*
+ * Associate a virtual page frame with a given physical page frame
+ * and protection flags for that frame.
+ */
+void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = swapper_pg_dir + pgd_index(vaddr);
+	if (pgd_none(*pgd)) {
+		BUG();
+		return;
+	}
+	pud = pud_offset(pgd, vaddr);
+	if (pud_none(*pud)) {
+		BUG();
+		return;
+	}
+	pmd = pmd_offset(pud, vaddr);
+	if (pmd_none(*pmd)) {
+		BUG();
+		return;
+	}
+	pte = pte_offset_kernel(pmd, vaddr);
+	/* <mfn,flags> stored as-is, to permit clearing entries */
+	xen_set_pte(pte, mfn_pte(mfn, flags));
+
+	/*
+	 * It's enough to flush this one mapping.
+	 * (PGE mappings get flushed as well)
+	 */
+	__flush_tlb_one(vaddr);
+}
+
+void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
+		    pte_t *ptep, pte_t pteval)
+{
+	if ((mm != current->mm && mm != &init_mm) ||
+	    HYPERVISOR_update_va_mapping(addr, pteval, 0) != 0)
+		xen_set_pte(ptep, pteval);
+}
+
+#ifdef CONFIG_X86_PAE
+void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
+{
+	set_64bit((u64 *)ptep, pte_val_ma(pte));
+}
+
+void xen_pte_clear(struct mm_struct *mm, unsigned long addr,pte_t *ptep)
+{
+#if 1
+	ptep->pte_low = 0;
+	smp_wmb();
+	ptep->pte_high = 0;
+#else
+	set_64bit((u64 *)ptep, 0);
+#endif
+}
+
+void xen_pmd_clear(pmd_t *pmdp)
+{
+	xen_set_pmd(pmdp, __pmd(0));
+}
+
+unsigned long long xen_pte_val(pte_t pte)
+{
+	unsigned long long ret = 0;
+
+	if (pte.pte_low) {
+		ret = ((unsigned long long)pte.pte_high << 32) | pte.pte_low;
+		ret = machine_to_phys(XMADDR(ret)).paddr | 1;
+	}
+
+	return ret;
+}
+
+unsigned long long xen_pmd_val(pmd_t pmd)
+{
+	unsigned long long ret = pmd.pmd;
+	if (ret)
+		ret = machine_to_phys(XMADDR(ret)).paddr | 1;
+	return ret;
+}
+
+unsigned long long xen_pgd_val(pgd_t pgd)
+{
+	unsigned long long ret = pgd.pgd;
+	if (ret)
+		ret = machine_to_phys(XMADDR(ret)).paddr | 1;
+	return ret;
+}
+
+pte_t xen_make_pte(unsigned long long pte)
+{
+	if (pte & 1)
+		pte = phys_to_machine(XPADDR(pte)).maddr;
+
+	return (pte_t){ pte, pte >> 32 };
+}
+
+pmd_t xen_make_pmd(unsigned long long pmd)
+{
+	if (pmd & 1)
+		pmd = phys_to_machine(XPADDR(pmd)).maddr;
+
+	return (pmd_t){ pmd };
+}
+
+pgd_t xen_make_pgd(unsigned long long pgd)
+{
+	if (pgd & _PAGE_PRESENT)
+		pgd = phys_to_machine(XPADDR(pgd)).maddr;
+
+	return (pgd_t){ pgd };
+}
+
+pte_t xen_ptep_get_and_clear(pte_t *ptep)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits */
+	res.pte_low = xchg(&ptep->pte_low, 0);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = 0;
+
+	return res;
+}
+#else  /* !PAE */
+unsigned long xen_pte_val(pte_t pte)
+{
+	unsigned long ret = pte.pte_low;
+
+	if (ret & _PAGE_PRESENT)
+		ret = machine_to_phys(XMADDR(ret)).paddr;
+
+	return ret;
+}
+
+unsigned long xen_pmd_val(pmd_t pmd)
+{
+	BUG();
+	return 0;
+}
+
+unsigned long xen_pgd_val(pgd_t pgd)
+{
+	unsigned long ret = pgd.pgd;
+	if (ret)
+		ret = machine_to_phys(XMADDR(ret)).paddr | 1;
+	return ret;
+}
+
+pte_t xen_make_pte(unsigned long pte)
+{
+	if (pte & _PAGE_PRESENT)
+		pte = phys_to_machine(XPADDR(pte)).maddr;
+
+	return (pte_t){ pte };
+}
+
+pmd_t xen_make_pmd(unsigned long pmd)
+{
+	BUG();
+	return __pmd(0);
+}
+
+pgd_t xen_make_pgd(unsigned long pgd)
+{
+	if (pgd & _PAGE_PRESENT)
+		pgd = phys_to_machine(XPADDR(pgd)).maddr;
+
+	return (pgd_t){ pgd };
+}
+
+pte_t xen_ptep_get_and_clear(pte_t *ptep)
+{
+	return __pte_ma(xchg(&(ptep)->pte_low, 0));
+}
+#endif	/* CONFIG_X86_PAE */
+
+
+
+static void pgd_walk_set_prot(void *pt, pgprot_t flags)
+{
+	unsigned long pfn = PFN_DOWN(__pa(pt));
+
+	if (HYPERVISOR_update_va_mapping((unsigned long)pt,
+					 pfn_pte(pfn, flags), 0) < 0)
+		BUG();
+}
+
+static void pgd_walk(pgd_t *pgd_base, pgprot_t flags)
+{
+	pgd_t *pgd = pgd_base;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int    g, u, m;
+
+	if (xen_feature(XENFEAT_auto_translated_physmap))
+		return;
+
+	for (g = 0; g < USER_PTRS_PER_PGD; g++, pgd++) {
+		if (pgd_none(*pgd))
+			continue;
+		pud = pud_offset(pgd, 0);
+
+		if (PTRS_PER_PUD > 1) /* not folded */
+			pgd_walk_set_prot(pud,flags);
+
+		for (u = 0; u < PTRS_PER_PUD; u++, pud++) {
+			if (pud_none(*pud))
+				continue;
+			pmd = pmd_offset(pud, 0);
+
+			if (PTRS_PER_PMD > 1) /* not folded */
+				pgd_walk_set_prot(pmd,flags);
+
+			for (m = 0; m < PTRS_PER_PMD; m++, pmd++) {
+				if (pmd_none(*pmd))
+					continue;
+
+				/* This can get called before mem_map
+				   is set up, so we assume nothing is
+				   highmem at that point. */
+				if (mem_map == NULL ||
+				    !PageHighMem(pmd_page(*pmd))) {
+					pte = pte_offset_kernel(pmd,0);
+					pgd_walk_set_prot(pte,flags);
+				}
+			}
+		}
+	}
+
+	if (HYPERVISOR_update_va_mapping((unsigned long)pgd_base,
+					 pfn_pte(PFN_DOWN(__pa(pgd_base)),
+						 flags),
+					 UVMF_TLB_FLUSH) < 0)
+		BUG();
+}
+
+
+/* This is called just after a mm has been duplicated from its parent,
+   but it has not been used yet.  We need to make sure that its
+   pagetable is all read-only, and can be pinned. */
+void xen_pgd_pin(pgd_t *pgd)
+{
+	struct mmuext_op op;
+
+	pgd_walk(pgd, PAGE_KERNEL_RO);
+
+#if defined(CONFIG_X86_PAE)
+	op.cmd = MMUEXT_PIN_L3_TABLE;
+#else
+	op.cmd = MMUEXT_PIN_L2_TABLE;
+#endif
+	op.arg1.mfn = pfn_to_mfn(PFN_DOWN(__pa(pgd)));
+	if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF) < 0)
+		BUG();
+}
+
+/* Release a pagetables pages back as normal RW */
+void xen_pgd_unpin(pgd_t *pgd)
+{
+	struct mmuext_op op;
+
+	op.cmd = MMUEXT_UNPIN_TABLE;
+	op.arg1.mfn = pfn_to_mfn(PFN_DOWN(__pa(pgd)));
+
+	if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF) < 0)
+		BUG();
+
+	pgd_walk(pgd, PAGE_KERNEL);
+}
+
+
+void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
+{
+	xen_pgd_pin(next->pgd);
+}
+
+void xen_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+	xen_pgd_pin(mm->pgd);
+}
+
+void xen_exit_mmap(struct mm_struct *mm)
+{
+	struct task_struct *tsk = current;
+
+	task_lock(tsk);
+
+	/*
+	 * We aggressively remove defunct pgd from cr3. We execute unmap_vmas()
+	 * *much* faster this way, as no tlb flushes means bigger wrpt batches.
+	 */
+	if (tsk->active_mm == mm) {
+		tsk->active_mm = &init_mm;
+		atomic_inc(&init_mm.mm_count);
+
+		switch_mm(mm, &init_mm, tsk);
+
+		atomic_dec(&mm->mm_count);
+		BUG_ON(atomic_read(&mm->mm_count) == 0);
+	}
+
+	task_unlock(tsk);
+
+	xen_pgd_unpin(mm->pgd);
+}
===================================================================
--- /dev/null
+++ b/arch/i386/xen/mmu.h
@@ -0,0 +1,49 @@
+#ifndef _XEN_MMU_H
+
+#include <linux/linkage.h>
+#include <asm/page.h>
+
+void set_pte_mfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags);
+
+void xen_set_pte(pte_t *ptep, pte_t pteval);
+void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
+		    pte_t *ptep, pte_t pteval);
+void xen_set_pmd(pmd_t *pmdp, pmd_t pmdval);
+
+void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next);
+void xen_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm);
+void xen_exit_mmap(struct mm_struct *mm);
+
+pte_t xen_ptep_get_and_clear(pte_t *ptep);
+
+void xen_pgd_pin(pgd_t *pgd);
+void xen_pgd_unpin(pgd_t *pgd);
+
+#ifdef CONFIG_X86_PAE
+unsigned long long xen_pte_val(pte_t);
+unsigned long long xen_pmd_val(pmd_t);
+unsigned long long xen_pgd_val(pgd_t);
+
+pte_t xen_make_pte(unsigned long long);
+pmd_t xen_make_pmd(unsigned long long);
+pgd_t xen_make_pgd(unsigned long long);
+
+void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
+		    pte_t *ptep, pte_t pteval);
+void xen_set_pte_atomic(pte_t *ptep, pte_t pte);
+void xen_set_pud(pud_t *ptr, pud_t val);
+void xen_pte_clear(struct mm_struct *mm, unsigned long addr,pte_t *ptep);
+void xen_pmd_clear(pmd_t *pmdp);
+
+
+#else
+unsigned long xen_pte_val(pte_t);
+unsigned long xen_pmd_val(pmd_t);
+unsigned long xen_pgd_val(pgd_t);
+
+pte_t xen_make_pte(unsigned long);
+pmd_t xen_make_pmd(unsigned long);
+pgd_t xen_make_pgd(unsigned long);
+#endif
+
+#endif	/* _XEN_MMU_H */
===================================================================
--- /dev/null
+++ b/arch/i386/xen/multicalls.c
@@ -0,0 +1,62 @@
+#include <linux/percpu.h>
+
+#include <asm/xen/hypercall.h>
+
+#include "multicalls.h"
+
+#define MC_BATCH	8
+#define MC_ARGS		(MC_BATCH * 32 / sizeof(u64))
+
+struct mc_buffer {
+	struct multicall_entry entries[MC_BATCH];
+	u64 args[MC_ARGS];
+	unsigned mcidx, argidx;
+};
+
+static DEFINE_PER_CPU(struct mc_buffer, mc_buffer);
+
+int xen_mc_flush(void)
+{
+	struct mc_buffer *b = &get_cpu_var(mc_buffer);
+	int ret = 0;
+
+	if (b->mcidx) {
+		int i;
+
+		if (HYPERVISOR_multicall(b->entries, b->mcidx) != 0)
+			BUG();
+		for(i = 0; i < b->mcidx; i++)
+			if (b->entries[i].result < 0)
+				ret++;
+		b->mcidx = 0;
+		b->argidx = 0;
+	} else
+		BUG_ON(b->argidx != 0);
+
+	put_cpu_var(mc_buffer);
+
+	return ret;
+}
+
+struct multicall_space xen_mc_entry(size_t args)
+{
+	struct mc_buffer *b = &get_cpu_var(mc_buffer);
+	struct multicall_space ret;
+	unsigned argspace = (args + sizeof(u64) - 1) / sizeof(u64);
+
+	BUG_ON(argspace > MC_ARGS);
+
+	if (b->mcidx == MC_BATCH ||
+	    (b->argidx + argspace) > MC_ARGS)
+		if (xen_mc_flush())
+			BUG();
+
+	ret.mc = &b->entries[b->mcidx];
+	b->mcidx++;
+	ret.args = &b->args[b->argidx];
+	b->argidx += argspace;
+
+	put_cpu_var(mc_buffer);
+
+	return ret;
+}
===================================================================
--- /dev/null
+++ b/arch/i386/xen/multicalls.h
@@ -0,0 +1,27 @@
+#ifndef _XEN_MULTICALLS_H
+#define _XEN_MULTICALLS_H
+
+#include "xen-ops.h"
+
+/* Multicalls */
+struct multicall_space
+{
+	struct multicall_entry *mc;
+	void *args;
+};
+
+/* Allocate room for a multicall and its args */
+struct multicall_space xen_mc_entry(size_t args);
+
+/* Flush all pending multicalls */
+int xen_mc_flush(void);
+
+/* Issue a multicall if we're not in lazy mode */
+static inline void xen_mc_issue(void)
+{
+	if (xen_get_lazy_mode() == PARAVIRT_LAZY_NONE)
+		if (xen_mc_flush())
+			BUG();
+}
+
+#endif /* _XEN_MULTICALLS_H */
===================================================================
--- /dev/null
+++ b/arch/i386/xen/setup.c
@@ -0,0 +1,95 @@
+/*
+ *	Machine specific setup for xen
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+
+#include <asm/e820.h>
+#include <asm/setup.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+#include <asm/pda.h>
+
+#include <xen/interface/physdev.h>
+#include <xen/features.h>
+
+#include "xen-ops.h"
+
+/* These are code, but not functions.  Defined in entry.S */
+extern const char xen_hypervisor_callback[];
+extern const char xen_failsafe_callback[];
+
+static __initdata struct shared_info init_shared;
+
+/*
+ * Point at some empty memory to start with. We map the real shared_info
+ * page as soon as fixmap is up and running.
+ */
+struct shared_info *HYPERVISOR_shared_info = &init_shared;
+
+unsigned long *phys_to_machine_mapping;
+EXPORT_SYMBOL(phys_to_machine_mapping);
+
+/**
+ * machine_specific_memory_setup - Hook for machine specific memory setup.
+ **/
+
+char * __init xen_memory_setup(void)
+{
+	unsigned long max_pfn = xen_start_info->nr_pages;
+
+	e820.nr_map = 0;
+	add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
+
+	return "Xen";
+}
+
+static void xen_idle(void)
+{
+	local_irq_disable();
+
+	if (need_resched())
+		local_irq_enable();
+	else {
+		current_thread_info()->status &= ~TS_POLLING;
+		smp_mb__after_clear_bit();
+		safe_halt();
+		current_thread_info()->status |= TS_POLLING;
+	}
+}
+
+void __init xen_arch_setup(void)
+{
+	struct physdevop_set_iopl set_iopl;
+	int rc;
+
+	HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_4gb_segments);
+	HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_writable_pagetables);
+
+	if (!xen_feature(XENFEAT_auto_translated_physmap))
+		HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_pae_extended_cr3);
+
+	HYPERVISOR_set_callbacks(__KERNEL_CS, (unsigned long)xen_hypervisor_callback,
+				 __KERNEL_CS, (unsigned long)xen_failsafe_callback);
+
+	set_iopl.iopl = 1;
+	rc = HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOPL, &set_iopl);
+	if (rc != 0)
+		printk(KERN_INFO "physdev_op failed %d\n", rc);
+
+#ifdef CONFIG_ACPI
+	if (!(xen_start_info->flags & SIF_INITDOMAIN)) {
+		printk(KERN_INFO "ACPI in unprivileged domain disabled\n");
+		disable_acpi();
+	}
+#endif
+
+	memcpy(boot_command_line, xen_start_info->cmd_line,
+	       MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
+	       COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
+
+	pm_idle = xen_idle;
+
+	vdso_enabled = 1;	/* enable by default */
+}
===================================================================
--- /dev/null
+++ b/arch/i386/xen/time.c
@@ -0,0 +1,446 @@
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/kernel_stat.h>
+#include <linux/clocksource.h>
+
+#include <asm/xen/hypercall.h>
+#include <asm/arch_hooks.h>
+
+#include <xen/events.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/vcpu.h>
+
+#include "xen-ops.h"
+
+#define XEN_SHIFT 22
+
+/* Permitted clock jitter, in nsecs, beyond which a warning will be printed. */
+static unsigned long permitted_clock_jitter = 10000000UL; /* 10ms */
+static int __init __permitted_clock_jitter(char *str)
+{
+	permitted_clock_jitter = simple_strtoul(str, NULL, 0);
+	return 1;
+}
+__setup("permitted_clock_jitter=", __permitted_clock_jitter);
+
+
+/* These are perodically updated in shared_info, and then copied here. */
+struct shadow_time_info {
+	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
+	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
+	u32 tsc_to_nsec_mul;
+	int tsc_shift;
+	u32 version;
+};
+
+static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
+
+/* Keep track of last time we did processing/updating of jiffies and xtime. */
+static u64 processed_system_time;   /* System time (ns) at last processing. */
+static DEFINE_PER_CPU(u64, processed_system_time);
+
+/* How much CPU time was spent blocked and how much was 'stolen'? */
+static DEFINE_PER_CPU(u64, processed_stolen_time);
+static DEFINE_PER_CPU(u64, processed_blocked_time);
+
+/* Current runstate of each CPU (updated automatically by the hypervisor). */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
+
+/* Must be signed, as it's compared with s64 quantities which can be -ve. */
+#define NS_PER_TICK (1000000000LL/HZ)
+
+/*
+ * Reads a consistent set of time-base values from Xen, into a shadow data
+ * area.
+ */
+static void get_time_values_from_xen(void)
+{
+	struct vcpu_time_info   *src;
+	struct shadow_time_info *dst;
+
+	src = &read_pda(xen.vcpu)->time;
+	dst = &get_cpu_var(shadow_time);
+
+	do {
+		dst->version = src->version;
+		rmb();
+		dst->tsc_timestamp     = src->tsc_timestamp;
+		dst->system_timestamp  = src->system_time;
+		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
+		dst->tsc_shift         = src->tsc_shift;
+		rmb();
+	} while ((src->version & 1) | (dst->version ^ src->version));
+
+	put_cpu_var(shadow_time);
+}
+
+static inline int time_values_up_to_date(void)
+{
+	struct vcpu_time_info   *src;
+	unsigned dstversion;
+
+	src = &read_pda(xen.vcpu)->time;
+	dstversion = get_cpu_var(shadow_time).version;
+	put_cpu_var(shadow_time);
+
+	rmb();
+	return (dstversion == src->version);
+}
+
+/*
+ * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
+ * yielding a 64-bit result.
+ */
+static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
+{
+	u64 product;
+#ifdef __i386__
+	u32 tmp1, tmp2;
+#endif
+
+	if (shift < 0)
+		delta >>= -shift;
+	else
+		delta <<= shift;
+
+#ifdef __i386__
+	__asm__ (
+		"mul  %5       ; "
+		"mov  %4,%%eax ; "
+		"mov  %%edx,%4 ; "
+		"mul  %5       ; "
+		"xor  %5,%5    ; "
+		"add  %4,%%eax ; "
+		"adc  %5,%%edx ; "
+		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
+		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif __x86_64__
+	__asm__ (
+		"mul %%rdx ; shrd $32,%%rdx,%%rax"
+		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+	return product;
+}
+
+static u64 get_nsec_offset(struct shadow_time_info *shadow)
+{
+	u64 now, delta;
+	rdtscll(now);
+	delta = now - shadow->tsc_timestamp;
+	return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
+}
+
+
+static void xen_timer_interrupt_hook(void)
+{
+	s64 delta, delta_cpu, stolen, blocked;
+	u64 sched_time;
+	int i, cpu = smp_processor_id();
+	unsigned long ticks;
+	struct shadow_time_info *shadow = &__get_cpu_var(shadow_time);
+	struct vcpu_runstate_info *runstate = &__get_cpu_var(runstate);
+
+	do {
+		get_time_values_from_xen();
+
+		/* Obtain a consistent snapshot of elapsed wallclock cycles. */
+		delta = delta_cpu =
+			shadow->system_timestamp + get_nsec_offset(shadow);
+		if (0)
+			printk("tsc_timestamp=%llu system_timestamp=%llu tsc_to_nsec=%u tsc_shift=%d, version=%u, delta=%lld processed_system_time=%lld\n",
+			       shadow->tsc_timestamp, shadow->system_timestamp,
+			       shadow->tsc_to_nsec_mul, shadow->tsc_shift,
+			       shadow->version, delta, processed_system_time);
+
+		delta     -= processed_system_time;
+		delta_cpu -= __get_cpu_var(processed_system_time);
+
+		/*
+		 * Obtain a consistent snapshot of stolen/blocked cycles. We
+		 * can use state_entry_time to detect if we get preempted here.
+		 */
+		do {
+			sched_time = runstate->state_entry_time;
+			barrier();
+			stolen = runstate->time[RUNSTATE_runnable] +
+				runstate->time[RUNSTATE_offline] -
+				__get_cpu_var(processed_stolen_time);
+			blocked = runstate->time[RUNSTATE_blocked] -
+				__get_cpu_var(processed_blocked_time);
+			barrier();
+		} while (sched_time != runstate->state_entry_time);
+	} while (!time_values_up_to_date());
+
+	if ((unlikely(delta < -(s64)permitted_clock_jitter) ||
+	     unlikely(delta_cpu < -(s64)permitted_clock_jitter))
+	    && printk_ratelimit()) {
+		printk("Timer ISR/%d: Time went backwards: "
+		       "delta=%lld delta_cpu=%lld shadow=%lld "
+		       "off=%lld processed=%lld cpu_processed=%lld\n",
+		       cpu, delta, delta_cpu, shadow->system_timestamp,
+		       (s64)get_nsec_offset(shadow),
+		       processed_system_time,
+		       __get_cpu_var(processed_system_time));
+		for (i = 0; i < num_online_cpus(); i++)
+			printk(" %d: %lld\n", i,
+			       per_cpu(processed_system_time, i));
+	}
+
+	/* System-wide jiffy work. */
+	ticks = 0;
+	while(delta > NS_PER_TICK) {
+		delta -= NS_PER_TICK;
+		processed_system_time += NS_PER_TICK;
+		ticks++;
+	}
+	do_timer(ticks);
+
+	/*
+	 * Account stolen ticks.
+	 * HACK: Passing NULL to account_steal_time()
+	 * ensures that the ticks are accounted as stolen.
+	 */
+	if ((stolen > 0) && (delta_cpu > 0)) {
+		delta_cpu -= stolen;
+		if (unlikely(delta_cpu < 0))
+			stolen += delta_cpu; /* clamp local-time progress */
+		do_div(stolen, NS_PER_TICK);
+		__get_cpu_var(processed_stolen_time) += stolen * NS_PER_TICK;
+		__get_cpu_var(processed_system_time) += stolen * NS_PER_TICK;
+		account_steal_time(NULL, (cputime_t)stolen);
+	}
+
+	/*
+	 * Account blocked ticks.
+	 * HACK: Passing idle_task to account_steal_time()
+	 * ensures that the ticks are accounted as idle/wait.
+	 */
+	if ((blocked > 0) && (delta_cpu > 0)) {
+		delta_cpu -= blocked;
+		if (unlikely(delta_cpu < 0))
+			blocked += delta_cpu; /* clamp local-time progress */
+		do_div(blocked, NS_PER_TICK);
+		__get_cpu_var(processed_blocked_time) += blocked * NS_PER_TICK;
+		__get_cpu_var(processed_system_time)  += blocked * NS_PER_TICK;
+		account_steal_time(idle_task(cpu), (cputime_t)blocked);
+	}
+
+	update_process_times(user_mode_vm(get_irq_regs()));
+}
+
+static cycle_t xen_clocksource_read(void)
+{
+	struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
+	cycle_t ret;
+
+	get_time_values_from_xen();
+
+	ret = shadow->system_timestamp + get_nsec_offset(shadow);
+
+	put_cpu_var(shadow_time);
+
+	return ret;
+}
+
+static void xen_read_wallclock(struct timespec *ts)
+{
+	const struct shared_info *s = HYPERVISOR_shared_info;
+	u32 version;
+	u64 delta;
+	struct timespec now;
+
+	/* get wallclock at system boot */
+	do {
+		version = s->wc_version;
+		rmb();
+		now.tv_sec  = s->wc_sec;
+		now.tv_nsec = s->wc_nsec;
+		rmb();
+	} while ((s->wc_version & 1) | (version ^ s->wc_version));
+
+	delta = xen_clocksource_read();	/* time since system boot */
+	delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
+
+	now.tv_nsec = do_div(delta, NSEC_PER_SEC);
+	now.tv_sec = delta;
+
+	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
+}
+
+unsigned long xen_get_wallclock(void)
+{
+	struct timespec ts;
+
+	xen_read_wallclock(&ts);
+
+	return ts.tv_sec;
+}
+
+int xen_set_wallclock(unsigned long now)
+{
+	/* do nothing for domU */
+	return -1;
+}
+
+static void init_cpu_khz(void)
+{
+	u64 __cpu_khz = 1000000ULL << 32;
+	struct vcpu_time_info *info;
+	info = &HYPERVISOR_shared_info->vcpu_info[0].time;
+	do_div(__cpu_khz, info->tsc_to_system_mul);
+	if (info->tsc_shift < 0)
+		cpu_khz = __cpu_khz << -info->tsc_shift;
+	else
+		cpu_khz = __cpu_khz >> info->tsc_shift;
+}
+
+static struct clocksource xen_clocksource = {
+	.name = "xen",
+	.rating = 400,
+	.read = xen_clocksource_read,
+	.mask = ~0,
+	.mult = 1<<XEN_SHIFT,		/* time directly in nanoseconds */
+	.shift = XEN_SHIFT,
+	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
+};
+
+static void init_missing_ticks_accounting(int cpu)
+{
+	struct vcpu_register_runstate_memory_area area;
+	struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
+
+	memset(runstate, 0, sizeof(*runstate));
+
+	area.addr.v = runstate;
+	HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, &area);
+
+	per_cpu(processed_blocked_time, cpu) =
+		runstate->time[RUNSTATE_blocked];
+	per_cpu(processed_stolen_time, cpu) =
+		runstate->time[RUNSTATE_runnable] +
+		runstate->time[RUNSTATE_offline];
+}
+
+static irqreturn_t xen_timer_interrupt(int irq, void *dev_id)
+{
+	/*
+	 * Here we are in the timer irq handler. We just have irqs locally
+	 * disabled but we don't know if the timer_bh is running on the other
+	 * CPU. We need to avoid to SMP race with it. NOTE: we don' t need
+	 * the irq version of write_lock because as just said we have irq
+	 * locally disabled. -arca
+	 */
+	write_seqlock(&xtime_lock);
+
+	xen_timer_interrupt_hook();
+
+	write_sequnlock(&xtime_lock);
+
+	return IRQ_HANDLED;
+}
+
+static void setup_cpu0_timer_irq(void)
+{
+	printk(KERN_DEBUG "installing Xen timer for CPU 0\n");
+
+	bind_virq_to_irqhandler(
+		VIRQ_TIMER,
+		0,
+		xen_timer_interrupt,
+		SA_INTERRUPT,
+		"timer0",
+		NULL);
+}
+
+__init void xen_time_init(void)
+{
+	get_time_values_from_xen();
+
+	processed_system_time = per_cpu(shadow_time, 0).system_timestamp;
+	per_cpu(processed_system_time, 0) = processed_system_time;
+
+	init_cpu_khz();
+	printk(KERN_INFO "Xen reported: %u.%03u MHz processor.\n",
+	       cpu_khz / 1000, cpu_khz % 1000);
+
+	init_missing_ticks_accounting(0);
+
+	clocksource_register(&xen_clocksource);
+
+	/* Set initial system time with full resolution */
+	xen_read_wallclock(&xtime);
+	set_normalized_timespec(&wall_to_monotonic,
+				-xtime.tv_sec, -xtime.tv_nsec);
+
+	tsc_disable = 0;
+
+	setup_cpu0_timer_irq();
+}
+
+/* Convert jiffies to system time. */
+static u64 jiffies_to_st(unsigned long j)
+{
+	unsigned long seq;
+	long delta;
+	u64 st;
+
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		delta = j - jiffies;
+		if (delta < 1) {
+			/* Triggers in some wrap-around cases, but that's okay:
+			 * we just end up with a shorter timeout. */
+			st = processed_system_time + NS_PER_TICK;
+		} else if (((unsigned long)delta >> (BITS_PER_LONG-3)) != 0) {
+			/* Very long timeout means there is no pending timer.
+			 * We indicate this to Xen by passing zero timeout. */
+			st = 0;
+		} else {
+			st = processed_system_time + delta * (u64)NS_PER_TICK;
+		}
+	} while (read_seqretry(&xtime_lock, seq));
+
+	return st;
+}
+
+/*
+ * stop_hz_timer / start_hz_timer - enter/exit 'tickless mode' on an idle cpu
+ * These functions are based on implementations from arch/s390/kernel/time.c
+ */
+void stop_hz_timer(void)
+{
+	unsigned int cpu = smp_processor_id();
+	unsigned long j;
+
+	cpu_set(cpu, nohz_cpu_mask);
+
+	/*
+	 * See matching smp_mb in rcu_start_batch in rcupdate.c.  These mbs
+	 * ensure that if __rcu_pending (nested in rcu_needs_cpu) fetches a
+	 * value of rcp->cur that matches rdp->quiescbatch and allows us to
+	 * stop the hz timer then the cpumasks created for subsequent values
+	 * of cur in rcu_start_batch are guaranteed to pick up the updated
+	 * nohz_cpu_mask and so will not depend on this cpu.
+	 */
+
+	smp_mb();
+
+	/* Leave ourselves in tick mode if rcu or softirq or timer pending. */
+	if (rcu_needs_cpu(cpu) || local_softirq_pending() ||
+	    (j = next_timer_interrupt(), time_before_eq(j, jiffies))) {
+		cpu_clear(cpu, nohz_cpu_mask);
+		j = jiffies + 1;
+	}
+
+	if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0)
+		BUG();
+}
+
+void start_hz_timer(void)
+{
+	cpu_clear(smp_processor_id(), nohz_cpu_mask);
+}
+
===================================================================
--- /dev/null
+++ b/arch/i386/xen/xen-head.S
@@ -0,0 +1,30 @@
+/* Xen-specific pieces of head.S, intended to be included in the right
+	place in head.S */
+
+#include <linux/elfnote.h>
+#include <asm/boot.h>
+#include <xen/interface/elfnote.h>
+
+ENTRY(startup_xen)
+	movl %esi,xen_start_info
+	jmp startup_paravirt
+
+.pushsection ".bss.page_aligned"
+ENTRY(hypercall_page)
+	.align PAGE_SIZE_asm
+	.skip 0x1000
+.popsection
+
+	ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,       .asciz, "linux")
+	ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION,  .asciz, "2.6")
+	ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION,    .asciz, "xen-3.0")
+	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      .long,  __PAGE_OFFSET)
+	ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          .long,  startup_xen)
+	ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .long,  hypercall_page)
+	ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,       .asciz, "!writable_page_tables|pae_pgdir_above_4gb")
+#ifdef CONFIG_X86_PAE
+	ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "yes")
+#else
+	ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "no")
+#endif
+	ELFNOTE(Xen, XEN_ELFNOTE_LOADER,         .asciz, "generic")
===================================================================
--- /dev/null
+++ b/arch/i386/xen/xen-ops.h
@@ -0,0 +1,31 @@
+#ifndef XEN_OPS_H
+#define XEN_OPS_H
+
+#include <linux/init.h>
+
+extern struct start_info *xen_start_info;
+extern struct shared_info *HYPERVISOR_shared_info;
+
+char * __init xen_memory_setup(void);
+void __init xen_arch_setup(void);
+void __init xen_init_IRQ(void);
+
+void __init xen_time_init(void);
+unsigned long xen_get_wallclock(void);
+int xen_set_wallclock(unsigned long time);
+
+void stop_hz_timer(void);
+void start_hz_timer(void);
+
+DECLARE_PER_CPU(unsigned, xen_lazy_mode);
+
+static inline unsigned xen_get_lazy_mode(void)
+{
+	unsigned ret = get_cpu_var(xen_lazy_mode);
+	put_cpu_var(xen_lazy_mode);
+
+	return ret;
+}
+
+
+#endif /* XEN_OPS_H */
===================================================================
--- a/include/asm-i386/irq.h
+++ b/include/asm-i386/irq.h
@@ -43,6 +43,7 @@ extern void fixup_irqs(cpumask_t map);
 extern void fixup_irqs(cpumask_t map);
 #endif
 
+unsigned int do_IRQ(struct pt_regs *regs);
 void init_IRQ(void);
 void __init native_init_IRQ(void);
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -253,6 +253,7 @@ struct Xgt_desc_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
 struct mm_struct;
+struct i386_pda;
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
@@ -272,6 +273,7 @@ struct paravirt_ops
 	void (*arch_setup)(void);
 	char *(*memory_setup)(void);
 	void (*init_IRQ)(void);
+	void (*init_pda)(struct i386_pda *, int cpu);
 
 	void (*pagetable_setup_start)(pgd_t *pgd_base);
 	void (*pagetable_setup_done)(pgd_t *pgd_base);
@@ -399,6 +401,30 @@ struct paravirt_ops
 	void (*startup_ipi_hook)(int phys_apicid, unsigned long start_eip, unsigned long start_esp);
 };
 
+/* Non-paravirtualized implementations of various operations for
+   back-ends which don't need their own version. */
+void native_clts(void);
+
+unsigned long native_read_cr0(void);
+void native_write_cr0(unsigned long val);
+
+unsigned long native_read_cr2(void);
+void native_write_cr2(unsigned long val);
+
+unsigned long native_read_cr3(void);
+void native_write_cr3(unsigned long val);
+
+unsigned long native_read_cr4(void);
+unsigned long native_read_cr4_safe(void);
+void native_write_cr4(unsigned long val);
+
+void native_wbinvd(void);
+
+unsigned long long native_read_msr(unsigned int msr, int *err);
+int native_write_msr(unsigned int msr, unsigned long long val);
+unsigned long long native_read_tsc(void);
+unsigned long long native_read_pmc(void);
+
 /* Mark a paravirt probe function. */
 #define paravirt_probe(fn)						\
  static asmlinkage void (*__paravirtprobe_##fn)(void) __attribute_used__ \
@@ -520,6 +546,13 @@ static inline void halt(void)
 	high = _l >> 32;					\
 } while(0)
 
+void native_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high);
+void native_write_gdt_entry(void *dt, int entrynum, u32 low, u32 high);
+void native_write_idt_entry(void *dt, int entrynum, u32 low, u32 high);
+void native_store_gdt(struct Xgt_desc_struct *dtr);
+void native_store_idt(struct Xgt_desc_struct *dtr);
+unsigned long native_store_tr(void);
+
 static inline void load_TR_desc(void)
 {
 	PVOP_VCALL0(load_tr_desc);
@@ -641,6 +674,12 @@ static inline void arch_exit_mmap(struct
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
 	PVOP_VCALL1(exit_mmap, mm);
+}
+
+static inline void paravirt_init_pda(struct i386_pda *pda, int cpu)
+{
+	if (paravirt_ops.init_pda)
+		(*paravirt_ops.init_pda)(pda, cpu);
 }
 
 #define __flush_tlb()		PVOP_VCALL0(flush_tlb_user)
===================================================================
--- a/include/asm-i386/pda.h
+++ b/include/asm-i386/pda.h
@@ -16,8 +16,26 @@ struct i386_pda
 	int cpu_number;
 	struct task_struct *pcurrent;	/* current process */
 	struct pt_regs *irq_regs;
+
+#ifdef CONFIG_PARAVIRT
+	union {
+#ifdef CONFIG_XEN
+		struct {
+			struct vcpu_info *vcpu;
+			unsigned long cr3;
+		} xen;
+#endif	/* CONFIG_XEN */
+	};
+#endif	/* CONFIG_PARAVIRT */
 };
 
+#ifndef CONFIG_PARAVIRT
+static inline void paravirt_init_pda(struct i386_pda *pda, int cpu)
+{
+}
+#endif
+
+extern struct i386_pda boot_pda;
 extern struct i386_pda *_cpu_pda[];
 
 #define cpu_pda(i)	(_cpu_pda[i])
===================================================================
--- a/include/asm-i386/xen/hypercall.h
+++ b/include/asm-i386/xen/hypercall.h
@@ -410,4 +410,22 @@ MULTI_mmuext_op(struct multicall_entry *
 	mcl->args[2] = (unsigned long)success_count;
 	mcl->args[3] = domid;
 }
+
+static inline void
+MULTI_set_gdt(struct multicall_entry *mcl, unsigned long *frames, int entries)
+{
+	mcl->op = __HYPERVISOR_set_gdt;
+	mcl->args[0] = (unsigned long)frames;
+	mcl->args[1] = entries;
+}
+
+static inline void
+MULTI_stack_switch(struct multicall_entry *mcl,
+		   unsigned long ss, unsigned long esp)
+{
+	mcl->op = __HYPERVISOR_stack_switch;
+	mcl->args[0] = ss;
+	mcl->args[1] = esp;
+}
+
 #endif /* __HYPERCALL_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/events.h
@@ -0,0 +1,28 @@
+#ifndef _XEN_EVENTS_H
+#define _XEN_EVENTS_H
+
+#include <linux/irq.h>
+
+int bind_evtchn_to_irqhandler(unsigned int evtchn,
+			      irqreturn_t (*handler)(int, void *),
+			      unsigned long irqflags, const char *devname,
+			      void *dev_id);
+int bind_virq_to_irqhandler(unsigned int virq, unsigned int cpu,
+			    irqreturn_t (*handler)(int, void *),
+			    unsigned long irqflags, const char *devname, void *dev_id);
+
+/*
+ * Common unbind function for all event sources. Takes IRQ to unbind from.
+ * Automatically closes the underlying event channel (even for bindings
+ * made with bind_evtchn_to_irqhandler()).
+ */
+void unbind_from_irqhandler(unsigned int irq, void *dev_id);
+
+static inline void notify_remote_via_evtchn(int port)
+{
+	struct evtchn_send send = { .port = port };
+	(void)HYPERVISOR_event_channel_op(EVTCHNOP_send, &send);
+}
+
+extern void notify_remote_via_irq(int irq);
+#endif	/* _XEN_EVENTS_H */
===================================================================
--- /dev/null
+++ b/include/xen/features.h
@@ -0,0 +1,26 @@
+/******************************************************************************
+ * features.h
+ *
+ * Query the features reported by Xen.
+ *
+ * Copyright (c) 2006, Ian Campbell
+ */
+
+#ifndef __XEN_FEATURES_H__
+#define __XEN_FEATURES_H__
+
+#include <xen/interface/features.h>
+
+void xen_setup_features(void);
+
+extern u8 xen_features[XENFEAT_NR_SUBMAPS * 32];
+
+static inline int xen_feature(int flag)
+{
+	switch(flag) {
+	}
+
+	return xen_features[flag];
+}
+
+#endif /* __ASM_XEN_FEATURES_H__ */
===================================================================
--- /dev/null
+++ b/include/xen/page.h
@@ -0,0 +1,175 @@
+#ifndef __XEN_PAGE_H
+#define __XEN_PAGE_H
+
+#include <linux/pfn.h>
+
+#include <asm/uaccess.h>
+
+#include <xen/features.h>
+
+#ifdef CONFIG_X86_PAE
+/* Xen machine address */
+typedef struct xmaddr {
+	unsigned long long maddr;
+} xmaddr_t;
+
+/* Xen pseudo-physical address */
+typedef struct xpaddr {
+	unsigned long long paddr;
+} xpaddr_t;
+#else
+/* Xen machine address */
+typedef struct xmaddr {
+	unsigned long maddr;
+} xmaddr_t;
+
+/* Xen pseudo-physical address */
+typedef struct xpaddr {
+	unsigned long paddr;
+} xpaddr_t;
+#endif
+
+#define XMADDR(x)	((xmaddr_t) { .maddr = (x) })
+#define XPADDR(x)	((xpaddr_t) { .paddr = (x) })
+
+/**** MACHINE <-> PHYSICAL CONVERSION MACROS ****/
+#define INVALID_P2M_ENTRY	(~0UL)
+#define FOREIGN_FRAME_BIT	(1UL<<31)
+#define FOREIGN_FRAME(m)	((m) | FOREIGN_FRAME_BIT)
+
+extern unsigned long *phys_to_machine_mapping;
+
+static inline unsigned long pfn_to_mfn(unsigned long pfn)
+{
+	if (xen_feature(XENFEAT_auto_translated_physmap))
+		return pfn;
+
+	return phys_to_machine_mapping[(unsigned int)(pfn)] &
+		~FOREIGN_FRAME_BIT;
+}
+
+static inline int phys_to_machine_mapping_valid(unsigned long pfn)
+{
+	if (xen_feature(XENFEAT_auto_translated_physmap))
+		return 1;
+
+	return (phys_to_machine_mapping[pfn] != INVALID_P2M_ENTRY);
+}
+
+static inline unsigned long mfn_to_pfn(unsigned long mfn)
+{
+	unsigned long pfn;
+
+	if (xen_feature(XENFEAT_auto_translated_physmap))
+		return mfn;
+
+#if 0
+	if (unlikely((mfn >> machine_to_phys_order) != 0))
+		return max_mapnr;
+#endif
+
+	pfn = 0;
+	/*
+	 * The array access can fail (e.g., device space beyond end of RAM).
+	 * In such cases it doesn't matter what we return (we return garbage),
+	 * but we must handle the fault without crashing!
+	 */
+	__get_user(pfn, &machine_to_phys_mapping[mfn]);
+
+	return pfn;
+}
+
+static inline xmaddr_t phys_to_machine(xpaddr_t phys)
+{
+	unsigned offset = phys.paddr & ~PAGE_MASK;
+	return XMADDR(PFN_PHYS(pfn_to_mfn(PFN_DOWN(phys.paddr))) | offset);
+}
+
+static inline xpaddr_t machine_to_phys(xmaddr_t machine)
+{
+	unsigned offset = machine.maddr & ~PAGE_MASK;
+	return XPADDR(PFN_PHYS(mfn_to_pfn(PFN_DOWN(machine.maddr))) | offset);
+}
+
+/*
+ * We detect special mappings in one of two ways:
+ *  1. If the MFN is an I/O page then Xen will set the m2p entry
+ *     to be outside our maximum possible pseudophys range.
+ *  2. If the MFN belongs to a different domain then we will certainly
+ *     not have MFN in our p2m table. Conversely, if the page is ours,
+ *     then we'll have p2m(m2p(MFN))==MFN.
+ * If we detect a special mapping then it doesn't have a 'struct page'.
+ * We force !pfn_valid() by returning an out-of-range pointer.
+ *
+ * NB. These checks require that, for any MFN that is not in our reservation,
+ * there is no PFN such that p2m(PFN) == MFN. Otherwise we can get confused if
+ * we are foreign-mapping the MFN, and the other domain as m2p(MFN) == PFN.
+ * Yikes! Various places must poke in INVALID_P2M_ENTRY for safety.
+ *
+ * NB2. When deliberately mapping foreign pages into the p2m table, you *must*
+ *      use FOREIGN_FRAME(). This will cause pte_pfn() to choke on it, as we
+ *      require. In all the cases we care about, the FOREIGN_FRAME bit is
+ *      masked (e.g., pfn_to_mfn()) so behaviour there is correct.
+ */
+static inline unsigned long mfn_to_local_pfn(unsigned long mfn)
+{
+	extern unsigned long max_mapnr;
+	unsigned long pfn = mfn_to_pfn(mfn);
+	if ((pfn < max_mapnr)
+	    && !xen_feature(XENFEAT_auto_translated_physmap)
+	    && (phys_to_machine_mapping[pfn] != mfn))
+		return max_mapnr; /* force !pfn_valid() */
+	return pfn;
+}
+
+static inline void set_phys_to_machine(unsigned long pfn, unsigned long mfn)
+{
+	if (xen_feature(XENFEAT_auto_translated_physmap)) {
+		BUG_ON(pfn != mfn && mfn != INVALID_P2M_ENTRY);
+		return;
+	}
+	phys_to_machine_mapping[pfn] = mfn;
+}
+
+/* VIRT <-> MACHINE conversion */
+#define virt_to_machine(v)	(phys_to_machine(XPADDR(__pa(v))))
+#define virt_to_mfn(v)		(pfn_to_mfn(PFN_DOWN(__pa(v))))
+#define mfn_to_virt(m)		(__va(mfn_to_pfn(m) << PAGE_SHIFT))
+
+#ifdef CONFIG_X86_PAE
+#define pte_mfn(_pte) (((_pte).pte_low >> PAGE_SHIFT) |\
+                       (((_pte).pte_high & 0xfff) << (32-PAGE_SHIFT)))
+
+static inline pte_t mfn_pte(unsigned long page_nr, pgprot_t pgprot)
+{
+	pte_t pte;
+
+	pte.pte_high = (page_nr >> (32 - PAGE_SHIFT)) | (pgprot_val(pgprot) >> 32);
+	pte.pte_high &= (__supported_pte_mask >> 32);
+	pte.pte_low = ((page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
+	pte.pte_low &= __supported_pte_mask;
+
+	return pte;
+}
+
+static inline unsigned long long pte_val_ma(pte_t x)
+{
+	return ((unsigned long long)x.pte_high << 32) | x.pte_low;
+}
+#define pmd_val_ma(v) ((v).pmd)
+#define pud_val_ma(v) ((v).pgd.pgd)
+#else  /* !X86_PAE */
+#define pte_mfn(_pte) ((_pte).pte_low >> PAGE_SHIFT)
+#define mfn_pte(pfn, prot)	__pte_ma(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
+#define pte_val_ma(x)	((x).pte_low)
+#define pmd_val_ma(v)	((v).pud.pgd.pgd)
+#endif	/* CONFIG_X86_PAE */
+#define pgd_val_ma(x)	((x).pgd)
+
+#define __pte_ma(x)	((pte_t) { (x) } )
+
+xmaddr_t arbitrary_virt_to_machine(unsigned long address);
+void make_lowmem_page_readonly(void *vaddr);
+void make_lowmem_page_readwrite(void *vaddr);
+
+#endif /* __XEN_PAGE_H */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (19 preceding siblings ...)
  2007-03-01 23:25 ` [patch 20/26] Xen-paravirt_ops: Core Xen implementation Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-16  8:54     ` Ingo Molnar
  2007-03-01 23:25 ` [patch 22/26] Xen-paravirt_ops: Add early printk support via hvc console Jeremy Fitzhardinge
                   ` (6 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: xen-hvc-console.patch --]
[-- Type: text/plain, Size: 5683 bytes --]

Implement a Xen back-end for hvc console.

From: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 arch/i386/xen/Kconfig     |    1 
 arch/i386/xen/events.c    |    3 -
 drivers/Makefile          |    3 +
 drivers/xen/Makefile      |    1 
 drivers/xen/hvc-console.c |  134 +++++++++++++++++++++++++++++++++++++++++++++
 include/xen/events.h      |    1 
 6 files changed, 142 insertions(+), 1 deletion(-)

===================================================================
--- a/arch/i386/xen/Kconfig
+++ b/arch/i386/xen/Kconfig
@@ -5,6 +5,7 @@ config XEN
 config XEN
 	bool "Enable support for Xen hypervisor"
 	depends on PARAVIRT && HZ_100 && !PREEMPT && !NO_HZ
+	select HVC_DRIVER
 	default y
 	help
 	  This is the Linux Xen port.
===================================================================
--- a/arch/i386/xen/events.c
+++ b/arch/i386/xen/events.c
@@ -209,7 +209,7 @@ static int find_unbound_irq(void)
 	return irq;
 }
 
-static int bind_evtchn_to_irq(unsigned int evtchn)
+int bind_evtchn_to_irq(unsigned int evtchn)
 {
 	int irq;
 
@@ -233,6 +233,7 @@ static int bind_evtchn_to_irq(unsigned i
 
 	return irq;
 }
+EXPORT_SYMBOL_GPL(bind_evtchn_to_irq);
 
 static int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
 {
===================================================================
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -14,6 +14,9 @@ obj-$(CONFIG_ACPI)		+= acpi/
 # was used and do nothing if so
 obj-$(CONFIG_PNP)		+= pnp/
 obj-$(CONFIG_ARM_AMBA)		+= amba/
+
+# Xen is the default console when running as a guest
+obj-$(CONFIG_XEN)		+= xen/
 
 # char/ comes before serial/ etc so that the VT console is the boot-time
 # default.
===================================================================
--- /dev/null
+++ b/drivers/xen/Makefile
@@ -0,0 +1,1 @@
+obj-y	+= hvc-console.o
===================================================================
--- /dev/null
+++ b/drivers/xen/hvc-console.c
@@ -0,0 +1,134 @@
+/*
+ * xen console driver interface to hvc_console.c
+ *
+ * (c) 2007 Gerd Hoffmann <kraxel@suse.de>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
+ */
+
+#include <linux/console.h>
+#include <linux/delay.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/types.h>
+
+#include <asm/xen/hypervisor.h>
+#include <xen/page.h>
+#include <xen/events.h>
+#include <xen/interface/io/console.h>
+
+#include "../char/hvc_console.h"
+
+#define HVC_COOKIE   0x58656e /* "Xen" in hex */
+
+static struct hvc_struct *hvc;
+static int xencons_irq;
+
+/* ------------------------------------------------------------------ */
+
+static inline struct xencons_interface *xencons_interface(void)
+{
+	return mfn_to_virt(xen_start_info->console.domU.mfn);
+}
+
+static inline void notify_daemon(void)
+{
+	/* Use evtchn: this is called early, before irq is set up. */
+	notify_remote_via_evtchn(xen_start_info->console.domU.evtchn);
+}
+
+static int write_console(uint32_t vtermno, const char *data, int len)
+{
+	struct xencons_interface *intf = xencons_interface();
+	XENCONS_RING_IDX cons, prod;
+	int sent = 0;
+
+	cons = intf->out_cons;
+	prod = intf->out_prod;
+	mb();
+	BUG_ON((prod - cons) > sizeof(intf->out));
+
+	while ((sent < len) && ((prod - cons) < sizeof(intf->out)))
+		intf->out[MASK_XENCONS_IDX(prod++, intf->out)] = data[sent++];
+
+	wmb();
+	intf->out_prod = prod;
+
+	notify_daemon();
+	return sent;
+}
+
+static int read_console(uint32_t vtermno, char *buf, int len)
+{
+	struct xencons_interface *intf = xencons_interface();
+	XENCONS_RING_IDX cons, prod;
+	int recv = 0;
+
+	cons = intf->in_cons;
+	prod = intf->in_prod;
+	mb();
+	BUG_ON((prod - cons) > sizeof(intf->in));
+
+	while (cons != prod && recv < len)
+		buf[recv++] = intf->in[MASK_XENCONS_IDX(cons++,intf->in)];
+
+	mb();
+	intf->in_cons = cons;
+
+	notify_daemon();
+	return recv;
+}
+
+static struct hv_ops hvc_ops = {
+	.get_chars = read_console,
+	.put_chars = write_console,
+};
+
+static int __init xen_init(void)
+{
+	struct hvc_struct *hp;
+
+	if (!is_running_on_xen())
+		return 0;
+
+	xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
+	if (xencons_irq < 0)
+		xencons_irq = 0 /* NO_IRQ */;
+	hp = hvc_alloc(HVC_COOKIE, xencons_irq, &hvc_ops, 256);
+	if (IS_ERR(hp))
+		return PTR_ERR(hp);
+
+	hvc = hp;
+	return 0;
+}
+
+static void __exit xen_fini(void)
+{
+	if (hvc)
+		hvc_remove(hvc);
+}
+
+static int xen_cons_init(void)
+{
+	if (!is_running_on_xen())
+		return 0;
+
+	hvc_instantiate(HVC_COOKIE, 0, &hvc_ops);
+	return 0;
+}
+
+module_init(xen_init);
+module_exit(xen_fini);
+console_initcall(xen_cons_init);
===================================================================
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -3,6 +3,7 @@
 
 #include <linux/irq.h>
 
+int bind_evtchn_to_irq(unsigned int evtchn);
 int bind_evtchn_to_irqhandler(unsigned int evtchn,
 			      irqreturn_t (*handler)(int, void *),
 			      unsigned long irqflags, const char *devname,

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 22/26] Xen-paravirt_ops: Add early printk support via hvc console
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (20 preceding siblings ...)
  2007-03-01 23:25 ` [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-16  8:52     ` Ingo Molnar
  2007-03-01 23:25   ` Jeremy Fitzhardinge
                   ` (5 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

[-- Attachment #1: xen-hvc-earlyprintk.patch --]
[-- Type: text/plain, Size: 2293 bytes --]

Add early printk support via hvc console, enable using
"earlyprintk=xen" on the kernel command line.

From: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

---
 arch/x86_64/kernel/early_printk.c |    5 +++++
 drivers/xen/hvc-console.c         |   25 +++++++++++++++++++++++++
 include/xen/hvc-console.h         |    6 ++++++
 3 files changed, 36 insertions(+)

===================================================================
--- a/arch/x86_64/kernel/early_printk.c
+++ b/arch/x86_64/kernel/early_printk.c
@@ -6,6 +6,7 @@
 #include <asm/io.h>
 #include <asm/processor.h>
 #include <asm/fcntl.h>
+#include <xen/hvc-console.h>
 
 /* Simple VGA output */
 
@@ -243,6 +244,10 @@ static int __init setup_early_printk(cha
  		simnow_init(buf + 6);
  		early_console = &simnow_console;
  		keep_early = 1;
+#ifdef CONFIG_XEN
+	} else if (!strncmp(buf, "xen", 3)) {
+ 		early_console = &xenboot_console;
+#endif
 	}
 	register_console(early_console);
 	return 0;
===================================================================
--- a/drivers/xen/hvc-console.c
+++ b/drivers/xen/hvc-console.c
@@ -28,6 +28,7 @@
 #include <xen/page.h>
 #include <xen/events.h>
 #include <xen/interface/io/console.h>
+#include <xen/hvc-console.h>
 
 #include "../char/hvc_console.h"
 
@@ -132,3 +133,27 @@ module_init(xen_init);
 module_init(xen_init);
 module_exit(xen_fini);
 console_initcall(xen_cons_init);
+
+static void xenboot_write_console(struct console *console, const char *string,
+				  unsigned len)
+{
+	unsigned int linelen, off = 0;
+	const char *pos;
+
+	while (off < len && NULL != (pos = strchr(string+off, '\n'))) {
+		linelen = pos-string+off;
+		if (off + linelen > len)
+			break;
+		write_console(0, string+off, linelen);
+		write_console(0, "\r\n", 2);
+		off += linelen + 1;
+	}
+	if (off < len)
+		write_console(0, string+off, len-off);
+}
+
+struct console xenboot_console = {
+	.name		= "xenboot",
+	.write		= xenboot_write_console,
+	.flags		= CON_PRINTBUFFER | CON_BOOT,
+};
===================================================================
--- /dev/null
+++ b/include/xen/hvc-console.h
@@ -0,0 +1,6 @@
+#ifndef XEN_HVC_CONSOLE_H
+#define XEN_HVC_CONSOLE_H
+
+extern struct console xenboot_console;
+
+#endif	/* XEN_HVC_CONSOLE_H */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 23/26] Xen-paravirt_ops: Add Xen grant table support
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-01 23:25   ` Jeremy Fitzhardinge
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach

[-- Attachment #1: xen-grant-table.patch --]
[-- Type: text/plain, Size: 17638 bytes --]

Add Xen 'grant table' driver which allows granting of access to
selected local memory pages by other virtual machines and,
symmetrically, the mapping of remote memory pages which other virtual
machines have granted access to.

This driver is a prerequisite for many of the Xen virtual device
drivers, which grant the 'device driver domain' restricted and
temporary access to only those memory pages that are currently
involved in I/O operations.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 drivers/xen/Makefile      |    1 
 drivers/xen/grant-table.c |  454 +++++++++++++++++++++++++++++++++++++++++++++
 include/xen/grant_table.h |  107 ++++++++++
 3 files changed, 562 insertions(+)

===================================================================
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,1 +1,2 @@ obj-y	+= hvc-console.o
+obj-y	+= grant-table.o
 obj-y	+= hvc-console.o
===================================================================
--- /dev/null
+++ b/drivers/xen/grant-table.c
@@ -0,0 +1,454 @@
+/******************************************************************************
+ * grant_table.c
+ *
+ * Granting foreign access to our memory reservation.
+ *
+ * Copyright (c) 2005, Christopher Clark
+ * Copyright (c) 2004-2005, K A Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <xen/interface/xen.h>
+#include <xen/page.h>
+#include <xen/grant_table.h>
+
+#include <asm/pgtable.h>
+#include <asm/uaccess.h>
+#include <asm/sync_bitops.h>
+
+
+/* External tools reserve first few grant table entries. */
+#define NR_RESERVED_ENTRIES 8
+
+#define NR_GRANT_ENTRIES \
+	(NR_GRANT_FRAMES * PAGE_SIZE / sizeof(struct grant_entry))
+#define GNTTAB_LIST_END (NR_GRANT_ENTRIES + 1)
+
+static grant_ref_t gnttab_list[NR_GRANT_ENTRIES];
+static int gnttab_free_count;
+static grant_ref_t gnttab_free_head;
+static DEFINE_SPINLOCK(gnttab_list_lock);
+
+static struct grant_entry *shared;
+
+static struct gnttab_free_callback *gnttab_free_callback_list;
+
+static int get_free_entries(int count)
+{
+	unsigned long flags;
+	int ref;
+	grant_ref_t head;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	if (gnttab_free_count < count) {
+		spin_unlock_irqrestore(&gnttab_list_lock, flags);
+		return -1;
+	}
+	ref = head = gnttab_free_head;
+	gnttab_free_count -= count;
+	while (count-- > 1)
+		head = gnttab_list[head];
+	gnttab_free_head = gnttab_list[head];
+	gnttab_list[head] = GNTTAB_LIST_END;
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+	return ref;
+}
+
+#define get_free_entry() get_free_entries(1)
+
+static void do_free_callbacks(void)
+{
+	struct gnttab_free_callback *callback, *next;
+
+	callback = gnttab_free_callback_list;
+	gnttab_free_callback_list = NULL;
+
+	while (callback != NULL) {
+		next = callback->next;
+		if (gnttab_free_count >= callback->count) {
+			callback->next = NULL;
+			callback->fn(callback->arg);
+		} else {
+			callback->next = gnttab_free_callback_list;
+			gnttab_free_callback_list = callback;
+		}
+		callback = next;
+	}
+}
+
+static inline void check_free_callbacks(void)
+{
+	if (unlikely(gnttab_free_callback_list))
+		do_free_callbacks();
+}
+
+static void put_free_entry(grant_ref_t ref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	gnttab_list[ref] = gnttab_free_head;
+	gnttab_free_head = ref;
+	gnttab_free_count++;
+	check_free_callbacks();
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+
+static void update_grant_entry(grant_ref_t ref, domid_t domid,
+			       unsigned long frame, unsigned flags)
+{
+	/*
+	 * Introducing a valid entry into the grant table:
+	 *  1. Write ent->domid.
+	 *  2. Write ent->frame:
+	 *      GTF_permit_access:   Frame to which access is permitted.
+	 *      GTF_accept_transfer: Pseudo-phys frame slot being filled by new
+	 *                           frame, or zero if none.
+	 *  3. Write memory barrier (WMB).
+	 *  4. Write ent->flags, inc. valid type.
+	 */
+	shared[ref].frame = frame;
+	shared[ref].domid = domid;
+	wmb();
+	shared[ref].flags = flags;
+}
+
+/*
+ * Public grant-issuing interface functions
+ */
+void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
+				     unsigned long frame, int readonly)
+{
+	update_grant_entry(ref, domid, frame,
+			   GTF_permit_access | (readonly ? GTF_readonly : 0));
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access_ref);
+
+int gnttab_grant_foreign_access(domid_t domid, unsigned long frame,
+				int readonly)
+{
+	int ref;
+
+	if (unlikely((ref = get_free_entry()) == -1))
+		return -ENOSPC;
+
+	gnttab_grant_foreign_access_ref(ref, domid, frame, readonly);
+
+	return ref;
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access);
+
+int gnttab_query_foreign_access(grant_ref_t ref)
+{
+	u16 nflags;
+
+	nflags = shared[ref].flags;
+
+	return (nflags & (GTF_reading|GTF_writing));
+}
+EXPORT_SYMBOL_GPL(gnttab_query_foreign_access);
+
+int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly)
+{
+	u16 flags, nflags;
+
+	nflags = shared[ref].flags;
+	do {
+		if ((flags = nflags) & (GTF_reading|GTF_writing)) {
+			printk(KERN_ALERT "WARNING: g.e. still in use!\n");
+			return 0;
+		}
+	} while ((nflags = sync_cmpxchg(&shared[ref].flags, flags, 0)) !=
+		 flags);
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_access_ref);
+
+void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
+			       unsigned long page)
+{
+	if (gnttab_end_foreign_access_ref(ref, readonly)) {
+		put_free_entry(ref);
+		if (page != 0)
+			free_page(page);
+	} else {
+		/* XXX This needs to be fixed so that the ref and page are
+		   placed on a list to be freed up later. */
+		printk(KERN_WARNING
+		       "WARNING: leaking g.e. and page still in use!\n");
+	}
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_access);
+
+int gnttab_grant_foreign_transfer(domid_t domid, unsigned long pfn)
+{
+	int ref;
+
+	if (unlikely((ref = get_free_entry()) == -1))
+		return -ENOSPC;
+	gnttab_grant_foreign_transfer_ref(ref, domid, pfn);
+
+	return ref;
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_transfer);
+
+void gnttab_grant_foreign_transfer_ref(grant_ref_t ref, domid_t domid,
+				       unsigned long pfn)
+{
+	update_grant_entry(ref, domid, pfn, GTF_accept_transfer);
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_transfer_ref);
+
+unsigned long gnttab_end_foreign_transfer_ref(grant_ref_t ref)
+{
+	unsigned long frame;
+	u16           flags;
+
+	/*
+	 * If a transfer is not even yet started, try to reclaim the grant
+	 * reference and return failure (== 0).
+	 */
+	while (!((flags = shared[ref].flags) & GTF_transfer_committed)) {
+		if (sync_cmpxchg(&shared[ref].flags, flags, 0) == flags)
+			return 0;
+		cpu_relax();
+	}
+
+	/* If a transfer is in progress then wait until it is completed. */
+	while (!(flags & GTF_transfer_completed)) {
+		flags = shared[ref].flags;
+		cpu_relax();
+	}
+
+	rmb();	/* Read the frame number /after/ reading completion status. */
+	frame = shared[ref].frame;
+	BUG_ON(frame == 0);
+
+	return frame;
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_transfer_ref);
+
+unsigned long gnttab_end_foreign_transfer(grant_ref_t ref)
+{
+	unsigned long frame = gnttab_end_foreign_transfer_ref(ref);
+	put_free_entry(ref);
+	return frame;
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_transfer);
+
+void gnttab_free_grant_reference(grant_ref_t ref)
+{
+	put_free_entry(ref);
+}
+EXPORT_SYMBOL_GPL(gnttab_free_grant_reference);
+
+void gnttab_free_grant_references(grant_ref_t head)
+{
+	grant_ref_t ref;
+	unsigned long flags;
+	int count = 1;
+	if (head == GNTTAB_LIST_END)
+		return;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	ref = head;
+	while (gnttab_list[ref] != GNTTAB_LIST_END) {
+		ref = gnttab_list[ref];
+		count++;
+	}
+	gnttab_list[ref] = gnttab_free_head;
+	gnttab_free_head = head;
+	gnttab_free_count += count;
+	check_free_callbacks();
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(gnttab_free_grant_references);
+
+int gnttab_alloc_grant_references(u16 count, grant_ref_t *head)
+{
+	int h = get_free_entries(count);
+
+	if (h == -1)
+		return -ENOSPC;
+
+	*head = h;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(gnttab_alloc_grant_references);
+
+int gnttab_empty_grant_references(const grant_ref_t *private_head)
+{
+	return (*private_head == GNTTAB_LIST_END);
+}
+EXPORT_SYMBOL_GPL(gnttab_empty_grant_references);
+
+int gnttab_claim_grant_reference(grant_ref_t *private_head)
+{
+	grant_ref_t g = *private_head;
+	if (unlikely(g == GNTTAB_LIST_END))
+		return -ENOSPC;
+	*private_head = gnttab_list[g];
+	return g;
+}
+EXPORT_SYMBOL_GPL(gnttab_claim_grant_reference);
+
+void gnttab_release_grant_reference(grant_ref_t *private_head,
+				    grant_ref_t release)
+{
+	gnttab_list[release] = *private_head;
+	*private_head = release;
+}
+EXPORT_SYMBOL_GPL(gnttab_release_grant_reference);
+
+void gnttab_request_free_callback(struct gnttab_free_callback *callback,
+				  void (*fn)(void *), void *arg, u16 count)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	if (callback->next)
+		goto out;
+	callback->fn = fn;
+	callback->arg = arg;
+	callback->count = count;
+	callback->next = gnttab_free_callback_list;
+	gnttab_free_callback_list = callback;
+	check_free_callbacks();
+out:
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(gnttab_request_free_callback);
+
+void gnttab_cancel_free_callback(struct gnttab_free_callback *callback)
+{
+	struct gnttab_free_callback **pcb;
+	unsigned long flags;
+
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	for (pcb = &gnttab_free_callback_list; *pcb; pcb = &(*pcb)->next) {
+		if (*pcb == callback) {
+			*pcb = callback->next;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(gnttab_cancel_free_callback);
+
+#ifndef __ia64__
+static int map_pte_fn(pte_t *pte, struct page *pmd_page,
+		      unsigned long addr, void *data)
+{
+	unsigned long **frames = (unsigned long **)data;
+
+	set_pte_at(&init_mm, addr, pte, mfn_pte((*frames)[0], PAGE_KERNEL));
+	(*frames)++;
+	return 0;
+}
+
+static int unmap_pte_fn(pte_t *pte, struct page *pmd_page,
+			unsigned long addr, void *data)
+{
+
+	set_pte_at(&init_mm, addr, pte, __pte(0));
+	return 0;
+}
+#endif
+
+int gnttab_resume(void)
+{
+	struct gnttab_setup_table setup;
+	unsigned long frames[NR_GRANT_FRAMES];
+	int rc;
+
+	setup.dom        = DOMID_SELF;
+	setup.nr_frames  = NR_GRANT_FRAMES;
+	setup.frame_list = frames;
+
+	rc = HYPERVISOR_grant_table_op(GNTTABOP_setup_table, &setup, 1);
+	if (rc == -ENOSYS)
+		return -ENOSYS;
+
+	BUG_ON(rc || setup.status);
+
+#ifndef __ia64__
+	{
+		void *pframes = frames;
+		struct vm_struct *area;
+
+		if (shared == NULL) {
+			area = get_vm_area(PAGE_SIZE * NR_GRANT_FRAMES,
+					   VM_IOREMAP);
+			BUG_ON(area == NULL);
+			shared = area->addr;
+		}
+		rc = apply_to_page_range(&init_mm, (unsigned long)shared,
+					 PAGE_SIZE * NR_GRANT_FRAMES,
+					 map_pte_fn, &pframes);
+		BUG_ON(rc);
+	}
+#else
+	shared = __va(frames[0] << PAGE_SHIFT);
+	printk("grant table at %p\n", shared);
+#endif
+
+	return 0;
+}
+
+int gnttab_suspend(void)
+{
+
+#ifndef __ia64__
+	apply_to_page_range(&init_mm, (unsigned long)shared,
+			    PAGE_SIZE * NR_GRANT_FRAMES,
+			    unmap_pte_fn, NULL);
+#endif
+
+	return 0;
+}
+
+static int __init gnttab_init(void)
+{
+	int i;
+
+	if (!is_running_on_xen())
+		return -ENODEV;
+
+	if (gnttab_resume() < 0)
+		return -ENODEV;
+
+	for (i = NR_RESERVED_ENTRIES; i < NR_GRANT_ENTRIES; i++)
+		gnttab_list[i] = i + 1;
+	gnttab_free_count = NR_GRANT_ENTRIES - NR_RESERVED_ENTRIES;
+	gnttab_free_head  = NR_RESERVED_ENTRIES;
+
+	printk("Grant table initialized\n");
+	return 0;
+}
+
+core_initcall(gnttab_init);
===================================================================
--- /dev/null
+++ b/include/xen/grant_table.h
@@ -0,0 +1,107 @@
+/******************************************************************************
+ * grant_table.h
+ *
+ * Two sets of functionality:
+ * 1. Granting foreign access to our memory reservation.
+ * 2. Accessing others' memory reservations via grant references.
+ * (i.e., mechanisms for both sender and recipient of grant references)
+ *
+ * Copyright (c) 2004-2005, K A Fraser
+ * Copyright (c) 2005, Christopher Clark
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __ASM_GNTTAB_H__
+#define __ASM_GNTTAB_H__
+
+#include <asm/xen/hypervisor.h>
+#include <xen/interface/grant_table.h>
+
+/* NR_GRANT_FRAMES must be less than or equal to that configured in Xen */
+#define NR_GRANT_FRAMES 4
+
+struct gnttab_free_callback {
+	struct gnttab_free_callback *next;
+	void (*fn)(void *);
+	void *arg;
+	u16 count;
+};
+
+int gnttab_grant_foreign_access(domid_t domid, unsigned long frame,
+				int readonly);
+
+/*
+ * End access through the given grant reference, iff the grant entry is no
+ * longer in use.  Return 1 if the grant entry was freed, 0 if it is still in
+ * use.
+ */
+int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly);
+
+/*
+ * Eventually end access through the given grant reference, and once that
+ * access has been ended, free the given page too.  Access will be ended
+ * immediately iff the grant entry is not in use, otherwise it will happen
+ * some time later.  page may be 0, in which case no freeing will occur.
+ */
+void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
+			       unsigned long page);
+
+int gnttab_grant_foreign_transfer(domid_t domid, unsigned long pfn);
+
+unsigned long gnttab_end_foreign_transfer_ref(grant_ref_t ref);
+unsigned long gnttab_end_foreign_transfer(grant_ref_t ref);
+
+int gnttab_query_foreign_access(grant_ref_t ref);
+
+/*
+ * operations on reserved batches of grant references
+ */
+int gnttab_alloc_grant_references(u16 count, grant_ref_t *pprivate_head);
+
+void gnttab_free_grant_reference(grant_ref_t ref);
+
+void gnttab_free_grant_references(grant_ref_t head);
+
+int gnttab_empty_grant_references(const grant_ref_t *pprivate_head);
+
+int gnttab_claim_grant_reference(grant_ref_t *pprivate_head);
+
+void gnttab_release_grant_reference(grant_ref_t *private_head,
+				    grant_ref_t release);
+
+void gnttab_request_free_callback(struct gnttab_free_callback *callback,
+				  void (*fn)(void *), void *arg, u16 count);
+void gnttab_cancel_free_callback(struct gnttab_free_callback *callback);
+
+void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
+				     unsigned long frame, int readonly);
+
+void gnttab_grant_foreign_transfer_ref(grant_ref_t, domid_t domid,
+				       unsigned long pfn);
+
+#define gnttab_map_vaddr(map) ((void *)(map.host_virt_addr))
+
+#endif /* __ASM_GNTTAB_H__ */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 23/26] Xen-paravirt_ops: Add Xen grant table support
@ 2007-03-01 23:25   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zachary Amsden, xen-devel, Ian Pratt, Rusty Russell,
	linux-kernel, Chris Wright, virtualization, Andrew Morton,
	Christian Limpach

[-- Attachment #1: xen-grant-table.patch --]
[-- Type: text/plain, Size: 17637 bytes --]

Add Xen 'grant table' driver which allows granting of access to
selected local memory pages by other virtual machines and,
symmetrically, the mapping of remote memory pages which other virtual
machines have granted access to.

This driver is a prerequisite for many of the Xen virtual device
drivers, which grant the 'device driver domain' restricted and
temporary access to only those memory pages that are currently
involved in I/O operations.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 drivers/xen/Makefile      |    1 
 drivers/xen/grant-table.c |  454 +++++++++++++++++++++++++++++++++++++++++++++
 include/xen/grant_table.h |  107 ++++++++++
 3 files changed, 562 insertions(+)

===================================================================
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,1 +1,2 @@ obj-y	+= hvc-console.o
+obj-y	+= grant-table.o
 obj-y	+= hvc-console.o
===================================================================
--- /dev/null
+++ b/drivers/xen/grant-table.c
@@ -0,0 +1,454 @@
+/******************************************************************************
+ * grant_table.c
+ *
+ * Granting foreign access to our memory reservation.
+ *
+ * Copyright (c) 2005, Christopher Clark
+ * Copyright (c) 2004-2005, K A Fraser
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <xen/interface/xen.h>
+#include <xen/page.h>
+#include <xen/grant_table.h>
+
+#include <asm/pgtable.h>
+#include <asm/uaccess.h>
+#include <asm/sync_bitops.h>
+
+
+/* External tools reserve first few grant table entries. */
+#define NR_RESERVED_ENTRIES 8
+
+#define NR_GRANT_ENTRIES \
+	(NR_GRANT_FRAMES * PAGE_SIZE / sizeof(struct grant_entry))
+#define GNTTAB_LIST_END (NR_GRANT_ENTRIES + 1)
+
+static grant_ref_t gnttab_list[NR_GRANT_ENTRIES];
+static int gnttab_free_count;
+static grant_ref_t gnttab_free_head;
+static DEFINE_SPINLOCK(gnttab_list_lock);
+
+static struct grant_entry *shared;
+
+static struct gnttab_free_callback *gnttab_free_callback_list;
+
+static int get_free_entries(int count)
+{
+	unsigned long flags;
+	int ref;
+	grant_ref_t head;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	if (gnttab_free_count < count) {
+		spin_unlock_irqrestore(&gnttab_list_lock, flags);
+		return -1;
+	}
+	ref = head = gnttab_free_head;
+	gnttab_free_count -= count;
+	while (count-- > 1)
+		head = gnttab_list[head];
+	gnttab_free_head = gnttab_list[head];
+	gnttab_list[head] = GNTTAB_LIST_END;
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+	return ref;
+}
+
+#define get_free_entry() get_free_entries(1)
+
+static void do_free_callbacks(void)
+{
+	struct gnttab_free_callback *callback, *next;
+
+	callback = gnttab_free_callback_list;
+	gnttab_free_callback_list = NULL;
+
+	while (callback != NULL) {
+		next = callback->next;
+		if (gnttab_free_count >= callback->count) {
+			callback->next = NULL;
+			callback->fn(callback->arg);
+		} else {
+			callback->next = gnttab_free_callback_list;
+			gnttab_free_callback_list = callback;
+		}
+		callback = next;
+	}
+}
+
+static inline void check_free_callbacks(void)
+{
+	if (unlikely(gnttab_free_callback_list))
+		do_free_callbacks();
+}
+
+static void put_free_entry(grant_ref_t ref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	gnttab_list[ref] = gnttab_free_head;
+	gnttab_free_head = ref;
+	gnttab_free_count++;
+	check_free_callbacks();
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+
+static void update_grant_entry(grant_ref_t ref, domid_t domid,
+			       unsigned long frame, unsigned flags)
+{
+	/*
+	 * Introducing a valid entry into the grant table:
+	 *  1. Write ent->domid.
+	 *  2. Write ent->frame:
+	 *      GTF_permit_access:   Frame to which access is permitted.
+	 *      GTF_accept_transfer: Pseudo-phys frame slot being filled by new
+	 *                           frame, or zero if none.
+	 *  3. Write memory barrier (WMB).
+	 *  4. Write ent->flags, inc. valid type.
+	 */
+	shared[ref].frame = frame;
+	shared[ref].domid = domid;
+	wmb();
+	shared[ref].flags = flags;
+}
+
+/*
+ * Public grant-issuing interface functions
+ */
+void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
+				     unsigned long frame, int readonly)
+{
+	update_grant_entry(ref, domid, frame,
+			   GTF_permit_access | (readonly ? GTF_readonly : 0));
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access_ref);
+
+int gnttab_grant_foreign_access(domid_t domid, unsigned long frame,
+				int readonly)
+{
+	int ref;
+
+	if (unlikely((ref = get_free_entry()) == -1))
+		return -ENOSPC;
+
+	gnttab_grant_foreign_access_ref(ref, domid, frame, readonly);
+
+	return ref;
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access);
+
+int gnttab_query_foreign_access(grant_ref_t ref)
+{
+	u16 nflags;
+
+	nflags = shared[ref].flags;
+
+	return (nflags & (GTF_reading|GTF_writing));
+}
+EXPORT_SYMBOL_GPL(gnttab_query_foreign_access);
+
+int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly)
+{
+	u16 flags, nflags;
+
+	nflags = shared[ref].flags;
+	do {
+		if ((flags = nflags) & (GTF_reading|GTF_writing)) {
+			printk(KERN_ALERT "WARNING: g.e. still in use!\n");
+			return 0;
+		}
+	} while ((nflags = sync_cmpxchg(&shared[ref].flags, flags, 0)) !=
+		 flags);
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_access_ref);
+
+void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
+			       unsigned long page)
+{
+	if (gnttab_end_foreign_access_ref(ref, readonly)) {
+		put_free_entry(ref);
+		if (page != 0)
+			free_page(page);
+	} else {
+		/* XXX This needs to be fixed so that the ref and page are
+		   placed on a list to be freed up later. */
+		printk(KERN_WARNING
+		       "WARNING: leaking g.e. and page still in use!\n");
+	}
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_access);
+
+int gnttab_grant_foreign_transfer(domid_t domid, unsigned long pfn)
+{
+	int ref;
+
+	if (unlikely((ref = get_free_entry()) == -1))
+		return -ENOSPC;
+	gnttab_grant_foreign_transfer_ref(ref, domid, pfn);
+
+	return ref;
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_transfer);
+
+void gnttab_grant_foreign_transfer_ref(grant_ref_t ref, domid_t domid,
+				       unsigned long pfn)
+{
+	update_grant_entry(ref, domid, pfn, GTF_accept_transfer);
+}
+EXPORT_SYMBOL_GPL(gnttab_grant_foreign_transfer_ref);
+
+unsigned long gnttab_end_foreign_transfer_ref(grant_ref_t ref)
+{
+	unsigned long frame;
+	u16           flags;
+
+	/*
+	 * If a transfer is not even yet started, try to reclaim the grant
+	 * reference and return failure (== 0).
+	 */
+	while (!((flags = shared[ref].flags) & GTF_transfer_committed)) {
+		if (sync_cmpxchg(&shared[ref].flags, flags, 0) == flags)
+			return 0;
+		cpu_relax();
+	}
+
+	/* If a transfer is in progress then wait until it is completed. */
+	while (!(flags & GTF_transfer_completed)) {
+		flags = shared[ref].flags;
+		cpu_relax();
+	}
+
+	rmb();	/* Read the frame number /after/ reading completion status. */
+	frame = shared[ref].frame;
+	BUG_ON(frame == 0);
+
+	return frame;
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_transfer_ref);
+
+unsigned long gnttab_end_foreign_transfer(grant_ref_t ref)
+{
+	unsigned long frame = gnttab_end_foreign_transfer_ref(ref);
+	put_free_entry(ref);
+	return frame;
+}
+EXPORT_SYMBOL_GPL(gnttab_end_foreign_transfer);
+
+void gnttab_free_grant_reference(grant_ref_t ref)
+{
+	put_free_entry(ref);
+}
+EXPORT_SYMBOL_GPL(gnttab_free_grant_reference);
+
+void gnttab_free_grant_references(grant_ref_t head)
+{
+	grant_ref_t ref;
+	unsigned long flags;
+	int count = 1;
+	if (head == GNTTAB_LIST_END)
+		return;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	ref = head;
+	while (gnttab_list[ref] != GNTTAB_LIST_END) {
+		ref = gnttab_list[ref];
+		count++;
+	}
+	gnttab_list[ref] = gnttab_free_head;
+	gnttab_free_head = head;
+	gnttab_free_count += count;
+	check_free_callbacks();
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(gnttab_free_grant_references);
+
+int gnttab_alloc_grant_references(u16 count, grant_ref_t *head)
+{
+	int h = get_free_entries(count);
+
+	if (h == -1)
+		return -ENOSPC;
+
+	*head = h;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(gnttab_alloc_grant_references);
+
+int gnttab_empty_grant_references(const grant_ref_t *private_head)
+{
+	return (*private_head == GNTTAB_LIST_END);
+}
+EXPORT_SYMBOL_GPL(gnttab_empty_grant_references);
+
+int gnttab_claim_grant_reference(grant_ref_t *private_head)
+{
+	grant_ref_t g = *private_head;
+	if (unlikely(g == GNTTAB_LIST_END))
+		return -ENOSPC;
+	*private_head = gnttab_list[g];
+	return g;
+}
+EXPORT_SYMBOL_GPL(gnttab_claim_grant_reference);
+
+void gnttab_release_grant_reference(grant_ref_t *private_head,
+				    grant_ref_t release)
+{
+	gnttab_list[release] = *private_head;
+	*private_head = release;
+}
+EXPORT_SYMBOL_GPL(gnttab_release_grant_reference);
+
+void gnttab_request_free_callback(struct gnttab_free_callback *callback,
+				  void (*fn)(void *), void *arg, u16 count)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	if (callback->next)
+		goto out;
+	callback->fn = fn;
+	callback->arg = arg;
+	callback->count = count;
+	callback->next = gnttab_free_callback_list;
+	gnttab_free_callback_list = callback;
+	check_free_callbacks();
+out:
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(gnttab_request_free_callback);
+
+void gnttab_cancel_free_callback(struct gnttab_free_callback *callback)
+{
+	struct gnttab_free_callback **pcb;
+	unsigned long flags;
+
+	spin_lock_irqsave(&gnttab_list_lock, flags);
+	for (pcb = &gnttab_free_callback_list; *pcb; pcb = &(*pcb)->next) {
+		if (*pcb == callback) {
+			*pcb = callback->next;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&gnttab_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(gnttab_cancel_free_callback);
+
+#ifndef __ia64__
+static int map_pte_fn(pte_t *pte, struct page *pmd_page,
+		      unsigned long addr, void *data)
+{
+	unsigned long **frames = (unsigned long **)data;
+
+	set_pte_at(&init_mm, addr, pte, mfn_pte((*frames)[0], PAGE_KERNEL));
+	(*frames)++;
+	return 0;
+}
+
+static int unmap_pte_fn(pte_t *pte, struct page *pmd_page,
+			unsigned long addr, void *data)
+{
+
+	set_pte_at(&init_mm, addr, pte, __pte(0));
+	return 0;
+}
+#endif
+
+int gnttab_resume(void)
+{
+	struct gnttab_setup_table setup;
+	unsigned long frames[NR_GRANT_FRAMES];
+	int rc;
+
+	setup.dom        = DOMID_SELF;
+	setup.nr_frames  = NR_GRANT_FRAMES;
+	setup.frame_list = frames;
+
+	rc = HYPERVISOR_grant_table_op(GNTTABOP_setup_table, &setup, 1);
+	if (rc == -ENOSYS)
+		return -ENOSYS;
+
+	BUG_ON(rc || setup.status);
+
+#ifndef __ia64__
+	{
+		void *pframes = frames;
+		struct vm_struct *area;
+
+		if (shared == NULL) {
+			area = get_vm_area(PAGE_SIZE * NR_GRANT_FRAMES,
+					   VM_IOREMAP);
+			BUG_ON(area == NULL);
+			shared = area->addr;
+		}
+		rc = apply_to_page_range(&init_mm, (unsigned long)shared,
+					 PAGE_SIZE * NR_GRANT_FRAMES,
+					 map_pte_fn, &pframes);
+		BUG_ON(rc);
+	}
+#else
+	shared = __va(frames[0] << PAGE_SHIFT);
+	printk("grant table at %p\n", shared);
+#endif
+
+	return 0;
+}
+
+int gnttab_suspend(void)
+{
+
+#ifndef __ia64__
+	apply_to_page_range(&init_mm, (unsigned long)shared,
+			    PAGE_SIZE * NR_GRANT_FRAMES,
+			    unmap_pte_fn, NULL);
+#endif
+
+	return 0;
+}
+
+static int __init gnttab_init(void)
+{
+	int i;
+
+	if (!is_running_on_xen())
+		return -ENODEV;
+
+	if (gnttab_resume() < 0)
+		return -ENODEV;
+
+	for (i = NR_RESERVED_ENTRIES; i < NR_GRANT_ENTRIES; i++)
+		gnttab_list[i] = i + 1;
+	gnttab_free_count = NR_GRANT_ENTRIES - NR_RESERVED_ENTRIES;
+	gnttab_free_head  = NR_RESERVED_ENTRIES;
+
+	printk("Grant table initialized\n");
+	return 0;
+}
+
+core_initcall(gnttab_init);
===================================================================
--- /dev/null
+++ b/include/xen/grant_table.h
@@ -0,0 +1,107 @@
+/******************************************************************************
+ * grant_table.h
+ *
+ * Two sets of functionality:
+ * 1. Granting foreign access to our memory reservation.
+ * 2. Accessing others' memory reservations via grant references.
+ * (i.e., mechanisms for both sender and recipient of grant references)
+ *
+ * Copyright (c) 2004-2005, K A Fraser
+ * Copyright (c) 2005, Christopher Clark
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __ASM_GNTTAB_H__
+#define __ASM_GNTTAB_H__
+
+#include <asm/xen/hypervisor.h>
+#include <xen/interface/grant_table.h>
+
+/* NR_GRANT_FRAMES must be less than or equal to that configured in Xen */
+#define NR_GRANT_FRAMES 4
+
+struct gnttab_free_callback {
+	struct gnttab_free_callback *next;
+	void (*fn)(void *);
+	void *arg;
+	u16 count;
+};
+
+int gnttab_grant_foreign_access(domid_t domid, unsigned long frame,
+				int readonly);
+
+/*
+ * End access through the given grant reference, iff the grant entry is no
+ * longer in use.  Return 1 if the grant entry was freed, 0 if it is still in
+ * use.
+ */
+int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly);
+
+/*
+ * Eventually end access through the given grant reference, and once that
+ * access has been ended, free the given page too.  Access will be ended
+ * immediately iff the grant entry is not in use, otherwise it will happen
+ * some time later.  page may be 0, in which case no freeing will occur.
+ */
+void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
+			       unsigned long page);
+
+int gnttab_grant_foreign_transfer(domid_t domid, unsigned long pfn);
+
+unsigned long gnttab_end_foreign_transfer_ref(grant_ref_t ref);
+unsigned long gnttab_end_foreign_transfer(grant_ref_t ref);
+
+int gnttab_query_foreign_access(grant_ref_t ref);
+
+/*
+ * operations on reserved batches of grant references
+ */
+int gnttab_alloc_grant_references(u16 count, grant_ref_t *pprivate_head);
+
+void gnttab_free_grant_reference(grant_ref_t ref);
+
+void gnttab_free_grant_references(grant_ref_t head);
+
+int gnttab_empty_grant_references(const grant_ref_t *pprivate_head);
+
+int gnttab_claim_grant_reference(grant_ref_t *pprivate_head);
+
+void gnttab_release_grant_reference(grant_ref_t *private_head,
+				    grant_ref_t release);
+
+void gnttab_request_free_callback(struct gnttab_free_callback *callback,
+				  void (*fn)(void *), void *arg, u16 count);
+void gnttab_cancel_free_callback(struct gnttab_free_callback *callback);
+
+void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
+				     unsigned long frame, int readonly);
+
+void gnttab_grant_foreign_transfer_ref(grant_ref_t, domid_t domid,
+				       unsigned long pfn);
+
+#define gnttab_map_vaddr(map) ((void *)(map.host_virt_addr))
+
+#endif /* __ASM_GNTTAB_H__ */

-- 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (22 preceding siblings ...)
  2007-03-01 23:25   ` Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-16  8:47     ` Ingo Molnar
  2007-03-01 23:25 ` [patch 25/26] Xen-paravirt_ops: Add Xen virtual block device driver Jeremy Fitzhardinge
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach

[-- Attachment #1: xenbus.patch --]
[-- Type: text/plain, Size: 89115 bytes --]

This communicates with the machine control software via a registry
residing in a controlling virtual machine. This allows dynamic
creation, destruction and modification of virtual device
configurations (network devices, block devices and CPUS, to name some
examples).

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 arch/i386/xen/Kconfig                     |    5 
 drivers/xen/Makefile                      |    1 
 drivers/xen/xenbus/Makefile               |    9 
 drivers/xen/xenbus/xenbus_client.c        |  563 ++++++++++++++++++
 drivers/xen/xenbus/xenbus_comms.c         |  218 +++++++
 drivers/xen/xenbus/xenbus_comms.h         |   44 +
 drivers/xen/xenbus/xenbus_probe.c         |  891 +++++++++++++++++++++++++++++
 drivers/xen/xenbus/xenbus_probe.h         |   72 ++
 drivers/xen/xenbus/xenbus_probe_backend.c |  269 ++++++++
 drivers/xen/xenbus/xenbus_xs.c            |  840 +++++++++++++++++++++++++++
 include/asm-i386/xen/hypervisor.h         |    1 
 include/xen/xenbus.h                      |  208 ++++++
 12 files changed, 3121 insertions(+)

===================================================================
--- a/arch/i386/xen/Kconfig
+++ b/arch/i386/xen/Kconfig
@@ -9,3 +9,8 @@ config XEN
 	default y
 	help
 	  This is the Linux Xen port.
+
+config XEN_BACKEND
+	bool
+	depends XEN
+	default n
===================================================================
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,2 +1,3 @@ obj-y	+= grant-table.o
 obj-y	+= grant-table.o
 obj-y	+= hvc-console.o
+obj-y	+= xenbus/
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/Makefile
@@ -0,0 +1,9 @@
+obj-y	+= xenbus.o
+
+xenbus-objs =
+xenbus-objs += xenbus_client.o
+xenbus-objs += xenbus_comms.o
+xenbus-objs += xenbus_xs.o
+xenbus-objs += xenbus_probe.o
+
+obj-$(CONFIG_XEN_BACKEND) += xenbus_probe_backend.o
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -0,0 +1,563 @@
+/******************************************************************************
+ * Client-facing interface for the Xenbus driver.  In other words, the
+ * interface between the Xenbus and the device-specific code, be it the
+ * frontend or the backend of that driver.
+ *
+ * Copyright (C) 2005 XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include <linux/vmalloc.h>
+#include <asm/xen/hypervisor.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/event_channel.h>
+#include <xen/events.h>
+#include <xen/grant_table.h>
+#include <xen/xenbus.h>
+
+char *xenbus_strstate(enum xenbus_state state)
+{
+	static char *name[] = {
+		[ XenbusStateUnknown      ] = "Unknown",
+		[ XenbusStateInitialising ] = "Initialising",
+		[ XenbusStateInitWait     ] = "InitWait",
+		[ XenbusStateInitialised  ] = "Initialised",
+		[ XenbusStateConnected    ] = "Connected",
+		[ XenbusStateClosing      ] = "Closing",
+		[ XenbusStateClosed	  ] = "Closed",
+	};
+	return (state < ARRAY_SIZE(name)) ? name[state] : "INVALID";
+}
+EXPORT_SYMBOL_GPL(xenbus_strstate);
+
+/**
+ * xenbus_watch_path - register a watch
+ * @dev: xenbus device
+ * @path: path to watch
+ * @watch: watch to register
+ * @callback: callback to register
+ *
+ * Register a @watch on the given path, using the given xenbus_watch structure
+ * for storage, and the given @callback function as the callback.  Return 0 on
+ * success, or -errno on error.  On success, the given @path will be saved as
+ * @watch->node, and remains the caller's to free.  On error, @watch->node will
+ * be NULL, the device will switch to %XenbusStateClosing, and the error will
+ * be saved in the store.
+ */
+int xenbus_watch_path(struct xenbus_device *dev, const char *path,
+		      struct xenbus_watch *watch,
+		      void (*callback)(struct xenbus_watch *,
+				       const char **, unsigned int))
+{
+	int err;
+
+	watch->node = path;
+	watch->callback = callback;
+
+	err = register_xenbus_watch(watch);
+
+	if (err) {
+		watch->node = NULL;
+		watch->callback = NULL;
+		xenbus_dev_fatal(dev, err, "adding watch on %s", path);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_watch_path);
+
+
+/**
+ * xenbus_watch_path2 - register a watch on path/path2
+ * @dev: xenbus device
+ * @path: first half of path to watch
+ * @path2: second half of path to watch
+ * @watch: watch to register
+ * @callback: callback to register
+ *
+ * Register a watch on the given @path/@path2, using the given xenbus_watch
+ * structure for storage, and the given @callback function as the callback.
+ * Return 0 on success, or -errno on error.  On success, the watched path
+ * (@path/@path2) will be saved as @watch->node, and becomes the caller's to
+ * kfree().  On error, watch->node will be NULL, so the caller has nothing to
+ * free, the device will switch to %XenbusStateClosing, and the error will be
+ * saved in the store.
+ */
+int xenbus_watch_path2(struct xenbus_device *dev, const char *path,
+		       const char *path2, struct xenbus_watch *watch,
+		       void (*callback)(struct xenbus_watch *,
+					const char **, unsigned int))
+{
+	int err;
+	char *state = kasprintf(GFP_KERNEL, "%s/%s", path, path2);
+	if (!state) {
+		xenbus_dev_fatal(dev, -ENOMEM, "allocating path for watch");
+		return -ENOMEM;
+	}
+	err = xenbus_watch_path(dev, state, watch, callback);
+
+	if (err)
+		kfree(state);
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_watch_path2);
+
+
+/**
+ * xenbus_switch_state
+ * @dev: xenbus device
+ * @xbt: transaction handle
+ * @state: new state
+ *
+ * Advertise in the store a change of the given driver to the given new_state.
+ * Return 0 on success, or -errno on error.  On error, the device will switch
+ * to XenbusStateClosing, and the error will be saved in the store.
+ */
+int xenbus_switch_state(struct xenbus_device *dev, enum xenbus_state state)
+{
+	/* We check whether the state is currently set to the given value, and
+	   if not, then the state is set.  We don't want to unconditionally
+	   write the given state, because we don't want to fire watches
+	   unnecessarily.  Furthermore, if the node has gone, we don't write
+	   to it, as the device will be tearing down, and we don't want to
+	   resurrect that directory.
+
+	   Note that, because of this cached value of our state, this function
+	   will not work inside a Xenstore transaction (something it was
+	   trying to in the past) because dev->state would not get reset if
+	   the transaction was aborted.
+
+	 */
+
+	int current_state;
+	int err;
+
+	if (state == dev->state)
+		return 0;
+
+	err = xenbus_scanf(XBT_NIL, dev->nodename, "state", "%d",
+			   &current_state);
+	if (err != 1)
+		return 0;
+
+	err = xenbus_printf(XBT_NIL, dev->nodename, "state", "%d", state);
+	if (err) {
+		if (state != XenbusStateClosing) /* Avoid looping */
+			xenbus_dev_fatal(dev, err, "writing new state");
+		return err;
+	}
+
+	dev->state = state;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xenbus_switch_state);
+
+int xenbus_frontend_closed(struct xenbus_device *dev)
+{
+	xenbus_switch_state(dev, XenbusStateClosed);
+	complete(&dev->down);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xenbus_frontend_closed);
+
+/**
+ * Return the path to the error node for the given device, or NULL on failure.
+ * If the value returned is non-NULL, then it is the caller's to kfree.
+ */
+static char *error_path(struct xenbus_device *dev)
+{
+	return kasprintf(GFP_KERNEL, "error/%s", dev->nodename);
+}
+
+
+static void xenbus_va_dev_error(struct xenbus_device *dev, int err,
+				const char *fmt, va_list ap)
+{
+	int ret;
+	unsigned int len;
+	char *printf_buffer = NULL;
+	char *path_buffer = NULL;
+
+#define PRINTF_BUFFER_SIZE 4096
+	printf_buffer = kmalloc(PRINTF_BUFFER_SIZE, GFP_KERNEL);
+	if (printf_buffer == NULL)
+		goto fail;
+
+	len = sprintf(printf_buffer, "%i ", -err);
+	ret = vsnprintf(printf_buffer+len, PRINTF_BUFFER_SIZE-len, fmt, ap);
+
+	BUG_ON(len + ret > PRINTF_BUFFER_SIZE-1);
+
+	dev_err(&dev->dev, "%s\n", printf_buffer);
+
+	path_buffer = error_path(dev);
+
+	if (path_buffer == NULL) {
+		dev_err(&dev->dev, "failed to write error node for %s (%s)\n",
+		       dev->nodename, printf_buffer);
+		goto fail;
+	}
+
+	if (xenbus_write(XBT_NIL, path_buffer, "error", printf_buffer) != 0) {
+		dev_err(&dev->dev, "failed to write error node for %s (%s)\n",
+		       dev->nodename, printf_buffer);
+		goto fail;
+	}
+
+fail:
+	kfree(printf_buffer);
+	kfree(path_buffer);
+}
+
+
+/**
+ * xenbus_dev_error
+ * @dev: xenbus device
+ * @err: error to report
+ * @fmt: error message format
+ *
+ * Report the given negative errno into the store, along with the given
+ * formatted message.
+ */
+void xenbus_dev_error(struct xenbus_device *dev, int err, const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	xenbus_va_dev_error(dev, err, fmt, ap);
+	va_end(ap);
+}
+EXPORT_SYMBOL_GPL(xenbus_dev_error);
+
+/**
+ * xenbus_dev_fatal
+ * @dev: xenbus device
+ * @err: error to report
+ * @fmt: error message format
+ *
+ * Equivalent to xenbus_dev_error(dev, err, fmt, args), followed by
+ * xenbus_switch_state(dev, NULL, XenbusStateClosing) to schedule an orderly
+ * closedown of this driver and its peer.
+ */
+
+void xenbus_dev_fatal(struct xenbus_device *dev, int err, const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	xenbus_va_dev_error(dev, err, fmt, ap);
+	va_end(ap);
+
+	xenbus_switch_state(dev, XenbusStateClosing);
+}
+EXPORT_SYMBOL_GPL(xenbus_dev_fatal);
+
+/**
+ * xenbus_grant_ring
+ * @dev: xenbus device
+ * @ring_mfn: mfn of ring to grant
+
+ * Grant access to the given @ring_mfn to the peer of the given device.  Return
+ * 0 on success, or -errno on error.  On error, the device will switch to
+ * XenbusStateClosing, and the error will be saved in the store.
+ */
+int xenbus_grant_ring(struct xenbus_device *dev, unsigned long ring_mfn)
+{
+	int err = gnttab_grant_foreign_access(dev->otherend_id, ring_mfn, 0);
+	if (err < 0)
+		xenbus_dev_fatal(dev, err, "granting access to ring page");
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_grant_ring);
+
+
+/**
+ * Allocate an event channel for the given xenbus_device, assigning the newly
+ * created local port to *port.  Return 0 on success, or -errno on error.  On
+ * error, the device will switch to XenbusStateClosing, and the error will be
+ * saved in the store.
+ */
+int xenbus_alloc_evtchn(struct xenbus_device *dev, int *port)
+{
+	struct evtchn_alloc_unbound alloc_unbound;
+	int err;
+
+	alloc_unbound.dom = DOMID_SELF;
+	alloc_unbound.remote_dom = dev->otherend_id;
+
+	err = HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound,
+					  &alloc_unbound);
+	if (err)
+		xenbus_dev_fatal(dev, err, "allocating event channel");
+	else
+		*port = alloc_unbound.port;
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_alloc_evtchn);
+
+
+/**
+ * Bind to an existing interdomain event channel in another domain. Returns 0
+ * on success and stores the local port in *port. On error, returns -errno,
+ * switches the device to XenbusStateClosing, and saves the error in XenStore.
+ */
+int xenbus_bind_evtchn(struct xenbus_device *dev, int remote_port, int *port)
+{
+	struct evtchn_bind_interdomain bind_interdomain;
+	int err;
+
+	bind_interdomain.remote_dom = dev->otherend_id;
+	bind_interdomain.remote_port = remote_port;
+
+	err = HYPERVISOR_event_channel_op(EVTCHNOP_bind_interdomain,
+					  &bind_interdomain);
+	if (err)
+		xenbus_dev_fatal(dev, err,
+				 "binding to event channel %d from domain %d",
+				 remote_port, dev->otherend_id);
+	else
+		*port = bind_interdomain.local_port;
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_bind_evtchn);
+
+
+/**
+ * Free an existing event channel. Returns 0 on success or -errno on error.
+ */
+int xenbus_free_evtchn(struct xenbus_device *dev, int port)
+{
+	struct evtchn_close close;
+	int err;
+
+	close.port = port;
+
+	err = HYPERVISOR_event_channel_op(EVTCHNOP_close, &close);
+	if (err)
+		xenbus_dev_error(dev, err, "freeing event channel %d", port);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_free_evtchn);
+
+
+/**
+ * xenbus_map_ring_valloc
+ * @dev: xenbus device
+ * @gnt_ref: grant reference
+ * @vaddr: pointer to address to be filled out by mapping
+ *
+ * Based on Rusty Russell's skeleton driver's map_page.
+ * Map a page of memory into this domain from another domain's grant table.
+ * xenbus_map_ring_valloc allocates a page of virtual address space, maps the
+ * page to that address, and sets *vaddr to that address.
+ * Returns 0 on success, and GNTST_* (see xen/include/interface/grant_table.h)
+ * or -ENOMEM on error. If an error is returned, device will switch to
+ * XenbusStateClosing and the error message will be saved in XenStore.
+ */
+int xenbus_map_ring_valloc(struct xenbus_device *dev, int gnt_ref, void **vaddr)
+{
+	struct gnttab_map_grant_ref op = {
+		.flags = GNTMAP_host_map,
+		.ref   = gnt_ref,
+		.dom   = dev->otherend_id,
+	};
+	struct vm_struct *area;
+
+	*vaddr = NULL;
+
+	area = alloc_vm_area(PAGE_SIZE);
+	if (!area)
+		return -ENOMEM;
+
+	op.host_addr = (unsigned long)area->addr;
+
+	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
+		BUG();
+
+	if (op.status != GNTST_okay) {
+		free_vm_area(area);
+		xenbus_dev_fatal(dev, op.status,
+				 "mapping in shared page %d from domain %d",
+				 gnt_ref, dev->otherend_id);
+		return op.status;
+	}
+
+	/* Stuff the handle in an unused field */
+	area->phys_addr = (unsigned long)op.handle;
+
+	*vaddr = area->addr;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xenbus_map_ring_valloc);
+
+
+/**
+ * xenbus_map_ring
+ * @dev: xenbus device
+ * @gnt_ref: grant reference
+ * @handle: pointer to grant handle to be filled
+ * @vaddr: address to be mapped to
+ *
+ * Map a page of memory into this domain from another domain's grant table.
+ * xenbus_map_ring does not allocate the virtual address space (you must do
+ * this yourself!). It only maps in the page to the specified address.
+ * Returns 0 on success, and GNTST_* (see xen/include/interface/grant_table.h)
+ * or -ENOMEM on error. If an error is returned, device will switch to
+ * XenbusStateClosing and the error message will be saved in XenStore.
+ */
+int xenbus_map_ring(struct xenbus_device *dev, int gnt_ref,
+		    grant_handle_t *handle, void *vaddr)
+{
+	struct gnttab_map_grant_ref op = {
+		.host_addr = (unsigned long)vaddr,
+		.flags     = GNTMAP_host_map,
+		.ref       = gnt_ref,
+		.dom       = dev->otherend_id,
+	};
+
+	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
+		BUG();
+
+	if (op.status != GNTST_okay) {
+		xenbus_dev_fatal(dev, op.status,
+				 "mapping in shared page %d from domain %d",
+				 gnt_ref, dev->otherend_id);
+	} else
+		*handle = op.handle;
+
+	return op.status;
+}
+EXPORT_SYMBOL_GPL(xenbus_map_ring);
+
+
+/**
+ * xenbus_unmap_ring_vfree
+ * @dev: xenbus device
+ * @vaddr: addr to unmap
+ *
+ * Based on Rusty Russell's skeleton driver's unmap_page.
+ * Unmap a page of memory in this domain that was imported from another domain.
+ * Use xenbus_unmap_ring_vfree if you mapped in your memory with
+ * xenbus_map_ring_valloc (it will free the virtual address space).
+ * Returns 0 on success and returns GNTST_* on error
+ * (see xen/include/interface/grant_table.h).
+ */
+int xenbus_unmap_ring_vfree(struct xenbus_device *dev, void *vaddr)
+{
+	struct vm_struct *area;
+	struct gnttab_unmap_grant_ref op = {
+		.host_addr = (unsigned long)vaddr,
+	};
+
+	/* It'd be nice if linux/vmalloc.h provided a find_vm_area(void *addr)
+	 * method so that we don't have to muck with vmalloc internals here.
+	 * We could force the user to hang on to their struct vm_struct from
+	 * xenbus_map_ring_valloc, but these 6 lines considerably simplify
+	 * this API.
+	 */
+	read_lock(&vmlist_lock);
+	for (area = vmlist; area != NULL; area = area->next) {
+		if (area->addr == vaddr)
+			break;
+	}
+	read_unlock(&vmlist_lock);
+
+	if (!area) {
+		xenbus_dev_error(dev, -ENOENT,
+				 "can't find mapped virtual address %p", vaddr);
+		return GNTST_bad_virt_addr;
+	}
+
+	op.handle = (grant_handle_t)area->phys_addr;
+
+	if (HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, &op, 1))
+		BUG();
+
+	if (op.status == GNTST_okay)
+		free_vm_area(area);
+	else
+		xenbus_dev_error(dev, op.status,
+				 "unmapping page at handle %d error %d",
+				 (int16_t)area->phys_addr, op.status);
+
+	return op.status;
+}
+EXPORT_SYMBOL_GPL(xenbus_unmap_ring_vfree);
+
+
+/**
+ * xenbus_unmap_ring
+ * @dev: xenbus device
+ * @handle: grant handle
+ * @vaddr: addr to unmap
+ *
+ * Unmap a page of memory in this domain that was imported from another domain.
+ * Returns 0 on success and returns GNTST_* on error
+ * (see xen/include/interface/grant_table.h).
+ */
+int xenbus_unmap_ring(struct xenbus_device *dev,
+		      grant_handle_t handle, void *vaddr)
+{
+	struct gnttab_unmap_grant_ref op = {
+		.host_addr = (unsigned long)vaddr,
+		.handle    = handle,
+	};
+
+	if (HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, &op, 1))
+		BUG();
+
+	if (op.status != GNTST_okay)
+		xenbus_dev_error(dev, op.status,
+				 "unmapping page at handle %d error %d",
+				 handle, op.status);
+
+	return op.status;
+}
+EXPORT_SYMBOL_GPL(xenbus_unmap_ring);
+
+
+/**
+ * xenbus_read_driver_state
+ * @path: path for driver
+ *
+ * Return the state of the driver rooted at the given store path, or
+ * XenbusStateUnknown if no state can be read.
+ */
+enum xenbus_state xenbus_read_driver_state(const char *path)
+{
+	enum xenbus_state result;
+	int err = xenbus_gather(XBT_NIL, path, "state", "%d", &result, NULL);
+	if (err)
+		result = XenbusStateUnknown;
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(xenbus_read_driver_state);
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -0,0 +1,218 @@
+/******************************************************************************
+ * xenbus_comms.c
+ *
+ * Low level code to talks to Xen Store: ringbuffer and event channel.
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/wait.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <linux/err.h>
+#include <xen/xenbus.h>
+#include <asm/xen/hypervisor.h>
+#include <xen/events.h>
+#include <xen/page.h>
+#include "xenbus_comms.h"
+
+static int xenbus_irq;
+
+static DECLARE_WORK(probe_work, xenbus_probe);
+
+static DECLARE_WAIT_QUEUE_HEAD(xb_waitq);
+
+static irqreturn_t wake_waiting(int irq, void *unused)
+{
+	if (unlikely(xenstored_ready == 0)) {
+		xenstored_ready = 1;
+		schedule_work(&probe_work);
+	}
+
+	wake_up(&xb_waitq);
+	return IRQ_HANDLED;
+}
+
+static int check_indexes(XENSTORE_RING_IDX cons, XENSTORE_RING_IDX prod)
+{
+	return ((prod - cons) <= XENSTORE_RING_SIZE);
+}
+
+static void *get_output_chunk(XENSTORE_RING_IDX cons,
+			      XENSTORE_RING_IDX prod,
+			      char *buf, uint32_t *len)
+{
+	*len = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(prod);
+	if ((XENSTORE_RING_SIZE - (prod - cons)) < *len)
+		*len = XENSTORE_RING_SIZE - (prod - cons);
+	return buf + MASK_XENSTORE_IDX(prod);
+}
+
+static const void *get_input_chunk(XENSTORE_RING_IDX cons,
+				   XENSTORE_RING_IDX prod,
+				   const char *buf, uint32_t *len)
+{
+	*len = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(cons);
+	if ((prod - cons) < *len)
+		*len = prod - cons;
+	return buf + MASK_XENSTORE_IDX(cons);
+}
+
+/**
+ * xb_write - low level write
+ * @data: buffer to send
+ * @len: length of buffer
+ *
+ * Returns 0 on success, error otherwise.
+ */
+int xb_write(const void *data, unsigned len)
+{
+	struct xenstore_domain_interface *intf = xen_store_interface;
+	XENSTORE_RING_IDX cons, prod;
+	int rc;
+
+	while (len != 0) {
+		void *dst;
+		unsigned int avail;
+
+		rc = wait_event_interruptible(
+			xb_waitq,
+			(intf->req_prod - intf->req_cons) !=
+			XENSTORE_RING_SIZE);
+		if (rc < 0)
+			return rc;
+
+		/* Read indexes, then verify. */
+		cons = intf->req_cons;
+		prod = intf->req_prod;
+		mb();
+		if (!check_indexes(cons, prod)) {
+			intf->req_cons = intf->req_prod = 0;
+			return -EIO;
+		}
+
+		dst = get_output_chunk(cons, prod, intf->req, &avail);
+		if (avail == 0)
+			continue;
+		if (avail > len)
+			avail = len;
+
+		memcpy(dst, data, avail);
+		data += avail;
+		len -= avail;
+
+		/* Other side must not see new header until data is there. */
+		wmb();
+		intf->req_prod += avail;
+
+		/* This implies mb() before other side sees interrupt. */
+		notify_remote_via_evtchn(xen_store_evtchn);
+	}
+
+	return 0;
+}
+
+/**
+ * xb_read - low level read
+ * @data: buffer to fill
+ * @len: length of buffer
+ *
+ * Returns 0 on success, error otherwise.
+ */
+int xb_read(void *data, unsigned len)
+{
+	struct xenstore_domain_interface *intf = xen_store_interface;
+	XENSTORE_RING_IDX cons, prod;
+	int rc;
+
+	while (len != 0) {
+		unsigned int avail;
+		const char *src;
+
+		rc = wait_event_interruptible(
+			xb_waitq,
+			intf->rsp_cons != intf->rsp_prod);
+		if (rc < 0)
+			return rc;
+
+		/* Read indexes, then verify. */
+		cons = intf->rsp_cons;
+		prod = intf->rsp_prod;
+		mb();
+		if (!check_indexes(cons, prod)) {
+			intf->rsp_cons = intf->rsp_prod = 0;
+			return -EIO;
+		}
+
+		src = get_input_chunk(cons, prod, intf->rsp, &avail);
+		if (avail == 0)
+			continue;
+		if (avail > len)
+			avail = len;
+
+		/* We must read header before we read data. */
+		rmb();
+
+		memcpy(data, src, avail);
+		data += avail;
+		len -= avail;
+
+		/* Other side must not see free space until we've copied out */
+		mb();
+		intf->rsp_cons += avail;
+
+		pr_debug("Finished read of %i bytes (%i to go)\n", avail, len);
+
+		/* Implies mb(): they will see new header. */
+		notify_remote_via_evtchn(xen_store_evtchn);
+	}
+
+	return 0;
+}
+
+/**
+ * xb_init_comms - Set up interrupt handler off store event channel.
+ */
+int xb_init_comms(void)
+{
+	int err;
+
+	if (xenbus_irq)
+		unbind_from_irqhandler(xenbus_irq, &xb_waitq);
+
+	err = bind_evtchn_to_irqhandler(
+		xen_store_evtchn, wake_waiting,
+		0, "xenbus", &xb_waitq);
+	if (err <= 0) {
+		printk(KERN_ERR "XENBUS request irq failed %i\n", err);
+		return err;
+	}
+
+	xenbus_irq = err;
+
+	return 0;
+}
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_comms.h
@@ -0,0 +1,44 @@
+/*
+ * Private include for xenbus communications.
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef _XENBUS_COMMS_H
+#define _XENBUS_COMMS_H
+
+int xs_init(void);
+int xb_init_comms(void);
+
+/* Low level routines. */
+int xb_write(const void *data, unsigned len);
+int xb_read(void *data, unsigned len);
+int xs_input_avail(void);
+extern struct xenstore_domain_interface *xen_store_interface;
+extern int xen_store_evtchn;
+
+#endif /* _XENBUS_COMMS_H */
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -0,0 +1,891 @@
+/******************************************************************************
+ * Talks to Xen Store to figure out what devices we have.
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ * Copyright (C) 2005 Mike Wray, Hewlett-Packard
+ * Copyright (C) 2005, 2006 XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#define DPRINTK(fmt, args...)				\
+	pr_debug("xenbus_probe (%s:%d) " fmt ".\n",	\
+		 __FUNCTION__, __LINE__, ##args)
+
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/fcntl.h>
+#include <linux/mm.h>
+#include <linux/notifier.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+
+#include <asm/io.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/xen/hypervisor.h>
+#include <xen/xenbus.h>
+#include <xen/events.h>
+#include <xen/page.h>
+
+#include "xenbus_comms.h"
+#include "xenbus_probe.h"
+
+int xen_store_evtchn;
+struct xenstore_domain_interface *xen_store_interface;
+static unsigned long xen_store_mfn;
+
+static BLOCKING_NOTIFIER_HEAD(xenstore_chain);
+
+static void wait_for_devices(struct xenbus_driver *xendrv);
+
+static int xenbus_probe_frontend(const char *type, const char *name);
+
+static void xenbus_dev_shutdown(struct device *_dev);
+
+/* If something in array of ids matches this device, return it. */
+static const struct xenbus_device_id *
+match_device(const struct xenbus_device_id *arr, struct xenbus_device *dev)
+{
+	for (; *arr->devicetype != '\0'; arr++) {
+		if (!strcmp(arr->devicetype, dev->devicetype))
+			return arr;
+	}
+	return NULL;
+}
+
+int xenbus_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct xenbus_driver *drv = to_xenbus_driver(_drv);
+
+	if (!drv->ids)
+		return 0;
+
+	return match_device(drv->ids, to_xenbus_device(_dev)) != NULL;
+}
+
+/* device/<type>/<id> => <type>-<id> */
+static int frontend_bus_id(char bus_id[BUS_ID_SIZE], const char *nodename)
+{
+	nodename = strchr(nodename, '/');
+	if (!nodename || strlen(nodename + 1) >= BUS_ID_SIZE) {
+		printk(KERN_WARNING "XENBUS: bad frontend %s\n", nodename);
+		return -EINVAL;
+	}
+
+	strlcpy(bus_id, nodename + 1, BUS_ID_SIZE);
+	if (!strchr(bus_id, '/')) {
+		printk(KERN_WARNING "XENBUS: bus_id %s no slash\n", bus_id);
+		return -EINVAL;
+	}
+	*strchr(bus_id, '/') = '-';
+	return 0;
+}
+
+
+static void free_otherend_details(struct xenbus_device *dev)
+{
+	kfree(dev->otherend);
+	dev->otherend = NULL;
+}
+
+
+static void free_otherend_watch(struct xenbus_device *dev)
+{
+	if (dev->otherend_watch.node) {
+		unregister_xenbus_watch(&dev->otherend_watch);
+		kfree(dev->otherend_watch.node);
+		dev->otherend_watch.node = NULL;
+	}
+}
+
+
+int read_otherend_details(struct xenbus_device *xendev,
+				 char *id_node, char *path_node)
+{
+	int err = xenbus_gather(XBT_NIL, xendev->nodename,
+				id_node, "%i", &xendev->otherend_id,
+				path_node, NULL, &xendev->otherend,
+				NULL);
+	if (err) {
+		xenbus_dev_fatal(xendev, err,
+				 "reading other end details from %s",
+				 xendev->nodename);
+		return err;
+	}
+	if (strlen(xendev->otherend) == 0 ||
+	    !xenbus_exists(XBT_NIL, xendev->otherend, "")) {
+		xenbus_dev_fatal(xendev, -ENOENT,
+				 "unable to read other end from %s.  "
+				 "missing or inaccessible.",
+				 xendev->nodename);
+		free_otherend_details(xendev);
+		return -ENOENT;
+	}
+
+	return 0;
+}
+
+
+static int read_backend_details(struct xenbus_device *xendev)
+{
+	return read_otherend_details(xendev, "backend-id", "backend");
+}
+
+
+/* Bus type for frontend drivers. */
+static struct xen_bus_type xenbus_frontend = {
+	.root = "device",
+	.levels = 2, 		/* device/type/<id> */
+	.get_bus_id = frontend_bus_id,
+	.probe = xenbus_probe_frontend,
+	.bus = {
+		.name     = "xen",
+		.match    = xenbus_match,
+		.probe    = xenbus_dev_probe,
+		.remove   = xenbus_dev_remove,
+		.shutdown = xenbus_dev_shutdown,
+	},
+};
+
+static void otherend_changed(struct xenbus_watch *watch,
+			     const char **vec, unsigned int len)
+{
+	struct xenbus_device *dev =
+		container_of(watch, struct xenbus_device, otherend_watch);
+	struct xenbus_driver *drv = to_xenbus_driver(dev->dev.driver);
+	enum xenbus_state state;
+
+	/* Protect us against watches firing on old details when the otherend
+	   details change, say immediately after a resume. */
+	if (!dev->otherend ||
+	    strncmp(dev->otherend, vec[XS_WATCH_PATH],
+		    strlen(dev->otherend))) {
+		dev_dbg(&dev->dev, "Ignoring watch at %s", vec[XS_WATCH_PATH]);
+		return;
+	}
+
+	state = xenbus_read_driver_state(dev->otherend);
+
+	dev_dbg(&dev->dev, "state is %d, (%s), %s, %s",
+		state, xenbus_strstate(state), dev->otherend_watch.node,
+		vec[XS_WATCH_PATH]);
+
+	/*
+	 * Ignore xenbus transitions during shutdown. This prevents us doing
+	 * work that can fail e.g., when the rootfs is gone.
+	 */
+	if (system_state > SYSTEM_RUNNING) {
+		struct xen_bus_type *bus = bus;
+		bus = container_of(dev->dev.bus, struct xen_bus_type, bus);
+		/* If we're frontend, drive the state machine to Closed. */
+		/* This should cause the backend to release our resources. */
+		if ((bus == &xenbus_frontend) && (state == XenbusStateClosing))
+			xenbus_frontend_closed(dev);
+		return;
+	}
+
+	if (drv->otherend_changed)
+		drv->otherend_changed(dev, state);
+}
+
+
+static int talk_to_otherend(struct xenbus_device *dev)
+{
+	struct xenbus_driver *drv = to_xenbus_driver(dev->dev.driver);
+
+	free_otherend_watch(dev);
+	free_otherend_details(dev);
+
+	return drv->read_otherend_details(dev);
+}
+
+
+static int watch_otherend(struct xenbus_device *dev)
+{
+	return xenbus_watch_path2(dev, dev->otherend, "state",
+				  &dev->otherend_watch, otherend_changed);
+}
+
+
+int xenbus_dev_probe(struct device *_dev)
+{
+	struct xenbus_device *dev = to_xenbus_device(_dev);
+	struct xenbus_driver *drv = to_xenbus_driver(_dev->driver);
+	const struct xenbus_device_id *id;
+	int err;
+
+	DPRINTK("%s", dev->nodename);
+
+	if (!drv->probe) {
+		err = -ENODEV;
+		goto fail;
+	}
+
+	id = match_device(drv->ids, dev);
+	if (!id) {
+		err = -ENODEV;
+		goto fail;
+	}
+
+	err = talk_to_otherend(dev);
+	if (err) {
+		dev_warn(&dev->dev, "talk_to_otherend on %s failed.\n",
+			 dev->nodename);
+		return err;
+	}
+
+	err = drv->probe(dev, id);
+	if (err)
+		goto fail;
+
+	err = watch_otherend(dev);
+	if (err) {
+		dev_warn(&dev->dev, "watch_otherend on %s failed.\n",
+		       dev->nodename);
+		return err;
+	}
+
+	return 0;
+fail:
+	xenbus_dev_error(dev, err, "xenbus_dev_probe on %s", dev->nodename);
+	xenbus_switch_state(dev, XenbusStateClosed);
+	return -ENODEV;
+}
+
+int xenbus_dev_remove(struct device *_dev)
+{
+	struct xenbus_device *dev = to_xenbus_device(_dev);
+	struct xenbus_driver *drv = to_xenbus_driver(_dev->driver);
+
+	DPRINTK("%s", dev->nodename);
+
+	free_otherend_watch(dev);
+	free_otherend_details(dev);
+
+	if (drv->remove)
+		drv->remove(dev);
+
+	xenbus_switch_state(dev, XenbusStateClosed);
+	return 0;
+}
+
+static void xenbus_dev_shutdown(struct device *_dev)
+{
+	struct xenbus_device *dev = to_xenbus_device(_dev);
+	unsigned long timeout = 5*HZ;
+
+	DPRINTK("%s", dev->nodename);
+
+	get_device(&dev->dev);
+	if (dev->state != XenbusStateConnected) {
+		printk("%s: %s: %s != Connected, skipping\n", __FUNCTION__,
+		       dev->nodename, xenbus_strstate(dev->state));
+		goto out;
+	}
+	xenbus_switch_state(dev, XenbusStateClosing);
+	timeout = wait_for_completion_timeout(&dev->down, timeout);
+	if (!timeout)
+		printk("%s: %s timeout closing device\n", __FUNCTION__, dev->nodename);
+ out:
+	put_device(&dev->dev);
+}
+
+int xenbus_register_driver_common(struct xenbus_driver *drv,
+				  struct xen_bus_type *bus)
+{
+	drv->driver.name = drv->name;
+	drv->driver.bus = &bus->bus;
+	drv->driver.owner = drv->owner;
+
+	return driver_register(&drv->driver);
+}
+
+int xenbus_register_frontend(struct xenbus_driver *drv)
+{
+	int ret;
+
+	drv->read_otherend_details = read_backend_details;
+
+	ret = xenbus_register_driver_common(drv, &xenbus_frontend);
+	if (ret)
+		return ret;
+
+	/* If this driver is loaded as a module wait for devices to attach. */
+	wait_for_devices(drv);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xenbus_register_frontend);
+
+void xenbus_unregister_driver(struct xenbus_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL_GPL(xenbus_unregister_driver);
+
+struct xb_find_info
+{
+	struct xenbus_device *dev;
+	const char *nodename;
+};
+
+static int cmp_dev(struct device *dev, void *data)
+{
+	struct xenbus_device *xendev = to_xenbus_device(dev);
+	struct xb_find_info *info = data;
+
+	if (!strcmp(xendev->nodename, info->nodename)) {
+		info->dev = xendev;
+		get_device(dev);
+		return 1;
+	}
+	return 0;
+}
+
+struct xenbus_device *xenbus_device_find(const char *nodename,
+					 struct bus_type *bus)
+{
+	struct xb_find_info info = { .dev = NULL, .nodename = nodename };
+
+	bus_for_each_dev(bus, NULL, &info, cmp_dev);
+	return info.dev;
+}
+
+static int cleanup_dev(struct device *dev, void *data)
+{
+	struct xenbus_device *xendev = to_xenbus_device(dev);
+	struct xb_find_info *info = data;
+	int len = strlen(info->nodename);
+
+	DPRINTK("%s", info->nodename);
+
+	/* Match the info->nodename path, or any subdirectory of that path. */
+	if (strncmp(xendev->nodename, info->nodename, len))
+		return 0;
+
+	/* If the node name is longer, ensure it really is a subdirectory. */
+	if ((strlen(xendev->nodename) > len) && (xendev->nodename[len] != '/'))
+		return 0;
+
+	info->dev = xendev;
+	get_device(dev);
+	return 1;
+}
+
+static void xenbus_cleanup_devices(const char *path, struct bus_type *bus)
+{
+	struct xb_find_info info = { .nodename = path };
+
+	do {
+		info.dev = NULL;
+		bus_for_each_dev(bus, NULL, &info, cleanup_dev);
+		if (info.dev) {
+			device_unregister(&info.dev->dev);
+			put_device(&info.dev->dev);
+		}
+	} while (info.dev);
+}
+
+static void xenbus_dev_release(struct device *dev)
+{
+	if (dev)
+		kfree(to_xenbus_device(dev));
+}
+
+static ssize_t xendev_show_nodename(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%s\n", to_xenbus_device(dev)->nodename);
+}
+DEVICE_ATTR(nodename, S_IRUSR | S_IRGRP | S_IROTH, xendev_show_nodename, NULL);
+
+static ssize_t xendev_show_devtype(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%s\n", to_xenbus_device(dev)->devicetype);
+}
+DEVICE_ATTR(devtype, S_IRUSR | S_IRGRP | S_IROTH, xendev_show_devtype, NULL);
+
+
+int xenbus_probe_node(struct xen_bus_type *bus,
+		      const char *type,
+		      const char *nodename)
+{
+	int err;
+	struct xenbus_device *xendev;
+	size_t stringlen;
+	char *tmpstring;
+
+	enum xenbus_state state = xenbus_read_driver_state(nodename);
+
+	if (state != XenbusStateInitialising) {
+		/* Device is not new, so ignore it.  This can happen if a
+		   device is going away after switching to Closed.  */
+		return 0;
+	}
+
+	stringlen = strlen(nodename) + 1 + strlen(type) + 1;
+	xendev = kzalloc(sizeof(*xendev) + stringlen, GFP_KERNEL);
+	if (!xendev)
+		return -ENOMEM;
+
+	xendev->state = XenbusStateInitialising;
+
+	/* Copy the strings into the extra space. */
+
+	tmpstring = (char *)(xendev + 1);
+	strcpy(tmpstring, nodename);
+	xendev->nodename = tmpstring;
+
+	tmpstring += strlen(tmpstring) + 1;
+	strcpy(tmpstring, type);
+	xendev->devicetype = tmpstring;
+	init_completion(&xendev->down);
+
+	xendev->dev.bus = &bus->bus;
+	xendev->dev.release = xenbus_dev_release;
+
+	err = bus->get_bus_id(xendev->dev.bus_id, xendev->nodename);
+	if (err)
+		goto fail;
+
+	/* Register with generic device framework. */
+	err = device_register(&xendev->dev);
+	if (err)
+		goto fail;
+
+	device_create_file(&xendev->dev, &dev_attr_nodename);
+	device_create_file(&xendev->dev, &dev_attr_devtype);
+
+	return 0;
+fail:
+	kfree(xendev);
+	return err;
+}
+
+/* device/<typename>/<name> */
+static int xenbus_probe_frontend(const char *type, const char *name)
+{
+	char *nodename;
+	int err;
+
+	nodename = kasprintf(GFP_KERNEL, "%s/%s/%s", xenbus_frontend.root, type, name);
+	if (!nodename)
+		return -ENOMEM;
+
+	DPRINTK("%s", nodename);
+
+	err = xenbus_probe_node(&xenbus_frontend, type, nodename);
+	kfree(nodename);
+	return err;
+}
+
+static int xenbus_probe_device_type(struct xen_bus_type *bus, const char *type)
+{
+	int err = 0;
+	char **dir;
+	unsigned int dir_n = 0;
+	int i;
+
+	dir = xenbus_directory(XBT_NIL, bus->root, type, &dir_n);
+	if (IS_ERR(dir))
+		return PTR_ERR(dir);
+
+	for (i = 0; i < dir_n; i++) {
+		err = bus->probe(type, dir[i]);
+		if (err)
+			break;
+	}
+	kfree(dir);
+	return err;
+}
+
+int xenbus_probe_devices(struct xen_bus_type *bus)
+{
+	int err = 0;
+	char **dir;
+	unsigned int i, dir_n;
+
+	dir = xenbus_directory(XBT_NIL, bus->root, "", &dir_n);
+	if (IS_ERR(dir))
+		return PTR_ERR(dir);
+
+	for (i = 0; i < dir_n; i++) {
+		err = xenbus_probe_device_type(bus, dir[i]);
+		if (err)
+			break;
+	}
+	kfree(dir);
+	return err;
+}
+
+static unsigned int char_count(const char *str, char c)
+{
+	unsigned int i, ret = 0;
+
+	for (i = 0; str[i]; i++)
+		if (str[i] == c)
+			ret++;
+	return ret;
+}
+
+static int strsep_len(const char *str, char c, unsigned int len)
+{
+	unsigned int i;
+
+	for (i = 0; str[i]; i++)
+		if (str[i] == c) {
+			if (len == 0)
+				return i;
+			len--;
+		}
+	return (len == 0) ? i : -ERANGE;
+}
+
+void dev_changed(const char *node, struct xen_bus_type *bus)
+{
+	int exists, rootlen;
+	struct xenbus_device *dev;
+	char type[BUS_ID_SIZE];
+	const char *p, *root;
+
+	if (char_count(node, '/') < 2)
+ 		return;
+
+	exists = xenbus_exists(XBT_NIL, node, "");
+	if (!exists) {
+		xenbus_cleanup_devices(node, &bus->bus);
+		return;
+	}
+
+	/* backend/<type>/... or device/<type>/... */
+	p = strchr(node, '/') + 1;
+	snprintf(type, BUS_ID_SIZE, "%.*s", (int)strcspn(p, "/"), p);
+	type[BUS_ID_SIZE-1] = '\0';
+
+	rootlen = strsep_len(node, '/', bus->levels);
+	if (rootlen < 0)
+		return;
+	root = kasprintf(GFP_KERNEL, "%.*s", rootlen, node);
+	if (!root)
+		return;
+
+	dev = xenbus_device_find(root, &bus->bus);
+	if (!dev)
+		xenbus_probe_node(bus, type, root);
+	else
+		put_device(&dev->dev);
+
+	kfree(root);
+}
+
+static void frontend_changed(struct xenbus_watch *watch,
+			     const char **vec, unsigned int len)
+{
+	DPRINTK("");
+
+	dev_changed(vec[XS_WATCH_PATH], &xenbus_frontend);
+}
+
+/* We watch for devices appearing and vanishing. */
+static struct xenbus_watch fe_watch = {
+	.node = "device",
+	.callback = frontend_changed,
+};
+
+static int suspend_dev(struct device *dev, void *data)
+{
+	int err = 0;
+	struct xenbus_driver *drv;
+	struct xenbus_device *xdev;
+
+	DPRINTK("");
+
+	if (dev->driver == NULL)
+		return 0;
+	drv = to_xenbus_driver(dev->driver);
+	xdev = container_of(dev, struct xenbus_device, dev);
+	if (drv->suspend)
+		err = drv->suspend(xdev);
+	if (err)
+		printk(KERN_WARNING
+		       "xenbus: suspend %s failed: %i\n", dev->bus_id, err);
+	return 0;
+}
+
+static int resume_dev(struct device *dev, void *data)
+{
+	int err;
+	struct xenbus_driver *drv;
+	struct xenbus_device *xdev;
+
+	DPRINTK("");
+
+	if (dev->driver == NULL)
+		return 0;
+
+	drv = to_xenbus_driver(dev->driver);
+	xdev = container_of(dev, struct xenbus_device, dev);
+
+	err = talk_to_otherend(xdev);
+	if (err) {
+		printk(KERN_WARNING
+		       "xenbus: resume (talk_to_otherend) %s failed: %i\n",
+		       dev->bus_id, err);
+		return err;
+	}
+
+	xdev->state = XenbusStateInitialising;
+
+	if (drv->resume) {
+		err = drv->resume(xdev);
+		if (err) {
+			printk(KERN_WARNING
+			       "xenbus: resume %s failed: %i\n",
+			       dev->bus_id, err);
+			return err;
+		}
+	}
+
+	err = watch_otherend(xdev);
+	if (err) {
+		printk(KERN_WARNING
+		       "xenbus_probe: resume (watch_otherend) %s failed: "
+		       "%d.\n", dev->bus_id, err);
+		return err;
+	}
+
+	return 0;
+}
+
+void xenbus_suspend(void)
+{
+	DPRINTK("");
+
+	bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, suspend_dev);
+	xenbus_backend_suspend(suspend_dev);
+	xs_suspend();
+}
+EXPORT_SYMBOL_GPL(xenbus_suspend);
+
+void xenbus_resume(void)
+{
+	xb_init_comms();
+	xs_resume();
+	bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, resume_dev);
+	xenbus_backend_resume(resume_dev);
+}
+EXPORT_SYMBOL_GPL(xenbus_resume);
+
+
+/* A flag to determine if xenstored is 'ready' (i.e. has started) */
+int xenstored_ready = 0;
+
+
+int register_xenstore_notifier(struct notifier_block *nb)
+{
+	int ret = 0;
+
+	if (xenstored_ready > 0)
+		ret = nb->notifier_call(nb, 0, NULL);
+	else
+		blocking_notifier_chain_register(&xenstore_chain, nb);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_xenstore_notifier);
+
+void unregister_xenstore_notifier(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&xenstore_chain, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_xenstore_notifier);
+
+void xenbus_probe(struct work_struct *unused)
+{
+	BUG_ON((xenstored_ready <= 0));
+
+	/* Enumerate devices in xenstore and watch for changes. */
+	xenbus_probe_devices(&xenbus_frontend);
+	register_xenbus_watch(&fe_watch);
+	xenbus_backend_probe_and_watch();
+
+	/* Notify others that xenstore is up */
+	blocking_notifier_call_chain(&xenstore_chain, 0, NULL);
+}
+
+static int __init xenbus_probe_init(void)
+{
+	int err = 0;
+
+	DPRINTK("");
+
+	err = -ENODEV;
+	if (!is_running_on_xen())
+		goto out_error;
+
+	/* Register ourselves with the kernel bus subsystem */
+	err = bus_register(&xenbus_frontend.bus);
+	if (err)
+		goto out_error;
+
+	err = xenbus_backend_bus_register();
+	if (err)
+		goto out_unreg_front;
+
+	/*
+	 * Domain0 doesn't have a store_evtchn or store_mfn yet.
+	 */
+	if (is_initial_xendomain()) {
+		/* dom0 not yet supported */
+	} else {
+		xenstored_ready = 1;
+		xen_store_evtchn = xen_start_info->store_evtchn;
+		xen_store_mfn = xen_start_info->store_mfn;
+	}
+	xen_store_interface = mfn_to_virt(xen_store_mfn);
+
+	/* Initialize the interface to xenstore. */
+	err = xs_init();
+	if (err) {
+		printk(KERN_WARNING
+		       "XENBUS: Error initializing xenstore comms: %i\n", err);
+		goto out_unreg_back;
+	}
+
+	if (!is_initial_xendomain())
+		xenbus_probe(NULL);
+
+	return 0;
+
+  out_unreg_back:
+	xenbus_backend_bus_unregister();
+
+  out_unreg_front:
+	bus_unregister(&xenbus_frontend.bus);
+
+  out_error:
+	return err;
+}
+
+postcore_initcall(xenbus_probe_init);
+
+MODULE_LICENSE("Dual BSD/GPL");
+
+static int is_disconnected_device(struct device *dev, void *data)
+{
+	struct xenbus_device *xendev = to_xenbus_device(dev);
+	struct device_driver *drv = data;
+
+	/*
+	 * A device with no driver will never connect. We care only about
+	 * devices which should currently be in the process of connecting.
+	 */
+	if (!dev->driver)
+		return 0;
+
+	/* Is this search limited to a particular driver? */
+	if (drv && (dev->driver != drv))
+		return 0;
+
+	return (xendev->state != XenbusStateConnected);
+}
+
+static int exists_disconnected_device(struct device_driver *drv)
+{
+	return bus_for_each_dev(&xenbus_frontend.bus, NULL, drv,
+				is_disconnected_device);
+}
+
+static int print_device_status(struct device *dev, void *data)
+{
+	struct xenbus_device *xendev = to_xenbus_device(dev);
+	struct device_driver *drv = data;
+
+	/* Is this operation limited to a particular driver? */
+	if (drv && (dev->driver != drv))
+		return 0;
+
+	if (!dev->driver) {
+		/* Information only: is this too noisy? */
+		printk(KERN_INFO "XENBUS: Device with no driver: %s\n",
+		       xendev->nodename);
+	} else if (xendev->state != XenbusStateConnected) {
+		printk(KERN_WARNING "XENBUS: Timeout connecting "
+		       "to device: %s (state %d)\n",
+		       xendev->nodename, xendev->state);
+	}
+
+	return 0;
+}
+
+/* We only wait for device setup after most initcalls have run. */
+static int ready_to_wait_for_devices;
+
+/*
+ * On a 10 second timeout, wait for all devices currently configured.  We need
+ * to do this to guarantee that the filesystems and / or network devices
+ * needed for boot are available, before we can allow the boot to proceed.
+ *
+ * This needs to be on a late_initcall, to happen after the frontend device
+ * drivers have been initialised, but before the root fs is mounted.
+ *
+ * A possible improvement here would be to have the tools add a per-device
+ * flag to the store entry, indicating whether it is needed at boot time.
+ * This would allow people who knew what they were doing to accelerate their
+ * boot slightly, but of course needs tools or manual intervention to set up
+ * those flags correctly.
+ */
+static void wait_for_devices(struct xenbus_driver *xendrv)
+{
+	unsigned long timeout = jiffies + 10*HZ;
+	struct device_driver *drv = xendrv ? &xendrv->driver : NULL;
+
+	if (!ready_to_wait_for_devices || !is_running_on_xen())
+		return;
+
+	while (exists_disconnected_device(drv)) {
+		if (time_after(jiffies, timeout))
+			break;
+		schedule_timeout_interruptible(HZ/10);
+	}
+
+	bus_for_each_dev(&xenbus_frontend.bus, NULL, drv,
+			 print_device_status);
+}
+
+#ifndef MODULE
+static int __init boot_wait_for_devices(void)
+{
+	ready_to_wait_for_devices = 1;
+	wait_for_devices(NULL);
+	return 0;
+}
+
+late_initcall(boot_wait_for_devices);
+#endif
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_probe.h
@@ -0,0 +1,72 @@
+/******************************************************************************
+ * xenbus_probe.h
+ *
+ * Talks to Xen Store to figure out what devices we have.
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ * Copyright (C) 2005 XenSource Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef _XENBUS_PROBE_H
+#define _XENBUS_PROBE_H
+
+#ifdef CONFIG_XEN_BACKEND
+extern void xenbus_backend_suspend(int (*fn)(struct device *, void *));
+extern void xenbus_backend_resume(int (*fn)(struct device *, void *));
+extern void xenbus_backend_probe_and_watch(void);
+extern int xenbus_backend_bus_register(void);
+extern void xenbus_backend_bus_unregister(void);
+#else
+static inline void xenbus_backend_suspend(int (*fn)(struct device *, void *)) {}
+static inline void xenbus_backend_resume(int (*fn)(struct device *, void *)) {}
+static inline void xenbus_backend_probe_and_watch(void) {}
+static inline int xenbus_backend_bus_register(void) {return 0;}
+static inline void xenbus_backend_bus_unregister(void) {}
+#endif
+
+struct xen_bus_type
+{
+	char *root;
+	unsigned int levels;
+	int (*get_bus_id)(char bus_id[BUS_ID_SIZE], const char *nodename);
+	int (*probe)(const char *type, const char *dir);
+	struct bus_type bus;
+};
+
+extern int xenbus_match(struct device *_dev, struct device_driver *_drv);
+extern int xenbus_dev_probe(struct device *_dev);
+extern int xenbus_dev_remove(struct device *_dev);
+extern int xenbus_register_driver_common(struct xenbus_driver *drv,
+					 struct xen_bus_type *bus);
+extern int xenbus_probe_node(struct xen_bus_type *bus,
+			     const char *type,
+			     const char *nodename);
+extern int xenbus_probe_devices(struct xen_bus_type *bus);
+
+extern void dev_changed(const char *node, struct xen_bus_type *bus);
+
+#endif
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_probe_backend.c
@@ -0,0 +1,269 @@
+/******************************************************************************
+ * Talks to Xen Store to figure out what devices we have (backend half).
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ * Copyright (C) 2005 Mike Wray, Hewlett-Packard
+ * Copyright (C) 2005, 2006 XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#define DPRINTK(fmt, args...)				\
+	pr_debug("xenbus_probe (%s:%d) " fmt ".\n",	\
+		 __FUNCTION__, __LINE__, ##args)
+
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/fcntl.h>
+#include <linux/mm.h>
+#include <linux/notifier.h>
+#include <linux/kthread.h>
+
+#include <asm/io.h>
+#include <asm/page.h>
+//#include <asm/maddr.h>
+#include <asm/pgtable.h>
+#include <asm/xen/hypervisor.h>
+#include <xen/xenbus.h>
+//#include <xen/xen_proc.h>
+//#include <xen/evtchn.h>
+#include <xen/features.h>
+//#include <xen/hvm.h>
+
+#include "xenbus_comms.h"
+#include "xenbus_probe.h"
+
+#ifdef HAVE_XEN_PLATFORM_COMPAT_H
+#include <xen/platform-compat.h>
+#endif
+
+static int xenbus_uevent_backend(struct device *dev, char **envp,
+				 int num_envp, char *buffer, int buffer_size);
+static int xenbus_probe_backend(const char *type, const char *domid);
+
+extern int read_otherend_details(struct xenbus_device *xendev,
+				 char *id_node, char *path_node);
+
+static int read_frontend_details(struct xenbus_device *xendev)
+{
+	return read_otherend_details(xendev, "frontend-id", "frontend");
+}
+
+/* backend/<type>/<fe-uuid>/<id> => <type>-<fe-domid>-<id> */
+static int backend_bus_id(char bus_id[BUS_ID_SIZE], const char *nodename)
+{
+	int domid, err;
+	const char *devid, *type, *frontend;
+	unsigned int typelen;
+
+	type = strchr(nodename, '/');
+	if (!type)
+		return -EINVAL;
+	type++;
+	typelen = strcspn(type, "/");
+	if (!typelen || type[typelen] != '/')
+		return -EINVAL;
+
+	devid = strrchr(nodename, '/') + 1;
+
+	err = xenbus_gather(XBT_NIL, nodename, "frontend-id", "%i", &domid,
+			    "frontend", NULL, &frontend,
+			    NULL);
+	if (err)
+		return err;
+	if (strlen(frontend) == 0)
+		err = -ERANGE;
+	if (!err && !xenbus_exists(XBT_NIL, frontend, ""))
+		err = -ENOENT;
+
+	kfree(frontend);
+
+	if (err)
+		return err;
+
+	if (snprintf(bus_id, BUS_ID_SIZE,
+		     "%.*s-%i-%s", typelen, type, domid, devid) >= BUS_ID_SIZE)
+		return -ENOSPC;
+	return 0;
+}
+
+static struct xen_bus_type xenbus_backend = {
+	.root = "backend",
+	.levels = 3, 		/* backend/type/<frontend>/<id> */
+	.get_bus_id = backend_bus_id,
+	.probe = xenbus_probe_backend,
+	.bus = {
+		.name     = "xen-backend",
+		.match    = xenbus_match,
+		.probe    = xenbus_dev_probe,
+		.remove   = xenbus_dev_remove,
+//		.shutdown = xenbus_dev_shutdown,
+		.uevent   = xenbus_uevent_backend,
+	},
+};
+
+static int xenbus_uevent_backend(struct device *dev, char **envp,
+				 int num_envp, char *buffer, int buffer_size)
+{
+	struct xenbus_device *xdev;
+	struct xenbus_driver *drv;
+	int i = 0;
+	int length = 0;
+
+	DPRINTK("");
+
+	if (dev == NULL)
+		return -ENODEV;
+
+	xdev = to_xenbus_device(dev);
+	if (xdev == NULL)
+		return -ENODEV;
+
+	/* stuff we want to pass to /sbin/hotplug */
+	add_uevent_var(envp, num_envp, &i, buffer, buffer_size, &length,
+		       "XENBUS_TYPE=%s", xdev->devicetype);
+
+	add_uevent_var(envp, num_envp, &i, buffer, buffer_size, &length,
+		       "XENBUS_PATH=%s", xdev->nodename);
+
+	add_uevent_var(envp, num_envp, &i, buffer, buffer_size, &length,
+		       "XENBUS_BASE_PATH=%s", xenbus_backend.root);
+
+	/* terminate, set to next free slot, shrink available space */
+	envp[i] = NULL;
+	envp = &envp[i];
+	num_envp -= i;
+	buffer = &buffer[length];
+	buffer_size -= length;
+
+	if (dev->driver) {
+		drv = to_xenbus_driver(dev->driver);
+		if (drv && drv->uevent)
+			return drv->uevent(xdev, envp, num_envp, buffer,
+					   buffer_size);
+	}
+
+	return 0;
+}
+
+int xenbus_register_backend(struct xenbus_driver *drv)
+{
+	drv->read_otherend_details = read_frontend_details;
+
+	return xenbus_register_driver_common(drv, &xenbus_backend);
+}
+EXPORT_SYMBOL_GPL(xenbus_register_backend);
+
+/* backend/<typename>/<frontend-uuid>/<name> */
+static int xenbus_probe_backend_unit(const char *dir,
+				     const char *type,
+				     const char *name)
+{
+	char *nodename;
+	int err;
+
+	nodename = kasprintf(GFP_KERNEL, "%s/%s", dir, name);
+	if (!nodename)
+		return -ENOMEM;
+
+	DPRINTK("%s\n", nodename);
+
+	err = xenbus_probe_node(&xenbus_backend, type, nodename);
+	kfree(nodename);
+	return err;
+}
+
+/* backend/<typename>/<frontend-domid> */
+static int xenbus_probe_backend(const char *type, const char *domid)
+{
+	char *nodename;
+	int err = 0;
+	char **dir;
+	unsigned int i, dir_n = 0;
+
+	DPRINTK("");
+
+	nodename = kasprintf(GFP_KERNEL, "%s/%s/%s", xenbus_backend.root, type, domid);
+	if (!nodename)
+		return -ENOMEM;
+
+	dir = xenbus_directory(XBT_NIL, nodename, "", &dir_n);
+	if (IS_ERR(dir)) {
+		kfree(nodename);
+		return PTR_ERR(dir);
+	}
+
+	for (i = 0; i < dir_n; i++) {
+		err = xenbus_probe_backend_unit(nodename, type, dir[i]);
+		if (err)
+			break;
+	}
+	kfree(dir);
+	kfree(nodename);
+	return err;
+}
+
+static void backend_changed(struct xenbus_watch *watch,
+			    const char **vec, unsigned int len)
+{
+	DPRINTK("");
+
+	dev_changed(vec[XS_WATCH_PATH], &xenbus_backend);
+}
+
+static struct xenbus_watch be_watch = {
+	.node = "backend",
+	.callback = backend_changed,
+};
+
+void xenbus_backend_suspend(int (*fn)(struct device *, void *))
+{
+	DPRINTK("");
+	bus_for_each_dev(&xenbus_backend.bus, NULL, NULL, fn);
+}
+
+void xenbus_backend_resume(int (*fn)(struct device *, void *))
+{
+	DPRINTK("");
+	bus_for_each_dev(&xenbus_backend.bus, NULL, NULL, fn);
+}
+
+void xenbus_backend_probe_and_watch(void)
+{
+	xenbus_probe_devices(&xenbus_backend);
+	register_xenbus_watch(&be_watch);
+}
+
+int xenbus_backend_bus_register(void)
+{
+	return bus_register(&xenbus_backend.bus);
+}
+
+void xenbus_backend_bus_unregister(void)
+{
+	bus_unregister(&xenbus_backend.bus);
+}
===================================================================
--- /dev/null
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -0,0 +1,840 @@
+/******************************************************************************
+ * xenbus_xs.c
+ *
+ * This is the kernel equivalent of the "xs" library.  We don't need everything
+ * and we use xenbus_comms for communication.
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/unistd.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/uio.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/fcntl.h>
+#include <linux/kthread.h>
+#include <linux/rwsem.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <xen/xenbus.h>
+#include "xenbus_comms.h"
+
+struct xs_stored_msg {
+	struct list_head list;
+
+	struct xsd_sockmsg hdr;
+
+	union {
+		/* Queued replies. */
+		struct {
+			char *body;
+		} reply;
+
+		/* Queued watch events. */
+		struct {
+			struct xenbus_watch *handle;
+			char **vec;
+			unsigned int vec_size;
+		} watch;
+	} u;
+};
+
+struct xs_handle {
+	/* A list of replies. Currently only one will ever be outstanding. */
+	struct list_head reply_list;
+	spinlock_t reply_lock;
+	wait_queue_head_t reply_waitq;
+
+	/* One request at a time. */
+	struct mutex request_mutex;
+
+	/* Protect transactions against save/restore. */
+	struct rw_semaphore suspend_mutex;
+};
+
+static struct xs_handle xs_state;
+
+/* List of registered watches, and a lock to protect it. */
+static LIST_HEAD(watches);
+static DEFINE_SPINLOCK(watches_lock);
+
+/* List of pending watch callback events, and a lock to protect it. */
+static LIST_HEAD(watch_events);
+static DEFINE_SPINLOCK(watch_events_lock);
+
+/*
+ * Details of the xenwatch callback kernel thread. The thread waits on the
+ * watch_events_waitq for work to do (queued on watch_events list). When it
+ * wakes up it acquires the xenwatch_mutex before reading the list and
+ * carrying out work.
+ */
+static pid_t xenwatch_pid;
+static DEFINE_MUTEX(xenwatch_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(watch_events_waitq);
+
+static int get_error(const char *errorstring)
+{
+	unsigned int i;
+
+	for (i = 0; strcmp(errorstring, xsd_errors[i].errstring) != 0; i++) {
+		if (i == ARRAY_SIZE(xsd_errors) - 1) {
+			printk(KERN_WARNING
+			       "XENBUS xen store gave: unknown error %s",
+			       errorstring);
+			return EINVAL;
+		}
+	}
+	return xsd_errors[i].errnum;
+}
+
+static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len)
+{
+	struct xs_stored_msg *msg;
+	char *body;
+
+	spin_lock(&xs_state.reply_lock);
+
+	while (list_empty(&xs_state.reply_list)) {
+		spin_unlock(&xs_state.reply_lock);
+		/* XXX FIXME: Avoid synchronous wait for response here. */
+		wait_event(xs_state.reply_waitq,
+			   !list_empty(&xs_state.reply_list));
+		spin_lock(&xs_state.reply_lock);
+	}
+
+	msg = list_entry(xs_state.reply_list.next,
+			 struct xs_stored_msg, list);
+	list_del(&msg->list);
+
+	spin_unlock(&xs_state.reply_lock);
+
+	*type = msg->hdr.type;
+	if (len)
+		*len = msg->hdr.len;
+	body = msg->u.reply.body;
+
+	kfree(msg);
+
+	return body;
+}
+
+/* Emergency write. */
+void xenbus_debug_write(const char *str, unsigned int count)
+{
+	struct xsd_sockmsg msg = { 0 };
+
+	msg.type = XS_DEBUG;
+	msg.len = sizeof("print") + count + 1;
+
+	mutex_lock(&xs_state.request_mutex);
+	xb_write(&msg, sizeof(msg));
+	xb_write("print", sizeof("print"));
+	xb_write(str, count);
+	xb_write("", 1);
+	mutex_unlock(&xs_state.request_mutex);
+}
+
+void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg)
+{
+	void *ret;
+	struct xsd_sockmsg req_msg = *msg;
+	int err;
+
+	if (req_msg.type == XS_TRANSACTION_START)
+		down_read(&xs_state.suspend_mutex);
+
+	mutex_lock(&xs_state.request_mutex);
+
+	err = xb_write(msg, sizeof(*msg) + msg->len);
+	if (err) {
+		msg->type = XS_ERROR;
+		ret = ERR_PTR(err);
+	} else
+		ret = read_reply(&msg->type, &msg->len);
+
+	mutex_unlock(&xs_state.request_mutex);
+
+	if ((msg->type == XS_TRANSACTION_END) ||
+	    ((req_msg.type == XS_TRANSACTION_START) &&
+	     (msg->type == XS_ERROR)))
+		up_read(&xs_state.suspend_mutex);
+
+	return ret;
+}
+
+/* Send message to xs, get kmalloc'ed reply.  ERR_PTR() on error. */
+static void *xs_talkv(struct xenbus_transaction t,
+		      enum xsd_sockmsg_type type,
+		      const struct kvec *iovec,
+		      unsigned int num_vecs,
+		      unsigned int *len)
+{
+	struct xsd_sockmsg msg;
+	void *ret = NULL;
+	unsigned int i;
+	int err;
+
+	msg.tx_id = t.id;
+	msg.req_id = 0;
+	msg.type = type;
+	msg.len = 0;
+	for (i = 0; i < num_vecs; i++)
+		msg.len += iovec[i].iov_len;
+
+	mutex_lock(&xs_state.request_mutex);
+
+	err = xb_write(&msg, sizeof(msg));
+	if (err) {
+		mutex_unlock(&xs_state.request_mutex);
+		return ERR_PTR(err);
+	}
+
+	for (i = 0; i < num_vecs; i++) {
+		err = xb_write(iovec[i].iov_base, iovec[i].iov_len);;
+		if (err) {
+			mutex_unlock(&xs_state.request_mutex);
+			return ERR_PTR(err);
+		}
+	}
+
+	ret = read_reply(&msg.type, len);
+
+	mutex_unlock(&xs_state.request_mutex);
+
+	if (IS_ERR(ret))
+		return ret;
+
+	if (msg.type == XS_ERROR) {
+		err = get_error(ret);
+		kfree(ret);
+		return ERR_PTR(-err);
+	}
+
+	if (msg.type != type) {
+		if (printk_ratelimit())
+			printk(KERN_WARNING
+			       "XENBUS unexpected type [%d], expected [%d]\n",
+			       msg.type, type);
+		kfree(ret);
+		return ERR_PTR(-EINVAL);
+	}
+	return ret;
+}
+
+/* Simplified version of xs_talkv: single message. */
+static void *xs_single(struct xenbus_transaction t,
+		       enum xsd_sockmsg_type type,
+		       const char *string,
+		       unsigned int *len)
+{
+	struct kvec iovec;
+
+	iovec.iov_base = (void *)string;
+	iovec.iov_len = strlen(string) + 1;
+	return xs_talkv(t, type, &iovec, 1, len);
+}
+
+/* Many commands only need an ack, don't care what it says. */
+static int xs_error(char *reply)
+{
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+	kfree(reply);
+	return 0;
+}
+
+static unsigned int count_strings(const char *strings, unsigned int len)
+{
+	unsigned int num;
+	const char *p;
+
+	for (p = strings, num = 0; p < strings + len; p += strlen(p) + 1)
+		num++;
+
+	return num;
+}
+
+/* Return the path to dir with /name appended. Buffer must be kfree()'ed. */
+static char *join(const char *dir, const char *name)
+{
+	char *buffer;
+
+	if (strlen(name) == 0)
+		buffer = kasprintf(GFP_KERNEL, "%s", dir);
+	else
+		buffer = kasprintf(GFP_KERNEL, "%s/%s", dir, name);
+	return (!buffer) ? ERR_PTR(-ENOMEM) : buffer;
+}
+
+static char **split(char *strings, unsigned int len, unsigned int *num)
+{
+	char *p, **ret;
+
+	/* Count the strings. */
+	*num = count_strings(strings, len);
+
+	/* Transfer to one big alloc for easy freeing. */
+	ret = kmalloc(*num * sizeof(char *) + len, GFP_KERNEL);
+	if (!ret) {
+		kfree(strings);
+		return ERR_PTR(-ENOMEM);
+	}
+	memcpy(&ret[*num], strings, len);
+	kfree(strings);
+
+	strings = (char *)&ret[*num];
+	for (p = strings, *num = 0; p < strings + len; p += strlen(p) + 1)
+		ret[(*num)++] = p;
+
+	return ret;
+}
+
+char **xenbus_directory(struct xenbus_transaction t,
+			const char *dir, const char *node, unsigned int *num)
+{
+	char *strings, *path;
+	unsigned int len;
+
+	path = join(dir, node);
+	if (IS_ERR(path))
+		return (char **)path;
+
+	strings = xs_single(t, XS_DIRECTORY, path, &len);
+	kfree(path);
+	if (IS_ERR(strings))
+		return (char **)strings;
+
+	return split(strings, len, num);
+}
+EXPORT_SYMBOL_GPL(xenbus_directory);
+
+/* Check if a path exists. Return 1 if it does. */
+int xenbus_exists(struct xenbus_transaction t,
+		  const char *dir, const char *node)
+{
+	char **d;
+	int dir_n;
+
+	d = xenbus_directory(t, dir, node, &dir_n);
+	if (IS_ERR(d))
+		return 0;
+	kfree(d);
+	return 1;
+}
+EXPORT_SYMBOL_GPL(xenbus_exists);
+
+/* Get the value of a single file.
+ * Returns a kmalloced value: call free() on it after use.
+ * len indicates length in bytes.
+ */
+void *xenbus_read(struct xenbus_transaction t,
+		  const char *dir, const char *node, unsigned int *len)
+{
+	char *path;
+	void *ret;
+
+	path = join(dir, node);
+	if (IS_ERR(path))
+		return (void *)path;
+
+	ret = xs_single(t, XS_READ, path, len);
+	kfree(path);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_read);
+
+/* Write the value of a single file.
+ * Returns -err on failure.
+ */
+int xenbus_write(struct xenbus_transaction t,
+		 const char *dir, const char *node, const char *string)
+{
+	const char *path;
+	struct kvec iovec[2];
+	int ret;
+
+	path = join(dir, node);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+
+	iovec[0].iov_base = (void *)path;
+	iovec[0].iov_len = strlen(path) + 1;
+	iovec[1].iov_base = (void *)string;
+	iovec[1].iov_len = strlen(string);
+
+	ret = xs_error(xs_talkv(t, XS_WRITE, iovec, ARRAY_SIZE(iovec), NULL));
+	kfree(path);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_write);
+
+/* Create a new directory. */
+int xenbus_mkdir(struct xenbus_transaction t,
+		 const char *dir, const char *node)
+{
+	char *path;
+	int ret;
+
+	path = join(dir, node);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+
+	ret = xs_error(xs_single(t, XS_MKDIR, path, NULL));
+	kfree(path);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_mkdir);
+
+/* Destroy a file or directory (directories must be empty). */
+int xenbus_rm(struct xenbus_transaction t, const char *dir, const char *node)
+{
+	char *path;
+	int ret;
+
+	path = join(dir, node);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+
+	ret = xs_error(xs_single(t, XS_RM, path, NULL));
+	kfree(path);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_rm);
+
+/* Start a transaction: changes by others will not be seen during this
+ * transaction, and changes will not be visible to others until end.
+ */
+int xenbus_transaction_start(struct xenbus_transaction *t)
+{
+	char *id_str;
+
+	down_read(&xs_state.suspend_mutex);
+
+	id_str = xs_single(XBT_NIL, XS_TRANSACTION_START, "", NULL);
+	if (IS_ERR(id_str)) {
+		up_read(&xs_state.suspend_mutex);
+		return PTR_ERR(id_str);
+	}
+
+	t->id = simple_strtoul(id_str, NULL, 0);
+	kfree(id_str);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xenbus_transaction_start);
+
+/* End a transaction.
+ * If abandon is true, transaction is discarded instead of committed.
+ */
+int xenbus_transaction_end(struct xenbus_transaction t, int abort)
+{
+	char abortstr[2];
+	int err;
+
+	if (abort)
+		strcpy(abortstr, "F");
+	else
+		strcpy(abortstr, "T");
+
+	err = xs_error(xs_single(t, XS_TRANSACTION_END, abortstr, NULL));
+
+	up_read(&xs_state.suspend_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xenbus_transaction_end);
+
+/* Single read and scanf: returns -errno or num scanned. */
+int xenbus_scanf(struct xenbus_transaction t,
+		 const char *dir, const char *node, const char *fmt, ...)
+{
+	va_list ap;
+	int ret;
+	char *val;
+
+	val = xenbus_read(t, dir, node, NULL);
+	if (IS_ERR(val))
+		return PTR_ERR(val);
+
+	va_start(ap, fmt);
+	ret = vsscanf(val, fmt, ap);
+	va_end(ap);
+	kfree(val);
+	/* Distinctive errno. */
+	if (ret == 0)
+		return -ERANGE;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_scanf);
+
+/* Single printf and write: returns -errno or 0. */
+int xenbus_printf(struct xenbus_transaction t,
+		  const char *dir, const char *node, const char *fmt, ...)
+{
+	va_list ap;
+	int ret;
+#define PRINTF_BUFFER_SIZE 4096
+	char *printf_buffer;
+
+	printf_buffer = kmalloc(PRINTF_BUFFER_SIZE, GFP_KERNEL);
+	if (printf_buffer == NULL)
+		return -ENOMEM;
+
+	va_start(ap, fmt);
+	ret = vsnprintf(printf_buffer, PRINTF_BUFFER_SIZE, fmt, ap);
+	va_end(ap);
+
+	BUG_ON(ret > PRINTF_BUFFER_SIZE-1);
+	ret = xenbus_write(t, dir, node, printf_buffer);
+
+	kfree(printf_buffer);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_printf);
+
+/* Takes tuples of names, scanf-style args, and void **, NULL terminated. */
+int xenbus_gather(struct xenbus_transaction t, const char *dir, ...)
+{
+	va_list ap;
+	const char *name;
+	int ret = 0;
+
+	va_start(ap, dir);
+	while (ret == 0 && (name = va_arg(ap, char *)) != NULL) {
+		const char *fmt = va_arg(ap, char *);
+		void *result = va_arg(ap, void *);
+		char *p;
+
+		p = xenbus_read(t, dir, name, NULL);
+		if (IS_ERR(p)) {
+			ret = PTR_ERR(p);
+			break;
+		}
+		if (fmt) {
+			if (sscanf(p, fmt, result) == 0)
+				ret = -EINVAL;
+			kfree(p);
+		} else
+			*(char **)result = p;
+	}
+	va_end(ap);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xenbus_gather);
+
+static int xs_watch(const char *path, const char *token)
+{
+	struct kvec iov[2];
+
+	iov[0].iov_base = (void *)path;
+	iov[0].iov_len = strlen(path) + 1;
+	iov[1].iov_base = (void *)token;
+	iov[1].iov_len = strlen(token) + 1;
+
+	return xs_error(xs_talkv(XBT_NIL, XS_WATCH, iov,
+				 ARRAY_SIZE(iov), NULL));
+}
+
+static int xs_unwatch(const char *path, const char *token)
+{
+	struct kvec iov[2];
+
+	iov[0].iov_base = (char *)path;
+	iov[0].iov_len = strlen(path) + 1;
+	iov[1].iov_base = (char *)token;
+	iov[1].iov_len = strlen(token) + 1;
+
+	return xs_error(xs_talkv(XBT_NIL, XS_UNWATCH, iov,
+				 ARRAY_SIZE(iov), NULL));
+}
+
+static struct xenbus_watch *find_watch(const char *token)
+{
+	struct xenbus_watch *i, *cmp;
+
+	cmp = (void *)simple_strtoul(token, NULL, 16);
+
+	list_for_each_entry(i, &watches, list)
+		if (i == cmp)
+			return i;
+
+	return NULL;
+}
+
+/* Register callback to watch this node. */
+int register_xenbus_watch(struct xenbus_watch *watch)
+{
+	/* Pointer in ascii is the token. */
+	char token[sizeof(watch) * 2 + 1];
+	int err;
+
+	sprintf(token, "%lX", (long)watch);
+
+	down_read(&xs_state.suspend_mutex);
+
+	spin_lock(&watches_lock);
+	BUG_ON(find_watch(token));
+	list_add(&watch->list, &watches);
+	spin_unlock(&watches_lock);
+
+	err = xs_watch(watch->node, token);
+
+	/* Ignore errors due to multiple registration. */
+	if ((err != 0) && (err != -EEXIST)) {
+		spin_lock(&watches_lock);
+		list_del(&watch->list);
+		spin_unlock(&watches_lock);
+	}
+
+	up_read(&xs_state.suspend_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(register_xenbus_watch);
+
+void unregister_xenbus_watch(struct xenbus_watch *watch)
+{
+	struct xs_stored_msg *msg, *tmp;
+	char token[sizeof(watch) * 2 + 1];
+	int err;
+
+	sprintf(token, "%lX", (long)watch);
+
+	down_read(&xs_state.suspend_mutex);
+
+	spin_lock(&watches_lock);
+	BUG_ON(!find_watch(token));
+	list_del(&watch->list);
+	spin_unlock(&watches_lock);
+
+	err = xs_unwatch(watch->node, token);
+	if (err)
+		printk(KERN_WARNING
+		       "XENBUS Failed to release watch %s: %i\n",
+		       watch->node, err);
+
+	up_read(&xs_state.suspend_mutex);
+
+	/* Make sure there are no callbacks running currently (unless
+	   its us) */
+	if (current->pid != xenwatch_pid)
+		mutex_lock(&xenwatch_mutex);
+
+	/* Cancel pending watch events. */
+	spin_lock(&watch_events_lock);
+	list_for_each_entry_safe(msg, tmp, &watch_events, list) {
+		if (msg->u.watch.handle != watch)
+			continue;
+		list_del(&msg->list);
+		kfree(msg->u.watch.vec);
+		kfree(msg);
+	}
+	spin_unlock(&watch_events_lock);
+
+	if (current->pid != xenwatch_pid)
+		mutex_unlock(&xenwatch_mutex);
+}
+EXPORT_SYMBOL_GPL(unregister_xenbus_watch);
+
+void xs_suspend(void)
+{
+	struct xenbus_watch *watch;
+	char token[sizeof(watch) * 2 + 1];
+
+	down_write(&xs_state.suspend_mutex);
+
+	/* No need for watches_lock: the suspend_mutex is sufficient. */
+	list_for_each_entry(watch, &watches, list) {
+		sprintf(token, "%lX", (long)watch);
+		xs_unwatch(watch->node, token);
+	}
+
+	mutex_lock(&xs_state.request_mutex);
+}
+
+void xs_resume(void)
+{
+	struct xenbus_watch *watch;
+	char token[sizeof(watch) * 2 + 1];
+
+	mutex_unlock(&xs_state.request_mutex);
+
+	/* No need for watches_lock: the suspend_mutex is sufficient. */
+	list_for_each_entry(watch, &watches, list) {
+		sprintf(token, "%lX", (long)watch);
+		xs_watch(watch->node, token);
+	}
+
+	up_write(&xs_state.suspend_mutex);
+}
+
+static int xenwatch_thread(void *unused)
+{
+	struct list_head *ent;
+	struct xs_stored_msg *msg;
+
+	for (;;) {
+		wait_event_interruptible(watch_events_waitq,
+					 !list_empty(&watch_events));
+
+		if (kthread_should_stop())
+			break;
+
+		mutex_lock(&xenwatch_mutex);
+
+		spin_lock(&watch_events_lock);
+		ent = watch_events.next;
+		if (ent != &watch_events)
+			list_del(ent);
+		spin_unlock(&watch_events_lock);
+
+		if (ent != &watch_events) {
+			msg = list_entry(ent, struct xs_stored_msg, list);
+			msg->u.watch.handle->callback(
+				msg->u.watch.handle,
+				(const char **)msg->u.watch.vec,
+				msg->u.watch.vec_size);
+			kfree(msg->u.watch.vec);
+			kfree(msg);
+		}
+
+		mutex_unlock(&xenwatch_mutex);
+	}
+
+	return 0;
+}
+
+static int process_msg(void)
+{
+	struct xs_stored_msg *msg;
+	char *body;
+	int err;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (msg == NULL)
+		return -ENOMEM;
+
+	err = xb_read(&msg->hdr, sizeof(msg->hdr));
+	if (err) {
+		kfree(msg);
+		return err;
+	}
+
+	body = kmalloc(msg->hdr.len + 1, GFP_KERNEL);
+	if (body == NULL) {
+		kfree(msg);
+		return -ENOMEM;
+	}
+
+	err = xb_read(body, msg->hdr.len);
+	if (err) {
+		kfree(body);
+		kfree(msg);
+		return err;
+	}
+	body[msg->hdr.len] = '\0';
+
+	if (msg->hdr.type == XS_WATCH_EVENT) {
+		msg->u.watch.vec = split(body, msg->hdr.len,
+					 &msg->u.watch.vec_size);
+		if (IS_ERR(msg->u.watch.vec)) {
+			kfree(msg);
+			return PTR_ERR(msg->u.watch.vec);
+		}
+
+		spin_lock(&watches_lock);
+		msg->u.watch.handle = find_watch(
+			msg->u.watch.vec[XS_WATCH_TOKEN]);
+		if (msg->u.watch.handle != NULL) {
+			spin_lock(&watch_events_lock);
+			list_add_tail(&msg->list, &watch_events);
+			wake_up(&watch_events_waitq);
+			spin_unlock(&watch_events_lock);
+		} else {
+			kfree(msg->u.watch.vec);
+			kfree(msg);
+		}
+		spin_unlock(&watches_lock);
+	} else {
+		msg->u.reply.body = body;
+		spin_lock(&xs_state.reply_lock);
+		list_add_tail(&msg->list, &xs_state.reply_list);
+		spin_unlock(&xs_state.reply_lock);
+		wake_up(&xs_state.reply_waitq);
+	}
+
+	return 0;
+}
+
+static int xenbus_thread(void *unused)
+{
+	int err;
+
+	for (;;) {
+		err = process_msg();
+		if (err)
+			printk(KERN_WARNING "XENBUS error %d while reading "
+			       "message\n", err);
+		if (kthread_should_stop())
+			break;
+	}
+
+	return 0;
+}
+
+int xs_init(void)
+{
+	int err;
+	struct task_struct *task;
+
+	INIT_LIST_HEAD(&xs_state.reply_list);
+	spin_lock_init(&xs_state.reply_lock);
+	init_waitqueue_head(&xs_state.reply_waitq);
+
+	mutex_init(&xs_state.request_mutex);
+	init_rwsem(&xs_state.suspend_mutex);
+
+	/* Initialize the shared memory rings to talk to xenstored */
+	err = xb_init_comms();
+	if (err)
+		return err;
+
+	task = kthread_run(xenwatch_thread, NULL, "xenwatch");
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+	xenwatch_pid = task->pid;
+
+	task = kthread_run(xenbus_thread, NULL, "xenbus");
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+
+	return 0;
+}
===================================================================
--- a/include/asm-i386/xen/hypervisor.h
+++ b/include/asm-i386/xen/hypervisor.h
@@ -42,6 +42,7 @@
 
 #include <asm/ptrace.h>
 #include <asm/page.h>
+#include <asm/desc.h>
 #if defined(__i386__)
 #  ifdef CONFIG_X86_PAE
 #   include <asm-generic/pgtable-nopud.h>
===================================================================
--- /dev/null
+++ b/include/xen/xenbus.h
@@ -0,0 +1,208 @@
+/******************************************************************************
+ * xenbus.h
+ *
+ * Talks to Xen Store to figure out what devices we have.
+ *
+ * Copyright (C) 2005 Rusty Russell, IBM Corporation
+ * Copyright (C) 2005 XenSource Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef _XEN_XENBUS_H
+#define _XEN_XENBUS_H
+
+#include <linux/device.h>
+#include <linux/notifier.h>
+#include <linux/mutex.h>
+#include <linux/completion.h>
+#include <linux/init.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/grant_table.h>
+#include <xen/interface/io/xenbus.h>
+#include <xen/interface/io/xs_wire.h>
+
+/* Register callback to watch this node. */
+struct xenbus_watch
+{
+	struct list_head list;
+
+	/* Path being watched. */
+	const char *node;
+
+	/* Callback (executed in a process context with no locks held). */
+	void (*callback)(struct xenbus_watch *,
+			 const char **vec, unsigned int len);
+};
+
+
+/* A xenbus device. */
+struct xenbus_device {
+	const char *devicetype;
+	const char *nodename;
+	const char *otherend;
+	int otherend_id;
+	struct xenbus_watch otherend_watch;
+	struct device dev;
+	enum xenbus_state state;
+	struct completion down;
+};
+
+static inline struct xenbus_device *to_xenbus_device(struct device *dev)
+{
+	return container_of(dev, struct xenbus_device, dev);
+}
+
+struct xenbus_device_id
+{
+	/* .../device/<device_type>/<identifier> */
+	char devicetype[32]; 	/* General class of device. */
+};
+
+/* A xenbus driver. */
+struct xenbus_driver {
+	char *name;
+	struct module *owner;
+	const struct xenbus_device_id *ids;
+	int (*probe)(struct xenbus_device *dev,
+		     const struct xenbus_device_id *id);
+	void (*otherend_changed)(struct xenbus_device *dev,
+				 enum xenbus_state backend_state);
+	int (*remove)(struct xenbus_device *dev);
+	int (*suspend)(struct xenbus_device *dev);
+	int (*resume)(struct xenbus_device *dev);
+	int (*uevent)(struct xenbus_device *, char **, int, char *, int);
+	struct device_driver driver;
+	int (*read_otherend_details)(struct xenbus_device *dev);
+};
+
+static inline struct xenbus_driver *to_xenbus_driver(struct device_driver *drv)
+{
+	return container_of(drv, struct xenbus_driver, driver);
+}
+
+int xenbus_register_frontend(struct xenbus_driver *drv);
+int xenbus_register_backend(struct xenbus_driver *drv);
+void xenbus_unregister_driver(struct xenbus_driver *drv);
+
+struct xenbus_transaction
+{
+	u32 id;
+};
+
+/* Nil transaction ID. */
+#define XBT_NIL ((struct xenbus_transaction) { 0 })
+
+int __init xenbus_dev_init(void);
+
+char **xenbus_directory(struct xenbus_transaction t,
+			const char *dir, const char *node, unsigned int *num);
+void *xenbus_read(struct xenbus_transaction t,
+		  const char *dir, const char *node, unsigned int *len);
+int xenbus_write(struct xenbus_transaction t,
+		 const char *dir, const char *node, const char *string);
+int xenbus_mkdir(struct xenbus_transaction t,
+		 const char *dir, const char *node);
+int xenbus_exists(struct xenbus_transaction t,
+		  const char *dir, const char *node);
+int xenbus_rm(struct xenbus_transaction t, const char *dir, const char *node);
+int xenbus_transaction_start(struct xenbus_transaction *t);
+int xenbus_transaction_end(struct xenbus_transaction t, int abort);
+
+/* Single read and scanf: returns -errno or num scanned if > 0. */
+int xenbus_scanf(struct xenbus_transaction t,
+		 const char *dir, const char *node, const char *fmt, ...)
+	__attribute__((format(scanf, 4, 5)));
+
+/* Single printf and write: returns -errno or 0. */
+int xenbus_printf(struct xenbus_transaction t,
+		  const char *dir, const char *node, const char *fmt, ...)
+	__attribute__((format(printf, 4, 5)));
+
+/* Generic read function: NULL-terminated triples of name,
+ * sprintf-style type string, and pointer. Returns 0 or errno.*/
+int xenbus_gather(struct xenbus_transaction t, const char *dir, ...);
+
+/* notifer routines for when the xenstore comes up */
+extern int xenstored_ready;
+int register_xenstore_notifier(struct notifier_block *nb);
+void unregister_xenstore_notifier(struct notifier_block *nb);
+
+int register_xenbus_watch(struct xenbus_watch *watch);
+void unregister_xenbus_watch(struct xenbus_watch *watch);
+void xs_suspend(void);
+void xs_resume(void);
+
+/* Used by xenbus_dev to borrow kernel's store connection. */
+void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg);
+
+/* Called from xen core code. */
+void xenbus_suspend(void);
+void xenbus_resume(void);
+void xenbus_probe(struct work_struct *);
+
+#define XENBUS_IS_ERR_READ(str) ({			\
+	if (!IS_ERR(str) && strlen(str) == 0) {		\
+		kfree(str);				\
+		str = ERR_PTR(-ERANGE);			\
+	}						\
+	IS_ERR(str);					\
+})
+
+#define XENBUS_EXIST_ERR(err) ((err) == -ENOENT || (err) == -ERANGE)
+
+int xenbus_watch_path(struct xenbus_device *dev, const char *path,
+		      struct xenbus_watch *watch,
+		      void (*callback)(struct xenbus_watch *,
+				       const char **, unsigned int));
+int xenbus_watch_path2(struct xenbus_device *dev, const char *path,
+		       const char *path2, struct xenbus_watch *watch,
+		       void (*callback)(struct xenbus_watch *,
+					const char **, unsigned int));
+int xenbus_switch_state(struct xenbus_device *dev, enum xenbus_state new_state);
+int xenbus_grant_ring(struct xenbus_device *dev, unsigned long ring_mfn);
+int xenbus_map_ring_valloc(struct xenbus_device *dev,
+			   int gnt_ref, void **vaddr);
+int xenbus_map_ring(struct xenbus_device *dev, int gnt_ref,
+			   grant_handle_t *handle, void *vaddr);
+
+int xenbus_unmap_ring_vfree(struct xenbus_device *dev, void *vaddr);
+int xenbus_unmap_ring(struct xenbus_device *dev,
+		      grant_handle_t handle, void *vaddr);
+
+int xenbus_alloc_evtchn(struct xenbus_device *dev, int *port);
+int xenbus_bind_evtchn(struct xenbus_device *dev, int remote_port, int *port);
+int xenbus_free_evtchn(struct xenbus_device *dev, int port);
+
+enum xenbus_state xenbus_read_driver_state(const char *path);
+
+void xenbus_dev_error(struct xenbus_device *dev, int err, const char *fmt, ...);
+void xenbus_dev_fatal(struct xenbus_device *dev, int err, const char *fmt, ...);
+
+char *xenbus_strstate(enum xenbus_state state);
+int xenbus_dev_is_online(struct xenbus_device *dev);
+int xenbus_frontend_closed(struct xenbus_device *dev);
+
+#endif /* _XEN_XENBUS_H */

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 25/26] Xen-paravirt_ops: Add Xen virtual block device driver.
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (23 preceding siblings ...)
  2007-03-01 23:25 ` [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-01 23:25 ` [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network " Jeremy Fitzhardinge
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach, Arjan van de Ven, Greg KH, Jens Axboe

[-- Attachment #1: xen-blockfront.patch --]
[-- Type: text/plain, Size: 36131 bytes --]

The block device frontend driver allows the kernel to access block
devices exported exported by a virtual machine containing a physical
block device driver.

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
 drivers/block/Kconfig        |    1 
 drivers/block/Makefile       |    1 
 drivers/block/xen/Kconfig    |   14 
 drivers/block/xen/Makefile   |    5 
 drivers/block/xen/blkfront.c |  844 ++++++++++++++++++++++++++++++++++++++++++
 drivers/block/xen/block.h    |  135 ++++++
 drivers/block/xen/vbd.c      |  230 +++++++++++
 include/linux/major.h        |    2 
 8 files changed, 1232 insertions(+)

===================================================================
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -461,6 +461,7 @@ config CDROM_PKTCDVD_WCACHE
 	  don't do deferred write error handling yet.
 
 source "drivers/s390/block/Kconfig"
+source "drivers/block/xen/Kconfig"
 
 config ATA_OVER_ETH
 	tristate "ATA over Ethernet support"
===================================================================
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 
+obj-$(CONFIG_XEN)		+= xen/
===================================================================
--- /dev/null
+++ b/drivers/block/xen/Kconfig
@@ -0,0 +1,14 @@
+menu "Xen block device drivers"
+        depends on XEN
+
+config XEN_BLKDEV_FRONTEND
+	tristate "Block device frontend driver"
+	depends on XEN
+	default y
+	help
+	  The block device frontend driver allows the kernel to access block
+	  devices exported from a device driver virtual machine. Unless you
+	  are building a dedicated device driver virtual machine, then you
+	  almost certainly want to say Y here.
+
+endmenu
===================================================================
--- /dev/null
+++ b/drivers/block/xen/Makefile
@@ -0,0 +1,5 @@
+
+obj-$(CONFIG_XEN_BLKDEV_FRONTEND)	:= xenblk.o
+
+xenblk-objs := blkfront.o vbd.o
+
===================================================================
--- /dev/null
+++ b/drivers/block/xen/blkfront.c
@@ -0,0 +1,844 @@
+/******************************************************************************
+ * blkfront.c
+ *
+ * XenLinux virtual block device driver.
+ *
+ * Copyright (c) 2003-2004, Keir Fraser & Steve Hand
+ * Modifications by Mark A. Williamson are (c) Intel Research Cambridge
+ * Copyright (c) 2004, Christian Limpach
+ * Copyright (c) 2004, Andrew Warfield
+ * Copyright (c) 2005, Christopher Clark
+ * Copyright (c) 2005, XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/version.h>
+#include "block.h"
+#include <linux/cdrom.h>
+#include <linux/sched.h>
+#include <linux/interrupt.h>
+#include <scsi/scsi.h>
+#include <xen/xenbus.h>
+#include <xen/interface/grant_table.h>
+#include <xen/grant_table.h>
+#include <xen/events.h>
+#include <xen/page.h>
+#include <asm/xen/hypervisor.h>
+
+#define BLKIF_STATE_DISCONNECTED 0
+#define BLKIF_STATE_CONNECTED    1
+#define BLKIF_STATE_SUSPENDED    2
+
+#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
+    (BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
+#define GRANT_INVALID_REF	0
+
+static void connect(struct blkfront_info *);
+static void blkfront_closing(struct xenbus_device *);
+static int blkfront_remove(struct xenbus_device *);
+static int talk_to_backend(struct xenbus_device *, struct blkfront_info *);
+static int setup_blkring(struct xenbus_device *, struct blkfront_info *);
+
+static void kick_pending_request_queues(struct blkfront_info *);
+
+static irqreturn_t blkif_int(int irq, void *dev_id);
+static void blkif_restart_queue(struct work_struct *work);
+static void blkif_recover(struct blkfront_info *);
+static void blkif_completion(struct blk_shadow *);
+static void blkif_free(struct blkfront_info *, int);
+
+
+/**
+ * Entry point to this code when a new device is created.  Allocate the basic
+ * structures and the ring buffer for communication with the backend, and
+ * inform the backend of the appropriate details for those.  Switch to
+ * Initialised state.
+ */
+static int blkfront_probe(struct xenbus_device *dev,
+			  const struct xenbus_device_id *id)
+{
+	int err, vdevice, i;
+	struct blkfront_info *info;
+
+	/* FIXME: Use dynamic device id if this is not set. */
+	err = xenbus_scanf(XBT_NIL, dev->nodename,
+			   "virtual-device", "%i", &vdevice);
+	if (err != 1) {
+		xenbus_dev_fatal(dev, err, "reading virtual-device");
+		return err;
+	}
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		xenbus_dev_fatal(dev, -ENOMEM, "allocating info structure");
+		return -ENOMEM;
+	}
+
+	info->xbdev = dev;
+	info->vdevice = vdevice;
+	info->connected = BLKIF_STATE_DISCONNECTED;
+	INIT_WORK(&info->work, blkif_restart_queue);
+
+	for (i = 0; i < BLK_RING_SIZE; i++)
+		info->shadow[i].req.id = i+1;
+	info->shadow[BLK_RING_SIZE-1].req.id = 0x0fffffff;
+
+	/* Front end dir is a number, which is used as the id. */
+	info->handle = simple_strtoul(strrchr(dev->nodename,'/')+1, NULL, 0);
+	dev->dev.driver_data = info;
+
+	err = talk_to_backend(dev, info);
+	if (err) {
+		kfree(info);
+		dev->dev.driver_data = NULL;
+		return err;
+	}
+
+	return 0;
+}
+
+
+/**
+ * We are reconnecting to the backend, due to a suspend/resume, or a backend
+ * driver restart.  We tear down our blkif structure and recreate it, but
+ * leave the device-layer structures intact so that this is transparent to the
+ * rest of the kernel.
+ */
+static int blkfront_resume(struct xenbus_device *dev)
+{
+	struct blkfront_info *info = dev->dev.driver_data;
+	int err;
+
+	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
+
+	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
+
+	err = talk_to_backend(dev, info);
+	if (info->connected == BLKIF_STATE_SUSPENDED && !err)
+		blkif_recover(info);
+
+	return err;
+}
+
+
+/* Common code used when first setting up, and when resuming. */
+static int talk_to_backend(struct xenbus_device *dev,
+			   struct blkfront_info *info)
+{
+	const char *message = NULL;
+	struct xenbus_transaction xbt;
+	int err;
+
+	/* Create shared ring, alloc event channel. */
+	err = setup_blkring(dev, info);
+	if (err)
+		goto out;
+
+again:
+	err = xenbus_transaction_start(&xbt);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "starting transaction");
+		goto destroy_blkring;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename,
+			    "ring-ref","%u", info->ring_ref);
+	if (err) {
+		message = "writing ring-ref";
+		goto abort_transaction;
+	}
+	err = xenbus_printf(xbt, dev->nodename,
+			    "event-channel", "%u", info->evtchn);
+	if (err) {
+		message = "writing event-channel";
+		goto abort_transaction;
+	}
+
+	err = xenbus_transaction_end(xbt, 0);
+	if (err) {
+		if (err == -EAGAIN)
+			goto again;
+		xenbus_dev_fatal(dev, err, "completing transaction");
+		goto destroy_blkring;
+	}
+
+	xenbus_switch_state(dev, XenbusStateInitialised);
+
+	return 0;
+
+ abort_transaction:
+	xenbus_transaction_end(xbt, 1);
+	if (message)
+		xenbus_dev_fatal(dev, err, "%s", message);
+ destroy_blkring:
+	blkif_free(info, 0);
+ out:
+	return err;
+}
+
+
+static int setup_blkring(struct xenbus_device *dev,
+			 struct blkfront_info *info)
+{
+	struct blkif_sring *sring;
+	int err;
+
+	info->ring_ref = GRANT_INVALID_REF;
+
+	sring = (struct blkif_sring *)__get_free_page(GFP_KERNEL);
+	if (!sring) {
+		xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring");
+		return -ENOMEM;
+	}
+	SHARED_RING_INIT(sring);
+	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
+
+	err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
+	if (err < 0) {
+		free_page((unsigned long)sring);
+		info->ring.sring = NULL;
+		goto fail;
+	}
+	info->ring_ref = err;
+
+	err = xenbus_alloc_evtchn(dev, &info->evtchn);
+	if (err)
+		goto fail;
+
+	err = bind_evtchn_to_irqhandler(
+		info->evtchn, blkif_int, SA_SAMPLE_RANDOM, "blkif", info);
+	if (err <= 0) {
+		xenbus_dev_fatal(dev, err,
+				 "bind_evtchn_to_irqhandler failed");
+		goto fail;
+	}
+	info->irq = err;
+
+	return 0;
+fail:
+	blkif_free(info, 0);
+	return err;
+}
+
+
+/**
+ * Callback received when the backend's state changes.
+ */
+static void backend_changed(struct xenbus_device *dev,
+			    enum xenbus_state backend_state)
+{
+	struct blkfront_info *info = dev->dev.driver_data;
+	struct block_device *bd;
+
+	dev_dbg(&dev->dev, "blkfront:backend_changed.\n");
+
+	switch (backend_state) {
+	case XenbusStateInitialising:
+	case XenbusStateInitWait:
+	case XenbusStateInitialised:
+	case XenbusStateUnknown:
+	case XenbusStateClosed:
+		break;
+
+	case XenbusStateConnected:
+		connect(info);
+		break;
+
+	case XenbusStateClosing:
+		bd = bdget(info->dev);
+		if (bd == NULL)
+			xenbus_dev_fatal(dev, -ENODEV, "bdget failed");
+
+		mutex_lock(&bd->bd_mutex);
+		if (info->users > 0)
+			xenbus_dev_error(dev, -EBUSY,
+					 "Device in use; refusing to close");
+		else
+			blkfront_closing(dev);
+		mutex_unlock(&bd->bd_mutex);
+		bdput(bd);
+		break;
+	}
+}
+
+
+/* ** Connection ** */
+
+
+/*
+ * Invoked when the backend is finally 'ready' (and has told produced
+ * the details about the physical device - #sectors, size, etc).
+ */
+static void connect(struct blkfront_info *info)
+{
+	unsigned long long sectors;
+	unsigned long sector_size;
+	unsigned int binfo;
+	int err;
+
+	if ((info->connected == BLKIF_STATE_CONNECTED) ||
+	    (info->connected == BLKIF_STATE_SUSPENDED) )
+		return;
+
+	dev_dbg(&info->dev, "blkfront.c:connect:%s.\n", info->xbdev->otherend);
+
+	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+			    "sectors", "%llu", &sectors,
+			    "info", "%u", &binfo,
+			    "sector-size", "%lu", &sector_size,
+			    NULL);
+	if (err) {
+		xenbus_dev_fatal(info->xbdev, err,
+				 "reading backend fields at %s",
+				 info->xbdev->otherend);
+		return;
+	}
+
+	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+			    "feature-barrier", "%lu", &info->feature_barrier,
+			    NULL);
+	if (err)
+		info->feature_barrier = 0;
+
+	err = xlvbd_add(sectors, info->vdevice, binfo, sector_size, info);
+	if (err) {
+		xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
+				 info->xbdev->otherend);
+		return;
+	}
+
+	xenbus_switch_state(info->xbdev, XenbusStateConnected);
+
+	/* Kick pending requests. */
+	spin_lock_irq(&blkif_io_lock);
+	info->connected = BLKIF_STATE_CONNECTED;
+	kick_pending_request_queues(info);
+	spin_unlock_irq(&blkif_io_lock);
+
+	add_disk(info->gd);
+}
+
+/**
+ * Handle the change of state of the backend to Closing.  We must delete our
+ * device-layer structures now, to ensure that writes are flushed through to
+ * the backend.  Once is this done, we can switch to Closed in
+ * acknowledgement.
+ */
+static void blkfront_closing(struct xenbus_device *dev)
+{
+	struct blkfront_info *info = dev->dev.driver_data;
+	unsigned long flags;
+
+	dev_dbg(&dev->dev, "blkfront_closing: %s removed\n", dev->nodename);
+
+	if (info->rq == NULL)
+		goto out;
+
+	spin_lock_irqsave(&blkif_io_lock, flags);
+	/* No more blkif_request(). */
+	blk_stop_queue(info->rq);
+	/* No more gnttab callback work. */
+	gnttab_cancel_free_callback(&info->callback);
+	spin_unlock_irqrestore(&blkif_io_lock, flags);
+
+	/* Flush gnttab callback work. Must be done with no locks held. */
+	flush_scheduled_work();
+
+	xlvbd_del(info);
+
+ out:
+	xenbus_frontend_closed(dev);
+}
+
+
+static int blkfront_remove(struct xenbus_device *dev)
+{
+	struct blkfront_info *info = dev->dev.driver_data;
+
+	dev_dbg(&dev->dev, "blkfront_remove: %s removed\n", dev->nodename);
+
+	blkif_free(info, 0);
+
+	kfree(info);
+
+	return 0;
+}
+
+
+static inline int GET_ID_FROM_FREELIST(
+	struct blkfront_info *info)
+{
+	unsigned long free = info->shadow_free;
+	BUG_ON(free > BLK_RING_SIZE);
+	info->shadow_free = info->shadow[free].req.id;
+	info->shadow[free].req.id = 0x0fffffee; /* debug */
+	return free;
+}
+
+static inline void ADD_ID_TO_FREELIST(
+	struct blkfront_info *info, unsigned long id)
+{
+	info->shadow[id].req.id  = info->shadow_free;
+	info->shadow[id].request = 0;
+	info->shadow_free = id;
+}
+
+static inline void flush_requests(struct blkfront_info *info)
+{
+	int notify;
+
+	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify);
+
+	if (notify)
+		notify_remote_via_irq(info->irq);
+}
+
+static void kick_pending_request_queues(struct blkfront_info *info)
+{
+	if (!RING_FULL(&info->ring)) {
+		/* Re-enable calldowns. */
+		blk_start_queue(info->rq);
+		/* Kick things off immediately. */
+		do_blkif_request(info->rq);
+	}
+}
+
+static void blkif_restart_queue(struct work_struct *work)
+{
+	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
+
+	spin_lock_irq(&blkif_io_lock);
+	if (info->connected == BLKIF_STATE_CONNECTED)
+		kick_pending_request_queues(info);
+	spin_unlock_irq(&blkif_io_lock);
+}
+
+static void blkif_restart_queue_callback(void *arg)
+{
+	struct blkfront_info *info = (struct blkfront_info *)arg;
+	schedule_work(&info->work);
+}
+
+int blkif_open(struct inode *inode, struct file *filep)
+{
+	struct blkfront_info *info = inode->i_bdev->bd_disk->private_data;
+	info->users++;
+	return 0;
+}
+
+
+int blkif_release(struct inode *inode, struct file *filep)
+{
+	struct blkfront_info *info = inode->i_bdev->bd_disk->private_data;
+	info->users--;
+	if (info->users == 0) {
+		/* Check whether we have been instructed to close.  We will
+		   have ignored this request initially, as the device was
+		   still mounted. */
+		struct xenbus_device * dev = info->xbdev;
+		enum xenbus_state state = xenbus_read_driver_state(dev->otherend);
+
+		if (state == XenbusStateClosing)
+			blkfront_closing(dev);
+	}
+	return 0;
+}
+
+
+int blkif_getgeo(struct block_device *bd, struct hd_geometry *hg)
+{
+	/* We don't have real geometry info, but let's at least return
+	   values consistent with the size of the device */
+	sector_t nsect = get_capacity(bd->bd_disk);
+	sector_t cylinders = nsect;
+
+	hg->heads = 0xff;
+	hg->sectors = 0x3f;
+	sector_div(cylinders, hg->heads * hg->sectors);
+	hg->cylinders = cylinders;
+	if ((sector_t)(hg->cylinders + 1) * hg->heads * hg->sectors < nsect)
+		hg->cylinders = 0xffff;
+	return 0;
+}
+
+/*
+ * blkif_queue_request
+ *
+ * request block io
+ *
+ * id: for guest use only.
+ * operation: BLKIF_OP_{READ,WRITE,PROBE}
+ * buffer: buffer to read/write into. this should be a
+ *   virtual address in the guest os.
+ */
+static int blkif_queue_request(struct request *req)
+{
+	struct blkfront_info *info = req->rq_disk->private_data;
+	unsigned long buffer_mfn;
+	struct blkif_request *ring_req;
+	struct bio *bio;
+	struct bio_vec *bvec;
+	int idx;
+	unsigned long id;
+	unsigned int fsect, lsect;
+	int ref;
+	grant_ref_t gref_head;
+
+	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
+		return 1;
+
+	if (gnttab_alloc_grant_references(
+		BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
+		gnttab_request_free_callback(
+			&info->callback,
+			blkif_restart_queue_callback,
+			info,
+			BLKIF_MAX_SEGMENTS_PER_REQUEST);
+		return 1;
+	}
+
+	/* Fill out a communications ring structure. */
+	ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
+	id = GET_ID_FROM_FREELIST(info);
+	info->shadow[id].request = (unsigned long)req;
+
+	ring_req->id = id;
+	ring_req->sector_number = (blkif_sector_t)req->sector;
+	ring_req->handle = info->handle;
+
+	ring_req->operation = rq_data_dir(req) ?
+		BLKIF_OP_WRITE : BLKIF_OP_READ;
+	if (blk_barrier_rq(req))
+		ring_req->operation = BLKIF_OP_WRITE_BARRIER;
+
+	ring_req->nr_segments = 0;
+	rq_for_each_bio (bio, req) {
+		bio_for_each_segment (bvec, bio, idx) {
+			BUG_ON(ring_req->nr_segments
+			       == BLKIF_MAX_SEGMENTS_PER_REQUEST);
+			buffer_mfn = pfn_to_mfn(page_to_pfn(bvec->bv_page));
+			fsect = bvec->bv_offset >> 9;
+			lsect = fsect + (bvec->bv_len >> 9) - 1;
+			/* install a grant reference. */
+			ref = gnttab_claim_grant_reference(&gref_head);
+			BUG_ON(ref == -ENOSPC);
+
+			gnttab_grant_foreign_access_ref(
+				ref,
+				info->xbdev->otherend_id,
+				buffer_mfn,
+				rq_data_dir(req) );
+
+			info->shadow[id].frame[ring_req->nr_segments] =
+				mfn_to_pfn(buffer_mfn);
+
+			ring_req->seg[ring_req->nr_segments] =
+				(struct blkif_request_segment) {
+					.gref       = ref,
+					.first_sect = fsect,
+					.last_sect  = lsect };
+
+			ring_req->nr_segments++;
+		}
+	}
+
+	info->ring.req_prod_pvt++;
+
+	/* Keep a private copy so we can reissue requests when recovering. */
+	info->shadow[id].req = *ring_req;
+
+	gnttab_free_grant_references(gref_head);
+
+	return 0;
+}
+
+/*
+ * do_blkif_request
+ *  read a block; request is in a request queue
+ */
+void do_blkif_request(request_queue_t *rq)
+{
+	struct blkfront_info *info = NULL;
+	struct request *req;
+	int queued;
+
+	pr_debug("Entered do_blkif_request\n");
+
+	queued = 0;
+
+	while ((req = elv_next_request(rq)) != NULL) {
+		info = req->rq_disk->private_data;
+		if (!blk_fs_request(req)) {
+			end_request(req, 0);
+			continue;
+		}
+
+		if (RING_FULL(&info->ring))
+			goto wait;
+
+		pr_debug("do_blk_req %p: cmd %p, sec %lx, "
+			 "(%u/%li) buffer:%p [%s]\n",
+			 req, req->cmd, (unsigned long)req->sector,
+			 req->current_nr_sectors,
+			 req->nr_sectors, req->buffer,
+			 rq_data_dir(req) ? "write" : "read");
+
+
+		blkdev_dequeue_request(req);
+		if (blkif_queue_request(req)) {
+			blk_requeue_request(rq, req);
+		wait:
+			/* Avoid pointless unplugs. */
+			blk_stop_queue(rq);
+			break;
+		}
+
+		queued++;
+	}
+
+	if (queued != 0)
+		flush_requests(info);
+}
+
+
+static irqreturn_t blkif_int(int irq, void *dev_id)
+{
+	struct request *req;
+	struct blkif_response *bret;
+	RING_IDX i, rp;
+	unsigned long flags;
+	struct blkfront_info *info = (struct blkfront_info *)dev_id;
+	int uptodate;
+
+	spin_lock_irqsave(&blkif_io_lock, flags);
+
+	if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
+		spin_unlock_irqrestore(&blkif_io_lock, flags);
+		return IRQ_HANDLED;
+	}
+
+ again:
+	rp = info->ring.sring->rsp_prod;
+	rmb(); /* Ensure we see queued responses up to 'rp'. */
+
+	for (i = info->ring.rsp_cons; i != rp; i++) {
+		unsigned long id;
+		int ret;
+
+		bret = RING_GET_RESPONSE(&info->ring, i);
+		id   = bret->id;
+		req  = (struct request *)info->shadow[id].request;
+
+		blkif_completion(&info->shadow[id]);
+
+		ADD_ID_TO_FREELIST(info, id);
+
+		uptodate = (bret->status == BLKIF_RSP_OKAY);
+		switch (bret->operation) {
+		case BLKIF_OP_WRITE_BARRIER:
+			if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
+				printk("blkfront: %s: write barrier op failed\n",
+				       info->gd->disk_name);
+				uptodate = -EOPNOTSUPP;
+				info->feature_barrier = 0;
+			        xlvbd_barrier(info);
+			}
+			/* fall through */
+		case BLKIF_OP_READ:
+		case BLKIF_OP_WRITE:
+			if (unlikely(bret->status != BLKIF_RSP_OKAY))
+				dev_dbg(&info->dev, "Bad return from blkdev data "
+					"request: %x\n", bret->status);
+
+			ret = end_that_request_first(req, uptodate,
+				req->hard_nr_sectors);
+			BUG_ON(ret);
+			end_that_request_last(req, uptodate);
+			break;
+		default:
+			BUG();
+		}
+	}
+
+	info->ring.rsp_cons = i;
+
+	if (i != info->ring.req_prod_pvt) {
+		int more_to_do;
+		RING_FINAL_CHECK_FOR_RESPONSES(&info->ring, more_to_do);
+		if (more_to_do)
+			goto again;
+	} else
+		info->ring.sring->rsp_event = i + 1;
+
+	kick_pending_request_queues(info);
+
+	spin_unlock_irqrestore(&blkif_io_lock, flags);
+
+	return IRQ_HANDLED;
+}
+
+static void blkif_free(struct blkfront_info *info, int suspend)
+{
+	/* Prevent new requests being issued until we fix things up. */
+	spin_lock_irq(&blkif_io_lock);
+	info->connected = suspend ?
+		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
+	/* No more blkif_request(). */
+	if (info->rq)
+		blk_stop_queue(info->rq);
+	/* No more gnttab callback work. */
+	gnttab_cancel_free_callback(&info->callback);
+	spin_unlock_irq(&blkif_io_lock);
+
+	/* Flush gnttab callback work. Must be done with no locks held. */
+	flush_scheduled_work();
+
+	/* Free resources associated with old device channel. */
+	if (info->ring_ref != GRANT_INVALID_REF) {
+		gnttab_end_foreign_access(info->ring_ref, 0,
+					  (unsigned long)info->ring.sring);
+		info->ring_ref = GRANT_INVALID_REF;
+		info->ring.sring = NULL;
+	}
+	if (info->irq)
+		unbind_from_irqhandler(info->irq, info);
+	info->evtchn = info->irq = 0;
+
+}
+
+static void blkif_completion(struct blk_shadow *s)
+{
+	int i;
+	for (i = 0; i < s->req.nr_segments; i++)
+		gnttab_end_foreign_access(s->req.seg[i].gref, 0, 0UL);
+}
+
+static void blkif_recover(struct blkfront_info *info)
+{
+	int i;
+	struct blkif_request *req;
+	struct blk_shadow *copy;
+	int j;
+
+	/* Stage 1: Make a safe copy of the shadow state. */
+	copy = kmalloc(sizeof(info->shadow), GFP_KERNEL | __GFP_NOFAIL);
+	memcpy(copy, info->shadow, sizeof(info->shadow));
+
+	/* Stage 2: Set up free list. */
+	memset(&info->shadow, 0, sizeof(info->shadow));
+	for (i = 0; i < BLK_RING_SIZE; i++)
+		info->shadow[i].req.id = i+1;
+	info->shadow_free = info->ring.req_prod_pvt;
+	info->shadow[BLK_RING_SIZE-1].req.id = 0x0fffffff;
+
+	/* Stage 3: Find pending requests and requeue them. */
+	for (i = 0; i < BLK_RING_SIZE; i++) {
+		/* Not in use? */
+		if (copy[i].request == 0)
+			continue;
+
+		/* Grab a request slot and copy shadow state into it. */
+		req = RING_GET_REQUEST(
+			&info->ring, info->ring.req_prod_pvt);
+		*req = copy[i].req;
+
+		/* We get a new request id, and must reset the shadow state. */
+		req->id = GET_ID_FROM_FREELIST(info);
+		memcpy(&info->shadow[req->id], &copy[i], sizeof(copy[i]));
+
+		/* Rewrite any grant references invalidated by susp/resume. */
+		for (j = 0; j < req->nr_segments; j++)
+			gnttab_grant_foreign_access_ref(
+				req->seg[j].gref,
+				info->xbdev->otherend_id,
+				pfn_to_mfn(info->shadow[req->id].frame[j]),
+				rq_data_dir(
+					(struct request *)
+					info->shadow[req->id].request));
+		info->shadow[req->id].req = *req;
+
+		info->ring.req_prod_pvt++;
+	}
+
+	kfree(copy);
+
+	xenbus_switch_state(info->xbdev, XenbusStateConnected);
+
+	spin_lock_irq(&blkif_io_lock);
+
+	/* Now safe for us to use the shared ring */
+	info->connected = BLKIF_STATE_CONNECTED;
+
+	/* Send off requeued requests */
+	flush_requests(info);
+
+	/* Kick any other new requests queued since we resumed */
+	kick_pending_request_queues(info);
+
+	spin_unlock_irq(&blkif_io_lock);
+}
+
+
+/* ** Driver Registration ** */
+
+
+static struct xenbus_device_id blkfront_ids[] = {
+	{ "vbd" },
+	{ "" }
+};
+
+
+static struct xenbus_driver blkfront = {
+	.name = "vbd",
+	.owner = THIS_MODULE,
+	.ids = blkfront_ids,
+	.probe = blkfront_probe,
+	.remove = blkfront_remove,
+	.resume = blkfront_resume,
+	.otherend_changed = backend_changed,
+};
+
+
+static int __init xlblk_init(void)
+{
+	if (!is_running_on_xen())
+		return -ENODEV;
+
+	if (xlvbd_alloc_major() < 0)
+		return -ENODEV;
+
+	return xenbus_register_frontend(&blkfront);
+}
+module_init(xlblk_init);
+
+
+static void xlblk_exit(void)
+{
+	return xenbus_unregister_driver(&blkfront);
+}
+module_exit(xlblk_exit);
+
+MODULE_LICENSE("Dual BSD/GPL");
===================================================================
--- /dev/null
+++ b/drivers/block/xen/block.h
@@ -0,0 +1,135 @@
+/******************************************************************************
+ * block.h
+ *
+ * Shared definitions between all levels of XenLinux Virtual block devices.
+ *
+ * Copyright (c) 2003-2004, Keir Fraser & Steve Hand
+ * Modifications by Mark A. Williamson are (c) Intel Research Cambridge
+ * Copyright (c) 2004-2005, Christian Limpach
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __XEN_DRIVERS_BLOCK_H__
+#define __XEN_DRIVERS_BLOCK_H__
+
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/hdreg.h>
+#include <linux/blkdev.h>
+#include <linux/major.h>
+#include <asm/xen/hypervisor.h>
+#include <xen/xenbus.h>
+#include <xen/grant_table.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/io/blkif.h>
+#include <xen/interface/io/ring.h>
+#include <asm/io.h>
+#include <asm/atomic.h>
+#include <asm/uaccess.h>
+
+struct xlbd_type_info
+{
+	int partn_shift;
+	int disks_per_major;
+	char *devname;
+	char *diskname;
+};
+
+struct xlbd_major_info
+{
+	int major;
+	int index;
+	int usage;
+	struct xlbd_type_info *type;
+};
+
+struct blk_shadow {
+	struct blkif_request req;
+	unsigned long request;
+	unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+};
+
+#define BLK_RING_SIZE __RING_SIZE((struct blkif_sring *)0, PAGE_SIZE)
+
+/*
+ * We have one of these per vbd, whether ide, scsi or 'other'.  They
+ * hang in private_data off the gendisk structure. We may end up
+ * putting all kinds of interesting stuff here :-)
+ */
+struct blkfront_info
+{
+	struct xenbus_device *xbdev;
+	dev_t dev;
+ 	struct gendisk *gd;
+	int vdevice;
+	blkif_vdev_t handle;
+	int connected;
+	int ring_ref;
+	struct blkif_front_ring ring;
+	unsigned int evtchn, irq;
+	struct xlbd_major_info *mi;
+	request_queue_t *rq;
+	struct work_struct work;
+	struct gnttab_free_callback callback;
+	struct blk_shadow shadow[BLK_RING_SIZE];
+	unsigned long shadow_free;
+	int feature_barrier;
+
+	/**
+	 * The number of people holding this device open.  We won't allow a
+	 * hot-unplug unless this is 0.
+	 */
+	int users;
+};
+
+extern spinlock_t blkif_io_lock;
+
+extern int blkif_open(struct inode *inode, struct file *filep);
+extern int blkif_release(struct inode *inode, struct file *filep);
+extern int blkif_ioctl(struct inode *inode, struct file *filep,
+		       unsigned command, unsigned long argument);
+extern int blkif_getgeo(struct block_device *, struct hd_geometry *);
+extern int blkif_check(dev_t dev);
+extern int blkif_revalidate(dev_t dev);
+extern void do_blkif_request (request_queue_t *rq);
+
+/* Virtual block device subsystem. */
+int xlvbd_alloc_major(void);
+/* Note that xlvbd_add doesn't call add_disk for you: you're expected
+   to call add_disk on info->gd once the disk is properly connected
+   up. */
+int xlvbd_add(blkif_sector_t capacity, int device,
+	      u16 vdisk_info, u16 sector_size, struct blkfront_info *info);
+void xlvbd_del(struct blkfront_info *info);
+int xlvbd_barrier(struct blkfront_info *info);
+
+#endif /* __XEN_DRIVERS_BLOCK_H__ */
===================================================================
--- /dev/null
+++ b/drivers/block/xen/vbd.c
@@ -0,0 +1,230 @@
+/******************************************************************************
+ * vbd.c
+ *
+ * XenLinux virtual block device driver (xvd).
+ *
+ * Copyright (c) 2003-2004, Keir Fraser & Steve Hand
+ * Modifications by Mark A. Williamson are (c) Intel Research Cambridge
+ * Copyright (c) 2004-2005, Christian Limpach
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include "block.h"
+#include <linux/blkdev.h>
+#include <linux/list.h>
+
+#define BLKIF_MAJOR(dev) ((dev)>>8)
+#define BLKIF_MINOR(dev) ((dev) & 0xff)
+
+static struct xlbd_type_info xvd_type_info = {
+	.partn_shift = 4,
+	.disks_per_major = 16,
+	.devname = "xvd",
+	.diskname = "xvd"
+};
+
+static struct xlbd_major_info xvd_major_info = {
+	.major = XENVBD_MAJOR,
+	.type = &xvd_type_info
+};
+
+/* Information about our VBDs. */
+#define MAX_VBDS 64
+static LIST_HEAD(vbds_list);
+
+static struct block_device_operations xlvbd_block_fops =
+{
+	.owner = THIS_MODULE,
+	.open = blkif_open,
+	.release = blkif_release,
+	.getgeo = blkif_getgeo
+};
+
+DEFINE_SPINLOCK(blkif_io_lock);
+
+int
+xlvbd_alloc_major(void)
+{
+	printk("Registering block device major %i\n", xvd_major_info.major);
+	if (register_blkdev(xvd_major_info.major,
+			    xvd_major_info.type->devname)) {
+		printk(KERN_WARNING "xen_blk: can't get major %d with name %s\n",
+		       xvd_major_info.major, xvd_major_info.type->devname);
+		return -1;
+	}
+	return 0;
+}
+
+static int
+xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
+{
+	request_queue_t *rq;
+
+	rq = blk_init_queue(do_blkif_request, &blkif_io_lock);
+	if (rq == NULL)
+		return -1;
+
+	elevator_init(rq, "noop");
+
+	/* Hard sector size and max sectors impersonate the equiv. hardware. */
+	blk_queue_hardsect_size(rq, sector_size);
+	blk_queue_max_sectors(rq, 512);
+
+	/* Each segment in a request is up to an aligned page in size. */
+	blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+	blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+	/* Ensure a merged request will fit in a single I/O ring slot. */
+	blk_queue_max_phys_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST);
+	blk_queue_max_hw_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST);
+
+	/* Make sure buffer addresses are sector-aligned. */
+	blk_queue_dma_alignment(rq, 511);
+
+	gd->queue = rq;
+
+	return 0;
+}
+
+static int
+xlvbd_alloc_gendisk(int minor, blkif_sector_t capacity, int vdevice,
+		    u16 vdisk_info, u16 sector_size,
+		    struct blkfront_info *info)
+{
+	struct gendisk *gd;
+	struct xlbd_major_info *mi;
+	int nr_minors = 1;
+	int err = -ENODEV;
+
+	BUG_ON(info->gd != NULL);
+	BUG_ON(info->mi != NULL);
+	BUG_ON(info->rq != NULL);
+
+	mi = &xvd_major_info;
+	info->mi = mi;
+
+	if ((minor & ((1 << mi->type->partn_shift) - 1)) == 0)
+		nr_minors = 1 << mi->type->partn_shift;
+
+	gd = alloc_disk(nr_minors);
+	if (gd == NULL)
+		goto out;
+
+	if (nr_minors > 1)
+		sprintf(gd->disk_name, "%s%c", mi->type->diskname,
+			'a' + mi->index * mi->type->disks_per_major +
+			(minor >> mi->type->partn_shift));
+	else
+		sprintf(gd->disk_name, "%s%c%d", mi->type->diskname,
+			'a' + mi->index * mi->type->disks_per_major +
+			(minor >> mi->type->partn_shift),
+			minor & ((1 << mi->type->partn_shift) - 1));
+
+	gd->major = mi->major;
+	gd->first_minor = minor;
+	gd->fops = &xlvbd_block_fops;
+	gd->private_data = info;
+	gd->driverfs_dev = &(info->xbdev->dev);
+	set_capacity(gd, capacity);
+
+	if (xlvbd_init_blk_queue(gd, sector_size)) {
+		del_gendisk(gd);
+		goto out;
+	}
+
+	info->rq = gd->queue;
+	info->gd = gd;
+
+	if (info->feature_barrier)
+		xlvbd_barrier(info);
+
+	if (vdisk_info & VDISK_READONLY)
+		set_disk_ro(gd, 1);
+
+	if (vdisk_info & VDISK_REMOVABLE)
+		gd->flags |= GENHD_FL_REMOVABLE;
+
+	if (vdisk_info & VDISK_CDROM)
+		gd->flags |= GENHD_FL_CD;
+
+	return 0;
+
+ out:
+	info->mi = NULL;
+	return err;
+}
+
+int
+xlvbd_add(blkif_sector_t capacity, int vdevice, u16 vdisk_info,
+	  u16 sector_size, struct blkfront_info *info)
+{
+	struct block_device *bd;
+	int err = 0;
+
+	info->dev = MKDEV(BLKIF_MAJOR(vdevice), BLKIF_MINOR(vdevice));
+
+	bd = bdget(info->dev);
+	if (bd == NULL)
+		return -ENODEV;
+
+	err = xlvbd_alloc_gendisk(BLKIF_MINOR(vdevice), capacity, vdevice,
+				  vdisk_info, sector_size, info);
+
+	bdput(bd);
+	return err;
+}
+
+void
+xlvbd_del(struct blkfront_info *info)
+{
+	if (info->mi == NULL)
+		return;
+
+	BUG_ON(info->gd == NULL);
+	del_gendisk(info->gd);
+	put_disk(info->gd);
+	info->gd = NULL;
+
+	info->mi = NULL;
+
+	BUG_ON(info->rq == NULL);
+	blk_cleanup_queue(info->rq);
+	info->rq = NULL;
+}
+
+int
+xlvbd_barrier(struct blkfront_info *info)
+{
+	int err;
+
+	err = blk_queue_ordered(info->rq,
+		info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE, NULL);
+	if (err)
+		return err;
+	printk("blkfront: %s: barriers %s\n",
+	       info->gd->disk_name, info->feature_barrier ? "enabled" : "disabled");
+	return 0;
+}
===================================================================
--- a/include/linux/major.h
+++ b/include/linux/major.h
@@ -156,6 +156,8 @@
 #define VXSPEC_MAJOR		200	/* VERITAS volume config driver */
 #define VXDMP_MAJOR		201	/* VERITAS volume multipath driver */
 
+#define XENVBD_MAJOR		202	/* Xen virtual block device */
+
 #define MSR_MAJOR		202
 #define CPUID_MAJOR		203
 

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver.
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
                   ` (24 preceding siblings ...)
  2007-03-01 23:25 ` [patch 25/26] Xen-paravirt_ops: Add Xen virtual block device driver Jeremy Fitzhardinge
@ 2007-03-01 23:25 ` Jeremy Fitzhardinge
  2007-03-02  0:42     ` Stephen Hemminger
  2007-03-16  8:42   ` Ingo Molnar
  2007-03-16  9:21   ` Ingo Molnar
  27 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 23:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell, Ian Pratt,
	Christian Limpach, netdev, Jeff Garzik

[-- Attachment #1: xen-netfront.patch --]
[-- Type: text/plain, Size: 56757 bytes --]

The network device frontend driver allows the kernel to access network
devices exported exported by a virtual machine containing a physical
network device driver.

Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: netdev@vger.kernel.org
Cc: Jeff Garzik <jeff@garzik.org>

---
 drivers/net/Kconfig        |   12 
 drivers/net/Makefile       |    2 
 drivers/net/xen-netfront.c | 2066 ++++++++++++++++++++++++++++++++++++++++++++
 include/xen/events.h       |    2 
 4 files changed, 2082 insertions(+)

===================================================================
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2525,6 +2525,18 @@ source "drivers/atm/Kconfig"
 
 source "drivers/s390/net/Kconfig"
 
+config XEN_NETDEV_FRONTEND
+	tristate "Xen network device frontend driver"
+	depends on XEN
+	default y
+	help
+	  The network device frontend driver allows the kernel to
+	  access network devices exported exported by a virtual
+	  machine containing a physical network device driver. The
+	  frontend driver is intended for unprivileged guest domains;
+	  if you are compiling a kernel for a Xen guest, you almost
+	  certainly want to enable this.
+
 config ISERIES_VETH
 	tristate "iSeries Virtual Ethernet driver support"
 	depends on PPC_ISERIES
===================================================================
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -218,3 +218,5 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_FS_ENET) += fs_enet/
 
 obj-$(CONFIG_NETXEN_NIC) += netxen/
+
+obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
===================================================================
--- /dev/null
+++ b/drivers/net/xen-netfront.c
@@ -0,0 +1,2066 @@
+/******************************************************************************
+ * Virtual network driver for conversing with remote driver backends.
+ *
+ * Copyright (c) 2002-2005, K A Fraser
+ * Copyright (c) 2005, XenSource Ltd
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/bitops.h>
+#include <linux/ethtool.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/io.h>
+#include <linux/moduleparam.h>
+#include <net/sock.h>
+#include <net/pkt_sched.h>
+#include <net/arp.h>
+#include <net/route.h>
+#include <asm/uaccess.h>
+#include <xen/xenbus.h>
+#include <xen/interface/io/netif.h>
+#include <xen/interface/memory.h>
+#ifdef CONFIG_XEN_BALLOON
+#include <xen/balloon.h>
+#endif
+#include <asm/page.h>
+#include <xen/interface/grant_table.h>
+
+#include <xen/events.h>
+#include <xen/page.h>
+#include <xen/grant_table.h>
+
+/*
+ * Mutually-exclusive module options to select receive data path:
+ *  rx_copy : Packets are copied by network backend into local memory
+ *  rx_flip : Page containing packet data is transferred to our ownership
+ * For fully-virtualised guests there is no option - copying must be used.
+ * For paravirtualised guests, flipping is the default.
+ */
+#ifdef CONFIG_XEN
+static int MODPARM_rx_copy = 0;
+module_param_named(rx_copy, MODPARM_rx_copy, bool, 0);
+MODULE_PARM_DESC(rx_copy, "Copy packets from network card (rather than flip)");
+static int MODPARM_rx_flip = 0;
+module_param_named(rx_flip, MODPARM_rx_flip, bool, 0);
+MODULE_PARM_DESC(rx_flip, "Flip packets from network card (rather than copy)");
+#else
+static const int MODPARM_rx_copy = 1;
+static const int MODPARM_rx_flip = 0;
+#endif
+
+#define RX_COPY_THRESHOLD 256
+
+#define GRANT_INVALID_REF	0
+
+#define NET_TX_RING_SIZE __RING_SIZE((struct netif_tx_sring *)0, PAGE_SIZE)
+#define NET_RX_RING_SIZE __RING_SIZE((struct netif_rx_sring *)0, PAGE_SIZE)
+
+struct netfront_info {
+	struct list_head list;
+	struct net_device *netdev;
+
+	struct net_device_stats stats;
+
+	struct netif_tx_front_ring tx;
+	struct netif_rx_front_ring rx;
+
+	spinlock_t   tx_lock;
+	spinlock_t   rx_lock;
+
+	unsigned int evtchn, irq;
+	unsigned int copying_receiver;
+
+	/* Receive-ring batched refills. */
+#define RX_MIN_TARGET 8
+#define RX_DFL_MIN_TARGET 64
+#define RX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256)
+	unsigned rx_min_target, rx_max_target, rx_target;
+	struct sk_buff_head rx_batch;
+
+	struct timer_list rx_refill_timer;
+
+	/*
+	 * {tx,rx}_skbs store outstanding skbuffs. The first entry in tx_skbs
+	 * is an index into a chain of free entries.
+	 */
+	struct sk_buff *tx_skbs[NET_TX_RING_SIZE+1];
+	struct sk_buff *rx_skbs[NET_RX_RING_SIZE];
+
+#define TX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256)
+	grant_ref_t gref_tx_head;
+	grant_ref_t grant_tx_ref[NET_TX_RING_SIZE + 1];
+	grant_ref_t gref_rx_head;
+	grant_ref_t grant_rx_ref[NET_RX_RING_SIZE];
+
+	struct xenbus_device *xbdev;
+	int tx_ring_ref;
+	int rx_ring_ref;
+	u8 mac[ETH_ALEN];
+
+	unsigned long rx_pfn_array[NET_RX_RING_SIZE];
+	struct multicall_entry rx_mcl[NET_RX_RING_SIZE+1];
+	struct mmu_update rx_mmu[NET_RX_RING_SIZE];
+};
+
+struct netfront_rx_info {
+	struct netif_rx_response rx;
+	struct netif_extra_info extras[XEN_NETIF_EXTRA_TYPE_MAX - 1];
+};
+
+/*
+ * Access macros for acquiring freeing slots in tx_skbs[].
+ */
+
+static inline void add_id_to_freelist(struct sk_buff **list, unsigned short id)
+{
+	list[id] = list[0];
+	list[0]  = (void *)(unsigned long)id;
+}
+
+static inline unsigned short get_id_from_freelist(struct sk_buff **list)
+{
+	unsigned int id = (unsigned int)(unsigned long)list[0];
+	list[0] = list[id];
+	return id;
+}
+
+static inline int xennet_rxidx(RING_IDX idx)
+{
+	return idx & (NET_RX_RING_SIZE - 1);
+}
+
+static inline struct sk_buff *xennet_get_rx_skb(struct netfront_info *np,
+						RING_IDX ri)
+{
+	int i = xennet_rxidx(ri);
+	struct sk_buff *skb = np->rx_skbs[i];
+	np->rx_skbs[i] = NULL;
+	return skb;
+}
+
+static inline grant_ref_t xennet_get_rx_ref(struct netfront_info *np,
+					    RING_IDX ri)
+{
+	int i = xennet_rxidx(ri);
+	grant_ref_t ref = np->grant_rx_ref[i];
+	np->grant_rx_ref[i] = GRANT_INVALID_REF;
+	return ref;
+}
+
+#define DPRINTK(fmt, args...)				\
+	pr_debug("netfront (%s:%d) " fmt,		\
+		 __FUNCTION__, __LINE__, ##args)
+#define IPRINTK(fmt, args...)				\
+	printk(KERN_INFO "netfront: " fmt, ##args)
+#define WPRINTK(fmt, args...)				\
+	printk(KERN_WARNING "netfront: " fmt, ##args)
+
+static int setup_device(struct xenbus_device *, struct netfront_info *);
+static struct net_device *create_netdev(struct xenbus_device *);
+
+static void netfront_closing(struct xenbus_device *);
+
+static void end_access(int, void *);
+static void netif_disconnect_backend(struct netfront_info *);
+static int open_netdev(struct netfront_info *);
+static void close_netdev(struct netfront_info *);
+
+static int network_connect(struct net_device *);
+static void network_tx_buf_gc(struct net_device *);
+static void network_alloc_rx_buffers(struct net_device *);
+static int send_fake_arp(struct net_device *);
+
+static irqreturn_t netif_int(int irq, void *dev_id);
+
+#ifdef CONFIG_SYSFS
+static int xennet_sysfs_addif(struct net_device *netdev);
+static void xennet_sysfs_delif(struct net_device *netdev);
+#else /* !CONFIG_SYSFS */
+#define xennet_sysfs_addif(dev) (0)
+#define xennet_sysfs_delif(dev) do { } while(0)
+#endif
+
+static inline int xennet_can_sg(struct net_device *dev)
+{
+	return dev->features & NETIF_F_SG;
+}
+
+/**
+ * Entry point to this code when a new device is created.  Allocate the basic
+ * structures and the ring buffers for communication with the backend, and
+ * inform the backend of the appropriate details for those.
+ */
+static int __devinit netfront_probe(struct xenbus_device *dev,
+				    const struct xenbus_device_id *id)
+{
+	int err;
+	struct net_device *netdev;
+	struct netfront_info *info;
+
+	netdev = create_netdev(dev);
+	if (IS_ERR(netdev)) {
+		err = PTR_ERR(netdev);
+		xenbus_dev_fatal(dev, err, "creating netdev");
+		return err;
+	}
+
+	info = netdev_priv(netdev);
+	dev->dev.driver_data = info;
+
+	err = open_netdev(info);
+	if (err)
+		goto fail;
+
+	return 0;
+
+ fail:
+	free_netdev(netdev);
+	dev->dev.driver_data = NULL;
+	return err;
+}
+
+
+/**
+ * We are reconnecting to the backend, due to a suspend/resume, or a backend
+ * driver restart.  We tear down our netif structure and recreate it, but
+ * leave the device-layer structures intact so that this is transparent to the
+ * rest of the kernel.
+ */
+static int netfront_resume(struct xenbus_device *dev)
+{
+	struct netfront_info *info = dev->dev.driver_data;
+
+	DPRINTK("%s\n", dev->nodename);
+
+	netif_disconnect_backend(info);
+	return 0;
+}
+
+static int xen_net_read_mac(struct xenbus_device *dev, u8 mac[])
+{
+	char *s, *e, *macstr;
+	int i;
+
+	macstr = s = xenbus_read(XBT_NIL, dev->nodename, "mac", NULL);
+	if (IS_ERR(macstr))
+		return PTR_ERR(macstr);
+
+	for (i = 0; i < ETH_ALEN; i++) {
+		mac[i] = simple_strtoul(s, &e, 16);
+		if ((s == e) || (*e != ((i == ETH_ALEN-1) ? '\0' : ':'))) {
+			kfree(macstr);
+			return -ENOENT;
+		}
+		s = e+1;
+	}
+
+	kfree(macstr);
+	return 0;
+}
+
+/* Common code used when first setting up, and when resuming. */
+static int talk_to_backend(struct xenbus_device *dev,
+			   struct netfront_info *info)
+{
+	const char *message;
+	struct xenbus_transaction xbt;
+	int err;
+
+	err = xen_net_read_mac(dev, info->mac);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "parsing %s/mac", dev->nodename);
+		goto out;
+	}
+
+	/* Create shared ring, alloc event channel. */
+	err = setup_device(dev, info);
+	if (err)
+		goto out;
+
+again:
+	err = xenbus_transaction_start(&xbt);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "starting transaction");
+		goto destroy_ring;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename, "tx-ring-ref","%u",
+			    info->tx_ring_ref);
+	if (err) {
+		message = "writing tx ring-ref";
+		goto abort_transaction;
+	}
+	err = xenbus_printf(xbt, dev->nodename, "rx-ring-ref","%u",
+			    info->rx_ring_ref);
+	if (err) {
+		message = "writing rx ring-ref";
+		goto abort_transaction;
+	}
+	err = xenbus_printf(xbt, dev->nodename,
+			    "event-channel", "%u", info->evtchn);
+	if (err) {
+		message = "writing event-channel";
+		goto abort_transaction;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename, "request-rx-copy", "%u",
+			    info->copying_receiver);
+	if (err) {
+		message = "writing request-rx-copy";
+		goto abort_transaction;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename, "feature-rx-notify", "%d", 1);
+	if (err) {
+		message = "writing feature-rx-notify";
+		goto abort_transaction;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename, "feature-sg", "%d", 1);
+	if (err) {
+		message = "writing feature-sg";
+		goto abort_transaction;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv4", "%d", 1);
+	if (err) {
+		message = "writing feature-gso-tcpv4";
+		goto abort_transaction;
+	}
+
+	err = xenbus_transaction_end(xbt, 0);
+	if (err) {
+		if (err == -EAGAIN)
+			goto again;
+		xenbus_dev_fatal(dev, err, "completing transaction");
+		goto destroy_ring;
+	}
+
+	return 0;
+
+ abort_transaction:
+	xenbus_transaction_end(xbt, 1);
+	xenbus_dev_fatal(dev, err, "%s", message);
+ destroy_ring:
+	netif_disconnect_backend(info);
+ out:
+	return err;
+}
+
+static int setup_device(struct xenbus_device *dev, struct netfront_info *info)
+{
+	struct netif_tx_sring *txs;
+	struct netif_rx_sring *rxs;
+	int err;
+	struct net_device *netdev = info->netdev;
+
+	info->tx_ring_ref = GRANT_INVALID_REF;
+	info->rx_ring_ref = GRANT_INVALID_REF;
+	info->rx.sring = NULL;
+	info->tx.sring = NULL;
+	info->irq = 0;
+
+	txs = (struct netif_tx_sring *)get_zeroed_page(GFP_KERNEL);
+	if (!txs) {
+		err = -ENOMEM;
+		xenbus_dev_fatal(dev, err, "allocating tx ring page");
+		goto fail;
+	}
+	SHARED_RING_INIT(txs);
+	FRONT_RING_INIT(&info->tx, txs, PAGE_SIZE);
+
+	err = xenbus_grant_ring(dev, virt_to_mfn(txs));
+	if (err < 0) {
+		free_page((unsigned long)txs);
+		goto fail;
+	}
+
+	info->tx_ring_ref = err;
+	rxs = (struct netif_rx_sring *)get_zeroed_page(GFP_KERNEL);
+	if (!rxs) {
+		err = -ENOMEM;
+		xenbus_dev_fatal(dev, err, "allocating rx ring page");
+		goto fail;
+	}
+	SHARED_RING_INIT(rxs);
+	FRONT_RING_INIT(&info->rx, rxs, PAGE_SIZE);
+
+	err = xenbus_grant_ring(dev, virt_to_mfn(rxs));
+	if (err < 0) {
+		free_page((unsigned long)rxs);
+		goto fail;
+	}
+	info->rx_ring_ref = err;
+
+	err = xenbus_alloc_evtchn(dev, &info->evtchn);
+	if (err)
+		goto fail;
+
+	memcpy(netdev->dev_addr, info->mac, ETH_ALEN);
+	err = bind_evtchn_to_irqhandler(info->evtchn, netif_int,
+					SA_SAMPLE_RANDOM, netdev->name,
+					netdev);
+	if (err < 0)
+		goto fail;
+	info->irq = err;
+	return 0;
+
+ fail:
+	return err;
+}
+
+/**
+ * Callback received when the backend's state changes.
+ */
+static void backend_changed(struct xenbus_device *dev,
+			    enum xenbus_state backend_state)
+{
+	struct netfront_info *np = dev->dev.driver_data;
+	struct net_device *netdev = np->netdev;
+
+	DPRINTK("%s\n", xenbus_strstate(backend_state));
+
+	switch (backend_state) {
+	case XenbusStateInitialising:
+	case XenbusStateInitialised:
+	case XenbusStateConnected:
+	case XenbusStateUnknown:
+	case XenbusStateClosed:
+		break;
+
+	case XenbusStateInitWait:
+		if (dev->state != XenbusStateInitialising)
+			break;
+		if (network_connect(netdev) != 0)
+			break;
+		xenbus_switch_state(dev, XenbusStateConnected);
+		(void)send_fake_arp(netdev);
+		break;
+
+	case XenbusStateClosing:
+		if (dev->state == XenbusStateClosed)
+			break;
+		netfront_closing(dev);
+		break;
+	}
+}
+
+/** Send a packet on a net device to encourage switches to learn the
+ * MAC. We send a fake ARP request.
+ *
+ * @param dev device
+ * @return 0 on success, error code otherwise
+ */
+static int send_fake_arp(struct net_device *dev)
+{
+	struct sk_buff *skb;
+	u32             src_ip, dst_ip;
+
+	dst_ip = INADDR_BROADCAST;
+	src_ip = inet_select_addr(dev, dst_ip, RT_SCOPE_LINK);
+
+	/* No IP? Then nothing to do. */
+	if (src_ip == 0)
+		return 0;
+
+	skb = arp_create(ARPOP_REPLY, ETH_P_ARP,
+			 dst_ip, dev, src_ip,
+			 /*dst_hw*/ NULL, /*src_hw*/ NULL,
+			 /*target_hw*/ dev->dev_addr);
+	if (skb == NULL)
+		return -ENOMEM;
+
+	return dev_queue_xmit(skb);
+}
+
+static int network_open(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+
+	memset(&np->stats, 0, sizeof(np->stats));
+
+	spin_lock(&np->rx_lock);
+	if (netif_carrier_ok(dev)) {
+		network_alloc_rx_buffers(dev);
+		np->rx.sring->rsp_event = np->rx.rsp_cons + 1;
+		if (RING_HAS_UNCONSUMED_RESPONSES(&np->rx))
+			netif_rx_schedule(dev);
+	}
+	spin_unlock(&np->rx_lock);
+
+	netif_start_queue(dev);
+
+	return 0;
+}
+
+static inline int netfront_tx_slot_available(struct netfront_info *np)
+{
+	return RING_FREE_REQUESTS(&np->tx) >= MAX_SKB_FRAGS + 2;
+}
+
+static inline void network_maybe_wake_tx(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+
+	if (unlikely(netif_queue_stopped(dev)) &&
+	    netfront_tx_slot_available(np) &&
+	    likely(netif_running(dev)))
+		netif_wake_queue(dev);
+}
+
+static void network_tx_buf_gc(struct net_device *dev)
+{
+	RING_IDX cons, prod;
+	unsigned short id;
+	struct netfront_info *np = netdev_priv(dev);
+	struct sk_buff *skb;
+
+	BUG_ON(!netif_carrier_ok(dev));
+
+	do {
+		prod = np->tx.sring->rsp_prod;
+		rmb(); /* Ensure we see responses up to 'rp'. */
+
+		for (cons = np->tx.rsp_cons; cons != prod; cons++) {
+			struct netif_tx_response *txrsp;
+
+			txrsp = RING_GET_RESPONSE(&np->tx, cons);
+			if (txrsp->status == NETIF_RSP_NULL)
+				continue;
+
+			id  = txrsp->id;
+			skb = np->tx_skbs[id];
+			if (unlikely(gnttab_query_foreign_access(
+				np->grant_tx_ref[id]) != 0)) {
+				printk(KERN_ALERT "network_tx_buf_gc: warning "
+				       "-- grant still in use by backend "
+				       "domain.\n");
+				BUG();
+			}
+			gnttab_end_foreign_access_ref(
+				np->grant_tx_ref[id], GNTMAP_readonly);
+			gnttab_release_grant_reference(
+				&np->gref_tx_head, np->grant_tx_ref[id]);
+			np->grant_tx_ref[id] = GRANT_INVALID_REF;
+			add_id_to_freelist(np->tx_skbs, id);
+			dev_kfree_skb_irq(skb);
+		}
+
+		np->tx.rsp_cons = prod;
+
+		/*
+		 * Set a new event, then check for race with update of tx_cons.
+		 * Note that it is essential to schedule a callback, no matter
+		 * how few buffers are pending. Even if there is space in the
+		 * transmit ring, higher layers may be blocked because too much
+		 * data is outstanding: in such cases notification from Xen is
+		 * likely to be the only kick that we'll get.
+		 */
+		np->tx.sring->rsp_event =
+			prod + ((np->tx.sring->req_prod - prod) >> 1) + 1;
+		mb();
+	} while ((cons == prod) && (prod != np->tx.sring->rsp_prod));
+
+	network_maybe_wake_tx(dev);
+}
+
+static void rx_refill_timeout(unsigned long data)
+{
+	struct net_device *dev = (struct net_device *)data;
+	netif_rx_schedule(dev);
+}
+
+static void network_alloc_rx_buffers(struct net_device *dev)
+{
+	unsigned short id;
+	struct netfront_info *np = netdev_priv(dev);
+	struct sk_buff *skb;
+	struct page *page;
+	int i, batch_target, notify;
+	RING_IDX req_prod = np->rx.req_prod_pvt;
+	struct xen_memory_reservation reservation;
+	grant_ref_t ref;
+	unsigned long pfn;
+	void *vaddr;
+	int nr_flips;
+	struct netif_rx_request *req;
+
+	if (unlikely(!netif_carrier_ok(dev)))
+		return;
+
+	/*
+	 * Allocate skbuffs greedily, even though we batch updates to the
+	 * receive ring. This creates a less bursty demand on the memory
+	 * allocator, so should reduce the chance of failed allocation requests
+	 * both for ourself and for other kernel subsystems.
+	 */
+	batch_target = np->rx_target - (req_prod - np->rx.rsp_cons);
+	for (i = skb_queue_len(&np->rx_batch); i < batch_target; i++) {
+		/*
+		 * Allocate an skb and a page. Do not use __dev_alloc_skb as
+		 * that will allocate page-sized buffers which is not
+		 * necessary here.
+		 * 16 bytes added as necessary headroom for netif_receive_skb.
+		 */
+		skb = alloc_skb(RX_COPY_THRESHOLD + 16 + NET_IP_ALIGN,
+				GFP_ATOMIC | __GFP_NOWARN);
+		if (unlikely(!skb))
+			goto no_skb;
+
+		page = alloc_page(GFP_ATOMIC | __GFP_NOWARN);
+		if (!page) {
+			kfree_skb(skb);
+no_skb:
+			/* Any skbuffs queued for refill? Force them out. */
+			if (i != 0)
+				goto refill;
+			/* Could not allocate any skbuffs. Try again later. */
+			mod_timer(&np->rx_refill_timer,
+				  jiffies + (HZ/10));
+			break;
+		}
+
+		skb_reserve(skb, 16 + NET_IP_ALIGN); /* mimic dev_alloc_skb() */
+		skb_shinfo(skb)->frags[0].page = page;
+		skb_shinfo(skb)->nr_frags = 1;
+		__skb_queue_tail(&np->rx_batch, skb);
+	}
+
+	/* Is the batch large enough to be worthwhile? */
+	if (i < (np->rx_target/2)) {
+		if (req_prod > np->rx.sring->req_prod)
+			goto push;
+		return;
+	}
+
+	/* Adjust our fill target if we risked running out of buffers. */
+	if (((req_prod - np->rx.sring->rsp_prod) < (np->rx_target / 4)) &&
+	    ((np->rx_target *= 2) > np->rx_max_target))
+		np->rx_target = np->rx_max_target;
+
+ refill:
+	for (nr_flips = i = 0; ; i++) {
+		if ((skb = __skb_dequeue(&np->rx_batch)) == NULL)
+			break;
+
+		skb->dev = dev;
+
+		id = xennet_rxidx(req_prod + i);
+
+		BUG_ON(np->rx_skbs[id]);
+		np->rx_skbs[id] = skb;
+
+		ref = gnttab_claim_grant_reference(&np->gref_rx_head);
+		BUG_ON((signed short)ref < 0);
+		np->grant_rx_ref[id] = ref;
+
+		pfn = page_to_pfn(skb_shinfo(skb)->frags[0].page);
+		vaddr = page_address(skb_shinfo(skb)->frags[0].page);
+
+		req = RING_GET_REQUEST(&np->rx, req_prod + i);
+		if (!np->copying_receiver) {
+			gnttab_grant_foreign_transfer_ref(ref,
+							  np->xbdev->otherend_id,
+							  pfn);
+			np->rx_pfn_array[nr_flips] = pfn_to_mfn(pfn);
+			if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+				/* Remove this page before passing
+				 * back to Xen. */
+				set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
+				MULTI_update_va_mapping(np->rx_mcl+i,
+							(unsigned long)vaddr,
+							__pte(0), 0);
+			}
+			nr_flips++;
+		} else {
+			gnttab_grant_foreign_access_ref(ref,
+							np->xbdev->otherend_id,
+							pfn_to_mfn(pfn),
+							0);
+		}
+
+		req->id = id;
+		req->gref = ref;
+	}
+
+	if ( nr_flips != 0 ) {
+#ifdef CONFIG_XEN_BALLOON
+		/* Tell the ballon driver what is going on. */
+		balloon_update_driver_allowance(i);
+#endif
+
+		reservation.extent_start = np->rx_pfn_array;
+		reservation.nr_extents   = nr_flips;
+		reservation.extent_order = 0;
+		reservation.address_bits = 0;
+		reservation.domid        = DOMID_SELF;
+
+		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+			/* After all PTEs have been zapped, flush the TLB. */
+			np->rx_mcl[i-1].args[MULTI_UVMFLAGS_INDEX] =
+				UVMF_TLB_FLUSH|UVMF_ALL;
+
+			/* Give away a batch of pages. */
+			np->rx_mcl[i].op = __HYPERVISOR_memory_op;
+			np->rx_mcl[i].args[0] = XENMEM_decrease_reservation;
+			np->rx_mcl[i].args[1] = (unsigned long)&reservation;
+
+			/* Zap PTEs and give away pages in one big
+			 * multicall. */
+			(void)HYPERVISOR_multicall(np->rx_mcl, i+1);
+
+			/* Check return status of HYPERVISOR_memory_op(). */
+			if (unlikely(np->rx_mcl[i].result != i))
+				panic("Unable to reduce memory reservation\n");
+		} else {
+			if (HYPERVISOR_memory_op(XENMEM_decrease_reservation,
+						 &reservation) != i)
+				panic("Unable to reduce memory reservation\n");
+		}
+	} else {
+		wmb();
+	}
+
+	/* Above is a suitable barrier to ensure backend will see requests. */
+	np->rx.req_prod_pvt = req_prod + i;
+ push:
+	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&np->rx, notify);
+	if (notify)
+		notify_remote_via_irq(np->irq);
+}
+
+static void xennet_move_rx_slot(struct netfront_info *np, struct sk_buff *skb,
+				grant_ref_t ref)
+{
+	int new = xennet_rxidx(np->rx.req_prod_pvt);
+
+	BUG_ON(np->rx_skbs[new]);
+	np->rx_skbs[new] = skb;
+	np->grant_rx_ref[new] = ref;
+	RING_GET_REQUEST(&np->rx, np->rx.req_prod_pvt)->id = new;
+	RING_GET_REQUEST(&np->rx, np->rx.req_prod_pvt)->gref = ref;
+	np->rx.req_prod_pvt++;
+}
+
+static void xennet_make_frags(struct sk_buff *skb, struct net_device *dev,
+			      struct netif_tx_request *tx)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	char *data = skb->data;
+	unsigned long mfn;
+	RING_IDX prod = np->tx.req_prod_pvt;
+	int frags = skb_shinfo(skb)->nr_frags;
+	unsigned int offset = offset_in_page(data);
+	unsigned int len = skb_headlen(skb);
+	unsigned int id;
+	grant_ref_t ref;
+	int i;
+
+	while (len > PAGE_SIZE - offset) {
+		tx->size = PAGE_SIZE - offset;
+		tx->flags |= NETTXF_more_data;
+		len -= tx->size;
+		data += tx->size;
+		offset = 0;
+
+		id = get_id_from_freelist(np->tx_skbs);
+		np->tx_skbs[id] = skb_get(skb);
+		tx = RING_GET_REQUEST(&np->tx, prod++);
+		tx->id = id;
+		ref = gnttab_claim_grant_reference(&np->gref_tx_head);
+		BUG_ON((signed short)ref < 0);
+
+		mfn = virt_to_mfn(data);
+		gnttab_grant_foreign_access_ref(ref, np->xbdev->otherend_id,
+						mfn, GNTMAP_readonly);
+
+		tx->gref = np->grant_tx_ref[id] = ref;
+		tx->offset = offset;
+		tx->size = len;
+		tx->flags = 0;
+	}
+
+	for (i = 0; i < frags; i++) {
+		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
+
+		tx->flags |= NETTXF_more_data;
+
+		id = get_id_from_freelist(np->tx_skbs);
+		np->tx_skbs[id] = skb_get(skb);
+		tx = RING_GET_REQUEST(&np->tx, prod++);
+		tx->id = id;
+		ref = gnttab_claim_grant_reference(&np->gref_tx_head);
+		BUG_ON((signed short)ref < 0);
+
+		mfn = pfn_to_mfn(page_to_pfn(frag->page));
+		gnttab_grant_foreign_access_ref(ref, np->xbdev->otherend_id,
+						mfn, GNTMAP_readonly);
+
+		tx->gref = np->grant_tx_ref[id] = ref;
+		tx->offset = frag->page_offset;
+		tx->size = frag->size;
+		tx->flags = 0;
+	}
+
+	np->tx.req_prod_pvt = prod;
+}
+
+static int network_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	unsigned short id;
+	struct netfront_info *np = netdev_priv(dev);
+	struct netif_tx_request *tx;
+	struct netif_extra_info *extra;
+	char *data = skb->data;
+	RING_IDX i;
+	grant_ref_t ref;
+	unsigned long mfn;
+	int notify;
+	int frags = skb_shinfo(skb)->nr_frags;
+	unsigned int offset = offset_in_page(data);
+	unsigned int len = skb_headlen(skb);
+
+	frags += (offset + len + PAGE_SIZE - 1) / PAGE_SIZE;
+	if (unlikely(frags > MAX_SKB_FRAGS + 1)) {
+		printk(KERN_ALERT "xennet: skb rides the rocket: %d frags\n",
+		       frags);
+		dump_stack();
+		goto drop;
+	}
+
+	spin_lock_irq(&np->tx_lock);
+
+	if (unlikely(!netif_carrier_ok(dev) ||
+		     (frags > 1 && !xennet_can_sg(dev)) ||
+		     netif_needs_gso(dev, skb))) {
+		spin_unlock_irq(&np->tx_lock);
+		goto drop;
+	}
+
+	i = np->tx.req_prod_pvt;
+
+	id = get_id_from_freelist(np->tx_skbs);
+	np->tx_skbs[id] = skb;
+
+	tx = RING_GET_REQUEST(&np->tx, i);
+
+	tx->id   = id;
+	ref = gnttab_claim_grant_reference(&np->gref_tx_head);
+	BUG_ON((signed short)ref < 0);
+	mfn = virt_to_mfn(data);
+	gnttab_grant_foreign_access_ref(
+		ref, np->xbdev->otherend_id, mfn, GNTMAP_readonly);
+	tx->gref = np->grant_tx_ref[id] = ref;
+	tx->offset = offset;
+	tx->size = len;
+	extra = NULL;
+
+ 	tx->flags = 0;
+ 	if (skb->ip_summed == CHECKSUM_PARTIAL) /* local packet? */
+ 		tx->flags |= NETTXF_csum_blank | NETTXF_data_validated;
+
+	if (skb_shinfo(skb)->gso_size) {
+		struct netif_extra_info *gso = (struct netif_extra_info *)
+			RING_GET_REQUEST(&np->tx, ++i);
+
+		if (extra)
+			extra->flags |= XEN_NETIF_EXTRA_FLAG_MORE;
+		else
+			tx->flags |= NETTXF_extra_info;
+
+		gso->u.gso.size = skb_shinfo(skb)->gso_size;
+		gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
+		gso->u.gso.pad = 0;
+		gso->u.gso.features = 0;
+
+		gso->type = XEN_NETIF_EXTRA_TYPE_GSO;
+		gso->flags = 0;
+		extra = gso;
+	}
+
+	np->tx.req_prod_pvt = i + 1;
+
+	xennet_make_frags(skb, dev, tx);
+	tx->size = skb->len;
+
+	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&np->tx, notify);
+	if (notify)
+		notify_remote_via_irq(np->irq);
+
+	network_tx_buf_gc(dev);
+
+	if (!netfront_tx_slot_available(np))
+		netif_stop_queue(dev);
+
+	spin_unlock_irq(&np->tx_lock);
+
+	np->stats.tx_bytes += skb->len;
+	np->stats.tx_packets++;
+
+	return 0;
+
+ drop:
+	np->stats.tx_dropped++;
+	dev_kfree_skb(skb);
+	return 0;
+}
+
+static irqreturn_t netif_int(int irq, void *dev_id)
+{
+	struct net_device *dev = dev_id;
+	struct netfront_info *np = netdev_priv(dev);
+	unsigned long flags;
+
+	spin_lock_irqsave(&np->tx_lock, flags);
+
+	if (likely(netif_carrier_ok(dev))) {
+		network_tx_buf_gc(dev);
+		/* Under tx_lock: protects access to rx shared-ring indexes. */
+		if (RING_HAS_UNCONSUMED_RESPONSES(&np->rx))
+			netif_rx_schedule(dev);
+	}
+
+	spin_unlock_irqrestore(&np->tx_lock, flags);
+
+	return IRQ_HANDLED;
+}
+
+static void handle_incoming_queue(struct net_device *dev, struct sk_buff_head *rxq)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(rxq)) != NULL) {
+		struct page *page = (struct page *)skb->nh.raw;
+		void *vaddr = page_address(page);
+
+		memcpy(skb->data, vaddr + (skb->h.raw - skb->nh.raw),
+		       skb_headlen(skb));
+
+		if (page != skb_shinfo(skb)->frags[0].page)
+			__free_page(page);
+
+		/* Ethernet work: Delayed to here as it peeks the header. */
+		skb->protocol = eth_type_trans(skb, dev);
+
+		/* Pass it up. */
+		netif_receive_skb(skb);
+		dev->last_rx = jiffies;
+	}
+}
+
+int xennet_get_extras(struct netfront_info *np,
+		      struct netif_extra_info *extras, RING_IDX rp)
+
+{
+	struct netif_extra_info *extra;
+	RING_IDX cons = np->rx.rsp_cons;
+	int err = 0;
+
+	do {
+		struct sk_buff *skb;
+		grant_ref_t ref;
+
+		if (unlikely(cons + 1 == rp)) {
+			if (net_ratelimit())
+				WPRINTK("Missing extra info\n");
+			err = -EBADR;
+			break;
+		}
+
+		extra = (struct netif_extra_info *)
+			RING_GET_RESPONSE(&np->rx, ++cons);
+
+		if (unlikely(!extra->type ||
+			     extra->type >= XEN_NETIF_EXTRA_TYPE_MAX)) {
+			if (net_ratelimit())
+				WPRINTK("Invalid extra type: %d\n",
+					extra->type);
+			err = -EINVAL;
+		} else {
+			memcpy(&extras[extra->type - 1], extra,
+			       sizeof(*extra));
+		}
+
+		skb = xennet_get_rx_skb(np, cons);
+		ref = xennet_get_rx_ref(np, cons);
+		xennet_move_rx_slot(np, skb, ref);
+	} while (extra->flags & XEN_NETIF_EXTRA_FLAG_MORE);
+
+	np->rx.rsp_cons = cons;
+	return err;
+}
+
+static int xennet_get_responses(struct netfront_info *np,
+				struct netfront_rx_info *rinfo, RING_IDX rp,
+				struct sk_buff_head *list,
+				int *pages_flipped_p)
+{
+	int pages_flipped = *pages_flipped_p;
+	struct mmu_update *mmu;
+	struct multicall_entry *mcl;
+	struct netif_rx_response *rx = &rinfo->rx;
+	struct netif_extra_info *extras = rinfo->extras;
+	RING_IDX cons = np->rx.rsp_cons;
+	struct sk_buff *skb = xennet_get_rx_skb(np, cons);
+	grant_ref_t ref = xennet_get_rx_ref(np, cons);
+	int max = MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD);
+	int frags = 1;
+	int err = 0;
+	unsigned long ret;
+
+	if (rx->flags & NETRXF_extra_info) {
+		err = xennet_get_extras(np, extras, rp);
+		cons = np->rx.rsp_cons;
+	}
+
+	for (;;) {
+		unsigned long mfn;
+
+		if (unlikely(rx->status < 0 ||
+			     rx->offset + rx->status > PAGE_SIZE)) {
+			if (net_ratelimit())
+				WPRINTK("rx->offset: %x, size: %u\n",
+					rx->offset, rx->status);
+			xennet_move_rx_slot(np, skb, ref);
+			err = -EINVAL;
+			goto next;
+		}
+
+		/*
+		 * This definitely indicates a bug, either in this driver or in
+		 * the backend driver. In future this should flag the bad
+		 * situation to the system controller to reboot the backed.
+		 */
+		if (ref == GRANT_INVALID_REF) {
+			if (net_ratelimit())
+				WPRINTK("Bad rx response id %d.\n", rx->id);
+			err = -EINVAL;
+			goto next;
+		}
+
+		if (!np->copying_receiver) {
+			/* Memory pressure, insufficient buffer
+			 * headroom, ... */
+			mfn = gnttab_end_foreign_transfer_ref(ref);
+			if (!mfn) {
+				if (net_ratelimit())
+					WPRINTK("Unfulfilled rx req "
+						"(id=%d, st=%d).\n",
+						rx->id, rx->status);
+				xennet_move_rx_slot(np, skb, ref);
+				err = -ENOMEM;
+				goto next;
+			}
+
+			if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+				/* Remap the page. */
+				struct page *page =
+					skb_shinfo(skb)->frags[0].page;
+				unsigned long pfn = page_to_pfn(page);
+				void *vaddr = page_address(page);
+
+				mcl = np->rx_mcl + pages_flipped;
+				mmu = np->rx_mmu + pages_flipped;
+
+				MULTI_update_va_mapping(mcl,
+							(unsigned long)vaddr,
+							mfn_pte(mfn, PAGE_KERNEL),
+							0);
+				mmu->ptr = ((u64)mfn << PAGE_SHIFT)
+					| MMU_MACHPHYS_UPDATE;
+				mmu->val = pfn;
+
+				set_phys_to_machine(pfn, mfn);
+			}
+			pages_flipped++;
+		} else {
+			ret = gnttab_end_foreign_access_ref(ref, 0);
+			BUG_ON(!ret);
+		}
+
+		gnttab_release_grant_reference(&np->gref_rx_head, ref);
+
+		__skb_queue_tail(list, skb);
+
+next:
+		if (!(rx->flags & NETRXF_more_data))
+			break;
+
+		if (cons + frags == rp) {
+			if (net_ratelimit())
+				WPRINTK("Need more frags\n");
+			err = -ENOENT;
+			break;
+		}
+
+		rx = RING_GET_RESPONSE(&np->rx, cons + frags);
+		skb = xennet_get_rx_skb(np, cons + frags);
+		ref = xennet_get_rx_ref(np, cons + frags);
+		frags++;
+	}
+
+	if (unlikely(frags > max)) {
+		if (net_ratelimit())
+			WPRINTK("Too many frags\n");
+		err = -E2BIG;
+	}
+
+	if (unlikely(err))
+		np->rx.rsp_cons = cons + frags;
+
+	*pages_flipped_p = pages_flipped;
+
+	return err;
+}
+
+static RING_IDX xennet_fill_frags(struct netfront_info *np,
+				  struct sk_buff *skb,
+				  struct sk_buff_head *list)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	int nr_frags = shinfo->nr_frags;
+	RING_IDX cons = np->rx.rsp_cons;
+	skb_frag_t *frag = shinfo->frags + nr_frags;
+	struct sk_buff *nskb;
+
+	while ((nskb = __skb_dequeue(list))) {
+		struct netif_rx_response *rx =
+			RING_GET_RESPONSE(&np->rx, ++cons);
+
+		frag->page = skb_shinfo(nskb)->frags[0].page;
+		frag->page_offset = rx->offset;
+		frag->size = rx->status;
+
+		skb->data_len += rx->status;
+
+		skb_shinfo(nskb)->nr_frags = 0;
+		kfree_skb(nskb);
+
+		frag++;
+		nr_frags++;
+	}
+
+	shinfo->nr_frags = nr_frags;
+	return cons;
+}
+
+static int xennet_set_skb_gso(struct sk_buff *skb, struct netif_extra_info *gso)
+{
+	if (!gso->u.gso.size) {
+		if (net_ratelimit())
+			WPRINTK("GSO size must not be zero.\n");
+		return -EINVAL;
+	}
+
+	/* Currently only TCPv4 S.O. is supported. */
+	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4) {
+		if (net_ratelimit())
+			WPRINTK("Bad GSO type %d.\n", gso->u.gso.type);
+		return -EINVAL;
+	}
+
+	skb_shinfo(skb)->gso_size = gso->u.gso.size;
+	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+
+	/* Header must be checked, and gso_segs computed. */
+	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+	skb_shinfo(skb)->gso_segs = 0;
+
+	return 0;
+}
+
+static int netif_poll(struct net_device *dev, int *pbudget)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	struct sk_buff *skb;
+	struct netfront_rx_info rinfo;
+	struct netif_rx_response *rx = &rinfo.rx;
+	struct netif_extra_info *extras = rinfo.extras;
+	RING_IDX i, rp;
+	struct multicall_entry *mcl;
+	int work_done, budget, more_to_do = 1;
+	struct sk_buff_head rxq;
+	struct sk_buff_head errq;
+	struct sk_buff_head tmpq;
+	unsigned long flags;
+	unsigned int len;
+	int pages_flipped = 0;
+	int err;
+
+	spin_lock(&np->rx_lock);
+
+	if (unlikely(!netif_carrier_ok(dev))) {
+		spin_unlock(&np->rx_lock);
+		return 0;
+	}
+
+	skb_queue_head_init(&rxq);
+	skb_queue_head_init(&errq);
+	skb_queue_head_init(&tmpq);
+
+	if ((budget = *pbudget) > dev->quota)
+		budget = dev->quota;
+	rp = np->rx.sring->rsp_prod;
+	rmb(); /* Ensure we see queued responses up to 'rp'. */
+
+	i = np->rx.rsp_cons;
+	work_done = 0;
+	while ((i != rp) && (work_done < budget)) {
+		memcpy(rx, RING_GET_RESPONSE(&np->rx, i), sizeof(*rx));
+		memset(extras, 0, sizeof(extras));
+
+		err = xennet_get_responses(np, &rinfo, rp, &tmpq,
+					   &pages_flipped);
+
+		if (unlikely(err)) {
+err:
+			while ((skb = __skb_dequeue(&tmpq)))
+				__skb_queue_tail(&errq, skb);
+			np->stats.rx_errors++;
+			i = np->rx.rsp_cons;
+			continue;
+		}
+
+		skb = __skb_dequeue(&tmpq);
+
+		if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
+			struct netif_extra_info *gso;
+			gso = &extras[XEN_NETIF_EXTRA_TYPE_GSO - 1];
+
+			if (unlikely(xennet_set_skb_gso(skb, gso))) {
+				__skb_queue_head(&tmpq, skb);
+				np->rx.rsp_cons += skb_queue_len(&tmpq);
+				goto err;
+			}
+		}
+
+		skb->nh.raw = (void *)skb_shinfo(skb)->frags[0].page;
+		skb->h.raw = skb->nh.raw + rx->offset;
+
+		len = rx->status;
+		if (len > RX_COPY_THRESHOLD)
+			len = RX_COPY_THRESHOLD;
+		skb_put(skb, len);
+
+		if (rx->status > len) {
+			skb_shinfo(skb)->frags[0].page_offset =
+				rx->offset + len;
+			skb_shinfo(skb)->frags[0].size = rx->status - len;
+			skb->data_len = rx->status - len;
+		} else {
+			skb_shinfo(skb)->frags[0].page = NULL;
+			skb_shinfo(skb)->nr_frags = 0;
+		}
+
+		i = xennet_fill_frags(np, skb, &tmpq);
+
+		/*
+		 * Truesize must approximates the size of true data plus
+		 * any supervisor overheads. Adding hypervisor overheads
+		 * has been shown to significantly reduce achievable
+		 * bandwidth with the default receive buffer size. It is
+		 * therefore not wise to account for it here.
+		 *
+		 * After alloc_skb(RX_COPY_THRESHOLD), truesize is set to
+		 * RX_COPY_THRESHOLD + the supervisor overheads. Here, we
+		 * add the size of the data pulled in xennet_fill_frags().
+		 *
+		 * We also adjust for any unused space in the main data
+		 * area by subtracting (RX_COPY_THRESHOLD - len). This is
+		 * especially important with drivers which split incoming
+		 * packets into header and data, using only 66 bytes of
+		 * the main data area (see the e1000 driver for example.)
+		 * On such systems, without this last adjustement, our
+		 * achievable receive throughout using the standard receive
+		 * buffer size was cut by 25%(!!!).
+		 */
+		skb->truesize += skb->data_len - (RX_COPY_THRESHOLD - len);
+		skb->len += skb->data_len;
+
+		if (rx->flags & NETRXF_data_validated)
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+		np->stats.rx_packets++;
+		np->stats.rx_bytes += skb->len;
+
+		__skb_queue_tail(&rxq, skb);
+
+		np->rx.rsp_cons = ++i;
+		work_done++;
+	}
+
+	if (pages_flipped) {
+#ifdef CONFIG_XEN_BALLOON
+		/* Some pages are no longer absent... */
+		balloon_update_driver_allowance(-pages_flipped);
+#endif
+		/* Do all the remapping work, and M2P updates. */
+		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+			mcl = np->rx_mcl + pages_flipped;
+			MULTI_mmu_update(mcl, np->rx_mmu,
+					 pages_flipped, 0, DOMID_SELF);
+			(void)HYPERVISOR_multicall(np->rx_mcl,
+						   pages_flipped + 1);
+		}
+	}
+
+	while ((skb = __skb_dequeue(&errq)))
+		kfree_skb(skb);
+
+	handle_incoming_queue(dev, &rxq);
+
+	/* If we get a callback with very few responses, reduce fill target. */
+	/* NB. Note exponential increase, linear decrease. */
+	if (((np->rx.req_prod_pvt - np->rx.sring->rsp_prod) >
+	     ((3*np->rx_target) / 4)) &&
+	    (--np->rx_target < np->rx_min_target))
+		np->rx_target = np->rx_min_target;
+
+	network_alloc_rx_buffers(dev);
+
+	*pbudget   -= work_done;
+	dev->quota -= work_done;
+
+	if (work_done < budget) {
+		local_irq_save(flags);
+
+		RING_FINAL_CHECK_FOR_RESPONSES(&np->rx, more_to_do);
+		if (!more_to_do)
+			__netif_rx_complete(dev);
+
+		local_irq_restore(flags);
+	}
+
+	spin_unlock(&np->rx_lock);
+
+	return more_to_do;
+}
+
+static void netif_release_tx_bufs(struct netfront_info *np)
+{
+	struct sk_buff *skb;
+	int i;
+
+	for (i = 1; i <= NET_TX_RING_SIZE; i++) {
+		if ((unsigned long)np->tx_skbs[i] < PAGE_OFFSET)
+			continue;
+
+		skb = np->tx_skbs[i];
+		gnttab_end_foreign_access_ref(
+			np->grant_tx_ref[i], GNTMAP_readonly);
+		gnttab_release_grant_reference(
+			&np->gref_tx_head, np->grant_tx_ref[i]);
+		np->grant_tx_ref[i] = GRANT_INVALID_REF;
+		add_id_to_freelist(np->tx_skbs, i);
+		dev_kfree_skb_irq(skb);
+	}
+}
+
+static void netif_release_rx_bufs(struct netfront_info *np)
+{
+	struct mmu_update      *mmu = np->rx_mmu;
+	struct multicall_entry *mcl = np->rx_mcl;
+	struct sk_buff_head free_list;
+	struct sk_buff *skb;
+	unsigned long mfn;
+	int xfer = 0, noxfer = 0, unused = 0;
+	int id, ref;
+
+	if (np->copying_receiver) {
+		printk("%s: fix me for copying receiver.\n", __FUNCTION__);
+		return;
+	}
+
+	skb_queue_head_init(&free_list);
+
+	spin_lock(&np->rx_lock);
+
+	for (id = 0; id < NET_RX_RING_SIZE; id++) {
+		if ((ref = np->grant_rx_ref[id]) == GRANT_INVALID_REF) {
+			unused++;
+			continue;
+		}
+
+		skb = np->rx_skbs[id];
+		mfn = gnttab_end_foreign_transfer_ref(ref);
+		gnttab_release_grant_reference(&np->gref_rx_head, ref);
+		np->grant_rx_ref[id] = GRANT_INVALID_REF;
+		add_id_to_freelist(np->rx_skbs, id);
+
+		if (0 == mfn) {
+#ifdef CONFIG_XEN_BALLOON
+			struct page *page = skb_shinfo(skb)->frags[0].page;
+			balloon_release_driver_page(page);
+#endif
+			skb_shinfo(skb)->nr_frags = 0;
+			dev_kfree_skb(skb);
+			noxfer++;
+			continue;
+		}
+
+		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+			/* Remap the page. */
+			struct page *page = skb_shinfo(skb)->frags[0].page;
+			unsigned long pfn = page_to_pfn(page);
+			void *vaddr = page_address(page);
+
+			MULTI_update_va_mapping(mcl, (unsigned long)vaddr,
+						mfn_pte(mfn, PAGE_KERNEL),
+						0);
+			mcl++;
+			mmu->ptr = ((u64)mfn << PAGE_SHIFT)
+				| MMU_MACHPHYS_UPDATE;
+			mmu->val = pfn;
+			mmu++;
+
+			set_phys_to_machine(pfn, mfn);
+		}
+		__skb_queue_tail(&free_list, skb);
+		xfer++;
+	}
+
+	printk("%s: %d xfer, %d noxfer, %d unused\n",
+	       __FUNCTION__, xfer, noxfer, unused);
+
+	if (xfer) {
+#ifdef CONFIG_XEN_BALLOON
+		/* Some pages are no longer absent... */
+		balloon_update_driver_allowance(-xfer);
+#endif
+
+		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+			/* Do all the remapping work and M2P updates. */
+			mcl->op = __HYPERVISOR_mmu_update;
+			mcl->args[0] = (unsigned long)np->rx_mmu;
+			mcl->args[1] = mmu - np->rx_mmu;
+			mcl->args[2] = 0;
+			mcl->args[3] = DOMID_SELF;
+			mcl++;
+			HYPERVISOR_multicall(np->rx_mcl, mcl - np->rx_mcl);
+		}
+	}
+
+	while ((skb = __skb_dequeue(&free_list)) != NULL)
+		dev_kfree_skb(skb);
+
+	spin_unlock(&np->rx_lock);
+}
+
+static int network_close(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	netif_stop_queue(np->netdev);
+	return 0;
+}
+
+
+static struct net_device_stats *network_get_stats(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	return &np->stats;
+}
+
+static int xennet_change_mtu(struct net_device *dev, int mtu)
+{
+	int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN;
+
+	if (mtu > max)
+		return -EINVAL;
+	dev->mtu = mtu;
+	return 0;
+}
+
+static int xennet_set_sg(struct net_device *dev, u32 data)
+{
+	if (data) {
+		struct netfront_info *np = netdev_priv(dev);
+		int val;
+
+		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend, "feature-sg",
+				 "%d", &val) < 0)
+			val = 0;
+		if (!val)
+			return -ENOSYS;
+	} else if (dev->mtu > ETH_DATA_LEN)
+		dev->mtu = ETH_DATA_LEN;
+
+	return ethtool_op_set_sg(dev, data);
+}
+
+static int xennet_set_tso(struct net_device *dev, u32 data)
+{
+	if (data) {
+		struct netfront_info *np = netdev_priv(dev);
+		int val;
+
+		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+				 "feature-gso-tcpv4", "%d", &val) < 0)
+			val = 0;
+		if (!val)
+			return -ENOSYS;
+	}
+
+	return ethtool_op_set_tso(dev, data);
+}
+
+static void xennet_set_features(struct net_device *dev)
+{
+	/* Turn off all GSO bits except ROBUST. */
+	dev->features &= (1 << NETIF_F_GSO_SHIFT) - 1;
+	dev->features |= NETIF_F_GSO_ROBUST;
+	xennet_set_sg(dev, 0);
+
+	/* We need checksum offload to enable scatter/gather and TSO. */
+	if (!(dev->features & NETIF_F_IP_CSUM))
+		return;
+
+	if (!xennet_set_sg(dev, 1))
+		xennet_set_tso(dev, 1);
+}
+
+static int network_connect(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	int i, requeue_idx, err;
+	struct sk_buff *skb;
+	grant_ref_t ref;
+	struct netif_rx_request *req;
+	unsigned int feature_rx_copy, feature_rx_flip;
+
+	err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+			   "feature-rx-copy", "%u", &feature_rx_copy);
+	if (err != 1)
+		feature_rx_copy = 0;
+	err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+			   "feature-rx-flip", "%u", &feature_rx_flip);
+	if (err != 1)
+		feature_rx_flip = 1;
+
+	/*
+	 * Copy packets on receive path if:
+	 *  (a) This was requested by user, and the backend supports it; or
+	 *  (b) Flipping was requested, but this is unsupported by the backend.
+	 */
+	np->copying_receiver = ((MODPARM_rx_copy && feature_rx_copy) ||
+				(MODPARM_rx_flip && !feature_rx_flip));
+
+	err = talk_to_backend(np->xbdev, np);
+	if (err)
+		return err;
+
+	xennet_set_features(dev);
+
+	IPRINTK("device %s has %sing receive path.\n",
+		dev->name, np->copying_receiver ? "copy" : "flipp");
+
+	spin_lock_irq(&np->tx_lock);
+	spin_lock(&np->rx_lock);
+
+	/*
+	 * Recovery procedure:
+	 *  NB. Freelist index entries are always going to be less than
+	 *  PAGE_OFFSET, whereas pointers to skbs will always be equal or
+	 *  greater than PAGE_OFFSET: we use this property to distinguish
+	 *  them.
+	 */
+
+	/* Step 1: Discard all pending TX packet fragments. */
+	netif_release_tx_bufs(np);
+
+	/* Step 2: Rebuild the RX buffer freelist and the RX ring itself. */
+	for (requeue_idx = 0, i = 0; i < NET_RX_RING_SIZE; i++) {
+		if (!np->rx_skbs[i])
+			continue;
+
+		skb = np->rx_skbs[requeue_idx] = xennet_get_rx_skb(np, i);
+		ref = np->grant_rx_ref[requeue_idx] = xennet_get_rx_ref(np, i);
+		req = RING_GET_REQUEST(&np->rx, requeue_idx);
+
+		if (!np->copying_receiver) {
+			gnttab_grant_foreign_transfer_ref(
+				ref, np->xbdev->otherend_id,
+				page_to_pfn(skb_shinfo(skb)->frags->page));
+		} else {
+			gnttab_grant_foreign_access_ref(
+				ref, np->xbdev->otherend_id,
+				pfn_to_mfn(page_to_pfn(skb_shinfo(skb)->
+						       frags->page)),
+				0);
+		}
+		req->gref = ref;
+		req->id   = requeue_idx;
+
+		requeue_idx++;
+	}
+
+	np->rx.req_prod_pvt = requeue_idx;
+
+	/*
+	 * Step 3: All public and private state should now be sane.  Get
+	 * ready to start sending and receiving packets and give the driver
+	 * domain a kick because we've probably just requeued some
+	 * packets.
+	 */
+	netif_carrier_on(dev);
+	notify_remote_via_irq(np->irq);
+	network_tx_buf_gc(dev);
+	network_alloc_rx_buffers(dev);
+
+	spin_unlock(&np->rx_lock);
+	spin_unlock_irq(&np->tx_lock);
+
+	return 0;
+}
+
+static void netif_uninit(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	netif_release_tx_bufs(np);
+	netif_release_rx_bufs(np);
+	gnttab_free_grant_references(np->gref_tx_head);
+	gnttab_free_grant_references(np->gref_rx_head);
+}
+
+static struct ethtool_ops network_ethtool_ops =
+{
+	.get_tx_csum = ethtool_op_get_tx_csum,
+	.set_tx_csum = ethtool_op_set_tx_csum,
+	.get_sg = ethtool_op_get_sg,
+	.set_sg = xennet_set_sg,
+	.get_tso = ethtool_op_get_tso,
+	.set_tso = xennet_set_tso,
+	.get_link = ethtool_op_get_link,
+};
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_rxbuf_min(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct netfront_info *info = netdev_priv(netdev);
+
+	return sprintf(buf, "%u\n", info->rx_min_target);
+}
+
+static ssize_t store_rxbuf_min(struct device *dev,
+			       struct device_attribute *attr,
+			       const char *buf, size_t len)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct netfront_info *np = netdev_priv(netdev);
+	char *endp;
+	unsigned long target;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	target = simple_strtoul(buf, &endp, 0);
+	if (endp == buf)
+		return -EBADMSG;
+
+	if (target < RX_MIN_TARGET)
+		target = RX_MIN_TARGET;
+	if (target > RX_MAX_TARGET)
+		target = RX_MAX_TARGET;
+
+	spin_lock(&np->rx_lock);
+	if (target > np->rx_max_target)
+		np->rx_max_target = target;
+	np->rx_min_target = target;
+	if (target > np->rx_target)
+		np->rx_target = target;
+
+	network_alloc_rx_buffers(netdev);
+
+	spin_unlock(&np->rx_lock);
+	return len;
+}
+
+static ssize_t show_rxbuf_max(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct netfront_info *info = netdev_priv(netdev);
+
+	return sprintf(buf, "%u\n", info->rx_max_target);
+}
+
+static ssize_t store_rxbuf_max(struct device *dev,
+			       struct device_attribute *attr,
+			       const char *buf, size_t len)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct netfront_info *np = netdev_priv(netdev);
+	char *endp;
+	unsigned long target;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	target = simple_strtoul(buf, &endp, 0);
+	if (endp == buf)
+		return -EBADMSG;
+
+	if (target < RX_MIN_TARGET)
+		target = RX_MIN_TARGET;
+	if (target > RX_MAX_TARGET)
+		target = RX_MAX_TARGET;
+
+	spin_lock(&np->rx_lock);
+	if (target < np->rx_min_target)
+		np->rx_min_target = target;
+	np->rx_max_target = target;
+	if (target < np->rx_target)
+		np->rx_target = target;
+
+	network_alloc_rx_buffers(netdev);
+
+	spin_unlock(&np->rx_lock);
+	return len;
+}
+
+static ssize_t show_rxbuf_cur(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct netfront_info *info = netdev_priv(netdev);
+
+	return sprintf(buf, "%u\n", info->rx_target);
+}
+
+static struct device_attribute xennet_attrs[] = {
+	__ATTR(rxbuf_min, S_IRUGO|S_IWUSR, show_rxbuf_min, store_rxbuf_min),
+	__ATTR(rxbuf_max, S_IRUGO|S_IWUSR, show_rxbuf_max, store_rxbuf_max),
+	__ATTR(rxbuf_cur, S_IRUGO, show_rxbuf_cur, NULL),
+};
+
+static int xennet_sysfs_addif(struct net_device *netdev)
+{
+	int i;
+	int error = 0;
+
+	for (i = 0; i < ARRAY_SIZE(xennet_attrs); i++) {
+		error = device_create_file(&netdev->dev,
+					   &xennet_attrs[i]);
+		if (error)
+			goto fail;
+	}
+	return 0;
+
+ fail:
+	while (--i >= 0)
+		device_remove_file(&netdev->dev, &xennet_attrs[i]);
+	return error;
+}
+
+static void xennet_sysfs_delif(struct net_device *netdev)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(xennet_attrs); i++) {
+		device_remove_file(&netdev->dev, &xennet_attrs[i]);
+	}
+}
+
+#endif /* CONFIG_SYSFS */
+
+
+/*
+ * Nothing to do here. Virtual interface is point-to-point and the
+ * physical interface is probably promiscuous anyway.
+ */
+static void network_set_multicast_list(struct net_device *dev)
+{
+}
+
+static struct net_device * __devinit create_netdev(struct xenbus_device *dev)
+{
+	int i, err = 0;
+	struct net_device *netdev = NULL;
+	struct netfront_info *np = NULL;
+
+	netdev = alloc_etherdev(sizeof(struct netfront_info));
+	if (!netdev) {
+		printk(KERN_WARNING "%s> alloc_etherdev failed.\n",
+		       __FUNCTION__);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	np                   = netdev_priv(netdev);
+	np->xbdev            = dev;
+
+	netif_carrier_off(netdev);
+
+	spin_lock_init(&np->tx_lock);
+	spin_lock_init(&np->rx_lock);
+
+	skb_queue_head_init(&np->rx_batch);
+	np->rx_target     = RX_DFL_MIN_TARGET;
+	np->rx_min_target = RX_DFL_MIN_TARGET;
+	np->rx_max_target = RX_MAX_TARGET;
+
+	init_timer(&np->rx_refill_timer);
+	np->rx_refill_timer.data = (unsigned long)netdev;
+	np->rx_refill_timer.function = rx_refill_timeout;
+
+	/* Initialise {tx,rx}_skbs as a free chain containing every entry. */
+	for (i = 0; i <= NET_TX_RING_SIZE; i++) {
+		np->tx_skbs[i] = (void *)((unsigned long) i+1);
+		np->grant_tx_ref[i] = GRANT_INVALID_REF;
+	}
+
+	for (i = 0; i < NET_RX_RING_SIZE; i++) {
+		np->rx_skbs[i] = NULL;
+		np->grant_rx_ref[i] = GRANT_INVALID_REF;
+	}
+
+	/* A grant for every tx ring slot */
+	if (gnttab_alloc_grant_references(TX_MAX_TARGET,
+					  &np->gref_tx_head) < 0) {
+		printk(KERN_ALERT "#### netfront can't alloc tx grant refs\n");
+		err = -ENOMEM;
+		goto exit;
+	}
+	/* A grant for every rx ring slot */
+	if (gnttab_alloc_grant_references(RX_MAX_TARGET,
+					  &np->gref_rx_head) < 0) {
+		printk(KERN_ALERT "#### netfront can't alloc rx grant refs\n");
+		err = -ENOMEM;
+		goto exit_free_tx;
+	}
+
+	netdev->open            = network_open;
+	netdev->hard_start_xmit = network_start_xmit;
+	netdev->stop            = network_close;
+	netdev->get_stats       = network_get_stats;
+	netdev->poll            = netif_poll;
+	netdev->set_multicast_list = network_set_multicast_list;
+	netdev->uninit          = netif_uninit;
+	netdev->change_mtu	= xennet_change_mtu;
+	netdev->weight          = 64;
+	netdev->features        = NETIF_F_IP_CSUM;
+
+	SET_ETHTOOL_OPS(netdev, &network_ethtool_ops);
+	SET_MODULE_OWNER(netdev);
+	SET_NETDEV_DEV(netdev, &dev->dev);
+
+	np->netdev = netdev;
+	return netdev;
+
+ exit_free_tx:
+	gnttab_free_grant_references(np->gref_tx_head);
+ exit:
+	free_netdev(netdev);
+	return ERR_PTR(err);
+}
+
+/*
+ * We use this notifier to send out a fake ARP reply to reset switches and
+ * router ARP caches when an IP interface is brought up on a VIF.
+ */
+static int
+inetdev_notify(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct in_ifaddr  *ifa = (struct in_ifaddr *)ptr;
+	struct net_device *dev = ifa->ifa_dev->dev;
+
+	/* UP event and is it one of our devices? */
+	if (event == NETDEV_UP && dev->open == network_open)
+		(void)send_fake_arp(dev);
+
+	return NOTIFY_DONE;
+}
+
+
+/* ** Close down ** */
+
+
+/**
+ * Handle the change of state of the backend to Closing.  We must delete our
+ * device-layer structures now, to ensure that writes are flushed through to
+ * the backend.  Once is this done, we can switch to Closed in
+ * acknowledgement.
+ */
+static void netfront_closing(struct xenbus_device *dev)
+{
+	struct netfront_info *info = dev->dev.driver_data;
+
+	DPRINTK("%s\n", dev->nodename);
+
+	close_netdev(info);
+	xenbus_frontend_closed(dev);
+}
+
+
+static int __devexit netfront_remove(struct xenbus_device *dev)
+{
+	struct netfront_info *info = dev->dev.driver_data;
+
+	DPRINTK("%s\n", dev->nodename);
+
+	netif_disconnect_backend(info);
+	free_netdev(info->netdev);
+
+	return 0;
+}
+
+
+static int open_netdev(struct netfront_info *info)
+{
+	int err;
+
+	err = register_netdev(info->netdev);
+	if (err) {
+		printk(KERN_WARNING "%s: register_netdev err=%d\n",
+		       __FUNCTION__, err);
+		return err;
+	}
+
+	err = xennet_sysfs_addif(info->netdev);
+	if (err) {
+		unregister_netdev(info->netdev);
+		printk(KERN_WARNING "%s: add sysfs failed err=%d\n",
+		       __FUNCTION__, err);
+		return err;
+	}
+
+	return 0;
+}
+
+static void close_netdev(struct netfront_info *info)
+{
+	del_timer_sync(&info->rx_refill_timer);
+
+	xennet_sysfs_delif(info->netdev);
+	unregister_netdev(info->netdev);
+}
+
+
+static void netif_disconnect_backend(struct netfront_info *info)
+{
+	/* Stop old i/f to prevent errors whilst we rebuild the state. */
+	spin_lock_irq(&info->tx_lock);
+	spin_lock(&info->rx_lock);
+	netif_carrier_off(info->netdev);
+	spin_unlock(&info->rx_lock);
+	spin_unlock_irq(&info->tx_lock);
+
+	if (info->irq)
+		unbind_from_irqhandler(info->irq, info->netdev);
+	info->evtchn = info->irq = 0;
+
+	end_access(info->tx_ring_ref, info->tx.sring);
+	end_access(info->rx_ring_ref, info->rx.sring);
+	info->tx_ring_ref = GRANT_INVALID_REF;
+	info->rx_ring_ref = GRANT_INVALID_REF;
+	info->tx.sring = NULL;
+	info->rx.sring = NULL;
+}
+
+
+static void end_access(int ref, void *page)
+{
+	if (ref != GRANT_INVALID_REF)
+		gnttab_end_foreign_access(ref, 0, (unsigned long)page);
+}
+
+
+/* ** Driver registration ** */
+
+
+static struct xenbus_device_id netfront_ids[] = {
+	{ "vif" },
+	{ "" }
+};
+
+
+static struct xenbus_driver netfront = {
+	.name = "vif",
+	.owner = THIS_MODULE,
+	.ids = netfront_ids,
+	.probe = netfront_probe,
+	.remove = __devexit_p(netfront_remove),
+	.resume = netfront_resume,
+	.otherend_changed = backend_changed,
+};
+
+
+static struct notifier_block notifier_inetdev = {
+	.notifier_call  = inetdev_notify,
+};
+
+static int __init netif_init(void)
+{
+	if (!is_running_on_xen())
+		return -ENODEV;
+
+#ifdef CONFIG_XEN
+	if (MODPARM_rx_flip && MODPARM_rx_copy) {
+		WPRINTK("Cannot specify both rx_copy and rx_flip.\n");
+		return -EINVAL;
+	}
+
+	if (!MODPARM_rx_flip && !MODPARM_rx_copy)
+		MODPARM_rx_flip = 1; /* Default is to flip. */
+#endif
+
+	if (is_initial_xendomain())
+		return 0;
+
+	IPRINTK("Initialising virtual ethernet driver.\n");
+
+	(void)register_inetaddr_notifier(&notifier_inetdev);
+
+	return xenbus_register_frontend(&netfront);
+}
+module_init(netif_init);
+
+
+static void __exit netif_exit(void)
+{
+	if (is_initial_xendomain())
+		return;
+
+	unregister_inetaddr_notifier(&notifier_inetdev);
+
+	return xenbus_unregister_driver(&netfront);
+}
+module_exit(netif_exit);
+
+MODULE_LICENSE("Dual BSD/GPL");
===================================================================
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -2,6 +2,8 @@
 #define _XEN_EVENTS_H
 
 #include <linux/irq.h>
+#include <xen/interface/event_channel.h>
+#include <asm/xen/hypercall.h>
 
 int bind_evtchn_to_irq(unsigned int evtchn);
 int bind_evtchn_to_irqhandler(unsigned int evtchn,

-- 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver.
  2007-03-01 23:25 ` [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network " Jeremy Fitzhardinge
@ 2007-03-02  0:42     ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-02  0:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, netdev, Jeff Garzik

On Thu, 01 Mar 2007 15:25:09 -0800
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> The network device frontend driver allows the kernel to access network
> devices exported exported by a virtual machine containing a physical
> network device driver.
> 
> Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: netdev@vger.kernel.org
> Cc: Jeff Garzik <jeff@garzik.org>
> 
> ---
>  drivers/net/Kconfig        |   12 
>  drivers/net/Makefile       |    2 
>  drivers/net/xen-netfront.c | 2066 ++++++++++++++++++++++++++++++++++++++++++++
>  include/xen/events.h       |    2 
>  4 files changed, 2082 insertions(+)
> 
> ===================================================================
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -2525,6 +2525,18 @@ source "drivers/atm/Kconfig"
>  
>  source "drivers/s390/net/Kconfig"
>  
> +config XEN_NETDEV_FRONTEND
> +	tristate "Xen network device frontend driver"
> +	depends on XEN
> +	default y
> +	help
> +	  The network device frontend driver allows the kernel to
> +	  access network devices exported exported by a virtual
> +	  machine containing a physical network device driver. The
> +	  frontend driver is intended for unprivileged guest domains;
> +	  if you are compiling a kernel for a Xen guest, you almost
> +	  certainly want to enable this.
> +
>  config ISERIES_VETH
>  	tristate "iSeries Virtual Ethernet driver support"
>  	depends on PPC_ISERIES

Might make more sense earlier in list (near other virtual devices).
===================================================================
>
> +/*
> + * Mutually-exclusive module options to select receive data path:
> + *  rx_copy : Packets are copied by network backend into local memory
> + *  rx_flip : Page containing packet data is transferred to our ownership
> + * For fully-virtualised guests there is no option - copying must be used.
> + * For paravirtualised guests, flipping is the default.
> + */
> +#ifdef CONFIG_XEN


Hey I thought this driver depended on CONFIG_XEN already?

> +static int MODPARM_rx_copy = 0;
> +module_param_named(rx_copy, MODPARM_rx_copy, bool, 0);
> +MODULE_PARM_DESC(rx_copy, "Copy packets from network card (rather than flip)");
> +static int MODPARM_rx_flip = 0;
> +module_param_named(rx_flip, MODPARM_rx_flip, bool, 0);
> +MODULE_PARM_DESC(rx_flip, "Flip packets from network card (rather than copy)");
> +#else
> +static const int MODPARM_rx_copy = 1;
> +static const int MODPARM_rx_flip = 0;
> +#endif


No MIXED case variable names please.


Why have two mutually exclusive values instead of just one value
with three states: 0 = normal, 1 = copy, 2 = flip?


> +#define DPRINTK(fmt, args...)				\
> +	pr_debug("netfront (%s:%d) " fmt,		\
> +		 __FUNCTION__, __LINE__, ##args)
> +#define IPRINTK(fmt, args...)				\
> +	printk(KERN_INFO "netfront: " fmt, ##args)
> +#define WPRINTK(fmt, args...)				\
> +	printk(KERN_WARNING "netfront: " fmt, ##args)


Could you use dev_dbg, dev_info, dev_warn instead of these macros?

> +
> +/** Send a packet on a net device to encourage switches to learn the
> + * MAC. We send a fake ARP request.
> + *
> + * @param dev device
> + * @return 0 on success, error code otherwise
> + */
Why the sudden urge to use docbook format on one internal function.


> +static int send_fake_arp(struct net_device *dev)
> +{
> +	struct sk_buff *skb;
> +	u32             src_ip, dst_ip;
> +
> +	dst_ip = INADDR_BROADCAST;
> +	src_ip = inet_select_addr(dev, dst_ip, RT_SCOPE_LINK);
> +
> +	/* No IP? Then nothing to do. */
> +	if (src_ip == 0)
> +		return 0;
> +
> +	skb = arp_create(ARPOP_REPLY, ETH_P_ARP,
> +			 dst_ip, dev, src_ip,
> +			 /*dst_hw*/ NULL, /*src_hw*/ NULL,
> +			 /*target_hw*/ dev->dev_addr);
> +	if (skb == NULL)
> +		return -ENOMEM;
> +
> +	return dev_queue_xmit(skb);
> +}

This should probably be done in generic (non driver code).
It creates lots of dependencies here.

> +/*
> + * We use this notifier to send out a fake ARP reply to reset switches and
> + * router ARP caches when an IP interface is brought up on a VIF.
> + */
> +static int
> +inetdev_notify(struct notifier_block *this, unsigned long event, void *ptr)
> +{
> +	struct in_ifaddr  *ifa = (struct in_ifaddr *)ptr;
> +	struct net_device *dev = ifa->ifa_dev->dev;
> +
> +	/* UP event and is it one of our devices? */
> +	if (event == NETDEV_UP && dev->open == network_open)
> +		(void)send_fake_arp(dev);
> +
> +	return NOTIFY_DONE;
> +}

Shouldn't just be a global kernel option for gratuitous ARP.
Doesn't seem to be unique to this driver.
With sysctl to enable it.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver.
@ 2007-03-02  0:42     ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-02  0:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Jeff Garzik, Ian Pratt, netdev, linux-kernel,
	Chris Wright, Andi Kleen, Zachary, virtualization, Andrew Morton

On Thu, 01 Mar 2007 15:25:09 -0800
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> The network device frontend driver allows the kernel to access network
> devices exported exported by a virtual machine containing a physical
> network device driver.
> 
> Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: netdev@vger.kernel.org
> Cc: Jeff Garzik <jeff@garzik.org>
> 
> ---
>  drivers/net/Kconfig        |   12 
>  drivers/net/Makefile       |    2 
>  drivers/net/xen-netfront.c | 2066 ++++++++++++++++++++++++++++++++++++++++++++
>  include/xen/events.h       |    2 
>  4 files changed, 2082 insertions(+)
> 
> ===================================================================
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -2525,6 +2525,18 @@ source "drivers/atm/Kconfig"
>  
>  source "drivers/s390/net/Kconfig"
>  
> +config XEN_NETDEV_FRONTEND
> +	tristate "Xen network device frontend driver"
> +	depends on XEN
> +	default y
> +	help
> +	  The network device frontend driver allows the kernel to
> +	  access network devices exported exported by a virtual
> +	  machine containing a physical network device driver. The
> +	  frontend driver is intended for unprivileged guest domains;
> +	  if you are compiling a kernel for a Xen guest, you almost
> +	  certainly want to enable this.
> +
>  config ISERIES_VETH
>  	tristate "iSeries Virtual Ethernet driver support"
>  	depends on PPC_ISERIES

Might make more sense earlier in list (near other virtual devices).
===================================================================
>
> +/*
> + * Mutually-exclusive module options to select receive data path:
> + *  rx_copy : Packets are copied by network backend into local memory
> + *  rx_flip : Page containing packet data is transferred to our ownership
> + * For fully-virtualised guests there is no option - copying must be used.
> + * For paravirtualised guests, flipping is the default.
> + */
> +#ifdef CONFIG_XEN


Hey I thought this driver depended on CONFIG_XEN already?

> +static int MODPARM_rx_copy = 0;
> +module_param_named(rx_copy, MODPARM_rx_copy, bool, 0);
> +MODULE_PARM_DESC(rx_copy, "Copy packets from network card (rather than flip)");
> +static int MODPARM_rx_flip = 0;
> +module_param_named(rx_flip, MODPARM_rx_flip, bool, 0);
> +MODULE_PARM_DESC(rx_flip, "Flip packets from network card (rather than copy)");
> +#else
> +static const int MODPARM_rx_copy = 1;
> +static const int MODPARM_rx_flip = 0;
> +#endif


No MIXED case variable names please.


Why have two mutually exclusive values instead of just one value
with three states: 0 = normal, 1 = copy, 2 = flip?


> +#define DPRINTK(fmt, args...)				\
> +	pr_debug("netfront (%s:%d) " fmt,		\
> +		 __FUNCTION__, __LINE__, ##args)
> +#define IPRINTK(fmt, args...)				\
> +	printk(KERN_INFO "netfront: " fmt, ##args)
> +#define WPRINTK(fmt, args...)				\
> +	printk(KERN_WARNING "netfront: " fmt, ##args)


Could you use dev_dbg, dev_info, dev_warn instead of these macros?

> +
> +/** Send a packet on a net device to encourage switches to learn the
> + * MAC. We send a fake ARP request.
> + *
> + * @param dev device
> + * @return 0 on success, error code otherwise
> + */
Why the sudden urge to use docbook format on one internal function.


> +static int send_fake_arp(struct net_device *dev)
> +{
> +	struct sk_buff *skb;
> +	u32             src_ip, dst_ip;
> +
> +	dst_ip = INADDR_BROADCAST;
> +	src_ip = inet_select_addr(dev, dst_ip, RT_SCOPE_LINK);
> +
> +	/* No IP? Then nothing to do. */
> +	if (src_ip == 0)
> +		return 0;
> +
> +	skb = arp_create(ARPOP_REPLY, ETH_P_ARP,
> +			 dst_ip, dev, src_ip,
> +			 /*dst_hw*/ NULL, /*src_hw*/ NULL,
> +			 /*target_hw*/ dev->dev_addr);
> +	if (skb == NULL)
> +		return -ENOMEM;
> +
> +	return dev_queue_xmit(skb);
> +}

This should probably be done in generic (non driver code).
It creates lots of dependencies here.

> +/*
> + * We use this notifier to send out a fake ARP reply to reset switches and
> + * router ARP caches when an IP interface is brought up on a VIF.
> + */
> +static int
> +inetdev_notify(struct notifier_block *this, unsigned long event, void *ptr)
> +{
> +	struct in_ifaddr  *ifa = (struct in_ifaddr *)ptr;
> +	struct net_device *dev = ifa->ifa_dev->dev;
> +
> +	/* UP event and is it one of our devices? */
> +	if (event == NETDEV_UP && dev->open == network_open)
> +		(void)send_fake_arp(dev);
> +
> +	return NOTIFY_DONE;
> +}

Shouldn't just be a global kernel option for gratuitous ARP.
Doesn't seem to be unique to this driver.
With sysctl to enable it.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-02  0:45     ` Zachary Amsden
  -1 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-02  0:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Rusty Russell

Jeremy Fitzhardinge wrote:
> Fix a few clobbers to include the return register.  The clobbers set
> is the set of all registers modified (or may be modified) by the code
> snippet, regardless of whether it was deliberate or accidental.
>
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Zachary Amsden <zach@vmware.com>
>
> ---
>  include/asm-i386/paravirt.h |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> ===================================================================
> --- a/include/asm-i386/paravirt.h
> +++ b/include/asm-i386/paravirt.h
> @@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
>  					   "popl %%edx; popl %%ecx")
>  			     : "=a"(f): "m"(paravirt_ops.save_fl),
>  			       paravirt_type(PARAVIRT_PATCH(save_fl)),
> -			       paravirt_clobber(CLBR_NONE)
> +			       paravirt_clobber(CLBR_EAX)
>  			     : "memory", "cc");
>  	return f;
>   

Has this been tested on older gcc's?  I seem to recall them barfing over 
things like this.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-02  0:45     ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-02  0:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, virtualization, Rusty Russell, linux-kernel,
	Chris Wright, Andi Kleen, Andrew Morton

Jeremy Fitzhardinge wrote:
> Fix a few clobbers to include the return register.  The clobbers set
> is the set of all registers modified (or may be modified) by the code
> snippet, regardless of whether it was deliberate or accidental.
>
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Zachary Amsden <zach@vmware.com>
>
> ---
>  include/asm-i386/paravirt.h |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> ===================================================================
> --- a/include/asm-i386/paravirt.h
> +++ b/include/asm-i386/paravirt.h
> @@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
>  					   "popl %%edx; popl %%ecx")
>  			     : "=a"(f): "m"(paravirt_ops.save_fl),
>  			       paravirt_type(PARAVIRT_PATCH(save_fl)),
> -			       paravirt_clobber(CLBR_NONE)
> +			       paravirt_clobber(CLBR_EAX)
>  			     : "memory", "cc");
>  	return f;
>   

Has this been tested on older gcc's?  I seem to recall them barfing over 
things like this.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-02  0:45     ` Zachary Amsden
@ 2007-03-02  0:49       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-02  0:49 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Rusty Russell

Zachary Amsden wrote:
> Jeremy Fitzhardinge wrote:
>> Fix a few clobbers to include the return register.  The clobbers set
>> is the set of all registers modified (or may be modified) by the code
>> snippet, regardless of whether it was deliberate or accidental.
>>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Zachary Amsden <zach@vmware.com>
>>
>> ---
>>  include/asm-i386/paravirt.h |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> ===================================================================
>> --- a/include/asm-i386/paravirt.h
>> +++ b/include/asm-i386/paravirt.h
>> @@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
>>                         "popl %%edx; popl %%ecx")
>>                   : "=a"(f): "m"(paravirt_ops.save_fl),
>>                     paravirt_type(PARAVIRT_PATCH(save_fl)),
>> -                   paravirt_clobber(CLBR_NONE)
>> +                   paravirt_clobber(CLBR_EAX)
>>                   : "memory", "cc");
>>      return f;
>>   
>
> Has this been tested on older gcc's?  I seem to recall them barfing
> over things like this.

Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
those versions are no longer supported - Arjan posted a patch a few days
ago to convert a pile of asms to this form.  Or do you mean something else?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-02  0:49       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-02  0:49 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: xen-devel, virtualization, Rusty Russell, linux-kernel,
	Chris Wright, Andi Kleen, Andrew Morton

Zachary Amsden wrote:
> Jeremy Fitzhardinge wrote:
>> Fix a few clobbers to include the return register.  The clobbers set
>> is the set of all registers modified (or may be modified) by the code
>> snippet, regardless of whether it was deliberate or accidental.
>>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Zachary Amsden <zach@vmware.com>
>>
>> ---
>>  include/asm-i386/paravirt.h |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> ===================================================================
>> --- a/include/asm-i386/paravirt.h
>> +++ b/include/asm-i386/paravirt.h
>> @@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
>>                         "popl %%edx; popl %%ecx")
>>                   : "=a"(f): "m"(paravirt_ops.save_fl),
>>                     paravirt_type(PARAVIRT_PATCH(save_fl)),
>> -                   paravirt_clobber(CLBR_NONE)
>> +                   paravirt_clobber(CLBR_EAX)
>>                   : "memory", "cc");
>>      return f;
>>   
>
> Has this been tested on older gcc's?  I seem to recall them barfing
> over things like this.

Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
those versions are no longer supported - Arjan posted a patch a few days
ago to convert a pile of asms to this form.  Or do you mean something else?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-02  0:49       ` Jeremy Fitzhardinge
@ 2007-03-02  0:52         ` Zachary Amsden
  -1 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-02  0:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Rusty Russell

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> Jeremy Fitzhardinge wrote:
>>     
>>> Fix a few clobbers to include the return register.  The clobbers set
>>> is the set of all registers modified (or may be modified) by the code
>>> snippet, regardless of whether it was deliberate or accidental.
>>>
>>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>> Cc: Zachary Amsden <zach@vmware.com>
>>>
>>> ---
>>>  include/asm-i386/paravirt.h |    4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> ===================================================================
>>> --- a/include/asm-i386/paravirt.h
>>> +++ b/include/asm-i386/paravirt.h
>>> @@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
>>>                         "popl %%edx; popl %%ecx")
>>>                   : "=a"(f): "m"(paravirt_ops.save_fl),
>>>                     paravirt_type(PARAVIRT_PATCH(save_fl)),
>>> -                   paravirt_clobber(CLBR_NONE)
>>> +                   paravirt_clobber(CLBR_EAX)
>>>                   : "memory", "cc");
>>>      return f;
>>>   
>>>       
>> Has this been tested on older gcc's?  I seem to recall them barfing
>> over things like this.
>>     
>
> Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
> those versions are no longer supported - Arjan posted a patch a few days
> ago to convert a pile of asms to this form.  Or do you mean something else?
>   

I meant having an output in the clobber list, I didn't know we were 
dropping support for older versions already.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-02  0:52         ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-02  0:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, virtualization, Rusty Russell, linux-kernel,
	Chris Wright, Andi Kleen, Andrew Morton

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> Jeremy Fitzhardinge wrote:
>>     
>>> Fix a few clobbers to include the return register.  The clobbers set
>>> is the set of all registers modified (or may be modified) by the code
>>> snippet, regardless of whether it was deliberate or accidental.
>>>
>>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>> Cc: Zachary Amsden <zach@vmware.com>
>>>
>>> ---
>>>  include/asm-i386/paravirt.h |    4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> ===================================================================
>>> --- a/include/asm-i386/paravirt.h
>>> +++ b/include/asm-i386/paravirt.h
>>> @@ -556,7 +556,7 @@ static inline unsigned long __raw_local_
>>>                         "popl %%edx; popl %%ecx")
>>>                   : "=a"(f): "m"(paravirt_ops.save_fl),
>>>                     paravirt_type(PARAVIRT_PATCH(save_fl)),
>>> -                   paravirt_clobber(CLBR_NONE)
>>> +                   paravirt_clobber(CLBR_EAX)
>>>                   : "memory", "cc");
>>>      return f;
>>>   
>>>       
>> Has this been tested on older gcc's?  I seem to recall them barfing
>> over things like this.
>>     
>
> Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
> those versions are no longer supported - Arjan posted a patch a few days
> ago to convert a pile of asms to this form.  Or do you mean something else?
>   

I meant having an output in the clobber list, I didn't know we were 
dropping support for older versions already.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver.
  2007-03-02  0:42     ` Stephen Hemminger
  (?)
@ 2007-03-02  0:56     ` Jeremy Fitzhardinge
  2007-03-02  1:30         ` Stephen Hemminger
  -1 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-02  0:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, netdev, Jeff Garzik

Stephen Hemminger wrote:
> On Thu, 01 Mar 2007 15:25:09 -0800
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> The network device frontend driver allows the kernel to access network
>> devices exported exported by a virtual machine containing a physical
>> network device driver.
>>
>> Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
>> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
>> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
>> Cc: netdev@vger.kernel.org
>> Cc: Jeff Garzik <jeff@garzik.org>
>>
>> ---
>>  drivers/net/Kconfig        |   12 
>>  drivers/net/Makefile       |    2 
>>  drivers/net/xen-netfront.c | 2066 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/xen/events.h       |    2 
>>  4 files changed, 2082 insertions(+)
>>
>> ===================================================================
>> --- a/drivers/net/Kconfig
>> +++ b/drivers/net/Kconfig
>> @@ -2525,6 +2525,18 @@ source "drivers/atm/Kconfig"
>>  
>>  source "drivers/s390/net/Kconfig"
>>  
>> +config XEN_NETDEV_FRONTEND
>> +	tristate "Xen network device frontend driver"
>> +	depends on XEN
>> +	default y
>> +	help
>> +	  The network device frontend driver allows the kernel to
>> +	  access network devices exported exported by a virtual
>> +	  machine containing a physical network device driver. The
>> +	  frontend driver is intended for unprivileged guest domains;
>> +	  if you are compiling a kernel for a Xen guest, you almost
>> +	  certainly want to enable this.
>> +
>>  config ISERIES_VETH
>>  	tristate "iSeries Virtual Ethernet driver support"
>>  	depends on PPC_ISERIES
>>     
>
> Might make more sense earlier in list (near other virtual devices).
>   

OK.

>
> Hey I thought this driver depended on CONFIG_XEN already?
>   

You're right.

>> +static int MODPARM_rx_copy = 0;
>> +module_param_named(rx_copy, MODPARM_rx_copy, bool, 0);
>> +MODULE_PARM_DESC(rx_copy, "Copy packets from network card (rather than flip)");
>> +static int MODPARM_rx_flip = 0;
>> +module_param_named(rx_flip, MODPARM_rx_flip, bool, 0);
>> +MODULE_PARM_DESC(rx_flip, "Flip packets from network card (rather than copy)");
>> +#else
>> +static const int MODPARM_rx_copy = 1;
>> +static const int MODPARM_rx_flip = 0;
>> +#endif
>>     
>
>
> No MIXED case variable names please.
>   

OK.

> Why have two mutually exclusive values instead of just one value
> with three states: 0 = normal, 1 = copy, 2 = flip?
>   

That does seem odd.  I'll fix it up.

>> +#define DPRINTK(fmt, args...)				\
>> +	pr_debug("netfront (%s:%d) " fmt,		\
>> +		 __FUNCTION__, __LINE__, ##args)
>> +#define IPRINTK(fmt, args...)				\
>> +	printk(KERN_INFO "netfront: " fmt, ##args)
>> +#define WPRINTK(fmt, args...)				\
>> +	printk(KERN_WARNING "netfront: " fmt, ##args)
>>     
>
>
> Could you use dev_dbg, dev_info, dev_warn instead of these macros?
>   

Yes.  I'd done that conversion elsewhere, but overlooked it here.

>> +
>> +/** Send a packet on a net device to encourage switches to learn the
>> + * MAC. We send a fake ARP request.
>> + *
>> + * @param dev device
>> + * @return 0 on success, error code otherwise
>> + */
>>     
> Why the sudden urge to use docbook format on one internal function.
>   

It probably got copied from somewhere.  I'll clean it up.

>> +static int send_fake_arp(struct net_device *dev)
>> +{
>> +	struct sk_buff *skb;
>> +	u32             src_ip, dst_ip;
>> +
>> +	dst_ip = INADDR_BROADCAST;
>> +	src_ip = inet_select_addr(dev, dst_ip, RT_SCOPE_LINK);
>> +
>> +	/* No IP? Then nothing to do. */
>> +	if (src_ip == 0)
>> +		return 0;
>> +
>> +	skb = arp_create(ARPOP_REPLY, ETH_P_ARP,
>> +			 dst_ip, dev, src_ip,
>> +			 /*dst_hw*/ NULL, /*src_hw*/ NULL,
>> +			 /*target_hw*/ dev->dev_addr);
>> +	if (skb == NULL)
>> +		return -ENOMEM;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>>     
>
> This should probably be done in generic (non driver code).
> It creates lots of dependencies here.
> [...]
> Shouldn't just be a global kernel option for gratuitous ARP.
> Doesn't seem to be unique to this driver.
> With sysctl to enable it.
>   

I agree it would be nice to not have to do this in the driver.  The
specific requirement here is to make it send a packet after resume
(which includes arriving after a migration) so that a switch can quickly
work that the machine is there.  Is there a generic mechanism for doing
this?

    J


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-02  0:52         ` Zachary Amsden
  (?)
@ 2007-03-02  0:58         ` Jeremy Fitzhardinge
  2007-03-02  1:18             ` Zachary Amsden
  -1 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-02  0:58 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Rusty Russell

Zachary Amsden wrote:
>>
>> Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
>> those versions are no longer supported - Arjan posted a patch a few days
>> ago to convert a pile of asms to this form.  Or do you mean something
>> else?
>>   
>
> I meant having an output in the clobber list,

There's no output in the clobber list, just "memory" and "cc".  The
paravirt_clobber() is the stuff which gets put in the .parainstructions
section which tells the patcher what registers are expected to be
modified at that callsite (including both temp and output registers).


    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-02  0:58         ` Jeremy Fitzhardinge
@ 2007-03-02  1:18             ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-02  1:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Rusty Russell

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>>> Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
>>> those versions are no longer supported - Arjan posted a patch a few days
>>> ago to convert a pile of asms to this form.  Or do you mean something
>>> else?
>>>   
>>>       
>> I meant having an output in the clobber list,
>>     
>
> There's no output in the clobber list, just "memory" and "cc".  The
> paravirt_clobber() is the stuff which gets put in the .parainstructions
> section which tells the patcher what registers are expected to be
> modified at that callsite (including both temp and output registers).
>   

Ah, ok.  I didn't think gcc had grown the appropriate antlers yet to 
allow outputs in the clobber list.  But the paravirt-clobbers, yes that 
makes sense.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-02  1:18             ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-02  1:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, virtualization, Rusty Russell, linux-kernel,
	Chris Wright, Andi Kleen, Andrew Morton

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>>> Things like what?  Do you mean the %[foo] asm parameter syntax?  I think
>>> those versions are no longer supported - Arjan posted a patch a few days
>>> ago to convert a pile of asms to this form.  Or do you mean something
>>> else?
>>>   
>>>       
>> I meant having an output in the clobber list,
>>     
>
> There's no output in the clobber list, just "memory" and "cc".  The
> paravirt_clobber() is the stuff which gets put in the .parainstructions
> section which tells the patcher what registers are expected to be
> modified at that callsite (including both temp and output registers).
>   

Ah, ok.  I didn't think gcc had grown the appropriate antlers yet to 
allow outputs in the clobber list.  But the paravirt-clobbers, yes that 
makes sense.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver.
  2007-03-02  0:42     ` Stephen Hemminger
  (?)
  (?)
@ 2007-03-02  1:21     ` Christoph Hellwig
  2007-03-02  1:26       ` Chris Wright
  -1 siblings, 1 reply; 330+ messages in thread
From: Christoph Hellwig @ 2007-03-02  1:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, netdev, Jeff Garzik

On Thu, Mar 01, 2007 at 04:42:14PM -0800, Stephen Hemminger wrote:
> > +
> > +/** Send a packet on a net device to encourage switches to learn the
> > + * MAC. We send a fake ARP request.
> > + *
> > + * @param dev device
> > + * @return 0 on success, error code otherwise
> > + */
> Why the sudden urge to use docbook format on one internal function.

And it's not even proper kerneldoc but looks more like javadoc.
If you want to write kerneldoc comments please also verify them by
running the tools over it.

> This should probably be done in generic (non driver code).
> It creates lots of dependencies here.

Actually the right way to do it is in userspace, as all clustering
solutions do.  That's whay everyone told the Xen folks by the just
refuse to rip this junk out.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver.
  2007-03-02  1:21     ` [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver Christoph Hellwig
@ 2007-03-02  1:26       ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-02  1:26 UTC (permalink / raw)
  To: Christoph Hellwig, Stephen Hemminger, Jeremy Fitzhardinge,
	Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, netdev, Jeff Garzik

* Christoph Hellwig (hch@infradead.org) wrote:
> Actually the right way to do it is in userspace, as all clustering
> solutions do.  That's whay everyone told the Xen folks by the just
> refuse to rip this junk out.

I'm ripping it out.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [RFC] Arp announce (for Xen)
  2007-03-02  0:56     ` Jeremy Fitzhardinge
@ 2007-03-02  1:30         ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-02  1:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, netdev, Jeff Garzik

What about implementing the unused arp_announce flag on the inetdevice?
Something like the following.  Totally untested...

Looks like it either was there (and got removed) or was planned but
never implemented.

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index e10794d..cefc339 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1089,6 +1089,16 @@ static int inetdev_event(struct notifier
 			}
 		}
 		ip_mc_up(in_dev);
+		/* fallthru */
+
+	case NETDEV_CHANGEADDR:
+		/* Send gratuitous ARP in case of address change or new device */
+		if (IN_DEV_ARP_ANNOUNCE(in_dev))
+			arp_send(ARPOP_REQUEST, ETH_P_ARP,
+				 in_dev->ifa_list->ifa_address, dev,
+				 in_dev->ifa_list->ifa_address, NULL, 
+				 dev->dev_addr, NULL);
+			
 		break;
 	case NETDEV_DOWN:
 		ip_mc_down(in_dev);


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [RFC] Arp announce (for Xen)
@ 2007-03-02  1:30         ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-02  1:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Jeff Garzik, Ian Pratt, netdev, linux-kernel,
	Chris Wright, Andi Kleen, Zachary, virtualization, Andrew Morton

What about implementing the unused arp_announce flag on the inetdevice?
Something like the following.  Totally untested...

Looks like it either was there (and got removed) or was planned but
never implemented.

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index e10794d..cefc339 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1089,6 +1089,16 @@ static int inetdev_event(struct notifier
 			}
 		}
 		ip_mc_up(in_dev);
+		/* fallthru */
+
+	case NETDEV_CHANGEADDR:
+		/* Send gratuitous ARP in case of address change or new device */
+		if (IN_DEV_ARP_ANNOUNCE(in_dev))
+			arp_send(ARPOP_REQUEST, ETH_P_ARP,
+				 in_dev->ifa_list->ifa_address, dev,
+				 in_dev->ifa_list->ifa_address, NULL, 
+				 dev->dev_addr, NULL);
+			
 		break;
 	case NETDEV_DOWN:
 		ip_mc_down(in_dev);

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02  1:30         ` Stephen Hemminger
  (?)
@ 2007-03-02  8:09         ` Pekka Savola
  2007-03-02 18:29           ` Ben Greear
  -1 siblings, 1 reply; 330+ messages in thread
From: Pekka Savola @ 2007-03-02  8:09 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

On Thu, 1 Mar 2007, Stephen Hemminger wrote:
> What about implementing the unused arp_announce flag on the inetdevice?
> Something like the following.  Totally untested...
>
> Looks like it either was there (and got removed) or was planned but
> never implemented.

If something like this goes in, it wouldn't hurt to do similar with 
IPv6 (RFC2461 section 7.2.6).

There are very popular hardware-based routers which refresh their NDP 
caches only every 24 hours or 20 minutes (depending on the software 
version).  Sending unsolicited NAs would eliminate traffic 
blackholing.

> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index e10794d..cefc339 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -1089,6 +1089,16 @@ static int inetdev_event(struct notifier
> 			}
> 		}
> 		ip_mc_up(in_dev);
> +		/* fallthru */
> +
> +	case NETDEV_CHANGEADDR:
> +		/* Send gratuitous ARP in case of address change or new device */
> +		if (IN_DEV_ARP_ANNOUNCE(in_dev))
> +			arp_send(ARPOP_REQUEST, ETH_P_ARP,
> +				 in_dev->ifa_list->ifa_address, dev,
> +				 in_dev->ifa_list->ifa_address, NULL,
> +				 dev->dev_addr, NULL);
> +
> 		break;
> 	case NETDEV_DOWN:
> 		ip_mc_down(in_dev);
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02  1:30         ` Stephen Hemminger
@ 2007-03-02  8:46           ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-02  8:46 UTC (permalink / raw)
  To: Stephen Hemminger, Jeremy Fitzhardinge
  Cc: xen-devel, Jeff Garzik, Ian Pratt, netdev, linux-kernel,
	Chris Wright, Andi Kleen, Zachary, virtualization, Andrew Morton

On 2/3/07 01:30, "Stephen Hemminger" <shemminger@linux-foundation.org>
wrote:

> What about implementing the unused arp_announce flag on the inetdevice?
> Something like the following.  Totally untested...
> 
> Looks like it either was there (and got removed) or was planned but
> never implemented.

This would be acceptable. It only needs a simple script to poke the correct
value, and that could be easily integrated into vendor scripts. The general
idea of being able to have a callback to userspace is okay but we'd like
something lighter weight for live migration, which is typically done only
within the scope of a single layer-2 network and hence all that is really
required is the gratuitous ARP to kick the switch hardware.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
@ 2007-03-02  8:46           ` Keir Fraser
  0 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-02  8:46 UTC (permalink / raw)
  To: Stephen Hemminger, Jeremy Fitzhardinge
  Cc: xen-devel, Jeff Garzik, Andi Kleen, netdev, linux-kernel,
	Chris Wright, Ian Pratt, Zachary, virtualization, Andrew Morton

On 2/3/07 01:30, "Stephen Hemminger" <shemminger@linux-foundation.org>
wrote:

> What about implementing the unused arp_announce flag on the inetdevice?
> Something like the following.  Totally untested...
> 
> Looks like it either was there (and got removed) or was planned but
> never implemented.

This would be acceptable. It only needs a simple script to poke the correct
value, and that could be easily integrated into vendor scripts. The general
idea of being able to have a callback to userspace is okay but we'd like
something lighter weight for live migration, which is typically done only
within the scope of a single layer-2 network and hence all that is really
required is the gratuitous ARP to kick the switch hardware.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02  1:30         ` Stephen Hemminger
@ 2007-03-02 12:54           ` Andi Kleen
  -1 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-02 12:54 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jeremy Fitzhardinge, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, netdev, Jeff Garzik

On Thu, Mar 01, 2007 at 05:30:30PM -0800, Stephen Hemminger wrote:
> What about implementing the unused arp_announce flag on the inetdevice?
> Something like the following.  Totally untested...
> 
> Looks like it either was there (and got removed) or was planned but
> never implemented.

Seems like the right solution to me (if it works)

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
@ 2007-03-02 12:54           ` Andi Kleen
  0 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-02 12:54 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Jeff Garzik,
	Ian Pratt, netdev, Rusty Russell, linux-kernel, Chris Wright,
	virtualization, Andrew Morton, Christian Limpach

On Thu, Mar 01, 2007 at 05:30:30PM -0800, Stephen Hemminger wrote:
> What about implementing the unused arp_announce flag on the inetdevice?
> Something like the following.  Totally untested...
> 
> Looks like it either was there (and got removed) or was planned but
> never implemented.

Seems like the right solution to me (if it works)

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02 12:54           ` Andi Kleen
  (?)
@ 2007-03-02 14:08           ` jamal
  -1 siblings, 0 replies; 330+ messages in thread
From: jamal @ 2007-03-02 14:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephen Hemminger, Jeremy Fitzhardinge, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Chris Wright,
	Zachary Amsden, Rusty Russell, Ian Pratt, Christian Limpach,
	netdev, Jeff Garzik

On Fri, 2007-02-03 at 13:54 +0100, Andi Kleen wrote:

> Seems like the right solution to me (if it works)

I like it as well given the simple change. 
These are the kind of optimizations that justify offloading things to
the kernel (instead of user space) i.e very little lines of code, covers
a large number of user-desires, and can be turned off if you want to do
something complex. _Almost_ elegant Stephen;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02  1:30         ` Stephen Hemminger
@ 2007-03-02 18:08           ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-02 18:08 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, netdev, Jeff Garzik

* Stephen Hemminger (shemminger@linux-foundation.org) wrote:
> +	case NETDEV_CHANGEADDR:
> +		/* Send gratuitous ARP in case of address change or new device */
> +		if (IN_DEV_ARP_ANNOUNCE(in_dev))

Conceptually right on, but it looks like improper hijacking
of arp_announce sysctl.  Could introduce another

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
@ 2007-03-02 18:08           ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-02 18:08 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Jeff Garzik,
	Ian Pratt, Andi Kleen, netdev, Rusty Russell, linux-kernel,
	Chris Wright, virtualization, Andrew Morton, Christian Limpach

* Stephen Hemminger (shemminger@linux-foundation.org) wrote:
> +	case NETDEV_CHANGEADDR:
> +		/* Send gratuitous ARP in case of address change or new device */
> +		if (IN_DEV_ARP_ANNOUNCE(in_dev))

Conceptually right on, but it looks like improper hijacking
of arp_announce sysctl.  Could introduce another

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02  8:09         ` Pekka Savola
@ 2007-03-02 18:29           ` Ben Greear
  2007-03-02 19:59             ` Stephen Hemminger
  0 siblings, 1 reply; 330+ messages in thread
From: Ben Greear @ 2007-03-02 18:29 UTC (permalink / raw)
  To: Pekka Savola; +Cc: Stephen Hemminger, netdev

Pekka Savola wrote:
> On Thu, 1 Mar 2007, Stephen Hemminger wrote:
>> What about implementing the unused arp_announce flag on the inetdevice?
>> Something like the following.  Totally untested...
>>
>> Looks like it either was there (and got removed) or was planned but
>> never implemented.
IN_DEV_ARP_ANNOUNCE is in 2.6.18, at least..used in arp_solicit in arp.c

I really hope this didn't get removed because I find it very useful!

But, you could certainly add another sysctl...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02 18:29           ` Ben Greear
@ 2007-03-02 19:59             ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-02 19:59 UTC (permalink / raw)
  To: Ben Greear; +Cc: Pekka Savola, netdev

On Fri, 02 Mar 2007 10:29:53 -0800
Ben Greear <greearb@candelatech.com> wrote:

> Pekka Savola wrote:
> > On Thu, 1 Mar 2007, Stephen Hemminger wrote:
> >> What about implementing the unused arp_announce flag on the inetdevice?
> >> Something like the following.  Totally untested...
> >>
> >> Looks like it either was there (and got removed) or was planned but
> >> never implemented.
> IN_DEV_ARP_ANNOUNCE is in 2.6.18, at least..used in arp_solicit in arp.c
> 
> I really hope this didn't get removed because I find it very useful!
> 
> But, you could certainly add another sysctl...
> 
> Thanks,
> Ben
> 

yeah, something new like arp_notify? or arp_gratiutous

There are other drivers that do their own arp, they need to be fixed.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] Arp announce (for Xen)
  2007-03-02  1:30         ` Stephen Hemminger
                           ` (4 preceding siblings ...)
  (?)
@ 2007-03-06  4:35         ` David Miller
  2007-03-06 18:51             ` Stephen Hemminger
  -1 siblings, 1 reply; 330+ messages in thread
From: David Miller @ 2007-03-06  4:35 UTC (permalink / raw)
  To: shemminger
  Cc: jeremy, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, rusty, ian.pratt, Christian.Limpach, netdev, jeff

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 1 Mar 2007 17:30:30 -0800

> What about implementing the unused arp_announce flag on the inetdevice?
> Something like the following.  Totally untested...
> 
> Looks like it either was there (and got removed) or was planned but
> never implemented.

This idea is fine.  But:

> +	case NETDEV_CHANGEADDR:
> +		/* Send gratuitous ARP in case of address change or new device */
> +		if (IN_DEV_ARP_ANNOUNCE(in_dev))
> +			arp_send(ARPOP_REQUEST, ETH_P_ARP,
> +				 in_dev->ifa_list->ifa_address, dev,
> +				 in_dev->ifa_list->ifa_address, NULL, 
> +				 dev->dev_addr, NULL);

We'll need to make sure the appropriate 'arp_anounce' address
selection is employed here.

One idea is to change arp_solicit() such that it can be invoked in
this context, or provide a new helper function which will do the
source address selection rules of 'arp_announce' and then invoke
arp_send() as appropriate for us.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [RFC] ARP notify option
  2007-03-06  4:35         ` David Miller
@ 2007-03-06 18:51             ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-06 18:51 UTC (permalink / raw)
  To: David Miller
  Cc: jeremy, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, rusty, ian.pratt, Christian.Limpach, netdev, jeff

This adds another inet device option to enable gratuitous ARP
when device is brought up or address change. This is handy for
clusters or virtualization.

Tested on a normal device (not Xen).

Signed-off-by: Stephen Hemminger <shemminge@linux-foundation.org>

---
 Documentation/networking/ip-sysctl.txt |    6 ++++++
 include/linux/inetdevice.h             |    2 ++
 include/linux/sysctl.h                 |    1 +
 net/ipv4/devinet.c                     |   16 ++++++++++++++++
 4 files changed, 25 insertions(+)

--- net-2.6.22.orig/Documentation/networking/ip-sysctl.txt	2007-03-05 14:35:31.000000000 -0800
+++ net-2.6.22/Documentation/networking/ip-sysctl.txt	2007-03-05 16:46:47.000000000 -0800
@@ -732,6 +732,12 @@
 	The max value from conf/{all,interface}/arp_ignore is used
 	when ARP request is received on the {interface}
 
+arp_notify - BOOLEAN
+	Define mode for notification of address and device changes.
+	0 - (default): do nothing
+	1 - Generate gratuitous arp replies when device is brought up
+	    or hardware address changes.
+
 arp_accept - BOOLEAN
 	Define behavior when gratuitous arp replies are received:
 	0 - drop gratuitous arp frames
--- net-2.6.22.orig/include/linux/inetdevice.h	2007-03-05 14:35:34.000000000 -0800
+++ net-2.6.22/include/linux/inetdevice.h	2007-03-05 16:46:47.000000000 -0800
@@ -26,6 +26,7 @@
 	int	arp_announce;
 	int	arp_ignore;
 	int	arp_accept;
+	int	arp_notify;
 	int	medium_id;
 	int	no_xfrm;
 	int	no_policy;
@@ -84,6 +85,7 @@
 #define IN_DEV_ARPFILTER(in_dev)	(ipv4_devconf.arp_filter || (in_dev)->cnf.arp_filter)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)	(max(ipv4_devconf.arp_announce, (in_dev)->cnf.arp_announce))
 #define IN_DEV_ARP_IGNORE(in_dev)	(max(ipv4_devconf.arp_ignore, (in_dev)->cnf.arp_ignore))
+#define IN_DEV_ARP_NOTIFY(in_dev)	(ipv4_devconf.arp_notify || (in_dev)->cnf.arp_notify)
 
 struct in_ifaddr
 {
--- net-2.6.22.orig/include/linux/sysctl.h	2007-03-05 14:35:34.000000000 -0800
+++ net-2.6.22/include/linux/sysctl.h	2007-03-05 16:46:47.000000000 -0800
@@ -495,6 +495,7 @@
 	NET_IPV4_CONF_ARP_IGNORE=19,
 	NET_IPV4_CONF_PROMOTE_SECONDARIES=20,
 	NET_IPV4_CONF_ARP_ACCEPT=21,
+	NET_IPV4_CONF_ARP_NOTIFY=22,
 	__NET_IPV4_CONF_MAX
 };
 
--- net-2.6.22.orig/net/ipv4/devinet.c	2007-03-05 14:35:34.000000000 -0800
+++ net-2.6.22/net/ipv4/devinet.c	2007-03-05 16:46:47.000000000 -0800
@@ -1089,6 +1089,14 @@
 			}
 		}
 		ip_mc_up(in_dev);
+		/* fall through */
+	case NETDEV_CHANGEADDR:
+		if (IN_DEV_ARP_NOTIFY(in_dev))
+			arp_send(ARPOP_REQUEST, ETH_P_ARP,
+				 in_dev->ifa_list->ifa_address,
+				 dev,
+				 in_dev->ifa_list->ifa_address,
+				 NULL, dev->dev_addr, NULL);
 		break;
 	case NETDEV_DOWN:
 		ip_mc_down(in_dev);
@@ -1495,6 +1503,14 @@
 			.proc_handler	= &proc_dointvec,
 		},
 		{
+			.ctl_name	= NET_IPV4_CONF_ARP_NOTIFY,
+			.procname	= "arp_notify",
+			.data		= &ipv4_devconf.arp_notify,
+			.maxlen		= sizeof(int),
+			.mode		= 0644,
+			.proc_handler	= &proc_dointvec,
+		},
+		{
 			.ctl_name	= NET_IPV4_CONF_NOXFRM,
 			.procname	= "disable_xfrm",
 			.data		= &ipv4_devconf.no_xfrm,

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [RFC] ARP notify option
@ 2007-03-06 18:51             ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-06 18:51 UTC (permalink / raw)
  To: David Miller
  Cc: xen-devel, jeff, ian.pratt, netdev, linux-kernel, chrisw,
	virtualization, akpm, ak

This adds another inet device option to enable gratuitous ARP
when device is brought up or address change. This is handy for
clusters or virtualization.

Tested on a normal device (not Xen).

Signed-off-by: Stephen Hemminger <shemminge@linux-foundation.org>

---
 Documentation/networking/ip-sysctl.txt |    6 ++++++
 include/linux/inetdevice.h             |    2 ++
 include/linux/sysctl.h                 |    1 +
 net/ipv4/devinet.c                     |   16 ++++++++++++++++
 4 files changed, 25 insertions(+)

--- net-2.6.22.orig/Documentation/networking/ip-sysctl.txt	2007-03-05 14:35:31.000000000 -0800
+++ net-2.6.22/Documentation/networking/ip-sysctl.txt	2007-03-05 16:46:47.000000000 -0800
@@ -732,6 +732,12 @@
 	The max value from conf/{all,interface}/arp_ignore is used
 	when ARP request is received on the {interface}
 
+arp_notify - BOOLEAN
+	Define mode for notification of address and device changes.
+	0 - (default): do nothing
+	1 - Generate gratuitous arp replies when device is brought up
+	    or hardware address changes.
+
 arp_accept - BOOLEAN
 	Define behavior when gratuitous arp replies are received:
 	0 - drop gratuitous arp frames
--- net-2.6.22.orig/include/linux/inetdevice.h	2007-03-05 14:35:34.000000000 -0800
+++ net-2.6.22/include/linux/inetdevice.h	2007-03-05 16:46:47.000000000 -0800
@@ -26,6 +26,7 @@
 	int	arp_announce;
 	int	arp_ignore;
 	int	arp_accept;
+	int	arp_notify;
 	int	medium_id;
 	int	no_xfrm;
 	int	no_policy;
@@ -84,6 +85,7 @@
 #define IN_DEV_ARPFILTER(in_dev)	(ipv4_devconf.arp_filter || (in_dev)->cnf.arp_filter)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)	(max(ipv4_devconf.arp_announce, (in_dev)->cnf.arp_announce))
 #define IN_DEV_ARP_IGNORE(in_dev)	(max(ipv4_devconf.arp_ignore, (in_dev)->cnf.arp_ignore))
+#define IN_DEV_ARP_NOTIFY(in_dev)	(ipv4_devconf.arp_notify || (in_dev)->cnf.arp_notify)
 
 struct in_ifaddr
 {
--- net-2.6.22.orig/include/linux/sysctl.h	2007-03-05 14:35:34.000000000 -0800
+++ net-2.6.22/include/linux/sysctl.h	2007-03-05 16:46:47.000000000 -0800
@@ -495,6 +495,7 @@
 	NET_IPV4_CONF_ARP_IGNORE=19,
 	NET_IPV4_CONF_PROMOTE_SECONDARIES=20,
 	NET_IPV4_CONF_ARP_ACCEPT=21,
+	NET_IPV4_CONF_ARP_NOTIFY=22,
 	__NET_IPV4_CONF_MAX
 };
 
--- net-2.6.22.orig/net/ipv4/devinet.c	2007-03-05 14:35:34.000000000 -0800
+++ net-2.6.22/net/ipv4/devinet.c	2007-03-05 16:46:47.000000000 -0800
@@ -1089,6 +1089,14 @@
 			}
 		}
 		ip_mc_up(in_dev);
+		/* fall through */
+	case NETDEV_CHANGEADDR:
+		if (IN_DEV_ARP_NOTIFY(in_dev))
+			arp_send(ARPOP_REQUEST, ETH_P_ARP,
+				 in_dev->ifa_list->ifa_address,
+				 dev,
+				 in_dev->ifa_list->ifa_address,
+				 NULL, dev->dev_addr, NULL);
 		break;
 	case NETDEV_DOWN:
 		ip_mc_down(in_dev);
@@ -1495,6 +1503,14 @@
 			.proc_handler	= &proc_dointvec,
 		},
 		{
+			.ctl_name	= NET_IPV4_CONF_ARP_NOTIFY,
+			.procname	= "arp_notify",
+			.data		= &ipv4_devconf.arp_notify,
+			.maxlen		= sizeof(int),
+			.mode		= 0644,
+			.proc_handler	= &proc_dointvec,
+		},
+		{
 			.ctl_name	= NET_IPV4_CONF_NOXFRM,
 			.procname	= "disable_xfrm",
 			.data		= &ipv4_devconf.no_xfrm,

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
  2007-03-06 18:51             ` Stephen Hemminger
  (?)
@ 2007-03-06 19:04             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-06 19:04 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, rusty, ian.pratt, Christian.Limpach, netdev, jeff

Stephen Hemminger wrote:
> This adds another inet device option to enable gratuitous ARP
> when device is brought up or address change. This is handy for
> clusters or virtualization.
>   

Thanks Stephen.  Haven't tested this yet, but it definitely cleans up a
warty corner of netfront.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
  2007-03-06 18:51             ` Stephen Hemminger
@ 2007-03-06 19:07               ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-06 19:07 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, jeremy, ak, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, rusty, ian.pratt, Christian.Limpach,
	netdev, jeff

* Stephen Hemminger (shemminger@linux-foundation.org) wrote:
> This adds another inet device option to enable gratuitous ARP
> when device is brought up or address change. This is handy for
> clusters or virtualization.

This looks good.  I'll test with Xen.  What about the source
addr selection?

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
@ 2007-03-06 19:07               ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-06 19:07 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: zach, jeremy, xen-devel, jeff, ian.pratt, ak, netdev, rusty,
	linux-kernel, chrisw, virtualization, akpm, David Miller,
	Christian.Limpach

* Stephen Hemminger (shemminger@linux-foundation.org) wrote:
> This adds another inet device option to enable gratuitous ARP
> when device is brought up or address change. This is handy for
> clusters or virtualization.

This looks good.  I'll test with Xen.  What about the source
addr selection?

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
  2007-03-06 18:51             ` Stephen Hemminger
@ 2007-03-06 21:18               ` Chris Friesen
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Friesen @ 2007-03-06 21:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, jeremy, ak, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, rusty, ian.pratt, Christian.Limpach,
	netdev, jeff

Stephen Hemminger wrote:

> +arp_notify - BOOLEAN
> +	Define mode for notification of address and device changes.
> +	0 - (default): do nothing
> +	1 - Generate gratuitous arp replies when device is brought up
> +	    or hardware address changes.

Did you consider using gratuitous arp requests instead?  I remember 
reading about some hardware that updated its arp cache on gratuitous 
requests but not gratuitous replies.

Chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
@ 2007-03-06 21:18               ` Chris Friesen
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Friesen @ 2007-03-06 21:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: xen-devel, jeff, ian.pratt, netdev, linux-kernel, chrisw,
	virtualization, akpm, David Miller, ak

Stephen Hemminger wrote:

> +arp_notify - BOOLEAN
> +	Define mode for notification of address and device changes.
> +	0 - (default): do nothing
> +	1 - Generate gratuitous arp replies when device is brought up
> +	    or hardware address changes.

Did you consider using gratuitous arp requests instead?  I remember 
reading about some hardware that updated its arp cache on gratuitous 
requests but not gratuitous replies.

Chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
  2007-03-06 21:18               ` Chris Friesen
@ 2007-03-06 22:52                 ` Stephen Hemminger
  -1 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-06 22:52 UTC (permalink / raw)
  To: Chris Friesen
  Cc: David Miller, jeremy, ak, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, rusty, ian.pratt, Christian.Limpach,
	netdev, jeff

On Tue, 06 Mar 2007 15:18:07 -0600
"Chris Friesen" <cfriesen@nortel.com> wrote:

> Stephen Hemminger wrote:
> 
> > +arp_notify - BOOLEAN
> > +	Define mode for notification of address and device changes.
> > +	0 - (default): do nothing
> > +	1 - Generate gratuitous arp replies when device is brought up
> > +	    or hardware address changes.
> 
> Did you consider using gratuitous arp requests instead?  I remember 
> reading about some hardware that updated its arp cache on gratuitous 
> requests but not gratuitous replies.
> 
> Chris

I copied the ARP generation from other places that were doing
gratuitous ARP already:  Xen and irlan.
Our local switch used REPLY's to do the same thing.

One could imagine making it a ternary value and having 2 generate
REQUEST's.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
@ 2007-03-06 22:52                 ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-06 22:52 UTC (permalink / raw)
  To: Chris Friesen
  Cc: xen-devel, jeff, ian.pratt, netdev, linux-kernel, chrisw,
	virtualization, akpm, David Miller, ak

On Tue, 06 Mar 2007 15:18:07 -0600
"Chris Friesen" <cfriesen@nortel.com> wrote:

> Stephen Hemminger wrote:
> 
> > +arp_notify - BOOLEAN
> > +	Define mode for notification of address and device changes.
> > +	0 - (default): do nothing
> > +	1 - Generate gratuitous arp replies when device is brought up
> > +	    or hardware address changes.
> 
> Did you consider using gratuitous arp requests instead?  I remember 
> reading about some hardware that updated its arp cache on gratuitous 
> requests but not gratuitous replies.
> 
> Chris

I copied the ARP generation from other places that were doing
gratuitous ARP already:  Xen and irlan.
Our local switch used REPLY's to do the same thing.

One could imagine making it a ternary value and having 2 generate
REQUEST's.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
  2007-03-06 21:18               ` Chris Friesen
@ 2007-03-07  6:42                 ` Pekka Savola
  -1 siblings, 0 replies; 330+ messages in thread
From: Pekka Savola @ 2007-03-07  6:42 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Stephen Hemminger, David Miller, jeremy, ak, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, zach, rusty, ian.pratt,
	Christian.Limpach, netdev, jeff

On Tue, 6 Mar 2007, Chris Friesen wrote:
> Stephen Hemminger wrote:
>>  +arp_notify - BOOLEAN
>>  +	Define mode for notification of address and device changes.
>>  +	0 - (default): do nothing
>>  +	1 - Generate gratuitous arp replies when device is brought up
>>  +	    or hardware address changes.
>
> Did you consider using gratuitous arp requests instead?  I remember reading 
> about some hardware that updated its arp cache on gratuitous requests but not 
> gratuitous replies.

You might be interested in taking a look at:

http://tools.ietf.org/id/draft-cheshire-ipv4-acd

There has been some follow-up discussion on this in the thread 
starting at:

http://www1.ietf.org/mail-archive/web/int-area/current/msg00611.html

In particular, you may be interested in this comment about ARP 
request and ARP reply for gratuitous ARP:

http://www1.ietf.org/mail-archive/web/int-area/current/msg00669.html

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
@ 2007-03-07  6:42                 ` Pekka Savola
  0 siblings, 0 replies; 330+ messages in thread
From: Pekka Savola @ 2007-03-07  6:42 UTC (permalink / raw)
  To: Chris Friesen
  Cc: xen-devel, jeff, ian.pratt, virtualization, netdev, linux-kernel,
	chrisw, ak, akpm, Stephen Hemminger, David Miller

On Tue, 6 Mar 2007, Chris Friesen wrote:
> Stephen Hemminger wrote:
>>  +arp_notify - BOOLEAN
>>  +	Define mode for notification of address and device changes.
>>  +	0 - (default): do nothing
>>  +	1 - Generate gratuitous arp replies when device is brought up
>>  +	    or hardware address changes.
>
> Did you consider using gratuitous arp requests instead?  I remember reading 
> about some hardware that updated its arp cache on gratuitous requests but not 
> gratuitous replies.

You might be interested in taking a look at:

http://tools.ietf.org/id/draft-cheshire-ipv4-acd

There has been some follow-up discussion on this in the thread 
starting at:

http://www1.ietf.org/mail-archive/web/int-area/current/msg00611.html

In particular, you may be interested in this comment about ARP 
request and ARP reply for gratuitous ARP:

http://www1.ietf.org/mail-archive/web/int-area/current/msg00669.html

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
  2007-03-07  6:42                 ` Pekka Savola
@ 2007-03-07 17:00                   ` Stephen Hemminger
  -1 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-07 17:00 UTC (permalink / raw)
  To: Pekka Savola
  Cc: Chris Friesen, David Miller, jeremy, ak, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, zach, rusty, ian.pratt,
	Christian.Limpach, netdev, jeff

On Wed, 7 Mar 2007 08:42:39 +0200 (EET)
Pekka Savola <pekkas@netcore.fi> wrote:

> On Tue, 6 Mar 2007, Chris Friesen wrote:
> > Stephen Hemminger wrote:
> >>  +arp_notify - BOOLEAN
> >>  +	Define mode for notification of address and device changes.
> >>  +	0 - (default): do nothing
> >>  +	1 - Generate gratuitous arp replies when device is brought up
> >>  +	    or hardware address changes.
> >
> > Did you consider using gratuitous arp requests instead?  I remember reading 
> > about some hardware that updated its arp cache on gratuitous requests but not 
> > gratuitous replies.
> 
> You might be interested in taking a look at:
> 
> http://tools.ietf.org/id/draft-cheshire-ipv4-acd
> 
> There has been some follow-up discussion on this in the thread 
> starting at:
> 
> http://www1.ietf.org/mail-archive/web/int-area/current/msg00611.html
> 
> In particular, you may be interested in this comment about ARP 
> request and ARP reply for gratuitous ARP:
> 
> http://www1.ietf.org/mail-archive/web/int-area/current/msg00669.html

Looks like REQUESTS make more sense. (See below). I will
rework this patch.

5. Why are ARP Announcements performed using ARP Request packets
   and not ARP Reply packets?

   During IETF deliberation of IPv4 Address Conflict Detection from 2000
   to 2005, a question that kept being asked repeatedly was, "Shouldn't
   ARP Announcements be performed using gratuitous ARP Reply packets?"

   On the face of it, this seems reasonable.  A conventional ARP Reply
   is an answer to a question.  If in fact no question had been asked,
   then it would be reasonable to describe such a reply as gratuitous. 
   This description would seem to apply perfectly to an ARP
   Announcement: an answer to an implied question that in fact no one
   asked.

   However reasonable this may seem in principle, there are two reasons
   why in practice ARP Request packets are the better choice.  One is
   historical precedent, and the other is practicality.

Expires 11th January 2006             Cheshire                 [Page 14]
\f
Internet Draft       IPv4 Address Conflict Detection      11th July 2005

   The historical precedent is that, as described above in Section 4,
   Gratuitous ARP is described in Stevens Networking [Ste94] as using
   ARP Request packets.  BSD Unix, Windows, Mac OS 9, Mac OS X, etc.
   all use ARP Request packets as described in Stevens.  At this stage,
   trying to mandate that they all switch to using ARP Reply packets
   would be futile.

   The practical reason is that ARP Request packets are more likely to
   work correctly with more existing ARP implementations, some of which
   may not implement RFC 826 correctly.  The Packet Reception rules in
   RFC 826 state that the opcode is the last thing to check in packet
   processing, so it really shouldn't matter, but there may be
   "creative" implementations that have different packet processing
   depending on the ar$op field, and there are several reasons why these
   are more likely to accept gratuitous ARP Requests than gratuitous ARP
   Replies:

   * An incorrect ARP implementation may expect that ARP Replies are
     only sent via unicast.  RFC 826 does not say this, but an incorrect
     implementation may assume it, and the "principle of least surprise"
     dictates that where there are two or more ways to solve a
     networking problem that are otherwise equally good, the one with
     the fewest unusual properties is the one likely to have the fewest
     interoperability problems with existing implementations.  An ARP
     Announcement needs to broadcast information to all hosts on the
     link.  Since ARP Request packets are always broadcast, and ARP
     Reply packets are not, receiving an ARP Request packet via
     broadcast is less surprising than receiving an ARP Reply packet via
     broadcast.

   * An incorrect ARP implementation may expect that ARP Replies are
     only received in response to ARP Requests that have been issued
     recently by that implementation.  Unexpected unsolicited Replies
     may be ignored.

   * An incorrect ARP implementation may ignore ARP Replies where
     ar$tha doesn't match its hardware address.

   * An incorrect ARP implementation may ignore ARP Replies where
     ar$tpa doesn't match its IP address.

   In summary, there are more ways that an incorrect ARP implementation
   might plausibly reject an ARP Reply (which usually occurs as a result
   of being solicited by the client) than an ARP Request (which is
   already expected to occur unsolicited).

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [RFC] ARP notify option
@ 2007-03-07 17:00                   ` Stephen Hemminger
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Hemminger @ 2007-03-07 17:00 UTC (permalink / raw)
  To: Pekka Savola
  Cc: xen-devel, jeff, ian.pratt, ak, netdev, linux-kernel, chrisw,
	Chris Friesen, virtualization, akpm, David Miller

On Wed, 7 Mar 2007 08:42:39 +0200 (EET)
Pekka Savola <pekkas@netcore.fi> wrote:

> On Tue, 6 Mar 2007, Chris Friesen wrote:
> > Stephen Hemminger wrote:
> >>  +arp_notify - BOOLEAN
> >>  +	Define mode for notification of address and device changes.
> >>  +	0 - (default): do nothing
> >>  +	1 - Generate gratuitous arp replies when device is brought up
> >>  +	    or hardware address changes.
> >
> > Did you consider using gratuitous arp requests instead?  I remember reading 
> > about some hardware that updated its arp cache on gratuitous requests but not 
> > gratuitous replies.
> 
> You might be interested in taking a look at:
> 
> http://tools.ietf.org/id/draft-cheshire-ipv4-acd
> 
> There has been some follow-up discussion on this in the thread 
> starting at:
> 
> http://www1.ietf.org/mail-archive/web/int-area/current/msg00611.html
> 
> In particular, you may be interested in this comment about ARP 
> request and ARP reply for gratuitous ARP:
> 
> http://www1.ietf.org/mail-archive/web/int-area/current/msg00669.html

Looks like REQUESTS make more sense. (See below). I will
rework this patch.

5. Why are ARP Announcements performed using ARP Request packets
   and not ARP Reply packets?

   During IETF deliberation of IPv4 Address Conflict Detection from 2000
   to 2005, a question that kept being asked repeatedly was, "Shouldn't
   ARP Announcements be performed using gratuitous ARP Reply packets?"

   On the face of it, this seems reasonable.  A conventional ARP Reply
   is an answer to a question.  If in fact no question had been asked,
   then it would be reasonable to describe such a reply as gratuitous. 
   This description would seem to apply perfectly to an ARP
   Announcement: an answer to an implied question that in fact no one
   asked.

   However reasonable this may seem in principle, there are two reasons
   why in practice ARP Request packets are the better choice.  One is
   historical precedent, and the other is practicality.

Expires 11th January 2006             Cheshire                 [Page 14]
\f
Internet Draft       IPv4 Address Conflict Detection      11th July 2005

   The historical precedent is that, as described above in Section 4,
   Gratuitous ARP is described in Stevens Networking [Ste94] as using
   ARP Request packets.  BSD Unix, Windows, Mac OS 9, Mac OS X, etc.
   all use ARP Request packets as described in Stevens.  At this stage,
   trying to mandate that they all switch to using ARP Reply packets
   would be futile.

   The practical reason is that ARP Request packets are more likely to
   work correctly with more existing ARP implementations, some of which
   may not implement RFC 826 correctly.  The Packet Reception rules in
   RFC 826 state that the opcode is the last thing to check in packet
   processing, so it really shouldn't matter, but there may be
   "creative" implementations that have different packet processing
   depending on the ar$op field, and there are several reasons why these
   are more likely to accept gratuitous ARP Requests than gratuitous ARP
   Replies:

   * An incorrect ARP implementation may expect that ARP Replies are
     only sent via unicast.  RFC 826 does not say this, but an incorrect
     implementation may assume it, and the "principle of least surprise"
     dictates that where there are two or more ways to solve a
     networking problem that are otherwise equally good, the one with
     the fewest unusual properties is the one likely to have the fewest
     interoperability problems with existing implementations.  An ARP
     Announcement needs to broadcast information to all hosts on the
     link.  Since ARP Request packets are always broadcast, and ARP
     Reply packets are not, receiving an ARP Request packet via
     broadcast is less surprising than receiving an ARP Reply packet via
     broadcast.

   * An incorrect ARP implementation may expect that ARP Replies are
     only received in response to ARP Requests that have been issued
     recently by that implementation.  Unexpected unsolicited Replies
     may be ignored.

   * An incorrect ARP implementation may ignore ARP Replies where
     ar$tha doesn't match its hardware address.

   * An incorrect ARP implementation may ignore ARP Replies where
     ar$tpa doesn't match its IP address.

   In summary, there are more ways that an incorrect ARP implementation
   might plausibly reject an ARP Reply (which usually occurs as a result
   of being solicited by the client) than an ARP Request (which is
   already expected to occur unsolicited).

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-16  8:42   ` Ingo Molnar
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Jens Axboe, Jeff Garzik


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>  * virtual block device (blockfront)
>  * virtual network device (netfront)

note, these drivers should be submitted through the proper block drivers 
and network drivers review process - not via the x86_64 tree.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
@ 2007-03-16  8:42   ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Jens Axboe,
	Andrew Morton, Jeff Garzik


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>  * virtual block device (blockfront)
>  * virtual network device (netfront)

note, these drivers should be submitted through the proper block drivers 
and network drivers review process - not via the x86_64 tree.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-01 23:25 ` [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver Jeremy Fitzhardinge
@ 2007-03-16  8:47     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Greg KH


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This communicates with the machine control software via a registry 
> residing in a controlling virtual machine. This allows dynamic 
> creation, destruction and modification of virtual device 
> configurations (network devices, block devices and CPUS, to name some 
> examples).

should be reviewed by the driver-core folks too i guess?

> +//#include <asm/maddr.h>

> +//#include <xen/xen_proc.h>
> +//#include <xen/evtchn.h>

> +//#include <xen/hvm.h>

> +//		.shutdown = xenbus_dev_shutdown,

hm?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
@ 2007-03-16  8:47     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Ian Pratt, linux-kernel, Chris Wright, Andi Kleen,
	virtualization, Andrew Morton, Greg KH


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This communicates with the machine control software via a registry 
> residing in a controlling virtual machine. This allows dynamic 
> creation, destruction and modification of virtual device 
> configurations (network devices, block devices and CPUS, to name some 
> examples).

should be reviewed by the driver-core folks too i guess?

> +//#include <asm/maddr.h>

> +//#include <xen/xen_proc.h>
> +//#include <xen/evtchn.h>

> +//#include <xen/hvm.h>

> +//		.shutdown = xenbus_dev_shutdown,

hm?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 23/26] Xen-paravirt_ops: Add Xen grant table support
  2007-03-01 23:25   ` Jeremy Fitzhardinge
@ 2007-03-16  8:51     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add Xen 'grant table' driver which allows granting of access to 
> selected local memory pages by other virtual machines and, 
> symmetrically, the mapping of remote memory pages which other virtual 
> machines have granted access to.
> 
> This driver is a prerequisite for many of the Xen virtual device 
> drivers, which grant the 'device driver domain' restricted and 
> temporary access to only those memory pages that are currently 
> involved in I/O operations.

> +
> +#ifndef __ia64__
> +	{

introduce a proper arch method instead.

> +	unsigned long flags;
> +	int ref;
> +	grant_ref_t head;
> +	spin_lock_irqsave(&gnttab_list_lock, flags);

> +	unsigned long flags;
> +	spin_lock_irqsave(&gnttab_list_lock, flags);

> +	unsigned long flags;
> +	int count = 1;
> +	if (head == GNTTAB_LIST_END)

> +	grant_ref_t g = *private_head;
> +	if (unlikely(g == GNTTAB_LIST_END))

coding style problems.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 23/26] Xen-paravirt_ops: Add Xen grant table support
@ 2007-03-16  8:51     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Christian Limpach


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add Xen 'grant table' driver which allows granting of access to 
> selected local memory pages by other virtual machines and, 
> symmetrically, the mapping of remote memory pages which other virtual 
> machines have granted access to.
> 
> This driver is a prerequisite for many of the Xen virtual device 
> drivers, which grant the 'device driver domain' restricted and 
> temporary access to only those memory pages that are currently 
> involved in I/O operations.

> +
> +#ifndef __ia64__
> +	{

introduce a proper arch method instead.

> +	unsigned long flags;
> +	int ref;
> +	grant_ref_t head;
> +	spin_lock_irqsave(&gnttab_list_lock, flags);

> +	unsigned long flags;
> +	spin_lock_irqsave(&gnttab_list_lock, flags);

> +	unsigned long flags;
> +	int count = 1;
> +	if (head == GNTTAB_LIST_END)

> +	grant_ref_t g = *private_head;
> +	if (unlikely(g == GNTTAB_LIST_END))

coding style problems.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 22/26] Xen-paravirt_ops: Add early printk support via hvc console
  2007-03-01 23:25 ` [patch 22/26] Xen-paravirt_ops: Add early printk support via hvc console Jeremy Fitzhardinge
@ 2007-03-16  8:52     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add early printk support via hvc console, enable using
> "earlyprintk=xen" on the kernel command line.
> 
> From: Gerd Hoffmann <kraxel@suse.de>
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 22/26] Xen-paravirt_ops: Add early printk support via hvc console
@ 2007-03-16  8:52     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add early printk support via hvc console, enable using
> "earlyprintk=xen" on the kernel command line.
> 
> From: Gerd Hoffmann <kraxel@suse.de>
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-01 23:25 ` [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console Jeremy Fitzhardinge
@ 2007-03-16  8:54     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Implement a Xen back-end for hvc console.

> +	cons = intf->out_cons;
> +	prod = intf->out_prod;
> +	mb();
> +	BUG_ON((prod - cons) > sizeof(intf->out));
> +
> +	while ((sent < len) && ((prod - cons) < sizeof(intf->out)))
> +		intf->out[MASK_XENCONS_IDX(prod++, intf->out)] = data[sent++];
> +
> +	wmb();
> +	intf->out_prod = prod;

> +	prod = intf->in_prod;
> +	mb();
> +	BUG_ON((prod - cons) > sizeof(intf->in));

such mb()'s are typically a sign of "i have no clear idea what SMP 
serialization rules apply here, but something is needed because 
otherwise it breaks" ?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16  8:54     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Implement a Xen back-end for hvc console.

> +	cons = intf->out_cons;
> +	prod = intf->out_prod;
> +	mb();
> +	BUG_ON((prod - cons) > sizeof(intf->out));
> +
> +	while ((sent < len) && ((prod - cons) < sizeof(intf->out)))
> +		intf->out[MASK_XENCONS_IDX(prod++, intf->out)] = data[sent++];
> +
> +	wmb();
> +	intf->out_prod = prod;

> +	prod = intf->in_prod;
> +	mb();
> +	BUG_ON((prod - cons) > sizeof(intf->in));

such mb()'s are typically a sign of "i have no clear idea what SMP 
serialization rules apply here, but something is needed because 
otherwise it breaks" ?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-01 23:25 ` [patch 20/26] Xen-paravirt_ops: Core Xen implementation Jeremy Fitzhardinge
@ 2007-03-16  9:14     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Adrian Bunk, Thomas Gleixner


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Core Xen Implementation
> 
> This patch is a rollup of all the core pieces of the Xen 
> implementation, including booting, memory management, interrupts, time 
> and so on.

> --- a/arch/i386/kernel/head.S
> +++ b/arch/i386/kernel/head.S
> @@ -535,6 +535,10 @@ unhandled_paravirt:
>  	ud2
>  #endif
>  
> +#ifdef CONFIG_XEN
> +#include "../xen/xen-head.S"
> +#endif

i'd suggest to remove the #ifdef and push it into xen-head.S.

>  /*
>   * BSS section
>   */
> -.section ".bss.page_aligned","w"
> +.section ".bss.page_aligned","wa"

why is this done?

>  ENTRY(swapper_pg_dir)
> +	.align PAGE_SIZE_asm
>  	.fill 1024,4,0

does the native kernel lose memory here?

> @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
>  
>  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
>  {
> -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
> -	C(0); C(1); C(2);
> -#undef C
> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
>  }

this is a cleanup unrelated to the purpose of the patch.

> +static void xen_safe_halt(void)
> +{
> +	stop_hz_timer();
> +	/* Blocking includes an implicit local_irq_enable(). */
> +	if (HYPERVISOR_sched_op(SCHEDOP_block, 0) != 0)
> +		BUG();
> +	start_hz_timer();
> +}

change this to tick_nohz_stop_sched_tick()/restart_sched_tick() instead.

> +static void xen_halt(void)
> +{
> +#if 0
> +	if (irqs_disabled())
> +		HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
> +#endif

hm?

> +		op->arg1.linear_addr = arbitrary_virt_to_machine((unsigned long)addr).maddr;

80+ chars line. (there are more instances of this throughout the patch)

> +	for (va = dtr->address, f = 0;
> +	     va < dtr->address + size;
> +	     va += PAGE_SIZE, f++) {
> +		frames[f] = virt_to_mfn(va);
> +		make_lowmem_page_readonly((void *)va);
> +	}

coding style.

> +	/* This is used very early, so we can't rely on per-cpu data
> +	   being set up, so no multicalls */

comment coding style. (there are instances of this throughout the patch)

> +	spin_lock(&lock);
> +	for(in = out = 0; in < count; in++) {

missing whitespace.

> +		if (HYPERVISOR_update_descriptor(virt_to_machine(dt + entry*8).maddr,

80+ chars.

> +static void xen_set_iopl_mask(unsigned mask)
> +{
> +#if 0
> +	struct physdev_set_iopl set_iopl;
> +
> +	/* Force the change at ring 0. */
> +	set_iopl.iopl = (mask == 0) ? 1 : (mask >> 12) & 3;
> +	HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
> +#endif

hm?

> +	}
> +
> +	write_pda(xen.cr3, cr3);
> +
> +
> +	{

unnecessary newline.

> +static struct irq_chip xen_dynamic_chip __read_mostly = {
> +	.name		= "xen-virq",
> +	.mask		= disable_dynirq,
> +	.unmask		= enable_dynirq,
> +	.ack		= ack_dynirq,
> +	.set_affinity	= set_affinity_irq,
> +	.retrigger	= retrigger_dynirq,
> +};

nicely done! :-)

> +void xen_set_pte(pte_t *ptep, pte_t pte)
> +{
> +#if 1
> +	struct mmu_update u;
> +
> +	u.ptr = virt_to_machine(ptep).maddr;
> +	u.val = pte_val_ma(pte);
> +	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
> +		BUG();
> +#else
> +	ptep->pte_high = pte.pte_high;
> +	smp_wmb();
> +	ptep->pte_low = pte.pte_low;
> +#endif

hm?

> +	pgd = swapper_pg_dir + pgd_index(vaddr);
> +	if (pgd_none(*pgd)) {
> +		BUG();
> +		return;

hm?

> +void xen_pte_clear(struct mm_struct *mm, unsigned long addr,pte_t *ptep)
> +{
> +#if 1
> +	ptep->pte_low = 0;
> +	smp_wmb();
> +	ptep->pte_high = 0;
> +#else
> +	set_64bit((u64 *)ptep, 0);
> +#endif

hm?

> +unsigned long xen_pmd_val(pmd_t pmd)
> +{
> +	BUG();
> +	return 0;
> +}

make it noret.

> +pmd_t xen_make_pmd(unsigned long pmd)
> +{
> +	BUG();
> +	return __pmd(0);
> +}

ditto.

> +static void xen_timer_interrupt_hook(void)
> +{

> +				runstate->time[RUNSTATE_offline] -
> +				__get_cpu_var(processed_stolen_time);

NACK for now, for the reasons outlined in the 'stolen time' thread. 
Stolen time accounting is a concept only related to the scheduler tick, 
it's not a concept that should leak into normal timer interrupt 
concepts.

> +	/* System-wide jiffy work. */
> +	ticks = 0;
> +	while(delta > NS_PER_TICK) {
> +		delta -= NS_PER_TICK;
> +		processed_system_time += NS_PER_TICK;
> +		ticks++;
> +	}
> +	do_timer(ticks);

that should be updated to clockevents. (i suspect it already is?)

> +
> +#if 0
> +	if (unlikely((mfn >> machine_to_phys_order) != 0))
> +		return max_mapnr;
> +#endif

hm?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-16  9:14     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Adrian Bunk, Chris Wright,
	Andi Kleen, Andrew Morton, Thomas Gleixner, Christian Limpach


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Core Xen Implementation
> 
> This patch is a rollup of all the core pieces of the Xen 
> implementation, including booting, memory management, interrupts, time 
> and so on.

> --- a/arch/i386/kernel/head.S
> +++ b/arch/i386/kernel/head.S
> @@ -535,6 +535,10 @@ unhandled_paravirt:
>  	ud2
>  #endif
>  
> +#ifdef CONFIG_XEN
> +#include "../xen/xen-head.S"
> +#endif

i'd suggest to remove the #ifdef and push it into xen-head.S.

>  /*
>   * BSS section
>   */
> -.section ".bss.page_aligned","w"
> +.section ".bss.page_aligned","wa"

why is this done?

>  ENTRY(swapper_pg_dir)
> +	.align PAGE_SIZE_asm
>  	.fill 1024,4,0

does the native kernel lose memory here?

> @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
>  
>  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
>  {
> -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
> -	C(0); C(1); C(2);
> -#undef C
> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
>  }

this is a cleanup unrelated to the purpose of the patch.

> +static void xen_safe_halt(void)
> +{
> +	stop_hz_timer();
> +	/* Blocking includes an implicit local_irq_enable(). */
> +	if (HYPERVISOR_sched_op(SCHEDOP_block, 0) != 0)
> +		BUG();
> +	start_hz_timer();
> +}

change this to tick_nohz_stop_sched_tick()/restart_sched_tick() instead.

> +static void xen_halt(void)
> +{
> +#if 0
> +	if (irqs_disabled())
> +		HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
> +#endif

hm?

> +		op->arg1.linear_addr = arbitrary_virt_to_machine((unsigned long)addr).maddr;

80+ chars line. (there are more instances of this throughout the patch)

> +	for (va = dtr->address, f = 0;
> +	     va < dtr->address + size;
> +	     va += PAGE_SIZE, f++) {
> +		frames[f] = virt_to_mfn(va);
> +		make_lowmem_page_readonly((void *)va);
> +	}

coding style.

> +	/* This is used very early, so we can't rely on per-cpu data
> +	   being set up, so no multicalls */

comment coding style. (there are instances of this throughout the patch)

> +	spin_lock(&lock);
> +	for(in = out = 0; in < count; in++) {

missing whitespace.

> +		if (HYPERVISOR_update_descriptor(virt_to_machine(dt + entry*8).maddr,

80+ chars.

> +static void xen_set_iopl_mask(unsigned mask)
> +{
> +#if 0
> +	struct physdev_set_iopl set_iopl;
> +
> +	/* Force the change at ring 0. */
> +	set_iopl.iopl = (mask == 0) ? 1 : (mask >> 12) & 3;
> +	HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
> +#endif

hm?

> +	}
> +
> +	write_pda(xen.cr3, cr3);
> +
> +
> +	{

unnecessary newline.

> +static struct irq_chip xen_dynamic_chip __read_mostly = {
> +	.name		= "xen-virq",
> +	.mask		= disable_dynirq,
> +	.unmask		= enable_dynirq,
> +	.ack		= ack_dynirq,
> +	.set_affinity	= set_affinity_irq,
> +	.retrigger	= retrigger_dynirq,
> +};

nicely done! :-)

> +void xen_set_pte(pte_t *ptep, pte_t pte)
> +{
> +#if 1
> +	struct mmu_update u;
> +
> +	u.ptr = virt_to_machine(ptep).maddr;
> +	u.val = pte_val_ma(pte);
> +	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
> +		BUG();
> +#else
> +	ptep->pte_high = pte.pte_high;
> +	smp_wmb();
> +	ptep->pte_low = pte.pte_low;
> +#endif

hm?

> +	pgd = swapper_pg_dir + pgd_index(vaddr);
> +	if (pgd_none(*pgd)) {
> +		BUG();
> +		return;

hm?

> +void xen_pte_clear(struct mm_struct *mm, unsigned long addr,pte_t *ptep)
> +{
> +#if 1
> +	ptep->pte_low = 0;
> +	smp_wmb();
> +	ptep->pte_high = 0;
> +#else
> +	set_64bit((u64 *)ptep, 0);
> +#endif

hm?

> +unsigned long xen_pmd_val(pmd_t pmd)
> +{
> +	BUG();
> +	return 0;
> +}

make it noret.

> +pmd_t xen_make_pmd(unsigned long pmd)
> +{
> +	BUG();
> +	return __pmd(0);
> +}

ditto.

> +static void xen_timer_interrupt_hook(void)
> +{

> +				runstate->time[RUNSTATE_offline] -
> +				__get_cpu_var(processed_stolen_time);

NACK for now, for the reasons outlined in the 'stolen time' thread. 
Stolen time accounting is a concept only related to the scheduler tick, 
it's not a concept that should leak into normal timer interrupt 
concepts.

> +	/* System-wide jiffy work. */
> +	ticks = 0;
> +	while(delta > NS_PER_TICK) {
> +		delta -= NS_PER_TICK;
> +		processed_system_time += NS_PER_TICK;
> +		ticks++;
> +	}
> +	do_timer(ticks);

that should be updated to clockevents. (i suspect it already is?)

> +
> +#if 0
> +	if (unlikely((mfn >> machine_to_phys_order) != 0))
> +		return max_mapnr;
> +#endif

hm?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 18/26] Xen-paravirt_ops: Add XEN config options
  2007-03-01 23:25 ` [patch 18/26] Xen-paravirt_ops: Add XEN config options Jeremy Fitzhardinge
@ 2007-03-16  9:14     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Thomas Gleixner


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Xen is currently incompatible with:
> - PREEMPT

why? That should be fixed.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 18/26] Xen-paravirt_ops: Add XEN config options
@ 2007-03-16  9:14     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Thomas Gleixner, Christian Limpach


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Xen is currently incompatible with:
> - PREEMPT

why? That should be fixed.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
  2007-03-01 23:25 ` [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes Jeremy Fitzhardinge
@ 2007-03-16  9:15     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:15 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Roland McGrath, Ulrich Drepper


i've Cc:-ed Roland and Ulrich, who should make the call on this one.

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
> used by the runtime linker to select a glibc version which then
> disables negative-offset accesses to the thread-local segment via
> %gs. These accesses require emulation in Xen (because segments are
> truncated to protect the hypervisor address space) and avoiding them
> provides a measurable performance boost.
>
> Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
> Acked-by: Zachary Amsden <zach@vmware.com>
> 
> ---
>  arch/i386/kernel/vsyscall-note.S |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> ===================================================================
> --- a/arch/i386/kernel/vsyscall-note.S
> +++ b/arch/i386/kernel/vsyscall-note.S
> @@ -23,3 +24,31 @@ 3:	.balign 4;		/* pad out section */			 
>  	ASM_ELF_NOTE_BEGIN(".note.kernel-version", "a", UTS_SYSNAME, 0)
>  	.long LINUX_VERSION_CODE
>  	ASM_ELF_NOTE_END
> +
> +#ifdef CONFIG_XEN
> +/*
> + * Add a special note telling glibc's dynamic linker a fake hardware
> + * flavor that it will use to choose the search path for libraries in the
> + * same way it uses real hardware capabilities like "mmx".
> + * We supply "nosegneg" as the fake capability, to indicate that we
> + * do not like negative offsets in instructions using segment overrides,
> + * since we implement those inefficiently.  This makes it possible to
> + * install libraries optimized to avoid those access patterns in someplace
> + * like /lib/i686/tls/nosegneg.  Note that an /etc/ld.so.conf.d/file
> + * corresponding to the bits here is needed to make ldconfig work right.
> + * It should contain:
> + *	hwcap 0 nosegneg
> + * to match the mapping of bit to name that we give here.
> + */
> +#define NOTE_KERNELCAP_BEGIN(ncaps, mask) \
> +	ASM_ELF_NOTE_BEGIN(".note.kernelcap", "a", "GNU", 2) \
> +	.long ncaps, mask
> +#define NOTE_KERNELCAP(bit, name) \
> +	.byte bit; .asciz name
> +#define NOTE_KERNELCAP_END ASM_ELF_NOTE_END
> +
> +NOTE_KERNELCAP_BEGIN(1, 1)
> +NOTE_KERNELCAP(1, "nosegneg")
> +NOTE_KERNELCAP_END
> +#endif
> +
> 
> -- 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
@ 2007-03-16  9:15     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:15 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Ian Pratt, linux-kernel, Chris Wright, Andi Kleen,
	virtualization, Andrew Morton, Ulrich Drepper, Roland McGrath


i've Cc:-ed Roland and Ulrich, who should make the call on this one.

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
> used by the runtime linker to select a glibc version which then
> disables negative-offset accesses to the thread-local segment via
> %gs. These accesses require emulation in Xen (because segments are
> truncated to protect the hypervisor address space) and avoiding them
> provides a measurable performance boost.
>
> Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
> Acked-by: Zachary Amsden <zach@vmware.com>
> 
> ---
>  arch/i386/kernel/vsyscall-note.S |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> ===================================================================
> --- a/arch/i386/kernel/vsyscall-note.S
> +++ b/arch/i386/kernel/vsyscall-note.S
> @@ -23,3 +24,31 @@ 3:	.balign 4;		/* pad out section */			 
>  	ASM_ELF_NOTE_BEGIN(".note.kernel-version", "a", UTS_SYSNAME, 0)
>  	.long LINUX_VERSION_CODE
>  	ASM_ELF_NOTE_END
> +
> +#ifdef CONFIG_XEN
> +/*
> + * Add a special note telling glibc's dynamic linker a fake hardware
> + * flavor that it will use to choose the search path for libraries in the
> + * same way it uses real hardware capabilities like "mmx".
> + * We supply "nosegneg" as the fake capability, to indicate that we
> + * do not like negative offsets in instructions using segment overrides,
> + * since we implement those inefficiently.  This makes it possible to
> + * install libraries optimized to avoid those access patterns in someplace
> + * like /lib/i686/tls/nosegneg.  Note that an /etc/ld.so.conf.d/file
> + * corresponding to the bits here is needed to make ldconfig work right.
> + * It should contain:
> + *	hwcap 0 nosegneg
> + * to match the mapping of bit to name that we give here.
> + */
> +#define NOTE_KERNELCAP_BEGIN(ncaps, mask) \
> +	ASM_ELF_NOTE_BEGIN(".note.kernelcap", "a", "GNU", 2) \
> +	.long ncaps, mask
> +#define NOTE_KERNELCAP(bit, name) \
> +	.byte bit; .asciz name
> +#define NOTE_KERNELCAP_END ASM_ELF_NOTE_END
> +
> +NOTE_KERNELCAP_BEGIN(1, 1)
> +NOTE_KERNELCAP(1, "nosegneg")
> +NOTE_KERNELCAP_END
> +#endif
> +
> 
> -- 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:16     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:16 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Jan Beulich


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area 
> The alloc function ensures that page tables are constructed for the 
> region of kernel virtual address space and mapped into init_mm.

> --- a/arch/i386/mm/fault.c

> +struct vm_struct *alloc_vm_area(unsigned long size)

this should go into mm/vmalloc.c.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas
@ 2007-03-16  9:16     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:16 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Jan Beulich, Chris Wright,
	Andi Kleen, Andrew Morton, Christian Limpach


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area 
> The alloc function ensures that page tables are constructed for the 
> region of kernel virtual address space and mapped into init_mm.

> --- a/arch/i386/mm/fault.c

> +struct vm_struct *alloc_vm_area(unsigned long size)

this should go into mm/vmalloc.c.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:19     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Christoph Lameter


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add a new mm function apply_to_page_range() which applies a given 
> function to every pte in a given virtual address range in a given mm 
> structure. This is a generic alternative to cut-and-pasting the Linux 
> idiomatic pagetable walking code in every place that a sequence of 
> PTEs must be accessed.

nice one! I suspect we could simplify some of the less 
performance-critical open-coded pagetable walker loops in the kernel 
with this? (i'm not even sure it's all that much of a performance loss 
to pass a function pointer around - would be nice to convert say 
mprotect() to this and see the performance difference?)

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
@ 2007-03-16  9:19     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Christoph Lameter, Christian Limpach


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add a new mm function apply_to_page_range() which applies a given 
> function to every pte in a given virtual address range in a given mm 
> structure. This is a generic alternative to cut-and-pasting the Linux 
> idiomatic pagetable walking code in every place that a sequence of 
> PTEs must be accessed.

nice one! I suspect we could simplify some of the less 
performance-critical open-coded pagetable walker loops in the kernel 
with this? (i'm not even sure it's all that much of a performance loss 
to pass a function pointer around - would be nice to convert say 
mprotect() to this and see the performance difference?)

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 14/26] Xen-paravirt_ops: add common patching machinery
  2007-03-01 23:24 ` [patch 14/26] Xen-paravirt_ops: add common patching machinery Jeremy Fitzhardinge
@ 2007-03-16  9:20     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Anthony Liguori


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Implement the actual patching machinery.  paravirt_patch_default() 
> contains the logic to automatically patch a callsite based on a few 
> simple rules:
> 
>  - if the paravirt_op function is paravirt_nop, then patch nops
>  - if the paravirt_op function is a jmp target, then jmp to it
>  - if the paravirt_op function is callable and doesn't clobber too much
>     for the callsite, call it directly
> 
> paravirt_patch_default is suitable as a default implementation of 
> paravirt_ops.patch, will remove most of the expensive indirect calls 
> in favour of either a direct call or a pile of nops.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 14/26] Xen-paravirt_ops: add common patching machinery
@ 2007-03-16  9:20     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Anthony Liguori,
	Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Implement the actual patching machinery.  paravirt_patch_default() 
> contains the logic to automatically patch a callsite based on a few 
> simple rules:
> 
>  - if the paravirt_op function is paravirt_nop, then patch nops
>  - if the paravirt_op function is a jmp target, then jmp to it
>  - if the paravirt_op function is callable and doesn't clobber too much
>     for the callsite, call it directly
> 
> paravirt_patch_default is suitable as a default implementation of 
> paravirt_ops.patch, will remove most of the expensive indirect calls 
> in favour of either a direct call or a pile of nops.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
  2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
@ 2007-03-16  9:21   ` Ingo Molnar
  2007-03-01 23:24   ` Jeremy Fitzhardinge
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This patch series implements the Linux Xen guest as a paravirt_ops 
> backend. [...]

19/26 seems to be missing (lkml size limit?) so i couldnt review it.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
@ 2007-03-16  9:21   ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This patch series implements the Linux Xen guest as a paravirt_ops 
> backend. [...]

19/26 seems to be missing (lkml size limit?) so i couldnt review it.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-01 23:24 ` [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable Jeremy Fitzhardinge
@ 2007-03-16  9:24     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:24 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Anthony Liguori, Linus Torvalds


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Wrap a set of interesting paravirt_ops calls in a wrapper which makes 
> the callsites available for patching.  Unfortunately this is pretty 
> ugly because there's no way to get gcc to generate a function call, 
> but also wrap just the callsite itself with the necessary labels.
> 
> This patch supports functions with 0-4 arguments, and either void or 
> returning a value.  64-bit arguments must be split into a pair of 
> 32-bit arguments (lower word first).  Small structures are returned in 
> registers.

ugh. This is beyond ugly! Why dont we just compile two images, one for 
Xen and one for native, do two passes to get those two images and 
'merge' them into a single vmlinuz (so that we still have a 'single' 
kernel unit to deal with on the distro side). This way we avoid all this 
crazy, limited, fragile patchery business...

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-16  9:24     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:24 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Anthony Liguori,
	Andrew Morton, Linus Torvalds


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Wrap a set of interesting paravirt_ops calls in a wrapper which makes 
> the callsites available for patching.  Unfortunately this is pretty 
> ugly because there's no way to get gcc to generate a function call, 
> but also wrap just the callsite itself with the necessary labels.
> 
> This patch supports functions with 0-4 arguments, and either void or 
> returning a value.  64-bit arguments must be split into a pair of 
> 32-bit arguments (lower word first).  Small structures are returned in 
> registers.

ugh. This is beyond ugly! Why dont we just compile two images, one for 
Xen and one for native, do two passes to get those two images and 
'merge' them into a single vmlinuz (so that we still have a 'single' 
kernel unit to deal with on the distro side). This way we avoid all this 
crazy, limited, fragile patchery business...

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:26     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Linus Torvalds


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Fix a few clobbers to include the return register.  The clobbers set 
> is the set of all registers modified (or may be modified) by the code 
> snippet, regardless of whether it was deliberate or accidental.

i guess we need these paravirt.h clobber fixes for v2.6.21 (VMI) too, 
correct?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-16  9:26     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton,
	Linus Torvalds


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Fix a few clobbers to include the return register.  The clobbers set 
> is the set of all registers modified (or may be modified) by the code 
> snippet, regardless of whether it was deliberate or accidental.

i guess we need these paravirt.h clobber fixes for v2.6.21 (VMI) too, 
correct?

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations
  2007-03-01 23:24 ` [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations Jeremy Fitzhardinge
@ 2007-03-16  9:27     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	linux-arch, James Bottomley


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> akpm:
> > Can we lose __HAVE_ARCH_MM_LIFETIME?  Just define these (preferably in C,
> > not in cpp) in the appropriate include/asm-foo/ files?

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations
@ 2007-03-16  9:27     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, linux-arch, xen-devel, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	James Bottomley, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> akpm:
> > Can we lose __HAVE_ARCH_MM_LIFETIME?  Just define these (preferably in C,
> > not in cpp) in the appropriate include/asm-foo/ files?

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  8:54     ` Ingo Molnar
  (?)
  (?)
@ 2007-03-16  9:28     ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16  9:28 UTC (permalink / raw)
  To: Ingo Molnar, Jeremy Fitzhardinge
  Cc: virtualization, xen-devel, Chris Wright, Andi Kleen,
	Andrew Morton, linux-kernel

On 16/3/07 08:54, "Ingo Molnar" <mingo@elte.hu> wrote:

>> + prod = intf->in_prod;
>> + mb();
>> + BUG_ON((prod - cons) > sizeof(intf->in));
> 
> such mb()'s are typically a sign of "i have no clear idea what SMP
> serialization rules apply here, but something is needed because
> otherwise it breaks" ?

These mb()'s are pretty standard for lock-free producer/consumer rings.
Write descriptor /then/ write the updated producer. Read the producer /then/
read any descriptors revealed by this new producer value.

All our ring protocols (and we have a few) require memory barriers. Those
that use our standard set of ring macros have the memory barriers hidden
from view, but they're still there.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  8:54     ` Ingo Molnar
  (?)
@ 2007-03-16  9:28     ` Keir Fraser
  2007-03-16  9:58         ` Ingo Molnar
  -1 siblings, 1 reply; 330+ messages in thread
From: Keir Fraser @ 2007-03-16  9:28 UTC (permalink / raw)
  To: Ingo Molnar, Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Andi Kleen, Rusty Russell,
	linux-kernel, Chris Wright, virtualization, Andrew Morton

On 16/3/07 08:54, "Ingo Molnar" <mingo@elte.hu> wrote:

>> + prod = intf->in_prod;
>> + mb();
>> + BUG_ON((prod - cons) > sizeof(intf->in));
> 
> such mb()'s are typically a sign of "i have no clear idea what SMP
> serialization rules apply here, but something is needed because
> otherwise it breaks" ?

These mb()'s are pretty standard for lock-free producer/consumer rings.
Write descriptor /then/ write the updated producer. Read the producer /then/
read any descriptors revealed by this new producer value.

All our ring protocols (and we have a few) require memory barriers. Those
that use our standard set of ring macros have the memory barriers hidden
from view, but they're still there.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction
  2007-03-01 23:24 ` [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
@ 2007-03-16  9:30     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	linux-arch


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add hooks to allow a paravirt implementation to track the lifetime of 
> an mm.

(i guess this one should be merged with the ARCH-define cleanup patch)

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction
@ 2007-03-16  9:30     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, linux-arch, xen-devel, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add hooks to allow a paravirt implementation to track the lifetime of 
> an mm.

(i guess this one should be merged with the ARCH-define cleanup patch)

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 07/26] Xen-paravirt_ops: Allow paravirt backend to choose kernel PMD sharing
  2007-03-01 23:24 ` [patch 07/26] Xen-paravirt_ops: Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
@ 2007-03-16  9:31     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	William Lee Irwin III


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Normally when running in PAE mode, the 4th PMD maps the kernel address 
> space, which can be shared among all processes (since they all need 
> the same kernel mappings.
> 
> Xen, however, does not allow guests to have the kernel pmd shared 
> between page tables, so parameterize pgtable.c to allow both modes of 
> operation.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 07/26] Xen-paravirt_ops: Allow paravirt backend to choose kernel PMD sharing
@ 2007-03-16  9:31     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, William Lee Irwin III, Chris Wright, Andi Kleen,
	Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Normally when running in PAE mode, the 4th PMD maps the kernel address 
> space, which can be shared among all processes (since they all need 
> the same kernel mappings.
> 
> Xen, however, does not allow guests to have the kernel pmd shared 
> between page tables, so parameterize pgtable.c to allow both modes of 
> operation.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 06/26] Xen-paravirt_ops: paravirt_ops: allocate a fixmap slot
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:31     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Allocate a fixmap slot for use by a paravirt_ops implementation.  This 
> is intended for early-boot bootstrap mappings.  Once the zones and 
> allocator have been set up, it would be better to use get_vm_area() to 
> allocate some virtual space.
> 
> Xen uses this to map the hypervisor's shared info page, which doesn't 
> have a pseudo-physical page number, and therefore can't be mapped 
> ordinarily.  It is needed early because it contains the vcpu state, 
> including the interrupt mask.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 06/26] Xen-paravirt_ops: paravirt_ops: allocate a fixmap slot
@ 2007-03-16  9:31     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Allocate a fixmap slot for use by a paravirt_ops implementation.  This 
> is intended for early-boot bootstrap mappings.  Once the zones and 
> allocator have been set up, it would be better to use get_vm_area() to 
> allocate some virtual space.
> 
> Xen uses this to map the hypervisor's shared info page, which doesn't 
> have a pseudo-physical page number, and therefore can't be mapped 
> ordinarily.  It is needed early because it contains the vcpu state, 
> including the interrupt mask.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16  9:24     ` Ingo Molnar
  (?)
@ 2007-03-16  9:33     ` David Miller
  2007-03-16  9:57         ` Ingo Molnar
  2007-03-16 20:38       ` Jeremy Fitzhardinge
  -1 siblings, 2 replies; 330+ messages in thread
From: David Miller @ 2007-03-16  9:33 UTC (permalink / raw)
  To: mingo
  Cc: jeremy, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, rusty, anthony, torvalds

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 16 Mar 2007 10:24:45 +0100

> ugh. This is beyond ugly! Why dont we just compile two images, one for 
> Xen and one for native, do two passes to get those two images and 
> 'merge' them into a single vmlinuz (so that we still have a 'single' 
> kernel unit to deal with on the distro side). This way we avoid all this 
> crazy, limited, fragile patchery business...

Perhaps the problem can be dealt with using ELF relocations.

There is another case, discussed yesterday on netdev, where run-time
resolution of ELF relocations would be useful (for
very-very-very-read-only variables) so if it can solve this problem
too it would be nice to have a generic infrastructure for it.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:33     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	William Lee Irwin III


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This patch introduces paravirt_ops hooks to control how the kernel's
> initial pagetable is set up.

looks good. Some minor nits:

> +	/* Make sure kernel address space is empty so that a pagetable
> +	   will be allocated for it. */

comment style.

> +	/* Enable PSE if available */
> +	if (cpu_has_pse) {
> +		set_in_cr4(X86_CR4_PSE);
> +	}

unnecessary braces.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
@ 2007-03-16  9:33     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, William Lee Irwin III, Chris Wright, Andi Kleen,
	Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This patch introduces paravirt_ops hooks to control how the kernel's
> initial pagetable is set up.

looks good. Some minor nits:

> +	/* Make sure kernel address space is empty so that a pagetable
> +	   will be allocated for it. */

comment style.

> +	/* Enable PSE if available */
> +	if (cpu_has_pse) {
> +		set_in_cr4(X86_CR4_PSE);
> +	}

unnecessary braces.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries
  2007-03-01 23:24 ` [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries Jeremy Fitzhardinge
@ 2007-03-16  9:38     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add a set of accessors to pack, unpack and modify page table entries 
> (at all levels).  This allows a paravirt implementation to control the 
> contents of pgd/pmd/pte entries.  For example, Xen uses this to 
> convert the (pseudo-)physical address into a machine address when 
> populating a pagetable entry, and converting back to pphys address 
> when an entry is read.

looks good.

Acked-by: Ingo Molnar <mingo@elte.hu>

only some minor style nits:

> +static inline unsigned long long native_pgd_val(pgd_t pgd)
> +{
> +	return pgd.pgd;
> +}
> +static inline unsigned long long native_pmd_val(pmd_t pmd)
> +{
> +	return pmd.pmd;
> +}
> +static inline unsigned long long native_pte_val(pte_t pte)
> +{
> +	return pte.pte_low | ((unsigned long long)pte.pte_high << 32);
> +}
> +static inline pgd_t native_make_pgd(unsigned long long val)
> +{
> +	return (pgd_t) { val };
> +}
> +static inline pmd_t native_make_pmd(unsigned long long val)
> +{
> +	return (pmd_t) { val };
> +}
> +static inline pte_t native_make_pte(unsigned long long val)
> +{
> +	return (pte_t) { .pte_low = val, .pte_high = (val >> 32) } ;
> +}

missing newlines between inline functions.

> +#ifndef CONFIG_PARAVIRT
> +#define pmd_val(x)	native_pmd_val(x)
> +#define __pmd(x)	native_make_pmd(x)
> +#endif	/* !CONFIG_PARAVIRT */

no need for the closing !CONFIG_PARAVIRT comment: this define is 2 lines 
long so it's not that hard to find the start of the block. We typically 
do the /* !CONFIG_XX */ comment only for larger blocks, and when 
multiple #endif's intermix.

> +static inline unsigned long native_pgd_val(pgd_t pgd)
> +{
> +	return pgd.pgd;
> +}
> +static inline unsigned long native_pte_val(pte_t pte)
> +{
> +	return pte.pte_low;
> +}
> +static inline pgd_t native_make_pgd(unsigned long val)
> +{
> +	return (pgd_t) { val };
> +}
> +static inline pte_t native_make_pte(unsigned long val)
> +{
> +	return (pte_t) { .pte_low = val };
> +}

newlines.

>  #define HPAGE_SHIFT	22
>  #include <asm-generic/pgtable-nopmd.h>
> -#endif
> +#endif	/* CONFIG_X86_PAE */

(for example here the #endif comment is justified.)

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries
@ 2007-03-16  9:38     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add a set of accessors to pack, unpack and modify page table entries 
> (at all levels).  This allows a paravirt implementation to control the 
> contents of pgd/pmd/pte entries.  For example, Xen uses this to 
> convert the (pseudo-)physical address into a machine address when 
> populating a pagetable entry, and converting back to pphys address 
> when an entry is read.

looks good.

Acked-by: Ingo Molnar <mingo@elte.hu>

only some minor style nits:

> +static inline unsigned long long native_pgd_val(pgd_t pgd)
> +{
> +	return pgd.pgd;
> +}
> +static inline unsigned long long native_pmd_val(pmd_t pmd)
> +{
> +	return pmd.pmd;
> +}
> +static inline unsigned long long native_pte_val(pte_t pte)
> +{
> +	return pte.pte_low | ((unsigned long long)pte.pte_high << 32);
> +}
> +static inline pgd_t native_make_pgd(unsigned long long val)
> +{
> +	return (pgd_t) { val };
> +}
> +static inline pmd_t native_make_pmd(unsigned long long val)
> +{
> +	return (pmd_t) { val };
> +}
> +static inline pte_t native_make_pte(unsigned long long val)
> +{
> +	return (pte_t) { .pte_low = val, .pte_high = (val >> 32) } ;
> +}

missing newlines between inline functions.

> +#ifndef CONFIG_PARAVIRT
> +#define pmd_val(x)	native_pmd_val(x)
> +#define __pmd(x)	native_make_pmd(x)
> +#endif	/* !CONFIG_PARAVIRT */

no need for the closing !CONFIG_PARAVIRT comment: this define is 2 lines 
long so it's not that hard to find the start of the block. We typically 
do the /* !CONFIG_XX */ comment only for larger blocks, and when 
multiple #endif's intermix.

> +static inline unsigned long native_pgd_val(pgd_t pgd)
> +{
> +	return pgd.pgd;
> +}
> +static inline unsigned long native_pte_val(pte_t pte)
> +{
> +	return pte.pte_low;
> +}
> +static inline pgd_t native_make_pgd(unsigned long val)
> +{
> +	return (pgd_t) { val };
> +}
> +static inline pte_t native_make_pte(unsigned long val)
> +{
> +	return (pte_t) { .pte_low = val };
> +}

newlines.

>  #define HPAGE_SHIFT	22
>  #include <asm-generic/pgtable-nopmd.h>
> -#endif
> +#endif	/* CONFIG_X86_PAE */

(for example here the #endif comment is justified.)

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-01 23:24 ` [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations Jeremy Fitzhardinge
@ 2007-03-16  9:44     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add a _paravirt_nop function for use as a stub for no-op operations, 
> and paravirt_nop #defined void * version to make using it easier 
> (since all its uses are as a void *).
> 
> This is useful to allow the patcher to automatically identify noop 
> operations so it can simply nop out the callsite.

Acked-by: Ingo Molnar <mingo@elte.hu>

> +void _paravirt_nop(void);
> +#define paravirt_nop	((void *)_paravirt_nop)

but only as a cleanup of the current open-coded (void *) casts. My 
problem with this is that it loses the types. Not that there is much to 
check for, but still, this adds some assumptions about how function 
calls look like.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
@ 2007-03-16  9:44     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Add a _paravirt_nop function for use as a stub for no-op operations, 
> and paravirt_nop #defined void * version to make using it easier 
> (since all its uses are as a void *).
> 
> This is useful to allow the patcher to automatically identify noop 
> operations so it can simply nop out the callsite.

Acked-by: Ingo Molnar <mingo@elte.hu>

> +void _paravirt_nop(void);
> +#define paravirt_nop	((void *)_paravirt_nop)

but only as a cleanup of the current open-coded (void *) casts. My 
problem with this is that it loses the types. Not that there is much to 
check for, but still, this adds some assumptions about how function 
calls look like.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 02/26] Xen-paravirt_ops: ignore vgacon if hardware not present
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:45     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Avoid trying to set up vgacon if there's no vga hardware present.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 02/26] Xen-paravirt_ops: ignore vgacon if hardware not present
@ 2007-03-16  9:45     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Avoid trying to set up vgacon if there's no vga hardware present.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 01/26] Xen-paravirt_ops: Fix typo in sync_constant_test_bit()s name.
  2007-03-01 23:24   ` Jeremy Fitzhardinge
@ 2007-03-16  9:45     ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 01/26] Xen-paravirt_ops: Fix typo in sync_constant_test_bit()s name.
@ 2007-03-16  9:45     ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16  9:33     ` David Miller
@ 2007-03-16  9:57         ` Ingo Molnar
  2007-03-16 20:38       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:57 UTC (permalink / raw)
  To: David Miller
  Cc: jeremy, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, rusty, anthony, torvalds

* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 16 Mar 2007 10:24:45 +0100
> 
> > ugh. This is beyond ugly! Why dont we just compile two images, one 
> > for Xen and one for native, do two passes to get those two images 
> > and 'merge' them into a single vmlinuz (so that we still have a 
> > 'single' kernel unit to deal with on the distro side). This way we 
> > avoid all this crazy, limited, fragile patchery business...
> 
> Perhaps the problem can be dealt with using ELF relocations.
> 
> There is another case, discussed yesterday on netdev, where run-time 
> resolution of ELF relocations would be useful (for 
> very-very-very-read-only variables) so if it can solve this problem 
> too it would be nice to have a generic infrastructure for it.

yeah, and i really think this is very fundamental: we've already got a 
very nice tool that can do things like detect when we are paravirt and 
optimize and patch things in a machine-specific way. It can even reorder 
instructions and simulate the CPU's pipeline state and do very smart 
optimizations based on that. It's a really neat thing, they call it 
"GCC".

Limited, instruction-level patching like alternatives.h is fine because 
that makes it easier to support multiple, incompatible CPU 
architectures, without having to do a hugely intrusive split at the 
kernel RPM level.

but the level of 'binary patching' done by the paravirt and Xen goes way 
beyond that, and the changes here really underscore that we:

  _should not emulate the closed source world_

There the only solution is to binary-patch - because they have no source 
code. But here, we've got all the source code.

splitting the images and simply extending vmlinuz to have a 
'multi-image' format would not only eliminate a huge amount of hookery, 
it would also solve the problem of CONFIG_PARAVIRT bloating the native 
kernel's codepaths by ~7%.

nobody wants to boot a xen-paravirt kernel from a floppy, so image size 
is not an issue. In-RAM overhead would in fact be /reduced/, because 
currently all the paravirt overhead hits both the native and the 
paravirt kernel. Nor would /all/ of the vmlinuz have to be replicated in 
the images - it's enough to replicate only those functions that truly 
differ between the two build methods.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-16  9:57         ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:57 UTC (permalink / raw)
  To: David Miller
  Cc: zach, jeremy, xen-devel, ak, rusty, linux-kernel, chrisw,
	virtualization, anthony, akpm, torvalds

* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 16 Mar 2007 10:24:45 +0100
> 
> > ugh. This is beyond ugly! Why dont we just compile two images, one 
> > for Xen and one for native, do two passes to get those two images 
> > and 'merge' them into a single vmlinuz (so that we still have a 
> > 'single' kernel unit to deal with on the distro side). This way we 
> > avoid all this crazy, limited, fragile patchery business...
> 
> Perhaps the problem can be dealt with using ELF relocations.
> 
> There is another case, discussed yesterday on netdev, where run-time 
> resolution of ELF relocations would be useful (for 
> very-very-very-read-only variables) so if it can solve this problem 
> too it would be nice to have a generic infrastructure for it.

yeah, and i really think this is very fundamental: we've already got a 
very nice tool that can do things like detect when we are paravirt and 
optimize and patch things in a machine-specific way. It can even reorder 
instructions and simulate the CPU's pipeline state and do very smart 
optimizations based on that. It's a really neat thing, they call it 
"GCC".

Limited, instruction-level patching like alternatives.h is fine because 
that makes it easier to support multiple, incompatible CPU 
architectures, without having to do a hugely intrusive split at the 
kernel RPM level.

but the level of 'binary patching' done by the paravirt and Xen goes way 
beyond that, and the changes here really underscore that we:

  _should not emulate the closed source world_

There the only solution is to binary-patch - because they have no source 
code. But here, we've got all the source code.

splitting the images and simply extending vmlinuz to have a 
'multi-image' format would not only eliminate a huge amount of hookery, 
it would also solve the problem of CONFIG_PARAVIRT bloating the native 
kernel's codepaths by ~7%.

nobody wants to boot a xen-paravirt kernel from a floppy, so image size 
is not an issue. In-RAM overhead would in fact be /reduced/, because 
currently all the paravirt overhead hits both the native and the 
paravirt kernel. Nor would /all/ of the vmlinuz have to be replicated in 
the images - it's enough to replicate only those functions that truly 
differ between the two build methods.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  9:28     ` Keir Fraser
@ 2007-03-16  9:58         ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:58 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, Zachary Amsden, xen-devel, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton


* Keir Fraser <Keir.Fraser@xensource.com> wrote:

> On 16/3/07 08:54, "Ingo Molnar" <mingo@elte.hu> wrote:
> 
> >> + prod = intf->in_prod;
> >> + mb();
> >> + BUG_ON((prod - cons) > sizeof(intf->in));
> > 
> > such mb()'s are typically a sign of "i have no clear idea what SMP
> > serialization rules apply here, but something is needed because
> > otherwise it breaks" ?
> 
> These mb()'s are pretty standard for lock-free producer/consumer 
> rings. Write descriptor /then/ write the updated producer. Read the 
> producer /then/ read any descriptors revealed by this new producer 
> value.

then use rmb()/wmb(). Rarely does a ring protocol truly need mb().

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16  9:58         ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16  9:58 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Andi Kleen,
	Rusty Russell, linux-kernel, Chris Wright, virtualization,
	Andrew Morton


* Keir Fraser <Keir.Fraser@xensource.com> wrote:

> On 16/3/07 08:54, "Ingo Molnar" <mingo@elte.hu> wrote:
> 
> >> + prod = intf->in_prod;
> >> + mb();
> >> + BUG_ON((prod - cons) > sizeof(intf->in));
> > 
> > such mb()'s are typically a sign of "i have no clear idea what SMP
> > serialization rules apply here, but something is needed because
> > otherwise it breaks" ?
> 
> These mb()'s are pretty standard for lock-free producer/consumer 
> rings. Write descriptor /then/ write the updated producer. Read the 
> producer /then/ read any descriptors revealed by this new producer 
> value.

then use rmb()/wmb(). Rarely does a ring protocol truly need mb().

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  9:58         ` Ingo Molnar
@ 2007-03-16 10:31           ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 10:31 UTC (permalink / raw)
  To: Ingo Molnar, Keir Fraser
  Cc: virtualization, xen-devel, Chris Wright, Andi Kleen,
	Andrew Morton, linux-kernel

On 16/3/07 09:58, "Ingo Molnar" <mingo@elte.hu> wrote:

>> These mb()'s are pretty standard for lock-free producer/consumer
>> rings. Write descriptor /then/ write the updated producer. Read the
>> producer /then/ read any descriptors revealed by this new producer
>> value.
> 
> then use rmb()/wmb(). Rarely does a ring protocol truly need mb().

It's needed for writing data /after/ reading the consumer index that shows
you have space to write. Looking through xenbus_comms.c I think all the
barriers are correct except there is a spurious extra mb() in xb_read(),
where there is a later rmb() which is sufficient by itself. All the others
have a purpose.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  9:58         ` Ingo Molnar
                           ` (2 preceding siblings ...)
  (?)
@ 2007-03-16 10:31         ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 10:31 UTC (permalink / raw)
  To: Ingo Molnar, Keir Fraser
  Cc: virtualization, xen-devel, Chris Wright, Andi Kleen,
	Andrew Morton, linux-kernel

On 16/3/07 09:58, "Ingo Molnar" <mingo@elte.hu> wrote:

>> These mb()'s are pretty standard for lock-free producer/consumer
>> rings. Write descriptor /then/ write the updated producer. Read the
>> producer /then/ read any descriptors revealed by this new producer
>> value.
> 
> then use rmb()/wmb(). Rarely does a ring protocol truly need mb().

It's needed for writing data /after/ reading the consumer index that shows
you have space to write. Looking through xenbus_comms.c I think all the
barriers are correct except there is a spurious extra mb() in xb_read(),
where there is a later rmb() which is sufficient by itself. All the others
have a purpose.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  9:58         ` Ingo Molnar
  (?)
  (?)
@ 2007-03-16 10:31         ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 10:31 UTC (permalink / raw)
  To: Ingo Molnar, Keir Fraser
  Cc: Chris Wright, virtualization, xen-devel, Andi Kleen,
	Andrew Morton, linux-kernel

On 16/3/07 09:58, "Ingo Molnar" <mingo@elte.hu> wrote:

>> These mb()'s are pretty standard for lock-free producer/consumer
>> rings. Write descriptor /then/ write the updated producer. Read the
>> producer /then/ read any descriptors revealed by this new producer
>> value.
> 
> then use rmb()/wmb(). Rarely does a ring protocol truly need mb().

It's needed for writing data /after/ reading the consumer index that shows
you have space to write. Looking through xenbus_comms.c I think all the
barriers are correct except there is a spurious extra mb() in xb_read(),
where there is a later rmb() which is sufficient by itself. All the others
have a purpose.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16 10:31           ` Keir Fraser
  0 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 10:31 UTC (permalink / raw)
  To: Ingo Molnar, Keir Fraser
  Cc: xen-devel, Andi Kleen, linux-kernel, Chris Wright,
	virtualization, Andrew Morton

On 16/3/07 09:58, "Ingo Molnar" <mingo@elte.hu> wrote:

>> These mb()'s are pretty standard for lock-free producer/consumer
>> rings. Write descriptor /then/ write the updated producer. Read the
>> producer /then/ read any descriptors revealed by this new producer
>> value.
> 
> then use rmb()/wmb(). Rarely does a ring protocol truly need mb().

It's needed for writing data /after/ reading the consumer index that shows
you have space to write. Looking through xenbus_comms.c I think all the
barriers are correct except there is a spurious extra mb() in xb_read(),
where there is a later rmb() which is sufficient by itself. All the others
have a purpose.

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16 10:31           ` Keir Fraser
@ 2007-03-16 11:41             ` Andrew Morton
  -1 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2007-03-16 11:41 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Ingo Molnar, Keir Fraser, virtualization, xen-devel,
	Chris Wright, Andi Kleen, linux-kernel

On Fri, 16 Mar 2007 10:31:49 +0000 Keir Fraser <keir@xensource.com> wrote:

> On 16/3/07 09:58, "Ingo Molnar" <mingo@elte.hu> wrote:
> 
> >> These mb()'s are pretty standard for lock-free producer/consumer
> >> rings. Write descriptor /then/ write the updated producer. Read the
> >> producer /then/ read any descriptors revealed by this new producer
> >> value.
> > 
> > then use rmb()/wmb(). Rarely does a ring protocol truly need mb().
> 
> It's needed for writing data /after/ reading the consumer index that shows
> you have space to write. Looking through xenbus_comms.c I think all the
> barriers are correct except there is a spurious extra mb() in xb_read(),
> where there is a later rmb() which is sufficient by itself. All the others
> have a purpose.
> 

If Ingo couldn't work this out from reading the code then nobody else can,
and we have a maintainability problem which can only be solved with
adequate commenting.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16 11:41             ` Andrew Morton
  0 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2007-03-16 11:41 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Ingo Molnar, Keir Fraser, virtualization, xen-devel,
	Chris Wright, Andi Kleen, linux-kernel

On Fri, 16 Mar 2007 10:31:49 +0000 Keir Fraser <keir@xensource.com> wrote:

> On 16/3/07 09:58, "Ingo Molnar" <mingo@elte.hu> wrote:
> 
> >> These mb()'s are pretty standard for lock-free producer/consumer
> >> rings. Write descriptor /then/ write the updated producer. Read the
> >> producer /then/ read any descriptors revealed by this new producer
> >> value.
> > 
> > then use rmb()/wmb(). Rarely does a ring protocol truly need mb().
> 
> It's needed for writing data /after/ reading the consumer index that shows
> you have space to write. Looking through xenbus_comms.c I think all the
> barriers are correct except there is a spurious extra mb() in xb_read(),
> where there is a later rmb() which is sufficient by itself. All the others
> have a purpose.
> 

If Ingo couldn't work this out from reading the code then nobody else can,
and we have a maintainability problem which can only be solved with
adequate commenting.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16 11:41             ` Andrew Morton
@ 2007-03-16 11:58               ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 11:58 UTC (permalink / raw)
  To: Andrew Morton, Keir Fraser
  Cc: Wright, virtualization, xen-devel, Andi Kleen, Chris,
	Ingo Molnar, linux-kernel, Keir Fraser

On 16/3/07 11:41, "Andrew Morton" <akpm@linux-foundation.org> wrote:

>> It's needed for writing data /after/ reading the consumer index that shows
>> you have space to write. Looking through xenbus_comms.c I think all the
>> barriers are correct except there is a spurious extra mb() in xb_read(),
>> where there is a later rmb() which is sufficient by itself. All the others
>> have a purpose.
>> 
> 
> If Ingo couldn't work this out from reading the code then nobody else can,
> and we have a maintainability problem which can only be solved with
> adequate commenting.

Agreed. I've fixed up the commenting, removed the spurious mb(), and I just
need Jeremy to merge that change into the Xen pv_ops patchset. :-)

 -- Keir


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16 11:41             ` Andrew Morton
  (?)
@ 2007-03-16 11:58             ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 11:58 UTC (permalink / raw)
  To: Andrew Morton, Keir Fraser
  Cc: Wright, virtualization, xen-devel, Andi Kleen, Chris,
	Ingo Molnar, linux-kernel, Keir Fraser

On 16/3/07 11:41, "Andrew Morton" <akpm@linux-foundation.org> wrote:

>> It's needed for writing data /after/ reading the consumer index that shows
>> you have space to write. Looking through xenbus_comms.c I think all the
>> barriers are correct except there is a spurious extra mb() in xb_read(),
>> where there is a later rmb() which is sufficient by itself. All the others
>> have a purpose.
>> 
> 
> If Ingo couldn't work this out from reading the code then nobody else can,
> and we have a maintainability problem which can only be solved with
> adequate commenting.

Agreed. I've fixed up the commenting, removed the spurious mb(), and I just
need Jeremy to merge that change into the Xen pv_ops patchset. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16 11:41             ` Andrew Morton
                               ` (2 preceding siblings ...)
  (?)
@ 2007-03-16 11:58             ` Keir Fraser
  -1 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 11:58 UTC (permalink / raw)
  To: Andrew Morton, Keir Fraser
  Cc: Wright, virtualization, xen-devel, Andi Kleen, Ingo Molnar,
	Chris, linux-kernel, Keir Fraser

On 16/3/07 11:41, "Andrew Morton" <akpm@linux-foundation.org> wrote:

>> It's needed for writing data /after/ reading the consumer index that shows
>> you have space to write. Looking through xenbus_comms.c I think all the
>> barriers are correct except there is a spurious extra mb() in xb_read(),
>> where there is a later rmb() which is sufficient by itself. All the others
>> have a purpose.
>> 
> 
> If Ingo couldn't work this out from reading the code then nobody else can,
> and we have a maintainability problem which can only be solved with
> adequate commenting.

Agreed. I've fixed up the commenting, removed the spurious mb(), and I just
need Jeremy to merge that change into the Xen pv_ops patchset. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16 11:58               ` Keir Fraser
  0 siblings, 0 replies; 330+ messages in thread
From: Keir Fraser @ 2007-03-16 11:58 UTC (permalink / raw)
  To: Andrew Morton, Keir Fraser
  Cc: xen-devel, Andi Kleen, linux-kernel, Keir Fraser, Wright,
	virtualization, Ingo Molnar, Chris

On 16/3/07 11:41, "Andrew Morton" <akpm@linux-foundation.org> wrote:

>> It's needed for writing data /after/ reading the consumer index that shows
>> you have space to write. Looking through xenbus_comms.c I think all the
>> barriers are correct except there is a spurious extra mb() in xb_read(),
>> where there is a later rmb() which is sufficient by itself. All the others
>> have a purpose.
>> 
> 
> If Ingo couldn't work this out from reading the code then nobody else can,
> and we have a maintainability problem which can only be solved with
> adequate commenting.

Agreed. I've fixed up the commenting, removed the spurious mb(), and I just
need Jeremy to merge that change into the Xen pv_ops patchset. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16  9:14     ` Ingo Molnar
  (?)
@ 2007-03-16 12:00     ` Christoph Hellwig
  -1 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2007-03-16 12:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Adrian Bunk,
	Thomas Gleixner

On Fri, Mar 16, 2007 at 10:14:11AM +0100, Ingo Molnar wrote:
> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Core Xen Implementation
> > 
> > This patch is a rollup of all the core pieces of the Xen 
> > implementation, including booting, memory management, interrupts, time 
> > and so on.
> 
> > --- a/arch/i386/kernel/head.S
> > +++ b/arch/i386/kernel/head.S
> > @@ -535,6 +535,10 @@ unhandled_paravirt:
> >  	ud2
> >  #endif
> >  
> > +#ifdef CONFIG_XEN
> > +#include "../xen/xen-head.S"
> > +#endif
> 
> i'd suggest to remove the #ifdef and push it into xen-head.S.

Even better would be an explanation of why this needs to be included
at all.  Life would be much simpler if this was simply a totally
separate file.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-16  8:47     ` Ingo Molnar
@ 2007-03-16 16:18       ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Greg KH

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > This communicates with the machine control software via a registry 
> > residing in a controlling virtual machine. This allows dynamic 
> > creation, destruction and modification of virtual device 
> > configurations (network devices, block devices and CPUS, to name some 
> > examples).
> 
> should be reviewed by the driver-core folks too i guess?

Greg has gone through it a couple times.  Any more feedback would
be useful.

> > +//#include <asm/maddr.h>
> 
> > +//#include <xen/xen_proc.h>
> > +//#include <xen/evtchn.h>
> 
> > +//#include <xen/hvm.h>
> 
> > +//		.shutdown = xenbus_dev_shutdown,
> 
> hm?

fixing that up, not sure where that snuck in.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
@ 2007-03-16 16:18       ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: xen-devel, Ian Pratt, linux-kernel, Chris Wright, virtualization,
	Andrew Morton, Greg KH, Andi Kleen

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > This communicates with the machine control software via a registry 
> > residing in a controlling virtual machine. This allows dynamic 
> > creation, destruction and modification of virtual device 
> > configurations (network devices, block devices and CPUS, to name some 
> > examples).
> 
> should be reviewed by the driver-core folks too i guess?

Greg has gone through it a couple times.  Any more feedback would
be useful.

> > +//#include <asm/maddr.h>
> 
> > +//#include <xen/xen_proc.h>
> > +//#include <xen/evtchn.h>
> 
> > +//#include <xen/hvm.h>
> 
> > +//		.shutdown = xenbus_dev_shutdown,
> 
> hm?

fixing that up, not sure where that snuck in.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-16 16:18       ` Chris Wright
@ 2007-03-16 16:19         ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16 16:19 UTC (permalink / raw)
  To: Chris Wright
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Greg KH


* Chris Wright <chrisw@sous-sol.org> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> > 
> > > This communicates with the machine control software via a registry 
> > > residing in a controlling virtual machine. This allows dynamic 
> > > creation, destruction and modification of virtual device 
> > > configurations (network devices, block devices and CPUS, to name some 
> > > examples).
> > 
> > should be reviewed by the driver-core folks too i guess?
> 
> Greg has gone through it a couple times.  Any more feedback would be 
> useful.

please indicate any such review feedback (and approval) via 
signed-off-by lines done by that person.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
@ 2007-03-16 16:19         ` Ingo Molnar
  0 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-16 16:19 UTC (permalink / raw)
  To: Chris Wright
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Ian Pratt,
	Andi Kleen, Rusty Russell, linux-kernel, virtualization,
	Andrew Morton, Greg KH, Christian Limpach


* Chris Wright <chrisw@sous-sol.org> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> > 
> > > This communicates with the machine control software via a registry 
> > > residing in a controlling virtual machine. This allows dynamic 
> > > creation, destruction and modification of virtual device 
> > > configurations (network devices, block devices and CPUS, to name some 
> > > examples).
> > 
> > should be reviewed by the driver-core folks too i guess?
> 
> Greg has gone through it a couple times.  Any more feedback would be 
> useful.

please indicate any such review feedback (and approval) via 
signed-off-by lines done by that person.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16  9:14     ` Ingo Molnar
@ 2007-03-16 16:33       ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Adrian Bunk,
	Thomas Gleixner

* Ingo Molnar (mingo@elte.hu) wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Core Xen Implementation
> > 
> > This patch is a rollup of all the core pieces of the Xen 
> > implementation, including booting, memory management, interrupts, time 
> > and so on.
> 
> > --- a/arch/i386/kernel/head.S
> > +++ b/arch/i386/kernel/head.S
> > @@ -535,6 +535,10 @@ unhandled_paravirt:
> >  	ud2
> >  #endif
> >  
> > +#ifdef CONFIG_XEN
> > +#include "../xen/xen-head.S"
> > +#endif
> 
> i'd suggest to remove the #ifdef and push it into xen-head.S.

That's been fixed, the two are built as seperate objects now.

> > @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
> >  
> >  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
> >  {
> > -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
> > -	C(0); C(1); C(2);
> > -#undef C
> > +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
> > +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
> > +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
> >  }
> 
> this is a cleanup unrelated to the purpose of the patch.

Sure, will split out.

> > +static void xen_safe_halt(void)
> > +{
> > +	stop_hz_timer();
> > +	/* Blocking includes an implicit local_irq_enable(). */
> > +	if (HYPERVISOR_sched_op(SCHEDOP_block, 0) != 0)
> > +		BUG();
> > +	start_hz_timer();
> > +}
> 
> change this to tick_nohz_stop_sched_tick()/restart_sched_tick() instead.

I don't think we want to do that.  It's dropped now with proper NO_HZ
support, as the core does that for us.

> > +static void xen_halt(void)
> > +{
> > +#if 0
> > +	if (irqs_disabled())
> > +		HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
> > +#endif
> 
> hm?

That's already been fixed.

> > +		op->arg1.linear_addr = arbitrary_virt_to_machine((unsigned long)addr).maddr;
> 
> 80+ chars line. (there are more instances of this throughout the patch)
> 
> > +	for (va = dtr->address, f = 0;
> > +	     va < dtr->address + size;
> > +	     va += PAGE_SIZE, f++) {
> > +		frames[f] = virt_to_mfn(va);
> > +		make_lowmem_page_readonly((void *)va);
> > +	}
> 
> coding style.
> 
> > +	/* This is used very early, so we can't rely on per-cpu data
> > +	   being set up, so no multicalls */
> 
> comment coding style. (there are instances of this throughout the patch)
> 
> > +	spin_lock(&lock);
> > +	for(in = out = 0; in < count; in++) {
> 
> missing whitespace.
> 
> > +		if (HYPERVISOR_update_descriptor(virt_to_machine(dt + entry*8).maddr,
> 
> 80+ chars.

cleaning those

> > +static void xen_set_iopl_mask(unsigned mask)
> > +{
> > +#if 0
> > +	struct physdev_set_iopl set_iopl;
> > +
> > +	/* Force the change at ring 0. */
> > +	set_iopl.iopl = (mask == 0) ? 1 : (mask >> 12) & 3;
> > +	HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
> > +#endif
> 
> hm?
> 
> > +	}
> > +
> > +	write_pda(xen.cr3, cr3);
> > +
> > +
> > +	{
> 
> unnecessary newline.

dropped

> > +static struct irq_chip xen_dynamic_chip __read_mostly = {
> > +	.name		= "xen-virq",
> > +	.mask		= disable_dynirq,
> > +	.unmask		= enable_dynirq,
> > +	.ack		= ack_dynirq,
> > +	.set_affinity	= set_affinity_irq,
> > +	.retrigger	= retrigger_dynirq,
> > +};
> 
> nicely done! :-)
> 
> > +void xen_set_pte(pte_t *ptep, pte_t pte)
> > +{
> > +#if 1
> > +	struct mmu_update u;
> > +
> > +	u.ptr = virt_to_machine(ptep).maddr;
> > +	u.val = pte_val_ma(pte);
> > +	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
> > +		BUG();
> > +#else
> > +	ptep->pte_high = pte.pte_high;
> > +	smp_wmb();
> > +	ptep->pte_low = pte.pte_low;
> > +#endif
> 
> hm?

that's already been fixed

> > +	pgd = swapper_pg_dir + pgd_index(vaddr);
> > +	if (pgd_none(*pgd)) {
> > +		BUG();
> > +		return;
> 
> hm?
that's already been fixed

> > +void xen_pte_clear(struct mm_struct *mm, unsigned long addr,pte_t *ptep)
> > +{
> > +#if 1
> > +	ptep->pte_low = 0;
> > +	smp_wmb();
> > +	ptep->pte_high = 0;
> > +#else
> > +	set_64bit((u64 *)ptep, 0);
> > +#endif
> 
> hm?

that's already been fixed

> > +unsigned long xen_pmd_val(pmd_t pmd)
> > +{
> > +	BUG();
> > +	return 0;
> > +}
> 
> make it noret.

It's actually used now, so it's not noret-able (if that's a word ;-)
> 
> > +pmd_t xen_make_pmd(unsigned long pmd)
> > +{
> > +	BUG();
> > +	return __pmd(0);
> > +}
> 
> ditto.

ditto

> > +static void xen_timer_interrupt_hook(void)
> > +{
> 
> > +				runstate->time[RUNSTATE_offline] -
> > +				__get_cpu_var(processed_stolen_time);
> 
> NACK for now, for the reasons outlined in the 'stolen time' thread. 
> Stolen time accounting is a concept only related to the scheduler tick, 
> it's not a concept that should leak into normal timer interrupt 
> concepts.

As Jeremey mentioned in that thread, it's just used as a place
to tigger the caluculations periodically.

> > +	/* System-wide jiffy work. */
> > +	ticks = 0;
> > +	while(delta > NS_PER_TICK) {
> > +		delta -= NS_PER_TICK;
> > +		processed_system_time += NS_PER_TICK;
> > +		ticks++;
> > +	}
> > +	do_timer(ticks);
> 
> that should be updated to clockevents. (i suspect it already is?)

Yes it is.

> > +
> > +#if 0
> > +	if (unlikely((mfn >> machine_to_phys_order) != 0))
> > +		return max_mapnr;
> > +#endif
> 
> hm?

Good point, I don't recall the condition where that would get
triggered.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-16 16:33       ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Ian Pratt,
	Andi Kleen, Rusty Russell, linux-kernel, Adrian Bunk,
	Chris Wright, virtualization, Andrew Morton, Thomas Gleixner,
	Christian Limpach

* Ingo Molnar (mingo@elte.hu) wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Core Xen Implementation
> > 
> > This patch is a rollup of all the core pieces of the Xen 
> > implementation, including booting, memory management, interrupts, time 
> > and so on.
> 
> > --- a/arch/i386/kernel/head.S
> > +++ b/arch/i386/kernel/head.S
> > @@ -535,6 +535,10 @@ unhandled_paravirt:
> >  	ud2
> >  #endif
> >  
> > +#ifdef CONFIG_XEN
> > +#include "../xen/xen-head.S"
> > +#endif
> 
> i'd suggest to remove the #ifdef and push it into xen-head.S.

That's been fixed, the two are built as seperate objects now.

> > @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
> >  
> >  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
> >  {
> > -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
> > -	C(0); C(1); C(2);
> > -#undef C
> > +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
> > +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
> > +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
> >  }
> 
> this is a cleanup unrelated to the purpose of the patch.

Sure, will split out.

> > +static void xen_safe_halt(void)
> > +{
> > +	stop_hz_timer();
> > +	/* Blocking includes an implicit local_irq_enable(). */
> > +	if (HYPERVISOR_sched_op(SCHEDOP_block, 0) != 0)
> > +		BUG();
> > +	start_hz_timer();
> > +}
> 
> change this to tick_nohz_stop_sched_tick()/restart_sched_tick() instead.

I don't think we want to do that.  It's dropped now with proper NO_HZ
support, as the core does that for us.

> > +static void xen_halt(void)
> > +{
> > +#if 0
> > +	if (irqs_disabled())
> > +		HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
> > +#endif
> 
> hm?

That's already been fixed.

> > +		op->arg1.linear_addr = arbitrary_virt_to_machine((unsigned long)addr).maddr;
> 
> 80+ chars line. (there are more instances of this throughout the patch)
> 
> > +	for (va = dtr->address, f = 0;
> > +	     va < dtr->address + size;
> > +	     va += PAGE_SIZE, f++) {
> > +		frames[f] = virt_to_mfn(va);
> > +		make_lowmem_page_readonly((void *)va);
> > +	}
> 
> coding style.
> 
> > +	/* This is used very early, so we can't rely on per-cpu data
> > +	   being set up, so no multicalls */
> 
> comment coding style. (there are instances of this throughout the patch)
> 
> > +	spin_lock(&lock);
> > +	for(in = out = 0; in < count; in++) {
> 
> missing whitespace.
> 
> > +		if (HYPERVISOR_update_descriptor(virt_to_machine(dt + entry*8).maddr,
> 
> 80+ chars.

cleaning those

> > +static void xen_set_iopl_mask(unsigned mask)
> > +{
> > +#if 0
> > +	struct physdev_set_iopl set_iopl;
> > +
> > +	/* Force the change at ring 0. */
> > +	set_iopl.iopl = (mask == 0) ? 1 : (mask >> 12) & 3;
> > +	HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
> > +#endif
> 
> hm?
> 
> > +	}
> > +
> > +	write_pda(xen.cr3, cr3);
> > +
> > +
> > +	{
> 
> unnecessary newline.

dropped

> > +static struct irq_chip xen_dynamic_chip __read_mostly = {
> > +	.name		= "xen-virq",
> > +	.mask		= disable_dynirq,
> > +	.unmask		= enable_dynirq,
> > +	.ack		= ack_dynirq,
> > +	.set_affinity	= set_affinity_irq,
> > +	.retrigger	= retrigger_dynirq,
> > +};
> 
> nicely done! :-)
> 
> > +void xen_set_pte(pte_t *ptep, pte_t pte)
> > +{
> > +#if 1
> > +	struct mmu_update u;
> > +
> > +	u.ptr = virt_to_machine(ptep).maddr;
> > +	u.val = pte_val_ma(pte);
> > +	if (HYPERVISOR_mmu_update(&u, 1, NULL, DOMID_SELF) < 0)
> > +		BUG();
> > +#else
> > +	ptep->pte_high = pte.pte_high;
> > +	smp_wmb();
> > +	ptep->pte_low = pte.pte_low;
> > +#endif
> 
> hm?

that's already been fixed

> > +	pgd = swapper_pg_dir + pgd_index(vaddr);
> > +	if (pgd_none(*pgd)) {
> > +		BUG();
> > +		return;
> 
> hm?
that's already been fixed

> > +void xen_pte_clear(struct mm_struct *mm, unsigned long addr,pte_t *ptep)
> > +{
> > +#if 1
> > +	ptep->pte_low = 0;
> > +	smp_wmb();
> > +	ptep->pte_high = 0;
> > +#else
> > +	set_64bit((u64 *)ptep, 0);
> > +#endif
> 
> hm?

that's already been fixed

> > +unsigned long xen_pmd_val(pmd_t pmd)
> > +{
> > +	BUG();
> > +	return 0;
> > +}
> 
> make it noret.

It's actually used now, so it's not noret-able (if that's a word ;-)
> 
> > +pmd_t xen_make_pmd(unsigned long pmd)
> > +{
> > +	BUG();
> > +	return __pmd(0);
> > +}
> 
> ditto.

ditto

> > +static void xen_timer_interrupt_hook(void)
> > +{
> 
> > +				runstate->time[RUNSTATE_offline] -
> > +				__get_cpu_var(processed_stolen_time);
> 
> NACK for now, for the reasons outlined in the 'stolen time' thread. 
> Stolen time accounting is a concept only related to the scheduler tick, 
> it's not a concept that should leak into normal timer interrupt 
> concepts.

As Jeremey mentioned in that thread, it's just used as a place
to tigger the caluculations periodically.

> > +	/* System-wide jiffy work. */
> > +	ticks = 0;
> > +	while(delta > NS_PER_TICK) {
> > +		delta -= NS_PER_TICK;
> > +		processed_system_time += NS_PER_TICK;
> > +		ticks++;
> > +	}
> > +	do_timer(ticks);
> 
> that should be updated to clockevents. (i suspect it already is?)

Yes it is.

> > +
> > +#if 0
> > +	if (unlikely((mfn >> machine_to_phys_order) != 0))
> > +		return max_mapnr;
> > +#endif
> 
> hm?

Good point, I don't recall the condition where that would get
triggered.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-16 16:19         ` Ingo Molnar
@ 2007-03-16 16:40           ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Jeremy Fitzhardinge, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Greg KH

* Ingo Molnar (mingo@elte.hu) wrote:
> please indicate any such review feedback (and approval) via 
> signed-off-by lines done by that person.

Certainly, Greg didn't sign off, he simply gave review feedback.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
@ 2007-03-16 16:40           ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Ian Pratt,
	virtualization, Rusty Russell, linux-kernel, Chris Wright,
	Andi Kleen, Andrew Morton, Greg KH, Christian Limpach

* Ingo Molnar (mingo@elte.hu) wrote:
> please indicate any such review feedback (and approval) via 
> signed-off-by lines done by that person.

Certainly, Greg didn't sign off, he simply gave review feedback.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16 16:33       ` Chris Wright
  (?)
@ 2007-03-16 16:44       ` Jeremy Fitzhardinge
  2007-03-16 16:57           ` Chris Wright
  -1 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 16:44 UTC (permalink / raw)
  To: Chris Wright
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Adrian Bunk, Thomas Gleixner

Chris Wright wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
>   
>> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>>
>>     
>>> Core Xen Implementation
>>>
>>> This patch is a rollup of all the core pieces of the Xen 
>>> implementation, including booting, memory management, interrupts, time 
>>> and so on.
>>>       
>>> --- a/arch/i386/kernel/head.S
>>> +++ b/arch/i386/kernel/head.S
>>> @@ -535,6 +535,10 @@ unhandled_paravirt:
>>>  	ud2
>>>  #endif
>>>  
>>> +#ifdef CONFIG_XEN
>>> +#include "../xen/xen-head.S"
>>> +#endif
>>>       
>> i'd suggest to remove the #ifdef and push it into xen-head.S.
>>     
>
> That's been fixed, the two are built as seperate objects now.
>   

Actually, we tried it but it causes bad kernel images with some
binutils, so it has to be included for now.

>>> @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
>>>  
>>>  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
>>>  {
>>> -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
>>> -	C(0); C(1); C(2);
>>> -#undef C
>>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
>>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
>>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
>>>  }
>>>       
>> this is a cleanup unrelated to the purpose of the patch.
>>     
>
> Sure, will split out.
>   

Already done in Rusty's cleanup patch.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
  2007-03-16  9:19     ` Ingo Molnar
@ 2007-03-16 16:47       ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Christoph Lameter

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Add a new mm function apply_to_page_range() which applies a given 
> > function to every pte in a given virtual address range in a given mm 
> > structure. This is a generic alternative to cut-and-pasting the Linux 
> > idiomatic pagetable walking code in every place that a sequence of 
> > PTEs must be accessed.
> 
> nice one! I suspect we could simplify some of the less 
> performance-critical open-coded pagetable walker loops in the kernel 
> with this? (i'm not even sure it's all that much of a performance loss 
> to pass a function pointer around - would be nice to convert say 
> mprotect() to this and see the performance difference?)

We went through this before, and eventually got kick back to
keep it Xen specific.  There's still some effort for generalized
walkers from the Gelato folks IIRC.  And we can certainly adapt.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
@ 2007-03-16 16:47       ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Ian Pratt,
	Andi Kleen, Rusty Russell, linux-kernel, Chris Wright,
	virtualization, Andrew Morton, Christoph Lameter,
	Christian Limpach

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Add a new mm function apply_to_page_range() which applies a given 
> > function to every pte in a given virtual address range in a given mm 
> > structure. This is a generic alternative to cut-and-pasting the Linux 
> > idiomatic pagetable walking code in every place that a sequence of 
> > PTEs must be accessed.
> 
> nice one! I suspect we could simplify some of the less 
> performance-critical open-coded pagetable walker loops in the kernel 
> with this? (i'm not even sure it's all that much of a performance loss 
> to pass a function pointer around - would be nice to convert say 
> mprotect() to this and see the performance difference?)

We went through this before, and eventually got kick back to
keep it Xen specific.  There's still some effort for generalized
walkers from the Gelato folks IIRC.  And we can certainly adapt.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
  2007-03-16  8:42   ` Ingo Molnar
@ 2007-03-16 16:55     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Jens Axboe, Jeff Garzik

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>>  * virtual block device (blockfront)
>>  * virtual network device (netfront)
>>     
>
> note, these drivers should be submitted through the proper block drivers 
> and network drivers review process - not via the x86_64 tree.
>   

Yes; who should look at them?  I posted netfront to net-dev, and Stephen
Hemminger commented on it on the last repost, but I'd love them to get
more scrutiny.

Obviously they're not actually useful without Xen (and vice-versa), so
there's not much point in committing them separately.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
@ 2007-03-16 16:55     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Jens Axboe,
	Andrew Morton, Jeff Garzik

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>>  * virtual block device (blockfront)
>>  * virtual network device (netfront)
>>     
>
> note, these drivers should be submitted through the proper block drivers 
> and network drivers review process - not via the x86_64 tree.
>   

Yes; who should look at them?  I posted netfront to net-dev, and Stephen
Hemminger commented on it on the last repost, but I'd love them to get
more scrutiny.

Obviously they're not actually useful without Xen (and vice-versa), so
there's not much point in committing them separately.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-16  8:47     ` Ingo Molnar
@ 2007-03-16 16:57       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 16:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Greg KH

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> This communicates with the machine control software via a registry 
>> residing in a controlling virtual machine. This allows dynamic 
>> creation, destruction and modification of virtual device 
>> configurations (network devices, block devices and CPUS, to name some 
>> examples).
>>     
>
> should be reviewed by the driver-core folks too i guess?
>
>   
>> +//#include <asm/maddr.h>
>>     
>
>   
>> +//#include <xen/xen_proc.h>
>> +//#include <xen/evtchn.h>
>>     
>
>   
>> +//#include <xen/hvm.h>
>>     
>
>   
>> +//		.shutdown = xenbus_dev_shutdown,
>>     
>
> hm?
>   

Yeah, gunk.  I'll clean that out.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
@ 2007-03-16 16:57       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 16:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Greg KH, Christian Limpach

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> This communicates with the machine control software via a registry 
>> residing in a controlling virtual machine. This allows dynamic 
>> creation, destruction and modification of virtual device 
>> configurations (network devices, block devices and CPUS, to name some 
>> examples).
>>     
>
> should be reviewed by the driver-core folks too i guess?
>
>   
>> +//#include <asm/maddr.h>
>>     
>
>   
>> +//#include <xen/xen_proc.h>
>> +//#include <xen/evtchn.h>
>>     
>
>   
>> +//#include <xen/hvm.h>
>>     
>
>   
>> +//		.shutdown = xenbus_dev_shutdown,
>>     
>
> hm?
>   

Yeah, gunk.  I'll clean that out.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16 16:44       ` Jeremy Fitzhardinge
@ 2007-03-16 16:57           ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Ingo Molnar, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Adrian Bunk,
	Thomas Gleixner

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > That's been fixed, the two are built as seperate objects now.
> 
> Actually, we tried it but it causes bad kernel images with some
> binutils, so it has to be included for now.

Ah, missed that.

> >>> @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
> >>>  
> >>>  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
> >>>  {
> >>> -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
> >>> -	C(0); C(1); C(2);
> >>> -#undef C
> >>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
> >>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
> >>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
> >>>  }
> >>>       
> >> this is a cleanup unrelated to the purpose of the patch.
> >>     
> >
> > Sure, will split out.
> >   
> 
> Already done in Rusty's cleanup patch.

Hmm, not here, I'll fix it up.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-16 16:57           ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 16:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Adrian Bunk, Chris Wright,
	Andi Kleen, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Christian Limpach

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > That's been fixed, the two are built as seperate objects now.
> 
> Actually, we tried it but it causes bad kernel images with some
> binutils, so it has to be included for now.

Ah, missed that.

> >>> @@ -437,9 +437,9 @@ static unsigned long native_store_tr(voi
> >>>  
> >>>  static void native_load_tls(struct thread_struct *t, unsigned int cpu)
> >>>  {
> >>> -#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
> >>> -	C(0); C(1); C(2);
> >>> -#undef C
> >>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 0] = t->tls_array[0];
> >>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 1] = t->tls_array[1];
> >>> +	get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + 2] = t->tls_array[2];
> >>>  }
> >>>       
> >> this is a cleanup unrelated to the purpose of the patch.
> >>     
> >
> > Sure, will split out.
> >   
> 
> Already done in Rusty's cleanup patch.

Hmm, not here, I'll fix it up.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 23/26] Xen-paravirt_ops: Add Xen grant table support
  2007-03-16  8:51     ` Ingo Molnar
@ 2007-03-16 17:00       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach

Ingo Molnar wrote:
>> +
>> +#ifndef __ia64__
>> +	{
>>     
>
> introduce a proper arch method instead.
>   

I'll clean out the ia64 stuff.  We can add it back properly when (if?)
the time comes.

>> +	grant_ref_t g = *private_head;
>> +	if (unlikely(g == GNTTAB_LIST_END))
>>     
>
> coding style problems.

Yup.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 23/26] Xen-paravirt_ops: Add Xen grant table support
@ 2007-03-16 17:00       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Christian Limpach

Ingo Molnar wrote:
>> +
>> +#ifndef __ia64__
>> +	{
>>     
>
> introduce a proper arch method instead.
>   

I'll clean out the ia64 stuff.  We can add it back properly when (if?)
the time comes.

>> +	grant_ref_t g = *private_head;
>> +	if (unlikely(g == GNTTAB_LIST_END))
>>     
>
> coding style problems.

Yup.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16  8:54     ` Ingo Molnar
@ 2007-03-16 17:02       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell

Ingo Molnar wrote:
>> +	prod = intf->in_prod;
>> +	mb();
>> +	BUG_ON((prod - cons) > sizeof(intf->in));
>>     
>
> such mb()'s are typically a sign of "i have no clear idea what SMP 
> serialization rules apply here, but something is needed because 
> otherwise it breaks" ?

Hm, in this case its because it's sharing the memory with Xen, so
there's a particular ordering protocol.  It needs some comments.

    J


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16 17:02       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton

Ingo Molnar wrote:
>> +	prod = intf->in_prod;
>> +	mb();
>> +	BUG_ON((prod - cons) > sizeof(intf->in));
>>     
>
> such mb()'s are typically a sign of "i have no clear idea what SMP 
> serialization rules apply here, but something is needed because 
> otherwise it breaks" ?

Hm, in this case its because it's sharing the memory with Xen, so
there's a particular ordering protocol.  It needs some comments.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 18/26] Xen-paravirt_ops: Add XEN config options
  2007-03-16  9:14     ` Ingo Molnar
@ 2007-03-16 17:04       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Thomas Gleixner

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Xen is currently incompatible with:
>> - PREEMPT
>>     
>
> why? That should be fixed.

Actually I don't think that's true any more.  We'll need to work out how
to deal with preempt and suspend/resume, but the process freezer seems
like the way to do that.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 18/26] Xen-paravirt_ops: Add XEN config options
@ 2007-03-16 17:04       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Thomas Gleixner, Christian Limpach

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Xen is currently incompatible with:
>> - PREEMPT
>>     
>
> why? That should be fixed.

Actually I don't think that's true any more.  We'll need to work out how
to deal with preempt and suspend/resume, but the process freezer seems
like the way to do that.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas
  2007-03-16  9:16     ` Ingo Molnar
@ 2007-03-16 17:05       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Jan Beulich

Ingo Molnar wrote:
>> +struct vm_struct *alloc_vm_area(unsigned long size)
>>     
>
> this should go into mm/vmalloc.c.

The trouble is that it calls the arch-dependent vmalloc_sync_all(),
which is important for Xen's use of this function.  And without that
it's just a pointless wrapper around get_vm_area.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas
@ 2007-03-16 17:05       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Jan Beulich, Chris Wright,
	Andi Kleen, Andrew Morton, Christian Limpach

Ingo Molnar wrote:
>> +struct vm_struct *alloc_vm_area(unsigned long size)
>>     
>
> this should go into mm/vmalloc.c.

The trouble is that it calls the arch-dependent vmalloc_sync_all(),
which is important for Xen's use of this function.  And without that
it's just a pointless wrapper around get_vm_area.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
  2007-03-16  9:19     ` Ingo Molnar
@ 2007-03-16 17:08       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Ian Pratt, Christian Limpach, Christoph Lameter

Ingo Molnar wrote:
> nice one! I suspect we could simplify some of the less 
> performance-critical open-coded pagetable walker loops in the kernel 
> with this? (i'm not even sure it's all that much of a performance loss 
> to pass a function pointer around - would be nice to convert say 
> mprotect() to this and see the performance difference?)
>   

apply_page_to_range has the side-effect of allocating all the pagetable
levels for the address range it walks.  Xen uses this to populate
pagetables (for example, build out the pagetable, and let the hypervisor
plug a mapping into the pte page).

If that's OK for the other uses, then it should be good.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range.
@ 2007-03-16 17:08       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton, Christoph Lameter, Christian Limpach

Ingo Molnar wrote:
> nice one! I suspect we could simplify some of the less 
> performance-critical open-coded pagetable walker loops in the kernel 
> with this? (i'm not even sure it's all that much of a performance loss 
> to pass a function pointer around - would be nice to convert say 
> mprotect() to this and see the performance difference?)
>   

apply_page_to_range has the side-effect of allocating all the pagetable
levels for the address range it walks.  Xen uses this to populate
pagetables (for example, build out the pagetable, and let the hypervisor
plug a mapping into the pte page).

If that's OK for the other uses, then it should be good.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16  9:14     ` Ingo Molnar
@ 2007-03-16 17:12       ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 17:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Adrian Bunk,
	Thomas Gleixner

* Ingo Molnar (mingo@elte.hu) wrote:
> >  ENTRY(swapper_pg_dir)
> > +	.align PAGE_SIZE_asm
> >  	.fill 1024,4,0
> 
> does the native kernel lose memory here?

Not in my builds.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-16 17:12       ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 17:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Ian Pratt,
	Andi Kleen, Rusty Russell, linux-kernel, Adrian Bunk,
	Chris Wright, virtualization, Andrew Morton, Thomas Gleixner,
	Christian Limpach

* Ingo Molnar (mingo@elte.hu) wrote:
> >  ENTRY(swapper_pg_dir)
> > +	.align PAGE_SIZE_asm
> >  	.fill 1024,4,0
> 
> does the native kernel lose memory here?

Not in my builds.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
  2007-03-16  9:21   ` Ingo Molnar
@ 2007-03-16 17:26     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> This patch series implements the Linux Xen guest as a paravirt_ops 
>> backend. [...]
>>     
>
> 19/26 seems to be missing (lkml size limit?) so i couldnt review it.
>   

Shouldn't think so; its a fairly small patch.  Here it is.

    J

Subject: add kmap_atomic_pte for mapping highpte pages

Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, _kmap_atomic, which maps the page with the specified
page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/paravirt.c |    7 +++++++
 arch/i386/mm/highmem.c      |    9 +++++++--
 include/asm-i386/highmem.h  |   11 +++++++++++
 include/asm-i386/paravirt.h |   14 +++++++++++++-
 include/asm-i386/pgtable.h  |    4 ++--
 include/linux/highmem.h     |    6 ++++++
 mm/highmem.c                |    9 +++++++++
 7 files changed, 55 insertions(+), 5 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -20,6 +20,7 @@
 #include <linux/efi.h>
 #include <linux/bcd.h>
 #include <linux/start_kernel.h>
+#include <linux/highmem.h>
 
 #include <asm/bug.h>
 #include <asm/paravirt.h>
@@ -600,6 +601,12 @@ struct paravirt_ops paravirt_ops = {
 
 	.ptep_get_and_clear = native_ptep_get_and_clear,
 
+#ifdef CONFIG_HIGHPTE
+	.kmap_atomic_pte = native_kmap_atomic_pte,
+#else
+	.kmap_atomic_pte = paravirt_nop,
+#endif
+
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
===================================================================
--- a/arch/i386/mm/highmem.c
+++ b/arch/i386/mm/highmem.c
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)
 {
 	enum fixed_addresses idx;
 	unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
 		return page_address(page);
 
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+	set_pte(kmap_pte-idx, mk_pte(page, prot));
 
 	return (void*) vaddr;
+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+	return _kmap_atomic(page, type, kmap_prot);
 }
 
 void kunmap_atomic(void *kvaddr, enum km_type type)
===================================================================
--- a/include/asm-i386/highmem.h
+++ b/include/asm-i386/highmem.h
@@ -24,6 +24,7 @@
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
 #include <asm/tlbflush.h>
+#include <asm/paravirt.h>
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct 
 
 void *kmap(struct page *page);
 void kunmap(struct page *page);
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot);
 void *kmap_atomic(struct page *page, enum km_type type);
 void kunmap_atomic(void *kvaddr, enum km_type type);
 void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
 struct page *kmap_atomic_to_page(void *ptr);
+
+static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type)
+{
+	return kmap_atomic(page, type);
+}
+
+#ifndef CONFIG_PARAVIRT
+#define kmap_atomic_pte(page, type)	kmap_atomic(page, type)
+#endif
 
 #define flush_cache_kmaps()	do { } while (0)
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -16,11 +16,14 @@
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
 #include <linux/cpumask.h>
-
+#include <asm/kmap_types.h>
+
+struct page;
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
 struct mm_struct;
+
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
@@ -138,6 +141,8 @@ struct paravirt_ops
 	void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 
  	pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
+	void *(*kmap_atomic_pte)(struct page *page, enum km_type type);
 
 #ifdef CONFIG_X86_PAE
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
@@ -667,6 +672,13 @@ static inline void flush_tlb_others(cpum
 	PVOP_VCALL4(alloc_pd_clone, pfn, clonepfn, start, count)
 #define paravirt_release_pd(pfn) PVOP_VCALL1(release_pd, pfn)
 
+static inline void *kmap_atomic_pte(struct page *page, enum km_type type)
+{
+	unsigned long ret;
+	ret = PVOP_CALL2(unsigned long, kmap_atomic_pte, page, type);
+	return (void *)ret;
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	PVOP_VCALL3(pte_update, mm, addr, ptep);
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -479,9 +479,9 @@ extern pte_t *lookup_address(unsigned lo
 
 #if defined(CONFIG_HIGHPTE)
 #define pte_offset_map(dir, address) \
-	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
+	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
 #define pte_offset_map_nested(dir, address) \
-	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
+	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
 #define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
 #else
===================================================================
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -27,6 +27,8 @@ unsigned int nr_free_highpages(void);
 unsigned int nr_free_highpages(void);
 extern unsigned long totalhigh_pages;
 
+void kmap_flush_unused(void);
+
 #else /* CONFIG_HIGHMEM */
 
 static inline unsigned int nr_free_highpages(void) { return 0; }
@@ -44,9 +46,13 @@ static inline void *kmap(struct page *pa
 
 #define kmap_atomic(page, idx) \
 	({ pagefault_disable(); page_address(page); })
+#define _kmap_atomic(page, idx, prot)	kmap_atomic(page, idx)
+
 #define kunmap_atomic(addr, idx)	do { pagefault_enable(); } while (0)
 #define kmap_atomic_pfn(pfn, idx)	kmap_atomic(pfn_to_page(pfn), (idx))
 #define kmap_atomic_to_page(ptr)	virt_to_page(ptr)
+
+#define kmap_flush_unused()	do {} while(0)
 #endif
 
 #endif /* CONFIG_HIGHMEM */
===================================================================
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -97,6 +97,15 @@ static void flush_all_zero_pkmaps(void)
 		set_page_address(page, NULL);
 	}
 	flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+}
+
+/* Flush all unused kmap mappings in order to remove stray
+   mappings. */
+void kmap_flush_unused(void)
+{
+	spin_lock(&kmap_lock);
+	flush_all_zero_pkmaps();
+	spin_unlock(&kmap_lock);
 }
 
 static inline unsigned long map_new_virtual(struct page *page)



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
@ 2007-03-16 17:26     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> This patch series implements the Linux Xen guest as a paravirt_ops 
>> backend. [...]
>>     
>
> 19/26 seems to be missing (lkml size limit?) so i couldnt review it.
>   

Shouldn't think so; its a fairly small patch.  Here it is.

    J

Subject: add kmap_atomic_pte for mapping highpte pages

Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, _kmap_atomic, which maps the page with the specified
page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Zachary Amsden <zach@vmware.com>

---
 arch/i386/kernel/paravirt.c |    7 +++++++
 arch/i386/mm/highmem.c      |    9 +++++++--
 include/asm-i386/highmem.h  |   11 +++++++++++
 include/asm-i386/paravirt.h |   14 +++++++++++++-
 include/asm-i386/pgtable.h  |    4 ++--
 include/linux/highmem.h     |    6 ++++++
 mm/highmem.c                |    9 +++++++++
 7 files changed, 55 insertions(+), 5 deletions(-)

===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -20,6 +20,7 @@
 #include <linux/efi.h>
 #include <linux/bcd.h>
 #include <linux/start_kernel.h>
+#include <linux/highmem.h>
 
 #include <asm/bug.h>
 #include <asm/paravirt.h>
@@ -600,6 +601,12 @@ struct paravirt_ops paravirt_ops = {
 
 	.ptep_get_and_clear = native_ptep_get_and_clear,
 
+#ifdef CONFIG_HIGHPTE
+	.kmap_atomic_pte = native_kmap_atomic_pte,
+#else
+	.kmap_atomic_pte = paravirt_nop,
+#endif
+
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
===================================================================
--- a/arch/i386/mm/highmem.c
+++ b/arch/i386/mm/highmem.c
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)
 {
 	enum fixed_addresses idx;
 	unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
 		return page_address(page);
 
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+	set_pte(kmap_pte-idx, mk_pte(page, prot));
 
 	return (void*) vaddr;
+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+	return _kmap_atomic(page, type, kmap_prot);
 }
 
 void kunmap_atomic(void *kvaddr, enum km_type type)
===================================================================
--- a/include/asm-i386/highmem.h
+++ b/include/asm-i386/highmem.h
@@ -24,6 +24,7 @@
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
 #include <asm/tlbflush.h>
+#include <asm/paravirt.h>
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct 
 
 void *kmap(struct page *page);
 void kunmap(struct page *page);
+void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot);
 void *kmap_atomic(struct page *page, enum km_type type);
 void kunmap_atomic(void *kvaddr, enum km_type type);
 void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
 struct page *kmap_atomic_to_page(void *ptr);
+
+static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type)
+{
+	return kmap_atomic(page, type);
+}
+
+#ifndef CONFIG_PARAVIRT
+#define kmap_atomic_pte(page, type)	kmap_atomic(page, type)
+#endif
 
 #define flush_cache_kmaps()	do { } while (0)
 
===================================================================
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -16,11 +16,14 @@
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
 #include <linux/cpumask.h>
-
+#include <asm/kmap_types.h>
+
+struct page;
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
 struct mm_struct;
+
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
@@ -138,6 +141,8 @@ struct paravirt_ops
 	void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 
  	pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
+	void *(*kmap_atomic_pte)(struct page *page, enum km_type type);
 
 #ifdef CONFIG_X86_PAE
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
@@ -667,6 +672,13 @@ static inline void flush_tlb_others(cpum
 	PVOP_VCALL4(alloc_pd_clone, pfn, clonepfn, start, count)
 #define paravirt_release_pd(pfn) PVOP_VCALL1(release_pd, pfn)
 
+static inline void *kmap_atomic_pte(struct page *page, enum km_type type)
+{
+	unsigned long ret;
+	ret = PVOP_CALL2(unsigned long, kmap_atomic_pte, page, type);
+	return (void *)ret;
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	PVOP_VCALL3(pte_update, mm, addr, ptep);
===================================================================
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -479,9 +479,9 @@ extern pte_t *lookup_address(unsigned lo
 
 #if defined(CONFIG_HIGHPTE)
 #define pte_offset_map(dir, address) \
-	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
+	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
 #define pte_offset_map_nested(dir, address) \
-	((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
+	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
 #define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
 #else
===================================================================
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -27,6 +27,8 @@ unsigned int nr_free_highpages(void);
 unsigned int nr_free_highpages(void);
 extern unsigned long totalhigh_pages;
 
+void kmap_flush_unused(void);
+
 #else /* CONFIG_HIGHMEM */
 
 static inline unsigned int nr_free_highpages(void) { return 0; }
@@ -44,9 +46,13 @@ static inline void *kmap(struct page *pa
 
 #define kmap_atomic(page, idx) \
 	({ pagefault_disable(); page_address(page); })
+#define _kmap_atomic(page, idx, prot)	kmap_atomic(page, idx)
+
 #define kunmap_atomic(addr, idx)	do { pagefault_enable(); } while (0)
 #define kmap_atomic_pfn(pfn, idx)	kmap_atomic(pfn_to_page(pfn), (idx))
 #define kmap_atomic_to_page(ptr)	virt_to_page(ptr)
+
+#define kmap_flush_unused()	do {} while(0)
 #endif
 
 #endif /* CONFIG_HIGHMEM */
===================================================================
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -97,6 +97,15 @@ static void flush_all_zero_pkmaps(void)
 		set_page_address(page, NULL);
 	}
 	flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+}
+
+/* Flush all unused kmap mappings in order to remove stray
+   mappings. */
+void kmap_flush_unused(void)
+{
+	spin_lock(&kmap_lock);
+	flush_all_zero_pkmaps();
+	spin_unlock(&kmap_lock);
 }
 
 static inline unsigned long map_new_virtual(struct page *page)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16  9:24     ` Ingo Molnar
@ 2007-03-16 17:36       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Anthony Liguori, Linus Torvalds, David Miller

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Wrap a set of interesting paravirt_ops calls in a wrapper which makes 
>> the callsites available for patching.  Unfortunately this is pretty 
>> ugly because there's no way to get gcc to generate a function call, 
>> but also wrap just the callsite itself with the necessary labels.
>>
>> This patch supports functions with 0-4 arguments, and either void or 
>> returning a value.  64-bit arguments must be split into a pair of 
>> 32-bit arguments (lower word first).  Small structures are returned in 
>> registers.
>>     
>
> ugh. This is beyond ugly!

Yes.  I can't say I'm proud of this, but I really couldn't come up with
a cleaner approach.

>  Why dont we just compile two images, one for 
> Xen and one for native, do two passes to get those two images and 
> 'merge' them into a single vmlinuz (so that we still have a 'single' 
> kernel unit to deal with on the distro side). This way we avoid all this 
> crazy, limited, fragile patchery business...

Well, one thing to make clear is this is absolutely not a Xen-specific
patch or piece of code.  This is part of the paravirt_ops infrastructure
designed to remove the overhead of all the indirect calls which are
scattered all over the place.  (Perhaps I should post the general
paravirt and Xen specific patches in separate patch series to make this
clear...).

The idea is to wrap the callsite itself with in the same manner as the
other altinstructions so that the general patcher can, at the very
least, convert the indirect call to a direct one, or nop it out if its
an indirect call.  This means that a pv_ops implementation can get about
90% of the benefit of patching without any extra effort.

So, I disagree with your characterisation that its "limited"; this is a
pretty general mechanism.  The fragile part is in using the PVOP_*
macros properly to match the ABI's calling convention, particularly with
tricky cases like passed and returned structures and 64-bit args.  But
that just needs to be done once in one place, and is otherwise
self-contained.

I would love a better mechanism.  I played with using things like gcc's
builtin_apply stuff, and so on, but I could find no way to get gcc to
set up the args and then be able to just emit the call itself under asm
control.

I haven't looked at Dave's reply in detail, but I saw some mention of
using relocs.  The idea is intriguing , but I don't quite see how it
would all fit together.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-16 17:36       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Anthony Liguori,
	Andrew Morton, Linus Torvalds, David Miller

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Wrap a set of interesting paravirt_ops calls in a wrapper which makes 
>> the callsites available for patching.  Unfortunately this is pretty 
>> ugly because there's no way to get gcc to generate a function call, 
>> but also wrap just the callsite itself with the necessary labels.
>>
>> This patch supports functions with 0-4 arguments, and either void or 
>> returning a value.  64-bit arguments must be split into a pair of 
>> 32-bit arguments (lower word first).  Small structures are returned in 
>> registers.
>>     
>
> ugh. This is beyond ugly!

Yes.  I can't say I'm proud of this, but I really couldn't come up with
a cleaner approach.

>  Why dont we just compile two images, one for 
> Xen and one for native, do two passes to get those two images and 
> 'merge' them into a single vmlinuz (so that we still have a 'single' 
> kernel unit to deal with on the distro side). This way we avoid all this 
> crazy, limited, fragile patchery business...

Well, one thing to make clear is this is absolutely not a Xen-specific
patch or piece of code.  This is part of the paravirt_ops infrastructure
designed to remove the overhead of all the indirect calls which are
scattered all over the place.  (Perhaps I should post the general
paravirt and Xen specific patches in separate patch series to make this
clear...).

The idea is to wrap the callsite itself with in the same manner as the
other altinstructions so that the general patcher can, at the very
least, convert the indirect call to a direct one, or nop it out if its
an indirect call.  This means that a pv_ops implementation can get about
90% of the benefit of patching without any extra effort.

So, I disagree with your characterisation that its "limited"; this is a
pretty general mechanism.  The fragile part is in using the PVOP_*
macros properly to match the ABI's calling convention, particularly with
tricky cases like passed and returned structures and 64-bit args.  But
that just needs to be done once in one place, and is otherwise
self-contained.

I would love a better mechanism.  I played with using things like gcc's
builtin_apply stuff, and so on, but I could find no way to get gcc to
set up the args and then be able to just emit the call itself under asm
control.

I haven't looked at Dave's reply in detail, but I saw some mention of
using relocs.  The idea is intriguing , but I don't quite see how it
would all fit together.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
  2007-03-16  9:26     ` Ingo Molnar
@ 2007-03-16 17:37       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	Linus Torvalds

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Fix a few clobbers to include the return register.  The clobbers set 
>> is the set of all registers modified (or may be modified) by the code 
>> snippet, regardless of whether it was deliberate or accidental.
>>     
>
> i guess we need these paravirt.h clobber fixes for v2.6.21 (VMI) too, 
> correct?

No, I don't think so.  These aren't gcc-style clobbers, but a marker in
the .parainstruction section about what registers may be clobbered by
the patched in code.   VMI doesn't use that info, and/or doesn't clobber
more than is allowed at each site, so it isn't an issue.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register
@ 2007-03-16 17:37       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton,
	Linus Torvalds

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Fix a few clobbers to include the return register.  The clobbers set 
>> is the set of all registers modified (or may be modified) by the code 
>> snippet, regardless of whether it was deliberate or accidental.
>>     
>
> i guess we need these paravirt.h clobber fixes for v2.6.21 (VMI) too, 
> correct?

No, I don't think so.  These aren't gcc-style clobbers, but a marker in
the .parainstruction section about what registers may be clobbered by
the patched in code.   VMI doesn't use that info, and/or doesn't clobber
more than is allowed at each site, so it isn't an issue.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction
  2007-03-16  9:30     ` Ingo Molnar
@ 2007-03-16 17:38       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	linux-arch

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Add hooks to allow a paravirt implementation to track the lifetime of 
>> an mm.
>>     
>
> (i guess this one should be merged with the ARCH-define cleanup patch)
>   

Already done.  In fact, I think I did it before the last patch post. 
Are you looking at an uptodate series of patches?

   

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction
@ 2007-03-16 17:38       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 17:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, linux-arch, xen-devel, virtualization,
	Rusty Russell, linux-kernel, Chris Wright, Andi Kleen,
	Andrew Morton

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Add hooks to allow a paravirt implementation to track the lifetime of 
>> an mm.
>>     
>
> (i guess this one should be merged with the ARCH-define cleanup patch)
>   

Already done.  In fact, I think I did it before the last patch post. 
Are you looking at an uptodate series of patches?

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver.
  2007-03-16 16:40           ` Chris Wright
  (?)
@ 2007-03-16 17:53           ` Greg KH
  -1 siblings, 0 replies; 330+ messages in thread
From: Greg KH @ 2007-03-16 17:53 UTC (permalink / raw)
  To: Chris Wright
  Cc: Ingo Molnar, Jeremy Fitzhardinge, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Greg KH

On Fri, Mar 16, 2007 at 09:40:23AM -0700, Chris Wright wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > please indicate any such review feedback (and approval) via 
> > signed-off-by lines done by that person.
> 
> Certainly, Greg didn't sign off, he simply gave review feedback.

Yeah, I did that a while ago, but I don't want to give my acked-by or
signed-off for this code.  While I can not find anything specfically
wrong with it, it seems a bit "odd", but that might just be because I
really don't understand what they are doing here, or why.

And no, I really don't want to understand it either at the moment due to
being busy with other things :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
  2007-03-16  9:33     ` Ingo Molnar
  (?)
@ 2007-03-16 18:39     ` Jeremy Fitzhardinge
  2007-03-16 19:35         ` Steven Rostedt
  2007-03-17  9:47         ` Rusty Russell
  -1 siblings, 2 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 18:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell,
	William Lee Irwin III

Ingo Molnar wrote:
>> +	/* Make sure kernel address space is empty so that a pagetable
>> +	   will be allocated for it. */
>>     
>
> comment style.
>   

As you've noticed its a comment style I use quite often.  I use it for
what's essentially a little local comment, rather than a block comment
which really needs to draw attention to itself.  In this case, I think using

/*
 * Make sure kernel address space is empty so that a pagetable
 * will be allocated for it.
 */

is just too loud for what its trying to say.  If // comments were deemed
acceptable, I'd probably use them here instead.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries
  2007-03-16  9:38     ` Ingo Molnar
@ 2007-03-16 18:42       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 18:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell

Ingo Molnar wrote:
>> +static inline pmd_t native_make_pmd(unsigned long long val)
>> +{
>> +	return (pmd_t) { val };
>> +}
>> +static inline pte_t native_make_pte(unsigned long long val)
>> +{
>> +	return (pte_t) { .pte_low = val, .pte_high = (val >> 32) } ;
>> +}
>>     
>
> missing newlines between inline functions.
>   

OK.

>> +#ifndef CONFIG_PARAVIRT
>> +#define pmd_val(x)	native_pmd_val(x)
>> +#define __pmd(x)	native_make_pmd(x)
>> +#endif	/* !CONFIG_PARAVIRT */
>>     
>
> no need for the closing !CONFIG_PARAVIRT comment: this define is 2 lines 
> long so it's not that hard to find the start of the block. We typically 
> do the /* !CONFIG_XX */ comment only for larger blocks, and when 
> multiple #endif's intermix.
>   

Yeah, I tend to put them there reflexively.  Its so easy for an #endif
to drift away over time, and suddenly you have no idea what's going on. 
I agree its overkill in this case.

>>  #define HPAGE_SHIFT	22
>>  #include <asm-generic/pgtable-nopmd.h>
>> -#endif
>> +#endif	/* CONFIG_X86_PAE */
>>     
>
> (for example here the #endif comment is justified.)
>   

Yeah, and it probably started life much closer to the #ifdef...

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries
@ 2007-03-16 18:42       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 18:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton

Ingo Molnar wrote:
>> +static inline pmd_t native_make_pmd(unsigned long long val)
>> +{
>> +	return (pmd_t) { val };
>> +}
>> +static inline pte_t native_make_pte(unsigned long long val)
>> +{
>> +	return (pte_t) { .pte_low = val, .pte_high = (val >> 32) } ;
>> +}
>>     
>
> missing newlines between inline functions.
>   

OK.

>> +#ifndef CONFIG_PARAVIRT
>> +#define pmd_val(x)	native_pmd_val(x)
>> +#define __pmd(x)	native_make_pmd(x)
>> +#endif	/* !CONFIG_PARAVIRT */
>>     
>
> no need for the closing !CONFIG_PARAVIRT comment: this define is 2 lines 
> long so it's not that hard to find the start of the block. We typically 
> do the /* !CONFIG_XX */ comment only for larger blocks, and when 
> multiple #endif's intermix.
>   

Yeah, I tend to put them there reflexively.  Its so easy for an #endif
to drift away over time, and suddenly you have no idea what's going on. 
I agree its overkill in this case.

>>  #define HPAGE_SHIFT	22
>>  #include <asm-generic/pgtable-nopmd.h>
>> -#endif
>> +#endif	/* CONFIG_X86_PAE */
>>     
>
> (for example here the #endif comment is justified.)
>   

Yeah, and it probably started life much closer to the #ifdef...

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-16  9:44     ` Ingo Molnar
@ 2007-03-16 18:43       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrew Morton, linux-kernel, virtualization,
	xen-devel, Chris Wright, Zachary Amsden, Rusty Russell

Ingo Molnar wrote:
> but only as a cleanup of the current open-coded (void *) casts. My 
> problem with this is that it loses the types. Not that there is much to 
> check for, but still, this adds some assumptions about how function 
> calls look like.

I agree.  I don't generally like this kind of hack, but having a single
test for "func == paravirt_nop" to look for nop pv_ops in the patcher is
what tipped the balance.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
@ 2007-03-16 18:43       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Andrew Morton

Ingo Molnar wrote:
> but only as a cleanup of the current open-coded (void *) casts. My 
> problem with this is that it loses the types. Not that there is much to 
> check for, but still, this adds some assumptions about how function 
> calls look like.

I agree.  I don't generally like this kind of hack, but having a single
test for "func == paravirt_nop" to look for nop pv_ops in the patcher is
what tipped the balance.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
  2007-03-16 17:26     ` Jeremy Fitzhardinge
@ 2007-03-16 18:59       ` Christoph Hellwig
  -1 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2007-03-16 18:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell

On Fri, Mar 16, 2007 at 10:26:55AM -0700, Jeremy Fitzhardinge wrote:
> +#ifdef CONFIG_HIGHPTE
> +	.kmap_atomic_pte = native_kmap_atomic_pte,
> +#else
> +	.kmap_atomic_pte = paravirt_nop,
> +#endif

This is ifdefing is quite ugly.  Shouldn't native_kmap_atomic_pte
just be a noop in the !CONFIG_HIGHPTE case?

> -void *kmap_atomic(struct page *page, enum km_type type)
> +void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)

We normally call our "secial" function __foo, not _foo.  But in this
case it really should have a more meaningfull name like
kmap_atomic_prot anyway.

> +void *kmap_atomic(struct page *page, enum km_type type)
> +{
> +	return _kmap_atomic(page, type, kmap_prot);

And this one should probably be an inline.

> +static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type)
> +{
> +	return kmap_atomic(page, type);
> +}
> +
> +#ifndef CONFIG_PARAVIRT
> +#define kmap_atomic_pte(page, type)	kmap_atomic(page, type)
> +#endif

This is all getting rather ugly just for your pagetable hackery.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
@ 2007-03-16 18:59       ` Christoph Hellwig
  0 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2007-03-16 18:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, virtualization, xen-devel, Andi Kleen, Ingo Molnar,
	Andrew Morton, linux-kernel

On Fri, Mar 16, 2007 at 10:26:55AM -0700, Jeremy Fitzhardinge wrote:
> +#ifdef CONFIG_HIGHPTE
> +	.kmap_atomic_pte = native_kmap_atomic_pte,
> +#else
> +	.kmap_atomic_pte = paravirt_nop,
> +#endif

This is ifdefing is quite ugly.  Shouldn't native_kmap_atomic_pte
just be a noop in the !CONFIG_HIGHPTE case?

> -void *kmap_atomic(struct page *page, enum km_type type)
> +void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)

We normally call our "secial" function __foo, not _foo.  But in this
case it really should have a more meaningfull name like
kmap_atomic_prot anyway.

> +void *kmap_atomic(struct page *page, enum km_type type)
> +{
> +	return _kmap_atomic(page, type, kmap_prot);

And this one should probably be an inline.

> +static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type)
> +{
> +	return kmap_atomic(page, type);
> +}
> +
> +#ifndef CONFIG_PARAVIRT
> +#define kmap_atomic_pte(page, type)	kmap_atomic(page, type)
> +#endif

This is all getting rather ugly just for your pagetable hackery.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* RE: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
  2007-03-16 11:58               ` Keir Fraser
@ 2007-03-16 19:01                 ` Huang2, Wei
  -1 siblings, 0 replies; 330+ messages in thread
From: Huang2, Wei @ 2007-03-16 19:01 UTC (permalink / raw)
  To: Keir Fraser, Andrew Morton
  Cc: xen-devel, Andi Kleen, linux-kernel, Keir Fraser, Wright,
	virtualization, Ingo Molnar, Chris

But it is shorter than from JFK or Newark. 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser
Sent: Friday, March 16, 2007 6:58 AM
To: Andrew Morton; Keir Fraser
Cc: xen-devel@lists.xensource.com; Andi Kleen;
linux-kernel@vger.kernel.org; Keir Fraser; Wright;
virtualization@lists.osdl.org; Ingo Molnar; Chris@smtp.osdl.org
Subject: Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc
console infrastructure for Xen console

On 16/3/07 11:41, "Andrew Morton" <akpm@linux-foundation.org> wrote:

>> It's needed for writing data /after/ reading the consumer index that
shows
>> you have space to write. Looking through xenbus_comms.c I think all
the
>> barriers are correct except there is a spurious extra mb() in
xb_read(),
>> where there is a later rmb() which is sufficient by itself. All the
others
>> have a purpose.
>> 
> 
> If Ingo couldn't work this out from reading the code then nobody else
can,
> and we have a maintainability problem which can only be solved with
> adequate commenting.

Agreed. I've fixed up the commenting, removed the spurious mb(), and I
just
need Jeremy to merge that change into the Xen pv_ops patchset. :-)

 -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel





^ permalink raw reply	[flat|nested] 330+ messages in thread

* RE: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console
@ 2007-03-16 19:01                 ` Huang2, Wei
  0 siblings, 0 replies; 330+ messages in thread
From: Huang2, Wei @ 2007-03-16 19:01 UTC (permalink / raw)
  To: Keir Fraser, Andrew Morton
  Cc: Wright, Andi Kleen, xen-devel, virtualization, Ingo Molnar,
	Chris, linux-kernel, Keir Fraser

But it is shorter than from JFK or Newark. 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser
Sent: Friday, March 16, 2007 6:58 AM
To: Andrew Morton; Keir Fraser
Cc: xen-devel@lists.xensource.com; Andi Kleen;
linux-kernel@vger.kernel.org; Keir Fraser; Wright;
virtualization@lists.osdl.org; Ingo Molnar; Chris@smtp.osdl.org
Subject: Re: [Xen-devel] Re: [patch 21/26] Xen-paravirt_ops: Use the hvc
console infrastructure for Xen console

On 16/3/07 11:41, "Andrew Morton" <akpm@linux-foundation.org> wrote:

>> It's needed for writing data /after/ reading the consumer index that
shows
>> you have space to write. Looking through xenbus_comms.c I think all
the
>> barriers are correct except there is a spurious extra mb() in
xb_read(),
>> where there is a later rmb() which is sufficient by itself. All the
others
>> have a purpose.
>> 
> 
> If Ingo couldn't work this out from reading the code then nobody else
can,
> and we have a maintainability problem which can only be solved with
> adequate commenting.

Agreed. I've fixed up the commenting, removed the spurious mb(), and I
just
need Jeremy to merge that change into the Xen pv_ops patchset. :-)

 -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16  9:57         ` Ingo Molnar
  (?)
@ 2007-03-16 19:16         ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 19:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, rusty, anthony, torvalds

Ingo Molnar wrote:

> * David Miller <davem@davemloft.net> wrote:

> > Perhaps the problem can be dealt with using ELF relocations.
> > 
> > There is another case, discussed yesterday on netdev, where run-time 
> > resolution of ELF relocations would be useful (for 
> > very-very-very-read-only variables) so if it can solve this problem 
> > too it would be nice to have a generic infrastructure for it.
>   

> yeah, and i really think this is very fundamental: [...]

I think what Dave is suggesting is that we use the reloc information the
compiler generates to find the patchable callsites rather than have
special wrappers.  This is an interesting idea.

> Limited, instruction-level patching like alternatives.h is fine because 
> that makes it easier to support multiple, incompatible CPU 
> architectures, without having to do a hugely intrusive split at the 
> kernel RPM level.
>
> but the level of 'binary patching' done by the paravirt and Xen goes way 
> beyond that,

Not really.  There are only three cases:

   1. replace an indirect call with a direct call
   2. nop out a callsite
   3. patch in a short inline sequence

And as I pointed out, this is used by all pv_op backends, using a common
piece of code to implement at least 1 and 2.  3 could be implemented
semi-generically by using rules like "if (func == native_sti) {
patch("sti"); }", which would cover many cases where a hypervisor
doesn't need any special handling for a particular operation.

The goal is to eliminate the cost of the indirect calls with nice
predictable indirect calls.  There's a 1 byte/callsite overhead, but I
don't think that's a horrible overhead.

And, at worst, its only a little more complex than the kinds of
transformations.

Ideally, its a mechanism which could be used elsewhere.  It applies with
you have some kind of ops_vector table which is updated once (or perhaps
very rarely), and you don't want to wear the overhead of indirect calls
everywhere.

>  and the changes here really underscore that we:
>
>   _should not emulate the closed source world_
>
> There the only solution is to binary-patch - because they have no source 
> code. But here, we've got all the source code.
>   

I don't think this is a relevant comparison.  This is purely a matter of
optimising out unnecessary indirect calls.

> nobody wants to boot a xen-paravirt kernel from a floppy, so image size 
> is not an issue. In-RAM overhead would in fact be /reduced/, because 
> currently all the paravirt overhead hits both the native and the 
> paravirt kernel. Nor would /all/ of the vmlinuz have to be replicated in 
> the images - it's enough to replicate only those functions that truly 
> differ between the two build methods.

One of the explicit goals of pv_ops was to allow a single kernel to
either boot on native hardware or under any one of the supported
hypervisors, explicitly to avoid having to manage multiple kernel
images.  Compiling the kernel N+1 times for N hypervisors, and then
bundling them up in some kind of multi-image format doesn't seem like a
particularly good tradeoff.  The kernel RPM on my machine here is
already ~50Mbytes; expanding that to 250Mbytes to support native, Xen,
vmi, lguest and kvm doesn't seem reasonable.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface
  2007-03-16 18:59       ` Christoph Hellwig
  (?)
@ 2007-03-16 19:26       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 19:26 UTC (permalink / raw)
  To: Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar, Andi Kleen,
	Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Zachary Amsden, Rusty Russell

Christoph Hellwig wrote:
> On Fri, Mar 16, 2007 at 10:26:55AM -0700, Jeremy Fitzhardinge wrote:
>   
>> +#ifdef CONFIG_HIGHPTE
>> +	.kmap_atomic_pte = native_kmap_atomic_pte,
>> +#else
>> +	.kmap_atomic_pte = paravirt_nop,
>> +#endif
>>     
>
> This is ifdefing is quite ugly.  Shouldn't native_kmap_atomic_pte
> just be a noop in the !CONFIG_HIGHPTE case?
>   

Yes, but the trouble is that asm/highmem.h simply isn't included in the
!HIGHMEM case, so I can't put anything in there, and putting anything
pv_ops related into linux/highmem.h isn't appropriate either.

>> -void *kmap_atomic(struct page *page, enum km_type type)
>> +void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot)
>>     
>
> We normally call our "secial" function __foo, not _foo.  But in this
> case it really should have a more meaningfull name like
> kmap_atomic_prot anyway.
>   

OK.

>> +void *kmap_atomic(struct page *page, enum km_type type)
>> +{
>> +	return _kmap_atomic(page, type, kmap_prot);
>>     
>
> And this one should probably be an inline.
>   

OK, if you think it makes a difference.

>> +static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type)
>> +{
>> +	return kmap_atomic(page, type);
>> +}
>> +
>> +#ifndef CONFIG_PARAVIRT
>> +#define kmap_atomic_pte(page, type)	kmap_atomic(page, type)
>> +#endif
>>     
>
> This is all getting rather ugly just for your pagetable hackery.
>   

Well, I could promote kmap_atomic_pte to a first-class interface, but it
seems like overkill.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
  2007-03-16 18:39     ` Jeremy Fitzhardinge
@ 2007-03-16 19:35         ` Steven Rostedt
  2007-03-17  9:47         ` Rusty Russell
  1 sibling, 0 replies; 330+ messages in thread
From: Steven Rostedt @ 2007-03-16 19:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Chris Wright, Andi Kleen, xen-devel, virtualization,
	Andrew Morton, linux-kernel, William Lee Irwin III

On Fri, 2007-03-16 at 11:39 -0700, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >> +	/* Make sure kernel address space is empty so that a pagetable
> >> +	   will be allocated for it. */
> >>     
> >
> > comment style.
> >   
> 
> As you've noticed its a comment style I use quite often.  I use it for
> what's essentially a little local comment, rather than a block comment
> which really needs to draw attention to itself.  In this case, I think using
> 

Understood, but then you need to make that comment a single line :)

The way Linux code prefers comments (like the code itself actually
prefers it!) is single line comments stay on a single line, and
multiline comments are always block.

So it's either:

   /* my little comment here */

or 

   /*
    * This comment, although small and to the point, does span multiple
    * lines and regardless, needs to be a block comment.
    */

I guess it just makes it easier on the eyes when everything conforms and
others need to look at / maintain your code.


-- Steve



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
@ 2007-03-16 19:35         ` Steven Rostedt
  0 siblings, 0 replies; 330+ messages in thread
From: Steven Rostedt @ 2007-03-16 19:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Andi Kleen, xen-devel, virtualization, Ingo Molnar,
	Andrew Morton, William Lee Irwin III, linux-kernel

On Fri, 2007-03-16 at 11:39 -0700, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >> +	/* Make sure kernel address space is empty so that a pagetable
> >> +	   will be allocated for it. */
> >>     
> >
> > comment style.
> >   
> 
> As you've noticed its a comment style I use quite often.  I use it for
> what's essentially a little local comment, rather than a block comment
> which really needs to draw attention to itself.  In this case, I think using
> 

Understood, but then you need to make that comment a single line :)

The way Linux code prefers comments (like the code itself actually
prefers it!) is single line comments stay on a single line, and
multiline comments are always block.

So it's either:

   /* my little comment here */

or 

   /*
    * This comment, although small and to the point, does span multiple
    * lines and regardless, needs to be a block comment.
    */

I guess it just makes it easier on the eyes when everything conforms and
others need to look at / maintain your code.


-- Steve

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-16 18:43       ` Jeremy Fitzhardinge
@ 2007-03-16 19:49         ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 19:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Ingo Molnar wrote:
> > but only as a cleanup of the current open-coded (void *) casts. My 
> > problem with this is that it loses the types. Not that there is much to 
> > check for, but still, this adds some assumptions about how function 
> > calls look like.
> 
> I agree.  I don't generally like this kind of hack, but having a single
> test for "func == paravirt_nop" to look for nop pv_ops in the patcher is
> what tipped the balance.

how about __paravirt_nop_start < func < __paravirt_nop_end  and preserve
the types?

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
@ 2007-03-16 19:49         ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 19:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Andi Kleen, Rusty Russell,
	linux-kernel, Chris Wright, virtualization, Ingo Molnar,
	Andrew Morton

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Ingo Molnar wrote:
> > but only as a cleanup of the current open-coded (void *) casts. My 
> > problem with this is that it loses the types. Not that there is much to 
> > check for, but still, this adds some assumptions about how function 
> > calls look like.
> 
> I agree.  I don't generally like this kind of hack, but having a single
> test for "func == paravirt_nop" to look for nop pv_ops in the patcher is
> what tipped the balance.

how about __paravirt_nop_start < func < __paravirt_nop_end  and preserve
the types?

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-16 19:49         ` Chris Wright
  (?)
@ 2007-03-16 20:00         ` Jeremy Fitzhardinge
  2007-03-16 21:59             ` Chris Wright
  -1 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 20:00 UTC (permalink / raw)
  To: Chris Wright
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Zachary Amsden, Rusty Russell

Chris Wright wrote:
> how about __paravirt_nop_start < func < __paravirt_nop_end  and preserve
> the types?
>   

Er?  The reason for the (void *) cast is to stop gcc complaining about
mismatched pointer types.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16  9:33     ` David Miller
  2007-03-16  9:57         ` Ingo Molnar
@ 2007-03-16 20:38       ` Jeremy Fitzhardinge
  2007-03-17 10:33           ` Rusty Russell
  1 sibling, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 20:38 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, ak, akpm, linux-kernel, virtualization, xen-devel, chrisw,
	zach, rusty, anthony, torvalds, NetDev

David Miller wrote:
> Perhaps the problem can be dealt with using ELF relocations.
>
> There is another case, discussed yesterday on netdev, where run-time
> resolution of ELF relocations would be useful (for
> very-very-very-read-only variables) so if it can solve this problem
> too it would be nice to have a generic infrastructure for it.

That's an interesting idea.  Have you or anyone else looked at what it
would take to code up?

For this case, I guess you'd walk the relocs looking for references into
the paravirt_ops structure.  You'd need to check that was a reference
from an indirect jump or call instruction, which would identify a
patchable callsite.  The offset into the pv_ops structure would identify
which operation is involved.  That would be enough to use the existing
paravirt_ops patch machinery to insert a patch, though it doesn't
provide quite as much information as the current scheme.  The current
scheme also tells us how much space is available for patching over, and
what registers the patched-in code can use/clobber.

What are the netdev requirements?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
  2007-03-16  9:15     ` Ingo Molnar
@ 2007-03-16 21:26       ` Roland McGrath
  -1 siblings, 0 replies; 330+ messages in thread
From: Roland McGrath @ 2007-03-16 21:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Ulrich Drepper,
	Jakub Jelinek

> > +NOTE_KERNELCAP_BEGIN(1, 1)
> > +NOTE_KERNELCAP(1, "nosegneg")
> > +NOTE_KERNELCAP_END

This should be:

NOTE_KERNELCAP_BEGIN(1, 1)
NOTE_KERNELCAP(0, "nosegneg")
NOTE_KERNELCAP_END

i.e. 1->0 in the "bit" member.  (Note the ld.so.conf.d file must have the
matching bit number for ldconfig-based lookups to do the right thing.)
Or else:

NOTE_KERNELCAP_BEGIN(1, 2)
NOTE_KERNELCAP(0, "nosegneg")
NOTE_KERNELCAP_END

i.e. 1->2 in the "mask" member.  (The mask value should be 1<<bit.)

Some pre-release glibc's (before 2.4) had a bug in the code that parses
this, and would crash parsing the correct note.  Using the wrong bit value
with nonmatching mask worked around this.  IIRC, no glibc release ever
included the buggy version of the code.  In nonbuggy glibc, the mismatched
value causes the "nosegneg" to be omitted from the directory search (under
LD_LIBRARY_PATH and default directories), though ldconfig-based lookups
will work (the most common case).

Thanks,
Roland

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
@ 2007-03-16 21:26       ` Roland McGrath
  0 siblings, 0 replies; 330+ messages in thread
From: Roland McGrath @ 2007-03-16 21:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Ulrich Drepper,
	Jakub Jelinek

> > +NOTE_KERNELCAP_BEGIN(1, 1)
> > +NOTE_KERNELCAP(1, "nosegneg")
> > +NOTE_KERNELCAP_END

This should be:

NOTE_KERNELCAP_BEGIN(1, 1)
NOTE_KERNELCAP(0, "nosegneg")
NOTE_KERNELCAP_END

i.e. 1->0 in the "bit" member.  (Note the ld.so.conf.d file must have the
matching bit number for ldconfig-based lookups to do the right thing.)
Or else:

NOTE_KERNELCAP_BEGIN(1, 2)
NOTE_KERNELCAP(0, "nosegneg")
NOTE_KERNELCAP_END

i.e. 1->2 in the "mask" member.  (The mask value should be 1<<bit.)

Some pre-release glibc's (before 2.4) had a bug in the code that parses
this, and would crash parsing the correct note.  Using the wrong bit value
with nonmatching mask worked around this.  IIRC, no glibc release ever
included the buggy version of the code.  In nonbuggy glibc, the mismatched
value causes the "nosegneg" to be omitted from the directory search (under
LD_LIBRARY_PATH and default directories), though ldconfig-based lookups
will work (the most common case).

Thanks,
Roland

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
  2007-03-16 21:26       ` Roland McGrath
@ 2007-03-16 21:56         ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 21:56 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Ulrich Drepper,
	Jakub Jelinek

Roland McGrath wrote:
> This should be:
>
> NOTE_KERNELCAP_BEGIN(1, 1)
> NOTE_KERNELCAP(0, "nosegneg")
> NOTE_KERNELCAP_END
>
> i.e. 1->0 in the "bit" member.  (Note the ld.so.conf.d file must have the
> matching bit number for ldconfig-based lookups to do the right thing.)
> Or else:
>
> NOTE_KERNELCAP_BEGIN(1, 2)
> NOTE_KERNELCAP(0, "nosegneg")
> NOTE_KERNELCAP_END
>
> i.e. 1->2 in the "mask" member.  (The mask value should be 1<<bit.)
>   

Thanks Roland.  I've never really understood this stuff, and I just
copied this cargo-cultishly.

I'm not quite sure what you're suggesting here though.  Do you mean one of:

NOTE_KERNELCAP_BEGIN(1, 1)
NOTE_KERNELCAP(0, "nosegneg")
NOTE_KERNELCAP_END

or

NOTE_KERNELCAP_BEGIN(1, 2)
NOTE_KERNELCAP(1, "nosegneg")
NOTE_KERNELCAP_END

is the correct thing to use?

> Some pre-release glibc's (before 2.4) had a bug in the code that parses
> this, and would crash parsing the correct note.  Using the wrong bit value
> with nonmatching mask worked around this.  IIRC, no glibc release ever
> included the buggy version of the code.  In nonbuggy glibc, the mismatched
> value causes the "nosegneg" to be omitted from the directory search (under
> LD_LIBRARY_PATH and default directories), though ldconfig-based lookups
> will work (the most common case).
>   

Are you saying that one of the corrected forms might cause old glibcs to
crash, or just ignore nosegneg?

    J


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
@ 2007-03-16 21:56         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 21:56 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Zachary Amsden, Jakub Jelinek, xen-devel, Ian Pratt, Andi Kleen,
	Rusty Russell, linux-kernel, Ulrich Drepper, Chris Wright,
	virtualization, Ingo Molnar, Andrew Morton, Christian Limpach

Roland McGrath wrote:
> This should be:
>
> NOTE_KERNELCAP_BEGIN(1, 1)
> NOTE_KERNELCAP(0, "nosegneg")
> NOTE_KERNELCAP_END
>
> i.e. 1->0 in the "bit" member.  (Note the ld.so.conf.d file must have the
> matching bit number for ldconfig-based lookups to do the right thing.)
> Or else:
>
> NOTE_KERNELCAP_BEGIN(1, 2)
> NOTE_KERNELCAP(0, "nosegneg")
> NOTE_KERNELCAP_END
>
> i.e. 1->2 in the "mask" member.  (The mask value should be 1<<bit.)
>   

Thanks Roland.  I've never really understood this stuff, and I just
copied this cargo-cultishly.

I'm not quite sure what you're suggesting here though.  Do you mean one of:

NOTE_KERNELCAP_BEGIN(1, 1)
NOTE_KERNELCAP(0, "nosegneg")
NOTE_KERNELCAP_END

or

NOTE_KERNELCAP_BEGIN(1, 2)
NOTE_KERNELCAP(1, "nosegneg")
NOTE_KERNELCAP_END

is the correct thing to use?

> Some pre-release glibc's (before 2.4) had a bug in the code that parses
> this, and would crash parsing the correct note.  Using the wrong bit value
> with nonmatching mask worked around this.  IIRC, no glibc release ever
> included the buggy version of the code.  In nonbuggy glibc, the mismatched
> value causes the "nosegneg" to be omitted from the directory search (under
> LD_LIBRARY_PATH and default directories), though ldconfig-based lookups
> will work (the most common case).
>   

Are you saying that one of the corrected forms might cause old glibcs to
crash, or just ignore nosegneg?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-16 20:00         ` Jeremy Fitzhardinge
@ 2007-03-16 21:59             ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 21:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Ingo Molnar, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > how about __paravirt_nop_start < func < __paravirt_nop_end  and preserve
> > the types?
> >   
> 
> Er?  The reason for the (void *) cast is to stop gcc complaining about
> mismatched pointer types.

I mean like this (bunch of work, for a type check that we're really ignoring
anwyay, but this is the idea...)

diff -r 930fff55070e arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c	Fri Mar 16 11:09:10 2007 -0700
+++ b/arch/i386/kernel/paravirt.c	Fri Mar 16 14:48:04 2007 -0700
@@ -35,10 +35,60 @@
 #include <asm/tlbflush.h>
 #include <asm/timer.h>
 
-/* nop stub */
-void _paravirt_nop(void)
-{
-}
+/* nop stubs */
+void __paravirt_nop pv_nop_arch_setup(void)
+{
+}
+void __paravirt_nop pv_nop_set_lazy_mode(int mode)
+{
+}
+void __paravirt_nop pv_nop_alloc_pt(struct mm_struct *mm, u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_alloc_pd(u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_alloc_pd_clone(u32 pfn, u32 clonepfn, u32 start, u32 count)
+{
+}
+void __paravirt_nop pv_nop_release_pt(u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_release_pd(u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+}
+void __paravirt_nop pv_nop_pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+}
+void * __paravirt_nop pv_nop_kmap_atomic_pte(struct page *page, enum km_type type)
+{
+}
+void __paravirt_nop pv_nop_load_tr_desc(void)
+{
+}
+void __paravirt_nop pv_nop_activate_mm(struct mm_struct *prev, struct mm_struct *next)
+{
+}
+void __paravirt_nop pv_nop_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+}
+void __paravirt_nop pv_nop_exit_mmap(struct mm_struct *mm)
+{
+}
+void __paravirt_nop pv_nop_startup_ipi_hook(int phys_apicid, unsigned long start_eip, unsigned long start_esp)
+{
+}
+#ifdef CONFIG_X86_LOCAL_APIC
+void __paravirt_nop pv_nop_setup_boot_clock(void)
+{
+}
+void __paravirt_nop pv_nop_setup_secondary_clock(void)
+{
+}
+#endif
 
 static void __init default_banner(void)
 {
@@ -166,7 +216,7 @@ unsigned paravirt_patch_default(u8 type,
 	if (opfunc == NULL)
 		/* If there's no function, patch it with a ud2a (BUG) */
 		ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
-	else if (opfunc == paravirt_nop)
+	else if (opfunc >= __paravirt_nop_start || opfunc < __paravirt_nop_end)
 		/* If the operation is a nop, then nop the callsite */
 		ret = paravirt_patch_nop();
 	else if (type == PARAVIRT_PATCH(iret) ||
@@ -521,7 +571,7 @@ struct paravirt_ops paravirt_ops = {
 
  	.patch = native_patch,
 	.banner = default_banner,
-	.arch_setup = paravirt_nop,
+	.arch_setup = pv_nop_arch_setup,
 	.memory_setup = machine_specific_memory_setup,
 	.get_wallclock = native_get_wallclock,
 	.set_wallclock = native_set_wallclock,
@@ -577,7 +627,7 @@ struct paravirt_ops paravirt_ops = {
 	.setup_boot_clock = setup_boot_APIC_clock,
 	.setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
-	.set_lazy_mode = paravirt_nop,
+	.set_lazy_mode = pv_nop_set_lazy_mode,
 
 	.pagetable_setup_start = native_pagetable_setup_start,
 	.pagetable_setup_done = native_pagetable_setup_done,
@@ -587,24 +637,24 @@ struct paravirt_ops paravirt_ops = {
 	.flush_tlb_single = native_flush_tlb_single,
 	.flush_tlb_others = native_flush_tlb_others,
 
-	.alloc_pt = paravirt_nop,
-	.alloc_pd = paravirt_nop,
-	.alloc_pd_clone = paravirt_nop,
-	.release_pt = paravirt_nop,
-	.release_pd = paravirt_nop,
+	.alloc_pt = pv_nop_alloc_pt,
+	.alloc_pd = pv_nop_alloc_pd,
+	.alloc_pd_clone = pv_nop_alloc_pd_clone,
+	.release_pt = pv_nop_release_pt,
+	.release_pd = pv_nop_release_pd,
 
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
-	.pte_update = paravirt_nop,
-	.pte_update_defer = paravirt_nop,
+	.pte_update = pv_nop_pte_update,
+	.pte_update_defer = pv_nop_pte_update_defer,
 
 	.ptep_get_and_clear = native_ptep_get_and_clear,
 
 #ifdef CONFIG_HIGHPTE
 	.kmap_atomic_pte = native_kmap_atomic_pte,
 #else
-	.kmap_atomic_pte = paravirt_nop,
+	.kmap_atomic_pte = pv_nop_kmap_atomic_pte,
 #endif
 
 #ifdef CONFIG_X86_PAE
@@ -627,11 +677,11 @@ struct paravirt_ops paravirt_ops = {
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
 
-	.dup_mmap = paravirt_nop,
-	.exit_mmap = paravirt_nop,
-	.activate_mm = paravirt_nop,
-
-	.startup_ipi_hook = paravirt_nop,
+	.dup_mmap = pv_nop_dup_mmap,
+	.exit_mmap = pv_nop_exit_mmap,
+	.activate_mm = pv_nop_activate_mm,
+
+	.startup_ipi_hook = pv_nop_startup_ipi_hook,
 };
 
 /*
diff -r 930fff55070e arch/i386/kernel/vmlinux.lds.S
--- a/arch/i386/kernel/vmlinux.lds.S	Fri Mar 16 11:09:10 2007 -0700
+++ b/arch/i386/kernel/vmlinux.lds.S	Fri Mar 16 14:16:42 2007 -0700
@@ -49,6 +49,9 @@ SECTIONS
 	SCHED_TEXT
 	LOCK_TEXT
 	KPROBES_TEXT
+	__paravirt_nop_start = .;
+	*(.paravirtnop)
+	__paravirt_nop_end = .;
 	*(.fixup)
 	*(.gnu.warning)
   	_etext = .;			/* End of text section */
diff -r 930fff55070e arch/i386/xen/enlighten.c
--- a/arch/i386/xen/enlighten.c	Fri Mar 16 11:09:10 2007 -0700
+++ b/arch/i386/xen/enlighten.c	Fri Mar 16 14:53:37 2007 -0700
@@ -697,7 +697,7 @@ static const struct paravirt_ops xen_par
 	.iret = (void *)&hypercall_page[__HYPERVISOR_iret],
 	.irq_enable_sysexit = (void *)xen_sti_sysexit,
 
-	.load_tr_desc = paravirt_nop,
+	.load_tr_desc = pv_nop_load_tr_desc,
 	.set_ldt = xen_set_ldt,
 	.load_gdt = xen_load_gdt,
 	.load_idt = xen_load_idt,
@@ -723,8 +723,8 @@ static const struct paravirt_ops xen_par
 	.apic_write = xen_apic_write,
 	.apic_write_atomic = xen_apic_write,
 	.apic_read = xen_apic_read,
-	.setup_boot_clock = paravirt_nop,
-	.setup_secondary_clock = paravirt_nop,
+	.setup_boot_clock = pv_nop_setup_boot_clock,
+	.setup_secondary_clock = pv_nop_setup_secondary_clock,
 #endif
 
 	.flush_tlb_user = xen_flush_tlb,
@@ -732,8 +732,8 @@ static const struct paravirt_ops xen_par
 	.flush_tlb_single = xen_flush_tlb_single,
 	.flush_tlb_others = xen_flush_tlb_others,
 
-	.pte_update = paravirt_nop,
-	.pte_update_defer = paravirt_nop,
+	.pte_update = pv_nop_pte_update,
+	.pte_update_defer = pv_nop_pte_update_defer,
 
 	.pagetable_setup_start = xen_pagetable_setup_start,
 	.pagetable_setup_done = xen_pagetable_setup_done,
@@ -747,9 +747,9 @@ static const struct paravirt_ops xen_par
 
 	.alloc_pt = xen_alloc_pt_init,
 	.release_pt = xen_release_pt,
-	.alloc_pd = paravirt_nop,
-	.alloc_pd_clone = paravirt_nop,
-	.release_pd = paravirt_nop,
+	.alloc_pd = pv_nop_alloc_pd,
+	.alloc_pd_clone = pv_nop_alloc_pd_clone,
+	.release_pd = pv_nop_release_pd,
 
 	.kmap_atomic_pte = xen_kmap_atomic_pte,
 
@@ -773,7 +773,7 @@ static const struct paravirt_ops xen_par
 #endif	/* PAE */
 
 	.set_lazy_mode = xen_set_lazy_mode,
-	.startup_ipi_hook = paravirt_nop,
+	.startup_ipi_hook = pv_nop_startup_ipi_hook,
 };
 
 #ifdef CONFIG_SMP
diff -r 930fff55070e include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h	Fri Mar 16 11:09:10 2007 -0700
+++ b/include/asm-i386/paravirt.h	Fri Mar 16 14:50:18 2007 -0700
@@ -202,6 +202,30 @@ int native_write_msr(unsigned int msr, u
 int native_write_msr(unsigned int msr, unsigned long long val);
 unsigned long long native_read_tsc(void);
 unsigned long long native_read_pmc(void);
+
+/* Paravirt nops */
+#define __paravirt_nop __attribute__ ((__section__ (".paravirtnop")))
+extern char __paravirt_nop_start[], __paravirt_nop_end[];
+
+void __paravirt_nop pv_nop_arch_setup(void);
+void __paravirt_nop pv_nop_set_lazy_mode(int mode);
+void __paravirt_nop pv_nop_alloc_pt(struct mm_struct *mm, u32 pfn);
+void __paravirt_nop pv_nop_alloc_pd(u32 pfn);
+void __paravirt_nop pv_nop_alloc_pd_clone(u32 pfn, u32 clonepfn, u32 start, u32 count);
+void __paravirt_nop pv_nop_release_pt(u32 pfn);
+void __paravirt_nop pv_nop_release_pd(u32 pfn);
+void __paravirt_nop pv_nop_pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+void __paravirt_nop pv_nop_pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+void * __paravirt_nop pv_nop_kmap_atomic_pte(struct page *page, enum km_type type);
+void __paravirt_nop pv_nop_load_tr_desc(void);
+void __paravirt_nop pv_nop_activate_mm(struct mm_struct *prev, struct mm_struct *next);
+void __paravirt_nop pv_nop_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm);
+void __paravirt_nop pv_nop_exit_mmap(struct mm_struct *mm);
+void __paravirt_nop pv_nop_startup_ipi_hook(int phys_apicid, unsigned long start_eip, unsigned long start_esp);
+#ifdef CONFIG_X86_LOCAL_APIC
+void __paravirt_nop pv_nop_setup_boot_clock(void);
+void __paravirt_nop pv_nop_setup_secondary_clock(void);
+#endif
 
 /* Mark a paravirt probe function. */
 #define paravirt_probe(fn)						\
@@ -876,9 +900,6 @@ static inline pte_t raw_ptep_get_and_cle
 #define arch_enter_lazy_mmu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_MMU)
 #define arch_leave_lazy_mmu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_NONE)
 
-void _paravirt_nop(void);
-#define paravirt_nop	((void *)_paravirt_nop)
-
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch_site {
 	u8 *instr; 		/* original instructions */

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
@ 2007-03-16 21:59             ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 21:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Ingo Molnar,
	Andrew Morton

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > how about __paravirt_nop_start < func < __paravirt_nop_end  and preserve
> > the types?
> >   
> 
> Er?  The reason for the (void *) cast is to stop gcc complaining about
> mismatched pointer types.

I mean like this (bunch of work, for a type check that we're really ignoring
anwyay, but this is the idea...)

diff -r 930fff55070e arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c	Fri Mar 16 11:09:10 2007 -0700
+++ b/arch/i386/kernel/paravirt.c	Fri Mar 16 14:48:04 2007 -0700
@@ -35,10 +35,60 @@
 #include <asm/tlbflush.h>
 #include <asm/timer.h>
 
-/* nop stub */
-void _paravirt_nop(void)
-{
-}
+/* nop stubs */
+void __paravirt_nop pv_nop_arch_setup(void)
+{
+}
+void __paravirt_nop pv_nop_set_lazy_mode(int mode)
+{
+}
+void __paravirt_nop pv_nop_alloc_pt(struct mm_struct *mm, u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_alloc_pd(u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_alloc_pd_clone(u32 pfn, u32 clonepfn, u32 start, u32 count)
+{
+}
+void __paravirt_nop pv_nop_release_pt(u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_release_pd(u32 pfn)
+{
+}
+void __paravirt_nop pv_nop_pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+}
+void __paravirt_nop pv_nop_pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+}
+void * __paravirt_nop pv_nop_kmap_atomic_pte(struct page *page, enum km_type type)
+{
+}
+void __paravirt_nop pv_nop_load_tr_desc(void)
+{
+}
+void __paravirt_nop pv_nop_activate_mm(struct mm_struct *prev, struct mm_struct *next)
+{
+}
+void __paravirt_nop pv_nop_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+}
+void __paravirt_nop pv_nop_exit_mmap(struct mm_struct *mm)
+{
+}
+void __paravirt_nop pv_nop_startup_ipi_hook(int phys_apicid, unsigned long start_eip, unsigned long start_esp)
+{
+}
+#ifdef CONFIG_X86_LOCAL_APIC
+void __paravirt_nop pv_nop_setup_boot_clock(void)
+{
+}
+void __paravirt_nop pv_nop_setup_secondary_clock(void)
+{
+}
+#endif
 
 static void __init default_banner(void)
 {
@@ -166,7 +216,7 @@ unsigned paravirt_patch_default(u8 type,
 	if (opfunc == NULL)
 		/* If there's no function, patch it with a ud2a (BUG) */
 		ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
-	else if (opfunc == paravirt_nop)
+	else if (opfunc >= __paravirt_nop_start || opfunc < __paravirt_nop_end)
 		/* If the operation is a nop, then nop the callsite */
 		ret = paravirt_patch_nop();
 	else if (type == PARAVIRT_PATCH(iret) ||
@@ -521,7 +571,7 @@ struct paravirt_ops paravirt_ops = {
 
  	.patch = native_patch,
 	.banner = default_banner,
-	.arch_setup = paravirt_nop,
+	.arch_setup = pv_nop_arch_setup,
 	.memory_setup = machine_specific_memory_setup,
 	.get_wallclock = native_get_wallclock,
 	.set_wallclock = native_set_wallclock,
@@ -577,7 +627,7 @@ struct paravirt_ops paravirt_ops = {
 	.setup_boot_clock = setup_boot_APIC_clock,
 	.setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
-	.set_lazy_mode = paravirt_nop,
+	.set_lazy_mode = pv_nop_set_lazy_mode,
 
 	.pagetable_setup_start = native_pagetable_setup_start,
 	.pagetable_setup_done = native_pagetable_setup_done,
@@ -587,24 +637,24 @@ struct paravirt_ops paravirt_ops = {
 	.flush_tlb_single = native_flush_tlb_single,
 	.flush_tlb_others = native_flush_tlb_others,
 
-	.alloc_pt = paravirt_nop,
-	.alloc_pd = paravirt_nop,
-	.alloc_pd_clone = paravirt_nop,
-	.release_pt = paravirt_nop,
-	.release_pd = paravirt_nop,
+	.alloc_pt = pv_nop_alloc_pt,
+	.alloc_pd = pv_nop_alloc_pd,
+	.alloc_pd_clone = pv_nop_alloc_pd_clone,
+	.release_pt = pv_nop_release_pt,
+	.release_pd = pv_nop_release_pd,
 
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
-	.pte_update = paravirt_nop,
-	.pte_update_defer = paravirt_nop,
+	.pte_update = pv_nop_pte_update,
+	.pte_update_defer = pv_nop_pte_update_defer,
 
 	.ptep_get_and_clear = native_ptep_get_and_clear,
 
 #ifdef CONFIG_HIGHPTE
 	.kmap_atomic_pte = native_kmap_atomic_pte,
 #else
-	.kmap_atomic_pte = paravirt_nop,
+	.kmap_atomic_pte = pv_nop_kmap_atomic_pte,
 #endif
 
 #ifdef CONFIG_X86_PAE
@@ -627,11 +677,11 @@ struct paravirt_ops paravirt_ops = {
 	.irq_enable_sysexit = native_irq_enable_sysexit,
 	.iret = native_iret,
 
-	.dup_mmap = paravirt_nop,
-	.exit_mmap = paravirt_nop,
-	.activate_mm = paravirt_nop,
-
-	.startup_ipi_hook = paravirt_nop,
+	.dup_mmap = pv_nop_dup_mmap,
+	.exit_mmap = pv_nop_exit_mmap,
+	.activate_mm = pv_nop_activate_mm,
+
+	.startup_ipi_hook = pv_nop_startup_ipi_hook,
 };
 
 /*
diff -r 930fff55070e arch/i386/kernel/vmlinux.lds.S
--- a/arch/i386/kernel/vmlinux.lds.S	Fri Mar 16 11:09:10 2007 -0700
+++ b/arch/i386/kernel/vmlinux.lds.S	Fri Mar 16 14:16:42 2007 -0700
@@ -49,6 +49,9 @@ SECTIONS
 	SCHED_TEXT
 	LOCK_TEXT
 	KPROBES_TEXT
+	__paravirt_nop_start = .;
+	*(.paravirtnop)
+	__paravirt_nop_end = .;
 	*(.fixup)
 	*(.gnu.warning)
   	_etext = .;			/* End of text section */
diff -r 930fff55070e arch/i386/xen/enlighten.c
--- a/arch/i386/xen/enlighten.c	Fri Mar 16 11:09:10 2007 -0700
+++ b/arch/i386/xen/enlighten.c	Fri Mar 16 14:53:37 2007 -0700
@@ -697,7 +697,7 @@ static const struct paravirt_ops xen_par
 	.iret = (void *)&hypercall_page[__HYPERVISOR_iret],
 	.irq_enable_sysexit = (void *)xen_sti_sysexit,
 
-	.load_tr_desc = paravirt_nop,
+	.load_tr_desc = pv_nop_load_tr_desc,
 	.set_ldt = xen_set_ldt,
 	.load_gdt = xen_load_gdt,
 	.load_idt = xen_load_idt,
@@ -723,8 +723,8 @@ static const struct paravirt_ops xen_par
 	.apic_write = xen_apic_write,
 	.apic_write_atomic = xen_apic_write,
 	.apic_read = xen_apic_read,
-	.setup_boot_clock = paravirt_nop,
-	.setup_secondary_clock = paravirt_nop,
+	.setup_boot_clock = pv_nop_setup_boot_clock,
+	.setup_secondary_clock = pv_nop_setup_secondary_clock,
 #endif
 
 	.flush_tlb_user = xen_flush_tlb,
@@ -732,8 +732,8 @@ static const struct paravirt_ops xen_par
 	.flush_tlb_single = xen_flush_tlb_single,
 	.flush_tlb_others = xen_flush_tlb_others,
 
-	.pte_update = paravirt_nop,
-	.pte_update_defer = paravirt_nop,
+	.pte_update = pv_nop_pte_update,
+	.pte_update_defer = pv_nop_pte_update_defer,
 
 	.pagetable_setup_start = xen_pagetable_setup_start,
 	.pagetable_setup_done = xen_pagetable_setup_done,
@@ -747,9 +747,9 @@ static const struct paravirt_ops xen_par
 
 	.alloc_pt = xen_alloc_pt_init,
 	.release_pt = xen_release_pt,
-	.alloc_pd = paravirt_nop,
-	.alloc_pd_clone = paravirt_nop,
-	.release_pd = paravirt_nop,
+	.alloc_pd = pv_nop_alloc_pd,
+	.alloc_pd_clone = pv_nop_alloc_pd_clone,
+	.release_pd = pv_nop_release_pd,
 
 	.kmap_atomic_pte = xen_kmap_atomic_pte,
 
@@ -773,7 +773,7 @@ static const struct paravirt_ops xen_par
 #endif	/* PAE */
 
 	.set_lazy_mode = xen_set_lazy_mode,
-	.startup_ipi_hook = paravirt_nop,
+	.startup_ipi_hook = pv_nop_startup_ipi_hook,
 };
 
 #ifdef CONFIG_SMP
diff -r 930fff55070e include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h	Fri Mar 16 11:09:10 2007 -0700
+++ b/include/asm-i386/paravirt.h	Fri Mar 16 14:50:18 2007 -0700
@@ -202,6 +202,30 @@ int native_write_msr(unsigned int msr, u
 int native_write_msr(unsigned int msr, unsigned long long val);
 unsigned long long native_read_tsc(void);
 unsigned long long native_read_pmc(void);
+
+/* Paravirt nops */
+#define __paravirt_nop __attribute__ ((__section__ (".paravirtnop")))
+extern char __paravirt_nop_start[], __paravirt_nop_end[];
+
+void __paravirt_nop pv_nop_arch_setup(void);
+void __paravirt_nop pv_nop_set_lazy_mode(int mode);
+void __paravirt_nop pv_nop_alloc_pt(struct mm_struct *mm, u32 pfn);
+void __paravirt_nop pv_nop_alloc_pd(u32 pfn);
+void __paravirt_nop pv_nop_alloc_pd_clone(u32 pfn, u32 clonepfn, u32 start, u32 count);
+void __paravirt_nop pv_nop_release_pt(u32 pfn);
+void __paravirt_nop pv_nop_release_pd(u32 pfn);
+void __paravirt_nop pv_nop_pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+void __paravirt_nop pv_nop_pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+void * __paravirt_nop pv_nop_kmap_atomic_pte(struct page *page, enum km_type type);
+void __paravirt_nop pv_nop_load_tr_desc(void);
+void __paravirt_nop pv_nop_activate_mm(struct mm_struct *prev, struct mm_struct *next);
+void __paravirt_nop pv_nop_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm);
+void __paravirt_nop pv_nop_exit_mmap(struct mm_struct *mm);
+void __paravirt_nop pv_nop_startup_ipi_hook(int phys_apicid, unsigned long start_eip, unsigned long start_esp);
+#ifdef CONFIG_X86_LOCAL_APIC
+void __paravirt_nop pv_nop_setup_boot_clock(void);
+void __paravirt_nop pv_nop_setup_secondary_clock(void);
+#endif
 
 /* Mark a paravirt probe function. */
 #define paravirt_probe(fn)						\
@@ -876,9 +900,6 @@ static inline pte_t raw_ptep_get_and_cle
 #define arch_enter_lazy_mmu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_MMU)
 #define arch_leave_lazy_mmu_mode() PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_NONE)
 
-void _paravirt_nop(void);
-#define paravirt_nop	((void *)_paravirt_nop)
-
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch_site {
 	u8 *instr; 		/* original instructions */

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-16 21:59             ` Chris Wright
@ 2007-03-16 22:10               ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 22:10 UTC (permalink / raw)
  To: Chris Wright
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Zachary Amsden, Rusty Russell

Chris Wright wrote:
> I mean like this (bunch of work, for a type check that we're really ignoring
> anwyay, but this is the idea...)
>   

Oh, I see.  I think this is the best argument yet for the current
arrangement...

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
@ 2007-03-16 22:10               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 22:10 UTC (permalink / raw)
  To: Chris Wright
  Cc: Zachary Amsden, xen-devel, Andi Kleen, Rusty Russell,
	linux-kernel, virtualization, Ingo Molnar, Andrew Morton

Chris Wright wrote:
> I mean like this (bunch of work, for a type check that we're really ignoring
> anwyay, but this is the idea...)
>   

Oh, I see.  I think this is the best argument yet for the current
arrangement...

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
  2007-03-16 22:10               ` Jeremy Fitzhardinge
@ 2007-03-16 22:18                 ` Chris Wright
  -1 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 22:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Ingo Molnar, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > I mean like this (bunch of work, for a type check that we're really ignoring
> > anwyay, but this is the idea...)
> 
> Oh, I see.  I think this is the best argument yet for the current
> arrangement...

Heh, like I said it's a bunch of work for literally nothing ;-)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations
@ 2007-03-16 22:18                 ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-16 22:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, virtualization, Rusty Russell,
	linux-kernel, Chris Wright, Andi Kleen, Ingo Molnar,
	Andrew Morton

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > I mean like this (bunch of work, for a type check that we're really ignoring
> > anwyay, but this is the idea...)
> 
> Oh, I see.  I think this is the best argument yet for the current
> arrangement...

Heh, like I said it's a bunch of work for literally nothing ;-)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
  2007-03-16 21:56         ` Jeremy Fitzhardinge
@ 2007-03-16 22:20           ` Roland McGrath
  -1 siblings, 0 replies; 330+ messages in thread
From: Roland McGrath @ 2007-03-16 22:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Ulrich Drepper,
	Jakub Jelinek

> I'm not quite sure what you're suggesting here though.  Do you mean one of:
> 
> NOTE_KERNELCAP_BEGIN(1, 1)
> NOTE_KERNELCAP(0, "nosegneg")
> NOTE_KERNELCAP_END
> 
> or
> 
> NOTE_KERNELCAP_BEGIN(1, 2)
> NOTE_KERNELCAP(1, "nosegneg")
> NOTE_KERNELCAP_END
> 
> is the correct thing to use?

Yes.  (Sorry about the typo, 1 and 0 are close enough aren't they? ;-)

> > Some pre-release glibc's (before 2.4) had a bug in the code that parses
> > this, and would crash parsing the correct note.  Using the wrong bit value
> > with nonmatching mask worked around this.  IIRC, no glibc release ever
> > included the buggy version of the code.  In nonbuggy glibc, the mismatched
> > value causes the "nosegneg" to be omitted from the directory search (under
> > LD_LIBRARY_PATH and default directories), though ldconfig-based lookups
> > will work (the most common case).
> 
> Are you saying that one of the corrected forms might cause old glibcs to
> crash, or just ignore nosegneg?

Yes, the affected glibc crashed with the canonical form (the first above).
I'm not sure such a glibc exists in the wild today, maybe only in some FC5
test releases (I CC'd Jakub so he can verify that at least as far as FC).
Rik van Riel discovered that the s/0/1/ tweak sufficed for common daily use
(i.e. ld.so.cache hits), and did that temporarily when he was hacking on
Xen Linux kernels for Fedora, but reverted it after we fixed glibc.

The uncorrected form causes current (correct) glibc to ignore nosegneg 
(for cache misses).

Looking at the buggy version of the code, I think it will not crash with
the second form above, just avoid using bit 0.  (But I wouldn't swear to it
without testing it.)  The second form should certainly be fine with the
current glibc.  Just make sure that "kernelcap 1 nosegneg" is used in the
ld.so.conf.d file, to match 1 in the NOTE_KERNELCAP first arg.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes
@ 2007-03-16 22:20           ` Roland McGrath
  0 siblings, 0 replies; 330+ messages in thread
From: Roland McGrath @ 2007-03-16 22:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Jakub Jelinek, xen-devel, Ian Pratt, linux-kernel,
	Ulrich Drepper, Chris Wright, virtualization, Ingo Molnar,
	Andrew Morton, Andi Kleen

> I'm not quite sure what you're suggesting here though.  Do you mean one of:
> 
> NOTE_KERNELCAP_BEGIN(1, 1)
> NOTE_KERNELCAP(0, "nosegneg")
> NOTE_KERNELCAP_END
> 
> or
> 
> NOTE_KERNELCAP_BEGIN(1, 2)
> NOTE_KERNELCAP(1, "nosegneg")
> NOTE_KERNELCAP_END
> 
> is the correct thing to use?

Yes.  (Sorry about the typo, 1 and 0 are close enough aren't they? ;-)

> > Some pre-release glibc's (before 2.4) had a bug in the code that parses
> > this, and would crash parsing the correct note.  Using the wrong bit value
> > with nonmatching mask worked around this.  IIRC, no glibc release ever
> > included the buggy version of the code.  In nonbuggy glibc, the mismatched
> > value causes the "nosegneg" to be omitted from the directory search (under
> > LD_LIBRARY_PATH and default directories), though ldconfig-based lookups
> > will work (the most common case).
> 
> Are you saying that one of the corrected forms might cause old glibcs to
> crash, or just ignore nosegneg?

Yes, the affected glibc crashed with the canonical form (the first above).
I'm not sure such a glibc exists in the wild today, maybe only in some FC5
test releases (I CC'd Jakub so he can verify that at least as far as FC).
Rik van Riel discovered that the s/0/1/ tweak sufficed for common daily use
(i.e. ld.so.cache hits), and did that temporarily when he was hacking on
Xen Linux kernels for Fedora, but reverted it after we fixed glibc.

The uncorrected form causes current (correct) glibc to ignore nosegneg 
(for cache misses).

Looking at the buggy version of the code, I think it will not crash with
the second form above, just avoid using bit 0.  (But I wouldn't swear to it
without testing it.)  The second form should certainly be fine with the
current glibc.  Just make sure that "kernelcap 1 nosegneg" is used in the
ld.so.conf.d file, to match 1 in the NOTE_KERNELCAP first arg.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16 17:36       ` Jeremy Fitzhardinge
@ 2007-03-16 23:29         ` Zachary Amsden
  -1 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-16 23:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Linus Torvalds, David Miller
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Rusty Russell,
	Anthony Liguori

Jeremy Fitzhardinge wrote:
> Well, one thing to make clear is this is absolutely not a Xen-specific
> patch or piece of code.  This is part of the paravirt_ops infrastructure
> designed to remove the overhead of all the indirect calls which are
> scattered all over the place.  (Perhaps I should post the general
> paravirt and Xen specific patches in separate patch series to make this
> clear...).
>
> The idea is to wrap the callsite itself with in the same manner as the
> other altinstructions so that the general patcher can, at the very
> least, convert the indirect call to a direct one, or nop it out if its
> an indirect call.  This means that a pv_ops implementation can get about
> 90% of the benefit of patching without any extra effort.
>   

I like this code very much; although it is unavoidably ugly, it is a 
nice general mechanism for doing code rewriting.  Much more elaboration 
on this below.

> So, I disagree with your characterisation that its "limited"; this is a
> pretty general mechanism.  The fragile part is in using the PVOP_*
> macros properly to match the ABI's calling convention, particularly with
> tricky cases like passed and returned structures and 64-bit args.  But
> that just needs to be done once in one place, and is otherwise
> self-contained.
>   

You could just use the VMI ABI, then patch everything at runtime to call 
into the Xen paravirt-ops backend ;)

> I would love a better mechanism.  I played with using things like gcc's
> builtin_apply stuff, and so on, but I could find no way to get gcc to
> set up the args and then be able to just emit the call itself under asm
> control.
>   

I fought tooth and nail to get something cleaner than this for VMI back 
when it was a subarch.  In the end, the best I could do was wrap the 
constraints into prettier macros so the asm volatile stuff wasn't 
sticking out everywhere.  It was pretty, but the macros were so 
grotesque that I was exiled from my home planet.

static inline void local_irq_restore(const unsigned long flags)
{
        vmi_wrap_call(
                SetInterruptMask, "pushl %0; popfl",
                VMI_NO_OUTPUT,
                1, VMI_IREG1 (flags),
                XCONC("cc", "memory"));
}

So the constraints are obvious and tied to the inline assembly.  But 
Jeremy seems to have done even better with the vcall stuff.  Prettier:

+	PVOP_VCALL0(setup_boot_clock);

>
> I haven't looked at Dave's reply in detail, but I saw some mention of
> using relocs.  The idea is intriguing , but I don't quite see how it
> would all fit together.
>   

We went through this design exercise, and thought it was pretty 
promising.  Basically, you would reserve a set of "local" relocation 
types that should never be emitted by the toolchain.  Then you can have 
complex relocations, such as "replace pushf; popf %0 with arbitrary 
code."  You can even leave the arguments unfixed and grant the compiler 
register allocation, as long as you took care to encode the input / 
output registers somewhere (in a .reloc section of some sort, or encoded 
in the relocation type itself).

Now you can make complex decisions at runtime, and apply choice 
functions to these relocations that can cope with a variety of different 
circumstances - you could encode not just paravirt-ops as relocations, 
but all of the alternative instructions, and smp alternatives, and even 
higher level constructs, such as choices made by the user with the 
kernel command line - some potential examples:

acpi=noirq
idle=halt

With proper synchronization, using something like stop_machine_run, you 
can even make these choices dynamically, and then relink the kernel in 
place to take faster paths.  And the technique is universal, so you 
could use it cross architecture, which would be really helpful for 
architectures that say, have really slow indirect branches.

Once the technique gains wide acceptance, you could use it for all 
kernel interfaces which have static function pointers for the post-init 
lifetime of the kernel.  Which might contribute to a global performance 
improvement of perhaps a couple percent.  But the cost is clearly the 
complexity.

I just had a slightly interesting idea - you could even catch bugs where 
dynamic assignments to function pointers fail to update the appropriate 
patch sites by checking for non .init code sections which write through 
accelerated_fn_ptr_t's using static checking from sparse.

Is that sort of what you were thinking of Dave?

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-16 23:29         ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-16 23:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Linus Torvalds, David Miller
  Cc: xen-devel, Andi Kleen, Rusty Russell, linux-kernel, Chris Wright,
	virtualization, Anthony Liguori, Ingo Molnar, Andrew Morton

Jeremy Fitzhardinge wrote:
> Well, one thing to make clear is this is absolutely not a Xen-specific
> patch or piece of code.  This is part of the paravirt_ops infrastructure
> designed to remove the overhead of all the indirect calls which are
> scattered all over the place.  (Perhaps I should post the general
> paravirt and Xen specific patches in separate patch series to make this
> clear...).
>
> The idea is to wrap the callsite itself with in the same manner as the
> other altinstructions so that the general patcher can, at the very
> least, convert the indirect call to a direct one, or nop it out if its
> an indirect call.  This means that a pv_ops implementation can get about
> 90% of the benefit of patching without any extra effort.
>   

I like this code very much; although it is unavoidably ugly, it is a 
nice general mechanism for doing code rewriting.  Much more elaboration 
on this below.

> So, I disagree with your characterisation that its "limited"; this is a
> pretty general mechanism.  The fragile part is in using the PVOP_*
> macros properly to match the ABI's calling convention, particularly with
> tricky cases like passed and returned structures and 64-bit args.  But
> that just needs to be done once in one place, and is otherwise
> self-contained.
>   

You could just use the VMI ABI, then patch everything at runtime to call 
into the Xen paravirt-ops backend ;)

> I would love a better mechanism.  I played with using things like gcc's
> builtin_apply stuff, and so on, but I could find no way to get gcc to
> set up the args and then be able to just emit the call itself under asm
> control.
>   

I fought tooth and nail to get something cleaner than this for VMI back 
when it was a subarch.  In the end, the best I could do was wrap the 
constraints into prettier macros so the asm volatile stuff wasn't 
sticking out everywhere.  It was pretty, but the macros were so 
grotesque that I was exiled from my home planet.

static inline void local_irq_restore(const unsigned long flags)
{
        vmi_wrap_call(
                SetInterruptMask, "pushl %0; popfl",
                VMI_NO_OUTPUT,
                1, VMI_IREG1 (flags),
                XCONC("cc", "memory"));
}

So the constraints are obvious and tied to the inline assembly.  But 
Jeremy seems to have done even better with the vcall stuff.  Prettier:

+	PVOP_VCALL0(setup_boot_clock);

>
> I haven't looked at Dave's reply in detail, but I saw some mention of
> using relocs.  The idea is intriguing , but I don't quite see how it
> would all fit together.
>   

We went through this design exercise, and thought it was pretty 
promising.  Basically, you would reserve a set of "local" relocation 
types that should never be emitted by the toolchain.  Then you can have 
complex relocations, such as "replace pushf; popf %0 with arbitrary 
code."  You can even leave the arguments unfixed and grant the compiler 
register allocation, as long as you took care to encode the input / 
output registers somewhere (in a .reloc section of some sort, or encoded 
in the relocation type itself).

Now you can make complex decisions at runtime, and apply choice 
functions to these relocations that can cope with a variety of different 
circumstances - you could encode not just paravirt-ops as relocations, 
but all of the alternative instructions, and smp alternatives, and even 
higher level constructs, such as choices made by the user with the 
kernel command line - some potential examples:

acpi=noirq
idle=halt

With proper synchronization, using something like stop_machine_run, you 
can even make these choices dynamically, and then relink the kernel in 
place to take faster paths.  And the technique is universal, so you 
could use it cross architecture, which would be really helpful for 
architectures that say, have really slow indirect branches.

Once the technique gains wide acceptance, you could use it for all 
kernel interfaces which have static function pointers for the post-init 
lifetime of the kernel.  Which might contribute to a global performance 
improvement of perhaps a couple percent.  But the cost is clearly the 
complexity.

I just had a slightly interesting idea - you could even catch bugs where 
dynamic assignments to function pointers fail to update the appropriate 
patch sites by checking for non .init code sections which write through 
accelerated_fn_ptr_t's using static checking from sparse.

Is that sort of what you were thinking of Dave?

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16 23:29         ` Zachary Amsden
  (?)
@ 2007-03-17  0:40         ` Jeremy Fitzhardinge
  2007-03-17  9:10           ` Zachary Amsden
  -1 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-17  0:40 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, David Miller, Ingo Molnar, Andi Kleen,
	Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Rusty Russell, Anthony Liguori

Zachary Amsden wrote:
> I like this code very much; although it is unavoidably ugly, it is a
> nice general mechanism for doing code rewriting.  Much more
> elaboration on this below.
>

Thanks.

> static inline void local_irq_restore(const unsigned long flags)
> {
>        vmi_wrap_call(
>                SetInterruptMask, "pushl %0; popfl",
>                VMI_NO_OUTPUT,
>                1, VMI_IREG1 (flags),
>                XCONC("cc", "memory"));
> }
>
> So the constraints are obvious and tied to the inline assembly.  But
> Jeremy seems to have done even better with the vcall stuff.  Prettier:
>
> +    PVOP_VCALL0(setup_boot_clock);

Yeah, it doesn't try as hard as your example, so its all based around
the function call ABI.  If you want to inline something, you need to do
that elsewhere, which I guess is OK because that's not the
common case (only very simple cases can be replaced by inlines, and only
a few of those are worth doing).

> We went through this design exercise, and thought it was pretty
> promising.  Basically, you would reserve a set of "local" relocation
> types that should never be emitted by the toolchain.  Then you can
> have complex relocations, such as "replace pushf; popf %0 with
> arbitrary code."  You can even leave the arguments unfixed and grant
> the compiler register allocation, as long as you took care to encode
> the input / output registers somewhere (in a .reloc section of some
> sort, or encoded in the relocation type itself).

I'm pretty sure that's not what he means.  The big objection to the
PVOP_* stuff is the fact that there are these massive macros full of
inline asm to wrap the calls, which have to be invoked in a fragile
type-unsafe way.  Adding custom relocs would suffer the same problem,
since you'd need inline asm to deal with them, and I'm deathly
frightened of whatever binutils would do if you mean real relocs.

I think the suggestion is much simpler.  If you convince gcc/binutils to
leave the .reloc section in vmlinux, and make that available to the
kernel itself, then you can scan all the kernel's relocs to find ones
which refer to paravirt_ops, and use those to determine which are
callsites that can be patched.

The main upside is that all the callsites are just normal C calls;
there's no special syntax or strange macros, and we get the full benefit
of typechecking, etc.

But I can see a few downsides compared the current scheme:

   1. Identifying the callsites is a somewhat hackish process of looking
      at a reloc and doing a bit of dissassembly to see what is using
      the reloc, to identify calls and jumps
   2. There's nothing explicit to tell us how much space there is to
      patch into; we just have to assume sizeof(indirect call/jmp)
   3. There's no information about the register environment at the
      callsite, so we just have to adopt normal C ABI rules.  For the
      patch sites in hand-written asm, this could be tricky.
   4. gcc could do strange things which prevent detection of patch
      sites.  For example, it might CSE the value of, say,
      paravirt_ops.irq_enable, which would be a reasonable optimisation,
      but prevent any of the resulting indirect calls from being
      patched.  In general it relies on gcc to generate identifiable
      callsites, which is a bit unpredictable.
   5. There's still a moderate amount of binutils hackery to get the
      relocs into the right form, and there's plenty of scope for it to
      screw up.

> [ Roswell technology deleted ]

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-17  0:40         ` Jeremy Fitzhardinge
@ 2007-03-17  9:10           ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-17  9:10 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, David Miller, Ingo Molnar, Andi Kleen,
	Andrew Morton, linux-kernel, virtualization, xen-devel,
	Chris Wright, Rusty Russell, Anthony Liguori

Jeremy Fitzhardinge wrote:
> I think the suggestion is much simpler.  If you convince gcc/binutils to
> leave the .reloc section in vmlinux, and make that available to the
> kernel itself, then you can scan all the kernel's relocs to find ones
> which refer to paravirt_ops, and use those to determine which are
> callsites that can be patched.
>   

Yes, that is pretty nice.

> The main upside is that all the callsites are just normal C calls;
> there's no special syntax or strange macros, and we get the full benefit
> of typechecking, etc.
>
> But I can see a few downsides compared the current scheme:
>
>    1. Identifying the callsites is a somewhat hackish process of looking
>       at a reloc and doing a bit of dissassembly to see what is using
>       the reloc, to identify calls and jumps
>    2. There's nothing explicit to tell us how much space there is to
>       patch into; we just have to assume sizeof(indirect call/jmp)
>    3. There's no information about the register environment at the
>       callsite, so we just have to adopt normal C ABI rules.  For the
>       patch sites in hand-written asm, this could be tricky.
>    4. gcc could do strange things which prevent detection of patch
>       sites.  For example, it might CSE the value of, say,
>       paravirt_ops.irq_enable, which would be a reasonable optimisation,
>       but prevent any of the resulting indirect calls from being
>       patched.  In general it relies on gcc to generate identifiable
>       callsites, which is a bit unpredictable.
>    5. There's still a moderate amount of binutils hackery to get the
>       relocs into the right form, and there's plenty of scope for it to
>       screw up.
>   

And yes, those are nasty points.  I think I'd be interested in seeing 
more discussion on it, perhaps those issues could be worked out.

>> [ Roswell technology deleted ]
>>     

Nack.  Everyone needs Roswell technology.

Actually, that was not a serious proposal.  I think the effort and 
complexity would likely not justify the gain.  But I still had to throw 
it out, since it is what we use on Betelgeuse.

Z

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16  9:14     ` Ingo Molnar
@ 2007-03-17  9:13       ` Rusty Russell
  -1 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, xen-devel, Ian Pratt,
	virtualization, linux-kernel, Adrian Bunk, Chris Wright,
	Andi Kleen, Andrew Morton, Thomas Gleixner, Christian Limpach

On Fri, 2007-03-16 at 10:14 +0100, Ingo Molnar wrote:
> > +unsigned long xen_pmd_val(pmd_t pmd)
> > +{
> > +	BUG();
> > +	return 0;
> > +}
> 
> make it noret.

OK, I missed this one.  How?

Wondering if I've missed a trick here...
Rusty.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-17  9:13       ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, virtualization,
	linux-kernel, Adrian Bunk, Chris Wright, Ian Pratt,
	Andrew Morton, Christian Limpach, Thomas Gleixner, Andi Kleen

On Fri, 2007-03-16 at 10:14 +0100, Ingo Molnar wrote:
> > +unsigned long xen_pmd_val(pmd_t pmd)
> > +{
> > +	BUG();
> > +	return 0;
> > +}
> 
> make it noret.

OK, I missed this one.  How?

Wondering if I've missed a trick here...
Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 14/26] Xen-paravirt_ops: add common patching machinery
  2007-03-16  9:20     ` Ingo Molnar
@ 2007-03-17  9:15       ` Rusty Russell
  -1 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	Anthony Liguori

On Fri, 2007-03-16 at 10:20 +0100, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Implement the actual patching machinery.  paravirt_patch_default() 
> > contains the logic to automatically patch a callsite based on a few 
> > simple rules:
> > 
> >  - if the paravirt_op function is paravirt_nop, then patch nops
> >  - if the paravirt_op function is a jmp target, then jmp to it
> >  - if the paravirt_op function is callable and doesn't clobber too much
> >     for the callsite, call it directly
> > 
> > paravirt_patch_default is suitable as a default implementation of 
> > paravirt_ops.patch, will remove most of the expensive indirect calls 
> > in favour of either a direct call or a pile of nops.
> 
> Acked-by: Ingo Molnar <mingo@elte.hu>

I like this one too, but note that it needs a twist when we change to
use direct calls to wrappers for the Great paravirt_ops Unexporting.

Rusty.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 14/26] Xen-paravirt_ops: add common patching machinery
@ 2007-03-17  9:15       ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Andi Kleen,
	linux-kernel, Chris Wright, virtualization, Anthony Liguori,
	Andrew Morton

On Fri, 2007-03-16 at 10:20 +0100, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Implement the actual patching machinery.  paravirt_patch_default() 
> > contains the logic to automatically patch a callsite based on a few 
> > simple rules:
> > 
> >  - if the paravirt_op function is paravirt_nop, then patch nops
> >  - if the paravirt_op function is a jmp target, then jmp to it
> >  - if the paravirt_op function is callable and doesn't clobber too much
> >     for the callsite, call it directly
> > 
> > paravirt_patch_default is suitable as a default implementation of 
> > paravirt_ops.patch, will remove most of the expensive indirect calls 
> > in favour of either a direct call or a pile of nops.
> 
> Acked-by: Ingo Molnar <mingo@elte.hu>

I like this one too, but note that it needs a twist when we change to
use direct calls to wrappers for the Great paravirt_ops Unexporting.

Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16  9:24     ` Ingo Molnar
@ 2007-03-17  9:26       ` Rusty Russell
  -1 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, xen-devel, virtualization,
	linux-kernel, Chris Wright, Andi Kleen, Anthony Liguori,
	Andrew Morton, Linus Torvalds

On Fri, 2007-03-16 at 10:24 +0100, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Wrap a set of interesting paravirt_ops calls in a wrapper which makes 
> > the callsites available for patching.  Unfortunately this is pretty 
> > ugly because there's no way to get gcc to generate a function call, 
> > but also wrap just the callsite itself with the necessary labels.
> > 
> > This patch supports functions with 0-4 arguments, and either void or 
> > returning a value.  64-bit arguments must be split into a pair of 
> > 32-bit arguments (lower word first).  Small structures are returned in 
> > registers.
> 
> ugh. This is beyond ugly! Why dont we just compile two images, one for 
> Xen and one for native, do two passes to get those two images and 
> 'merge' them into a single vmlinuz (so that we still have a 'single' 
> kernel unit to deal with on the distro side). This way we avoid all this 
> crazy, limited, fragile patchery business...

But with lguest, VMI and kvm I don't think that's a good idea.

For background: the current patching code is ugly too, but it only deals
with the 6 most common functions, so it's contained.  This gets us
pretty close to !CONFIG_PARAVIRT performance.  (But slowdown is still
measurable).

We get worse with Xen wanting to alter mkpte et al.  By the time we
patch another 6 functions (each one slightly different), we get pretty
close to general coverage anyway, so it's clearer to do them all.

I think the question is, can we do it better than this?  My previous
attempts (which lead to the current code) ranged from ugly to very ugly
8(

Rusty.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-17  9:26       ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Andi Kleen,
	linux-kernel, Chris Wright, virtualization, Anthony Liguori,
	Andrew Morton, Linus Torvalds

On Fri, 2007-03-16 at 10:24 +0100, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Wrap a set of interesting paravirt_ops calls in a wrapper which makes 
> > the callsites available for patching.  Unfortunately this is pretty 
> > ugly because there's no way to get gcc to generate a function call, 
> > but also wrap just the callsite itself with the necessary labels.
> > 
> > This patch supports functions with 0-4 arguments, and either void or 
> > returning a value.  64-bit arguments must be split into a pair of 
> > 32-bit arguments (lower word first).  Small structures are returned in 
> > registers.
> 
> ugh. This is beyond ugly! Why dont we just compile two images, one for 
> Xen and one for native, do two passes to get those two images and 
> 'merge' them into a single vmlinuz (so that we still have a 'single' 
> kernel unit to deal with on the distro side). This way we avoid all this 
> crazy, limited, fragile patchery business...

But with lguest, VMI and kvm I don't think that's a good idea.

For background: the current patching code is ugly too, but it only deals
with the 6 most common functions, so it's contained.  This gets us
pretty close to !CONFIG_PARAVIRT performance.  (But slowdown is still
measurable).

We get worse with Xen wanting to alter mkpte et al.  By the time we
patch another 6 functions (each one slightly different), we get pretty
close to general coverage anyway, so it's clearer to do them all.

I think the question is, can we do it better than this?  My previous
attempts (which lead to the current code) ranged from ugly to very ugly
8(

Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
  2007-03-16 18:39     ` Jeremy Fitzhardinge
@ 2007-03-17  9:47         ` Rusty Russell
  2007-03-17  9:47         ` Rusty Russell
  1 sibling, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel,
	virtualization, xen-devel, Chris Wright, Zachary Amsden,
	William Lee Irwin III

On Fri, 2007-03-16 at 11:39 -0700, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >> +	/* Make sure kernel address space is empty so that a pagetable
> >> +	   will be allocated for it. */
> >>     
> >
> > comment style.
> >   
> 
> As you've noticed its a comment style I use quite often.

Me too, and Ingo and I have had this argument before.  These days I try
hard to fit simple comments in a single line 8)

Rusty.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
@ 2007-03-17  9:47         ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17  9:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, Andi Kleen, linux-kernel,
	William Lee Irwin III, Chris Wright, virtualization, Ingo Molnar,
	Andrew Morton

On Fri, 2007-03-16 at 11:39 -0700, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >> +	/* Make sure kernel address space is empty so that a pagetable
> >> +	   will be allocated for it. */
> >>     
> >
> > comment style.
> >   
> 
> As you've noticed its a comment style I use quite often.

Me too, and Ingo and I have had this argument before.  These days I try
hard to fit simple comments in a single line 8)

Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-16 20:38       ` Jeremy Fitzhardinge
@ 2007-03-17 10:33           ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17 10:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: David Miller, mingo, ak, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, NetDev

On Fri, 2007-03-16 at 13:38 -0700, Jeremy Fitzhardinge wrote:
> David Miller wrote:
> > Perhaps the problem can be dealt with using ELF relocations.
> >
> > There is another case, discussed yesterday on netdev, where run-time
> > resolution of ELF relocations would be useful (for
> > very-very-very-read-only variables) so if it can solve this problem
> > too it would be nice to have a generic infrastructure for it.
> 
> That's an interesting idea.  Have you or anyone else looked at what it
> would take to code up?
> 
> For this case, I guess you'd walk the relocs looking for references into
> the paravirt_ops structure.  You'd need to check that was a reference
> from an indirect jump or call instruction, which would identify a
> patchable callsite.  The offset into the pv_ops structure would identify
> which operation is involved.

I wrote a whole email on ways to do this, BUT...

Perhaps our patching code can already be vastly simplified to something
like:

#define pv_patch(call, args...) \
	asm volatile("8888:"); 
	call(args);
	asm volatile("8889:"
	 [ stuff to put 8889, 8888 and call in fixup section ]

The patching code doesn't even need to decode between those two: it
simply looks for an indirect call insn (0xff 0xd?).  If it finds more
than one, it doesn't patch.  If it only finds one, it patches.  We'll
probably hit them all just doing that.

> What are the netdev requirements?

Reading Ben LaHaise's (very cool!) patch, it's not clear that using
reloc postprocessing is going to be clearer than open-coding it as he
has done.

Rusty.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-17 10:33           ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-17 10:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: zach, xen-devel, akpm, ak, NetDev, linux-kernel, chrisw,
	virtualization, anthony, mingo, torvalds, David Miller

On Fri, 2007-03-16 at 13:38 -0700, Jeremy Fitzhardinge wrote:
> David Miller wrote:
> > Perhaps the problem can be dealt with using ELF relocations.
> >
> > There is another case, discussed yesterday on netdev, where run-time
> > resolution of ELF relocations would be useful (for
> > very-very-very-read-only variables) so if it can solve this problem
> > too it would be nice to have a generic infrastructure for it.
> 
> That's an interesting idea.  Have you or anyone else looked at what it
> would take to code up?
> 
> For this case, I guess you'd walk the relocs looking for references into
> the paravirt_ops structure.  You'd need to check that was a reference
> from an indirect jump or call instruction, which would identify a
> patchable callsite.  The offset into the pv_ops structure would identify
> which operation is involved.

I wrote a whole email on ways to do this, BUT...

Perhaps our patching code can already be vastly simplified to something
like:

#define pv_patch(call, args...) \
	asm volatile("8888:"); 
	call(args);
	asm volatile("8889:"
	 [ stuff to put 8889, 8888 and call in fixup section ]

The patching code doesn't even need to decode between those two: it
simply looks for an indirect call insn (0xff 0xd?).  If it finds more
than one, it doesn't patch.  If it only finds one, it patches.  We'll
probably hit them all just doing that.

> What are the netdev requirements?

Reading Ben LaHaise's (very cool!) patch, it's not clear that using
reloc postprocessing is going to be clearer than open-coding it as he
has done.

Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-17  9:13       ` Rusty Russell
  (?)
@ 2007-03-18  7:03       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18  7:03 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Zachary Amsden, xen-devel, Ian Pratt,
	virtualization, linux-kernel, Adrian Bunk, Chris Wright,
	Andi Kleen, Andrew Morton, Thomas Gleixner, Christian Limpach

Rusty Russell wrote:
> On Fri, 2007-03-16 at 10:14 +0100, Ingo Molnar wrote:
>   
>>> +unsigned long xen_pmd_val(pmd_t pmd)
>>> +{
>>> +	BUG();
>>> +	return 0;
>>> +}
>>>       
>> make it noret.
>>     
>
> OK, I missed this one.  How?
>
> Wondering if I've missed a trick here...

No, I don't think its terribly useful to make it noret; noret is an
interface annotation to let the caller know the function won't return. 
If I mark this noret, then the compiler will complain that it does
appear to return.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-17 10:33           ` Rusty Russell
  (?)
@ 2007-03-18  7:33           ` David Miller
  2007-03-18  7:59               ` Jeremy Fitzhardinge
  2007-03-18 12:08             ` Andi Kleen
  -1 siblings, 2 replies; 330+ messages in thread
From: David Miller @ 2007-03-18  7:33 UTC (permalink / raw)
  To: rusty
  Cc: jeremy, mingo, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, anthony, torvalds, netdev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Sat, 17 Mar 2007 21:33:58 +1100

> On Fri, 2007-03-16 at 13:38 -0700, Jeremy Fitzhardinge wrote:
> > David Miller wrote:
> > > Perhaps the problem can be dealt with using ELF relocations.
> > >
> > > There is another case, discussed yesterday on netdev, where run-time
> > > resolution of ELF relocations would be useful (for
> > > very-very-very-read-only variables) so if it can solve this problem
> > > too it would be nice to have a generic infrastructure for it.
> > 
> > That's an interesting idea.  Have you or anyone else looked at what it
> > would take to code up?
> > 
> > For this case, I guess you'd walk the relocs looking for references into
> > the paravirt_ops structure.  You'd need to check that was a reference
> > from an indirect jump or call instruction, which would identify a
> > patchable callsite.  The offset into the pv_ops structure would identify
> > which operation is involved.
> 
> I wrote a whole email on ways to do this, BUT...

The idea is _NOT_ that you go look for references to the paravirt_ops
members structure, that would be stupid and you wouldn't be able to
use the most efficient addressing mode on a given cpu, you'd be
patching up indirect calls and crap like that.  Just say no...

Instead you get rid of paravirt ops completely, and you call functions
whose symbol name will not resolve in the initial kernel link.

You do an initial link of the kernel, look for the unresolved symbols
in the ELF relocation tables (just like the linker does), and put
those references into a table that is use to patch things up and you
can use standard ELF relocation code to handle this, exactly like code
we already have for module loading in the kernel already.

This idea is about 15 years old, sparc32 has been doing exactly this
via something called "btfixup" to handle the page table, TLB, and
cache differences of 15 different cpu+cache type combinations.

> #define pv_patch(call, args...) \
> 	asm volatile("8888:"); 
> 	call(args);
> 	asm volatile("8889:"
> 	 [ stuff to put 8889, 8888 and call in fixup section ]

Please, use ELF and it's powerful and clean existing way to
do this please. :-)

> > What are the netdev requirements?
> 
> Reading Ben LaHaise's (very cool!) patch, it's not clear that using
> reloc postprocessing is going to be clearer than open-coding it as he
> has done.

Ben's case can be handled in the same way.  Just do not define the
symbols, pre-link, look for the references in the relocation tables,
and run through that when you do the set_very_readonly() or
install_paravirt_ops() thing.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18  7:33           ` David Miller
@ 2007-03-18  7:59               ` Jeremy Fitzhardinge
  2007-03-18 12:08             ` Andi Kleen
  1 sibling, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18  7:59 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, mingo, ak, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, zach, anthony, torvalds, netdev

David Miller wrote:
> The idea is _NOT_ that you go look for references to the paravirt_ops
> members structure, that would be stupid and you wouldn't be able to
> use the most efficient addressing mode on a given cpu, you'd be
> patching up indirect calls and crap like that.  Just say no...
>
> Instead you get rid of paravirt ops completely, and you call functions
> whose symbol name will not resolve in the initial kernel link.
>   

Yeah, I came to that conclusion after thinking about it for a while. 
Thanks for the pointer to the sparc stuff; it looks very interesting.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-18  7:59               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18  7:59 UTC (permalink / raw)
  To: David Miller
  Cc: zach, xen-devel, ak, netdev, rusty, linux-kernel, chrisw,
	virtualization, anthony, mingo, torvalds, akpm

David Miller wrote:
> The idea is _NOT_ that you go look for references to the paravirt_ops
> members structure, that would be stupid and you wouldn't be able to
> use the most efficient addressing mode on a given cpu, you'd be
> patching up indirect calls and crap like that.  Just say no...
>
> Instead you get rid of paravirt ops completely, and you call functions
> whose symbol name will not resolve in the initial kernel link.
>   

Yeah, I came to that conclusion after thinking about it for a while. 
Thanks for the pointer to the sparc stuff; it looks very interesting.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18  7:33           ` David Miller
  2007-03-18  7:59               ` Jeremy Fitzhardinge
@ 2007-03-18 12:08             ` Andi Kleen
  2007-03-18 15:58               ` Jeremy Fitzhardinge
  2007-03-19  2:47               ` Rusty Russell
  1 sibling, 2 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-18 12:08 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, jeremy, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

> The idea is _NOT_ that you go look for references to the paravirt_ops
> members structure, that would be stupid and you wouldn't be able to
> use the most efficient addressing mode on a given cpu, you'd be
> patching up indirect calls and crap like that.  Just say no...

That wouldn't handle inlines though. At least some of the current
paravirtops like cli/sti are critical enough to require inlining.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 12:08             ` Andi Kleen
@ 2007-03-18 15:58               ` Jeremy Fitzhardinge
  2007-03-18 17:04                 ` Andi Kleen
  2007-03-19  2:47               ` Rusty Russell
  1 sibling, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18 15:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, rusty, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

Andi Kleen wrote:
>> The idea is _NOT_ that you go look for references to the paravirt_ops
>> members structure, that would be stupid and you wouldn't be able to
>> use the most efficient addressing mode on a given cpu, you'd be
>> patching up indirect calls and crap like that.  Just say no...
>>     
>
> That wouldn't handle inlines though. At least some of the current
> paravirtops like cli/sti are critical enough to require inlining.
>   

You could special-case it in the thing handling the relocs; if you're
relocating to point to a function which you have an inline substitution,
then inline.

The bigger problem is that you don't know what registers you can
clobber.  For the pv_ops in hand-written asm, that a big pain.  The code
either has to assume the worst, or the "relocator" has to do something
more sophisticated (like look for push/pop pairs surrounding the call,
perhaps?).

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 15:58               ` Jeremy Fitzhardinge
@ 2007-03-18 17:04                 ` Andi Kleen
  2007-03-18 17:29                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-18 17:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: David Miller, rusty, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

> The bigger problem is that you don't know what registers you can
> clobber.  For the pv_ops in hand-written asm, that a big pain.  The code
> either has to assume the worst, or the "relocator" has to do something
> more sophisticated (like look for push/pop pairs surrounding the call,
> perhaps?).

You could use the dwarf2 unwind tables. They have exact information
what register has what. But it would likely get complicated.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 17:04                 ` Andi Kleen
@ 2007-03-18 17:29                   ` Jeremy Fitzhardinge
  2007-03-18 19:30                     ` Andi Kleen
  0 siblings, 1 reply; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18 17:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, rusty, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

Andi Kleen wrote:
> You could use the dwarf2 unwind tables. They have exact information
> what register has what. But it would likely get complicated.

Yes.  And would they be accurate for hand-written asm, which is where we
have this problem?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 17:29                   ` Jeremy Fitzhardinge
@ 2007-03-18 19:30                     ` Andi Kleen
  2007-03-18 23:46                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-18 19:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: David Miller, rusty, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

On Sun, Mar 18, 2007 at 10:29:10AM -0700, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > You could use the dwarf2 unwind tables. They have exact information
> > what register has what. But it would likely get complicated.
> 
> Yes.  And would they be accurate for hand-written asm, which is where we
> have this problem?

Yes. All inline assembly tells gcc what registers are clobbered
and it fills in the tables. Hand clobbering in inline assembly cannot
be expressed with the current toolchain, so we moved all those
out of line.

But again I'm not sure it will work anyways. For once you would
need large padding around the calls anyways for inline replacement --
how would you generate that? I expect you would need to put the calls
into asm() again and with that a custom annotiation format looks reasonable.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 19:30                     ` Andi Kleen
@ 2007-03-18 23:46                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18 23:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, rusty, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

Andi Kleen wrote:
> Yes. All inline assembly tells gcc what registers are clobbered
> and it fills in the tables. Hand clobbering in inline assembly cannot
> be expressed with the current toolchain, so we moved all those
> out of line.
>
> But again I'm not sure it will work anyways. For once you would
> need large padding around the calls anyways for inline replacement --
> how would you generate that? I expect you would need to put the calls
> into asm() again and with that a custom annotiation format looks reasonable.

Inlining is most important for very small code: sti, cli, pushf;pop eax,
etc (in many cases, no-ops).  We'd have at least 5 bytes to work in, and
maybe more if there are surrounding push/pops to be consumed.

For example, say we wanted to put a general call for sti into entry.S,
where its expected it won't touch any registers.  In that case, we'd
have a sequence like:

    push %eax
    push %ecx
    push %edx
    call paravirt_cli
    pop %edx
    pop %ecx
    pop %eax

If we parse the relocs, then we'd find the reference to paravirt_cli. 
If we look at the byte before and see 0xe8, then we can see if its a
call.  If we then work out in each direction and see matched push/pops,
then we know what registers can be trashed in the call.  This also
allows us to determine the callsite size, and therefore how much space
we need for inlining.

So in this case, we see that there are 5 bytes for the call and a
further 6 bytes of push/pops available for inlining.

Of course this is hand-written code anyway, so there's no particular
burden to having some extra metadata stashed away in another section. 
For compiler-generated code, we know that it's already expecting
standard C ABI calling conventions.  The downside, of course, is that
only the 5 byte call space is available for inline patching.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-18 23:46                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-18 23:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: zach, xen-devel, akpm, netdev, rusty, linux-kernel, chrisw,
	virtualization, anthony, mingo, torvalds, David Miller

Andi Kleen wrote:
> Yes. All inline assembly tells gcc what registers are clobbered
> and it fills in the tables. Hand clobbering in inline assembly cannot
> be expressed with the current toolchain, so we moved all those
> out of line.
>
> But again I'm not sure it will work anyways. For once you would
> need large padding around the calls anyways for inline replacement --
> how would you generate that? I expect you would need to put the calls
> into asm() again and with that a custom annotiation format looks reasonable.

Inlining is most important for very small code: sti, cli, pushf;pop eax,
etc (in many cases, no-ops).  We'd have at least 5 bytes to work in, and
maybe more if there are surrounding push/pops to be consumed.

For example, say we wanted to put a general call for sti into entry.S,
where its expected it won't touch any registers.  In that case, we'd
have a sequence like:

    push %eax
    push %ecx
    push %edx
    call paravirt_cli
    pop %edx
    pop %ecx
    pop %eax

If we parse the relocs, then we'd find the reference to paravirt_cli. 
If we look at the byte before and see 0xe8, then we can see if its a
call.  If we then work out in each direction and see matched push/pops,
then we know what registers can be trashed in the call.  This also
allows us to determine the callsite size, and therefore how much space
we need for inlining.

So in this case, we see that there are 5 bytes for the call and a
further 6 bytes of push/pops available for inlining.

Of course this is hand-written code anyway, so there's no particular
burden to having some extra metadata stashed away in another section. 
For compiler-generated code, we know that it's already expecting
standard C ABI calling conventions.  The downside, of course, is that
only the 5 byte call space is available for inline patching.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 12:08             ` Andi Kleen
  2007-03-18 15:58               ` Jeremy Fitzhardinge
@ 2007-03-19  2:47               ` Rusty Russell
  2007-03-19 18:25                 ` Eric W. Biederman
  1 sibling, 1 reply; 330+ messages in thread
From: Rusty Russell @ 2007-03-19  2:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, jeremy, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

On Sun, 2007-03-18 at 13:08 +0100, Andi Kleen wrote:
> > The idea is _NOT_ that you go look for references to the paravirt_ops
> > members structure, that would be stupid and you wouldn't be able to
> > use the most efficient addressing mode on a given cpu, you'd be
> > patching up indirect calls and crap like that.  Just say no...
> 
> That wouldn't handle inlines though. At least some of the current
> paravirtops like cli/sti are critical enough to require inlining.

Well, we'd patch the inline over the call if we have room.

Magic patching would be neat, but the downsides are that (1) we can't
expand the patching room and (2) there's no way of attaching clobber
info to the call site (doing register liveness analysis is not
appealing).

Now, this may not be fatal.  5 bytes is enough for all the native ops to
be patched inline.   For lguest this covers popf and pushf, but not cli
and sti (10 bytes): they'd have to be calls.

As for clobber info, it turns out that almost all of the calls can
clobber %eax, which is probably enough.  We just need to mark the
handful of asm ones where this isn't true.

Thoughts?
Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 23:46                         ` Jeremy Fitzhardinge
  (?)
@ 2007-03-19 10:57                         ` Andi Kleen
  2007-03-19 17:58                           ` Jeremy Fitzhardinge
  2007-03-19 19:08                           ` David Miller
  -1 siblings, 2 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-19 10:57 UTC (permalink / raw)
  To: virtualization, jbeulich
  Cc: Jeremy Fitzhardinge, Andi Kleen, xen-devel, netdev, linux-kernel,
	David Miller, chrisw, virtualization, anthony, akpm, torvalds,
	mingo

On Monday 19 March 2007 00:46, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Yes. All inline assembly tells gcc what registers are clobbered
> > and it fills in the tables. Hand clobbering in inline assembly cannot
> > be expressed with the current toolchain, so we moved all those
> > out of line.
> >
> > But again I'm not sure it will work anyways. For once you would
> > need large padding around the calls anyways for inline replacement --
> > how would you generate that? I expect you would need to put the calls
> > into asm() again and with that a custom annotiation format looks
> > reasonable.
>
> Inlining is most important for very small code: sti, cli, pushf;pop eax,
> etc (in many cases, no-ops).  We'd have at least 5 bytes to work in, and
> maybe more if there are surrounding push/pops to be consumed.
>
> For example, say we wanted to put a general call for sti into entry.S,
> where its expected it won't touch any registers.  In that case, we'd
> have a sequence like:
>
>     push %eax
>     push %ecx
>     push %edx
>     call paravirt_cli
>     pop %edx
>     pop %ecx
>     pop %eax

This cannot right now be expressed as inline assembly in the unwinder at all 
because there is no way to inject the push/pops into the compiler generated
ehframe tables.

[BTW I plan to resubmit the unwinder with some changes]

>
>
> If we parse the relocs, then we'd find the reference to paravirt_cli.
> If we look at the byte before and see 0xe8, then we can see if its a
> call.  If we then work out in each direction and see matched push/pops,
> then we know what registers can be trashed in the call.  This also
> allows us to determine the callsite size, and therefore how much space
> we need for inlining.

gcc normally doesn't generate push/pops around directly around the
call site, but somewhere else due to the way its register allocator works.
It can be anywhere in the function or even not there at all if the register
didn't contain anything useful. And they're not necessarily push/pops of 
course.

So you would need to write it as inline assembly. I'm not sure it would
be significantly cleaner than just having tables then.

> So in this case, we see that there are 5 bytes for the call and a
> further 6 bytes of push/pops available for inlining.
>
> Of course this is hand-written code anyway, so there's no particular
> burden to having some extra metadata stashed away in another section.
> For compiler-generated code, we know that it's already expecting
> standard C ABI calling conventions.  The downside, of course, is that
> only the 5 byte call space is available for inline patching.

It's unlikely you can do much useful in 5 bytes I guess.

Regarding cli/sti: i've been actually thinking about changing it in the
non paravirt kernel. IIRC most save_flags/restore_flags are inside
spin_lock_irqsave/restore() and that is a separate function anyways
so a little larger special case code is ok as long as it is not slower. 
There is some evidence that at least on P4 a software cli/sti flag without 
pushf/popf would be faster.

-Andi


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 10:57                         ` Andi Kleen
@ 2007-03-19 17:58                           ` Jeremy Fitzhardinge
  2007-03-19 19:08                           ` David Miller
  1 sibling, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 17:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, jbeulich, Andi Kleen, xen-devel, netdev,
	linux-kernel, David Miller, chrisw, virtualization, anthony,
	akpm, torvalds, mingo

Andi Kleen wrote:
>>
>>     push %eax
>>     push %ecx
>>     push %edx
>>     call paravirt_cli
>>     pop %edx
>>     pop %ecx
>>     pop %eax
>>     
>
> This cannot right now be expressed as inline assembly in the unwinder at all 
> because there is no way to inject the push/pops into the compiler generated
> ehframe tables.
>   

I was thinking more of entry.S rather than inline asms.  The only ones I
can think of which would be relevant are the interrupt operations in
spinlocks, and even then we get a bit more leeway.

> gcc normally doesn't generate push/pops around directly around the
> call site, but somewhere else due to the way its register allocator works.
> It can be anywhere in the function or even not there at all if the register
> didn't contain anything useful. And they're not necessarily push/pops of 
> course.
>   

Right, I'm not making any assumptions about what gcc generates for
calls, other than assuming it uses a standard "call X" direct call
instruction.

> So you would need to write it as inline assembly. I'm not sure it would
> be significantly cleaner than just having tables then.
>   

No, my intention is that the vast majority of pv_ops uses, which are
calls from C code, would simply be normal C calls, syntatically and
semantically.

> It's unlikely you can do much useful in 5 bytes I guess.
>   

I think the main value is retaining non-PARAVIRT native performance
levels.  In the native case, the inlined instructions amount to 1 or 2
bytes, and having small patch sites is an advantage because there's less
space wasted on nops.  My understanding is that a direct call has almost
zero overhead on modern processors, because both the call and the return
can be completely predicted and prefetched, so the threshhold at which
you make inline vs call tradeoff is pretty small.

> Regarding cli/sti: i've been actually thinking about changing it in the
> non paravirt kernel. IIRC most save_flags/restore_flags are inside
> spin_lock_irqsave/restore() and that is a separate function anyways
> so a little larger special case code is ok as long as it is not slower. 
> There is some evidence that at least on P4 a software cli/sti flag without 
> pushf/popf would be faster.
>   

There are also the ones in entry.S.  I suppose spinlocks do get used
more than syscalls and interrupts.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-16 17:12       ` Chris Wright
  (?)
@ 2007-03-19 18:05       ` Eric W. Biederman
  2007-03-19 18:13           ` Jeremy Fitzhardinge
  2007-03-19 18:15           ` Chris Wright
  -1 siblings, 2 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-19 18:05 UTC (permalink / raw)
  To: Chris Wright
  Cc: Ingo Molnar, Jeremy Fitzhardinge, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Adrian Bunk,
	Thomas Gleixner

Chris Wright <chrisw@sous-sol.org> writes:

> * Ingo Molnar (mingo@elte.hu) wrote:
>> >  ENTRY(swapper_pg_dir)
>> > +	.align PAGE_SIZE_asm
>> >  	.fill 1024,4,0
>> 
>> does the native kernel lose memory here?
>
> Not in my builds.

Shouldn't the align be before the label.  Otherwise padding
would be inserted between the label and the data.

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-19 18:05       ` Eric W. Biederman
@ 2007-03-19 18:13           ` Jeremy Fitzhardinge
  2007-03-19 18:15           ` Chris Wright
  1 sibling, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 18:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Chris Wright, Ingo Molnar, Andi Kleen, Andrew Morton,
	linux-kernel, virtualization, xen-devel, Zachary Amsden,
	Rusty Russell, Ian Pratt, Christian Limpach, Adrian Bunk,
	Thomas Gleixner

Eric W. Biederman wrote:
>>>>  ENTRY(swapper_pg_dir)
>>>> +	.align PAGE_SIZE_asm
>>>>  	.fill 1024,4,0
>>>>         
>>>
> Shouldn't the align be before the label.  Otherwise padding
> would be inserted between the label and the data.


Good point.

Thanks,
    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-19 18:13           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 18:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Zachary Amsden, xen-devel, Ian Pratt, virtualization,
	Rusty Russell, linux-kernel, Adrian Bunk, Chris Wright,
	Andi Kleen, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Christian Limpach

Eric W. Biederman wrote:
>>>>  ENTRY(swapper_pg_dir)
>>>> +	.align PAGE_SIZE_asm
>>>>  	.fill 1024,4,0
>>>>         
>>>
> Shouldn't the align be before the label.  Otherwise padding
> would be inserted between the label and the data.


Good point.

Thanks,
    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
  2007-03-19 18:05       ` Eric W. Biederman
@ 2007-03-19 18:15           ` Chris Wright
  2007-03-19 18:15           ` Chris Wright
  1 sibling, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-19 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Chris Wright, Ingo Molnar, Jeremy Fitzhardinge, Andi Kleen,
	Andrew Morton, linux-kernel, virtualization, xen-devel,
	Zachary Amsden, Rusty Russell, Ian Pratt, Christian Limpach,
	Adrian Bunk, Thomas Gleixner

* Eric W. Biederman (ebiederm@xmission.com) wrote:
> Chris Wright <chrisw@sous-sol.org> writes:
> 
> > * Ingo Molnar (mingo@elte.hu) wrote:
> >> >  ENTRY(swapper_pg_dir)
> >> > +	.align PAGE_SIZE_asm
> >> >  	.fill 1024,4,0
> >> 
> >> does the native kernel lose memory here?
> >
> > Not in my builds.
> 
> Shouldn't the align be before the label.  Otherwise padding
> would be inserted between the label and the data.

Yes, I think you're right.  Thanks
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 20/26] Xen-paravirt_ops: Core Xen implementation
@ 2007-03-19 18:15           ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-19 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, Ian Pratt,
	virtualization, Rusty Russell, linux-kernel, Adrian Bunk,
	Chris Wright, Andi Kleen, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Christian Limpach

* Eric W. Biederman (ebiederm@xmission.com) wrote:
> Chris Wright <chrisw@sous-sol.org> writes:
> 
> > * Ingo Molnar (mingo@elte.hu) wrote:
> >> >  ENTRY(swapper_pg_dir)
> >> > +	.align PAGE_SIZE_asm
> >> >  	.fill 1024,4,0
> >> 
> >> does the native kernel lose memory here?
> >
> > Not in my builds.
> 
> Shouldn't the align be before the label.  Otherwise padding
> would be inserted between the label and the data.

Yes, I think you're right.  Thanks
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19  2:47               ` Rusty Russell
@ 2007-03-19 18:25                 ` Eric W. Biederman
  2007-03-19 18:38                     ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-19 18:25 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, David Miller, jeremy, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, zach, anthony, torvalds,
	netdev

Rusty Russell <rusty@rustcorp.com.au> writes:

> On Sun, 2007-03-18 at 13:08 +0100, Andi Kleen wrote:
>> > The idea is _NOT_ that you go look for references to the paravirt_ops
>> > members structure, that would be stupid and you wouldn't be able to
>> > use the most efficient addressing mode on a given cpu, you'd be
>> > patching up indirect calls and crap like that.  Just say no...
>> 
>> That wouldn't handle inlines though. At least some of the current
>> paravirtops like cli/sti are critical enough to require inlining.
>
> Well, we'd patch the inline over the call if we have room.
>
> Magic patching would be neat, but the downsides are that (1) we can't
> expand the patching room and (2) there's no way of attaching clobber
> info to the call site (doing register liveness analysis is not
> appealing).

True.  You can use all of the call clobbered registers.

> Now, this may not be fatal.  5 bytes is enough for all the native ops to
> be patched inline.   For lguest this covers popf and pushf, but not cli
> and sti (10 bytes): they'd have to be calls.
>
> As for clobber info, it turns out that almost all of the calls can
> clobber %eax, which is probably enough.  We just need to mark the
> handful of asm ones where this isn't true.

I guess if the code is larger than a function call I'm failing to see
the disadvantage in making it a direct function call.  Any modern
processor ought to be able to predict it perfectly, and processors
like the P4 may even optimize the call out of their L1 instruction
cache.

If what David is suggesting works, making all of these direct calls
looks easy and very maintainable.   At which point patching
instructions inline is quite possibly overkill.

Is it truly critical to inline any of these instructions?

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 18:25                 ` Eric W. Biederman
@ 2007-03-19 18:38                     ` Linus Torvalds
  2007-03-19 18:41                     ` Chris Wright
  2007-03-19 19:10                     ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-19 18:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rusty Russell, Andi Kleen, David Miller, jeremy, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, zach, anthony,
	netdev

On Mon, 19 Mar 2007, Eric W. Biederman wrote:
> 
> True.  You can use all of the call clobbered registers.

Quite often, the biggest single win of inlining is not so much the code 
size (although if done right, that will be smaller too), but the fact that 
inlining DOES NOT CLOBBER AS MANY REGISTERS!

The lack of register clobbering, and the freedom for the compiler to 
choose registers around an inlined function is usually the biggest win! If 
you can't do that, then inlining generally doesn't actually even help: a 
call/return to a single instruction isn't all that much slower than just 
doing the "cli" in the first place.

If we end up with a setup where any inlined instruction needs to act as if 
it was a function call (just with the "call" instruction papered over with 
the inlined instruction sequence), then there is no point to this at all. 

In short: people here seem to think that inlining is about avoiding the 
call/ret sequence. Not so. The real advantages of inlining are elsewhere.

So *please* don't believe that you can make it "as cheap" to have some 
automatic fixup of two sequences, one inlined and one as a "call".  It may 
look so when you look at the single instruction generated, but you're 
ignoring all the instructions *around* the site.

			Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-19 18:38                     ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-19 18:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: xen-devel, virtualization, netdev, linux-kernel, David Miller,
	chrisw, Andi Kleen, anthony, mingo, akpm

On Mon, 19 Mar 2007, Eric W. Biederman wrote:
> 
> True.  You can use all of the call clobbered registers.

Quite often, the biggest single win of inlining is not so much the code 
size (although if done right, that will be smaller too), but the fact that 
inlining DOES NOT CLOBBER AS MANY REGISTERS!

The lack of register clobbering, and the freedom for the compiler to 
choose registers around an inlined function is usually the biggest win! If 
you can't do that, then inlining generally doesn't actually even help: a 
call/return to a single instruction isn't all that much slower than just 
doing the "cli" in the first place.

If we end up with a setup where any inlined instruction needs to act as if 
it was a function call (just with the "call" instruction papered over with 
the inlined instruction sequence), then there is no point to this at all. 

In short: people here seem to think that inlining is about avoiding the 
call/ret sequence. Not so. The real advantages of inlining are elsewhere.

So *please* don't believe that you can make it "as cheap" to have some 
automatic fixup of two sequences, one inlined and one as a "call".  It may 
look so when you look at the single instruction generated, but you're 
ignoring all the instructions *around* the site.

			Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 18:25                 ` Eric W. Biederman
@ 2007-03-19 18:41                     ` Chris Wright
  2007-03-19 18:41                     ` Chris Wright
  2007-03-19 19:10                     ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-19 18:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rusty Russell, Andi Kleen, David Miller, jeremy, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, zach, anthony,
	torvalds, netdev

* Eric W. Biederman (ebiederm@xmission.com) wrote:
> Is it truly critical to inline any of these instructions?

I don't have any current measurements.  But we'd been aiming
at getting irq_{en,dis}able to a simple memory write to pda.
But simplicity, maintenance, etc. win over trimming a couple
cycles, so still worth real look.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-19 18:41                     ` Chris Wright
  0 siblings, 0 replies; 330+ messages in thread
From: Chris Wright @ 2007-03-19 18:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: zach, jeremy, xen-devel, anthony, virtualization, netdev,
	Rusty Russell, linux-kernel, chrisw, Andi Kleen, mingo, akpm,
	torvalds, David Miller

* Eric W. Biederman (ebiederm@xmission.com) wrote:
> Is it truly critical to inline any of these instructions?

I don't have any current measurements.  But we'd been aiming
at getting irq_{en,dis}able to a simple memory write to pda.
But simplicity, maintenance, etc. win over trimming a couple
cycles, so still worth real look.

thanks,
-chris

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 18:38                     ` Linus Torvalds
@ 2007-03-19 18:44                       ` Linus Torvalds
  -1 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-19 18:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rusty Russell, Andi Kleen, David Miller, jeremy, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, zach, anthony,
	netdev



On Mon, 19 Mar 2007, Linus Torvalds wrote:
> 
> So *please* don't believe that you can make it "as cheap" to have some 
> automatic fixup of two sequences, one inlined and one as a "call".  It may 
> look so when you look at the single instruction generated, but you're 
> ignoring all the instructions *around* the site.

Side note, you can certainly fix things like this at least in theory, but 
it requires:

 - the "call" instruction that is used instead of the inlining should 
   basically have no callee-clobbers. Any paravirt_ops called this way
   should have a special calling sequence where they simple save and 
   restore all the registers they use.

   This is usually not that bad. Just create a per-architecture wrapper 
   function that saves/restores anything that the C calling convention on 
   that architecture says is clobbered by calls.

 - if the function has arguments, and the inlined sequence can take the 
   arguments in arbitrary registers, you are going to penalize the inlined 
   sequence anyway (by forcing some fixed arbitrary register allocation 
   policy).This thing is largely unfixable without some really extreme 
   compiler games (like post-processing the assembler output and having 
   different entry-points depending on where the arguments are..)

.. it will obviously depend on how thngs are done whether these things are 
useful or not. But it does mean that it's always a good idea to just have 
a config option of "turn off all the paravirt crap, because it *does* add 
overhead, and replacing instructions on the fly doesn't make that 
overhead go away".

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-19 18:44                       ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-19 18:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: xen-devel, virtualization, netdev, linux-kernel, David Miller,
	chrisw, Andi Kleen, anthony, mingo, akpm



On Mon, 19 Mar 2007, Linus Torvalds wrote:
> 
> So *please* don't believe that you can make it "as cheap" to have some 
> automatic fixup of two sequences, one inlined and one as a "call".  It may 
> look so when you look at the single instruction generated, but you're 
> ignoring all the instructions *around* the site.

Side note, you can certainly fix things like this at least in theory, but 
it requires:

 - the "call" instruction that is used instead of the inlining should 
   basically have no callee-clobbers. Any paravirt_ops called this way
   should have a special calling sequence where they simple save and 
   restore all the registers they use.

   This is usually not that bad. Just create a per-architecture wrapper 
   function that saves/restores anything that the C calling convention on 
   that architecture says is clobbered by calls.

 - if the function has arguments, and the inlined sequence can take the 
   arguments in arbitrary registers, you are going to penalize the inlined 
   sequence anyway (by forcing some fixed arbitrary register allocation 
   policy).This thing is largely unfixable without some really extreme 
   compiler games (like post-processing the assembler output and having 
   different entry-points depending on where the arguments are..)

.. it will obviously depend on how thngs are done whether these things are 
useful or not. But it does mean that it's always a good idea to just have 
a config option of "turn off all the paravirt crap, because it *does* add 
overhead, and replacing instructions on the fly doesn't make that 
overhead go away".

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 10:57                         ` Andi Kleen
  2007-03-19 17:58                           ` Jeremy Fitzhardinge
@ 2007-03-19 19:08                           ` David Miller
  2007-03-19 20:59                             ` Andi Kleen
  1 sibling, 1 reply; 330+ messages in thread
From: David Miller @ 2007-03-19 19:08 UTC (permalink / raw)
  To: ak
  Cc: virtualization, jbeulich, jeremy, ak, xen-devel, netdev,
	linux-kernel, chrisw, virtualization, anthony, akpm, torvalds,
	mingo

From: Andi Kleen <ak@suse.de>
Date: Mon, 19 Mar 2007 11:57:28 +0100

> On Monday 19 March 2007 00:46, Jeremy Fitzhardinge wrote:
> > Andi Kleen wrote:
> > For example, say we wanted to put a general call for sti into entry.S,
> > where its expected it won't touch any registers.  In that case, we'd
> > have a sequence like:
> >
> >     push %eax
> >     push %ecx
> >     push %edx
> >     call paravirt_cli
> >     pop %edx
> >     pop %ecx
> >     pop %eax
> 
> This cannot right now be expressed as inline assembly in the unwinder at all 
> because there is no way to inject the push/pops into the compiler generated
> ehframe tables.
> 
> [BTW I plan to resubmit the unwinder with some changes]

It's inability to handle sequences like the above sounds to me like
a very good argument to _not_ merge the unwinder back into the tree.

To me, that unwinder is nothing but trouble, it severly limits what
cases you can use special calling conventions via inline asm (and we
have done that on several occaisions) and even ignoring that the
unwinder only works half the time.

Please don't subject us to another couple months of hair-pulling only
to have Linus yank the thing out again, there are certainly more
useful things to spend time on :-)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 18:25                 ` Eric W. Biederman
@ 2007-03-19 19:10                     ` Jeremy Fitzhardinge
  2007-03-19 18:41                     ` Chris Wright
  2007-03-19 19:10                     ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 19:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rusty Russell, Andi Kleen, David Miller, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, zach, anthony,
	torvalds, netdev

Eric W. Biederman wrote:
> Rusty Russell <rusty@rustcorp.com.au> writes:
>
>   
>> On Sun, 2007-03-18 at 13:08 +0100, Andi Kleen wrote:
>>     
>>>> The idea is _NOT_ that you go look for references to the paravirt_ops
>>>> members structure, that would be stupid and you wouldn't be able to
>>>> use the most efficient addressing mode on a given cpu, you'd be
>>>> patching up indirect calls and crap like that.  Just say no...
>>>>         
>>> That wouldn't handle inlines though. At least some of the current
>>> paravirtops like cli/sti are critical enough to require inlining.
>>>       
>> Well, we'd patch the inline over the call if we have room.
>>
>> Magic patching would be neat, but the downsides are that (1) we can't
>> expand the patching room and (2) there's no way of attaching clobber
>> info to the call site (doing register liveness analysis is not
>> appealing).
>>     
>
> True.  You can use all of the call clobbered registers.
>   

The trouble is that there are places where we want to intercept things
like sti/cli in entry.S in a context where all the registers are in use,
so we can't generally make the assumption that caller-saved regs are
clobberable.  For any call-site in compiler-generated code we can make
that assumption, of course.

>> Now, this may not be fatal.  5 bytes is enough for all the native ops to
>> be patched inline.   For lguest this covers popf and pushf, but not cli
>> and sti (10 bytes): they'd have to be calls.
>>
>> As for clobber info, it turns out that almost all of the calls can
>> clobber %eax, which is probably enough.  We just need to mark the
>> handful of asm ones where this isn't true.
>>     
>
> I guess if the code is larger than a function call I'm failing to see
> the disadvantage in making it a direct function call.  Any modern
> processor ought to be able to predict it perfectly, and processors
> like the P4 may even optimize the call out of their L1 instruction
> cache.
>   

For current processors, I think i-cache pollution is really the only
downside to a call.  But maybe you could put multiple pv-op functions
into a cacheline if they're used in close proximity
(cli/sti/save_fl/restore_fl would all fit into a single cacheline).

> If what David is suggesting works, making all of these direct calls
> looks easy and very maintainable.   At which point patching
> instructions inline is quite possibly overkill.
>   

I spent some time looking at what it would take to get this going.  It
would probably look something like:

   1. make CONFIG_PARAVIRT select CONFIG_RELOCATABLE
   2. generate vmlinux
   3. extract relocs, and process the references to paravirt symbols
      into a table
   4. compile and link that table into the kernel again (like the ksyms
      dance)
   5. at boot time, use that table to redirect calls/references into the
      paravirt backend to the appropriate one

I think if we store the reloc references as section-relative, and the
reloc table itself is linked into its own section, then there's no need
for multipass linking to combine the reloc table into the kernel image,
since adding that table won't affect any existing reloc.

All this is doable; I'd probably end up hacking boot/compressed/relocs.c
to generate the appropriate reloc table.  My main concern is hacking the
kernel build process itself; I'm unsure of what it would actually take
to implement all this.

> Is it truly critical to inline any of these instructions?
>   

Possibly not, but I'd like to be able to say with confidence that
running a PARAVIRT kernel on bare hardware has no performance loss
compared to running a !PARAVIRT kernel.  There's the case of small
instruction sequences which have been replaced with calls (such as
sti/cli/push;popf/etc), and also the case of hooks which are noops on
native (like make_pte/pte_val, etc).  In those cases it would be nice to
patch in a replacement, either a real instruction or just nops.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-19 19:10                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 19:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: zach, xen-devel, akpm, virtualization, netdev, Rusty Russell,
	linux-kernel, chrisw, Andi Kleen, anthony, mingo, torvalds,
	David Miller

Eric W. Biederman wrote:
> Rusty Russell <rusty@rustcorp.com.au> writes:
>
>   
>> On Sun, 2007-03-18 at 13:08 +0100, Andi Kleen wrote:
>>     
>>>> The idea is _NOT_ that you go look for references to the paravirt_ops
>>>> members structure, that would be stupid and you wouldn't be able to
>>>> use the most efficient addressing mode on a given cpu, you'd be
>>>> patching up indirect calls and crap like that.  Just say no...
>>>>         
>>> That wouldn't handle inlines though. At least some of the current
>>> paravirtops like cli/sti are critical enough to require inlining.
>>>       
>> Well, we'd patch the inline over the call if we have room.
>>
>> Magic patching would be neat, but the downsides are that (1) we can't
>> expand the patching room and (2) there's no way of attaching clobber
>> info to the call site (doing register liveness analysis is not
>> appealing).
>>     
>
> True.  You can use all of the call clobbered registers.
>   

The trouble is that there are places where we want to intercept things
like sti/cli in entry.S in a context where all the registers are in use,
so we can't generally make the assumption that caller-saved regs are
clobberable.  For any call-site in compiler-generated code we can make
that assumption, of course.

>> Now, this may not be fatal.  5 bytes is enough for all the native ops to
>> be patched inline.   For lguest this covers popf and pushf, but not cli
>> and sti (10 bytes): they'd have to be calls.
>>
>> As for clobber info, it turns out that almost all of the calls can
>> clobber %eax, which is probably enough.  We just need to mark the
>> handful of asm ones where this isn't true.
>>     
>
> I guess if the code is larger than a function call I'm failing to see
> the disadvantage in making it a direct function call.  Any modern
> processor ought to be able to predict it perfectly, and processors
> like the P4 may even optimize the call out of their L1 instruction
> cache.
>   

For current processors, I think i-cache pollution is really the only
downside to a call.  But maybe you could put multiple pv-op functions
into a cacheline if they're used in close proximity
(cli/sti/save_fl/restore_fl would all fit into a single cacheline).

> If what David is suggesting works, making all of these direct calls
> looks easy and very maintainable.   At which point patching
> instructions inline is quite possibly overkill.
>   

I spent some time looking at what it would take to get this going.  It
would probably look something like:

   1. make CONFIG_PARAVIRT select CONFIG_RELOCATABLE
   2. generate vmlinux
   3. extract relocs, and process the references to paravirt symbols
      into a table
   4. compile and link that table into the kernel again (like the ksyms
      dance)
   5. at boot time, use that table to redirect calls/references into the
      paravirt backend to the appropriate one

I think if we store the reloc references as section-relative, and the
reloc table itself is linked into its own section, then there's no need
for multipass linking to combine the reloc table into the kernel image,
since adding that table won't affect any existing reloc.

All this is doable; I'd probably end up hacking boot/compressed/relocs.c
to generate the appropriate reloc table.  My main concern is hacking the
kernel build process itself; I'm unsure of what it would actually take
to implement all this.

> Is it truly critical to inline any of these instructions?
>   

Possibly not, but I'd like to be able to say with confidence that
running a PARAVIRT kernel on bare hardware has no performance loss
compared to running a !PARAVIRT kernel.  There's the case of small
instruction sequences which have been replaced with calls (such as
sti/cli/push;popf/etc), and also the case of hooks which are noops on
native (like make_pte/pte_val, etc).  In those cases it would be nice to
patch in a replacement, either a real instruction or just nops.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 18:38                     ` Linus Torvalds
  (?)
  (?)
@ 2007-03-19 19:33                     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Rusty Russell, Andi Kleen, David Miller,
	mingo, akpm, linux-kernel, virtualization, xen-devel, chrisw,
	zach, anthony, netdev

Linus Torvalds wrote:
> On Mon, 19 Mar 2007, Eric W. Biederman wrote:
>   
>> True.  You can use all of the call clobbered registers.
>>     
>
> Quite often, the biggest single win of inlining is not so much the code 
> size (although if done right, that will be smaller too), but the fact that 
> inlining DOES NOT CLOBBER AS MANY REGISTERS!
>   

Yes, that's something that we've been conscious of when designing the
pv_ops patching mechanism.  As it stands, each time you emit a call to a
pv-ops operation, it not only stores the start and end of the patchable
area, but also which registers are available for clobbering at that
callsite by the patched-in code.  For things like sti/cli, where the
surrounding code expects nothing to be clobbered, we make sure that the
regs are preserved on the pv-ops side, even though there's a call to a
normal C function in the middle.  That gives in the pvops backend the
flexibility to patch over that site with either some inline code or a
call out to some function which doesn't necessarily clobber the full
caller-save set (or even any registers).

> In short: people here seem to think that inlining is about avoiding the 
> call/ret sequence. Not so. The real advantages of inlining are elsewhere.
>   

Yes.  Probably the biggest win of inlining is constant propagation into
the inline function, but register-use flexibility is good small-scale
win (vs the micro-scale call/ret elimination).

> Side note, you can certainly fix things like this at least in theory, but 
> it requires:
>
>  - the "call" instruction that is used instead of the inlining should 
>    basically have no callee-clobbers. Any paravirt_ops called this way
>    should have a special calling sequence where they simple save and 
>    restore all the registers they use.
>
>    This is usually not that bad. Just create a per-architecture wrapper 
>    function that saves/restores anything that the C calling convention on 
>    that architecture says is clobbered by calls.
>   

Most of the places we intercept are normal C calls anyway, so this isn't
a big issue.  Its mainly the places where we replace single instructions
that this will have a big effect.

The trouble with this is that we're back to having to have special
wrappers around each call to hide the normal C calling convention from
the compiler, so that it knows that it has more registers to play with. 
This was the main complaint about the original version of the patch,
because it all looks very ugly.

>  - if the function has arguments, and the inlined sequence can take the 
>    arguments in arbitrary registers, you are going to penalize the inlined 
>    sequence anyway (by forcing some fixed arbitrary register allocation 
>    policy).This thing is largely unfixable without some really extreme 
>    compiler games (like post-processing the assembler output and having 
>    different entry-points depending on where the arguments are..)
>   

Yeah, that doesn't sound like much fun.  I think using the normal
regparm calling convention will be OK.  Aside from some slightly longer
instruction encodings, all the registers are more or less
interchangeable anyway.

> .. it will obviously depend on how thngs are done whether these things are 
> useful or not. But it does mean that it's always a good idea to just have 
> a config option of "turn off all the paravirt crap, because it *does* add 
> overhead, and replacing instructions on the fly doesn't make that 
> overhead go away".

Yes, there's a big switch to turn all this off.  It would be nice if we
could get things to the point that it isn't necessary to have (ie,
running on bare hardware really is indistinguishable either way), but
we're a fair way from being able to prove that.  In the meantime, the
goal is to try to keep the source-level changes a local as possible so
that maintaining the CONFIG_PARAVIRT vs non-PARAVIRT distinction is
straightforward.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 19:10                     ` Jeremy Fitzhardinge
  (?)
@ 2007-03-19 19:46                     ` David Miller
  2007-03-19 20:06                         ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 330+ messages in thread
From: David Miller @ 2007-03-19 19:46 UTC (permalink / raw)
  To: jeremy
  Cc: ebiederm, rusty, ak, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

From: Jeremy Fitzhardinge <jeremy@goop.org>
Date: Mon, 19 Mar 2007 12:10:08 -0700

> All this is doable; I'd probably end up hacking boot/compressed/relocs.c
> to generate the appropriate reloc table.  My main concern is hacking the
> kernel build process itself; I'm unsure of what it would actually take
> to implement all this.

32-bit Sparc's btfixup might be usable as a guide.

Another point worth making is that for function calls you
can fix things up lazily if you want.

So you link, build the reloc tables, then link in a *.o file that
does provide the functions in the form of stubs.  The stubs intercept
the call, and patch the callsite, then revector to the real handler.

I don't like this idea actually because it essentially means you
either:

1) Only allow one setting of the operations

OR

2) Need to have code which walks the whole reloc table anyways
   to handle settings after the first so you can revector
   everyone back to the stubs and lazy reloc properly again

In fact forget I mentioned this idea :)

As another note, I do agree with Linus about the register usage
arguments.  It is important.  I think it's been mentioned but what you
could do is save nothing (so that "sti" and "cli" are just that and
cost nothing), but the more complicated versions save and restore
enough registers to operate.

It all depends upon what you're trying to do.  For example, it's
easy to use patching to make different PTE layouts be supportable
in the same kernel image.  We do this on sparc64 since sun4v
has a different PTE layout than sun4u, you can see the code in
asm-sparc64/pgtable.h for details (search for "sun4v_*_patch")

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 19:46                     ` David Miller
@ 2007-03-19 20:06                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 20:06 UTC (permalink / raw)
  To: David Miller
  Cc: ebiederm, rusty, ak, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, zach, anthony, torvalds, netdev

David Miller wrote:
> Another point worth making is that for function calls you
> can fix things up lazily if you want.
> [...]
> In fact forget I mentioned this idea :)
>   

OK :)  I think we'll only ever want to bind to a hypervisor once, since
the underlying hypervisor can't change on the fly (well, in principle it
might if you migrate, but you'll have more problems than just dealing
with the hook points).  And lazy binding doesn't buy much; I think its
much better to get it all out of the way at once, and then throw the
reloc info away.

> As another note, I do agree with Linus about the register usage
> arguments.  It is important.  I think it's been mentioned but what you
> could do is save nothing (so that "sti" and "cli" are just that and
> cost nothing), but the more complicated versions save and restore
> enough registers to operate.
>   

Right, that's pretty much what we do now.

> It all depends upon what you're trying to do.  For example, it's
> easy to use patching to make different PTE layouts be supportable
> in the same kernel image.  We do this on sparc64 since sun4v
> has a different PTE layout than sun4u, you can see the code in
> asm-sparc64/pgtable.h for details (search for "sun4v_*_patch")
>   

I see.  We want to do something conceptually like this, but we need to
handle more than just adjusting which of two constants to use as a
mask.  For example, Xen needs to run pfns through a pfn->mfn (machine
frame number) conversion and back when making/unpacking pagetable entries.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-19 20:06                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-19 20:06 UTC (permalink / raw)
  To: David Miller
  Cc: zach, xen-devel, virtualization, netdev, rusty, linux-kernel,
	chrisw, ak, ebiederm, anthony, mingo, torvalds, akpm

David Miller wrote:
> Another point worth making is that for function calls you
> can fix things up lazily if you want.
> [...]
> In fact forget I mentioned this idea :)
>   

OK :)  I think we'll only ever want to bind to a hypervisor once, since
the underlying hypervisor can't change on the fly (well, in principle it
might if you migrate, but you'll have more problems than just dealing
with the hook points).  And lazy binding doesn't buy much; I think its
much better to get it all out of the way at once, and then throw the
reloc info away.

> As another note, I do agree with Linus about the register usage
> arguments.  It is important.  I think it's been mentioned but what you
> could do is save nothing (so that "sti" and "cli" are just that and
> cost nothing), but the more complicated versions save and restore
> enough registers to operate.
>   

Right, that's pretty much what we do now.

> It all depends upon what you're trying to do.  For example, it's
> easy to use patching to make different PTE layouts be supportable
> in the same kernel image.  We do this on sparc64 since sun4v
> has a different PTE layout than sun4u, you can see the code in
> asm-sparc64/pgtable.h for details (search for "sun4v_*_patch")
>   

I see.  We want to do something conceptually like this, but we need to
handle more than just adjusting which of two constants to use as a
mask.  For example, Xen needs to run pfns through a pfn->mfn (machine
frame number) conversion and back when making/unpacking pagetable entries.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 19:08                           ` David Miller
@ 2007-03-19 20:59                             ` Andi Kleen
  2007-03-19 21:55                               ` [PATCH] x86_64 : Suppress __jiffies Eric Dumazet
  2007-03-20  3:18                                 ` Linus Torvalds
  0 siblings, 2 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-19 20:59 UTC (permalink / raw)
  To: David Miller
  Cc: virtualization, jbeulich, jeremy, xen-devel, netdev,
	linux-kernel, chrisw, virtualization, anthony, akpm, torvalds,
	mingo

> It's inability to handle sequences like the above sounds to me like
> a very good argument to _not_ merge the unwinder back into the tree.

The unwinder can handle it fine, it is just that gcc right now cannot
be taught to generate correct unwind tables for it. If paravirt ops
is widely used i guess it would be possible to find some workaround
for that though.

There is one case it truly cannot handle (it would be possible by switching 
to dwarf3), but that was relatively easily eliminated and wasn't a problem
imho.

> To me, that unwinder is nothing but trouble, it severly limits what
> cases you can use special calling conventions via inline asm (and we
> have done that on several occaisions) and even ignoring that the
> unwinder only works half the time.

Initially we had some bugs that accounted for near all failures, but they 
were all fixed in the latest version.

> Please don't subject us to another couple months of hair-pulling only
> to have Linus yank the thing out again, there are certainly more
> useful things to spend time on :-)

I think better backtraces are a good use of my time. Sorry if you
disagree on that.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH] x86_64 : Suppress __jiffies
  2007-03-19 20:59                             ` Andi Kleen
@ 2007-03-19 21:55                               ` Eric Dumazet
  2007-03-20  3:18                                 ` Linus Torvalds
  1 sibling, 0 replies; 330+ messages in thread
From: Eric Dumazet @ 2007-03-19 21:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

Hi Andi

I was considering re-submiting patch to group together 
xtime_lock/xtime/jiffies in the same cache line. Before doing so, I believe 
the following cleanup patch could prepare the x86_64 port to be similar with 
other arches.

Thank you

[PATCH] x86_64 : Suppress __jiffies

x86_64 arch currently specially defines __jiffies, as a relic of old vsyscall 
implementation. It is currently used to validate vgetcpu() user cache as a 
token. If we use another token, more opaque like the seqlock sequence, we can 
get rid of special __jiffies handling. As a result, jiffies_64 is no more 
included in vsyscall page. This patch prepares the introduction of time_data.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: no__jiffies.patch --]
[-- Type: text/plain, Size: 3287 bytes --]

diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
index b43c698..ac4d865 100644
--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -62,6 +62,14 @@ struct vsyscall_gtod_data_t __vsyscall_g
 	.sysctl_enabled = 1,
 };
 
+/*
+ * vgetcpu_token is an opaque token used to validate vgetcpu() user cache.
+ * We need something that change like jiffies. It used to be jiffies
+ * but using seqlock __vsyscall_gtod_data.lock sequence is better because
+ * we prefer not let user read jiffies.
+ */
+#define vgetcpu_token() read_seqbegin(&__vsyscall_gtod_data.lock)
+
 void update_vsyscall(struct timespec *wall_time, struct clocksource *clock)
 {
 	unsigned long flags;
@@ -180,7 +188,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s
 	   We do this here because otherwise user space would do it on
 	   its own in a likely inferior way (no access to jiffies).
 	   If you don't like it pass NULL. */
-	if (tcache && tcache->blob[0] == (j = __jiffies)) {
+	if (tcache && tcache->blob[0] == (j = vgetcpu_token())) {
 		p = tcache->blob[1];
 	} else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
 		/* Load per CPU data from RDTSCP */
diff --git a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c
index 75d73a9..5a44c11 100644
--- a/arch/x86_64/kernel/time.c
+++ b/arch/x86_64/kernel/time.c
@@ -53,7 +53,6 @@ DEFINE_SPINLOCK(rtc_lock);
 EXPORT_SYMBOL(rtc_lock);
 DEFINE_SPINLOCK(i8253_lock);
 
-volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
 
 unsigned long profile_pc(struct pt_regs *regs)
 {
diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S
index b73212c..6ca4660 100644
--- a/arch/x86_64/kernel/vmlinux.lds.S
+++ b/arch/x86_64/kernel/vmlinux.lds.S
@@ -12,7 +12,7 @@ #undef i386	/* in case the preprocessor 
 OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(phys_startup_64)
-jiffies_64 = jiffies;
+jiffies = jiffies_64;
 _proxy_pda = 0;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
@@ -97,10 +97,6 @@ #define VVIRT(x) (ADDR(x) - VVIRT_OFFSET
   .vgetcpu_mode : AT(VLOAD(.vgetcpu_mode)) { *(.vgetcpu_mode) }
   vgetcpu_mode = VVIRT(.vgetcpu_mode);
 
-  . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) }
-  jiffies = VVIRT(.jiffies);
-
   .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1))
 		{ *(.vsyscall_1) }
   .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2))
diff --git a/include/asm-x86_64/vsyscall.h b/include/asm-x86_64/vsyscall.h
index 82b4afe..1fb09dc 100644
--- a/include/asm-x86_64/vsyscall.h
+++ b/include/asm-x86_64/vsyscall.h
@@ -17,7 +17,6 @@ #ifdef __KERNEL__
 #include <linux/seqlock.h>
 
 #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16)))
-#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
 
 /* Definitions for CONFIG_GENERIC_TIME definitions */
 #define __section_vsyscall_gtod_data __attribute__ \
@@ -31,7 +30,6 @@ #define hpet_readl(a)           readl((c
 #define hpet_writel(d,a)        writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a)
 
 extern int __vgetcpu_mode;
-extern volatile unsigned long __jiffies;
 
 /* kernel space (writeable) */
 extern int vgetcpu_mode;

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 19:10                     ` Jeremy Fitzhardinge
  (?)
  (?)
@ 2007-03-19 23:42                     ` Andi Kleen
  -1 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-19 23:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric W. Biederman, Rusty Russell, David Miller, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, zach, anthony,
	torvalds, netdev

> Possibly not, but I'd like to be able to say with confidence that
> running a PARAVIRT kernel on bare hardware has no performance loss
> compared to running a !PARAVIRT kernel.  There's the case of small
> instruction sequences which have been replaced with calls (such as
> sti/cli/push;popf/etc), 

My guess is that most critical pushf/popf are in spin_lock_irqsave(). It would 
be possible to special case that one -- inline it -- and use out of line
versions for all the others.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 18:38                     ` Linus Torvalds
                                       ` (2 preceding siblings ...)
  (?)
@ 2007-03-20  0:01                     ` Rusty Russell
  2007-03-20  2:00                       ` Zachary Amsden
  -1 siblings, 1 reply; 330+ messages in thread
From: Rusty Russell @ 2007-03-20  0:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Andi Kleen, David Miller, jeremy, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, zach, anthony,
	netdev

On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote:
> 
> On Mon, 19 Mar 2007, Eric W. Biederman wrote:
> > 
> > True.  You can use all of the call clobbered registers.
> 
> Quite often, the biggest single win of inlining is not so much the code 
> size (although if done right, that will be smaller too), but the fact that 
> inlining DOES NOT CLOBBER AS MANY REGISTERS!

Thanks Linus.

*This* was the reason that the current hand-coded calls only clobber %
eax.  It was a compromise between native (no clobbers) and others (might
need a reg).

Now, since we decided to allow paravirt_ops operations to be normal C
(ie. the patching is optional and done late), we actually push and pop %
ecx and %edx.  This makes the call site 10 bytes long, which is a nice
size for patching anyway (enough for a movl $0, <addr>, a-la lguest's
cli, or movw $0, %gs:<addr> if we supported SMP).

The current 6 paravirt ops which are patched cover the vast majority of
calls (until the Xen patches, then we need ~4 more?).  Jeremy chose to
expand patching to cover *all* paravirt ops, rather than just the new
hot ones, and that's where we tipped over the ugliness threshold.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-18 23:46                         ` Jeremy Fitzhardinge
  (?)
  (?)
@ 2007-03-20  1:23                         ` Zachary Amsden
  2007-03-20  1:45                           ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 330+ messages in thread
From: Zachary Amsden @ 2007-03-20  1:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, David Miller, rusty, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, torvalds, netdev

Jeremy Fitzhardinge wrote:
> For example, say we wanted to put a general call for sti into entry.S,
> where its expected it won't touch any registers.  In that case, we'd
> have a sequence like:
>
>     push %eax
>     push %ecx
>     push %edx
>     call paravirt_cli
>     pop %edx
>     pop %ecx
>     pop %eax
>       
>
> If we parse the relocs, then we'd find the reference to paravirt_cli. 
> If we look at the byte before and see 0xe8, then we can see if its a
> call.  If we then work out in each direction and see matched push/pops,
> then we know what registers can be trashed in the call.  This also
> allows us to determine the callsite size, and therefore how much space
> we need for inlining.
>   

No, that is a very dangerous suggestion.  You absolutely *cannot* do 
this safely without explicitly marking the start EIP of this code.  You 
*must* use metadata to do that.  It is never safe to disassemble 
backwards or "rewind" EIP for x86 code.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  1:23                         ` Zachary Amsden
@ 2007-03-20  1:45                           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20  1:45 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, David Miller, rusty, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, torvalds, netdev

Zachary Amsden wrote:
> Jeremy Fitzhardinge wrote:
>>  If we then work out in each direction and see matched push/pops,
>> then we know what registers can be trashed in the call.  This also
>> allows us to determine the callsite size, and therefore how much space
>> we need for inlining.
>>   
>
> No, that is a very dangerous suggestion.  You absolutely *cannot* do
> this safely without explicitly marking the start EIP of this code. 
> You *must* use metadata to do that.  It is never safe to disassemble
> backwards or "rewind" EIP for x86 code. 

What do you mean the instruction before is "mov $0x52515000,%eax"?

Yeah, you're right.  Oh well.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  0:01                     ` Rusty Russell
@ 2007-03-20  2:00                       ` Zachary Amsden
  2007-03-20  4:20                           ` Rusty Russell
  2007-03-20  5:54                         ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-20  2:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Eric W. Biederman, Andi Kleen, David Miller,
	jeremy, mingo, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, anthony, netdev

Rusty Russell wrote:
> On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote:
>   
>> On Mon, 19 Mar 2007, Eric W. Biederman wrote:
>>     
>>> True.  You can use all of the call clobbered registers.
>>>       
>> Quite often, the biggest single win of inlining is not so much the code 
>> size (although if done right, that will be smaller too), but the fact that 
>> inlining DOES NOT CLOBBER AS MANY REGISTERS!
>>     

For VMI, the default clobber was "cc", and you need a way to allow at 
least that, because saving and restoring flags is too expensive on x86.

> Thanks Linus.
>
> *This* was the reason that the current hand-coded calls only clobber %
> eax.  It was a compromise between native (no clobbers) and others (might
> need a reg).
>   

I still don't think this was a good trade.  The primary motivation for 
clobbering %eax was that Xen wanted a free register to use for computing 
the offset into the shared data in the case of SMP preemptible kernels.  
Xen no longer needs such a register, they can use the PDA offset 
instead.  And it does hurt native performance by unconditionally 
stealing a register in the four most commonly invoked paravirt-ops code 
sequences.

> Now, since we decided to allow paravirt_ops operations to be normal C
> (ie. the patching is optional and done late), we actually push and pop %
> ecx and %edx.  This makes the call site 10 bytes long, which is a nice
> size for patching anyway (enough for a movl $0, <addr>, a-la lguest's
> cli, or movw $0, %gs:<addr> if we supported SMP).
>   

You can do it in 11 bytes with no clobbers and normal C semantics by 
linking to a direct address instead of calling to an indirect, but then 
you need some gross fixup technology in paravirt_patch:

if (call_addr == (void*)native_sti) {
  ...
}

I think we should probably try to do it in 12 bytes.  Freeing eax to the 
inline caller is likely to make up the 2 bytes of space more we have to nop.

One thing I always tried to get in VMI was to encapsulate the actual 
code which went through the business of computing arguments that were 
not even used in the native case.  Unfortunately, that seems impossible 
in the current design, but I don't think it is an issue because I don't 
think there is actually a way to express:

SWITCHABLE_CODE_BLOCK_BEGIN {
   /* arbitrary C code for native */
} SWITCHABLE_CODE_BLOCK_ALTERNATIVE {
   /* arbitrary C code for something else */
}

Dave's linker suggestion is probably the best for things like that.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-19 20:59                             ` Andi Kleen
@ 2007-03-20  3:18                                 ` Linus Torvalds
  2007-03-20  3:18                                 ` Linus Torvalds
  1 sibling, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20  3:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, virtualization, jbeulich, jeremy, xen-devel,
	netdev, linux-kernel, chrisw, virtualization, anthony, akpm,
	mingo

On Mon, 19 Mar 2007, Andi Kleen wrote:
> 
> Initially we had some bugs that accounted for near all failures, but they 
> were all fixed in the latest version.

No. The real bugs were that the people involved wouldn't even accept that 
unwinding information was inevitably buggy and/or incomplete.

That much more fundamental bug never got fixed, as far as I know. 

I'm not going to merge anything that depends on unwind tables as things 
stand. The pain just isn't worth it.

> > Please don't subject us to another couple months of hair-pulling only
> > to have Linus yank the thing out again, there are certainly more
> > useful things to spend time on :-)

Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
out, I simply won't merge it. It was more than just totally buggy code, it 
was an inability of the people to understand that even bugfree code 
isn't enough - you have to be able to also handle buggy data.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20  3:18                                 ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20  3:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, mingo, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, anthony, akpm, David Miller

On Mon, 19 Mar 2007, Andi Kleen wrote:
> 
> Initially we had some bugs that accounted for near all failures, but they 
> were all fixed in the latest version.

No. The real bugs were that the people involved wouldn't even accept that 
unwinding information was inevitably buggy and/or incomplete.

That much more fundamental bug never got fixed, as far as I know. 

I'm not going to merge anything that depends on unwind tables as things 
stand. The pain just isn't worth it.

> > Please don't subject us to another couple months of hair-pulling only
> > to have Linus yank the thing out again, there are certainly more
> > useful things to spend time on :-)

Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
out, I simply won't merge it. It was more than just totally buggy code, it 
was an inability of the people to understand that even bugfree code 
isn't enough - you have to be able to also handle buggy data.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  3:18                                 ` Linus Torvalds
  (?)
@ 2007-03-20  3:47                                 ` David Miller
  2007-03-20  4:19                                     ` Eric W. Biederman
  -1 siblings, 1 reply; 330+ messages in thread
From: David Miller @ 2007-03-20  3:47 UTC (permalink / raw)
  To: torvalds
  Cc: ak, virtualization, jbeulich, jeremy, xen-devel, netdev,
	linux-kernel, chrisw, virtualization, anthony, akpm, mingo

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)

> > > Please don't subject us to another couple months of hair-pulling only
> > > to have Linus yank the thing out again, there are certainly more
> > > useful things to spend time on :-)
> 
> Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
> out, I simply won't merge it. It was more than just totally buggy code, it 
> was an inability of the people to understand that even bugfree code 
> isn't enough - you have to be able to also handle buggy data.

Thank you.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  3:47                                 ` David Miller
@ 2007-03-20  4:19                                     ` Eric W. Biederman
  0 siblings, 0 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-20  4:19 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, ak, virtualization, jbeulich, jeremy, xen-devel,
	netdev, linux-kernel, chrisw, virtualization, anthony, akpm,
	mingo

David Miller <davem@davemloft.net> writes:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)
>
>> > > Please don't subject us to another couple months of hair-pulling only
>> > > to have Linus yank the thing out again, there are certainly more
>> > > useful things to spend time on :-)
>> 
>> Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
>> out, I simply won't merge it. It was more than just totally buggy code, it 
>> was an inability of the people to understand that even bugfree code 
>> isn't enough - you have to be able to also handle buggy data.
>
> Thank you.

Hmm..

I know the feeling I have had a similar rant about the kexec on panic
code path.   The code is still no where near as paranoid about normal
kernel things not working as it could be, but by ranting about it
periodically the people doing the work are gradually making it better.

I'm conflicted about the dwarf unwinder.  I was off doing other things
at the time so I missed the pain, but I do have a distinct recollection of
the back traces on x86_64 being distinctly worse the on i386.  Lately
I haven't seen that so it may be I was misinterpreting what I was
seeing, and the compiler optimizations were what gave me such weird
back traces.  

But if the quality of our backtraces has gone down and dwarf unwinder
could give us better back traces it is likely worth pursuing.  Of
course it would need to start with the assumption that it's tables
may be borked (the kernel is busted after all) and be much more
careful than Andi's last attempt.

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20  4:19                                     ` Eric W. Biederman
  0 siblings, 0 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-20  4:19 UTC (permalink / raw)
  To: David Miller
  Cc: xen-devel, netdev, linux-kernel, jbeulich, chrisw,
	virtualization, anthony, akpm, virtualization, torvalds, mingo

David Miller <davem@davemloft.net> writes:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)
>
>> > > Please don't subject us to another couple months of hair-pulling only
>> > > to have Linus yank the thing out again, there are certainly more
>> > > useful things to spend time on :-)
>> 
>> Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
>> out, I simply won't merge it. It was more than just totally buggy code, it 
>> was an inability of the people to understand that even bugfree code 
>> isn't enough - you have to be able to also handle buggy data.
>
> Thank you.

Hmm..

I know the feeling I have had a similar rant about the kexec on panic
code path.   The code is still no where near as paranoid about normal
kernel things not working as it could be, but by ranting about it
periodically the people doing the work are gradually making it better.

I'm conflicted about the dwarf unwinder.  I was off doing other things
at the time so I missed the pain, but I do have a distinct recollection of
the back traces on x86_64 being distinctly worse the on i386.  Lately
I haven't seen that so it may be I was misinterpreting what I was
seeing, and the compiler optimizations were what gave me such weird
back traces.  

But if the quality of our backtraces has gone down and dwarf unwinder
could give us better back traces it is likely worth pursuing.  Of
course it would need to start with the assumption that it's tables
may be borked (the kernel is busted after all) and be much more
careful than Andi's last attempt.

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  2:00                       ` Zachary Amsden
@ 2007-03-20  4:20                           ` Rusty Russell
  2007-03-20  5:54                         ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-20  4:20 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Eric W. Biederman, Andi Kleen, David Miller,
	jeremy, mingo, akpm, linux-kernel, virtualization, xen-devel,
	chrisw, anthony, netdev

On Mon, 2007-03-19 at 18:00 -0800, Zachary Amsden wrote:
> Rusty Russell wrote:
> > *This* was the reason that the current hand-coded calls only clobber %
> > eax.  It was a compromise between native (no clobbers) and others (might
> > need a reg).
> 
> I still don't think this was a good trade.
...
> Xen no longer needs such a register

Hmm, well, if VMI is happy, Xen is happy, and lguest is happy, then
perhaps we're better off with a cc-only clobber rule?  Certainly makes
life simpler.

> > Now, since we decided to allow paravirt_ops operations to be normal C
> > (ie. the patching is optional and done late), we actually push and pop %
> > ecx and %edx.  This makes the call site 10 bytes long, which is a nice
> > size for patching anyway (enough for a movl $0, <addr>, a-la lguest's
> > cli, or movw $0, %gs:<addr> if we supported SMP).
> 
> You can do it in 11 bytes with no clobbers and normal C semantics by 
> linking to a direct address instead of calling to an indirect, but then 
> you need some gross fixup technology in paravirt_patch:
> 
> if (call_addr == (void*)native_sti) {
>   ...
> }

Well, I don't think we need such hacks: since we have to use handcoded
asm and mark the callsites anyway, marking what they're calling is
trivial.

The other idea from "btfixup" is that we can do the patching *much*
earlier, so we don't need the initial code to be valid at all if we
wanted to: we just need room to patch in a call insn.  We could then
generate trampolines which do the necessary pushes & pops automatically
for backends which want to use C calling conventions.

Perhaps it's time for code and benchmarks?

Rusty.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20  4:20                           ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-20  4:20 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: jeremy, xen-devel, akpm, virtualization, netdev, linux-kernel,
	chrisw, Andi Kleen, Eric W. Biederman, anthony, mingo,
	Linus Torvalds, David Miller

On Mon, 2007-03-19 at 18:00 -0800, Zachary Amsden wrote:
> Rusty Russell wrote:
> > *This* was the reason that the current hand-coded calls only clobber %
> > eax.  It was a compromise between native (no clobbers) and others (might
> > need a reg).
> 
> I still don't think this was a good trade.
...
> Xen no longer needs such a register

Hmm, well, if VMI is happy, Xen is happy, and lguest is happy, then
perhaps we're better off with a cc-only clobber rule?  Certainly makes
life simpler.

> > Now, since we decided to allow paravirt_ops operations to be normal C
> > (ie. the patching is optional and done late), we actually push and pop %
> > ecx and %edx.  This makes the call site 10 bytes long, which is a nice
> > size for patching anyway (enough for a movl $0, <addr>, a-la lguest's
> > cli, or movw $0, %gs:<addr> if we supported SMP).
> 
> You can do it in 11 bytes with no clobbers and normal C semantics by 
> linking to a direct address instead of calling to an indirect, but then 
> you need some gross fixup technology in paravirt_patch:
> 
> if (call_addr == (void*)native_sti) {
>   ...
> }

Well, I don't think we need such hacks: since we have to use handcoded
asm and mark the callsites anyway, marking what they're calling is
trivial.

The other idea from "btfixup" is that we can do the patching *much*
earlier, so we don't need the initial code to be valid at all if we
wanted to: we just need room to patch in a call insn.  We could then
generate trampolines which do the necessary pushes & pops automatically
for backends which want to use C calling conventions.

Perhaps it's time for code and benchmarks?

Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  2:00                       ` Zachary Amsden
  2007-03-20  4:20                           ` Rusty Russell
@ 2007-03-20  5:54                         ` Jeremy Fitzhardinge
  2007-03-20 11:33                           ` Andreas Kleen
  2007-03-20 15:09                             ` Linus Torvalds
  1 sibling, 2 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20  5:54 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Rusty Russell, Linus Torvalds, Eric W. Biederman, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

Zachary Amsden wrote:
> For VMI, the default clobber was "cc", and you need a way to allow at
> least that, because saving and restoring flags is too expensive on x86.

According to lore (Andi, I think), asm() always clobbers cc. 

> I still don't think this was a good trade.  The primary motivation for
> clobbering %eax was that Xen wanted a free register to use for
> computing the offset into the shared data in the case of SMP
> preemptible kernels.  Xen no longer needs such a register, they can
> use the PDA offset instead.  And it does hurt native performance by
> unconditionally stealing a register in the four most commonly invoked
> paravirt-ops code sequences.

Actually, it still does need a temp register.  The sequence for cli is:

    mov %fs:xen_vcpu, %eax
    movb $1,1(%eax)

At some point I hope to move the vcpu structure directly into the
pda/percpu variables, at which point it will need no temps.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  5:54                         ` Jeremy Fitzhardinge
@ 2007-03-20 11:33                           ` Andreas Kleen
  2007-03-20 15:09                             ` Linus Torvalds
  1 sibling, 0 replies; 330+ messages in thread
From: Andreas Kleen @ 2007-03-20 11:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, xen-devel, akpm, virtualization, netdev,
	linux-kernel, chrisw, Andi Kleen, Eric W. Biederman, anthony,
	mingo, Linus Torvalds, David Miller

Am Di 20.03.2007 06:54 schrieb Jeremy Fitzhardinge <jeremy@goop.org>:

> Zachary Amsden wrote:
> > For VMI, the default clobber was "cc", and you need a way to allow
> > at
> > least that, because saving and restoring flags is too expensive on
> > x86.
>
> According to lore (Andi, I think), asm() always clobbers cc.

asm with input and/or output. I'm not sure about asms without that.

-Andi



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  4:19                                     ` Eric W. Biederman
@ 2007-03-20 13:28                                       ` Andi Kleen
  -1 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 13:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, torvalds, virtualization, jbeulich, jeremy,
	xen-devel, netdev, linux-kernel, chrisw, virtualization, anthony,
	akpm, mingo

\
> I'm conflicted about the dwarf unwinder.  I was off doing other things
> at the time so I missed the pain, but I do have a distinct recollection of
> the back traces on x86_64 being distinctly worse the on i386. 

The only case were i386 was better was with frame pointers, which
was never fully implemented for x86-64. However i find that hilarious: 
people are spending a lot of time right here in this thread to squeeze
out the best call sequences for the paravirt ops, but then accept
losing a full frame pointer register on i386. I never found that
acceptable, that is why I prefered the unwinder instead. 

This said the big problem with the frame pointers is mostly gone now:
on older CPUs it tended to cause a pipeline stall early in the function.
That is now fixed in the latest Intel/upcomming AMD CPUs, but there 
are still millions and millions of older CPUs out there so I still
don't consider it acceptable.

> Lately 
> I haven't seen that so it may be I was misinterpreting what I was
> seeing, and the compiler optimizations were what gave me such weird
> back traces. 

The main problem is that subsystems are getting more and more complex
and especially callbacks seem to multiply far too quickly.

In 2.4 it was often very reasonable to just sort out the false positives,
but with sometimes 20-30+ level deep call chains in 2.6 with many callbacks that just
gets far too tenuous. 

> But if the quality of our backtraces has gone down and dwarf unwinder
> could give us better back traces it is likely worth pursuing.  Of
> course it would need to start with the assumption that it's tables
> may be borked (the kernel is busted after all) and be much more
> careful than Andi's last attempt.

The latest version validates the stack always. It was only a few lines
of change. I doubt it will make much difference though. The few true crashes
we had were not actually due the unwinder itself, but the buggy fallback code
(which were fixed quickly). But anyways it should satisfy everybody's paranoia now.

Although in future it would be good if people did some more analysis in root causes
for failures before let the paranoia take over and revert patches.

We see a good example here of what I call the JFS/ACPI effect: code gets merged
too early with some visible problems. It gets a bad name and afterwards people never
look objectively at it again and just trust their prejudices. 

But that's not a good strategy to get good code in the end I think. If there
is enough evidence the early problems were fixed then prejudices should
be reevaluated.

I will let it cook some time in -mm* and we will see if it works now or not.
I'm pretty confident it will though. And if it does there is no reason not 
to resubmit it.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 13:28                                       ` Andi Kleen
  0 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 13:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: jeremy, xen-devel, netdev, mingo, linux-kernel, jbeulich,
	virtualization, chrisw, virtualization, anthony, akpm, torvalds,
	David Miller

\
> I'm conflicted about the dwarf unwinder.  I was off doing other things
> at the time so I missed the pain, but I do have a distinct recollection of
> the back traces on x86_64 being distinctly worse the on i386. 

The only case were i386 was better was with frame pointers, which
was never fully implemented for x86-64. However i find that hilarious: 
people are spending a lot of time right here in this thread to squeeze
out the best call sequences for the paravirt ops, but then accept
losing a full frame pointer register on i386. I never found that
acceptable, that is why I prefered the unwinder instead. 

This said the big problem with the frame pointers is mostly gone now:
on older CPUs it tended to cause a pipeline stall early in the function.
That is now fixed in the latest Intel/upcomming AMD CPUs, but there 
are still millions and millions of older CPUs out there so I still
don't consider it acceptable.

> Lately 
> I haven't seen that so it may be I was misinterpreting what I was
> seeing, and the compiler optimizations were what gave me such weird
> back traces. 

The main problem is that subsystems are getting more and more complex
and especially callbacks seem to multiply far too quickly.

In 2.4 it was often very reasonable to just sort out the false positives,
but with sometimes 20-30+ level deep call chains in 2.6 with many callbacks that just
gets far too tenuous. 

> But if the quality of our backtraces has gone down and dwarf unwinder
> could give us better back traces it is likely worth pursuing.  Of
> course it would need to start with the assumption that it's tables
> may be borked (the kernel is busted after all) and be much more
> careful than Andi's last attempt.

The latest version validates the stack always. It was only a few lines
of change. I doubt it will make much difference though. The few true crashes
we had were not actually due the unwinder itself, but the buggy fallback code
(which were fixed quickly). But anyways it should satisfy everybody's paranoia now.

Although in future it would be good if people did some more analysis in root causes
for failures before let the paranoia take over and revert patches.

We see a good example here of what I call the JFS/ACPI effect: code gets merged
too early with some visible problems. It gets a bad name and afterwards people never
look objectively at it again and just trust their prejudices. 

But that's not a good strategy to get good code in the end I think. If there
is enough evidence the early problems were fixed then prejudices should
be reevaluated.

I will let it cook some time in -mm* and we will see if it works now or not.
I'm pretty confident it will though. And if it does there is no reason not 
to resubmit it.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  5:54                         ` Jeremy Fitzhardinge
@ 2007-03-20 15:09                             ` Linus Torvalds
  2007-03-20 15:09                             ` Linus Torvalds
  1 sibling, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 15:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, Rusty Russell, Eric W. Biederman, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

On Mon, 19 Mar 2007, Jeremy Fitzhardinge wrote:
>
> Zachary Amsden wrote:
> > For VMI, the default clobber was "cc", and you need a way to allow at
> > least that, because saving and restoring flags is too expensive on x86.
> 
> According to lore (Andi, I think), asm() always clobbers cc. 

On x86, yes. Practically any instruction will clobber cc anyway, so it's 
the default.

Although the gcc guys have occasionally been suggesting we should set it 
in our asms.

> Actually, it still does need a temp register.  The sequence for cli is:
> 
>     mov %fs:xen_vcpu, %eax
>     movb $1,1(%eax)

We should just do this natively. There's been several tests over the years 
saying that it's much more efficient to do sti/cli as a simple store, and 
handling the "oops, we got an interrupt while interrupts were disabled" as 
a special case.

I have this dim memory that ARM has done it that way for a long time 
because it's so expensive to do a "real" cli/sti.

And I think -rt does it for other reasons. It's just more flexible.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 15:09                             ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 15:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, akpm, virtualization, netdev, linux-kernel, chrisw,
	Andi Kleen, Eric W. Biederman, anthony, mingo, David Miller

On Mon, 19 Mar 2007, Jeremy Fitzhardinge wrote:
>
> Zachary Amsden wrote:
> > For VMI, the default clobber was "cc", and you need a way to allow at
> > least that, because saving and restoring flags is too expensive on x86.
> 
> According to lore (Andi, I think), asm() always clobbers cc. 

On x86, yes. Practically any instruction will clobber cc anyway, so it's 
the default.

Although the gcc guys have occasionally been suggesting we should set it 
in our asms.

> Actually, it still does need a temp register.  The sequence for cli is:
> 
>     mov %fs:xen_vcpu, %eax
>     movb $1,1(%eax)

We should just do this natively. There's been several tests over the years 
saying that it's much more efficient to do sti/cli as a simple store, and 
handling the "oops, we got an interrupt while interrupts were disabled" as 
a special case.

I have this dim memory that ARM has done it that way for a long time 
because it's so expensive to do a "real" cli/sti.

And I think -rt does it for other reasons. It's just more flexible.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 15:09                             ` Linus Torvalds
  (?)
@ 2007-03-20 15:58                             ` Eric W. Biederman
  2007-03-20 16:06                                 ` Linus Torvalds
                                                 ` (2 more replies)
  -1 siblings, 3 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-20 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Rusty Russell, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 19 Mar 2007, Jeremy Fitzhardinge wrote:
>> Actually, it still does need a temp register.  The sequence for cli is:
>> 
>>     mov %fs:xen_vcpu, %eax
>>     movb $1,1(%eax)
>
> We should just do this natively. There's been several tests over the years 
> saying that it's much more efficient to do sti/cli as a simple store, and 
> handling the "oops, we got an interrupt while interrupts were disabled" as 
> a special case.
>
> I have this dim memory that ARM has done it that way for a long time 
> because it's so expensive to do a "real" cli/sti.
>
> And I think -rt does it for other reasons. It's just more flexible.

If that is the case.  In the normal kernel what would
the "the oops, we got an interrupt code do?"
I assume it would leave interrupts disabled when it returns?
Like we currently do with the delayed disable of normal interrupts?

I'm trying to understand the proposed semantics.

Looking at the above code snippet.  I guess it is about time to
merge our per_cpu and pda variables...

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 15:58                             ` Eric W. Biederman
@ 2007-03-20 16:06                                 ` Linus Torvalds
  2007-03-20 16:26                               ` Jeremy Fitzhardinge
  2007-03-20 22:41                                 ` Rusty Russell
  2 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 16:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Rusty Russell, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev



On Tue, 20 Mar 2007, Eric W. Biederman wrote:
> 
> If that is the case.  In the normal kernel what would
> the "the oops, we got an interrupt code do?"
> I assume it would leave interrupts disabled when it returns?
> Like we currently do with the delayed disable of normal interrupts?

Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
just return without doing anything.

(You may or may not also need to do extra work to Ack the hardware 
interrupt etc, which may be irq-controller specific. Once the CPU has 
accepted the interrupt, you may not be able to just leave it dangling)

But I was throwing that out as a long-term thing. I'm not claiming it's 
trivial, but it should be doable.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 16:06                                 ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 16:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: xen-devel, virtualization, netdev, linux-kernel, David Miller,
	chrisw, Andi Kleen, anthony, akpm, mingo

On Tue, 20 Mar 2007, Eric W. Biederman wrote:
> 
> If that is the case.  In the normal kernel what would
> the "the oops, we got an interrupt code do?"
> I assume it would leave interrupts disabled when it returns?
> Like we currently do with the delayed disable of normal interrupts?

Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
just return without doing anything.

(You may or may not also need to do extra work to Ack the hardware 
interrupt etc, which may be irq-controller specific. Once the CPU has 
accepted the interrupt, you may not be able to just leave it dangling)

But I was throwing that out as a long-term thing. I'm not claiming it's 
trivial, but it should be doable.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20  4:19                                     ` Eric W. Biederman
@ 2007-03-20 16:12                                       ` Chuck Ebbert
  -1 siblings, 0 replies; 330+ messages in thread
From: Chuck Ebbert @ 2007-03-20 16:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, torvalds, ak, virtualization, jbeulich, jeremy,
	xen-devel, netdev, linux-kernel, chrisw, virtualization, anthony,
	akpm, mingo

Eric W. Biederman wrote:
 
> I'm conflicted about the dwarf unwinder.  I was off doing other things
> at the time so I missed the pain, but I do have a distinct recollection of
> the back traces on x86_64 being distinctly worse the on i386.  Lately
> I haven't seen that so it may be I was misinterpreting what I was
> seeing, and the compiler optimizations were what gave me such weird
> back traces.  
> 

Well, if you compile x86_64 with frame pointers it helps a bit because
the compiler doesn't tail merge function calls. But the stack backtrace
ignores the frame pointers even if they're present, unlike i386 which
will use them.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 16:12                                       ` Chuck Ebbert
  0 siblings, 0 replies; 330+ messages in thread
From: Chuck Ebbert @ 2007-03-20 16:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: xen-devel, netdev, mingo, jbeulich, linux-kernel, chrisw,
	virtualization, anthony, akpm, virtualization, torvalds,
	David Miller

Eric W. Biederman wrote:
 
> I'm conflicted about the dwarf unwinder.  I was off doing other things
> at the time so I missed the pain, but I do have a distinct recollection of
> the back traces on x86_64 being distinctly worse the on i386.  Lately
> I haven't seen that so it may be I was misinterpreting what I was
> seeing, and the compiler optimizations were what gave me such weird
> back traces.  
> 

Well, if you compile x86_64 with frame pointers it helps a bit because
the compiler doesn't tail merge function calls. But the stack backtrace
ignores the frame pointers even if they're present, unlike i386 which
will use them.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 13:28                                       ` Andi Kleen
@ 2007-03-20 16:25                                         ` Eric W. Biederman
  -1 siblings, 0 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-20 16:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, torvalds, virtualization, jbeulich, jeremy,
	xen-devel, netdev, linux-kernel, chrisw, virtualization, anthony,
	akpm, mingo

Andi Kleen <ak@suse.de> writes:

>> I'm conflicted about the dwarf unwinder.  I was off doing other things
>> at the time so I missed the pain, but I do have a distinct recollection of
>> the back traces on x86_64 being distinctly worse the on i386. 
>
> The only case were i386 was better was with frame pointers, which
> was never fully implemented for x86-64. However i find that hilarious: 
> people are spending a lot of time right here in this thread to squeeze
> out the best call sequences for the paravirt ops, but then accept
> losing a full frame pointer register on i386. I never found that
> acceptable, that is why I prefered the unwinder instead. 
>
> This said the big problem with the frame pointers is mostly gone now:
> on older CPUs it tended to cause a pipeline stall early in the function.
> That is now fixed in the latest Intel/upcomming AMD CPUs, but there 
> are still millions and millions of older CPUs out there so I still
> don't consider it acceptable.

What I recall observing is call traces that made no sense.  Not just
extra noise in the stack trace but things like seeing a function that
has exactly one path to it, and not seeing all of the functions on
that path in the call trace.

In my later debugging I have been reasonably able to attribute those
kinds of things to compiler optimizations like inlining and tail call
optimization.

Now I will agree that having fewer or no false positives to weed
through is a good thing, if we can do it reliably.

>> Lately 
>> I haven't seen that so it may be I was misinterpreting what I was
>> seeing, and the compiler optimizations were what gave me such weird
>> back traces. 
>
> The main problem is that subsystems are getting more and more complex
> and especially callbacks seem to multiply far too quickly.
>
> In 2.4 it was often very reasonable to just sort out the false positives,
> but with sometimes 20-30+ level deep call chains in 2.6 with many callbacks that
> just
> gets far too tenuous. 

Hmm.  I haven't seen those traces, but I wonder if the size of those
stack traces indicates potential stack overflow problems.
  
>> But if the quality of our backtraces has gone down and dwarf unwinder
>> could give us better back traces it is likely worth pursuing.  Of
>> course it would need to start with the assumption that it's tables
>> may be borked (the kernel is busted after all) and be much more
>> careful than Andi's last attempt.
>
> The latest version validates the stack always. It was only a few lines
> of change. I doubt it will make much difference though. The few true crashes
> we had were not actually due the unwinder itself, but the buggy fallback code
> (which were fixed quickly). But anyways it should satisfy everybody's paranoia
> now.

Do you also validate the unwind data?

> Although in future it would be good if people did some more analysis in root
> causes for failures before let the paranoia take over and revert patches.
>
> We see a good example here of what I call the JFS/ACPI effect: code gets merged
> too early with some visible problems. It gets a bad name and afterwards people
> never look objectively at it again and just trust their prejudices. 

I don't know.  The impression I got was the root cause analysis stopped 
when it was observed that the code was unsuitable for solving the problem.
When asked about it, it appeared the developer did not understand the
question.  Therefore the root cause was assumed to be the developer.

At least that is how I have read the few little bits I have seen.

> But that's not a good strategy to get good code in the end I think. If there
> is enough evidence the early problems were fixed then prejudices should
> be reevaluated.

Certainly.  However if the developer has lost a certain amount of
initial trust, the burden becomes much higher.

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 16:25                                         ` Eric W. Biederman
  0 siblings, 0 replies; 330+ messages in thread
From: Eric W. Biederman @ 2007-03-20 16:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, mingo, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, anthony, akpm, torvalds, David Miller

Andi Kleen <ak@suse.de> writes:

>> I'm conflicted about the dwarf unwinder.  I was off doing other things
>> at the time so I missed the pain, but I do have a distinct recollection of
>> the back traces on x86_64 being distinctly worse the on i386. 
>
> The only case were i386 was better was with frame pointers, which
> was never fully implemented for x86-64. However i find that hilarious: 
> people are spending a lot of time right here in this thread to squeeze
> out the best call sequences for the paravirt ops, but then accept
> losing a full frame pointer register on i386. I never found that
> acceptable, that is why I prefered the unwinder instead. 
>
> This said the big problem with the frame pointers is mostly gone now:
> on older CPUs it tended to cause a pipeline stall early in the function.
> That is now fixed in the latest Intel/upcomming AMD CPUs, but there 
> are still millions and millions of older CPUs out there so I still
> don't consider it acceptable.

What I recall observing is call traces that made no sense.  Not just
extra noise in the stack trace but things like seeing a function that
has exactly one path to it, and not seeing all of the functions on
that path in the call trace.

In my later debugging I have been reasonably able to attribute those
kinds of things to compiler optimizations like inlining and tail call
optimization.

Now I will agree that having fewer or no false positives to weed
through is a good thing, if we can do it reliably.

>> Lately 
>> I haven't seen that so it may be I was misinterpreting what I was
>> seeing, and the compiler optimizations were what gave me such weird
>> back traces. 
>
> The main problem is that subsystems are getting more and more complex
> and especially callbacks seem to multiply far too quickly.
>
> In 2.4 it was often very reasonable to just sort out the false positives,
> but with sometimes 20-30+ level deep call chains in 2.6 with many callbacks that
> just
> gets far too tenuous. 

Hmm.  I haven't seen those traces, but I wonder if the size of those
stack traces indicates potential stack overflow problems.
  
>> But if the quality of our backtraces has gone down and dwarf unwinder
>> could give us better back traces it is likely worth pursuing.  Of
>> course it would need to start with the assumption that it's tables
>> may be borked (the kernel is busted after all) and be much more
>> careful than Andi's last attempt.
>
> The latest version validates the stack always. It was only a few lines
> of change. I doubt it will make much difference though. The few true crashes
> we had were not actually due the unwinder itself, but the buggy fallback code
> (which were fixed quickly). But anyways it should satisfy everybody's paranoia
> now.

Do you also validate the unwind data?

> Although in future it would be good if people did some more analysis in root
> causes for failures before let the paranoia take over and revert patches.
>
> We see a good example here of what I call the JFS/ACPI effect: code gets merged
> too early with some visible problems. It gets a bad name and afterwards people
> never look objectively at it again and just trust their prejudices. 

I don't know.  The impression I got was the root cause analysis stopped 
when it was observed that the code was unsuitable for solving the problem.
When asked about it, it appeared the developer did not understand the
question.  Therefore the root cause was assumed to be the developer.

At least that is how I have read the few little bits I have seen.

> But that's not a good strategy to get good code in the end I think. If there
> is enough evidence the early problems were fixed then prejudices should
> be reevaluated.

Certainly.  However if the developer has lost a certain amount of
initial trust, the burden becomes much higher.

Eric

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 15:58                             ` Eric W. Biederman
  2007-03-20 16:06                                 ` Linus Torvalds
@ 2007-03-20 16:26                               ` Jeremy Fitzhardinge
  2007-03-20 22:41                                 ` Rusty Russell
  2 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20 16:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Zachary Amsden, Rusty Russell, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

Eric W. Biederman wrote:
> Looking at the above code snippet.  I guess it is about time to
> merge our per_cpu and pda variables...

Rusty has a nice patch series to do just that; I think he's still
looking for a taker.

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 16:06                                 ` Linus Torvalds
@ 2007-03-20 16:31                                   ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Zachary Amsden, Rusty Russell, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

Linus Torvalds wrote:
> On Tue, 20 Mar 2007, Eric W. Biederman wrote:
>   
>> If that is the case.  In the normal kernel what would
>> the "the oops, we got an interrupt code do?"
>> I assume it would leave interrupts disabled when it returns?
>> Like we currently do with the delayed disable of normal interrupts?
>>     
>
> Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
> just return without doing anything.
>
> (You may or may not also need to do extra work to Ack the hardware 
> interrupt etc, which may be irq-controller specific. Once the CPU has 
> accepted the interrupt, you may not be able to just leave it dangling)
>   

So it would be something like:

    pda.intr_mask = 1;		/* disable interrupts */
    ...
    pda.intr_mask = 0;		/* enable interrupts */
    if (xchg(&pda.intr_pending, 0))	/* check pending */
    	asm("sti");		/* was pending; isr left cpu interrupts masked */
      

and in the interrupt handler:

    if (pda.intr_mask) {
    	pda.intr_pending = 1;
    	regs->eflags &= ~IF;
    	maybe_ack_interrupt_controller();
    	iret
      

    }

?

    J


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 16:31                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zachary Amsden, xen-devel, akpm, virtualization, netdev,
	Rusty Russell, linux-kernel, chrisw, Andi Kleen,
	Eric W. Biederman, anthony, mingo, David Miller

Linus Torvalds wrote:
> On Tue, 20 Mar 2007, Eric W. Biederman wrote:
>   
>> If that is the case.  In the normal kernel what would
>> the "the oops, we got an interrupt code do?"
>> I assume it would leave interrupts disabled when it returns?
>> Like we currently do with the delayed disable of normal interrupts?
>>     
>
> Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
> just return without doing anything.
>
> (You may or may not also need to do extra work to Ack the hardware 
> interrupt etc, which may be irq-controller specific. Once the CPU has 
> accepted the interrupt, you may not be able to just leave it dangling)
>   

So it would be something like:

    pda.intr_mask = 1;		/* disable interrupts */
    ...
    pda.intr_mask = 0;		/* enable interrupts */
    if (xchg(&pda.intr_pending, 0))	/* check pending */
    	asm("sti");		/* was pending; isr left cpu interrupts masked */
      

and in the interrupt handler:

    if (pda.intr_mask) {
    	pda.intr_pending = 1;
    	regs->eflags &= ~IF;
    	maybe_ack_interrupt_controller();
    	iret
      

    }

?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 17:42                                         ` Andi Kleen
@ 2007-03-20 16:52                                             ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 16:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, David Miller, virtualization, jbeulich,
	jeremy, xen-devel, netdev, linux-kernel, chrisw, virtualization,
	anthony, akpm, mingo

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> No, me and Jan fixed all reported bugs as far as I know.

No you did not. You didn't fix the ones I reported. Which is why it got 
removed, and will not get added back until there is another maintainer.

The ones I reported were all about trusting the stack contents implicitly, 
and assuming that the unwind info was there and valid. Using things like 
"__get_user()" didn't fix it, because if a WARN_ON() happened while we 
held the mm semaphore and the unwind info was bogus, it would take a 
page-fault and deadlock.

Those kinds of things are not acceptable for debugging output. If I cannot 
use WARN_ON() because I hold the MM lock and I'm afraid there might be 
kernel corruption, then something is *wrong*!

And I told you guys this. Over *months*. And you ignored me. You told me 
everything was fine. Each time, somebody else ended up reporting a hang 
where the unwinder was at fault. And since I couldn't trust the 
maintainers to fix it, removing the broken feature that only caused more 
problems than it fixed was the only option.

And you clearly *still* haven't accepted the fact that the code was buggy. 

Does anybody wonder why I wouldn't merge it back?

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 16:52                                             ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 16:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, mingo, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, Eric W. Biederman, anthony, akpm,
	David Miller

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> No, me and Jan fixed all reported bugs as far as I know.

No you did not. You didn't fix the ones I reported. Which is why it got 
removed, and will not get added back until there is another maintainer.

The ones I reported were all about trusting the stack contents implicitly, 
and assuming that the unwind info was there and valid. Using things like 
"__get_user()" didn't fix it, because if a WARN_ON() happened while we 
held the mm semaphore and the unwind info was bogus, it would take a 
page-fault and deadlock.

Those kinds of things are not acceptable for debugging output. If I cannot 
use WARN_ON() because I hold the MM lock and I'm afraid there might be 
kernel corruption, then something is *wrong*!

And I told you guys this. Over *months*. And you ignored me. You told me 
everything was fine. Each time, somebody else ended up reporting a hang 
where the unwinder was at fault. And since I couldn't trust the 
maintainers to fix it, removing the broken feature that only caused more 
problems than it fixed was the only option.

And you clearly *still* haven't accepted the fact that the code was buggy. 

Does anybody wonder why I wouldn't merge it back?

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 15:09                             ` Linus Torvalds
  (?)
  (?)
@ 2007-03-20 17:00                             ` Ingo Molnar
  -1 siblings, 0 replies; 330+ messages in thread
From: Ingo Molnar @ 2007-03-20 17:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Rusty Russell,
	Eric W. Biederman, Andi Kleen, David Miller, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I have this dim memory that ARM has done it that way for a long time 
> because it's so expensive to do a "real" cli/sti.
> 
> And I think -rt does it for other reasons. It's just more flexible.

-rt doesnt wrap cli/sti anymore: spin_lock_irq*() doesnt disable irq 
flags on -rt, thus the amount of real irqs-off sections is very small 
and reviewable.

But nevertheless an incarnation of the code survived and is upstream 
already, in the form of TRACE_IRQFLAGS lockdep code ;) This implements a 
soft hardirq flag _today_: all that would be needed is for Xen to define 
raw_local_irq_disable() as a NOP, and to use the 
current->hardirqs_enabled as 'soft IRQ-off flag'.

Note that ->hardirqs_enabled is self-maintained, i.e. it's not just a 
stupid shadow of the hardirq flag, it's an independently maintained flag 
that does not rely on the existence of the hard flag.

[ this code even has its own debugging code, so out-of-sync-flags,
  double-off and double-on is detected and complained about. So all the 
  hard stuff has already been done as part of lockdep, and it's even 
  long-term maintainable because under a native lockdep kernel we check 
  the soft flag against the hard flag. ]

so i think a soft cli/sti flag support should be merged/unified with the 
trace_hardirqs_on()/trace_hardirqs_off() code.

	Ingo

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 18:03                                             ` Andi Kleen
@ 2007-03-20 17:27                                                 ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 17:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, David Miller, virtualization, jbeulich,
	jeremy, xen-devel, netdev, linux-kernel, chrisw, virtualization,
	anthony, akpm, mingo

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> The code never did that. In fact many of the problems we had initially
> especially came out of that -- the fallback code that would handle
> this case wasn't fully correct.

I don't keep my emails any more, but you *never* fixed the problems in 
arch/*/kernel/traps.c.

Yes, the kernel/unwind.c issues generally got fixed. The infinite loops in 
the *callers* never did.

> Also frankly often your analysis about what went wrong was just
> incorrect.

Still in denial, I see.

Do you still claim that "the fallback position always did the right 
thing"? Despite the fact that the unwinder had sometimes *corrupted* the 
incoming information so much that the fallback position was the one that 
oopsed? And no, you didn't fix that.

And no, IT DID NOT use probe_kernel_address like you still claim.

Anyway, you work for Suse, I don't care what you do to the Suse kernel. 
Maybe it will get stable some day. Somehow, I doubt it. 

			Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 17:27                                                 ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 17:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, mingo, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, Eric W. Biederman, anthony, akpm,
	David Miller

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> The code never did that. In fact many of the problems we had initially
> especially came out of that -- the fallback code that would handle
> this case wasn't fully correct.

I don't keep my emails any more, but you *never* fixed the problems in 
arch/*/kernel/traps.c.

Yes, the kernel/unwind.c issues generally got fixed. The infinite loops in 
the *callers* never did.

> Also frankly often your analysis about what went wrong was just
> incorrect.

Still in denial, I see.

Do you still claim that "the fallback position always did the right 
thing"? Despite the fact that the unwinder had sometimes *corrupted* the 
incoming information so much that the fallback position was the one that 
oopsed? And no, you didn't fix that.

And no, IT DID NOT use probe_kernel_address like you still claim.

Anyway, you work for Suse, I don't care what you do to the Suse kernel. 
Maybe it will get stable some day. Somehow, I doubt it. 

			Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 16:25                                         ` Eric W. Biederman
  (?)
@ 2007-03-20 17:42                                         ` Andi Kleen
  2007-03-20 16:52                                             ` Linus Torvalds
  -1 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 17:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, David Miller, torvalds, virtualization, jbeulich,
	jeremy, xen-devel, netdev, linux-kernel, chrisw, virtualization,
	anthony, akpm, mingo

On Tue, Mar 20, 2007 at 10:25:20AM -0600, Eric W. Biederman wrote:
> What I recall observing is call traces that made no sense.  Not just
> extra noise in the stack trace but things like seeing a function that
> has exactly one path to it, and not seeing all of the functions on
> that path in the call trace.

That's tail call/sibling call optimization. No unwinder can untangle that
because the return address is lost.  But it's also an quite important optimization.

> >
> > In 2.4 it was often very reasonable to just sort out the false positives,
> > but with sometimes 20-30+ level deep call chains in 2.6 with many callbacks that
> > just
> > gets far too tenuous. 
> 
> Hmm.  I haven't seen those traces, but I wonder if the size of those
> stack traces indicates potential stack overflow problems.

Most functions have quite small frames, so 20-30 is still not a problem

> Do you also validate the unwind data?

There are many sanity checks in the unwind code and it will fall back
to the old unwinder when it gets stuck.

> 
> > Although in future it would be good if people did some more analysis in root
> > causes for failures before let the paranoia take over and revert patches.
> >
> > We see a good example here of what I call the JFS/ACPI effect: code gets merged
> > too early with some visible problems. It gets a bad name and afterwards people
> > never look objectively at it again and just trust their prejudices. 
> 
> I don't know.  The impression I got was the root cause analysis stopped 
> when it was observed that the code was unsuitable for solving the problem.

No, me and Jan fixed all reported bugs as far as I know.

-Andi


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 16:52                                             ` Linus Torvalds
  (?)
@ 2007-03-20 18:03                                             ` Andi Kleen
  2007-03-20 17:27                                                 ` Linus Torvalds
  -1 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Eric W. Biederman, David Miller, virtualization,
	jbeulich, jeremy, xen-devel, netdev, linux-kernel, chrisw,
	virtualization, anthony, akpm, mingo

On Tue, Mar 20, 2007 at 09:52:42AM -0700, Linus Torvalds wrote:
> The ones I reported were all about trusting the stack contents implicitly, 

The latest version added the full double check you wanted - every
new RSP is validated against the current stack or the exception stacks.

> and assuming that the unwind info was there and valid. Using things like 

The code never did that. In fact many of the problems we had initially
especially came out of that -- the fallback code that would handle
this case wasn't fully correct.

> "__get_user()" didn't fix it, because if a WARN_ON() happened while we 
> held the mm semaphore and the unwind info was bogus, it would take a 
> page-fault and deadlock.

That was fixed too even before you dropped it by using probe_kernel_address()

> And I told you guys this. Over *months*. And you ignored me. You told me 
> everything was fine. Each time, somebody else ended up reporting a hang 
> where the unwinder was at fault. 

That was because most of the bugs were reported many times duplicated --
and reports from older releases kept streaming in even after the fix
went in.

Also frankly often your analysis about what went wrong was just
incorrect.

> And you clearly *still* haven't accepted the fact that the code was buggy. 

There were some bugs, but they were all fixed. Or at least I hope -- 
we will find out during testing.

> Does anybody wonder why I wouldn't merge it back?

So do you have an alternative to the unwinder? Don't tell me 
"real men decode 30+ entry long stack traces with lots of callbacks
by hand". Or do you prefer to use frame pointers everywhere? 

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 19:21                                                 ` Andi Kleen
@ 2007-03-20 18:49                                                     ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 18:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, David Miller, virtualization, jbeulich,
	jeremy, xen-devel, netdev, linux-kernel, chrisw, virtualization,
	anthony, akpm, mingo

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> So what is your proposed alternative to handle long backtraces? 
> You didn't answer that question. Please do, I'm curious about your thoughts
> in this area.

the thing is, I'd rather see a long backtrace that is hard to decipher but 
that *never* *ever* causes any additional problems, over a pretty one.

Because that's really the issue: do you want a "pretty" backtrace, or do 
you want one that is rock solid but has some crud in it.

I'll take the rock solid one any day. Especially as even the pretty one 
won't fix the most common problem, which is "I don't see the caller" (due 
to inlining or tail-calls).

In contrast, the ugly backtrace will have some "garbage entries" from 
previous frames that didn't get overwritten, but there have actually been 
(admittedly rare) cases where those garbage entries have given hints about 
what happened just before.

			Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 18:49                                                     ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 18:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, mingo, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, Eric W. Biederman, anthony, akpm,
	David Miller

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> So what is your proposed alternative to handle long backtraces? 
> You didn't answer that question. Please do, I'm curious about your thoughts
> in this area.

the thing is, I'd rather see a long backtrace that is hard to decipher but 
that *never* *ever* causes any additional problems, over a pretty one.

Because that's really the issue: do you want a "pretty" backtrace, or do 
you want one that is rock solid but has some crud in it.

I'll take the rock solid one any day. Especially as even the pretty one 
won't fix the most common problem, which is "I don't see the caller" (due 
to inlining or tail-calls).

In contrast, the ugly backtrace will have some "garbage entries" from 
previous frames that didn't get overwritten, but there have actually been 
(admittedly rare) cases where those garbage entries have given hints about 
what happened just before.

			Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 17:27                                                 ` Linus Torvalds
  (?)
@ 2007-03-20 19:21                                                 ` Andi Kleen
  2007-03-20 18:49                                                     ` Linus Torvalds
  -1 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Eric W. Biederman, David Miller, virtualization,
	jbeulich, jeremy, xen-devel, netdev, linux-kernel, chrisw,
	virtualization, anthony, akpm, mingo

On Tue, Mar 20, 2007 at 10:27:00AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 20 Mar 2007, Andi Kleen wrote:
> > 
> > The code never did that. In fact many of the problems we had initially
> > especially came out of that -- the fallback code that would handle
> > this case wasn't fully correct.
> 
> I don't keep my emails any more, but you *never* fixed the problems in 
> arch/*/kernel/traps.c.

I fixed that one after you dropped it (hmm, double checking: or at least
I thought I had fixed it, but don't see the code right now; will redo then) 
Basically it was just a one liner anyways - always check against all the
stacks that are there.


> Yes, the kernel/unwind.c issues generally got fixed. The infinite loops in 
> the *callers* never did.

There was later a weaker form that should have caught most loops, but admittedly
it wasn't 100% bullet-proof with exception stacks.

> 
> > Also frankly often your analysis about what went wrong was just
> > incorrect.
> 
> Still in denial, I see.
> 
> Do you still claim that "the fallback position always did the right 
> thing"

No initially it was buggy and that caused several of the crashes.

> Despite the fact that the unwinder had sometimes *corrupted* the 
> incoming information so much that the fallback position was the one that 
> oopsed? And no, you didn't fix that.

No, it oopsed because it was broken by itself.
Anyways that got fixed quickly.

> 
> And no, IT DID NOT use probe_kernel_address like you still claim.

There definitely was a patch that made it use it. You might have
not merged it though.
 
> Anyway, you work for Suse, I don't care what you do to the Suse kernel. 
> Maybe it will get stable some day. Somehow, I doubt it. 

So what is your proposed alternative to handle long backtraces? 
You didn't answer that question. Please do, I'm curious about your thoughts
in this area.

-Andi


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 16:06                                 ` Linus Torvalds
  (?)
  (?)
@ 2007-03-20 19:28                                 ` Andi Kleen
  2007-03-20 19:54                                   ` Zachary Amsden
  -1 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Jeremy Fitzhardinge, Zachary Amsden,
	Rusty Russell, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

> But I was throwing that out as a long-term thing. I'm not claiming it's 
> trivial, but it should be doable.

One thing I was pondering was to replace the expensive popfs with

bt  $IF,(%rsp) 
jnc 1f 
sti
1: 

But would be mostly a P4 optimization again and I'm not 100% sure it is
worth it.

-Andi


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 19:28                                 ` Andi Kleen
@ 2007-03-20 19:54                                   ` Zachary Amsden
  2007-03-20 20:02                                     ` Andi Kleen
  0 siblings, 1 reply; 330+ messages in thread
From: Zachary Amsden @ 2007-03-20 19:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Eric W. Biederman, Jeremy Fitzhardinge,
	Rusty Russell, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

Andi Kleen wrote:
> One thing I was pondering was to replace the expensive popfs with
>
> bt  $IF,(%rsp) 
> jnc 1f 
> sti
> 1: 
>
> But would be mostly a P4 optimization again and I'm not 100% sure it is
> worth it.
>   

Worth it on 32-bit.  On AMD64, probably not.  On Intel 64-bit, maybe, 
but less important than in P4 days.

This could change character completely if used at the tail of a function 
where you now have

sti; 1: ret

Which generates an interrupt holdoff on the ret, an unusual thing to do.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 19:54                                   ` Zachary Amsden
@ 2007-03-20 20:02                                     ` Andi Kleen
  0 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 20:02 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Eric W. Biederman, Jeremy Fitzhardinge,
	Rusty Russell, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

> Worth it on 32-bit.  On AMD64, probably not.  On Intel 64-bit, maybe, 
> but less important than in P4 days.

Well most of Intel 64bit is P4 -- and Intel is still shipping
millions more of them each quarter.

> This could change character completely if used at the tail of a function 
> where you now have
> 
> sti; 1: ret
> 
> Which generates an interrupt holdoff on the ret, an unusual thing to do.

Unusual yes, but I don't see how it should cause problems. Or do you
have anything specific in mind? 

Worse is probably that on K8 this case might cause a pipeline stall
if there isn't a prefix in front of the ret.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 18:49                                                     ` Linus Torvalds
  (?)
@ 2007-03-20 20:23                                                     ` Andi Kleen
  2007-03-20 21:39                                                         ` Alan Cox
                                                                         ` (2 more replies)
  -1 siblings, 3 replies; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Eric W. Biederman, David Miller, virtualization,
	jbeulich, jeremy, xen-devel, netdev, linux-kernel, chrisw,
	virtualization, anthony, akpm, mingo

On Tue, Mar 20, 2007 at 11:49:39AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 20 Mar 2007, Andi Kleen wrote:
> > 
> > So what is your proposed alternative to handle long backtraces? 
> > You didn't answer that question. Please do, I'm curious about your thoughts
> > in this area.
> 
> the thing is, I'd rather see a long backtrace that is hard to decipher but 
> that *never* *ever* causes any additional problems, over a pretty one.

Well it causes additional problems. We had some cases where it was really
hard to distingush garbage and the true call chain. I can probably dig
out some examples if you want.

With lots of call backs (e.g. common with sysfs) it is also frequently
not obvious how the call chains are supposed to go.

I would have agreed with you in the 2.4 time frame. It was also
always fun to explain this to a newbie with oopses and watch their
face when they secretly think you're crazy ;-) But code is getting
much and more complex unfortunately and I had a few cases where
it was really ugly to sort them out. 

> Because that's really the issue: do you want a "pretty" backtrace, or do 
> you want one that is rock solid but has some crud in it.

I just want an as exact backtrace as possible. I also think
that we can make the unwinder robust enough.

> I'll take the rock solid one any day. Especially as even the pretty one 
> won't fix the most common problem, which is "I don't see the caller" (due 
> to inlining or tail-calls).

The problem is not one or a few levels of calls. That can be always figured
out from the source. The problem is when you have a double
digit number of them with non obvious (dynamic indirect) dependencies.  Yes it 
can be figured out too, but it is a long and very annoying process
with lots of grepping etc. and even then you sometimes can't be 100% 
sure in some cases. 

It's also a mechanic process that just cries to be done by a machine instead.

> In contrast, the ugly backtrace will have some "garbage entries" from 
> previous frames that didn't get overwritten, but there have actually been 

If it was only a few that would be great ...

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 20:23                                                     ` Andi Kleen
  2007-03-20 21:39                                                         ` Alan Cox
@ 2007-03-20 21:39                                                         ` Alan Cox
  2007-03-21  6:08                                                         ` Andrew Morton
  2 siblings, 0 replies; 330+ messages in thread
From: Alan Cox @ 2007-03-20 21:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, jeremy, xen-devel, netdev, mingo, Andi Kleen,
	jbeulich, virtualization, chrisw, virtualization,
	Eric W. Biederman, anthony, akpm, David Miller, linux-kernel

> > Because that's really the issue: do you want a "pretty" backtrace, or do 
> > you want one that is rock solid but has some crud in it.
> 
> I just want an as exact backtrace as possible. I also think
> that we can make the unwinder robust enough.

Any reason you can't put the exact back trace in "[xxx]" and the ones we
see on the stack which dont look like call trace as ?xxx? It makes the
code a bit trickier but we depend on the quality of traces

Alan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 21:39                                                         ` Alan Cox
  0 siblings, 0 replies; 330+ messages in thread
From: Alan Cox @ 2007-03-20 21:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jeremy, xen-devel, Andi, David Miller, netdev, Kleen, jbeulich,
	virtualization, chrisw, virtualization, Eric W. Biederman,
	anthony, mingo, Linus Torvalds, akpm, linux-kernel

> > Because that's really the issue: do you want a "pretty" backtrace, or do 
> > you want one that is rock solid but has some crud in it.
> 
> I just want an as exact backtrace as possible. I also think
> that we can make the unwinder robust enough.

Any reason you can't put the exact back trace in "[xxx]" and the ones we
see on the stack which dont look like call trace as ?xxx? It makes the
code a bit trickier but we depend on the quality of traces

Alan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 21:39                                                         ` Alan Cox
  0 siblings, 0 replies; 330+ messages in thread
From: Alan Cox @ 2007-03-20 21:39 UTC (permalink / raw)
  Cc: jeremy, xen-devel, Andi, David Miller, netdev, Kleen, jbeulich,
	virtualization, chrisw, virtualization, Eric W. Biederman,
	anthony, mingo, Linus Torvalds, akpm, linux-kernel

> > Because that's really the issue: do you want a "pretty" backtrace, or do 
> > you want one that is rock solid but has some crud in it.
> 
> I just want an as exact backtrace as possible. I also think
> that we can make the unwinder robust enough.

Any reason you can't put the exact back trace in "[xxx]" and the ones we
see on the stack which dont look like call trace as ?xxx? It makes the
code a bit trickier but we depend on the quality of traces

Alan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 21:39                                                         ` Alan Cox
  (?)
  (?)
@ 2007-03-20 21:49                                                         ` Andi Kleen
  2007-03-20 23:51                                                             ` Linus Torvalds
  -1 siblings, 1 reply; 330+ messages in thread
From: Andi Kleen @ 2007-03-20 21:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Linus Torvalds, jeremy, xen-devel, netdev, mingo,
	jbeulich, virtualization, chrisw, virtualization,
	Eric W. Biederman, anthony, akpm, David Miller, linux-kernel

On Tue, Mar 20, 2007 at 09:39:18PM +0000, Alan Cox wrote:
> > > Because that's really the issue: do you want a "pretty" backtrace, or do 
> > > you want one that is rock solid but has some crud in it.
> > 
> > I just want an as exact backtrace as possible. I also think
> > that we can make the unwinder robust enough.
> 
> Any reason you can't put the exact back trace in "[xxx]" and the ones we
> see on the stack which dont look like call trace as ?xxx? It makes the
> code a bit trickier but we depend on the quality of traces

Linus is worried about the unwinder crashing -- that wouldn't help with that.

What the (now out of tree) unwinder does is to check if it finishes
the trace and if not fall back to the old unwinder.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 16:31                                   ` Jeremy Fitzhardinge
@ 2007-03-20 22:09                                     ` Zachary Amsden
  -1 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-20 22:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Eric W. Biederman, Rusty Russell, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

Jeremy Fitzhardinge wrote:

>> Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
>> just return without doing anything.
>>
>> (You may or may not also need to do extra work to Ack the hardware 
>> interrupt etc, which may be irq-controller specific. Once the CPU has 
>> accepted the interrupt, you may not be able to just leave it dangling)
>>   
>>     
>
> So it would be something like:
>
>     pda.intr_mask = 1;		/* disable interrupts */
>     ...
>     pda.intr_mask = 0;		/* enable interrupts */
>     if (xchg(&pda.intr_pending, 0))	/* check pending */
>   

Well, can't do xchg, since it implies #LOCK, and you'll lose more than 
you gain on the processors where it matters.  Cmpxchg is fine, but 
processor dependent.

Or, just make the interrupt handlers use software resend for IRQs when 
pda.intr_mask is set to zero.  Now, local_irq_save / restore are very 
pretty:

int local_irq_save(void)
{
   int tmp = pda.intr_mask;
   pda.intr_mask = 0;
   /*
    * note there is a window here where local IRQs notice intr_mask == 0
    * in that case, they will attempt to resend the IRQ via a tasklet,
    * and will succeed, albeit through a slightly longer path
    */
   local_bh_disable();
   return tmp;
}

void local_irq_restore(int enabled)
{
    pda.intr_mask = enabled;
    /*
     * note there is a window here where softirqs are not processed by
     * the interrupt handler, but that is not a problem, since it will
     * get done here in the outer enable of any nested pair.
     */
    if (enabled)
        local_bh_enable();
}

I think Ingo's suggestion of using the hardirq tracing is another way 
that could work, but it seems to be too heavyweight and tied too much to 
the lockdep verification code - plus it inserts additional 
raw_irq_disable in places that seem counter to the goal of getting rid 
of them in the first place.  Perhaps I'm misunderstanding the code, though.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 22:09                                     ` Zachary Amsden
  0 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-20 22:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, akpm, virtualization, netdev, Rusty Russell,
	linux-kernel, chrisw, Andi Kleen, Eric W. Biederman, anthony,
	mingo, Linus Torvalds, David Miller

Jeremy Fitzhardinge wrote:

>> Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
>> just return without doing anything.
>>
>> (You may or may not also need to do extra work to Ack the hardware 
>> interrupt etc, which may be irq-controller specific. Once the CPU has 
>> accepted the interrupt, you may not be able to just leave it dangling)
>>   
>>     
>
> So it would be something like:
>
>     pda.intr_mask = 1;		/* disable interrupts */
>     ...
>     pda.intr_mask = 0;		/* enable interrupts */
>     if (xchg(&pda.intr_pending, 0))	/* check pending */
>   

Well, can't do xchg, since it implies #LOCK, and you'll lose more than 
you gain on the processors where it matters.  Cmpxchg is fine, but 
processor dependent.

Or, just make the interrupt handlers use software resend for IRQs when 
pda.intr_mask is set to zero.  Now, local_irq_save / restore are very 
pretty:

int local_irq_save(void)
{
   int tmp = pda.intr_mask;
   pda.intr_mask = 0;
   /*
    * note there is a window here where local IRQs notice intr_mask == 0
    * in that case, they will attempt to resend the IRQ via a tasklet,
    * and will succeed, albeit through a slightly longer path
    */
   local_bh_disable();
   return tmp;
}

void local_irq_restore(int enabled)
{
    pda.intr_mask = enabled;
    /*
     * note there is a window here where softirqs are not processed by
     * the interrupt handler, but that is not a problem, since it will
     * get done here in the outer enable of any nested pair.
     */
    if (enabled)
        local_bh_enable();
}

I think Ingo's suggestion of using the hardirq tracing is another way 
that could work, but it seems to be too heavyweight and tied too much to 
the lockdep verification code - plus it inserts additional 
raw_irq_disable in places that seem counter to the goal of getting rid 
of them in the first place.  Perhaps I'm misunderstanding the code, though.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 15:58                             ` Eric W. Biederman
@ 2007-03-20 22:41                                 ` Rusty Russell
  2007-03-20 16:26                               ` Jeremy Fitzhardinge
  2007-03-20 22:41                                 ` Rusty Russell
  2 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-20 22:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden, Andi Kleen,
	David Miller, mingo, akpm, linux-kernel, virtualization,
	xen-devel, chrisw, anthony, netdev

On Tue, 2007-03-20 at 09:58 -0600, Eric W. Biederman wrote:
> Looking at the above code snippet.  I guess it is about time to
> merge our per_cpu and pda variables...

Indeed, thanks for the prod.  Now 2.6.21-rc4-mm1 is out, I'll resend the
patches.

Cheers,
Rusty.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 22:41                                 ` Rusty Russell
  0 siblings, 0 replies; 330+ messages in thread
From: Rusty Russell @ 2007-03-20 22:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Zachary Amsden, Jeremy Fitzhardinge, xen-devel, akpm,
	virtualization, netdev, linux-kernel, chrisw, Andi Kleen,
	anthony, mingo, Linus Torvalds, David Miller

On Tue, 2007-03-20 at 09:58 -0600, Eric W. Biederman wrote:
> Looking at the above code snippet.  I guess it is about time to
> merge our per_cpu and pda variables...

Indeed, thanks for the prod.  Now 2.6.21-rc4-mm1 is out, I'll resend the
patches.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 16:31                                   ` Jeremy Fitzhardinge
@ 2007-03-20 22:43                                     ` Matt Mackall
  -1 siblings, 0 replies; 330+ messages in thread
From: Matt Mackall @ 2007-03-20 22:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Eric W. Biederman, Zachary Amsden, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

On Tue, Mar 20, 2007 at 09:31:58AM -0700, Jeremy Fitzhardinge wrote:
> Linus Torvalds wrote:
> > On Tue, 20 Mar 2007, Eric W. Biederman wrote:
> >   
> >> If that is the case.  In the normal kernel what would
> >> the "the oops, we got an interrupt code do?"
> >> I assume it would leave interrupts disabled when it returns?
> >> Like we currently do with the delayed disable of normal interrupts?
> >>     
> >
> > Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
> > just return without doing anything.
> >
> > (You may or may not also need to do extra work to Ack the hardware 
> > interrupt etc, which may be irq-controller specific. Once the CPU has 
> > accepted the interrupt, you may not be able to just leave it dangling)
> >   
> 
> So it would be something like:
> 
>     pda.intr_mask = 1;		/* disable interrupts */
>     ...
>     pda.intr_mask = 0;		/* enable interrupts */
>     if (xchg(&pda.intr_pending, 0))	/* check pending */
>     	asm("sti");		/* was pending; isr left cpu interrupts masked */

I don't know that you need an xchg there. If you're still on the same
CPU, it should all be nice and causal even across an interrupt handler.
So it could be:

   pda.intr_mask = 0; /* intr_pending can't get set after this */
   if (unlikely(pda.intr_pending)) {
      pda.intr_pending = 0;
      asm("sti");
   }

(This would actually need a C barrier, but I'll ignore that as this'd
end up being asm...)

But other interesting things could happen. If we never did a real CLI
and we get preempted and switched to another CPU between clearing
intr_mask and checking intr_pending, we get a little confused. 

But perhaps that doesn't matter because we'd by definition have no
pending interrupts on either processor?

Is it expensive to do an STI if interrupts are already enabled?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 22:43                                     ` Matt Mackall
  0 siblings, 0 replies; 330+ messages in thread
From: Matt Mackall @ 2007-03-20 22:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, akpm, virtualization, netdev, linux-kernel, chrisw,
	Andi Kleen, Eric W. Biederman, anthony, mingo, Linus Torvalds,
	David Miller

On Tue, Mar 20, 2007 at 09:31:58AM -0700, Jeremy Fitzhardinge wrote:
> Linus Torvalds wrote:
> > On Tue, 20 Mar 2007, Eric W. Biederman wrote:
> >   
> >> If that is the case.  In the normal kernel what would
> >> the "the oops, we got an interrupt code do?"
> >> I assume it would leave interrupts disabled when it returns?
> >> Like we currently do with the delayed disable of normal interrupts?
> >>     
> >
> > Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
> > just return without doing anything.
> >
> > (You may or may not also need to do extra work to Ack the hardware 
> > interrupt etc, which may be irq-controller specific. Once the CPU has 
> > accepted the interrupt, you may not be able to just leave it dangling)
> >   
> 
> So it would be something like:
> 
>     pda.intr_mask = 1;		/* disable interrupts */
>     ...
>     pda.intr_mask = 0;		/* enable interrupts */
>     if (xchg(&pda.intr_pending, 0))	/* check pending */
>     	asm("sti");		/* was pending; isr left cpu interrupts masked */

I don't know that you need an xchg there. If you're still on the same
CPU, it should all be nice and causal even across an interrupt handler.
So it could be:

   pda.intr_mask = 0; /* intr_pending can't get set after this */
   if (unlikely(pda.intr_pending)) {
      pda.intr_pending = 0;
      asm("sti");
   }

(This would actually need a C barrier, but I'll ignore that as this'd
end up being asm...)

But other interesting things could happen. If we never did a real CLI
and we get preempted and switched to another CPU between clearing
intr_mask and checking intr_pending, we get a little confused. 

But perhaps that doesn't matter because we'd by definition have no
pending interrupts on either processor?

Is it expensive to do an STI if interrupts are already enabled?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 22:43                                     ` Matt Mackall
  (?)
@ 2007-03-20 23:08                                     ` Zachary Amsden
  2007-03-20 23:33                                         ` Jeremy Fitzhardinge
  2007-03-20 23:41                                         ` Matt Mackall
  -1 siblings, 2 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-20 23:08 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Eric W. Biederman,
	Rusty Russell, Andi Kleen, David Miller, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, anthony, netdev

Matt Mackall wrote:
> I don't know that you need an xchg there. If you're still on the same
> CPU, it should all be nice and causal even across an interrupt handler.
> So it could be:
>
>    pda.intr_mask = 0; /* intr_pending can't get set after this */
>   

Why not?  Oh, I see.  intr_mask is inverted form of EFLAGS_IF.

>    if (unlikely(pda.intr_pending)) {
>       pda.intr_pending = 0;
>       asm("sti");
>    }
>
> (This would actually need a C barrier, but I'll ignore that as this'd
> end up being asm...)
>
> But other interesting things could happen. If we never did a real CLI
> and we get preempted and switched to another CPU between clearing
> intr_mask and checking intr_pending, we get a little confused. 
>   

I think Jeremy's idea was to have interrupt handlers leave interrupts 
disabled on exit if pda.intr_mask was set.  In which case, they would 
bypass all work and we could never get preempted.  I don't think leaving 
hardware interrupts disabled for such a long time is good though.

> But perhaps that doesn't matter because we'd by definition have no
> pending interrupts on either processor?
>
> Is it expensive to do an STI if interrupts are already enabled?
>   

Yes.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 23:08                                     ` Zachary Amsden
@ 2007-03-20 23:33                                         ` Jeremy Fitzhardinge
  2007-03-20 23:41                                         ` Matt Mackall
  1 sibling, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20 23:33 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Matt Mackall, Linus Torvalds, Eric W. Biederman, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

Zachary Amsden wrote:
> I think Jeremy's idea was to have interrupt handlers leave interrupts
> disabled on exit if pda.intr_mask was set.  In which case, they would
> bypass all work and we could never get preempted.

Yes, I was worried that if we left the isr without actually handling the
interrupt, it would still be asserted and we'd just get interrupted
again.  The idea is that we avoid touching cli/sti for the common case
of no interrupts while interrupts are disabled, but we'd still need to
fall back to using them if an interrupt becomes pending.

> I don't think leaving hardware interrupts disabled for such a long
> time is good though. 

How long?  It would be no longer than now, and possibly less, wouldn't it?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 23:33                                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-20 23:33 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: xen-devel, akpm, virtualization, netdev, Rusty Russell,
	linux-kernel, chrisw, Andi Kleen, Eric W. Biederman, anthony,
	Matt Mackall, mingo, Linus Torvalds, David Miller

Zachary Amsden wrote:
> I think Jeremy's idea was to have interrupt handlers leave interrupts
> disabled on exit if pda.intr_mask was set.  In which case, they would
> bypass all work and we could never get preempted.

Yes, I was worried that if we left the isr without actually handling the
interrupt, it would still be asserted and we'd just get interrupted
again.  The idea is that we avoid touching cli/sti for the common case
of no interrupts while interrupts are disabled, but we'd still need to
fall back to using them if an interrupt becomes pending.

> I don't think leaving hardware interrupts disabled for such a long
> time is good though. 

How long?  It would be no longer than now, and possibly less, wouldn't it?

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 23:08                                     ` Zachary Amsden
@ 2007-03-20 23:41                                         ` Matt Mackall
  2007-03-20 23:41                                         ` Matt Mackall
  1 sibling, 0 replies; 330+ messages in thread
From: Matt Mackall @ 2007-03-20 23:41 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Eric W. Biederman,
	Rusty Russell, Andi Kleen, David Miller, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, anthony, netdev

On Tue, Mar 20, 2007 at 03:08:19PM -0800, Zachary Amsden wrote:
> Matt Mackall wrote:
> >I don't know that you need an xchg there. If you're still on the same
> >CPU, it should all be nice and causal even across an interrupt handler.
> >So it could be:
> >
> >   pda.intr_mask = 0; /* intr_pending can't get set after this */
> >  
> 
> Why not?  Oh, I see.  intr_mask is inverted form of EFLAGS_IF.

It's not even that. There are two things that can happen:

case 1:

  intr_mask = 1;
            <interrupt occurs and is deferred>
  intr_mask = 0;
  /* intr_pending is already set and CLI is in effect */
  if(intr_pending)

case 2:

  intr_mask = 1;
  intr_mask = 0;
            <interrupt occurs and is processed>
  /* intr_pending remains cleared */
  if(intr_pending)

As this is all about local interrupts, it's all on a single CPU and
out of order issues aren't visible..
 
> >(This would actually need a C barrier, but I'll ignore that as this'd
> >end up being asm...)

..unless the compiler is doing the reordering, of course.

> >But other interesting things could happen. If we never did a real CLI
> >and we get preempted and switched to another CPU between clearing
> >intr_mask and checking intr_pending, we get a little confused. 
> 
> I think Jeremy's idea was to have interrupt handlers leave interrupts 
> disabled on exit if pda.intr_mask was set.  In which case, they would 
> bypass all work and we could never get preempted.

I was actually worrying about the case where the interrupt came in
"late". But I don't think it's a problem there either.

> I don't think leaving 
> hardware interrupts disabled for such a long time is good though.

It can only be worse than the current situation by the amount of time
it takes to defer an interrupt once. On average, it'll be a lot
better as most critical sections are -tiny-.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 23:41                                         ` Matt Mackall
  0 siblings, 0 replies; 330+ messages in thread
From: Matt Mackall @ 2007-03-20 23:41 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: xen-devel, akpm, virtualization, netdev, linux-kernel, chrisw,
	Andi Kleen, Eric W. Biederman, anthony, mingo, Linus Torvalds,
	David Miller

On Tue, Mar 20, 2007 at 03:08:19PM -0800, Zachary Amsden wrote:
> Matt Mackall wrote:
> >I don't know that you need an xchg there. If you're still on the same
> >CPU, it should all be nice and causal even across an interrupt handler.
> >So it could be:
> >
> >   pda.intr_mask = 0; /* intr_pending can't get set after this */
> >  
> 
> Why not?  Oh, I see.  intr_mask is inverted form of EFLAGS_IF.

It's not even that. There are two things that can happen:

case 1:

  intr_mask = 1;
            <interrupt occurs and is deferred>
  intr_mask = 0;
  /* intr_pending is already set and CLI is in effect */
  if(intr_pending)

case 2:

  intr_mask = 1;
  intr_mask = 0;
            <interrupt occurs and is processed>
  /* intr_pending remains cleared */
  if(intr_pending)

As this is all about local interrupts, it's all on a single CPU and
out of order issues aren't visible..
 
> >(This would actually need a C barrier, but I'll ignore that as this'd
> >end up being asm...)

..unless the compiler is doing the reordering, of course.

> >But other interesting things could happen. If we never did a real CLI
> >and we get preempted and switched to another CPU between clearing
> >intr_mask and checking intr_pending, we get a little confused. 
> 
> I think Jeremy's idea was to have interrupt handlers leave interrupts 
> disabled on exit if pda.intr_mask was set.  In which case, they would 
> bypass all work and we could never get preempted.

I was actually worrying about the case where the interrupt came in
"late". But I don't think it's a problem there either.

> I don't think leaving 
> hardware interrupts disabled for such a long time is good though.

It can only be worse than the current situation by the amount of time
it takes to defer an interrupt once. On average, it'll be a lot
better as most critical sections are -tiny-.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 20:23                                                     ` Andi Kleen
@ 2007-03-20 23:43                                                         ` Linus Torvalds
  2007-03-20 23:43                                                         ` Linus Torvalds
  2007-03-21  6:08                                                         ` Andrew Morton
  2 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 23:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, David Miller, virtualization, jbeulich,
	jeremy, xen-devel, netdev, linux-kernel, chrisw, virtualization,
	anthony, akpm, mingo

On Tue, 20 Mar 2007, Andi Kleen wrote:

> On Tue, Mar 20, 2007 at 11:49:39AM -0700, Linus Torvalds wrote:
> > 
> > the thing is, I'd rather see a long backtrace that is hard to decipher but 
> > that *never* *ever* causes any additional problems, over a pretty one.
> 
> Well it causes additional problems. We had some cases where it was really
> hard to distingush garbage and the true call chain. I can probably dig
> out some examples if you want.

Well, by "additional problems" _I_ mean things like "a warning turned into 
a fatal oops and didn't get logged at all".

That's a lot more serious than "there were a few extra entries in the 
traceback that caused us some confusion".

And yes, we had exactly that case happen several times.

> With lots of call backs (e.g. common with sysfs) it is also frequently
> not obvious how the call chains are supposed to go.

With callbacks, it's actually often nice to see the callback data that is 
on the stack (and it's very obvious from the "<function+0>" ksymtab 
explanation: you can't have a <function+0> that is anything but a callback 
pointer (since it isn't a return address).

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 23:43                                                         ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 23:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, mingo, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, Eric W. Biederman, anthony, akpm,
	David Miller

On Tue, 20 Mar 2007, Andi Kleen wrote:

> On Tue, Mar 20, 2007 at 11:49:39AM -0700, Linus Torvalds wrote:
> > 
> > the thing is, I'd rather see a long backtrace that is hard to decipher but 
> > that *never* *ever* causes any additional problems, over a pretty one.
> 
> Well it causes additional problems. We had some cases where it was really
> hard to distingush garbage and the true call chain. I can probably dig
> out some examples if you want.

Well, by "additional problems" _I_ mean things like "a warning turned into 
a fatal oops and didn't get logged at all".

That's a lot more serious than "there were a few extra entries in the 
traceback that caused us some confusion".

And yes, we had exactly that case happen several times.

> With lots of call backs (e.g. common with sysfs) it is also frequently
> not obvious how the call chains are supposed to go.

With callbacks, it's actually often nice to see the callback data that is 
on the stack (and it's very obvious from the "<function+0>" ksymtab 
explanation: you can't have a <function+0> that is anything but a callback 
pointer (since it isn't a return address).

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 21:49                                                         ` [Xen-devel] " Andi Kleen
@ 2007-03-20 23:51                                                             ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 23:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, jeremy, xen-devel, netdev, mingo, jbeulich,
	virtualization, chrisw, virtualization, Eric W. Biederman,
	anthony, akpm, David Miller, linux-kernel



On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> Linus is worried about the unwinder crashing -- that wouldn't help with that.

And to make it clear: this is not a theoretical worry. It happened many 
times over the months the unwinder was in. 

It was supposed to help debugging, but it made bugs that *would* have been 
nicely debuggable without it into nightmares. So the only reason for it 
existing in the first place was actually the thing that made it not work.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [Xen-devel] Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-20 23:51                                                             ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-20 23:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, David Miller, netdev, linux-kernel, jbeulich,
	virtualization, chrisw, virtualization, Eric W. Biederman,
	anthony, mingo, akpm, Alan Cox

On Tue, 20 Mar 2007, Andi Kleen wrote:
> 
> Linus is worried about the unwinder crashing -- that wouldn't help with that.

And to make it clear: this is not a theoretical worry. It happened many 
times over the months the unwinder was in. 

It was supposed to help debugging, but it made bugs that *would* have been 
nicely debuggable without it into nightmares. So the only reason for it 
existing in the first place was actually the thing that made it not work.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 15:09                             ` Linus Torvalds
                                               ` (2 preceding siblings ...)
  (?)
@ 2007-03-21  0:03                             ` Paul Mackerras
  2007-04-12 23:16                                 ` David Miller
  -1 siblings, 1 reply; 330+ messages in thread
From: Paul Mackerras @ 2007-03-21  0:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Rusty Russell,
	Eric W. Biederman, Andi Kleen, David Miller, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, anthony, netdev

Linus Torvalds writes:

> We should just do this natively. There's been several tests over the years 
> saying that it's much more efficient to do sti/cli as a simple store, and 
> handling the "oops, we got an interrupt while interrupts were disabled" as 
> a special case.
> 
> I have this dim memory that ARM has done it that way for a long time 
> because it's so expensive to do a "real" cli/sti.
> 
> And I think -rt does it for other reasons. It's just more flexible.

64-bit powerpc does this now as well.

Paul.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 22:43                                     ` Matt Mackall
  (?)
  (?)
@ 2007-03-21  0:20                                     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 330+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-21  0:20 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Eric W. Biederman, Zachary Amsden, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

Matt Mackall wrote:
> On Tue, Mar 20, 2007 at 09:31:58AM -0700, Jeremy Fitzhardinge wrote:
>   
>> Linus Torvalds wrote:
>>     
>>> On Tue, 20 Mar 2007, Eric W. Biederman wrote:
>>>   
>>>       
>>>> If that is the case.  In the normal kernel what would
>>>> the "the oops, we got an interrupt code do?"
>>>> I assume it would leave interrupts disabled when it returns?
>>>> Like we currently do with the delayed disable of normal interrupts?
>>>>     
>>>>         
>>> Yeah, disable interrupts, and set a flag that the fake "sti" can test, and 
>>> just return without doing anything.
>>>
>>> (You may or may not also need to do extra work to Ack the hardware 
>>> interrupt etc, which may be irq-controller specific. Once the CPU has 
>>> accepted the interrupt, you may not be able to just leave it dangling)
>>>   
>>>       
>> So it would be something like:
>>
>>     pda.intr_mask = 1;		/* disable interrupts */
>>     ...
>>     pda.intr_mask = 0;		/* enable interrupts */
>>     if (xchg(&pda.intr_pending, 0))	/* check pending */
>>     	asm("sti");		/* was pending; isr left cpu interrupts masked */
>>     
>
> I don't know that you need an xchg there. If you're still on the same
> CPU, it should all be nice and causal even across an interrupt handler.
> So it could be:
>
>    pda.intr_mask = 0; /* intr_pending can't get set after this */
>    if (unlikely(pda.intr_pending)) {
>       pda.intr_pending = 0;
>       asm("sti");
>    }
>
> (This would actually need a C barrier, but I'll ignore that as this'd
> end up being asm...)
>
> But other interesting things could happen. If we never did a real CLI
> and we get preempted and switched to another CPU between clearing
> intr_mask and checking intr_pending, we get a little confused. 
>   

Could prevent preempt if pda.intr_mask is set.  preemptible() is defined as:

    # define preemptible()    (preempt_count() == 0 && !irqs_disabled())

anyway, so that would be changed to look at the intr_mask rather than
eflags.
(I'm not sure if preemptible() is actually used to determine whether
preempt or not).

Alternatively, the intr_mask could be encoded in a bit of preempt_count...

    J

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 22:09                                     ` Zachary Amsden
@ 2007-03-21  0:24                                       ` Linus Torvalds
  -1 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-21  0:24 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

On Tue, 20 Mar 2007, Zachary Amsden wrote:
> 
> void local_irq_restore(int enabled)
> {
>    pda.intr_mask = enabled;
>    /*
>     * note there is a window here where softirqs are not processed by
>     * the interrupt handler, but that is not a problem, since it will
>     * get done here in the outer enable of any nested pair.
>     */
>    if (enabled)
>        local_bh_enable();
> }

Actually, this one is more complicated. You also need to actually enable 
hardware interrupts again if they got disabled by an interrupt actually 
occurring while the "soft-interrupt" was disabled.

But since it's all a local-cpu issue, you can do things like test 
cpu-local memory flags for whetehr that has happened or not.

So it *should* be something as simple as

	local_irq_disable()
	{
		pda.irq_enable = 0;
	}

	handle_interrupt()
	{
		if (!pda.irq_enable) {
			pda.irq_queued = 1;
			queue_interrupt();
			.. make sure we return with hardirq's now 
			   disabled: just clear IF in the pt_regs ..
			return;
		}
		.. normal ..
	}

	local_irq_enable()
	{
		pda.irq_enable = 1;
		barrier();
		/* Common case - nothing happened while we were fake-disabled.. */
		if (!pda.irq_queued)
			return; 

		/* Ok, actually handle the things! */
		handle_queued_irqs();

		/*
		 * And enable the hw interrupts again, they got disabled 
		 * when we were queueing stuff.. 
		 */
		hardware_sti();
	}

but I haven't really gone over it in any detail, I may have missed 
something really obvious.

Anyway, it really *should* be pretty damn simple. No need to disable 
preemption, there should be no events that can *cause* it, since all 
interrupts get headed off at the pass.. (the return-from-interrupt thng 
should already notice that it's returning to an interrupts-disabled 
section and not try to do any preemption).

What did I miss?

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-21  0:24                                       ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-21  0:24 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: xen-devel, akpm, virtualization, netdev, linux-kernel, chrisw,
	Andi Kleen, Eric W. Biederman, anthony, mingo, David Miller

On Tue, 20 Mar 2007, Zachary Amsden wrote:
> 
> void local_irq_restore(int enabled)
> {
>    pda.intr_mask = enabled;
>    /*
>     * note there is a window here where softirqs are not processed by
>     * the interrupt handler, but that is not a problem, since it will
>     * get done here in the outer enable of any nested pair.
>     */
>    if (enabled)
>        local_bh_enable();
> }

Actually, this one is more complicated. You also need to actually enable 
hardware interrupts again if they got disabled by an interrupt actually 
occurring while the "soft-interrupt" was disabled.

But since it's all a local-cpu issue, you can do things like test 
cpu-local memory flags for whetehr that has happened or not.

So it *should* be something as simple as

	local_irq_disable()
	{
		pda.irq_enable = 0;
	}

	handle_interrupt()
	{
		if (!pda.irq_enable) {
			pda.irq_queued = 1;
			queue_interrupt();
			.. make sure we return with hardirq's now 
			   disabled: just clear IF in the pt_regs ..
			return;
		}
		.. normal ..
	}

	local_irq_enable()
	{
		pda.irq_enable = 1;
		barrier();
		/* Common case - nothing happened while we were fake-disabled.. */
		if (!pda.irq_queued)
			return; 

		/* Ok, actually handle the things! */
		handle_queued_irqs();

		/*
		 * And enable the hw interrupts again, they got disabled 
		 * when we were queueing stuff.. 
		 */
		hardware_sti();
	}

but I haven't really gone over it in any detail, I may have missed 
something really obvious.

Anyway, it really *should* be pretty damn simple. No need to disable 
preemption, there should be no events that can *cause* it, since all 
interrupts get headed off at the pass.. (the return-from-interrupt thng 
should already notice that it's returning to an interrupts-disabled 
section and not try to do any preemption).

What did I miss?

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 23:33                                         ` Jeremy Fitzhardinge
  (?)
@ 2007-03-21  1:14                                         ` Zachary Amsden
  -1 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-21  1:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Matt Mackall, Linus Torvalds, Eric W. Biederman, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> I think Jeremy's idea was to have interrupt handlers leave interrupts
>> disabled on exit if pda.intr_mask was set.  In which case, they would
>> bypass all work and we could never get preempted.
>>     
>
> Yes, I was worried that if we left the isr without actually handling the
> interrupt, it would still be asserted and we'd just get interrupted
> again.  The idea is that we avoid touching cli/sti for the common case
> of no interrupts while interrupts are disabled, but we'd still need to
> fall back to using them if an interrupt becomes pending.
>
>   
>> I don't think leaving hardware interrupts disabled for such a long
>> time is good though. 
>>     
>
> How long?  It would be no longer than now, and possibly less, wouldn't it?
>   

Hmm.  Perhaps.  Something about the asymmetry bothers me alot though.

Zach


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-21  2:53                                       ` Zachary Amsden
@ 2007-03-21  2:15                                           ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-21  2:15 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

On Tue, 20 Mar 2007, Zachary Amsden wrote:
> 
> Actually, I was thinking the irq handlers would just not mess around with
> eflags on the stack, just call the chip to ack the interrupt and re-enable
> hardware interrupts when they left, since that is free anyway with the iret.

No can do. Think level-triggered. You *need* to disable the interrupt, and 
disabling it at the CPU is the easiest approach. Even so, you need to 
worry about SMP and screaming interrupts at all CPU's, but if you don't 
ack it to the IO-APIC until later, that should be ok (alternatively, you 
need to just mask-and-ack the irq controller).

> Maybe leaving irqs disabled is better.

One of the advantages of doing that is that you only ever have a queue of 
one single entry, which then makes it easier to do the replay.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-21  2:15                                           ` Linus Torvalds
  0 siblings, 0 replies; 330+ messages in thread
From: Linus Torvalds @ 2007-03-21  2:15 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: xen-devel, akpm, virtualization, netdev, linux-kernel, chrisw,
	Andi Kleen, Eric W. Biederman, anthony, mingo, David Miller

On Tue, 20 Mar 2007, Zachary Amsden wrote:
> 
> Actually, I was thinking the irq handlers would just not mess around with
> eflags on the stack, just call the chip to ack the interrupt and re-enable
> hardware interrupts when they left, since that is free anyway with the iret.

No can do. Think level-triggered. You *need* to disable the interrupt, and 
disabling it at the CPU is the easiest approach. Even so, you need to 
worry about SMP and screaming interrupts at all CPU's, but if you don't 
ack it to the IO-APIC until later, that should be ok (alternatively, you 
need to just mask-and-ack the irq controller).

> Maybe leaving irqs disabled is better.

One of the advantages of doing that is that you only ever have a queue of 
one single entry, which then makes it easier to do the replay.

		Linus

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-21  0:24                                       ` Linus Torvalds
  (?)
@ 2007-03-21  2:53                                       ` Zachary Amsden
  2007-03-21  2:15                                           ` Linus Torvalds
  -1 siblings, 1 reply; 330+ messages in thread
From: Zachary Amsden @ 2007-03-21  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

Linus Torvalds wrote:
> On Tue, 20 Mar 2007, Zachary Amsden wrote:
>   
>> void local_irq_restore(int enabled)
>> {
>>    pda.intr_mask = enabled;
>>    /*
>>     * note there is a window here where softirqs are not processed by
>>     * the interrupt handler, but that is not a problem, since it will
>>     * get done here in the outer enable of any nested pair.
>>     */
>>    if (enabled)
>>        local_bh_enable();
>> }
>>     
>
> Actually, this one is more complicated. You also need to actually enable 
> hardware interrupts again if they got disabled by an interrupt actually 
> occurring while the "soft-interrupt" was disabled.
>   

Actually, I was thinking the irq handlers would just not mess around 
with eflags on the stack, just call the chip to ack the interrupt and 
re-enable hardware interrupts when they left, since that is free anyway 
with the iret.  Maybe leaving irqs disabled is better.

> Anyway, it really *should* be pretty damn simple. No need to disable 
> preemption, there should be no events that can *cause* it, since all 
> interrupts get headed off at the pass.. (the return-from-interrupt thng 
> should already notice that it's returning to an interrupts-disabled 
> section and not try to do any preemption).
>   
> What did I miss?
>   

I wasn't disabling preemption to actually disable preemption.  I was 
just using bh_disable as a global hammer to stop softirqs (thus the irq 
replay tasklet) from running during the normal irq_exit path.  Then, we 
can just use the existing software IRQ replay code, and I think barely 
any new code (queue_irq(), etc) has to be written.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-21  2:15                                           ` Linus Torvalds
  (?)
@ 2007-03-21  3:43                                           ` Zachary Amsden
  -1 siblings, 0 replies; 330+ messages in thread
From: Zachary Amsden @ 2007-03-21  3:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Rusty Russell,
	Andi Kleen, David Miller, mingo, akpm, linux-kernel,
	virtualization, xen-devel, chrisw, anthony, netdev

Linus Torvalds wrote:
> On Tue, 20 Mar 2007, Zachary Amsden wrote:
>   
>> Actually, I was thinking the irq handlers would just not mess around with
>> eflags on the stack, just call the chip to ack the interrupt and re-enable
>> hardware interrupts when they left, since that is free anyway with the iret.
>>     
>
> No can do. Think level-triggered. You *need* to disable the interrupt, and 
> disabling it at the CPU is the easiest approach. Even so, you need to 
> worry about SMP and screaming interrupts at all CPU's, but if you don't 
> ack it to the IO-APIC until later, that should be ok (alternatively, you 
> need to just mask-and-ack the irq controller).
>   

Well, you can keep it masked, but a more important point is that I've 
entirely neglected local interrupts.  This might work for IRQs, but for 
local timer or thermal or IPIs, using the tasklet based replay simply 
will not work.

> One of the advantages of doing that is that you only ever have a queue of 
> one single entry, which then makes it easier to do the replay.
>   

Yes.  Unfortunately now both do_IRQ and all the smp_foo interrupt 
handlers need to detect and queue for replay, but fortunately they all 
have the interrupt number conveniently on the stack.

Zach

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-20 20:23                                                     ` Andi Kleen
@ 2007-03-21  6:08                                                         ` Andrew Morton
  2007-03-20 23:43                                                         ` Linus Torvalds
  2007-03-21  6:08                                                         ` Andrew Morton
  2 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2007-03-21  6:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Eric W. Biederman, David Miller, virtualization,
	jbeulich, jeremy, xen-devel, netdev, linux-kernel, chrisw,
	virtualization, anthony, mingo

On Tue, 20 Mar 2007 21:23:52 +0100 Andi Kleen <ak@suse.de> wrote:

> Well it causes additional problems. We had some cases where it was really
> hard to distingush garbage and the true call chain.

yes, for some reason the naive backtraces seem to have got messier and messier over
the years and some of them are really quite hard to piece together nowadays.

An accurate backtrace would have some value, if we can get it bullet-proof.

The fault-injection driver wants it too.  And lockdep, I guess.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-03-21  6:08                                                         ` Andrew Morton
  0 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2007-03-21  6:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: xen-devel, netdev, linux-kernel, jbeulich, virtualization,
	chrisw, virtualization, Eric W. Biederman, anthony, mingo,
	Linus Torvalds, David Miller

On Tue, 20 Mar 2007 21:23:52 +0100 Andi Kleen <ak@suse.de> wrote:

> Well it causes additional problems. We had some cases where it was really
> hard to distingush garbage and the true call chain.

yes, for some reason the naive backtraces seem to have got messier and messier over
the years and some of them are really quite hard to piece together nowadays.

An accurate backtrace would have some value, if we can get it bullet-proof.

The fault-injection driver wants it too.  And lockdep, I guess.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
  2007-03-21  0:03                             ` Paul Mackerras
@ 2007-04-12 23:16                                 ` David Miller
  0 siblings, 0 replies; 330+ messages in thread
From: David Miller @ 2007-04-12 23:16 UTC (permalink / raw)
  To: paulus
  Cc: torvalds, jeremy, zach, rusty, ebiederm, ak, mingo, akpm,
	linux-kernel, virtualization, xen-devel, chrisw, anthony, netdev

From: Paul Mackerras <paulus@samba.org>
Date: Wed, 21 Mar 2007 11:03:14 +1100

> Linus Torvalds writes:
> 
> > We should just do this natively. There's been several tests over the years 
> > saying that it's much more efficient to do sti/cli as a simple store, and 
> > handling the "oops, we got an interrupt while interrupts were disabled" as 
> > a special case.
> > 
> > I have this dim memory that ARM has done it that way for a long time 
> > because it's so expensive to do a "real" cli/sti.
> > 
> > And I think -rt does it for other reasons. It's just more flexible.
> 
> 64-bit powerpc does this now as well.

I was curious about this so I had a look.

There appears to be three pieces of state used to manage this
on powerpc, PACASOFTIRQEN(r13), PACAHARDIRQEN(r13) and the
SOFTE() in the stackframe.

Plus there is all of this complicated logic on trap entry and
exit to manage these three values properly.

local_irq_restore() doesn't look like a simple piece of code
either.  Logically it should be simple, update the software
binary state, and if enabling see if any interrupts came in
while we were disable so we can run them.

Given all of that, is it really cheaper than just flipping the
bit in the cpu control register? :-/

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
@ 2007-04-12 23:16                                 ` David Miller
  0 siblings, 0 replies; 330+ messages in thread
From: David Miller @ 2007-04-12 23:16 UTC (permalink / raw)
  To: paulus
  Cc: xen-devel, virtualization, netdev, linux-kernel, chrisw, ak,
	ebiederm, anthony, mingo, torvalds, akpm

From: Paul Mackerras <paulus@samba.org>
Date: Wed, 21 Mar 2007 11:03:14 +1100

> Linus Torvalds writes:
> 
> > We should just do this natively. There's been several tests over the years 
> > saying that it's much more efficient to do sti/cli as a simple store, and 
> > handling the "oops, we got an interrupt while interrupts were disabled" as 
> > a special case.
> > 
> > I have this dim memory that ARM has done it that way for a long time 
> > because it's so expensive to do a "real" cli/sti.
> > 
> > And I think -rt does it for other reasons. It's just more flexible.
> 
> 64-bit powerpc does this now as well.

I was curious about this so I had a look.

There appears to be three pieces of state used to manage this
on powerpc, PACASOFTIRQEN(r13), PACAHARDIRQEN(r13) and the
SOFTE() in the stackframe.

Plus there is all of this complicated logic on trap entry and
exit to manage these three values properly.

local_irq_restore() doesn't look like a simple piece of code
either.  Logically it should be simple, update the software
binary state, and if enabling see if any interrupts came in
while we were disable so we can run them.

Given all of that, is it really cheaper than just flipping the
bit in the cpu control register? :-/

^ permalink raw reply	[flat|nested] 330+ messages in thread

end of thread, other threads:[~2007-04-12 23:16 UTC | newest]

Thread overview: 330+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-01 23:24 [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 01/26] Xen-paravirt_ops: Fix typo in sync_constant_test_bit()s name Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-16  9:45   ` Ingo Molnar
2007-03-16  9:45     ` Ingo Molnar
2007-03-01 23:24 ` [patch 02/26] Xen-paravirt_ops: ignore vgacon if hardware not present Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-16  9:45   ` Ingo Molnar
2007-03-16  9:45     ` Ingo Molnar
2007-03-01 23:24 ` [patch 03/26] Xen-paravirt_ops: use paravirt_nop to consistently mark no-op operations Jeremy Fitzhardinge
2007-03-16  9:44   ` Ingo Molnar
2007-03-16  9:44     ` Ingo Molnar
2007-03-16 18:43     ` Jeremy Fitzhardinge
2007-03-16 18:43       ` Jeremy Fitzhardinge
2007-03-16 19:49       ` Chris Wright
2007-03-16 19:49         ` Chris Wright
2007-03-16 20:00         ` Jeremy Fitzhardinge
2007-03-16 21:59           ` Chris Wright
2007-03-16 21:59             ` Chris Wright
2007-03-16 22:10             ` Jeremy Fitzhardinge
2007-03-16 22:10               ` Jeremy Fitzhardinge
2007-03-16 22:18               ` Chris Wright
2007-03-16 22:18                 ` Chris Wright
2007-03-01 23:24 ` [patch 04/26] Xen-paravirt_ops: Add pagetable accessors to pack and unpack pagetable entries Jeremy Fitzhardinge
2007-03-16  9:38   ` Ingo Molnar
2007-03-16  9:38     ` Ingo Molnar
2007-03-16 18:42     ` Jeremy Fitzhardinge
2007-03-16 18:42       ` Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-16  9:33   ` Ingo Molnar
2007-03-16  9:33     ` Ingo Molnar
2007-03-16 18:39     ` Jeremy Fitzhardinge
2007-03-16 19:35       ` Steven Rostedt
2007-03-16 19:35         ` Steven Rostedt
2007-03-17  9:47       ` Rusty Russell
2007-03-17  9:47         ` Rusty Russell
2007-03-01 23:24 ` [patch 06/26] Xen-paravirt_ops: paravirt_ops: allocate a fixmap slot Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-16  9:31   ` Ingo Molnar
2007-03-16  9:31     ` Ingo Molnar
2007-03-01 23:24 ` [patch 07/26] Xen-paravirt_ops: Allow paravirt backend to choose kernel PMD sharing Jeremy Fitzhardinge
2007-03-16  9:31   ` Ingo Molnar
2007-03-16  9:31     ` Ingo Molnar
2007-03-01 23:24 ` [patch 08/26] Xen-paravirt_ops: add hooks to intercept mm creation and destruction Jeremy Fitzhardinge
2007-03-16  9:30   ` Ingo Molnar
2007-03-16  9:30     ` Ingo Molnar
2007-03-16 17:38     ` Jeremy Fitzhardinge
2007-03-16 17:38       ` Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 09/26] Xen-paravirt_ops: remove HAVE_ARCH_MM_LIFETIME, define no-op architecture implementations Jeremy Fitzhardinge
2007-03-16  9:27   ` Ingo Molnar
2007-03-16  9:27     ` Ingo Molnar
2007-03-01 23:24 ` [patch 10/26] Xen-paravirt_ops: rename struct paravirt_patch to paravirt_patch_site for clarity Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 11/26] Xen-paravirt_ops: Use patch site IDs computed from offset in paravirt_ops structure Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 12/26] Xen-paravirt_ops: Fix patch site clobbers to include return register Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-02  0:45   ` Zachary Amsden
2007-03-02  0:45     ` Zachary Amsden
2007-03-02  0:49     ` Jeremy Fitzhardinge
2007-03-02  0:49       ` Jeremy Fitzhardinge
2007-03-02  0:52       ` Zachary Amsden
2007-03-02  0:52         ` Zachary Amsden
2007-03-02  0:58         ` Jeremy Fitzhardinge
2007-03-02  1:18           ` Zachary Amsden
2007-03-02  1:18             ` Zachary Amsden
2007-03-16  9:26   ` Ingo Molnar
2007-03-16  9:26     ` Ingo Molnar
2007-03-16 17:37     ` Jeremy Fitzhardinge
2007-03-16 17:37       ` Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable Jeremy Fitzhardinge
2007-03-16  9:24   ` Ingo Molnar
2007-03-16  9:24     ` Ingo Molnar
2007-03-16  9:33     ` David Miller
2007-03-16  9:57       ` Ingo Molnar
2007-03-16  9:57         ` Ingo Molnar
2007-03-16 19:16         ` Jeremy Fitzhardinge
2007-03-16 20:38       ` Jeremy Fitzhardinge
2007-03-17 10:33         ` Rusty Russell
2007-03-17 10:33           ` Rusty Russell
2007-03-18  7:33           ` David Miller
2007-03-18  7:59             ` Jeremy Fitzhardinge
2007-03-18  7:59               ` Jeremy Fitzhardinge
2007-03-18 12:08             ` Andi Kleen
2007-03-18 15:58               ` Jeremy Fitzhardinge
2007-03-18 17:04                 ` Andi Kleen
2007-03-18 17:29                   ` Jeremy Fitzhardinge
2007-03-18 19:30                     ` Andi Kleen
2007-03-18 23:46                       ` Jeremy Fitzhardinge
2007-03-18 23:46                         ` Jeremy Fitzhardinge
2007-03-19 10:57                         ` Andi Kleen
2007-03-19 17:58                           ` Jeremy Fitzhardinge
2007-03-19 19:08                           ` David Miller
2007-03-19 20:59                             ` Andi Kleen
2007-03-19 21:55                               ` [PATCH] x86_64 : Suppress __jiffies Eric Dumazet
2007-03-20  3:18                               ` [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable Linus Torvalds
2007-03-20  3:18                                 ` Linus Torvalds
2007-03-20  3:47                                 ` David Miller
2007-03-20  4:19                                   ` Eric W. Biederman
2007-03-20  4:19                                     ` Eric W. Biederman
2007-03-20 13:28                                     ` Andi Kleen
2007-03-20 13:28                                       ` Andi Kleen
2007-03-20 16:25                                       ` Eric W. Biederman
2007-03-20 16:25                                         ` Eric W. Biederman
2007-03-20 17:42                                         ` Andi Kleen
2007-03-20 16:52                                           ` Linus Torvalds
2007-03-20 16:52                                             ` Linus Torvalds
2007-03-20 18:03                                             ` Andi Kleen
2007-03-20 17:27                                               ` Linus Torvalds
2007-03-20 17:27                                                 ` Linus Torvalds
2007-03-20 19:21                                                 ` Andi Kleen
2007-03-20 18:49                                                   ` Linus Torvalds
2007-03-20 18:49                                                     ` Linus Torvalds
2007-03-20 20:23                                                     ` Andi Kleen
2007-03-20 21:39                                                       ` [Xen-devel] " Alan Cox
2007-03-20 21:39                                                         ` Alan Cox
2007-03-20 21:39                                                         ` Alan Cox
2007-03-20 21:49                                                         ` [Xen-devel] " Andi Kleen
2007-03-20 23:51                                                           ` Linus Torvalds
2007-03-20 23:51                                                             ` Linus Torvalds
2007-03-20 23:43                                                       ` Linus Torvalds
2007-03-20 23:43                                                         ` Linus Torvalds
2007-03-21  6:08                                                       ` Andrew Morton
2007-03-21  6:08                                                         ` Andrew Morton
2007-03-20 16:12                                     ` Chuck Ebbert
2007-03-20 16:12                                       ` Chuck Ebbert
2007-03-20  1:23                         ` Zachary Amsden
2007-03-20  1:45                           ` Jeremy Fitzhardinge
2007-03-19  2:47               ` Rusty Russell
2007-03-19 18:25                 ` Eric W. Biederman
2007-03-19 18:38                   ` Linus Torvalds
2007-03-19 18:38                     ` Linus Torvalds
2007-03-19 18:44                     ` Linus Torvalds
2007-03-19 18:44                       ` Linus Torvalds
2007-03-19 19:33                     ` Jeremy Fitzhardinge
2007-03-20  0:01                     ` Rusty Russell
2007-03-20  2:00                       ` Zachary Amsden
2007-03-20  4:20                         ` Rusty Russell
2007-03-20  4:20                           ` Rusty Russell
2007-03-20  5:54                         ` Jeremy Fitzhardinge
2007-03-20 11:33                           ` Andreas Kleen
2007-03-20 15:09                           ` Linus Torvalds
2007-03-20 15:09                             ` Linus Torvalds
2007-03-20 15:58                             ` Eric W. Biederman
2007-03-20 16:06                               ` Linus Torvalds
2007-03-20 16:06                                 ` Linus Torvalds
2007-03-20 16:31                                 ` Jeremy Fitzhardinge
2007-03-20 16:31                                   ` Jeremy Fitzhardinge
2007-03-20 22:09                                   ` Zachary Amsden
2007-03-20 22:09                                     ` Zachary Amsden
2007-03-21  0:24                                     ` Linus Torvalds
2007-03-21  0:24                                       ` Linus Torvalds
2007-03-21  2:53                                       ` Zachary Amsden
2007-03-21  2:15                                         ` Linus Torvalds
2007-03-21  2:15                                           ` Linus Torvalds
2007-03-21  3:43                                           ` Zachary Amsden
2007-03-20 22:43                                   ` Matt Mackall
2007-03-20 22:43                                     ` Matt Mackall
2007-03-20 23:08                                     ` Zachary Amsden
2007-03-20 23:33                                       ` Jeremy Fitzhardinge
2007-03-20 23:33                                         ` Jeremy Fitzhardinge
2007-03-21  1:14                                         ` Zachary Amsden
2007-03-20 23:41                                       ` Matt Mackall
2007-03-20 23:41                                         ` Matt Mackall
2007-03-21  0:20                                     ` Jeremy Fitzhardinge
2007-03-20 19:28                                 ` Andi Kleen
2007-03-20 19:54                                   ` Zachary Amsden
2007-03-20 20:02                                     ` Andi Kleen
2007-03-20 16:26                               ` Jeremy Fitzhardinge
2007-03-20 22:41                               ` Rusty Russell
2007-03-20 22:41                                 ` Rusty Russell
2007-03-20 17:00                             ` Ingo Molnar
2007-03-21  0:03                             ` Paul Mackerras
2007-04-12 23:16                               ` David Miller
2007-04-12 23:16                                 ` David Miller
2007-03-19 18:41                   ` Chris Wright
2007-03-19 18:41                     ` Chris Wright
2007-03-19 19:10                   ` Jeremy Fitzhardinge
2007-03-19 19:10                     ` Jeremy Fitzhardinge
2007-03-19 19:46                     ` David Miller
2007-03-19 20:06                       ` Jeremy Fitzhardinge
2007-03-19 20:06                         ` Jeremy Fitzhardinge
2007-03-19 23:42                     ` Andi Kleen
2007-03-16 17:36     ` Jeremy Fitzhardinge
2007-03-16 17:36       ` Jeremy Fitzhardinge
2007-03-16 23:29       ` Zachary Amsden
2007-03-16 23:29         ` Zachary Amsden
2007-03-17  0:40         ` Jeremy Fitzhardinge
2007-03-17  9:10           ` Zachary Amsden
2007-03-17  9:26     ` [Xen-devel] " Rusty Russell
2007-03-17  9:26       ` Rusty Russell
2007-03-01 23:24 ` [patch 14/26] Xen-paravirt_ops: add common patching machinery Jeremy Fitzhardinge
2007-03-16  9:20   ` Ingo Molnar
2007-03-16  9:20     ` Ingo Molnar
2007-03-17  9:15     ` Rusty Russell
2007-03-17  9:15       ` Rusty Russell
2007-03-01 23:24 ` [patch 15/26] Xen-paravirt_ops: Add apply_to_page_range() which applies a function to a pte range Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-16  9:19   ` Ingo Molnar
2007-03-16  9:19     ` Ingo Molnar
2007-03-16 16:47     ` Chris Wright
2007-03-16 16:47       ` Chris Wright
2007-03-16 17:08     ` Jeremy Fitzhardinge
2007-03-16 17:08       ` Jeremy Fitzhardinge
2007-03-01 23:24 ` [patch 16/26] Xen-paravirt_ops: Allocate and free vmalloc areas Jeremy Fitzhardinge
2007-03-01 23:24   ` Jeremy Fitzhardinge
2007-03-16  9:16   ` Ingo Molnar
2007-03-16  9:16     ` Ingo Molnar
2007-03-16 17:05     ` Jeremy Fitzhardinge
2007-03-16 17:05       ` Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 17/26] Xen-paravirt_ops: Add nosegneg capability to the vsyscall page notes Jeremy Fitzhardinge
2007-03-16  9:15   ` Ingo Molnar
2007-03-16  9:15     ` Ingo Molnar
2007-03-16 21:26     ` Roland McGrath
2007-03-16 21:26       ` Roland McGrath
2007-03-16 21:56       ` Jeremy Fitzhardinge
2007-03-16 21:56         ` Jeremy Fitzhardinge
2007-03-16 22:20         ` Roland McGrath
2007-03-16 22:20           ` Roland McGrath
2007-03-01 23:25 ` [patch 18/26] Xen-paravirt_ops: Add XEN config options Jeremy Fitzhardinge
2007-03-16  9:14   ` Ingo Molnar
2007-03-16  9:14     ` Ingo Molnar
2007-03-16 17:04     ` Jeremy Fitzhardinge
2007-03-16 17:04       ` Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 19/26] Xen-paravirt_ops: Add Xen interface header files Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 20/26] Xen-paravirt_ops: Core Xen implementation Jeremy Fitzhardinge
2007-03-16  9:14   ` Ingo Molnar
2007-03-16  9:14     ` Ingo Molnar
2007-03-16 12:00     ` Christoph Hellwig
2007-03-16 16:33     ` Chris Wright
2007-03-16 16:33       ` Chris Wright
2007-03-16 16:44       ` Jeremy Fitzhardinge
2007-03-16 16:57         ` Chris Wright
2007-03-16 16:57           ` Chris Wright
2007-03-16 17:12     ` Chris Wright
2007-03-16 17:12       ` Chris Wright
2007-03-19 18:05       ` Eric W. Biederman
2007-03-19 18:13         ` Jeremy Fitzhardinge
2007-03-19 18:13           ` Jeremy Fitzhardinge
2007-03-19 18:15         ` Chris Wright
2007-03-19 18:15           ` Chris Wright
2007-03-17  9:13     ` [Xen-devel] " Rusty Russell
2007-03-17  9:13       ` Rusty Russell
2007-03-18  7:03       ` [Xen-devel] " Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 21/26] Xen-paravirt_ops: Use the hvc console infrastructure for Xen console Jeremy Fitzhardinge
2007-03-16  8:54   ` Ingo Molnar
2007-03-16  8:54     ` Ingo Molnar
2007-03-16  9:28     ` Keir Fraser
2007-03-16  9:58       ` [Xen-devel] " Ingo Molnar
2007-03-16  9:58         ` Ingo Molnar
2007-03-16 10:31         ` [Xen-devel] " Keir Fraser
2007-03-16 10:31           ` Keir Fraser
2007-03-16 11:41           ` [Xen-devel] " Andrew Morton
2007-03-16 11:41             ` Andrew Morton
2007-03-16 11:58             ` Keir Fraser
2007-03-16 11:58             ` Keir Fraser
2007-03-16 11:58               ` Keir Fraser
2007-03-16 19:01               ` [Xen-devel] " Huang2, Wei
2007-03-16 19:01                 ` Huang2, Wei
2007-03-16 11:58             ` Keir Fraser
2007-03-16 10:31         ` Keir Fraser
2007-03-16 10:31         ` Keir Fraser
2007-03-16  9:28     ` Keir Fraser
2007-03-16 17:02     ` Jeremy Fitzhardinge
2007-03-16 17:02       ` Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 22/26] Xen-paravirt_ops: Add early printk support via hvc console Jeremy Fitzhardinge
2007-03-16  8:52   ` Ingo Molnar
2007-03-16  8:52     ` Ingo Molnar
2007-03-01 23:25 ` [patch 23/26] Xen-paravirt_ops: Add Xen grant table support Jeremy Fitzhardinge
2007-03-01 23:25   ` Jeremy Fitzhardinge
2007-03-16  8:51   ` Ingo Molnar
2007-03-16  8:51     ` Ingo Molnar
2007-03-16 17:00     ` Jeremy Fitzhardinge
2007-03-16 17:00       ` Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 24/26] Xen-paravirt_ops: Add the Xenbus sysfs and virtual device hotplug driver Jeremy Fitzhardinge
2007-03-16  8:47   ` Ingo Molnar
2007-03-16  8:47     ` Ingo Molnar
2007-03-16 16:18     ` Chris Wright
2007-03-16 16:18       ` Chris Wright
2007-03-16 16:19       ` Ingo Molnar
2007-03-16 16:19         ` Ingo Molnar
2007-03-16 16:40         ` Chris Wright
2007-03-16 16:40           ` Chris Wright
2007-03-16 17:53           ` Greg KH
2007-03-16 16:57     ` Jeremy Fitzhardinge
2007-03-16 16:57       ` Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 25/26] Xen-paravirt_ops: Add Xen virtual block device driver Jeremy Fitzhardinge
2007-03-01 23:25 ` [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network " Jeremy Fitzhardinge
2007-03-02  0:42   ` Stephen Hemminger
2007-03-02  0:42     ` Stephen Hemminger
2007-03-02  0:56     ` Jeremy Fitzhardinge
2007-03-02  1:30       ` [RFC] Arp announce (for Xen) Stephen Hemminger
2007-03-02  1:30         ` Stephen Hemminger
2007-03-02  8:09         ` Pekka Savola
2007-03-02 18:29           ` Ben Greear
2007-03-02 19:59             ` Stephen Hemminger
2007-03-02  8:46         ` Keir Fraser
2007-03-02  8:46           ` Keir Fraser
2007-03-02 12:54         ` Andi Kleen
2007-03-02 12:54           ` Andi Kleen
2007-03-02 14:08           ` jamal
2007-03-02 18:08         ` Chris Wright
2007-03-02 18:08           ` Chris Wright
2007-03-06  4:35         ` David Miller
2007-03-06 18:51           ` [RFC] ARP notify option Stephen Hemminger
2007-03-06 18:51             ` Stephen Hemminger
2007-03-06 19:04             ` Jeremy Fitzhardinge
2007-03-06 19:07             ` Chris Wright
2007-03-06 19:07               ` Chris Wright
2007-03-06 21:18             ` Chris Friesen
2007-03-06 21:18               ` Chris Friesen
2007-03-06 22:52               ` Stephen Hemminger
2007-03-06 22:52                 ` Stephen Hemminger
2007-03-07  6:42               ` Pekka Savola
2007-03-07  6:42                 ` Pekka Savola
2007-03-07 17:00                 ` Stephen Hemminger
2007-03-07 17:00                   ` Stephen Hemminger
2007-03-02  1:21     ` [patch 26/26] Xen-paravirt_ops: Add the Xen virtual network device driver Christoph Hellwig
2007-03-02  1:26       ` Chris Wright
2007-03-16  8:42 ` [patch 00/26] Xen-paravirt_ops: Xen guest implementation for paravirt_ops interface Ingo Molnar
2007-03-16  8:42   ` Ingo Molnar
2007-03-16 16:55   ` Jeremy Fitzhardinge
2007-03-16 16:55     ` Jeremy Fitzhardinge
2007-03-16  9:21 ` Ingo Molnar
2007-03-16  9:21   ` Ingo Molnar
2007-03-16 17:26   ` Jeremy Fitzhardinge
2007-03-16 17:26     ` Jeremy Fitzhardinge
2007-03-16 18:59     ` Christoph Hellwig
2007-03-16 18:59       ` Christoph Hellwig
2007-03-16 19:26       ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.