All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10] lguest
@ 2007-02-09  9:11 ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:11 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

This patch series is against 2.6.20; some things are in flux, so there
might be issues as other things flow into the latest -git tree.

>From the documentation:

Lguest is designed to be a minimal hypervisor for the Linux kernel, for
Linux developers and users to experiment with virtualization with the
minimum of complexity.  Nonetheless, it should have sufficient
features to make it useful for specific tasks, and, of course, you are
encouraged to fork and enhance it.

Features:

- Kernel module which runs in a normal kernel.
- Simple I/O model for communication.
- Simple program to create new guests.
- Logo contains cute puppies: http://lguest.ozlabs.org

Developer features:

- Fun to hack on.
- No ABI: being tied to a specific kernel anyway, you can change
anything.
- Many opportunities for improvement or feature implementation.

Cheers!
Rusty


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 0/10] lguest
@ 2007-02-09  9:11 ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:11 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

This patch series is against 2.6.20; some things are in flux, so there
might be issues as other things flow into the latest -git tree.

From the documentation:

Lguest is designed to be a minimal hypervisor for the Linux kernel, for
Linux developers and users to experiment with virtualization with the
minimum of complexity.  Nonetheless, it should have sufficient
features to make it useful for specific tasks, and, of course, you are
encouraged to fork and enhance it.

Features:

- Kernel module which runs in a normal kernel.
- Simple I/O model for communication.
- Simple program to create new guests.
- Logo contains cute puppies: http://lguest.ozlabs.org

Developer features:

- Fun to hack on.
- No ABI: being tied to a specific kernel anyway, you can change
anything.
- Many opportunities for improvement or feature implementation.

Cheers!
Rusty

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler
  2007-02-09  9:11 ` Rusty Russell
  (?)
@ 2007-02-09  9:14 ` Rusty Russell
  2007-02-09  9:15   ` [PATCH 2/10] lguest: Export symbols for lguest as a module Rusty Russell
                     ` (3 more replies)
  -1 siblings, 4 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:14 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

The current code simply calls "start_kernel" directly if we're under a
hypervisor and no paravirt_ops backend wants us, because paravirt.c
registers that as a backend and it's linked last.

This was always a vain hope; start_kernel won't get far without setup.
It's also impossible for paravirt_ops backends which don't sit in the
arch/i386/kernel directory: they can't link before paravirt.o anyway.

This implements a real fallthrough if we pass all the registered
paravirt probes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/arch/i386/kernel/Makefile
+++ b/arch/i386/kernel/Makefile
@@ -39,8 +39,6 @@ obj-$(CONFIG_EARLY_PRINTK)	+= early_prin
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
 obj-$(CONFIG_K8_NB)		+= k8.o
-
-# Make sure this is linked after any other paravirt_ops structs: see head.S
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o
 
 EXTRA_AFLAGS   := -traditional
===================================================================
--- a/arch/i386/kernel/head.S
+++ b/arch/i386/kernel/head.S
@@ -502,10 +502,11 @@ startup_paravirt:
 	pushl	%ecx
 	pushl	%eax
 
-	/* paravirt.o is last in link, and that probe fn never returns */
 	pushl	$__start_paravirtprobe
 1:
 	movl	0(%esp), %eax
+	cmpl	$__stop_paravirtprobe, %eax
+	je	unhandled_paravirt
 	pushl	(%eax)
 	movl	8(%esp), %eax
 	call	*(%esp)
@@ -517,6 +518,12 @@ 1:
 
 	addl	$4, (%esp)
 	jmp	1b
+
+unhandled_paravirt:
+	/* Nothing wanted us: try to die with dignity (impossible trap). */ 
+	movl	$0x1F, %edx
+	pushl	$0
+	jmp	early_fault
 #endif
 
 /*
===================================================================
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -481,9 +481,6 @@ static int __init print_banner(void)
 	return 0;
 }
 core_initcall(print_banner);
-
-/* We simply declare start_kernel to be the paravirt probe of last resort. */
-paravirt_probe(start_kernel);
 
 struct paravirt_ops paravirt_ops = {
 	.name = "bare hardware",



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 2/10] lguest: Export symbols for lguest as a module
  2007-02-09  9:14 ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Rusty Russell
@ 2007-02-09  9:15   ` Rusty Russell
  2007-02-09  9:32     ` Andi Kleen
  2007-02-09  9:17   ` [PATCH 3/10] lguest: Expose get_futex_key, get_key_refs and drop_key_refs Rusty Russell
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:15 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

lguest does some fairly lowlevel things to support a host, which
normal modules don't need:

math_state_restore:
	When the guest triggers a Device Not Available fault, we need
	to be able to restore the FPU

tsc_khz:
	Simplest way of telling the guest how to interpret the TSC
	counter.

__put_task_struct:
	We need to hold a reference to another task for inter-guest
	I/O, and put_task_struct() is an inline function which calls
	__put_task_struct.

access_process_vm:
	We need to access another task for inter-guest I/O.

map_vm_area & __get_vm_area:
	We need to map the switcher shim (ie. monitor) at 0xFFC01000.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/arch/i386/kernel/traps.c
+++ b/arch/i386/kernel/traps.c
@@ -1054,6 +1054,7 @@ asmlinkage void math_state_restore(void)
 	thread->status |= TS_USEDFPU;	/* So we fnsave on switch_to() */
 	tsk->fpu_counter++;
 }
+EXPORT_SYMBOL_GPL(math_state_restore);
 
 #ifndef CONFIG_MATH_EMULATION
 
===================================================================
--- a/arch/i386/kernel/tsc.c
+++ b/arch/i386/kernel/tsc.c
@@ -475,3 +475,4 @@ static int __init init_tsc_clocksource(v
 }
 
 module_init(init_tsc_clocksource);
+EXPORT_SYMBOL_GPL(tsc_khz);
===================================================================
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -126,6 +126,7 @@ void __put_task_struct(struct task_struc
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
+EXPORT_SYMBOL_GPL(__put_task_struct);
 
 void __init fork_init(unsigned long mempages)
 {
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2692,3 +2692,4 @@ int access_process_vm(struct task_struct
 
 	return buf - old_buf;
 }
+EXPORT_SYMBOL_GPL(access_process_vm);
===================================================================
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -159,6 +159,7 @@ int map_vm_area(struct vm_struct *area, 
 	flush_cache_vmap((unsigned long) area->addr, end);
 	return err;
 }
+EXPORT_SYMBOL_GPL(map_vm_area);
 
 static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long flags,
 					    unsigned long start, unsigned long end,
@@ -237,6 +238,7 @@ struct vm_struct *__get_vm_area(unsigned
 {
 	return __get_vm_area_node(size, flags, start, end, -1, GFP_KERNEL);
 }
+EXPORT_SYMBOL_GPL(__get_vm_area);
 
 /**
  *	get_vm_area  -  reserve a contingous kernel virtual area



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 3/10] lguest: Expose get_futex_key, get_key_refs and drop_key_refs.
  2007-02-09  9:14 ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Rusty Russell
  2007-02-09  9:15   ` [PATCH 2/10] lguest: Export symbols for lguest as a module Rusty Russell
@ 2007-02-09  9:17   ` Rusty Russell
  2007-02-09  9:18   ` [PATCH 4/10] lguest: Initialize esp0 properly all the time Rusty Russell
  2007-02-09  9:31   ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Andi Kleen
  3 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:17 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

Name: Expose get_futex_key, get_key_refs and drop_key_refs.

lguest uses the convenient futex infrastructure for inter-domain I/O,
so expose get_futex_key, get_key_refs (renamed get_futex_key_refs) and
drop_key_refs (renamed drop_futex_key_refs).  Also means we need to
expose the union that these use.

No code changes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -100,6 +100,35 @@ extern int
 extern int
 handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);
 
+/*
+ * Futexes are matched on equal values of this key.
+ * The key type depends on whether it's a shared or private mapping.
+ * Don't rearrange members without looking at hash_futex().
+ *
+ * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
+ * We set bit 0 to indicate if it's an inode-based key.
+ */
+union futex_key {
+	struct {
+		unsigned long pgoff;
+		struct inode *inode;
+		int offset;
+	} shared;
+	struct {
+		unsigned long address;
+		struct mm_struct *mm;
+		int offset;
+	} private;
+	struct {
+		unsigned long word;
+		void *ptr;
+		int offset;
+	} both;
+};
+int get_futex_key(u32 __user *uaddr, union futex_key *key);
+void get_futex_key_refs(union futex_key *key);
+void drop_futex_key_refs(union futex_key *key);
+
 #ifdef CONFIG_FUTEX
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
===================================================================
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -48,37 +48,12 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/signal.h>
+#include <linux/module.h>
 #include <asm/futex.h>
 
 #include "rtmutex_common.h"
 
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
-
-/*
- * Futexes are matched on equal values of this key.
- * The key type depends on whether it's a shared or private mapping.
- * Don't rearrange members without looking at hash_futex().
- *
- * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
- */
-union futex_key {
-	struct {
-		unsigned long pgoff;
-		struct inode *inode;
-		int offset;
-	} shared;
-	struct {
-		unsigned long address;
-		struct mm_struct *mm;
-		int offset;
-	} private;
-	struct {
-		unsigned long word;
-		void *ptr;
-		int offset;
-	} both;
-};
 
 /*
  * Priority Inheritance state:
@@ -175,7 +150,7 @@ static inline int match_futex(union fute
  *
  * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+int get_futex_key(u32 __user *uaddr, union futex_key *key)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -246,6 +221,7 @@ static int get_futex_key(u32 __user *uad
 	}
 	return err;
 }
+EXPORT_SYMBOL_GPL(get_futex_key);
 
 /*
  * Take a reference to the resource addressed by a key.
@@ -254,7 +230,7 @@ static int get_futex_key(u32 __user *uad
  * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
  * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
-static inline void get_key_refs(union futex_key *key)
+inline void get_futex_key_refs(union futex_key *key)
 {
 	if (key->both.ptr != 0) {
 		if (key->both.offset & 1)
@@ -263,12 +239,13 @@ static inline void get_key_refs(union fu
 			atomic_inc(&key->private.mm->mm_count);
 	}
 }
+EXPORT_SYMBOL_GPL(get_futex_key_refs);
 
 /*
  * Drop a reference to the resource addressed by a key.
  * The hash bucket spinlock must not be held.
  */
-static void drop_key_refs(union futex_key *key)
+void drop_futex_key_refs(union futex_key *key)
 {
 	if (key->both.ptr != 0) {
 		if (key->both.offset & 1)
@@ -277,6 +254,7 @@ static void drop_key_refs(union futex_ke
 			mmdrop(key->private.mm);
 	}
 }
+EXPORT_SYMBOL_GPL(drop_futex_key_refs);
 
 static inline int get_futex_value_locked(u32 *dest, u32 __user *from)
 {
@@ -871,7 +849,7 @@ static int futex_requeue(u32 __user *uad
 				this->lock_ptr = &hb2->lock;
 			}
 			this->key = key2;
-			get_key_refs(&key2);
+			get_futex_key_refs(&key2);
 			drop_count++;
 
 			if (ret - nr_wake >= nr_requeue)
@@ -884,9 +862,9 @@ out_unlock:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 
-	/* drop_key_refs() must be called outside the spinlocks. */
+	/* drop_futex_key_refs() must be called outside the spinlocks. */
 	while (--drop_count >= 0)
-		drop_key_refs(&key1);
+		drop_futex_key_refs(&key1);
 
 out:
 	up_read(&current->mm->mmap_sem);
@@ -904,7 +882,7 @@ queue_lock(struct futex_q *q, int fd, st
 
 	init_waitqueue_head(&q->waiters);
 
-	get_key_refs(&q->key);
+	get_futex_key_refs(&q->key);
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
@@ -923,7 +901,7 @@ queue_unlock(struct futex_q *q, struct f
 queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
 {
 	spin_unlock(&hb->lock);
-	drop_key_refs(&q->key);
+	drop_futex_key_refs(&q->key);
 }
 
 /*
@@ -978,7 +956,7 @@ static int unqueue_me(struct futex_q *q)
 		ret = 1;
 	}
 
-	drop_key_refs(&q->key);
+	drop_futex_key_refs(&q->key);
 	return ret;
 }
 
@@ -997,7 +975,7 @@ static void unqueue_me_pi(struct futex_q
 
 	spin_unlock(&hb->lock);
 
-	drop_key_refs(&q->key);
+	drop_futex_key_refs(&q->key);
 }
 
 static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)




^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 4/10] lguest: Initialize esp0 properly all the time
  2007-02-09  9:14 ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Rusty Russell
  2007-02-09  9:15   ` [PATCH 2/10] lguest: Export symbols for lguest as a module Rusty Russell
  2007-02-09  9:17   ` [PATCH 3/10] lguest: Expose get_futex_key, get_key_refs and drop_key_refs Rusty Russell
@ 2007-02-09  9:18   ` Rusty Russell
  2007-02-09  9:19       ` Rusty Russell
  2007-02-09  9:31   ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Andi Kleen
  3 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:18 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

Whenever we schedule, __switch_to calls load_esp0 which does:

	tss->esp0 = thread->esp0;

This is never initialized for the initial thread (ie "swapper"), so
when we're scheduling that, we end up setting esp0 to 0.  This is
fine: the swapper never leaves ring 0, so this field is never used.

lguest, however, gets upset that we're trying to used an unmapped page
as our kernel stack.  Rather than work around it there, let's
initialize it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/include/asm-i386/processor.h
+++ b/include/asm-i386/processor.h
@@ -421,6 +421,7 @@ struct thread_struct {
 };
 
 #define INIT_THREAD  {							\
+	.esp0 = sizeof(init_stack) + (long)&init_stack,			\
 	.vm86_info = NULL,						\
 	.sysenter_cs = __KERNEL_CS,					\
 	.io_bitmap_ptr = NULL,						\



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 5/10] Make hvc_console.c compile on non-PowerPC
  2007-02-09  9:18   ` [PATCH 4/10] lguest: Initialize esp0 properly all the time Rusty Russell
@ 2007-02-09  9:19       ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:19 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, virtualization, Paul Mackerras,
	Stephen Rothwell

There's a really nice console helper (esp. for virtual console
drivers) in drivers/char/hvc_console.c.  It has only ever been used
for PowerPC, though, so it uses NO_IRQ which is only defined there.

Let's fix that so it's more widely useful.  By, say, lguest.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/drivers/char/hvc_console.c
+++ b/drivers/char/hvc_console.c
@@ -48,6 +48,10 @@
 #define HVC_MINOR	0
 
 #define TIMEOUT		(10)
+
+#ifndef NO_IRQ
+#define NO_IRQ 0
+#endif
 
 /*
  * Wait this long per iteration while trying to push buffered data to the



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 5/10] Make hvc_console.c compile on non-PowerPC
@ 2007-02-09  9:19       ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:19 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, Stephen Rothwell, Paul Mackerras,
	virtualization

There's a really nice console helper (esp. for virtual console
drivers) in drivers/char/hvc_console.c.  It has only ever been used
for PowerPC, though, so it uses NO_IRQ which is only defined there.

Let's fix that so it's more widely useful.  By, say, lguest.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/drivers/char/hvc_console.c
+++ b/drivers/char/hvc_console.c
@@ -48,6 +48,10 @@
 #define HVC_MINOR	0
 
 #define TIMEOUT		(10)
+
+#ifndef NO_IRQ
+#define NO_IRQ 0
+#endif
 
 /*
  * Wait this long per iteration while trying to push buffered data to the

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09  9:19       ` Rusty Russell
  (?)
@ 2007-02-09  9:20       ` Rusty Russell
  2007-02-09  9:22         ` [PATCH 7/10] lguest: Simple lguest network driver Rusty Russell
                           ` (2 more replies)
  -1 siblings, 3 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:20 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, Stephen Rothwell, Paul Mackerras,
	virtualization

This is the core of lguest: both the guest code (always compiled in to
the image so it can boot under lguest), and the host code (lg.ko).

There is only one config prompt at the moment: lguest is currently
designed to run exactly the same guest and host kernels so we can
frob the ABI freely.

Unfortunately, we don't have the build infrastructure for "private"
asm-offsets.h files, so there's a not-so-neat include in
arch/i386/kernel/asm-offsets.c.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -226,6 +226,27 @@ config ES7000_CLUSTERED_APIC
 	depends on SMP && X86_ES7000 && MPENTIUMIII
 
 source "arch/i386/Kconfig.cpu"
+
+config LGUEST
+	tristate "Linux hypervisor example code"
+	depends on X86 && PARAVIRT && EXPERIMENTAL && !X86_PAE
+	select LGUEST_GUEST
+	select HVC_DRIVER
+	---help---
+	  This is a very simple module which allows you to run
+	  multiple instances of the same Linux kernel, using the
+	  "lguest" command found in the Documentation/lguest directory.
+	  Note that "lguest" is pronounced to rhyme with "fell quest",
+	  not "rustyvisor".  See Documentation/lguest/lguest.txt.
+
+	  If unsure, say N.  If curious, say M.  If masochistic, say Y.
+
+config LGUEST_GUEST
+	bool
+	help
+	  The guest needs code built-in, even if the host has lguest
+	  support as a module.  The drivers are tiny, so we build them
+	  in too.
 
 config HPET_TIMER
 	bool "HPET Timer Support"
===================================================================
--- a/arch/i386/Makefile
+++ b/arch/i386/Makefile
@@ -108,6 +108,7 @@ drivers-$(CONFIG_PCI)			+= arch/i386/pci
 # must be linked after kernel/
 drivers-$(CONFIG_OPROFILE)		+= arch/i386/oprofile/
 drivers-$(CONFIG_PM)			+= arch/i386/power/
+drivers-$(CONFIG_LGUEST_GUEST)		+= arch/i386/lguest/
 
 CFLAGS += $(mflags-y)
 AFLAGS += $(mflags-y)
===================================================================
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -16,6 +16,10 @@
 #include <asm/thread_info.h>
 #include <asm/elf.h>
 #include <asm/pda.h>
+#ifdef CONFIG_LGUEST_GUEST
+#include <asm/lguest.h>
+#include "../lguest/lg.h"
+#endif
 
 #define DEFINE(sym, val) \
         asm volatile("\n->" #sym " %0 " #val : : "i" (val))
@@ -111,4 +115,19 @@ void foo(void)
 	OFFSET(PARAVIRT_iret, paravirt_ops, iret);
 	OFFSET(PARAVIRT_read_cr0, paravirt_ops, read_cr0);
 #endif
+
+#ifdef CONFIG_LGUEST_GUEST
+	BLANK();
+	OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
+	OFFSET(LGUEST_STATE_host_stackptr, lguest_state, host.stackptr);
+	OFFSET(LGUEST_STATE_host_pgdir, lguest_state, host.pgdir);
+	OFFSET(LGUEST_STATE_host_gdt, lguest_state, host.gdt);
+	OFFSET(LGUEST_STATE_host_idt, lguest_state, host.idt);
+	OFFSET(LGUEST_STATE_regs, lguest_state, regs);
+	OFFSET(LGUEST_STATE_gdt, lguest_state, gdt);
+	OFFSET(LGUEST_STATE_idt, lguest_state, idt);
+	OFFSET(LGUEST_STATE_gdt_table, lguest_state, gdt_table);
+	OFFSET(LGUEST_STATE_trapnum, lguest_state, regs.trapnum);
+	OFFSET(LGUEST_STATE_errcode, lguest_state, regs.errcode);
+#endif
 }
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/Makefile
@@ -0,0 +1,22 @@
+# Guest requires the paravirt_ops replacement and the bus driver.
+obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_bus.o
+
+# Host requires the other files, which can be a module.
+obj-$(CONFIG_LGUEST)	+= lg.o
+lg-objs := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
+	segments.o io.o lguest_user.o
+
+# We use top 4MB for guest traps page, then hypervisor. */
+HYPE_ADDR := (0xFFC00000+4096)
+# The data is only 1k (256 interrupt handler pointers)
+HYPE_DATA_SIZE := 1024
+CFLAGS += -DHYPE_ADDR="$(HYPE_ADDR)" -DHYPE_DATA_SIZE="$(HYPE_DATA_SIZE)"
+
+$(obj)/core.o: $(obj)/hypervisor-blob.c
+# This links the hypervisor in the right place and turns it into a C array.
+$(obj)/hypervisor-raw: $(obj)/hypervisor.o
+	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
+$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
+	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
+
+clean-files := hypervisor-blob.c hypervisor-raw
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/core.c
@@ -0,0 +1,425 @@
+/* World's simplest hypervisor, to test paravirt_ops and show
+ * unbelievers that virtualization is the future.  Plus, it's fun! */
+#include <linux/module.h>
+#include <linux/stringify.h>
+#include <linux/stddef.h>
+#include <linux/io.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <asm/lguest.h>
+#include <asm/paravirt.h>
+#include <asm/desc.h>
+#include <asm/pgtable.h>
+#include <asm/uaccess.h>
+#include <asm/poll.h>
+#include <asm/highmem.h>
+#include <asm/asm-offsets.h>
+#include "lg.h"
+
+/* This is our hypervisor, compiled from hypervisor.S. */
+static char __initdata hypervisor_blob[] = {
+#include "hypervisor-blob.c"
+};
+
+#define MAX_LGUEST_GUESTS \
+	((HYPERVISOR_SIZE-sizeof(hypervisor_blob))/sizeof(struct lguest_state))
+
+static struct vm_struct *hypervisor_vma;
+static int cpu_had_pge;
+static struct {
+	unsigned long offset;
+	unsigned short segment;
+} lguest_entry;
+struct page *hype_pages; /* Contiguous pages. */
+struct lguest lguests[MAX_LGUEST_GUESTS];
+DECLARE_MUTEX(lguest_lock);
+
+/* IDT entries are at start of hypervisor. */
+const unsigned long *__lguest_default_idt_entries(void)
+{
+	return (void *)HYPE_ADDR;
+}
+
+/* Next is switch_to_guest */
+static void *__lguest_switch_to_guest(void)
+{
+	return (void *)HYPE_ADDR + HYPE_DATA_SIZE;
+}
+
+/* Then we use everything else to hold guest state. */
+struct lguest_state *__lguest_states(void)
+{
+	return (void *)HYPE_ADDR + sizeof(hypervisor_blob);
+}
+
+static __init int map_hypervisor(void)
+{
+	unsigned int i;
+	int err;
+	struct page *pages[HYPERVISOR_PAGES], **pagep = pages;
+
+	hype_pages = alloc_pages(GFP_KERNEL|__GFP_ZERO,
+				 get_order(HYPERVISOR_SIZE));
+	if (!hype_pages)
+		return -ENOMEM;
+
+	hypervisor_vma = __get_vm_area(HYPERVISOR_SIZE, VM_ALLOC,
+				       HYPE_ADDR, VMALLOC_END);
+	if (!hypervisor_vma) {
+		err = -ENOMEM;
+		printk("lguest: could not map hypervisor pages high\n");
+		goto free_pages;
+	}
+
+	for (i = 0; i < HYPERVISOR_PAGES; i++)
+		pages[i] = hype_pages + i;
+
+	err = map_vm_area(hypervisor_vma, PAGE_KERNEL, &pagep);
+	if (err) {
+		printk("lguest: map_vm_area failed: %i\n", err);
+		goto free_vma;
+	}
+	memcpy(hypervisor_vma->addr, hypervisor_blob, sizeof(hypervisor_blob));
+
+	/* Setup LGUEST segments on all cpus */
+	for_each_possible_cpu(i) {
+		get_cpu_gdt_table(i)[GDT_ENTRY_LGUEST_CS] = FULL_EXEC_SEGMENT;
+		get_cpu_gdt_table(i)[GDT_ENTRY_LGUEST_DS] = FULL_SEGMENT;
+	}
+
+	/* Initialize entry point into hypervisor. */
+	lguest_entry.offset = (long)__lguest_switch_to_guest();
+	lguest_entry.segment = LGUEST_CS;
+
+	printk("lguest: mapped hypervisor at %p\n", hypervisor_vma->addr);
+	return 0;
+
+free_vma:
+	vunmap(hypervisor_vma->addr);
+free_pages:
+	__free_pages(hype_pages, get_order(HYPERVISOR_SIZE));
+	return err;
+}
+
+static __exit void unmap_hypervisor(void)
+{
+	vunmap(hypervisor_vma->addr);
+	__free_pages(hype_pages, get_order(HYPERVISOR_SIZE));
+}
+
+/* IN/OUT insns: enough to get us past boot-time probing. */
+static int emulate_insn(struct lguest *lg)
+{
+	u8 insn;
+	unsigned int insnlen = 0, in = 0, shift = 0;
+	unsigned long physaddr = guest_pa(lg, lg->state->regs.eip);
+
+	/* This only works for addresses in linear mapping... */
+	if (lg->state->regs.eip < lg->page_offset)
+		return 0;
+	lhread(lg, &insn, physaddr, 1);
+
+	/* Operand size prefix means it's actually for ax. */
+	if (insn == 0x66) {
+		shift = 16;
+		insnlen = 1;
+		lhread(lg, &insn, physaddr + insnlen, 1);
+	}
+
+	switch (insn & 0xFE) {
+	case 0xE4: /* in     <next byte>,%al */
+		insnlen += 2;
+		in = 1;
+		break;
+	case 0xEC: /* in     (%dx),%al */
+		insnlen += 1;
+		in = 1;
+		break;
+	case 0xE6: /* out    %al,<next byte> */
+		insnlen += 2;
+		break;
+	case 0xEE: /* out    %al,(%dx) */
+		insnlen += 1;
+		break;
+	default:
+		return 0;
+	}
+
+	if (in) {
+		/* Lower bit tells is whether it's a 16 or 32 bit access */
+		if (insn & 0x1)
+			lg->state->regs.eax = 0xFFFFFFFF;
+		else
+			lg->state->regs.eax |= (0xFFFF << shift);
+	}
+	lg->state->regs.eip += insnlen;
+	return 1;
+}
+
+int find_free_guest(void)
+{
+	unsigned int i;
+	for (i = 0; i < MAX_LGUEST_GUESTS; i++)
+		if (!lguests[i].state)
+			return i;
+	return -1;
+}
+
+int lguest_address_ok(const struct lguest *lg, unsigned long addr)
+{
+	return addr / PAGE_SIZE < lg->pfn_limit;
+}
+
+/* Just like get_user, but don't let guest access lguest binary. */
+u32 lhread_u32(struct lguest *lg, u32 addr)
+{
+	u32 val = 0;
+
+	/* Don't let them access lguest_add */
+	if (!lguest_address_ok(lg, addr)
+	    || get_user(val, (u32 __user *)addr) != 0)
+		kill_guest(lg, "bad read address %u", addr);
+	return val;
+}
+
+void lhwrite_u32(struct lguest *lg, u32 addr, u32 val)
+{
+	if (!lguest_address_ok(lg, addr)
+	    || put_user(val, (u32 __user *)addr) != 0)
+		kill_guest(lg, "bad write address %u", addr);
+}
+
+void lhread(struct lguest *lg, void *b, u32 addr, unsigned bytes)
+{
+	if (addr + bytes < addr || !lguest_address_ok(lg, addr+bytes)
+	    || copy_from_user(b, (void __user *)addr, bytes) != 0) {
+		/* copy_from_user should do this, but as we rely on it... */
+		memset(b, 0, bytes);
+		kill_guest(lg, "bad read address %u len %u", addr, bytes);
+	}
+}
+
+void lhwrite(struct lguest *lg, u32 addr, const void *b, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(lg, addr+bytes)
+	    || copy_to_user((void __user *)addr, b, bytes) != 0)
+		kill_guest(lg, "bad write address %u len %u", addr, bytes);
+}
+
+/* Saves exporting idt_table from kernel */
+static struct desc_struct *get_idt_table(void)
+{
+	struct Xgt_desc_struct idt;
+
+	asm("sidt %0":"=m" (idt));
+	return (void *)idt.address;
+}
+
+extern asmlinkage void math_state_restore(void);
+
+static int usermode(struct lguest_regs *regs)
+{
+	return (regs->cs & SEGMENT_RPL_MASK) == USER_RPL;
+}
+
+/* Trap page resets this when it reloads gs. */
+static int new_gfp_eip(struct lguest *lg, struct lguest_regs *regs)
+{
+	u32 eip;
+	get_user(eip, &lg->lguest_data->gs_gpf_eip);
+	if (eip == regs->eip)
+		return 0;
+	put_user(regs->eip, &lg->lguest_data->gs_gpf_eip);
+	return 1;
+}
+
+static void set_ts(unsigned int guest_ts)
+{
+	u32 cr0;
+	if (guest_ts) {
+		asm("movl %%cr0,%0":"=r" (cr0));
+		if (!(cr0 & 8))
+			asm("movl %0,%%cr0": :"r" (cr0|8));
+	}
+}
+
+static void run_guest_once(struct lguest *lg)
+{
+	unsigned int clobber;
+
+	/* Put eflags on stack, lcall does rest. */
+	asm volatile("pushf; lcall *lguest_entry"
+		     : "=a"(clobber), "=d"(clobber)
+		     : "0"(lg->state), "1"(get_idt_table())
+		     : "memory");
+}
+
+int run_guest(struct lguest *lg, char *__user user)
+{
+	struct lguest_regs *regs = &lg->state->regs;
+
+	while (!lg->dead) {
+		unsigned int cr2 = 0; /* Damn gcc */
+
+		/* Hypercalls first: we might have been out to userspace */
+		if (do_async_hcalls(lg))
+			goto pending_dma;
+
+		if (regs->trapnum == LGUEST_TRAP_ENTRY) {
+			/* Only do hypercall once. */
+			regs->trapnum = 255;
+			if (hypercall(lg, regs))
+				goto pending_dma;
+		}
+
+		if (signal_pending(current))
+			return -EINTR;
+		maybe_do_interrupt(lg);
+
+		if (lg->dead)
+			break;
+
+		if (lg->halted) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(1);
+			continue;
+		}
+
+		/* Restore limits on TLS segments if in user mode. */
+		if (usermode(regs)) {
+			unsigned int i;
+			for (i = 0; i < ARRAY_SIZE(lg->tls_limits); i++)
+				lg->state->gdt_table[GDT_ENTRY_TLS_MIN+i].a
+					|= lg->tls_limits[i];
+		}
+
+		local_irq_disable();
+		map_trap_page(lg);
+
+		/* Host state to be restored after the guest returns. */
+		asm("sidt %0":"=m"(lg->state->host.idt));
+		lg->state->host.gdt = __get_cpu_var(cpu_gdt_descr);
+
+		/* Even if *we* don't want FPU trap, guest might... */
+		set_ts(lg->ts);
+
+		run_guest_once(lg);
+
+		/* Save cr2 now if we page-faulted. */
+		if (regs->trapnum == 14)
+			asm("movl %%cr2,%0" :"=r" (cr2));
+		else if (regs->trapnum == 7)
+			math_state_restore();
+		local_irq_enable();
+
+		switch (regs->trapnum) {
+		case 13: /* We've intercepted a GPF. */
+			if (regs->errcode == 0) {
+				if (emulate_insn(lg))
+					continue;
+
+				/* FIXME: If it's reloading %gs in a loop? */
+				if (usermode(regs) && new_gfp_eip(lg,regs))
+					continue;
+			}
+
+			if (reflect_trap(lg, &lg->gpf_trap, 1))
+				continue;
+			break;
+		case 14: /* We've intercepted a page fault. */
+			if (demand_page(lg, cr2, regs->errcode & 2))
+				continue;
+
+			/* If lguest_data is NULL, this won't hurt. */
+			put_user(cr2, &lg->lguest_data->cr2);
+			if (reflect_trap(lg, &lg->page_trap, 1))
+				continue;
+			kill_guest(lg, "unhandled page fault at %#x"
+				   " (eip=%#x, errcode=%#x)",
+				   cr2, regs->eip, regs->errcode);
+			break;
+		case 7: /* We've intercepted a Device Not Available fault. */
+			/* If they don't want to know, just absorb it. */
+			if (!lg->ts) 
+				continue;
+			if (reflect_trap(lg, &lg->fpu_trap, 0))
+				continue;
+			kill_guest(lg, "unhandled FPU fault at %#x",
+				   regs->eip);
+			break;
+		case 32 ... 255: /* Real interrupt, fall thru */
+			cond_resched();
+		case LGUEST_TRAP_ENTRY: /* Handled at top of loop */
+			continue;
+		case 6: /* Invalid opcode before they installed handler */
+			check_bug_kill(lg);
+		}
+		kill_guest(lg,"unhandled trap %i at %#x (err=%i)",
+			   regs->trapnum, regs->eip, regs->errcode);
+	}
+	return -ENOENT;
+
+pending_dma:
+	put_user(lg->pending_dma, (unsigned long *)user);
+	put_user(lg->pending_addr, (unsigned long *)user+1);
+	return sizeof(unsigned long)*2;
+}
+
+#define STRUCT_LGUEST_ELEM_SIZE(elem) sizeof(((struct lguest_state *)0)->elem)
+
+static void adjust_pge(void *on)
+{
+	if (on)
+		write_cr4(read_cr4() | X86_CR4_PGE);
+	else
+		write_cr4(read_cr4() & ~X86_CR4_PGE);
+}
+ 
+static int __init init(void)
+{
+	int err;
+
+	if (paravirt_enabled())
+		return -EPERM;
+
+	err = map_hypervisor();
+	if (err)
+		return err;
+
+	err = init_pagetables(hype_pages);
+	if (err) {
+		unmap_hypervisor();
+		return err;
+	}
+	lguest_io_init();
+
+	err = lguest_device_init();
+	if (err) {
+		free_pagetables();
+		unmap_hypervisor();
+		return err;
+	}
+	if (cpu_has_pge) { /* We have a broader idea of "global". */
+		cpu_had_pge = 1;
+		on_each_cpu(adjust_pge, 0, 0, 1);
+		clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
+	}
+	return 0;
+}
+
+static void __exit fini(void)
+{
+	lguest_device_remove();
+	free_pagetables();
+	unmap_hypervisor();
+	if (cpu_had_pge) {
+		set_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
+		on_each_cpu(adjust_pge, (void *)1, 0, 1);
+	}
+}
+
+module_init(init);
+module_exit(fini);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Rusty Russell <rusty@rustcorp.com.au>");
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/hypercalls.c
@@ -0,0 +1,199 @@
+/*  Actual hypercalls, which allow guests to actually do something.
+    Copyright (C) 2006 Rusty Russell IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+*/
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/mm.h>
+#include <linux/clocksource.h>
+#include <asm/lguest.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <irq_vectors.h>
+#include "lg.h"
+
+static void guest_set_stack(struct lguest *lg,
+			    u32 seg, u32 esp, unsigned int pages)
+{
+	/* You cannot have a stack segment with priv level 0. */
+	if ((seg & 0x3) != GUEST_DPL)
+		kill_guest(lg, "bad stack segment %i", seg);
+	if (pages > 2)
+		kill_guest(lg, "bad stack pages %u", pages);
+	lg->state->tss.ss1 = seg;
+	lg->state->tss.esp1 = esp;
+	lg->stack_pages = pages;
+	pin_stack_pages(lg);
+}
+
+/* Return true if DMA to host userspace now pending. */
+static int do_hcall(struct lguest *lg, struct lguest_regs *regs)
+{
+	switch (regs->eax) {
+	case LHCALL_FLUSH_ASYNC:
+		break;
+	case LHCALL_LGUEST_INIT:
+		kill_guest(lg, "already have lguest_data");
+		break;
+	case LHCALL_CRASH: {
+		char msg[128];
+		lhread(lg, msg, regs->edx, sizeof(msg));
+		msg[sizeof(msg)-1] = '\0';
+		kill_guest(lg, "CRASH: %s", msg);
+		break;
+	}
+	case LHCALL_LOAD_GDT:
+		load_guest_gdt(lg, regs->edx, regs->ebx);
+		break;
+	case LHCALL_NEW_PGTABLE:
+		guest_new_pagetable(lg, regs->edx);
+		break;
+	case LHCALL_FLUSH_TLB:
+		if (regs->edx)
+			guest_pagetable_clear_all(lg);
+		else
+			guest_pagetable_flush_user(lg);
+		break;
+	case LHCALL_LOAD_IDT_ENTRY:
+		load_guest_idt_entry(lg, regs->edx, regs->ebx, regs->ecx);
+		break;
+	case LHCALL_SET_STACK:
+		guest_set_stack(lg, regs->edx, regs->ebx, regs->ecx);
+		break;
+	case LHCALL_TS:
+		lg->ts = regs->edx;
+		break;
+	case LHCALL_TIMER_READ: {
+		u32 now = jiffies;
+		mb();
+		regs->eax = now - lg->last_timer;
+		lg->last_timer = now;
+		break;
+	}
+	case LHCALL_TIMER_START:
+		lg->timer_on = 1;
+		if (regs->edx != HZ)
+			kill_guest(lg, "Bad clock speed %i", regs->edx);
+		lg->last_timer = jiffies;
+		break;
+	case LHCALL_HALT:
+		lg->halted = 1;
+		break;
+	case LHCALL_GET_WALLCLOCK: {
+		struct timeval tv;
+		do_gettimeofday(&tv);
+		regs->eax = tv.tv_sec;
+		break;
+	}
+	case LHCALL_BIND_DMA:
+		regs->eax = bind_dma(lg, regs->edx, regs->ebx,
+				     regs->ecx >> 8, regs->ecx & 0xFF);
+		break;
+	case LHCALL_SEND_DMA:
+		return send_dma(lg, regs->edx, regs->ebx);
+	case LHCALL_SET_PTE:
+		guest_set_pte(lg, regs->edx, regs->ebx, regs->ecx);
+		break;
+	case LHCALL_SET_UNKNOWN_PTE:
+		guest_pagetable_clear_all(lg);
+		break;
+	case LHCALL_SET_PUD:
+		guest_set_pud(lg, regs->edx, regs->ebx);
+		break;
+	case LHCALL_LOAD_TLS:
+		guest_load_tls(lg, (struct desc_struct __user*)regs->edx);
+		break;
+	default:
+		kill_guest(lg, "Bad hypercall %i\n", regs->eax);
+	}
+	return 0;
+}
+
+#define log(...)					\
+	do {						\
+		mm_segment_t oldfs = get_fs();		\
+		char buf[100];				\
+		sprintf(buf, "lguest:" __VA_ARGS__);	\
+		set_fs(KERNEL_DS);			\
+		sys_write(1, buf, strlen(buf));		\
+		set_fs(oldfs);				\
+	} while(0)
+
+/* We always do queued calls before actual hypercall. */
+int do_async_hcalls(struct lguest *lg)
+{
+	unsigned int i, pending;
+	u8 st[LHCALL_RING_SIZE];
+
+	if (!lg->lguest_data)
+		return 0;
+
+	copy_from_user(&st, &lg->lguest_data->hcall_status, sizeof(st));
+	for (i = 0; i < ARRAY_SIZE(st); i++) {
+		struct lguest_regs regs;
+		unsigned int n = lg->next_hcall;
+
+		if (st[n] == 0xFF)
+			break;
+
+		if (++lg->next_hcall == LHCALL_RING_SIZE)
+			lg->next_hcall = 0;
+
+		get_user(regs.eax, &lg->lguest_data->hcalls[n].eax);
+		get_user(regs.edx, &lg->lguest_data->hcalls[n].edx);
+		get_user(regs.ecx, &lg->lguest_data->hcalls[n].ecx);
+		get_user(regs.ebx, &lg->lguest_data->hcalls[n].ebx);
+		pending = do_hcall(lg, &regs);
+		put_user(0xFF, &lg->lguest_data->hcall_status[n]);
+		if (pending)
+			return 1;
+	}
+
+	set_wakeup_process(lg, NULL);
+	return 0;
+}
+
+int hypercall(struct lguest *lg, struct lguest_regs *regs)
+{
+	int pending;
+
+	if (!lg->lguest_data) {
+		if (regs->eax != LHCALL_LGUEST_INIT) {
+			kill_guest(lg, "hypercall %i before LGUEST_INIT",
+				   regs->eax);
+			return 0;
+		}
+
+		lg->lguest_data = (struct lguest_data __user *)regs->edx;
+		/* We check here so we can simply copy_to_user/from_user */
+		if (!lguest_address_ok(lg, (long)lg->lguest_data)
+		    || !lguest_address_ok(lg, (long)(lg->lguest_data+1))){
+			kill_guest(lg, "bad guest page %p", lg->lguest_data);
+			return 0;
+		}
+		get_user(lg->noirq_start, &lg->lguest_data->noirq_start);
+		get_user(lg->noirq_end, &lg->lguest_data->noirq_end);
+		/* We reserve the top pgd entry. */
+		put_user(4U*1024*1024, &lg->lguest_data->reserve_mem);
+		put_user(lg->guestid, &lg->lguest_data->guestid);
+		put_user(clocksource_khz2mult(tsc_khz, 22),
+			 &lg->lguest_data->clock_mult);
+		return 0;
+	}
+	pending = do_hcall(lg, regs);
+	set_wakeup_process(lg, NULL);
+	return pending;
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/hypervisor.S
@@ -0,0 +1,170 @@
+/* This code sits at 0xFFFF1000 to do the low-level guest<->host switch.
+   Layout is: default_idt_entries (1k), then switch_to_guest entry point. */
+#include <linux/linkage.h>
+#include <asm/asm-offsets.h>
+#include "lg.h"
+
+#define SAVE_REGS				\
+	/* Save old guest/host state */		\
+	pushl	%es;				\
+	pushl	%ds;				\
+	pushl	%fs;				\
+	pushl	%eax;				\
+	pushl	%gs;				\
+	pushl	%ebp;				\
+	pushl	%edi;				\
+	pushl	%esi;				\
+	pushl	%edx;				\
+	pushl	%ecx;				\
+	pushl	%ebx;				\
+
+.text
+ENTRY(_start) /* ld complains unless _start is defined. */
+/* %eax contains ptr to target guest state, %edx contains host idt. */
+switch_to_guest:
+	pushl	%ss
+	SAVE_REGS
+	/* Save old stack, switch to guest's stack. */
+	movl	%esp, LGUEST_STATE_host_stackptr(%eax)
+	movl	%eax, %esp
+	/* Guest registers will be at: %esp-$LGUEST_STATE_regs */
+	addl	$LGUEST_STATE_regs, %esp
+	/* Switch to guest's GDT, IDT. */
+	lgdt	LGUEST_STATE_gdt(%eax)
+	lidt	LGUEST_STATE_idt(%eax)
+	/* Save page table top. */
+	movl	%cr3, %ebx
+	movl	%ebx, LGUEST_STATE_host_pgdir(%eax)
+	/* Set host's TSS to available (clear byte 5 bit 2). */
+	movl	(LGUEST_STATE_host_gdt+2)(%eax), %ebx
+	andb	$0xFD, (GDT_ENTRY_TSS*8 + 5)(%ebx)
+	/* Switch to guest page tables */
+	popl	%ebx
+	movl	%ebx, %cr3
+	/* Switch to guest's TSS. */
+	movl	$(GDT_ENTRY_TSS*8), %ebx
+	ltr	%bx
+	/* Restore guest regs */
+	popl	%ebx
+	popl	%ecx
+	popl	%edx
+	popl	%esi
+	popl	%edi
+	popl	%ebp
+	popl	%gs
+	/* Now we've loaded gs, neuter the TLS entries down to 1 byte/page */
+	addl	$(LGUEST_STATE_gdt_table+GDT_ENTRY_TLS_MIN*8), %eax
+	movw	$0,(%eax)
+	movw	$0,8(%eax)
+	movw	$0,16(%eax)
+	popl	%eax
+	popl	%fs
+	popl	%ds
+	popl	%es
+	/* Skip error code and trap number */
+	addl	$8, %esp
+	iret
+
+#define SWITCH_TO_HOST							\
+	SAVE_REGS;							\
+	/* Save old pgdir */						\
+	movl	%cr3, %eax;						\
+	pushl	%eax;							\
+	/* Load lguest ds segment for convenience. */			\
+	movl	$(LGUEST_DS), %eax;					\
+	movl	%eax, %ds;						\
+	/* Now figure out who we are */					\
+	movl	%esp, %eax;						\
+	subl	$LGUEST_STATE_regs, %eax;				\
+	/* Switch to host page tables (GDT, IDT and stack are in host   \
+	   mem, so need this first) */					\
+	movl	LGUEST_STATE_host_pgdir(%eax), %ebx;			\
+	movl	%ebx, %cr3;						\
+	/* Set guest's TSS to available (clear byte 5 bit 2). */	\
+	andb	$0xFD, (LGUEST_STATE_gdt_table+GDT_ENTRY_TSS*8+5)(%eax);\
+	/* Switch to host's GDT & IDT. */				\
+	lgdt	LGUEST_STATE_host_gdt(%eax);				\
+	lidt	LGUEST_STATE_host_idt(%eax);				\
+	/* Switch to host's stack. */					\
+	movl	LGUEST_STATE_host_stackptr(%eax), %esp;			\
+	/* Switch to host's TSS */					\
+	movl	$(GDT_ENTRY_TSS*8), %eax;				\
+	ltr	%ax;							\
+	/* Restore host regs */						\
+	popl	%ebx;							\
+	popl	%ecx;							\
+	popl	%edx;							\
+	popl	%esi;							\
+	popl	%edi;							\
+	popl	%ebp;							\
+	popl	%gs;							\
+	popl	%eax;							\
+	popl	%fs;							\
+	popl	%ds;							\
+	popl	%es;							\
+	popl	%ss
+	
+/* Return to run_guest_once. */
+return_to_host:
+	SWITCH_TO_HOST
+	iret
+
+deliver_to_host:
+	SWITCH_TO_HOST
+decode_idt_and_jmp:
+	/* Decode IDT and jump to hosts' irq handler.  When that does iret, it
+	 * will return to run_guest_once.  This is a feature. */
+	/* We told gcc we'd clobber edx and eax... */
+	movl	LGUEST_STATE_trapnum(%eax), %eax
+	leal	(%edx,%eax,8), %eax
+	movzwl	(%eax),%edx
+	movl	4(%eax), %eax
+	xorw	%ax, %ax
+	orl	%eax, %edx
+	jmp	*%edx
+
+deliver_to_host_with_errcode:
+	SWITCH_TO_HOST
+	pushl	LGUEST_STATE_errcode(%eax)
+	jmp decode_idt_and_jmp
+
+/* Real hardware interrupts are delivered straight to the host.  Others
+   cause us to return to run_guest_once so it can decide what to do.  Note
+   that some of these are overridden by the guest to deliver directly, and
+   never enter here (see load_guest_idt_entry). */
+.macro IRQ_STUB N TARGET
+	.data; .long 1f; .text; 1:
+ /* Make an error number for most traps, which don't have one. */
+ .if (\N <> 2) && (\N <> 8) && (\N < 10 || \N > 14) && (\N <> 17)
+	pushl	$0
+ .endif
+	pushl	$\N
+	jmp	\TARGET
+	ALIGN
+.endm
+
+.macro IRQ_STUBS FIRST LAST TARGET
+ irq=\FIRST
+ .rept \LAST-\FIRST+1
+	IRQ_STUB irq \TARGET
+  irq=irq+1
+ .endr
+.endm
+	
+/* We intercept every interrupt, because we may need to switch back to
+ * host.  Unfortunately we can't tell them apart except by entry
+ * point, so we need 256 entry points.
+ */
+irq_stubs:
+.data
+default_idt_entries:	
+.text
+	IRQ_STUBS 0 1 return_to_host		/* First two traps */
+	IRQ_STUB 2 deliver_to_host_with_errcode	/* NMI */
+	IRQ_STUBS 3 31 return_to_host		/* Rest of traps */
+	IRQ_STUBS 32 127 deliver_to_host	/* Real interrupts */
+	IRQ_STUB 128 return_to_host		/* System call (overridden) */
+	IRQ_STUBS 129 255 deliver_to_host	/* Other real interrupts */
+
+/* Everything after this is used for the lguest_state structs. */
+ALIGN
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/interrupts_and_traps.c
@@ -0,0 +1,221 @@
+#include <linux/uaccess.h>
+#include "lg.h"
+
+static void push_guest_stack(struct lguest *lg, u32 __user **gstack, u32 val)
+{
+	lhwrite_u32(lg, (u32)--(*gstack), val);
+}
+
+int reflect_trap(struct lguest *lg, const struct host_trap *trap, int has_err)
+{
+	u32 __user *gstack;
+	u32 eflags, ss, irq_enable;
+	struct lguest_regs *regs = &lg->state->regs;
+
+	if (!trap->addr)
+		return 0;
+
+	/* If they want a ring change, we use new stack and push old ss/esp */
+	if ((regs->ss&0x3) != GUEST_DPL) {
+		gstack = (u32 __user *)guest_pa(lg, lg->state->tss.esp1);
+		ss = lg->state->tss.ss1;
+		push_guest_stack(lg, &gstack, regs->ss);
+		push_guest_stack(lg, &gstack, regs->esp);
+	} else {
+		gstack = (u32 __user *)guest_pa(lg, regs->esp);
+		ss = regs->ss;
+	}
+
+	/* We use IF bit in eflags to indicate whether irqs were disabled
+	   (it's always 0, since irqs are enabled when guest is running). */
+	eflags = regs->eflags;
+	get_user(irq_enable, &lg->lguest_data->irq_enabled);
+	eflags |= (irq_enable & 512);
+
+	push_guest_stack(lg, &gstack, eflags);
+	push_guest_stack(lg, &gstack, regs->cs);
+	push_guest_stack(lg, &gstack, regs->eip);
+
+	if (has_err)
+		push_guest_stack(lg, &gstack, regs->errcode);
+
+	/* Change the real stack so hypervisor returns to trap handler */
+	regs->ss = ss;
+	regs->esp = (u32)gstack + lg->page_offset;
+	regs->cs = (__KERNEL_CS|GUEST_DPL);
+	regs->eip = trap->addr;
+
+	/* GS will be neutered on way back to guest. */
+	put_user(0, &lg->lguest_data->gs_gpf_eip);
+
+	/* Disable interrupts for an interrupt gate. */
+	if (trap->disable_interrupts)
+		put_user(0, &lg->lguest_data->irq_enabled);
+	return 1;
+}
+
+void maybe_do_interrupt(struct lguest *lg)
+{
+	unsigned int irq;
+	DECLARE_BITMAP(irqs, LGUEST_IRQS);
+
+	if (!lg->lguest_data)
+		return;
+
+	/* If timer has changed, set timer interrupt. */
+	if (lg->timer_on && jiffies != lg->last_timer)
+		set_bit(0, lg->irqs_pending);
+
+	/* Mask out any interrupts they have blocked. */
+	copy_from_user(&irqs, lg->lguest_data->interrupts, sizeof(irqs));
+	bitmap_andnot(irqs, lg->irqs_pending, irqs, LGUEST_IRQS);
+
+	irq = find_first_bit(irqs, LGUEST_IRQS);
+	if (irq >= LGUEST_IRQS)
+		return;
+
+	/* If they're halted, we re-enable interrupts. */
+	if (lg->halted) {
+		/* Re-enable interrupts. */
+		put_user(512, &lg->lguest_data->irq_enabled);
+		lg->halted = 0;
+	} else {
+		/* Maybe they have interrupts disabled? */
+		u32 irq_enabled;
+		get_user(irq_enabled, &lg->lguest_data->irq_enabled);
+		if (!irq_enabled)
+			return;
+	}
+
+	if (lg->interrupt[irq].addr != 0) {
+		clear_bit(irq, lg->irqs_pending);
+		reflect_trap(lg, &lg->interrupt[irq], 0);
+	}
+}
+
+void check_bug_kill(struct lguest *lg)
+{
+#ifdef CONFIG_BUG
+	u32 eip = lg->state->regs.eip - PAGE_OFFSET;
+	u16 insn;
+
+	/* This only works for addresses in linear mapping... */
+	if (lg->state->regs.eip < PAGE_OFFSET)
+		return;
+	lhread(lg, &insn, eip, sizeof(insn));
+	if (insn == 0x0b0f) {
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+		u16 l;
+		u32 f;
+		char file[128];
+		lhread(lg, &l, eip+sizeof(insn), sizeof(l));
+		lhread(lg, &f, eip+sizeof(insn)+sizeof(l), sizeof(f));
+		lhread(lg, file, f - PAGE_OFFSET, sizeof(file));
+		file[sizeof(file)-1] = 0;
+		kill_guest(lg, "BUG() at %#x %s:%u", eip, file, l);
+#else
+		kill_guest(lg, "BUG() at %#x", eip);
+#endif	/* CONFIG_DEBUG_BUGVERBOSE */
+	}
+#endif	/* CONFIG_BUG */
+}
+
+static void copy_trap(struct lguest *lg,
+		      struct host_trap *trap,
+		      const struct desc_struct *desc)
+{
+	u8 type = ((desc->b >> 8) & 0xF);
+
+	/* Not present? */
+	if (!(desc->b & 0x8000)) {
+		trap->addr = 0;
+		return;
+	}
+	if (type != 0xE && type != 0xF)
+		kill_guest(lg, "bad IDT type %i", type);
+	trap->disable_interrupts = (type == 0xE);
+	trap->addr = ((desc->a & 0x0000FFFF) | (desc->b & 0xFFFF0000));
+}
+
+/* FIXME: Put this in hypervisor.S and do something clever with relocs? */
+static u8 tramp[] 
+= { 0x0f, 0xa8, 0x0f, 0xa9, /* push %gs; pop %gs */
+    0x36, 0xc7, 0x05, 0x55, 0x55, 0x55, 0x55, 0x00, 0x00, 0x00, 0x00,
+    /* movl 0, %ss:lguest_data.gs_gpf_eip */
+    0xe9, 0x55, 0x55, 0x55, 0x55 /* jmp dstaddr */
+};
+#define TRAMP_MOVL_TARGET_OFF 7
+#define TRAMP_JMP_TARGET_OFF 16
+
+static u32 setup_trampoline(struct lguest *lg, unsigned int i, u32 dstaddr)
+{
+	u32 addr, off;
+
+	off = sizeof(tramp)*i;
+	memcpy(lg->trap_page + off, tramp, sizeof(tramp));
+
+	/* 0 is to be placed in lguest_data.gs_gpf_eip. */
+	addr = (u32)&lg->lguest_data->gs_gpf_eip + lg->page_offset;
+	memcpy(lg->trap_page + off + TRAMP_MOVL_TARGET_OFF, &addr, 4);
+
+	/* Address is relative to where end of jmp will be. */
+	addr = dstaddr - ((-4*1024*1024) + off + sizeof(tramp));
+	memcpy(lg->trap_page + off + TRAMP_JMP_TARGET_OFF, &addr, 4);
+	return (-4*1024*1024) + off;
+}
+
+/* We bounce through the trap page, for two reasons: firstly, we need
+   the interrupt destination always mapped, to avoid double faults,
+   secondly we want to reload %gs to make it innocuous on entering kernel.
+ */
+static void setup_idt(struct lguest *lg,
+		      unsigned int i,
+		      const struct desc_struct *desc)
+{
+	u8 type = ((desc->b >> 8) & 0xF);
+	u32 taddr;
+
+	/* Not present? */
+	if (!(desc->b & 0x8000)) {
+		/* FIXME: When we need this, we'll know... */
+		if (lg->state->idt_table[i].a & 0x8000)
+			kill_guest(lg, "removing interrupts not supported");
+		return;
+	}
+
+	/* We could reflect and disable interrupts, but guest can do itself. */
+	if (type != 0xF)
+		kill_guest(lg, "bad direct IDT %i type %i", i, type);
+
+	taddr = setup_trampoline(lg, i, (desc->a&0xFFFF)|(desc->b&0xFFFF0000));
+
+	lg->state->idt_table[i].a = (((__KERNEL_CS|GUEST_DPL)<<16)
+					| (taddr & 0x0000FFFF));
+	lg->state->idt_table[i].b = (desc->b&0xEF00)|(taddr&0xFFFF0000);
+}
+
+void load_guest_idt_entry(struct lguest *lg, unsigned int i, u32 low, u32 high)
+{
+	struct desc_struct d = { low, high };
+
+	/* Ignore NMI, doublefault, hypercall, spurious interrupt. */
+	if (i == 2 || i == 8 || i == 15 || i == LGUEST_TRAP_ENTRY)
+		return;
+	/* FIXME: We should handle debug and int3 */
+	else if (i == 1 || i == 3)
+		return;
+	/* We intercept page fault, general protection fault and fpu missing */
+	else if (i == 13)
+		copy_trap(lg, &lg->gpf_trap, &d);
+	else if (i == 14)
+		copy_trap(lg, &lg->page_trap, &d);
+	else if (i == 7)
+		copy_trap(lg, &lg->fpu_trap, &d);
+	/* Other traps go straight to guest. */
+	else if (i < FIRST_EXTERNAL_VECTOR || i == SYSCALL_VECTOR)
+		setup_idt(lg, i, &d);
+	/* A virtual interrupt */
+	else if (i < FIRST_EXTERNAL_VECTOR + LGUEST_IRQS)
+		copy_trap(lg, &lg->interrupt[i-FIRST_EXTERNAL_VECTOR], &d);
+}
+
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/io.c
@@ -0,0 +1,413 @@
+/* Simple I/O model for guests, based on shared memory.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+#include <linux/types.h>
+#include <linux/futex.h>
+#include <linux/jhash.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/uaccess.h>
+#include "lg.h"
+
+static struct list_head dma_hash[64];
+
+/* FIXME: allow multi-page lengths. */
+static int check_dma_list(struct lguest *lg, const struct lguest_dma *dma)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (!dma->len[i])
+			return 1;
+		if (!lguest_address_ok(lg, dma->addr[i]))
+			goto kill;
+		if (dma->len[i] > PAGE_SIZE)
+			goto kill;
+		/* We could do over a page, but is it worth it? */
+		if ((dma->addr[i] % PAGE_SIZE) + dma->len[i] > PAGE_SIZE)
+			goto kill;
+	}
+	return 1;
+
+kill:
+	kill_guest(lg, "bad DMA entry: %u@%#x", dma->len[i], dma->addr[i]);
+	return 0;
+}
+
+static unsigned int hash(const union futex_key *key)
+{
+	return jhash2((u32*)&key->both.word,
+		      (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
+		      key->both.offset)
+		% ARRAY_SIZE(dma_hash);
+}
+
+/* Must hold read lock on dmainfo owner's current->mm->mmap_sem */
+static void unlink_dma(struct lguest_dma_info *dmainfo)
+{
+	BUG_ON(down_trylock(&lguest_lock) == 0);
+	dmainfo->interrupt = 0;
+	list_del(&dmainfo->list);
+	drop_futex_key_refs(&dmainfo->key);
+}
+
+static inline int key_eq(const union futex_key *a, const union futex_key *b)
+{
+	return (a->both.word == b->both.word
+		&& a->both.ptr == b->both.ptr
+		&& a->both.offset == b->both.offset);
+}
+
+static u32 unbind_dma(struct lguest *lg,
+		      const union futex_key *key,
+		      unsigned long dmas)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < LGUEST_MAX_DMA; i++) {
+		if (key_eq(key, &lg->dma[i].key) && dmas == lg->dma[i].dmas) {
+			unlink_dma(&lg->dma[i]);
+			ret = 1;
+			break;
+		}
+	}
+	return ret;
+}
+
+u32 bind_dma(struct lguest *lg,
+	     unsigned long addr, unsigned long dmas, u16 numdmas, u8 interrupt)
+{
+	unsigned int i;
+	u32 ret = 0;
+	union futex_key key;
+
+	if (interrupt >= LGUEST_IRQS)
+		return 0;
+
+	down(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(lg, "bad dma address %#lx", addr);
+		goto unlock;
+	}
+	get_futex_key_refs(&key);
+
+	if (interrupt == 0)
+		ret = unbind_dma(lg, &key, dmas);
+	else {
+		for (i = 0; i < LGUEST_MAX_DMA; i++) {
+			if (lg->dma[i].interrupt == 0) {
+				lg->dma[i].dmas = dmas;
+				lg->dma[i].num_dmas = numdmas;
+				lg->dma[i].next_dma = 0;
+				lg->dma[i].key = key;
+				lg->dma[i].guestid = lg->guestid;
+				lg->dma[i].interrupt = interrupt;
+				list_add(&lg->dma[i].list,
+					 &dma_hash[hash(&key)]);
+				ret = 1;
+				goto unlock;
+			}
+		}
+	}
+	drop_futex_key_refs(&key);
+unlock:
+ 	up_read(&current->mm->mmap_sem);
+	up(&lguest_lock);
+	return ret;
+}
+
+/* lhread from another guest */
+static int lhread_other(struct lguest *lg,
+			void *buf, u32 addr, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(lg, addr+bytes)
+	    || access_process_vm(lg->tsk, addr, buf, bytes, 0) != bytes) {
+		memset(buf, 0, bytes);
+		kill_guest(lg, "bad address in registered DMA struct");
+		return 0;
+	}
+	return 1;
+}
+
+/* lhwrite to another guest */
+static int lhwrite_other(struct lguest *lg, u32 addr,
+			 const void *buf, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(lg, addr+bytes)
+	    || (access_process_vm(lg->tsk, addr, (void *)buf, bytes, 1)
+		!= bytes)) {
+		kill_guest(lg, "bad address writing to registered DMA");
+		return 0;
+	}
+	return 1;
+}
+
+static u32 copy_data(const struct lguest_dma *src,
+		     const struct lguest_dma *dst,
+		     struct page *pages[])
+{
+	unsigned int totlen, si, di, srcoff, dstoff;
+	void *maddr = NULL;
+
+	totlen = 0;
+	si = di = 0;
+	srcoff = dstoff = 0;
+	while (si < LGUEST_MAX_DMA_SECTIONS && src->len[si]
+	       && di < LGUEST_MAX_DMA_SECTIONS && dst->len[di]) {
+		u32 len = min(src->len[si] - srcoff, dst->len[di] - dstoff);
+
+		if (!maddr)
+			maddr = kmap(pages[di]);
+
+		/* FIXME: This is not completely portable, since
+		   archs do different things for copy_to_user_page. */
+		if (copy_from_user(maddr + (dst->addr[di] + dstoff)%PAGE_SIZE,
+				   (void *__user)src->addr[si], len) != 0) {
+			totlen = 0;
+			break;
+		}
+
+		totlen += len;
+		srcoff += len;
+		dstoff += len;
+		if (srcoff == src->len[si]) {
+			si++;
+			srcoff = 0;
+		}
+		if (dstoff == dst->len[di]) {
+			kunmap(pages[di]);
+			maddr = NULL;
+			di++;
+			dstoff = 0;
+		}
+	}
+
+	if (maddr)
+		kunmap(pages[di]);
+
+	return totlen;
+}
+
+/* Src is us, ie. current. */
+static u32 do_dma(struct lguest *srclg, const struct lguest_dma *src,
+		  struct lguest *dstlg, const struct lguest_dma *dst)
+{
+	int i;
+	u32 ret;
+	struct page *pages[LGUEST_MAX_DMA_SECTIONS];
+
+	if (!check_dma_list(dstlg, dst) || !check_dma_list(srclg, src))
+		return 0;
+
+	/* First get the destination pages */
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (dst->len[i] == 0)
+			break;
+		if (get_user_pages(dstlg->tsk, dstlg->mm,
+				   dst->addr[i], 1, 1, 1, pages+i, NULL)
+		    != 1) {
+			ret = 0;
+			goto drop_pages;
+		}
+	}
+
+	/* Now copy until we run out of src or dst. */
+	ret = copy_data(src, dst, pages);
+
+drop_pages:
+	while (--i >= 0)
+		put_page(pages[i]);
+	return ret;
+}
+
+/* We cache one process to wakeup: helps for batching & wakes outside locks. */
+void set_wakeup_process(struct lguest *lg, struct task_struct *p)
+{
+	if (p == lg->wake)
+		return;
+
+	if (lg->wake) {
+		wake_up_process(lg->wake);
+		put_task_struct(lg->wake);
+	}
+	lg->wake = p;
+	if (lg->wake)
+		get_task_struct(lg->wake);
+}
+
+static int dma_transfer(struct lguest *srclg,
+			unsigned long udma,
+			struct lguest_dma_info *dst)
+{
+	struct lguest_dma dst_dma, src_dma;
+	struct lguest *dstlg;
+	u32 i, dma = 0;
+
+	dstlg = &lguests[dst->guestid];
+	/* Get our dma list. */
+	lhread(srclg, &src_dma, udma, sizeof(src_dma));
+
+	/* We can't deadlock against them dmaing to us, because this
+	 * is all under the lguest_lock. */
+	down_read(&dstlg->mm->mmap_sem);
+
+	for (i = 0; i < dst->num_dmas; i++) {
+		dma = (dst->next_dma + i) % dst->num_dmas;
+		if (!lhread_other(dstlg, &dst_dma,
+				  dst->dmas + dma * sizeof(struct lguest_dma),
+				  sizeof(dst_dma))) {
+			goto fail;
+		}
+		if (!dst_dma.used_len)
+			break;
+	}
+	if (i != dst->num_dmas) {
+		unsigned long used_lenp;
+		unsigned int ret;
+
+		ret = do_dma(srclg, &src_dma, dstlg, &dst_dma);
+		/* Put used length in src. */
+		lhwrite_u32(srclg,
+			    udma+offsetof(struct lguest_dma, used_len), ret);
+		if (ret == 0 && src_dma.len[0] != 0)
+			goto fail;
+
+		/* Make sure destination sees contents before length. */
+		mb();
+		used_lenp = dst->dmas
+			+ dma * sizeof(struct lguest_dma)
+			+ offsetof(struct lguest_dma, used_len);
+		lhwrite_other(dstlg, used_lenp, &ret, sizeof(ret));
+		dst->next_dma++;
+	}
+ 	up_read(&dstlg->mm->mmap_sem);
+
+	/* Do this last so dst doesn't simply sleep on lock. */
+	set_bit(dst->interrupt, dstlg->irqs_pending);
+	set_wakeup_process(srclg, dstlg->tsk);
+	return i == dst->num_dmas;
+
+fail:
+	up_read(&dstlg->mm->mmap_sem);
+	return 0;
+}
+
+int send_dma(struct lguest *lg, unsigned long addr, unsigned long udma)
+{
+	union futex_key key;
+	int pending = 0, empty = 0;
+
+again:
+	down(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(lg, "bad sending DMA address");
+		goto unlock;
+	}
+	/* Shared mapping?  Look for other guests... */
+	if (key.shared.offset & 1) {
+		struct lguest_dma_info *i, *n;
+		list_for_each_entry_safe(i, n, &dma_hash[hash(&key)], list) {
+			if (i->guestid == lg->guestid)
+				continue;
+			if (!key_eq(&key, &i->key))
+				continue;
+
+			empty += dma_transfer(lg, udma, i);
+			break;
+		}
+		if (empty == 1) {
+			/* Give any recipients one chance to restock. */
+			up_read(&current->mm->mmap_sem);
+			up(&lguest_lock);
+			yield();
+			empty++;
+			goto again;
+		}
+		pending = 0;
+	} else {
+		/* Private mapping: tell our userspace. */
+		lg->dma_is_pending = 1;
+		lg->pending_dma = udma;
+		lg->pending_addr = addr;
+		pending = 1;
+	}
+unlock:
+	up_read(&current->mm->mmap_sem);
+	up(&lguest_lock);
+	return pending;
+}
+
+void release_all_dma(struct lguest *lg)
+{
+	unsigned int i;
+
+	BUG_ON(down_trylock(&lguest_lock) == 0);
+
+	down_read(&lg->mm->mmap_sem);
+	for (i = 0; i < LGUEST_MAX_DMA; i++) {
+		if (lg->dma[i].interrupt)
+			unlink_dma(&lg->dma[i]);
+	}
+	up_read(&lg->mm->mmap_sem);
+}
+
+/* Userspace wants a dma buffer from this guest. */
+unsigned long get_dma_buffer(struct lguest *lg,
+			     unsigned long addr, unsigned long *interrupt)
+{
+	unsigned long ret = 0;
+	union futex_key key;
+	struct lguest_dma_info *i;
+
+	down(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(lg, "bad registered DMA buffer");
+		goto unlock;
+	}
+	list_for_each_entry(i, &dma_hash[hash(&key)], list) {
+		if (key_eq(&key, &i->key) && i->guestid == lg->guestid) {
+			unsigned int j;
+			for (j = 0; j < i->num_dmas; j++) {
+				struct lguest_dma dma;
+
+				ret = i->dmas + j * sizeof(struct lguest_dma);
+				lhread(lg, &dma, ret, sizeof(dma));
+				if (dma.used_len == 0)
+					break;
+			}
+			*interrupt = i->interrupt;
+			break;
+		}
+	}
+unlock:
+	up_read(&current->mm->mmap_sem);
+	up(&lguest_lock);
+	return ret;
+}
+
+void lguest_io_init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(dma_hash); i++)
+		INIT_LIST_HEAD(&dma_hash[i]);
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lg.h
@@ -0,0 +1,274 @@
+#ifndef _LGUEST_H
+#define _LGUEST_H
+
+#include <asm/desc.h>
+/* 64k ought to be enough for anybody! */
+#define HYPERVISOR_SIZE 65536
+#define HYPERVISOR_PAGES (HYPERVISOR_SIZE/PAGE_SIZE)
+
+#define GDT_ENTRY_LGUEST_CS	10
+#define GDT_ENTRY_LGUEST_DS	11
+#define LGUEST_CS		(GDT_ENTRY_LGUEST_CS * 8)
+#define LGUEST_DS		(GDT_ENTRY_LGUEST_DS * 8)
+
+#if 0
+/* FIXME: Use asm-offsets here... */
+#define LGUEST_TSS_OFF		0
+#define LGUEST_TSS_SIZE		(26*4)
+#define LGUEST_GDT_OFF		(LGUEST_TSS_OFF + LGUEST_TSS_SIZE)
+#define LGUEST_GDTABLE_OFF	(LGUEST_GDT_OFF + 8)
+#define LGUEST_GDTABLE_SIZE	(8 * GDT_ENTRIES)
+#define LGUEST_IDT_OFF		(LGUEST_GDTABLE_OFF + LGUEST_GDTABLE_SIZE)
+#define LGUEST_IDTABLE_SIZE	(8 * IDT_ENTRIES)
+#define LGUEST_IDTABLE_OFF	(LGUEST_IDT_OFF + 8)
+#define LGUEST_HOST_OFF		(LGUEST_IDTABLE_OFF + LGUEST_IDTABLE_SIZE)
+#define LGUEST_HOST_GDT_OFF	LGUEST_HOST_OFF
+#define LGUEST_HOST_IDT_OFF	(LGUEST_HOST_OFF + 8)
+#define LGUEST_HOST_PGDIR_OFF	(LGUEST_HOST_IDT_OFF + 8)
+#define LGUEST_HOST_STKP_OFF	(LGUEST_HOST_PGDIR_OFF + 4)
+#define LGUEST_HOST_SIZE	(8+8+4+4)
+#define LGUEST_REGS_OFF		(LGUEST_HOST_OFF + LGUEST_HOST_SIZE)	
+#define LGUEST_TRAPNUM_OFF	(LGUEST_REGS_OFF + 12*4)
+#define LGUEST_ERRCODE_OFF	(LGUEST_REGS_OFF + 13*4)
+#endif
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/stringify.h>
+#include <linux/binfmts.h>
+#include <linux/futex.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+#include <asm/semaphore.h>
+#include "irq_vectors.h"
+
+#define GUEST_DPL 1
+
+struct lguest_regs
+{
+	/* Manually saved part. */
+	u32 cr3;
+	u32 ebx, ecx, edx;
+	u32 esi, edi, ebp;
+	u32 gs;
+	u32 eax;
+	u32 fs, ds, es;
+	u32 trapnum, errcode;
+	/* Trap pushed part */
+	u32 eip;
+	u32 cs;
+	u32 eflags;
+	u32 esp;
+	u32 ss;
+};
+
+__exit void free_pagetables(void);
+__init int init_pagetables(struct page *hype_pages);
+
+/* Full 4G segment descriptors, suitable for CS and DS. */
+#define FULL_EXEC_SEGMENT ((struct desc_struct){0x0000ffff, 0x00cf9b00}) 
+#define FULL_SEGMENT ((struct desc_struct){0x0000ffff, 0x00cf9300}) 
+
+/* Simplified version of IDT. */
+struct host_trap
+{
+	unsigned long addr;
+	int disable_interrupts;
+};
+
+struct lguest_dma_info
+{
+	struct list_head list;
+	union futex_key key;
+	unsigned long dmas;
+	u16 next_dma;
+	u16 num_dmas;
+	u16 guestid;
+	u8 interrupt; 	/* 0 when not registered */
+};
+
+struct pgdir
+{
+	u32 cr3;
+	u32 *pgdir;
+};
+
+/* The private info the thread maintains about the guest. */
+struct lguest
+{
+	struct lguest_state *state;
+	struct lguest_data __user *lguest_data;
+	struct task_struct *tsk;
+	struct mm_struct *mm; 	/* == tsk->mm, but that becomes NULL on exit */
+	u16 guestid;
+	u32 pfn_limit;
+	u32 page_offset;
+	u32 cr2;
+	int timer_on;
+	int halted;
+	int ts;
+	u32 gpf_eip;
+	u32 last_timer;
+	u32 next_hcall;
+	u16 tls_limits[GDT_ENTRY_TLS_ENTRIES];
+
+	/* We keep a small number of these. */
+	u32 pgdidx;
+	struct pgdir pgdirs[4];
+	void *trap_page;
+
+	/* Cached wakeup: we hold a reference to this task. */
+	struct task_struct *wake;
+
+	unsigned long noirq_start, noirq_end;
+	int dma_is_pending;
+	unsigned long pending_dma; /* struct lguest_dma */
+	unsigned long pending_addr; /* address they're sending to */
+
+	unsigned int stack_pages;
+
+	struct lguest_dma_info dma[LGUEST_MAX_DMA];
+
+	/* Dead? */
+	const char *dead;
+
+	/* We intercept page fault (demand shadow paging & cr2 saving)
+	   protection fault (in/out emulation, TLS handling) and
+	   device not available (TS handling). */
+	struct host_trap page_trap, gpf_trap, fpu_trap;
+
+	/* Virtual interrupts */
+	DECLARE_BITMAP(irqs_pending, LGUEST_IRQS);
+	struct host_trap interrupt[LGUEST_IRQS];
+};
+
+extern struct page *hype_pages; /* Contiguous pages. */
+extern struct lguest lguests[];
+extern struct semaphore lguest_lock;
+
+/* core.c: */
+/* Entry points in hypervisor */
+const unsigned long *__lguest_default_idt_entries(void);
+struct lguest_state *__lguest_states(void);
+u32 lhread_u32(struct lguest *lg, u32 addr);
+void lhwrite_u32(struct lguest *lg, u32 val, u32 addr);
+void lhread(struct lguest *lg, void *buf, u32 addr, unsigned bytes);
+void lhwrite(struct lguest *lg, u32 addr, const void *buf, unsigned bytes);
+int lguest_address_ok(const struct lguest *lg, unsigned long addr);
+int run_guest(struct lguest *lg, char *__user user);
+int find_free_guest(void);
+
+/* interrupts_and_traps.c: */
+void maybe_do_interrupt(struct lguest *lg);
+int reflect_trap(struct lguest *lg, const struct host_trap *trap, int has_err);
+void check_bug_kill(struct lguest *lg);
+void load_guest_idt_entry(struct lguest *lg, unsigned int i, u32 low, u32 hi);
+
+/* segments.c: */
+void load_guest_gdt(struct lguest *lg, u32 table, u32 num);
+void guest_load_tls(struct lguest *lg,
+		    const struct desc_struct __user *tls_array);
+
+int init_guest_pagetable(struct lguest *lg, u32 pgtable);
+void free_guest_pagetable(struct lguest *lg);
+void guest_new_pagetable(struct lguest *lg, u32 pgtable);
+void guest_set_pud(struct lguest *lg, unsigned long cr3, u32 i);
+void guest_pagetable_clear_all(struct lguest *lg);
+void guest_pagetable_flush_user(struct lguest *lg);
+void guest_set_pte(struct lguest *lg, unsigned long cr3,
+		   unsigned long vaddr, u32 val);
+void map_trap_page(struct lguest *info);
+int demand_page(struct lguest *info, u32 cr2, int write);
+void pin_stack_pages(struct lguest *lg);
+
+int lguest_device_init(void);
+void lguest_device_remove(void);
+void lguest_io_init(void);
+u32 bind_dma(struct lguest *lg,
+	     unsigned long addr, unsigned long udma, u16 numdmas,u8 interrupt);
+int send_dma(struct lguest *info, unsigned long addr,
+	     unsigned long udma);
+void release_all_dma(struct lguest *lg);
+unsigned long get_dma_buffer(struct lguest *lg, unsigned long addr,
+			     unsigned long *interrupt);
+
+void set_wakeup_process(struct lguest *lg, struct task_struct *p);
+int do_async_hcalls(struct lguest *info);
+int hypercall(struct lguest *info, struct lguest_regs *regs);
+
+#define kill_guest(lg, fmt...)					\
+do {								\
+	if (!(lg)->dead) {					\
+		(lg)->dead = kasprintf(GFP_ATOMIC, fmt);	\
+		if (!(lg)->dead)				\
+			(lg)->dead = (void *)1;			\
+	}							\
+} while(0)
+
+static inline unsigned long guest_pa(struct lguest *lg, unsigned long vaddr)
+{
+	return vaddr - lg->page_offset;
+}
+
+/* Hardware-defined TSS structure. */
+struct x86_tss
+{
+	unsigned short	back_link,__blh;
+	unsigned long	esp0;
+	unsigned short	ss0,__ss0pad;
+	unsigned long	esp1;
+	unsigned short	ss1,__ss1pad;
+	unsigned long	esp2;
+	unsigned short	ss2,__ss2pad;
+	unsigned long	cr3;
+	unsigned long	eip;
+	unsigned long	eflags;
+	unsigned long	eax,ecx,edx,ebx;
+	unsigned long	esp; /* We actually use this one to save esp. */
+	unsigned long	ebp;
+	unsigned long	esi;
+	unsigned long	edi;
+	unsigned short	es, __espad;
+	unsigned short	cs, __cspad;
+	unsigned short	ss, __sspad;
+	unsigned short	ds, __dspad;
+	unsigned short	fs, __fspad;
+	unsigned short	gs, __gspad;
+	unsigned short	ldt, __ldtpad;
+	unsigned short	trace, io_bitmap_base;
+};
+
+int fixup_gdt_table(struct desc_struct *gdt, unsigned int num,
+		    struct lguest_regs *regs, struct x86_tss *tss);
+
+struct lguest_host_state
+{
+	struct Xgt_desc_struct	gdt;
+	struct Xgt_desc_struct	idt;
+	unsigned long		pgdir;
+	unsigned long		stackptr;
+};
+
+/* This sits in the high-mapped shim. */
+struct lguest_state
+{
+	/* Task struct. */
+	struct x86_tss tss;
+
+	/* Gate descriptor table. */
+	struct Xgt_desc_struct gdt;
+	struct desc_struct gdt_table[GDT_ENTRIES];
+
+	/* Interrupt descriptor table. */
+	struct Xgt_desc_struct idt;
+	struct desc_struct idt_table[IDT_ENTRIES];
+
+	/* Host state we store while the guest runs. */
+	struct lguest_host_state host;
+
+	/* This is the stack on which we push our regs. */
+	struct lguest_regs regs;
+};
+#endif	/* __ASSEMBLY__ */
+#endif	/* _LGUEST_H */
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lguest.c
@@ -0,0 +1,595 @@
+/*
+ * Lguest specific paravirt-ops implementation
+ *
+ * Copyright (C) 2006, Rusty Russell <rusty@rustcorp.com.au> IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <linux/kernel.h>
+#include <linux/start_kernel.h>
+#include <linux/string.h>
+#include <linux/console.h>
+#include <linux/screen_info.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/clocksource.h>
+#include <asm/paravirt.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+#include <asm/param.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/desc.h>
+#include <asm/setup.h>
+#include <asm/e820.h>
+#include <asm/pda.h>
+#include <asm/asm-offsets.h>
+
+extern int mce_disabled;
+
+struct lguest_data lguest_data;
+struct lguest_device_desc *lguest_devices;
+static __initdata const struct lguest_boot_info *boot = __va(0);
+
+void async_hcall(unsigned long call,
+		 unsigned long arg1, unsigned long arg2, unsigned long arg3)
+{
+	/* Note: This code assumes we're uniprocessor. */
+	static unsigned int next_call;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (lguest_data.hcall_status[next_call] != 0xFF) {
+		/* Table full, so do normal hcall which will flush table. */
+		hcall(call, arg1, arg2, arg3);
+	} else {
+		lguest_data.hcalls[next_call].eax = call;
+		lguest_data.hcalls[next_call].edx = arg1;
+		lguest_data.hcalls[next_call].ebx = arg2;
+		lguest_data.hcalls[next_call].ecx = arg3;
+		wmb();
+		lguest_data.hcall_status[next_call] = 0;
+		if (++next_call == LHCALL_RING_SIZE)
+			next_call = 0;
+	}
+	local_irq_restore(flags);
+}
+
+#ifdef PARAVIRT_LAZY_NONE 	/* Not in 2.6.20. */
+static int lazy_mode;
+static void fastcall lguest_lazy_mode(int mode)
+{
+	lazy_mode = mode;
+	if (mode == PARAVIRT_LAZY_NONE)
+		hcall(LHCALL_FLUSH_ASYNC, 0, 0, 0);
+}
+
+static void lazy_hcall(unsigned long call,
+		       unsigned long arg1,
+		       unsigned long arg2,
+		       unsigned long arg3)
+{
+	if (lazy_mode == PARAVIRT_LAZY_NONE)
+		hcall(call, arg1, arg2, arg3);
+	else
+		async_hcall(call, arg1, arg2, arg3);
+}
+#else
+#define lazy_hcall hcall
+#endif
+
+static unsigned long fastcall save_fl(void)
+{
+	return lguest_data.irq_enabled;
+}
+
+static void fastcall restore_fl(unsigned long flags)
+{
+	/* FIXME: Check if interrupt pending... */
+	lguest_data.irq_enabled = flags;
+}
+
+static void fastcall irq_disable(void)
+{
+	lguest_data.irq_enabled = 0;
+}
+
+static void fastcall irq_enable(void)
+{
+	/* Linux i386 code expects bit 9 set. */
+	/* FIXME: Check if interrupt pending... */
+	lguest_data.irq_enabled = 512;
+}
+
+static void fastcall lguest_load_gdt(const struct Xgt_desc_struct *desc)
+{
+	BUG_ON((desc->size+1)/8 != GDT_ENTRIES);
+	hcall(LHCALL_LOAD_GDT, __pa(desc->address), GDT_ENTRIES, 0);
+}
+
+static void fastcall lguest_load_idt(const struct Xgt_desc_struct *desc)
+{
+	unsigned int i;
+	struct desc_struct *idt = (void *)desc->address;
+
+	for (i = 0; i < (desc->size+1)/8; i++)
+		hcall(LHCALL_LOAD_IDT_ENTRY, i, idt[i].a, idt[i].b);
+}
+
+static int lguest_panic(struct notifier_block *nb, unsigned long l, void *p)
+{
+	hcall(LHCALL_CRASH, __pa(p), 0, 0);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block paniced = {
+	.notifier_call = lguest_panic
+};
+
+static cycle_t lguest_clock_read(void)
+{
+	/* FIXME: This is just the native one.  Account stolen time! */
+	return paravirt_ops.read_tsc();
+}
+
+/* FIXME: Update iff tsc rate changes. */
+static struct clocksource lguest_clock = {
+	.name			= "lguest",
+	.rating			= 400,
+	.read			= lguest_clock_read,
+	.mask			= CLOCKSOURCE_MASK(64),
+	.mult			= 0, /* to be set */
+	.shift			= 22,
+	.is_continuous		= 1,
+};
+
+static char *lguest_memory_setup(void)
+{
+	/* We do these here because lockcheck barfs if before start_kernel */
+	atomic_notifier_chain_register(&panic_notifier_list, &paniced);
+	lguest_clock.mult = lguest_data.clock_mult;
+	clocksource_register(&lguest_clock);
+
+	e820.nr_map = 0;
+	add_memory_region(0, PFN_PHYS(boot->max_pfn), E820_RAM);
+	return "LGUEST";
+}
+
+static fastcall void lguest_cpuid(unsigned int *eax, unsigned int *ebx,
+				 unsigned int *ecx, unsigned int *edx)
+{
+	int is_feature = (*eax == 1);
+
+	asm volatile ("cpuid"
+		      : "=a" (*eax),
+			"=b" (*ebx),
+			"=c" (*ecx),
+			"=d" (*edx)
+		      : "0" (*eax), "2" (*ecx));
+
+	if (is_feature) {
+		unsigned long *excap = (unsigned long *)ecx,
+			*features = (unsigned long *)edx;
+		/* Hypervisor needs to know when we flush kernel pages. */
+		set_bit(X86_FEATURE_PGE, features);
+		/* We don't have any features! */
+		clear_bit(X86_FEATURE_VME, features);
+		clear_bit(X86_FEATURE_DE, features);
+		clear_bit(X86_FEATURE_PSE, features);
+		clear_bit(X86_FEATURE_PAE, features);
+		clear_bit(X86_FEATURE_SEP, features);
+		clear_bit(X86_FEATURE_APIC, features);
+		clear_bit(X86_FEATURE_MTRR, features);
+		/* No MWAIT, either */
+		clear_bit(3, excap);
+	}
+}
+
+static unsigned long current_cr3;
+static void fastcall lguest_write_cr3(unsigned long cr3)
+{
+	hcall(LHCALL_NEW_PGTABLE, cr3, 0, 0);
+	current_cr3 = cr3;
+}
+
+static void fastcall lguest_flush_tlb(void)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB, 0, 0, 0);
+}
+
+static void fastcall lguest_flush_tlb_kernel(void)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB, 1, 0, 0);
+}
+
+static void fastcall lguest_flush_tlb_single(u32 addr)
+{
+	/* Simply set it to zero, and it will fault back in. */
+	lazy_hcall(LHCALL_SET_PTE, current_cr3, addr, 0);
+}
+
+/* FIXME: Eliminate all callers of this. */
+static fastcall void lguest_set_pte(pte_t *ptep, pte_t pteval)
+{
+	*ptep = pteval;
+	/* Don't bother with hypercall before initial setup. */
+	if (current_cr3)
+		hcall(LHCALL_SET_UNKNOWN_PTE, 0, 0, 0);
+}
+
+static fastcall void lguest_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval)
+{
+	*ptep = pteval;
+	lazy_hcall(LHCALL_SET_PTE, __pa(mm->pgd), addr, pteval.pte_low);
+}
+
+/* We only support two-level pagetables at the moment. */
+static fastcall void lguest_set_pud(pmd_t *pmdp, pmd_t pmdval)
+{
+	*pmdp = pmdval;
+	lazy_hcall(LHCALL_SET_PUD, __pa(pmdp)&PAGE_MASK,
+		   (__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+static fastcall void lguest_apic_write(unsigned long reg, unsigned long v)
+{
+}
+
+static fastcall void lguest_apic_write_atomic(unsigned long reg, unsigned long v)
+{
+}
+
+static fastcall unsigned long lguest_apic_read(unsigned long reg)
+{
+	return 0;
+}
+#endif
+
+/* We move eflags word to lguest_data.irq_enabled to restore interrupt
+   state.  For page faults, gpfs and virtual interrupts, the
+   hypervisor has saved eflags manually, otherwise it was delivered
+   directly and so eflags reflects the real machine IF state,
+   ie. interrupts on.  Since the kernel always dies if it takes such a
+   trap with interrupts disabled anyway, turning interrupts back on
+   unconditionally here is OK. */
+asm("lguest_iret:"
+    " pushl	%eax;"
+    " movl	12(%esp), %eax;"
+    "lguest_noirq_start:;"
+    " movl	%eax,%ss:lguest_data+"__stringify(LGUEST_DATA_irq_enabled)";"
+    " popl	%eax;"
+    " iret;"
+    "lguest_noirq_end:");
+extern void fastcall lguest_iret(void);
+extern char lguest_noirq_start[], lguest_noirq_end[];
+
+static void fastcall lguest_load_esp0(struct tss_struct *tss,
+				     struct thread_struct *thread)
+{
+	lazy_hcall(LHCALL_SET_STACK, __KERNEL_DS|0x1, thread->esp0,
+		   THREAD_SIZE/PAGE_SIZE);
+}
+
+static fastcall void lguest_load_tr_desc(void)
+{
+}
+
+static fastcall void lguest_set_ldt(const void *addr, unsigned entries)
+{
+	/* FIXME: Implement. */
+	BUG_ON(entries);
+}
+
+static fastcall void lguest_load_tls(struct thread_struct *t, unsigned int cpu)
+{
+	lazy_hcall(LHCALL_LOAD_TLS, __pa(&t->tls_array), cpu, 0);
+}
+
+static fastcall void lguest_set_debugreg(int regno, unsigned long value)
+{
+	/* FIXME: Implement */
+}
+
+static unsigned int lguest_cr0;
+static fastcall void lguest_clts(void)
+{
+	lazy_hcall(LHCALL_TS, 0, 0, 0);
+	lguest_cr0 &= ~8U;
+}
+
+static fastcall unsigned long lguest_read_cr0(void)
+{
+	return lguest_cr0;
+}
+
+static fastcall void lguest_write_cr0(unsigned long val)
+{
+	hcall(LHCALL_TS, val & 8, 0, 0);
+	lguest_cr0 = val;
+}
+
+static fastcall unsigned long lguest_read_cr2(void)
+{
+	return lguest_data.cr2;
+}
+
+static fastcall unsigned long lguest_read_cr3(void)
+{
+	return current_cr3;
+}
+
+/* Used to enable/disable PGE, but we don't care. */
+static fastcall unsigned long lguest_read_cr4(void)
+{
+	return 0;
+}
+
+static fastcall void lguest_write_cr4(unsigned long val)
+{
+}
+
+/* FIXME: These should be in a header somewhere */
+extern unsigned long init_pg_tables_end;
+
+static void fastcall lguest_time_irq(unsigned int irq, struct irq_desc *desc)
+{
+	do_timer(hcall(LHCALL_TIMER_READ, 0, 0, 0));
+	update_process_times(user_mode_vm(get_irq_regs()));
+}
+
+static void disable_lguest_irq(unsigned int irq)
+{
+	set_bit(irq, lguest_data.interrupts);
+}
+
+static void enable_lguest_irq(unsigned int irq)
+{
+	clear_bit(irq, lguest_data.interrupts);
+	/* FIXME: If it's pending? */
+}
+
+static struct irq_chip lguest_irq_controller = {
+	.name		= "lguest",
+	.mask		= disable_lguest_irq,
+	.mask_ack	= disable_lguest_irq,
+	.unmask		= enable_lguest_irq,
+};
+
+static void lguest_time_init(void)
+{
+	set_irq_handler(0, lguest_time_irq);
+	hcall(LHCALL_TIMER_START,HZ,0,0);
+}
+
+static void __init lguest_init_IRQ(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_IRQS; i++) {
+		int vector = FIRST_EXTERNAL_VECTOR + i;
+		if (i >= NR_IRQS)
+			break;
+		if (vector != SYSCALL_VECTOR) {
+			set_intr_gate(vector, interrupt[i]);
+			set_irq_chip_and_handler(i, &lguest_irq_controller,
+						 handle_level_irq);
+		}
+	}
+	irq_ctx_init(smp_processor_id());
+}
+
+static inline void native_write_dt_entry(void *dt, int entry, u32 entry_low, u32 entry_high)
+{
+	u32 *lp = (u32 *)((char *)dt + entry*8);
+	lp[0] = entry_low;
+	lp[1] = entry_high;
+}
+
+static fastcall void lguest_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
+{
+	/* FIXME: Allow this. */
+	BUG();
+}
+
+static fastcall void lguest_write_gdt_entry(void *dt, int entrynum,
+					   u32 low, u32 high)
+{
+	native_write_dt_entry(dt, entrynum, low, high);
+	hcall(LHCALL_LOAD_GDT, __pa(dt), GDT_ENTRIES, 0);
+}
+
+static fastcall void lguest_write_idt_entry(void *dt, int entrynum,
+					   u32 low, u32 high)
+{
+	native_write_dt_entry(dt, entrynum, low, high);
+	hcall(LHCALL_LOAD_IDT_ENTRY, entrynum, low, high);
+}
+
+#define LGUEST_IRQ "lguest_data+"__stringify(LGUEST_DATA_irq_enabled)
+#define DEF_LGUEST(name, code)				\
+	extern const char start_##name[], end_##name[];		\
+	asm("start_" #name ": " code "; end_" #name ":")
+DEF_LGUEST(cli, "movl $0," LGUEST_IRQ);
+DEF_LGUEST(sti, "movl $512," LGUEST_IRQ);
+DEF_LGUEST(popf, "movl %eax," LGUEST_IRQ);
+DEF_LGUEST(pushf, "movl " LGUEST_IRQ ",%eax");
+DEF_LGUEST(pushf_cli, "movl " LGUEST_IRQ ",%eax; movl $0," LGUEST_IRQ);
+DEF_LGUEST(iret, ".byte 0xE9,0,0,0,0"); /* jmp ... */
+
+static const struct lguest_insns
+{
+	const char *start, *end;
+} lguest_insns[] = {
+	[PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
+	[PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
+	[PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
+	[PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
+	[PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
+	[PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
+};
+static unsigned lguest_patch(u8 type, u16 clobber, void *insns, unsigned len)
+{
+	unsigned int insn_len;
+
+	/* Don't touch it if we don't have a replacement */
+	if (type >= ARRAY_SIZE(lguest_insns) || !lguest_insns[type].start)
+		return len;
+
+	insn_len = lguest_insns[type].end - lguest_insns[type].start;
+
+	/* Similarly if we can't fit replacement. */
+	if (len < insn_len)
+		return len;
+
+	memcpy(insns, lguest_insns[type].start, insn_len);
+	if (type == PARAVIRT_INTERRUPT_RETURN) {
+		/* Jumps are relative. */
+		u32 off = (u32)lguest_iret - ((u32)insns + insn_len);
+		memcpy(insns+1, &off, sizeof(off));
+	}
+	return insn_len;
+}
+
+static void fastcall lguest_safe_halt(void)
+{
+	hcall(LHCALL_HALT, 0, 0, 0);
+}
+
+static unsigned long lguest_get_wallclock(void)
+{
+	return hcall(LHCALL_GET_WALLCLOCK, 0, 0, 0);
+}
+
+static void lguest_power_off(void)
+{
+	hcall(LHCALL_CRASH, __pa("Power down"), 0, 0);
+}
+
+static __attribute_used__ __init void lguest_init(void)
+{
+	extern struct Xgt_desc_struct cpu_gdt_descr;
+	extern struct i386_pda boot_pda;
+
+	paravirt_ops.name = "lguest";
+	paravirt_ops.paravirt_enabled = 1;
+	paravirt_ops.kernel_rpl = 1;
+
+	paravirt_ops.save_fl = save_fl;
+	paravirt_ops.restore_fl = restore_fl;
+	paravirt_ops.irq_disable = irq_disable;
+	paravirt_ops.irq_enable = irq_enable;
+	paravirt_ops.load_gdt = lguest_load_gdt;
+	paravirt_ops.memory_setup = lguest_memory_setup;
+	paravirt_ops.cpuid = lguest_cpuid;
+	paravirt_ops.write_cr3 = lguest_write_cr3;
+	paravirt_ops.flush_tlb_user = lguest_flush_tlb;
+	paravirt_ops.flush_tlb_single = lguest_flush_tlb_single;
+	paravirt_ops.flush_tlb_kernel = lguest_flush_tlb_kernel;
+	paravirt_ops.set_pte = lguest_set_pte;
+	paravirt_ops.set_pte_at = lguest_set_pte_at;
+	paravirt_ops.set_pmd = lguest_set_pud;
+#ifdef CONFIG_X86_LOCAL_APIC
+	paravirt_ops.apic_write = lguest_apic_write;
+	paravirt_ops.apic_write_atomic = lguest_apic_write_atomic;
+	paravirt_ops.apic_read = lguest_apic_read;
+#endif
+	paravirt_ops.load_idt = lguest_load_idt;
+	paravirt_ops.iret = lguest_iret;
+	paravirt_ops.load_esp0 = lguest_load_esp0;
+	paravirt_ops.load_tr_desc = lguest_load_tr_desc;
+	paravirt_ops.set_ldt = lguest_set_ldt;
+	paravirt_ops.load_tls = lguest_load_tls;
+	paravirt_ops.set_debugreg = lguest_set_debugreg;
+	paravirt_ops.clts = lguest_clts;
+	paravirt_ops.read_cr0 = lguest_read_cr0;
+	paravirt_ops.write_cr0 = lguest_write_cr0;
+	paravirt_ops.init_IRQ = lguest_init_IRQ;
+	paravirt_ops.read_cr2 = lguest_read_cr2;
+	paravirt_ops.read_cr3 = lguest_read_cr3;
+	paravirt_ops.read_cr4 = lguest_read_cr4;
+	paravirt_ops.write_cr4 = lguest_write_cr4;
+	paravirt_ops.write_ldt_entry = lguest_write_ldt_entry;
+	paravirt_ops.write_gdt_entry = lguest_write_gdt_entry;
+	paravirt_ops.write_idt_entry = lguest_write_idt_entry;
+	paravirt_ops.patch = lguest_patch;
+	paravirt_ops.safe_halt = lguest_safe_halt;
+	paravirt_ops.get_wallclock = lguest_get_wallclock;
+	paravirt_ops.time_init = lguest_time_init;
+#ifdef PARAVIRT_LAZY_NONE
+	paravirt_ops.set_lazy_mode = lguest_lazy_mode;
+#endif
+
+	memset(lguest_data.hcall_status,0xFF,sizeof(lguest_data.hcall_status));
+	lguest_data.noirq_start = (u32)lguest_noirq_start;
+	lguest_data.noirq_end = (u32)lguest_noirq_end;
+	hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);
+	strncpy(saved_command_line, boot->cmdline, COMMAND_LINE_SIZE);
+
+	/* We use top of mem for initial pagetables. */
+	init_pg_tables_end = __pa(pg0);
+
+	/* set up PDA descriptor */
+	pack_descriptor((u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].a,
+			(u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].b,
+			(unsigned)&boot_pda, sizeof(boot_pda)-1,
+			0x80 | DESCTYPE_S | 0x02, 0);
+	load_gdt(&cpu_gdt_descr);
+	asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA) : "memory");
+
+	reserve_top_address(lguest_data.reserve_mem);
+
+	cpu_detect(&new_cpu_data);
+	/* Need this before paging_init. */
+	set_bit(X86_FEATURE_PGE, new_cpu_data.x86_capability);
+	/* Math is always hard! */
+	new_cpu_data.hard_math = 1;
+
+	/* FIXME: Better way? */
+	/* Suppress vgacon startup code */
+	SCREEN_INFO.orig_video_isVGA = VIDEO_TYPE_VLFB;
+
+	add_preferred_console("hvc", 0, NULL);
+
+#ifdef CONFIG_X86_MCE
+	mce_disabled = 1;
+#endif
+
+#ifdef CONFIG_ACPI
+	acpi_disabled = 1;
+	acpi_ht = 0;
+#endif
+	if (boot->initrd_size) {
+		/* We stash this at top of memory. */
+		INITRD_START = boot->max_pfn*PAGE_SIZE - boot->initrd_size;
+		INITRD_SIZE = boot->initrd_size;
+		LOADER_TYPE = 0xFF;
+	}
+
+	pm_power_off = lguest_power_off;
+	start_kernel();
+}
+
+asm("lguest_maybe_init:\n"
+    "	cmpl $"__stringify(LGUEST_MAGIC_EBP)", %ebp\n"
+    "	jne 1f\n"
+    "	cmpl $"__stringify(LGUEST_MAGIC_EDI)", %edi\n"
+    "	jne 1f\n"
+    "	cmpl $"__stringify(LGUEST_MAGIC_ESI)", %esi\n"
+    "	je lguest_init\n"
+    "1: ret");
+extern void asmlinkage lguest_maybe_init(void);
+paravirt_probe(lguest_maybe_init);
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lguest_bus.c
@@ -0,0 +1,180 @@
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <asm/lguest_device.h>
+#include <asm/lguest.h>
+#include <asm/io.h>
+
+static ssize_t type_show(struct device *_dev,
+                         struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hu", lguest_devices[dev->index].type);
+}
+static ssize_t features_show(struct device *_dev,
+                             struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hx", lguest_devices[dev->index].features);
+}
+static ssize_t pfn_show(struct device *_dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%u", lguest_devices[dev->index].pfn);
+}
+static ssize_t status_show(struct device *_dev,
+                           struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hx", lguest_devices[dev->index].status);
+}
+static ssize_t status_store(struct device *_dev, struct device_attribute *attr,
+                            const char *buf, size_t count)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	if (sscanf(buf, "%hi", &lguest_devices[dev->index].status) != 1)
+		return -EINVAL;
+	return count;
+}
+static struct device_attribute lguest_dev_attrs[] = {
+	__ATTR_RO(type),
+	__ATTR_RO(features),
+	__ATTR_RO(pfn),
+	__ATTR(status, 0644, status_show, status_store),
+	__ATTR_NULL
+};
+
+static int lguest_dev_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(_drv,struct lguest_driver,drv);
+
+	return (drv->device_type == lguest_devices[dev->index].type);
+}
+
+struct lguest_bus {
+	struct bus_type bus;
+	struct device dev;
+};
+
+static struct lguest_bus lguest_bus = {
+	.bus = {
+		.name  = "lguest",
+		.match = lguest_dev_match,
+		.dev_attrs = lguest_dev_attrs,
+	},
+	.dev = {
+		.parent = NULL,
+		.bus_id = "lguest",
+	}
+};
+
+static int lguest_dev_probe(struct device *_dev)
+{
+	int ret;
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(dev->dev.driver,
+						struct lguest_driver, drv);
+
+	lguest_devices[dev->index].status |= LGUEST_DEVICE_S_DRIVER;
+	ret = drv->probe(dev);
+	if (ret == 0)
+		lguest_devices[dev->index].status |= LGUEST_DEVICE_S_DRIVER_OK;
+	return ret;
+}
+
+static int lguest_dev_remove(struct device *_dev)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(dev->dev.driver,
+						struct lguest_driver, drv);
+
+	if (dev->dev.driver && drv->remove)
+		drv->remove(dev);
+	put_device(&dev->dev);
+	return 0;
+}
+
+int register_lguest_driver(struct lguest_driver *drv)
+{
+	if (!lguest_devices)
+		return 0;
+	
+	drv->drv.bus = &lguest_bus.bus;
+	drv->drv.name = drv->name;
+	drv->drv.owner = drv->owner;
+	drv->drv.probe = lguest_dev_probe;
+	drv->drv.remove = lguest_dev_remove;
+
+	return driver_register(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(register_lguest_driver);
+
+void unregister_lguest_driver(struct lguest_driver *drv)
+{
+	if (!lguest_devices)
+		return;
+
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(unregister_lguest_driver);
+
+static void release_lguest_device(struct device *_dev)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+
+	lguest_devices[dev->index].status |= LGUEST_DEVICE_S_REMOVED_ACK;
+	kfree(dev);
+}
+
+static void add_lguest_device(unsigned int index)
+{
+	struct lguest_device *new;
+
+	lguest_devices[index].status |= LGUEST_DEVICE_S_ACKNOWLEDGE;
+	new = kmalloc(sizeof(struct lguest_device), GFP_KERNEL);
+	if (!new) {
+		printk(KERN_EMERG "Cannot allocate lguest device %u\n", index);
+		lguest_devices[index].status |= LGUEST_DEVICE_S_FAILED;
+		return;
+	}
+
+	new->index = index;
+	new->private = NULL;
+	memset(&new->dev, 0, sizeof(new->dev));
+	new->dev.parent = &lguest_bus.dev;
+	new->dev.bus = &lguest_bus.bus;
+	new->dev.release = release_lguest_device;
+	sprintf(new->dev.bus_id, "%u", index);
+	if (device_register(&new->dev) != 0) {
+		printk(KERN_EMERG "Cannot register lguest device %u\n", index);
+		lguest_devices[index].status |= LGUEST_DEVICE_S_FAILED;
+		kfree(new);
+	}
+}
+
+static void scan_devices(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_MAX_DEVICES; i++)
+		if (lguest_devices[i].type)
+			add_lguest_device(i);
+}
+
+static int __init lguest_bus_init(void)
+{
+	if (strcmp(paravirt_ops.name, "lguest") != 0)
+		return 0;
+
+	/* Devices are in page above top of "normal" mem. */
+	lguest_devices = ioremap(max_pfn << PAGE_SHIFT, PAGE_SIZE);
+
+	if (bus_register(&lguest_bus.bus) != 0
+	    || device_register(&lguest_bus.dev) != 0)
+		panic("lguest bus registration failed");
+
+	scan_devices();
+	return 0;
+}
+postcore_initcall(lguest_bus_init);
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lguest_user.c
@@ -0,0 +1,242 @@
+/* Userspace control of the guest, via /dev/lguest. */
+#include <linux/uaccess.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include "lg.h"
+
+static struct lguest_state *setup_guest_state(unsigned int num, void *pgdir,
+					      unsigned long start)
+{
+	struct lguest_state *guest = &__lguest_states()[num];
+	unsigned int i;
+	const long *def = __lguest_default_idt_entries();
+	struct lguest_regs *regs;
+
+	guest->gdt_table[GDT_ENTRY_KERNEL_CS] = FULL_EXEC_SEGMENT;
+	guest->gdt_table[GDT_ENTRY_KERNEL_DS] = FULL_SEGMENT;
+	guest->gdt.size = GDT_ENTRIES*8-1;
+	guest->gdt.address = (unsigned long)&guest->gdt_table;
+
+	/* Other guest's IDTs are initialized from default. */
+	guest->idt.size = 8 * IDT_ENTRIES;
+	guest->idt.address = (long)guest->idt_table;
+	for (i = 0; i < IDT_ENTRIES; i++) {
+		u32 flags = 0x8e00;
+
+		/* They can't "int" into any of them except hypercall. */
+		if (i == LGUEST_TRAP_ENTRY)
+			flags |= (GUEST_DPL << 13);
+
+		guest->idt_table[i].a = (LGUEST_CS<<16) | (def[i]&0x0000FFFF);
+		guest->idt_table[i].b = (def[i]&0xFFFF0000) | flags;
+	}
+
+	memset(&guest->tss, 0, sizeof(guest->tss));
+	guest->tss.ss0 = LGUEST_DS;
+	guest->tss.esp0 = (unsigned long)(guest+1);
+	guest->tss.io_bitmap_base = sizeof(guest->tss); /* No I/O for you! */
+
+	/* Write out stack in format lguest expects, so we can switch to it. */
+	regs = &guest->regs;
+	regs->cr3 = __pa(pgdir);
+	regs->eax = regs->ebx = regs->ecx = regs->edx = regs->esp = 0;
+	regs->edi = LGUEST_MAGIC_EDI;
+	regs->ebp = LGUEST_MAGIC_EBP;
+	regs->esi = LGUEST_MAGIC_ESI;
+	regs->gs = regs->fs = 0;
+	regs->ds = regs->es = __KERNEL_DS|GUEST_DPL;
+	regs->trapnum = regs->errcode = 0;
+	regs->eip = start;
+	regs->cs = __KERNEL_CS|GUEST_DPL;
+	regs->eflags = 0x202; 	/* Interrupts enabled. */
+	regs->ss = __KERNEL_DS|GUEST_DPL;
+
+	if (!fixup_gdt_table(guest->gdt_table, ARRAY_SIZE(guest->gdt_table),
+			     &guest->regs, &guest->tss))
+		return NULL;
+
+	return guest;
+}
+
+/* + addr */
+static long user_get_dma(struct lguest *lg, const u32 __user *input)
+{
+	unsigned long addr, udma, irq;
+
+	if (get_user(addr, input) != 0)
+		return -EFAULT;
+	udma = get_dma_buffer(lg, addr, &irq);
+	if (!udma)
+		return -ENOENT;
+
+	/* We put irq number in udma->used_len. */
+	lhwrite_u32(lg, udma + offsetof(struct lguest_dma, used_len), irq);
+	return udma;
+}
+
+/* + irq */
+static int user_send_irq(struct lguest *lg, const u32 __user *input)
+{
+	u32 irq;
+
+	if (get_user(irq, input) != 0)
+		return -EFAULT;
+	if (irq >= LGUEST_IRQS)
+		return -EINVAL;
+	set_bit(irq, lg->irqs_pending);
+	return 0;
+}
+
+static ssize_t read(struct file *file, char __user *user, size_t size,loff_t*o)
+{
+	struct lguest *lg = file->private_data;
+
+	if (!lg)
+		return -EINVAL;
+
+	if (lg->dead) {
+		size_t len;
+
+		if (lg->dead == (void *)-1)
+			return -ENOMEM;
+
+		len = min(size, strlen(lg->dead)+1);
+		if (copy_to_user(user, lg->dead, len) != 0)
+			return -EFAULT;
+		return len;
+	}
+
+	if (lg->dma_is_pending)
+		lg->dma_is_pending = 0;
+
+	return run_guest(lg, user);
+}
+
+/* Take: pfnlimit, pgdir, start, pageoffset. */
+static int initialize(struct file *file, const u32 __user *input)
+{
+	struct lguest *lg;
+	int err, i;
+	u32 args[4];
+
+	if (file->private_data)
+		return -EBUSY;
+
+	if (copy_from_user(args, input, sizeof(args)) != 0)
+		return -EFAULT;
+
+	if (args[1] <= PAGE_SIZE)
+		return -EINVAL;
+
+	down(&lguest_lock);
+	i = find_free_guest();
+	if (i < 0) {
+		err = -ENOSPC;
+		goto unlock;
+	}
+	lg = &lguests[i];
+	lg->guestid = i;
+	lg->pfn_limit = args[0];
+	lg->page_offset = args[3];
+
+	lg->trap_page = (u32 *)get_zeroed_page(GFP_KERNEL);
+	if (!lg->trap_page) {
+		err = -ENOMEM;
+		goto release_guest;
+	}
+
+	err = init_guest_pagetable(lg, args[1]);
+	if (err)
+		goto free_trap_page;
+
+	lg->state = setup_guest_state(i, lg->pgdirs[lg->pgdidx].pgdir,args[2]);
+	if (!lg->state) {
+		err = -ENOEXEC;
+		goto release_pgtable;
+	}
+	up(&lguest_lock);
+
+	lg->tsk = current;
+	lg->mm = get_task_mm(current);
+	file->private_data = lg;
+	return sizeof(args);
+
+release_pgtable:
+	free_guest_pagetable(lg);
+free_trap_page:
+	free_page((long)lg->trap_page);
+release_guest:
+	memset(lg, 0, sizeof(*lg));
+unlock:
+	up(&lguest_lock);
+	return err;
+}
+
+static ssize_t write(struct file *file, const char __user *input,
+		     size_t size, loff_t *off)
+{
+	struct lguest *lg = file->private_data;
+	u32 req;
+
+	if (get_user(req, input) != 0)
+		return -EFAULT;
+	input += sizeof(req);
+
+	if (req != LHREQ_INITIALIZE && !lg)
+		return -EINVAL;
+	if (lg && lg->dead)
+		return -ENOENT;
+
+	switch (req) {
+	case LHREQ_INITIALIZE:
+		return initialize(file, (const u32 __user *)input);
+	case LHREQ_GETDMA:
+		return user_get_dma(lg, (const u32 __user *)input);
+	case LHREQ_IRQ:
+		return user_send_irq(lg, (const u32 __user *)input);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int close(struct inode *inode, struct file *file)
+{
+	struct lguest *lg = file->private_data;
+
+	if (!lg)
+		return 0;
+
+	down(&lguest_lock);
+	release_all_dma(lg);
+	free_page((long)lg->trap_page);
+	free_guest_pagetable(lg);
+	mmput(lg->mm);
+	if (lg->dead != (void *)1)
+		kfree(lg->dead);
+	memset(lg->state, 0, sizeof(*lg->state));
+	memset(lg, 0, sizeof(*lg));
+	up(&lguest_lock);
+	return 0;
+}
+
+static struct file_operations lguest_fops = {
+	.owner	 = THIS_MODULE,
+	.release = close,
+	.write	 = write,
+	.read	 = read,
+};
+static struct miscdevice lguest_dev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "lguest",
+	.fops	= &lguest_fops,
+};
+
+int __init lguest_device_init(void)
+{
+	return misc_register(&lguest_dev);
+}
+
+void __exit lguest_device_remove(void)
+{
+	misc_deregister(&lguest_dev);
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/page_tables.c
@@ -0,0 +1,374 @@
+/* Shadow page table operations.
+ * Copyright (C) Rusty Russell IBm Corporation 2006.
+ * GPL v2 and any later version */
+#include <linux/mm.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/random.h>
+#include <linux/percpu.h>
+#include <asm/tlbflush.h>
+#include "lg.h"
+
+#define PTES_PER_PAGE_SHIFT 10
+#define PTES_PER_PAGE (1 << PTES_PER_PAGE_SHIFT)
+#define HYPERVISOR_PGD_ENTRY (PTES_PER_PAGE - 1)
+
+static DEFINE_PER_CPU(u32 *, hypervisor_pte_pages) = { NULL };
+#define hypervisor_pte_page(cpu) per_cpu(hypervisor_pte_pages, cpu)
+
+static unsigned vaddr_to_pgd(unsigned long vaddr)
+{
+	return vaddr >> (PAGE_SHIFT + PTES_PER_PAGE_SHIFT);
+}
+
+/* These access the real versions. */
+static u32 *toplev(struct lguest *lg, u32 i, unsigned long vaddr)
+{
+	unsigned int index = vaddr_to_pgd(vaddr);
+
+	if (index >= HYPERVISOR_PGD_ENTRY) {
+		kill_guest(lg, "attempt to access hypervisor pages");
+		index = 0;
+	} 
+	return &lg->pgdirs[i].pgdir[index];
+}
+
+static u32 *pteof(struct lguest *lg, u32 top, unsigned long vaddr)
+{
+	u32 *page = __va(top&PAGE_MASK);
+	BUG_ON(!(top & _PAGE_PRESENT));
+	return &page[(vaddr >> PAGE_SHIFT) % PTES_PER_PAGE];
+}
+
+/* These access the guest versions. */
+static u32 gtoplev(struct lguest *lg, unsigned long vaddr)
+{
+	unsigned int index = vaddr >> (PAGE_SHIFT + PTES_PER_PAGE_SHIFT);
+	return lg->pgdirs[lg->pgdidx].cr3 + index * sizeof(u32);
+}
+
+static u32 gpteof(struct lguest *lg, u32 gtop, unsigned long vaddr)
+{
+	u32 gpage = (gtop&PAGE_MASK);
+	BUG_ON(!(gtop & _PAGE_PRESENT));
+	return gpage + ((vaddr >> PAGE_SHIFT) % PTES_PER_PAGE) * sizeof(u32);
+}
+
+static void release_pte(u32 pte)
+{
+	if (pte & _PAGE_PRESENT)
+		put_page(pfn_to_page(pte >> PAGE_SHIFT));
+}
+
+/* Do a virtual -> physical mapping on a user page. */
+static unsigned long get_pfn(unsigned long virtpfn, int write)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned long ret = -1UL;
+
+	down_read(&current->mm->mmap_sem);
+	if (get_user_pages(current, current->mm, virtpfn << PAGE_SHIFT,
+			   1, write, 1, &page, &vma) == 1)
+		ret = page_to_pfn(page);
+	up_read(&current->mm->mmap_sem);
+	return ret;
+}
+
+static u32 check_pgtable_entry(struct lguest *lg, u32 entry)
+{
+	if ((entry & (_PAGE_PWT|_PAGE_PSE))
+	    || (entry >> PAGE_SHIFT) >= lg->pfn_limit)
+		kill_guest(lg, "bad page table entry");
+	return entry & ~_PAGE_GLOBAL;
+}
+
+static u32 get_pte(struct lguest *lg, u32 entry, int write)
+{
+	u32 pfn;
+
+	pfn = get_pfn(entry >> PAGE_SHIFT, write);
+	if (pfn == -1UL) {
+		kill_guest(lg, "failed to get page %u", entry>>PAGE_SHIFT);
+		return 0;
+	}
+	return ((pfn << PAGE_SHIFT) | (entry & (PAGE_SIZE-1)));
+}
+
+/* FIXME: We hold reference to pages, which prevents them from being
+   swapped.  It'd be nice to have a callback when Linux wants to swap out. */
+
+/* We fault pages in, which allows us to update accessed/dirty bits.
+ * Return NULL or the pte page. */
+static int page_in(struct lguest *lg, u32 vaddr, unsigned flags)
+{
+	u32 gtop, gpte;
+	u32 *top, *pte, *ptepage;
+	u32 val;
+
+	gtop = gtoplev(lg, vaddr);
+	val = lhread_u32(lg, gtop);
+	if (!(val & _PAGE_PRESENT))
+		return 0;
+
+	top = toplev(lg, lg->pgdidx, vaddr);
+	if (!(*top & _PAGE_PRESENT)) {
+		/* Get a PTE page for them. */
+		ptepage = (void *)get_zeroed_page(GFP_KERNEL);
+		/* FIXME: Steal from self in this case? */
+		if (!ptepage) {
+			kill_guest(lg, "out of memory allocating pte page");
+			return 0;
+		}
+		val = check_pgtable_entry(lg, val);
+		*top = (__pa(ptepage) | (val & (PAGE_SIZE-1)));
+	} else
+		ptepage = __va(*top & PAGE_MASK);
+
+	gpte = gpteof(lg, val, vaddr);
+	val = lhread_u32(lg, gpte);
+
+	/* No page, or write to readonly page? */
+	if (!(val&_PAGE_PRESENT) || ((flags&_PAGE_DIRTY) && !(val&_PAGE_RW)))
+		return 0;
+
+	pte = pteof(lg, *top, vaddr);
+	val = check_pgtable_entry(lg, val) | flags;
+
+	/* We're done with the old pte. */
+	release_pte(*pte);
+
+	/* We don't make it writable if this isn't a write: later
+	 * write will fault so we can set dirty bit in guest. */
+	if (val & _PAGE_DIRTY)
+		*pte = get_pte(lg, val, 1);
+	else
+		*pte = get_pte(lg, val & ~_PAGE_RW, 0);
+
+	/* Now we update dirty/accessed on guest. */
+	lhwrite_u32(lg, gpte, val);
+	return 1;
+}
+
+int demand_page(struct lguest *lg, u32 vaddr, int write)
+{
+	return page_in(lg, vaddr, (write ? _PAGE_DIRTY : 0)|_PAGE_ACCESSED);
+}
+
+void pin_stack_pages(struct lguest *lg)
+{
+	unsigned int i;
+	u32 stack = lg->state->tss.esp1;
+
+	for (i = 0; i < lg->stack_pages; i++)
+		if (!demand_page(lg, stack - i*PAGE_SIZE, 1))
+			kill_guest(lg, "bad stack page %i@%#x", i, stack);
+}
+
+static unsigned int find_pgdir(struct lguest *lg, u32 pgtable)
+{
+	unsigned int i;
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		if (lg->pgdirs[i].cr3 == pgtable)
+			break;
+	return i;
+}
+
+static void release_pgd(struct lguest *lg, u32 *pgd)
+{
+	if (*pgd & _PAGE_PRESENT) {
+		unsigned int i;
+		u32 *ptepage = __va(*pgd & ~(PAGE_SIZE-1));
+		for (i = 0; i < PTES_PER_PAGE; i++)
+			release_pte(ptepage[i]);
+		free_page((long)ptepage);
+		*pgd = 0;
+	}
+}
+
+static void flush_user_mappings(struct lguest *lg, int idx)
+{
+	unsigned int i;
+	for (i = 0; i < vaddr_to_pgd(lg->page_offset); i++)
+		release_pgd(lg, lg->pgdirs[idx].pgdir + i);
+}
+
+void guest_pagetable_flush_user(struct lguest *lg)
+{
+	flush_user_mappings(lg, lg->pgdidx);
+}
+
+static unsigned int new_pgdir(struct lguest *lg, u32 cr3)
+{
+	unsigned int next;
+
+	next = (lg->pgdidx + random32()) % ARRAY_SIZE(lg->pgdirs);
+	if (!lg->pgdirs[next].pgdir) {
+		lg->pgdirs[next].pgdir = (u32 *)get_zeroed_page(GFP_KERNEL);
+		if (!lg->pgdirs[next].pgdir)
+			next = lg->pgdidx;
+	}
+	lg->pgdirs[next].cr3 = cr3;
+	/* Release all the non-kernel mappings. */
+	flush_user_mappings(lg, next);
+
+	return next;
+}
+
+void guest_new_pagetable(struct lguest *lg, u32 pgtable)
+{
+	int newpgdir;
+
+	newpgdir = find_pgdir(lg, pgtable);
+	if (newpgdir == ARRAY_SIZE(lg->pgdirs))
+		newpgdir = new_pgdir(lg, pgtable);
+	lg->pgdidx = newpgdir;
+	lg->state->regs.cr3 = __pa(lg->pgdirs[lg->pgdidx].pgdir);
+	pin_stack_pages(lg);
+}
+
+static void release_all_pagetables(struct lguest *lg)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		if (lg->pgdirs[i].pgdir)
+			for (j = 0; j < HYPERVISOR_PGD_ENTRY; j++)
+				release_pgd(lg, lg->pgdirs[i].pgdir + j);
+}
+
+void guest_pagetable_clear_all(struct lguest *lg)
+{
+	release_all_pagetables(lg);
+	pin_stack_pages(lg);
+}
+
+static void do_set_pte(struct lguest *lg, int idx,
+		       unsigned long vaddr, u32 val)
+{
+	u32 *top = toplev(lg, idx, vaddr);
+	if (*top & _PAGE_PRESENT) {
+		u32 *pte = pteof(lg, *top, vaddr);
+		release_pte(*pte);
+		if (val & (_PAGE_DIRTY | _PAGE_ACCESSED)) {
+			val = check_pgtable_entry(lg, val);
+			*pte = get_pte(lg, val, val & _PAGE_DIRTY);
+		} else
+			*pte = 0;
+	}
+}
+
+void guest_set_pte(struct lguest *lg,
+		   unsigned long cr3, unsigned long vaddr, u32 val)
+{
+	/* Kernel mappings must be changed on all top levels. */
+	if (vaddr >= lg->page_offset) {
+		unsigned int i;
+		for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+			if (lg->pgdirs[i].pgdir)
+				do_set_pte(lg, i, vaddr, val);
+	} else {
+		int pgdir = find_pgdir(lg, cr3);
+		if (pgdir != ARRAY_SIZE(lg->pgdirs))
+			do_set_pte(lg, pgdir, vaddr, val);
+	}
+}
+
+void guest_set_pud(struct lguest *lg, unsigned long cr3, u32 idx)
+{
+	int pgdir;
+
+	if (idx >= HYPERVISOR_PGD_ENTRY)
+		return;
+
+	pgdir = find_pgdir(lg, cr3);
+	if (pgdir < ARRAY_SIZE(lg->pgdirs))
+		release_pgd(lg, lg->pgdirs[pgdir].pgdir + idx);
+}
+
+int init_guest_pagetable(struct lguest *lg, u32 pgtable)
+{
+	/* We assume this in flush_user_mappings, so check now */
+	if (vaddr_to_pgd(lg->page_offset) >= HYPERVISOR_PGD_ENTRY)
+		return -EINVAL;
+	lg->pgdidx = 0;
+	lg->pgdirs[lg->pgdidx].cr3 = pgtable;
+	lg->pgdirs[lg->pgdidx].pgdir = (u32*)get_zeroed_page(GFP_KERNEL);
+	if (!lg->pgdirs[lg->pgdidx].pgdir)
+		return -ENOMEM;
+	return 0;
+}
+
+void free_guest_pagetable(struct lguest *lg)
+{
+	unsigned int i;
+
+	release_all_pagetables(lg);
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		free_page((long)lg->pgdirs[i].pgdir);
+}
+
+/* Caller must be preempt-safe */
+void map_trap_page(struct lguest *lg)
+{
+	int cpu = smp_processor_id();
+	
+	hypervisor_pte_page(cpu)[0] = (__pa(lg->trap_page)|_PAGE_PRESENT);
+
+	/* Since hypervisor less that 4MB, we simply mug top pte page. */
+	lg->pgdirs[lg->pgdidx].pgdir[HYPERVISOR_PGD_ENTRY] =
+		(__pa(hypervisor_pte_page(cpu))| _PAGE_KERNEL);
+}
+
+static void free_hypervisor_pte_pages(void)
+{
+	int i;
+	
+	for_each_possible_cpu(i)
+		free_page((long)hypervisor_pte_page(i));
+}
+
+static __init int alloc_hypervisor_pte_pages(void)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		hypervisor_pte_page(i) = (u32 *)get_zeroed_page(GFP_KERNEL);
+		if (!hypervisor_pte_page(i)) {
+			free_hypervisor_pte_pages();
+			return -ENOMEM;
+		}
+	}
+	return 0;
+}
+
+static __init void populate_hypervisor_pte_page(int cpu)
+{
+	int i;
+	u32 *pte = hypervisor_pte_page(cpu);
+
+	for (i = 0; i < HYPERVISOR_PAGES; i++) {
+		/* First entry set dynamically in map_trap_page */
+		pte[i+1] = ((page_to_pfn(&hype_pages[i]) << PAGE_SHIFT) 
+			    | _PAGE_KERNEL_EXEC);
+	}
+}
+
+__init int init_pagetables(struct page hype_pages[])
+{
+	int ret;
+	unsigned int i;
+
+	ret = alloc_hypervisor_pte_pages();
+	if (ret)
+		return ret;
+
+	for_each_possible_cpu(i)
+		populate_hypervisor_pte_page(i);
+	return 0;
+}
+
+__exit void free_pagetables(void)
+{
+	free_hypervisor_pte_pages();
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/segments.c
@@ -0,0 +1,171 @@
+#include "lg.h"
+
+/* Dealing with GDT entries is such a horror, I convert to sanity and back */
+struct decoded_gdt_entry
+{
+	u32 base, limit;
+	union {
+		struct {
+			unsigned type:4;
+			unsigned dtype:1;
+			unsigned dpl:2;
+			unsigned present:1;
+			unsigned unused:4;
+			unsigned avl:1;
+			unsigned mbz:1;
+			unsigned def:1;
+			unsigned page_granularity:1;
+		};
+		u16 raw_attributes;
+	};
+};
+
+static struct decoded_gdt_entry decode_gdt_entry(const struct desc_struct *en)
+{
+	struct decoded_gdt_entry de;
+	de.base = ((en->a >> 16) | ((en->b & 0xff) << 16) 
+		   | (en->b & 0xFF000000));
+	de.limit = ((en->a & 0xFFFF) | (en->b & 0xF0000));
+	de.raw_attributes = (en->b >> 8);
+	return de;
+}
+
+static struct desc_struct encode_gdt_entry(const struct decoded_gdt_entry *de)
+{
+	struct desc_struct en;
+	en.a = ((de->limit & 0xFFFF) | (de->base << 16));
+	en.b = (((de->base >> 16) & 0xFF) 
+		 | ((((u32)de->raw_attributes) & 0xF0FF) << 8)
+		 | (de->limit & 0xF0000)
+		 | (de->base & 0xFF000000));
+	return en;
+}
+
+static int check_desc(const struct decoded_gdt_entry *dec)
+{
+	return (dec->mbz == 0 && dec->dtype == 1 && (dec->type & 4) == 0);
+}
+
+static void check_segment(const struct desc_struct *gdt, u32 *segreg)
+{
+	if (*segreg > 255 || !(gdt[*segreg >> 3].b & 0x8000))
+		*segreg = 0;
+}
+
+/* Ensure our manually-loaded segment regs don't fault in switch_to_guest. */
+static void check_live_segments(const struct desc_struct *gdt,
+				struct lguest_regs *regs)
+{
+	check_segment(gdt, &regs->es);
+	check_segment(gdt, &regs->ds);
+	check_segment(gdt, &regs->fs);
+	check_segment(gdt, &regs->gs);
+}
+
+int fixup_gdt_table(struct desc_struct *gdt, unsigned int num,
+		    struct lguest_regs *regs, struct x86_tss *tss)
+{
+	unsigned int i;
+	struct decoded_gdt_entry dec;
+
+	for (i = 0; i < num; i++) {
+		unsigned long base, length;
+
+		/* We override these ones, so we don't care what they give. */
+		if (i == GDT_ENTRY_TSS
+		    || i == GDT_ENTRY_LGUEST_CS
+		    || i == GDT_ENTRY_LGUEST_DS
+		    || i == GDT_ENTRY_DOUBLEFAULT_TSS)
+			continue;
+
+		dec = decode_gdt_entry(&gdt[i]);
+		if (!dec.present)
+			continue;
+
+		if (!check_desc(&dec))
+			return 0;
+
+		base = dec.base;
+		length = dec.limit + 1;
+		if (dec.page_granularity) {
+			base *= PAGE_SIZE;
+			length *= PAGE_SIZE;
+		}
+
+		/* Unacceptable base? */
+		if (base >= HYPE_ADDR)
+			return 0;
+
+		/* Wrap around or segment overlaps hypervisor mem? */
+		if (!length
+		    || base + length < base
+		    || base + length > HYPE_ADDR) {
+			/* Trim to edge of hypervisor. */
+			length = HYPE_ADDR - base;
+			if (dec.page_granularity)
+				dec.limit = (length / PAGE_SIZE) - 1;
+			else
+				dec.limit = length - 1;
+		}
+		if (dec.dpl == 0)
+			dec.dpl = GUEST_DPL;
+		gdt[i] = encode_gdt_entry(&dec);
+	}
+	check_live_segments(gdt, regs);
+
+	/* Now put in hypervisor data and code segments. */
+	gdt[GDT_ENTRY_LGUEST_CS] = FULL_EXEC_SEGMENT;
+	gdt[GDT_ENTRY_LGUEST_DS] = FULL_SEGMENT;
+
+	/* Finally, TSS entry */
+	dec.base = (unsigned long)tss;
+	dec.limit = sizeof(*tss)-1;
+	dec.type = 0x9;
+	dec.dtype = 0;
+	dec.def = 0;
+	dec.present = 1;
+	dec.mbz = 0;
+	dec.page_granularity = 0;
+	gdt[GDT_ENTRY_TSS] = encode_gdt_entry(&dec);
+
+	return 1;
+}
+
+void load_guest_gdt(struct lguest *lg, u32 table, u32 num)
+{
+	if (num > GDT_ENTRIES)
+		kill_guest(lg, "too many gdt entries %i", num);
+
+	lhread(lg, lg->state->gdt_table, table,
+	       num * sizeof(lg->state->gdt_table[0]));
+	if (!fixup_gdt_table(lg->state->gdt_table, num, 
+			     &lg->state->regs, &lg->state->tss))
+		kill_guest(lg, "bad gdt table");
+}
+
+/* We don't care about limit here, since we only let them use these in
+ * usermode (where lack of USER bit in pagetable protects hypervisor mem).
+ * However, we want to ensure it doesn't fault when loaded, since *we* are
+ * the ones who will load it in switch_to_guest.
+ */
+void guest_load_tls(struct lguest *lg, const struct desc_struct __user *gtls)
+{
+	unsigned int i;
+	struct desc_struct *tls = &lg->state->gdt_table[GDT_ENTRY_TLS_MIN];
+
+	lhread(lg, tls, (u32)gtls, sizeof(*tls)*GDT_ENTRY_TLS_ENTRIES);
+	for (i = 0; i < ARRAY_SIZE(lg->tls_limits); i++) {
+		struct decoded_gdt_entry dec = decode_gdt_entry(&tls[i]);
+
+		if (!dec.present)
+			continue;
+
+		/* We truncate to one byte/page (depending on G bit) to neuter
+		   it, so ensure it's more than 1 page below trap page. */
+		tls[i].a &= 0xFFFF0000;
+		lg->tls_limits[i] = dec.limit;
+		if (!check_desc(&dec) || dec.base > HYPE_ADDR - PAGE_SIZE)
+			kill_guest(lg, "bad TLS descriptor %i", i);
+	}
+	check_live_segments(lg->state->gdt_table, &lg->state->regs);
+}
===================================================================
--- /dev/null
+++ b/include/asm-i386/lguest.h
@@ -0,0 +1,86 @@
+/* Things the lguest guest needs to know. */
+#ifndef _ASM_LGUEST_H
+#define _ASM_LGUEST_H
+
+#define LGUEST_MAGIC_EBP 0x4C687970
+#define LGUEST_MAGIC_EDI 0x652D4D65
+#define LGUEST_MAGIC_ESI 0xFFFFFFFF
+
+#define LHCALL_FLUSH_ASYNC	0
+#define LHCALL_LGUEST_INIT	1
+#define LHCALL_CRASH		2
+#define LHCALL_LOAD_GDT		3
+#define LHCALL_NEW_PGTABLE	4
+#define LHCALL_FLUSH_TLB	5
+#define LHCALL_LOAD_IDT_ENTRY	6
+#define LHCALL_SET_STACK	7
+#define LHCALL_TS		8
+#define LHCALL_TIMER_READ	9
+#define LHCALL_TIMER_START	10
+#define LHCALL_HALT		11
+#define LHCALL_GET_WALLCLOCK	12
+#define LHCALL_BIND_DMA		13
+#define LHCALL_SEND_DMA		14
+#define LHCALL_SET_PTE		15
+#define LHCALL_SET_UNKNOWN_PTE	16
+#define LHCALL_SET_PUD		17
+#define LHCALL_LOAD_TLS		18
+
+#define LGUEST_TRAP_ENTRY 0x1F
+
+static inline unsigned long
+hcall(unsigned long call,
+      unsigned long arg1, unsigned long arg2, unsigned long arg3)
+{
+	asm volatile("int $" __stringify(LGUEST_TRAP_ENTRY)
+		     : "=a"(call)
+		     : "a"(call), "d"(arg1), "b"(arg2), "c"(arg3) 
+		     : "memory");
+	return call;
+}
+
+void async_hcall(unsigned long call,
+		 unsigned long arg1, unsigned long arg2, unsigned long arg3);
+
+#define LGUEST_IRQS 32
+
+#define LHCALL_RING_SIZE 64
+struct hcall_ring
+{
+	u32 eax, edx, ebx, ecx;
+};
+
+/* All the good stuff happens here: guest registers it with LGUEST_INIT */
+struct lguest_data
+{
+/* Fields which change during running: */
+	/* 512 == enabled (same as eflags) */
+	unsigned int irq_enabled;
+	/* Blocked interrupts. */
+	DECLARE_BITMAP(interrupts, LGUEST_IRQS); 
+
+	/* Last (userspace) address we got a GPF & reloaded gs. */
+	unsigned int gs_gpf_eip;
+
+	/* Virtual address of page fault. */
+	unsigned long cr2;
+
+	/* Async hypercall ring.  0xFF == done, 0 == pending. */
+	u8 hcall_status[LHCALL_RING_SIZE];
+	struct hcall_ring hcalls[LHCALL_RING_SIZE];
+			
+/* Fields initialized by the hypervisor at boot: */
+	/* Memory not to try to access */
+	unsigned long reserve_mem;
+	/* ID of this guest (used by network driver to set ethernet address) */
+	u16 guestid;
+	/* Multiplier for TSC clock. */
+	u32 clock_mult;
+
+/* Fields initialized by the guest at boot: */
+	/* Instruction range to suppress interrupts even if enabled */
+	unsigned long noirq_start, noirq_end;
+};
+extern struct lguest_data lguest_data;
+extern struct lguest_device_desc *lguest_devices; /* Just past max_pfn */
+#endif	/* _ASM_LGUEST_H */
===================================================================
--- /dev/null
+++ b/include/asm-i386/lguest_device.h
@@ -0,0 +1,31 @@
+#ifndef _ASM_LGUEST_DEVICE_H
+#define _ASM_LGUEST_DEVICE_H
+/* Everything you need to know about lguest devices. */
+#include <linux/device.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+
+struct lguest_device {
+	/* Unique busid, and index into lguest_page->devices[] */
+	/* By convention, each device can use irq index+1 if it wants to. */
+	unsigned int index;
+
+	struct device dev;
+
+	/* Driver can hang data off here. */
+	void *private;
+};
+
+struct lguest_driver {
+	const char *name;
+	struct module *owner;
+	u16 device_type;
+	int (*probe)(struct lguest_device *dev);
+	void (*remove)(struct lguest_device *dev);
+
+	struct device_driver drv;
+};
+
+extern int register_lguest_driver(struct lguest_driver *drv);
+extern void unregister_lguest_driver(struct lguest_driver *drv);
+#endif /* _ASM_LGUEST_DEVICE_H */
===================================================================
--- /dev/null
+++ b/include/asm-i386/lguest_user.h
@@ -0,0 +1,86 @@
+#ifndef _ASM_LGUEST_USER
+#define _ASM_LGUEST_USER
+/* Everything the "lguest" userspace program needs to know. */
+/* They can register up to 32 arrays of lguest_dma. */
+#define LGUEST_MAX_DMA		32
+/* At most we can dma 16 lguest_dma in one op. */
+#define LGUEST_MAX_DMA_SECTIONS	16
+
+/* How many devices?  Assume each one wants up to two dma arrays per device. */
+#define LGUEST_MAX_DEVICES (LGUEST_MAX_DMA/2)
+
+struct lguest_dma
+{
+	/* 0 if free to be used, filled by hypervisor. */
+ 	u32 used_len;
+	u32 addr[LGUEST_MAX_DMA_SECTIONS];
+	u16 len[LGUEST_MAX_DMA_SECTIONS];
+};
+
+/* This is found at address 0. */
+struct lguest_boot_info
+{
+	u32 max_pfn;
+	u32 initrd_size;
+	char cmdline[256];
+};
+
+struct lguest_block_page
+{
+	/* 0 is a read, 1 is a write. */
+	int type;
+	u32 sector; 	/* Offset in device = sector * 512. */
+	u32 bytes;	/* Length expected to be read/written in bytes */
+	/* 0 = pending, 1 = done, 2 = done, error */
+	int result;
+	u32 num_sectors; /* Disk length = num_sectors * 512 */
+};
+
+/* There is a shared page of these. */
+struct lguest_net
+{
+	union {
+		unsigned char mac[6];
+		struct {
+			u8 promisc;
+			u8 pad;
+			u16 guestid;
+		};
+	};
+};
+
+/* lguest_device_desc->type */
+#define LGUEST_DEVICE_T_CONSOLE	1
+#define LGUEST_DEVICE_T_NET	2
+#define LGUEST_DEVICE_T_BLOCK	3
+
+/* lguest_device_desc->status.  256 and above are device specific. */
+#define LGUEST_DEVICE_S_ACKNOWLEDGE	1 /* We have seen device. */
+#define LGUEST_DEVICE_S_DRIVER		2 /* We have found a driver */
+#define LGUEST_DEVICE_S_DRIVER_OK	4 /* Driver says OK! */
+#define LGUEST_DEVICE_S_REMOVED		8 /* Device has gone away. */
+#define LGUEST_DEVICE_S_REMOVED_ACK	16 /* Driver has been told. */
+#define LGUEST_DEVICE_S_FAILED		128 /* Something actually failed */
+
+#define LGUEST_NET_F_NOCSUM		0x4000 /* Don't bother checksumming */
+#define LGUEST_DEVICE_F_RANDOMNESS	0x8000 /* IRQ is fairly random */
+
+/* We have a page of these descriptors in the lguest_device page. */
+struct lguest_device_desc {
+	u16 type;
+	u16 features;
+	u16 status;
+	u16 num_pages;
+	u32 pfn;
+};
+
+/* Write command first word is a request. */
+enum lguest_req
+{
+	LHREQ_INITIALIZE, /* + pfnlimit, pgdir, start, pageoffset */
+	LHREQ_GETDMA, /* + addr (returns &lguest_dma, irq in ->used_len) */
+	LHREQ_IRQ, /* + irq */
+};
+
+
+#endif /* _ASM_LGUEST_USER */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 7/10] lguest: Simple lguest network driver.
  2007-02-09  9:20       ` [PATCH 6/10] lguest code: the little linux hypervisor Rusty Russell
@ 2007-02-09  9:22         ` Rusty Russell
  2007-02-09  9:23           ` [PATCH 8/10] lguest: console driver Rusty Russell
  2007-02-09  9:35         ` [PATCH 6/10] lguest code: the little linux hypervisor Andrew Morton
  2007-02-09 10:09         ` Andi Kleen
  2 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:22 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, virtualization, jgarzik

This network driver operates both to the host process, and to other
guests.  It's pretty trivial.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -217,3 +217,4 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_FS_ENET) += fs_enet/
 
 obj-$(CONFIG_NETXEN_NIC) += netxen/
+obj-$(CONFIG_LGUEST_GUEST) += lguest_net.o
===================================================================
--- /dev/null
+++ b/drivers/net/lguest_net.c
@@ -0,0 +1,400 @@
+/* A simple network driver for lguest.
+ *
+ * Copyright 2006 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+//#define DEBUG
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/mm_types.h>
+#include <asm/io.h>
+#include <asm/lguest_device.h>
+
+#define SHARED_SIZE		PAGE_SIZE
+#define DATA_SIZE		1500
+#define MAX_LANS		4
+#define NUM_SKBS		8
+/* We overload multicast bit to show promiscuous mode. */
+#define PROMISC_BIT		0x80
+
+struct lguestnet_info
+{
+	/* The shared page. */
+	struct lguest_net *peer;
+	unsigned long peer_phys;
+
+	/* My peerid. */
+	unsigned int me;
+
+	struct net_device_stats stats;
+
+	/* Receive queue. */
+	struct sk_buff *skb[NUM_SKBS];
+	struct lguest_dma dma[NUM_SKBS];
+};
+
+/* How many bytes left in this page. */
+static unsigned int rest_of_page(void *data)
+{
+	return PAGE_SIZE - ((unsigned long)data % PAGE_SIZE);
+}
+
+/* Simple convention: offset 4 * peernum. */
+static unsigned long peer_addr(struct lguestnet_info *info, unsigned peernum)
+{
+	return info->peer_phys + 4 * peernum;
+}
+
+static void skb_to_dma(const struct sk_buff *skb, unsigned int len,
+		       struct lguest_dma *dma)
+{
+	unsigned int i, seg;
+
+	for (i = seg = 0; i < len; seg++, i += rest_of_page(skb->data + i)) {
+		dma->addr[seg] = virt_to_phys(skb->data + i);
+		dma->len[seg] = min((unsigned)(len - i),
+				    rest_of_page(skb->data + i));
+	}
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++, seg++) {
+		const skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+		/* Should not happen with MTU less than 64k - 2 * PAGE_SIZE. */
+		if (seg == LGUEST_MAX_DMA_SECTIONS) {
+			printk("Woah dude!  Megapacket!\n");
+			break;
+		}
+		dma->addr[seg] = page_to_phys(f->page) + f->page_offset;
+		dma->len[seg] = f->size;
+	}
+	if (seg < LGUEST_MAX_DMA_SECTIONS)
+		dma->len[seg] = 0;
+}
+
+static void transfer_packet(struct net_device *dev,
+			    struct sk_buff *skb,
+			    unsigned int peernum)
+{
+	struct lguestnet_info *info = dev->priv;
+	struct lguest_dma dma;
+
+	skb_to_dma(skb, skb->len, &dma);
+	pr_debug("xfer length %04x (%u)\n", htons(skb->len), skb->len);
+
+	hcall(LHCALL_SEND_DMA, peer_addr(info,peernum), __pa(&dma),0);
+	if (dma.used_len != skb->len) {
+		info->stats.tx_carrier_errors++;
+		pr_debug("Bad xfer to peer %i: %i of %i (dma %p/%i)\n",
+			 peernum, dma.used_len, skb->len,
+			 (void *)dma.addr[0], dma.len[0]);
+	} else {
+		pr_debug("lguestnet: sent %u bytes\n", skb->len);
+		info->stats.tx_bytes += skb->len;
+		info->stats.tx_packets++;
+	}
+}
+
+static int mac_eq(const unsigned char mac[ETH_ALEN],
+		  struct lguestnet_info *info, unsigned int peer)
+{
+	/* Ignore multicast bit, which peer turns on to mean promisc. */
+	if ((info->peer[peer].promisc & (~PROMISC_BIT)) != mac[0])
+		return 0;
+	return memcmp(mac+1, info->peer[peer].mac+1, ETH_ALEN-1) == 0;
+}
+
+static int unused_peer(const struct lguest_net peer[], unsigned int num)
+{
+	return peer[num].guestid == 0xFFFF;
+}
+
+static int is_broadcast(const unsigned char dest[ETH_ALEN])
+{
+	return dest[0] == 0xFF && dest[1] == 0xFF && dest[2] == 0xFF
+		&& dest[3] == 0xFF && dest[4] == 0xFF && dest[5] == 0xFF;
+}
+
+static int promisc(struct lguestnet_info *info, unsigned int peer)
+{
+	return info->peer[peer].promisc & PROMISC_BIT;
+}
+
+static void lguestnet_set_multicast(struct net_device *dev)
+{
+	struct lguestnet_info *info = dev->priv;
+
+	if (dev->flags & IFF_PROMISC)
+		info->peer[info->me].promisc |= PROMISC_BIT;
+	else
+		info->peer[info->me].promisc &= ~PROMISC_BIT;
+}
+
+static int lguestnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	unsigned int i;
+	int broadcast;
+	struct lguestnet_info *info = dev->priv;
+	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+
+	broadcast = is_broadcast(dest);
+
+	pr_debug("lguestnet %s: xmit broadcast=%i\n",
+		 dev->name, broadcast);
+	pr_debug("dest: %02x:%02x:%02x:%02x:%02x:%02x\n",
+		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
+
+	for (i = 0; i < SHARED_SIZE/sizeof(struct lguest_net); i++) {
+		if (i == info->me || unused_peer(info->peer, i))
+			continue;
+
+		if (!broadcast && !promisc(info, i) && !mac_eq(dest, info, i))
+			continue;
+
+		pr_debug("lguestnet %s: sending from %i to %i\n",
+			 dev->name, info->me, i);
+		transfer_packet(dev, skb, i);
+	}
+	dev_kfree_skb(skb);
+	return 0;
+}
+
+static struct sk_buff *lguestnet_alloc_skb(struct net_device *dev, int gfpflags)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(16 + ETH_HLEN + DATA_SIZE, gfpflags);
+	if (!skb)
+		return NULL;
+
+	skb->dev = dev;
+	skb_reserve(skb, 16);
+	return skb;
+}
+
+/* Find a new skb to put in this slot in shared mem. */
+static int fill_slot(struct net_device *dev, unsigned int slot)
+{
+	struct lguestnet_info *info = dev->priv;
+	/* Try to create and register a new one. */
+	info->skb[slot] = lguestnet_alloc_skb(dev, GFP_ATOMIC);
+	if (!info->skb[slot]) {
+		printk("%s: could not fill slot %i\n", dev->name, slot);
+		return -ENOMEM;
+	}
+
+	skb_to_dma(info->skb[slot], ETH_HLEN + DATA_SIZE, &info->dma[slot]);
+	wmb();
+	/* Now we tell hypervisor it can use the slot. */
+	info->dma[slot].used_len = 0;
+
+	pr_debug("lguestnet: %s populating slot %i with %p\n",
+		 dev->name, slot, info->skb[slot]);
+	return 0;
+}
+
+static irqreturn_t lguestnet_rcv(int irq, void *dev_id)
+{
+	struct net_device *dev = dev_id;
+	struct lguestnet_info *info = dev->priv;
+	unsigned int i, done = 0;
+
+	for (i = 0; i < ARRAY_SIZE(info->dma); i++) {
+		unsigned int length;
+		struct sk_buff *skb;
+
+		length = info->dma[i].used_len;
+		if (length == 0)
+			continue;
+
+		done++;
+		skb = info->skb[i];
+		fill_slot(dev, i);
+
+		pr_debug("Received skb %p\n", skb);
+		if (length < 14 || length > 1514) {
+			pr_debug(KERN_WARNING "%s: unbelievable skb len: %i\n",
+				 dev->name, length);
+			dev_kfree_skb(skb);
+			continue;
+		}
+
+		skb_put(skb, length);
+		skb->protocol = eth_type_trans(skb, dev);
+		/* This is a reliable transport. */
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+		pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
+			 ntohs(skb->protocol), skb->len,skb->pkt_type);
+
+		info->stats.rx_bytes += skb->len;
+		info->stats.rx_packets++;
+		netif_rx(skb);
+	}
+	return done ? IRQ_HANDLED : IRQ_NONE;
+}
+
+static int populate_page(struct net_device *dev)
+{
+	int i;
+	struct lguestnet_info *info = dev->priv;
+
+	pr_debug("lguestnet: peer %i shared page %p me %p\n",
+	         info->me, info->peer, &info->peer[info->me]);
+
+	/* Set up our MAC address */
+	memcpy(info->peer[info->me].mac, dev->dev_addr, ETH_ALEN);
+
+	/* Turn on promisc mode if needed */
+	lguestnet_set_multicast(dev);
+
+	for (i = 0; i < ARRAY_SIZE(info->dma); i++) {
+		if (fill_slot(dev, i) != 0)
+			goto cleanup;
+	}
+	pr_debug("lguestnet: allocated %i dma recv buffers\n", i);
+	if (!hcall(LHCALL_BIND_DMA, peer_addr(info, info->me), __pa(info->dma),
+		   (NUM_SKBS << 8) | dev->irq))
+		goto cleanup;
+	return 0;
+
+cleanup:
+	while (--i >= 0)
+		dev_kfree_skb(info->skb[i]);
+	return -ENOMEM;
+}
+
+static void unpopulate_page(struct lguestnet_info *info)
+{
+	unsigned int i;
+	struct lguest_net *me = &info->peer[info->me];
+
+	/* Clear all trace: others might deliver packets, we'll ignore it. */
+	memset(me, 0, sizeof(*me));
+
+	/* Deregister sg lists. */
+	hcall(LHCALL_BIND_DMA, peer_addr(info, info->me), __pa(info->dma), 0);
+	for (i = 0; i < ARRAY_SIZE(info->dma); i++)
+		dev_kfree_skb(info->skb[i]);
+}
+
+static int lguestnet_open(struct net_device *dev)
+{
+	return populate_page(dev);
+}
+
+static int lguestnet_close(struct net_device *dev)
+{
+	unpopulate_page(dev->priv);
+	return 0;
+}
+
+static struct net_device_stats *lguestnet_get_stats(struct net_device *dev)
+{
+	struct lguestnet_info *info = dev->priv;
+
+	return &info->stats;
+}
+
+static int lguestnet_probe(struct lguest_device *lhdev)
+{
+	int err, irqf = IRQF_SHARED;
+	unsigned long mapsize;
+	struct net_device *dev;
+	struct lguestnet_info *info;
+	struct lguest_device_desc *desc = &lguest_devices[lhdev->index];
+
+	pr_debug("lguest_net: probing for device %i\n", lhdev->index);
+	mapsize = PAGE_SIZE * desc->num_pages;
+
+	dev = alloc_etherdev(sizeof(struct lguestnet_info));
+	if (!dev)
+		return -ENOMEM;
+
+	SET_MODULE_OWNER(dev);
+
+	/* Ethernet defaults with some changes */
+	ether_setup(dev);
+	dev->set_mac_address = NULL;
+	dev->mtu = DATA_SIZE;
+
+	dev->dev_addr[0] = 0x02; /* set local assignment bit (IEEE802) */
+	dev->dev_addr[1] = 0x00;
+	memcpy(&dev->dev_addr[2], &lguest_data.guestid, 2);
+	dev->dev_addr[4] = 0x00;
+	dev->dev_addr[5] = 0x00;
+
+	dev->open = lguestnet_open;
+	dev->stop = lguestnet_close;
+	dev->hard_start_xmit = lguestnet_start_xmit;
+	dev->get_stats = lguestnet_get_stats;
+
+	/* Turning on/off promisc will call dev->set_multicast_list.
+	 * We don't actually support multicast yet */
+	dev->set_multicast_list = lguestnet_set_multicast;
+	dev->mem_start = ((unsigned long)desc->pfn << PAGE_SHIFT);
+	dev->mem_end = dev->mem_start + mapsize;
+	dev->irq = lhdev->index+1;
+	dev->dma = 0;
+	dev->features = NETIF_F_SG;
+	if (desc->features & LGUEST_NET_F_NOCSUM)
+		dev->features |= NETIF_F_NO_CSUM;
+
+	info = dev->priv;
+	info->peer_phys = ((unsigned long)desc->pfn << PAGE_SHIFT);
+	info->peer = (void *)ioremap(info->peer_phys, mapsize);
+	/* This stores our peerid (upper bits reserved for future). */
+	info->me = (desc->features & (mapsize-1));
+
+	/* skbs allocated on open */
+	memset(info->skb, 0, sizeof(info->skb));
+
+	err = register_netdev(dev);
+	if (err) {
+		pr_debug("lguestnet: registering device failed\n");
+		goto free;
+	}
+
+	if (lguest_devices[lhdev->index].features & LGUEST_DEVICE_F_RANDOMNESS)
+		irqf |= IRQF_SAMPLE_RANDOM;
+	if (request_irq(dev->irq, lguestnet_rcv, irqf, "lguestnet", dev) != 0) {
+		pr_debug("lguestnet: could not get net irq %i\n", dev->irq);
+		goto unregister;
+	}
+
+	pr_debug("lguestnet: registered device %s\n", dev->name);
+	lhdev->private = dev;
+	return 0;
+
+unregister:
+	unregister_netdev(dev);
+free:
+	free_netdev(dev);
+	return err;
+}
+
+static struct lguest_driver lguestnet_drv = {
+	.name = "lguestnet",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_NET,
+	.probe = lguestnet_probe,
+};
+
+static __init int lguestnet_init(void)
+{
+	return register_lguest_driver(&lguestnet_drv);
+}
+module_init(lguestnet_init);
+
+MODULE_DESCRIPTION("Lguest network driver");
+MODULE_LICENSE("GPL");



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 8/10] lguest: console driver.
  2007-02-09  9:22         ` [PATCH 7/10] lguest: Simple lguest network driver Rusty Russell
@ 2007-02-09  9:23           ` Rusty Russell
  2007-02-09  9:24             ` [PATCH 9/10] lguest: block driver Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:23 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

A trivial driver to have a basic lguest console, using the hvc_console
infrastructure.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -45,6 +45,7 @@ obj-$(CONFIG_HVC_CONSOLE)	+= hvc_vio.o h
 obj-$(CONFIG_HVC_CONSOLE)	+= hvc_vio.o hvsi.o
 obj-$(CONFIG_HVC_ISERIES)	+= hvc_iseries.o
 obj-$(CONFIG_HVC_RTAS)		+= hvc_rtas.o
+obj-$(CONFIG_LGUEST_GUEST)	+= hvc_lguest.o
 obj-$(CONFIG_HVC_DRIVER)	+= hvc_console.o
 obj-$(CONFIG_RAW_DRIVER)	+= raw.o
 obj-$(CONFIG_SGI_SNSC)		+= snsc.o snsc_event.o
===================================================================
--- /dev/null
+++ b/drivers/char/hvc_lguest.c
@@ -0,0 +1,99 @@
+/* Simple console for lguest.
+ *
+ * Copyright (C) 2006 Rusty Russell, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#include <linux/err.h>
+#include <linux/init.h>
+#include <asm/lguest_device.h>
+#include "hvc_console.h"
+
+static int cons_irq;
+static int cons_offset;
+static char inbuf[256];
+static struct lguest_dma cons_input = { .used_len = 0,
+					.addr[0] = __pa(inbuf),
+					.len[0] = sizeof(inbuf),
+					.len[1] = 0 };
+
+static int get_chars(u32 vtermno, char *buf, int count)
+{
+	if (!cons_input.used_len)
+		return 0;
+
+	if (cons_input.used_len - cons_offset < count)
+		count = cons_input.used_len - cons_offset;
+
+	memcpy(buf, inbuf + cons_offset, count);
+	cons_offset += count;
+	if (cons_offset == cons_input.used_len) {
+		cons_offset = 0;
+		cons_input.used_len = 0;
+	}
+	return count;
+}
+
+static int put_chars(u32 vtermno, const char *buf, int count)
+{
+	struct lguest_dma dma;
+
+	/* FIXME: what if it's over a page boundary? */
+	dma.len[0] = count;
+	dma.len[1] = 0;
+	dma.addr[0] = __pa(buf);
+
+	hcall(LHCALL_SEND_DMA, 4, __pa(&dma), 0);
+	return count;
+}
+
+struct hv_ops lguest_cons = {
+	.get_chars = get_chars,
+	.put_chars = put_chars,
+};
+
+static int __init cons_init(void)
+{
+	if (strcmp(paravirt_ops.name, "lguest") != 0)
+		return 0;
+
+	return hvc_instantiate(0, 0, &lguest_cons);
+}
+console_initcall(cons_init);
+
+static int lguestcons_probe(struct lguest_device *lhdev)
+{
+	cons_irq = lhdev->index+1;
+	lhdev->private = hvc_alloc(0, cons_irq, &lguest_cons, 256);
+	if (IS_ERR(lhdev->private))
+		return PTR_ERR(lhdev->private);
+
+	if (!hcall(LHCALL_BIND_DMA, 0, __pa(&cons_input), (1<<8)+cons_irq))
+		printk("lguest console: failed to bind buffer.\n");
+	return 0;
+}
+
+static struct lguest_driver lguestcons_drv = {
+	.name = "lguestcons",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_CONSOLE,
+	.probe = lguestcons_probe,
+};
+
+static int __init hvc_lguest_init(void)
+{
+	return register_lguest_driver(&lguestcons_drv);
+}
+module_init(hvc_lguest_init);



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 9/10] lguest: block driver
  2007-02-09  9:23           ` [PATCH 8/10] lguest: console driver Rusty Russell
@ 2007-02-09  9:24             ` Rusty Russell
  2007-02-09  9:25               ` [PATCH 10/10] lguest: documentatation including example launcher Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:24 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, virtualization, axboe

A simple block driver for lguest (/dev/lgbX).  Only does one request
at once.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -28,4 +28,5 @@ obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
+obj-$(CONFIG_LGUEST_GUEST)	+= lguest_blk.o
 
===================================================================
--- /dev/null
+++ b/drivers/block/lguest_blk.c
@@ -0,0 +1,260 @@
+/* A simple block driver for lguest.
+ *
+ * Copyright 2006 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+//#define DEBUG
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/interrupt.h>
+#include <asm/lguest_device.h>
+
+static char next_block_index = 'a';
+
+struct blockdev
+{
+	spinlock_t lock;
+
+	/* The disk structure for the kernel. */
+	struct gendisk *disk;
+
+	/* The major number for this disk. */
+	int major;
+	int irq;
+
+	unsigned long phys_addr;
+	/* The ioremap'ed block page. */
+	struct lguest_block_page *lb_page;
+
+	/* We only have a single request outstanding at a time. */
+	struct lguest_dma dma;
+	struct request *req;
+};
+
+static irqreturn_t lgb_irq(int irq, void *_bd)
+{
+	struct blockdev *bd = _bd;
+	unsigned long flags;
+
+	if (!bd->req) {
+		pr_debug("No work!\n");
+		return IRQ_NONE;
+	}
+
+	if (!bd->lb_page->result) {
+		pr_debug("No result!\n");
+		return IRQ_NONE;
+	}
+
+	spin_lock_irqsave(&bd->lock, flags);
+	end_request(bd->req, bd->lb_page->result == 1);
+	bd->req = NULL;
+	bd->dma.used_len = 0;
+	blk_start_queue(bd->disk->queue);
+	spin_unlock_irqrestore(&bd->lock, flags);
+	return IRQ_HANDLED;
+}
+
+static unsigned int req_to_dma(struct request *req, struct lguest_dma *dma)
+{
+	unsigned int i = 0, idx, len = 0;
+	struct bio *bio;
+
+	rq_for_each_bio(bio, req) {
+		struct bio_vec *bvec;
+		bio_for_each_segment(bvec, bio, idx) {
+			BUG_ON(i == LGUEST_MAX_DMA_SECTIONS);
+			BUG_ON(!bvec->bv_len);
+			dma->addr[i] = page_to_phys(bvec->bv_page)
+				+ bvec->bv_offset;
+			dma->len[i] = bvec->bv_len;
+			len += bvec->bv_len;
+			i++;
+		}
+	}
+	if (i < LGUEST_MAX_DMA_SECTIONS)
+		dma->len[i] = 0;
+	return len;
+}
+
+static void empty_dma(struct lguest_dma *dma)
+{
+	dma->len[0] = 0;
+}
+
+static void setup_req(struct blockdev *bd,
+		      int type, struct request *req, struct lguest_dma *dma)
+{
+	bd->lb_page->type = type;
+	bd->lb_page->sector = req->sector;
+	bd->lb_page->result = 0;
+	bd->req = req;
+	bd->lb_page->bytes = req_to_dma(req, dma);
+}
+
+static int do_write(struct blockdev *bd, struct request *req)
+{
+	struct lguest_dma send;
+
+	pr_debug("lgb: WRITE sector %li\n", (long)req->sector);
+	setup_req(bd, 1, req, &send);
+
+	hcall(LHCALL_SEND_DMA, bd->phys_addr, __pa(&send), 0);
+	return 1;
+}
+
+static int do_read(struct blockdev *bd, struct request *req)
+{
+	struct lguest_dma ping;
+
+	pr_debug("lgb: READ sector %li\n", (long)req->sector);
+	setup_req(bd, 0, req, &bd->dma);
+
+	empty_dma(&ping);
+	hcall(LHCALL_SEND_DMA,bd->phys_addr,__pa(&ping),0);
+	return 1;
+}
+
+static void do_lgb_request(request_queue_t *q)
+{
+	struct blockdev *bd;
+	struct request *req;
+	int ok;
+
+again:
+	req = elv_next_request(q);
+	if (!req)
+		return;
+
+	bd = req->rq_disk->private_data;
+	/* Sometimes we get repeated requests after blk_stop_queue. */
+	if (bd->req)
+		return;
+
+	if (!blk_fs_request(req)) {
+		pr_debug("Got non-command 0x%08x\n", req->cmd_type);
+	error:
+		req->errors++;
+		end_request(req, 0);
+		goto again;
+	} else {
+		if (rq_data_dir(req) == WRITE)
+			ok = do_write(req->rq_disk->private_data, req);
+		else
+			ok = do_read(req->rq_disk->private_data, req);
+
+		if (!ok)
+			goto error;
+		/* Wait for interrupt to tell us it's done. */
+		blk_stop_queue(q);
+	}
+}
+
+static struct block_device_operations lguestblk_fops = {
+	.owner = THIS_MODULE,
+};
+
+static int lguestblk_probe(struct lguest_device *lhdev)
+{
+	struct blockdev *bd;
+	int err;
+	int irqflags = IRQF_SHARED;
+
+	bd = kmalloc(sizeof(*bd), GFP_KERNEL);
+	if (!bd)
+		return -ENOMEM;
+
+	spin_lock_init(&bd->lock);
+	bd->phys_addr = (lguest_devices[lhdev->index].pfn << PAGE_SHIFT);
+
+	bd->disk = alloc_disk(1);
+	if (!bd->disk) {
+		err = -ENOMEM;
+		goto out_free_bd;
+	}
+
+	bd->disk->queue = blk_init_queue(do_lgb_request, &bd->lock);
+	if (!bd->disk->queue) {
+		err = -ENOMEM;
+		goto out_put;
+	}
+
+	/* We can only handle a certain number of sg entries */
+	blk_queue_max_hw_segments(bd->disk->queue, LGUEST_MAX_DMA_SECTIONS);
+	/* Buffers must not cross page boundaries */
+	blk_queue_segment_boundary(bd->disk->queue, PAGE_SIZE-1);
+
+	bd->irq = lhdev->index+1;
+	bd->major = register_blkdev(0, "lguestblk");
+	if (bd->major < 0) {
+		err = bd->major;
+		goto out_cleanup_queue;
+	}
+	bd->lb_page = (void *)ioremap(bd->phys_addr, PAGE_SIZE);
+	bd->req = NULL;
+
+	sprintf(bd->disk->disk_name, "lgb%c", next_block_index++);
+	if (lguest_devices[lhdev->index].features & LGUEST_DEVICE_F_RANDOMNESS)
+		irqflags |= IRQF_SAMPLE_RANDOM;
+	err = request_irq(bd->irq, lgb_irq, irqflags, bd->disk->disk_name, bd);
+	if (err)
+		goto out_unmap;
+
+	bd->dma.used_len = 0;
+	bd->dma.len[0] = 0;
+	hcall(LHCALL_BIND_DMA, bd->phys_addr, __pa(&bd->dma), (1<<8)+bd->irq);
+
+	printk(KERN_INFO "%s: device %i at major %d\n",
+	       bd->disk->disk_name, lhdev->index, bd->major);
+
+	bd->disk->major = bd->major;
+	bd->disk->first_minor = 0;
+	bd->disk->private_data = bd;
+	bd->disk->fops = &lguestblk_fops;
+	/* This is initialized to the disk size by the other end. */
+	set_capacity(bd->disk, bd->lb_page->num_sectors);
+	add_disk(bd->disk);
+
+	lhdev->private = bd;
+	return 0;
+
+out_unmap:
+	iounmap(bd->lb_page);
+out_cleanup_queue:
+	blk_cleanup_queue(bd->disk->queue);
+out_put:
+	put_disk(bd->disk);
+out_free_bd:
+	kfree(bd);
+	return err;
+}
+
+static struct lguest_driver lguestblk_drv = {
+	.name = "lguestblk",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_BLOCK,
+	.probe = lguestblk_probe,
+};
+
+static __init int lguestblk_init(void)
+{
+	return register_lguest_driver(&lguestblk_drv);
+}
+module_init(lguestblk_init);
+
+MODULE_DESCRIPTION("Lguest block driver");
+MODULE_LICENSE("GPL");



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 10/10] lguest: documentatation including example launcher
  2007-02-09  9:24             ` [PATCH 9/10] lguest: block driver Rusty Russell
@ 2007-02-09  9:25               ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09  9:25 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, virtualization, axboe

Name: Lguest documentatation and example launcher

Fairly complete documentation for lguest.  I actually want to get rid
of the "coding" part of lguest.txt and roll it into the code itself,
literary-programming-style.

The launcher utility is also here: I don't have delusions of interface
stability, so it makes sense to have it here as an example, and it's
only 1000 lines.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/Documentation/dontdiff
+++ b/Documentation/dontdiff
@@ -144,3 +144,6 @@ wanxlfw.inc
 wanxlfw.inc
 uImage
 zImage
+hypervisor-blob.c
+lguest.lds
+hypervisor-raw
===================================================================
--- /dev/null
+++ b/Documentation/lguest/Makefile
@@ -0,0 +1,21 @@
+# This creates the demonstration utility "lguest" which runs a Linux guest.
+
+# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
+# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
+include ../../.config
+LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000)
+
+CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 \
+	-static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds
+LDLIBS:=-lz
+
+all: lguest
+
+# The linker script on x86 is so complex the only way of creating one
+# which will link our binary in the right place is to mangle the
+# default one.
+lguest.lds:
+	$(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@
+
+clean:
+	rm -f lguest.lds lguest
===================================================================
--- /dev/null
+++ b/Documentation/lguest/lguest.c
@@ -0,0 +1,989 @@
+/* Simple program to layout "physical" memory for new lguest guest.
+ * Linked high to avoid likely physical memory.  */
+#define _LARGEFILE64_SOURCE
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <err.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <elf.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <time.h>
+#include <netinet/in.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+#include <sys/uio.h>
+#include <termios.h>
+#include <zlib.h>
+typedef uint32_t u32;
+typedef uint16_t u16;
+typedef uint8_t u8;
+
+#include "../../include/asm/lguest_user.h"
+
+#define PAGE_PRESENT 0x7 	/* Present, RW, Execute */
+#define NET_PEERNUM 1
+
+static bool verbose;
+#define verbose(args...) \
+	do { if (verbose) printf(args); fflush(stdout); } while(0)
+
+struct devices
+{
+	fd_set infds;
+	int max_infd;
+
+	struct device *dev;
+};
+
+struct device
+{
+	struct device *next;
+	struct lguest_device_desc *desc;
+	void *mem;
+
+	/* Watch this fd if handle_input non-NULL. */
+	int fd;
+	int (*handle_input)(int fd, struct device *me);
+
+	/* Watch DMA to this address if handle_input non-NULL. */
+	unsigned long watch_address;
+	u32 (*handle_output)(int fd, const struct iovec *iov,
+			     unsigned int num, struct device *me);
+
+	/* Device-specific data. */
+	void *priv;
+};
+
+static char buf[1024];
+static struct iovec discard_iov = { .iov_base=buf, .iov_len=sizeof(buf) };
+static int zero_fd;
+
+static u32 memparse(const char *ptr)
+{
+	char *end;
+	unsigned long ret = strtoul(ptr, &end, 0);
+
+	switch (*end) {
+	case 'G':
+	case 'g':
+		ret <<= 10;
+	case 'M':
+	case 'm':
+		ret <<= 10;
+	case 'K':
+	case 'k':
+		ret <<= 10;
+		end++;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static inline unsigned long page_align(unsigned long addr)
+{
+	return ((addr + getpagesize()-1) & ~(getpagesize()-1));
+}
+
+/* initrd gets loaded at top of memory: return length. */
+static unsigned long load_initrd(const char *name, unsigned long end)
+{
+	int ifd;
+	struct stat st;
+	void *iaddr;
+
+	if (!name)
+		return 0;
+
+	ifd = open(name, O_RDONLY, 0);
+	if (ifd < 0)
+		err(1, "Opening initrd '%s'", name);
+		
+	if (fstat(ifd, &st) < 0)
+		err(1, "fstat() on initrd '%s'", name);
+
+	iaddr = mmap((void *)end - st.st_size, st.st_size,
+		     PROT_READ|PROT_EXEC|PROT_WRITE,
+		     MAP_FIXED|MAP_PRIVATE, ifd, 0);
+	if (iaddr != (void *)end - st.st_size)
+		err(1, "Mmaping initrd '%s' returned %p not %p",
+		    name, iaddr, (void *)end - st.st_size);
+	close(ifd);
+	verbose("mapped initrd %s size=%lu @ %p\n", name, st.st_size, iaddr);
+	return st.st_size;
+}
+
+/* First map /dev/zero over entire memory, then insert kernel. */
+static void map_memory(unsigned long mem)
+{
+	if (mmap(0, mem,
+		 PROT_READ|PROT_WRITE|PROT_EXEC,
+		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)0)
+		err(1, "Mmaping /dev/zero for %li bytes", mem);
+}
+
+static u32 finish(unsigned long mem, unsigned long *page_offset,
+		  const char *initrd, unsigned long *ird_size)
+{
+	u32 *pgdir = NULL, *linear = NULL;
+	int i, pte_pages;
+
+	/* This is a top of mem. */
+	*ird_size = load_initrd(initrd, mem);
+
+	/* Below initrd is used as top level of pagetable. */
+	pte_pages = 1 + (mem/getpagesize() + 1023)/1024;
+
+	pgdir = (u32 *)page_align(mem - *ird_size - pte_pages*getpagesize());
+	linear = (void *)pgdir + getpagesize();
+
+	/* Linear map all of memory at page_offset (to top of mem). */
+	if (mem > -*page_offset)
+		mem = -*page_offset;
+
+	for (i = 0; i < mem / getpagesize(); i++)
+		linear[i] = ((i * getpagesize()) | PAGE_PRESENT);
+	verbose("Linear %p-%p (%i-%i) = %#08x-%#08x\n",
+		linear, linear+i-1, 0, i-1, linear[0], linear[i-1]);
+
+	/* Now set up pgd so that this memory is at page_offset */
+	for (i = 0; i < mem / getpagesize(); i += getpagesize()/sizeof(u32)) {
+		pgdir[(i + *page_offset/getpagesize())/1024] 
+			= (((u32)linear + i*sizeof(u32)) | PAGE_PRESENT);
+		verbose("Top level %lu = %#08x\n",
+			(i + *page_offset/getpagesize())/1024,
+			pgdir[(i + *page_offset/getpagesize())/1024]);
+	}
+
+	return (unsigned long)pgdir;
+}
+
+/* Returns the entry point */
+static u32 map_elf(int elf_fd, const Elf32_Ehdr *ehdr, unsigned long mem,
+		   unsigned long *pgdir_addr,
+		   const char *initrd, unsigned long *ird_size,
+		   unsigned long *page_offset)
+{
+	void *addr;
+	Elf32_Phdr phdr[ehdr->e_phnum];
+	unsigned int i;
+
+	/* Sanity checks. */
+	if (ehdr->e_type != ET_EXEC
+	    || ehdr->e_machine != EM_386
+	    || ehdr->e_phentsize != sizeof(Elf32_Phdr)
+	    || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
+		errx(1, "Malformed elf header");
+
+	if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
+		err(1, "Seeking to program headers");
+	if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
+		err(1, "Reading program headers");
+
+	map_memory(mem);
+
+	*page_offset = 0;
+	/* We map the loadable segments at virtual addresses corresponding
+	 * to their physical addresses (our virtual == guest physical). */
+	for (i = 0; i < ehdr->e_phnum; i++) {
+		if (phdr[i].p_type != PT_LOAD)
+			continue;
+
+		verbose("Section %i: size %i addr %p\n",
+			i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
+		/* We map everything private, writable. */
+		if (phdr[i].p_paddr + phdr[i].p_memsz > mem)
+			errx(1, "Segment %i overlaps end of memory", i);
+
+		/* We expect linear address space. */
+		if (!*page_offset)
+			*page_offset = phdr[i].p_vaddr - phdr[i].p_paddr;
+		else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
+			errx(1, "Page offset of section %i different", i);
+
+		/* Recent ld versions don't page align any more. */
+		if (phdr[i].p_paddr % getpagesize()) {
+			phdr[i].p_filesz += (phdr[i].p_paddr % getpagesize());
+			phdr[i].p_offset -= (phdr[i].p_paddr % getpagesize());
+			phdr[i].p_paddr -= (phdr[i].p_paddr % getpagesize());
+		}
+		addr = mmap((void *)phdr[i].p_paddr,
+			    phdr[i].p_filesz,
+			    PROT_READ|PROT_WRITE|PROT_EXEC,
+			    MAP_FIXED|MAP_PRIVATE,
+			    elf_fd, phdr[i].p_offset);
+		if (addr != (void *)phdr[i].p_paddr)
+			err(1, "Mmaping vmlinux segment %i returned %p not %p (%p)",
+			    i, addr, (void *)phdr[i].p_paddr, &phdr[i].p_paddr);
+	}
+
+	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
+	/* Entry is physical address: convert to virtual */
+	return ehdr->e_entry + *page_offset;
+}
+
+static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
+{
+	unsigned int i, possibilities[256];
+
+	for (i = 0; i + 4 < len; i++) {
+		/* mov 0xXXXXXXXX,%eax */
+		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
+			return (unsigned long)img[i+4] << 24;
+	}
+	errx(1, "could not determine page offset");
+}
+
+static u32 bzimage(int fd, unsigned long mem, unsigned long *pgdir_addr,
+		   const char *initrd, unsigned long *ird_size,
+		   unsigned long *page_offset)
+{
+	gzFile f;
+	int ret, len = 0;
+	void *img = (void *)0x100000;
+
+	map_memory(mem);
+
+	f = gzdopen(fd, "rb");
+	if (gzdirect(f))
+		errx(1, "did not find correct gzip header");
+	while ((ret = gzread(f, img + len, 65536)) > 0)
+		len += ret;
+	if (ret < 0)
+		err(1, "reading image from bzImage");
+
+	verbose("Unpacked size %i addr %p\n", len, img);
+	*page_offset = intuit_page_offset(img, len);
+	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
+
+	/* Entry is physical address: convert to virtual */
+	return (u32)img + *page_offset;
+}
+
+static u32 load_bzimage(int bzimage_fd, const Elf32_Ehdr *ehdr, 
+			unsigned long mem, unsigned long *pgdir_addr,
+			const char *initrd, unsigned long *ird_size,
+			unsigned long *page_offset)
+{
+	unsigned char c;
+	int state = 0;
+
+	/* Just brute force it. */
+	while (read(bzimage_fd, &c, 1) == 1) {
+		switch (state) {
+		case 0:
+			if (c == 0x1F)
+				state++;
+			break;
+		case 1:
+			if (c == 0x8B)
+				state++;
+			else
+				state = 0;
+			break;
+		case 2 ... 8:
+			state++;
+			break;
+		case 9:
+			lseek(bzimage_fd, -10, SEEK_CUR);
+			if (c != 0x03) /* Compressed under UNIX. */
+				state = -1;
+			else
+				return bzimage(bzimage_fd, mem, pgdir_addr,
+					       initrd, ird_size, page_offset);
+		}
+	}
+	errx(1, "Could not find kernel in bzImage");
+}
+
+static void *map_pages(unsigned long addr, unsigned int num)
+{
+	if (mmap((void *)addr, getpagesize() * num,
+		 PROT_READ|PROT_WRITE|PROT_EXEC,
+		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)addr)
+		err(1, "Mmaping %u pages of /dev/zero @%p", num, (void *)addr);
+	return (void *)addr;
+}
+
+static struct lguest_device_desc *
+get_dev_entry(struct lguest_device_desc *descs, u16 type, u16 num_pages)
+{
+	static unsigned long top = LGUEST_GUEST_TOP;
+	int i;
+	unsigned long pfn = 0;
+
+	if (num_pages) {
+		top -= num_pages*getpagesize();
+		map_pages(top, num_pages);
+		pfn = top / getpagesize();
+	}
+
+	for (i = 0; i < LGUEST_MAX_DEVICES; i++) {
+		if (!descs[i].type) {
+			descs[i].features = descs[i].status = 0;
+			descs[i].type = type;
+			descs[i].num_pages = num_pages;
+			descs[i].pfn = pfn;
+			return &descs[i];
+		}
+	}
+	errx(1, "too many devices");
+}
+
+static void set_fd(int fd, struct devices *devices)
+{
+	FD_SET(fd, &devices->infds);
+	if (fd > devices->max_infd)
+		devices->max_infd = fd;
+}
+
+static struct device *new_device(struct devices *devices,
+				 struct lguest_device_desc *descs,
+				 u16 type, u16 num_pages,
+				 int fd,
+				 int (*handle_input)(int, struct device *),
+				 unsigned long watch_off,
+				 u32 (*handle_output)(int,
+						      const struct iovec *,
+						      unsigned,
+						      struct device *))
+{
+	struct device *dev = malloc(sizeof(*dev));
+
+	dev->next = devices->dev;
+	devices->dev = dev;
+
+	dev->fd = fd;
+	if (handle_input)
+		set_fd(dev->fd, devices);
+	dev->desc = get_dev_entry(descs, type, num_pages);
+	dev->mem = (void *)(dev->desc->pfn * getpagesize());
+	dev->handle_input = handle_input;
+	dev->watch_address = (unsigned long)dev->mem + watch_off;
+	dev->handle_output = handle_output;
+	return dev;
+}
+
+static int tell_kernel(u32 pagelimit, u32 pgdir, u32 start, u32 page_offset)
+{
+	u32 args[] = { LHREQ_INITIALIZE,
+		       pagelimit, pgdir, start, page_offset };
+	int fd = open("/dev/lguest", O_RDWR);
+
+	if (fd < 0)
+		err(1, "Opening /dev/lguest");
+
+	verbose("Telling kernel limit %u, pgdir %i, e=%#08x page_off=0x%08x\n",
+		pagelimit, pgdir, start, page_offset);
+	if (write(fd, args, sizeof(args)) < 0)
+		err(1, "Writing to /dev/lguest");
+	return fd;
+}
+
+static void concat(char *dst, char *args[])
+{
+	unsigned int i, len = 0;
+
+	for (i = 0; args[i]; i++) {
+		strcpy(dst+len, args[i]);
+		strcat(dst+len, " ");
+		len += strlen(args[i]) + 1;
+	}
+	/* In case it's empty. */
+	dst[len] = '\0';
+}
+
+static void *_check_pointer(unsigned long addr, unsigned int size,
+			    unsigned int line)
+{
+	if (addr >= LGUEST_GUEST_TOP || addr + size >= LGUEST_GUEST_TOP)
+		errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr);
+	return (void *)addr;
+}
+#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
+
+/* Returns pointer to dma->used_len */
+static u32 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num)
+{
+	unsigned int i;
+	struct lguest_dma *udma;
+
+	/* No buffers? */
+	if (dma == 0) {
+		printf("no buffers\n");
+		return NULL;
+	}
+
+	udma = check_pointer(dma, sizeof(*udma));
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (!udma->len[i])
+			break;
+
+		iov[i].iov_base = check_pointer(udma->addr[i], udma->len[i]);
+		iov[i].iov_len = udma->len[i];
+	}
+	*num = i;
+	return &udma->used_len;
+}
+
+static u32 *get_dma_buffer(int fd, void *addr,
+			   struct iovec iov[], unsigned *num, u32 *irq)
+{
+	u32 buf[] = { LHREQ_GETDMA, (u32)addr };
+	unsigned long udma;
+	u32 *res;
+
+	udma = write(fd, buf, sizeof(buf));
+	if (udma == (unsigned long)-1)
+		return NULL;
+
+	/* Kernel stashes irq in ->used_len. */
+	res = dma2iov(udma, iov, num);
+	if (res)
+		*irq = *res;
+	return res;
+}
+
+static void trigger_irq(int fd, u32 irq)
+{
+	u32 buf[] = { LHREQ_IRQ, irq };
+	if (write(fd, buf, sizeof(buf)) != 0)
+		err(1, "Triggering irq %i", irq);
+}
+
+static struct termios orig_term;
+static void restore_term(void)
+{
+	tcsetattr(STDIN_FILENO, TCSANOW, &orig_term);
+}
+
+struct console_abort
+{
+	int count;
+	struct timeval start;
+};
+
+/* We DMA input to buffer bound at start of console page. */
+static int handle_console_input(int fd, struct device *dev)
+{
+	u32 num, irq = 0, *lenp;
+	int len;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+	struct console_abort *abort = dev->priv;
+
+	lenp = get_dma_buffer(fd, dev->mem, iov, &num, &irq);
+	if (!lenp) {
+		warn("console: no dma buffer!");
+		iov[0] = discard_iov;
+		num = 1;
+	}
+
+	len = readv(dev->fd, iov, num);
+	if (len <= 0) {
+		warnx("Failed to get console input, ignoring console.");
+		len = 0;
+	}
+
+	if (lenp) {
+		*lenp = len;
+		trigger_irq(fd, irq);
+	}
+
+	/* Three ^C within one second?  Exit. */
+	if (len == 1 && ((char *)iov[0].iov_base)[0] == 3) {
+		if (!abort->count++)
+			gettimeofday(&abort->start, NULL);
+		else if (abort->count == 3) {
+			struct timeval now;
+			gettimeofday(&now, NULL);
+			if (now.tv_sec <= abort->start.tv_sec+1)
+				exit(2);
+			abort->count = 0;
+		}
+	} else
+		abort->count = 0;
+
+	if (!len) {
+		restore_term();
+		return 0;
+	}
+	return 1;
+}
+
+static unsigned long peer_offset(unsigned int peernum)
+{
+	return 4 * peernum;
+}
+
+static u32 handle_tun_output(int fd, const struct iovec *iov,
+			     unsigned num, struct device *dev)
+{
+	/* Now we've seen output, we should warn if we can't get buffers. */
+	*(bool *)dev->priv = true;
+	return writev(dev->fd, iov, num);
+}
+
+static u32 handle_block_output(int fd, const struct iovec *iov,
+			       unsigned num, struct device *dev)
+{
+	struct lguest_block_page *p = dev->mem;
+	u32 irq, reply_num, *lenp;
+	int len;
+	struct iovec reply[LGUEST_MAX_DMA_SECTIONS];
+	off64_t device_len, off = (off64_t)p->sector * 512;
+
+	device_len = *(off64_t *)dev->priv;
+
+	if (off >= device_len)
+		err(1, "Bad offset %llu vs %llu", off, device_len);
+	if (lseek64(dev->fd, off, SEEK_SET) != off)
+		err(1, "Bad seek to sector %i", p->sector);
+
+	verbose("Block: %s at offset %llu\n", p->type ? "WRITE" : "READ", off);
+
+	lenp = get_dma_buffer(fd, dev->mem, reply, &reply_num, &irq);
+	if (!lenp)
+		err(1, "Block request didn't give us a dma buffer");
+
+	if (p->type) {
+		len = writev(dev->fd, iov, num);
+		if (off + len > device_len) {
+			ftruncate(dev->fd, device_len);
+			errx(1, "Write past end %llu+%u", off, len);
+		}
+		*lenp = 0;
+	} else {
+		len = readv(dev->fd, reply, reply_num);
+		*lenp = len;
+	}
+
+	p->result = 1 + (p->bytes != len);
+	trigger_irq(fd, irq);
+	return 0;
+}
+
+#define HIPQUAD(ip)				\
+	((u8)(ip >> 24)),			\
+	((u8)(ip >> 16)),			\
+	((u8)(ip >> 8)),			\
+	((u8)(ip))
+
+static void configure_device(const char *devname, u32 ipaddr,
+			     unsigned char hwaddr[6])
+{
+	struct ifreq ifr;
+	int fd;
+	struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
+
+	memset(&ifr, 0, sizeof(ifr));
+	strcpy(ifr.ifr_name, devname);
+	sin->sin_family = AF_INET;
+	sin->sin_addr.s_addr = htonl(ipaddr);
+	fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (fd < 0)
+		err(1, "opening IP socket");
+	if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
+		err(1, "Setting %s interface address", devname);
+	ifr.ifr_flags = IFF_UP;
+	if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
+		err(1, "Bringing interface %s up", devname);
+
+	if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0)
+		err(1, "getting hw address for %s", devname);
+
+	memcpy(hwaddr, ifr.ifr_hwaddr.sa_data, 6);
+}
+
+/* We send lguest_add signals while input is pending: avoids races. */
+static void wake_parent(int pipefd, struct devices *devices)
+{
+	int parent = getppid();
+	nice(19);
+
+	set_fd(pipefd, devices);
+
+	for (;;) {
+		fd_set rfds = devices->infds;
+
+		select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
+		if (FD_ISSET(pipefd, &rfds)) {
+			int ignorefd;
+			if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
+				exit(0);
+			FD_CLR(ignorefd, &devices->infds);
+		}
+		kill(parent, SIGUSR1);
+	}
+}
+
+/* We don't want signal to kill us, just jerk us out of kernel. */
+static void wakeup(int signo)
+{
+}
+
+static int handle_tun_input(int fd, struct device *dev)
+{
+	u32 irq = 0, num, *lenp;
+	int len;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+
+	lenp = get_dma_buffer(fd, dev->mem+peer_offset(NET_PEERNUM), iov, &num,
+			      &irq);
+	if (!lenp) {
+		if (*(bool *)dev->priv)
+			warn("network: no dma buffer!");
+		iov[0] = discard_iov;
+		num = 1;
+	}
+
+	len = readv(dev->fd, iov, num);
+	if (len <= 0)
+		err(1, "reading network");
+	if (lenp) {
+		*lenp = len;
+		trigger_irq(fd, irq);
+	}
+	verbose("tun input packet len %i [%02x %02x] (%s)\n", len,
+		((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1],
+		lenp ? "sent" : "discarded");
+	return 1;
+}
+
+/* We use fnctl locks to reserve network slots (autocleanup!) */
+static unsigned int find_slot(int netfd, const char *filename)
+{
+	struct flock fl;
+
+	fl.l_type = F_WRLCK;
+	fl.l_whence = SEEK_SET;
+	fl.l_len = 1;
+	for (fl.l_start = 0;
+	     fl.l_start < getpagesize()/sizeof(struct lguest_net);
+	     fl.l_start++) {
+		if (fcntl(netfd, F_SETLK, &fl) == 0)
+			return fl.l_start;
+	}
+	errx(1, "No free slots in network file %s", filename);
+}
+
+static void setup_net_file(const char *filename,
+			   struct lguest_device_desc *descs,
+			   struct devices *devices)
+{
+	int netfd;
+	struct device *dev;
+
+	netfd = open(filename, O_RDWR, 0);
+	if (netfd < 0) {
+		if (errno == ENOENT) {
+			netfd = open(filename, O_RDWR|O_CREAT, 0600);
+			if (netfd >= 0) {
+				char page[getpagesize()];
+				/* 0xFFFF == NO_GUEST */
+				memset(page, 0xFF, sizeof(page));
+				write(netfd, page, sizeof(page));
+			}
+		}
+		if (netfd < 0)
+			err(1, "cannot open net file '%s'", filename);
+	}
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
+			 -1, NULL, 0, NULL);
+
+	/* This is the slot for the guest to use. */
+	dev->desc->features = find_slot(netfd, filename)|LGUEST_NET_F_NOCSUM;
+	/* We overwrite the /dev/zero mapping with the actual file. */
+	if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE,
+			 MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem)
+			err(1, "could not mmap '%s'", filename);
+	verbose("device %p@%p: shared net %s, peer %i\n", dev->desc, 
+		(void *)(dev->desc->pfn * getpagesize()), filename, 
+		dev->desc->features & ~LGUEST_NET_F_NOCSUM);
+}
+
+static u32 str2ip(const char *ipaddr)
+{
+	unsigned int byte[4];
+
+	sscanf(ipaddr, "%u.%u.%u.%u", &byte[0], &byte[1], &byte[2], &byte[3]);
+	return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3];
+}
+
+static void setup_tun_net(const char *ipaddr,
+			  struct lguest_device_desc *descs,
+			  struct devices *devices)
+{
+	struct device *dev;
+	struct ifreq ifr;
+	int netfd;
+
+	netfd = open("/dev/net/tun", O_RDWR);
+	if (netfd < 0)
+		err(1, "opening /dev/net/tun");
+
+	memset(&ifr, 0, sizeof(ifr));
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+	strcpy(ifr.ifr_name, "tap%d");
+	if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
+		err(1, "configuring /dev/net/tun");
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
+			 netfd, handle_tun_input,
+			 peer_offset(0), handle_tun_output);
+	dev->priv = malloc(sizeof(bool));
+	*(bool *)dev->priv = false;
+
+	/* We are peer 0, rest is all NO_GUEST */
+	memset(dev->mem, 0xFF, getpagesize());
+	configure_device(ifr.ifr_name, str2ip(ipaddr), dev->mem);
+
+	/* You will be peer 1: we should create enough jitter to randomize */
+	dev->desc->features = NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS;
+	verbose("device %p@%p: tun net %u.%u.%u.%u\n", dev->desc, 
+		(void *)(dev->desc->pfn * getpagesize()),
+		HIPQUAD(str2ip(ipaddr)));
+}
+
+static void setup_block_file(const char *filename,
+			     struct lguest_device_desc *descs,
+			     struct devices *devices)
+{
+	int fd;
+	struct device *dev;
+	off64_t *blocksize;
+	struct lguest_block_page *p;
+
+	fd = open(filename, O_RDWR|O_LARGEFILE|O_DIRECT, 0);
+	if (fd < 0)
+		err(1, "Opening %s", filename);
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_BLOCK, 1,
+			 fd, NULL, 0, handle_block_output);
+	dev->desc->features = LGUEST_DEVICE_F_RANDOMNESS;
+	blocksize = dev->priv = malloc(sizeof(*blocksize));
+	*blocksize = lseek64(fd, 0, SEEK_END);
+	p = dev->mem;
+
+	p->num_sectors = *blocksize/512;
+	verbose("device %p@%p: block %i sectors\n", dev->desc, 
+		(void *)(dev->desc->pfn * getpagesize()), p->num_sectors);
+}
+
+static u32 handle_console_output(int fd, const struct iovec *iov,
+				 unsigned num, struct device*dev)
+{
+	return writev(STDOUT_FILENO, iov, num);
+}
+
+static void setup_console(struct lguest_device_desc *descs,
+			  struct devices *devices)
+{
+	struct device *dev;
+
+	if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
+		struct termios term = orig_term;
+		term.c_lflag &= ~(ISIG|ICANON|ECHO);
+		tcsetattr(STDIN_FILENO, TCSANOW, &term);
+		atexit(restore_term);
+	}
+
+	/* We don't currently require a page for the console. */
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_CONSOLE, 0,
+			 STDIN_FILENO, handle_console_input,
+			 4, handle_console_output);
+	dev->priv = malloc(sizeof(struct console_abort));
+	((struct console_abort *)dev->priv)->count = 0;
+	verbose("device %p@%p: console\n", dev->desc, 
+		(void *)(dev->desc->pfn * getpagesize()));
+}
+
+static const char *get_arg(const char *arg, const char *prefix)
+{
+	if (strncmp(arg, prefix, strlen(prefix)) == 0)
+		return arg + strlen(prefix);
+	return NULL;
+}
+
+static u32 handle_device(int fd, unsigned long dma, unsigned long addr,
+			 struct devices *devices)
+{
+	struct device *i;
+	u32 *lenp;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+	unsigned num = 0;
+
+	lenp = dma2iov(dma, iov, &num);
+	if (!lenp)
+		errx(1, "Bad SEND_DMA %li for address %#lx\n", dma, addr);
+
+	for (i = devices->dev; i; i = i->next) {
+		if (i->handle_output && addr == i->watch_address) {
+			*lenp = i->handle_output(fd, iov, num, i);
+			return 0;
+		}
+	}
+	warnx("Pending dma %p, addr %p", (void *)dma, (void *)addr);
+	return 0;
+}
+
+static void handle_input(int fd, int childfd, struct devices *devices)
+{
+	struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };
+
+	for (;;) {
+		struct device *i;
+		fd_set fds = devices->infds;
+
+		if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0)
+			break;
+
+		for (i = devices->dev; i; i = i->next) {
+			if (i->handle_input && FD_ISSET(i->fd, &fds)) {
+				if (!i->handle_input(fd, i)) {
+					FD_CLR(i->fd, &devices->infds);
+					/* Tell child to ignore it too... */
+					write(childfd, &i->fd, sizeof(i->fd));
+				}
+			}
+		}
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned long mem, pgdir, entry, initrd_size, page_offset;
+	int arg, kern_fd, fd, child, pipefd[2];
+	Elf32_Ehdr hdr;
+	struct sigaction act;
+	sigset_t sigset;
+	struct lguest_device_desc *devdescs;
+	struct devices devices;
+	struct lguest_boot_info *boot = (void *)0;
+	const char *initrd_name = NULL;
+	u32 (*load)(int, const Elf32_Ehdr *ehdr, unsigned long,
+		    unsigned long *, const char *, unsigned long *,
+		    unsigned long *);
+
+	if (argv[1] && strcmp(argv[1], "--verbose") == 0) {
+		verbose = true;
+		argv++;
+		argc--;
+	}
+
+	if (argc < 4)
+		errx(1, "Usage: lguest [--verbose] <mem> vmlinux "
+			"[--sharenet=<filename>|--tunnet=<ipaddr>|--block=<filename>"
+			"|--initrd=<filename>]... [args...]");
+
+	zero_fd = open("/dev/zero", O_RDONLY, 0);
+	if (zero_fd < 0)
+		err(1, "Opening /dev/zero");
+
+	mem = memparse(argv[1]);
+	kern_fd = open(argv[2], O_RDONLY, 0);
+	if (kern_fd < 0)
+		err(1, "Opening %s", argv[2]);
+
+	if (read(kern_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		err(1, "Reading %s elf header", argv[2]);
+
+	if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0)
+		load = map_elf;
+	else
+		load = load_bzimage;
+
+	devices.max_infd = -1;
+	devices.dev = NULL;
+	FD_ZERO(&devices.infds);
+
+	devdescs = map_pages(mem, 1);
+	arg = 3;
+	while (argv[arg] && argv[arg][0] == '-') {
+		const char *argval;
+
+		if ((argval = get_arg(argv[arg], "--sharenet=")) != NULL)
+			setup_net_file(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--tunnet=")) != NULL)
+			setup_tun_net(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--block=")) != NULL)
+			setup_block_file(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--initrd=")) != NULL)
+			initrd_name = argval;
+		else
+			errx(1, "unknown arg '%s'", argv[arg]);
+		arg++;
+	}
+
+	entry = load(kern_fd, &hdr, mem, &pgdir, initrd_name, &initrd_size,
+		     &page_offset);
+	setup_console(devdescs, &devices);
+
+	concat(boot->cmdline, argv+arg);
+	boot->max_pfn = mem/getpagesize();
+	boot->initrd_size = initrd_size;
+
+	act.sa_handler = wakeup;
+	sigemptyset(&act.sa_mask);
+	act.sa_flags = 0;
+	sigaction(SIGUSR1, &act, NULL);
+
+	pipe(pipefd);
+	child = fork();
+	if (child == -1)
+		err(1, "forking");
+
+	if (child == 0) {
+		close(pipefd[1]);
+		wake_parent(pipefd[0], &devices);
+	}
+	close(pipefd[0]);
+
+	sigemptyset(&sigset);
+	sigaddset(&sigset, SIGUSR1);
+	sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+	/* LGUEST_GUEST_TOP defined in Makefile, just below us. */
+	fd = tell_kernel(LGUEST_GUEST_TOP/getpagesize(),
+			 pgdir, entry, page_offset);
+
+	for (;;) {
+		unsigned long arr[2];
+		int readval;
+
+		sigprocmask(SIG_UNBLOCK, &sigset, NULL);
+		readval = read(fd, arr, sizeof(arr));
+		sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+		switch (readval) {
+		case sizeof(arr):
+			handle_device(fd, arr[0], arr[1], &devices);
+			break;
+		case -1:
+			if (errno == EINTR)
+				break;
+		default:
+			if (errno == ENOENT) {
+				char reason[1024];
+				if (read(fd, reason, sizeof(reason)) > 0)
+					errx(1, "%s", reason);
+			}
+			err(1, "Running guest failed");
+		}
+		handle_input(fd, pipefd[1], &devices);
+	}
+}
===================================================================
--- /dev/null
+++ b/Documentation/lguest/lguest.txt
@@ -0,0 +1,355 @@
+Rusty's Remarkably Unreliable Guide to Lguest
+	- or, A Young Coder's Illustrated Hypervisor
+http://lguest.ozlabs.org
+
+Lguest is designed to be a minimal hypervisor for the Linux kernel, for
+Linux developers and users to experiment with virtualization with the
+minimum of complexity.  Nonetheless, it should have sufficient
+features to make it useful for specific tasks, and, of course, you are
+encouraged to fork and enhance it.
+
+Features:
+
+- Kernel module which runs in a normal kernel.
+- Simple I/O model for communication.
+- Simple program to create new guests.
+- Logo contains cute puppies: http://lguest.ozlabs.org
+
+Developer features:
+
+- Fun to hack on.
+- No ABI: being tied to a specific kernel anyway, you can change anything.
+- Many opportunities for improvement or feature implementation.
+
+Running Lguest:
+
+- You will need to configure your kernel with the following options:
+
+  CONFIG_HIGHMEM64G=n ("High Memory Support" "64GB")[1]
+  CONFIG_TUN=y/m ("Universal TUN/TAP device driver support")
+  CONFIG_EXPERIMENTAL=y ("Prompt for development and/or incomplete code/drivers")
+  CONFIG_PARAVIRT=y ("Paravirtualization support (EXPERIMENTAL)")
+  CONFIG_LGUEST=y/m ("Linux hypervisor example code")
+
+  and I recommend:
+  CONFIG_HZ=100 ("Timer frequency")[2]
+
+  You must have a machine with a TSC: look for "tsc" in /proc/cpuinfo.
+  It's simple to remove this restriction, but everyone has a TSC these
+  days.
+
+- A tool called "lguest" is available in this directory: type "make"
+  to build it.
+
+- Create or find a root disk image.  There are several useful ones
+  around, such as the xm-test tiny root image at 
+	  http://xm-test.xensource.com/ramdisks/initrd-1.1-i386.img
+
+  For more serious work, I usually use a distribution ISO image and
+  install it under qemu, then make multiple copies:
+
+	  dd if=/dev/zero of=rootfile bs=1M count=2048
+	  qemu -cdrom image.iso -hda rootfile -net user -net nic -boot d
+
+- "modprobe lg" if you built it as a module.
+
+- Run an lguest as root:
+
+      Documentation/lguest/lguest 64m vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/lgba
+
+   Explanation:
+    64m: the amount of memory to use.
+
+    vmlinux: the kernel image found in the top of your build directory.  You
+       can also use a standard bzImage.
+
+    --tunnet=192.168.19.1: configures a "tap" device for networking with this
+       IP address.
+
+    --block=rootfile: a file or block device which becomes /dev/lgba
+       inside the guest.
+
+    root=/dev/lgba: this (and anything else on the command line) are
+       kernel boot parameters.
+
+- Configuring networking.  I usually have the host masquerade, using
+  "iptables -t nat -o eth0 -j MASQUERADE" and "echo 1 >
+  /proc/sys/net/ipv4/ip_forward".  In this example, I would configure
+  eth0 inside the guest at 192.168.19.2.
+
+- You can also create an inter-guest network using
+  "--sharenet=<filename>": any two guests using the same file are on
+  the same network.  This file is created if it does not exist.
+
+
+Lguest I/O model:
+
+Lguest uses a simplified DMA model plus shared memory for I/O.  Guests
+can communicate with each other if they share underlying memory
+(usually by the lguest program mmaping the same file), but they can
+use any non-shared memory to communicate with the lguest process.
+
+Guests can register DMA buffers at any physical address using the
+LHCALL_BIND_DMA(physaddr, dmabufs, num<<8|irq) hypercall.  "dmabufs"
+is the physical address of an array of "num" "struct lguest_dma": each
+contains a used_len, and an array of physical addresses and lengths.
+When a transfer occurs, the "used_len" field of one of the buffers
+which has used_len 0 will be set to the length transferred and the irq
+will fire.
+
+Using an irq value of 0 unbinds the dma buffers.
+
+To send DMA, the LHCALL_SEND_DMA(physaddr, dma_physaddr) hypercall is
+used, and the bytes used is written to the used_len field.  This can
+be 0 if noone else has bound a DMA buffer to that address or some
+other error.  DMA buffers bound by the same guest are ignored.
+
+
+Hacking on Lguest:
+
+Lguest uses the paravirt_ops infrastructure to override various
+sensitive operations so Linux can run in ring level 1 (rather
+than 0).  These operations make "hypercalls": traps into a tiny shim
+which is mapped at the top of memory which then switches back to the
+host Linux for servicing.  In fact, any real interrupt and many
+traps cause a switch back to the host, which doesn't even notice that
+it was switched out.  This means that the guest process is scheduled
+like any other process, although it spends most of its time in its own
+special address space.
+
+Here are the parts of the hypervisor at the moment:
+
+hypervisor.S:
+	The assembler shim which is mapped at 0xFFC01000 (-4M+1page)
+	in the host and all the guests.  This is built into a .o file
+	and inserted in the source as a C array: it is simply copied
+	into the mapped memory.
+
+	The shim is entered from the host at switch_to_guest with
+	interrupts off: this saves state and switches page tables,
+	GDT, IDT, TSS and stack, then dives into the guest with an
+	iret.
+
+	There are two ways back to the host: a trap or an external
+	interrupt.  A trap, such as a page fault, goes through
+	return_to_host, which simply switches back and irets to the
+	caller (init.c's lcall), which decides what to do.  For an
+	interrupt we call deliver_to_host, which switches to the host
+	then jumps straight to the host interrupt routine: the
+	interrupt routine will do an "iret" at some stage, which, now
+	we've switched stacks, will return to the caller in init.c.
+
+page_tables.c:
+	We cannot let guests control their own pagetables, since they
+	must not access others' memory and their concept of physical
+	addresses is not related to the real physical addresses: the
+	guest "physical" addresses are in fact virtual addresses in
+	the host's lguest thread.  The process of mapping the two
+	can be fairly complicated.
+
+	We keep up to 4 cached page tables.  When a page is referred
+	to by these guest "shadow" pagetables, we keep a reference to
+	it to prevent the Linux kernel from thinking it is unused and
+	paging it out underneath us.
+	FIXME: it would be much better to have a callback in mm_struct.
+
+	The main work is done in page_in.  First we check the
+	top-level guest page table: if that entry is not present, then
+	it's a real guest fault and we reflect it to the guest.
+	Otherwise, we check the real top level, and allocate a new
+	pagetable page if necessary.  Then we check the next level of
+	the guest page table: if that isn't present, or this was a
+	write and the guest entry is read only, we reflect it to the
+	guest.  Otherwise, we check the guest entry, convert the page
+	number to the actual physical page number, then set it in our
+	page table.  At this point we also update the accessed and
+	dirty bits in the guest.
+
+	So a guest's top-level pagetable starts empty, and over time
+	we fault more pages in.  If the guest switches page tables, we
+	see if it's in out 4-entry cache: if not, we clear the
+	non-kernel section of one of them and use that.  (The kernel
+	page table entries will always be the same in all top levels).
+
+	We have to keep the stack pages for the guest kernel mapped at
+	all times, since we point some traps (particularly system
+	calls) directly into the guest.  If the stack were not mapped
+	we would get a double fault, which means we kill the guest.
+
+	Note that there are three page tables for each guest: the
+	Linux host ones which exist for lguest just like any other
+	process, the actual ones used when we switch to running the
+	guest, and the ones inside the guest which it thinks it's
+	using (and we copy to the actual ones after checking).
+
+hypercalls.c:
+	This is where the guest used int 0x1F to ask the hypervisor
+	for something.  The first hypercall is always
+	LHCALL_LGUEST_INIT, which tells us where the "struct
+	lguest_page" is.  We populate the lguest_page with useful
+	information, and it's also used to indicate virtual interrupts
+	and whether the guest expects interrupts to be disabled.
+
+	Most of these calls are fairly self-explanatory, or covered
+	elsewhere.  Note that LHCALL_CRASH allows a guest to get a
+	message out before any devices are enabled, which can be
+	useful for debugging.
+
+	do_async_hypercalls: a ringbuffer in the lguest page allows
+	the guest to queue hypercalls for later execution.  This is
+	useful for hypercall batching during context switch, and for
+	some bulk I/O.  The return value of the hypercall is
+	discarded, so it doesn't make sense to batch some hypercalls.
+	Note that we always do all these "async" calls before any
+	normal hypercall, which means that any hypercall acts as a
+	flush operation.  The only trick is that an async SEND_DMA
+	hypercall may need to be serviced by the host userspace; the
+	run_guest loop is constructed so that we continue servicing
+	hypercalls when we re-enter the loop after host userspace has
+	done the I/O operation.
+
+	setup_trampoline: this populates a stub for direct traps to
+	the guest.  Using a trampoline page (which sits just below the
+	hypervisor at -4M) ensures that the page is always mapped, and
+	also ensures that we reload the %gs register before entering the
+	kernel (see guest_load_tls).
+
+io.c:
+	lguest provides DMA-style transfer, and buffer registration.
+	The guest can dma send to a particular address, or register a
+	set of DMA buffers at a particular address.  This provides
+	inter-guest I/O (for shared addresses, such as a shared mmap)
+	or I/O out to the userspace process (lguest).
+
+	We currently use the futex infrastructure to see if a given
+	address is shared: if it is, we look for another guest which
+	has registered a DMA buffer at this address and copy the data,
+	then interrupt the recipient.  Otherwise, we notify the guest
+	userspace (which has access to all the guest memory) to handle
+	the transfer.
+
+	TODO: We could flip whole pages between guests at this point
+	if we wanted to, however it seems unlikely to be worthwhile.
+	More optimization could be gained by having servers for certain
+	devices within the host kernel itself, avoiding at
+	least two switches into the lguest binary and back.
+
+core.c:
+	This contains the core of lguest, "run_guest", which
+	continuously lcalls into the switch_to_guest routine until
+	something interesting happens.  In particular, we only return
+	to userspace (ie. "lguest") when a signal occurs or the guest
+	does a SEND_DMA destined for host userspace.
+
+	emulate_insn(): we don't paravirtualize io and out
+	instructions, so we trap and emulate them here.  This is only
+	used when the guest is booting and probing for PCI busses,
+	etc.
+
+	lguest_address_ok(): the guest kernel must not be able to
+	access the lguest binary, otherwise it could break out of
+	its virtualization, so all dereferences must use the
+	lhread_u32/lhwrite_u32/lhread/lhwrite routines which check
+	this.
+
+	reflect_trap(): when we decide that the guest should handle a
+	trap (a page fault, a general protection fault, an FPU fault
+	or a virtual interrupt), we manually push a trap frame onto
+	its stack as it expects it to be.  There are two kinds of
+	traps for x86: interrupt gates expect to have interrupts
+	disabled, and trap gates expect interrupts to be left alone.
+	The guest will restore interrupts in lguest_iret.
+
+	Of course, we don't actually let the guest disable interrupts,
+	just prevent us from delivering interupts to that guest (the
+	flag "irq_enabled" in the lguest_page).
+
+	kill_guest: this is used when an error occurs which can only
+	be caused by the guest kernel.  You can continue as normal
+	after this: the guest will exit when it returns to run_thread.
+
+	fixup_gdt_table: we protect the hypervisor shim from being
+	accessed using segments, so we have to trim segments the guest
+	uses to exclude the hypervisor.  The shim itself uses two
+	segments (only accessible to ring 0) which map the entire
+	memory range, and we use our own TSS entry.
+
+	guest_load_tls: glibc implements __thread using
+	thread-local-storage segments.  These segments start at a
+	different offset for each thread, and cover the entire 4GB
+	address space.  glibc then uses huge offsets into this segment
+	to wrap around and access variables below that offset.
+	Unfortunately, we cannot allow this in general, as this would
+	allow access to the hypervisor shim!  Fortunately, x86 page
+	table entries contain a "user" bit, which when cleared makes
+	pages inaccessible to ring level 3.  We clear this bit for the
+	pagetable entries mapping the hypervisor, so we can allow ring
+	3 (ie. userspace) access to 4G segments.  If the guest is in
+	ring 3, we setup the segment limits at the full 4G just before
+	calling into hypervisor.S.  It will reload %gs, then truncate
+	these TLS segments to a single page.  This ensures that any
+	reload of gs gets the truncated segments.  As the guest
+	userspace will also load %gs itself, we ignore the first
+	protection fault that occurs at any given address in userspace
+	(assuming it's caused by use of the truncated segment).  As
+	all traps reload gs explicitly (trampoline page) or implicitly
+	(reflect_trap), they all must reset the pointer to the
+	last-detected faulting instruction, as they will fault again.
+
+device.c:
+	This contains the host userspace interface code	(ie. /dev/lguest).
+
+	The read and write routines are where the userspace program
+	lguest starts and performs I/O to the guest.  The initial
+	write supplies the number of memory pages, the access limit
+	(which is used to ensure the guest doesn't overwrite the
+	lguest binary which sits above this address), the initial
+	guest pagetable top, and the address to jump into the guest
+	image.  Reading from the file causes the guest to run until a
+	signal or I/O is pending.
+
+lguest_bus.c:
+	A simple bus which sits in the lguest_page and indicates what
+	devices are available.  Using the interrupt model it would be
+	easy to make this dynamic.
+
+drivers/net/lguest_net.c:
+	A simple network device, which (invisible to the guest) can be
+	shared between several guests or simply talk to the lguest
+	process.  There is only one unusual element: the sender
+	needs to find the packet destination.
+
+	We manually scan the shared page for mac addresses to decide
+	where to send a packet.  We overload an unusable bit in that
+	mac address to indicate promiscuous mode (so the sender knows
+	to send a copy of all packets to that recipient).
+
+drivers/char/hvc_lguest.c:
+	A simple console.  It could use a shared page as a ringbuffer
+	and merely use the dma mechanism for notifications, but using
+	DMA directly is less code.
+
+	TODO: The console input can be flooded if it doesn't service
+	fast enough, and will lose characters.  If this is a problem,
+	switch to ringbuffer or use multiple DMA buffers and define an
+	ordering.
+
+drivers/block/lguest_blk.c:
+	A simple block device.  It's actually overkill for the current
+	use: talking to the userspace side is synchronous, but this allows
+	it to be served by something else in future.
+
+arch/i386/kernel/lguest.c:
+	The guest paravirt_ops implementation.  The only complexity is
+	in the implementation of lguest_iret: we need to restore the
+	interrupt state and return from the interrupt atomically.  To
+	this end, we tell the hypervisor that it is not to interrupt
+	us in those instructions between the restoration (usually
+	enabling) of interrupts and the actual "iret".
+
+Cheers!
+Rusty Russell rusty@rustcorp.com.au.
+
+[1] These are on various places on the TODO list, waiting for you to
+    get annoyed enough at the limitation to fix it.
+[2] Lguest is not yet tickless when idle.  See [1].



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler
  2007-02-09  9:14 ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Rusty Russell
                     ` (2 preceding siblings ...)
  2007-02-09  9:18   ` [PATCH 4/10] lguest: Initialize esp0 properly all the time Rusty Russell
@ 2007-02-09  9:31   ` Andi Kleen
  2007-02-09 11:52     ` Rusty Russell
  3 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2007-02-09  9:31 UTC (permalink / raw)
  To: virtualization; +Cc: Rusty Russell, lkml - Kernel Mailing List, Andrew Morton

On Friday 09 February 2007 10:14, Rusty Russell wrote:

> +unhandled_paravirt:
> +	/* Nothing wanted us: try to die with dignity (impossible trap). */ 
> +	movl	$0x1F, %edx
> +	pushl	$0
> +	jmp	early_fault

Please print a real message with early_printk


-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/10] lguest: Export symbols for lguest as a module
  2007-02-09  9:15   ` [PATCH 2/10] lguest: Export symbols for lguest as a module Rusty Russell
@ 2007-02-09  9:32     ` Andi Kleen
  2007-02-09 12:06       ` Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2007-02-09  9:32 UTC (permalink / raw)
  To: virtualization; +Cc: Rusty Russell, lkml - Kernel Mailing List, Andrew Morton

On Friday 09 February 2007 10:15, Rusty Russell wrote:

> tsc_khz:
> 	Simplest way of telling the guest how to interpret the TSC
> 	counter.


Are you sure this will work with varying TSC frequencies? 

In general you should get this from cpufreq.

-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09  9:20       ` [PATCH 6/10] lguest code: the little linux hypervisor Rusty Russell
  2007-02-09  9:22         ` [PATCH 7/10] lguest: Simple lguest network driver Rusty Russell
@ 2007-02-09  9:35         ` Andrew Morton
  2007-02-09 11:00           ` Rusty Russell
  2007-02-09 10:09         ` Andi Kleen
  2 siblings, 1 reply; 57+ messages in thread
From: Andrew Morton @ 2007-02-09  9:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: lkml - Kernel Mailing List, Andi Kleen, virtualization,
	Paul Mackerras, Stephen Rothwell

On Fri, 09 Feb 2007 20:20:27 +1100 Rusty Russell <rusty@rustcorp.com.au> wrote:

> +#define log(...)					\
> +	do {						\
> +		mm_segment_t oldfs = get_fs();		\
> +		char buf[100];				\
> +		sprintf(buf, "lguest:" __VA_ARGS__);	\
> +		set_fs(KERNEL_DS);			\
> +		sys_write(1, buf, strlen(buf));		\
> +		set_fs(oldfs);				\
> +	} while(0)

Due to gcc shortcomings, each instance of this will chew an additional 100
bytes of stack.  Unless they fixed it recently.  Is a bit of a timebomb.  I
guess ksaprintf() could be used.

It also looks a bit, umm, innovative.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09  9:20       ` [PATCH 6/10] lguest code: the little linux hypervisor Rusty Russell
  2007-02-09  9:22         ` [PATCH 7/10] lguest: Simple lguest network driver Rusty Russell
  2007-02-09  9:35         ` [PATCH 6/10] lguest code: the little linux hypervisor Andrew Morton
@ 2007-02-09 10:09         ` Andi Kleen
  2007-02-09 12:39           ` Rusty Russell
  2 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2007-02-09 10:09 UTC (permalink / raw)
  To: virtualization
  Cc: Rusty Russell, lkml - Kernel Mailing List, Andrew Morton,
	Stephen Rothwell, Paul Mackerras

On Friday 09 February 2007 10:20, Rusty Russell wrote:
> This is the core of lguest: both the guest code (always compiled in to
> the image so it can boot under lguest), and the host code (lg.ko).
> 
> There is only one config prompt at the moment: lguest is currently
> designed to run exactly the same guest and host kernels so we can
> frob the ABI freely.
> 
> Unfortunately, we don't have the build infrastructure for "private"
> asm-offsets.h files, so there's a not-so-neat include in
> arch/i386/kernel/asm-offsets.c.

Ask the kbuild people to fix that? 

It indeed looks ugly.

I bet Xen et.al. could make good use of that too.

> +# This links the hypervisor in the right place and turns it into a C array.
> +$(obj)/hypervisor-raw: $(obj)/hypervisor.o
> +	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
> +$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
> +	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@

an .S file with .incbin is more efficient and simpler
(note it has to be an separate .S file, otherwise icecream/distcc break) 

It won't allow to show off any sed skills, but I guess we can live with that ;-)


> +static struct vm_struct *hypervisor_vma;
> +static int cpu_had_pge;
> +static struct {
> +	unsigned long offset;
> +	unsigned short segment;
> +} lguest_entry;
> +struct page *hype_pages; /* Contiguous pages. */

Statics? looks funky.  Why only a single hypervisor_vma?

> +struct lguest lguests[MAX_LGUEST_GUESTS];
> +DECLARE_MUTEX(lguest_lock);
> +
> +/* IDT entries are at start of hypervisor. */
> +const unsigned long *__lguest_default_idt_entries(void)
> +{
> +	return (void *)HYPE_ADDR;
> +}
> +
> +/* Next is switch_to_guest */
> +static void *__lguest_switch_to_guest(void)
> +{
> +	return (void *)HYPE_ADDR + HYPE_DATA_SIZE;
> +}
> +
> +/* Then we use everything else to hold guest state. */
> +struct lguest_state *__lguest_states(void)
> +{
> +	return (void *)HYPE_ADDR + sizeof(hypervisor_blob);

This cries for asm_offsets.h too, doesn't it? 

> +}
> +
> +static __init int map_hypervisor(void)
> +{
> +	unsigned int i;
> +	int err;
> +	struct page *pages[HYPERVISOR_PAGES], **pagep = pages;
> +
> +	hype_pages = alloc_pages(GFP_KERNEL|__GFP_ZERO,
> +				 get_order(HYPERVISOR_SIZE));

Wasteful because of the rounding. Probably wants reintroduction
of alloc_pages_exact()


> +
> +static __exit void unmap_hypervisor(void)
> +{
> +	vunmap(hypervisor_vma->addr);
> +	__free_pages(hype_pages, get_order(HYPERVISOR_SIZE));

Shouldn't you clean up the GDTs too? 

> +}
> +
> +/* IN/OUT insns: enough to get us past boot-time probing. */
> +static int emulate_insn(struct lguest *lg)
> +{
> +	u8 insn;
> +	unsigned int insnlen = 0, in = 0, shift = 0;
> +	unsigned long physaddr = guest_pa(lg, lg->state->regs.eip);
> +
> +	/* This only works for addresses in linear mapping... */
> +	if (lg->state->regs.eip < lg->page_offset)
> +		return 0;

Shouldn't there be a printk here?

> +/* Saves exporting idt_table from kernel */
> +static struct desc_struct *get_idt_table(void)
> +{
> +	struct Xgt_desc_struct idt;
> +
> +	asm("sidt %0":"=m" (idt));

Nasty, but ok.

> +	return (void *)idt.address;
> +}
> +
> +extern asmlinkage void math_state_restore(void);

No externs in .c files

> +
> +/* Trap page resets this when it reloads gs. */
> +static int new_gfp_eip(struct lguest *lg, struct lguest_regs *regs)
> +{
> +	u32 eip;
> +	get_user(eip, &lg->lguest_data->gs_gpf_eip);
> +	if (eip == regs->eip)
> +		return 0;
> +	put_user(regs->eip, &lg->lguest_data->gs_gpf_eip);

No fault checking? 

lhread/write use probably also needs to be double checked that a malicious
guest can't put the kernel into a loop.

> +	return 1;
> +}
> +
> +static void set_ts(unsigned int guest_ts)
> +{
> +	u32 cr0;
> +	if (guest_ts) {
> +		asm("movl %%cr0,%0":"=r" (cr0));
> +		if (!(cr0 & 8))
> +			asm("movl %0,%%cr0": :"r" (cr0|8));
> +	}

We have macros and defines for this in standard headers.
\
> +	while (!lg->dead) {
> +		unsigned int cr2 = 0; /* Damn gcc */
> +
> +		/* Hypercalls first: we might have been out to userspace */
> +		if (do_async_hcalls(lg))
> +			goto pending_dma;
> +
> +		if (regs->trapnum == LGUEST_TRAP_ENTRY) {
> +			/* Only do hypercall once. */
> +			regs->trapnum = 255;
> +			if (hypercall(lg, regs))
> +				goto pending_dma;
> +		}
> +
> +		if (signal_pending(current))
> +			return -EINTR

Probably needs freezer checking here somewhere.

> ; 
> +		maybe_do_interrupt(lg);
> +
> +		if (lg->dead)
> +			break;
> +
> +		if (lg->halted) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			schedule_timeout(1);

1?  And what is that good for anyways?

> +				/* FIXME: If it's reloading %gs in a loop? */

Yes what then? Have you tried it?

In general i miss printks when things go wrong. Do you expect
all users to have a gdbstub ready? @)

> +pending_dma:
> +	put_user(lg->pending_dma, (unsigned long *)user);
> +	put_user(lg->pending_addr, (unsigned long *)user+1);

error checking? How do you avoid loops?


> +	if (cpu_has_pge) { /* We have a broader idea of "global". */
> +		cpu_had_pge = 1;
> +		on_each_cpu(adjust_pge, 0, 0, 1);

cpu hotplug? 

> +		clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
> +	}
> +	return 0;
> +}
> 
> +	case LHCALL_CRASH: {
> +		char msg[128];
> +		lhread(lg, msg, regs->edx, sizeof(msg));
> +		msg[sizeof(msg)-1] = '\0';

Might be safer to vet for isprint here

> +#define log(...)					\
> +	do {						\
> +		mm_segment_t oldfs = get_fs();		\
> +		char buf[100];				\

At least older gccs will accumulate the bufs in a function, eventually possibly blowing
the stack. Better use a function.


> +	/* If they're halted, we re-enable interrupts. */
> +	if (lg->halted) {
> +		/* Re-enable interrupts. */
> +		put_user(512, &lg->lguest_data->irq_enabled);

interesting magic number

> +	/* Ignore NMI, doublefault, hypercall, spurious interrupt. */
> +	if (i == 2 || i == 8 || i == 15 || i == LGUEST_TRAP_ENTRY)
> +		return;
> +	/* FIXME: We should handle debug and int3 */
> +	else if (i == 1 || i == 3)
> +		return;
> +	/* We intercept page fault, general protection fault and fpu missing */
> +	else if (i == 13)
> +		copy_trap(lg, &lg->gpf_trap, &d);
> +	else if (i == 14)
> +		copy_trap(lg, &lg->page_trap, &d);
> +	else if (i == 7)
> +		copy_trap(lg, &lg->fpu_trap, &d);
> +	/* Other traps go straight to guest. */
> +	else if (i < FIRST_EXTERNAL_VECTOR || i == SYSCALL_VECTOR)
> +		setup_idt(lg, i, &d);
> +	/* A virtual interrupt */
> +	else if (i < FIRST_EXTERNAL_VECTOR + LGUEST_IRQS)
> +		copy_trap(lg, &lg->interrupt[i-FIRST_EXTERNAL_VECTOR], &d);\

switch is not cool enough anymore?

>
> +	down(&lguest_lock);

i suspect mutexes are the new way to do this

> +	down_read(&current->mm->mmap_sem);
> +	if (get_futex_key((u32 __user *)addr, &key) != 0) {
> +		kill_guest(lg, "bad dma address %#lx", addr);
> +		goto unlock;

Risky? Use probe_kernel_address et.al.?

> +#if 0
> +/* FIXME: Use asm-offsets here... */

Remove?

> +extern int mce_disabled;

tststs

> +
> +/* FIXME: Update iff tsc rate changes. */

It does.


> +static fastcall void lguest_cpuid(unsigned int *eax, unsigned int *ebx,
> +				 unsigned int *ecx, unsigned int *edx)
> +{
> +	int is_feature = (*eax == 1);
> +
> +	asm volatile ("cpuid"
> +		      : "=a" (*eax),
> +			"=b" (*ebx),
> +			"=c" (*ecx),
> +			"=d" (*edx)
> +		      : "0" (*eax), "2" (*ecx));

What's wrong with the standard cpuid*() macros?

> +	extern struct Xgt_desc_struct cpu_gdt_descr;
> +	extern struct i386_pda boot_pda;

No externs in .c

> +
> +	paravirt_ops.name = "lguest";

Can you just statically initialize this and then copy over? 

> +	asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA) : "memory");

This will be %fs soon.


... haven't read everything else. the IO driver earlier was also not very closely looked at.

-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 6a/10] lguest: Config and headers
  2007-02-09  9:19       ` Rusty Russell
  (?)
  (?)
@ 2007-02-09 10:55       ` Rusty Russell
  2007-02-09 10:56         ` [PATCH 6b/10] lguest: the host code (lg.ko) Rusty Russell
  -1 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 10:55 UTC (permalink / raw)
  To: lkml - Kernel Mailing List
  Cc: Andrew Morton, Andi Kleen, Stephen Rothwell, Paul Mackerras,
	virtualization

[ Part 6 was too big, so posting in four parts ]

Unfortunately, we don't have the build infrastructure for "private"
asm-offsets.h files, so there's a not-so-neat include in
arch/i386/kernel/asm-offsets.c.

The four headers are:
asm/lguest.h:
	Things the guest needs to know (hypercall numbers, etc).
asm/lguest_device.h:
	Things lguest devices need to know (lguest bus registration)
asm/lguest_user.h:
	Things that the lguest userspace utility needs (/dev/lguest
	and some devices)
arch/i386/lguest/lg.h:
	Internal header for the lg module (which consists of 8 files).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -226,6 +226,27 @@ config ES7000_CLUSTERED_APIC
 	depends on SMP && X86_ES7000 && MPENTIUMIII
 
 source "arch/i386/Kconfig.cpu"
+
+config LGUEST
+	tristate "Linux hypervisor example code"
+	depends on X86 && PARAVIRT && EXPERIMENTAL && !X86_PAE
+	select LGUEST_GUEST
+	select HVC_DRIVER
+	---help---
+	  This is a very simple module which allows you to run
+	  multiple instances of the same Linux kernel, using the
+	  "lguest" command found in the Documentation/lguest directory.
+	  Note that "lguest" is pronounced to rhyme with "fell quest",
+	  not "rustyvisor".  See Documentation/lguest/lguest.txt.
+
+	  If unsure, say N.  If curious, say M.  If masochistic, say Y.
+
+config LGUEST_GUEST
+	bool
+	help
+	  The guest needs code built-in, even if the host has lguest
+	  support as a module.  The drivers are tiny, so we build them
+	  in too.
 
 config HPET_TIMER
 	bool "HPET Timer Support"
===================================================================
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -16,6 +16,10 @@
 #include <asm/thread_info.h>
 #include <asm/elf.h>
 #include <asm/pda.h>
+#ifdef CONFIG_LGUEST_GUEST
+#include <asm/lguest.h>
+#include "../lguest/lg.h"
+#endif
 
 #define DEFINE(sym, val) \
         asm volatile("\n->" #sym " %0 " #val : : "i" (val))
@@ -111,4 +115,19 @@ void foo(void)
 	OFFSET(PARAVIRT_iret, paravirt_ops, iret);
 	OFFSET(PARAVIRT_read_cr0, paravirt_ops, read_cr0);
 #endif
+
+#ifdef CONFIG_LGUEST_GUEST
+	BLANK();
+	OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
+	OFFSET(LGUEST_STATE_host_stackptr, lguest_state, host.stackptr);
+	OFFSET(LGUEST_STATE_host_pgdir, lguest_state, host.pgdir);
+	OFFSET(LGUEST_STATE_host_gdt, lguest_state, host.gdt);
+	OFFSET(LGUEST_STATE_host_idt, lguest_state, host.idt);
+	OFFSET(LGUEST_STATE_regs, lguest_state, regs);
+	OFFSET(LGUEST_STATE_gdt, lguest_state, gdt);
+	OFFSET(LGUEST_STATE_idt, lguest_state, idt);
+	OFFSET(LGUEST_STATE_gdt_table, lguest_state, gdt_table);
+	OFFSET(LGUEST_STATE_trapnum, lguest_state, regs.trapnum);
+	OFFSET(LGUEST_STATE_errcode, lguest_state, regs.errcode);
+#endif
 }
===================================================================
--- /dev/null
+++ b/include/asm-i386/lguest.h
@@ -0,0 +1,86 @@
+/* Things the lguest guest needs to know. */
+#ifndef _ASM_LGUEST_H
+#define _ASM_LGUEST_H
+
+#define LGUEST_MAGIC_EBP 0x4C687970
+#define LGUEST_MAGIC_EDI 0x652D4D65
+#define LGUEST_MAGIC_ESI 0xFFFFFFFF
+
+#define LHCALL_FLUSH_ASYNC	0
+#define LHCALL_LGUEST_INIT	1
+#define LHCALL_CRASH		2
+#define LHCALL_LOAD_GDT		3
+#define LHCALL_NEW_PGTABLE	4
+#define LHCALL_FLUSH_TLB	5
+#define LHCALL_LOAD_IDT_ENTRY	6
+#define LHCALL_SET_STACK	7
+#define LHCALL_TS		8
+#define LHCALL_TIMER_READ	9
+#define LHCALL_TIMER_START	10
+#define LHCALL_HALT		11
+#define LHCALL_GET_WALLCLOCK	12
+#define LHCALL_BIND_DMA		13
+#define LHCALL_SEND_DMA		14
+#define LHCALL_SET_PTE		15
+#define LHCALL_SET_UNKNOWN_PTE	16
+#define LHCALL_SET_PUD		17
+#define LHCALL_LOAD_TLS		18
+
+#define LGUEST_TRAP_ENTRY 0x1F
+
+static inline unsigned long
+hcall(unsigned long call,
+      unsigned long arg1, unsigned long arg2, unsigned long arg3)
+{
+	asm volatile("int $" __stringify(LGUEST_TRAP_ENTRY)
+		     : "=a"(call)
+		     : "a"(call), "d"(arg1), "b"(arg2), "c"(arg3) 
+		     : "memory");
+	return call;
+}
+
+void async_hcall(unsigned long call,
+		 unsigned long arg1, unsigned long arg2, unsigned long arg3);
+
+#define LGUEST_IRQS 32
+
+#define LHCALL_RING_SIZE 64
+struct hcall_ring
+{
+	u32 eax, edx, ebx, ecx;
+};
+
+/* All the good stuff happens here: guest registers it with LGUEST_INIT */
+struct lguest_data
+{
+/* Fields which change during running: */
+	/* 512 == enabled (same as eflags) */
+	unsigned int irq_enabled;
+	/* Blocked interrupts. */
+	DECLARE_BITMAP(interrupts, LGUEST_IRQS); 
+
+	/* Last (userspace) address we got a GPF & reloaded gs. */
+	unsigned int gs_gpf_eip;
+
+	/* Virtual address of page fault. */
+	unsigned long cr2;
+
+	/* Async hypercall ring.  0xFF == done, 0 == pending. */
+	u8 hcall_status[LHCALL_RING_SIZE];
+	struct hcall_ring hcalls[LHCALL_RING_SIZE];
+			
+/* Fields initialized by the hypervisor at boot: */
+	/* Memory not to try to access */
+	unsigned long reserve_mem;
+	/* ID of this guest (used by network driver to set ethernet address) */
+	u16 guestid;
+	/* Multiplier for TSC clock. */
+	u32 clock_mult;
+
+/* Fields initialized by the guest at boot: */
+	/* Instruction range to suppress interrupts even if enabled */
+	unsigned long noirq_start, noirq_end;
+};
+extern struct lguest_data lguest_data;
+extern struct lguest_device_desc *lguest_devices; /* Just past max_pfn */
+#endif	/* _ASM_LGUEST_H */
===================================================================
--- /dev/null
+++ b/include/asm-i386/lguest_device.h
@@ -0,0 +1,31 @@
+#ifndef _ASM_LGUEST_DEVICE_H
+#define _ASM_LGUEST_DEVICE_H
+/* Everything you need to know about lguest devices. */
+#include <linux/device.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+
+struct lguest_device {
+	/* Unique busid, and index into lguest_page->devices[] */
+	/* By convention, each device can use irq index+1 if it wants to. */
+	unsigned int index;
+
+	struct device dev;
+
+	/* Driver can hang data off here. */
+	void *private;
+};
+
+struct lguest_driver {
+	const char *name;
+	struct module *owner;
+	u16 device_type;
+	int (*probe)(struct lguest_device *dev);
+	void (*remove)(struct lguest_device *dev);
+
+	struct device_driver drv;
+};
+
+extern int register_lguest_driver(struct lguest_driver *drv);
+extern void unregister_lguest_driver(struct lguest_driver *drv);
+#endif /* _ASM_LGUEST_DEVICE_H */
===================================================================
--- /dev/null
+++ b/include/asm-i386/lguest_user.h
@@ -0,0 +1,86 @@
+#ifndef _ASM_LGUEST_USER
+#define _ASM_LGUEST_USER
+/* Everything the "lguest" userspace program needs to know. */
+/* They can register up to 32 arrays of lguest_dma. */
+#define LGUEST_MAX_DMA		32
+/* At most we can dma 16 lguest_dma in one op. */
+#define LGUEST_MAX_DMA_SECTIONS	16
+
+/* How many devices?  Assume each one wants up to two dma arrays per device. */
+#define LGUEST_MAX_DEVICES (LGUEST_MAX_DMA/2)
+
+struct lguest_dma
+{
+	/* 0 if free to be used, filled by hypervisor. */
+ 	u32 used_len;
+	u32 addr[LGUEST_MAX_DMA_SECTIONS];
+	u16 len[LGUEST_MAX_DMA_SECTIONS];
+};
+
+/* This is found at address 0. */
+struct lguest_boot_info
+{
+	u32 max_pfn;
+	u32 initrd_size;
+	char cmdline[256];
+};
+
+struct lguest_block_page
+{
+	/* 0 is a read, 1 is a write. */
+	int type;
+	u32 sector; 	/* Offset in device = sector * 512. */
+	u32 bytes;	/* Length expected to be read/written in bytes */
+	/* 0 = pending, 1 = done, 2 = done, error */
+	int result;
+	u32 num_sectors; /* Disk length = num_sectors * 512 */
+};
+
+/* There is a shared page of these. */
+struct lguest_net
+{
+	union {
+		unsigned char mac[6];
+		struct {
+			u8 promisc;
+			u8 pad;
+			u16 guestid;
+		};
+	};
+};
+
+/* lguest_device_desc->type */
+#define LGUEST_DEVICE_T_CONSOLE	1
+#define LGUEST_DEVICE_T_NET	2
+#define LGUEST_DEVICE_T_BLOCK	3
+
+/* lguest_device_desc->status.  256 and above are device specific. */
+#define LGUEST_DEVICE_S_ACKNOWLEDGE	1 /* We have seen device. */
+#define LGUEST_DEVICE_S_DRIVER		2 /* We have found a driver */
+#define LGUEST_DEVICE_S_DRIVER_OK	4 /* Driver says OK! */
+#define LGUEST_DEVICE_S_REMOVED		8 /* Device has gone away. */
+#define LGUEST_DEVICE_S_REMOVED_ACK	16 /* Driver has been told. */
+#define LGUEST_DEVICE_S_FAILED		128 /* Something actually failed */
+
+#define LGUEST_NET_F_NOCSUM		0x4000 /* Don't bother checksumming */
+#define LGUEST_DEVICE_F_RANDOMNESS	0x8000 /* IRQ is fairly random */
+
+/* We have a page of these descriptors in the lguest_device page. */
+struct lguest_device_desc {
+	u16 type;
+	u16 features;
+	u16 status;
+	u16 num_pages;
+	u32 pfn;
+};
+
+/* Write command first word is a request. */
+enum lguest_req
+{
+	LHREQ_INITIALIZE, /* + pfnlimit, pgdir, start, pageoffset */
+	LHREQ_GETDMA, /* + addr (returns &lguest_dma, irq in ->used_len) */
+	LHREQ_IRQ, /* + irq */
+};
+
+
+#endif /* _ASM_LGUEST_USER */
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lg.h
@@ -0,0 +1,253 @@
+#ifndef _LGUEST_H
+#define _LGUEST_H
+
+#include <asm/desc.h>
+/* 64k ought to be enough for anybody! */
+#define HYPERVISOR_SIZE 65536
+#define HYPERVISOR_PAGES (HYPERVISOR_SIZE/PAGE_SIZE)
+
+#define GDT_ENTRY_LGUEST_CS	10
+#define GDT_ENTRY_LGUEST_DS	11
+#define LGUEST_CS		(GDT_ENTRY_LGUEST_CS * 8)
+#define LGUEST_DS		(GDT_ENTRY_LGUEST_DS * 8)
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/stringify.h>
+#include <linux/binfmts.h>
+#include <linux/futex.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+#include <asm/semaphore.h>
+#include "irq_vectors.h"
+
+#define GUEST_DPL 1
+
+struct lguest_regs
+{
+	/* Manually saved part. */
+	u32 cr3;
+	u32 ebx, ecx, edx;
+	u32 esi, edi, ebp;
+	u32 gs;
+	u32 eax;
+	u32 fs, ds, es;
+	u32 trapnum, errcode;
+	/* Trap pushed part */
+	u32 eip;
+	u32 cs;
+	u32 eflags;
+	u32 esp;
+	u32 ss;
+};
+
+__exit void free_pagetables(void);
+__init int init_pagetables(struct page *hype_pages);
+
+/* Full 4G segment descriptors, suitable for CS and DS. */
+#define FULL_EXEC_SEGMENT ((struct desc_struct){0x0000ffff, 0x00cf9b00}) 
+#define FULL_SEGMENT ((struct desc_struct){0x0000ffff, 0x00cf9300}) 
+
+/* Simplified version of IDT. */
+struct host_trap
+{
+	unsigned long addr;
+	int disable_interrupts;
+};
+
+struct lguest_dma_info
+{
+	struct list_head list;
+	union futex_key key;
+	unsigned long dmas;
+	u16 next_dma;
+	u16 num_dmas;
+	u16 guestid;
+	u8 interrupt; 	/* 0 when not registered */
+};
+
+struct pgdir
+{
+	u32 cr3;
+	u32 *pgdir;
+};
+
+/* The private info the thread maintains about the guest. */
+struct lguest
+{
+	struct lguest_state *state;
+	struct lguest_data __user *lguest_data;
+	struct task_struct *tsk;
+	struct mm_struct *mm; 	/* == tsk->mm, but that becomes NULL on exit */
+	u16 guestid;
+	u32 pfn_limit;
+	u32 page_offset;
+	u32 cr2;
+	int timer_on;
+	int halted;
+	int ts;
+	u32 gpf_eip;
+	u32 last_timer;
+	u32 next_hcall;
+	u16 tls_limits[GDT_ENTRY_TLS_ENTRIES];
+
+	/* We keep a small number of these. */
+	u32 pgdidx;
+	struct pgdir pgdirs[4];
+	void *trap_page;
+
+	/* Cached wakeup: we hold a reference to this task. */
+	struct task_struct *wake;
+
+	unsigned long noirq_start, noirq_end;
+	int dma_is_pending;
+	unsigned long pending_dma; /* struct lguest_dma */
+	unsigned long pending_addr; /* address they're sending to */
+
+	unsigned int stack_pages;
+
+	struct lguest_dma_info dma[LGUEST_MAX_DMA];
+
+	/* Dead? */
+	const char *dead;
+
+	/* We intercept page fault (demand shadow paging & cr2 saving)
+	   protection fault (in/out emulation, TLS handling) and
+	   device not available (TS handling). */
+	struct host_trap page_trap, gpf_trap, fpu_trap;
+
+	/* Virtual interrupts */
+	DECLARE_BITMAP(irqs_pending, LGUEST_IRQS);
+	struct host_trap interrupt[LGUEST_IRQS];
+};
+
+extern struct page *hype_pages; /* Contiguous pages. */
+extern struct lguest lguests[];
+extern struct semaphore lguest_lock;
+
+/* core.c: */
+/* Entry points in hypervisor */
+const unsigned long *__lguest_default_idt_entries(void);
+struct lguest_state *__lguest_states(void);
+u32 lhread_u32(struct lguest *lg, u32 addr);
+void lhwrite_u32(struct lguest *lg, u32 val, u32 addr);
+void lhread(struct lguest *lg, void *buf, u32 addr, unsigned bytes);
+void lhwrite(struct lguest *lg, u32 addr, const void *buf, unsigned bytes);
+int lguest_address_ok(const struct lguest *lg, unsigned long addr);
+int run_guest(struct lguest *lg, char *__user user);
+int find_free_guest(void);
+
+/* interrupts_and_traps.c: */
+void maybe_do_interrupt(struct lguest *lg);
+int reflect_trap(struct lguest *lg, const struct host_trap *trap, int has_err);
+void check_bug_kill(struct lguest *lg);
+void load_guest_idt_entry(struct lguest *lg, unsigned int i, u32 low, u32 hi);
+
+/* segments.c: */
+void load_guest_gdt(struct lguest *lg, u32 table, u32 num);
+void guest_load_tls(struct lguest *lg,
+		    const struct desc_struct __user *tls_array);
+
+int init_guest_pagetable(struct lguest *lg, u32 pgtable);
+void free_guest_pagetable(struct lguest *lg);
+void guest_new_pagetable(struct lguest *lg, u32 pgtable);
+void guest_set_pud(struct lguest *lg, unsigned long cr3, u32 i);
+void guest_pagetable_clear_all(struct lguest *lg);
+void guest_pagetable_flush_user(struct lguest *lg);
+void guest_set_pte(struct lguest *lg, unsigned long cr3,
+		   unsigned long vaddr, u32 val);
+void map_trap_page(struct lguest *info);
+int demand_page(struct lguest *info, u32 cr2, int write);
+void pin_stack_pages(struct lguest *lg);
+
+int lguest_device_init(void);
+void lguest_device_remove(void);
+void lguest_io_init(void);
+u32 bind_dma(struct lguest *lg,
+	     unsigned long addr, unsigned long udma, u16 numdmas,u8 interrupt);
+int send_dma(struct lguest *info, unsigned long addr,
+	     unsigned long udma);
+void release_all_dma(struct lguest *lg);
+unsigned long get_dma_buffer(struct lguest *lg, unsigned long addr,
+			     unsigned long *interrupt);
+
+void set_wakeup_process(struct lguest *lg, struct task_struct *p);
+int do_async_hcalls(struct lguest *info);
+int hypercall(struct lguest *info, struct lguest_regs *regs);
+
+#define kill_guest(lg, fmt...)					\
+do {								\
+	if (!(lg)->dead) {					\
+		(lg)->dead = kasprintf(GFP_ATOMIC, fmt);	\
+		if (!(lg)->dead)				\
+			(lg)->dead = (void *)1;			\
+	}							\
+} while(0)
+
+static inline unsigned long guest_pa(struct lguest *lg, unsigned long vaddr)
+{
+	return vaddr - lg->page_offset;
+}
+
+/* Hardware-defined TSS structure. */
+struct x86_tss
+{
+	unsigned short	back_link,__blh;
+	unsigned long	esp0;
+	unsigned short	ss0,__ss0pad;
+	unsigned long	esp1;
+	unsigned short	ss1,__ss1pad;
+	unsigned long	esp2;
+	unsigned short	ss2,__ss2pad;
+	unsigned long	cr3;
+	unsigned long	eip;
+	unsigned long	eflags;
+	unsigned long	eax,ecx,edx,ebx;
+	unsigned long	esp; /* We actually use this one to save esp. */
+	unsigned long	ebp;
+	unsigned long	esi;
+	unsigned long	edi;
+	unsigned short	es, __espad;
+	unsigned short	cs, __cspad;
+	unsigned short	ss, __sspad;
+	unsigned short	ds, __dspad;
+	unsigned short	fs, __fspad;
+	unsigned short	gs, __gspad;
+	unsigned short	ldt, __ldtpad;
+	unsigned short	trace, io_bitmap_base;
+};
+
+int fixup_gdt_table(struct desc_struct *gdt, unsigned int num,
+		    struct lguest_regs *regs, struct x86_tss *tss);
+
+struct lguest_host_state
+{
+	struct Xgt_desc_struct	gdt;
+	struct Xgt_desc_struct	idt;
+	unsigned long		pgdir;
+	unsigned long		stackptr;
+};
+
+/* This sits in the high-mapped shim. */
+struct lguest_state
+{
+	/* Task struct. */
+	struct x86_tss tss;
+
+	/* Gate descriptor table. */
+	struct Xgt_desc_struct gdt;
+	struct desc_struct gdt_table[GDT_ENTRIES];
+
+	/* Interrupt descriptor table. */
+	struct Xgt_desc_struct idt;
+	struct desc_struct idt_table[IDT_ENTRIES];
+
+	/* Host state we store while the guest runs. */
+	struct lguest_host_state host;
+
+	/* This is the stack on which we push our regs. */
+	struct lguest_regs regs;
+};
+#endif	/* __ASSEMBLY__ */
+#endif	/* _LGUEST_H */



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 6b/10] lguest: the host code (lg.ko)
  2007-02-09 10:55       ` [PATCH 6a/10] lguest: Config and headers Rusty Russell
@ 2007-02-09 10:56         ` Rusty Russell
  2007-02-09 10:57           ` [PATCH 6c/10] lguest: the guest code Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 10:56 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

This is the host module (lg.ko) which supports lguest:

arch/i386/lguest/hypervisor.S:
	The actual guest <-> host switching code.  This is compiled into
	a C array, which is mapped to 0xFFC01000 in host and guests.

arch/i386/lguest/core.c:
	The core of the hypervisor, which calls into the assembler
	code which does this actual switch.  Also contains helper
	routines.

arch/i386/lguest/hypercalls.c:
	The entry point for the 19 hypercalls.

arch/i386/lguest/interrupts_and_traps.c:
	Handling of interrupts and traps, except page faults.

arch/i386/lguest/io.c:
	I/O from guest to host, and between guests.

arch/i386/lguest/lguest_user.c:
	/dev/lguest interface for lguest program to launch/control guests.

arch/i386/lguest/page_tables.c:
	Shadow Page table handling: generally we build up the shadow
	page tables by converting from guest page tables when a fault occurs.

arch/i386/lguest/segments.c:
	Segmentation (GDT) handling: we have to ensure they're trimmed
	to avoid guest access to the switching code.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- /dev/null
+++ b/arch/i386/lguest/core.c
@@ -0,0 +1,425 @@
+/* World's simplest hypervisor, to test paravirt_ops and show
+ * unbelievers that virtualization is the future.  Plus, it's fun! */
+#include <linux/module.h>
+#include <linux/stringify.h>
+#include <linux/stddef.h>
+#include <linux/io.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <asm/lguest.h>
+#include <asm/paravirt.h>
+#include <asm/desc.h>
+#include <asm/pgtable.h>
+#include <asm/uaccess.h>
+#include <asm/poll.h>
+#include <asm/highmem.h>
+#include <asm/asm-offsets.h>
+#include "lg.h"
+
+/* This is our hypervisor, compiled from hypervisor.S. */
+static char __initdata hypervisor_blob[] = {
+#include "hypervisor-blob.c"
+};
+
+#define MAX_LGUEST_GUESTS \
+	((HYPERVISOR_SIZE-sizeof(hypervisor_blob))/sizeof(struct lguest_state))
+
+static struct vm_struct *hypervisor_vma;
+static int cpu_had_pge;
+static struct {
+	unsigned long offset;
+	unsigned short segment;
+} lguest_entry;
+struct page *hype_pages; /* Contiguous pages. */
+struct lguest lguests[MAX_LGUEST_GUESTS];
+DECLARE_MUTEX(lguest_lock);
+
+/* IDT entries are at start of hypervisor. */
+const unsigned long *__lguest_default_idt_entries(void)
+{
+	return (void *)HYPE_ADDR;
+}
+
+/* Next is switch_to_guest */
+static void *__lguest_switch_to_guest(void)
+{
+	return (void *)HYPE_ADDR + HYPE_DATA_SIZE;
+}
+
+/* Then we use everything else to hold guest state. */
+struct lguest_state *__lguest_states(void)
+{
+	return (void *)HYPE_ADDR + sizeof(hypervisor_blob);
+}
+
+static __init int map_hypervisor(void)
+{
+	unsigned int i;
+	int err;
+	struct page *pages[HYPERVISOR_PAGES], **pagep = pages;
+
+	hype_pages = alloc_pages(GFP_KERNEL|__GFP_ZERO,
+				 get_order(HYPERVISOR_SIZE));
+	if (!hype_pages)
+		return -ENOMEM;
+
+	hypervisor_vma = __get_vm_area(HYPERVISOR_SIZE, VM_ALLOC,
+				       HYPE_ADDR, VMALLOC_END);
+	if (!hypervisor_vma) {
+		err = -ENOMEM;
+		printk("lguest: could not map hypervisor pages high\n");
+		goto free_pages;
+	}
+
+	for (i = 0; i < HYPERVISOR_PAGES; i++)
+		pages[i] = hype_pages + i;
+
+	err = map_vm_area(hypervisor_vma, PAGE_KERNEL, &pagep);
+	if (err) {
+		printk("lguest: map_vm_area failed: %i\n", err);
+		goto free_vma;
+	}
+	memcpy(hypervisor_vma->addr, hypervisor_blob, sizeof(hypervisor_blob));
+
+	/* Setup LGUEST segments on all cpus */
+	for_each_possible_cpu(i) {
+		get_cpu_gdt_table(i)[GDT_ENTRY_LGUEST_CS] = FULL_EXEC_SEGMENT;
+		get_cpu_gdt_table(i)[GDT_ENTRY_LGUEST_DS] = FULL_SEGMENT;
+	}
+
+	/* Initialize entry point into hypervisor. */
+	lguest_entry.offset = (long)__lguest_switch_to_guest();
+	lguest_entry.segment = LGUEST_CS;
+
+	printk("lguest: mapped hypervisor at %p\n", hypervisor_vma->addr);
+	return 0;
+
+free_vma:
+	vunmap(hypervisor_vma->addr);
+free_pages:
+	__free_pages(hype_pages, get_order(HYPERVISOR_SIZE));
+	return err;
+}
+
+static __exit void unmap_hypervisor(void)
+{
+	vunmap(hypervisor_vma->addr);
+	__free_pages(hype_pages, get_order(HYPERVISOR_SIZE));
+}
+
+/* IN/OUT insns: enough to get us past boot-time probing. */
+static int emulate_insn(struct lguest *lg)
+{
+	u8 insn;
+	unsigned int insnlen = 0, in = 0, shift = 0;
+	unsigned long physaddr = guest_pa(lg, lg->state->regs.eip);
+
+	/* This only works for addresses in linear mapping... */
+	if (lg->state->regs.eip < lg->page_offset)
+		return 0;
+	lhread(lg, &insn, physaddr, 1);
+
+	/* Operand size prefix means it's actually for ax. */
+	if (insn == 0x66) {
+		shift = 16;
+		insnlen = 1;
+		lhread(lg, &insn, physaddr + insnlen, 1);
+	}
+
+	switch (insn & 0xFE) {
+	case 0xE4: /* in     <next byte>,%al */
+		insnlen += 2;
+		in = 1;
+		break;
+	case 0xEC: /* in     (%dx),%al */
+		insnlen += 1;
+		in = 1;
+		break;
+	case 0xE6: /* out    %al,<next byte> */
+		insnlen += 2;
+		break;
+	case 0xEE: /* out    %al,(%dx) */
+		insnlen += 1;
+		break;
+	default:
+		return 0;
+	}
+
+	if (in) {
+		/* Lower bit tells is whether it's a 16 or 32 bit access */
+		if (insn & 0x1)
+			lg->state->regs.eax = 0xFFFFFFFF;
+		else
+			lg->state->regs.eax |= (0xFFFF << shift);
+	}
+	lg->state->regs.eip += insnlen;
+	return 1;
+}
+
+int find_free_guest(void)
+{
+	unsigned int i;
+	for (i = 0; i < MAX_LGUEST_GUESTS; i++)
+		if (!lguests[i].state)
+			return i;
+	return -1;
+}
+
+int lguest_address_ok(const struct lguest *lg, unsigned long addr)
+{
+	return addr / PAGE_SIZE < lg->pfn_limit;
+}
+
+/* Just like get_user, but don't let guest access lguest binary. */
+u32 lhread_u32(struct lguest *lg, u32 addr)
+{
+	u32 val = 0;
+
+	/* Don't let them access lguest_add */
+	if (!lguest_address_ok(lg, addr)
+	    || get_user(val, (u32 __user *)addr) != 0)
+		kill_guest(lg, "bad read address %u", addr);
+	return val;
+}
+
+void lhwrite_u32(struct lguest *lg, u32 addr, u32 val)
+{
+	if (!lguest_address_ok(lg, addr)
+	    || put_user(val, (u32 __user *)addr) != 0)
+		kill_guest(lg, "bad write address %u", addr);
+}
+
+void lhread(struct lguest *lg, void *b, u32 addr, unsigned bytes)
+{
+	if (addr + bytes < addr || !lguest_address_ok(lg, addr+bytes)
+	    || copy_from_user(b, (void __user *)addr, bytes) != 0) {
+		/* copy_from_user should do this, but as we rely on it... */
+		memset(b, 0, bytes);
+		kill_guest(lg, "bad read address %u len %u", addr, bytes);
+	}
+}
+
+void lhwrite(struct lguest *lg, u32 addr, const void *b, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(lg, addr+bytes)
+	    || copy_to_user((void __user *)addr, b, bytes) != 0)
+		kill_guest(lg, "bad write address %u len %u", addr, bytes);
+}
+
+/* Saves exporting idt_table from kernel */
+static struct desc_struct *get_idt_table(void)
+{
+	struct Xgt_desc_struct idt;
+
+	asm("sidt %0":"=m" (idt));
+	return (void *)idt.address;
+}
+
+extern asmlinkage void math_state_restore(void);
+
+static int usermode(struct lguest_regs *regs)
+{
+	return (regs->cs & SEGMENT_RPL_MASK) == USER_RPL;
+}
+
+/* Trap page resets this when it reloads gs. */
+static int new_gfp_eip(struct lguest *lg, struct lguest_regs *regs)
+{
+	u32 eip;
+	get_user(eip, &lg->lguest_data->gs_gpf_eip);
+	if (eip == regs->eip)
+		return 0;
+	put_user(regs->eip, &lg->lguest_data->gs_gpf_eip);
+	return 1;
+}
+
+static void set_ts(unsigned int guest_ts)
+{
+	u32 cr0;
+	if (guest_ts) {
+		asm("movl %%cr0,%0":"=r" (cr0));
+		if (!(cr0 & 8))
+			asm("movl %0,%%cr0": :"r" (cr0|8));
+	}
+}
+
+static void run_guest_once(struct lguest *lg)
+{
+	unsigned int clobber;
+
+	/* Put eflags on stack, lcall does rest. */
+	asm volatile("pushf; lcall *lguest_entry"
+		     : "=a"(clobber), "=d"(clobber)
+		     : "0"(lg->state), "1"(get_idt_table())
+		     : "memory");
+}
+
+int run_guest(struct lguest *lg, char *__user user)
+{
+	struct lguest_regs *regs = &lg->state->regs;
+
+	while (!lg->dead) {
+		unsigned int cr2 = 0; /* Damn gcc */
+
+		/* Hypercalls first: we might have been out to userspace */
+		if (do_async_hcalls(lg))
+			goto pending_dma;
+
+		if (regs->trapnum == LGUEST_TRAP_ENTRY) {
+			/* Only do hypercall once. */
+			regs->trapnum = 255;
+			if (hypercall(lg, regs))
+				goto pending_dma;
+		}
+
+		if (signal_pending(current))
+			return -EINTR;
+		maybe_do_interrupt(lg);
+
+		if (lg->dead)
+			break;
+
+		if (lg->halted) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(1);
+			continue;
+		}
+
+		/* Restore limits on TLS segments if in user mode. */
+		if (usermode(regs)) {
+			unsigned int i;
+			for (i = 0; i < ARRAY_SIZE(lg->tls_limits); i++)
+				lg->state->gdt_table[GDT_ENTRY_TLS_MIN+i].a
+					|= lg->tls_limits[i];
+		}
+
+		local_irq_disable();
+		map_trap_page(lg);
+
+		/* Host state to be restored after the guest returns. */
+		asm("sidt %0":"=m"(lg->state->host.idt));
+		lg->state->host.gdt = __get_cpu_var(cpu_gdt_descr);
+
+		/* Even if *we* don't want FPU trap, guest might... */
+		set_ts(lg->ts);
+
+		run_guest_once(lg);
+
+		/* Save cr2 now if we page-faulted. */
+		if (regs->trapnum == 14)
+			asm("movl %%cr2,%0" :"=r" (cr2));
+		else if (regs->trapnum == 7)
+			math_state_restore();
+		local_irq_enable();
+
+		switch (regs->trapnum) {
+		case 13: /* We've intercepted a GPF. */
+			if (regs->errcode == 0) {
+				if (emulate_insn(lg))
+					continue;
+
+				/* FIXME: If it's reloading %gs in a loop? */
+				if (usermode(regs) && new_gfp_eip(lg,regs))
+					continue;
+			}
+
+			if (reflect_trap(lg, &lg->gpf_trap, 1))
+				continue;
+			break;
+		case 14: /* We've intercepted a page fault. */
+			if (demand_page(lg, cr2, regs->errcode & 2))
+				continue;
+
+			/* If lguest_data is NULL, this won't hurt. */
+			put_user(cr2, &lg->lguest_data->cr2);
+			if (reflect_trap(lg, &lg->page_trap, 1))
+				continue;
+			kill_guest(lg, "unhandled page fault at %#x"
+				   " (eip=%#x, errcode=%#x)",
+				   cr2, regs->eip, regs->errcode);
+			break;
+		case 7: /* We've intercepted a Device Not Available fault. */
+			/* If they don't want to know, just absorb it. */
+			if (!lg->ts) 
+				continue;
+			if (reflect_trap(lg, &lg->fpu_trap, 0))
+				continue;
+			kill_guest(lg, "unhandled FPU fault at %#x",
+				   regs->eip);
+			break;
+		case 32 ... 255: /* Real interrupt, fall thru */
+			cond_resched();
+		case LGUEST_TRAP_ENTRY: /* Handled at top of loop */
+			continue;
+		case 6: /* Invalid opcode before they installed handler */
+			check_bug_kill(lg);
+		}
+		kill_guest(lg,"unhandled trap %i at %#x (err=%i)",
+			   regs->trapnum, regs->eip, regs->errcode);
+	}
+	return -ENOENT;
+
+pending_dma:
+	put_user(lg->pending_dma, (unsigned long *)user);
+	put_user(lg->pending_addr, (unsigned long *)user+1);
+	return sizeof(unsigned long)*2;
+}
+
+#define STRUCT_LGUEST_ELEM_SIZE(elem) sizeof(((struct lguest_state *)0)->elem)
+
+static void adjust_pge(void *on)
+{
+	if (on)
+		write_cr4(read_cr4() | X86_CR4_PGE);
+	else
+		write_cr4(read_cr4() & ~X86_CR4_PGE);
+}
+ 
+static int __init init(void)
+{
+	int err;
+
+	if (paravirt_enabled())
+		return -EPERM;
+
+	err = map_hypervisor();
+	if (err)
+		return err;
+
+	err = init_pagetables(hype_pages);
+	if (err) {
+		unmap_hypervisor();
+		return err;
+	}
+	lguest_io_init();
+
+	err = lguest_device_init();
+	if (err) {
+		free_pagetables();
+		unmap_hypervisor();
+		return err;
+	}
+	if (cpu_has_pge) { /* We have a broader idea of "global". */
+		cpu_had_pge = 1;
+		on_each_cpu(adjust_pge, 0, 0, 1);
+		clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
+	}
+	return 0;
+}
+
+static void __exit fini(void)
+{
+	lguest_device_remove();
+	free_pagetables();
+	unmap_hypervisor();
+	if (cpu_had_pge) {
+		set_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
+		on_each_cpu(adjust_pge, (void *)1, 0, 1);
+	}
+}
+
+module_init(init);
+module_exit(fini);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Rusty Russell <rusty@rustcorp.com.au>");
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/hypercalls.c
@@ -0,0 +1,199 @@
+/*  Actual hypercalls, which allow guests to actually do something.
+    Copyright (C) 2006 Rusty Russell IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+*/
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/mm.h>
+#include <linux/clocksource.h>
+#include <asm/lguest.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <irq_vectors.h>
+#include "lg.h"
+
+static void guest_set_stack(struct lguest *lg,
+			    u32 seg, u32 esp, unsigned int pages)
+{
+	/* You cannot have a stack segment with priv level 0. */
+	if ((seg & 0x3) != GUEST_DPL)
+		kill_guest(lg, "bad stack segment %i", seg);
+	if (pages > 2)
+		kill_guest(lg, "bad stack pages %u", pages);
+	lg->state->tss.ss1 = seg;
+	lg->state->tss.esp1 = esp;
+	lg->stack_pages = pages;
+	pin_stack_pages(lg);
+}
+
+/* Return true if DMA to host userspace now pending. */
+static int do_hcall(struct lguest *lg, struct lguest_regs *regs)
+{
+	switch (regs->eax) {
+	case LHCALL_FLUSH_ASYNC:
+		break;
+	case LHCALL_LGUEST_INIT:
+		kill_guest(lg, "already have lguest_data");
+		break;
+	case LHCALL_CRASH: {
+		char msg[128];
+		lhread(lg, msg, regs->edx, sizeof(msg));
+		msg[sizeof(msg)-1] = '\0';
+		kill_guest(lg, "CRASH: %s", msg);
+		break;
+	}
+	case LHCALL_LOAD_GDT:
+		load_guest_gdt(lg, regs->edx, regs->ebx);
+		break;
+	case LHCALL_NEW_PGTABLE:
+		guest_new_pagetable(lg, regs->edx);
+		break;
+	case LHCALL_FLUSH_TLB:
+		if (regs->edx)
+			guest_pagetable_clear_all(lg);
+		else
+			guest_pagetable_flush_user(lg);
+		break;
+	case LHCALL_LOAD_IDT_ENTRY:
+		load_guest_idt_entry(lg, regs->edx, regs->ebx, regs->ecx);
+		break;
+	case LHCALL_SET_STACK:
+		guest_set_stack(lg, regs->edx, regs->ebx, regs->ecx);
+		break;
+	case LHCALL_TS:
+		lg->ts = regs->edx;
+		break;
+	case LHCALL_TIMER_READ: {
+		u32 now = jiffies;
+		mb();
+		regs->eax = now - lg->last_timer;
+		lg->last_timer = now;
+		break;
+	}
+	case LHCALL_TIMER_START:
+		lg->timer_on = 1;
+		if (regs->edx != HZ)
+			kill_guest(lg, "Bad clock speed %i", regs->edx);
+		lg->last_timer = jiffies;
+		break;
+	case LHCALL_HALT:
+		lg->halted = 1;
+		break;
+	case LHCALL_GET_WALLCLOCK: {
+		struct timeval tv;
+		do_gettimeofday(&tv);
+		regs->eax = tv.tv_sec;
+		break;
+	}
+	case LHCALL_BIND_DMA:
+		regs->eax = bind_dma(lg, regs->edx, regs->ebx,
+				     regs->ecx >> 8, regs->ecx & 0xFF);
+		break;
+	case LHCALL_SEND_DMA:
+		return send_dma(lg, regs->edx, regs->ebx);
+	case LHCALL_SET_PTE:
+		guest_set_pte(lg, regs->edx, regs->ebx, regs->ecx);
+		break;
+	case LHCALL_SET_UNKNOWN_PTE:
+		guest_pagetable_clear_all(lg);
+		break;
+	case LHCALL_SET_PUD:
+		guest_set_pud(lg, regs->edx, regs->ebx);
+		break;
+	case LHCALL_LOAD_TLS:
+		guest_load_tls(lg, (struct desc_struct __user*)regs->edx);
+		break;
+	default:
+		kill_guest(lg, "Bad hypercall %i\n", regs->eax);
+	}
+	return 0;
+}
+
+#define log(...)					\
+	do {						\
+		mm_segment_t oldfs = get_fs();		\
+		char buf[100];				\
+		sprintf(buf, "lguest:" __VA_ARGS__);	\
+		set_fs(KERNEL_DS);			\
+		sys_write(1, buf, strlen(buf));		\
+		set_fs(oldfs);				\
+	} while(0)
+
+/* We always do queued calls before actual hypercall. */
+int do_async_hcalls(struct lguest *lg)
+{
+	unsigned int i, pending;
+	u8 st[LHCALL_RING_SIZE];
+
+	if (!lg->lguest_data)
+		return 0;
+
+	copy_from_user(&st, &lg->lguest_data->hcall_status, sizeof(st));
+	for (i = 0; i < ARRAY_SIZE(st); i++) {
+		struct lguest_regs regs;
+		unsigned int n = lg->next_hcall;
+
+		if (st[n] == 0xFF)
+			break;
+
+		if (++lg->next_hcall == LHCALL_RING_SIZE)
+			lg->next_hcall = 0;
+
+		get_user(regs.eax, &lg->lguest_data->hcalls[n].eax);
+		get_user(regs.edx, &lg->lguest_data->hcalls[n].edx);
+		get_user(regs.ecx, &lg->lguest_data->hcalls[n].ecx);
+		get_user(regs.ebx, &lg->lguest_data->hcalls[n].ebx);
+		pending = do_hcall(lg, &regs);
+		put_user(0xFF, &lg->lguest_data->hcall_status[n]);
+		if (pending)
+			return 1;
+	}
+
+	set_wakeup_process(lg, NULL);
+	return 0;
+}
+
+int hypercall(struct lguest *lg, struct lguest_regs *regs)
+{
+	int pending;
+
+	if (!lg->lguest_data) {
+		if (regs->eax != LHCALL_LGUEST_INIT) {
+			kill_guest(lg, "hypercall %i before LGUEST_INIT",
+				   regs->eax);
+			return 0;
+		}
+
+		lg->lguest_data = (struct lguest_data __user *)regs->edx;
+		/* We check here so we can simply copy_to_user/from_user */
+		if (!lguest_address_ok(lg, (long)lg->lguest_data)
+		    || !lguest_address_ok(lg, (long)(lg->lguest_data+1))){
+			kill_guest(lg, "bad guest page %p", lg->lguest_data);
+			return 0;
+		}
+		get_user(lg->noirq_start, &lg->lguest_data->noirq_start);
+		get_user(lg->noirq_end, &lg->lguest_data->noirq_end);
+		/* We reserve the top pgd entry. */
+		put_user(4U*1024*1024, &lg->lguest_data->reserve_mem);
+		put_user(lg->guestid, &lg->lguest_data->guestid);
+		put_user(clocksource_khz2mult(tsc_khz, 22),
+			 &lg->lguest_data->clock_mult);
+		return 0;
+	}
+	pending = do_hcall(lg, regs);
+	set_wakeup_process(lg, NULL);
+	return pending;
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/hypervisor.S
@@ -0,0 +1,170 @@
+/* This code sits at 0xFFFF1000 to do the low-level guest<->host switch.
+   Layout is: default_idt_entries (1k), then switch_to_guest entry point. */
+#include <linux/linkage.h>
+#include <asm/asm-offsets.h>
+#include "lg.h"
+
+#define SAVE_REGS				\
+	/* Save old guest/host state */		\
+	pushl	%es;				\
+	pushl	%ds;				\
+	pushl	%fs;				\
+	pushl	%eax;				\
+	pushl	%gs;				\
+	pushl	%ebp;				\
+	pushl	%edi;				\
+	pushl	%esi;				\
+	pushl	%edx;				\
+	pushl	%ecx;				\
+	pushl	%ebx;				\
+
+.text
+ENTRY(_start) /* ld complains unless _start is defined. */
+/* %eax contains ptr to target guest state, %edx contains host idt. */
+switch_to_guest:
+	pushl	%ss
+	SAVE_REGS
+	/* Save old stack, switch to guest's stack. */
+	movl	%esp, LGUEST_STATE_host_stackptr(%eax)
+	movl	%eax, %esp
+	/* Guest registers will be at: %esp-$LGUEST_STATE_regs */
+	addl	$LGUEST_STATE_regs, %esp
+	/* Switch to guest's GDT, IDT. */
+	lgdt	LGUEST_STATE_gdt(%eax)
+	lidt	LGUEST_STATE_idt(%eax)
+	/* Save page table top. */
+	movl	%cr3, %ebx
+	movl	%ebx, LGUEST_STATE_host_pgdir(%eax)
+	/* Set host's TSS to available (clear byte 5 bit 2). */
+	movl	(LGUEST_STATE_host_gdt+2)(%eax), %ebx
+	andb	$0xFD, (GDT_ENTRY_TSS*8 + 5)(%ebx)
+	/* Switch to guest page tables */
+	popl	%ebx
+	movl	%ebx, %cr3
+	/* Switch to guest's TSS. */
+	movl	$(GDT_ENTRY_TSS*8), %ebx
+	ltr	%bx
+	/* Restore guest regs */
+	popl	%ebx
+	popl	%ecx
+	popl	%edx
+	popl	%esi
+	popl	%edi
+	popl	%ebp
+	popl	%gs
+	/* Now we've loaded gs, neuter the TLS entries down to 1 byte/page */
+	addl	$(LGUEST_STATE_gdt_table+GDT_ENTRY_TLS_MIN*8), %eax
+	movw	$0,(%eax)
+	movw	$0,8(%eax)
+	movw	$0,16(%eax)
+	popl	%eax
+	popl	%fs
+	popl	%ds
+	popl	%es
+	/* Skip error code and trap number */
+	addl	$8, %esp
+	iret
+
+#define SWITCH_TO_HOST							\
+	SAVE_REGS;							\
+	/* Save old pgdir */						\
+	movl	%cr3, %eax;						\
+	pushl	%eax;							\
+	/* Load lguest ds segment for convenience. */			\
+	movl	$(LGUEST_DS), %eax;					\
+	movl	%eax, %ds;						\
+	/* Now figure out who we are */					\
+	movl	%esp, %eax;						\
+	subl	$LGUEST_STATE_regs, %eax;				\
+	/* Switch to host page tables (GDT, IDT and stack are in host   \
+	   mem, so need this first) */					\
+	movl	LGUEST_STATE_host_pgdir(%eax), %ebx;			\
+	movl	%ebx, %cr3;						\
+	/* Set guest's TSS to available (clear byte 5 bit 2). */	\
+	andb	$0xFD, (LGUEST_STATE_gdt_table+GDT_ENTRY_TSS*8+5)(%eax);\
+	/* Switch to host's GDT & IDT. */				\
+	lgdt	LGUEST_STATE_host_gdt(%eax);				\
+	lidt	LGUEST_STATE_host_idt(%eax);				\
+	/* Switch to host's stack. */					\
+	movl	LGUEST_STATE_host_stackptr(%eax), %esp;			\
+	/* Switch to host's TSS */					\
+	movl	$(GDT_ENTRY_TSS*8), %eax;				\
+	ltr	%ax;							\
+	/* Restore host regs */						\
+	popl	%ebx;							\
+	popl	%ecx;							\
+	popl	%edx;							\
+	popl	%esi;							\
+	popl	%edi;							\
+	popl	%ebp;							\
+	popl	%gs;							\
+	popl	%eax;							\
+	popl	%fs;							\
+	popl	%ds;							\
+	popl	%es;							\
+	popl	%ss
+	
+/* Return to run_guest_once. */
+return_to_host:
+	SWITCH_TO_HOST
+	iret
+
+deliver_to_host:
+	SWITCH_TO_HOST
+decode_idt_and_jmp:
+	/* Decode IDT and jump to hosts' irq handler.  When that does iret, it
+	 * will return to run_guest_once.  This is a feature. */
+	/* We told gcc we'd clobber edx and eax... */
+	movl	LGUEST_STATE_trapnum(%eax), %eax
+	leal	(%edx,%eax,8), %eax
+	movzwl	(%eax),%edx
+	movl	4(%eax), %eax
+	xorw	%ax, %ax
+	orl	%eax, %edx
+	jmp	*%edx
+
+deliver_to_host_with_errcode:
+	SWITCH_TO_HOST
+	pushl	LGUEST_STATE_errcode(%eax)
+	jmp decode_idt_and_jmp
+
+/* Real hardware interrupts are delivered straight to the host.  Others
+   cause us to return to run_guest_once so it can decide what to do.  Note
+   that some of these are overridden by the guest to deliver directly, and
+   never enter here (see load_guest_idt_entry). */
+.macro IRQ_STUB N TARGET
+	.data; .long 1f; .text; 1:
+ /* Make an error number for most traps, which don't have one. */
+ .if (\N <> 2) && (\N <> 8) && (\N < 10 || \N > 14) && (\N <> 17)
+	pushl	$0
+ .endif
+	pushl	$\N
+	jmp	\TARGET
+	ALIGN
+.endm
+
+.macro IRQ_STUBS FIRST LAST TARGET
+ irq=\FIRST
+ .rept \LAST-\FIRST+1
+	IRQ_STUB irq \TARGET
+  irq=irq+1
+ .endr
+.endm
+	
+/* We intercept every interrupt, because we may need to switch back to
+ * host.  Unfortunately we can't tell them apart except by entry
+ * point, so we need 256 entry points.
+ */
+irq_stubs:
+.data
+default_idt_entries:	
+.text
+	IRQ_STUBS 0 1 return_to_host		/* First two traps */
+	IRQ_STUB 2 deliver_to_host_with_errcode	/* NMI */
+	IRQ_STUBS 3 31 return_to_host		/* Rest of traps */
+	IRQ_STUBS 32 127 deliver_to_host	/* Real interrupts */
+	IRQ_STUB 128 return_to_host		/* System call (overridden) */
+	IRQ_STUBS 129 255 deliver_to_host	/* Other real interrupts */
+
+/* Everything after this is used for the lguest_state structs. */
+ALIGN
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/interrupts_and_traps.c
@@ -0,0 +1,221 @@
+#include <linux/uaccess.h>
+#include "lg.h"
+
+static void push_guest_stack(struct lguest *lg, u32 __user **gstack, u32 val)
+{
+	lhwrite_u32(lg, (u32)--(*gstack), val);
+}
+
+int reflect_trap(struct lguest *lg, const struct host_trap *trap, int has_err)
+{
+	u32 __user *gstack;
+	u32 eflags, ss, irq_enable;
+	struct lguest_regs *regs = &lg->state->regs;
+
+	if (!trap->addr)
+		return 0;
+
+	/* If they want a ring change, we use new stack and push old ss/esp */
+	if ((regs->ss&0x3) != GUEST_DPL) {
+		gstack = (u32 __user *)guest_pa(lg, lg->state->tss.esp1);
+		ss = lg->state->tss.ss1;
+		push_guest_stack(lg, &gstack, regs->ss);
+		push_guest_stack(lg, &gstack, regs->esp);
+	} else {
+		gstack = (u32 __user *)guest_pa(lg, regs->esp);
+		ss = regs->ss;
+	}
+
+	/* We use IF bit in eflags to indicate whether irqs were disabled
+	   (it's always 0, since irqs are enabled when guest is running). */
+	eflags = regs->eflags;
+	get_user(irq_enable, &lg->lguest_data->irq_enabled);
+	eflags |= (irq_enable & 512);
+
+	push_guest_stack(lg, &gstack, eflags);
+	push_guest_stack(lg, &gstack, regs->cs);
+	push_guest_stack(lg, &gstack, regs->eip);
+
+	if (has_err)
+		push_guest_stack(lg, &gstack, regs->errcode);
+
+	/* Change the real stack so hypervisor returns to trap handler */
+	regs->ss = ss;
+	regs->esp = (u32)gstack + lg->page_offset;
+	regs->cs = (__KERNEL_CS|GUEST_DPL);
+	regs->eip = trap->addr;
+
+	/* GS will be neutered on way back to guest. */
+	put_user(0, &lg->lguest_data->gs_gpf_eip);
+
+	/* Disable interrupts for an interrupt gate. */
+	if (trap->disable_interrupts)
+		put_user(0, &lg->lguest_data->irq_enabled);
+	return 1;
+}
+
+void maybe_do_interrupt(struct lguest *lg)
+{
+	unsigned int irq;
+	DECLARE_BITMAP(irqs, LGUEST_IRQS);
+
+	if (!lg->lguest_data)
+		return;
+
+	/* If timer has changed, set timer interrupt. */
+	if (lg->timer_on && jiffies != lg->last_timer)
+		set_bit(0, lg->irqs_pending);
+
+	/* Mask out any interrupts they have blocked. */
+	copy_from_user(&irqs, lg->lguest_data->interrupts, sizeof(irqs));
+	bitmap_andnot(irqs, lg->irqs_pending, irqs, LGUEST_IRQS);
+
+	irq = find_first_bit(irqs, LGUEST_IRQS);
+	if (irq >= LGUEST_IRQS)
+		return;
+
+	/* If they're halted, we re-enable interrupts. */
+	if (lg->halted) {
+		/* Re-enable interrupts. */
+		put_user(512, &lg->lguest_data->irq_enabled);
+		lg->halted = 0;
+	} else {
+		/* Maybe they have interrupts disabled? */
+		u32 irq_enabled;
+		get_user(irq_enabled, &lg->lguest_data->irq_enabled);
+		if (!irq_enabled)
+			return;
+	}
+
+	if (lg->interrupt[irq].addr != 0) {
+		clear_bit(irq, lg->irqs_pending);
+		reflect_trap(lg, &lg->interrupt[irq], 0);
+	}
+}
+
+void check_bug_kill(struct lguest *lg)
+{
+#ifdef CONFIG_BUG
+	u32 eip = lg->state->regs.eip - PAGE_OFFSET;
+	u16 insn;
+
+	/* This only works for addresses in linear mapping... */
+	if (lg->state->regs.eip < PAGE_OFFSET)
+		return;
+	lhread(lg, &insn, eip, sizeof(insn));
+	if (insn == 0x0b0f) {
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+		u16 l;
+		u32 f;
+		char file[128];
+		lhread(lg, &l, eip+sizeof(insn), sizeof(l));
+		lhread(lg, &f, eip+sizeof(insn)+sizeof(l), sizeof(f));
+		lhread(lg, file, f - PAGE_OFFSET, sizeof(file));
+		file[sizeof(file)-1] = 0;
+		kill_guest(lg, "BUG() at %#x %s:%u", eip, file, l);
+#else
+		kill_guest(lg, "BUG() at %#x", eip);
+#endif	/* CONFIG_DEBUG_BUGVERBOSE */
+	}
+#endif	/* CONFIG_BUG */
+}
+
+static void copy_trap(struct lguest *lg,
+		      struct host_trap *trap,
+		      const struct desc_struct *desc)
+{
+	u8 type = ((desc->b >> 8) & 0xF);
+
+	/* Not present? */
+	if (!(desc->b & 0x8000)) {
+		trap->addr = 0;
+		return;
+	}
+	if (type != 0xE && type != 0xF)
+		kill_guest(lg, "bad IDT type %i", type);
+	trap->disable_interrupts = (type == 0xE);
+	trap->addr = ((desc->a & 0x0000FFFF) | (desc->b & 0xFFFF0000));
+}
+
+/* FIXME: Put this in hypervisor.S and do something clever with relocs? */
+static u8 tramp[] 
+= { 0x0f, 0xa8, 0x0f, 0xa9, /* push %gs; pop %gs */
+    0x36, 0xc7, 0x05, 0x55, 0x55, 0x55, 0x55, 0x00, 0x00, 0x00, 0x00,
+    /* movl 0, %ss:lguest_data.gs_gpf_eip */
+    0xe9, 0x55, 0x55, 0x55, 0x55 /* jmp dstaddr */
+};
+#define TRAMP_MOVL_TARGET_OFF 7
+#define TRAMP_JMP_TARGET_OFF 16
+
+static u32 setup_trampoline(struct lguest *lg, unsigned int i, u32 dstaddr)
+{
+	u32 addr, off;
+
+	off = sizeof(tramp)*i;
+	memcpy(lg->trap_page + off, tramp, sizeof(tramp));
+
+	/* 0 is to be placed in lguest_data.gs_gpf_eip. */
+	addr = (u32)&lg->lguest_data->gs_gpf_eip + lg->page_offset;
+	memcpy(lg->trap_page + off + TRAMP_MOVL_TARGET_OFF, &addr, 4);
+
+	/* Address is relative to where end of jmp will be. */
+	addr = dstaddr - ((-4*1024*1024) + off + sizeof(tramp));
+	memcpy(lg->trap_page + off + TRAMP_JMP_TARGET_OFF, &addr, 4);
+	return (-4*1024*1024) + off;
+}
+
+/* We bounce through the trap page, for two reasons: firstly, we need
+   the interrupt destination always mapped, to avoid double faults,
+   secondly we want to reload %gs to make it innocuous on entering kernel.
+ */
+static void setup_idt(struct lguest *lg,
+		      unsigned int i,
+		      const struct desc_struct *desc)
+{
+	u8 type = ((desc->b >> 8) & 0xF);
+	u32 taddr;
+
+	/* Not present? */
+	if (!(desc->b & 0x8000)) {
+		/* FIXME: When we need this, we'll know... */
+		if (lg->state->idt_table[i].a & 0x8000)
+			kill_guest(lg, "removing interrupts not supported");
+		return;
+	}
+
+	/* We could reflect and disable interrupts, but guest can do itself. */
+	if (type != 0xF)
+		kill_guest(lg, "bad direct IDT %i type %i", i, type);
+
+	taddr = setup_trampoline(lg, i, (desc->a&0xFFFF)|(desc->b&0xFFFF0000));
+
+	lg->state->idt_table[i].a = (((__KERNEL_CS|GUEST_DPL)<<16)
+					| (taddr & 0x0000FFFF));
+	lg->state->idt_table[i].b = (desc->b&0xEF00)|(taddr&0xFFFF0000);
+}
+
+void load_guest_idt_entry(struct lguest *lg, unsigned int i, u32 low, u32 high)
+{
+	struct desc_struct d = { low, high };
+
+	/* Ignore NMI, doublefault, hypercall, spurious interrupt. */
+	if (i == 2 || i == 8 || i == 15 || i == LGUEST_TRAP_ENTRY)
+		return;
+	/* FIXME: We should handle debug and int3 */
+	else if (i == 1 || i == 3)
+		return;
+	/* We intercept page fault, general protection fault and fpu missing */
+	else if (i == 13)
+		copy_trap(lg, &lg->gpf_trap, &d);
+	else if (i == 14)
+		copy_trap(lg, &lg->page_trap, &d);
+	else if (i == 7)
+		copy_trap(lg, &lg->fpu_trap, &d);
+	/* Other traps go straight to guest. */
+	else if (i < FIRST_EXTERNAL_VECTOR || i == SYSCALL_VECTOR)
+		setup_idt(lg, i, &d);
+	/* A virtual interrupt */
+	else if (i < FIRST_EXTERNAL_VECTOR + LGUEST_IRQS)
+		copy_trap(lg, &lg->interrupt[i-FIRST_EXTERNAL_VECTOR], &d);
+}
+
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/io.c
@@ -0,0 +1,413 @@
+/* Simple I/O model for guests, based on shared memory.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+#include <linux/types.h>
+#include <linux/futex.h>
+#include <linux/jhash.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/uaccess.h>
+#include "lg.h"
+
+static struct list_head dma_hash[64];
+
+/* FIXME: allow multi-page lengths. */
+static int check_dma_list(struct lguest *lg, const struct lguest_dma *dma)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (!dma->len[i])
+			return 1;
+		if (!lguest_address_ok(lg, dma->addr[i]))
+			goto kill;
+		if (dma->len[i] > PAGE_SIZE)
+			goto kill;
+		/* We could do over a page, but is it worth it? */
+		if ((dma->addr[i] % PAGE_SIZE) + dma->len[i] > PAGE_SIZE)
+			goto kill;
+	}
+	return 1;
+
+kill:
+	kill_guest(lg, "bad DMA entry: %u@%#x", dma->len[i], dma->addr[i]);
+	return 0;
+}
+
+static unsigned int hash(const union futex_key *key)
+{
+	return jhash2((u32*)&key->both.word,
+		      (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
+		      key->both.offset)
+		% ARRAY_SIZE(dma_hash);
+}
+
+/* Must hold read lock on dmainfo owner's current->mm->mmap_sem */
+static void unlink_dma(struct lguest_dma_info *dmainfo)
+{
+	BUG_ON(down_trylock(&lguest_lock) == 0);
+	dmainfo->interrupt = 0;
+	list_del(&dmainfo->list);
+	drop_futex_key_refs(&dmainfo->key);
+}
+
+static inline int key_eq(const union futex_key *a, const union futex_key *b)
+{
+	return (a->both.word == b->both.word
+		&& a->both.ptr == b->both.ptr
+		&& a->both.offset == b->both.offset);
+}
+
+static u32 unbind_dma(struct lguest *lg,
+		      const union futex_key *key,
+		      unsigned long dmas)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < LGUEST_MAX_DMA; i++) {
+		if (key_eq(key, &lg->dma[i].key) && dmas == lg->dma[i].dmas) {
+			unlink_dma(&lg->dma[i]);
+			ret = 1;
+			break;
+		}
+	}
+	return ret;
+}
+
+u32 bind_dma(struct lguest *lg,
+	     unsigned long addr, unsigned long dmas, u16 numdmas, u8 interrupt)
+{
+	unsigned int i;
+	u32 ret = 0;
+	union futex_key key;
+
+	if (interrupt >= LGUEST_IRQS)
+		return 0;
+
+	down(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(lg, "bad dma address %#lx", addr);
+		goto unlock;
+	}
+	get_futex_key_refs(&key);
+
+	if (interrupt == 0)
+		ret = unbind_dma(lg, &key, dmas);
+	else {
+		for (i = 0; i < LGUEST_MAX_DMA; i++) {
+			if (lg->dma[i].interrupt == 0) {
+				lg->dma[i].dmas = dmas;
+				lg->dma[i].num_dmas = numdmas;
+				lg->dma[i].next_dma = 0;
+				lg->dma[i].key = key;
+				lg->dma[i].guestid = lg->guestid;
+				lg->dma[i].interrupt = interrupt;
+				list_add(&lg->dma[i].list,
+					 &dma_hash[hash(&key)]);
+				ret = 1;
+				goto unlock;
+			}
+		}
+	}
+	drop_futex_key_refs(&key);
+unlock:
+ 	up_read(&current->mm->mmap_sem);
+	up(&lguest_lock);
+	return ret;
+}
+
+/* lhread from another guest */
+static int lhread_other(struct lguest *lg,
+			void *buf, u32 addr, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(lg, addr+bytes)
+	    || access_process_vm(lg->tsk, addr, buf, bytes, 0) != bytes) {
+		memset(buf, 0, bytes);
+		kill_guest(lg, "bad address in registered DMA struct");
+		return 0;
+	}
+	return 1;
+}
+
+/* lhwrite to another guest */
+static int lhwrite_other(struct lguest *lg, u32 addr,
+			 const void *buf, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(lg, addr+bytes)
+	    || (access_process_vm(lg->tsk, addr, (void *)buf, bytes, 1)
+		!= bytes)) {
+		kill_guest(lg, "bad address writing to registered DMA");
+		return 0;
+	}
+	return 1;
+}
+
+static u32 copy_data(const struct lguest_dma *src,
+		     const struct lguest_dma *dst,
+		     struct page *pages[])
+{
+	unsigned int totlen, si, di, srcoff, dstoff;
+	void *maddr = NULL;
+
+	totlen = 0;
+	si = di = 0;
+	srcoff = dstoff = 0;
+	while (si < LGUEST_MAX_DMA_SECTIONS && src->len[si]
+	       && di < LGUEST_MAX_DMA_SECTIONS && dst->len[di]) {
+		u32 len = min(src->len[si] - srcoff, dst->len[di] - dstoff);
+
+		if (!maddr)
+			maddr = kmap(pages[di]);
+
+		/* FIXME: This is not completely portable, since
+		   archs do different things for copy_to_user_page. */
+		if (copy_from_user(maddr + (dst->addr[di] + dstoff)%PAGE_SIZE,
+				   (void *__user)src->addr[si], len) != 0) {
+			totlen = 0;
+			break;
+		}
+
+		totlen += len;
+		srcoff += len;
+		dstoff += len;
+		if (srcoff == src->len[si]) {
+			si++;
+			srcoff = 0;
+		}
+		if (dstoff == dst->len[di]) {
+			kunmap(pages[di]);
+			maddr = NULL;
+			di++;
+			dstoff = 0;
+		}
+	}
+
+	if (maddr)
+		kunmap(pages[di]);
+
+	return totlen;
+}
+
+/* Src is us, ie. current. */
+static u32 do_dma(struct lguest *srclg, const struct lguest_dma *src,
+		  struct lguest *dstlg, const struct lguest_dma *dst)
+{
+	int i;
+	u32 ret;
+	struct page *pages[LGUEST_MAX_DMA_SECTIONS];
+
+	if (!check_dma_list(dstlg, dst) || !check_dma_list(srclg, src))
+		return 0;
+
+	/* First get the destination pages */
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (dst->len[i] == 0)
+			break;
+		if (get_user_pages(dstlg->tsk, dstlg->mm,
+				   dst->addr[i], 1, 1, 1, pages+i, NULL)
+		    != 1) {
+			ret = 0;
+			goto drop_pages;
+		}
+	}
+
+	/* Now copy until we run out of src or dst. */
+	ret = copy_data(src, dst, pages);
+
+drop_pages:
+	while (--i >= 0)
+		put_page(pages[i]);
+	return ret;
+}
+
+/* We cache one process to wakeup: helps for batching & wakes outside locks. */
+void set_wakeup_process(struct lguest *lg, struct task_struct *p)
+{
+	if (p == lg->wake)
+		return;
+
+	if (lg->wake) {
+		wake_up_process(lg->wake);
+		put_task_struct(lg->wake);
+	}
+	lg->wake = p;
+	if (lg->wake)
+		get_task_struct(lg->wake);
+}
+
+static int dma_transfer(struct lguest *srclg,
+			unsigned long udma,
+			struct lguest_dma_info *dst)
+{
+	struct lguest_dma dst_dma, src_dma;
+	struct lguest *dstlg;
+	u32 i, dma = 0;
+
+	dstlg = &lguests[dst->guestid];
+	/* Get our dma list. */
+	lhread(srclg, &src_dma, udma, sizeof(src_dma));
+
+	/* We can't deadlock against them dmaing to us, because this
+	 * is all under the lguest_lock. */
+	down_read(&dstlg->mm->mmap_sem);
+
+	for (i = 0; i < dst->num_dmas; i++) {
+		dma = (dst->next_dma + i) % dst->num_dmas;
+		if (!lhread_other(dstlg, &dst_dma,
+				  dst->dmas + dma * sizeof(struct lguest_dma),
+				  sizeof(dst_dma))) {
+			goto fail;
+		}
+		if (!dst_dma.used_len)
+			break;
+	}
+	if (i != dst->num_dmas) {
+		unsigned long used_lenp;
+		unsigned int ret;
+
+		ret = do_dma(srclg, &src_dma, dstlg, &dst_dma);
+		/* Put used length in src. */
+		lhwrite_u32(srclg,
+			    udma+offsetof(struct lguest_dma, used_len), ret);
+		if (ret == 0 && src_dma.len[0] != 0)
+			goto fail;
+
+		/* Make sure destination sees contents before length. */
+		mb();
+		used_lenp = dst->dmas
+			+ dma * sizeof(struct lguest_dma)
+			+ offsetof(struct lguest_dma, used_len);
+		lhwrite_other(dstlg, used_lenp, &ret, sizeof(ret));
+		dst->next_dma++;
+	}
+ 	up_read(&dstlg->mm->mmap_sem);
+
+	/* Do this last so dst doesn't simply sleep on lock. */
+	set_bit(dst->interrupt, dstlg->irqs_pending);
+	set_wakeup_process(srclg, dstlg->tsk);
+	return i == dst->num_dmas;
+
+fail:
+	up_read(&dstlg->mm->mmap_sem);
+	return 0;
+}
+
+int send_dma(struct lguest *lg, unsigned long addr, unsigned long udma)
+{
+	union futex_key key;
+	int pending = 0, empty = 0;
+
+again:
+	down(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(lg, "bad sending DMA address");
+		goto unlock;
+	}
+	/* Shared mapping?  Look for other guests... */
+	if (key.shared.offset & 1) {
+		struct lguest_dma_info *i, *n;
+		list_for_each_entry_safe(i, n, &dma_hash[hash(&key)], list) {
+			if (i->guestid == lg->guestid)
+				continue;
+			if (!key_eq(&key, &i->key))
+				continue;
+
+			empty += dma_transfer(lg, udma, i);
+			break;
+		}
+		if (empty == 1) {
+			/* Give any recipients one chance to restock. */
+			up_read(&current->mm->mmap_sem);
+			up(&lguest_lock);
+			yield();
+			empty++;
+			goto again;
+		}
+		pending = 0;
+	} else {
+		/* Private mapping: tell our userspace. */
+		lg->dma_is_pending = 1;
+		lg->pending_dma = udma;
+		lg->pending_addr = addr;
+		pending = 1;
+	}
+unlock:
+	up_read(&current->mm->mmap_sem);
+	up(&lguest_lock);
+	return pending;
+}
+
+void release_all_dma(struct lguest *lg)
+{
+	unsigned int i;
+
+	BUG_ON(down_trylock(&lguest_lock) == 0);
+
+	down_read(&lg->mm->mmap_sem);
+	for (i = 0; i < LGUEST_MAX_DMA; i++) {
+		if (lg->dma[i].interrupt)
+			unlink_dma(&lg->dma[i]);
+	}
+	up_read(&lg->mm->mmap_sem);
+}
+
+/* Userspace wants a dma buffer from this guest. */
+unsigned long get_dma_buffer(struct lguest *lg,
+			     unsigned long addr, unsigned long *interrupt)
+{
+	unsigned long ret = 0;
+	union futex_key key;
+	struct lguest_dma_info *i;
+
+	down(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(lg, "bad registered DMA buffer");
+		goto unlock;
+	}
+	list_for_each_entry(i, &dma_hash[hash(&key)], list) {
+		if (key_eq(&key, &i->key) && i->guestid == lg->guestid) {
+			unsigned int j;
+			for (j = 0; j < i->num_dmas; j++) {
+				struct lguest_dma dma;
+
+				ret = i->dmas + j * sizeof(struct lguest_dma);
+				lhread(lg, &dma, ret, sizeof(dma));
+				if (dma.used_len == 0)
+					break;
+			}
+			*interrupt = i->interrupt;
+			break;
+		}
+	}
+unlock:
+	up_read(&current->mm->mmap_sem);
+	up(&lguest_lock);
+	return ret;
+}
+
+void lguest_io_init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(dma_hash); i++)
+		INIT_LIST_HEAD(&dma_hash[i]);
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lguest_user.c
@@ -0,0 +1,242 @@
+/* Userspace control of the guest, via /dev/lguest. */
+#include <linux/uaccess.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include "lg.h"
+
+static struct lguest_state *setup_guest_state(unsigned int num, void *pgdir,
+					      unsigned long start)
+{
+	struct lguest_state *guest = &__lguest_states()[num];
+	unsigned int i;
+	const long *def = __lguest_default_idt_entries();
+	struct lguest_regs *regs;
+
+	guest->gdt_table[GDT_ENTRY_KERNEL_CS] = FULL_EXEC_SEGMENT;
+	guest->gdt_table[GDT_ENTRY_KERNEL_DS] = FULL_SEGMENT;
+	guest->gdt.size = GDT_ENTRIES*8-1;
+	guest->gdt.address = (unsigned long)&guest->gdt_table;
+
+	/* Other guest's IDTs are initialized from default. */
+	guest->idt.size = 8 * IDT_ENTRIES;
+	guest->idt.address = (long)guest->idt_table;
+	for (i = 0; i < IDT_ENTRIES; i++) {
+		u32 flags = 0x8e00;
+
+		/* They can't "int" into any of them except hypercall. */
+		if (i == LGUEST_TRAP_ENTRY)
+			flags |= (GUEST_DPL << 13);
+
+		guest->idt_table[i].a = (LGUEST_CS<<16) | (def[i]&0x0000FFFF);
+		guest->idt_table[i].b = (def[i]&0xFFFF0000) | flags;
+	}
+
+	memset(&guest->tss, 0, sizeof(guest->tss));
+	guest->tss.ss0 = LGUEST_DS;
+	guest->tss.esp0 = (unsigned long)(guest+1);
+	guest->tss.io_bitmap_base = sizeof(guest->tss); /* No I/O for you! */
+
+	/* Write out stack in format lguest expects, so we can switch to it. */
+	regs = &guest->regs;
+	regs->cr3 = __pa(pgdir);
+	regs->eax = regs->ebx = regs->ecx = regs->edx = regs->esp = 0;
+	regs->edi = LGUEST_MAGIC_EDI;
+	regs->ebp = LGUEST_MAGIC_EBP;
+	regs->esi = LGUEST_MAGIC_ESI;
+	regs->gs = regs->fs = 0;
+	regs->ds = regs->es = __KERNEL_DS|GUEST_DPL;
+	regs->trapnum = regs->errcode = 0;
+	regs->eip = start;
+	regs->cs = __KERNEL_CS|GUEST_DPL;
+	regs->eflags = 0x202; 	/* Interrupts enabled. */
+	regs->ss = __KERNEL_DS|GUEST_DPL;
+
+	if (!fixup_gdt_table(guest->gdt_table, ARRAY_SIZE(guest->gdt_table),
+			     &guest->regs, &guest->tss))
+		return NULL;
+
+	return guest;
+}
+
+/* + addr */
+static long user_get_dma(struct lguest *lg, const u32 __user *input)
+{
+	unsigned long addr, udma, irq;
+
+	if (get_user(addr, input) != 0)
+		return -EFAULT;
+	udma = get_dma_buffer(lg, addr, &irq);
+	if (!udma)
+		return -ENOENT;
+
+	/* We put irq number in udma->used_len. */
+	lhwrite_u32(lg, udma + offsetof(struct lguest_dma, used_len), irq);
+	return udma;
+}
+
+/* + irq */
+static int user_send_irq(struct lguest *lg, const u32 __user *input)
+{
+	u32 irq;
+
+	if (get_user(irq, input) != 0)
+		return -EFAULT;
+	if (irq >= LGUEST_IRQS)
+		return -EINVAL;
+	set_bit(irq, lg->irqs_pending);
+	return 0;
+}
+
+static ssize_t read(struct file *file, char __user *user, size_t size,loff_t*o)
+{
+	struct lguest *lg = file->private_data;
+
+	if (!lg)
+		return -EINVAL;
+
+	if (lg->dead) {
+		size_t len;
+
+		if (lg->dead == (void *)-1)
+			return -ENOMEM;
+
+		len = min(size, strlen(lg->dead)+1);
+		if (copy_to_user(user, lg->dead, len) != 0)
+			return -EFAULT;
+		return len;
+	}
+
+	if (lg->dma_is_pending)
+		lg->dma_is_pending = 0;
+
+	return run_guest(lg, user);
+}
+
+/* Take: pfnlimit, pgdir, start, pageoffset. */
+static int initialize(struct file *file, const u32 __user *input)
+{
+	struct lguest *lg;
+	int err, i;
+	u32 args[4];
+
+	if (file->private_data)
+		return -EBUSY;
+
+	if (copy_from_user(args, input, sizeof(args)) != 0)
+		return -EFAULT;
+
+	if (args[1] <= PAGE_SIZE)
+		return -EINVAL;
+
+	down(&lguest_lock);
+	i = find_free_guest();
+	if (i < 0) {
+		err = -ENOSPC;
+		goto unlock;
+	}
+	lg = &lguests[i];
+	lg->guestid = i;
+	lg->pfn_limit = args[0];
+	lg->page_offset = args[3];
+
+	lg->trap_page = (u32 *)get_zeroed_page(GFP_KERNEL);
+	if (!lg->trap_page) {
+		err = -ENOMEM;
+		goto release_guest;
+	}
+
+	err = init_guest_pagetable(lg, args[1]);
+	if (err)
+		goto free_trap_page;
+
+	lg->state = setup_guest_state(i, lg->pgdirs[lg->pgdidx].pgdir,args[2]);
+	if (!lg->state) {
+		err = -ENOEXEC;
+		goto release_pgtable;
+	}
+	up(&lguest_lock);
+
+	lg->tsk = current;
+	lg->mm = get_task_mm(current);
+	file->private_data = lg;
+	return sizeof(args);
+
+release_pgtable:
+	free_guest_pagetable(lg);
+free_trap_page:
+	free_page((long)lg->trap_page);
+release_guest:
+	memset(lg, 0, sizeof(*lg));
+unlock:
+	up(&lguest_lock);
+	return err;
+}
+
+static ssize_t write(struct file *file, const char __user *input,
+		     size_t size, loff_t *off)
+{
+	struct lguest *lg = file->private_data;
+	u32 req;
+
+	if (get_user(req, input) != 0)
+		return -EFAULT;
+	input += sizeof(req);
+
+	if (req != LHREQ_INITIALIZE && !lg)
+		return -EINVAL;
+	if (lg && lg->dead)
+		return -ENOENT;
+
+	switch (req) {
+	case LHREQ_INITIALIZE:
+		return initialize(file, (const u32 __user *)input);
+	case LHREQ_GETDMA:
+		return user_get_dma(lg, (const u32 __user *)input);
+	case LHREQ_IRQ:
+		return user_send_irq(lg, (const u32 __user *)input);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int close(struct inode *inode, struct file *file)
+{
+	struct lguest *lg = file->private_data;
+
+	if (!lg)
+		return 0;
+
+	down(&lguest_lock);
+	release_all_dma(lg);
+	free_page((long)lg->trap_page);
+	free_guest_pagetable(lg);
+	mmput(lg->mm);
+	if (lg->dead != (void *)1)
+		kfree(lg->dead);
+	memset(lg->state, 0, sizeof(*lg->state));
+	memset(lg, 0, sizeof(*lg));
+	up(&lguest_lock);
+	return 0;
+}
+
+static struct file_operations lguest_fops = {
+	.owner	 = THIS_MODULE,
+	.release = close,
+	.write	 = write,
+	.read	 = read,
+};
+static struct miscdevice lguest_dev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "lguest",
+	.fops	= &lguest_fops,
+};
+
+int __init lguest_device_init(void)
+{
+	return misc_register(&lguest_dev);
+}
+
+void __exit lguest_device_remove(void)
+{
+	misc_deregister(&lguest_dev);
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/page_tables.c
@@ -0,0 +1,374 @@
+/* Shadow page table operations.
+ * Copyright (C) Rusty Russell IBm Corporation 2006.
+ * GPL v2 and any later version */
+#include <linux/mm.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/random.h>
+#include <linux/percpu.h>
+#include <asm/tlbflush.h>
+#include "lg.h"
+
+#define PTES_PER_PAGE_SHIFT 10
+#define PTES_PER_PAGE (1 << PTES_PER_PAGE_SHIFT)
+#define HYPERVISOR_PGD_ENTRY (PTES_PER_PAGE - 1)
+
+static DEFINE_PER_CPU(u32 *, hypervisor_pte_pages) = { NULL };
+#define hypervisor_pte_page(cpu) per_cpu(hypervisor_pte_pages, cpu)
+
+static unsigned vaddr_to_pgd(unsigned long vaddr)
+{
+	return vaddr >> (PAGE_SHIFT + PTES_PER_PAGE_SHIFT);
+}
+
+/* These access the real versions. */
+static u32 *toplev(struct lguest *lg, u32 i, unsigned long vaddr)
+{
+	unsigned int index = vaddr_to_pgd(vaddr);
+
+	if (index >= HYPERVISOR_PGD_ENTRY) {
+		kill_guest(lg, "attempt to access hypervisor pages");
+		index = 0;
+	} 
+	return &lg->pgdirs[i].pgdir[index];
+}
+
+static u32 *pteof(struct lguest *lg, u32 top, unsigned long vaddr)
+{
+	u32 *page = __va(top&PAGE_MASK);
+	BUG_ON(!(top & _PAGE_PRESENT));
+	return &page[(vaddr >> PAGE_SHIFT) % PTES_PER_PAGE];
+}
+
+/* These access the guest versions. */
+static u32 gtoplev(struct lguest *lg, unsigned long vaddr)
+{
+	unsigned int index = vaddr >> (PAGE_SHIFT + PTES_PER_PAGE_SHIFT);
+	return lg->pgdirs[lg->pgdidx].cr3 + index * sizeof(u32);
+}
+
+static u32 gpteof(struct lguest *lg, u32 gtop, unsigned long vaddr)
+{
+	u32 gpage = (gtop&PAGE_MASK);
+	BUG_ON(!(gtop & _PAGE_PRESENT));
+	return gpage + ((vaddr >> PAGE_SHIFT) % PTES_PER_PAGE) * sizeof(u32);
+}
+
+static void release_pte(u32 pte)
+{
+	if (pte & _PAGE_PRESENT)
+		put_page(pfn_to_page(pte >> PAGE_SHIFT));
+}
+
+/* Do a virtual -> physical mapping on a user page. */
+static unsigned long get_pfn(unsigned long virtpfn, int write)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned long ret = -1UL;
+
+	down_read(&current->mm->mmap_sem);
+	if (get_user_pages(current, current->mm, virtpfn << PAGE_SHIFT,
+			   1, write, 1, &page, &vma) == 1)
+		ret = page_to_pfn(page);
+	up_read(&current->mm->mmap_sem);
+	return ret;
+}
+
+static u32 check_pgtable_entry(struct lguest *lg, u32 entry)
+{
+	if ((entry & (_PAGE_PWT|_PAGE_PSE))
+	    || (entry >> PAGE_SHIFT) >= lg->pfn_limit)
+		kill_guest(lg, "bad page table entry");
+	return entry & ~_PAGE_GLOBAL;
+}
+
+static u32 get_pte(struct lguest *lg, u32 entry, int write)
+{
+	u32 pfn;
+
+	pfn = get_pfn(entry >> PAGE_SHIFT, write);
+	if (pfn == -1UL) {
+		kill_guest(lg, "failed to get page %u", entry>>PAGE_SHIFT);
+		return 0;
+	}
+	return ((pfn << PAGE_SHIFT) | (entry & (PAGE_SIZE-1)));
+}
+
+/* FIXME: We hold reference to pages, which prevents them from being
+   swapped.  It'd be nice to have a callback when Linux wants to swap out. */
+
+/* We fault pages in, which allows us to update accessed/dirty bits.
+ * Return NULL or the pte page. */
+static int page_in(struct lguest *lg, u32 vaddr, unsigned flags)
+{
+	u32 gtop, gpte;
+	u32 *top, *pte, *ptepage;
+	u32 val;
+
+	gtop = gtoplev(lg, vaddr);
+	val = lhread_u32(lg, gtop);
+	if (!(val & _PAGE_PRESENT))
+		return 0;
+
+	top = toplev(lg, lg->pgdidx, vaddr);
+	if (!(*top & _PAGE_PRESENT)) {
+		/* Get a PTE page for them. */
+		ptepage = (void *)get_zeroed_page(GFP_KERNEL);
+		/* FIXME: Steal from self in this case? */
+		if (!ptepage) {
+			kill_guest(lg, "out of memory allocating pte page");
+			return 0;
+		}
+		val = check_pgtable_entry(lg, val);
+		*top = (__pa(ptepage) | (val & (PAGE_SIZE-1)));
+	} else
+		ptepage = __va(*top & PAGE_MASK);
+
+	gpte = gpteof(lg, val, vaddr);
+	val = lhread_u32(lg, gpte);
+
+	/* No page, or write to readonly page? */
+	if (!(val&_PAGE_PRESENT) || ((flags&_PAGE_DIRTY) && !(val&_PAGE_RW)))
+		return 0;
+
+	pte = pteof(lg, *top, vaddr);
+	val = check_pgtable_entry(lg, val) | flags;
+
+	/* We're done with the old pte. */
+	release_pte(*pte);
+
+	/* We don't make it writable if this isn't a write: later
+	 * write will fault so we can set dirty bit in guest. */
+	if (val & _PAGE_DIRTY)
+		*pte = get_pte(lg, val, 1);
+	else
+		*pte = get_pte(lg, val & ~_PAGE_RW, 0);
+
+	/* Now we update dirty/accessed on guest. */
+	lhwrite_u32(lg, gpte, val);
+	return 1;
+}
+
+int demand_page(struct lguest *lg, u32 vaddr, int write)
+{
+	return page_in(lg, vaddr, (write ? _PAGE_DIRTY : 0)|_PAGE_ACCESSED);
+}
+
+void pin_stack_pages(struct lguest *lg)
+{
+	unsigned int i;
+	u32 stack = lg->state->tss.esp1;
+
+	for (i = 0; i < lg->stack_pages; i++)
+		if (!demand_page(lg, stack - i*PAGE_SIZE, 1))
+			kill_guest(lg, "bad stack page %i@%#x", i, stack);
+}
+
+static unsigned int find_pgdir(struct lguest *lg, u32 pgtable)
+{
+	unsigned int i;
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		if (lg->pgdirs[i].cr3 == pgtable)
+			break;
+	return i;
+}
+
+static void release_pgd(struct lguest *lg, u32 *pgd)
+{
+	if (*pgd & _PAGE_PRESENT) {
+		unsigned int i;
+		u32 *ptepage = __va(*pgd & ~(PAGE_SIZE-1));
+		for (i = 0; i < PTES_PER_PAGE; i++)
+			release_pte(ptepage[i]);
+		free_page((long)ptepage);
+		*pgd = 0;
+	}
+}
+
+static void flush_user_mappings(struct lguest *lg, int idx)
+{
+	unsigned int i;
+	for (i = 0; i < vaddr_to_pgd(lg->page_offset); i++)
+		release_pgd(lg, lg->pgdirs[idx].pgdir + i);
+}
+
+void guest_pagetable_flush_user(struct lguest *lg)
+{
+	flush_user_mappings(lg, lg->pgdidx);
+}
+
+static unsigned int new_pgdir(struct lguest *lg, u32 cr3)
+{
+	unsigned int next;
+
+	next = (lg->pgdidx + random32()) % ARRAY_SIZE(lg->pgdirs);
+	if (!lg->pgdirs[next].pgdir) {
+		lg->pgdirs[next].pgdir = (u32 *)get_zeroed_page(GFP_KERNEL);
+		if (!lg->pgdirs[next].pgdir)
+			next = lg->pgdidx;
+	}
+	lg->pgdirs[next].cr3 = cr3;
+	/* Release all the non-kernel mappings. */
+	flush_user_mappings(lg, next);
+
+	return next;
+}
+
+void guest_new_pagetable(struct lguest *lg, u32 pgtable)
+{
+	int newpgdir;
+
+	newpgdir = find_pgdir(lg, pgtable);
+	if (newpgdir == ARRAY_SIZE(lg->pgdirs))
+		newpgdir = new_pgdir(lg, pgtable);
+	lg->pgdidx = newpgdir;
+	lg->state->regs.cr3 = __pa(lg->pgdirs[lg->pgdidx].pgdir);
+	pin_stack_pages(lg);
+}
+
+static void release_all_pagetables(struct lguest *lg)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		if (lg->pgdirs[i].pgdir)
+			for (j = 0; j < HYPERVISOR_PGD_ENTRY; j++)
+				release_pgd(lg, lg->pgdirs[i].pgdir + j);
+}
+
+void guest_pagetable_clear_all(struct lguest *lg)
+{
+	release_all_pagetables(lg);
+	pin_stack_pages(lg);
+}
+
+static void do_set_pte(struct lguest *lg, int idx,
+		       unsigned long vaddr, u32 val)
+{
+	u32 *top = toplev(lg, idx, vaddr);
+	if (*top & _PAGE_PRESENT) {
+		u32 *pte = pteof(lg, *top, vaddr);
+		release_pte(*pte);
+		if (val & (_PAGE_DIRTY | _PAGE_ACCESSED)) {
+			val = check_pgtable_entry(lg, val);
+			*pte = get_pte(lg, val, val & _PAGE_DIRTY);
+		} else
+			*pte = 0;
+	}
+}
+
+void guest_set_pte(struct lguest *lg,
+		   unsigned long cr3, unsigned long vaddr, u32 val)
+{
+	/* Kernel mappings must be changed on all top levels. */
+	if (vaddr >= lg->page_offset) {
+		unsigned int i;
+		for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+			if (lg->pgdirs[i].pgdir)
+				do_set_pte(lg, i, vaddr, val);
+	} else {
+		int pgdir = find_pgdir(lg, cr3);
+		if (pgdir != ARRAY_SIZE(lg->pgdirs))
+			do_set_pte(lg, pgdir, vaddr, val);
+	}
+}
+
+void guest_set_pud(struct lguest *lg, unsigned long cr3, u32 idx)
+{
+	int pgdir;
+
+	if (idx >= HYPERVISOR_PGD_ENTRY)
+		return;
+
+	pgdir = find_pgdir(lg, cr3);
+	if (pgdir < ARRAY_SIZE(lg->pgdirs))
+		release_pgd(lg, lg->pgdirs[pgdir].pgdir + idx);
+}
+
+int init_guest_pagetable(struct lguest *lg, u32 pgtable)
+{
+	/* We assume this in flush_user_mappings, so check now */
+	if (vaddr_to_pgd(lg->page_offset) >= HYPERVISOR_PGD_ENTRY)
+		return -EINVAL;
+	lg->pgdidx = 0;
+	lg->pgdirs[lg->pgdidx].cr3 = pgtable;
+	lg->pgdirs[lg->pgdidx].pgdir = (u32*)get_zeroed_page(GFP_KERNEL);
+	if (!lg->pgdirs[lg->pgdidx].pgdir)
+		return -ENOMEM;
+	return 0;
+}
+
+void free_guest_pagetable(struct lguest *lg)
+{
+	unsigned int i;
+
+	release_all_pagetables(lg);
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		free_page((long)lg->pgdirs[i].pgdir);
+}
+
+/* Caller must be preempt-safe */
+void map_trap_page(struct lguest *lg)
+{
+	int cpu = smp_processor_id();
+	
+	hypervisor_pte_page(cpu)[0] = (__pa(lg->trap_page)|_PAGE_PRESENT);
+
+	/* Since hypervisor less that 4MB, we simply mug top pte page. */
+	lg->pgdirs[lg->pgdidx].pgdir[HYPERVISOR_PGD_ENTRY] =
+		(__pa(hypervisor_pte_page(cpu))| _PAGE_KERNEL);
+}
+
+static void free_hypervisor_pte_pages(void)
+{
+	int i;
+	
+	for_each_possible_cpu(i)
+		free_page((long)hypervisor_pte_page(i));
+}
+
+static __init int alloc_hypervisor_pte_pages(void)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		hypervisor_pte_page(i) = (u32 *)get_zeroed_page(GFP_KERNEL);
+		if (!hypervisor_pte_page(i)) {
+			free_hypervisor_pte_pages();
+			return -ENOMEM;
+		}
+	}
+	return 0;
+}
+
+static __init void populate_hypervisor_pte_page(int cpu)
+{
+	int i;
+	u32 *pte = hypervisor_pte_page(cpu);
+
+	for (i = 0; i < HYPERVISOR_PAGES; i++) {
+		/* First entry set dynamically in map_trap_page */
+		pte[i+1] = ((page_to_pfn(&hype_pages[i]) << PAGE_SHIFT) 
+			    | _PAGE_KERNEL_EXEC);
+	}
+}
+
+__init int init_pagetables(struct page hype_pages[])
+{
+	int ret;
+	unsigned int i;
+
+	ret = alloc_hypervisor_pte_pages();
+	if (ret)
+		return ret;
+
+	for_each_possible_cpu(i)
+		populate_hypervisor_pte_page(i);
+	return 0;
+}
+
+__exit void free_pagetables(void)
+{
+	free_hypervisor_pte_pages();
+}
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/segments.c
@@ -0,0 +1,171 @@
+#include "lg.h"
+
+/* Dealing with GDT entries is such a horror, I convert to sanity and back */
+struct decoded_gdt_entry
+{
+	u32 base, limit;
+	union {
+		struct {
+			unsigned type:4;
+			unsigned dtype:1;
+			unsigned dpl:2;
+			unsigned present:1;
+			unsigned unused:4;
+			unsigned avl:1;
+			unsigned mbz:1;
+			unsigned def:1;
+			unsigned page_granularity:1;
+		};
+		u16 raw_attributes;
+	};
+};
+
+static struct decoded_gdt_entry decode_gdt_entry(const struct desc_struct *en)
+{
+	struct decoded_gdt_entry de;
+	de.base = ((en->a >> 16) | ((en->b & 0xff) << 16) 
+		   | (en->b & 0xFF000000));
+	de.limit = ((en->a & 0xFFFF) | (en->b & 0xF0000));
+	de.raw_attributes = (en->b >> 8);
+	return de;
+}
+
+static struct desc_struct encode_gdt_entry(const struct decoded_gdt_entry *de)
+{
+	struct desc_struct en;
+	en.a = ((de->limit & 0xFFFF) | (de->base << 16));
+	en.b = (((de->base >> 16) & 0xFF) 
+		 | ((((u32)de->raw_attributes) & 0xF0FF) << 8)
+		 | (de->limit & 0xF0000)
+		 | (de->base & 0xFF000000));
+	return en;
+}
+
+static int check_desc(const struct decoded_gdt_entry *dec)
+{
+	return (dec->mbz == 0 && dec->dtype == 1 && (dec->type & 4) == 0);
+}
+
+static void check_segment(const struct desc_struct *gdt, u32 *segreg)
+{
+	if (*segreg > 255 || !(gdt[*segreg >> 3].b & 0x8000))
+		*segreg = 0;
+}
+
+/* Ensure our manually-loaded segment regs don't fault in switch_to_guest. */
+static void check_live_segments(const struct desc_struct *gdt,
+				struct lguest_regs *regs)
+{
+	check_segment(gdt, &regs->es);
+	check_segment(gdt, &regs->ds);
+	check_segment(gdt, &regs->fs);
+	check_segment(gdt, &regs->gs);
+}
+
+int fixup_gdt_table(struct desc_struct *gdt, unsigned int num,
+		    struct lguest_regs *regs, struct x86_tss *tss)
+{
+	unsigned int i;
+	struct decoded_gdt_entry dec;
+
+	for (i = 0; i < num; i++) {
+		unsigned long base, length;
+
+		/* We override these ones, so we don't care what they give. */
+		if (i == GDT_ENTRY_TSS
+		    || i == GDT_ENTRY_LGUEST_CS
+		    || i == GDT_ENTRY_LGUEST_DS
+		    || i == GDT_ENTRY_DOUBLEFAULT_TSS)
+			continue;
+
+		dec = decode_gdt_entry(&gdt[i]);
+		if (!dec.present)
+			continue;
+
+		if (!check_desc(&dec))
+			return 0;
+
+		base = dec.base;
+		length = dec.limit + 1;
+		if (dec.page_granularity) {
+			base *= PAGE_SIZE;
+			length *= PAGE_SIZE;
+		}
+
+		/* Unacceptable base? */
+		if (base >= HYPE_ADDR)
+			return 0;
+
+		/* Wrap around or segment overlaps hypervisor mem? */
+		if (!length
+		    || base + length < base
+		    || base + length > HYPE_ADDR) {
+			/* Trim to edge of hypervisor. */
+			length = HYPE_ADDR - base;
+			if (dec.page_granularity)
+				dec.limit = (length / PAGE_SIZE) - 1;
+			else
+				dec.limit = length - 1;
+		}
+		if (dec.dpl == 0)
+			dec.dpl = GUEST_DPL;
+		gdt[i] = encode_gdt_entry(&dec);
+	}
+	check_live_segments(gdt, regs);
+
+	/* Now put in hypervisor data and code segments. */
+	gdt[GDT_ENTRY_LGUEST_CS] = FULL_EXEC_SEGMENT;
+	gdt[GDT_ENTRY_LGUEST_DS] = FULL_SEGMENT;
+
+	/* Finally, TSS entry */
+	dec.base = (unsigned long)tss;
+	dec.limit = sizeof(*tss)-1;
+	dec.type = 0x9;
+	dec.dtype = 0;
+	dec.def = 0;
+	dec.present = 1;
+	dec.mbz = 0;
+	dec.page_granularity = 0;
+	gdt[GDT_ENTRY_TSS] = encode_gdt_entry(&dec);
+
+	return 1;
+}
+
+void load_guest_gdt(struct lguest *lg, u32 table, u32 num)
+{
+	if (num > GDT_ENTRIES)
+		kill_guest(lg, "too many gdt entries %i", num);
+
+	lhread(lg, lg->state->gdt_table, table,
+	       num * sizeof(lg->state->gdt_table[0]));
+	if (!fixup_gdt_table(lg->state->gdt_table, num, 
+			     &lg->state->regs, &lg->state->tss))
+		kill_guest(lg, "bad gdt table");
+}
+
+/* We don't care about limit here, since we only let them use these in
+ * usermode (where lack of USER bit in pagetable protects hypervisor mem).
+ * However, we want to ensure it doesn't fault when loaded, since *we* are
+ * the ones who will load it in switch_to_guest.
+ */
+void guest_load_tls(struct lguest *lg, const struct desc_struct __user *gtls)
+{
+	unsigned int i;
+	struct desc_struct *tls = &lg->state->gdt_table[GDT_ENTRY_TLS_MIN];
+
+	lhread(lg, tls, (u32)gtls, sizeof(*tls)*GDT_ENTRY_TLS_ENTRIES);
+	for (i = 0; i < ARRAY_SIZE(lg->tls_limits); i++) {
+		struct decoded_gdt_entry dec = decode_gdt_entry(&tls[i]);
+
+		if (!dec.present)
+			continue;
+
+		/* We truncate to one byte/page (depending on G bit) to neuter
+		   it, so ensure it's more than 1 page below trap page. */
+		tls[i].a &= 0xFFFF0000;
+		lg->tls_limits[i] = dec.limit;
+		if (!check_desc(&dec) || dec.base > HYPE_ADDR - PAGE_SIZE)
+			kill_guest(lg, "bad TLS descriptor %i", i);
+	}
+	check_live_segments(lg->state->gdt_table, &lg->state->regs);
+}



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 6c/10] lguest: the guest code
  2007-02-09 10:56         ` [PATCH 6b/10] lguest: the host code (lg.ko) Rusty Russell
@ 2007-02-09 10:57           ` Rusty Russell
  2007-02-09 10:58             ` [PATCH 6d/10] lguest: the Makefiles Rusty Russell
  2007-02-09 17:06             ` [PATCH 6c/10] lguest: the guest code Len Brown
  0 siblings, 2 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 10:57 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

This is the guest code which replaces the parts of paravirt_ops with
hypercalls.  It's fairly trivial.  This patch also includes trivial
bus driver for lguest devices.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lguest.c
@@ -0,0 +1,595 @@
+/*
+ * Lguest specific paravirt-ops implementation
+ *
+ * Copyright (C) 2006, Rusty Russell <rusty@rustcorp.com.au> IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <linux/kernel.h>
+#include <linux/start_kernel.h>
+#include <linux/string.h>
+#include <linux/console.h>
+#include <linux/screen_info.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/clocksource.h>
+#include <asm/paravirt.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+#include <asm/param.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/desc.h>
+#include <asm/setup.h>
+#include <asm/e820.h>
+#include <asm/pda.h>
+#include <asm/asm-offsets.h>
+
+extern int mce_disabled;
+
+struct lguest_data lguest_data;
+struct lguest_device_desc *lguest_devices;
+static __initdata const struct lguest_boot_info *boot = __va(0);
+
+void async_hcall(unsigned long call,
+		 unsigned long arg1, unsigned long arg2, unsigned long arg3)
+{
+	/* Note: This code assumes we're uniprocessor. */
+	static unsigned int next_call;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (lguest_data.hcall_status[next_call] != 0xFF) {
+		/* Table full, so do normal hcall which will flush table. */
+		hcall(call, arg1, arg2, arg3);
+	} else {
+		lguest_data.hcalls[next_call].eax = call;
+		lguest_data.hcalls[next_call].edx = arg1;
+		lguest_data.hcalls[next_call].ebx = arg2;
+		lguest_data.hcalls[next_call].ecx = arg3;
+		wmb();
+		lguest_data.hcall_status[next_call] = 0;
+		if (++next_call == LHCALL_RING_SIZE)
+			next_call = 0;
+	}
+	local_irq_restore(flags);
+}
+
+#ifdef PARAVIRT_LAZY_NONE 	/* Not in 2.6.20. */
+static int lazy_mode;
+static void fastcall lguest_lazy_mode(int mode)
+{
+	lazy_mode = mode;
+	if (mode == PARAVIRT_LAZY_NONE)
+		hcall(LHCALL_FLUSH_ASYNC, 0, 0, 0);
+}
+
+static void lazy_hcall(unsigned long call,
+		       unsigned long arg1,
+		       unsigned long arg2,
+		       unsigned long arg3)
+{
+	if (lazy_mode == PARAVIRT_LAZY_NONE)
+		hcall(call, arg1, arg2, arg3);
+	else
+		async_hcall(call, arg1, arg2, arg3);
+}
+#else
+#define lazy_hcall hcall
+#endif
+
+static unsigned long fastcall save_fl(void)
+{
+	return lguest_data.irq_enabled;
+}
+
+static void fastcall restore_fl(unsigned long flags)
+{
+	/* FIXME: Check if interrupt pending... */
+	lguest_data.irq_enabled = flags;
+}
+
+static void fastcall irq_disable(void)
+{
+	lguest_data.irq_enabled = 0;
+}
+
+static void fastcall irq_enable(void)
+{
+	/* Linux i386 code expects bit 9 set. */
+	/* FIXME: Check if interrupt pending... */
+	lguest_data.irq_enabled = 512;
+}
+
+static void fastcall lguest_load_gdt(const struct Xgt_desc_struct *desc)
+{
+	BUG_ON((desc->size+1)/8 != GDT_ENTRIES);
+	hcall(LHCALL_LOAD_GDT, __pa(desc->address), GDT_ENTRIES, 0);
+}
+
+static void fastcall lguest_load_idt(const struct Xgt_desc_struct *desc)
+{
+	unsigned int i;
+	struct desc_struct *idt = (void *)desc->address;
+
+	for (i = 0; i < (desc->size+1)/8; i++)
+		hcall(LHCALL_LOAD_IDT_ENTRY, i, idt[i].a, idt[i].b);
+}
+
+static int lguest_panic(struct notifier_block *nb, unsigned long l, void *p)
+{
+	hcall(LHCALL_CRASH, __pa(p), 0, 0);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block paniced = {
+	.notifier_call = lguest_panic
+};
+
+static cycle_t lguest_clock_read(void)
+{
+	/* FIXME: This is just the native one.  Account stolen time! */
+	return paravirt_ops.read_tsc();
+}
+
+/* FIXME: Update iff tsc rate changes. */
+static struct clocksource lguest_clock = {
+	.name			= "lguest",
+	.rating			= 400,
+	.read			= lguest_clock_read,
+	.mask			= CLOCKSOURCE_MASK(64),
+	.mult			= 0, /* to be set */
+	.shift			= 22,
+	.is_continuous		= 1,
+};
+
+static char *lguest_memory_setup(void)
+{
+	/* We do these here because lockcheck barfs if before start_kernel */
+	atomic_notifier_chain_register(&panic_notifier_list, &paniced);
+	lguest_clock.mult = lguest_data.clock_mult;
+	clocksource_register(&lguest_clock);
+
+	e820.nr_map = 0;
+	add_memory_region(0, PFN_PHYS(boot->max_pfn), E820_RAM);
+	return "LGUEST";
+}
+
+static fastcall void lguest_cpuid(unsigned int *eax, unsigned int *ebx,
+				 unsigned int *ecx, unsigned int *edx)
+{
+	int is_feature = (*eax == 1);
+
+	asm volatile ("cpuid"
+		      : "=a" (*eax),
+			"=b" (*ebx),
+			"=c" (*ecx),
+			"=d" (*edx)
+		      : "0" (*eax), "2" (*ecx));
+
+	if (is_feature) {
+		unsigned long *excap = (unsigned long *)ecx,
+			*features = (unsigned long *)edx;
+		/* Hypervisor needs to know when we flush kernel pages. */
+		set_bit(X86_FEATURE_PGE, features);
+		/* We don't have any features! */
+		clear_bit(X86_FEATURE_VME, features);
+		clear_bit(X86_FEATURE_DE, features);
+		clear_bit(X86_FEATURE_PSE, features);
+		clear_bit(X86_FEATURE_PAE, features);
+		clear_bit(X86_FEATURE_SEP, features);
+		clear_bit(X86_FEATURE_APIC, features);
+		clear_bit(X86_FEATURE_MTRR, features);
+		/* No MWAIT, either */
+		clear_bit(3, excap);
+	}
+}
+
+static unsigned long current_cr3;
+static void fastcall lguest_write_cr3(unsigned long cr3)
+{
+	hcall(LHCALL_NEW_PGTABLE, cr3, 0, 0);
+	current_cr3 = cr3;
+}
+
+static void fastcall lguest_flush_tlb(void)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB, 0, 0, 0);
+}
+
+static void fastcall lguest_flush_tlb_kernel(void)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB, 1, 0, 0);
+}
+
+static void fastcall lguest_flush_tlb_single(u32 addr)
+{
+	/* Simply set it to zero, and it will fault back in. */
+	lazy_hcall(LHCALL_SET_PTE, current_cr3, addr, 0);
+}
+
+/* FIXME: Eliminate all callers of this. */
+static fastcall void lguest_set_pte(pte_t *ptep, pte_t pteval)
+{
+	*ptep = pteval;
+	/* Don't bother with hypercall before initial setup. */
+	if (current_cr3)
+		hcall(LHCALL_SET_UNKNOWN_PTE, 0, 0, 0);
+}
+
+static fastcall void lguest_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t pteval)
+{
+	*ptep = pteval;
+	lazy_hcall(LHCALL_SET_PTE, __pa(mm->pgd), addr, pteval.pte_low);
+}
+
+/* We only support two-level pagetables at the moment. */
+static fastcall void lguest_set_pud(pmd_t *pmdp, pmd_t pmdval)
+{
+	*pmdp = pmdval;
+	lazy_hcall(LHCALL_SET_PUD, __pa(pmdp)&PAGE_MASK,
+		   (__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+static fastcall void lguest_apic_write(unsigned long reg, unsigned long v)
+{
+}
+
+static fastcall void lguest_apic_write_atomic(unsigned long reg, unsigned long v)
+{
+}
+
+static fastcall unsigned long lguest_apic_read(unsigned long reg)
+{
+	return 0;
+}
+#endif
+
+/* We move eflags word to lguest_data.irq_enabled to restore interrupt
+   state.  For page faults, gpfs and virtual interrupts, the
+   hypervisor has saved eflags manually, otherwise it was delivered
+   directly and so eflags reflects the real machine IF state,
+   ie. interrupts on.  Since the kernel always dies if it takes such a
+   trap with interrupts disabled anyway, turning interrupts back on
+   unconditionally here is OK. */
+asm("lguest_iret:"
+    " pushl	%eax;"
+    " movl	12(%esp), %eax;"
+    "lguest_noirq_start:;"
+    " movl	%eax,%ss:lguest_data+"__stringify(LGUEST_DATA_irq_enabled)";"
+    " popl	%eax;"
+    " iret;"
+    "lguest_noirq_end:");
+extern void fastcall lguest_iret(void);
+extern char lguest_noirq_start[], lguest_noirq_end[];
+
+static void fastcall lguest_load_esp0(struct tss_struct *tss,
+				     struct thread_struct *thread)
+{
+	lazy_hcall(LHCALL_SET_STACK, __KERNEL_DS|0x1, thread->esp0,
+		   THREAD_SIZE/PAGE_SIZE);
+}
+
+static fastcall void lguest_load_tr_desc(void)
+{
+}
+
+static fastcall void lguest_set_ldt(const void *addr, unsigned entries)
+{
+	/* FIXME: Implement. */
+	BUG_ON(entries);
+}
+
+static fastcall void lguest_load_tls(struct thread_struct *t, unsigned int cpu)
+{
+	lazy_hcall(LHCALL_LOAD_TLS, __pa(&t->tls_array), cpu, 0);
+}
+
+static fastcall void lguest_set_debugreg(int regno, unsigned long value)
+{
+	/* FIXME: Implement */
+}
+
+static unsigned int lguest_cr0;
+static fastcall void lguest_clts(void)
+{
+	lazy_hcall(LHCALL_TS, 0, 0, 0);
+	lguest_cr0 &= ~8U;
+}
+
+static fastcall unsigned long lguest_read_cr0(void)
+{
+	return lguest_cr0;
+}
+
+static fastcall void lguest_write_cr0(unsigned long val)
+{
+	hcall(LHCALL_TS, val & 8, 0, 0);
+	lguest_cr0 = val;
+}
+
+static fastcall unsigned long lguest_read_cr2(void)
+{
+	return lguest_data.cr2;
+}
+
+static fastcall unsigned long lguest_read_cr3(void)
+{
+	return current_cr3;
+}
+
+/* Used to enable/disable PGE, but we don't care. */
+static fastcall unsigned long lguest_read_cr4(void)
+{
+	return 0;
+}
+
+static fastcall void lguest_write_cr4(unsigned long val)
+{
+}
+
+/* FIXME: These should be in a header somewhere */
+extern unsigned long init_pg_tables_end;
+
+static void fastcall lguest_time_irq(unsigned int irq, struct irq_desc *desc)
+{
+	do_timer(hcall(LHCALL_TIMER_READ, 0, 0, 0));
+	update_process_times(user_mode_vm(get_irq_regs()));
+}
+
+static void disable_lguest_irq(unsigned int irq)
+{
+	set_bit(irq, lguest_data.interrupts);
+}
+
+static void enable_lguest_irq(unsigned int irq)
+{
+	clear_bit(irq, lguest_data.interrupts);
+	/* FIXME: If it's pending? */
+}
+
+static struct irq_chip lguest_irq_controller = {
+	.name		= "lguest",
+	.mask		= disable_lguest_irq,
+	.mask_ack	= disable_lguest_irq,
+	.unmask		= enable_lguest_irq,
+};
+
+static void lguest_time_init(void)
+{
+	set_irq_handler(0, lguest_time_irq);
+	hcall(LHCALL_TIMER_START,HZ,0,0);
+}
+
+static void __init lguest_init_IRQ(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_IRQS; i++) {
+		int vector = FIRST_EXTERNAL_VECTOR + i;
+		if (i >= NR_IRQS)
+			break;
+		if (vector != SYSCALL_VECTOR) {
+			set_intr_gate(vector, interrupt[i]);
+			set_irq_chip_and_handler(i, &lguest_irq_controller,
+						 handle_level_irq);
+		}
+	}
+	irq_ctx_init(smp_processor_id());
+}
+
+static inline void native_write_dt_entry(void *dt, int entry, u32 entry_low, u32 entry_high)
+{
+	u32 *lp = (u32 *)((char *)dt + entry*8);
+	lp[0] = entry_low;
+	lp[1] = entry_high;
+}
+
+static fastcall void lguest_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
+{
+	/* FIXME: Allow this. */
+	BUG();
+}
+
+static fastcall void lguest_write_gdt_entry(void *dt, int entrynum,
+					   u32 low, u32 high)
+{
+	native_write_dt_entry(dt, entrynum, low, high);
+	hcall(LHCALL_LOAD_GDT, __pa(dt), GDT_ENTRIES, 0);
+}
+
+static fastcall void lguest_write_idt_entry(void *dt, int entrynum,
+					   u32 low, u32 high)
+{
+	native_write_dt_entry(dt, entrynum, low, high);
+	hcall(LHCALL_LOAD_IDT_ENTRY, entrynum, low, high);
+}
+
+#define LGUEST_IRQ "lguest_data+"__stringify(LGUEST_DATA_irq_enabled)
+#define DEF_LGUEST(name, code)				\
+	extern const char start_##name[], end_##name[];		\
+	asm("start_" #name ": " code "; end_" #name ":")
+DEF_LGUEST(cli, "movl $0," LGUEST_IRQ);
+DEF_LGUEST(sti, "movl $512," LGUEST_IRQ);
+DEF_LGUEST(popf, "movl %eax," LGUEST_IRQ);
+DEF_LGUEST(pushf, "movl " LGUEST_IRQ ",%eax");
+DEF_LGUEST(pushf_cli, "movl " LGUEST_IRQ ",%eax; movl $0," LGUEST_IRQ);
+DEF_LGUEST(iret, ".byte 0xE9,0,0,0,0"); /* jmp ... */
+
+static const struct lguest_insns
+{
+	const char *start, *end;
+} lguest_insns[] = {
+	[PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
+	[PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
+	[PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
+	[PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
+	[PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
+	[PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
+};
+static unsigned lguest_patch(u8 type, u16 clobber, void *insns, unsigned len)
+{
+	unsigned int insn_len;
+
+	/* Don't touch it if we don't have a replacement */
+	if (type >= ARRAY_SIZE(lguest_insns) || !lguest_insns[type].start)
+		return len;
+
+	insn_len = lguest_insns[type].end - lguest_insns[type].start;
+
+	/* Similarly if we can't fit replacement. */
+	if (len < insn_len)
+		return len;
+
+	memcpy(insns, lguest_insns[type].start, insn_len);
+	if (type == PARAVIRT_INTERRUPT_RETURN) {
+		/* Jumps are relative. */
+		u32 off = (u32)lguest_iret - ((u32)insns + insn_len);
+		memcpy(insns+1, &off, sizeof(off));
+	}
+	return insn_len;
+}
+
+static void fastcall lguest_safe_halt(void)
+{
+	hcall(LHCALL_HALT, 0, 0, 0);
+}
+
+static unsigned long lguest_get_wallclock(void)
+{
+	return hcall(LHCALL_GET_WALLCLOCK, 0, 0, 0);
+}
+
+static void lguest_power_off(void)
+{
+	hcall(LHCALL_CRASH, __pa("Power down"), 0, 0);
+}
+
+static __attribute_used__ __init void lguest_init(void)
+{
+	extern struct Xgt_desc_struct cpu_gdt_descr;
+	extern struct i386_pda boot_pda;
+
+	paravirt_ops.name = "lguest";
+	paravirt_ops.paravirt_enabled = 1;
+	paravirt_ops.kernel_rpl = 1;
+
+	paravirt_ops.save_fl = save_fl;
+	paravirt_ops.restore_fl = restore_fl;
+	paravirt_ops.irq_disable = irq_disable;
+	paravirt_ops.irq_enable = irq_enable;
+	paravirt_ops.load_gdt = lguest_load_gdt;
+	paravirt_ops.memory_setup = lguest_memory_setup;
+	paravirt_ops.cpuid = lguest_cpuid;
+	paravirt_ops.write_cr3 = lguest_write_cr3;
+	paravirt_ops.flush_tlb_user = lguest_flush_tlb;
+	paravirt_ops.flush_tlb_single = lguest_flush_tlb_single;
+	paravirt_ops.flush_tlb_kernel = lguest_flush_tlb_kernel;
+	paravirt_ops.set_pte = lguest_set_pte;
+	paravirt_ops.set_pte_at = lguest_set_pte_at;
+	paravirt_ops.set_pmd = lguest_set_pud;
+#ifdef CONFIG_X86_LOCAL_APIC
+	paravirt_ops.apic_write = lguest_apic_write;
+	paravirt_ops.apic_write_atomic = lguest_apic_write_atomic;
+	paravirt_ops.apic_read = lguest_apic_read;
+#endif
+	paravirt_ops.load_idt = lguest_load_idt;
+	paravirt_ops.iret = lguest_iret;
+	paravirt_ops.load_esp0 = lguest_load_esp0;
+	paravirt_ops.load_tr_desc = lguest_load_tr_desc;
+	paravirt_ops.set_ldt = lguest_set_ldt;
+	paravirt_ops.load_tls = lguest_load_tls;
+	paravirt_ops.set_debugreg = lguest_set_debugreg;
+	paravirt_ops.clts = lguest_clts;
+	paravirt_ops.read_cr0 = lguest_read_cr0;
+	paravirt_ops.write_cr0 = lguest_write_cr0;
+	paravirt_ops.init_IRQ = lguest_init_IRQ;
+	paravirt_ops.read_cr2 = lguest_read_cr2;
+	paravirt_ops.read_cr3 = lguest_read_cr3;
+	paravirt_ops.read_cr4 = lguest_read_cr4;
+	paravirt_ops.write_cr4 = lguest_write_cr4;
+	paravirt_ops.write_ldt_entry = lguest_write_ldt_entry;
+	paravirt_ops.write_gdt_entry = lguest_write_gdt_entry;
+	paravirt_ops.write_idt_entry = lguest_write_idt_entry;
+	paravirt_ops.patch = lguest_patch;
+	paravirt_ops.safe_halt = lguest_safe_halt;
+	paravirt_ops.get_wallclock = lguest_get_wallclock;
+	paravirt_ops.time_init = lguest_time_init;
+#ifdef PARAVIRT_LAZY_NONE
+	paravirt_ops.set_lazy_mode = lguest_lazy_mode;
+#endif
+
+	memset(lguest_data.hcall_status,0xFF,sizeof(lguest_data.hcall_status));
+	lguest_data.noirq_start = (u32)lguest_noirq_start;
+	lguest_data.noirq_end = (u32)lguest_noirq_end;
+	hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);
+	strncpy(saved_command_line, boot->cmdline, COMMAND_LINE_SIZE);
+
+	/* We use top of mem for initial pagetables. */
+	init_pg_tables_end = __pa(pg0);
+
+	/* set up PDA descriptor */
+	pack_descriptor((u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].a,
+			(u32 *)&cpu_gdt_table[GDT_ENTRY_PDA].b,
+			(unsigned)&boot_pda, sizeof(boot_pda)-1,
+			0x80 | DESCTYPE_S | 0x02, 0);
+	load_gdt(&cpu_gdt_descr);
+	asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA) : "memory");
+
+	reserve_top_address(lguest_data.reserve_mem);
+
+	cpu_detect(&new_cpu_data);
+	/* Need this before paging_init. */
+	set_bit(X86_FEATURE_PGE, new_cpu_data.x86_capability);
+	/* Math is always hard! */
+	new_cpu_data.hard_math = 1;
+
+	/* FIXME: Better way? */
+	/* Suppress vgacon startup code */
+	SCREEN_INFO.orig_video_isVGA = VIDEO_TYPE_VLFB;
+
+	add_preferred_console("hvc", 0, NULL);
+
+#ifdef CONFIG_X86_MCE
+	mce_disabled = 1;
+#endif
+
+#ifdef CONFIG_ACPI
+	acpi_disabled = 1;
+	acpi_ht = 0;
+#endif
+	if (boot->initrd_size) {
+		/* We stash this at top of memory. */
+		INITRD_START = boot->max_pfn*PAGE_SIZE - boot->initrd_size;
+		INITRD_SIZE = boot->initrd_size;
+		LOADER_TYPE = 0xFF;
+	}
+
+	pm_power_off = lguest_power_off;
+	start_kernel();
+}
+
+asm("lguest_maybe_init:\n"
+    "	cmpl $"__stringify(LGUEST_MAGIC_EBP)", %ebp\n"
+    "	jne 1f\n"
+    "	cmpl $"__stringify(LGUEST_MAGIC_EDI)", %edi\n"
+    "	jne 1f\n"
+    "	cmpl $"__stringify(LGUEST_MAGIC_ESI)", %esi\n"
+    "	je lguest_init\n"
+    "1: ret");
+extern void asmlinkage lguest_maybe_init(void);
+paravirt_probe(lguest_maybe_init);
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/lguest_bus.c
@@ -0,0 +1,180 @@
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <asm/lguest_device.h>
+#include <asm/lguest.h>
+#include <asm/io.h>
+
+static ssize_t type_show(struct device *_dev,
+                         struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hu", lguest_devices[dev->index].type);
+}
+static ssize_t features_show(struct device *_dev,
+                             struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hx", lguest_devices[dev->index].features);
+}
+static ssize_t pfn_show(struct device *_dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%u", lguest_devices[dev->index].pfn);
+}
+static ssize_t status_show(struct device *_dev,
+                           struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hx", lguest_devices[dev->index].status);
+}
+static ssize_t status_store(struct device *_dev, struct device_attribute *attr,
+                            const char *buf, size_t count)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	if (sscanf(buf, "%hi", &lguest_devices[dev->index].status) != 1)
+		return -EINVAL;
+	return count;
+}
+static struct device_attribute lguest_dev_attrs[] = {
+	__ATTR_RO(type),
+	__ATTR_RO(features),
+	__ATTR_RO(pfn),
+	__ATTR(status, 0644, status_show, status_store),
+	__ATTR_NULL
+};
+
+static int lguest_dev_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(_drv,struct lguest_driver,drv);
+
+	return (drv->device_type == lguest_devices[dev->index].type);
+}
+
+struct lguest_bus {
+	struct bus_type bus;
+	struct device dev;
+};
+
+static struct lguest_bus lguest_bus = {
+	.bus = {
+		.name  = "lguest",
+		.match = lguest_dev_match,
+		.dev_attrs = lguest_dev_attrs,
+	},
+	.dev = {
+		.parent = NULL,
+		.bus_id = "lguest",
+	}
+};
+
+static int lguest_dev_probe(struct device *_dev)
+{
+	int ret;
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(dev->dev.driver,
+						struct lguest_driver, drv);
+
+	lguest_devices[dev->index].status |= LGUEST_DEVICE_S_DRIVER;
+	ret = drv->probe(dev);
+	if (ret == 0)
+		lguest_devices[dev->index].status |= LGUEST_DEVICE_S_DRIVER_OK;
+	return ret;
+}
+
+static int lguest_dev_remove(struct device *_dev)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(dev->dev.driver,
+						struct lguest_driver, drv);
+
+	if (dev->dev.driver && drv->remove)
+		drv->remove(dev);
+	put_device(&dev->dev);
+	return 0;
+}
+
+int register_lguest_driver(struct lguest_driver *drv)
+{
+	if (!lguest_devices)
+		return 0;
+	
+	drv->drv.bus = &lguest_bus.bus;
+	drv->drv.name = drv->name;
+	drv->drv.owner = drv->owner;
+	drv->drv.probe = lguest_dev_probe;
+	drv->drv.remove = lguest_dev_remove;
+
+	return driver_register(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(register_lguest_driver);
+
+void unregister_lguest_driver(struct lguest_driver *drv)
+{
+	if (!lguest_devices)
+		return;
+
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(unregister_lguest_driver);
+
+static void release_lguest_device(struct device *_dev)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+
+	lguest_devices[dev->index].status |= LGUEST_DEVICE_S_REMOVED_ACK;
+	kfree(dev);
+}
+
+static void add_lguest_device(unsigned int index)
+{
+	struct lguest_device *new;
+
+	lguest_devices[index].status |= LGUEST_DEVICE_S_ACKNOWLEDGE;
+	new = kmalloc(sizeof(struct lguest_device), GFP_KERNEL);
+	if (!new) {
+		printk(KERN_EMERG "Cannot allocate lguest device %u\n", index);
+		lguest_devices[index].status |= LGUEST_DEVICE_S_FAILED;
+		return;
+	}
+
+	new->index = index;
+	new->private = NULL;
+	memset(&new->dev, 0, sizeof(new->dev));
+	new->dev.parent = &lguest_bus.dev;
+	new->dev.bus = &lguest_bus.bus;
+	new->dev.release = release_lguest_device;
+	sprintf(new->dev.bus_id, "%u", index);
+	if (device_register(&new->dev) != 0) {
+		printk(KERN_EMERG "Cannot register lguest device %u\n", index);
+		lguest_devices[index].status |= LGUEST_DEVICE_S_FAILED;
+		kfree(new);
+	}
+}
+
+static void scan_devices(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_MAX_DEVICES; i++)
+		if (lguest_devices[i].type)
+			add_lguest_device(i);
+}
+
+static int __init lguest_bus_init(void)
+{
+	if (strcmp(paravirt_ops.name, "lguest") != 0)
+		return 0;
+
+	/* Devices are in page above top of "normal" mem. */
+	lguest_devices = ioremap(max_pfn << PAGE_SHIFT, PAGE_SIZE);
+
+	if (bus_register(&lguest_bus.bus) != 0
+	    || device_register(&lguest_bus.dev) != 0)
+		panic("lguest bus registration failed");
+
+	scan_devices();
+	return 0;
+}
+postcore_initcall(lguest_bus_init);



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 6d/10] lguest: the Makefiles
  2007-02-09 10:57           ` [PATCH 6c/10] lguest: the guest code Rusty Russell
@ 2007-02-09 10:58             ` Rusty Russell
  2007-02-09 17:06             ` [PATCH 6c/10] lguest: the guest code Len Brown
  1 sibling, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 10:58 UTC (permalink / raw)
  To: lkml - Kernel Mailing List; +Cc: Andrew Morton, Andi Kleen, virtualization

Finally, we put in the Makefile, so it will build.

You can see the pain involved in creating the switcher code
(hypervisor.S) ready to be copied into the top of memory.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/arch/i386/Makefile
+++ b/arch/i386/Makefile
@@ -108,6 +108,7 @@ drivers-$(CONFIG_PCI)			+= arch/i386/pci
 # must be linked after kernel/
 drivers-$(CONFIG_OPROFILE)		+= arch/i386/oprofile/
 drivers-$(CONFIG_PM)			+= arch/i386/power/
+drivers-$(CONFIG_LGUEST_GUEST)		+= arch/i386/lguest/
 
 CFLAGS += $(mflags-y)
 AFLAGS += $(mflags-y)
===================================================================
--- /dev/null
+++ b/arch/i386/lguest/Makefile
@@ -0,0 +1,22 @@
+# Guest requires the paravirt_ops replacement and the bus driver.
+obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_bus.o
+
+# Host requires the other files, which can be a module.
+obj-$(CONFIG_LGUEST)	+= lg.o
+lg-objs := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
+	segments.o io.o lguest_user.o
+
+# We use top 4MB for guest traps page, then hypervisor. */
+HYPE_ADDR := (0xFFC00000+4096)
+# The data is only 1k (256 interrupt handler pointers)
+HYPE_DATA_SIZE := 1024
+CFLAGS += -DHYPE_ADDR="$(HYPE_ADDR)" -DHYPE_DATA_SIZE="$(HYPE_DATA_SIZE)"
+
+$(obj)/core.o: $(obj)/hypervisor-blob.c
+# This links the hypervisor in the right place and turns it into a C array.
+$(obj)/hypervisor-raw: $(obj)/hypervisor.o
+	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
+$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
+	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
+
+clean-files := hypervisor-blob.c hypervisor-raw



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09  9:35         ` [PATCH 6/10] lguest code: the little linux hypervisor Andrew Morton
@ 2007-02-09 11:00           ` Rusty Russell
  2007-02-09 11:13             ` Zachary Amsden
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: lkml - Kernel Mailing List, Andi Kleen, virtualization,
	Paul Mackerras, Stephen Rothwell

On Fri, 2007-02-09 at 01:35 -0800, Andrew Morton wrote:
> On Fri, 09 Feb 2007 20:20:27 +1100 Rusty Russell <rusty@rustcorp.com.au> wrote:
> 
> > +#define log(...)					\
> > +	do {						\
> > +		mm_segment_t oldfs = get_fs();		\
> > +		char buf[100];				\
> > +		sprintf(buf, "lguest:" __VA_ARGS__);	\
> > +		set_fs(KERNEL_DS);			\
> > +		sys_write(1, buf, strlen(buf));		\
> > +		set_fs(oldfs);				\
> > +	} while(0)
> 
> Due to gcc shortcomings, each instance of this will chew an additional 100
> bytes of stack.  Unless they fixed it recently.  Is a bit of a timebomb.  I
> guess ksaprintf() could be used.
> 
> It also looks a bit, umm, innovative.

It's also unused 8)

It's an extremely useful macro for doing grossly invasive logging of the
guest.  I'll drop it if you prefer.

Cheers,
Rusty.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 11:00           ` Rusty Russell
@ 2007-02-09 11:13             ` Zachary Amsden
  2007-02-09 11:50               ` Andi Kleen
  0 siblings, 1 reply; 57+ messages in thread
From: Zachary Amsden @ 2007-02-09 11:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, lkml - Kernel Mailing List, Andi Kleen,
	virtualization, Paul Mackerras, Stephen Rothwell

Rusty Russell wrote:
> On Fri, 2007-02-09 at 01:35 -0800, Andrew Morton wrote:
>   
>> On Fri, 09 Feb 2007 20:20:27 +1100 Rusty Russell <rusty@rustcorp.com.au> wrote:
>>
>>     
>>> +#define log(...)					\
>>> +	do {						\
>>> +		mm_segment_t oldfs = get_fs();		\
>>> +		char buf[100];				\
>>> +		sprintf(buf, "lguest:" __VA_ARGS__);	\
>>> +		set_fs(KERNEL_DS);			\
>>> +		sys_write(1, buf, strlen(buf));		\
>>> +		set_fs(oldfs);				\
>>> +	} while(0)
>>>       
>> Due to gcc shortcomings, each instance of this will chew an additional 100
>> bytes of stack.  Unless they fixed it recently.  Is a bit of a timebomb.  I
>> guess ksaprintf() could be used.
>>
>> It also looks a bit, umm, innovative.
>>     
>
> It's also unused 8)
>
> It's an extremely useful macro for doing grossly invasive logging of the
> guest.  I'll drop it if you prefer.
>   

Yes, it is a bit, umm, innovative.  If it is going to be kept, even if 
just for devel logging, you should disable interrupts around it.  
Changing segments is not a normal thing to do.

Zach

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 11:13             ` Zachary Amsden
@ 2007-02-09 11:50               ` Andi Kleen
  2007-02-09 11:54                 ` Zachary Amsden
  2007-02-09 22:29                 ` David Miller
  0 siblings, 2 replies; 57+ messages in thread
From: Andi Kleen @ 2007-02-09 11:50 UTC (permalink / raw)
  To: virtualization
  Cc: Zachary Amsden, Rusty Russell, Paul Mackerras, Stephen Rothwell,
	Andrew Morton, lkml - Kernel Mailing List


> Yes, it is a bit, umm, innovative.  If it is going to be kept, even if 
> just for devel logging, you should disable interrupts around it.  
> Changing segments is not a normal thing to do.

Actually that wouldn't be needed because interrupts are not allowed to do any 
user accesses. And contrary to the name it doesn't actually change
the segment registers, only state used by *_user.

-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler
  2007-02-09  9:31   ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Andi Kleen
@ 2007-02-09 11:52     ` Rusty Russell
  2007-02-09 20:49       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 11:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton,
	Jeremy Fitzhardinge

On Fri, 2007-02-09 at 10:31 +0100, Andi Kleen wrote:
> On Friday 09 February 2007 10:14, Rusty Russell wrote:
> 
> > +unhandled_paravirt:
> > +	/* Nothing wanted us: try to die with dignity (impossible trap). */ 
> > +	movl	$0x1F, %edx
> > +	pushl	$0
> > +	jmp	early_fault
> 
> Please print a real message with early_printk

If we make it thought early_fault, this will do just that.

Given this is a "never happens" situation, however... if you're actually
under Xen or lguest, you won't make it that far (lguest, at least, will
kill you on the cr2 load in early_fault, but it doesn't matter because
we won't get anywhere with early_printk anyway).

Actually, if we did BUG() here at least lguest would print something...
I wonder what Xen would do...

Rusty.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 11:50               ` Andi Kleen
@ 2007-02-09 11:54                 ` Zachary Amsden
  2007-02-09 11:57                   ` Andi Kleen
  2007-02-09 22:29                 ` David Miller
  1 sibling, 1 reply; 57+ messages in thread
From: Zachary Amsden @ 2007-02-09 11:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Rusty Russell, Paul Mackerras, Stephen Rothwell,
	Andrew Morton, lkml - Kernel Mailing List

Andi Kleen wrote:
>> Yes, it is a bit, umm, innovative.  If it is going to be kept, even if 
>> just for devel logging, you should disable interrupts around it.  
>> Changing segments is not a normal thing to do.
>>     
>
> Actually that wouldn't be needed because interrupts are not allowed to do any 
> user accesses. And contrary to the name it doesn't actually change
> the segment registers, only state used by *_user.
>   

My bad, I fell for the same mistake as everyone.  Set_fs is just way too 
confusing of a name now.  But good to know interrupts must be disable in 
such a circumstance.

Zach

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 11:54                 ` Zachary Amsden
@ 2007-02-09 11:57                   ` Andi Kleen
  2007-02-09 12:08                     ` Zachary Amsden
  0 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2007-02-09 11:57 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: virtualization, Rusty Russell, Paul Mackerras, Stephen Rothwell,
	Andrew Morton, lkml - Kernel Mailing List

On Fri, Feb 09, 2007 at 03:54:37AM -0800, Zachary Amsden wrote:
> Andi Kleen wrote:
> >>Yes, it is a bit, umm, innovative.  If it is going to be kept, even if 
> >>just for devel logging, you should disable interrupts around it.  
> >>Changing segments is not a normal thing to do.
> >>    
> >
> >Actually that wouldn't be needed because interrupts are not allowed to do 
> >any user accesses. And contrary to the name it doesn't actually change
> >the segment registers, only state used by *_user.
> >  
> 
> My bad, I fell for the same mistake as everyone.  Set_fs is just way too 

You could change the name. Only 654 occurrences all over the tree @)

> confusing of a name now.  But good to know interrupts must be disable in 
> such a circumstance.

+not

-Andi


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/10] lguest: Export symbols for lguest as a module
  2007-02-09  9:32     ` Andi Kleen
@ 2007-02-09 12:06       ` Rusty Russell
  2007-02-09 13:58         ` Andi Kleen
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 12:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton

On Fri, 2007-02-09 at 10:32 +0100, Andi Kleen wrote:
> On Friday 09 February 2007 10:15, Rusty Russell wrote:
> 
> > tsc_khz:
> > 	Simplest way of telling the guest how to interpret the TSC
> > 	counter.
> 
> 
> Are you sure this will work with varying TSC frequencies? 

I'm actually quite sure it doesn't (there's a FIXME in the lguest code).
Given the debate over how useful the TSC was, I originally didn't use
it, but (1) it's simple, and (2) when it doesn't change, it's pretty
accurate.

> In general you should get this from cpufreq.

Hmm, ok, I'll bite: how?  Time is a mystery I've avoided so far 8)

Thanks!
Rusty.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 11:57                   ` Andi Kleen
@ 2007-02-09 12:08                     ` Zachary Amsden
  0 siblings, 0 replies; 57+ messages in thread
From: Zachary Amsden @ 2007-02-09 12:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Rusty Russell, Paul Mackerras, Stephen Rothwell,
	Andrew Morton, lkml - Kernel Mailing List

Andi Kleen wrote:
>>>  
>>>       
>> My bad, I fell for the same mistake as everyone.  Set_fs is just way too 
>>     
>
> You could change the name. Only 654 occurrences all over the tree @)
>   

I'm preparing a patch.  +not.  This legacy thing is certainly taking a 
long time to disappear.  Long live x86!

>   
>> confusing of a name now.  But good to know interrupts must be disable in 
>> such a circumstance.
>>     
>
> +not
>   

+ack

Zach

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 10:09         ` Andi Kleen
@ 2007-02-09 12:39           ` Rusty Russell
  2007-02-09 13:57             ` Andi Kleen
  2007-02-09 14:17             ` Sam Ravnborg
  0 siblings, 2 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 12:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton, Sam Ravnborg

On Fri, 2007-02-09 at 11:09 +0100, Andi Kleen wrote:
> On Friday 09 February 2007 10:20, Rusty Russell wrote:
> > Unfortunately, we don't have the build infrastructure for "private"
> > asm-offsets.h files, so there's a not-so-neat include in
> > arch/i386/kernel/asm-offsets.c.
> 
> Ask the kbuild people to fix that? 
> 
> It indeed looks ugly.
> 
> I bet Xen et.al. could make good use of that too.

Yes.   I originally had the constants #defined in the header and a whole
heap of BUILD_BUG_ON(XYZ_offset != offsetof(xyz)) in my module, which
was even uglier (but at least contained in my code).

> > +# This links the hypervisor in the right place and turns it into a C array.
> > +$(obj)/hypervisor-raw: $(obj)/hypervisor.o
> > +	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
> > +$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
> > +	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
> 
> an .S file with .incbin is more efficient and simpler
> (note it has to be an separate .S file, otherwise icecream/distcc break) 
> 
> It won't allow to show off any sed skills, but I guess we can live with that ;-)

Good idea, except I currently use sizeof(hypervisor_blob): I'd have to
extract the size separately and hand it in the CFLAGS 8(

> > +static struct vm_struct *hypervisor_vma;
> > +static int cpu_had_pge;
> > +static struct {
> > +	unsigned long offset;
> > +	unsigned short segment;
> > +} lguest_entry;
> > +struct page *hype_pages; /* Contiguous pages. */
> 
> Statics? looks funky.  Why only a single hypervisor_vma?

We only have one switcher: it contains an array of "struct
lguest_state"; one for each guest.  (This is host code we're looking at
here).

> > +/* IDT entries are at start of hypervisor. */
> > +const unsigned long *__lguest_default_idt_entries(void)
> > +{
> > +	return (void *)HYPE_ADDR;
> > +}
> > +
> > +/* Next is switch_to_guest */
> > +static void *__lguest_switch_to_guest(void)
> > +{
> > +	return (void *)HYPE_ADDR + HYPE_DATA_SIZE;
> > +}
> > +
> > +/* Then we use everything else to hold guest state. */
> > +struct lguest_state *__lguest_states(void)
> > +{
> > +	return (void *)HYPE_ADDR + sizeof(hypervisor_blob);
> 
> This cries for asm_offsets.h too, doesn't it? 

HYPE_DATA_SIZE is the size of the hypervisor.S data segment, I'm not
sure how I'd get that into asm_offsets....

> > +}
> > +
> > +static __init int map_hypervisor(void)
> > +{
> > +	unsigned int i;
> > +	int err;
> > +	struct page *pages[HYPERVISOR_PAGES], **pagep = pages;
> > +
> > +	hype_pages = alloc_pages(GFP_KERNEL|__GFP_ZERO,
> > +				 get_order(HYPERVISOR_SIZE));
> 
> Wasteful because of the rounding. Probably wants reintroduction
> of alloc_pages_exact()

HYPERVISOR_SIZE here is 64k, so it's actually OK (we use the space after
the ~3k text & data from hypervisor.S to hold as many struct
lguest_state's as we can).  The name is bad tho; SWITCHER_MAP_SIZE would
probably be better...

> > +
> > +static __exit void unmap_hypervisor(void)
> > +{
> > +	vunmap(hypervisor_vma->addr);
> > +	__free_pages(hype_pages, get_order(HYPERVISOR_SIZE));
> 
> Shouldn't you clean up the GDTs too? 

I could, but there doesn't seem a great deal of point.  If anyone else
is using them, we're in trouble already...

> > +/* IN/OUT insns: enough to get us past boot-time probing. */
> > +static int emulate_insn(struct lguest *lg)
> > +{
> > +	u8 insn;
> > +	unsigned int insnlen = 0, in = 0, shift = 0;
> > +	unsigned long physaddr = guest_pa(lg, lg->state->regs.eip);
> > +
> > +	/* This only works for addresses in linear mapping... */
> > +	if (lg->state->regs.eip < lg->page_offset)
> > +		return 0;
> 
> Shouldn't there be a printk here?

No, the guest should not be able to evoke a printk from the host kernel.
In this case, however, we'll reflect the trap to the kernel, or if the
kernel hasn't installed an IDT yet we'll kill the guest and the lguest
program will get the string "unhandled trap 13 at <EIP> (err=0)".

Basically this emulation code is simply to get us through all that
annoying early boot probing (paravirt_ops does *not* override in/out).

> > +/* Saves exporting idt_table from kernel */
> > +static struct desc_struct *get_idt_table(void)
> > +{
> > +	struct Xgt_desc_struct idt;
> > +
> > +	asm("sidt %0":"=m" (idt));
> 
> Nasty, but ok.
> 
> > +	return (void *)idt.address;
> > +}
> > +
> > +extern asmlinkage void math_state_restore(void);
> 
> No externs in .c files

Agreed, will fix.

> > +
> > +/* Trap page resets this when it reloads gs. */
> > +static int new_gfp_eip(struct lguest *lg, struct lguest_regs *regs)
> > +{
> > +	u32 eip;
> > +	get_user(eip, &lg->lguest_data->gs_gpf_eip);
> > +	if (eip == regs->eip)
> > +		return 0;
> > +	put_user(regs->eip, &lg->lguest_data->gs_gpf_eip);
> 
> No fault checking? 

What do we care if the guest process has unmapped the lguest_data page:
it's his funeral.  We could put a kill_guest there, but it seems a waste
of code to me.

> lhread/write use probably also needs to be double checked that a malicious
> guest can't put the kernel into a loop.

How?  

> > +static void set_ts(unsigned int guest_ts)
> > +{
> > +	u32 cr0;
> > +	if (guest_ts) {
> > +		asm("movl %%cr0,%0":"=r" (cr0));
> > +		if (!(cr0 & 8))
> > +			asm("movl %0,%%cr0": :"r" (cr0|8));
> > +	}
> 
> We have macros and defines for this in standard headers.

This was in prep to unexport paravirt_ops, but you threatened the freeze
on me so I pushed this ahead.

Some of this code will get cleaned up along with others when that patch
goes through (my idea is that we'll make sure to expose the native
versions as inlines, so this code can use native_read_cr0 etc. directly.
lguest hosts cannot be paravirtualized anyway 8).

> > +		if (signal_pending(current))
> > +			return -EINTR
> 
> Probably needs freezer checking here somewhere.

Yes, that code always mystified me...

> > +		if (lg->halted) {
> > +			set_current_state(TASK_INTERRUPTIBLE);
> > +			schedule_timeout(1);
> 
> 1?  And what is that good for anyways?

It's waiting for the dynamic tick code to go in.  Again, I was hoping
that that would go in and I could go in behind it, but you made it clear
that waiting longer isn't an option.

> > +				/* FIXME: If it's reloading %gs in a loop? */
> 
> Yes what then? Have you tried it?
> 
> In general i miss printks when things go wrong. Do you expect
> all users to have a gdbstub ready? @)

If this happens then the process inside the guest which is doing this
will get a Segmentation Fault, because we'll deliver a GPF to the kernel
at that EIP.  In practice, glibc doesn't do this that I have found.  The
note is there so I think harder about it if anyone reports strange segvs
under lguest 8)

> > +pending_dma:
> > +	put_user(lg->pending_dma, (unsigned long *)user);
> > +	put_user(lg->pending_addr, (unsigned long *)user+1);
> 
> error checking? How do you avoid loops?

Why?  And I still don't get what loops?

> > +	if (cpu_has_pge) { /* We have a broader idea of "global". */
> > +		cpu_had_pge = 1;
> > +		on_each_cpu(adjust_pge, 0, 0, 1);
> 
> cpu hotplug? 

Good catch.  Should lock out cpu hotplug during this.  Once it's done,
we're fine, because we clear the feature bit.  Same when putting it
back.

> > +		clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability);
> > +	}
> > +	return 0;
> > +}
> > 
> > +	case LHCALL_CRASH: {
> > +		char msg[128];
> > +		lhread(lg, msg, regs->edx, sizeof(msg));
> > +		msg[sizeof(msg)-1] = '\0';
> 
> Might be safer to vet for isprint here.

Hmm, I hope the kasprintf won't barf... this buffer gets handed straight
to the lguest process on its next read.

> > +#define log(...)					\
> > +	do {						\
> > +		mm_segment_t oldfs = get_fs();		\
> > +		char buf[100];				\
> 
> At least older gccs will accumulate the bufs in a function, eventually possibly blowing
> the stack. Better use a function.

Or I'll kill the macro altogether.  It's useful for shotgun debugging a
guest, but it's unused.

> > +	/* If they're halted, we re-enable interrupts. */
> > +	if (lg->halted) {
> > +		/* Re-enable interrupts. */
> > +		put_user(512, &lg->lguest_data->irq_enabled);
> 
> interesting magic number

Yeah, we don't define it anywhere, and we probably should.  From
irqflags.h:

static inline int raw_irqs_disabled_flags(unsigned long flags)
{
	return !(flags & (1 << 9));
}

8(

> > +	/* Ignore NMI, doublefault, hypercall, spurious interrupt. */
> > +	if (i == 2 || i == 8 || i == 15 || i == LGUEST_TRAP_ENTRY)
> > +		return;
> > +	/* FIXME: We should handle debug and int3 */
> > +	else if (i == 1 || i == 3)
> > +		return;
> > +	/* We intercept page fault, general protection fault and fpu missing */
> > +	else if (i == 13)
> > +		copy_trap(lg, &lg->gpf_trap, &d);
> > +	else if (i == 14)
> > +		copy_trap(lg, &lg->page_trap, &d);
> > +	else if (i == 7)
> > +		copy_trap(lg, &lg->fpu_trap, &d);
> > +	/* Other traps go straight to guest. */
> > +	else if (i < FIRST_EXTERNAL_VECTOR || i == SYSCALL_VECTOR)
> > +		setup_idt(lg, i, &d);
> > +	/* A virtual interrupt */
> > +	else if (i < FIRST_EXTERNAL_VECTOR + LGUEST_IRQS)
> > +		copy_trap(lg, &lg->interrupt[i-FIRST_EXTERNAL_VECTOR], &d);\
> 
> switch is not cool enough anymore?

It would have to be a switch then gunk at the bottom, because those last
two tests don't switch-ify.  IIRC I changed back from a switch because
of that.

> > +	down(&lguest_lock);
> 
> i suspect mutexes are the new way to do this

Yep, will replace.

> > +	down_read(&current->mm->mmap_sem);
> > +	if (get_futex_key((u32 __user *)addr, &key) != 0) {
> > +		kill_guest(lg, "bad dma address %#lx", addr);
> > +		goto unlock;
> 
> Risky? Use probe_kernel_address et.al.?

No, get_futex_key handles anything; we rely on that in futex.c

> > +#if 0
> > +/* FIXME: Use asm-offsets here... */
> 
> Remove?

Yep, done in the split-up version...

> > +extern int mce_disabled;
> 
> tststs

I'll get that too 8)

> > +
> > +/* FIXME: Update iff tsc rate changes. */
> 
> It does.

Hey, but I have a FIXME for it!

> > +static fastcall void lguest_cpuid(unsigned int *eax, unsigned int *ebx,
> > +				 unsigned int *ecx, unsigned int *edx)
> > +{
> > +	int is_feature = (*eax == 1);
> > +
> > +	asm volatile ("cpuid"
> > +		      : "=a" (*eax),
> > +			"=b" (*ebx),
> > +			"=c" (*ecx),
> > +			"=d" (*edx)
> > +		      : "0" (*eax), "2" (*ecx));
> 
> What's wrong with the standard cpuid*() macros?

Good call... we expose native_cpuid so I should use that (I can't use
cpuid: this *is* cpuid!)

> > +	extern struct Xgt_desc_struct cpu_gdt_descr;
> > +	extern struct i386_pda boot_pda;
> 
> No externs in .c

Yes, ugly code.  Usually this stuff is done in asm, so it doesn't exist
in a header.  I'll fix it.

> > +
> > +	paravirt_ops.name = "lguest";
> 
> Can you just statically initialize this and then copy over? 

Sure, but then I'll hit all the things I don't want to override, and
it'll be far less clear.

> > +	asm volatile ("mov %0, %%gs" : : "r" (__KERNEL_PDA) : "memory");
> 
> This will be %fs soon.

I know, I had to change this back to backport from -mm tree 8(
Fortunately, the breakage is *really* obvious when it happens.

> ... haven't read everything else. the IO driver earlier was also not very closely looked at.

OK, I'll resend patches (in 4 parts) based on these comments (and
Andrew's).

Thanks!
Rusty.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 12:39           ` Rusty Russell
@ 2007-02-09 13:57             ` Andi Kleen
  2007-02-09 15:01               ` Rusty Russell
  2007-02-09 14:17             ` Sam Ravnborg
  1 sibling, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2007-02-09 13:57 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton, Sam Ravnborg

On Fri, Feb 09, 2007 at 11:39:31PM +1100, Rusty Russell wrote:
> On Fri, 2007-02-09 at 11:09 +0100, Andi Kleen wrote:
> > > +# This links the hypervisor in the right place and turns it into a C array.
> > > +$(obj)/hypervisor-raw: $(obj)/hypervisor.o
> > > +	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
> > > +$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
> > > +	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
> > 
> > an .S file with .incbin is more efficient and simpler
> > (note it has to be an separate .S file, otherwise icecream/distcc break) 
> > 
> > It won't allow to show off any sed skills, but I guess we can live with that ;-)
> 
> Good idea, except I currently use sizeof(hypervisor_blob): I'd have to
> extract the size separately and hand it in the CFLAGS 8(

hypervisor_start:
	.incbin "hypervisor"
hypervisor_end:

...
	extern char hypervisor_start[], hypervisor_end[];

	size = hypervisor_end - hypervisor_start;

	


> > > +static int cpu_had_pge;
> > > +static struct {
> > > +	unsigned long offset;
> > > +	unsigned short segment;
> > > +} lguest_entry;
> > > +struct page *hype_pages; /* Contiguous pages. */
> > 
> > Statics? looks funky.  Why only a single hypervisor_vma?
> 
> We only have one switcher: it contains an array of "struct
> lguest_state"; one for each guest.  (This is host code we're looking at
> here).

This means it is not SMP safe? 

> No, the guest should not be able to evoke a printk from the host kernel.

This means nobody will know why it failed.

> > > +	else if (i < FIRST_EXTERNAL_VECTOR || i == SYSCALL_VECTOR)
> > > +		setup_idt(lg, i, &d);
> > > +	/* A virtual interrupt */
> > > +	else if (i < FIRST_EXTERNAL_VECTOR + LGUEST_IRQS)
> > > +		copy_trap(lg, &lg->interrupt[i-FIRST_EXTERNAL_VECTOR], &d);\
> > 
> > switch is not cool enough anymore?
> 
> It would have to be a switch then gunk at the bottom, because those last
> two tests don't switch-ify.  IIRC I changed back from a switch because
> of that.

gcc has a handy extension for this: 

case 0...FIRST_EXTERNAL_VECTOR-1:
case SYSCALL_VECTOR:
case FIRST_EXTERNAL_VECTOR...FIRST_EXTERNAL_VECTOR+LGUEST_IRQS:


Re: the loops; e.g. we used to have possible loop cases
when a page fault does read instructions and then causes another
page fault etc.etc. I haven't seen any immediate danger of this,
but it might be worth double checking.

-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/10] lguest: Export symbols for lguest as a module
  2007-02-09 12:06       ` Rusty Russell
@ 2007-02-09 13:58         ` Andi Kleen
  2007-02-10 11:39           ` Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2007-02-09 13:58 UTC (permalink / raw)
  To: Rusty Russell; +Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton

On Fri, Feb 09, 2007 at 11:06:06PM +1100, Rusty Russell wrote:
> On Fri, 2007-02-09 at 10:32 +0100, Andi Kleen wrote:
> > On Friday 09 February 2007 10:15, Rusty Russell wrote:
> > 
> > > tsc_khz:
> > > 	Simplest way of telling the guest how to interpret the TSC
> > > 	counter.
> > 
> > 
> > Are you sure this will work with varying TSC frequencies? 
> 
> I'm actually quite sure it doesn't (there's a FIXME in the lguest code).
> Given the debate over how useful the TSC was, I originally didn't use
> it, but (1) it's simple, and (2) when it doesn't change, it's pretty
> accurate.

But when it changes users become pretty unhappy

> 
> > In general you should get this from cpufreq.
> 
> Hmm, ok, I'll bite: how?  Time is a mystery I've avoided so far 8)

the old x86-64 time.c (before -mm) has a example in #ifdef CONFIG_CPUFREQ


-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 12:39           ` Rusty Russell
  2007-02-09 13:57             ` Andi Kleen
@ 2007-02-09 14:17             ` Sam Ravnborg
  2007-02-09 15:23               ` Rusty Russell
  1 sibling, 1 reply; 57+ messages in thread
From: Sam Ravnborg @ 2007-02-09 14:17 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, virtualization, lkml - Kernel Mailing List, Andrew Morton

On Fri, Feb 09, 2007 at 11:39:31PM +1100, Rusty Russell wrote:
> On Fri, 2007-02-09 at 11:09 +0100, Andi Kleen wrote:
> > On Friday 09 February 2007 10:20, Rusty Russell wrote:
> > > Unfortunately, we don't have the build infrastructure for "private"
> > > asm-offsets.h files, so there's a not-so-neat include in
> > > arch/i386/kernel/asm-offsets.c.
> > 
> > Ask the kbuild people to fix that? 
> > 
> > It indeed looks ugly.
> > 
> > I bet Xen et.al. could make good use of that too.
> 
> Yes.   I originally had the constants #defined in the header and a whole
> heap of BUILD_BUG_ON(XYZ_offset != offsetof(xyz)) in my module, which
> was even uglier (but at least contained in my code).
I do not quite see what you ask for.
Care to try to describe the problem a bit then I may look at it sometime.

[Heading for vacation in a few hours so no prompt reply]

	Sam

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 13:57             ` Andi Kleen
@ 2007-02-09 15:01               ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 15:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton, Sam Ravnborg

On Fri, 2007-02-09 at 14:57 +0100, Andi Kleen wrote:
> On Fri, Feb 09, 2007 at 11:39:31PM +1100, Rusty Russell wrote:
> > On Fri, 2007-02-09 at 11:09 +0100, Andi Kleen wrote:
> > > > +# This links the hypervisor in the right place and turns it into a C array.
> > > > +$(obj)/hypervisor-raw: $(obj)/hypervisor.o
> > > > +	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
> > > > +$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
> > > > +	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
> > > 
> > > an .S file with .incbin is more efficient and simpler
> > > (note it has to be an separate .S file, otherwise icecream/distcc break) 
> > > 
> > > It won't allow to show off any sed skills, but I guess we can live with that ;-)
> > 
> > Good idea, except I currently use sizeof(hypervisor_blob): I'd have to
> > extract the size separately and hand it in the CFLAGS 8(
> 
> hypervisor_start:
> 	.incbin "hypervisor"
> hypervisor_end:
> 
> ...
> 	extern char hypervisor_start[], hypervisor_end[];
> 
> 	size = hypervisor_end - hypervisor_start;

#define MAX_LGUEST_GUESTS \
	((HYPERVISOR_SIZE-sizeof(hypervisor_blob))/sizeof(struct lguest_state))
struct lguest lguests[MAX_LGUEST_GUESTS];

I could kmalloc that array, of course, but is it worth it to get rid of
one line in a Makefile?

> > > Statics? looks funky.  Why only a single hypervisor_vma?
> > 
> > We only have one switcher: it contains an array of "struct
> > lguest_state"; one for each guest.  (This is host code we're looking at
> > here).
> 
> This means it is not SMP safe? 

No, it's host-SMP safe.  There's no guest SMP support though, which
keeps things nice and simple.

> > No, the guest should not be able to evoke a printk from the host kernel.
> 
> This means nobody will know why it failed.

No, that's why the lguest process gets the error string (and prints it
out).  Biggest usability improvement I made in a while.

The kernel log is the absolute worst place to report errors; iptables
and module code both do that and the #1 FAQ is "What happened?"

I didn't make the same mistake this (third) time!  Here's lguest when
the guest crashes:

        # lguest 64m bzImage ...
        ...
        lguest: CRASH: Attempted to kill init!
        # 

> > It would have to be a switch then gunk at the bottom, because those last
> > two tests don't switch-ify.  IIRC I changed back from a switch because
> > of that.
> 
> gcc has a handy extension for this: 
> 
> case 0...FIRST_EXTERNAL_VECTOR-1:
> case SYSCALL_VECTOR:
> case FIRST_EXTERNAL_VECTOR...FIRST_EXTERNAL_VECTOR+LGUEST_IRQS:

Indeed, and I really wanted to use it, but it still doesn't allow
overlapping ranges.  The first cases are in the middle of 0 ...
FIRST_EXTERNAL_VECTOR-1, and if LGUEST_IRQS is large enough, 
SYSCALL_VECTOR is in the middle of FIRST_EXTERNAL_VECTOR ...
FIRST_EXTERNAL_VECTOR+LGUEST_IRQS 8(

Nonetheless, I switchified what I could in my update (testing now).

> Re: the loops; e.g. we used to have possible loop cases
> when a page fault does read instructions and then causes another
> page fault etc.etc. I haven't seen any immediate danger of this,
> but it might be worth double checking.

OK, I'll run some tests here.  There shouldn't be any danger here
though....

Thanks!
Rusty.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 14:17             ` Sam Ravnborg
@ 2007-02-09 15:23               ` Rusty Russell
  2007-02-12 13:34                 ` [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.) Oleg Verych
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 15:23 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: Andi Kleen, virtualization, lkml - Kernel Mailing List, Andrew Morton

On Fri, 2007-02-09 at 15:17 +0100, Sam Ravnborg wrote:
> I do not quite see what you ask for.
> Care to try to describe the problem a bit then I may look at it sometime.
> 
> [Heading for vacation in a few hours so no prompt reply]

I'd like my own, private "asm-offsets.h".  In this case, in
arch/i386/lguest/.  I guess it's a matter of extracting the core of the
asm-offsets.h magic and generalizing it.

Have a good break!
Rusty.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6c/10] lguest: the guest code
  2007-02-09 10:57           ` [PATCH 6c/10] lguest: the guest code Rusty Russell
  2007-02-09 10:58             ` [PATCH 6d/10] lguest: the Makefiles Rusty Russell
@ 2007-02-09 17:06             ` Len Brown
  2007-02-09 17:14               ` James Morris
  1 sibling, 1 reply; 57+ messages in thread
From: Len Brown @ 2007-02-09 17:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: lkml - Kernel Mailing List, Andrew Morton, Andi Kleen, virtualization

On Friday 09 February 2007 05:57, Rusty Russell wrote:

> +#ifdef CONFIG_ACPI
> +       acpi_disabled = 1;
> +       acpi_ht = 0;
> +#endif

If this is hard-coded to have ACPI disabled, why isn't it enforced at build-time?

thanks,
-Len

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6c/10] lguest: the guest code
  2007-02-09 17:06             ` [PATCH 6c/10] lguest: the guest code Len Brown
@ 2007-02-09 17:14               ` James Morris
  2007-02-09 17:49                 ` Len Brown
  0 siblings, 1 reply; 57+ messages in thread
From: James Morris @ 2007-02-09 17:14 UTC (permalink / raw)
  To: Len Brown
  Cc: Rusty Russell, lkml - Kernel Mailing List, Andrew Morton,
	Andi Kleen, virtualization

[-- Attachment #1: Type: TEXT/PLAIN, Size: 438 bytes --]

On Fri, 9 Feb 2007, Len Brown wrote:

> On Friday 09 February 2007 05:57, Rusty Russell wrote:
> 
> > +#ifdef CONFIG_ACPI
> > +       acpi_disabled = 1;
> > +       acpi_ht = 0;
> > +#endif
> 
> If this is hard-coded to have ACPI disabled, why isn't it enforced at build-time?

This is being disabled in the guest kernel only.  The host and guest 
kernels are expected to be the same build.



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6c/10] lguest: the guest code
  2007-02-09 17:14               ` James Morris
@ 2007-02-09 17:49                 ` Len Brown
  2007-02-09 23:48                   ` [PATCH 11/10] lguest: use disable_acpi() Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Len Brown @ 2007-02-09 17:49 UTC (permalink / raw)
  To: James Morris
  Cc: Rusty Russell, lkml - Kernel Mailing List, Andrew Morton,
	Andi Kleen, virtualization

On Friday 09 February 2007 12:14, James Morris wrote:
> On Fri, 9 Feb 2007, Len Brown wrote:
> 
> > On Friday 09 February 2007 05:57, Rusty Russell wrote:
> > 
> > > +#ifdef CONFIG_ACPI
> > > +       acpi_disabled = 1;
> > > +       acpi_ht = 0;
> > > +#endif
> > 
> > If this is hard-coded to have ACPI disabled, why isn't it enforced at build-time?
> 
> This is being disabled in the guest kernel only.  The host and guest 
> kernels are expected to be the same build.

Okay, but better to use disable_acpi()
indeed, since this would be the first code not already inside CONFIG_ACPI
to invoke disable_acpi(), we could define the inline as empty and you could
then scratch the #ifdef too.

cheers,
-Len

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler
  2007-02-09 11:52     ` Rusty Russell
@ 2007-02-09 20:49       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 57+ messages in thread
From: Jeremy Fitzhardinge @ 2007-02-09 20:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, virtualization, lkml - Kernel Mailing List,
	Andrew Morton, Jeremy Fitzhardinge

Rusty Russell wrote:
> If we make it thought early_fault, this will do just that.
>
> Given this is a "never happens" situation, however... if you're actually
> under Xen or lguest, you won't make it that far (lguest, at least, will
> kill you on the cr2 load in early_fault, but it doesn't matter because
> we won't get anywhere with early_printk anyway).
>
> Actually, if we did BUG() here at least lguest would print something...
> I wonder what Xen would do...

Xen would print a complete register dump and backtrace.  There's not a
lot else you can do here; you could try early_printk, but if we're in a
strange virtual environment, there may be no device on which the output
could appear.

    J


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/10] lguest code: the little linux hypervisor.
  2007-02-09 11:50               ` Andi Kleen
  2007-02-09 11:54                 ` Zachary Amsden
@ 2007-02-09 22:29                 ` David Miller
  1 sibling, 0 replies; 57+ messages in thread
From: David Miller @ 2007-02-09 22:29 UTC (permalink / raw)
  To: ak; +Cc: virtualization, zach, rusty, paulus, sfr, akpm, linux-kernel

From: Andi Kleen <ak@muc.de>
Date: Fri, 9 Feb 2007 12:50:06 +0100

> 
> > Yes, it is a bit, umm, innovative.  If it is going to be kept, even if 
> > just for devel logging, you should disable interrupts around it.  
> > Changing segments is not a normal thing to do.
> 
> Actually that wouldn't be needed because interrupts are not allowed to do any 
> user accesses. And contrary to the name it doesn't actually change
> the segment registers, only state used by *_user.

That's right and we use this construct all throughout the
syscall compatibility layer for 64-bit platforms.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 11/10] lguest: use disable_acpi()
  2007-02-09 17:49                 ` Len Brown
@ 2007-02-09 23:48                   ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-09 23:48 UTC (permalink / raw)
  To: Len Brown
  Cc: James Morris, lkml - Kernel Mailing List, Andrew Morton,
	Andi Kleen, virtualization

On Fri, 2007-02-09 at 12:49 -0500, Len Brown wrote:
> On Friday 09 February 2007 12:14, James Morris wrote:
> > This is being disabled in the guest kernel only.  The host and guest 
> > kernels are expected to be the same build.
> 
> Okay, but better to use disable_acpi()
> indeed, since this would be the first code not already inside CONFIG_ACPI
> to invoke disable_acpi(), we could define the inline as empty and you could
> then scratch the #ifdef too.

Thanks Len!

This applies on top of that series.

== 
Len Brown <lenb@kernel.org> said:
> Okay, but better to use disable_acpi()
> indeed, since this would be the first code not already inside CONFIG_ACPI
> to invoke disable_acpi(), we could define the inline as empty and you could
> then scratch the #ifdef too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff -r 85363b87e20b arch/i386/lguest/lguest.c
--- a/arch/i386/lguest/lguest.c	Sat Feb 10 01:52:37 2007 +1100
+++ b/arch/i386/lguest/lguest.c	Sat Feb 10 10:28:36 2007 +1100
@@ -555,10 +555,7 @@ static __attribute_used__ __init void lg
 	mce_disabled = 1;
 #endif
 
-#ifdef CONFIG_ACPI
-	acpi_disabled = 1;
-	acpi_ht = 0;
-#endif
+	disable_acpi();
 	if (boot->initrd_size) {
 		/* We stash this at top of memory. */
 		INITRD_START = boot->max_pfn*PAGE_SIZE - boot->initrd_size;
diff -r 85363b87e20b include/asm-i386/acpi.h
--- a/include/asm-i386/acpi.h	Sat Feb 10 01:52:37 2007 +1100
+++ b/include/asm-i386/acpi.h	Sat Feb 10 10:43:43 2007 +1100
@@ -127,6 +127,7 @@ extern int acpi_irq_balance_set(char *st
 #define acpi_ioapic 0
 static inline void acpi_noirq_set(void) { }
 static inline void acpi_disable_pci(void) { }
+static inline void disable_acpi(void) { }
 
 #endif	/* !CONFIG_ACPI */
 




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/10] lguest: Export symbols for lguest as a module
  2007-02-09 13:58         ` Andi Kleen
@ 2007-02-10 11:39           ` Rusty Russell
  0 siblings, 0 replies; 57+ messages in thread
From: Rusty Russell @ 2007-02-10 11:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: virtualization, lkml - Kernel Mailing List, Andrew Morton

On Fri, 2007-02-09 at 14:58 +0100, Andi Kleen wrote:
> On Fri, Feb 09, 2007 at 11:06:06PM +1100, Rusty Russell wrote:
> > On Fri, 2007-02-09 at 10:32 +0100, Andi Kleen wrote:
> > > Are you sure this will work with varying TSC frequencies? 
> > 
> > I'm actually quite sure it doesn't (there's a FIXME in the lguest code).
> > Given the debate over how useful the TSC was, I originally didn't use
> > it, but (1) it's simple, and (2) when it doesn't change, it's pretty
> > accurate.
> 
> But when it changes users become pretty unhappy

True.  Simplest fix is below.  There are several time issues on the TODO
list, and I will simply add this one.

Thanks!
Rusty.

lguest: Don't use the TSC in guest.

Andi complained that lguest guests don't deal with host TSC speed
changing.  Close this can of worms by not using the TSC in the guest.
Later on we could do something clever when we overhaul this to deal
with stolen time (also on the TODO list).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

===================================================================
--- a/arch/i386/kernel/tsc.c
+++ b/arch/i386/kernel/tsc.c
@@ -475,4 +475,3 @@ static int __init init_tsc_clocksource(v
 }
 
 module_init(init_tsc_clocksource);
-EXPORT_SYMBOL_GPL(tsc_khz);
===================================================================
--- a/arch/i386/lguest/hypercalls.c
+++ b/arch/i386/lguest/hypercalls.c
@@ -18,7 +18,6 @@
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
 #include <linux/mm.h>
-#include <linux/clocksource.h>
 #include <asm/lguest.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -179,8 +178,6 @@ int hypercall(struct lguest *lg, struct 
 		/* We reserve the top pgd entry. */
 		put_user(4U*1024*1024, &lg->lguest_data->reserve_mem);
 		put_user(lg->guestid, &lg->lguest_data->guestid);
-		put_user(clocksource_khz2mult(tsc_khz, 22),
-			 &lg->lguest_data->clock_mult);
 		return 0;
 	}
 	pending = do_hcall(lg, regs);
===================================================================
--- a/arch/i386/lguest/lguest.c
+++ b/arch/i386/lguest/lguest.c
@@ -25,7 +25,6 @@
 #include <linux/screen_info.h>
 #include <linux/irq.h>
 #include <linux/interrupt.h>
-#include <linux/clocksource.h>
 #include <asm/paravirt.h>
 #include <asm/lguest.h>
 #include <asm/lguest_user.h>
@@ -138,29 +137,10 @@ static struct notifier_block paniced = {
 	.notifier_call = lguest_panic
 };
 
-static cycle_t lguest_clock_read(void)
-{
-	/* FIXME: This is just the native one.  Account stolen time! */
-	return paravirt_ops.read_tsc();
-}
-
-/* FIXME: Update iff tsc rate changes. */
-static struct clocksource lguest_clock = {
-	.name			= "lguest",
-	.rating			= 400,
-	.read			= lguest_clock_read,
-	.mask			= CLOCKSOURCE_MASK(64),
-	.mult			= 0, /* to be set */
-	.shift			= 22,
-	.is_continuous		= 1,
-};
-
 static char *lguest_memory_setup(void)
 {
 	/* We do these here because lockcheck barfs if before start_kernel */
 	atomic_notifier_chain_register(&panic_notifier_list, &paniced);
-	lguest_clock.mult = lguest_data.clock_mult;
-	clocksource_register(&lguest_clock);
 
 	e820.nr_map = 0;
 	add_memory_region(0, PFN_PHYS(boot->max_pfn), E820_RAM);
===================================================================
--- a/include/asm-i386/lguest.h
+++ b/include/asm-i386/lguest.h
@@ -74,8 +74,6 @@ struct lguest_data
 	unsigned long reserve_mem;
 	/* ID of this guest (used by network driver to set ethernet address) */
 	u16 guestid;
-	/* Multiplier for TSC clock. */
-	u32 clock_mult;
 
 /* Fields initialized by the guest at boot: */
 	/* Instruction range to suppress interrupts even if enabled */



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.)
  2007-02-09 15:23               ` Rusty Russell
@ 2007-02-12 13:34                 ` Oleg Verych
  2007-02-12 17:24                   ` Andi Kleen
                                     ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Oleg Verych @ 2007-02-12 13:34 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Sam Ravnborg, Andi Kleen, virtualization,
	lkml - Kernel Mailing List, Andrew Morton

> From: Rusty Russell
> Newsgroups: gmane.linux.kernel,gmane.linux.kernel.virtualization
> Subject: Re: [PATCH 6/10] lguest code: the little linux hypervisor.
> Date: Sat, 10 Feb 2007 02:23:19 +1100

Hallo, Rusty, guys.

> On Fri, 2007-02-09 at 15:17 +0100, Sam Ravnborg wrote:
>> I do not quite see what you ask for.
>> Care to try to describe the problem a bit then I may look at it sometime.
>> 
>> [Heading for vacation in a few hours so no prompt reply]
>
> I'd like my own, private "asm-offsets.h".  In this case, in
> arch/i386/lguest/.  I guess it's a matter of extracting the core of the
> asm-offsets.h magic and generalizing it.
>
> Have a good break!
> Rusty.

If you will have time for newbie, to explain in a few words, what is it need
for (whole idea, or key detail), and, maybe, why it is generated so ... interestingly:

            asm-offsets.c -> *.s -> *.h
 (but this looks like interconnecting C and assembler, obviously)

I will glad to help providing solution maybe somewhat earlier (well, i'm
trying to understand whole building process, if that matters).

.... And, of course, if this q isn't dumb.

Thanks.
____

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.)
  2007-02-12 13:34                 ` [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.) Oleg Verych
@ 2007-02-12 17:24                   ` Andi Kleen
  2007-02-12 21:41                   ` Sam Ravnborg
  2007-02-12 23:41                   ` Rusty Russell
  2 siblings, 0 replies; 57+ messages in thread
From: Andi Kleen @ 2007-02-12 17:24 UTC (permalink / raw)
  To: Oleg Verych
  Cc: Rusty Russell, Sam Ravnborg, virtualization,
	lkml - Kernel Mailing List, Andrew Morton

> If you will have time for newbie, to explain in a few words, what is it need
> for (whole idea, or key detail), and, maybe, why it is generated so ... interestingly:
> 
>             asm-offsets.c -> *.s -> *.h
>  (but this looks like interconnecting C and assembler, obviously)
> 
> I will glad to help providing solution maybe somewhat earlier (well, i'm
> trying to understand whole building process, if that matters).

The problem is trying to figure out what needs to be done to get
multiple asm-offsets.h.

-Andi

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.)
  2007-02-12 13:34                 ` [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.) Oleg Verych
  2007-02-12 17:24                   ` Andi Kleen
@ 2007-02-12 21:41                   ` Sam Ravnborg
  2007-02-12 23:41                   ` Rusty Russell
  2 siblings, 0 replies; 57+ messages in thread
From: Sam Ravnborg @ 2007-02-12 21:41 UTC (permalink / raw)
  To: Oleg Verych
  Cc: Rusty Russell, Andi Kleen, virtualization,
	lkml - Kernel Mailing List, Andrew Morton

> 
> If you will have time for newbie, to explain in a few words, what is it need
> for (whole idea, or key detail), and, maybe, why it is generated so ... interestingly:
> 
>             asm-offsets.c -> *.s -> *.h
>  (but this looks like interconnecting C and assembler, obviously)

Correct - asm-offsets is used to transfer constants for .c to assembler,
which us especially convinient for constant that differ when a stuct
is changed.

> 
> I will glad to help providing solution maybe somewhat earlier (well, i'm
> trying to understand whole building process, if that matters).

Basically what is requested is to move the generic parts of asm-offsets
generation from top-level Kbuild file to a generic file somewhere.

Since this is not generic Kbuild functionality but more specific to the
kernel it should not be part of Kbuild generic files.
include/asm-generic/asm-offsets seems to be a fair place for it.

It would contain the sed-y parts as well as cmd_offsets definition.

I would be glad if you could give that a spin.

	Sam

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.)
  2007-02-12 13:34                 ` [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.) Oleg Verych
  2007-02-12 17:24                   ` Andi Kleen
  2007-02-12 21:41                   ` Sam Ravnborg
@ 2007-02-12 23:41                   ` Rusty Russell
  2007-02-13  3:10                       ` Oleg Verych
  2 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-12 23:41 UTC (permalink / raw)
  To: Oleg Verych
  Cc: Sam Ravnborg, Andi Kleen, virtualization,
	lkml - Kernel Mailing List, Andrew Morton

On Mon, 2007-02-12 at 14:34 +0100, Oleg Verych wrote:
> > I'd like my own, private "asm-offsets.h".  In this case, in
> > arch/i386/lguest/.  I guess it's a matter of extracting the core of the
> > asm-offsets.h magic and generalizing it.
> >
> > Have a good break!
> > Rusty.
> 
> If you will have time for newbie, to explain in a few words, what is it need
> for (whole idea, or key detail), and, maybe, why it is generated so ... interestingly:
> 
>             asm-offsets.c -> *.s -> *.h
>  (but this looks like interconnecting C and assembler, obviously)

Hi Oleg,

	Always happy to explain.  There's often a need to access constants in
assembler, which can only be derived from C, such as the size of a
structure, or the offset of a certain member within a structure.
Hardcoding the numbers in assembler is fragile leading to breakage when
something changes.

	So, asm-offsets.c is the solution: it uses asm() statements to emit
patterns in the assembler, with the compiler computing the actual
numbers, eg:

	#define DEFINE(sym, val) \
	        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
	DEFINE(SIZEOF_FOOBAR, sizeof(foobar));

Becomes in asm-offsets.s:
	->SIZEOF_FOOBAR $10 sizeof(foobar)  #

This gets sed'd back into asm-offsets.h:
	#define SIZEOF_FOOBAR 10 /* SIZEOF_FOOBAR  # */

This can be included from .S files (which get passed through the
pre-processor).

Hope that helps!
Rusty.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.)
  2007-02-12 23:41                   ` Rusty Russell
@ 2007-02-13  3:10                       ` Oleg Verych
  0 siblings, 0 replies; 57+ messages in thread
From: Oleg Verych @ 2007-02-13  3:10 UTC (permalink / raw)
  To: LKML
  Cc: Rusty Russell, Sam Ravnborg, Andi Kleen, virtualization, Andrew Morton

On Tue, Feb 13, 2007 at 10:41:36AM +1100, Rusty Russell wrote:
> 	Always happy to explain.

Thanks!

[]
> 	So, asm-offsets.c is the solution: it uses asm() statements to emit
> patterns in the assembler, with the compiler computing the actual
> numbers, eg:
> 
> 	#define DEFINE(sym, val) \
(1) 	        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
> 	DEFINE(SIZEOF_FOOBAR, sizeof(foobar));
> 
(2) Becomes in asm-offsets.s:
> 	->SIZEOF_FOOBAR $10 sizeof(foobar)  #
> 
(3) This gets sed'd back into asm-offsets.h:
> 	#define SIZEOF_FOOBAR 10 /* SIZEOF_FOOBAR  # */
> 
> This can be included from .S files (which get passed through the
> pre-processor).

So, to make this parsing job clear once again, (1) is a pattern generator
of (2) for sed (this is not assembler), which is making final (3) then.
And this are actually C-for-asm offsets (asm-offsets.h is confusing to
me).

On Mon, Feb 12, 2007 at 10:41:28PM +0100, Sam Ravnborg wrote:
[]
> > I will glad to help providing solution maybe somewhat earlier (well, i'm
> > trying to understand whole building process, if that matters).
> 
> Basically what is requested is to move the generic parts of asm-offsets
> generation from top-level Kbuild file to a generic file somewhere.
> 
> Since this is not generic Kbuild functionality but more specific to the
> kernel it should not be part of Kbuild generic files.
> include/asm-generic/asm-offsets seems to be a fair place for it.

Too much "generic" were used. Anyways, simply thing is, that there must be:

-- hardware file(s) for Arch;

-- "private" software file(s), for paravirt and such.

On Mon, Feb 12, 2007 at 06:24:13PM +0100, Andi Kleen wrote:
[]
> The problem is trying to figure out what needs to be done to get
> multiple asm-offsets.h.

Proposition will follow.

Kind regards.

P.S.
While it was fun to run kinds of offtopic operation systems in
virtual machine back in 2000-2001. Now it isn't.

Mainly due to yet another zoo. May be this is good for quick adoption,
commercial benefit, etc. Yet thankfully to Rusty, general useful
infrastructure is available now. Kudos to everyone :)
____

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.)
@ 2007-02-13  3:10                       ` Oleg Verych
  0 siblings, 0 replies; 57+ messages in thread
From: Oleg Verych @ 2007-02-13  3:10 UTC (permalink / raw)
  To: LKML; +Cc: Andi Kleen, Andrew Morton, Sam Ravnborg, virtualization

On Tue, Feb 13, 2007 at 10:41:36AM +1100, Rusty Russell wrote:
> 	Always happy to explain.

Thanks!

[]
> 	So, asm-offsets.c is the solution: it uses asm() statements to emit
> patterns in the assembler, with the compiler computing the actual
> numbers, eg:
> 
> 	#define DEFINE(sym, val) \
(1) 	        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
> 	DEFINE(SIZEOF_FOOBAR, sizeof(foobar));
> 
(2) Becomes in asm-offsets.s:
> 	->SIZEOF_FOOBAR $10 sizeof(foobar)  #
> 
(3) This gets sed'd back into asm-offsets.h:
> 	#define SIZEOF_FOOBAR 10 /* SIZEOF_FOOBAR  # */
> 
> This can be included from .S files (which get passed through the
> pre-processor).

So, to make this parsing job clear once again, (1) is a pattern generator
of (2) for sed (this is not assembler), which is making final (3) then.
And this are actually C-for-asm offsets (asm-offsets.h is confusing to
me).

On Mon, Feb 12, 2007 at 10:41:28PM +0100, Sam Ravnborg wrote:
[]
> > I will glad to help providing solution maybe somewhat earlier (well, i'm
> > trying to understand whole building process, if that matters).
> 
> Basically what is requested is to move the generic parts of asm-offsets
> generation from top-level Kbuild file to a generic file somewhere.
> 
> Since this is not generic Kbuild functionality but more specific to the
> kernel it should not be part of Kbuild generic files.
> include/asm-generic/asm-offsets seems to be a fair place for it.

Too much "generic" were used. Anyways, simply thing is, that there must be:

-- hardware file(s) for Arch;

-- "private" software file(s), for paravirt and such.

On Mon, Feb 12, 2007 at 06:24:13PM +0100, Andi Kleen wrote:
[]
> The problem is trying to figure out what needs to be done to get
> multiple asm-offsets.h.

Proposition will follow.

Kind regards.

P.S.
While it was fun to run kinds of offtopic operation systems in
virtual machine back in 2000-2001. Now it isn't.

Mainly due to yet another zoo. May be this is good for quick adoption,
commercial benefit, etc. Yet thankfully to Rusty, general useful
infrastructure is available now. Kudos to everyone :)
____

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [pp] kbuild: lguest with private asm-offsets (and some bloat)
  2007-02-13  3:10                       ` Oleg Verych
  (?)
@ 2007-02-16 15:55                       ` Oleg Verych
  2007-02-16 15:59                         ` [pp] kbuild: asm-offsets generalized Oleg Verych
  -1 siblings, 1 reply; 57+ messages in thread
From: Oleg Verych @ 2007-02-16 15:55 UTC (permalink / raw)
  To: LKML; +Cc: Andi Kleen, Andrew Morton, Sam Ravnborg, olecom

On Tue, Feb 13, 2007 at 04:10:44AM +0100, Oleg Verych wrote:
[]
> 
> Proposition will follow.
> 
[]

[patch proposition] kbuild: lguest with private asm-offsets

 * added some bloat to lguest's Makefile:
   - lguest doesn't rebuild, if not changed (due to FORCED implicit %o:%S),
   - support of kbuild's queit/nonquiet commands,
   - (hopefully) more readable and clear build process,
   - private asm-offsets,

 * some whitespace was killed,

 * needs "asm-offsets magic demystified, generalized".

Done on top of first lkml post of lguest.

pp-by: Oleg Verych
---
 arch/i386/kernel/asm-offsets.c |  23 +-----------------
 arch/i386/lguest/Makefile      |  40 +++++++++++++++++++++++++++++---
 arch/i386/lguest/asm-offsets.c |  31 +++++++++++++++++++++++++
 arch/i386/lguest/hypervisor.S  |   8 +++---
 
 4 files changed, 73 insertions(+), 29 deletions(-)

Index: linux-2.6.20/arch/i386/lguest/Makefile
===================================================================
--- linux-2.6.20.orig/arch/i386/lguest/Makefile	2007-02-16 15:32:51.665149000 +0100
+++ linux-2.6.20/arch/i386/lguest/Makefile	2007-02-16 16:16:48.525942500 +0100
@@ -1,2 +1,6 @@
+#
+# i386/lguest/Makefile
+#
+
 # Guest requires the paravirt_ops replacement and the bus driver.
 obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_bus.o
@@ -7,4 +11,6 @@ lg-objs := core.o hypercalls.o page_tabl
 	segments.o io.o lguest_user.o
 
+asm-offsets-lg := $(objtree)/include/asm/asm-offsets-lg.h
+
 # We use top 4MB for guest traps page, then hypervisor. */
 HYPE_ADDR := (0xFFC00000+4096)
@@ -13,10 +19,36 @@ HYPE_DATA_SIZE := 1024
 CFLAGS += -DHYPE_ADDR="$(HYPE_ADDR)" -DHYPE_DATA_SIZE="$(HYPE_DATA_SIZE)"
 
+LD_HYPE_DATA := $(shell printf %\#x $$(($(HYPE_ADDR))))
+LD_HYPE_TEXT := $(shell printf %\#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE))))
+
+quiet_cmd_ld_hyber = LD [H]  $@
+      cmd_ld_hyber = $(LD) -static -Tdata=$(LD_HYPE_DATA) \
+				   -Ttext=$(LD_HYPE_TEXT) -o $@ $<
+cmd_objcopy = $(OBJCOPY) -O binary $@
+
+quiet_cmd_blob = BLOB    $@
+      cmd_blob = od -tx1 -An -v $< | sed -e 's/^ /0x/' \
+					 -e 's/$$/,/' -e 's/ /,0x/g' > $@
+quiet_cmd_offsets = GEN     $@
+      cmd_offsets = $(srctree)/scripts/mkCconstants $< $@
+
 $(obj)/core.o: $(obj)/hypervisor-blob.c
+
+$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
+	$(call cmd,blob)
+
 # This links the hypervisor in the right place and turns it into a C array.
 $(obj)/hypervisor-raw: $(obj)/hypervisor.o
-	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
-$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
-	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
+	$(call if_changed,ld_hyber)
+	$(call cmd,objcopy)
+
+$(obj)/hypervisor.o: $(src)/hypervisor.S $(asm-offsets-lg)
+	$(call if_changed_dep,as_o_S)
+
+$(asm-offsets-lg): $(obj)/asm-offsets.s
+	$(call cmd,offsets)
+
+$(obj)/asm-offsets.s: $(src)/asm-offsets.c
+	$(call if_changed_dep,cc_s_c)
 
-clean-files := hypervisor-blob.c hypervisor-raw
+clean-files := $(asm-offsets-lg) hypervisor-blob.c hypervisor-raw
Index: linux-2.6.20/arch/i386/lguest/asm-offsets.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20/arch/i386/lguest/asm-offsets.c	2007-02-16 16:03:10.410813500 +0100
@@ -0,0 +1,31 @@
+/*
+ * lguest's "private"
+ */
+
+#include <linux/sched.h>
+#include <asm/lguest.h>
+#include "../lguest/lg.h"
+
+#define DEFINE(sym, val) \
+	asm volatile("\n->" #sym " %0 " #val : : "i" (val))
+
+#define BLANK() asm volatile("\n->" : : )
+
+#define OFFSET(sym, str, mem) \
+	DEFINE(sym, offsetof(struct str, mem));
+
+void fell_lguest(void)
+{
+	BLANK();
+	OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
+	OFFSET(LGUEST_STATE_host_stackptr, lguest_state, host.stackptr);
+	OFFSET(LGUEST_STATE_host_pgdir, lguest_state, host.pgdir);
+	OFFSET(LGUEST_STATE_host_gdt, lguest_state, host.gdt);
+	OFFSET(LGUEST_STATE_host_idt, lguest_state, host.idt);
+	OFFSET(LGUEST_STATE_regs, lguest_state, regs);
+	OFFSET(LGUEST_STATE_gdt, lguest_state, gdt);
+	OFFSET(LGUEST_STATE_idt, lguest_state, idt);
+	OFFSET(LGUEST_STATE_gdt_table, lguest_state, gdt_table);
+	OFFSET(LGUEST_STATE_trapnum, lguest_state, regs.trapnum);
+	OFFSET(LGUEST_STATE_errcode, lguest_state, regs.errcode);
+}

Index: linux-2.6.20/arch/i386/kernel/asm-offsets.c
===================================================================
--- linux-2.6.20.orig/arch/i386/kernel/asm-offsets.c	2007-02-16 16:02:11.155110250 +0100
+++ linux-2.6.20/arch/i386/kernel/asm-offsets.c	2007-02-16 16:03:52.697456250 +0100
@@ -17,11 +17,7 @@
 #include <asm/elf.h>
 #include <asm/pda.h>
-#ifdef CONFIG_LGUEST_GUEST
-#include <asm/lguest.h>
-#include "../lguest/lg.h"
-#endif
 
 #define DEFINE(sym, val) \
-        asm volatile("\n->" #sym " %0 " #val : : "i" (val))
+	asm volatile("\n->" #sym " %0 " #val : : "i" (val))
 
 #define BLANK() asm volatile("\n->" : : )
@@ -104,5 +100,5 @@ void foo(void)
 
 	BLANK();
- 	OFFSET(PDA_cpu, i386_pda, cpu_number);
+	OFFSET(PDA_cpu, i386_pda, cpu_number);
 	OFFSET(PDA_pcurrent, i386_pda, pcurrent);
 
@@ -116,18 +112,3 @@ void foo(void)
 	OFFSET(PARAVIRT_read_cr0, paravirt_ops, read_cr0);
 #endif
-
-#ifdef CONFIG_LGUEST_GUEST
-	BLANK();
-	OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
-	OFFSET(LGUEST_STATE_host_stackptr, lguest_state, host.stackptr);
-	OFFSET(LGUEST_STATE_host_pgdir, lguest_state, host.pgdir);
-	OFFSET(LGUEST_STATE_host_gdt, lguest_state, host.gdt);
-	OFFSET(LGUEST_STATE_host_idt, lguest_state, host.idt);
-	OFFSET(LGUEST_STATE_regs, lguest_state, regs);
-	OFFSET(LGUEST_STATE_gdt, lguest_state, gdt);
-	OFFSET(LGUEST_STATE_idt, lguest_state, idt);
-	OFFSET(LGUEST_STATE_gdt_table, lguest_state, gdt_table);
-	OFFSET(LGUEST_STATE_trapnum, lguest_state, regs.trapnum);
-	OFFSET(LGUEST_STATE_errcode, lguest_state, regs.errcode);
-#endif
 }
Index: linux-2.6.20/arch/i386/lguest/hypervisor.S
===================================================================
--- linux-2.6.20.orig/arch/i386/lguest/hypervisor.S	2007-02-16 16:09:02.620825250 +0100
+++ linux-2.6.20/arch/i386/lguest/hypervisor.S	2007-02-16 16:13:19.548882250 +0100
@@ -2,5 +2,5 @@
    Layout is: default_idt_entries (1k), then switch_to_guest entry point. */
 #include <linux/linkage.h>
-#include <asm/asm-offsets.h>
+#include <asm/asm-offsets-lg.h>
 #include "lg.h"
 
@@ -104,5 +104,5 @@ switch_to_guest:
 	popl	%es;							\
 	popl	%ss
-	
+
 /* Return to run_guest_once. */
 return_to_host:
@@ -151,5 +151,5 @@ deliver_to_host_with_errcode:
  .endr
 .endm
-	
+
 /* We intercept every interrupt, because we may need to switch back to
  * host.  Unfortunately we can't tell them apart except by entry
@@ -158,5 +158,5 @@ deliver_to_host_with_errcode:
 irq_stubs:
 .data
-default_idt_entries:	
+default_idt_entries:
 .text
 	IRQ_STUBS 0 1 return_to_host		/* First two traps */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [pp] kbuild: asm-offsets generalized
  2007-02-16 15:55                       ` [pp] kbuild: lguest with private asm-offsets (and some bloat) Oleg Verych
@ 2007-02-16 15:59                         ` Oleg Verych
  2007-02-16 18:56                           ` Sam Ravnborg
  2007-04-01 20:42                           ` Sam Ravnborg
  0 siblings, 2 replies; 57+ messages in thread
From: Oleg Verych @ 2007-02-16 15:59 UTC (permalink / raw)
  To: LKML; +Cc: Andi Kleen, Andrew Morton, Sam Ravnborg

> > 
> > Proposition will follow.
> > 
> []
> 
> [patch proposition] kbuild: lguest with private asm-offsets
[] 
>  * needs "asm-offsets magic demystified, generalized".
[] 

[patch proposition] kbuild: asm-offsets generalized

 * scripts/mkCconstants:
   - asm-offsets magic demystified, generalized,

 * (hopefully) more readable sed scripts,

 * top Kbuild may be updated...

 * file needs `chmod u+x`, i don't know, how it's done in patch(1).

pp-by: Oleg Verych
---
 scripts/mkCconstants           |  50 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 50 insertions(+), 0 deletions(-)

Index: linux-2.6.20/scripts/mkCconstants
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20/scripts/mkCconstants	2007-02-16 15:33:51.696900750 +0100
@@ -0,0 +1,50 @@
+#!/bin/sh
+
+# Input file, where values of interest are stored is produced by
+# `cmd_cc_s_c'. It yields calculation of constants, needed in
+# assembler modules. Output is a suitable header file.
+#
+# $1 - input filename;
+# $2 - output filename;
+# $3 - header file format: "normal" (default), "mips".
+
+set -e
+
+[ -z "$1" ] || [ -z "$2" ] && exit 1
+
+case $3 in
+    mips)
+	SED_SCRIPT='
+/^@@@/{
+s/^@@@//;
+s/ \#.*\$//;
+p;
+}'
+	;;
+    normal | *)
+	SED_SCRIPT='
+/^->/{
+s:^->\([^ ]*\) [\$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:;
+s:->::;
+p;
+}'
+	;;
+esac
+
+cat << "EOF"  > $2
+#ifndef __ASM_OFFSETS_H__
+#define __ASM_OFFSETS_H__
+
+/*
+ * This file was generated by scripts/mkCconstants
+ */
+
+EOF
+
+sed -ne "$SED_SCRIPT" $1 >> $2
+
+cat << "EOF" >> $2
+
+#endif
+
+EOF

--
-o--=O`C  info emacs : not found
 #oo'L O  info make  : not found
<___=E M  man gcc    : not found

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-02-16 15:59                         ` [pp] kbuild: asm-offsets generalized Oleg Verych
@ 2007-02-16 18:56                           ` Sam Ravnborg
  2007-02-16 21:56                             ` Oleg Verych
  2007-04-01 20:42                           ` Sam Ravnborg
  1 sibling, 1 reply; 57+ messages in thread
From: Sam Ravnborg @ 2007-02-16 18:56 UTC (permalink / raw)
  To: Oleg Verych; +Cc: LKML, Andi Kleen, Andrew Morton

On Fri, Feb 16, 2007 at 04:59:29PM +0100, Oleg Verych wrote:
> > > 
> > > Proposition will follow.
> > > 
> > []
> > 
> > [patch proposition] kbuild: lguest with private asm-offsets
> [] 
> >  * needs "asm-offsets magic demystified, generalized".
> [] 

To make asm-offset generic I had in mind something like the
following.
It uses the currect functionality, and allows for architecture override
as needed - but default not used. And architecture override is kept in a
architecture specific location.

	Sam

commit df95bb04b04ff2f64805dfa8459099ffe469c8a5
Author: Sam Ravnborg <sam@uranus.ravnborg.org>
Date:   Fri Feb 16 19:51:29 2007 +0100

    kbuild: Make asm-offset generic
    
    Signed-off-by: Sam Ravnborg <sam@ravnborg.org>

diff --git a/include/asm-generic/asm-offset b/include/asm-generic/asm-offset
new file mode 100644
index 0000000..5e5629e
--- /dev/null
+++ b/include/asm-generic/asm-offset
@@ -0,0 +1,30 @@
+#
+# Support generating constant useable from assembler but defined on C-level
+#
+# See usage in top-level Kbuild file
+
+# Default sed regexp - multiline due to syntax constraints
+define sed-y
+	"/^->/{s:^->\([^ ]*\) [\$$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:; s:->::; p;}"
+endef
+
+# let architectures override the sed expression as needed
+-include include/asm/asm-offset
+
+quiet_cmd_offsets = GEN     $@
+define cmd_offsets
+	(set -e; \
+	 echo "#ifndef __ASM_OFFSETS_H__"; \
+	 echo "#define __ASM_OFFSETS_H__"; \
+	 echo "/*"; \
+	 echo " * DO NOT MODIFY."; \
+	 echo " *"; \
+	 echo " * This file was generated by Kbuild"; \
+	 echo " *"; \
+	 echo " */"; \
+	 echo ""; \
+	 sed -ne $(sed-y) $<; \
+	 echo ""; \
+	 echo "#endif" ) > $@
+endef
+
diff --git a/include/asm-mips/asm-offset b/include/asm-mips/asm-offset
new file mode 100644
index 0000000..b2eb959
--- /dev/null
+++ b/include/asm-mips/asm-offset
@@ -0,0 +1,4 @@
+# Override default sed for MIPS when generating asm-offset
+sed-$(CONFIG_MIPS) := "/^@@@/{s/^@@@//; s/ \#.*\$$//; p;}"
+
+
diff --git a/Kbuild b/Kbuild
index 0451f69..0a32f5f 100644
--- a/Kbuild
+++ b/Kbuild
@@ -1,5 +1,8 @@
 #
 # Kbuild for top-level directory of the kernel
+
+
+
 # This file takes care of the following:
 # 1) Generate asm-offsets.h
 
@@ -7,36 +10,14 @@ #####
 # 1) Generate asm-offsets.h
 #
 
+include include/asm-generic/asm-offset
+
 offsets-file := include/asm-$(ARCH)/asm-offsets.h
 
 always  := $(offsets-file)
 targets := $(offsets-file)
 targets += arch/$(ARCH)/kernel/asm-offsets.s
 
-# Default sed regexp - multiline due to syntax constraints
-define sed-y
-	"/^->/{s:^->\([^ ]*\) [\$$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:; s:->::; p;}"
-endef
-# Override default regexp for specific architectures
-sed-$(CONFIG_MIPS) := "/^@@@/{s/^@@@//; s/ \#.*\$$//; p;}"
-
-quiet_cmd_offsets = GEN     $@
-define cmd_offsets
-	(set -e; \
-	 echo "#ifndef __ASM_OFFSETS_H__"; \
-	 echo "#define __ASM_OFFSETS_H__"; \
-	 echo "/*"; \
-	 echo " * DO NOT MODIFY."; \
-	 echo " *"; \
-	 echo " * This file was generated by Kbuild"; \
-	 echo " *"; \
-	 echo " */"; \
-	 echo ""; \
-	 sed -ne $(sed-y) $<; \
-	 echo ""; \
-	 echo "#endif" ) > $@
-endef
-
 # We use internal kbuild rules to avoid the "is up to date" message from make
 arch/$(ARCH)/kernel/asm-offsets.s: arch/$(ARCH)/kernel/asm-offsets.c FORCE
 	$(Q)mkdir -p $(dir $@)

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-02-16 18:56                           ` Sam Ravnborg
@ 2007-02-16 21:56                             ` Oleg Verych
  2007-02-17  4:43                               ` Rusty Russell
  0 siblings, 1 reply; 57+ messages in thread
From: Oleg Verych @ 2007-02-16 21:56 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: LKML, Andi Kleen, Andrew Morton, Rusty Russell

Hallo.

On Fri, Feb 16, 2007 at 07:56:35PM +0100, Sam Ravnborg wrote:
> On Fri, Feb 16, 2007 at 04:59:29PM +0100, Oleg Verych wrote:
> > > > 
> > > > Proposition will follow.
> > > > 
> > > []
> > > 
> > > [patch proposition] kbuild: lguest with private asm-offsets
> > [] 
> > >  * needs "asm-offsets magic demystified, generalized".
> > [] 
> 
> To make asm-offset generic I had in mind something like the
> following.

Then i misunderstood, what you mean, sorry.

> It uses the currect functionality, and allows for architecture override
> as needed - but default not used. And architecture override is kept in a
> architecture specific location.
> 
> 	Sam
> 
> commit df95bb04b04ff2f64805dfa8459099ffe469c8a5
> Author: Sam Ravnborg <sam@uranus.ravnborg.org>
> Date:   Fri Feb 16 19:51:29 2007 +0100
> 
>     kbuild: Make asm-offset generic
>     
>     Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
> 
> diff --git a/include/asm-generic/asm-offset b/include/asm-generic/asm-offset
> new file mode 100644
> index 0000000..5e5629e
> --- /dev/null
> +++ b/include/asm-generic/asm-offset
> @@ -0,0 +1,30 @@
> +#
> +# Support generating constant useable from assembler but defined on C-level
> +#
> +# See usage in top-level Kbuild file
> +
> +# Default sed regexp - multiline due to syntax constraints
> +define sed-y
> +	"/^->/{s:^->\([^ ]*\) [\$$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:; s:->::; p;}"
> +endef
> +
> +# let architectures override the sed expression as needed
> +-include include/asm/asm-offset
> +
> +quiet_cmd_offsets = GEN     $@
> +define cmd_offsets
> +	(set -e; \
> +	 echo "#ifndef __ASM_OFFSETS_H__"; \
> +	 echo "#define __ASM_OFFSETS_H__"; \
> +	 echo "/*"; \
> +	 echo " * DO NOT MODIFY."; \
> +	 echo " *"; \
> +	 echo " * This file was generated by Kbuild"; \
> +	 echo " *"; \
> +	 echo " */"; \
> +	 echo ""; \
> +	 sed -ne $(sed-y) $<; \
> +	 echo ""; \
> +	 echo "#endif" ) > $@
> +endef
> +
> diff --git a/include/asm-mips/asm-offset b/include/asm-mips/asm-offset
> new file mode 100644
> index 0000000..b2eb959
> --- /dev/null
> +++ b/include/asm-mips/asm-offset
> @@ -0,0 +1,4 @@
> +# Override default sed for MIPS when generating asm-offset
> +sed-$(CONFIG_MIPS) := "/^@@@/{s/^@@@//; s/ \#.*\$$//; p;}"
> +
> +
> diff --git a/Kbuild b/Kbuild
> index 0451f69..0a32f5f 100644
> --- a/Kbuild
> +++ b/Kbuild
> @@ -1,5 +1,8 @@
>  #
>  # Kbuild for top-level directory of the kernel
> +
> +
> +
>  # This file takes care of the following:
>  # 1) Generate asm-offsets.h
>  
> @@ -7,36 +10,14 @@ #####
>  # 1) Generate asm-offsets.h
>  #
>  
> +include include/asm-generic/asm-offset
> +
>  offsets-file := include/asm-$(ARCH)/asm-offsets.h
>  
>  always  := $(offsets-file)
>  targets := $(offsets-file)
>  targets += arch/$(ARCH)/kernel/asm-offsets.s
>  
> -# Default sed regexp - multiline due to syntax constraints
> -define sed-y
> -	"/^->/{s:^->\([^ ]*\) [\$$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:; s:->::; p;}"
> -endef
> -# Override default regexp for specific architectures
> -sed-$(CONFIG_MIPS) := "/^@@@/{s/^@@@//; s/ \#.*\$$//; p;}"
> -
> -quiet_cmd_offsets = GEN     $@
> -define cmd_offsets
> -	(set -e; \
> -	 echo "#ifndef __ASM_OFFSETS_H__"; \
> -	 echo "#define __ASM_OFFSETS_H__"; \
> -	 echo "/*"; \
> -	 echo " * DO NOT MODIFY."; \
> -	 echo " *"; \
> -	 echo " * This file was generated by Kbuild"; \
> -	 echo " *"; \
> -	 echo " */"; \
> -	 echo ""; \
> -	 sed -ne $(sed-y) $<; \
> -	 echo ""; \
> -	 echo "#endif" ) > $@
> -endef
> -
>  # We use internal kbuild rules to avoid the "is up to date" message from make
>  arch/$(ARCH)/kernel/asm-offsets.s: arch/$(ARCH)/kernel/asm-offsets.c FORCE
>  	$(Q)mkdir -p $(dir $@)

Thanks.

____

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-02-16 21:56                             ` Oleg Verych
@ 2007-02-17  4:43                               ` Rusty Russell
  2007-02-17  5:33                                 ` Oleg Verych
  0 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-02-17  4:43 UTC (permalink / raw)
  To: Oleg Verych; +Cc: Sam Ravnborg, LKML, Andi Kleen, Andrew Morton

On Fri, 2007-02-16 at 22:56 +0100, Oleg Verych wrote:
> Hallo.

lguest parts look good though!

Rusty.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-02-17  4:43                               ` Rusty Russell
@ 2007-02-17  5:33                                 ` Oleg Verych
  0 siblings, 0 replies; 57+ messages in thread
From: Oleg Verych @ 2007-02-17  5:33 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Sam Ravnborg, LKML, Andi Kleen, Andrew Morton

On Sat, Feb 17, 2007 at 03:43:49PM +1100, Rusty Russell wrote:
> On Fri, 2007-02-16 at 22:56 +0100, Oleg Verych wrote:
> > Hallo.
> 
> lguest parts look good though!

Thanks.

Well, then what about my way of doing generalization?

I.e. one-file shell script-let, rather many indistinguishable GNU make
files throughout source tree?

Sam's approach also offers using of CONGIG* options. But this has two
issues (for me, of course):

* lguest is in i386 arch, *top* Kbuild doing this for specific ARCH
(include/asm/);

* lguest wants to have private for-asm constants generated, unless CONFIG
is set, kbuild never will reach that directory.

Finally (my favorite). While i was told, that i'm doing more obfusticated
solutions, i doubt i did now: script-let makes magic more readable, and
GNU make's complications, like `$$' and various whitespace issues, do not
influence.

One thing is, that

+quiet_cmd_offsets = GEN     $@
+      cmd_offsets = $(srctree)/scripts/mkCconstants $< $@
+

can be moved in Kbuild.include with CONFIG* upgrade:
....
sed-$(CONFIG_MIPS) = mips
....
      cmd_offsets = $(srctree)/scripts/mkCconstants $< $@ $(sed-y)


Kind regards!
--
-o--=O`C  info emacs : not found
 #oo'L O  info make  : not found
<___=E M  man gcc    : not found

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-02-16 15:59                         ` [pp] kbuild: asm-offsets generalized Oleg Verych
  2007-02-16 18:56                           ` Sam Ravnborg
@ 2007-04-01 20:42                           ` Sam Ravnborg
  2007-04-01 21:08                             ` Oleg Verych
  1 sibling, 1 reply; 57+ messages in thread
From: Sam Ravnborg @ 2007-04-01 20:42 UTC (permalink / raw)
  To: Oleg Verych; +Cc: LKML, Andi Kleen, Andrew Morton

On Fri, Feb 16, 2007 at 04:59:29PM +0100, Oleg Verych wrote:
> > > 
> > > Proposition will follow.
> > > 
> > []
> > 
> > [patch proposition] kbuild: lguest with private asm-offsets
> [] 
> >  * needs "asm-offsets magic demystified, generalized".
> [] 
> 
> [patch proposition] kbuild: asm-offsets generalized
> 
>  * scripts/mkCconstants:
>    - asm-offsets magic demystified, generalized,
> 
>  * (hopefully) more readable sed scripts,
> 
>  * top Kbuild may be updated...
> 
>  * file needs `chmod u+x`, i don't know, how it's done in patch(1).

Can I ask you to provide a complete patch that replaces the current
asm-offset stuff with your more readable script version.

Name the script: mkasm-offset.sh to make a direct connection to 
the resulting .h file name.

Thanks,

	Sam


> 
> pp-by: Oleg Verych
> ---
>  scripts/mkCconstants           |  50 +++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 50 insertions(+), 0 deletions(-)
> 
> Index: linux-2.6.20/scripts/mkCconstants
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.20/scripts/mkCconstants	2007-02-16 15:33:51.696900750 +0100
> @@ -0,0 +1,50 @@
> +#!/bin/sh
> +
> +# Input file, where values of interest are stored is produced by
> +# `cmd_cc_s_c'. It yields calculation of constants, needed in
> +# assembler modules. Output is a suitable header file.
> +#
> +# $1 - input filename;
> +# $2 - output filename;
> +# $3 - header file format: "normal" (default), "mips".
> +
> +set -e
> +
> +[ -z "$1" ] || [ -z "$2" ] && exit 1
> +
> +case $3 in
> +    mips)
> +	SED_SCRIPT='
> +/^@@@/{
> +s/^@@@//;
> +s/ \#.*\$//;
> +p;
> +}'
> +	;;
> +    normal | *)
> +	SED_SCRIPT='
> +/^->/{
> +s:^->\([^ ]*\) [\$#]*\([^ ]*\) \(.*\):#define \1 \2 /* \3 */:;
> +s:->::;
> +p;
> +}'
> +	;;
> +esac
> +
> +cat << "EOF"  > $2
> +#ifndef __ASM_OFFSETS_H__
> +#define __ASM_OFFSETS_H__
> +
> +/*
> + * This file was generated by scripts/mkCconstants
> + */
> +
> +EOF
> +
> +sed -ne "$SED_SCRIPT" $1 >> $2
> +
> +cat << "EOF" >> $2
> +
> +#endif
> +
> +EOF
> 
> --
> -o--=O`C  info emacs : not found
>  #oo'L O  info make  : not found
> <___=E M  man gcc    : not found

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-04-01 21:08                             ` Oleg Verych
@ 2007-04-01 21:03                               ` Sam Ravnborg
  0 siblings, 0 replies; 57+ messages in thread
From: Sam Ravnborg @ 2007-04-01 21:03 UTC (permalink / raw)
  To: Oleg Verych; +Cc: LKML, Andi Kleen, Andrew Morton

On Sun, Apr 01, 2007 at 11:08:03PM +0200, Oleg Verych wrote:
> On Sun, Apr 01, 2007 at 10:42:03PM +0200, Sam Ravnborg wrote:
> > On Fri, Feb 16, 2007 at 04:59:29PM +0100, Oleg Verych wrote:
> []
> > > [patch proposition] kbuild: asm-offsets generalized
> []
> > >  * (hopefully) more readable sed scripts,
> > > 
> > >  * top Kbuild may be updated...
> []
> > 
> > Can I ask you to provide a complete patch that replaces the current
> > asm-offset stuff with your more readable script version.
>  
> OK, unless this is 1st April joke ;)
> 
> > >  * scripts/mkCconstants:
> > >    - asm-offsets magic demystified, generalized,
> 
> > Name the script: mkasm-offset.sh to make a direct connection to 
> > the resulting .h file name.
>  
> OK, nice name.
> 
> > > 
> > >  * file needs `chmod u+x`, i don't know, how it's done in patch(1).
> 
> BTW, i'll look at git-diff/git-apply for doing that, as was noted by Linus.
When patch passes me I can take care - so do not worry.
But tell me to do so if I should forget in the meantime.

PS. On vacation for a week from tomorrow so take your time.

	Sam

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [pp] kbuild: asm-offsets generalized
  2007-04-01 20:42                           ` Sam Ravnborg
@ 2007-04-01 21:08                             ` Oleg Verych
  2007-04-01 21:03                               ` Sam Ravnborg
  0 siblings, 1 reply; 57+ messages in thread
From: Oleg Verych @ 2007-04-01 21:08 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: LKML, Andi Kleen, Andrew Morton

On Sun, Apr 01, 2007 at 10:42:03PM +0200, Sam Ravnborg wrote:
> On Fri, Feb 16, 2007 at 04:59:29PM +0100, Oleg Verych wrote:
[]
> > [patch proposition] kbuild: asm-offsets generalized
[]
> >  * (hopefully) more readable sed scripts,
> > 
> >  * top Kbuild may be updated...
[]
> 
> Can I ask you to provide a complete patch that replaces the current
> asm-offset stuff with your more readable script version.
 
OK, unless this is 1st April joke ;)

> >  * scripts/mkCconstants:
> >    - asm-offsets magic demystified, generalized,

> Name the script: mkasm-offset.sh to make a direct connection to 
> the resulting .h file name.
 
OK, nice name.

> > 
> >  * file needs `chmod u+x`, i don't know, how it's done in patch(1).

BTW, i'll look at git-diff/git-apply for doing that, as was noted by Linus.
____

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2007-04-01 21:03 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-09  9:11 [PATCH 0/10] lguest Rusty Russell
2007-02-09  9:11 ` Rusty Russell
2007-02-09  9:14 ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Rusty Russell
2007-02-09  9:15   ` [PATCH 2/10] lguest: Export symbols for lguest as a module Rusty Russell
2007-02-09  9:32     ` Andi Kleen
2007-02-09 12:06       ` Rusty Russell
2007-02-09 13:58         ` Andi Kleen
2007-02-10 11:39           ` Rusty Russell
2007-02-09  9:17   ` [PATCH 3/10] lguest: Expose get_futex_key, get_key_refs and drop_key_refs Rusty Russell
2007-02-09  9:18   ` [PATCH 4/10] lguest: Initialize esp0 properly all the time Rusty Russell
2007-02-09  9:19     ` [PATCH 5/10] Make hvc_console.c compile on non-PowerPC Rusty Russell
2007-02-09  9:19       ` Rusty Russell
2007-02-09  9:20       ` [PATCH 6/10] lguest code: the little linux hypervisor Rusty Russell
2007-02-09  9:22         ` [PATCH 7/10] lguest: Simple lguest network driver Rusty Russell
2007-02-09  9:23           ` [PATCH 8/10] lguest: console driver Rusty Russell
2007-02-09  9:24             ` [PATCH 9/10] lguest: block driver Rusty Russell
2007-02-09  9:25               ` [PATCH 10/10] lguest: documentatation including example launcher Rusty Russell
2007-02-09  9:35         ` [PATCH 6/10] lguest code: the little linux hypervisor Andrew Morton
2007-02-09 11:00           ` Rusty Russell
2007-02-09 11:13             ` Zachary Amsden
2007-02-09 11:50               ` Andi Kleen
2007-02-09 11:54                 ` Zachary Amsden
2007-02-09 11:57                   ` Andi Kleen
2007-02-09 12:08                     ` Zachary Amsden
2007-02-09 22:29                 ` David Miller
2007-02-09 10:09         ` Andi Kleen
2007-02-09 12:39           ` Rusty Russell
2007-02-09 13:57             ` Andi Kleen
2007-02-09 15:01               ` Rusty Russell
2007-02-09 14:17             ` Sam Ravnborg
2007-02-09 15:23               ` Rusty Russell
2007-02-12 13:34                 ` [q] kbuild for private asm-offsets (Re: [PATCH 6/10] lguest code: the little linux hypervisor.) Oleg Verych
2007-02-12 17:24                   ` Andi Kleen
2007-02-12 21:41                   ` Sam Ravnborg
2007-02-12 23:41                   ` Rusty Russell
2007-02-13  3:10                     ` Oleg Verych
2007-02-13  3:10                       ` Oleg Verych
2007-02-16 15:55                       ` [pp] kbuild: lguest with private asm-offsets (and some bloat) Oleg Verych
2007-02-16 15:59                         ` [pp] kbuild: asm-offsets generalized Oleg Verych
2007-02-16 18:56                           ` Sam Ravnborg
2007-02-16 21:56                             ` Oleg Verych
2007-02-17  4:43                               ` Rusty Russell
2007-02-17  5:33                                 ` Oleg Verych
2007-04-01 20:42                           ` Sam Ravnborg
2007-04-01 21:08                             ` Oleg Verych
2007-04-01 21:03                               ` Sam Ravnborg
2007-02-09 10:55       ` [PATCH 6a/10] lguest: Config and headers Rusty Russell
2007-02-09 10:56         ` [PATCH 6b/10] lguest: the host code (lg.ko) Rusty Russell
2007-02-09 10:57           ` [PATCH 6c/10] lguest: the guest code Rusty Russell
2007-02-09 10:58             ` [PATCH 6d/10] lguest: the Makefiles Rusty Russell
2007-02-09 17:06             ` [PATCH 6c/10] lguest: the guest code Len Brown
2007-02-09 17:14               ` James Morris
2007-02-09 17:49                 ` Len Brown
2007-02-09 23:48                   ` [PATCH 11/10] lguest: use disable_acpi() Rusty Russell
2007-02-09  9:31   ` [PATCH 1/10] lguest: Don't rely on last-linked fallthru when no paravirt handler Andi Kleen
2007-02-09 11:52     ` Rusty Russell
2007-02-09 20:49       ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.