[PATCH 00/11] KVM: Add asynchronous page fault for PV guest.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/11] KVM: Add asynchronous page fault for PV guest.
@ 2009-11-01 11:56 Gleb Natapov
  2009-11-01 11:56 ` [PATCH 01/11] Add shared memory hypercall to PV Linux guest Gleb Natapov
                   ` (10 more replies)
  0 siblings, 11 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

KVM virtualize guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into guest
address space KVM is notified about it. KVM maps the page into guest
address space and resumes vcpu execution. If the page is swapped out
from host memory vcpu execution is suspended till page is not swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

To overcome this inefficient this patch series implements "asynchronous
page fault" for paravirtualized  KVM guests. If a page that vcpu is trying
to access is swapped out KVM sends async PF to the vcpu and continues vcpu
execution. Requested page is swapped in by another thread in parallel.
When vcpu gets async PF it puts the task that faulted to sleep until
"wake up" interrupt is delivered. When page is brought to host memory
KVM sends "wake up" interrupt and the guest task resumes execution.

Gleb Natapov (11):
  Add shared memory hypercall to PV Linux guest.
  Add "handle page fault" PV helper.
  Handle asynchronous page fault in a PV guest.
  Export __get_user_pages_fast.
  Add get_user_pages() variant that fails if major fault is required.
  Inject asynchronous page fault into a guest if page is swapped out.
  Retry fault before vmentry
  Add "wait for page" hypercall.
  Maintain preemptability count even for !CONFIG_PREEMPT kernels
  Handle async PF in non preemptable context.
  Send async PF when guest is not in userspace too.

 arch/x86/include/asm/kvm_host.h       |   29 ++++-
 arch/x86/include/asm/kvm_para.h       |   14 ++
 arch/x86/include/asm/paravirt.h       |    7 +
 arch/x86/include/asm/paravirt_types.h |    4 +
 arch/x86/kernel/kvm.c                 |  210 ++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c            |    8 +
 arch/x86/kernel/paravirt_patch_32.c   |    8 +
 arch/x86/kernel/paravirt_patch_64.c   |    7 +
 arch/x86/kernel/setup.c               |    1 +
 arch/x86/kernel/smpboot.c             |    3 +
 arch/x86/kvm/mmu.c                    |  284 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmutrace.h               |   79 +++++++++
 arch/x86/kvm/paging_tmpl.h            |   44 +++++-
 arch/x86/kvm/x86.c                    |   97 +++++++++++-
 arch/x86/mm/fault.c                   |    3 +
 arch/x86/mm/gup.c                     |    2 +
 fs/ncpfs/mmap.c                       |    2 +
 include/linux/hardirq.h               |    6 +-
 include/linux/kvm.h                   |    1 +
 include/linux/kvm_para.h              |    5 +
 include/linux/mm.h                    |    5 +
 include/linux/preempt.h               |   22 ++-
 kernel/sched.c                        |    6 -
 lib/kernel_lock.c                     |    1 +
 mm/filemap.c                          |    3 +
 mm/memory.c                           |   31 ++++-
 mm/shmem.c                            |    8 +-
 27 files changed, 855 insertions(+), 35 deletions(-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02  4:27   ` Rik van Riel
  2009-11-02 12:18   ` Avi Kivity
  2009-11-01 11:56 ` [PATCH 02/11] Add "handle page fault" PV helper Gleb Natapov
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

Add hypercall that allows guest and host to setup per cpu shared
memory.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +
 arch/x86/include/asm/kvm_para.h |   11 +++++
 arch/x86/kernel/kvm.c           |   82 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/setup.c         |    1 +
 arch/x86/kernel/smpboot.c       |    3 +
 arch/x86/kvm/x86.c              |   70 +++++++++++++++++++++++++++++++++
 include/linux/kvm.h             |    1 +
 include/linux/kvm_para.h        |    4 ++
 8 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26a74b7..2d1f526 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -374,6 +374,9 @@ struct kvm_vcpu_arch {
 	/* used for guest single stepping over the given code position */
 	u16 singlestep_cs;
 	unsigned long singlestep_rip;
+
+	struct kvm_vcpu_pv_shm *pv_shm;
+	struct page *pv_shm_page;
 };
 
 struct kvm_mem_alias {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c584076..90708b7 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -15,6 +15,7 @@
 #define KVM_FEATURE_CLOCKSOURCE		0
 #define KVM_FEATURE_NOP_IO_DELAY	1
 #define KVM_FEATURE_MMU_OP		2
+#define KVM_FEATURE_ASYNC_PF		3
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
@@ -47,6 +48,16 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+#define KVM_PV_SHM_VERSION 1
+
+#define KVM_PV_SHM_FEATURES_ASYNC_PF		(1 << 0)
+
+struct kvm_vcpu_pv_shm {
+	__u64 features;
+	__u64 reason;
+	__u64 param;
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..d03f33c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,7 +27,11 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/hardirq.h>
+#include <linux/bootmem.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
 #include <asm/timer.h>
+#include <asm/cpu.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -37,6 +41,7 @@ struct kvm_para_state {
 };
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_shm *, kvm_vcpu_pv_shm);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -50,6 +55,17 @@ static void kvm_io_delay(void)
 {
 }
 
+static void kvm_end_context_switch(struct task_struct *next)
+{
+	struct kvm_vcpu_pv_shm *pv_shm =
+		per_cpu(kvm_vcpu_pv_shm, smp_processor_id());
+
+	if (!pv_shm)
+		return;
+
+	pv_shm->current_task = (u64)next;
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -231,10 +247,76 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static void kvm_pv_unregister_shm(void *unused)
+{
+	if (per_cpu(kvm_vcpu_pv_shm, smp_processor_id()) == NULL)
+		return;
+
+	kvm_hypercall3(KVM_HC_SETUP_SHM, 0, 0, KVM_PV_SHM_VERSION);
+	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+	       smp_processor_id());
+
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+				unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_pv_unregister_shm, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+        .notifier_call = kvm_pv_reboot_notify,
+};
+
 void __init kvm_guest_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
+	register_reboot_notifier(&kvm_pv_reboot_nb);
+}
+
+void __cpuinit kvm_guest_cpu_init(void)
+{
+	int r;
+	unsigned long a0, a1, a2;
+	struct kvm_vcpu_pv_shm *pv_shm;
+
+	if (!kvm_para_available())
+		return;
+
+	if (smp_processor_id() == boot_cpu_id)
+		pv_shm = alloc_bootmem(sizeof(*pv_shm));
+	else
+		pv_shm = kmalloc(sizeof(*pv_shm), GFP_ATOMIC);
+
+	if (!pv_shm)
+		return;
+
+	memset(pv_shm, 0, sizeof(*pv_shm));
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		pv_shm->features |= KVM_PV_SHM_FEATURES_ASYNC_PF;
+
+	per_cpu(kvm_vcpu_pv_shm, smp_processor_id()) = pv_shm;
+	a0 = __pa(pv_shm);
+	a1 = sizeof(*pv_shm);
+	a2 = KVM_PV_SHM_VERSION;
+	r = kvm_hypercall3(KVM_HC_SETUP_SHM, a0, a1, a2);
+
+	if (!r) {
+		printk(KERN_INFO"Setup pv shared memory for cpu %d\n",
+		       smp_processor_id());
+		return;
+	}
+
+	if (smp_processor_id() == boot_cpu_id)
+		free_bootmem(__pa(pv_shm), sizeof(*pv_shm));
+	else
+		kfree(pv_shm);
+
+	per_cpu(kvm_vcpu_pv_shm, smp_processor_id()) = NULL;
 }
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e09f0e2..1c2f8dd 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1007,6 +1007,7 @@ void __init setup_arch(char **cmdline_p)
 	probe_nr_irqs_gsi();
 
 	kvm_guest_init();
+	kvm_guest_cpu_init();
 
 	e820_reserve_resources();
 	e820_mark_nosave_regions(max_low_pfn);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 565ebc6..5599098 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -65,6 +65,7 @@
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
 #include <linux/mc146818rtc.h>
+#include <linux/kvm_para.h>
 
 #include <asm/smpboot_hooks.h>
 
@@ -321,6 +322,8 @@ notrace static void __cpuinit start_secondary(void *unused)
 	ipi_call_unlock();
 	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
 
+	kvm_guest_cpu_init();
+
 	/* enable local interrupts */
 	local_irq_enable();
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2ef3906..c177933 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1342,6 +1342,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_SET_IDENTITY_MAP_ADDR:
 	case KVM_CAP_XEN_HVM:
 	case KVM_CAP_ADJUST_CLOCK:
+	case KVM_CAP_ASYNC_PF:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -3371,6 +3372,68 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_halt);
 
+static void kvm_pv_release_shm(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->arch.pv_shm_page)
+		return;
+
+	kunmap(vcpu->arch.pv_shm_page);
+	put_page(vcpu->arch.pv_shm_page);
+	vcpu->arch.pv_shm_page = NULL;
+	vcpu->arch.pv_shm = NULL;
+}
+
+static int kvm_pv_setup_shm(struct kvm_vcpu *vcpu, unsigned long gpa,
+			    unsigned long size, unsigned long version,
+			    unsigned long *ret)
+{
+	int r;
+	unsigned long addr;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	int offset = offset_in_page(gpa);
+	struct mm_struct *mm = current->mm;
+	void *p;
+
+	*ret = -KVM_EINVAL;
+
+	if (size == 0 && vcpu->arch.pv_shm != NULL &&
+	    version == KVM_PV_SHM_VERSION) {
+		kvm_pv_release_shm(vcpu);
+		goto out;
+	}
+
+	if (vcpu->arch.pv_shm != NULL || size != sizeof(*vcpu->arch.pv_shm) ||
+	    size > PAGE_SIZE || version != KVM_PV_SHM_VERSION)
+		return -EINVAL;
+
+	*ret = -KVM_EFAULT;
+
+	addr = gfn_to_hva(vcpu->kvm, gfn);
+	if (kvm_is_error_hva(addr))
+		return -EFAULT;
+
+	/* pin page with pv shared memory */
+	down_read(&mm->mmap_sem);
+	r = get_user_pages(current, mm, addr, 1, 1, 0, &vcpu->arch.pv_shm_page,
+			   NULL);
+	up_read(&mm->mmap_sem);
+	if (r != 1)
+		return -EFAULT;
+
+	p = kmap(vcpu->arch.pv_shm_page);
+	if (!p) {
+		put_page(vcpu->arch.pv_shm_page);
+		vcpu->arch.pv_shm_page = NULL;
+		return -ENOMEM;
+	}
+
+	vcpu->arch.pv_shm = p + offset;
+
+out:
+	*ret = 0;
+	return 0;
+}
+
 static inline gpa_t hc_gpa(struct kvm_vcpu *vcpu, unsigned long a0,
 			   unsigned long a1)
 {
@@ -3413,6 +3476,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_MMU_OP:
 		r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
 		break;
+	case KVM_HC_SETUP_SHM:
+		r = kvm_pv_setup_shm(vcpu, a0, a1, a2, &ret);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
@@ -4860,6 +4926,8 @@ free_vcpu:
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	kvm_pv_release_shm(vcpu);
+
 	vcpu_load(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
@@ -4877,6 +4945,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.dr6 = DR6_FIXED_1;
 	vcpu->arch.dr7 = DR7_FIXED_1;
 
+	kvm_pv_release_shm(vcpu);
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 6ed1a12..2fec4a2 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -440,6 +440,7 @@ struct kvm_ioeventfd {
 #define KVM_CAP_XEN_HVM 38
 #endif
 #define KVM_CAP_ADJUST_CLOCK 39
+#define KVM_CAP_ASYNC_PF 40
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index d731092..1c37495 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -14,9 +14,11 @@
 #define KVM_EFAULT		EFAULT
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
+#define KVM_EINVAL		EINVAL
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
+#define KVM_HC_SETUP_SHM		3
 
 /*
  * hypercalls use architecture specific
@@ -26,8 +28,10 @@
 #ifdef __KERNEL__
 #ifdef CONFIG_KVM_GUEST
 void __init kvm_guest_init(void);
+void __cpuinit kvm_guest_cpu_init(void);
 #else
 #define kvm_guest_init() do { } while (0)
+#define kvm_guest_cpu_init() do { } while (0)
 #endif
 
 static inline int kvm_para_has_feature(unsigned int feature)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
  2009-11-01 11:56 ` [PATCH 01/11] Add shared memory hypercall to PV Linux guest Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02  9:22   ` Ingo Molnar
  2009-11-01 11:56 ` [PATCH 03/11] Handle asynchronous page fault in a PV guest Gleb Natapov
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

Allow paravirtualized guest to do special handling for some page faults.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/paravirt.h       |    7 +++++++
 arch/x86/include/asm/paravirt_types.h |    4 ++++
 arch/x86/kernel/paravirt.c            |    8 ++++++++
 arch/x86/kernel/paravirt_patch_32.c   |    8 ++++++++
 arch/x86/kernel/paravirt_patch_64.c   |    7 +++++++
 arch/x86/mm/fault.c                   |    3 +++
 6 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index efb3899..5203da1 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -6,6 +6,7 @@
 #ifdef CONFIG_PARAVIRT
 #include <asm/pgtable_types.h>
 #include <asm/asm.h>
+#include <asm/ptrace.h>
 
 #include <asm/paravirt_types.h>
 
@@ -710,6 +711,12 @@ static inline void arch_end_context_switch(struct task_struct *next)
 	PVOP_VCALL1(pv_cpu_ops.end_context_switch, next);
 }
 
+static inline int arch_handle_page_fault(struct pt_regs *regs,
+					 unsigned long error_code)
+{
+	return PVOP_CALL2(int, pv_cpu_ops.handle_pf, regs, error_code);
+}
+
 #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 9357473..bcc39b3 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -186,6 +186,7 @@ struct pv_cpu_ops {
 
 	void (*start_context_switch)(struct task_struct *prev);
 	void (*end_context_switch)(struct task_struct *next);
+	int (*handle_pf)(struct pt_regs *regs, unsigned long error_code);
 };
 
 struct pv_irq_ops {
@@ -385,6 +386,7 @@ extern struct pv_lock_ops pv_lock_ops;
 unsigned paravirt_patch_nop(void);
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
+unsigned paravirt_patch_ret_0(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
 			     const void *target, u16 tgt_clobbers,
@@ -676,8 +678,10 @@ void paravirt_leave_lazy_mmu(void);
 void _paravirt_nop(void);
 u32 _paravirt_ident_32(u32);
 u64 _paravirt_ident_64(u64);
+unsigned long _paravirt_ret_0(void);
 
 #define paravirt_nop	((void *)_paravirt_nop)
+#define paravirt_ret_0  ((void *)_paravirt_ret_0)
 
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch_site {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b1739d..7d8f37b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -54,6 +54,11 @@ u64 _paravirt_ident_64(u64 x)
 	return x;
 }
 
+unsigned long _paravirt_ret_0(void)
+{
+	return 0;
+}
+
 void __init default_banner(void)
 {
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -154,6 +159,8 @@ unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
 		ret = paravirt_patch_ident_32(insnbuf, len);
 	else if (opfunc == _paravirt_ident_64)
 		ret = paravirt_patch_ident_64(insnbuf, len);
+	else if (opfunc == _paravirt_ret_0)
+		ret = paravirt_patch_ret_0(insnbuf, len);
 
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
@@ -380,6 +387,7 @@ struct pv_cpu_ops pv_cpu_ops = {
 
 	.start_context_switch = paravirt_nop,
 	.end_context_switch = paravirt_nop,
+	.handle_pf = paravirt_ret_0,
 };
 
 struct pv_apic_ops pv_apic_ops = {
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
index d9f32e6..de006b1 100644
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,8 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
 DEF_NATIVE(pv_cpu_ops, clts, "clts");
 DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
 
+DEF_NATIVE(, mov0, "xor %eax, %eax");
+
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
 {
 	/* arg in %eax, return in %eax */
@@ -24,6 +26,12 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
 	return 0;
 }
 
+unsigned paravirt_patch_ret_0(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov0, end__mov0);
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
index 3f08f34..d685e7d 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -21,6 +21,7 @@ DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
 
 DEF_NATIVE(, mov32, "mov %edi, %eax");
 DEF_NATIVE(, mov64, "mov %rdi, %rax");
+DEF_NATIVE(, mov0, "xor %rax, %rax");
 
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
 {
@@ -34,6 +35,12 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
 				    start__mov64, end__mov64);
 }
 
+unsigned paravirt_patch_ret_0(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov0, end__mov0);
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f4cee90..14707dc 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	int write;
 	int fault;
 
+	if (arch_handle_page_fault(regs, error_code))
+		return;
+
 	tsk = current;
 	mm = tsk->mm;
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
  2009-11-01 11:56 ` [PATCH 01/11] Add shared memory hypercall to PV Linux guest Gleb Natapov
  2009-11-01 11:56 ` [PATCH 02/11] Add "handle page fault" PV helper Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02 12:38   ` Avi Kivity
  2009-11-03 14:14   ` Marcelo Tosatti
  2009-11-01 11:56 ` [PATCH 04/11] Export __get_user_pages_fast Gleb Natapov
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

Asynchronous page fault notifies vcpu that page it is trying to access
is swapped out by a host. In response guest puts a task that caused the
fault to sleep until page is swapped in again. When missing page is
brought back into the memory guest is notified and task resumes execution.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    3 +
 arch/x86/kernel/kvm.c           |  120 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 120 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 90708b7..61e2aa3 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -52,6 +52,9 @@ struct kvm_mmu_op_release_pt {
 
 #define KVM_PV_SHM_FEATURES_ASYNC_PF		(1 << 0)
 
+#define KVM_PV_REASON_PAGE_NP 1
+#define KVM_PV_REASON_PAGE_READY 2
+
 struct kvm_vcpu_pv_shm {
 	__u64 features;
 	__u64 reason;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d03f33c..79d291f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -30,6 +30,8 @@
 #include <linux/bootmem.h>
 #include <linux/notifier.h>
 #include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 
@@ -55,15 +57,121 @@ static void kvm_io_delay(void)
 {
 }
 
-static void kvm_end_context_switch(struct task_struct *next)
+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+	struct hlist_node link;
+	wait_queue_head_t wq;
+	u64 token;
+};
+
+static struct kvm_task_sleep_head {
+	spinlock_t lock;
+	struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+						  u64 token)
+{
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		struct kvm_task_sleep_node *n =
+			hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+static void apf_task_wait(struct task_struct *tsk, u64 token)
 {
+	u64 key = hash_64(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node n, *e;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&b->lock);
+	e = _find_apf_task(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		spin_unlock(&b->lock);
+		return;
+	}
+
+	n.token = token;
+	init_waitqueue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	spin_unlock(&b->lock);
+
+	for (;;) {
+		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+		schedule();
+	}
+	finish_wait(&n.wq, &wait);
+
+	return;
+}
+
+static void apf_task_wake(u64 token)
+{
+	u64 key = hash_64(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node *n;
+
+	spin_lock(&b->lock);
+	n = _find_apf_task(b, token);
+	if (!n) {
+		/* PF was not yet handled. Add dummy entry for the token */
+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			printk(KERN_EMERG"async PF can't allocate memory\n");
+		} else {
+			n->token = token;
+			hlist_add_head(&n->link, &b->list);
+		}
+	} else {
+		hlist_del_init(&n->link);
+		if (waitqueue_active(&n->wq))
+			wake_up(&n->wq);
+	}
+	spin_unlock(&b->lock);
+	return;
+}
+
+int kvm_handle_pf(struct pt_regs *regs, unsigned long error_code)
+{
+	u64 reason, token;
 	struct kvm_vcpu_pv_shm *pv_shm =
 		per_cpu(kvm_vcpu_pv_shm, smp_processor_id());
 
 	if (!pv_shm)
-		return;
+		return 0;
+
+	reason = pv_shm->reason;
+	pv_shm->reason = 0;
+
+	token = pv_shm->param;
+
+	switch (reason) {
+	default:
+		return 0;
+	case KVM_PV_REASON_PAGE_NP:
+		/* real page is missing. */
+		apf_task_wait(current, token);
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		apf_task_wake(token);
+		break;
+	}
 
-	pv_shm->current_task = (u64)next;
+	return 1;
 }
 
 static void kvm_mmu_op(void *buffer, unsigned len)
@@ -219,6 +327,9 @@ static void __init paravirt_ops_setup(void)
 	if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
 		pv_cpu_ops.io_delay = kvm_io_delay;
 
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		pv_cpu_ops.handle_pf = kvm_handle_pf;
+
 	if (kvm_para_has_feature(KVM_FEATURE_MMU_OP)) {
 		pv_mmu_ops.set_pte = kvm_set_pte;
 		pv_mmu_ops.set_pte_at = kvm_set_pte_at;
@@ -272,11 +383,14 @@ static struct notifier_block kvm_pv_reboot_nb = {
 
 void __init kvm_guest_init(void)
 {
+	int i;
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
 	register_reboot_notifier(&kvm_pv_reboot_nb);
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+		spin_lock_init(&async_pf_sleepers[i].lock);
 }
 
 void __cpuinit kvm_guest_cpu_init(void)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/11] Export __get_user_pages_fast.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (2 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 03/11] Handle asynchronous page fault in a PV guest Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02  9:23   ` Ingo Molnar
  2009-11-01 11:56 ` [PATCH 05/11] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/mm/gup.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 71da1bc..cea0dfe 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
+#include <linux/module.h>
 
 #include <asm/pgtable.h>
 
@@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	return nr;
 }
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 
 /**
  * get_user_pages_fast() - pin user pages in memory
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/11] Add get_user_pages() variant that fails if major fault is required.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (3 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 04/11] Export __get_user_pages_fast Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02 19:05   ` Rik van Riel
  2009-11-01 11:56 ` [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out Gleb Natapov
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 fs/ncpfs/mmap.c    |    2 ++
 include/linux/mm.h |    5 +++++
 mm/filemap.c       |    3 +++
 mm/memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/shmem.c         |    8 +++++++-
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 15458de..338527e 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	int bufsize;
 	int pos; /* XXX: loff_t ? */
 
+	if (vmf->flags & FAULT_FLAG_MINOR)
+		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
 	/*
 	 * ncpfs has nothing against high pages as long
 	 * as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 24c3956..2304181 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -136,6 +136,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -821,6 +822,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			unsigned long start, int nr_pages, int write, int force,
 			struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int nr_pages, int write, int force,
+			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1239,6 +1243,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MINOR	0x20	/* do only minor page faults */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index ef169f3..6ef29e0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1530,6 +1530,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			goto no_cached_page;
 		}
 	} else {
+		if (vmf->flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 7e91b5f..b621526 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1318,10 +1318,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				unsigned int fault_fl =
+					((foll_flags & FOLL_WRITE) ?
+					FAULT_FLAG_WRITE : 0) |
+					((foll_flags & FOLL_MINOR) ?
+					FAULT_FLAG_MINOR : 0);
 
-				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+				ret = handle_mm_fault(mm, vma, start, fault_fl);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1329,6 +1332,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					if (ret &
 					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
+					else if (ret & VM_FAULT_MAJOR)
+						return i ? i : -EFAULT;
 					BUG();
 				}
 				if (ret & VM_FAULT_MAJOR)
@@ -1439,6 +1444,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int flags = FOLL_TOUCH | FOLL_MINOR;
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
@@ -2518,6 +2540,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		if (flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
 					GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index 356dd99..6a9d3c0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1218,6 +1218,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
 	swp_entry_t swap;
 	gfp_t gfp;
 	int error;
+	int flags = type ? *type : 0;
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
@@ -1266,6 +1267,11 @@ repeat:
 		swappage = lookup_swap_cache(swap);
 		if (!swappage) {
 			shmem_swp_unmap(entry);
+			if (flags & FAULT_FLAG_MINOR) {
+				spin_unlock(&info->lock);
+				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+				goto failed;
+			}
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
@@ -1474,7 +1480,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int error;
-	int ret;
+	int ret = (int)vmf->flags;
 
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (4 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 05/11] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02 12:56   ` Avi Kivity
  2009-11-01 11:56 ` [PATCH 07/11] Retry fault before vmentry Gleb Natapov
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

If guest access swapped out memory do not swap it in from vcpu thread
context. Setup slow work to do swapping and send async page fault to
a guest.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able to
reschedule.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   20 +++
 arch/x86/kvm/mmu.c              |  243 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmutrace.h         |   60 ++++++++++
 arch/x86/kvm/paging_tmpl.h      |   16 +++-
 arch/x86/kvm/x86.c              |   22 +++-
 5 files changed, 352 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2d1f526..e3cdbfe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/mmu_notifier.h>
 #include <linux/tracepoint.h>
+#include <linux/slow-work.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
@@ -235,6 +236,18 @@ struct kvm_pv_mmu_op_buffer {
 	char buf[512] __aligned(sizeof(long));
 };
 
+struct kvm_mmu_async_pf {
+	struct slow_work work;
+	struct list_head link;
+	struct kvm_vcpu *vcpu;
+	struct mm_struct *mm;
+	gva_t gva;
+	unsigned long addr;
+	u64 token;
+	struct page *page;
+	atomic_t used;
+};
+
 struct kvm_pio_request {
 	unsigned long count;
 	int cur_count;
@@ -318,6 +331,11 @@ struct kvm_vcpu_arch {
 		unsigned long mmu_seq;
 	} update_pte;
 
+	struct list_head mmu_async_pf_done;
+	spinlock_t mmu_async_pf_lock;
+	struct kvm_mmu_async_pf *mmu_async_pf_work;
+	u32 async_pf_id;
+
 	struct i387_fxsave_struct host_fx_image;
 	struct i387_fxsave_struct guest_fx_image;
 
@@ -654,6 +672,8 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a902479..31e837b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -19,6 +19,7 @@
 
 #include "mmu.h"
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -188,6 +189,7 @@ typedef int (*mmu_parent_walk_fn) (struct kvm_vcpu *vcpu, struct kvm_mmu_page *s
 static struct kmem_cache *pte_chain_cache;
 static struct kmem_cache *rmap_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
+static struct kmem_cache *mmu_async_pf_cache;
 
 static u64 __read_mostly shadow_trap_nonpresent_pte;
 static u64 __read_mostly shadow_notrap_nonpresent_pte;
@@ -2189,6 +2191,218 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
+static int gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, pfn_t *pfn)
+{
+	struct page *page[1];
+	unsigned long addr;
+	int npages;
+
+	*pfn = bad_pfn;
+
+	addr = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(addr)) {
+		get_page(bad_page);
+		return 1;
+	}
+
+	npages = __get_user_pages_fast(addr, 1, 1, page);
+
+	if (unlikely(npages != 1)) {
+		down_read(&current->mm->mmap_sem);
+		npages = get_user_pages_noio(current, current->mm, addr, 1, 1,
+					     0, page, NULL);
+		up_read(&current->mm->mmap_sem);
+	}
+
+	if (unlikely(npages != 1)) {
+		struct vm_area_struct *vma;
+
+		down_read(&current->mm->mmap_sem);
+		vma = find_vma(current->mm, addr);
+
+		if (vma == NULL || addr < vma->vm_start ||
+		    !(vma->vm_flags & VM_PFNMAP)) {
+			up_read(&current->mm->mmap_sem);
+			return 0; /* do async fault in */
+		}
+
+		*pfn = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		up_read(&current->mm->mmap_sem);
+		BUG_ON(!kvm_is_mmio_pfn(*pfn));
+	} else
+		*pfn = page_to_pfn(page[0]);
+
+	return 1;
+}
+
+static void async_pf_work_free(struct kvm_mmu_async_pf *apf)
+{
+	if (atomic_dec_and_test(&apf->used))
+		kmem_cache_free(mmu_async_pf_cache, apf);
+}
+
+static int async_pf_get_ref(struct slow_work *work)
+{
+	struct kvm_mmu_async_pf *apf =
+		container_of(work, struct kvm_mmu_async_pf, work);
+
+	atomic_inc(&apf->used);
+	return 0;
+}
+
+static void async_pf_put_ref(struct slow_work *work)
+{
+	struct kvm_mmu_async_pf *apf =
+		container_of(work, struct kvm_mmu_async_pf, work);
+
+	kvm_put_kvm(apf->vcpu->kvm);
+	async_pf_work_free(apf);
+}
+
+static void async_pf_execute(struct slow_work *work)
+{
+	struct page *page[1];
+	struct kvm_mmu_async_pf *apf =
+		container_of(work, struct kvm_mmu_async_pf, work);
+	wait_queue_head_t *q = &apf->vcpu->wq;
+
+	might_sleep();
+
+	down_read(&apf->mm->mmap_sem);
+	get_user_pages(current, apf->mm, apf->addr, 1, 1, 0, page, NULL);
+	up_read(&apf->mm->mmap_sem);
+
+	spin_lock(&apf->vcpu->arch.mmu_async_pf_lock);
+	list_add_tail(&apf->link, &apf->vcpu->arch.mmu_async_pf_done);
+	apf->page = page[0];
+	spin_unlock(&apf->vcpu->arch.mmu_async_pf_lock);
+
+	trace_kvm_mmu_async_pf_executed(apf->addr, apf->page, apf->token,
+					apf->gva);
+
+	if (waitqueue_active(q))
+		wake_up_interruptible(q);
+
+	mmdrop(apf->mm);
+}
+
+struct slow_work_ops async_pf_ops = {
+	.get_ref = async_pf_get_ref,
+	.put_ref = async_pf_put_ref,
+	.execute = async_pf_execute
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+	while (!list_empty(&vcpu->arch.mmu_async_pf_done)) {
+		struct kvm_mmu_async_pf *work =
+			list_entry(vcpu->arch.mmu_async_pf_done.next,
+				   typeof(*work), link);
+		list_del(&work->link);
+		put_page(work->page);
+		kmem_cache_free(mmu_async_pf_cache, work);
+	}
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_async_pf *work = vcpu->arch.mmu_async_pf_work;
+
+	if (work) {
+		vcpu->arch.mmu_async_pf_work = NULL;
+		if (work->page == NULL) {
+			vcpu->arch.pv_shm->reason = KVM_PV_REASON_PAGE_NP;
+			vcpu->arch.pv_shm->param = work->token;
+			kvm_inject_page_fault(vcpu, work->gva, 0);
+			trace_kvm_mmu_send_async_pf(
+				vcpu->arch.pv_shm->param,
+				work->gva, KVM_PV_REASON_PAGE_NP);
+			return;
+		} else {
+			spin_lock(&vcpu->arch.mmu_async_pf_lock);
+			list_del(&work->link);
+			spin_unlock(&vcpu->arch.mmu_async_pf_lock);
+			put_page(work->page);
+			async_pf_work_free(work);
+		}
+	}
+
+	if (list_empty_careful(&vcpu->arch.mmu_async_pf_done) ||
+	    kvm_event_needs_reinjection(vcpu) ||
+	    !kvm_x86_ops->interrupt_allowed(vcpu))
+		return;
+
+	spin_lock(&vcpu->arch.mmu_async_pf_lock);
+	work = list_first_entry(&vcpu->arch.mmu_async_pf_done, typeof(*work),
+				link);
+	list_del(&work->link);
+	spin_unlock(&vcpu->arch.mmu_async_pf_lock);
+
+	vcpu->arch.pv_shm->reason = KVM_PV_REASON_PAGE_READY;
+	vcpu->arch.pv_shm->param = work->token;
+	kvm_inject_page_fault(vcpu, work->gva, 0);
+	trace_kvm_mmu_send_async_pf(work->token, work->gva,
+				    KVM_PV_REASON_PAGE_READY);
+
+	put_page(work->page);
+	async_pf_work_free(work);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment kvm_seg;
+
+	if (!vcpu->arch.pv_shm ||
+	    !(vcpu->arch.pv_shm->features & KVM_PV_SHM_FEATURES_ASYNC_PF) ||
+	    kvm_event_needs_reinjection(vcpu))
+		return false;
+
+	kvm_get_segment(vcpu, &kvm_seg, VCPU_SREG_CS);
+
+	/* is userspace code? TODO check VM86 mode */
+	return !!(kvm_seg.selector & 3);
+}
+
+static int setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+	struct kvm_mmu_async_pf *work;
+
+	/* setup slow work */
+
+	/* do alloc atomic since if we are going to sleep anyway we
+	   may as well sleep faulting in page */
+	work = kmem_cache_zalloc(mmu_async_pf_cache, GFP_ATOMIC);
+	if (!work)
+		return 0;
+
+	atomic_set(&work->used, 1);
+	work->page = NULL;
+	work->vcpu = vcpu;
+	work->gva = gva;
+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
+	work->token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+	work->mm = current->mm;
+	atomic_inc(&work->mm->mm_count);
+	kvm_get_kvm(work->vcpu->kvm);
+
+	/* this can't really happen otherwise gfn_to_pfn_async
+	   would succeed */
+	if (unlikely(kvm_is_error_hva(work->addr)))
+		goto retry_sync;
+
+	slow_work_init(&work->work, &async_pf_ops);
+	if (slow_work_enqueue(&work->work) != 0)
+		goto retry_sync;
+
+	vcpu->arch.mmu_async_pf_work = work;
+	return 1;
+retry_sync:
+	kvm_put_kvm(work->vcpu->kvm);
+	mmdrop(work->mm);
+	kmem_cache_free(mmu_async_pf_cache, work);
+	return 0;
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 				u32 error_code)
 {
@@ -2211,7 +2425,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	if (can_do_async_pf(vcpu)) {
+		r = gfn_to_pfn_async(vcpu->kvm, gfn, &pfn);
+		trace_kvm_mmu_try_async_get_page(r, pfn);
+	} else {
+do_sync:
+		r = 1;
+		pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	}
+
+	if (!r) {
+		if (!setup_async_pf(vcpu, gpa, gfn))
+			goto do_sync;
+		return 0;
+	}
+
+	/* mmio */
 	if (is_error_pfn(pfn)) {
 		kvm_release_pfn_clean(pfn);
 		return 1;
@@ -2220,8 +2450,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	if (mmu_notifier_retry(vcpu, mmu_seq))
 		goto out_unlock;
 	kvm_mmu_free_some_pages(vcpu);
-	r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK,
-			 level, gfn, pfn);
+	r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK, level, gfn,
+			 pfn);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	return r;
@@ -2976,6 +3206,8 @@ static void mmu_destroy_caches(void)
 		kmem_cache_destroy(rmap_desc_cache);
 	if (mmu_page_header_cache)
 		kmem_cache_destroy(mmu_page_header_cache);
+	if (mmu_async_pf_cache)
+		kmem_cache_destroy(mmu_async_pf_cache);
 }
 
 void kvm_mmu_module_exit(void)
@@ -3003,6 +3235,11 @@ int kvm_mmu_module_init(void)
 	if (!mmu_page_header_cache)
 		goto nomem;
 
+	mmu_async_pf_cache = KMEM_CACHE(kvm_mmu_async_pf, 0);
+
+	if (!mmu_async_pf_cache)
+		goto nomem;
+
 	register_shrinker(&mmu_shrinker);
 
 	return 0;
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 3e4a5c6..d6dd63c 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -214,6 +214,66 @@ TRACE_EVENT(
 	TP_printk("%s", KVM_MMU_PAGE_PRINTK())
 );
 
+TRACE_EVENT(
+	kvm_mmu_try_async_get_page,
+	TP_PROTO(bool r, u64 pfn),
+	TP_ARGS(r, pfn),
+
+	TP_STRUCT__entry(
+		__field(__u64, pfn)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = r ? pfn : (u64)-1;
+		),
+
+	TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_mmu_send_async_pf,
+	TP_PROTO(u64 task, u64 gva, u64 reason),
+	TP_ARGS(task, gva, reason),
+
+	TP_STRUCT__entry(
+		__field(__u64, task)
+		__field(__u64, gva)
+		__field(bool, np)
+		),
+
+	TP_fast_assign(
+		__entry->task = task;
+		__entry->gva = gva;
+		__entry->np = (reason == KVM_PV_REASON_PAGE_NP);
+		),
+
+	TP_printk("task %#llx gva %#llx %s", __entry->task, __entry->gva,
+		  __entry->np ? "not present" : "ready")
+);
+
+TRACE_EVENT(
+	kvm_mmu_async_pf_executed,
+	TP_PROTO(unsigned long address, struct page *page, u64 task, u64 gva),
+	TP_ARGS(address, page, task, gva),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, address)
+		__field(struct page*, page)
+		__field(u64, task)
+		__field(u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->address = address;
+		__entry->page = page;
+		__entry->task = task;
+		__entry->gva = gva;
+		),
+
+	TP_printk("task %#llx gva %#llx address %#lx pfn %lx", __entry->task,
+		  __entry->gva, __entry->address, page_to_pfn(__entry->page))
+);
+
 #endif /* _TRACE_KVMMMU_H */
 
 /* This part must be outside protection */
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 72558f8..9fe2ecd 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -419,7 +419,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+	if (can_do_async_pf(vcpu)) {
+		r = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &pfn);
+		trace_kvm_mmu_try_async_get_page(r, pfn);
+	} else {
+do_sync:
+		r = 1;
+		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+	}
+
+	if (!r) {
+		if (!setup_async_pf(vcpu, addr, walker.gfn))
+			goto do_sync;
+		return 0;
+	}
 
 	/* mmio */
 	if (is_error_pfn(pfn)) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c177933..e6bd3ad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3330,6 +3330,10 @@ int kvm_arch_init(void *opaque)
 		goto out;
 	}
 
+	r = slow_work_register_user();
+	if (r)
+		goto out;
+
 	r = kvm_mmu_module_init();
 	if (r)
 		goto out;
@@ -3352,6 +3356,7 @@ out:
 
 void kvm_arch_exit(void)
 {
+	slow_work_unregister_user();
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
@@ -3837,6 +3842,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	kvm_check_async_pf_completion(vcpu);
+
 	preempt_disable();
 
 	kvm_x86_ops->prepare_guest_switch(vcpu);
@@ -3927,7 +3934,6 @@ out:
 	return r;
 }
 
-
 static int __vcpu_run(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -5026,6 +5032,9 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	}
 	vcpu->arch.mcg_cap = KVM_MAX_MCE_BANKS;
 
+	INIT_LIST_HEAD(&vcpu->arch.mmu_async_pf_done);
+	spin_lock_init(&vcpu->arch.mmu_async_pf_lock);
+
 	return 0;
 
 fail_mmu_destroy:
@@ -5078,8 +5087,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
 	/*
 	 * Unpin any mmu pages first.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
+	}
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_arch_vcpu_free(vcpu);
 
@@ -5178,10 +5189,11 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+		|| !list_empty_careful(&vcpu->arch.mmu_async_pf_done)
 		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
-		|| vcpu->arch.nmi_pending ||
-		(kvm_arch_interrupt_allowed(vcpu) &&
-		 kvm_cpu_has_interrupt(vcpu));
+		|| vcpu->arch.nmi_pending
+		|| (kvm_arch_interrupt_allowed(vcpu) &&
+		    kvm_cpu_has_interrupt(vcpu));
 }
 
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/11] Retry fault before vmentry
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (5 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02 13:03   ` Avi Kivity
  2009-11-01 11:56 ` [PATCH 08/11] Add "wait for page" hypercall Gleb Natapov
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +++-
 arch/x86/kvm/mmu.c              |   18 ++++++++++++------
 arch/x86/kvm/paging_tmpl.h      |   32 ++++++++++++++++++++++++++++----
 3 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e3cdbfe..6c781ea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,6 +241,8 @@ struct kvm_mmu_async_pf {
 	struct list_head link;
 	struct kvm_vcpu *vcpu;
 	struct mm_struct *mm;
+	gpa_t cr3;
+	u32 error_code;
 	gva_t gva;
 	unsigned long addr;
 	u64 token;
@@ -267,7 +269,7 @@ struct kvm_pio_request {
  */
 struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva, u32 err);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva);
 	void (*prefetch_page)(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 31e837b..abe1ce9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2171,7 +2171,7 @@ static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr)
 	return vaddr;
 }
 
-static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
+static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva,
 				u32 error_code)
 {
 	gfn_t gfn;
@@ -2322,6 +2322,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 			spin_lock(&vcpu->arch.mmu_async_pf_lock);
 			list_del(&work->link);
 			spin_unlock(&vcpu->arch.mmu_async_pf_lock);
+			vcpu->arch.mmu.page_fault(vcpu, (gpa_t)-1, work->gva,
+						  work->error_code);
 			put_page(work->page);
 			async_pf_work_free(work);
 		}
@@ -2338,6 +2340,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->link);
 	spin_unlock(&vcpu->arch.mmu_async_pf_lock);
 
+	vcpu->arch.mmu.page_fault(vcpu, (gpa_t)-1, work->gva, work->error_code);
 	vcpu->arch.pv_shm->reason = KVM_PV_REASON_PAGE_READY;
 	vcpu->arch.pv_shm->param = work->token;
 	kvm_inject_page_fault(vcpu, work->gva, 0);
@@ -2363,7 +2366,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return !!(kvm_seg.selector & 3);
 }
 
-static int setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+static int setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva,
+			  gfn_t gfn, u32 error_code)
 {
 	struct kvm_mmu_async_pf *work;
 
@@ -2378,6 +2382,8 @@ static int setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 	atomic_set(&work->used, 1);
 	work->page = NULL;
 	work->vcpu = vcpu;
+	work->cr3 = cr3;
+	work->error_code = error_code;
 	work->gva = gva;
 	work->addr = gfn_to_hva(vcpu->kvm, gfn);
 	work->token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
@@ -2403,7 +2409,7 @@ retry_sync:
 	return 0;
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gpa,
 				u32 error_code)
 {
 	pfn_t pfn;
@@ -2426,7 +2432,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (can_do_async_pf(vcpu)) {
+	if (cr3 != (gpa_t)-1 && can_do_async_pf(vcpu)) {
 		r = gfn_to_pfn_async(vcpu->kvm, gfn, &pfn);
 		trace_kvm_mmu_try_async_get_page(r, pfn);
 	} else {
@@ -2436,7 +2442,7 @@ do_sync:
 	}
 
 	if (!r) {
-		if (!setup_async_pf(vcpu, gpa, gfn))
+		if (!setup_async_pf(vcpu, cr3, gpa, gfn, error_code))
 			goto do_sync;
 		return 0;
 	}
@@ -3006,7 +3012,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, vcpu->arch.cr3, cr2, error_code);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 9fe2ecd..b1fe61f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -375,7 +375,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t addr,
 			       u32 error_code)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
@@ -388,6 +388,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
 	unsigned long mmu_seq;
+	gpa_t curr_cr3 = vcpu->arch.cr3;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 	kvm_mmu_audit(vcpu, "pre page fault");
@@ -396,6 +397,13 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	if (r)
 		return r;
 
+	if (curr_cr3 != cr3) {
+		vcpu->arch.cr3 = cr3;
+		paging_new_cr3(vcpu);
+		if (kvm_mmu_reload(vcpu))
+			goto switch_cr3;
+	}
+
 	/*
 	 * Look up the guest pte for the faulting address.
 	 */
@@ -406,6 +414,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	 * The page is not mapped by the guest.  Let the guest handle it.
 	 */
 	if (!r) {
+		if (curr_cr3 != vcpu->arch.cr3)
+			goto switch_cr3;
 		pgprintk("%s: guest page fault\n", __func__);
 		inject_page_fault(vcpu, addr, walker.error_code);
 		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
@@ -420,7 +430,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (can_do_async_pf(vcpu)) {
+	if (cr3 != (gpa_t)-1 && can_do_async_pf(vcpu)) {
 		r = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &pfn);
 		trace_kvm_mmu_try_async_get_page(r, pfn);
 	} else {
@@ -430,13 +440,17 @@ do_sync:
 	}
 
 	if (!r) {
-		if (!setup_async_pf(vcpu, addr, walker.gfn))
+		if (!setup_async_pf(vcpu, cr3, addr, walker.gfn, error_code))
 			goto do_sync;
+		if (curr_cr3 != vcpu->arch.cr3)
+			goto switch_cr3;
 		return 0;
 	}
 
 	/* mmio */
 	if (is_error_pfn(pfn)) {
+		if (curr_cr3 != vcpu->arch.cr3)
+			goto switch_cr3;
 		pgprintk("gfn %lx is mmio\n", walker.gfn);
 		kvm_release_pfn_clean(pfn);
 		return 1;
@@ -458,12 +472,22 @@ do_sync:
 	kvm_mmu_audit(vcpu, "post page fault (fixed)");
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
+	if (curr_cr3 != vcpu->arch.cr3)
+		goto switch_cr3;
+
 	return write_pt;
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
-	return 0;
+switch_cr3:
+	if (curr_cr3 != vcpu->arch.cr3) {
+		vcpu->arch.cr3 = curr_cr3;
+		paging_new_cr3(vcpu);
+		kvm_mmu_reload(vcpu);
+	}
+
+	return write_pt;
 }
 
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/11] Add "wait for page" hypercall.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (6 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 07/11] Retry fault before vmentry Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02 13:05   ` Avi Kivity
  2009-11-01 11:56 ` [PATCH 09/11] Maintain preemptability count even for !CONFIG_PREEMPT kernels Gleb Natapov
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

We want to be able to inject async pagefault into guest event if a guest
is not executing userspace code. But in this case guest may receive
async page fault in non-sleepable context. In this case it will be
able to make "wait for page" hypercall vcpu will be put to sleep until
page is swapped in and guest can continue without reschedule.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/kvm/mmu.c              |   35 ++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmutrace.h         |   19 +++++++++++++++++++
 arch/x86/kvm/x86.c              |    5 +++++
 include/linux/kvm_para.h        |    1 +
 5 files changed, 61 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6c781ea..d404b14 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -456,6 +456,7 @@ struct kvm_vm_stat {
 
 struct kvm_vcpu_stat {
 	u32 pf_fixed;
+	u32 pf_async_wait;
 	u32 pf_guest;
 	u32 tlb_flush;
 	u32 invlpg;
@@ -676,6 +677,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_pv_wait_for_async_pf(struct kvm_vcpu *vcpu);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index abe1ce9..3d33994 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2281,7 +2281,7 @@ static void async_pf_execute(struct slow_work *work)
 					apf->gva);
 
 	if (waitqueue_active(q))
-		wake_up_interruptible(q);
+		wake_up(q);
 
 	mmdrop(apf->mm);
 }
@@ -2351,6 +2351,39 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	async_pf_work_free(work);
 }
 
+static bool kvm_asyc_pf_is_done(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_async_pf *p, *node;
+	bool found = false;
+
+	spin_lock(&vcpu->arch.mmu_async_pf_lock);
+	list_for_each_entry_safe(p, node, &vcpu->arch.mmu_async_pf_done, link) {
+		if (p->guest_task != vcpu->arch.pv_shm->current_task)
+			continue;
+		list_del(&p->link);
+		found = true;
+		break;
+	}
+	spin_unlock(&vcpu->arch.mmu_async_pf_lock);
+	if (found) {
+		vcpu->arch.mmu.page_fault(vcpu, (gpa_t)-1, p->gva,
+					  p->error_code);
+		put_page(p->page);
+		async_pf_work_free(p);
+		trace_kvm_mmu_async_pf_wait(vcpu->arch.pv_shm->current_task, 0);
+	}
+	return found;
+}
+
+int kvm_pv_wait_for_async_pf(struct kvm_vcpu *vcpu)
+{
+	++vcpu->stat.pf_async_wait;
+	trace_kvm_mmu_async_pf_wait(vcpu->arch.pv_shm->current_task, 1);
+	wait_event(vcpu->wq, kvm_asyc_pf_is_done(vcpu));
+
+	return 0;
+}
+
 static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 {
 	struct kvm_segment kvm_seg;
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index d6dd63c..a74f718 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -274,6 +274,25 @@ TRACE_EVENT(
 		  __entry->gva, __entry->address, page_to_pfn(__entry->page))
 );
 
+TRACE_EVENT(
+	kvm_mmu_async_pf_wait,
+	TP_PROTO(u64 task, bool wait),
+	TP_ARGS(task, wait),
+
+	TP_STRUCT__entry(
+		__field(u64, task)
+		__field(bool, wait)
+		),
+
+	TP_fast_assign(
+		__entry->task = task;
+		__entry->wait = wait;
+		),
+
+	TP_printk("task %#llx %s", __entry->task, __entry->wait ?
+		  "waits for PF" : "end wait for PF")
+);
+
 #endif /* _TRACE_KVMMMU_H */
 
 /* This part must be outside protection */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e6bd3ad..9208796 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -109,6 +109,7 @@ static DEFINE_PER_CPU(struct kvm_shared_msrs, shared_msrs);
 
 struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "pf_fixed", VCPU_STAT(pf_fixed) },
+	{ "pf_async_wait", VCPU_STAT(pf_async_wait) },
 	{ "pf_guest", VCPU_STAT(pf_guest) },
 	{ "tlb_flush", VCPU_STAT(tlb_flush) },
 	{ "invlpg", VCPU_STAT(invlpg) },
@@ -3484,6 +3485,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_SETUP_SHM:
 		r = kvm_pv_setup_shm(vcpu, a0, a1, a2, &ret);
 		break;
+	case KVM_HC_WAIT_FOR_ASYNC_PF:
+		r = kvm_pv_wait_for_async_pf(vcpu);
+		ret = 0;
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 1c37495..50296a6 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -19,6 +19,7 @@
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
 #define KVM_HC_SETUP_SHM		3
+#define KVM_HC_WAIT_FOR_ASYNC_PF	4
 
 /*
  * hypercalls use architecture specific
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/11] Maintain preemptability count even for !CONFIG_PREEMPT kernels
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (7 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 08/11] Add "wait for page" hypercall Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-02  9:24   ` Ingo Molnar
  2009-11-01 11:56 ` [PATCH 10/11] Handle async PF in non preemptable context Gleb Natapov
  2009-11-01 11:56 ` [PATCH 11/11] Send async PF when guest is not in userspace too Gleb Natapov
  10 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

Do not preempt kernel. Just maintain counter to know if task can sleep.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/hardirq.h |    6 ++----
 include/linux/preempt.h |   22 ++++++++++++++++------
 kernel/sched.c          |    6 ------
 lib/kernel_lock.c       |    1 +
 4 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 6d527ee..a6b6040 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -92,12 +92,11 @@
  */
 #define in_nmi()	(preempt_count() & NMI_MASK)
 
+#define PREEMPT_CHECK_OFFSET 1
 #if defined(CONFIG_PREEMPT)
 # define PREEMPT_INATOMIC_BASE kernel_locked()
-# define PREEMPT_CHECK_OFFSET 1
 #else
 # define PREEMPT_INATOMIC_BASE 0
-# define PREEMPT_CHECK_OFFSET 0
 #endif
 
 /*
@@ -116,12 +115,11 @@
 #define in_atomic_preempt_off() \
 		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
 
+#define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
 #ifdef CONFIG_PREEMPT
 # define preemptible()	(preempt_count() == 0 && !irqs_disabled())
-# define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
 #else
 # define preemptible()	0
-# define IRQ_EXIT_OFFSET HARDIRQ_OFFSET
 #endif
 
 #if defined(CONFIG_SMP) || defined(CONFIG_GENERIC_HARDIRQS)
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 72b1a10..7d039ca 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -82,14 +82,24 @@ do { \
 
 #else
 
-#define preempt_disable()		do { } while (0)
-#define preempt_enable_no_resched()	do { } while (0)
-#define preempt_enable()		do { } while (0)
+#define preempt_disable() \
+do { \
+	inc_preempt_count(); \
+	barrier(); \
+} while (0)
+
+#define preempt_enable() \
+do { \
+	barrier(); \
+	dec_preempt_count(); \
+} while (0)
+
+#define preempt_enable_no_resched()	preempt_enable()
 #define preempt_check_resched()		do { } while (0)
 
-#define preempt_disable_notrace()		do { } while (0)
-#define preempt_enable_no_resched_notrace()	do { } while (0)
-#define preempt_enable_notrace()		do { } while (0)
+#define preempt_disable_notrace()		preempt_disable()
+#define preempt_enable_no_resched_notrace()	preempt_enable()
+#define preempt_enable_notrace()		preempt_enable()
 
 #endif
 
diff --git a/kernel/sched.c b/kernel/sched.c
index e886895..adb0415 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2561,10 +2561,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
 	p->oncpu = 0;
 #endif
-#ifdef CONFIG_PREEMPT
 	/* Want to start with kernel preemption disabled. */
 	task_thread_info(p)->preempt_count = 1;
-#endif
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 
 	put_cpu();
@@ -6944,11 +6942,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
 	spin_unlock_irqrestore(&rq->lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
-#if defined(CONFIG_PREEMPT)
 	task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0);
-#else
-	task_thread_info(idle)->preempt_count = 0;
-#endif
 	/*
 	 * The idle tasks have their own, simple scheduling class:
 	 */
diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
index 39f1029..6e2659d 100644
--- a/lib/kernel_lock.c
+++ b/lib/kernel_lock.c
@@ -93,6 +93,7 @@ static inline void __lock_kernel(void)
  */
 static inline void __lock_kernel(void)
 {
+	preempt_disable();
 	_raw_spin_lock(&kernel_flag);
 }
 #endif
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/11] Handle async PF in non preemptable context.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (8 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 09/11] Maintain preemptability count even for !CONFIG_PREEMPT kernels Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  2009-11-01 11:56 ` [PATCH 11/11] Send async PF when guest is not in userspace too Gleb Natapov
  10 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so make "wait for page" hypercall
and comtinue only after a page is ready.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kernel/kvm.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 79d291f..1bd8b8d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -162,10 +162,24 @@ int kvm_handle_pf(struct pt_regs *regs, unsigned long error_code)
 	switch (reason) {
 	default:
 		return 0;
-	case KVM_PV_REASON_PAGE_NP:
+	case KVM_PV_REASON_PAGE_NP: {
+		int cpu, idle;
+		cpu = get_cpu();
+		idle = idle_cpu(cpu);
+		put_cpu();
+
+		/*
+		 * We cannot reschedule. Wait for page to be ready.
+		 */
+		if (idle || preempt_count()) {
+			kvm_hypercall0(KVM_HC_WAIT_FOR_ASYNC_PF);
+			break;
+		}
+
 		/* real page is missing. */
 		apf_task_wait(current, token);
 		break;
+	}
 	case KVM_PV_REASON_PAGE_READY:
 		apf_task_wake(token);
 		break;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/11] Send async PF when guest is not in userspace too.
  2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
                   ` (9 preceding siblings ...)
  2009-11-01 11:56 ` [PATCH 10/11] Handle async PF in non preemptable context Gleb Natapov
@ 2009-11-01 11:56 ` Gleb Natapov
  10 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-01 11:56 UTC (permalink / raw)
  To: kvm; +Cc: linux-mm, linux-kernel


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c |   16 ++++++----------
 1 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d33994..21ec65a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2358,7 +2358,7 @@ static bool kvm_asyc_pf_is_done(struct kvm_vcpu *vcpu)
 
 	spin_lock(&vcpu->arch.mmu_async_pf_lock);
 	list_for_each_entry_safe(p, node, &vcpu->arch.mmu_async_pf_done, link) {
-		if (p->guest_task != vcpu->arch.pv_shm->current_task)
+		if (p->token != vcpu->arch.pv_shm->param)
 			continue;
 		list_del(&p->link);
 		found = true;
@@ -2370,7 +2370,7 @@ static bool kvm_asyc_pf_is_done(struct kvm_vcpu *vcpu)
 					  p->error_code);
 		put_page(p->page);
 		async_pf_work_free(p);
-		trace_kvm_mmu_async_pf_wait(vcpu->arch.pv_shm->current_task, 0);
+		trace_kvm_mmu_async_pf_wait(vcpu->arch.pv_shm->param, 0);
 	}
 	return found;
 }
@@ -2378,7 +2378,7 @@ static bool kvm_asyc_pf_is_done(struct kvm_vcpu *vcpu)
 int kvm_pv_wait_for_async_pf(struct kvm_vcpu *vcpu)
 {
 	++vcpu->stat.pf_async_wait;
-	trace_kvm_mmu_async_pf_wait(vcpu->arch.pv_shm->current_task, 1);
+	trace_kvm_mmu_async_pf_wait(vcpu->arch.pv_shm->param, 1);
 	wait_event(vcpu->wq, kvm_asyc_pf_is_done(vcpu));
 
 	return 0;
@@ -2386,17 +2386,13 @@ int kvm_pv_wait_for_async_pf(struct kvm_vcpu *vcpu)
 
 static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 {
-	struct kvm_segment kvm_seg;
-
 	if (!vcpu->arch.pv_shm ||
 	    !(vcpu->arch.pv_shm->features & KVM_PV_SHM_FEATURES_ASYNC_PF) ||
-	    kvm_event_needs_reinjection(vcpu))
+	    kvm_event_needs_reinjection(vcpu) ||
+	    !kvm_x86_ops->interrupt_allowed(vcpu))
 		return false;
 
-	kvm_get_segment(vcpu, &kvm_seg, VCPU_SREG_CS);
-
-	/* is userspace code? TODO check VM86 mode */
-	return !!(kvm_seg.selector & 3);
+	return true;
 }
 
 static int setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-01 11:56 ` [PATCH 01/11] Add shared memory hypercall to PV Linux guest Gleb Natapov
@ 2009-11-02  4:27   ` Rik van Riel
  2009-11-02  7:07     ` Gleb Natapov
  2009-11-02 12:18   ` Avi Kivity
  1 sibling, 1 reply; 55+ messages in thread
From: Rik van Riel @ 2009-11-02  4:27 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 06:56 AM, Gleb Natapov wrote:
> Add hypercall that allows guest and host to setup per cpu shared
> memory.

While it is pretty obvious that we should implement
the asynchronous pagefaults for KVM, so a swap-in
of a page the host swapped out does not stall the
entire virtual CPU, I believe that adding extra
data accesses at context switch time may not be
the best tradeoff.

It may be better to simply tell the guest what
address is faulting (or give the guest some other
random unique number as a token).  Then, once the
host brings that page into memory, we can send a
signal to the guest with that same token.

The problem of finding the task(s) associated with
that token can be left to the guest.  A little more
complexity on the guest side, but probably worth it
if we can avoid adding cost to the context switch
path.

> +static void kvm_end_context_switch(struct task_struct *next)
> +{
> +	struct kvm_vcpu_pv_shm *pv_shm =
> +		per_cpu(kvm_vcpu_pv_shm, smp_processor_id());
> +
> +	if (!pv_shm)
> +		return;
> +
> +	pv_shm->current_task = (u64)next;
> +}
> +

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-02  4:27   ` Rik van Riel
@ 2009-11-02  7:07     ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02  7:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kvm, linux-mm, linux-kernel

On Sun, Nov 01, 2009 at 11:27:15PM -0500, Rik van Riel wrote:
> On 11/01/2009 06:56 AM, Gleb Natapov wrote:
> >Add hypercall that allows guest and host to setup per cpu shared
> >memory.
> 
> While it is pretty obvious that we should implement
> the asynchronous pagefaults for KVM, so a swap-in
> of a page the host swapped out does not stall the
> entire virtual CPU, I believe that adding extra
> data accesses at context switch time may not be
> the best tradeoff.
> 
> It may be better to simply tell the guest what
> address is faulting (or give the guest some other
> random unique number as a token).  Then, once the
> host brings that page into memory, we can send a
> signal to the guest with that same token.
> 
> The problem of finding the task(s) associated with
> that token can be left to the guest.  A little more
> complexity on the guest side, but probably worth it
> if we can avoid adding cost to the context switch
> path.
> 
This is precisely what this series implements. The function below
is leftover from previous implementation, not used by the rest of the
patch and removed by a later patch. Just a left over from rebase. Sorry
about that. Will be fixed for future submissions.

> >+static void kvm_end_context_switch(struct task_struct *next)
> >+{
> >+	struct kvm_vcpu_pv_shm *pv_shm =
> >+		per_cpu(kvm_vcpu_pv_shm, smp_processor_id());
> >+
> >+	if (!pv_shm)
> >+		return;
> >+
> >+	pv_shm->current_task = (u64)next;
> >+}
> >+
> 
> 
> 
> -- 
> All rights reversed.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-01 11:56 ` [PATCH 02/11] Add "handle page fault" PV helper Gleb Natapov
@ 2009-11-02  9:22   ` Ingo Molnar
  2009-11-02 16:04     ` Gleb Natapov
  2009-11-02 19:03     ` Rik van Riel
  0 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-11-02  9:22 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner


* Gleb Natapov <gleb@redhat.com> wrote:

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f4cee90..14707dc 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	int write;
>  	int fault;
>  
> +	if (arch_handle_page_fault(regs, error_code))
> +		return;
> +

This patch is not acceptable unless it's done cleaner. Currently we 
already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
notifier), and this adds a fourth one. Please consolidate them into a 
single callback site, this is a hotpath on x86.

And please always Cc: the x86 maintainers to patches that touch x86 
code.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/11] Export __get_user_pages_fast.
  2009-11-01 11:56 ` [PATCH 04/11] Export __get_user_pages_fast Gleb Natapov
@ 2009-11-02  9:23   ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-11-02  9:23 UTC (permalink / raw)
  To: Gleb Natapov, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin,
	Nick Piggin
  Cc: kvm, linux-mm, linux-kernel


* Gleb Natapov <gleb@redhat.com> wrote:

> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/mm/gup.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 71da1bc..cea0dfe 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -8,6 +8,7 @@
>  #include <linux/mm.h>
>  #include <linux/vmstat.h>
>  #include <linux/highmem.h>
> +#include <linux/module.h>
>  
>  #include <asm/pgtable.h>
>  
> @@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  
>  	return nr;
>  }
> +EXPORT_SYMBOL_GPL(__get_user_pages_fast);

Lack of explanation in the changelog and lack of Cc:s. I fixed the 
latter.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/11] Maintain preemptability count even for !CONFIG_PREEMPT kernels
  2009-11-01 11:56 ` [PATCH 09/11] Maintain preemptability count even for !CONFIG_PREEMPT kernels Gleb Natapov
@ 2009-11-02  9:24   ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-11-02  9:24 UTC (permalink / raw)
  To: Gleb Natapov, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin
  Cc: kvm, linux-mm, linux-kernel

* Gleb Natapov <gleb@redhat.com> wrote:

> Do not preempt kernel. Just maintain counter to know if task can sleep.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  include/linux/hardirq.h |    6 ++----
>  include/linux/preempt.h |   22 ++++++++++++++++------
>  kernel/sched.c          |    6 ------
>  lib/kernel_lock.c       |    1 +
>  4 files changed, 19 insertions(+), 16 deletions(-)

Lack of explanation in the changelog and lack of Cc:s.

This whole patch-set should be Cc:-ed to a lot more people, and the acks 
of various maintainers are needed before it can be applied.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-01 11:56 ` [PATCH 01/11] Add shared memory hypercall to PV Linux guest Gleb Natapov
  2009-11-02  4:27   ` Rik van Riel
@ 2009-11-02 12:18   ` Avi Kivity
  2009-11-02 16:18     ` Gleb Natapov
  1 sibling, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 12:18 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> Add hypercall that allows guest and host to setup per cpu shared
> memory.
>
>    

Better to set this up as an MSR (with bit zero enabling, bits 1-5 
features, and 64-byte alignment).  This allows auto-reset on INIT and 
live migration using the existing MSR save/restore infrastructure.

>   arch/x86/include/asm/kvm_host.h |    3 +
>   arch/x86/include/asm/kvm_para.h |   11 +++++
>   arch/x86/kernel/kvm.c           |   82 +++++++++++++++++++++++++++++++++++++++
>   arch/x86/kernel/setup.c         |    1 +
>   arch/x86/kernel/smpboot.c       |    3 +
>   arch/x86/kvm/x86.c              |   70 +++++++++++++++++++++++++++++++++
>   include/linux/kvm.h             |    1 +
>   include/linux/kvm_para.h        |    4 ++
>   8 files changed, 175 insertions(+), 0 deletions(-)
>    

Please separate into guest and host patches.

> +#define KVM_PV_SHM_VERSION 1
>    

versions = bad, feature bits = good

> +
> +#define KVM_PV_SHM_FEATURES_ASYNC_PF		(1<<  0)
> +
> +struct kvm_vcpu_pv_shm {
> +	__u64 features;
> +	__u64 reason;
> +	__u64 param;
> +};
> +
>    

Some documentation for this?

Also, the name should reflect the pv pagefault use.  For other uses we 
can register other areas.

>   #define MMU_QUEUE_SIZE 1024
>
> @@ -37,6 +41,7 @@ struct kvm_para_state {
>   };
>
>   static DEFINE_PER_CPU(struct kvm_para_state, para_state);
> +static DEFINE_PER_CPU(struct kvm_vcpu_pv_shm *, kvm_vcpu_pv_shm);
>    

Easier to put the entire structure here, not a pointer.

> +
> +static int kvm_pv_reboot_notify(struct notifier_block *nb,
> +				unsigned long code, void *unused)
> +{
> +	if (code == SYS_RESTART)
> +		on_each_cpu(kvm_pv_unregister_shm, NULL, 1);
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block kvm_pv_reboot_nb = {
> +        .notifier_call = kvm_pv_reboot_notify,
> +};
>    

Is this called on kexec, or do we need another hook?

> +static int kvm_pv_setup_shm(struct kvm_vcpu *vcpu, unsigned long gpa,
> +			    unsigned long size, unsigned long version,
> +			    unsigned long *ret)
> +{
> +	addr = gfn_to_hva(vcpu->kvm, gfn);
> +	if (kvm_is_error_hva(addr))
> +		return -EFAULT;
> +
> +	/* pin page with pv shared memory */
> +	down_read(&mm->mmap_sem);
> +	r = get_user_pages(current, mm, addr, 1, 1, 0,&vcpu->arch.pv_shm_page,
> +			   NULL);
> +	up_read(&mm->mmap_sem);
>    

This fails if the memory area straddles a page boundary.  Aligning would 
solve this.  I prefer using put_user() though than a permanent 
get_user_pages().


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-01 11:56 ` [PATCH 03/11] Handle asynchronous page fault in a PV guest Gleb Natapov
@ 2009-11-02 12:38   ` Avi Kivity
  2009-11-02 15:54     ` Gleb Natapov
  2009-11-03 14:14   ` Marcelo Tosatti
  1 sibling, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 12:38 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> Asynchronous page fault notifies vcpu that page it is trying to access
> is swapped out by a host. In response guest puts a task that caused the
> fault to sleep until page is swapped in again. When missing page is
> brought back into the memory guest is notified and task resumes execution.
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 90708b7..61e2aa3 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -52,6 +52,9 @@ struct kvm_mmu_op_release_pt {
>
>   #define KVM_PV_SHM_FEATURES_ASYNC_PF		(1<<  0)
>
> +#define KVM_PV_REASON_PAGE_NP 1
> +#define KVM_PV_REASON_PAGE_READY 2
>    

_NOT_PRESENT would improve readability.

> +static void apf_task_wait(struct task_struct *tsk, u64 token)
>   {
> +	u64 key = hash_64(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node n, *e;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&b->lock);
> +	e = _find_apf_task(b, token);
> +	if (e) {
> +		/* dummy entry exist ->  wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		spin_unlock(&b->lock);
> +		return;
> +	}
> +
> +	n.token = token;
> +	init_waitqueue_head(&n.wq);
> +	hlist_add_head(&n.link,&b->list);
> +	spin_unlock(&b->lock);
> +
> +	for (;;) {
> +		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
>    

Don't you need locking here?  At least for the implied memory barriers.

> +		schedule();
> +	}
> +	finish_wait(&n.wq,&wait);
> +
> +	return;
> +}
> +
> +static void apf_task_wake(u64 token)
> +{
> +	u64 key = hash_64(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node *n;
> +
> +	spin_lock(&b->lock);
> +	n = _find_apf_task(b, token);
> +	if (!n) {
> +		/* PF was not yet handled. Add dummy entry for the token */
> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			printk(KERN_EMERG"async PF can't allocate memory\n");
>    

Worrying.  We could have an emergency pool of one node per cpu, and 
disable apf if we use it until it's returned.  But that's a lot of 
complexity for an edge case, so a simpler solution would be welcome.

> +int kvm_handle_pf(struct pt_regs *regs, unsigned long error_code)
> +{
> +	u64 reason, token;
>   	struct kvm_vcpu_pv_shm *pv_shm =
>   		per_cpu(kvm_vcpu_pv_shm, smp_processor_id());
>
>   	if (!pv_shm)
> -		return;
> +		return 0;
> +
> +	reason = pv_shm->reason;
> +	pv_shm->reason = 0;
> +
> +	token = pv_shm->param;
> +
> +	switch (reason) {
> +	default:
> +		return 0;
> +	case KVM_PV_REASON_PAGE_NP:
> +		/* real page is missing. */
> +		apf_task_wait(current, token);
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		apf_task_wake(token);
> +		break;
> +	}
>    

Ah, reason is not a bitmask but an enumerator.  __u32 is more friendly 
to i386 in that case.

Much of the code here is arch independent and would work well on non-x86 
kvm ports.  But we can always lay the burden of moving it on the non-x86 
maintainers.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out.
  2009-11-01 11:56 ` [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out Gleb Natapov
@ 2009-11-02 12:56   ` Avi Kivity
  2009-11-02 15:41     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 12:56 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> If guest access swapped out memory do not swap it in from vcpu thread
> context. Setup slow work to do swapping and send async page fault to
> a guest.
>
> Allow async page fault injection only when guest is in user mode since
> otherwise guest may be in non-sleepable context and will not be able to
> reschedule.
>    

That loses us page cache accesses, which may be the majority of accesses 
in some workloads.

If we allow the guest to ignore a fault, and ensure that a second access 
to an apf page from the same vcpu doesn't trigger another apf, we can 
simply ignore the apf in a guest when we can't schedule.

Probably best done with an enable bit for kernel-mode apfs.

> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_host.h |   20 +++
>   arch/x86/kvm/mmu.c              |  243 ++++++++++++++++++++++++++++++++++++++-
>   arch/x86/kvm/mmutrace.h         |   60 ++++++++++
>   arch/x86/kvm/paging_tmpl.h      |   16 +++-
>   arch/x86/kvm/x86.c              |   22 +++-
>    

Much of the code is generic, please move it to virt/kvm.

> +static void async_pf_execute(struct slow_work *work)
> +{
> +	struct page *page[1];
>    

No need to make it an array, just pass its address.

> +	struct kvm_mmu_async_pf *apf =
> +		container_of(work, struct kvm_mmu_async_pf, work);
> +	wait_queue_head_t *q =&apf->vcpu->wq;
> +
> +	might_sleep();
> +
> +	down_read(&apf->mm->mmap_sem);
> +	get_user_pages(current, apf->mm, apf->addr, 1, 1, 0, page, NULL);
> +	up_read(&apf->mm->mmap_sem);
> +
> +	spin_lock(&apf->vcpu->arch.mmu_async_pf_lock);
> +	list_add_tail(&apf->link,&apf->vcpu->arch.mmu_async_pf_done);
> +	apf->page = page[0];
> +	spin_unlock(&apf->vcpu->arch.mmu_async_pf_lock);
> +
> +	trace_kvm_mmu_async_pf_executed(apf->addr, apf->page, apf->token,
> +					apf->gva);
>    

_completed, but maybe better placed in vcpu context.

> +
> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_segment kvm_seg;
> +
> +	if (!vcpu->arch.pv_shm ||
> +	    !(vcpu->arch.pv_shm->features&  KVM_PV_SHM_FEATURES_ASYNC_PF) ||
> +	    kvm_event_needs_reinjection(vcpu))
> +		return false;
> +
> +	kvm_get_segment(vcpu,&kvm_seg, VCPU_SREG_CS);
> +
> +	/* is userspace code? TODO check VM86 mode */
> +	return !!(kvm_seg.selector&  3);
>    

There's a ->get_cpl() which is slightly faster.  Note vm86 is perfectly 
fine for async pf.

> +static int setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> +{
> +	struct kvm_mmu_async_pf *work;
> +
> +	/* setup slow work */
> +
> +	/* do alloc atomic since if we are going to sleep anyway we
> +	   may as well sleep faulting in page */
> +	work = kmem_cache_zalloc(mmu_async_pf_cache, GFP_ATOMIC);
> +	if (!work)
> +		return 0;
> +
> +	atomic_set(&work->used, 1);
> +	work->page = NULL;
> +	work->vcpu = vcpu;
> +	work->gva = gva;
> +	work->addr = gfn_to_hva(vcpu->kvm, gfn);
> +	work->token = (vcpu->arch.async_pf_id++<<  12) | vcpu->vcpu_id;
>    

The shift truncates async_pf_id.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/11] Retry fault before vmentry
  2009-11-01 11:56 ` [PATCH 07/11] Retry fault before vmentry Gleb Natapov
@ 2009-11-02 13:03   ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 13:03 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page.
>
>    
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 9fe2ecd..b1fe61f 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -375,7 +375,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>    *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
>    *           a negative value on error.
>    */
> -static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
> +static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t addr,
>   			       u32 error_code)
>   {
>   	int write_fault = error_code&  PFERR_WRITE_MASK;
> @@ -388,6 +388,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>   	pfn_t pfn;
>   	int level = PT_PAGE_TABLE_LEVEL;
>   	unsigned long mmu_seq;
> +	gpa_t curr_cr3 = vcpu->arch.cr3;
>
>   	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
>   	kvm_mmu_audit(vcpu, "pre page fault");
> @@ -396,6 +397,13 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>   	if (r)
>   		return r;
>
> +	if (curr_cr3 != cr3) {
> +		vcpu->arch.cr3 = cr3;
> +		paging_new_cr3(vcpu);
> +		if (kvm_mmu_reload(vcpu))
> +			goto switch_cr3;
> +	}
> +
>    

This is a little frightening.  I can't put my finger on anything 
though.  But playing with cr3 under the guest's feet worries me.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/11] Add "wait for page" hypercall.
  2009-11-01 11:56 ` [PATCH 08/11] Add "wait for page" hypercall Gleb Natapov
@ 2009-11-02 13:05   ` Avi Kivity
  2009-11-02 15:13     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 13:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> We want to be able to inject async pagefault into guest event if a guest
> is not executing userspace code. But in this case guest may receive
> async page fault in non-sleepable context. In this case it will be
> able to make "wait for page" hypercall vcpu will be put to sleep until
> page is swapped in and guest can continue without reschedule.
>    

What's wrong with just 'hlt' and checking in the guest?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/11] Add "wait for page" hypercall.
  2009-11-02 13:05   ` Avi Kivity
@ 2009-11-02 15:13     ` Gleb Natapov
  2009-11-02 15:19       ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 15:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-mm, linux-kernel

On Mon, Nov 02, 2009 at 03:05:11PM +0200, Avi Kivity wrote:
> On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> >We want to be able to inject async pagefault into guest event if a guest
> >is not executing userspace code. But in this case guest may receive
> >async page fault in non-sleepable context. In this case it will be
> >able to make "wait for page" hypercall vcpu will be put to sleep until
> >page is swapped in and guest can continue without reschedule.
> 
> What's wrong with just 'hlt' and checking in the guest?
> 
Halting here will leave vcpu with interrupt disabled and this will prevent
"wake up" signal delivery. Enabling interrupts is also not an options
since we can't be sure that vcpu can process interrupt at this point.
And we can't allow NMI delivery for the same reason.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/11] Add "wait for page" hypercall.
  2009-11-02 15:13     ` Gleb Natapov
@ 2009-11-02 15:19       ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 15:19 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/02/2009 05:13 PM, Gleb Natapov wrote:
> On Mon, Nov 02, 2009 at 03:05:11PM +0200, Avi Kivity wrote:
>    
>> On 11/01/2009 01:56 PM, Gleb Natapov wrote:
>>      
>>> We want to be able to inject async pagefault into guest event if a guest
>>> is not executing userspace code. But in this case guest may receive
>>> async page fault in non-sleepable context. In this case it will be
>>> able to make "wait for page" hypercall vcpu will be put to sleep until
>>> page is swapped in and guest can continue without reschedule.
>>>        
>> What's wrong with just 'hlt' and checking in the guest?
>>
>>      
> Halting here will leave vcpu with interrupt disabled and this will prevent
> "wake up" signal delivery.

Page faults can be delivered with interrupts disabled.

>   Enabling interrupts is also not an options
> since we can't be sure that vcpu can process interrupt at this point.
>    

That's too bad, allowing interrupts in this context can help maintain 
responsiveness.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out.
  2009-11-02 12:56   ` Avi Kivity
@ 2009-11-02 15:41     ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 15:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-mm, linux-kernel

On Mon, Nov 02, 2009 at 02:56:46PM +0200, Avi Kivity wrote:
> On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> >If guest access swapped out memory do not swap it in from vcpu thread
> >context. Setup slow work to do swapping and send async page fault to
> >a guest.
> >
> >Allow async page fault injection only when guest is in user mode since
> >otherwise guest may be in non-sleepable context and will not be able to
> >reschedule.
> 
> That loses us page cache accesses, which may be the majority of
> accesses in some workloads.
> 
This is addressed later in the patch series.

> If we allow the guest to ignore a fault, and ensure that a second
> access to an apf page from the same vcpu doesn't trigger another
> apf, we can simply ignore the apf in a guest when we can't schedule.
> 
> Probably best done with an enable bit for kernel-mode apfs.
> 
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/include/asm/kvm_host.h |   20 +++
> >  arch/x86/kvm/mmu.c              |  243 ++++++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/mmutrace.h         |   60 ++++++++++
> >  arch/x86/kvm/paging_tmpl.h      |   16 +++-
> >  arch/x86/kvm/x86.c              |   22 +++-
> 
> Much of the code is generic, please move it to virt/kvm.
> 
OK, Will move generic part to virt.

> >+static void async_pf_execute(struct slow_work *work)
> >+{
> >+	struct page *page[1];
> 
> No need to make it an array, just pass its address.
> 
OK

> >+	struct kvm_mmu_async_pf *apf =
> >+		container_of(work, struct kvm_mmu_async_pf, work);
> >+	wait_queue_head_t *q =&apf->vcpu->wq;
> >+
> >+	might_sleep();
> >+
> >+	down_read(&apf->mm->mmap_sem);
> >+	get_user_pages(current, apf->mm, apf->addr, 1, 1, 0, page, NULL);
> >+	up_read(&apf->mm->mmap_sem);
> >+
> >+	spin_lock(&apf->vcpu->arch.mmu_async_pf_lock);
> >+	list_add_tail(&apf->link,&apf->vcpu->arch.mmu_async_pf_done);
> >+	apf->page = page[0];
> >+	spin_unlock(&apf->vcpu->arch.mmu_async_pf_lock);
> >+
> >+	trace_kvm_mmu_async_pf_executed(apf->addr, apf->page, apf->token,
> >+					apf->gva);
> 
> _completed, but maybe better placed in vcpu context.
> 
> >+
> >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >+{
> >+	struct kvm_segment kvm_seg;
> >+
> >+	if (!vcpu->arch.pv_shm ||
> >+	    !(vcpu->arch.pv_shm->features&  KVM_PV_SHM_FEATURES_ASYNC_PF) ||
> >+	    kvm_event_needs_reinjection(vcpu))
> >+		return false;
> >+
> >+	kvm_get_segment(vcpu,&kvm_seg, VCPU_SREG_CS);
> >+
> >+	/* is userspace code? TODO check VM86 mode */
> >+	return !!(kvm_seg.selector&  3);
> 
> There's a ->get_cpl() which is slightly faster.  Note vm86 is
> perfectly fine for async pf.
> 
OK. But the code is removed by following patches anyway.

> >+static int setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> >+{
> >+	struct kvm_mmu_async_pf *work;
> >+
> >+	/* setup slow work */
> >+
> >+	/* do alloc atomic since if we are going to sleep anyway we
> >+	   may as well sleep faulting in page */
> >+	work = kmem_cache_zalloc(mmu_async_pf_cache, GFP_ATOMIC);
> >+	if (!work)
> >+		return 0;
> >+
> >+	atomic_set(&work->used, 1);
> >+	work->page = NULL;
> >+	work->vcpu = vcpu;
> >+	work->gva = gva;
> >+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
> >+	work->token = (vcpu->arch.async_pf_id++<<  12) | vcpu->vcpu_id;
> 
> The shift truncates async_pf_id.
> 
Will fix.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-02 12:38   ` Avi Kivity
@ 2009-11-02 15:54     ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 15:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-mm, linux-kernel

On Mon, Nov 02, 2009 at 02:38:30PM +0200, Avi Kivity wrote:
> On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> >Asynchronous page fault notifies vcpu that page it is trying to access
> >is swapped out by a host. In response guest puts a task that caused the
> >fault to sleep until page is swapped in again. When missing page is
> >brought back into the memory guest is notified and task resumes execution.
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index 90708b7..61e2aa3 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -52,6 +52,9 @@ struct kvm_mmu_op_release_pt {
> >
> >  #define KVM_PV_SHM_FEATURES_ASYNC_PF		(1<<  0)
> >
> >+#define KVM_PV_REASON_PAGE_NP 1
> >+#define KVM_PV_REASON_PAGE_READY 2
> 
> _NOT_PRESENT would improve readability.
> 
> >+static void apf_task_wait(struct task_struct *tsk, u64 token)
> >  {
> >+	u64 key = hash_64(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node n, *e;
> >+	DEFINE_WAIT(wait);
> >+
> >+	spin_lock(&b->lock);
> >+	e = _find_apf_task(b, token);
> >+	if (e) {
> >+		/* dummy entry exist ->  wake up was delivered ahead of PF */
> >+		hlist_del(&e->link);
> >+		kfree(e);
> >+		spin_unlock(&b->lock);
> >+		return;
> >+	}
> >+
> >+	n.token = token;
> >+	init_waitqueue_head(&n.wq);
> >+	hlist_add_head(&n.link,&b->list);
> >+	spin_unlock(&b->lock);
> >+
> >+	for (;;) {
> >+		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >+		if (hlist_unhashed(&n.link))
> >+			break;
> 
> Don't you need locking here?  At least for the implied memory barriers.
> 
May be memory barrier will be enough. Will look at it.

> >+		schedule();
> >+	}
> >+	finish_wait(&n.wq,&wait);
> >+
> >+	return;
> >+}
> >+
> >+static void apf_task_wake(u64 token)
> >+{
> >+	u64 key = hash_64(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node *n;
> >+
> >+	spin_lock(&b->lock);
> >+	n = _find_apf_task(b, token);
> >+	if (!n) {
> >+		/* PF was not yet handled. Add dummy entry for the token */
> >+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> >+		if (!n) {
> >+			printk(KERN_EMERG"async PF can't allocate memory\n");
> 
> Worrying.  We could have an emergency pool of one node per cpu, and
> disable apf if we use it until it's returned.  But that's a lot of
> complexity for an edge case, so a simpler solution would be welcome.
> 
Currently this code can't trigger since "wake up" is always sent on the
same vcpu as "not present", but I don't want this implementation detail
to be part of guest/host interface. Idea you've described can be easy
to implement. Will look into it.

> >+int kvm_handle_pf(struct pt_regs *regs, unsigned long error_code)
> >+{
> >+	u64 reason, token;
> >  	struct kvm_vcpu_pv_shm *pv_shm =
> >  		per_cpu(kvm_vcpu_pv_shm, smp_processor_id());
> >
> >  	if (!pv_shm)
> >-		return;
> >+		return 0;
> >+
> >+	reason = pv_shm->reason;
> >+	pv_shm->reason = 0;
> >+
> >+	token = pv_shm->param;
> >+
> >+	switch (reason) {
> >+	default:
> >+		return 0;
> >+	case KVM_PV_REASON_PAGE_NP:
> >+		/* real page is missing. */
> >+		apf_task_wait(current, token);
> >+		break;
> >+	case KVM_PV_REASON_PAGE_READY:
> >+		apf_task_wake(token);
> >+		break;
> >+	}
> 
> Ah, reason is not a bitmask but an enumerator.  __u32 is more
> friendly to i386 in that case.
> 
OK. Will need padding for 64 bit host case.

> Much of the code here is arch independent and would work well on
> non-x86 kvm ports.  But we can always lay the burden of moving it on
> the non-x86 maintainers.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02  9:22   ` Ingo Molnar
@ 2009-11-02 16:04     ` Gleb Natapov
  2009-11-02 16:12       ` Ingo Molnar
  2009-11-02 19:03     ` Rik van Riel
  1 sibling, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 16:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner

On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> 
> * Gleb Natapov <gleb@redhat.com> wrote:
> 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index f4cee90..14707dc 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> >  	int write;
> >  	int fault;
> >  
> > +	if (arch_handle_page_fault(regs, error_code))
> > +		return;
> > +
> 
> This patch is not acceptable unless it's done cleaner. Currently we 
> already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> notifier), and this adds a fourth one. Please consolidate them into a 
> single callback site, this is a hotpath on x86.
> 
This call is patched out by paravirt patching mechanism so overhead should be
zero for non paravirt cases. What do you want to achieve by consolidate
them into single callback? I mean the code will still exist and will have to be
executed on every #PF. Is the goal to move them out of line? 

> And please always Cc: the x86 maintainers to patches that touch x86 
> code.
> 
Will do.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 16:04     ` Gleb Natapov
@ 2009-11-02 16:12       ` Ingo Molnar
  2009-11-02 16:22         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2009-11-02 16:12 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Frédéric Weisbecker


* Gleb Natapov <gleb@redhat.com> wrote:

> On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> > 
> > * Gleb Natapov <gleb@redhat.com> wrote:
> > 
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index f4cee90..14707dc 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > >  	int write;
> > >  	int fault;
> > >  
> > > +	if (arch_handle_page_fault(regs, error_code))
> > > +		return;
> > > +
> > 
> > This patch is not acceptable unless it's done cleaner. Currently we 
> > already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> > notifier), and this adds a fourth one. Please consolidate them into a 
> > single callback site, this is a hotpath on x86.
> > 
> This call is patched out by paravirt patching mechanism so overhead 
> should be zero for non paravirt cases. [...]

arch_handle_page_fault() isnt upstream yet - precisely what is the 
instruction sequence injected into do_page_fault() in the patched-out 
case?

> [...] What do you want to achieve by consolidate them into single 
> callback? [...]

Less bloat in a hotpath and a shared callback infrastructure.

> [...] I mean the code will still exist and will have to be executed on 
> every #PF. Is the goal to move them out of line?

The goal is to have a single callback site for all the users - which 
call-site is patched out ideally - on non-paravirt too if needed. Most 
of these callbacks/notifier-chains have are inactive most of the time.

I.e. a very low overhead 'conditional callback' facility, and a single 
one - not just lots of them sprinkled around the code.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-02 12:18   ` Avi Kivity
@ 2009-11-02 16:18     ` Gleb Natapov
  2009-11-03  5:15       ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 16:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-mm, linux-kernel

On Mon, Nov 02, 2009 at 02:18:54PM +0200, Avi Kivity wrote:
> On 11/01/2009 01:56 PM, Gleb Natapov wrote:
> >Add hypercall that allows guest and host to setup per cpu shared
> >memory.
> >
> 
> Better to set this up as an MSR (with bit zero enabling, bits 1-5
> features, and 64-byte alignment).  This allows auto-reset on INIT
> and live migration using the existing MSR save/restore
> infrastructure.
> 
Hmm. Will do.

> >  arch/x86/include/asm/kvm_host.h |    3 +
> >  arch/x86/include/asm/kvm_para.h |   11 +++++
> >  arch/x86/kernel/kvm.c           |   82 +++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kernel/setup.c         |    1 +
> >  arch/x86/kernel/smpboot.c       |    3 +
> >  arch/x86/kvm/x86.c              |   70 +++++++++++++++++++++++++++++++++
> >  include/linux/kvm.h             |    1 +
> >  include/linux/kvm_para.h        |    4 ++
> >  8 files changed, 175 insertions(+), 0 deletions(-)
> 
> Please separate into guest and host patches.
> 
OK.

> >+#define KVM_PV_SHM_VERSION 1
> 
> versions = bad, feature bits = good
> 
I have both! Do you want me to drop version?

> >+
> >+#define KVM_PV_SHM_FEATURES_ASYNC_PF		(1<<  0)
> >+
> >+struct kvm_vcpu_pv_shm {
> >+	__u64 features;
> >+	__u64 reason;
> >+	__u64 param;
> >+};
> >+
> 
> Some documentation for this?
> 
> Also, the name should reflect the pv pagefault use.  For other uses
> we can register other areas.
> 
I wanted it to be generic, but I am fine with making it apf specific.
It will allow to make it smaller too.

> >  #define MMU_QUEUE_SIZE 1024
> >
> >@@ -37,6 +41,7 @@ struct kvm_para_state {
> >  };
> >
> >  static DEFINE_PER_CPU(struct kvm_para_state, para_state);
> >+static DEFINE_PER_CPU(struct kvm_vcpu_pv_shm *, kvm_vcpu_pv_shm);
> 
> Easier to put the entire structure here, not a pointer.
OK.

> 
> >+
> >+static int kvm_pv_reboot_notify(struct notifier_block *nb,
> >+				unsigned long code, void *unused)
> >+{
> >+	if (code == SYS_RESTART)
> >+		on_each_cpu(kvm_pv_unregister_shm, NULL, 1);
> >+	return NOTIFY_DONE;
> >+}
> >+
> >+static struct notifier_block kvm_pv_reboot_nb = {
> >+        .notifier_call = kvm_pv_reboot_notify,
> >+};
> 
> Is this called on kexec, or do we need another hook?
> 
This was added specifically for kexec to work. It was called in my test,

> >+static int kvm_pv_setup_shm(struct kvm_vcpu *vcpu, unsigned long gpa,
> >+			    unsigned long size, unsigned long version,
> >+			    unsigned long *ret)
> >+{
> >+	addr = gfn_to_hva(vcpu->kvm, gfn);
> >+	if (kvm_is_error_hva(addr))
> >+		return -EFAULT;
> >+
> >+	/* pin page with pv shared memory */
> >+	down_read(&mm->mmap_sem);
> >+	r = get_user_pages(current, mm, addr, 1, 1, 0,&vcpu->arch.pv_shm_page,
> >+			   NULL);
> >+	up_read(&mm->mmap_sem);
> 
> This fails if the memory area straddles a page boundary.  Aligning
Good point.

> would solve this.  I prefer using put_user() though than a permanent
> get_user_pages().
> 
I want to prevent it from been swapped out.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 16:12       ` Ingo Molnar
@ 2009-11-02 16:22         ` Gleb Natapov
  2009-11-02 16:29           ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 16:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Frédéric Weisbecker

On Mon, Nov 02, 2009 at 05:12:48PM +0100, Ingo Molnar wrote:
> 
> * Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> > > 
> > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > 
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index f4cee90..14707dc 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > >  	int write;
> > > >  	int fault;
> > > >  
> > > > +	if (arch_handle_page_fault(regs, error_code))
> > > > +		return;
> > > > +
> > > 
> > > This patch is not acceptable unless it's done cleaner. Currently we 
> > > already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> > > notifier), and this adds a fourth one. Please consolidate them into a 
> > > single callback site, this is a hotpath on x86.
> > > 
> > This call is patched out by paravirt patching mechanism so overhead 
> > should be zero for non paravirt cases. [...]
> 
> arch_handle_page_fault() isnt upstream yet - precisely what is the 
> instruction sequence injected into do_page_fault() in the patched-out 
> case?
> 
It is introduced by the same patch. The instruction inserted is: 
 xor %rax, %rax

> > [...] What do you want to achieve by consolidate them into single 
> > callback? [...]
> 
> Less bloat in a hotpath and a shared callback infrastructure.
> 
> > [...] I mean the code will still exist and will have to be executed on 
> > every #PF. Is the goal to move them out of line?
> 
> The goal is to have a single callback site for all the users - which 
> call-site is patched out ideally - on non-paravirt too if needed. Most 
> of these callbacks/notifier-chains have are inactive most of the time.
> 
> I.e. a very low overhead 'conditional callback' facility, and a single 
> one - not just lots of them sprinkled around the code.
> 
> 	Ingo

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 16:22         ` Gleb Natapov
@ 2009-11-02 16:29           ` Ingo Molnar
  2009-11-02 16:31             ` Gleb Natapov
  2009-11-02 17:42             ` Gleb Natapov
  0 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-11-02 16:29 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Frédéric Weisbecker


* Gleb Natapov <gleb@redhat.com> wrote:

> On Mon, Nov 02, 2009 at 05:12:48PM +0100, Ingo Molnar wrote:
> > 
> > * Gleb Natapov <gleb@redhat.com> wrote:
> > 
> > > On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> > > > 
> > > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > > 
> > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > > index f4cee90..14707dc 100644
> > > > > --- a/arch/x86/mm/fault.c
> > > > > +++ b/arch/x86/mm/fault.c
> > > > > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > >  	int write;
> > > > >  	int fault;
> > > > >  
> > > > > +	if (arch_handle_page_fault(regs, error_code))
> > > > > +		return;
> > > > > +
> > > > 
> > > > This patch is not acceptable unless it's done cleaner. Currently we 
> > > > already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> > > > notifier), and this adds a fourth one. Please consolidate them into a 
> > > > single callback site, this is a hotpath on x86.
> > > > 
> > > This call is patched out by paravirt patching mechanism so overhead 
> > > should be zero for non paravirt cases. [...]
> > 
> > arch_handle_page_fault() isnt upstream yet - precisely what is the 
> > instruction sequence injected into do_page_fault() in the patched-out 
> > case?
> 
> It is introduced by the same patch. The instruction inserted is:
>  xor %rax, %rax

ok.

My observations still stand:

> > > [...] What do you want to achieve by consolidate them into single 
> > > callback? [...]
> > 
> > Less bloat in a hotpath and a shared callback infrastructure.
> > 
> > > [...] I mean the code will still exist and will have to be executed on 
> > > every #PF. Is the goal to move them out of line?
> > 
> > The goal is to have a single callback site for all the users - which 
> > call-site is patched out ideally - on non-paravirt too if needed. Most 
> > of these callbacks/notifier-chains have are inactive most of the time.
> > 
> > I.e. a very low overhead 'conditional callback' facility, and a single 
> > one - not just lots of them sprinkled around the code.

looks like a golden opportunity to get this right.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 16:29           ` Ingo Molnar
@ 2009-11-02 16:31             ` Gleb Natapov
  2009-11-02 17:42             ` Gleb Natapov
  1 sibling, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 16:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Frédéric Weisbecker

On Mon, Nov 02, 2009 at 05:29:41PM +0100, Ingo Molnar wrote:
> 
> * Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Mon, Nov 02, 2009 at 05:12:48PM +0100, Ingo Molnar wrote:
> > > 
> > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > 
> > > > On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> > > > > 
> > > > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > > > 
> > > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > > > index f4cee90..14707dc 100644
> > > > > > --- a/arch/x86/mm/fault.c
> > > > > > +++ b/arch/x86/mm/fault.c
> > > > > > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > > >  	int write;
> > > > > >  	int fault;
> > > > > >  
> > > > > > +	if (arch_handle_page_fault(regs, error_code))
> > > > > > +		return;
> > > > > > +
> > > > > 
> > > > > This patch is not acceptable unless it's done cleaner. Currently we 
> > > > > already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> > > > > notifier), and this adds a fourth one. Please consolidate them into a 
> > > > > single callback site, this is a hotpath on x86.
> > > > > 
> > > > This call is patched out by paravirt patching mechanism so overhead 
> > > > should be zero for non paravirt cases. [...]
> > > 
> > > arch_handle_page_fault() isnt upstream yet - precisely what is the 
> > > instruction sequence injected into do_page_fault() in the patched-out 
> > > case?
> > 
> > It is introduced by the same patch. The instruction inserted is:
> >  xor %rax, %rax
> 
> ok.
> 
> My observations still stand:
> 
> > > > [...] What do you want to achieve by consolidate them into single 
> > > > callback? [...]
> > > 
> > > Less bloat in a hotpath and a shared callback infrastructure.
> > > 
> > > > [...] I mean the code will still exist and will have to be executed on 
> > > > every #PF. Is the goal to move them out of line?
> > > 
> > > The goal is to have a single callback site for all the users - which 
> > > call-site is patched out ideally - on non-paravirt too if needed. Most 
> > > of these callbacks/notifier-chains have are inactive most of the time.
> > > 
> > > I.e. a very low overhead 'conditional callback' facility, and a single 
> > > one - not just lots of them sprinkled around the code.
> 
> looks like a golden opportunity to get this right.
> 
I'll look into it. Expect questions from me ;)

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 16:29           ` Ingo Molnar
  2009-11-02 16:31             ` Gleb Natapov
@ 2009-11-02 17:42             ` Gleb Natapov
  2009-11-08 11:36               ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-02 17:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Frédéric Weisbecker

On Mon, Nov 02, 2009 at 05:29:41PM +0100, Ingo Molnar wrote:
> 
> * Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Mon, Nov 02, 2009 at 05:12:48PM +0100, Ingo Molnar wrote:
> > > 
> > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > 
> > > > On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> > > > > 
> > > > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > > > 
> > > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > > > index f4cee90..14707dc 100644
> > > > > > --- a/arch/x86/mm/fault.c
> > > > > > +++ b/arch/x86/mm/fault.c
> > > > > > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > > >  	int write;
> > > > > >  	int fault;
> > > > > >  
> > > > > > +	if (arch_handle_page_fault(regs, error_code))
> > > > > > +		return;
> > > > > > +
> > > > > 
> > > > > This patch is not acceptable unless it's done cleaner. Currently we 
> > > > > already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> > > > > notifier), and this adds a fourth one. Please consolidate them into a 
> > > > > single callback site, this is a hotpath on x86.
> > > > > 
> > > > This call is patched out by paravirt patching mechanism so overhead 
> > > > should be zero for non paravirt cases. [...]
> > > 
> > > arch_handle_page_fault() isnt upstream yet - precisely what is the 
> > > instruction sequence injected into do_page_fault() in the patched-out 
> > > case?
> > 
> > It is introduced by the same patch. The instruction inserted is:
> >  xor %rax, %rax
> 
> ok.
> 
> My observations still stand:
> 
> > > > [...] What do you want to achieve by consolidate them into single 
> > > > callback? [...]
> > > 
> > > Less bloat in a hotpath and a shared callback infrastructure.
> > > 
> > > > [...] I mean the code will still exist and will have to be executed on 
> > > > every #PF. Is the goal to move them out of line?
> > > 
> > > The goal is to have a single callback site for all the users - which 
> > > call-site is patched out ideally - on non-paravirt too if needed. Most 
> > > of these callbacks/notifier-chains have are inactive most of the time.
> > > 
> > > I.e. a very low overhead 'conditional callback' facility, and a single 
> > > one - not just lots of them sprinkled around the code.
> 
> looks like a golden opportunity to get this right.
>
Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
of them kmemcheck, mmiotrace are enabled only for debugging, should
not be performance concern. And notifier call sites (two of them)
are deliberately, as explained by comment, not at the function entry,
so can't be unified with others. (And kmemcheck also has two different
call site BTW)

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02  9:22   ` Ingo Molnar
  2009-11-02 16:04     ` Gleb Natapov
@ 2009-11-02 19:03     ` Rik van Riel
  2009-11-02 19:33       ` Avi Kivity
  1 sibling, 1 reply; 55+ messages in thread
From: Rik van Riel @ 2009-11-02 19:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, H. Peter Anvin,
	Thomas Gleixner

On 11/02/2009 04:22 AM, Ingo Molnar wrote:
>
> * Gleb Natapov<gleb@redhat.com>  wrote:
>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index f4cee90..14707dc 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>>   	int write;
>>   	int fault;
>>
>> +	if (arch_handle_page_fault(regs, error_code))
>> +		return;
>> +
>
> This patch is not acceptable unless it's done cleaner. Currently we
> already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace,
> notifier), and this adds a fourth one.

There's another alternative - add our own exception vector
for async page faults.  Not sure if that is warranted though,
especially if we already have other callbacks in do_page_fault()
and we can consolidate them.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/11] Add get_user_pages() variant that fails if major fault is required.
  2009-11-01 11:56 ` [PATCH 05/11] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
@ 2009-11-02 19:05   ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2009-11-02 19:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/01/2009 06:56 AM, Gleb Natapov wrote:
> This patch add get_user_pages() variant that only succeeds if getting
> a reference to a page doesn't require major fault.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 19:03     ` Rik van Riel
@ 2009-11-02 19:33       ` Avi Kivity
  2009-11-02 23:35         ` Rik van Riel
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-02 19:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Gleb Natapov, kvm, linux-mm, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On 11/02/2009 09:03 PM, Rik van Riel wrote:
>> This patch is not acceptable unless it's done cleaner. Currently we
>> already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace,
>> notifier), and this adds a fourth one.
>
>
> There's another alternative - add our own exception vector
> for async page faults.  Not sure if that is warranted though,
> especially if we already have other callbacks in do_page_fault()
> and we can consolidate them.
>

We can't add an exception vector since all the existing ones are either 
taken or reserved.  We can try to use an interrupt vector as an 
exception, but that becomes messy, and I'm not sure hardware will allow 
us to inject an interrupt when interrupts are disabled.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 19:33       ` Avi Kivity
@ 2009-11-02 23:35         ` Rik van Riel
  2009-11-03  4:57           ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Rik van Riel @ 2009-11-02 23:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Gleb Natapov, kvm, linux-mm, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On 11/02/2009 02:33 PM, Avi Kivity wrote:
> On 11/02/2009 09:03 PM, Rik van Riel wrote:
>>> This patch is not acceptable unless it's done cleaner. Currently we
>>> already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace,
>>> notifier), and this adds a fourth one.
>>
>>
>> There's another alternative - add our own exception vector
>> for async page faults. Not sure if that is warranted though,
>> especially if we already have other callbacks in do_page_fault()
>> and we can consolidate them.
>>
>
> We can't add an exception vector since all the existing ones are either
> taken or reserved.

I believe some are reserved for operating system use.

That means the guest can pick one and tell the host to use
that one to notify it.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 23:35         ` Rik van Riel
@ 2009-11-03  4:57           ` Avi Kivity
  2009-11-05  6:44             ` Tian, Kevin
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-03  4:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Gleb Natapov, kvm, linux-mm, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On 11/03/2009 01:35 AM, Rik van Riel wrote:
>> We can't add an exception vector since all the existing ones are either
>> taken or reserved.
>
>
> I believe some are reserved for operating system use.

Table 6-1 says:

   9 |  | Coprocessor Segment Overrun (reserved)  |  Fault |  No  | 
Floating-point instruction.2
   15 |  — |  (Intel reserved. Do not use.) |   | No |
   20-31 |  — | Intel reserved. Do not use.  |
   32-255 |  —  | User Defined (Non-reserved) Interrupts |  Interrupt  
|   | External interrupt or INT n instruction.

So we can only use 32-255, but these are not fault-like exceptions that 
can be delivered with interrupts disabled.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-02 16:18     ` Gleb Natapov
@ 2009-11-03  5:15       ` Avi Kivity
  2009-11-03  7:16         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-03  5:15 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/02/2009 06:18 PM, Gleb Natapov wrote:
>>> +#define KVM_PV_SHM_VERSION 1
>>>        
>> versions = bad, feature bits = good
>>
>>      
> I have both! Do you want me to drop version?
>    

Yes.  Once a kernel is released you can't realistically change the version.

>> Some documentation for this?
>>
>> Also, the name should reflect the pv pagefault use.  For other uses
>> we can register other areas.
>>
>>      
> I wanted it to be generic, but I am fine with making it apf specific.
> It will allow to make it smaller too.
>    

Maybe we can squeeze it into the page-fault error code?

>> would solve this.  I prefer using put_user() though than a permanent
>> get_user_pages().
>>
>>      
> I want to prevent it from been swapped out.
>    

Since you don't prevent the page fault handler or code from being 
swapped out, you don't get anything out of it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-03  5:15       ` Avi Kivity
@ 2009-11-03  7:16         ` Gleb Natapov
  2009-11-03  7:40           ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-03  7:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-mm, linux-kernel

On Tue, Nov 03, 2009 at 07:15:10AM +0200, Avi Kivity wrote:
> On 11/02/2009 06:18 PM, Gleb Natapov wrote:
> >>>+#define KVM_PV_SHM_VERSION 1
> >>versions = bad, feature bits = good
> >>
> >I have both! Do you want me to drop version?
> 
> Yes.  Once a kernel is released you can't realistically change the version.
> 
Why not? If version doesn't match apf will not be used.

> >>Some documentation for this?
> >>
> >>Also, the name should reflect the pv pagefault use.  For other uses
> >>we can register other areas.
> >>
> >I wanted it to be generic, but I am fine with making it apf specific.
> >It will allow to make it smaller too.
> 
> Maybe we can squeeze it into the page-fault error code?
> 
apf has to pass two things into a guest kernel:
 - event type (page not present/wake up)
 - unique token
Error code has 32 bits and at least 1 of them should indicate that this
is apf another one should indicate event type so this leaves us 30 bits
for a token. 12 bits of a token is used to store vcpu id this leaves 18
bits for unique per vcpu id. Yes this may be enough. I don't think it is
realistic to have more then 200000 outstanding apfs per vcpu. Alternately
we can use CR2 to pass a token.
 
> >>would solve this.  I prefer using put_user() though than a permanent
> >>get_user_pages().
> >>
> >I want to prevent it from been swapped out.
> 
> Since you don't prevent the page fault handler or code from being
> swapped out, you don't get anything out of it.
> 
Performance. Currently it is accessed on each page fault and to access
it gup+kmap should be done each and every time.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/11] Add shared memory hypercall to PV Linux guest.
  2009-11-03  7:16         ` Gleb Natapov
@ 2009-11-03  7:40           ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2009-11-03  7:40 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On 11/03/2009 09:16 AM, Gleb Natapov wrote:
>>>
>>> I have both! Do you want me to drop version?
>>>        
>> Yes.  Once a kernel is released you can't realistically change the version.
>>
>>      
> Why not? If version doesn't match apf will not be used.
>    

Then you cause a large performance regression (assuming apf is any 
good).  So there will be a lot of pressure to modify things 
incrementally via feature bits.

>    
>>>> Some documentation for this?
>>>>
>>>> Also, the name should reflect the pv pagefault use.  For other uses
>>>> we can register other areas.
>>>>
>>>>          
>>> I wanted it to be generic, but I am fine with making it apf specific.
>>> It will allow to make it smaller too.
>>>        
>> Maybe we can squeeze it into the page-fault error code?
>>
>>      
> apf has to pass two things into a guest kernel:
>   - event type (page not present/wake up)
>   - unique token
> Error code has 32 bits and at least 1 of them should indicate that this
> is apf another one should indicate event type so this leaves us 30 bits
> for a token. 12 bits of a token is used to store vcpu id this leaves 18
> bits for unique per vcpu id. Yes this may be enough. I don't think it is
> realistic to have more then 200000 outstanding apfs per vcpu. Alternately
> we can use CR2 to pass a token.
>    

Or a combination of pfec and cr2, yes.

>>>> would solve this.  I prefer using put_user() though than a permanent
>>>> get_user_pages().
>>>>
>>>>          
>>> I want to prevent it from been swapped out.
>>>        
>> Since you don't prevent the page fault handler or code from being
>> swapped out, you don't get anything out of it.
>>
>>      
> Performance. Currently it is accessed on each page fault and to access
> it gup+kmap should be done each and every time.
>    

put_user() is just as fast as a kmap, and don't prevent page migration 
or defragmentation.

Note we still have to mark_page_dirty() unless we want to chase live 
migration bugs.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-01 11:56 ` [PATCH 03/11] Handle asynchronous page fault in a PV guest Gleb Natapov
  2009-11-02 12:38   ` Avi Kivity
@ 2009-11-03 14:14   ` Marcelo Tosatti
  2009-11-03 14:25     ` Gleb Natapov
  1 sibling, 1 reply; 55+ messages in thread
From: Marcelo Tosatti @ 2009-11-03 14:14 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On Sun, Nov 01, 2009 at 01:56:22PM +0200, Gleb Natapov wrote:
> Asynchronous page fault notifies vcpu that page it is trying to access
> is swapped out by a host. In response guest puts a task that caused the
> fault to sleep until page is swapped in again. When missing page is
> brought back into the memory guest is notified and task resumes execution.

Can't you apply this to non-paravirt guests, and continue to deliver
interrupts while waiting for the swapin? 

It should allow the guest to schedule a different task.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-03 14:14   ` Marcelo Tosatti
@ 2009-11-03 14:25     ` Gleb Natapov
  2009-11-03 14:32       ` Marcelo Tosatti
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2009-11-03 14:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, linux-mm, linux-kernel

On Tue, Nov 03, 2009 at 12:14:23PM -0200, Marcelo Tosatti wrote:
> On Sun, Nov 01, 2009 at 01:56:22PM +0200, Gleb Natapov wrote:
> > Asynchronous page fault notifies vcpu that page it is trying to access
> > is swapped out by a host. In response guest puts a task that caused the
> > fault to sleep until page is swapped in again. When missing page is
> > brought back into the memory guest is notified and task resumes execution.
> 
> Can't you apply this to non-paravirt guests, and continue to deliver
> interrupts while waiting for the swapin? 
> 
> It should allow the guest to schedule a different task.
But how can I make the guest to not run the task that caused the fault?

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-03 14:25     ` Gleb Natapov
@ 2009-11-03 14:32       ` Marcelo Tosatti
  2009-11-03 14:38         ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Marcelo Tosatti @ 2009-11-03 14:32 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-mm, linux-kernel

On Tue, Nov 03, 2009 at 04:25:33PM +0200, Gleb Natapov wrote:
> On Tue, Nov 03, 2009 at 12:14:23PM -0200, Marcelo Tosatti wrote:
> > On Sun, Nov 01, 2009 at 01:56:22PM +0200, Gleb Natapov wrote:
> > > Asynchronous page fault notifies vcpu that page it is trying to access
> > > is swapped out by a host. In response guest puts a task that caused the
> > > fault to sleep until page is swapped in again. When missing page is
> > > brought back into the memory guest is notified and task resumes execution.
> > 
> > Can't you apply this to non-paravirt guests, and continue to deliver
> > interrupts while waiting for the swapin? 
> > 
> > It should allow the guest to schedule a different task.
> But how can I make the guest to not run the task that caused the fault?

Any attempt to access the swapped out data will cause a #PF vmexit,
since the translation is marked as not present. If there's swapin in
progress, you wait for that swapin, otherwise start swapin and wait.

Its not as efficient as paravirt because you have to wait for a timer
interrupt and the guest scheduler to decide to taskswitch, but OTOH its
transparent.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/11] Handle asynchronous page fault in a PV guest.
  2009-11-03 14:32       ` Marcelo Tosatti
@ 2009-11-03 14:38         ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2009-11-03 14:38 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, kvm, linux-mm, linux-kernel

On 11/03/2009 04:32 PM, Marcelo Tosatti wrote:
> Any attempt to access the swapped out data will cause a #PF vmexit,
> since the translation is marked as not present. If there's swapin in
> progress, you wait for that swapin, otherwise start swapin and wait.
>
> Its not as efficient as paravirt because you have to wait for a timer
> interrupt and the guest scheduler to decide to taskswitch, but OTOH its
> transparent.
>    

With a dyntick guest the timer interrupt will come at the end of the 
time slice, likely after the page has been swapped in.  That leaves smp 
reschedule interrupts and non-dyntick guests.

An advantage is that there is one code path for apf and non-apf.  
Another is that interrupts are processed, improving timekeeping and 
maybe responsiveness.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-03  4:57           ` Avi Kivity
@ 2009-11-05  6:44             ` Tian, Kevin
  2009-11-05  8:22               ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2009-11-05  6:44 UTC (permalink / raw)
  To: Avi Kivity, Rik van Riel
  Cc: Ingo Molnar, Gleb Natapov, kvm, linux-mm, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 1228 bytes --]

>From: Avi Kivity
>Sent: 2009Äê11ÔÂ3ÈÕ 12:57
>
>On 11/03/2009 01:35 AM, Rik van Riel wrote:
>>> We can't add an exception vector since all the existing 
>ones are either
>>> taken or reserved.
>>
>>
>> I believe some are reserved for operating system use.
>
>Table 6-1 says:
>
>   9 |  | Coprocessor Segment Overrun (reserved)  |  Fault |  No  | 
>Floating-point instruction.2
>   15 |  ¡ª |  (Intel reserved. Do not use.) |   | No |
>   20-31 |  ¡ª | Intel reserved. Do not use.  |
>   32-255 |  ¡ª  | User Defined (Non-reserved) Interrupts |  Interrupt  
>|   | External interrupt or INT n instruction.
>
>So we can only use 32-255, but these are not fault-like 
>exceptions that 
>can be delivered with interrupts disabled.
>

would you really want to inject a fault-like exception here? Fault
is architurally synchronous event while here apf is more like an 
asynchronous interrupt as it's not caused by guest itself. If 
guest is with interrupt disabled, preemption won't happen and 
apf path just ends up "wait for page" hypercall to waste cycles.

Thanks,
Kevinÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-05  6:44             ` Tian, Kevin
@ 2009-11-05  8:22               ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2009-11-05  8:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Rik van Riel, Ingo Molnar, Gleb Natapov, kvm, linux-mm,
	linux-kernel, H. Peter Anvin, Thomas Gleixner

On 11/05/2009 08:44 AM, Tian, Kevin wrote:
>> From: Avi Kivity
>> Sent: 2009年11月3日 12:57
>>
>> On 11/03/2009 01:35 AM, Rik van Riel wrote:
>>     
>>>> We can't add an exception vector since all the existing 
>>>>         
>> ones are either
>>     
>>>> taken or reserved.
>>>>         
>>>
>>> I believe some are reserved for operating system use.
>>>       
>> Table 6-1 says:
>>
>>   9 |  | Coprocessor Segment Overrun (reserved)  |  Fault |  No  | 
>> Floating-point instruction.2
>>   15 |  ― |  (Intel reserved. Do not use.) |   | No |
>>   20-31 |  ― | Intel reserved. Do not use.  |
>>   32-255 |  ―  | User Defined (Non-reserved) Interrupts |  Interrupt  
>> |   | External interrupt or INT n instruction.
>>
>> So we can only use 32-255, but these are not fault-like 
>> exceptions that 
>> can be delivered with interrupts disabled.
>>
>>     
> would you really want to inject a fault-like exception here? Fault
> is architurally synchronous event while here apf is more like an 
> asynchronous interrupt as it's not caused by guest itself. If 
> guest is with interrupt disabled, preemption won't happen and 
> apf path just ends up "wait for page" hypercall to waste cycles.
>   

An async page fault is, despite its name, synchronous, since it is
associated with an instruction. It must either be delivered immediately
or not at all.

It's true that in kernel mode you can't do much with an apf if
interrupts are disabled, but you still want to receive apfs for user
mode with interrupts disabled (for example due to interrupt shadow).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-02 17:42             ` Gleb Natapov
@ 2009-11-08 11:36               ` Ingo Molnar
  2009-11-08 12:43                 ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2009-11-08 11:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Fr??d??ric Weisbecker


* Gleb Natapov <gleb@redhat.com> wrote:

> On Mon, Nov 02, 2009 at 05:29:41PM +0100, Ingo Molnar wrote:
> > 
> > * Gleb Natapov <gleb@redhat.com> wrote:
> > 
> > > On Mon, Nov 02, 2009 at 05:12:48PM +0100, Ingo Molnar wrote:
> > > > 
> > > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > > 
> > > > > On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
> > > > > > 
> > > > > > * Gleb Natapov <gleb@redhat.com> wrote:
> > > > > > 
> > > > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > > > > index f4cee90..14707dc 100644
> > > > > > > --- a/arch/x86/mm/fault.c
> > > > > > > +++ b/arch/x86/mm/fault.c
> > > > > > > @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > > > >  	int write;
> > > > > > >  	int fault;
> > > > > > >  
> > > > > > > +	if (arch_handle_page_fault(regs, error_code))
> > > > > > > +		return;
> > > > > > > +
> > > > > > 
> > > > > > This patch is not acceptable unless it's done cleaner. Currently we 
> > > > > > already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
> > > > > > notifier), and this adds a fourth one. Please consolidate them into a 
> > > > > > single callback site, this is a hotpath on x86.
> > > > > > 
> > > > > This call is patched out by paravirt patching mechanism so overhead 
> > > > > should be zero for non paravirt cases. [...]
> > > > 
> > > > arch_handle_page_fault() isnt upstream yet - precisely what is the 
> > > > instruction sequence injected into do_page_fault() in the patched-out 
> > > > case?
> > > 
> > > It is introduced by the same patch. The instruction inserted is:
> > >  xor %rax, %rax
> > 
> > ok.
> > 
> > My observations still stand:
> > 
> > > > > [...] What do you want to achieve by consolidate them into single 
> > > > > callback? [...]
> > > > 
> > > > Less bloat in a hotpath and a shared callback infrastructure.
> > > > 
> > > > > [...] I mean the code will still exist and will have to be executed on 
> > > > > every #PF. Is the goal to move them out of line?
> > > > 
> > > > The goal is to have a single callback site for all the users - which 
> > > > call-site is patched out ideally - on non-paravirt too if needed. Most 
> > > > of these callbacks/notifier-chains have are inactive most of the time.
> > > > 
> > > > I.e. a very low overhead 'conditional callback' facility, and a single 
> > > > one - not just lots of them sprinkled around the code.
> > 
> > looks like a golden opportunity to get this right.
> >
> Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
> of them kmemcheck, mmiotrace are enabled only for debugging, should
> not be performance concern. And notifier call sites (two of them)
> are deliberately, as explained by comment, not at the function entry,
> so can't be unified with others. (And kmemcheck also has two different
> call site BTW)

We want mmiotrace to be generic distro capable so the overhead when the 
hook is not used is of concern.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 11:36               ` Ingo Molnar
@ 2009-11-08 12:43                 ` Avi Kivity
  2009-11-08 12:51                   ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-08 12:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Fr??d??ric Weisbecker

On 11/08/2009 01:36 PM, Ingo Molnar wrote:
>> Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
>> of them kmemcheck, mmiotrace are enabled only for debugging, should
>> not be performance concern. And notifier call sites (two of them)
>> are deliberately, as explained by comment, not at the function entry,
>> so can't be unified with others. (And kmemcheck also has two different
>> call site BTW)
>>      
> We want mmiotrace to be generic distro capable so the overhead when the
> hook is not used is of concern.
>
>    

Maybe we should generalize paravirt-ops patching in case if (x) f() is 
deemed too expensive.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 12:43                 ` Avi Kivity
@ 2009-11-08 12:51                   ` Ingo Molnar
  2009-11-08 13:01                     ` Avi Kivity
  2009-11-08 16:44                     ` H. Peter Anvin
  0 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-11-08 12:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Fr??d??ric Weisbecker


* Avi Kivity <avi@redhat.com> wrote:

> On 11/08/2009 01:36 PM, Ingo Molnar wrote:
> >>Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
> >>of them kmemcheck, mmiotrace are enabled only for debugging, should
> >>not be performance concern. And notifier call sites (two of them)
> >>are deliberately, as explained by comment, not at the function entry,
> >>so can't be unified with others. (And kmemcheck also has two different
> >>call site BTW)
> >
> > We want mmiotrace to be generic distro capable so the overhead when 
> > the hook is not used is of concern.
> 
> Maybe we should generalize paravirt-ops patching in case if (x) f() is 
> deemed too expensive.

Yes, that's a nice idea. We have quite a number of 'conditional 
callbacks' in various critical paths that could be made lighter via such 
a technique.

It would also free new callbacks from the 'it increases overhead even if 
unused' criticism and made it easier to add them.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 12:51                   ` Ingo Molnar
@ 2009-11-08 13:01                     ` Avi Kivity
  2009-11-08 13:05                       ` Ingo Molnar
  2009-11-08 16:44                     ` H. Peter Anvin
  1 sibling, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2009-11-08 13:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Fr??d??ric Weisbecker

On 11/08/2009 02:51 PM, Ingo Molnar wrote:
>> Maybe we should generalize paravirt-ops patching in case if (x) f() is
>> deemed too expensive.
>>      
> Yes, that's a nice idea. We have quite a number of 'conditional
> callbacks' in various critical paths that could be made lighter via such
> a technique.
>
> It would also free new callbacks from the 'it increases overhead even if
> unused' criticism and made it easier to add them.
>    

We can take the "immediate values" infrastructure as a first step.  Has 
that been merged?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 13:01                     ` Avi Kivity
@ 2009-11-08 13:05                       ` Ingo Molnar
  2009-11-08 13:08                         ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2009-11-08 13:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Fr??d??ric Weisbecker


* Avi Kivity <avi@redhat.com> wrote:

> On 11/08/2009 02:51 PM, Ingo Molnar wrote:
> >>Maybe we should generalize paravirt-ops patching in case if (x) f() is
> >>deemed too expensive.
> >Yes, that's a nice idea. We have quite a number of 'conditional
> >callbacks' in various critical paths that could be made lighter via such
> >a technique.
> >
> > It would also free new callbacks from the 'it increases overhead 
> > even if unused' criticism and made it easier to add them.
> 
> We can take the "immediate values" infrastructure as a first step. Has 
> that been merged?

No, there were doubts about whether patching in live instructions like 
that is safe on all CPU types.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 13:05                       ` Ingo Molnar
@ 2009-11-08 13:08                         ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2009-11-08 13:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Fr??d??ric Weisbecker

On 11/08/2009 03:05 PM, Ingo Molnar wrote:
>> We can take the "immediate values" infrastructure as a first step. Has
>> that been merged?
>>      
> No, there were doubts about whether patching in live instructions like
> that is safe on all CPU types.
>    

Ah, I remember.  Doesn't the trick of going through a breakpoint 
instruction work?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 12:51                   ` Ingo Molnar
  2009-11-08 13:01                     ` Avi Kivity
@ 2009-11-08 16:44                     ` H. Peter Anvin
  2009-11-08 16:47                       ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: H. Peter Anvin @ 2009-11-08 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Gleb Natapov, kvm, linux-mm, linux-kernel,
	Thomas Gleixner, Fr??d??ric Weisbecker

On 11/08/2009 04:51 AM, Ingo Molnar wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
>> On 11/08/2009 01:36 PM, Ingo Molnar wrote:
>>>> Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
>>>> of them kmemcheck, mmiotrace are enabled only for debugging, should
>>>> not be performance concern. And notifier call sites (two of them)
>>>> are deliberately, as explained by comment, not at the function entry,
>>>> so can't be unified with others. (And kmemcheck also has two different
>>>> call site BTW)
>>>
>>> We want mmiotrace to be generic distro capable so the overhead when 
>>> the hook is not used is of concern.
>>
>> Maybe we should generalize paravirt-ops patching in case if (x) f() is 
>> deemed too expensive.
> 
> Yes, that's a nice idea. We have quite a number of 'conditional 
> callbacks' in various critical paths that could be made lighter via such 
> a technique.
> 
> It would also free new callbacks from the 'it increases overhead even if 
> unused' criticism and made it easier to add them.
> 

There are a number of other things were we permanently bind to a single
instance of something, too.  Optimizing those away would be nice.
Consider memcpy(), where we may want to have different implementations
for different processors.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/11] Add "handle page fault" PV helper.
  2009-11-08 16:44                     ` H. Peter Anvin
@ 2009-11-08 16:47                       ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-11-08 16:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Avi Kivity, Gleb Natapov, kvm, linux-mm, linux-kernel,
	Thomas Gleixner, Fr??d??ric Weisbecker


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 11/08/2009 04:51 AM, Ingo Molnar wrote:
> > 
> > * Avi Kivity <avi@redhat.com> wrote:
> > 
> >> On 11/08/2009 01:36 PM, Ingo Molnar wrote:
> >>>> Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
> >>>> of them kmemcheck, mmiotrace are enabled only for debugging, should
> >>>> not be performance concern. And notifier call sites (two of them)
> >>>> are deliberately, as explained by comment, not at the function entry,
> >>>> so can't be unified with others. (And kmemcheck also has two different
> >>>> call site BTW)
> >>>
> >>> We want mmiotrace to be generic distro capable so the overhead when 
> >>> the hook is not used is of concern.
> >>
> >> Maybe we should generalize paravirt-ops patching in case if (x) f() is 
> >> deemed too expensive.
> > 
> > Yes, that's a nice idea. We have quite a number of 'conditional 
> > callbacks' in various critical paths that could be made lighter via such 
> > a technique.
> > 
> > It would also free new callbacks from the 'it increases overhead 
> > even if unused' criticism and made it easier to add them.
> 
> There are a number of other things were we permanently bind to a 
> single instance of something, too.  Optimizing those away would be 
> nice. Consider memcpy(), where we may want to have different 
> implementations for different processors.

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2009-11-08 16:47 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-01 11:56 [PATCH 00/11] KVM: Add asynchronous page fault for PV guest Gleb Natapov
2009-11-01 11:56 ` [PATCH 01/11] Add shared memory hypercall to PV Linux guest Gleb Natapov
2009-11-02  4:27   ` Rik van Riel
2009-11-02  7:07     ` Gleb Natapov
2009-11-02 12:18   ` Avi Kivity
2009-11-02 16:18     ` Gleb Natapov
2009-11-03  5:15       ` Avi Kivity
2009-11-03  7:16         ` Gleb Natapov
2009-11-03  7:40           ` Avi Kivity
2009-11-01 11:56 ` [PATCH 02/11] Add "handle page fault" PV helper Gleb Natapov
2009-11-02  9:22   ` Ingo Molnar
2009-11-02 16:04     ` Gleb Natapov
2009-11-02 16:12       ` Ingo Molnar
2009-11-02 16:22         ` Gleb Natapov
2009-11-02 16:29           ` Ingo Molnar
2009-11-02 16:31             ` Gleb Natapov
2009-11-02 17:42             ` Gleb Natapov
2009-11-08 11:36               ` Ingo Molnar
2009-11-08 12:43                 ` Avi Kivity
2009-11-08 12:51                   ` Ingo Molnar
2009-11-08 13:01                     ` Avi Kivity
2009-11-08 13:05                       ` Ingo Molnar
2009-11-08 13:08                         ` Avi Kivity
2009-11-08 16:44                     ` H. Peter Anvin
2009-11-08 16:47                       ` Ingo Molnar
2009-11-02 19:03     ` Rik van Riel
2009-11-02 19:33       ` Avi Kivity
2009-11-02 23:35         ` Rik van Riel
2009-11-03  4:57           ` Avi Kivity
2009-11-05  6:44             ` Tian, Kevin
2009-11-05  8:22               ` Avi Kivity
2009-11-01 11:56 ` [PATCH 03/11] Handle asynchronous page fault in a PV guest Gleb Natapov
2009-11-02 12:38   ` Avi Kivity
2009-11-02 15:54     ` Gleb Natapov
2009-11-03 14:14   ` Marcelo Tosatti
2009-11-03 14:25     ` Gleb Natapov
2009-11-03 14:32       ` Marcelo Tosatti
2009-11-03 14:38         ` Avi Kivity
2009-11-01 11:56 ` [PATCH 04/11] Export __get_user_pages_fast Gleb Natapov
2009-11-02  9:23   ` Ingo Molnar
2009-11-01 11:56 ` [PATCH 05/11] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
2009-11-02 19:05   ` Rik van Riel
2009-11-01 11:56 ` [PATCH 06/11] Inject asynchronous page fault into a guest if page is swapped out Gleb Natapov
2009-11-02 12:56   ` Avi Kivity
2009-11-02 15:41     ` Gleb Natapov
2009-11-01 11:56 ` [PATCH 07/11] Retry fault before vmentry Gleb Natapov
2009-11-02 13:03   ` Avi Kivity
2009-11-01 11:56 ` [PATCH 08/11] Add "wait for page" hypercall Gleb Natapov
2009-11-02 13:05   ` Avi Kivity
2009-11-02 15:13     ` Gleb Natapov
2009-11-02 15:19       ` Avi Kivity
2009-11-01 11:56 ` [PATCH 09/11] Maintain preemptability count even for !CONFIG_PREEMPT kernels Gleb Natapov
2009-11-02  9:24   ` Ingo Molnar
2009-11-01 11:56 ` [PATCH 10/11] Handle async PF in non preemptable context Gleb Natapov
2009-11-01 11:56 ` [PATCH 11/11] Send async PF when guest is not in userspace too Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).