linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest
@ 2010-10-04 15:56 Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 01/12] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
                   ` (11 more replies)
  0 siblings, 12 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out from
the host memory vcpu execution is suspended till the page is swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

The patch series tries to mitigate this problem by introducing two
mechanisms. The first one is used with non-PV guest and it works like
this: when vcpu tries to access swapped out page it is halted and
requested page is swapped in by another thread. That way vcpu can still
process interrupts while io is happening in parallel and, with any luck,
interrupt will cause the guest to schedule another task on the vcpu, so
it will have work to do instead of waiting for the page to be swapped in.

The second mechanism introduces PV notification about swapped page state to
a guest (asynchronous page fault). Instead of halting vcpu upon access to
swapped out page and hoping that some interrupt will cause reschedule we
immediately inject asynchronous page fault to the vcpu.  PV aware guest
knows that upon receiving such exception it should schedule another task
to run on the vcpu. Current task is put to sleep until another kind of
asynchronous page fault is received that notifies the guest that page
is now in the host memory, so task that waits for it can run again.

To measure performance benefits I use a simple benchmark program (below)
that starts number of threads. Some of them do work (increment counter),
others access huge array in random location trying to generate host page
faults. The size of the array is smaller then guest memory bug bigger
then host memory so we are guarantied that host will swap out part of
the array.

I ran the benchmark on three setups: with current kvm.git (master),
with my patch series + non-pv guest (nonpv) and with my patch series +
pv guest (pv).

Each guest had 4 cpus and 2G memory and was launched inside 512M memory
container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
threads and 4 working threads for a minute).

Below is the total amount of "work" each guest managed to do
(average of 10 runs):
         total work    std error
master: 122789420615 (3818565029)
nonpv:  138455939001 (773774299)
pv:     234351846135 (10461117116)

Changes:
 v1->v2
   Use MSR instead of hypercall.
   Move most of the code into arch independent place.
   halt inside a guest instead of doing "wait for page" hypercall if
    preemption is disabled.
 v2->v3
   Use MSR from range 0x4b564dxx.
   Add slot version tracking.
   Support migration by restarting all guest processes after migration.
   Drop patch that tract preemptability for non-preemptable kernels
    due to performance concerns. Send async PF to non-preemptable
    guests only when vcpu is executing userspace code.
 v3->v4
  Provide alternative page fault handler in PV guest instead of adding hook to
   standard page fault handler and patch it out on non-PV guests.
  Allow only limited number of outstanding async page fault per vcpu.
  Unify  gfn_to_pfn and gfn_to_pfn_async code.
  Cancel outstanding slow work on reset.
 v4->v5
  Move async pv cpu initialization into cpu hotplug notifier.
  Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
  Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
   cr3 back
 v5->v6
  To many. Will list only major changes here.
  Replace slow work with work queues.
  Halt vcpu for non-pv guests.
  Handle async PF in nested SVM mode.
  Do not prefault swapped in page for non tdp case.

Gleb Natapov (12):
  Add get_user_pages() variant that fails if major fault is required.
  Halt vcpu if page it tries to access is swapped out.
  Retry fault before vmentry
  Add memory slot versioning and use it to provide fast guest write interface
  Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  Add PV MSR to enable asynchronous page faults delivery.
  Add async PF initialization to PV guest.
  Handle async PF in a guest.
  Inject asynchronous page fault into a PV guest if page is swapped out.
  Handle async PF in non preemptable context
  Let host know whether the guest can handle async PF in non-userspace context.
  Send async PF when guest is not in userspace too.

 Documentation/kernel-parameters.txt |    3 +
 Documentation/kvm/cpuid.txt         |    3 +
 Documentation/kvm/msr.txt           |   14 ++-
 arch/x86/include/asm/kvm_host.h     |   27 +++-
 arch/x86/include/asm/kvm_para.h     |   23 +++
 arch/x86/include/asm/traps.h        |    1 +
 arch/x86/kernel/entry_32.S          |   10 +
 arch/x86/kernel/entry_64.S          |    3 +
 arch/x86/kernel/kvm.c               |  316 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c          |   13 +--
 arch/x86/kvm/Kconfig                |    1 +
 arch/x86/kvm/Makefile               |    1 +
 arch/x86/kvm/mmu.c                  |   60 ++++++-
 arch/x86/kvm/paging_tmpl.h          |    8 +-
 arch/x86/kvm/svm.c                  |   43 ++++-
 arch/x86/kvm/x86.c                  |  189 ++++++++++++++++++++-
 fs/ncpfs/mmap.c                     |    2 +
 include/linux/kvm.h                 |    1 +
 include/linux/kvm_host.h            |   38 ++++
 include/linux/kvm_types.h           |    7 +
 include/linux/mm.h                  |    5 +
 include/trace/events/kvm.h          |   93 ++++++++++
 mm/filemap.c                        |    3 +
 mm/memory.c                         |   31 +++-
 mm/shmem.c                          |    8 +-
 virt/kvm/Kconfig                    |    3 +
 virt/kvm/async_pf.c                 |  223 ++++++++++++++++++++++++
 virt/kvm/async_pf.h                 |   36 ++++
 virt/kvm/kvm_main.c                 |  114 +++++++++++--
 29 files changed, 1225 insertions(+), 54 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

=== benchmark.c ===

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

#define FAULTING_THREADS 1
#define WORKING_THREADS 1
#define TIMEOUT 5
#define MEMORY 1024*1024*1024

pthread_barrier_t barrier;
volatile int stop;
size_t pages;

void *fault_thread(void* p)
{
	char *mem = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		mem[(random() % pages) << 12] = 10;

	pthread_barrier_wait(&barrier);

	return NULL;
}

void *work_thread(void* p)
{
	unsigned long *i = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		(*i)++;

	pthread_barrier_wait(&barrier);

	return NULL;
}

int main(int argc, char **argv)
{
	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
	unsigned int timeout = TIMEOUT;
	size_t mem = MEMORY;
	void *buf;
	int i, opt, verbose = 0;
	pthread_t t;
	pthread_attr_t pattr;
	unsigned long *res, sum = 0;

	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
		switch (opt) {
		case 'f':
			ft = atoi(optarg);
			break;
		case 'w':
			wt = atoi(optarg);
			break;
		case 'm':
			mem = atoi(optarg);
			break;
		case 't':
			timeout = atoi(optarg);
			break;
		case 'v':
			verbose++;
			break;
		default:
			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
			exit(1);
		}
	}

	if (verbose)
		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);

	pages = mem >> 12;
	posix_memalign(&buf, 4096, pages << 12);
	res = malloc(sizeof (unsigned long) * wt);
	memset(res, 0, sizeof (unsigned long) * wt);

	pthread_attr_init(&pattr);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	for (i = 0; i < ft; i++) {
		pthread_create(&t, &pattr, fault_thread, buf);
		pthread_detach(t);
	}

	for (i = 0; i < wt; i++) {
		pthread_create(&t, &pattr, work_thread, &res[i]);
		pthread_detach(t);
	}

	/* prefault memory */
	memset(buf, 0, pages << 12);
	printf("start\n");

	pthread_barrier_wait(&barrier);

	pthread_barrier_destroy(&barrier);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	sleep(timeout);
	stop = 1;

	pthread_barrier_wait(&barrier);

	for (i = 0; i < wt; i++) {
		sum += res[i];
		printf("worker %d: %lu\n", i, res[i]);
	}
	printf("total: %lu\n", sum);

	return 0;
}

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH v6 01/12] Add get_user_pages() variant that fails if major fault is required.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 fs/ncpfs/mmap.c    |    2 ++
 include/linux/mm.h |    5 +++++
 mm/filemap.c       |    3 +++
 mm/memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/shmem.c         |    8 +++++++-
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 56f5b3a..b9c4f36 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	int bufsize;
 	int pos; /* XXX: loff_t ? */
 
+	if (vmf->flags & FAULT_FLAG_MINOR)
+		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
 	/*
 	 * ncpfs has nothing against high pages as long
 	 * as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..da32900 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -848,6 +849,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			unsigned long start, int nr_pages, int write, int force,
 			struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int nr_pages, int write, int force,
+			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1394,6 +1398,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MINOR	0x20	/* do only minor page faults */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..ef28b6d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			goto no_cached_page;
 		}
 	} else {
+		if (vmf->flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..b221458 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1441,10 +1441,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				unsigned int fault_fl =
+					((foll_flags & FOLL_WRITE) ?
+					FAULT_FLAG_WRITE : 0) |
+					((foll_flags & FOLL_MINOR) ?
+					FAULT_FLAG_MINOR : 0);
 
-				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+				ret = handle_mm_fault(mm, vma, start, fault_fl);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1452,6 +1455,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					if (ret &
 					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
+					else if (ret & VM_FAULT_MAJOR)
+						return i ? i : -EFAULT;
 					BUG();
 				}
 				if (ret & VM_FAULT_MAJOR)
@@ -1562,6 +1567,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int flags = FOLL_TOUCH | FOLL_MINOR;
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
@@ -2648,6 +2670,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		if (flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
 					GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..470d8a7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1228,6 +1228,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
 	swp_entry_t swap;
 	gfp_t gfp;
 	int error;
+	int flags = type ? *type : 0;
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
@@ -1287,6 +1288,11 @@ repeat:
 		swappage = lookup_swap_cache(swap);
 		if (!swappage) {
 			shmem_swp_unmap(entry);
+			if (flags & FAULT_FLAG_MINOR) {
+				spin_unlock(&info->lock);
+				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+				goto failed;
+			}
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
@@ -1510,7 +1516,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int error;
-	int ret;
+	int ret = (int)vmf->flags;
 
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 01/12] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05  1:20   ` Rik van Riel
                     ` (2 more replies)
  2010-10-04 15:56 ` [PATCH v6 03/12] Retry fault before vmentry Gleb Natapov
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.

Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   17 +++
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/Makefile           |    1 +
 arch/x86/kvm/mmu.c              |   51 +++++++++-
 arch/x86/kvm/paging_tmpl.h      |    4 +-
 arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
 include/linux/kvm_host.h        |   31 ++++++
 include/trace/events/kvm.h      |   88 ++++++++++++++++
 virt/kvm/Kconfig                |    3 +
 virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/async_pf.h             |   36 +++++++
 virt/kvm/kvm_main.c             |   57 ++++++++--
 12 files changed, 603 insertions(+), 15 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e209078..5f154d3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -83,6 +83,8 @@
 #define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
 
+#define ASYNC_PF_PER_VCPU 64
+
 extern spinlock_t kvm_lock;
 extern struct list_head vm_list;
 
@@ -412,6 +414,10 @@ struct kvm_vcpu_arch {
 	u64 hv_vapic;
 
 	cpumask_var_t wbinvd_dirty_mask;
+
+	struct {
+		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+	} apf;
 };
 
 struct kvm_arch {
@@ -585,7 +591,12 @@ struct kvm_x86_ops {
 	const struct trace_print_flags *exit_reasons_str;
 };
 
+struct kvm_arch_async_pf {
+	gfn_t gfn;
+};
+
 extern struct kvm_x86_ops *kvm_x86_ops;
+extern struct kvm_async_pf *kvm_double_apf;
 
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
@@ -823,4 +834,10 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work);
+extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ddc131f..50f6364 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -28,6 +28,7 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_EVENTFD
 	select KVM_APIC_ARCHITECTURE
+	select KVM_ASYNC_PF
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
 	---help---
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31a7035..c53bf19 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,6 +9,7 @@ kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
 				coalesced_mmio.o irq_comm.o eventfd.o \
 				assigned-dev.o)
 kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
+kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o timer.o
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c94c432..4d49b5e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -18,9 +18,11 @@
  *
  */
 
+#include "irq.h"
 #include "mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -2575,6 +2577,49 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	arch.gfn = gfn;
+
+	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
+		     kvm_event_needs_reinjection(vcpu)))
+		return false;
+
+	return kvm_x86_ops->interrupt_allowed(vcpu);
+}
+
+static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+			 pfn_t *pfn)
+{
+	bool async;
+
+	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
+
+	if (!async)
+		return false; /* *pfn has correct page already */
+
+	put_page(pfn_to_page(*pfn));
+
+	if (can_do_async_pf(vcpu)) {
+		trace_kvm_try_async_get_page(async, *pfn);
+		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
+			vcpu->async_pf.work = kvm_double_apf;
+			return true;
+		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
+			return true;
+	}
+
+	*pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	
+	return false;
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 				u32 error_code)
 {
@@ -2597,7 +2642,11 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+		return 0;
+
+	/* mmio */
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 2bdd843..8154353 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -573,7 +573,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+		return 0;
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3729bcb..8dd9ac2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -43,6 +43,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/uaccess.h>
+#include <linux/hash.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -116,6 +117,7 @@ struct kvm_shared_msrs {
 
 static struct kvm_shared_msrs_global __read_mostly shared_msrs_global;
 static DEFINE_PER_CPU(struct kvm_shared_msrs, shared_msrs);
+struct kvm_async_pf *kvm_double_apf;
 
 struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "pf_fixed", VCPU_STAT(pf_fixed) },
@@ -4635,6 +4637,12 @@ int kvm_arch_init(void *opaque)
 		goto out;
 	}
 
+	kvm_double_apf = kzalloc(sizeof(struct kvm_async_pf), GFP_KERNEL);
+	if (!kvm_double_apf) {
+		r = -ENOMEM;
+		goto out;
+	}
+
 	r = kvm_mmu_module_init();
 	if (r)
 		goto out;
@@ -4657,6 +4665,7 @@ int kvm_arch_init(void *opaque)
 	return 0;
 
 out:
+	kfree(kvm_double_apf);
 	return r;
 }
 
@@ -4669,6 +4678,7 @@ void kvm_arch_exit(void)
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
 	kvm_x86_ops = NULL;
+	kfree(kvm_double_apf);
 	kvm_mmu_module_exit();
 }
 
@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (unlikely(r))
 		goto out;
 
+	kvm_check_async_pf_completion(vcpu);
+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
+		/* Page is swapped out. Do synthetic halt */
+		r = 1;
+		goto out;
+	}
+
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
 		inject_pending_event(vcpu);
 
@@ -5781,6 +5798,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
+	kvm_clear_async_pf_completion_queue(vcpu);
+	memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -5869,6 +5889,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL))
 		goto fail_free_mce_banks;
 
+	memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
+
 	return 0;
 fail_free_mce_banks:
 	kfree(vcpu->arch.mce_banks);
@@ -5927,8 +5949,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
 	/*
 	 * Unpin any mmu pages first.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
+	}
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_arch_vcpu_free(vcpu);
 
@@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+		|| !list_empty_careful(&vcpu->async_pf.done)
 		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
 		|| vcpu->arch.nmi_pending ||
 		(kvm_arch_interrupt_allowed(vcpu) &&
@@ -6098,6 +6123,88 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
+{
+	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
+}
+
+static inline u32 kvm_async_pf_next_probe(u32 key)
+{
+	return (key + 1) & (roundup_pow_of_two(ASYNC_PF_PER_VCPU) - 1);
+}
+
+static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	while (vcpu->arch.apf.gfns[key] != -1)
+		key = kvm_async_pf_next_probe(key);
+
+	vcpu->arch.apf.gfns[key] = gfn;
+}
+
+static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	int i;
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU) &&
+		     (vcpu->arch.apf.gfns[key] != gfn ||
+		      vcpu->arch.apf.gfns[key] == -1); i++)
+		key = kvm_async_pf_next_probe(key);
+
+	return key;
+}
+
+bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	return vcpu->arch.apf.gfns[kvm_async_pf_gfn_slot(vcpu, gfn)] == gfn;
+}
+
+static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 i, j, k;
+
+	i = j = kvm_async_pf_gfn_slot(vcpu, gfn);
+	while (true) {
+		vcpu->arch.apf.gfns[i] = -1;
+		do {
+			j = kvm_async_pf_next_probe(j);
+			if (vcpu->arch.apf.gfns[j] == -1)
+				return;
+			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
+			/*
+			 * k lies cyclically in ]i,j]
+			 * |    i.k.j |
+			 * |....j i.k.| or  |.k..j i...|
+			 */
+		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
+		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
+		i = j;
+	}
+}
+
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work)
+{
+	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+
+	if (work == kvm_double_apf)
+		trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
+	else {
+		trace_kvm_async_pf_not_present(work->gva);
+
+		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+	}
+}
+
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work)
+{
+	trace_kvm_async_pf_ready(work->gva);
+	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b89d00..a08614e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,6 +74,26 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+struct kvm_async_pf {
+	struct work_struct work;
+	struct list_head link;
+	struct list_head queue;
+	struct kvm_vcpu *vcpu;
+	struct mm_struct *mm;
+	gva_t gva;
+	unsigned long addr;
+	struct kvm_arch_async_pf arch;
+	struct page *page;
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch);
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
+#endif
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -104,6 +124,16 @@ struct kvm_vcpu {
 	gpa_t mmio_phys_addr;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	struct {
+		u32 queued;
+		struct list_head queue;
+		struct list_head done;
+		spinlock_t lock;
+		struct kvm_async_pf *work;
+	} async_pf;
+#endif
+
 	struct kvm_vcpu_arch arch;
 };
 
@@ -302,6 +332,7 @@ void kvm_set_page_accessed(struct page *page);
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr);
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 6dd3a51..bcc69b2 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -185,6 +185,94 @@ TRACE_EVENT(kvm_age_page,
 		  __entry->referenced ? "YOUNG" : "OLD")
 );
 
+#ifdef CONFIG_KVM_ASYNC_PF
+TRACE_EVENT(
+	kvm_try_async_get_page,
+	TP_PROTO(bool async, u64 pfn),
+	TP_ARGS(async, pfn),
+
+	TP_STRUCT__entry(
+		__field(__u64, pfn)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = (!async) ? pfn : (u64)-1;
+		),
+
+	TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_not_present,
+	TP_PROTO(u64 gva),
+	TP_ARGS(gva),
+
+	TP_STRUCT__entry(
+		__field(__u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx not present", __entry->gva)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_ready,
+	TP_PROTO(u64 gva),
+	TP_ARGS(gva),
+
+	TP_STRUCT__entry(
+		__field(__u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx ready", __entry->gva)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_completed,
+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
+	TP_ARGS(address, page, gva),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, address)
+		__field(struct page*, page)
+		__field(u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->address = address;
+		__entry->page = page;
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx address %#lx pfn %lx",  __entry->gva,
+		  __entry->address, page_to_pfn(__entry->page))
+);
+
+TRACE_EVENT(
+	kvm_async_pf_doublefault,
+	TP_PROTO(unsigned long rip),
+	TP_ARGS(rip),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, rip)
+		),
+
+	TP_fast_assign(
+		__entry->rip = rip;
+		),
+
+	TP_printk("rip = %#lx", __entry->rip)
+);
+
+#endif
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7f1178f..f63ccb0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -15,3 +15,6 @@ config KVM_APIC_ARCHITECTURE
 
 config KVM_MMIO
        bool
+
+config KVM_ASYNC_PF
+       bool
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
new file mode 100644
index 0000000..f5109eb
--- /dev/null
+++ b/virt/kvm/async_pf.c
@@ -0,0 +1,220 @@
+/*
+ * kvm asynchromous fault support
+ *
+ * Copyright 2010 Red Hat, Inc.
+ *
+ * Author:
+ *      Gleb Natapov <gleb@redhat.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/mmu_context.h>
+
+#include "async_pf.h"
+#include <trace/events/kvm.h>
+
+static struct kmem_cache *async_pf_cache;
+
+int kvm_async_pf_init(void)
+{
+	async_pf_cache = KMEM_CACHE(kvm_async_pf, 0);
+
+	if (!async_pf_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void kvm_async_pf_deinit(void)
+{
+	if (async_pf_cache)
+		kmem_cache_destroy(async_pf_cache);
+	async_pf_cache = NULL;
+}
+
+void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	INIT_LIST_HEAD(&vcpu->async_pf.done);
+	INIT_LIST_HEAD(&vcpu->async_pf.queue);
+	spin_lock_init(&vcpu->async_pf.lock);
+}
+
+static void async_pf_execute(struct work_struct *work)
+{
+	struct page *page;
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+	struct mm_struct *mm = apf->mm;
+	struct kvm_vcpu *vcpu = apf->vcpu;
+	unsigned long addr = apf->addr;
+	gva_t gva = apf->gva;
+
+	might_sleep();
+
+	use_mm(mm);
+	down_read(&mm->mmap_sem);
+	get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
+	up_read(&mm->mmap_sem);
+	unuse_mm(mm);
+
+	spin_lock(&vcpu->async_pf.lock);
+	list_add_tail(&apf->link, &vcpu->async_pf.done);
+	apf->page = page;
+	spin_unlock(&vcpu->async_pf.lock);
+
+	/*
+	 * apf may be freed by kvm_check_async_pf_completion() after
+	 * this point
+	 */
+
+	trace_kvm_async_pf_completed(addr, page, gva);
+
+	if (waitqueue_active(&vcpu->wq))
+		wake_up_interruptible(&vcpu->wq);
+
+	mmdrop(mm);
+	kvm_put_kvm(vcpu->kvm);
+}
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+	/* cancel outstanding work queue item */
+	while (!list_empty(&vcpu->async_pf.queue)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf.queue.next,
+				   typeof(*work), queue);
+		cancel_work_sync(&work->work);
+		list_del(&work->queue);
+		if (!work->page) /* work was canceled */
+			kmem_cache_free(async_pf_cache, work);
+	}
+
+	spin_lock(&vcpu->async_pf.lock);
+	while (!list_empty(&vcpu->async_pf.done)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf.done.next,
+				   typeof(*work), link);
+		list_del(&work->link);
+		put_page(work->page);
+		kmem_cache_free(async_pf_cache, work);
+	}
+	spin_unlock(&vcpu->async_pf.lock);
+
+	vcpu->async_pf.queued = 0;
+	vcpu->async_pf.work = NULL;
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work = vcpu->async_pf.work;
+
+	if (work) {
+		vcpu->async_pf.work = NULL;
+		if (work->page == NULL) {
+			kvm_arch_async_page_not_present(vcpu, work);
+			return;
+		} else {
+			/* page was swapped in before vcpu entry */
+			spin_lock(&vcpu->async_pf.lock);
+			list_del(&work->link);
+			spin_unlock(&vcpu->async_pf.lock);
+			goto free;
+		}
+	}
+
+	if (list_empty_careful(&vcpu->async_pf.done))
+		return;
+
+	spin_lock(&vcpu->async_pf.lock);
+	work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
+	list_del(&work->link);
+	spin_unlock(&vcpu->async_pf.lock);
+
+	kvm_arch_async_page_present(vcpu, work);
+
+free:
+	list_del(&work->queue);
+	vcpu->async_pf.queued--;
+	put_page(work->page);
+	kmem_cache_free(async_pf_cache, work);
+}
+
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch)
+{
+	struct kvm_async_pf *work;
+
+	if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
+		return 0;
+
+	/* setup delayed work */
+
+	/* do alloc nowait since if we are going to sleep anyway we
+	   may as well sleep faulting in page */
+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+	if (!work)
+		return 0;
+
+	work->page = NULL;
+	work->vcpu = vcpu;
+	work->gva = gva;
+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
+	work->arch = *arch;
+	work->mm = current->mm;
+	atomic_inc(&work->mm->mm_count);
+	kvm_get_kvm(work->vcpu->kvm);
+
+	/* this can't really happen otherwise gfn_to_pfn_async
+	   would succeed */
+	if (unlikely(kvm_is_error_hva(work->addr)))
+		goto retry_sync;
+
+	INIT_WORK(&work->work, async_pf_execute);
+	if (!schedule_work(&work->work))
+		goto retry_sync;
+
+	vcpu->async_pf.work = work;
+	list_add_tail(&work->queue, &vcpu->async_pf.queue);
+	vcpu->async_pf.queued++;
+	return 1;
+retry_sync:
+	kvm_put_kvm(work->vcpu->kvm);
+	mmdrop(work->mm);
+	kmem_cache_free(async_pf_cache, work);
+	return 0;
+}
+
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (!list_empty(&vcpu->async_pf.done))
+		return 0;
+
+	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	work->page = bad_page;
+	get_page(bad_page);
+	INIT_LIST_HEAD(&work->queue); /* for list_del to work */
+
+	list_add_tail(&work->link, &vcpu->async_pf.done);
+	vcpu->async_pf.queued++;
+	return 0;
+}
diff --git a/virt/kvm/async_pf.h b/virt/kvm/async_pf.h
new file mode 100644
index 0000000..fa15074
--- /dev/null
+++ b/virt/kvm/async_pf.h
@@ -0,0 +1,36 @@
+/*
+ * kvm asynchromous fault support
+ *
+ * Copyright 2010 Red Hat, Inc.
+ *
+ * Author:
+ *      Gleb Natapov <gleb@redhat.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KVM_ASYNC_PF_H__
+#define __KVM_ASYNC_PF_H__
+
+#ifdef CONFIG_KVM_ASYNC_PF
+int kvm_async_pf_init(void);
+void kvm_async_pf_deinit(void);
+void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu);
+#else
+#define kvm_async_pf_init() (0)
+#define kvm_async_pf_deinit() do{}while(0)
+#define kvm_async_pf_vcpu_init(C) do{}while(0)
+#endif
+
+#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b8499f5..db58a1b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -55,6 +55,7 @@
 #include <asm-generic/bitops/le.h>
 
 #include "coalesced_mmio.h"
+#include "async_pf.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
@@ -186,6 +187,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
 	init_waitqueue_head(&vcpu->wq);
+	kvm_async_pf_vcpu_init(vcpu);
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -948,17 +950,29 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic,
+			bool *async)
 {
 	struct page *page[1];
-	int npages;
+	int npages = 0;
 	pfn_t pfn;
 
-	if (atomic)
+	/* we can do it either atomically or asynchronously, not both */
+	BUG_ON(atomic && async);
+
+	if (atomic || async)
 		npages = __get_user_pages_fast(addr, 1, 1, page);
-	else {
+
+	if (unlikely(npages != 1) && !atomic) {
 		might_sleep();
-		npages = get_user_pages_fast(addr, 1, 1, page);
+
+		if (async) {
+			down_read(&current->mm->mmap_sem);
+			npages = get_user_pages_noio(current, current->mm,
+						     addr, 1, 1, 0, page, NULL);
+			up_read(&current->mm->mmap_sem);
+		} else
+			npages = get_user_pages_fast(addr, 1, 1, page);
 	}
 
 	if (unlikely(npages != 1)) {
@@ -978,6 +992,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
 
 		if (vma == NULL || addr < vma->vm_start ||
 		    !(vma->vm_flags & VM_PFNMAP)) {
+			if (async && !(vma->vm_flags & VM_PFNMAP) &&
+			    (vma->vm_flags & VM_WRITE))
+				*async = true;
 			up_read(&current->mm->mmap_sem);
 return_fault_page:
 			get_page(fault_page);
@@ -995,32 +1012,41 @@ return_fault_page:
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr)
 {
-	return hva_to_pfn(kvm, addr, true);
+	return hva_to_pfn(kvm, addr, true, NULL);
 }
 EXPORT_SYMBOL_GPL(hva_to_pfn_atomic);
 
-static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic)
+static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic, bool *async)
 {
 	unsigned long addr;
 
+	if (async)
+		*async = false;
+
 	addr = gfn_to_hva(kvm, gfn);
 	if (kvm_is_error_hva(addr)) {
 		get_page(bad_page);
 		return page_to_pfn(bad_page);
 	}
 
-	return hva_to_pfn(kvm, addr, atomic);
+	return hva_to_pfn(kvm, addr, atomic, async);
 }
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, true);
+	return __gfn_to_pfn(kvm, gfn, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async)
+{
+	return __gfn_to_pfn(kvm, gfn, false, async);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_async);
+
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, false);
+	return __gfn_to_pfn(kvm, gfn, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
@@ -1028,7 +1054,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	unsigned long addr = gfn_to_hva_memslot(slot, gfn);
-	return hva_to_pfn(kvm, addr, false);
+	return hva_to_pfn(kvm, addr, false, NULL);
 }
 
 int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
@@ -2335,6 +2361,10 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		goto out_free_5;
 	}
 
+	r = kvm_async_pf_init();
+	if (r)
+		goto out_free;
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -2342,7 +2372,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		printk(KERN_ERR "kvm: misc device register failed\n");
-		goto out_free;
+		goto out_unreg;
 	}
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -2352,6 +2382,8 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
+out_unreg:
+	kvm_async_pf_deinit();
 out_free:
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_5:
@@ -2384,6 +2416,7 @@ void kvm_exit(void)
 	kvm_exit_debug();
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
+	kvm_async_pf_deinit();
 	sysdev_unregister(&kvm_sysdev);
 	sysdev_class_unregister(&kvm_sysdev_class);
 	unregister_reboot_notifier(&kvm_reboot_notifier);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 03/12] Retry fault before vmentry
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 01/12] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05 15:54   ` Marcelo Tosatti
  2010-10-07 12:29   ` Avi Kivity
  2010-10-04 15:56 ` [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +++-
 arch/x86/kvm/mmu.c              |   16 ++++++++--------
 arch/x86/kvm/paging_tmpl.h      |    6 +++---
 arch/x86/kvm/x86.c              |    7 +++++++
 virt/kvm/async_pf.c             |    2 ++
 5 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5f154d3..b9f263e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -240,7 +240,7 @@ struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
@@ -838,6 +838,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work);
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4d49b5e..d85fda8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2558,7 +2558,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool no_apf)
 {
 	gfn_t gfn;
 	int r;
@@ -2594,8 +2594,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			 pfn_t *pfn)
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+			 gva_t gva, pfn_t *pfn)
 {
 	bool async;
 
@@ -2606,7 +2606,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 
 	put_page(pfn_to_page(*pfn));
 
-	if (can_do_async_pf(vcpu)) {
+	if (!no_apf && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(async, *pfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			vcpu->async_pf.work = kvm_double_apf;
@@ -2620,8 +2620,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 	return false;
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool no_apf)
 {
 	pfn_t pfn;
 	int r;
@@ -2643,7 +2643,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
 		return 0;
 
 	/* mmio */
@@ -3306,7 +3306,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 8154353..9ad90f8 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -530,8 +530,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool no_apf)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -574,7 +574,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
 		return 0;
 
 	/* mmio */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8dd9ac2..48fd59d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6123,6 +6123,13 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	if (!tdp_enabled || is_error_page(work->page))
+		return;
+	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
+}
+
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index f5109eb..44f4005 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -133,6 +133,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 			spin_lock(&vcpu->async_pf.lock);
 			list_del(&work->link);
 			spin_unlock(&vcpu->async_pf.lock);
+			kvm_arch_async_page_ready(vcpu, work);
 			goto free;
 		}
 	}
@@ -145,6 +146,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->link);
 	spin_unlock(&vcpu->async_pf.lock);
 
+	kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_async_page_present(vcpu, work);
 
 free:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (2 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 03/12] Retry fault before vmentry Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05  1:29   ` Rik van Riel
                     ` (2 more replies)
  2010-10-04 15:56 ` [PATCH v6 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c Gleb Natapov
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 +++++
 include/linux/kvm_types.h |    7 +++++
 virt/kvm/kvm_main.c       |   57 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a08614e..4dff9a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u32 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..ee6eb71 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u32 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index db58a1b..45ef50c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -1247,6 +1249,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = gfn_to_memslot(kvm, gfn);
+	ghc->hva = gfn_to_hva(kvm, gfn);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1315,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1325,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (3 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |   11 +++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +------------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 7b562b6..e3faaaf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,7 @@ struct kvm_mmu_op_release_pt {
 #include <asm/processor.h>
 
 extern void kvmclock_init(void);
+extern int kvm_register_clock(char *txt);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..e6db179 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -231,10 +231,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+#ifdef CONFIG_SMP
+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+	WARN_ON(kvm_register_clock("primary cpu clock"));
+	native_smp_prepare_boot_cpu();
+}
+#endif
+
 void __init kvm_guest_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
+#ifdef CONFIG_SMP
+	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
 }
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ca43ce3..f98d3ea 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -125,7 +125,7 @@ static struct clocksource kvm_clock = {
 	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
-static int kvm_register_clock(char *txt)
+int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
@@ -152,14 +152,6 @@ static void __cpuinit kvm_setup_secondary_clock(void)
 }
 #endif
 
-#ifdef CONFIG_SMP
-static void __init kvm_smp_prepare_boot_cpu(void)
-{
-	WARN_ON(kvm_register_clock("primary cpu clock"));
-	native_smp_prepare_boot_cpu();
-}
-#endif
-
 /*
  * After the clock is registered, the host will keep writing to the
  * registered memory location. If the guest happens to shutdown, this memory
@@ -206,9 +198,6 @@ void __init kvmclock_init(void)
 	x86_cpuinit.setup_percpu_clockev =
 		kvm_setup_secondary_clock;
 #endif
-#ifdef CONFIG_SMP
-	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-#endif
 	machine_ops.shutdown  = kvm_shutdown;
 #ifdef CONFIG_KEXEC
 	machine_ops.crash_shutdown  = kvm_crash_shutdown;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (4 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-07 12:42   ` Avi Kivity
  2010-10-07 12:58   ` Avi Kivity
  2010-10-04 15:56 ` [PATCH v6 07/12] Add async PF initialization to PV guest Gleb Natapov
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Guest enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kvm/cpuid.txt     |    3 +++
 Documentation/kvm/msr.txt       |   13 ++++++++++++-
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/include/asm/kvm_para.h |    4 ++++
 arch/x86/kvm/x86.c              |   38 ++++++++++++++++++++++++++++++++++++--
 include/linux/kvm.h             |    1 +
 6 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt
index 14a12ea..8820685 100644
--- a/Documentation/kvm/cpuid.txt
+++ b/Documentation/kvm/cpuid.txt
@@ -36,6 +36,9 @@ KVM_FEATURE_MMU_OP                 ||     2 || deprecated.
 KVM_FEATURE_CLOCKSOURCE2           ||     3 || kvmclock available at msrs
                                    ||       || 0x4b564d00 and 0x4b564d01
 ------------------------------------------------------------------------------
+KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
+                                   ||       || writing to msr 0x4b564d02
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index 8ddcfe8..d64e723 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -3,7 +3,6 @@ Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
 =====================================================
 
 KVM makes use of some custom MSRs to service some requests.
-At present, this facility is only used by kvmclock.
 
 Custom MSRs have a range reserved for them, that goes from
 0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
@@ -151,3 +150,15 @@ MSR_KVM_SYSTEM_TIME: 0x12
 			return PRESENT;
 		} else
 			return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
+	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory
+	area which must be in guest RAM. Bits 5-1 are reserved and should be
+	zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu
+	0 when disabled.
+
+	Physical address points to 32 bit memory location that will be written
+	to by the hypervisor at the time of asynchronous page fault injection to
+	indicate type of asynchronous page fault. Value of 1 means that the page
+	referred to by the page fault is not present. Value 2 means that the
+	page is now available.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b9f263e..de31551 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -417,6 +417,8 @@ struct kvm_vcpu_arch {
 
 	struct {
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+		struct gfn_to_hva_cache data;
+		u64 msr_val;
 	} apf;
 };
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index e3faaaf..8662ae0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,7 @@
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
+#define KVM_FEATURE_ASYNC_PF		4
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -32,9 +33,12 @@
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
+#define KVM_ASYNC_PF_ENABLED			(1 << 0)
+
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
 #define KVM_MMU_OP_FLUSH_TLB	        2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48fd59d..3e123ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -782,12 +782,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	7
+#define KVM_SAVE_MSRS_BEGIN	8
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-	HV_X64_MSR_APIC_ASSIST_PAGE,
+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1425,6 +1425,29 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	return 0;
 }
 
+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+	gpa_t gpa = data & ~0x3f;
+
+	/* Bits 1:5 are resrved, Should be zero */
+	if (data & 0x3e)
+		return 1;
+
+	vcpu->arch.apf.msr_val = data;
+
+	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+		kvm_clear_async_pf_completion_queue(vcpu);
+		memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
+		return 0;
+	}
+
+	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
+		return 1;
+
+	kvm_async_pf_wakeup_all(vcpu);
+	return 0;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	switch (msr) {
@@ -1506,6 +1529,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		}
 		break;
 	}
+	case MSR_KVM_ASYNC_PF_EN:
+		if (kvm_pv_enable_async_pf(vcpu, data))
+			return 1;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1782,6 +1809,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_ASYNC_PF_EN:
+		data = vcpu->arch.apf.msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -1929,6 +1959,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_ASYNC_PF:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -5778,6 +5809,8 @@ free_vcpu:
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.apf.msr_val = 0;
+
 	vcpu_load(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
@@ -5797,6 +5830,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.dr7 = DR7_FIXED_1;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
+	vcpu->arch.apf.msr_val = 0;
 
 	kvm_clear_async_pf_completion_queue(vcpu);
 	memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..ea2dc1a 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,7 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#define KVM_CAP_ASYNC_PF 59
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (5 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05  2:34   ` Rik van Riel
                     ` (2 more replies)
  2010-10-04 15:56 ` [PATCH v6 08/12] Handle async PF in a guest Gleb Natapov
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Enable async PF in a guest if async PF capability is discovered.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kernel-parameters.txt |    3 +
 arch/x86/include/asm/kvm_para.h     |    5 ++
 arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8dc2548..0bd2203 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1699,6 +1699,9 @@ and is between 256 and 4096 characters. It is defined in the file
 
 	no-kvmclock	[X86,KVM] Disable paravirtualized KVM clock driver
 
+	no-kvmapf	[X86,KVM] Disable paravirtualized asynchronous page
+			fault handling.
+
 	nolapic		[X86-32,APIC] Do not enable or use the local APIC.
 
 	nolapic_timer	[X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 8662ae0..e193c4f 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+struct kvm_vcpu_pv_apf_data {
+	__u32 reason;
+	__u32 enabled;
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6db179..67d1c8d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,16 +27,30 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/hardirq.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
 #include <asm/timer.h>
+#include <asm/cpu.h>
 
 #define MMU_QUEUE_SIZE 1024
 
+static int kvmapf = 1;
+
+static int parse_no_kvmapf(char *arg)
+{
+        kvmapf = 0;
+        return 0;
+}
+
+early_param("no-kvmapf", parse_no_kvmapf);
+
 struct kvm_para_state {
 	u8 mmu_queue[MMU_QUEUE_SIZE];
 	int mmu_queue_len;
 };
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -231,12 +245,86 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+void __cpuinit kvm_guest_cpu_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
+		u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
+					  pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
+			return;
+		__get_cpu_var(apf_reason).enabled = 1;
+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
+		       smp_processor_id());
+	}
+}
+
+static void kvm_pv_disable_apf(void *unused)
+{
+	if (!__get_cpu_var(apf_reason).enabled)
+		return;
+
+	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
+	__get_cpu_var(apf_reason).enabled = 0;
+
+	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+	       smp_processor_id());
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+				unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+	.notifier_call = kvm_pv_reboot_notify,
+};
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
 	WARN_ON(kvm_register_clock("primary cpu clock"));
+	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
 }
+
+static void kvm_guest_cpu_notify(void *dummy)
+{
+	if (!dummy)
+		kvm_guest_cpu_init();
+	else
+		kvm_pv_disable_apf(NULL);
+}
+
+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
+				    unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE_FROZEN:
+		smp_call_function_single(cpu, kvm_guest_cpu_notify, NULL, 0);
+		break;
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		smp_call_function_single(cpu, kvm_guest_cpu_notify, (void*)1, 1);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
+        .notifier_call  = kvm_cpu_notify,
+};
 #endif
 
 void __init kvm_guest_init(void)
@@ -245,7 +333,11 @@ void __init kvm_guest_init(void)
 		return;
 
 	paravirt_ops_setup();
+	register_reboot_notifier(&kvm_pv_reboot_nb);
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+	register_cpu_notifier(&kvm_cpu_notifier);
+#else
+	kvm_guest_cpu_init();
 #endif
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (6 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 07/12] Add async PF initialization to PV guest Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-07 13:10   ` Avi Kivity
  2010-10-04 15:56 ` [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out Gleb Natapov
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When async PF capability is detected hook up special page fault handler
that will handle async page fault events and bypass other page faults to
regular page fault handler. Also add async PF handling to nested SVM
emulation. Async PF always generates exit to L1 where vcpu thread will
be scheduled out until page is available.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |   12 +++
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 ++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  184 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm.c              |   43 +++++++--
 6 files changed, 243 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index e193c4f..bcc5022 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
 struct kvm_vcpu_pv_apf_data {
 	__u32 reason;
 	__u32 enabled;
@@ -170,8 +173,17 @@ static inline unsigned int kvm_arch_para_features(void)
 
 #ifdef CONFIG_KVM_GUEST
 void __init kvm_guest_init(void);
+void kvm_async_pf_task_wait(u32 token);
+void kvm_async_pf_task_wake(u32 token);
+u32 kvm_read_and_reset_pf_reason(void);
 #else
 #define kvm_guest_init() do { } while (0)
+#define kvm_async_pf_task_wait(T) do {} while(0)
+#define kvm_async_pf_task_wake(T) do {} while(0)
+static u32 kvm_read_and_reset_pf_reason(void)
+{
+	return 0;
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index f66cda5..0310da6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
 asmlinkage void stack_segment(void);
 asmlinkage void general_protection(void);
 asmlinkage void page_fault(void);
+asmlinkage void async_page_fault(void);
 asmlinkage void spurious_interrupt_bug(void);
 asmlinkage void coprocessor_error(void);
 asmlinkage void alignment_check(void);
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 227d009..e6e7273 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1496,6 +1496,16 @@ ENTRY(general_protection)
 	CFI_ENDPROC
 END(general_protection)
 
+#ifdef CONFIG_KVM_GUEST
+ENTRY(async_page_fault)
+	RING0_EC_FRAME
+	pushl $do_async_page_fault
+	CFI_ADJUST_CFA_OFFSET 4
+	jmp error_code
+	CFI_ENDPROC
+END(apf_page_fault)
+#endif
+
 /*
  * End of kprobes section
  */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 17be5ec..def98c3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
 #endif
 errorentry general_protection do_general_protection
 errorentry page_fault do_page_fault
+#ifdef CONFIG_KVM_GUEST
+errorentry async_page_fault do_async_page_fault
+#endif
 #ifdef CONFIG_X86_MCE
 paranoidzeroentry machine_check *machine_check_vector(%rip)
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 67d1c8d..36fb3e4 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,8 +29,14 @@
 #include <linux/hardirq.h>
 #include <linux/notifier.h>
 #include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/kprobes.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
+#include <asm/traps.h>
+#include <asm/desc.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -64,6 +70,168 @@ static void kvm_io_delay(void)
 {
 }
 
+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+	struct hlist_node link;
+	wait_queue_head_t wq;
+	u32 token;
+	int cpu;
+};
+
+static struct kvm_task_sleep_head {
+	spinlock_t lock;
+	struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+						  u32 token)
+{
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		struct kvm_task_sleep_node *n =
+			hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+void kvm_async_pf_task_wait(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node n, *e;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&b->lock);
+	e = _find_apf_task(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		spin_unlock(&b->lock);
+		return;
+	}
+
+	n.token = token;
+	n.cpu = smp_processor_id();
+	init_waitqueue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	spin_unlock(&b->lock);
+
+	for (;;) {
+		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+		local_irq_enable();
+		schedule();
+		local_irq_disable();
+	}
+	finish_wait(&n.wq, &wait);
+
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
+
+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
+{
+	hlist_del_init(&n->link);
+	if (waitqueue_active(&n->wq))
+		wake_up(&n->wq);
+}
+
+static void apf_task_wake_all(void)
+{
+	int i;
+
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
+		struct hlist_node *p, *next;
+		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+		spin_lock(&b->lock);
+		hlist_for_each_safe(p, next, &b->list) {
+			struct kvm_task_sleep_node *n =
+				hlist_entry(p, typeof(*n), link);
+			if (n->cpu == smp_processor_id())
+				apf_task_wake_one(n);
+		}
+		spin_unlock(&b->lock);
+	}
+}
+
+void kvm_async_pf_task_wake(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node *n;
+
+	if (token == ~0) {
+		apf_task_wake_all();
+		return;
+	}
+
+again:
+	spin_lock(&b->lock);
+	n = _find_apf_task(b, token);
+	if (!n) {
+		/*
+		 * async PF was not yet handled.
+		 * Add dummy entry for the token.
+		 */
+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			/*
+			 * Allocation failed! Busy wait while other cpu
+			 * handles async PF.
+			 */
+			spin_unlock(&b->lock);
+			cpu_relax();
+			goto again;
+		}
+		n->token = token;
+		n->cpu = smp_processor_id();
+		init_waitqueue_head(&n->wq);
+		hlist_add_head(&n->link, &b->list);
+	} else
+		apf_task_wake_one(n);
+	spin_unlock(&b->lock);
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
+
+u32 kvm_read_and_reset_pf_reason(void)
+{
+	u32 reason = 0;
+
+	if (__get_cpu_var(apf_reason).enabled) {
+		reason = __get_cpu_var(apf_reason).reason;
+		__get_cpu_var(apf_reason).reason = 0;
+	}
+
+	return reason;
+}
+EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
+
+dotraplinkage void __kprobes
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	switch (kvm_read_and_reset_pf_reason()) {
+	default:
+		do_page_fault(regs, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		/* page is swapped out by the host. */
+		kvm_async_pf_task_wait((u32)read_cr2());
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		kvm_async_pf_task_wake((u32)read_cr2());
+		break;
+	}
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -298,8 +466,10 @@ static void kvm_guest_cpu_notify(void *dummy)
 {
 	if (!dummy)
 		kvm_guest_cpu_init();
-	else
+	else {
 		kvm_pv_disable_apf(NULL);
+		apf_task_wake_all();
+	}
 }
 
 static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
@@ -327,13 +497,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
 };
 #endif
 
+static void __init kvm_apf_trap_init(void)
+{
+	set_intr_gate(14, &async_page_fault);
+}
+
 void __init kvm_guest_init(void)
 {
+	int i;
+
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
 	register_reboot_notifier(&kvm_pv_reboot_nb);
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+		spin_lock_init(&async_pf_sleepers[i].lock);
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		x86_init.irqs.trap_init = kvm_apf_trap_init;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ca778d5..709456f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -31,6 +31,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
+#include <asm/kvm_para.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
@@ -133,6 +134,7 @@ struct vcpu_svm {
 
 	unsigned int3_injected;
 	unsigned long int3_rip;
+	u32 apf_reason;
 };
 
 #define MSR_INVALID			0xffffffffU
@@ -1383,16 +1385,31 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
 
 static int pf_interception(struct vcpu_svm *svm)
 {
-	u64 fault_address;
+	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u32 error_code;
+	int r = 1;
 
-	fault_address  = svm->vmcb->control.exit_info_2;
-	error_code = svm->vmcb->control.exit_info_1;
+	switch (svm->apf_reason) {
+	default:
+		error_code = svm->vmcb->control.exit_info_1;
 
-	trace_kvm_page_fault(fault_address, error_code);
-	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
-		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
-	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
+		trace_kvm_page_fault(fault_address, error_code);
+		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
+			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
+		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		local_irq_disable();
+		kvm_async_pf_task_wait(fault_address);
+		local_irq_enable();
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		local_irq_disable();
+		kvm_async_pf_task_wake(fault_address);
+		local_irq_enable();
+		break;
+	}
+	return r;
 }
 
 static int db_interception(struct vcpu_svm *svm)
@@ -1836,8 +1853,8 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
 			return NESTED_EXIT_HOST;
 		break;
 	case SVM_EXIT_EXCP_BASE + PF_VECTOR:
-		/* When we're shadowing, trap PFs */
-		if (!npt_enabled)
+		/* When we're shadowing, trap PFs, but not async PF */
+		if (!npt_enabled && svm->apf_reason == 0)
 			return NESTED_EXIT_HOST;
 		break;
 	case SVM_EXIT_EXCP_BASE + NM_VECTOR:
@@ -1893,6 +1910,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
 		u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
 		if (svm->nested.intercept_exceptions & excp_bits)
 			vmexit = NESTED_EXIT_DONE;
+		/* async page fault always cause vmexit */
+		else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
+			 svm->apf_reason != 0)
+			vmexit = NESTED_EXIT_DONE;
 		break;
 	}
 	case SVM_EXIT_ERR: {
@@ -3409,6 +3430,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	svm->next_rip = 0;
 
+	/* if exit due to PF check for async PF */
+	if (svm->vmcb->control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR)
+		svm->apf_reason = kvm_read_and_reset_pf_reason();
+
 	if (npt_enabled) {
 		vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
 		vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (7 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 08/12] Handle async PF in a guest Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05  2:36   ` Rik van Riel
  2010-10-05 19:00   ` Marcelo Tosatti
  2010-10-04 15:56 ` [PATCH v6 10/12] Handle async PF in non preemptable context Gleb Natapov
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.

Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 ++
 arch/x86/kvm/mmu.c              |    1 +
 arch/x86/kvm/x86.c              |   49 ++++++++++++++++++++++++++++++++------
 include/trace/events/kvm.h      |   17 ++++++++----
 virt/kvm/async_pf.c             |    3 +-
 5 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index de31551..2f6fc87 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -419,6 +419,7 @@ struct kvm_vcpu_arch {
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
 		struct gfn_to_hva_cache data;
 		u64 msr_val;
+		u32 id;
 	} apf;
 };
 
@@ -594,6 +595,7 @@ struct kvm_x86_ops {
 };
 
 struct kvm_arch_async_pf {
+	u32 token;
 	gfn_t gfn;
 };
 
@@ -842,6 +844,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d85fda8..de53cab 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2580,6 +2580,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 {
 	struct kvm_arch_async_pf arch;
+	arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
 	arch.gfn = gfn;
 
 	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3e123ab..0e69d37 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6225,25 +6225,58 @@ static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 	}
 }
 
+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
+				      sizeof(val));
+}
+
 void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work)
 {
-	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
-
-	if (work == kvm_double_apf)
+	if (work == kvm_double_apf) {
 		trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
-	else {
-		trace_kvm_async_pf_not_present(work->gva);
-
+		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+	} else {
+		trace_kvm_async_pf_not_present(work->arch.token, work->gva);
 		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+
+		if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
+		    kvm_x86_ops->get_cpl(vcpu) == 0)
+			vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+		else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+			vcpu->arch.fault.error_code = 0;
+			vcpu->arch.fault.address = work->arch.token;
+			kvm_inject_page_fault(vcpu);
+		}
 	}
 }
 
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
-	trace_kvm_async_pf_ready(work->gva);
-	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+	trace_kvm_async_pf_ready(work->arch.token, work->gva);
+	if (is_error_page(work->page))
+		work->arch.token = ~0; /* broadcast wakeup */
+	else
+		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+
+	if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
+	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+		vcpu->arch.fault.error_code = 0;
+		vcpu->arch.fault.address = work->arch.token;
+		kvm_inject_page_fault(vcpu);
+	}
+}
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
+		return true;
+	else
+		return !kvm_event_needs_reinjection(vcpu) &&
+			kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index bcc69b2..dd44aa3 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -204,34 +204,39 @@ TRACE_EVENT(
 
 TRACE_EVENT(
 	kvm_async_pf_not_present,
-	TP_PROTO(u64 gva),
-	TP_ARGS(gva),
+	TP_PROTO(u64 token, u64 gva),
+	TP_ARGS(token, gva),
 
 	TP_STRUCT__entry(
+		__field(__u64, token)
 		__field(__u64, gva)
 		),
 
 	TP_fast_assign(
+		__entry->token = token;
 		__entry->gva = gva;
 		),
 
-	TP_printk("gva %#llx not present", __entry->gva)
+	TP_printk("token %#llx gva %#llx not present", __entry->token,
+		  __entry->gva)
 );
 
 TRACE_EVENT(
 	kvm_async_pf_ready,
-	TP_PROTO(u64 gva),
-	TP_ARGS(gva),
+	TP_PROTO(u64 token, u64 gva),
+	TP_ARGS(token, gva),
 
 	TP_STRUCT__entry(
+		__field(__u64, token)
 		__field(__u64, gva)
 		),
 
 	TP_fast_assign(
+		__entry->token = token;
 		__entry->gva = gva;
 		),
 
-	TP_printk("gva %#llx ready", __entry->gva)
+	TP_printk("token %#llx gva %#llx ready", __entry->token, __entry->gva)
 );
 
 TRACE_EVENT(
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 44f4005..d1cd495 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -138,7 +138,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 		}
 	}
 
-	if (list_empty_careful(&vcpu->async_pf.done))
+	if (list_empty_careful(&vcpu->async_pf.done) ||
+	    !kvm_arch_can_inject_async_page_present(vcpu))
 		return;
 
 	spin_lock(&vcpu->async_pf.lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 10/12] Handle async PF in non preemptable context
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (8 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05 19:51   ` Marcelo Tosatti
  2010-10-04 15:56 ` [PATCH v6 11/12] Let host know whether the guest can handle async PF in non-userspace context Gleb Natapov
  2010-10-04 15:56 ` [PATCH v6 12/12] Send async PF when guest is not in userspace too Gleb Natapov
  11 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kernel/kvm.c |   40 ++++++++++++++++++++++++++++++++++------
 1 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 36fb3e4..f73946f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -37,6 +37,7 @@
 #include <asm/cpu.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
+#include <asm/tlbflush.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -78,6 +79,8 @@ struct kvm_task_sleep_node {
 	wait_queue_head_t wq;
 	u32 token;
 	int cpu;
+	bool halted;
+	struct mm_struct *mm;
 };
 
 static struct kvm_task_sleep_head {
@@ -106,6 +109,11 @@ void kvm_async_pf_task_wait(u32 token)
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
 	struct kvm_task_sleep_node n, *e;
 	DEFINE_WAIT(wait);
+	int cpu, idle;
+
+	cpu = get_cpu();
+	idle = idle_cpu(cpu);
+	put_cpu();
 
 	spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
@@ -119,19 +127,33 @@ void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
+	n.mm = current->active_mm;
+	n.halted = idle || preempt_count() > 1;
+	atomic_inc(&n.mm->mm_count);
 	init_waitqueue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	spin_unlock(&b->lock);
 
 	for (;;) {
-		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (!n.halted)
+			prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
-		local_irq_enable();
-		schedule();
-		local_irq_disable();
+
+		if (!n.halted) {
+			local_irq_enable();
+			schedule();
+			local_irq_disable();
+		} else {
+			/*
+			 * We cannot reschedule. So halt.
+			 */
+			native_safe_halt();
+			local_irq_disable();
+		}
 	}
-	finish_wait(&n.wq, &wait);
+	if (!n.halted)
+		finish_wait(&n.wq, &wait);
 
 	return;
 }
@@ -140,7 +162,12 @@ EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (waitqueue_active(&n->wq))
+	if (!n->mm)
+		return;
+	mmdrop(n->mm);
+	if (n->halted)
+		smp_send_reschedule(n->cpu);
+	else if (waitqueue_active(&n->wq))
 		wake_up(&n->wq);
 }
 
@@ -193,6 +220,7 @@ again:
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->mm = NULL;
 		init_waitqueue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 11/12] Let host know whether the guest can handle async PF in non-userspace context.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (9 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 10/12] Handle async PF in non preemptable context Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-07 13:36   ` Avi Kivity
  2010-10-04 15:56 ` [PATCH v6 12/12] Send async PF when guest is not in userspace too Gleb Natapov
  11 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kvm/msr.txt       |    5 +++--
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |    3 +++
 arch/x86/kvm/x86.c              |    5 +++--
 5 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index d64e723..918f9ad 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -153,9 +153,10 @@ MSR_KVM_SYSTEM_TIME: 0x12
 
 MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory
-	area which must be in guest RAM. Bits 5-1 are reserved and should be
+	area which must be in guest RAM. Bits 5-2 are reserved and should be
 	zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu
-	0 when disabled.
+	0 when disabled. Bit 2 is 1 if asynchronous page faults can be injected
+	when vcpu is in kernel mode.
 
 	Physical address points to 32 bit memory location that will be written
 	to by the hypervisor at the time of asynchronous page fault injection to
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2f6fc87..81c1a4f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -420,6 +420,7 @@ struct kvm_vcpu_arch {
 		struct gfn_to_hva_cache data;
 		u64 msr_val;
 		u32 id;
+		bool send_user_only;
 	} apf;
 };
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index bcc5022..bf3dab3 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -38,6 +38,7 @@
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f73946f..d5877bf 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -449,6 +449,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
 		u64 pa = __pa(&__get_cpu_var(apf_reason));
 
+#ifdef CONFIG_PREEMPT
+		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
+#endif
 		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
 					  pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
 			return;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e69d37..cad4412 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1429,8 +1429,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 1:5 are resrved, Should be zero */
-	if (data & 0x3e)
+	/* Bits 2:5 are resrved, Should be zero */
+	if (data & 0x3c)
 		return 1;
 
 	vcpu->arch.apf.msr_val = data;
@@ -1444,6 +1444,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
 		return 1;
 
+	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v6 12/12] Send async PF when guest is not in userspace too.
  2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
                   ` (10 preceding siblings ...)
  2010-10-04 15:56 ` [PATCH v6 11/12] Let host know whether the guest can handle async PF in non-userspace context Gleb Natapov
@ 2010-10-04 15:56 ` Gleb Natapov
  2010-10-05  2:37   ` Rik van Riel
  11 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-04 15:56 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupts are enabled.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/x86.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cad4412..30b1cd1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6244,7 +6244,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
 
 		if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
-		    kvm_x86_ops->get_cpl(vcpu) == 0)
+		    (vcpu->arch.apf.send_user_only &&
+		     kvm_x86_ops->get_cpl(vcpu) == 0))
 			vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
 		else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
 			vcpu->arch.fault.error_code = 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-04 15:56 ` [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
@ 2010-10-05  1:20   ` Rik van Riel
  2010-10-05 14:59   ` Marcelo Tosatti
  2010-10-07  9:50   ` Avi Kivity
  2 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2010-10-05  1:20 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 10/04/2010 11:56 AM, Gleb Natapov wrote:
> If a guest accesses swapped out memory do not swap it in from vcpu thread
> context. Schedule work to do swapping and put vcpu into halted state
> instead.
>
> Interrupts will still be delivered to the guest and if interrupt will
> cause reschedule guest will continue to run another task.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

This seems quite different from the last version, but it
looks fine to me.

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-04 15:56 ` [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
@ 2010-10-05  1:29   ` Rik van Riel
  2010-10-05 16:57   ` Marcelo Tosatti
  2010-10-07 12:31   ` Avi Kivity
  2 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2010-10-05  1:29 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 10/04/2010 11:56 AM, Gleb Natapov wrote:
> Keep track of memslots changes by keeping generation number in memslots
> structure. Provide kvm_write_guest_cached() function that skips
> gfn_to_hva() translation if memslots was not changed since previous
> invocation.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-04 15:56 ` [PATCH v6 07/12] Add async PF initialization to PV guest Gleb Natapov
@ 2010-10-05  2:34   ` Rik van Riel
  2010-10-05 18:25   ` Marcelo Tosatti
  2010-10-07 12:50   ` Avi Kivity
  2 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2010-10-05  2:34 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 10/04/2010 11:56 AM, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out.
  2010-10-04 15:56 ` [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out Gleb Natapov
@ 2010-10-05  2:36   ` Rik van Riel
  2010-10-05 19:00   ` Marcelo Tosatti
  1 sibling, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2010-10-05  2:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 10/04/2010 11:56 AM, Gleb Natapov wrote:
> Send async page fault to a PV guest if it accesses swapped out memory.
> Guest will choose another task to run upon receiving the fault.
>
> Allow async page fault injection only when guest is in user mode since
> otherwise guest may be in non-sleepable context and will not be able
> to reschedule.
>
> Vcpu will be halted if guest will fault on the same page again or if
> vcpu executes kernel code.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 12/12] Send async PF when guest is not in userspace too.
  2010-10-04 15:56 ` [PATCH v6 12/12] Send async PF when guest is not in userspace too Gleb Natapov
@ 2010-10-05  2:37   ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2010-10-05  2:37 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 10/04/2010 11:56 AM, Gleb Natapov wrote:
> If guest indicates that it can handle async pf in kernel mode too send
> it, but only if interrupts are enabled.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-04 15:56 ` [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
  2010-10-05  1:20   ` Rik van Riel
@ 2010-10-05 14:59   ` Marcelo Tosatti
  2010-10-06 10:50     ` Avi Kivity
  2010-10-06 11:15     ` Gleb Natapov
  2010-10-07  9:50   ` Avi Kivity
  2 siblings, 2 replies; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-05 14:59 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Mon, Oct 04, 2010 at 05:56:24PM +0200, Gleb Natapov wrote:
> If a guest accesses swapped out memory do not swap it in from vcpu thread
> context. Schedule work to do swapping and put vcpu into halted state
> instead.
> 
> Interrupts will still be delivered to the guest and if interrupt will
> cause reschedule guest will continue to run another task.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   17 +++
>  arch/x86/kvm/Kconfig            |    1 +
>  arch/x86/kvm/Makefile           |    1 +
>  arch/x86/kvm/mmu.c              |   51 +++++++++-
>  arch/x86/kvm/paging_tmpl.h      |    4 +-
>  arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
>  include/linux/kvm_host.h        |   31 ++++++
>  include/trace/events/kvm.h      |   88 ++++++++++++++++
>  virt/kvm/Kconfig                |    3 +
>  virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/async_pf.h             |   36 +++++++
>  virt/kvm/kvm_main.c             |   57 ++++++++--
>  12 files changed, 603 insertions(+), 15 deletions(-)
>  create mode 100644 virt/kvm/async_pf.c
>  create mode 100644 virt/kvm/async_pf.h
> 

> +	async_pf_cache = NULL;
> +}
> +
> +void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
> +{
> +	INIT_LIST_HEAD(&vcpu->async_pf.done);
> +	INIT_LIST_HEAD(&vcpu->async_pf.queue);
> +	spin_lock_init(&vcpu->async_pf.lock);
> +}
> +
> +static void async_pf_execute(struct work_struct *work)
> +{
> +	struct page *page;
> +	struct kvm_async_pf *apf =
> +		container_of(work, struct kvm_async_pf, work);
> +	struct mm_struct *mm = apf->mm;
> +	struct kvm_vcpu *vcpu = apf->vcpu;
> +	unsigned long addr = apf->addr;
> +	gva_t gva = apf->gva;
> +
> +	might_sleep();
> +
> +	use_mm(mm);
> +	down_read(&mm->mmap_sem);
> +	get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
> +	up_read(&mm->mmap_sem);
> +	unuse_mm(mm);
> +
> +	spin_lock(&vcpu->async_pf.lock);
> +	list_add_tail(&apf->link, &vcpu->async_pf.done);
> +	apf->page = page;
> +	spin_unlock(&vcpu->async_pf.lock);

This can fail, and apf->page become NULL.

> +	if (list_empty_careful(&vcpu->async_pf.done))
> +		return;
> +
> +	spin_lock(&vcpu->async_pf.lock);
> +	work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
> +	list_del(&work->link);
> +	spin_unlock(&vcpu->async_pf.lock);
> +
> +	kvm_arch_async_page_present(vcpu, work);
> +
> +free:
> +	list_del(&work->queue);
> +	vcpu->async_pf.queued--;
> +	put_page(work->page);
> +	kmem_cache_free(async_pf_cache, work);
> +}

Better handle it here (and other sites).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-04 15:56 ` [PATCH v6 03/12] Retry fault before vmentry Gleb Natapov
@ 2010-10-05 15:54   ` Marcelo Tosatti
  2010-10-06 11:07     ` Gleb Natapov
  2010-10-07 12:29   ` Avi Kivity
  1 sibling, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-05 15:54 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Mon, Oct 04, 2010 at 05:56:25PM +0200, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page. Do it only when tdp is enabled for now. Shadow paging case is
> more complicated. CR[034] and EFER registers should be switched before
> doing mapping and then switched back.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    4 +++-
>  arch/x86/kvm/mmu.c              |   16 ++++++++--------
>  arch/x86/kvm/paging_tmpl.h      |    6 +++---
>  arch/x86/kvm/x86.c              |    7 +++++++
>  virt/kvm/async_pf.c             |    2 ++
>  5 files changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5f154d3..b9f263e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -240,7 +240,7 @@ struct kvm_mmu {
>  	void (*new_cr3)(struct kvm_vcpu *vcpu);
>  	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
>  	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
> -	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
> +	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
>  	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
>  	void (*free)(struct kvm_vcpu *vcpu);
>  	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
> @@ -838,6 +838,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>  				     struct kvm_async_pf *work);
>  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  				 struct kvm_async_pf *work);
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work);
>  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
>  
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 4d49b5e..d85fda8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2558,7 +2558,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
>  }
>  
>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
> -				u32 error_code)
> +				u32 error_code, bool no_apf)
>  {
>  	gfn_t gfn;
>  	int r;
> @@ -2594,8 +2594,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
>  	return kvm_x86_ops->interrupt_allowed(vcpu);
>  }
>  
> -static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> -			 pfn_t *pfn)
> +static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
> +			 gva_t gva, pfn_t *pfn)
>  {
>  	bool async;
>  
> @@ -2606,7 +2606,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>  
>  	put_page(pfn_to_page(*pfn));
>  
> -	if (can_do_async_pf(vcpu)) {
> +	if (!no_apf && can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(async, *pfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			vcpu->async_pf.work = kvm_double_apf;
> @@ -2620,8 +2620,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>  	return false;
>  }
>  
> -static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> -				u32 error_code)
> +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
> +			  bool no_apf)
>  {
>  	pfn_t pfn;
>  	int r;
> @@ -2643,7 +2643,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> -	if (try_async_pf(vcpu, gfn, gpa, &pfn))
> +	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
>  		return 0;
>  
>  	/* mmio */
> @@ -3306,7 +3306,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
>  	int r;
>  	enum emulation_result er;
>  
> -	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
> +	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
>  	if (r < 0)
>  		goto out;
>  
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 8154353..9ad90f8 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -530,8 +530,8 @@ out_gpte_changed:
>   *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
>   *           a negative value on error.
>   */
> -static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
> -			       u32 error_code)
> +static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
> +			     bool no_apf)
>  {
>  	int write_fault = error_code & PFERR_WRITE_MASK;
>  	int user_fault = error_code & PFERR_USER_MASK;
> @@ -574,7 +574,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
>  
> -	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
> +	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
>  		return 0;
>  
>  	/* mmio */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8dd9ac2..48fd59d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6123,6 +6123,13 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> +{
> +	if (!tdp_enabled || is_error_page(work->page))
> +		return;
> +	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
> +}
> +

Can't you set a bit in vcpu->requests instead, and handle it in "out:"
at the end of vcpu_enter_guest? 

To have a single entry point for pagefaults, after vmexit handling.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-04 15:56 ` [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
  2010-10-05  1:29   ` Rik van Riel
@ 2010-10-05 16:57   ` Marcelo Tosatti
  2010-10-06 11:14     ` Gleb Natapov
  2010-10-07 12:31   ` Avi Kivity
  2 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-05 16:57 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Mon, Oct 04, 2010 at 05:56:26PM +0200, Gleb Natapov wrote:
> Keep track of memslots changes by keeping generation number in memslots
> structure. Provide kvm_write_guest_cached() function that skips
> gfn_to_hva() translation if memslots was not changed since previous
> invocation.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  include/linux/kvm_host.h  |    7 +++++
>  include/linux/kvm_types.h |    7 +++++
>  virt/kvm/kvm_main.c       |   57 +++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a08614e..4dff9a1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
>  
>  struct kvm_memslots {
>  	int nmemslots;
> +	u32 generation;
>  	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
>  					KVM_PRIVATE_MEM_SLOTS];
>  };
> @@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
>  			 int offset, int len);
>  int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
>  		    unsigned long len);
> +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> +			   void *data, unsigned long len);
> +int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> +			      gpa_t gpa);
>  int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
>  int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
>  int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
>  unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
>  void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
> +void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> +			     gfn_t gfn);
>  
>  void kvm_vcpu_block(struct kvm_vcpu *vcpu);
>  void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 7ac0d4e..ee6eb71 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -67,4 +67,11 @@ struct kvm_lapic_irq {
>  	u32 dest_id;
>  };
>  
> +struct gfn_to_hva_cache {
> +	u32 generation;
> +	gpa_t gpa;
> +	unsigned long hva;
> +	struct kvm_memory_slot *memslot;
> +};
> +
>  #endif /* __KVM_TYPES_H__ */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index db58a1b..45ef50c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -687,6 +687,7 @@ skip_lpage:
>  		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
>  		if (mem->slot >= slots->nmemslots)
>  			slots->nmemslots = mem->slot + 1;
> +		slots->generation++;
>  		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
>  
>  		old_memslots = kvm->memslots;
> @@ -723,6 +724,7 @@ skip_lpage:
>  	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
>  	if (mem->slot >= slots->nmemslots)
>  		slots->nmemslots = mem->slot + 1;
> +	slots->generation++;
>  
>  	/* actual memory is freed via old in kvm_free_physmem_slot below */
>  	if (!npages) {
> @@ -1247,6 +1249,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
>  	return 0;
>  }
>  
> +int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> +			      gpa_t gpa)
> +{
> +	struct kvm_memslots *slots = kvm_memslots(kvm);
> +	int offset = offset_in_page(gpa);
> +	gfn_t gfn = gpa >> PAGE_SHIFT;
> +
> +	ghc->gpa = gpa;
> +	ghc->generation = slots->generation;
> +	ghc->memslot = gfn_to_memslot(kvm, gfn);
> +	ghc->hva = gfn_to_hva(kvm, gfn);
> +	if (!kvm_is_error_hva(ghc->hva))
> +		ghc->hva += offset;
> +	else
> +		return -EFAULT;
> +
> +	return 0;
> +}

Should use a unique kvm_memslots structure for the cache entry, since it
can change in between (use gfn_to_hva_memslot, etc on "slots" pointer).

Also should zap any cached entries on overflow, otherwise malicious
userspace could make use of stale slots:

> +void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> +{
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = gfn_to_memslot(kvm, gfn);
> +	mark_page_dirty_in_slot(kvm, memslot, gfn);
> +}
> +
>  /*
>   * The vCPU has executed a HLT instruction with in-kernel mode enabled.
>   */
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-04 15:56 ` [PATCH v6 07/12] Add async PF initialization to PV guest Gleb Natapov
  2010-10-05  2:34   ` Rik van Riel
@ 2010-10-05 18:25   ` Marcelo Tosatti
  2010-10-06 10:55     ` Gleb Natapov
  2010-10-07 12:50   ` Avi Kivity
  2 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-05 18:25 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Mon, Oct 04, 2010 at 05:56:29PM +0200, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  Documentation/kernel-parameters.txt |    3 +
>  arch/x86/include/asm/kvm_para.h     |    5 ++
>  arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
>  3 files changed, 100 insertions(+), 0 deletions(-)
> 

> +static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> +				    unsigned long action, void *hcpu)
> +{
> +	int cpu = (unsigned long)hcpu;
> +	switch (action) {
> +	case CPU_ONLINE:
> +	case CPU_DOWN_FAILED:
> +	case CPU_ONLINE_FROZEN:
> +		smp_call_function_single(cpu, kvm_guest_cpu_notify, NULL, 0);

wait parameter should probably be 1.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out.
  2010-10-04 15:56 ` [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out Gleb Natapov
  2010-10-05  2:36   ` Rik van Riel
@ 2010-10-05 19:00   ` Marcelo Tosatti
  2010-10-06 10:42     ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-05 19:00 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Mon, Oct 04, 2010 at 05:56:31PM +0200, Gleb Natapov wrote:
> Send async page fault to a PV guest if it accesses swapped out memory.
> Guest will choose another task to run upon receiving the fault.
> 
> Allow async page fault injection only when guest is in user mode since
> otherwise guest may be in non-sleepable context and will not be able
> to reschedule.
> 
> Vcpu will be halted if guest will fault on the same page again or if
> vcpu executes kernel code.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    3 ++
>  arch/x86/kvm/mmu.c              |    1 +
>  arch/x86/kvm/x86.c              |   49 ++++++++++++++++++++++++++++++++------
>  include/trace/events/kvm.h      |   17 ++++++++----
>  virt/kvm/async_pf.c             |    3 +-
>  5 files changed, 58 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index de31551..2f6fc87 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -419,6 +419,7 @@ struct kvm_vcpu_arch {
>  		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
>  		struct gfn_to_hva_cache data;
>  		u64 msr_val;
> +		u32 id;
>  	} apf;
>  };
>  
> @@ -594,6 +595,7 @@ struct kvm_x86_ops {
>  };
>  
>  struct kvm_arch_async_pf {
> +	u32 token;
>  	gfn_t gfn;
>  };
>  
> @@ -842,6 +844,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  				 struct kvm_async_pf *work);
>  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>  			       struct kvm_async_pf *work);
> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
>  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
>  
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d85fda8..de53cab 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2580,6 +2580,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>  int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
>  {
>  	struct kvm_arch_async_pf arch;
> +	arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
>  	arch.gfn = gfn;
>  
>  	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3e123ab..0e69d37 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6225,25 +6225,58 @@ static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
>  	}
>  }
>  
> +static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> +{
> +
> +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
> +				      sizeof(val));
> +}
> +
>  void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>  				     struct kvm_async_pf *work)
>  {
> -	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> -
> -	if (work == kvm_double_apf)
> +	if (work == kvm_double_apf) {
>  		trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
> -	else {
> -		trace_kvm_async_pf_not_present(work->gva);
> -
> +		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> +	} else {
> +		trace_kvm_async_pf_not_present(work->arch.token, work->gva);
>  		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> +
> +		if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
> +		    kvm_x86_ops->get_cpl(vcpu) == 0)
> +			vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> +		else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
> +			vcpu->arch.fault.error_code = 0;
> +			vcpu->arch.fault.address = work->arch.token;
> +			kvm_inject_page_fault(vcpu);
> +		}

Missed !kvm_event_needs_reinjection(vcpu) ? 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 10/12] Handle async PF in non preemptable context
  2010-10-04 15:56 ` [PATCH v6 10/12] Handle async PF in non preemptable context Gleb Natapov
@ 2010-10-05 19:51   ` Marcelo Tosatti
  2010-10-06 10:41     ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-05 19:51 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Mon, Oct 04, 2010 at 05:56:32PM +0200, Gleb Natapov wrote:
> If async page fault is received by idle task or when preemp_count is
> not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> ready. vcpu can still process interrupts while it waits for the page to
> be ready.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/kernel/kvm.c |   40 ++++++++++++++++++++++++++++++++++------
>  1 files changed, 34 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 36fb3e4..f73946f 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -37,6 +37,7 @@
>  #include <asm/cpu.h>
>  #include <asm/traps.h>
>  #include <asm/desc.h>
> +#include <asm/tlbflush.h>
>  
>  #define MMU_QUEUE_SIZE 1024
>  
> @@ -78,6 +79,8 @@ struct kvm_task_sleep_node {
>  	wait_queue_head_t wq;
>  	u32 token;
>  	int cpu;
> +	bool halted;
> +	struct mm_struct *mm;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -106,6 +109,11 @@ void kvm_async_pf_task_wait(u32 token)
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
>  	struct kvm_task_sleep_node n, *e;
>  	DEFINE_WAIT(wait);
> +	int cpu, idle;
> +
> +	cpu = get_cpu();
> +	idle = idle_cpu(cpu);
> +	put_cpu();
>  
>  	spin_lock(&b->lock);
>  	e = _find_apf_task(b, token);
> @@ -119,19 +127,33 @@ void kvm_async_pf_task_wait(u32 token)
>  
>  	n.token = token;
>  	n.cpu = smp_processor_id();
> +	n.mm = current->active_mm;
> +	n.halted = idle || preempt_count() > 1;
> +	atomic_inc(&n.mm->mm_count);

Can't see why this reference is needed.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 10/12] Handle async PF in non preemptable context
  2010-10-05 19:51   ` Marcelo Tosatti
@ 2010-10-06 10:41     ` Gleb Natapov
  2010-10-10 14:25       ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 10:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Tue, Oct 05, 2010 at 04:51:50PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:32PM +0200, Gleb Natapov wrote:
> > If async page fault is received by idle task or when preemp_count is
> > not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> > ready. vcpu can still process interrupts while it waits for the page to
> > be ready.
> > 
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/kernel/kvm.c |   40 ++++++++++++++++++++++++++++++++++------
> >  1 files changed, 34 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 36fb3e4..f73946f 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -37,6 +37,7 @@
> >  #include <asm/cpu.h>
> >  #include <asm/traps.h>
> >  #include <asm/desc.h>
> > +#include <asm/tlbflush.h>
> >  
> >  #define MMU_QUEUE_SIZE 1024
> >  
> > @@ -78,6 +79,8 @@ struct kvm_task_sleep_node {
> >  	wait_queue_head_t wq;
> >  	u32 token;
> >  	int cpu;
> > +	bool halted;
> > +	struct mm_struct *mm;
> >  };
> >  
> >  static struct kvm_task_sleep_head {
> > @@ -106,6 +109,11 @@ void kvm_async_pf_task_wait(u32 token)
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> >  	struct kvm_task_sleep_node n, *e;
> >  	DEFINE_WAIT(wait);
> > +	int cpu, idle;
> > +
> > +	cpu = get_cpu();
> > +	idle = idle_cpu(cpu);
> > +	put_cpu();
> >  
> >  	spin_lock(&b->lock);
> >  	e = _find_apf_task(b, token);
> > @@ -119,19 +127,33 @@ void kvm_async_pf_task_wait(u32 token)
> >  
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> > +	n.mm = current->active_mm;
> > +	n.halted = idle || preempt_count() > 1;
> > +	atomic_inc(&n.mm->mm_count);
> 
> Can't see why this reference is needed.
I thought that if kernel thread does fault on behalf of some
process mm can go away while kernel thread is sleeping. But it looks
like kernel thread increase reference to mm it runs with by himself, so
may be this is redundant (but not harmful).

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out.
  2010-10-05 19:00   ` Marcelo Tosatti
@ 2010-10-06 10:42     ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 10:42 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Tue, Oct 05, 2010 at 04:00:51PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:31PM +0200, Gleb Natapov wrote:
> > Send async page fault to a PV guest if it accesses swapped out memory.
> > Guest will choose another task to run upon receiving the fault.
> > 
> > Allow async page fault injection only when guest is in user mode since
> > otherwise guest may be in non-sleepable context and will not be able
> > to reschedule.
> > 
> > Vcpu will be halted if guest will fault on the same page again or if
> > vcpu executes kernel code.
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    3 ++
> >  arch/x86/kvm/mmu.c              |    1 +
> >  arch/x86/kvm/x86.c              |   49 ++++++++++++++++++++++++++++++++------
> >  include/trace/events/kvm.h      |   17 ++++++++----
> >  virt/kvm/async_pf.c             |    3 +-
> >  5 files changed, 58 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index de31551..2f6fc87 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -419,6 +419,7 @@ struct kvm_vcpu_arch {
> >  		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
> >  		struct gfn_to_hva_cache data;
> >  		u64 msr_val;
> > +		u32 id;
> >  	} apf;
> >  };
> >  
> > @@ -594,6 +595,7 @@ struct kvm_x86_ops {
> >  };
> >  
> >  struct kvm_arch_async_pf {
> > +	u32 token;
> >  	gfn_t gfn;
> >  };
> >  
> > @@ -842,6 +844,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  				 struct kvm_async_pf *work);
> >  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> >  			       struct kvm_async_pf *work);
> > +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
> >  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index d85fda8..de53cab 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2580,6 +2580,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
> >  int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> >  {
> >  	struct kvm_arch_async_pf arch;
> > +	arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
> >  	arch.gfn = gfn;
> >  
> >  	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 3e123ab..0e69d37 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6225,25 +6225,58 @@ static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> >  	}
> >  }
> >  
> > +static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> > +{
> > +
> > +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
> > +				      sizeof(val));
> > +}
> > +
> >  void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >  				     struct kvm_async_pf *work)
> >  {
> > -	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> > -
> > -	if (work == kvm_double_apf)
> > +	if (work == kvm_double_apf) {
> >  		trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
> > -	else {
> > -		trace_kvm_async_pf_not_present(work->gva);
> > -
> > +		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> > +	} else {
> > +		trace_kvm_async_pf_not_present(work->arch.token, work->gva);
> >  		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> > +
> > +		if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
> > +		    kvm_x86_ops->get_cpl(vcpu) == 0)
> > +			vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> > +		else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
> > +			vcpu->arch.fault.error_code = 0;
> > +			vcpu->arch.fault.address = work->arch.token;
> > +			kvm_inject_page_fault(vcpu);
> > +		}
> 
> Missed !kvm_event_needs_reinjection(vcpu) ? 
This check is done in can_do_async_pf(). We will not get here if event is pending.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-05 14:59   ` Marcelo Tosatti
@ 2010-10-06 10:50     ` Avi Kivity
  2010-10-06 10:52       ` Gleb Natapov
  2010-10-06 11:15     ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-06 10:50 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, cl

  On 10/05/2010 04:59 PM, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:24PM +0200, Gleb Natapov wrote:
> >  If a guest accesses swapped out memory do not swap it in from vcpu thread
> >  context. Schedule work to do swapping and put vcpu into halted state
> >  instead.
> >
> >  Interrupts will still be delivered to the guest and if interrupt will
> >  cause reschedule guest will continue to run another task.
> >
> >  Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >  ---
> >   arch/x86/include/asm/kvm_host.h |   17 +++
> >   arch/x86/kvm/Kconfig            |    1 +
> >   arch/x86/kvm/Makefile           |    1 +
> >   arch/x86/kvm/mmu.c              |   51 +++++++++-
> >   arch/x86/kvm/paging_tmpl.h      |    4 +-
> >   arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
> >   include/linux/kvm_host.h        |   31 ++++++
> >   include/trace/events/kvm.h      |   88 ++++++++++++++++
> >   virt/kvm/Kconfig                |    3 +
> >   virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
> >   virt/kvm/async_pf.h             |   36 +++++++
> >   virt/kvm/kvm_main.c             |   57 ++++++++--
> >   12 files changed, 603 insertions(+), 15 deletions(-)
> >   create mode 100644 virt/kvm/async_pf.c
> >   create mode 100644 virt/kvm/async_pf.h
> >
>
> >  +	async_pf_cache = NULL;
> >  +}
> >  +
> >  +void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
> >  +{
> >  +	INIT_LIST_HEAD(&vcpu->async_pf.done);
> >  +	INIT_LIST_HEAD(&vcpu->async_pf.queue);
> >  +	spin_lock_init(&vcpu->async_pf.lock);
> >  +}
> >  +
> >  +static void async_pf_execute(struct work_struct *work)
> >  +{
> >  +	struct page *page;
> >  +	struct kvm_async_pf *apf =
> >  +		container_of(work, struct kvm_async_pf, work);
> >  +	struct mm_struct *mm = apf->mm;
> >  +	struct kvm_vcpu *vcpu = apf->vcpu;
> >  +	unsigned long addr = apf->addr;
> >  +	gva_t gva = apf->gva;
> >  +
> >  +	might_sleep();
> >  +
> >  +	use_mm(mm);
> >  +	down_read(&mm->mmap_sem);
> >  +	get_user_pages(current, mm, addr, 1, 1, 0,&page, NULL);
> >  +	up_read(&mm->mmap_sem);
> >  +	unuse_mm(mm);
> >  +
> >  +	spin_lock(&vcpu->async_pf.lock);
> >  +	list_add_tail(&apf->link,&vcpu->async_pf.done);
> >  +	apf->page = page;
> >  +	spin_unlock(&vcpu->async_pf.lock);
>
> This can fail, and apf->page become NULL.

Does it even become NULL?  On error, get_user_pages() won't update the 
pages argument, so page becomes garbage here.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-06 10:50     ` Avi Kivity
@ 2010-10-06 10:52       ` Gleb Natapov
  2010-10-07  9:54         ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 10:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Marcelo Tosatti, kvm, linux-mm, linux-kernel, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

On Wed, Oct 06, 2010 at 12:50:01PM +0200, Avi Kivity wrote:
>  On 10/05/2010 04:59 PM, Marcelo Tosatti wrote:
> >On Mon, Oct 04, 2010 at 05:56:24PM +0200, Gleb Natapov wrote:
> >>  If a guest accesses swapped out memory do not swap it in from vcpu thread
> >>  context. Schedule work to do swapping and put vcpu into halted state
> >>  instead.
> >>
> >>  Interrupts will still be delivered to the guest and if interrupt will
> >>  cause reschedule guest will continue to run another task.
> >>
> >>  Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >>  ---
> >>   arch/x86/include/asm/kvm_host.h |   17 +++
> >>   arch/x86/kvm/Kconfig            |    1 +
> >>   arch/x86/kvm/Makefile           |    1 +
> >>   arch/x86/kvm/mmu.c              |   51 +++++++++-
> >>   arch/x86/kvm/paging_tmpl.h      |    4 +-
> >>   arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
> >>   include/linux/kvm_host.h        |   31 ++++++
> >>   include/trace/events/kvm.h      |   88 ++++++++++++++++
> >>   virt/kvm/Kconfig                |    3 +
> >>   virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
> >>   virt/kvm/async_pf.h             |   36 +++++++
> >>   virt/kvm/kvm_main.c             |   57 ++++++++--
> >>   12 files changed, 603 insertions(+), 15 deletions(-)
> >>   create mode 100644 virt/kvm/async_pf.c
> >>   create mode 100644 virt/kvm/async_pf.h
> >>
> >
> >>  +	async_pf_cache = NULL;
> >>  +}
> >>  +
> >>  +void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
> >>  +{
> >>  +	INIT_LIST_HEAD(&vcpu->async_pf.done);
> >>  +	INIT_LIST_HEAD(&vcpu->async_pf.queue);
> >>  +	spin_lock_init(&vcpu->async_pf.lock);
> >>  +}
> >>  +
> >>  +static void async_pf_execute(struct work_struct *work)
> >>  +{
> >>  +	struct page *page;
> >>  +	struct kvm_async_pf *apf =
> >>  +		container_of(work, struct kvm_async_pf, work);
> >>  +	struct mm_struct *mm = apf->mm;
> >>  +	struct kvm_vcpu *vcpu = apf->vcpu;
> >>  +	unsigned long addr = apf->addr;
> >>  +	gva_t gva = apf->gva;
> >>  +
> >>  +	might_sleep();
> >>  +
> >>  +	use_mm(mm);
> >>  +	down_read(&mm->mmap_sem);
> >>  +	get_user_pages(current, mm, addr, 1, 1, 0,&page, NULL);
> >>  +	up_read(&mm->mmap_sem);
> >>  +	unuse_mm(mm);
> >>  +
> >>  +	spin_lock(&vcpu->async_pf.lock);
> >>  +	list_add_tail(&apf->link,&vcpu->async_pf.done);
> >>  +	apf->page = page;
> >>  +	spin_unlock(&vcpu->async_pf.lock);
> >
> >This can fail, and apf->page become NULL.
> 
> Does it even become NULL?  On error, get_user_pages() won't update
> the pages argument, so page becomes garbage here.
> 
apf is allocated with kmem_cache_zalloc() and ->page is set to NULL in
kvm_setup_async_pf() to be extra sure.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-05 18:25   ` Marcelo Tosatti
@ 2010-10-06 10:55     ` Gleb Natapov
  2010-10-06 14:45       ` Marcelo Tosatti
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 10:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Tue, Oct 05, 2010 at 03:25:54PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:29PM +0200, Gleb Natapov wrote:
> > Enable async PF in a guest if async PF capability is discovered.
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  Documentation/kernel-parameters.txt |    3 +
> >  arch/x86/include/asm/kvm_para.h     |    5 ++
> >  arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 100 insertions(+), 0 deletions(-)
> > 
> 
> > +static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> > +				    unsigned long action, void *hcpu)
> > +{
> > +	int cpu = (unsigned long)hcpu;
> > +	switch (action) {
> > +	case CPU_ONLINE:
> > +	case CPU_DOWN_FAILED:
> > +	case CPU_ONLINE_FROZEN:
> > +		smp_call_function_single(cpu, kvm_guest_cpu_notify, NULL, 0);
> 
> wait parameter should probably be 1.
Why should we wait for it? FWIW I copied this from somewhere (May be
arch/x86/pci/amd_bus.c).

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-05 15:54   ` Marcelo Tosatti
@ 2010-10-06 11:07     ` Gleb Natapov
  2010-10-06 14:20       ` Marcelo Tosatti
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 11:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Tue, Oct 05, 2010 at 12:54:09PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:25PM +0200, Gleb Natapov wrote:
> > When page is swapped in it is mapped into guest memory only after guest
> > tries to access it again and generate another fault. To save this fault
> > we can map it immediately since we know that guest is going to access
> > the page. Do it only when tdp is enabled for now. Shadow paging case is
> > more complicated. CR[034] and EFER registers should be switched before
> > doing mapping and then switched back.
> > 
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    4 +++-
> >  arch/x86/kvm/mmu.c              |   16 ++++++++--------
> >  arch/x86/kvm/paging_tmpl.h      |    6 +++---
> >  arch/x86/kvm/x86.c              |    7 +++++++
> >  virt/kvm/async_pf.c             |    2 ++
> >  5 files changed, 23 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 5f154d3..b9f263e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -240,7 +240,7 @@ struct kvm_mmu {
> >  	void (*new_cr3)(struct kvm_vcpu *vcpu);
> >  	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
> >  	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
> > -	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
> > +	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
> >  	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
> >  	void (*free)(struct kvm_vcpu *vcpu);
> >  	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
> > @@ -838,6 +838,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >  				     struct kvm_async_pf *work);
> >  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  				 struct kvm_async_pf *work);
> > +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> > +			       struct kvm_async_pf *work);
> >  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 4d49b5e..d85fda8 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2558,7 +2558,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
> >  }
> >  
> >  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
> > -				u32 error_code)
> > +				u32 error_code, bool no_apf)
> >  {
> >  	gfn_t gfn;
> >  	int r;
> > @@ -2594,8 +2594,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >  	return kvm_x86_ops->interrupt_allowed(vcpu);
> >  }
> >  
> > -static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> > -			 pfn_t *pfn)
> > +static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
> > +			 gva_t gva, pfn_t *pfn)
> >  {
> >  	bool async;
> >  
> > @@ -2606,7 +2606,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> >  
> >  	put_page(pfn_to_page(*pfn));
> >  
> > -	if (can_do_async_pf(vcpu)) {
> > +	if (!no_apf && can_do_async_pf(vcpu)) {
> >  		trace_kvm_try_async_get_page(async, *pfn);
> >  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> >  			vcpu->async_pf.work = kvm_double_apf;
> > @@ -2620,8 +2620,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> >  	return false;
> >  }
> >  
> > -static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> > -				u32 error_code)
> > +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
> > +			  bool no_apf)
> >  {
> >  	pfn_t pfn;
> >  	int r;
> > @@ -2643,7 +2643,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> >  	smp_rmb();
> >  
> > -	if (try_async_pf(vcpu, gfn, gpa, &pfn))
> > +	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
> >  		return 0;
> >  
> >  	/* mmio */
> > @@ -3306,7 +3306,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
> >  	int r;
> >  	enum emulation_result er;
> >  
> > -	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
> > +	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
> >  	if (r < 0)
> >  		goto out;
> >  
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 8154353..9ad90f8 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -530,8 +530,8 @@ out_gpte_changed:
> >   *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
> >   *           a negative value on error.
> >   */
> > -static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
> > -			       u32 error_code)
> > +static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
> > +			     bool no_apf)
> >  {
> >  	int write_fault = error_code & PFERR_WRITE_MASK;
> >  	int user_fault = error_code & PFERR_USER_MASK;
> > @@ -574,7 +574,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
> >  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> >  	smp_rmb();
> >  
> > -	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
> > +	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
> >  		return 0;
> >  
> >  	/* mmio */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 8dd9ac2..48fd59d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6123,6 +6123,13 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> >  
> > +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > +{
> > +	if (!tdp_enabled || is_error_page(work->page))
> > +		return;
> > +	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
> > +}
> > +
> 
> Can't you set a bit in vcpu->requests instead, and handle it in "out:"
> at the end of vcpu_enter_guest? 
> 
> To have a single entry point for pagefaults, after vmexit handling.
Jumping to "out:" will skip vmexit handling anyway, so we will not reuse
same call site anyway. I don't see yet why the way you propose will have
an advantage.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-05 16:57   ` Marcelo Tosatti
@ 2010-10-06 11:14     ` Gleb Natapov
  2010-10-06 14:38       ` Marcelo Tosatti
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 11:14 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Tue, Oct 05, 2010 at 01:57:38PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:26PM +0200, Gleb Natapov wrote:
> > Keep track of memslots changes by keeping generation number in memslots
> > structure. Provide kvm_write_guest_cached() function that skips
> > gfn_to_hva() translation if memslots was not changed since previous
> > invocation.
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  include/linux/kvm_host.h  |    7 +++++
> >  include/linux/kvm_types.h |    7 +++++
> >  virt/kvm/kvm_main.c       |   57 +++++++++++++++++++++++++++++++++++++++++---
> >  3 files changed, 67 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a08614e..4dff9a1 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
> >  
> >  struct kvm_memslots {
> >  	int nmemslots;
> > +	u32 generation;
> >  	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
> >  					KVM_PRIVATE_MEM_SLOTS];
> >  };
> > @@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
> >  			 int offset, int len);
> >  int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
> >  		    unsigned long len);
> > +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> > +			   void *data, unsigned long len);
> > +int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> > +			      gpa_t gpa);
> >  int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
> >  int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
> >  int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
> >  unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
> >  void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
> > +void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > +			     gfn_t gfn);
> >  
> >  void kvm_vcpu_block(struct kvm_vcpu *vcpu);
> >  void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 7ac0d4e..ee6eb71 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -67,4 +67,11 @@ struct kvm_lapic_irq {
> >  	u32 dest_id;
> >  };
> >  
> > +struct gfn_to_hva_cache {
> > +	u32 generation;
> > +	gpa_t gpa;
> > +	unsigned long hva;
> > +	struct kvm_memory_slot *memslot;
> > +};
> > +
> >  #endif /* __KVM_TYPES_H__ */
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index db58a1b..45ef50c 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -687,6 +687,7 @@ skip_lpage:
> >  		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
> >  		if (mem->slot >= slots->nmemslots)
> >  			slots->nmemslots = mem->slot + 1;
> > +		slots->generation++;
> >  		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
> >  
> >  		old_memslots = kvm->memslots;
> > @@ -723,6 +724,7 @@ skip_lpage:
> >  	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
> >  	if (mem->slot >= slots->nmemslots)
> >  		slots->nmemslots = mem->slot + 1;
> > +	slots->generation++;
> >  
> >  	/* actual memory is freed via old in kvm_free_physmem_slot below */
> >  	if (!npages) {
> > @@ -1247,6 +1249,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
> >  	return 0;
> >  }
> >  
> > +int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> > +			      gpa_t gpa)
> > +{
> > +	struct kvm_memslots *slots = kvm_memslots(kvm);
> > +	int offset = offset_in_page(gpa);
> > +	gfn_t gfn = gpa >> PAGE_SHIFT;
> > +
> > +	ghc->gpa = gpa;
> > +	ghc->generation = slots->generation;
> > +	ghc->memslot = gfn_to_memslot(kvm, gfn);
> > +	ghc->hva = gfn_to_hva(kvm, gfn);
> > +	if (!kvm_is_error_hva(ghc->hva))
> > +		ghc->hva += offset;
> > +	else
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> 
> Should use a unique kvm_memslots structure for the cache entry, since it
> can change in between (use gfn_to_hva_memslot, etc on "slots" pointer).
> 
I do not understand what do you mean here. kvm_memslots structure itself
is not cached only various translation that use it are cached. Translation
result are never used if kvm_memslots was changed.

> Also should zap any cached entries on overflow, otherwise malicious
> userspace could make use of stale slots:
> 
There is only one cached entry at each given time. User who wants to
write into guest memory often defines gfn_to_hva_cache variable
somewhere. Init it with kvm_gfn_to_hva_cache_init() and then calls
kvm_write_guest_cached() on it. If there was no slot changes in between
cached translation are used. Otherwise cache is recalculated.

> > +void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	struct kvm_memory_slot *memslot;
> > +
> > +	memslot = gfn_to_memslot(kvm, gfn);
> > +	mark_page_dirty_in_slot(kvm, memslot, gfn);
> > +}
> > +
> >  /*
> >   * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> >   */
> > -- 
> > 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-05 14:59   ` Marcelo Tosatti
  2010-10-06 10:50     ` Avi Kivity
@ 2010-10-06 11:15     ` Gleb Natapov
  1 sibling, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 11:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Tue, Oct 05, 2010 at 11:59:16AM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 04, 2010 at 05:56:24PM +0200, Gleb Natapov wrote:
> > If a guest accesses swapped out memory do not swap it in from vcpu thread
> > context. Schedule work to do swapping and put vcpu into halted state
> > instead.
> > 
> > Interrupts will still be delivered to the guest and if interrupt will
> > cause reschedule guest will continue to run another task.
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   17 +++
> >  arch/x86/kvm/Kconfig            |    1 +
> >  arch/x86/kvm/Makefile           |    1 +
> >  arch/x86/kvm/mmu.c              |   51 +++++++++-
> >  arch/x86/kvm/paging_tmpl.h      |    4 +-
> >  arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
> >  include/linux/kvm_host.h        |   31 ++++++
> >  include/trace/events/kvm.h      |   88 ++++++++++++++++
> >  virt/kvm/Kconfig                |    3 +
> >  virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
> >  virt/kvm/async_pf.h             |   36 +++++++
> >  virt/kvm/kvm_main.c             |   57 ++++++++--
> >  12 files changed, 603 insertions(+), 15 deletions(-)
> >  create mode 100644 virt/kvm/async_pf.c
> >  create mode 100644 virt/kvm/async_pf.h
> > 
> 
> > +	async_pf_cache = NULL;
> > +}
> > +
> > +void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
> > +{
> > +	INIT_LIST_HEAD(&vcpu->async_pf.done);
> > +	INIT_LIST_HEAD(&vcpu->async_pf.queue);
> > +	spin_lock_init(&vcpu->async_pf.lock);
> > +}
> > +
> > +static void async_pf_execute(struct work_struct *work)
> > +{
> > +	struct page *page;
> > +	struct kvm_async_pf *apf =
> > +		container_of(work, struct kvm_async_pf, work);
> > +	struct mm_struct *mm = apf->mm;
> > +	struct kvm_vcpu *vcpu = apf->vcpu;
> > +	unsigned long addr = apf->addr;
> > +	gva_t gva = apf->gva;
> > +
> > +	might_sleep();
> > +
> > +	use_mm(mm);
> > +	down_read(&mm->mmap_sem);
> > +	get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
> > +	up_read(&mm->mmap_sem);
> > +	unuse_mm(mm);
> > +
> > +	spin_lock(&vcpu->async_pf.lock);
> > +	list_add_tail(&apf->link, &vcpu->async_pf.done);
> > +	apf->page = page;
> > +	spin_unlock(&vcpu->async_pf.lock);
> 
> This can fail, and apf->page become NULL.
> 
> > +	if (list_empty_careful(&vcpu->async_pf.done))
> > +		return;
> > +
> > +	spin_lock(&vcpu->async_pf.lock);
> > +	work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
> > +	list_del(&work->link);
> > +	spin_unlock(&vcpu->async_pf.lock);
> > +
> > +	kvm_arch_async_page_present(vcpu, work);
> > +
> > +free:
> > +	list_del(&work->queue);
> > +	vcpu->async_pf.queued--;
> > +	put_page(work->page);
> > +	kmem_cache_free(async_pf_cache, work);
> > +}
> 
> Better handle it here (and other sites).
Yeah. We should just reenter gust and let usual code path handle error
on next guest access.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-06 11:07     ` Gleb Natapov
@ 2010-10-06 14:20       ` Marcelo Tosatti
  2010-10-07 18:44         ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-06 14:20 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Wed, Oct 06, 2010 at 01:07:04PM +0200, Gleb Natapov wrote:
> > Can't you set a bit in vcpu->requests instead, and handle it in "out:"
> > at the end of vcpu_enter_guest? 
> > 
> > To have a single entry point for pagefaults, after vmexit handling.
> Jumping to "out:" will skip vmexit handling anyway, so we will not reuse
> same call site anyway. I don't see yet why the way you propose will have
> an advantage.

What i meant was to call pagefault handler after vmexit handling.

Because the way it is in your patch now, with pre pagefault on entry,
one has to make an effort to verify ordering wrt other events on entry
processing.

With pre pagefault after vmexit, its more natural.

Does that make sense?


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-06 11:14     ` Gleb Natapov
@ 2010-10-06 14:38       ` Marcelo Tosatti
  2010-10-06 20:08         ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-06 14:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Wed, Oct 06, 2010 at 01:14:17PM +0200, Gleb Natapov wrote:
> > > +int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> > > +			      gpa_t gpa)
> > > +{
> > > +	struct kvm_memslots *slots = kvm_memslots(kvm);
> > > +	int offset = offset_in_page(gpa);
> > > +	gfn_t gfn = gpa >> PAGE_SHIFT;
> > > +
> > > +	ghc->gpa = gpa;
> > > +	ghc->generation = slots->generation;

kvm->memslots can change here.

> > > +	ghc->memslot = gfn_to_memslot(kvm, gfn);
> > > +	ghc->hva = gfn_to_hva(kvm, gfn);

And if so, gfn_to_memslot / gfn_to_hva will use new memslots pointer.

Should dereference all values from one copy of kvm->memslots pointer.
 
> > > +	if (!kvm_is_error_hva(ghc->hva))
> > > +		ghc->hva += offset;
> > > +	else
> > > +		return -EFAULT;
> > > +
> > > +	return 0;
> > > +}
> > 
> > Should use a unique kvm_memslots structure for the cache entry, since it
> > can change in between (use gfn_to_hva_memslot, etc on "slots" pointer).
> > 
> I do not understand what do you mean here. kvm_memslots structure itself
> is not cached only various translation that use it are cached. Translation
> result are never used if kvm_memslots was changed.

> > Also should zap any cached entries on overflow, otherwise malicious
> > userspace could make use of stale slots:
> > 
> There is only one cached entry at each given time. User who wants to
> write into guest memory often defines gfn_to_hva_cache variable
> somewhere. Init it with kvm_gfn_to_hva_cache_init() and then calls
> kvm_write_guest_cached() on it. If there was no slot changes in between
> cached translation are used. Otherwise cache is recalculated.

Malicious userspace can cause entry to be cached, ioctl
SET_USER_MEMORY_REGION 2^32 times, generation number will match,
mark_page_dirty_in_slot will be called with pointer to freed memory.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-06 10:55     ` Gleb Natapov
@ 2010-10-06 14:45       ` Marcelo Tosatti
  2010-10-06 20:05         ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-06 14:45 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Wed, Oct 06, 2010 at 12:55:04PM +0200, Gleb Natapov wrote:
> On Tue, Oct 05, 2010 at 03:25:54PM -0300, Marcelo Tosatti wrote:
> > On Mon, Oct 04, 2010 at 05:56:29PM +0200, Gleb Natapov wrote:
> > > Enable async PF in a guest if async PF capability is discovered.
> > > 
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > ---
> > >  Documentation/kernel-parameters.txt |    3 +
> > >  arch/x86/include/asm/kvm_para.h     |    5 ++
> > >  arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
> > >  3 files changed, 100 insertions(+), 0 deletions(-)
> > > 
> > 
> > > +static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> > > +				    unsigned long action, void *hcpu)
> > > +{
> > > +	int cpu = (unsigned long)hcpu;
> > > +	switch (action) {
> > > +	case CPU_ONLINE:
> > > +	case CPU_DOWN_FAILED:
> > > +	case CPU_ONLINE_FROZEN:
> > > +		smp_call_function_single(cpu, kvm_guest_cpu_notify, NULL, 0);
> > 
> > wait parameter should probably be 1.
> Why should we wait for it? FWIW I copied this from somewhere (May be
> arch/x86/pci/amd_bus.c).

So that you know its executed in a defined point in cpu bringup.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-06 14:45       ` Marcelo Tosatti
@ 2010-10-06 20:05         ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 20:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, avi, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

On Wed, Oct 06, 2010 at 11:45:12AM -0300, Marcelo Tosatti wrote:
> On Wed, Oct 06, 2010 at 12:55:04PM +0200, Gleb Natapov wrote:
> > On Tue, Oct 05, 2010 at 03:25:54PM -0300, Marcelo Tosatti wrote:
> > > On Mon, Oct 04, 2010 at 05:56:29PM +0200, Gleb Natapov wrote:
> > > > Enable async PF in a guest if async PF capability is discovered.
> > > > 
> > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > > ---
> > > >  Documentation/kernel-parameters.txt |    3 +
> > > >  arch/x86/include/asm/kvm_para.h     |    5 ++
> > > >  arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 100 insertions(+), 0 deletions(-)
> > > > 
> > > 
> > > > +static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> > > > +				    unsigned long action, void *hcpu)
> > > > +{
> > > > +	int cpu = (unsigned long)hcpu;
> > > > +	switch (action) {
> > > > +	case CPU_ONLINE:
> > > > +	case CPU_DOWN_FAILED:
> > > > +	case CPU_ONLINE_FROZEN:
> > > > +		smp_call_function_single(cpu, kvm_guest_cpu_notify, NULL, 0);
> > > 
> > > wait parameter should probably be 1.
> > Why should we wait for it? FWIW I copied this from somewhere (May be
> > arch/x86/pci/amd_bus.c).
> 
> So that you know its executed in a defined point in cpu bringup.
> 
If I read code correctly CPU we are notified about is already running when
callback is called, so I do not see what waiting for IPI to be processed will
accomplish here. With many cpus we will make boot a little bit slower. I don't
care too much though, so if you still think that 1 is required here I'll make
it so. 

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-06 14:38       ` Marcelo Tosatti
@ 2010-10-06 20:08         ` Gleb Natapov
  2010-10-07 10:00           ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-06 20:08 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, avi, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

On Wed, Oct 06, 2010 at 11:38:47AM -0300, Marcelo Tosatti wrote:
> On Wed, Oct 06, 2010 at 01:14:17PM +0200, Gleb Natapov wrote:
> > > > +int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> > > > +			      gpa_t gpa)
> > > > +{
> > > > +	struct kvm_memslots *slots = kvm_memslots(kvm);
> > > > +	int offset = offset_in_page(gpa);
> > > > +	gfn_t gfn = gpa >> PAGE_SHIFT;
> > > > +
> > > > +	ghc->gpa = gpa;
> > > > +	ghc->generation = slots->generation;
> 
> kvm->memslots can change here.
> 
> > > > +	ghc->memslot = gfn_to_memslot(kvm, gfn);
> > > > +	ghc->hva = gfn_to_hva(kvm, gfn);
> 
> And if so, gfn_to_memslot / gfn_to_hva will use new memslots pointer.
> 
> Should dereference all values from one copy of kvm->memslots pointer.
>  
Ah, I see now. Thanks! Will fix.

> > > > +	if (!kvm_is_error_hva(ghc->hva))
> > > > +		ghc->hva += offset;
> > > > +	else
> > > > +		return -EFAULT;
> > > > +
> > > > +	return 0;
> > > > +}
> > > 
> > > Should use a unique kvm_memslots structure for the cache entry, since it
> > > can change in between (use gfn_to_hva_memslot, etc on "slots" pointer).
> > > 
> > I do not understand what do you mean here. kvm_memslots structure itself
> > is not cached only various translation that use it are cached. Translation
> > result are never used if kvm_memslots was changed.
> 
> > > Also should zap any cached entries on overflow, otherwise malicious
> > > userspace could make use of stale slots:
> > > 
> > There is only one cached entry at each given time. User who wants to
> > write into guest memory often defines gfn_to_hva_cache variable
> > somewhere. Init it with kvm_gfn_to_hva_cache_init() and then calls
> > kvm_write_guest_cached() on it. If there was no slot changes in between
> > cached translation are used. Otherwise cache is recalculated.
> 
> Malicious userspace can cause entry to be cached, ioctl
> SET_USER_MEMORY_REGION 2^32 times, generation number will match,
> mark_page_dirty_in_slot will be called with pointer to freed memory.
> 
Hmm. To zap all cached entires on overflow we need to track them. If we
will track then we can zap them on each slot update and drop "generation"
entirely.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-04 15:56 ` [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
  2010-10-05  1:20   ` Rik van Riel
  2010-10-05 14:59   ` Marcelo Tosatti
@ 2010-10-07  9:50   ` Avi Kivity
  2010-10-07  9:52     ` Avi Kivity
                       ` (2 more replies)
  2 siblings, 3 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07  9:50 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> If a guest accesses swapped out memory do not swap it in from vcpu thread
> context. Schedule work to do swapping and put vcpu into halted state
> instead.
>
> Interrupts will still be delivered to the guest and if interrupt will
> cause reschedule guest will continue to run another task.
>
>
> +
> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> +	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> +		     kvm_event_needs_reinjection(vcpu)))
> +		return false;
> +
> +	return kvm_x86_ops->interrupt_allowed(vcpu);
> +}

Strictly speaking, if the cpu can handle NMIs it can take an apf?

> @@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   	if (unlikely(r))
>   		goto out;
>
> +	kvm_check_async_pf_completion(vcpu);
> +	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> +		/* Page is swapped out. Do synthetic halt */
> +		r = 1;
> +		goto out;
> +	}
> +

Why do it here in the fast path?  Can't you halt the cpu when starting 
the page fault?

I guess the apf threads can't touch mp_state, but they can have a 
KVM_REQ to trigger the check.

>   	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>   		inject_pending_event(vcpu);
>
> @@ -5781,6 +5798,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>
>   	kvm_make_request(KVM_REQ_EVENT, vcpu);
>
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);

An ordinary for loop is less tricky, even if it means one more line.

>
> @@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
>   int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>   {
>   	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> +		|| !list_empty_careful(&vcpu->async_pf.done)
>   		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
>   		|| vcpu->arch.nmi_pending ||
>   		(kvm_arch_interrupt_allowed(vcpu)&&

Unrelated, shouldn't kvm_arch_vcpu_runnable() look at vcpu->requests?  
Specifically KVM_REQ_EVENT?

> +static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	while (vcpu->arch.apf.gfns[key] != -1)
> +		key = kvm_async_pf_next_probe(key);

Not sure what that -1 converts to on i386 where gfn_t is u64.
> +
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work)
> +{
> +	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> +
> +	if (work == kvm_double_apf)
> +		trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
> +	else {
> +		trace_kvm_async_pf_not_present(work->gva);
> +
> +		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> +	}
> +}

Just have vcpu as the argument for tracepoints to avoid unconditional 
kvm_rip_read (slow on Intel), and call kvm_rip_read() in 
tp_fast_assign().  Similarly you can pass work instead of work->gva, 
though that's not nearly as important.

> +
> +TRACE_EVENT(
> +	kvm_async_pf_not_present,
> +	TP_PROTO(u64 gva),
> +	TP_ARGS(gva),

Do you actually have a gva with tdp?  With nested virtualization, how do 
you interpret this gva?
> +
> +TRACE_EVENT(
> +	kvm_async_pf_completed,
> +	TP_PROTO(unsigned long address, struct page *page, u64 gva),
> +	TP_ARGS(address, page, gva),

What does address mean?  There's also gva?

> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, address)
> +		__field(struct page*, page)
> +		__field(u64, gva)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->address = address;
> +		__entry->page = page;
> +		__entry->gva = gva;
> +		),

Recording a struct page * in a tracepoint?  Userspace can read this 
entry, better to the page_to_pfn() here.


> +void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> +{
> +	/* cancel outstanding work queue item */
> +	while (!list_empty(&vcpu->async_pf.queue)) {
> +		struct kvm_async_pf *work =
> +			list_entry(vcpu->async_pf.queue.next,
> +				   typeof(*work), queue);
> +		cancel_work_sync(&work->work);
> +		list_del(&work->queue);
> +		if (!work->page) /* work was canceled */
> +			kmem_cache_free(async_pf_cache, work);
> +	}

Are you holding any lock here?

If not, what protects vcpu->async_pf.queue?
If yes, cancel_work_sync() will need to aquire it too (in case work is 
running now and needs to take the lock, and cacncel_work_sync() needs to 
wait for it) -> deadlock.

> +
> +	/* do alloc nowait since if we are going to sleep anyway we
> +	   may as well sleep faulting in page */
/*
  * multi
  * line
  * comment
  */

(but a good one, this is subtle)

I missed where you halt the vcpu.  Can you point me at the function?

Note this is a synthetic halt and must not be visible to live migration, 
or we risk live migrating a halted state which doesn't really exist.

Might be simplest to drain the apf queue on any of the save/restore ioctls.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-07  9:50   ` Avi Kivity
@ 2010-10-07  9:52     ` Avi Kivity
  2010-10-07 13:24     ` Rik van Riel
  2010-10-07 17:47     ` Gleb Natapov
  2 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07  9:52 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 11:50 AM, Avi Kivity wrote:
>
> I missed where you halt the vcpu.  Can you point me at the function?

Found it.

Multiplexing a guest state with a host state is a bad idea.  Better have 
separate state.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-06 10:52       ` Gleb Natapov
@ 2010-10-07  9:54         ` Avi Kivity
  2010-10-07 17:48           ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07  9:54 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, kvm, linux-mm, linux-kernel, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

  On 10/06/2010 12:52 PM, Gleb Natapov wrote:
> On Wed, Oct 06, 2010 at 12:50:01PM +0200, Avi Kivity wrote:
> >   On 10/05/2010 04:59 PM, Marcelo Tosatti wrote:
> >  >On Mon, Oct 04, 2010 at 05:56:24PM +0200, Gleb Natapov wrote:
> >  >>   If a guest accesses swapped out memory do not swap it in from vcpu thread
> >  >>   context. Schedule work to do swapping and put vcpu into halted state
> >  >>   instead.
> >  >>
> >  >>   Interrupts will still be delivered to the guest and if interrupt will
> >  >>   cause reschedule guest will continue to run another task.
> >  >>
> >  >>   Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >  >>   ---
> >  >>    arch/x86/include/asm/kvm_host.h |   17 +++
> >  >>    arch/x86/kvm/Kconfig            |    1 +
> >  >>    arch/x86/kvm/Makefile           |    1 +
> >  >>    arch/x86/kvm/mmu.c              |   51 +++++++++-
> >  >>    arch/x86/kvm/paging_tmpl.h      |    4 +-
> >  >>    arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
> >  >>    include/linux/kvm_host.h        |   31 ++++++
> >  >>    include/trace/events/kvm.h      |   88 ++++++++++++++++
> >  >>    virt/kvm/Kconfig                |    3 +
> >  >>    virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
> >  >>    virt/kvm/async_pf.h             |   36 +++++++
> >  >>    virt/kvm/kvm_main.c             |   57 ++++++++--
> >  >>    12 files changed, 603 insertions(+), 15 deletions(-)
> >  >>    create mode 100644 virt/kvm/async_pf.c
> >  >>    create mode 100644 virt/kvm/async_pf.h
> >  >>
> >  >
> >  >>   +	async_pf_cache = NULL;
> >  >>   +}
> >  >>   +
> >  >>   +void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
> >  >>   +{
> >  >>   +	INIT_LIST_HEAD(&vcpu->async_pf.done);
> >  >>   +	INIT_LIST_HEAD(&vcpu->async_pf.queue);
> >  >>   +	spin_lock_init(&vcpu->async_pf.lock);
> >  >>   +}
> >  >>   +
> >  >>   +static void async_pf_execute(struct work_struct *work)
> >  >>   +{
> >  >>   +	struct page *page;
> >  >>   +	struct kvm_async_pf *apf =
> >  >>   +		container_of(work, struct kvm_async_pf, work);
> >  >>   +	struct mm_struct *mm = apf->mm;
> >  >>   +	struct kvm_vcpu *vcpu = apf->vcpu;
> >  >>   +	unsigned long addr = apf->addr;
> >  >>   +	gva_t gva = apf->gva;
> >  >>   +
> >  >>   +	might_sleep();
> >  >>   +
> >  >>   +	use_mm(mm);
> >  >>   +	down_read(&mm->mmap_sem);
> >  >>   +	get_user_pages(current, mm, addr, 1, 1, 0,&page, NULL);
> >  >>   +	up_read(&mm->mmap_sem);
> >  >>   +	unuse_mm(mm);
> >  >>   +
> >  >>   +	spin_lock(&vcpu->async_pf.lock);
> >  >>   +	list_add_tail(&apf->link,&vcpu->async_pf.done);
> >  >>   +	apf->page = page;
> >  >>   +	spin_unlock(&vcpu->async_pf.lock);
> >  >
> >  >This can fail, and apf->page become NULL.
> >
> >  Does it even become NULL?  On error, get_user_pages() won't update
> >  the pages argument, so page becomes garbage here.
> >
> apf is allocated with kmem_cache_zalloc() and ->page is set to NULL in
> kvm_setup_async_pf() to be extra sure.
>

But you assign apf->page = page;, overriding it with garbage here.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-06 20:08         ` Gleb Natapov
@ 2010-10-07 10:00           ` Avi Kivity
  2010-10-07 15:42             ` Marcelo Tosatti
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 10:00 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, Gleb Natapov, kvm, linux-mm, linux-kernel,
	mingo, a.p.zijlstra, tglx, hpa, riel, cl

  On 10/06/2010 10:08 PM, Gleb Natapov wrote:
> >  Malicious userspace can cause entry to be cached, ioctl
> >  SET_USER_MEMORY_REGION 2^32 times, generation number will match,
> >  mark_page_dirty_in_slot will be called with pointer to freed memory.
> >
> Hmm. To zap all cached entires on overflow we need to track them. If we
> will track then we can zap them on each slot update and drop "generation"
> entirely.

To track them you need locking.

Isn't SET_USER_MEMORY_REGION so slow that calling it 2^32 times isn't 
really feasible?

In any case, can use u64 generation count.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-04 15:56 ` [PATCH v6 03/12] Retry fault before vmentry Gleb Natapov
  2010-10-05 15:54   ` Marcelo Tosatti
@ 2010-10-07 12:29   ` Avi Kivity
  2010-10-07 17:21     ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 12:29 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page. Do it only when tdp is enabled for now. Shadow paging case is
> more complicated. CR[034] and EFER registers should be switched before
> doing mapping and then switched back.

With non-pv apf, I don't think we can do shadow paging.  The guest isn't 
aware of the apf, so as far as it is concerned it is allowed to kill the 
process and replace it with something else:

   guest process x: apf
   kvm: timer intr
   guest kernel: context switch
   very fast guest admin: pkill -9 x
   guest kernel: destroy x's cr3
   guest kernel: reuse x's cr3 for new process y
   kvm: retry fault, instantiating x's page in y's page table

Even with tdp, we have the same case for nnpt (just s/kernel/hypervisor/ 
and s/process/guest/).  What we really need is to only instantiate the 
page for direct maps, which are independent of the guest.

Could be done like this:

- at apf time, walk shadow mmu
- if !sp->role.direct, abort
- take reference to sp
- on apf completion, instantiate spte in sp

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-04 15:56 ` [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
  2010-10-05  1:29   ` Rik van Riel
  2010-10-05 16:57   ` Marcelo Tosatti
@ 2010-10-07 12:31   ` Avi Kivity
  2 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 12:31 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> Keep track of memslots changes by keeping generation number in memslots
> structure. Provide kvm_write_guest_cached() function that skips
> gfn_to_hva() translation if memslots was not changed since previous
> invocation.

btw, this patch (and patch 5, and perhaps more) can be applied 
independently.  If you like, you can submit them before the patch set is 
complete to reduce your queue length.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-04 15:56 ` [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
@ 2010-10-07 12:42   ` Avi Kivity
  2010-10-07 17:53     ` Gleb Natapov
  2010-10-07 12:58   ` Avi Kivity
  1 sibling, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 12:42 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> Guest enables async PF vcpu functionality using this MSR.
>
>   			return NON_PRESENT;
> +
> +MSR_KVM_ASYNC_PF_EN: 0x4b564d02
> +	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory

Given that it must be aligned anyway, we can require it to be a 64-byte 
region and also require that the guest zero it before writing the MSR.  
That will give us a little more flexibility in the future.

> +	area which must be in guest RAM. Bits 5-1 are reserved and should be
> +	zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu
> +	0 when disabled.
> +
> +	Physical address points to 32 bit memory location that will be written
> +	to by the hypervisor at the time of asynchronous page fault injection to
> +	indicate type of asynchronous page fault. Value of 1 means that the page
> +	referred to by the page fault is not present. Value 2 means that the
> +	page is now available.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b9f263e..de31551 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -417,6 +417,8 @@ struct kvm_vcpu_arch {
>
>   	struct {
>   		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
> +		struct gfn_to_hva_cache data;
> +		u64 msr_val;
>   	} apf;
>   };
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index e3faaaf..8662ae0 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -20,6 +20,7 @@
>    * are available. The use of 0x11 and 0x12 is deprecated
>    */
>   #define KVM_FEATURE_CLOCKSOURCE2        3
> +#define KVM_FEATURE_ASYNC_PF		4
>
>   /* The last 8 bits are used to indicate how to interpret the flags field
>    * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -32,9 +33,12 @@
>   /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
>   #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
>   #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> +#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>
>   #define KVM_MAX_MMU_OP_BATCH           32
>
> +#define KVM_ASYNC_PF_ENABLED			(1<<  0)
> +
>   /* Operations for KVM_HC_MMU_OP */
>   #define KVM_MMU_OP_WRITE_PTE            1
>   #define KVM_MMU_OP_FLUSH_TLB	        2
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 48fd59d..3e123ab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -782,12 +782,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
>    * kvm-specific. Those are put in the beginning of the list.
>    */
>
> -#define KVM_SAVE_MSRS_BEGIN	7
> +#define KVM_SAVE_MSRS_BEGIN	8
>   static u32 msrs_to_save[] = {
>   	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>   	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>   	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> -	HV_X64_MSR_APIC_ASSIST_PAGE,
> +	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
>   	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>   	MSR_STAR,
>   #ifdef CONFIG_X86_64
> @@ -1425,6 +1425,29 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>   	return 0;
>   }
>
> +static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	gpa_t gpa = data&  ~0x3f;
> +
> +	/* Bits 1:5 are resrved, Should be zero */
> +	if (data&  0x3e)
> +		return 1;
> +
> +	vcpu->arch.apf.msr_val = data;
> +
> +	if (!(data&  KVM_ASYNC_PF_ENABLED)) {
> +		kvm_clear_async_pf_completion_queue(vcpu);

May be a lengthy synchronous operation.  I guess we don't care.

> +		memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);

That memset again.

> +		return 0;
> +	}
> +
> +	if (kvm_gfn_to_hva_cache_init(vcpu->kvm,&vcpu->arch.apf.data, gpa))
> +		return 1;

Note: we need to handle the memory being removed from underneath 
kvm_gfn_to_hve_cache().  Given that, we can just make 
kvm_gfn_to_hva_cache_init() return void.  "success" means nothing when 
future changes can invalidate it.

> +
> +	kvm_async_pf_wakeup_all(vcpu);

Why is this needed?  If all apfs are flushed at disable time, what do we 
need to wake up?

Need to list the MSR for save/restore/reset.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-04 15:56 ` [PATCH v6 07/12] Add async PF initialization to PV guest Gleb Natapov
  2010-10-05  2:34   ` Rik van Riel
  2010-10-05 18:25   ` Marcelo Tosatti
@ 2010-10-07 12:50   ` Avi Kivity
  2010-10-08  7:54     ` Gleb Natapov
  2 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 12:50 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
>
>
> +void __cpuinit kvm_guest_cpu_init(void)
> +{
> +	if (!kvm_para_available())
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)&&  kvmapf) {
> +		u64 pa = __pa(&__get_cpu_var(apf_reason));
> +
> +		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
> +					  pa | KVM_ASYNC_PF_ENABLED, pa>>  32))

native_ versions of processor accessors shouldn't be used generally.

Also, the MSR isn't documented to fail on valid input, so you can use a 
normal wrmsrl() here.

> +			return;
> +		__get_cpu_var(apf_reason).enabled = 1;
> +		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> +		       smp_processor_id());
> +	}
> +}
> +
>
> +static int kvm_pv_reboot_notify(struct notifier_block *nb,
> +				unsigned long code, void *unused)
> +{
> +	if (code == SYS_RESTART)
> +		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block kvm_pv_reboot_nb = {
> +	.notifier_call = kvm_pv_reboot_notify,
> +};

Does this handle kexec?

> +
> +static void kvm_guest_cpu_notify(void *dummy)
> +{
> +	if (!dummy)
> +		kvm_guest_cpu_init();
> +	else
> +		kvm_pv_disable_apf(NULL);
> +}

Why are you making decisions based on a dummy input?

The whole thing looks strange.  Use two functions?


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-04 15:56 ` [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
  2010-10-07 12:42   ` Avi Kivity
@ 2010-10-07 12:58   ` Avi Kivity
  2010-10-07 17:59     ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 12:58 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> +
> +	Physical address points to 32 bit memory location that will be written
> +	to by the hypervisor at the time of asynchronous page fault injection to
> +	indicate type of asynchronous page fault. Value of 1 means that the page
> +	referred to by the page fault is not present. Value 2 means that the
> +	page is now available.

"The must not enable interrupts before the reason is read, or it may be 
overwritten by another apf".

Document the fact that disabling interrupts disables APFs.

How does the guest distinguish betweem APFs and ordinary page faults?

What's the role of cr2?

When disabling APF, all pending APFs are flushed and may or may not get 
a completion.

Is a "page available" notification guaranteed to arrive on the same vcpu 
that took the "page not present" fault?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-04 15:56 ` [PATCH v6 08/12] Handle async PF in a guest Gleb Natapov
@ 2010-10-07 13:10   ` Avi Kivity
  2010-10-07 17:14     ` Gleb Natapov
  2010-10-10 12:32     ` Gleb Natapov
  0 siblings, 2 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 13:10 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> When async PF capability is detected hook up special page fault handler
> that will handle async page fault events and bypass other page faults to
> regular page fault handler. Also add async PF handling to nested SVM
> emulation. Async PF always generates exit to L1 where vcpu thread will
> be scheduled out until page is available.
>

Please separate guest and host changes.

> +void kvm_async_pf_task_wait(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node n, *e;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&b->lock);
> +	e = _find_apf_task(b, token);
> +	if (e) {
> +		/* dummy entry exist ->  wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		spin_unlock(&b->lock);
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.cpu = smp_processor_id();
> +	init_waitqueue_head(&n.wq);
> +	hlist_add_head(&n.link,&b->list);
> +	spin_unlock(&b->lock);
> +
> +	for (;;) {
> +		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +		local_irq_enable();

Suppose we take another apf here.  And another, and another (for 
different pages, while executing schedule()).  What's to prevent kernel 
stack overflow?

> +		schedule();
> +		local_irq_disable();
> +	}
> +	finish_wait(&n.wq,&wait);
> +
> +	return;
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
> +
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-07  9:50   ` Avi Kivity
  2010-10-07  9:52     ` Avi Kivity
@ 2010-10-07 13:24     ` Rik van Riel
  2010-10-07 13:29       ` Avi Kivity
  2010-10-07 17:47     ` Gleb Natapov
  2 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2010-10-07 13:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

On 10/07/2010 05:50 AM, Avi Kivity wrote:

>> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
>> +{
>> + if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
>> + kvm_event_needs_reinjection(vcpu)))
>> + return false;
>> +
>> + return kvm_x86_ops->interrupt_allowed(vcpu);
>> +}
>
> Strictly speaking, if the cpu can handle NMIs it can take an apf?

Strictly speaking, yes.

However, it may not be able to DO anything with it, since
it won't be able to reschedule the context it's running :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-07 13:24     ` Rik van Riel
@ 2010-10-07 13:29       ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 13:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

  On 10/07/2010 03:24 PM, Rik van Riel wrote:
> On 10/07/2010 05:50 AM, Avi Kivity wrote:
>
>>> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
>>> +{
>>> + if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
>>> + kvm_event_needs_reinjection(vcpu)))
>>> + return false;
>>> +
>>> + return kvm_x86_ops->interrupt_allowed(vcpu);
>>> +}
>>
>> Strictly speaking, if the cpu can handle NMIs it can take an apf?
>
> Strictly speaking, yes.
>
> However, it may not be able to DO anything with it, since
> it won't be able to reschedule the context it's running :)
>

I was thinking about keeping the watchdog nmi handler responsive.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 11/12] Let host know whether the guest can handle async PF in non-userspace context.
  2010-10-04 15:56 ` [PATCH v6 11/12] Let host know whether the guest can handle async PF in non-userspace context Gleb Natapov
@ 2010-10-07 13:36   ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 13:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> If guest can detect that it runs in non-preemptable context it can
> handle async PFs at any time, so let host know that it can send async
> PF even if guest cpu is not in userspace.
>
>
>
>   MSR_KVM_ASYNC_PF_EN: 0x4b564d02
>   	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory
> -	area which must be in guest RAM. Bits 5-1 are reserved and should be
> +	area which must be in guest RAM. Bits 5-2 are reserved and should be
>   	zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu
> -	0 when disabled.
> +	0 when disabled. Bit 2 is 1 if asynchronous page faults can be injected
> +	when vcpu is in kernel mode.

Please use cpl instead of user mode and kernel mode.  The original terms 
are ambiguous for cpl ==1 || cpl == 2.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-07 10:00           ` Avi Kivity
@ 2010-10-07 15:42             ` Marcelo Tosatti
  2010-10-07 16:03               ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-07 15:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, Gleb Natapov, kvm, linux-mm, linux-kernel, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

On Thu, Oct 07, 2010 at 12:00:13PM +0200, Avi Kivity wrote:
>  On 10/06/2010 10:08 PM, Gleb Natapov wrote:
> >>  Malicious userspace can cause entry to be cached, ioctl
> >>  SET_USER_MEMORY_REGION 2^32 times, generation number will match,
> >>  mark_page_dirty_in_slot will be called with pointer to freed memory.
> >>
> >Hmm. To zap all cached entires on overflow we need to track them. If we
> >will track then we can zap them on each slot update and drop "generation"
> >entirely.
> 
> To track them you need locking.
> 
> Isn't SET_USER_MEMORY_REGION so slow that calling it 2^32 times
> isn't really feasible?

Assuming it takes 1ms, it would take 49 days.

> In any case, can use u64 generation count.

Agree.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-07 15:42             ` Marcelo Tosatti
@ 2010-10-07 16:03               ` Gleb Natapov
  2010-10-07 16:20                 ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 16:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, Gleb Natapov, kvm, linux-mm, linux-kernel, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

On Thu, Oct 07, 2010 at 12:42:48PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 07, 2010 at 12:00:13PM +0200, Avi Kivity wrote:
> >  On 10/06/2010 10:08 PM, Gleb Natapov wrote:
> > >>  Malicious userspace can cause entry to be cached, ioctl
> > >>  SET_USER_MEMORY_REGION 2^32 times, generation number will match,
> > >>  mark_page_dirty_in_slot will be called with pointer to freed memory.
> > >>
> > >Hmm. To zap all cached entires on overflow we need to track them. If we
> > >will track then we can zap them on each slot update and drop "generation"
> > >entirely.
> > 
> > To track them you need locking.
> > 
> > Isn't SET_USER_MEMORY_REGION so slow that calling it 2^32 times
> > isn't really feasible?
> 
> Assuming it takes 1ms, it would take 49 days.
> 
We may fail ioctl when max value is reached. The question is how much slot
changes can we expect from real guest during its lifetime.

> > In any case, can use u64 generation count.
> 
> Agree.
Yes, 64 bit ought to be enough for anybody.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-07 16:03               ` Gleb Natapov
@ 2010-10-07 16:20                 ` Avi Kivity
  2010-10-07 17:23                   ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 16:20 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, Gleb Natapov, kvm, linux-mm, linux-kernel,
	mingo, a.p.zijlstra, tglx, hpa, riel, cl

  On 10/07/2010 06:03 PM, Gleb Natapov wrote:
> >  >
> >  >  Isn't SET_USER_MEMORY_REGION so slow that calling it 2^32 times
> >  >  isn't really feasible?
> >
> >  Assuming it takes 1ms, it would take 49 days.
> >
> We may fail ioctl when max value is reached. The question is how much slot
> changes can we expect from real guest during its lifetime.
>

A normal guest has a 30 Hz timer for reading the vga framebuffer, 
multiple slots.  Let's assume 100 Hz frequency, that gives 490 days 
until things stop working.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-07 13:10   ` Avi Kivity
@ 2010-10-07 17:14     ` Gleb Natapov
  2010-10-07 17:18       ` Avi Kivity
  2010-10-10 12:32     ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 03:10:27PM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >When async PF capability is detected hook up special page fault handler
> >that will handle async page fault events and bypass other page faults to
> >regular page fault handler. Also add async PF handling to nested SVM
> >emulation. Async PF always generates exit to L1 where vcpu thread will
> >be scheduled out until page is available.
> >
> 
> Please separate guest and host changes.
> 
> >+void kvm_async_pf_task_wait(u32 token)
> >+{
> >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node n, *e;
> >+	DEFINE_WAIT(wait);
> >+
> >+	spin_lock(&b->lock);
> >+	e = _find_apf_task(b, token);
> >+	if (e) {
> >+		/* dummy entry exist ->  wake up was delivered ahead of PF */
> >+		hlist_del(&e->link);
> >+		kfree(e);
> >+		spin_unlock(&b->lock);
> >+		return;
> >+	}
> >+
> >+	n.token = token;
> >+	n.cpu = smp_processor_id();
> >+	init_waitqueue_head(&n.wq);
> >+	hlist_add_head(&n.link,&b->list);
> >+	spin_unlock(&b->lock);
> >+
> >+	for (;;) {
> >+		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >+		if (hlist_unhashed(&n.link))
> >+			break;
> >+		local_irq_enable();
> 
> Suppose we take another apf here.  And another, and another (for
> different pages, while executing schedule()).  What's to prevent
> kernel stack overflow?
> 
Host side keeps track of outstanding apfs and will not send apf for the
same phys address twice. It will halt vcpu instead.

> >+		schedule();
> >+		local_irq_disable();
> >+	}
> >+	finish_wait(&n.wq,&wait);
> >+
> >+	return;
> >+}
> >+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
> >+
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-07 17:14     ` Gleb Natapov
@ 2010-10-07 17:18       ` Avi Kivity
  2010-10-07 17:48         ` Rik van Riel
  2010-10-07 18:03         ` Gleb Natapov
  0 siblings, 2 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-07 17:18 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 07:14 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 03:10:27PM +0200, Avi Kivity wrote:
> >   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >When async PF capability is detected hook up special page fault handler
> >  >that will handle async page fault events and bypass other page faults to
> >  >regular page fault handler. Also add async PF handling to nested SVM
> >  >emulation. Async PF always generates exit to L1 where vcpu thread will
> >  >be scheduled out until page is available.
> >  >
> >
> >  Please separate guest and host changes.
> >
> >  >+void kvm_async_pf_task_wait(u32 token)
> >  >+{
> >  >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >  >+	struct kvm_task_sleep_node n, *e;
> >  >+	DEFINE_WAIT(wait);
> >  >+
> >  >+	spin_lock(&b->lock);
> >  >+	e = _find_apf_task(b, token);
> >  >+	if (e) {
> >  >+		/* dummy entry exist ->   wake up was delivered ahead of PF */
> >  >+		hlist_del(&e->link);
> >  >+		kfree(e);
> >  >+		spin_unlock(&b->lock);
> >  >+		return;
> >  >+	}
> >  >+
> >  >+	n.token = token;
> >  >+	n.cpu = smp_processor_id();
> >  >+	init_waitqueue_head(&n.wq);
> >  >+	hlist_add_head(&n.link,&b->list);
> >  >+	spin_unlock(&b->lock);
> >  >+
> >  >+	for (;;) {
> >  >+		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >  >+		if (hlist_unhashed(&n.link))
> >  >+			break;
> >  >+		local_irq_enable();
> >
> >  Suppose we take another apf here.  And another, and another (for
> >  different pages, while executing schedule()).  What's to prevent
> >  kernel stack overflow?
> >
> Host side keeps track of outstanding apfs and will not send apf for the
> same phys address twice. It will halt vcpu instead.

What about different pages, running the scheduler code?

Oh, and we'll run the scheduler recursively.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-07 12:29   ` Avi Kivity
@ 2010-10-07 17:21     ` Gleb Natapov
  2010-10-09 18:42       ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:21 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 02:29:07PM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >When page is swapped in it is mapped into guest memory only after guest
> >tries to access it again and generate another fault. To save this fault
> >we can map it immediately since we know that guest is going to access
> >the page. Do it only when tdp is enabled for now. Shadow paging case is
> >more complicated. CR[034] and EFER registers should be switched before
> >doing mapping and then switched back.
> 
> With non-pv apf, I don't think we can do shadow paging.  The guest
Yes, with non-pv this trick will not work without tdp. I haven't even
considered it for that case.

> isn't aware of the apf, so as far as it is concerned it is allowed
> to kill the process and replace it with something else:
> 
>   guest process x: apf
>   kvm: timer intr
>   guest kernel: context switch
>   very fast guest admin: pkill -9 x
>   guest kernel: destroy x's cr3
>   guest kernel: reuse x's cr3 for new process y
>   kvm: retry fault, instantiating x's page in y's page table
> 
> Even with tdp, we have the same case for nnpt (just
> s/kernel/hypervisor/ and s/process/guest/).  What we really need is
> to only instantiate the page for direct maps, which are independent
> of the guest.
> 
> Could be done like this:
> 
> - at apf time, walk shadow mmu
> - if !sp->role.direct, abort
> - take reference to sp
> - on apf completion, instantiate spte in sp
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-07 16:20                 ` Avi Kivity
@ 2010-10-07 17:23                   ` Gleb Natapov
  2010-10-10 12:48                     ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, Marcelo Tosatti, kvm, linux-mm, linux-kernel,
	mingo, a.p.zijlstra, tglx, hpa, riel, cl

On Thu, Oct 07, 2010 at 06:20:53PM +0200, Avi Kivity wrote:
>  On 10/07/2010 06:03 PM, Gleb Natapov wrote:
> >>  >
> >>  >  Isn't SET_USER_MEMORY_REGION so slow that calling it 2^32 times
> >>  >  isn't really feasible?
> >>
> >>  Assuming it takes 1ms, it would take 49 days.
> >>
> >We may fail ioctl when max value is reached. The question is how much slot
> >changes can we expect from real guest during its lifetime.
> >
> 
> A normal guest has a 30 Hz timer for reading the vga framebuffer,
> multiple slots.  Let's assume 100 Hz frequency, that gives 490 days
> until things stop working.
> 
And reading vga framebuffer needs slots changes because of dirty map
tracking?

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-07  9:50   ` Avi Kivity
  2010-10-07  9:52     ` Avi Kivity
  2010-10-07 13:24     ` Rik van Riel
@ 2010-10-07 17:47     ` Gleb Natapov
  2010-10-09 18:30       ` Avi Kivity
  2 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 11:50:08AM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >If a guest accesses swapped out memory do not swap it in from vcpu thread
> >context. Schedule work to do swapping and put vcpu into halted state
> >instead.
> >
> >Interrupts will still be delivered to the guest and if interrupt will
> >cause reschedule guest will continue to run another task.
> >
> >
> >+
> >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >+{
> >+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> >+		     kvm_event_needs_reinjection(vcpu)))
> >+		return false;
> >+
> >+	return kvm_x86_ops->interrupt_allowed(vcpu);
> >+}
> 
> Strictly speaking, if the cpu can handle NMIs it can take an apf?
> 
We can always do apf, but if vcpu can't do anything hwy bother. For NMI
watchdog yes, may be it is worth to allow apf if nmi is allowed.

> >@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	if (unlikely(r))
> >  		goto out;
> >
> >+	kvm_check_async_pf_completion(vcpu);
> >+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> >+		/* Page is swapped out. Do synthetic halt */
> >+		r = 1;
> >+		goto out;
> >+	}
> >+
> 
> Why do it here in the fast path?  Can't you halt the cpu when
> starting the page fault?
Page fault may complete before guest re-entry. We do not want to halt vcpu
in this case.
> 
> I guess the apf threads can't touch mp_state, but they can have a
> KVM_REQ to trigger the check.
This will require KVM_REQ check on fast path, so what's the difference
performance wise.

> 
> >  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> >  		inject_pending_event(vcpu);
> >
> >@@ -5781,6 +5798,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
> >
> >  	kvm_make_request(KVM_REQ_EVENT, vcpu);
> >
> >+	kvm_clear_async_pf_completion_queue(vcpu);
> >+	memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
> 
> An ordinary for loop is less tricky, even if it means one more line.
> 
> >
> >@@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
> >  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> >  {
> >  	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> >+		|| !list_empty_careful(&vcpu->async_pf.done)
> >  		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> >  		|| vcpu->arch.nmi_pending ||
> >  		(kvm_arch_interrupt_allowed(vcpu)&&
> 
> Unrelated, shouldn't kvm_arch_vcpu_runnable() look at
> vcpu->requests?  Specifically KVM_REQ_EVENT?
I think KVM_REQ_EVENT is covered by checking nmi and interrupt queue
here.

> 
> >+static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> >+{
> >+	u32 key = kvm_async_pf_hash_fn(gfn);
> >+
> >+	while (vcpu->arch.apf.gfns[key] != -1)
> >+		key = kvm_async_pf_next_probe(key);
> 
> Not sure what that -1 converts to on i386 where gfn_t is u64.
Will check.

> >+
> >+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >+				     struct kvm_async_pf *work)
> >+{
> >+	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> >+
> >+	if (work == kvm_double_apf)
> >+		trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
> >+	else {
> >+		trace_kvm_async_pf_not_present(work->gva);
> >+
> >+		kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> >+	}
> >+}
> 
> Just have vcpu as the argument for tracepoints to avoid
> unconditional kvm_rip_read (slow on Intel), and call kvm_rip_read()
> in tp_fast_assign().  Similarly you can pass work instead of
> work->gva, though that's not nearly as important.
> 
Will do.

> >+
> >+TRACE_EVENT(
> >+	kvm_async_pf_not_present,
> >+	TP_PROTO(u64 gva),
> >+	TP_ARGS(gva),
> 
> Do you actually have a gva with tdp?  With nested virtualization,
> how do you interpret this gva?
With tdp it is gpa just like tdp_page_fault gets gpa where shadow page
version gets gva. Nested virtualization is too complex to interpret.

> >+
> >+TRACE_EVENT(
> >+	kvm_async_pf_completed,
> >+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
> >+	TP_ARGS(address, page, gva),
> 
> What does address mean?  There's also gva?
> 
hva.

> >+
> >+	TP_STRUCT__entry(
> >+		__field(unsigned long, address)
> >+		__field(struct page*, page)
> >+		__field(u64, gva)
> >+		),
> >+
> >+	TP_fast_assign(
> >+		__entry->address = address;
> >+		__entry->page = page;
> >+		__entry->gva = gva;
> >+		),
> 
> Recording a struct page * in a tracepoint?  Userspace can read this
> entry, better to the page_to_pfn() here.
>
OK.
 
> 
> >+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >+{
> >+	/* cancel outstanding work queue item */
> >+	while (!list_empty(&vcpu->async_pf.queue)) {
> >+		struct kvm_async_pf *work =
> >+			list_entry(vcpu->async_pf.queue.next,
> >+				   typeof(*work), queue);
> >+		cancel_work_sync(&work->work);
> >+		list_del(&work->queue);
> >+		if (!work->page) /* work was canceled */
> >+			kmem_cache_free(async_pf_cache, work);
> >+	}
> 
> Are you holding any lock here?
> 
> If not, what protects vcpu->async_pf.queue?
Nothing. It is accessed only from vcpu thread.

> If yes, cancel_work_sync() will need to aquire it too (in case work
> is running now and needs to take the lock, and cacncel_work_sync()
> needs to wait for it) -> deadlock.
> 
Work never touches this list.

> >+
> >+	/* do alloc nowait since if we are going to sleep anyway we
> >+	   may as well sleep faulting in page */
> /*
>  * multi
>  * line
>  * comment
>  */
> 
> (but a good one, this is subtle)
> 
> I missed where you halt the vcpu.  Can you point me at the function?
> 
> Note this is a synthetic halt and must not be visible to live
> migration, or we risk live migrating a halted state which doesn't
> really exist.
> 
> Might be simplest to drain the apf queue on any of the save/restore ioctls.
> 
So that "info cpu" will interfere with apf? Migration should work
in regular way. apf state should not be migrated since it has no meaning
on the destination. I'll make sure synthetic halt state will not
interfere with migration.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-07  9:54         ` Avi Kivity
@ 2010-10-07 17:48           ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Marcelo Tosatti, kvm, linux-mm, linux-kernel, mingo,
	a.p.zijlstra, tglx, hpa, riel, cl

On Thu, Oct 07, 2010 at 11:54:22AM +0200, Avi Kivity wrote:
>  On 10/06/2010 12:52 PM, Gleb Natapov wrote:
> >On Wed, Oct 06, 2010 at 12:50:01PM +0200, Avi Kivity wrote:
> >>   On 10/05/2010 04:59 PM, Marcelo Tosatti wrote:
> >>  >On Mon, Oct 04, 2010 at 05:56:24PM +0200, Gleb Natapov wrote:
> >>  >>   If a guest accesses swapped out memory do not swap it in from vcpu thread
> >>  >>   context. Schedule work to do swapping and put vcpu into halted state
> >>  >>   instead.
> >>  >>
> >>  >>   Interrupts will still be delivered to the guest and if interrupt will
> >>  >>   cause reschedule guest will continue to run another task.
> >>  >>
> >>  >>   Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >>  >>   ---
> >>  >>    arch/x86/include/asm/kvm_host.h |   17 +++
> >>  >>    arch/x86/kvm/Kconfig            |    1 +
> >>  >>    arch/x86/kvm/Makefile           |    1 +
> >>  >>    arch/x86/kvm/mmu.c              |   51 +++++++++-
> >>  >>    arch/x86/kvm/paging_tmpl.h      |    4 +-
> >>  >>    arch/x86/kvm/x86.c              |  109 +++++++++++++++++++-
> >>  >>    include/linux/kvm_host.h        |   31 ++++++
> >>  >>    include/trace/events/kvm.h      |   88 ++++++++++++++++
> >>  >>    virt/kvm/Kconfig                |    3 +
> >>  >>    virt/kvm/async_pf.c             |  220 +++++++++++++++++++++++++++++++++++++++
> >>  >>    virt/kvm/async_pf.h             |   36 +++++++
> >>  >>    virt/kvm/kvm_main.c             |   57 ++++++++--
> >>  >>    12 files changed, 603 insertions(+), 15 deletions(-)
> >>  >>    create mode 100644 virt/kvm/async_pf.c
> >>  >>    create mode 100644 virt/kvm/async_pf.h
> >>  >>
> >>  >
> >>  >>   +	async_pf_cache = NULL;
> >>  >>   +}
> >>  >>   +
> >>  >>   +void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
> >>  >>   +{
> >>  >>   +	INIT_LIST_HEAD(&vcpu->async_pf.done);
> >>  >>   +	INIT_LIST_HEAD(&vcpu->async_pf.queue);
> >>  >>   +	spin_lock_init(&vcpu->async_pf.lock);
> >>  >>   +}
> >>  >>   +
> >>  >>   +static void async_pf_execute(struct work_struct *work)
> >>  >>   +{
> >>  >>   +	struct page *page;
> >>  >>   +	struct kvm_async_pf *apf =
> >>  >>   +		container_of(work, struct kvm_async_pf, work);
> >>  >>   +	struct mm_struct *mm = apf->mm;
> >>  >>   +	struct kvm_vcpu *vcpu = apf->vcpu;
> >>  >>   +	unsigned long addr = apf->addr;
> >>  >>   +	gva_t gva = apf->gva;
> >>  >>   +
> >>  >>   +	might_sleep();
> >>  >>   +
> >>  >>   +	use_mm(mm);
> >>  >>   +	down_read(&mm->mmap_sem);
> >>  >>   +	get_user_pages(current, mm, addr, 1, 1, 0,&page, NULL);
> >>  >>   +	up_read(&mm->mmap_sem);
> >>  >>   +	unuse_mm(mm);
> >>  >>   +
> >>  >>   +	spin_lock(&vcpu->async_pf.lock);
> >>  >>   +	list_add_tail(&apf->link,&vcpu->async_pf.done);
> >>  >>   +	apf->page = page;
> >>  >>   +	spin_unlock(&vcpu->async_pf.lock);
> >>  >
> >>  >This can fail, and apf->page become NULL.
> >>
> >>  Does it even become NULL?  On error, get_user_pages() won't update
> >>  the pages argument, so page becomes garbage here.
> >>
> >apf is allocated with kmem_cache_zalloc() and ->page is set to NULL in
> >kvm_setup_async_pf() to be extra sure.
> >
> 
> But you assign apf->page = page;, overriding it with garbage here.
> 
Ah, right you are.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-07 17:18       ` Avi Kivity
@ 2010-10-07 17:48         ` Rik van Riel
  2010-10-07 18:03         ` Gleb Natapov
  1 sibling, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2010-10-07 17:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

On 10/07/2010 01:18 PM, Avi Kivity wrote:
> On 10/07/2010 07:14 PM, Gleb Natapov wrote:

>> Host side keeps track of outstanding apfs and will not send apf for the
>> same phys address twice. It will halt vcpu instead.
>
> What about different pages, running the scheduler code?
>
> Oh, and we'll run the scheduler recursively.

When preempt is disabled in the guest, it will not invoke
the "reschedule for apf" code, but it will simply turn
into a normal page fault.

Last I looked, the scheduler code disabled preempt (for
obvious reasons).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-07 12:42   ` Avi Kivity
@ 2010-10-07 17:53     ` Gleb Natapov
  2010-10-10 12:47       ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 02:42:06PM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >Guest enables async PF vcpu functionality using this MSR.
> >
> >  			return NON_PRESENT;
> >+
> >+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
> >+	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory
> 
> Given that it must be aligned anyway, we can require it to be a
> 64-byte region and also require that the guest zero it before
> writing the MSR.  That will give us a little more flexibility in the
> future.
> 
No code change needed, so OK.

> >+	area which must be in guest RAM. Bits 5-1 are reserved and should be
> >+	zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu
> >+	0 when disabled.
> >+
> >+	Physical address points to 32 bit memory location that will be written
> >+	to by the hypervisor at the time of asynchronous page fault injection to
> >+	indicate type of asynchronous page fault. Value of 1 means that the page
> >+	referred to by the page fault is not present. Value 2 means that the
> >+	page is now available.
> >diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >index b9f263e..de31551 100644
> >--- a/arch/x86/include/asm/kvm_host.h
> >+++ b/arch/x86/include/asm/kvm_host.h
> >@@ -417,6 +417,8 @@ struct kvm_vcpu_arch {
> >
> >  	struct {
> >  		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
> >+		struct gfn_to_hva_cache data;
> >+		u64 msr_val;
> >  	} apf;
> >  };
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index e3faaaf..8662ae0 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -20,6 +20,7 @@
> >   * are available. The use of 0x11 and 0x12 is deprecated
> >   */
> >  #define KVM_FEATURE_CLOCKSOURCE2        3
> >+#define KVM_FEATURE_ASYNC_PF		4
> >
> >  /* The last 8 bits are used to indicate how to interpret the flags field
> >   * in pvclock structure. If no bits are set, all flags are ignored.
> >@@ -32,9 +33,12 @@
> >  /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
> >  #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
> >  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> >+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> >
> >  #define KVM_MAX_MMU_OP_BATCH           32
> >
> >+#define KVM_ASYNC_PF_ENABLED			(1<<  0)
> >+
> >  /* Operations for KVM_HC_MMU_OP */
> >  #define KVM_MMU_OP_WRITE_PTE            1
> >  #define KVM_MMU_OP_FLUSH_TLB	        2
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index 48fd59d..3e123ab 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -782,12 +782,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
> >   * kvm-specific. Those are put in the beginning of the list.
> >   */
> >
> >-#define KVM_SAVE_MSRS_BEGIN	7
> >+#define KVM_SAVE_MSRS_BEGIN	8
> >  static u32 msrs_to_save[] = {
> >  	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
> >  	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> >  	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> >-	HV_X64_MSR_APIC_ASSIST_PAGE,
> >+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
> >  	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> >  	MSR_STAR,
> >  #ifdef CONFIG_X86_64
> >@@ -1425,6 +1425,29 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  	return 0;
> >  }
> >
> >+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >+{
> >+	gpa_t gpa = data&  ~0x3f;
> >+
> >+	/* Bits 1:5 are resrved, Should be zero */
> >+	if (data&  0x3e)
> >+		return 1;
> >+
> >+	vcpu->arch.apf.msr_val = data;
> >+
> >+	if (!(data&  KVM_ASYNC_PF_ENABLED)) {
> >+		kvm_clear_async_pf_completion_queue(vcpu);
> 
> May be a lengthy synchronous operation.  I guess we don't care.
> 
> >+		memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
> 
> That memset again.
> 
> >+		return 0;
> >+	}
> >+
> >+	if (kvm_gfn_to_hva_cache_init(vcpu->kvm,&vcpu->arch.apf.data, gpa))
> >+		return 1;
> 
> Note: we need to handle the memory being removed from underneath
> kvm_gfn_to_hve_cache().  Given that, we can just make
> kvm_gfn_to_hva_cache_init() return void.  "success" means nothing
> when future changes can invalidate it.
> 
I want to catch guest doing stupid things. If guest give us non-existent
address I want wrmsr to #GP.

> >+
> >+	kvm_async_pf_wakeup_all(vcpu);
> 
> Why is this needed?  If all apfs are flushed at disable time, what
> do we need to wake up?
For migration. Destination will rewrite msr and all processes will be
waked up.

> 
> Need to list the MSR for save/restore/reset.
> 
> 
This patch adds it to msrs_to_save, no?

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-07 12:58   ` Avi Kivity
@ 2010-10-07 17:59     ` Gleb Natapov
  2010-10-09 18:43       ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 17:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 02:58:26PM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >+
> >+	Physical address points to 32 bit memory location that will be written
> >+	to by the hypervisor at the time of asynchronous page fault injection to
> >+	indicate type of asynchronous page fault. Value of 1 means that the page
> >+	referred to by the page fault is not present. Value 2 means that the
> >+	page is now available.
> 
> "The must not enable interrupts before the reason is read, or it may
> be overwritten by another apf".
> 
> Document the fact that disabling interrupts disables APFs.
> 
> How does the guest distinguish betweem APFs and ordinary page faults?
> 
> What's the role of cr2?
> 
> When disabling APF, all pending APFs are flushed and may or may not
> get a completion.
> 
> Is a "page available" notification guaranteed to arrive on the same
> vcpu that took the "page not present" fault?
> 
You mean documentation is lacking? :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-07 17:18       ` Avi Kivity
  2010-10-07 17:48         ` Rik van Riel
@ 2010-10-07 18:03         ` Gleb Natapov
  2010-10-09 18:48           ` Avi Kivity
  1 sibling, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 18:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 07:18:03PM +0200, Avi Kivity wrote:
>  On 10/07/2010 07:14 PM, Gleb Natapov wrote:
> >On Thu, Oct 07, 2010 at 03:10:27PM +0200, Avi Kivity wrote:
> >>   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >>  >When async PF capability is detected hook up special page fault handler
> >>  >that will handle async page fault events and bypass other page faults to
> >>  >regular page fault handler. Also add async PF handling to nested SVM
> >>  >emulation. Async PF always generates exit to L1 where vcpu thread will
> >>  >be scheduled out until page is available.
> >>  >
> >>
> >>  Please separate guest and host changes.
> >>
> >>  >+void kvm_async_pf_task_wait(u32 token)
> >>  >+{
> >>  >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >>  >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >>  >+	struct kvm_task_sleep_node n, *e;
> >>  >+	DEFINE_WAIT(wait);
> >>  >+
> >>  >+	spin_lock(&b->lock);
> >>  >+	e = _find_apf_task(b, token);
> >>  >+	if (e) {
> >>  >+		/* dummy entry exist ->   wake up was delivered ahead of PF */
> >>  >+		hlist_del(&e->link);
> >>  >+		kfree(e);
> >>  >+		spin_unlock(&b->lock);
> >>  >+		return;
> >>  >+	}
> >>  >+
> >>  >+	n.token = token;
> >>  >+	n.cpu = smp_processor_id();
> >>  >+	init_waitqueue_head(&n.wq);
> >>  >+	hlist_add_head(&n.link,&b->list);
> >>  >+	spin_unlock(&b->lock);
> >>  >+
> >>  >+	for (;;) {
> >>  >+		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >>  >+		if (hlist_unhashed(&n.link))
> >>  >+			break;
> >>  >+		local_irq_enable();
> >>
> >>  Suppose we take another apf here.  And another, and another (for
> >>  different pages, while executing schedule()).  What's to prevent
> >>  kernel stack overflow?
> >>
> >Host side keeps track of outstanding apfs and will not send apf for the
> >same phys address twice. It will halt vcpu instead.
> 
> What about different pages, running the scheduler code?
> 
We can get couple of nested apfs, just like we can get nested
interrupts. Since scheduler disables preemption second apf will halt.

> Oh, and we'll run the scheduler recursively.
> 
As rick said scheduler disables preemption.  And this is actually first
thing it does. Otherwise any interrupt may cause recursive scheduler
invocation.
 
--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-06 14:20       ` Marcelo Tosatti
@ 2010-10-07 18:44         ` Gleb Natapov
  2010-10-08 16:07           ` Marcelo Tosatti
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-07 18:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Wed, Oct 06, 2010 at 11:20:50AM -0300, Marcelo Tosatti wrote:
> On Wed, Oct 06, 2010 at 01:07:04PM +0200, Gleb Natapov wrote:
> > > Can't you set a bit in vcpu->requests instead, and handle it in "out:"
> > > at the end of vcpu_enter_guest? 
> > > 
> > > To have a single entry point for pagefaults, after vmexit handling.
> > Jumping to "out:" will skip vmexit handling anyway, so we will not reuse
> > same call site anyway. I don't see yet why the way you propose will have
> > an advantage.
> 
> What i meant was to call pagefault handler after vmexit handling.
> 
> Because the way it is in your patch now, with pre pagefault on entry,
> one has to make an effort to verify ordering wrt other events on entry
> processing.
> 
What events do you have in mind?

> With pre pagefault after vmexit, its more natural.
> 
I do not see non-ugly way to pass information that is needed to perform
the prefault to the place you want me to put it. We can skip guest entry
in case prefault was done which will have the same effect as your
proposal, but I want to have a good reason to do so since otherwise we
will just do more work for nothing on guest entry.

> Does that make sense?

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-07 12:50   ` Avi Kivity
@ 2010-10-08  7:54     ` Gleb Natapov
  2010-10-09 18:44       ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-08  7:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 02:50:49PM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >Enable async PF in a guest if async PF capability is discovered.
> >
> >
> >+void __cpuinit kvm_guest_cpu_init(void)
> >+{
> >+	if (!kvm_para_available())
> >+		return;
> >+
> >+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)&&  kvmapf) {
> >+		u64 pa = __pa(&__get_cpu_var(apf_reason));
> >+
> >+		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
> >+					  pa | KVM_ASYNC_PF_ENABLED, pa>>  32))
> 
> native_ versions of processor accessors shouldn't be used generally.
> 
> Also, the MSR isn't documented to fail on valid input, so you can
> use a normal wrmsrl() here.
> 
Kernel will oops on wrong write then. OK, why not.

> >+			return;
> >+		__get_cpu_var(apf_reason).enabled = 1;
> >+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> >+		       smp_processor_id());
> >+	}
> >+}
> >+
> >
> >+static int kvm_pv_reboot_notify(struct notifier_block *nb,
> >+				unsigned long code, void *unused)
> >+{
> >+	if (code == SYS_RESTART)
> >+		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
> >+	return NOTIFY_DONE;
> >+}
> >+
> >+static struct notifier_block kvm_pv_reboot_nb = {
> >+	.notifier_call = kvm_pv_reboot_notify,
> >+};
> 
> Does this handle kexec?
> 
Yes.

> >+
> >+static void kvm_guest_cpu_notify(void *dummy)
> >+{
> >+	if (!dummy)
> >+		kvm_guest_cpu_init();
> >+	else
> >+		kvm_pv_disable_apf(NULL);
> >+}
> 
> Why are you making decisions based on a dummy input?
> 
> The whole thing looks strange.  Use two functions?
> 
What is so strange? Type of notification is passed as a parameter.
The code that does this is just under the function. I can rename
dummy to something else. Or make it two functions.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-07 18:44         ` Gleb Natapov
@ 2010-10-08 16:07           ` Marcelo Tosatti
  0 siblings, 0 replies; 88+ messages in thread
From: Marcelo Tosatti @ 2010-10-08 16:07 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Thu, Oct 07, 2010 at 08:44:57PM +0200, Gleb Natapov wrote:
> On Wed, Oct 06, 2010 at 11:20:50AM -0300, Marcelo Tosatti wrote:
> > On Wed, Oct 06, 2010 at 01:07:04PM +0200, Gleb Natapov wrote:
> > > > Can't you set a bit in vcpu->requests instead, and handle it in "out:"
> > > > at the end of vcpu_enter_guest? 
> > > > 
> > > > To have a single entry point for pagefaults, after vmexit handling.
> > > Jumping to "out:" will skip vmexit handling anyway, so we will not reuse
> > > same call site anyway. I don't see yet why the way you propose will have
> > > an advantage.
> > 
> > What i meant was to call pagefault handler after vmexit handling.
> > 
> > Because the way it is in your patch now, with pre pagefault on entry,
> > one has to make an effort to verify ordering wrt other events on entry
> > processing.
> > 
> What events do you have in mind?

TLB flushing, event injection, etc.

> > With pre pagefault after vmexit, its more natural.
> > 
> I do not see non-ugly way to pass information that is needed to perform
> the prefault to the place you want me to put it. We can skip guest entry
> in case prefault was done which will have the same effect as your
> proposal, but I want to have a good reason to do so since otherwise we
> will just do more work for nothing on guest entry.

The reason is that it becomes similar to normal pagefault handling. I
don't have a specific bug to give you as example.

> 
> > Does that make sense?
> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-07 17:47     ` Gleb Natapov
@ 2010-10-09 18:30       ` Avi Kivity
  2010-10-09 18:32         ` Avi Kivity
  2010-10-10  7:29         ` Gleb Natapov
  0 siblings, 2 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-09 18:30 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 07:47 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 11:50:08AM +0200, Avi Kivity wrote:
> >   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >If a guest accesses swapped out memory do not swap it in from vcpu thread
> >  >context. Schedule work to do swapping and put vcpu into halted state
> >  >instead.
> >  >
> >  >Interrupts will still be delivered to the guest and if interrupt will
> >  >cause reschedule guest will continue to run another task.
> >  >
> >  >
> >  >+
> >  >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >  >+{
> >  >+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> >  >+		     kvm_event_needs_reinjection(vcpu)))
> >  >+		return false;
> >  >+
> >  >+	return kvm_x86_ops->interrupt_allowed(vcpu);
> >  >+}
> >
> >  Strictly speaking, if the cpu can handle NMIs it can take an apf?
> >
> We can always do apf, but if vcpu can't do anything hwy bother. For NMI
> watchdog yes, may be it is worth to allow apf if nmi is allowed.

Actually it's very dangerous - the IRET from APF will re-enable NMIs.  
So without the guest enabling apf-in-nmi we shouldn't allow it.

Not worth the complexity IMO.

> >  >@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  >   	if (unlikely(r))
> >  >   		goto out;
> >  >
> >  >+	kvm_check_async_pf_completion(vcpu);
> >  >+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> >  >+		/* Page is swapped out. Do synthetic halt */
> >  >+		r = 1;
> >  >+		goto out;
> >  >+	}
> >  >+
> >
> >  Why do it here in the fast path?  Can't you halt the cpu when
> >  starting the page fault?
> Page fault may complete before guest re-entry. We do not want to halt vcpu
> in this case.

So unhalt on completion.

> >
> >  I guess the apf threads can't touch mp_state, but they can have a
> >  KVM_REQ to trigger the check.
> This will require KVM_REQ check on fast path, so what's the difference
> performance wise.

We already have a KVM_REQ check (if (vcpu->requests)) so it doesn't cost 
anything extra.

> >  >
> >  >@@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
> >  >   int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> >  >   {
> >  >   	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> >  >+		|| !list_empty_careful(&vcpu->async_pf.done)
> >  >   		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> >  >   		|| vcpu->arch.nmi_pending ||
> >  >   		(kvm_arch_interrupt_allowed(vcpu)&&
> >
> >  Unrelated, shouldn't kvm_arch_vcpu_runnable() look at
> >  vcpu->requests?  Specifically KVM_REQ_EVENT?
> I think KVM_REQ_EVENT is covered by checking nmi and interrupt queue
> here.

No, the nmi and interrupt queues are only filled when the lapic is 
polled via KVM_REQ_EVENT.  I'll prepare a patch.

> >  >+
> >  >+TRACE_EVENT(
> >  >+	kvm_async_pf_not_present,
> >  >+	TP_PROTO(u64 gva),
> >  >+	TP_ARGS(gva),
> >
> >  Do you actually have a gva with tdp?  With nested virtualization,
> >  how do you interpret this gva?
> With tdp it is gpa just like tdp_page_fault gets gpa where shadow page
> version gets gva. Nested virtualization is too complex to interpret.

It's not good to have a tracepoint that depends on cpu mode (without 
recording that mode). I think we have the same issue in 
trace_kvm_page_fault though.

> >  >+
> >  >+TRACE_EVENT(
> >  >+	kvm_async_pf_completed,
> >  >+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
> >  >+	TP_ARGS(address, page, gva),
> >
> >  What does address mean?  There's also gva?
> >
> hva.

Is hva helpful here?  Generally gpa is better, but may not be available 
since it's ambiguous.

>
> >
> >  >+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  >+{
> >  >+	/* cancel outstanding work queue item */
> >  >+	while (!list_empty(&vcpu->async_pf.queue)) {
> >  >+		struct kvm_async_pf *work =
> >  >+			list_entry(vcpu->async_pf.queue.next,
> >  >+				   typeof(*work), queue);
> >  >+		cancel_work_sync(&work->work);
> >  >+		list_del(&work->queue);
> >  >+		if (!work->page) /* work was canceled */
> >  >+			kmem_cache_free(async_pf_cache, work);
> >  >+	}
> >
> >  Are you holding any lock here?
> >
> >  If not, what protects vcpu->async_pf.queue?
> Nothing. It is accessed only from vcpu thread.
>
> >  If yes, cancel_work_sync() will need to aquire it too (in case work
> >  is running now and needs to take the lock, and cacncel_work_sync()
> >  needs to wait for it) ->  deadlock.
> >
> Work never touches this list.

So, an apf is always in ->queue and when completed also in ->done?

Is it not cleaner to list_move the apf from ->queue to ->done?  saves a 
->link.

Can be done later.

> >  >+
> >  >+	/* do alloc nowait since if we are going to sleep anyway we
> >  >+	   may as well sleep faulting in page */
> >  /*
> >   * multi
> >   * line
> >   * comment
> >   */
> >
> >  (but a good one, this is subtle)
> >
> >  I missed where you halt the vcpu.  Can you point me at the function?
> >
> >  Note this is a synthetic halt and must not be visible to live
> >  migration, or we risk live migrating a halted state which doesn't
> >  really exist.
> >
> >  Might be simplest to drain the apf queue on any of the save/restore ioctls.
> >
> So that "info cpu" will interfere with apf? Migration should work
> in regular way. apf state should not be migrated since it has no meaning
> on the destination. I'll make sure synthetic halt state will not
> interfere with migration.

If you deliver an apf, the guest expects a completion.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-09 18:30       ` Avi Kivity
@ 2010-10-09 18:32         ` Avi Kivity
  2010-10-10  7:30           ` Gleb Natapov
  2010-10-10  7:29         ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-09 18:32 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/09/2010 08:30 PM, Avi Kivity wrote:
>> So that "info cpu" will interfere with apf? Migration should work
>> in regular way. apf state should not be migrated since it has no meaning
>> on the destination. I'll make sure synthetic halt state will not
>> interfere with migration.
>
>
> If you deliver an apf, the guest expects a completion.
>

btw, the token generation scheme resets as well.  Draining the queue 
fixes that as well.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-07 17:21     ` Gleb Natapov
@ 2010-10-09 18:42       ` Avi Kivity
  2010-10-10  7:35         ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-09 18:42 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 07:21 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 02:29:07PM +0200, Avi Kivity wrote:
> >   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >When page is swapped in it is mapped into guest memory only after guest
> >  >tries to access it again and generate another fault. To save this fault
> >  >we can map it immediately since we know that guest is going to access
> >  >the page. Do it only when tdp is enabled for now. Shadow paging case is
> >  >more complicated. CR[034] and EFER registers should be switched before
> >  >doing mapping and then switched back.
> >
> >  With non-pv apf, I don't think we can do shadow paging.  The guest
> Yes, with non-pv this trick will not work without tdp. I haven't even
> considered it for that case.
>

What about nnpt?  The same issues exist.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-07 17:59     ` Gleb Natapov
@ 2010-10-09 18:43       ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-09 18:43 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 07:59 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 02:58:26PM +0200, Avi Kivity wrote:
> >   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >+
> >  >+	Physical address points to 32 bit memory location that will be written
> >  >+	to by the hypervisor at the time of asynchronous page fault injection to
> >  >+	indicate type of asynchronous page fault. Value of 1 means that the page
> >  >+	referred to by the page fault is not present. Value 2 means that the
> >  >+	page is now available.
> >
> >  "The must not enable interrupts before the reason is read, or it may
> >  be overwritten by another apf".
> >
> >  Document the fact that disabling interrupts disables APFs.
> >
> >  How does the guest distinguish betweem APFs and ordinary page faults?
> >
> >  What's the role of cr2?
> >
> >  When disabling APF, all pending APFs are flushed and may or may not
> >  get a completion.
> >
> >  Is a "page available" notification guaranteed to arrive on the same
> >  vcpu that took the "page not present" fault?
> >
> You mean documentation is lacking? :)
>

I mean you should be able to write guest support code without reading 
the host code, just the documentation.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 07/12] Add async PF initialization to PV guest.
  2010-10-08  7:54     ` Gleb Natapov
@ 2010-10-09 18:44       ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-09 18:44 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/08/2010 09:54 AM, Gleb Natapov wrote:
> >  >+
> >  >+static void kvm_guest_cpu_notify(void *dummy)
> >  >+{
> >  >+	if (!dummy)
> >  >+		kvm_guest_cpu_init();
> >  >+	else
> >  >+		kvm_pv_disable_apf(NULL);
> >  >+}
> >
> >  Why are you making decisions based on a dummy input?
> >
> >  The whole thing looks strange.  Use two functions?
> >
> What is so strange? Type of notification is passed as a parameter.
> The code that does this is just under the function. I can rename
> dummy to something else. Or make it two functions.

Two separate functions is simplest.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-07 18:03         ` Gleb Natapov
@ 2010-10-09 18:48           ` Avi Kivity
  2010-10-10  7:56             ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-09 18:48 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 08:03 PM, Gleb Natapov wrote:
> >  >>
> >  >Host side keeps track of outstanding apfs and will not send apf for the
> >  >same phys address twice. It will halt vcpu instead.
> >
> >  What about different pages, running the scheduler code?
> >
> We can get couple of nested apfs, just like we can get nested
> interrupts. Since scheduler disables preemption second apf will halt.

How much is a couple?

Consider:

SIGSTOP
Entire process swapped out
SIGCONT

We can get APF's on the current code, the scheduler code, the stack, any 
debugging code in between (e.g. ftrace), and the page tables for all of 
these.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-09 18:30       ` Avi Kivity
  2010-10-09 18:32         ` Avi Kivity
@ 2010-10-10  7:29         ` Gleb Natapov
  2010-10-10 15:55           ` Avi Kivity
  1 sibling, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10  7:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sat, Oct 09, 2010 at 08:30:18PM +0200, Avi Kivity wrote:
>  On 10/07/2010 07:47 PM, Gleb Natapov wrote:
> >On Thu, Oct 07, 2010 at 11:50:08AM +0200, Avi Kivity wrote:
> >>   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >>  >If a guest accesses swapped out memory do not swap it in from vcpu thread
> >>  >context. Schedule work to do swapping and put vcpu into halted state
> >>  >instead.
> >>  >
> >>  >Interrupts will still be delivered to the guest and if interrupt will
> >>  >cause reschedule guest will continue to run another task.
> >>  >
> >>  >
> >>  >+
> >>  >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >>  >+{
> >>  >+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> >>  >+		     kvm_event_needs_reinjection(vcpu)))
> >>  >+		return false;
> >>  >+
> >>  >+	return kvm_x86_ops->interrupt_allowed(vcpu);
> >>  >+}
> >>
> >>  Strictly speaking, if the cpu can handle NMIs it can take an apf?
> >>
> >We can always do apf, but if vcpu can't do anything hwy bother. For NMI
> >watchdog yes, may be it is worth to allow apf if nmi is allowed.
> 
> Actually it's very dangerous - the IRET from APF will re-enable
> NMIs.  So without the guest enabling apf-in-nmi we shouldn't allow
> it.
> 
Good point.

> Not worth the complexity IMO.
> 
> >>  >@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>  >   	if (unlikely(r))
> >>  >   		goto out;
> >>  >
> >>  >+	kvm_check_async_pf_completion(vcpu);
> >>  >+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> >>  >+		/* Page is swapped out. Do synthetic halt */
> >>  >+		r = 1;
> >>  >+		goto out;
> >>  >+	}
> >>  >+
> >>
> >>  Why do it here in the fast path?  Can't you halt the cpu when
> >>  starting the page fault?
> >Page fault may complete before guest re-entry. We do not want to halt vcpu
> >in this case.
> 
> So unhalt on completion.
> 
I want to avoid touching vcpu state from work if possible. Work code does
not contain arch dependent code right now and mp_state is x86 thing

> >>
> >>  I guess the apf threads can't touch mp_state, but they can have a
> >>  KVM_REQ to trigger the check.
> >This will require KVM_REQ check on fast path, so what's the difference
> >performance wise.
> 
> We already have a KVM_REQ check (if (vcpu->requests)) so it doesn't
> cost anything extra.
if (vcpu->requests) does not clear req bit, so what will have to be added
is: if (kvm_check_request(KVM_REQ_APF_HLT, vcpu)) which is even more
expensive then my check (but not so expensive to worry about).

> 
> >>  >
> >>  >@@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
> >>  >   int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> >>  >   {
> >>  >   	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> >>  >+		|| !list_empty_careful(&vcpu->async_pf.done)
> >>  >   		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> >>  >   		|| vcpu->arch.nmi_pending ||
> >>  >   		(kvm_arch_interrupt_allowed(vcpu)&&
> >>
> >>  Unrelated, shouldn't kvm_arch_vcpu_runnable() look at
> >>  vcpu->requests?  Specifically KVM_REQ_EVENT?
> >I think KVM_REQ_EVENT is covered by checking nmi and interrupt queue
> >here.
> 
> No, the nmi and interrupt queues are only filled when the lapic is
> polled via KVM_REQ_EVENT.  I'll prepare a patch.
I don't think you are correct. nmi_pending is filled before setting
KVM_REQ_EVENT and kvm_cpu_has_interrupt() checks directly in apic/pic.

> 
> >>  >+
> >>  >+TRACE_EVENT(
> >>  >+	kvm_async_pf_not_present,
> >>  >+	TP_PROTO(u64 gva),
> >>  >+	TP_ARGS(gva),
> >>
> >>  Do you actually have a gva with tdp?  With nested virtualization,
> >>  how do you interpret this gva?
> >With tdp it is gpa just like tdp_page_fault gets gpa where shadow page
> >version gets gva. Nested virtualization is too complex to interpret.
> 
> It's not good to have a tracepoint that depends on cpu mode (without
> recording that mode). I think we have the same issue in
> trace_kvm_page_fault though.
We have mmu_is_nested(). I'll just disable apf while vcpu is in nested
mode for now.

> 
> >>  >+
> >>  >+TRACE_EVENT(
> >>  >+	kvm_async_pf_completed,
> >>  >+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
> >>  >+	TP_ARGS(address, page, gva),
> >>
> >>  What does address mean?  There's also gva?
> >>
> >hva.
> 
> Is hva helpful here?  Generally gpa is better, but may not be
> available since it's ambiguous.
> 
> >
> >>
> >>  >+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >>  >+{
> >>  >+	/* cancel outstanding work queue item */
> >>  >+	while (!list_empty(&vcpu->async_pf.queue)) {
> >>  >+		struct kvm_async_pf *work =
> >>  >+			list_entry(vcpu->async_pf.queue.next,
> >>  >+				   typeof(*work), queue);
> >>  >+		cancel_work_sync(&work->work);
> >>  >+		list_del(&work->queue);
> >>  >+		if (!work->page) /* work was canceled */
> >>  >+			kmem_cache_free(async_pf_cache, work);
> >>  >+	}
> >>
> >>  Are you holding any lock here?
> >>
> >>  If not, what protects vcpu->async_pf.queue?
> >Nothing. It is accessed only from vcpu thread.
> >
> >>  If yes, cancel_work_sync() will need to aquire it too (in case work
> >>  is running now and needs to take the lock, and cacncel_work_sync()
> >>  needs to wait for it) ->  deadlock.
> >>
> >Work never touches this list.
> 
> So, an apf is always in ->queue and when completed also in ->done?
> 
> Is it not cleaner to list_move the apf from ->queue to ->done?
> saves a ->link.
Then you have more complicated locking issues.

> 
> Can be done later.
> 
> >>  >+
> >>  >+	/* do alloc nowait since if we are going to sleep anyway we
> >>  >+	   may as well sleep faulting in page */
> >>  /*
> >>   * multi
> >>   * line
> >>   * comment
> >>   */
> >>
> >>  (but a good one, this is subtle)
> >>
> >>  I missed where you halt the vcpu.  Can you point me at the function?
> >>
> >>  Note this is a synthetic halt and must not be visible to live
> >>  migration, or we risk live migrating a halted state which doesn't
> >>  really exist.
> >>
> >>  Might be simplest to drain the apf queue on any of the save/restore ioctls.
> >>
> >So that "info cpu" will interfere with apf? Migration should work
> >in regular way. apf state should not be migrated since it has no meaning
> >on the destination. I'll make sure synthetic halt state will not
> >interfere with migration.
> 
> If you deliver an apf, the guest expects a completion.
> 
There is special completion that tells guest to wake all sleeping tasks
on vcpu. It is delivered after migration on the destination.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-09 18:32         ` Avi Kivity
@ 2010-10-10  7:30           ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10  7:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sat, Oct 09, 2010 at 08:32:14PM +0200, Avi Kivity wrote:
>  On 10/09/2010 08:30 PM, Avi Kivity wrote:
> >>So that "info cpu" will interfere with apf? Migration should work
> >>in regular way. apf state should not be migrated since it has no meaning
> >>on the destination. I'll make sure synthetic halt state will not
> >>interfere with migration.
> >
> >
> >If you deliver an apf, the guest expects a completion.
> >
> 
> btw, the token generation scheme resets as well.  Draining the queue
> fixes that as well.
> 
I don't see what's there to fix. Can you explain what problem you see in
the way current code works?

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 03/12] Retry fault before vmentry
  2010-10-09 18:42       ` Avi Kivity
@ 2010-10-10  7:35         ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10  7:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sat, Oct 09, 2010 at 08:42:00PM +0200, Avi Kivity wrote:
>  On 10/07/2010 07:21 PM, Gleb Natapov wrote:
> >On Thu, Oct 07, 2010 at 02:29:07PM +0200, Avi Kivity wrote:
> >>   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >>  >When page is swapped in it is mapped into guest memory only after guest
> >>  >tries to access it again and generate another fault. To save this fault
> >>  >we can map it immediately since we know that guest is going to access
> >>  >the page. Do it only when tdp is enabled for now. Shadow paging case is
> >>  >more complicated. CR[034] and EFER registers should be switched before
> >>  >doing mapping and then switched back.
> >>
> >>  With non-pv apf, I don't think we can do shadow paging.  The guest
> >Yes, with non-pv this trick will not work without tdp. I haven't even
> >considered it for that case.
> >
> 
> What about nnpt?  The same issues exist.
> 
I am not sure how nntp works. What is the problem there? In case of tdp
prefault instantiates page in direct_map, how nntp interfere with that?

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-09 18:48           ` Avi Kivity
@ 2010-10-10  7:56             ` Gleb Natapov
  2010-10-10 12:40               ` Avi Kivity
  0 siblings, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10  7:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sat, Oct 09, 2010 at 08:48:15PM +0200, Avi Kivity wrote:
>  On 10/07/2010 08:03 PM, Gleb Natapov wrote:
> >>  >>
> >>  >Host side keeps track of outstanding apfs and will not send apf for the
> >>  >same phys address twice. It will halt vcpu instead.
> >>
> >>  What about different pages, running the scheduler code?
> >>
> >We can get couple of nested apfs, just like we can get nested
> >interrupts. Since scheduler disables preemption second apf will halt.
> 
> How much is a couple?
> 
> Consider:
> 
> SIGSTOP
> Entire process swapped out
> SIGCONT
> 
> We can get APF's on the current code, the scheduler code, the stack,
> any debugging code in between (e.g. ftrace), and the page tables for
> all of these.
> 
Lets count them all. Suppose guest is in a userspace process code and
guest memory is completely swapped out. Guest starts to run and faults
in userspace. Apf is queued but can't be delivered due to faults in
idt and exception stack. All of them will be taken synchronously due
to event pending check. After apf is delivered any fault in apf code
will be takes synchronously since interrupt are disabled. Just before
calling schedule() interrupts are enabled, so next pf that will happen
during call to schedule() will be taken asynchronously. Which will cause
another call to schedule() at which point vcpu will be halted since two
apfs happened at the same address. So I counted two of them.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-07 13:10   ` Avi Kivity
  2010-10-07 17:14     ` Gleb Natapov
@ 2010-10-10 12:32     ` Gleb Natapov
  2010-10-10 12:38       ` Avi Kivity
  1 sibling, 1 reply; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10 12:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Thu, Oct 07, 2010 at 03:10:27PM +0200, Avi Kivity wrote:
>  On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >When async PF capability is detected hook up special page fault handler
> >that will handle async page fault events and bypass other page faults to
> >regular page fault handler. Also add async PF handling to nested SVM
> >emulation. Async PF always generates exit to L1 where vcpu thread will
> >be scheduled out until page is available.
> >
> 
> Please separate guest and host changes.
> 
Hmm. There are only guest changes here as far as I can see.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-10 12:32     ` Gleb Natapov
@ 2010-10-10 12:38       ` Avi Kivity
  2010-10-10 13:22         ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-10 12:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/10/2010 02:32 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 03:10:27PM +0200, Avi Kivity wrote:
> >   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >When async PF capability is detected hook up special page fault handler
> >  >that will handle async page fault events and bypass other page faults to
> >  >regular page fault handler. Also add async PF handling to nested SVM
> >  >emulation. Async PF always generates exit to L1 where vcpu thread will
> >  >be scheduled out until page is available.
> >  >
> >
> >  Please separate guest and host changes.
> >
> Hmm. There are only guest changes here as far as I can see.

 From the diffstat:

>   arch/x86/include/asm/kvm_para.h |   12 +++
>   arch/x86/include/asm/traps.h    |    1 +
>   arch/x86/kernel/entry_32.S      |   10 ++
>   arch/x86/kernel/entry_64.S      |    3 +
>   arch/x86/kernel/kvm.c           |  184 ++++++++++++++++++++++++++++++++++++++-
>   arch/x86/kvm/svm.c              |   43 +++++++--

svm.c is host code.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-10  7:56             ` Gleb Natapov
@ 2010-10-10 12:40               ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-10 12:40 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/10/2010 09:56 AM, Gleb Natapov wrote:
> On Sat, Oct 09, 2010 at 08:48:15PM +0200, Avi Kivity wrote:
> >   On 10/07/2010 08:03 PM, Gleb Natapov wrote:
> >  >>   >>
> >  >>   >Host side keeps track of outstanding apfs and will not send apf for the
> >  >>   >same phys address twice. It will halt vcpu instead.
> >  >>
> >  >>   What about different pages, running the scheduler code?
> >  >>
> >  >We can get couple of nested apfs, just like we can get nested
> >  >interrupts. Since scheduler disables preemption second apf will halt.
> >
> >  How much is a couple?
> >
> >  Consider:
> >
> >  SIGSTOP
> >  Entire process swapped out
> >  SIGCONT
> >
> >  We can get APF's on the current code, the scheduler code, the stack,
> >  any debugging code in between (e.g. ftrace), and the page tables for
> >  all of these.
> >
> Lets count them all. Suppose guest is in a userspace process code and
> guest memory is completely swapped out. Guest starts to run and faults
> in userspace. Apf is queued but can't be delivered due to faults in
> idt and exception stack. All of them will be taken synchronously due
> to event pending check. After apf is delivered any fault in apf code
> will be takes synchronously since interrupt are disabled. Just before
> calling schedule() interrupts are enabled, so next pf that will happen
> during call to schedule() will be taken asynchronously. Which will cause
> another call to schedule() at which point vcpu will be halted since two
> apfs happened at the same address. So I counted two of them.
>

Ok.  Feels weird, but I guess this is fine.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-07 17:53     ` Gleb Natapov
@ 2010-10-10 12:47       ` Avi Kivity
  2010-10-10 13:27         ` Gleb Natapov
  0 siblings, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-10 12:47 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/07/2010 07:53 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 02:42:06PM +0200, Avi Kivity wrote:
> >   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >Guest enables async PF vcpu functionality using this MSR.
> >  >
> >  >   			return NON_PRESENT;
> >  >+
> >  >+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
> >  >+	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory
> >
> >  Given that it must be aligned anyway, we can require it to be a
> >  64-byte region and also require that the guest zero it before
> >  writing the MSR.  That will give us a little more flexibility in the
> >  future.
> >
> No code change needed, so OK.

The guest needs to allocate a 64-byte per-cpu entry instead of a 4-byte 
entry.

> >
> >  >+		return 0;
> >  >+	}
> >  >+
> >  >+	if (kvm_gfn_to_hva_cache_init(vcpu->kvm,&vcpu->arch.apf.data, gpa))
> >  >+		return 1;
> >
> >  Note: we need to handle the memory being removed from underneath
> >  kvm_gfn_to_hve_cache().  Given that, we can just make
> >  kvm_gfn_to_hva_cache_init() return void.  "success" means nothing
> >  when future changes can invalidate it.
> >
> I want to catch guest doing stupid things. If guest give us non-existent
> address I want wrmsr to #GP.

Ok.

> >  >+
> >  >+	kvm_async_pf_wakeup_all(vcpu);
> >
> >  Why is this needed?  If all apfs are flushed at disable time, what
> >  do we need to wake up?
> For migration. Destination will rewrite msr and all processes will be
> waked up.

Ok. What happens to apf completions that happen after all vcpus are stopped?

> >
> >  Need to list the MSR for save/restore/reset.
> >
> >
> This patch adds it to msrs_to_save, no?

Yes, missed it somehow.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-07 17:23                   ` Gleb Natapov
@ 2010-10-10 12:48                     ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-10 12:48 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Gleb Natapov, Marcelo Tosatti, kvm, linux-mm, linux-kernel,
	mingo, a.p.zijlstra, tglx, hpa, riel, cl

  On 10/07/2010 07:23 PM, Gleb Natapov wrote:
> On Thu, Oct 07, 2010 at 06:20:53PM +0200, Avi Kivity wrote:
> >   On 10/07/2010 06:03 PM, Gleb Natapov wrote:
> >  >>   >
> >  >>   >   Isn't SET_USER_MEMORY_REGION so slow that calling it 2^32 times
> >  >>   >   isn't really feasible?
> >  >>
> >  >>   Assuming it takes 1ms, it would take 49 days.
> >  >>
> >  >We may fail ioctl when max value is reached. The question is how much slot
> >  >changes can we expect from real guest during its lifetime.
> >  >
> >
> >  A normal guest has a 30 Hz timer for reading the vga framebuffer,
> >  multiple slots.  Let's assume 100 Hz frequency, that gives 490 days
> >  until things stop working.
> >
> And reading vga framebuffer needs slots changes because of dirty map
> tracking?

Yes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 08/12] Handle async PF in a guest.
  2010-10-10 12:38       ` Avi Kivity
@ 2010-10-10 13:22         ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10 13:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sun, Oct 10, 2010 at 02:38:58PM +0200, Avi Kivity wrote:
>  On 10/10/2010 02:32 PM, Gleb Natapov wrote:
> >On Thu, Oct 07, 2010 at 03:10:27PM +0200, Avi Kivity wrote:
> >>   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >>  >When async PF capability is detected hook up special page fault handler
> >>  >that will handle async page fault events and bypass other page faults to
> >>  >regular page fault handler. Also add async PF handling to nested SVM
> >>  >emulation. Async PF always generates exit to L1 where vcpu thread will
> >>  >be scheduled out until page is available.
> >>  >
> >>
> >>  Please separate guest and host changes.
> >>
> >Hmm. There are only guest changes here as far as I can see.
> 
> From the diffstat:
> 
> >  arch/x86/include/asm/kvm_para.h |   12 +++
> >  arch/x86/include/asm/traps.h    |    1 +
> >  arch/x86/kernel/entry_32.S      |   10 ++
> >  arch/x86/kernel/entry_64.S      |    3 +
> >  arch/x86/kernel/kvm.c           |  184 ++++++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/svm.c              |   43 +++++++--
> 
> svm.c is host code.
> 
Not exactly :) It is a host code from nested guest perspective, but
guest code from L0 perspective.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-10 12:47       ` Avi Kivity
@ 2010-10-10 13:27         ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10 13:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sun, Oct 10, 2010 at 02:47:43PM +0200, Avi Kivity wrote:
>  On 10/07/2010 07:53 PM, Gleb Natapov wrote:
> >On Thu, Oct 07, 2010 at 02:42:06PM +0200, Avi Kivity wrote:
> >>   On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >>  >Guest enables async PF vcpu functionality using this MSR.
> >>  >
> >>  >   			return NON_PRESENT;
> >>  >+
> >>  >+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
> >>  >+	data: Bits 63-6 hold 64-byte aligned physical address of a 32bit memory
> >>
> >>  Given that it must be aligned anyway, we can require it to be a
> >>  64-byte region and also require that the guest zero it before
> >>  writing the MSR.  That will give us a little more flexibility in the
> >>  future.
> >>
> >No code change needed, so OK.
> 
> The guest needs to allocate a 64-byte per-cpu entry instead of a
> 4-byte entry.
> 
Yes, noticed that already :(

> 
> >>  >+
> >>  >+	kvm_async_pf_wakeup_all(vcpu);
> >>
> >>  Why is this needed?  If all apfs are flushed at disable time, what
> >>  do we need to wake up?
> >For migration. Destination will rewrite msr and all processes will be
> >waked up.
> 
> Ok. What happens to apf completions that happen after all vcpus are stopped?
> 
They will be cleaned by kvm_clear_async_pf_completion_queue() on vcpu
destroy.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 10/12] Handle async PF in non preemptable context
  2010-10-06 10:41     ` Gleb Natapov
@ 2010-10-10 14:25       ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10 14:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Wed, Oct 06, 2010 at 12:41:32PM +0200, Gleb Natapov wrote:
> On Tue, Oct 05, 2010 at 04:51:50PM -0300, Marcelo Tosatti wrote:
> > On Mon, Oct 04, 2010 at 05:56:32PM +0200, Gleb Natapov wrote:
> > > If async page fault is received by idle task or when preemp_count is
> > > not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> > > ready. vcpu can still process interrupts while it waits for the page to
> > > be ready.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > ---
> > >  arch/x86/kernel/kvm.c |   40 ++++++++++++++++++++++++++++++++++------
> > >  1 files changed, 34 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 36fb3e4..f73946f 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > > @@ -37,6 +37,7 @@
> > >  #include <asm/cpu.h>
> > >  #include <asm/traps.h>
> > >  #include <asm/desc.h>
> > > +#include <asm/tlbflush.h>
> > >  
> > >  #define MMU_QUEUE_SIZE 1024
> > >  
> > > @@ -78,6 +79,8 @@ struct kvm_task_sleep_node {
> > >  	wait_queue_head_t wq;
> > >  	u32 token;
> > >  	int cpu;
> > > +	bool halted;
> > > +	struct mm_struct *mm;
> > >  };
> > >  
> > >  static struct kvm_task_sleep_head {
> > > @@ -106,6 +109,11 @@ void kvm_async_pf_task_wait(u32 token)
> > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > >  	struct kvm_task_sleep_node n, *e;
> > >  	DEFINE_WAIT(wait);
> > > +	int cpu, idle;
> > > +
> > > +	cpu = get_cpu();
> > > +	idle = idle_cpu(cpu);
> > > +	put_cpu();
> > >  
> > >  	spin_lock(&b->lock);
> > >  	e = _find_apf_task(b, token);
> > > @@ -119,19 +127,33 @@ void kvm_async_pf_task_wait(u32 token)
> > >  
> > >  	n.token = token;
> > >  	n.cpu = smp_processor_id();
> > > +	n.mm = current->active_mm;
> > > +	n.halted = idle || preempt_count() > 1;
> > > +	atomic_inc(&n.mm->mm_count);
> > 
> > Can't see why this reference is needed.
> I thought that if kernel thread does fault on behalf of some
> process mm can go away while kernel thread is sleeping. But it looks
> like kernel thread increase reference to mm it runs with by himself, so
> may be this is redundant (but not harmful).
> 
Actually it is not redundant. Kernel thread will release reference to
active_mm on reschedule.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-10  7:29         ` Gleb Natapov
@ 2010-10-10 15:55           ` Avi Kivity
  2010-10-10 15:56             ` Avi Kivity
  2010-10-10 16:16             ` Gleb Natapov
  0 siblings, 2 replies; 88+ messages in thread
From: Avi Kivity @ 2010-10-10 15:55 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/10/2010 09:29 AM, Gleb Natapov wrote:
> On Sat, Oct 09, 2010 at 08:30:18PM +0200, Avi Kivity wrote:
> >   On 10/07/2010 07:47 PM, Gleb Natapov wrote:
> >  >On Thu, Oct 07, 2010 at 11:50:08AM +0200, Avi Kivity wrote:
> >  >>    On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >  >>   >If a guest accesses swapped out memory do not swap it in from vcpu thread
> >  >>   >context. Schedule work to do swapping and put vcpu into halted state
> >  >>   >instead.
> >  >>   >
> >  >>   >Interrupts will still be delivered to the guest and if interrupt will
> >  >>   >cause reschedule guest will continue to run another task.
> >  >>   >
> >  >>   >
> >  >>   >+
> >  >>   >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >  >>   >+{
> >  >>   >+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> >  >>   >+		     kvm_event_needs_reinjection(vcpu)))
> >  >>   >+		return false;
> >  >>   >+
> >  >>   >+	return kvm_x86_ops->interrupt_allowed(vcpu);
> >  >>   >+}
> >  >>
> >  >>   Strictly speaking, if the cpu can handle NMIs it can take an apf?
> >  >>
> >  >We can always do apf, but if vcpu can't do anything hwy bother. For NMI
> >  >watchdog yes, may be it is worth to allow apf if nmi is allowed.
> >
> >  Actually it's very dangerous - the IRET from APF will re-enable
> >  NMIs.  So without the guest enabling apf-in-nmi we shouldn't allow
> >  it.
> >
> Good point.
>
> >  Not worth the complexity IMO.
> >
> >  >>   >@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  >>   >    	if (unlikely(r))
> >  >>   >    		goto out;
> >  >>   >
> >  >>   >+	kvm_check_async_pf_completion(vcpu);
> >  >>   >+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> >  >>   >+		/* Page is swapped out. Do synthetic halt */
> >  >>   >+		r = 1;
> >  >>   >+		goto out;
> >  >>   >+	}
> >  >>   >+
> >  >>
> >  >>   Why do it here in the fast path?  Can't you halt the cpu when
> >  >>   starting the page fault?
> >  >Page fault may complete before guest re-entry. We do not want to halt vcpu
> >  >in this case.
> >
> >  So unhalt on completion.
> >
> I want to avoid touching vcpu state from work if possible. Work code does
> not contain arch dependent code right now and mp_state is x86 thing
>

Use a KVM_REQ.


> >  >>
> >  >>   I guess the apf threads can't touch mp_state, but they can have a
> >  >>   KVM_REQ to trigger the check.
> >  >This will require KVM_REQ check on fast path, so what's the difference
> >  >performance wise.
> >
> >  We already have a KVM_REQ check (if (vcpu->requests)) so it doesn't
> >  cost anything extra.
> if (vcpu->requests) does not clear req bit, so what will have to be added
> is: if (kvm_check_request(KVM_REQ_APF_HLT, vcpu)) which is even more
> expensive then my check (but not so expensive to worry about).

It's only expensive when it happens.  Most entries will have the bit clear.

> >
> >  >>   >
> >  >>   >@@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
> >  >>   >    int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> >  >>   >    {
> >  >>   >    	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> >  >>   >+		|| !list_empty_careful(&vcpu->async_pf.done)
> >  >>   >    		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> >  >>   >    		|| vcpu->arch.nmi_pending ||
> >  >>   >    		(kvm_arch_interrupt_allowed(vcpu)&&
> >  >>
> >  >>   Unrelated, shouldn't kvm_arch_vcpu_runnable() look at
> >  >>   vcpu->requests?  Specifically KVM_REQ_EVENT?
> >  >I think KVM_REQ_EVENT is covered by checking nmi and interrupt queue
> >  >here.
> >
> >  No, the nmi and interrupt queues are only filled when the lapic is
> >  polled via KVM_REQ_EVENT.  I'll prepare a patch.
> I don't think you are correct. nmi_pending is filled before setting
> KVM_REQ_EVENT and kvm_cpu_has_interrupt() checks directly in apic/pic.

Right.

> >
> >  >>   >+
> >  >>   >+TRACE_EVENT(
> >  >>   >+	kvm_async_pf_not_present,
> >  >>   >+	TP_PROTO(u64 gva),
> >  >>   >+	TP_ARGS(gva),
> >  >>
> >  >>   Do you actually have a gva with tdp?  With nested virtualization,
> >  >>   how do you interpret this gva?
> >  >With tdp it is gpa just like tdp_page_fault gets gpa where shadow page
> >  >version gets gva. Nested virtualization is too complex to interpret.
> >
> >  It's not good to have a tracepoint that depends on cpu mode (without
> >  recording that mode). I think we have the same issue in
> >  trace_kvm_page_fault though.
> We have mmu_is_nested(). I'll just disable apf while vcpu is in nested
> mode for now.

What if we get the apf in non-nested mode and it completes in nested mode?

> >
> >  >>   >+
> >  >>   >+	/* do alloc nowait since if we are going to sleep anyway we
> >  >>   >+	   may as well sleep faulting in page */
> >  >>   /*
> >  >>    * multi
> >  >>    * line
> >  >>    * comment
> >  >>    */
> >  >>
> >  >>   (but a good one, this is subtle)
> >  >>
> >  >>   I missed where you halt the vcpu.  Can you point me at the function?
> >  >>
> >  >>   Note this is a synthetic halt and must not be visible to live
> >  >>   migration, or we risk live migrating a halted state which doesn't
> >  >>   really exist.
> >  >>
> >  >>   Might be simplest to drain the apf queue on any of the save/restore ioctls.
> >  >>
> >  >So that "info cpu" will interfere with apf? Migration should work
> >  >in regular way. apf state should not be migrated since it has no meaning
> >  >on the destination. I'll make sure synthetic halt state will not
> >  >interfere with migration.
> >
> >  If you deliver an apf, the guest expects a completion.
> >
> There is special completion that tells guest to wake all sleeping tasks
> on vcpu. It is delivered after migration on the destination.
>

Yes, I saw.

What if you can't deliver it?  is it possible that some other vcpu will 
start receiving apfs that alias the old ones?  Or is the broadcast global?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-10 15:55           ` Avi Kivity
@ 2010-10-10 15:56             ` Avi Kivity
  2010-10-10 16:17               ` Gleb Natapov
  2010-10-10 16:16             ` Gleb Natapov
  1 sibling, 1 reply; 88+ messages in thread
From: Avi Kivity @ 2010-10-10 15:56 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/10/2010 05:55 PM, Avi Kivity wrote:
>> There is special completion that tells guest to wake all sleeping tasks
>> on vcpu. It is delivered after migration on the destination.
>>
> >
>
> Yes, I saw.
>
> What if you can't deliver it?  is it possible that some other vcpu 
> will start receiving apfs that alias the old ones?  Or is the 
> broadcast global?
>

And, is the broadcast used only for migrations?


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-10 15:55           ` Avi Kivity
  2010-10-10 15:56             ` Avi Kivity
@ 2010-10-10 16:16             ` Gleb Natapov
  1 sibling, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10 16:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sun, Oct 10, 2010 at 05:55:25PM +0200, Avi Kivity wrote:
> >>
> >>  >>   >@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>  >>   >    	if (unlikely(r))
> >>  >>   >    		goto out;
> >>  >>   >
> >>  >>   >+	kvm_check_async_pf_completion(vcpu);
> >>  >>   >+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> >>  >>   >+		/* Page is swapped out. Do synthetic halt */
> >>  >>   >+		r = 1;
> >>  >>   >+		goto out;
> >>  >>   >+	}
> >>  >>   >+
> >>  >>
> >>  >>   Why do it here in the fast path?  Can't you halt the cpu when
> >>  >>   starting the page fault?
> >>  >Page fault may complete before guest re-entry. We do not want to halt vcpu
> >>  >in this case.
> >>
> >>  So unhalt on completion.
> >>
> >I want to avoid touching vcpu state from work if possible. Work code does
> >not contain arch dependent code right now and mp_state is x86 thing
> >
> 
> Use a KVM_REQ.
> 
Completion happens asynchronously. CPU may not be even halted at that
point. Actually completion does unhalt vcpu. It puts completed work into
vcpu->async_pf.done list and wakes vcpu thread if it sleeps. Next
invocation of kvm_arch_vcpu_runnable() will return true since vcpu->async_pf.done
is not empty and vcpu will be unhalted in usual way by kvm_vcpu_block().

> 
> >>  >>
> >>  >>   I guess the apf threads can't touch mp_state, but they can have a
> >>  >>   KVM_REQ to trigger the check.
> >>  >This will require KVM_REQ check on fast path, so what's the difference
> >>  >performance wise.
> >>
> >>  We already have a KVM_REQ check (if (vcpu->requests)) so it doesn't
> >>  cost anything extra.
> >if (vcpu->requests) does not clear req bit, so what will have to be added
> >is: if (kvm_check_request(KVM_REQ_APF_HLT, vcpu)) which is even more
> >expensive then my check (but not so expensive to worry about).
> 
> It's only expensive when it happens.  Most entries will have the bit clear.
kvm_check_async_pf_completion() (the one that detects if vcpu should be
halted) is called after vcpu->requests processing. It is done in order
to delay completion checking as far as possible in hope to get
completion before next vcpu entry and skip sending apf, so I do it at
the last possible moment before event injection.

> >>
> >>  >>   >+
> >>  >>   >+TRACE_EVENT(
> >>  >>   >+	kvm_async_pf_not_present,
> >>  >>   >+	TP_PROTO(u64 gva),
> >>  >>   >+	TP_ARGS(gva),
> >>  >>
> >>  >>   Do you actually have a gva with tdp?  With nested virtualization,
> >>  >>   how do you interpret this gva?
> >>  >With tdp it is gpa just like tdp_page_fault gets gpa where shadow page
> >>  >version gets gva. Nested virtualization is too complex to interpret.
> >>
> >>  It's not good to have a tracepoint that depends on cpu mode (without
> >>  recording that mode). I think we have the same issue in
> >>  trace_kvm_page_fault though.
> >We have mmu_is_nested(). I'll just disable apf while vcpu is in nested
> >mode for now.
> 
> What if we get the apf in non-nested mode and it completes in nested mode?
> 
I am not yet sure we have any problem with nested mode at all. I am
looking at it. If we have we can skip prefault if in nested.

> >>
> >>  >>   >+
> >>  >>   >+	/* do alloc nowait since if we are going to sleep anyway we
> >>  >>   >+	   may as well sleep faulting in page */
> >>  >>   /*
> >>  >>    * multi
> >>  >>    * line
> >>  >>    * comment
> >>  >>    */
> >>  >>
> >>  >>   (but a good one, this is subtle)
> >>  >>
> >>  >>   I missed where you halt the vcpu.  Can you point me at the function?
> >>  >>
> >>  >>   Note this is a synthetic halt and must not be visible to live
> >>  >>   migration, or we risk live migrating a halted state which doesn't
> >>  >>   really exist.
> >>  >>
> >>  >>   Might be simplest to drain the apf queue on any of the save/restore ioctls.
> >>  >>
> >>  >So that "info cpu" will interfere with apf? Migration should work
> >>  >in regular way. apf state should not be migrated since it has no meaning
> >>  >on the destination. I'll make sure synthetic halt state will not
> >>  >interfere with migration.
> >>
> >>  If you deliver an apf, the guest expects a completion.
> >>
> >There is special completion that tells guest to wake all sleeping tasks
> >on vcpu. It is delivered after migration on the destination.
> >
> 
> Yes, I saw.
> 
> What if you can't deliver it?  is it possible that some other vcpu
How can this happen? If I can't deliverer it I can't deliver
non-broadcast apfs too.

> will start receiving apfs that alias the old ones?  Or is the
> broadcast global?
> 
Broadcast is not global but tokens are unique per cpu so other vcpu will
not be able to receiving apfs that alias the old ones (if I understand
what you mean correctly). 

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-10 15:56             ` Avi Kivity
@ 2010-10-10 16:17               ` Gleb Natapov
  0 siblings, 0 replies; 88+ messages in thread
From: Gleb Natapov @ 2010-10-10 16:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Sun, Oct 10, 2010 at 05:56:31PM +0200, Avi Kivity wrote:
>  On 10/10/2010 05:55 PM, Avi Kivity wrote:
> >>There is special completion that tells guest to wake all sleeping tasks
> >>on vcpu. It is delivered after migration on the destination.
> >>
> >>
> >
> >Yes, I saw.
> >
> >What if you can't deliver it?  is it possible that some other vcpu
> >will start receiving apfs that alias the old ones?  Or is the
> >broadcast global?
> >
> 
> And, is the broadcast used only for migrations?
> 
Any time apf is enabled on vcpu broadcast is sent to the vcpu. Guest should be
careful here and do not write to apf msr without a reason.

--
			Gleb.

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2010-10-10 16:18 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-04 15:56 [PATCH v6 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 01/12] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
2010-10-05  1:20   ` Rik van Riel
2010-10-05 14:59   ` Marcelo Tosatti
2010-10-06 10:50     ` Avi Kivity
2010-10-06 10:52       ` Gleb Natapov
2010-10-07  9:54         ` Avi Kivity
2010-10-07 17:48           ` Gleb Natapov
2010-10-06 11:15     ` Gleb Natapov
2010-10-07  9:50   ` Avi Kivity
2010-10-07  9:52     ` Avi Kivity
2010-10-07 13:24     ` Rik van Riel
2010-10-07 13:29       ` Avi Kivity
2010-10-07 17:47     ` Gleb Natapov
2010-10-09 18:30       ` Avi Kivity
2010-10-09 18:32         ` Avi Kivity
2010-10-10  7:30           ` Gleb Natapov
2010-10-10  7:29         ` Gleb Natapov
2010-10-10 15:55           ` Avi Kivity
2010-10-10 15:56             ` Avi Kivity
2010-10-10 16:17               ` Gleb Natapov
2010-10-10 16:16             ` Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 03/12] Retry fault before vmentry Gleb Natapov
2010-10-05 15:54   ` Marcelo Tosatti
2010-10-06 11:07     ` Gleb Natapov
2010-10-06 14:20       ` Marcelo Tosatti
2010-10-07 18:44         ` Gleb Natapov
2010-10-08 16:07           ` Marcelo Tosatti
2010-10-07 12:29   ` Avi Kivity
2010-10-07 17:21     ` Gleb Natapov
2010-10-09 18:42       ` Avi Kivity
2010-10-10  7:35         ` Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
2010-10-05  1:29   ` Rik van Riel
2010-10-05 16:57   ` Marcelo Tosatti
2010-10-06 11:14     ` Gleb Natapov
2010-10-06 14:38       ` Marcelo Tosatti
2010-10-06 20:08         ` Gleb Natapov
2010-10-07 10:00           ` Avi Kivity
2010-10-07 15:42             ` Marcelo Tosatti
2010-10-07 16:03               ` Gleb Natapov
2010-10-07 16:20                 ` Avi Kivity
2010-10-07 17:23                   ` Gleb Natapov
2010-10-10 12:48                     ` Avi Kivity
2010-10-07 12:31   ` Avi Kivity
2010-10-04 15:56 ` [PATCH v6 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 06/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
2010-10-07 12:42   ` Avi Kivity
2010-10-07 17:53     ` Gleb Natapov
2010-10-10 12:47       ` Avi Kivity
2010-10-10 13:27         ` Gleb Natapov
2010-10-07 12:58   ` Avi Kivity
2010-10-07 17:59     ` Gleb Natapov
2010-10-09 18:43       ` Avi Kivity
2010-10-04 15:56 ` [PATCH v6 07/12] Add async PF initialization to PV guest Gleb Natapov
2010-10-05  2:34   ` Rik van Riel
2010-10-05 18:25   ` Marcelo Tosatti
2010-10-06 10:55     ` Gleb Natapov
2010-10-06 14:45       ` Marcelo Tosatti
2010-10-06 20:05         ` Gleb Natapov
2010-10-07 12:50   ` Avi Kivity
2010-10-08  7:54     ` Gleb Natapov
2010-10-09 18:44       ` Avi Kivity
2010-10-04 15:56 ` [PATCH v6 08/12] Handle async PF in a guest Gleb Natapov
2010-10-07 13:10   ` Avi Kivity
2010-10-07 17:14     ` Gleb Natapov
2010-10-07 17:18       ` Avi Kivity
2010-10-07 17:48         ` Rik van Riel
2010-10-07 18:03         ` Gleb Natapov
2010-10-09 18:48           ` Avi Kivity
2010-10-10  7:56             ` Gleb Natapov
2010-10-10 12:40               ` Avi Kivity
2010-10-10 12:32     ` Gleb Natapov
2010-10-10 12:38       ` Avi Kivity
2010-10-10 13:22         ` Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 09/12] Inject asynchronous page fault into a PV guest if page is swapped out Gleb Natapov
2010-10-05  2:36   ` Rik van Riel
2010-10-05 19:00   ` Marcelo Tosatti
2010-10-06 10:42     ` Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 10/12] Handle async PF in non preemptable context Gleb Natapov
2010-10-05 19:51   ` Marcelo Tosatti
2010-10-06 10:41     ` Gleb Natapov
2010-10-10 14:25       ` Gleb Natapov
2010-10-04 15:56 ` [PATCH v6 11/12] Let host know whether the guest can handle async PF in non-userspace context Gleb Natapov
2010-10-07 13:36   ` Avi Kivity
2010-10-04 15:56 ` [PATCH v6 12/12] Send async PF when guest is not in userspace too Gleb Natapov
2010-10-05  2:37   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).