All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest
@ 2010-10-14  9:22 ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out from
the host memory vcpu execution is suspended till the page is swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

The patch series tries to mitigate this problem by introducing two
mechanisms. The first one is used with non-PV guest and it works like
this: when vcpu tries to access swapped out page it is halted and
requested page is swapped in by another thread. That way vcpu can still
process interrupts while io is happening in parallel and, with any luck,
interrupt will cause the guest to schedule another task on the vcpu, so
it will have work to do instead of waiting for the page to be swapped in.

The second mechanism introduces PV notification about swapped page state to
a guest (asynchronous page fault). Instead of halting vcpu upon access to
swapped out page and hoping that some interrupt will cause reschedule we
immediately inject asynchronous page fault to the vcpu.  PV aware guest
knows that upon receiving such exception it should schedule another task
to run on the vcpu. Current task is put to sleep until another kind of
asynchronous page fault is received that notifies the guest that page
is now in the host memory, so task that waits for it can run again.

To measure performance benefits I use a simple benchmark program (below)
that starts number of threads. Some of them do work (increment counter),
others access huge array in random location trying to generate host page
faults. The size of the array is smaller then guest memory bug bigger
then host memory so we are guarantied that host will swap out part of
the array.

I ran the benchmark on three setups: with current kvm.git (master),
with my patch series + non-pv guest (nonpv) and with my patch series +
pv guest (pv).

Each guest had 4 cpus and 2G memory and was launched inside 512M memory
container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
threads and 4 working threads for a minute).

Below is the total amount of "work" each guest managed to do
(average of 10 runs):
         total work    std error
master: 122789420615 (3818565029)
nonpv:  138455939001 (773774299)
pv:     234351846135 (10461117116)

Changes:
 v1->v2
   Use MSR instead of hypercall.
   Move most of the code into arch independent place.
   halt inside a guest instead of doing "wait for page" hypercall if
    preemption is disabled.
 v2->v3
   Use MSR from range 0x4b564dxx.
   Add slot version tracking.
   Support migration by restarting all guest processes after migration.
   Drop patch that tract preemptability for non-preemptable kernels
    due to performance concerns. Send async PF to non-preemptable
    guests only when vcpu is executing userspace code.
 v3->v4
  Provide alternative page fault handler in PV guest instead of adding hook to
   standard page fault handler and patch it out on non-PV guests.
  Allow only limited number of outstanding async page fault per vcpu.
  Unify  gfn_to_pfn and gfn_to_pfn_async code.
  Cancel outstanding slow work on reset.
 v4->v5
  Move async pv cpu initialization into cpu hotplug notifier.
  Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
  Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
   cr3 back
 v5->v6
  To many. Will list only major changes here.
  Replace slow work with work queues.
  Halt vcpu for non-pv guests.
  Handle async PF in nested SVM mode.
  Do not prefault swapped in page for non tdp case.
 v6->v7
  Fix "GUP fail in work thread" problem
  Do prefault only if mmu is in direct map mode
  Use cpu->request to ask for vcpu halt (drop optimization that tried to
   skip non-present apf injection if page is swapped in before next vmentry)
  Keep track of synthetic halt in separate state to prevent it from leaking
   during migration.
  Fix memslot tracking problems.
  More documentation.
  Other small comments are addressed

Gleb Natapov (12):
  Add get_user_pages() variant that fails if major fault is required.
  Halt vcpu if page it tries to access is swapped out.
  Retry fault before vmentry
  Add memory slot versioning and use it to provide fast guest write interface
  Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  Add PV MSR to enable asynchronous page faults delivery.
  Add async PF initialization to PV guest.
  Handle async PF in a guest.
  Inject asynchronous page fault into a PV guest if page is swapped out.
  Handle async PF in non preemptable context
  Let host know whether the guest can handle async PF in non-userspace context.
  Send async PF when guest is not in userspace too.

 Documentation/kernel-parameters.txt |    3 +
 Documentation/kvm/cpuid.txt         |    3 +
 Documentation/kvm/msr.txt           |   36 ++++-
 arch/x86/include/asm/kvm_host.h     |   28 +++-
 arch/x86/include/asm/kvm_para.h     |   24 +++
 arch/x86/include/asm/traps.h        |    1 +
 arch/x86/kernel/entry_32.S          |   10 +
 arch/x86/kernel/entry_64.S          |    3 +
 arch/x86/kernel/kvm.c               |  315 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c          |   13 +--
 arch/x86/kvm/Kconfig                |    1 +
 arch/x86/kvm/Makefile               |    1 +
 arch/x86/kvm/mmu.c                  |   61 ++++++-
 arch/x86/kvm/paging_tmpl.h          |    8 +-
 arch/x86/kvm/svm.c                  |   45 ++++-
 arch/x86/kvm/x86.c                  |  192 +++++++++++++++++++++-
 fs/ncpfs/mmap.c                     |    2 +
 include/linux/kvm.h                 |    1 +
 include/linux/kvm_host.h            |   39 +++++
 include/linux/kvm_types.h           |    7 +
 include/linux/mm.h                  |    5 +
 include/trace/events/kvm.h          |   95 +++++++++++
 mm/filemap.c                        |    3 +
 mm/memory.c                         |   31 +++-
 mm/shmem.c                          |    8 +-
 virt/kvm/Kconfig                    |    3 +
 virt/kvm/async_pf.c                 |  213 +++++++++++++++++++++++
 virt/kvm/async_pf.h                 |   36 ++++
 virt/kvm/kvm_main.c                 |  132 ++++++++++++---
 29 files changed, 1255 insertions(+), 64 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

=== benchmark.c ===

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

#define FAULTING_THREADS 1
#define WORKING_THREADS 1
#define TIMEOUT 5
#define MEMORY 1024*1024*1024

pthread_barrier_t barrier;
volatile int stop;
size_t pages;

void *fault_thread(void* p)
{
	char *mem = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		mem[(random() % pages) << 12] = 10;

	pthread_barrier_wait(&barrier);

	return NULL;
}

void *work_thread(void* p)
{
	unsigned long *i = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		(*i)++;

	pthread_barrier_wait(&barrier);

	return NULL;
}

int main(int argc, char **argv)
{
	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
	unsigned int timeout = TIMEOUT;
	size_t mem = MEMORY;
	void *buf;
	int i, opt, verbose = 0;
	pthread_t t;
	pthread_attr_t pattr;
	unsigned long *res, sum = 0;

	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
		switch (opt) {
		case 'f':
			ft = atoi(optarg);
			break;
		case 'w':
			wt = atoi(optarg);
			break;
		case 'm':
			mem = atoi(optarg);
			break;
		case 't':
			timeout = atoi(optarg);
			break;
		case 'v':
			verbose++;
			break;
		default:
			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
			exit(1);
		}
	}

	if (verbose)
		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);

	pages = mem >> 12;
	posix_memalign(&buf, 4096, pages << 12);
	res = malloc(sizeof (unsigned long) * wt);
	memset(res, 0, sizeof (unsigned long) * wt);

	pthread_attr_init(&pattr);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	for (i = 0; i < ft; i++) {
		pthread_create(&t, &pattr, fault_thread, buf);
		pthread_detach(t);
	}

	for (i = 0; i < wt; i++) {
		pthread_create(&t, &pattr, work_thread, &res[i]);
		pthread_detach(t);
	}

	/* prefault memory */
	memset(buf, 0, pages << 12);
	printf("start\n");

	pthread_barrier_wait(&barrier);

	pthread_barrier_destroy(&barrier);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	sleep(timeout);
	stop = 1;

	pthread_barrier_wait(&barrier);

	for (i = 0; i < wt; i++) {
		sum += res[i];
		printf("worker %d: %lu\n", i, res[i]);
	}
	printf("total: %lu\n", sum);

	return 0;
}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest
@ 2010-10-14  9:22 ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out from
the host memory vcpu execution is suspended till the page is swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

The patch series tries to mitigate this problem by introducing two
mechanisms. The first one is used with non-PV guest and it works like
this: when vcpu tries to access swapped out page it is halted and
requested page is swapped in by another thread. That way vcpu can still
process interrupts while io is happening in parallel and, with any luck,
interrupt will cause the guest to schedule another task on the vcpu, so
it will have work to do instead of waiting for the page to be swapped in.

The second mechanism introduces PV notification about swapped page state to
a guest (asynchronous page fault). Instead of halting vcpu upon access to
swapped out page and hoping that some interrupt will cause reschedule we
immediately inject asynchronous page fault to the vcpu.  PV aware guest
knows that upon receiving such exception it should schedule another task
to run on the vcpu. Current task is put to sleep until another kind of
asynchronous page fault is received that notifies the guest that page
is now in the host memory, so task that waits for it can run again.

To measure performance benefits I use a simple benchmark program (below)
that starts number of threads. Some of them do work (increment counter),
others access huge array in random location trying to generate host page
faults. The size of the array is smaller then guest memory bug bigger
then host memory so we are guarantied that host will swap out part of
the array.

I ran the benchmark on three setups: with current kvm.git (master),
with my patch series + non-pv guest (nonpv) and with my patch series +
pv guest (pv).

Each guest had 4 cpus and 2G memory and was launched inside 512M memory
container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
threads and 4 working threads for a minute).

Below is the total amount of "work" each guest managed to do
(average of 10 runs):
         total work    std error
master: 122789420615 (3818565029)
nonpv:  138455939001 (773774299)
pv:     234351846135 (10461117116)

Changes:
 v1->v2
   Use MSR instead of hypercall.
   Move most of the code into arch independent place.
   halt inside a guest instead of doing "wait for page" hypercall if
    preemption is disabled.
 v2->v3
   Use MSR from range 0x4b564dxx.
   Add slot version tracking.
   Support migration by restarting all guest processes after migration.
   Drop patch that tract preemptability for non-preemptable kernels
    due to performance concerns. Send async PF to non-preemptable
    guests only when vcpu is executing userspace code.
 v3->v4
  Provide alternative page fault handler in PV guest instead of adding hook to
   standard page fault handler and patch it out on non-PV guests.
  Allow only limited number of outstanding async page fault per vcpu.
  Unify  gfn_to_pfn and gfn_to_pfn_async code.
  Cancel outstanding slow work on reset.
 v4->v5
  Move async pv cpu initialization into cpu hotplug notifier.
  Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
  Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
   cr3 back
 v5->v6
  To many. Will list only major changes here.
  Replace slow work with work queues.
  Halt vcpu for non-pv guests.
  Handle async PF in nested SVM mode.
  Do not prefault swapped in page for non tdp case.
 v6->v7
  Fix "GUP fail in work thread" problem
  Do prefault only if mmu is in direct map mode
  Use cpu->request to ask for vcpu halt (drop optimization that tried to
   skip non-present apf injection if page is swapped in before next vmentry)
  Keep track of synthetic halt in separate state to prevent it from leaking
   during migration.
  Fix memslot tracking problems.
  More documentation.
  Other small comments are addressed

Gleb Natapov (12):
  Add get_user_pages() variant that fails if major fault is required.
  Halt vcpu if page it tries to access is swapped out.
  Retry fault before vmentry
  Add memory slot versioning and use it to provide fast guest write interface
  Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  Add PV MSR to enable asynchronous page faults delivery.
  Add async PF initialization to PV guest.
  Handle async PF in a guest.
  Inject asynchronous page fault into a PV guest if page is swapped out.
  Handle async PF in non preemptable context
  Let host know whether the guest can handle async PF in non-userspace context.
  Send async PF when guest is not in userspace too.

 Documentation/kernel-parameters.txt |    3 +
 Documentation/kvm/cpuid.txt         |    3 +
 Documentation/kvm/msr.txt           |   36 ++++-
 arch/x86/include/asm/kvm_host.h     |   28 +++-
 arch/x86/include/asm/kvm_para.h     |   24 +++
 arch/x86/include/asm/traps.h        |    1 +
 arch/x86/kernel/entry_32.S          |   10 +
 arch/x86/kernel/entry_64.S          |    3 +
 arch/x86/kernel/kvm.c               |  315 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c          |   13 +--
 arch/x86/kvm/Kconfig                |    1 +
 arch/x86/kvm/Makefile               |    1 +
 arch/x86/kvm/mmu.c                  |   61 ++++++-
 arch/x86/kvm/paging_tmpl.h          |    8 +-
 arch/x86/kvm/svm.c                  |   45 ++++-
 arch/x86/kvm/x86.c                  |  192 +++++++++++++++++++++-
 fs/ncpfs/mmap.c                     |    2 +
 include/linux/kvm.h                 |    1 +
 include/linux/kvm_host.h            |   39 +++++
 include/linux/kvm_types.h           |    7 +
 include/linux/mm.h                  |    5 +
 include/trace/events/kvm.h          |   95 +++++++++++
 mm/filemap.c                        |    3 +
 mm/memory.c                         |   31 +++-
 mm/shmem.c                          |    8 +-
 virt/kvm/Kconfig                    |    3 +
 virt/kvm/async_pf.c                 |  213 +++++++++++++++++++++++
 virt/kvm/async_pf.h                 |   36 ++++
 virt/kvm/kvm_main.c                 |  132 ++++++++++++---
 29 files changed, 1255 insertions(+), 64 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

=== benchmark.c ===

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

#define FAULTING_THREADS 1
#define WORKING_THREADS 1
#define TIMEOUT 5
#define MEMORY 1024*1024*1024

pthread_barrier_t barrier;
volatile int stop;
size_t pages;

void *fault_thread(void* p)
{
	char *mem = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		mem[(random() % pages) << 12] = 10;

	pthread_barrier_wait(&barrier);

	return NULL;
}

void *work_thread(void* p)
{
	unsigned long *i = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		(*i)++;

	pthread_barrier_wait(&barrier);

	return NULL;
}

int main(int argc, char **argv)
{
	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
	unsigned int timeout = TIMEOUT;
	size_t mem = MEMORY;
	void *buf;
	int i, opt, verbose = 0;
	pthread_t t;
	pthread_attr_t pattr;
	unsigned long *res, sum = 0;

	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
		switch (opt) {
		case 'f':
			ft = atoi(optarg);
			break;
		case 'w':
			wt = atoi(optarg);
			break;
		case 'm':
			mem = atoi(optarg);
			break;
		case 't':
			timeout = atoi(optarg);
			break;
		case 'v':
			verbose++;
			break;
		default:
			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
			exit(1);
		}
	}

	if (verbose)
		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);

	pages = mem >> 12;
	posix_memalign(&buf, 4096, pages << 12);
	res = malloc(sizeof (unsigned long) * wt);
	memset(res, 0, sizeof (unsigned long) * wt);

	pthread_attr_init(&pattr);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	for (i = 0; i < ft; i++) {
		pthread_create(&t, &pattr, fault_thread, buf);
		pthread_detach(t);
	}

	for (i = 0; i < wt; i++) {
		pthread_create(&t, &pattr, work_thread, &res[i]);
		pthread_detach(t);
	}

	/* prefault memory */
	memset(buf, 0, pages << 12);
	printf("start\n");

	pthread_barrier_wait(&barrier);

	pthread_barrier_destroy(&barrier);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	sleep(timeout);
	stop = 1;

	pthread_barrier_wait(&barrier);

	for (i = 0; i < wt; i++) {
		sum += res[i];
		printf("worker %d: %lu\n", i, res[i]);
	}
	printf("total: %lu\n", sum);

	return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 01/12] Add get_user_pages() variant that fails if major fault is required.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 fs/ncpfs/mmap.c    |    2 ++
 include/linux/mm.h |    5 +++++
 mm/filemap.c       |    3 +++
 mm/memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/shmem.c         |    8 +++++++-
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 56f5b3a..b9c4f36 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	int bufsize;
 	int pos; /* XXX: loff_t ? */
 
+	if (vmf->flags & FAULT_FLAG_MINOR)
+		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
 	/*
 	 * ncpfs has nothing against high pages as long
 	 * as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..da32900 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -848,6 +849,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			unsigned long start, int nr_pages, int write, int force,
 			struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int nr_pages, int write, int force,
+			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1394,6 +1398,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MINOR	0x20	/* do only minor page faults */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..ef28b6d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			goto no_cached_page;
 		}
 	} else {
+		if (vmf->flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..b221458 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1441,10 +1441,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				unsigned int fault_fl =
+					((foll_flags & FOLL_WRITE) ?
+					FAULT_FLAG_WRITE : 0) |
+					((foll_flags & FOLL_MINOR) ?
+					FAULT_FLAG_MINOR : 0);
 
-				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+				ret = handle_mm_fault(mm, vma, start, fault_fl);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1452,6 +1455,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					if (ret &
 					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
+					else if (ret & VM_FAULT_MAJOR)
+						return i ? i : -EFAULT;
 					BUG();
 				}
 				if (ret & VM_FAULT_MAJOR)
@@ -1562,6 +1567,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int flags = FOLL_TOUCH | FOLL_MINOR;
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
@@ -2648,6 +2670,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		if (flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
 					GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..470d8a7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1228,6 +1228,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
 	swp_entry_t swap;
 	gfp_t gfp;
 	int error;
+	int flags = type ? *type : 0;
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
@@ -1287,6 +1288,11 @@ repeat:
 		swappage = lookup_swap_cache(swap);
 		if (!swappage) {
 			shmem_swp_unmap(entry);
+			if (flags & FAULT_FLAG_MINOR) {
+				spin_unlock(&info->lock);
+				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+				goto failed;
+			}
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
@@ -1510,7 +1516,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int error;
-	int ret;
+	int ret = (int)vmf->flags;
 
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 01/12] Add get_user_pages() variant that fails if major fault is required.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 fs/ncpfs/mmap.c    |    2 ++
 include/linux/mm.h |    5 +++++
 mm/filemap.c       |    3 +++
 mm/memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/shmem.c         |    8 +++++++-
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 56f5b3a..b9c4f36 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	int bufsize;
 	int pos; /* XXX: loff_t ? */
 
+	if (vmf->flags & FAULT_FLAG_MINOR)
+		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
 	/*
 	 * ncpfs has nothing against high pages as long
 	 * as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..da32900 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -848,6 +849,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			unsigned long start, int nr_pages, int write, int force,
 			struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int nr_pages, int write, int force,
+			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1394,6 +1398,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MINOR	0x20	/* do only minor page faults */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..ef28b6d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			goto no_cached_page;
 		}
 	} else {
+		if (vmf->flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..b221458 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1441,10 +1441,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				unsigned int fault_fl =
+					((foll_flags & FOLL_WRITE) ?
+					FAULT_FLAG_WRITE : 0) |
+					((foll_flags & FOLL_MINOR) ?
+					FAULT_FLAG_MINOR : 0);
 
-				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+				ret = handle_mm_fault(mm, vma, start, fault_fl);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1452,6 +1455,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					if (ret &
 					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
+					else if (ret & VM_FAULT_MAJOR)
+						return i ? i : -EFAULT;
 					BUG();
 				}
 				if (ret & VM_FAULT_MAJOR)
@@ -1562,6 +1567,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int flags = FOLL_TOUCH | FOLL_MINOR;
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
@@ -2648,6 +2670,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		if (flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
 					GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..470d8a7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1228,6 +1228,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
 	swp_entry_t swap;
 	gfp_t gfp;
 	int error;
+	int flags = type ? *type : 0;
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
@@ -1287,6 +1288,11 @@ repeat:
 		swappage = lookup_swap_cache(swap);
 		if (!swappage) {
 			shmem_swp_unmap(entry);
+			if (flags & FAULT_FLAG_MINOR) {
+				spin_unlock(&info->lock);
+				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+				goto failed;
+			}
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
@@ -1510,7 +1516,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int error;
-	int ret;
+	int ret = (int)vmf->flags;
 
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.

Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   18 ++++
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/Makefile           |    1 +
 arch/x86/kvm/mmu.c              |   52 +++++++++++-
 arch/x86/kvm/paging_tmpl.h      |    4 +-
 arch/x86/kvm/x86.c              |  112 ++++++++++++++++++++++-
 include/linux/kvm_host.h        |   31 +++++++
 include/trace/events/kvm.h      |   90 ++++++++++++++++++
 virt/kvm/Kconfig                |    3 +
 virt/kvm/async_pf.c             |  190 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/async_pf.h             |   36 ++++++++
 virt/kvm/kvm_main.c             |   57 +++++++++---
 12 files changed, 578 insertions(+), 17 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e209078..043e29e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -83,11 +83,14 @@
 #define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
 
+#define ASYNC_PF_PER_VCPU 64
+
 extern spinlock_t kvm_lock;
 extern struct list_head vm_list;
 
 struct kvm_vcpu;
 struct kvm;
+struct kvm_async_pf;
 
 enum kvm_reg {
 	VCPU_REGS_RAX = 0,
@@ -412,6 +415,11 @@ struct kvm_vcpu_arch {
 	u64 hv_vapic;
 
 	cpumask_var_t wbinvd_dirty_mask;
+
+	struct {
+		bool halted;
+		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+	} apf;
 };
 
 struct kvm_arch {
@@ -585,6 +593,10 @@ struct kvm_x86_ops {
 	const struct trace_print_flags *exit_reasons_str;
 };
 
+struct kvm_arch_async_pf {
+	gfn_t gfn;
+};
+
 extern struct kvm_x86_ops *kvm_x86_ops;
 
 int kvm_mmu_module_init(void);
@@ -823,4 +835,10 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work);
+extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ddc131f..50f6364 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -28,6 +28,7 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_EVENTFD
 	select KVM_APIC_ARCHITECTURE
+	select KVM_ASYNC_PF
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
 	---help---
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31a7035..c53bf19 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,6 +9,7 @@ kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
 				coalesced_mmio.o irq_comm.o eventfd.o \
 				assigned-dev.o)
 kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
+kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o timer.o
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 908ea54..f01e89a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -18,9 +18,11 @@
  *
  */
 
+#include "irq.h"
 #include "mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -2585,6 +2587,50 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	arch.gfn = gfn;
+
+	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
+		     kvm_event_needs_reinjection(vcpu)))
+		return false;
+
+	return kvm_x86_ops->interrupt_allowed(vcpu);
+}
+
+static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+			 pfn_t *pfn)
+{
+	bool async;
+
+	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
+
+	if (!async)
+		return false; /* *pfn has correct page already */
+
+	put_page(pfn_to_page(*pfn));
+
+	if (can_do_async_pf(vcpu)) {
+		trace_kvm_try_async_get_page(async, *pfn);
+		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
+			trace_kvm_async_pf_doublefault(gva, gfn);
+			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+			return true;
+		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
+			return true;
+	}
+
+	*pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	
+	return false;
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 				u32 error_code)
 {
@@ -2607,7 +2653,11 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+		return 0;
+
+	/* mmio */
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index cd7a833..c45376d 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -568,7 +568,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+		return 0;
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7127a13..09e72fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -43,6 +43,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/uaccess.h>
+#include <linux/hash.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -155,6 +156,13 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 
 u64 __read_mostly host_xcr0;
 
+static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
+{
+	int i;
+	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU); i++)
+		vcpu->arch.apf.gfns[i] = ~0;
+}
+
 static inline u32 bit(int bitno)
 {
 	return 1 << (bitno & 31);
@@ -5110,6 +5118,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			vcpu->fpu_active = 0;
 			kvm_x86_ops->fpu_deactivate(vcpu);
 		}
+		if (kvm_check_request(KVM_REQ_APF_HALT, vcpu)) {
+			/* Page is swapped out. Do synthetic halt */
+			vcpu->arch.apf.halted = true;
+			r = 1;
+			goto out;
+		}
 	}
 
 	r = kvm_mmu_reload(vcpu);
@@ -5238,7 +5252,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 
 	r = 1;
 	while (r > 0) {
-		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE)
+		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
+		    !vcpu->arch.apf.halted)
 			r = vcpu_enter_guest(vcpu);
 		else {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
@@ -5251,6 +5266,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 					vcpu->arch.mp_state =
 						KVM_MP_STATE_RUNNABLE;
 				case KVM_MP_STATE_RUNNABLE:
+					vcpu->arch.apf.halted = false;
 					break;
 				case KVM_MP_STATE_SIPI_RECEIVED:
 				default:
@@ -5272,6 +5288,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
 			++vcpu->stat.request_irq_exits;
 		}
+		
+		kvm_check_async_pf_completion(vcpu);
+
 		if (signal_pending(current)) {
 			r = -EINTR;
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
@@ -5785,6 +5804,10 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_async_pf_hash_reset(vcpu);
+	vcpu->arch.apf.halted = false;
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -5873,6 +5896,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL))
 		goto fail_free_mce_banks;
 
+	kvm_async_pf_hash_reset(vcpu);
+
 	return 0;
 fail_free_mce_banks:
 	kfree(vcpu->arch.mce_banks);
@@ -5931,8 +5956,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
 	/*
 	 * Unpin any mmu pages first.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
+	}
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_arch_vcpu_free(vcpu);
 
@@ -6043,7 +6070,9 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
-	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
+		!vcpu->arch.apf.halted)
+		|| !list_empty_careful(&vcpu->async_pf.done)
 		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
 		|| vcpu->arch.nmi_pending ||
 		(kvm_arch_interrupt_allowed(vcpu) &&
@@ -6102,6 +6131,83 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
+{
+	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
+}
+
+static inline u32 kvm_async_pf_next_probe(u32 key)
+{
+	return (key + 1) & (roundup_pow_of_two(ASYNC_PF_PER_VCPU) - 1);
+}
+
+static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	while (vcpu->arch.apf.gfns[key] != ~0)
+		key = kvm_async_pf_next_probe(key);
+
+	vcpu->arch.apf.gfns[key] = gfn;
+}
+
+static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	int i;
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU) &&
+		     (vcpu->arch.apf.gfns[key] != gfn ||
+		      vcpu->arch.apf.gfns[key] == ~0); i++)
+		key = kvm_async_pf_next_probe(key);
+
+	return key;
+}
+
+bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	return vcpu->arch.apf.gfns[kvm_async_pf_gfn_slot(vcpu, gfn)] == gfn;
+}
+
+static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 i, j, k;
+
+	i = j = kvm_async_pf_gfn_slot(vcpu, gfn);
+	while (true) {
+		vcpu->arch.apf.gfns[i] = ~0;
+		do {
+			j = kvm_async_pf_next_probe(j);
+			if (vcpu->arch.apf.gfns[j] == ~0)
+				return;
+			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
+			/*
+			 * k lies cyclically in ]i,j]
+			 * |    i.k.j |
+			 * |....j i.k.| or  |.k..j i...|
+			 */
+		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
+		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
+		i = j;
+	}
+}
+
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work)
+{
+	trace_kvm_async_pf_not_present(work->gva);
+
+	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+}
+
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work)
+{
+	trace_kvm_async_pf_ready(work->gva);
+	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b89d00..9a9b017 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -40,6 +40,7 @@
 #define KVM_REQ_KICK               9
 #define KVM_REQ_DEACTIVATE_FPU    10
 #define KVM_REQ_EVENT             11
+#define KVM_REQ_APF_HALT          12
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID	0
 
@@ -74,6 +75,26 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+struct kvm_async_pf {
+	struct work_struct work;
+	struct list_head link;
+	struct list_head queue;
+	struct kvm_vcpu *vcpu;
+	struct mm_struct *mm;
+	gva_t gva;
+	unsigned long addr;
+	struct kvm_arch_async_pf arch;
+	struct page *page;
+	bool done;
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch);
+#endif
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -104,6 +125,15 @@ struct kvm_vcpu {
 	gpa_t mmio_phys_addr;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	struct {
+		u32 queued;
+		struct list_head queue;
+		struct list_head done;
+		spinlock_t lock;
+	} async_pf;
+#endif
+
 	struct kvm_vcpu_arch arch;
 };
 
@@ -302,6 +332,7 @@ void kvm_set_page_accessed(struct page *page);
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr);
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 6dd3a51..a78a5e5 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -185,6 +185,96 @@ TRACE_EVENT(kvm_age_page,
 		  __entry->referenced ? "YOUNG" : "OLD")
 );
 
+#ifdef CONFIG_KVM_ASYNC_PF
+TRACE_EVENT(
+	kvm_try_async_get_page,
+	TP_PROTO(bool async, u64 pfn),
+	TP_ARGS(async, pfn),
+
+	TP_STRUCT__entry(
+		__field(__u64, pfn)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = (!async) ? pfn : (u64)-1;
+		),
+
+	TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_not_present,
+	TP_PROTO(u64 gva),
+	TP_ARGS(gva),
+
+	TP_STRUCT__entry(
+		__field(__u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx not present", __entry->gva)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_ready,
+	TP_PROTO(u64 gva),
+	TP_ARGS(gva),
+
+	TP_STRUCT__entry(
+		__field(__u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx ready", __entry->gva)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_completed,
+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
+	TP_ARGS(address, page, gva),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, address)
+		__field(pfn_t, pfn)
+		__field(u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->address = address;
+		__entry->pfn = page ? page_to_pfn(page) : 0;
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx address %#lx pfn %#llx",  __entry->gva,
+		  __entry->address, __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_doublefault,
+	TP_PROTO(u64 gva, u64 gfn),
+	TP_ARGS(gva, gfn),
+
+	TP_STRUCT__entry(
+		__field(u64, gva)
+		__field(u64, gfn)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		__entry->gfn = gfn;
+		),
+
+	TP_printk("gva = %#llx, gfn = %#llx", __entry->gva, __entry->gfn)
+);
+
+#endif
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7f1178f..f63ccb0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -15,3 +15,6 @@ config KVM_APIC_ARCHITECTURE
 
 config KVM_MMIO
        bool
+
+config KVM_ASYNC_PF
+       bool
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
new file mode 100644
index 0000000..8b144d5
--- /dev/null
+++ b/virt/kvm/async_pf.c
@@ -0,0 +1,190 @@
+/*
+ * kvm asynchronous fault support
+ *
+ * Copyright 2010 Red Hat, Inc.
+ *
+ * Author:
+ *      Gleb Natapov <gleb@redhat.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/mmu_context.h>
+
+#include "async_pf.h"
+#include <trace/events/kvm.h>
+
+static struct kmem_cache *async_pf_cache;
+
+int kvm_async_pf_init(void)
+{
+	async_pf_cache = KMEM_CACHE(kvm_async_pf, 0);
+
+	if (!async_pf_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void kvm_async_pf_deinit(void)
+{
+	if (async_pf_cache)
+		kmem_cache_destroy(async_pf_cache);
+	async_pf_cache = NULL;
+}
+
+void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	INIT_LIST_HEAD(&vcpu->async_pf.done);
+	INIT_LIST_HEAD(&vcpu->async_pf.queue);
+	spin_lock_init(&vcpu->async_pf.lock);
+}
+
+static void async_pf_execute(struct work_struct *work)
+{
+	struct page *page = NULL;
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+	struct mm_struct *mm = apf->mm;
+	struct kvm_vcpu *vcpu = apf->vcpu;
+	unsigned long addr = apf->addr;
+	gva_t gva = apf->gva;
+
+	might_sleep();
+
+	use_mm(mm);
+	down_read(&mm->mmap_sem);
+	get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
+	up_read(&mm->mmap_sem);
+	unuse_mm(mm);
+
+	spin_lock(&vcpu->async_pf.lock);
+	list_add_tail(&apf->link, &vcpu->async_pf.done);
+	apf->page = page;
+	apf->done = true;
+	spin_unlock(&vcpu->async_pf.lock);
+
+	/*
+	 * apf may be freed by kvm_check_async_pf_completion() after
+	 * this point
+	 */
+
+	trace_kvm_async_pf_completed(addr, page, gva);
+
+	if (waitqueue_active(&vcpu->wq))
+		wake_up_interruptible(&vcpu->wq);
+
+	mmdrop(mm);
+	kvm_put_kvm(vcpu->kvm);
+}
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+	/* cancel outstanding work queue item */
+	while (!list_empty(&vcpu->async_pf.queue)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf.queue.next,
+				   typeof(*work), queue);
+		cancel_work_sync(&work->work);
+		list_del(&work->queue);
+		if (!work->done) /* work was canceled */
+			kmem_cache_free(async_pf_cache, work);
+	}
+
+	spin_lock(&vcpu->async_pf.lock);
+	while (!list_empty(&vcpu->async_pf.done)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf.done.next,
+				   typeof(*work), link);
+		list_del(&work->link);
+		if (work->page)
+			put_page(work->page);
+		kmem_cache_free(async_pf_cache, work);
+	}
+	spin_unlock(&vcpu->async_pf.lock);
+
+	vcpu->async_pf.queued = 0;
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (list_empty_careful(&vcpu->async_pf.done))
+		return;
+
+	spin_lock(&vcpu->async_pf.lock);
+	work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
+	list_del(&work->link);
+	spin_unlock(&vcpu->async_pf.lock);
+
+	kvm_arch_async_page_present(vcpu, work);
+
+	list_del(&work->queue);
+	vcpu->async_pf.queued--;
+	if (work->page)
+		put_page(work->page);
+	kmem_cache_free(async_pf_cache, work);
+}
+
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch)
+{
+	struct kvm_async_pf *work;
+
+	if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
+		return 0;
+
+	/* setup delayed work */
+
+	/*
+	 * do alloc nowait since if we are going to sleep anyway we
+	 * may as well sleep faulting in page
+	 */
+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+	if (!work)
+		return 0;
+
+	work->page = NULL;
+	work->done = false;
+	work->vcpu = vcpu;
+	work->gva = gva;
+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
+	work->arch = *arch;
+	work->mm = current->mm;
+	atomic_inc(&work->mm->mm_count);
+	kvm_get_kvm(work->vcpu->kvm);
+
+	/* this can't really happen otherwise gfn_to_pfn_async
+	   would succeed */
+	if (unlikely(kvm_is_error_hva(work->addr)))
+		goto retry_sync;
+
+	INIT_WORK(&work->work, async_pf_execute);
+	if (!schedule_work(&work->work))
+		goto retry_sync;
+
+	list_add_tail(&work->queue, &vcpu->async_pf.queue);
+	vcpu->async_pf.queued++;
+	kvm_arch_async_page_not_present(vcpu, work);
+	return 1;
+retry_sync:
+	kvm_put_kvm(work->vcpu->kvm);
+	mmdrop(work->mm);
+	kmem_cache_free(async_pf_cache, work);
+	return 0;
+}
diff --git a/virt/kvm/async_pf.h b/virt/kvm/async_pf.h
new file mode 100644
index 0000000..fa15074
--- /dev/null
+++ b/virt/kvm/async_pf.h
@@ -0,0 +1,36 @@
+/*
+ * kvm asynchronous fault support
+ *
+ * Copyright 2010 Red Hat, Inc.
+ *
+ * Author:
+ *      Gleb Natapov <gleb@redhat.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KVM_ASYNC_PF_H__
+#define __KVM_ASYNC_PF_H__
+
+#ifdef CONFIG_KVM_ASYNC_PF
+int kvm_async_pf_init(void);
+void kvm_async_pf_deinit(void);
+void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu);
+#else
+#define kvm_async_pf_init() (0)
+#define kvm_async_pf_deinit() do{}while(0)
+#define kvm_async_pf_vcpu_init(C) do{}while(0)
+#endif
+
+#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1aeeb7f..238079e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -55,6 +55,7 @@
 #include <asm-generic/bitops/le.h>
 
 #include "coalesced_mmio.h"
+#include "async_pf.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
@@ -186,6 +187,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
 	init_waitqueue_head(&vcpu->wq);
+	kvm_async_pf_vcpu_init(vcpu);
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -948,17 +950,29 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic,
+			bool *async)
 {
 	struct page *page[1];
-	int npages;
+	int npages = 0;
 	pfn_t pfn;
 
-	if (atomic)
+	/* we can do it either atomically or asynchronously, not both */
+	BUG_ON(atomic && async);
+
+	if (atomic || async)
 		npages = __get_user_pages_fast(addr, 1, 1, page);
-	else {
+
+	if (unlikely(npages != 1) && !atomic) {
 		might_sleep();
-		npages = get_user_pages_fast(addr, 1, 1, page);
+
+		if (async) {
+			down_read(&current->mm->mmap_sem);
+			npages = get_user_pages_noio(current, current->mm,
+						     addr, 1, 1, 0, page, NULL);
+			up_read(&current->mm->mmap_sem);
+		} else
+			npages = get_user_pages_fast(addr, 1, 1, page);
 	}
 
 	if (unlikely(npages != 1)) {
@@ -978,6 +992,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
 
 		if (vma == NULL || addr < vma->vm_start ||
 		    !(vma->vm_flags & VM_PFNMAP)) {
+			if (async && !(vma->vm_flags & VM_PFNMAP) &&
+			    (vma->vm_flags & VM_WRITE))
+				*async = true;
 			up_read(&current->mm->mmap_sem);
 return_fault_page:
 			get_page(fault_page);
@@ -995,32 +1012,41 @@ return_fault_page:
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr)
 {
-	return hva_to_pfn(kvm, addr, true);
+	return hva_to_pfn(kvm, addr, true, NULL);
 }
 EXPORT_SYMBOL_GPL(hva_to_pfn_atomic);
 
-static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic)
+static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic, bool *async)
 {
 	unsigned long addr;
 
+	if (async)
+		*async = false;
+
 	addr = gfn_to_hva(kvm, gfn);
 	if (kvm_is_error_hva(addr)) {
 		get_page(bad_page);
 		return page_to_pfn(bad_page);
 	}
 
-	return hva_to_pfn(kvm, addr, atomic);
+	return hva_to_pfn(kvm, addr, atomic, async);
 }
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, true);
+	return __gfn_to_pfn(kvm, gfn, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async)
+{
+	return __gfn_to_pfn(kvm, gfn, false, async);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_async);
+
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, false);
+	return __gfn_to_pfn(kvm, gfn, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
@@ -1028,7 +1054,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	unsigned long addr = gfn_to_hva_memslot(slot, gfn);
-	return hva_to_pfn(kvm, addr, false);
+	return hva_to_pfn(kvm, addr, false, NULL);
 }
 
 int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
@@ -2335,6 +2361,10 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		goto out_free_5;
 	}
 
+	r = kvm_async_pf_init();
+	if (r)
+		goto out_free;
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -2342,7 +2372,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		printk(KERN_ERR "kvm: misc device register failed\n");
-		goto out_free;
+		goto out_unreg;
 	}
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -2352,6 +2382,8 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
+out_unreg:
+	kvm_async_pf_deinit();
 out_free:
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_5:
@@ -2384,6 +2416,7 @@ void kvm_exit(void)
 	kvm_exit_debug();
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
+	kvm_async_pf_deinit();
 	sysdev_unregister(&kvm_sysdev);
 	sysdev_class_unregister(&kvm_sysdev_class);
 	unregister_reboot_notifier(&kvm_reboot_notifier);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.

Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   18 ++++
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/Makefile           |    1 +
 arch/x86/kvm/mmu.c              |   52 +++++++++++-
 arch/x86/kvm/paging_tmpl.h      |    4 +-
 arch/x86/kvm/x86.c              |  112 ++++++++++++++++++++++-
 include/linux/kvm_host.h        |   31 +++++++
 include/trace/events/kvm.h      |   90 ++++++++++++++++++
 virt/kvm/Kconfig                |    3 +
 virt/kvm/async_pf.c             |  190 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/async_pf.h             |   36 ++++++++
 virt/kvm/kvm_main.c             |   57 +++++++++---
 12 files changed, 578 insertions(+), 17 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e209078..043e29e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -83,11 +83,14 @@
 #define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
 
+#define ASYNC_PF_PER_VCPU 64
+
 extern spinlock_t kvm_lock;
 extern struct list_head vm_list;
 
 struct kvm_vcpu;
 struct kvm;
+struct kvm_async_pf;
 
 enum kvm_reg {
 	VCPU_REGS_RAX = 0,
@@ -412,6 +415,11 @@ struct kvm_vcpu_arch {
 	u64 hv_vapic;
 
 	cpumask_var_t wbinvd_dirty_mask;
+
+	struct {
+		bool halted;
+		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+	} apf;
 };
 
 struct kvm_arch {
@@ -585,6 +593,10 @@ struct kvm_x86_ops {
 	const struct trace_print_flags *exit_reasons_str;
 };
 
+struct kvm_arch_async_pf {
+	gfn_t gfn;
+};
+
 extern struct kvm_x86_ops *kvm_x86_ops;
 
 int kvm_mmu_module_init(void);
@@ -823,4 +835,10 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work);
+extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ddc131f..50f6364 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -28,6 +28,7 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_EVENTFD
 	select KVM_APIC_ARCHITECTURE
+	select KVM_ASYNC_PF
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
 	---help---
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31a7035..c53bf19 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,6 +9,7 @@ kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
 				coalesced_mmio.o irq_comm.o eventfd.o \
 				assigned-dev.o)
 kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
+kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o timer.o
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 908ea54..f01e89a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -18,9 +18,11 @@
  *
  */
 
+#include "irq.h"
 #include "mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -2585,6 +2587,50 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	arch.gfn = gfn;
+
+	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
+		     kvm_event_needs_reinjection(vcpu)))
+		return false;
+
+	return kvm_x86_ops->interrupt_allowed(vcpu);
+}
+
+static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+			 pfn_t *pfn)
+{
+	bool async;
+
+	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
+
+	if (!async)
+		return false; /* *pfn has correct page already */
+
+	put_page(pfn_to_page(*pfn));
+
+	if (can_do_async_pf(vcpu)) {
+		trace_kvm_try_async_get_page(async, *pfn);
+		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
+			trace_kvm_async_pf_doublefault(gva, gfn);
+			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+			return true;
+		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
+			return true;
+	}
+
+	*pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	
+	return false;
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 				u32 error_code)
 {
@@ -2607,7 +2653,11 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+		return 0;
+
+	/* mmio */
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index cd7a833..c45376d 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -568,7 +568,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+		return 0;
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7127a13..09e72fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -43,6 +43,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/uaccess.h>
+#include <linux/hash.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -155,6 +156,13 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 
 u64 __read_mostly host_xcr0;
 
+static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
+{
+	int i;
+	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU); i++)
+		vcpu->arch.apf.gfns[i] = ~0;
+}
+
 static inline u32 bit(int bitno)
 {
 	return 1 << (bitno & 31);
@@ -5110,6 +5118,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			vcpu->fpu_active = 0;
 			kvm_x86_ops->fpu_deactivate(vcpu);
 		}
+		if (kvm_check_request(KVM_REQ_APF_HALT, vcpu)) {
+			/* Page is swapped out. Do synthetic halt */
+			vcpu->arch.apf.halted = true;
+			r = 1;
+			goto out;
+		}
 	}
 
 	r = kvm_mmu_reload(vcpu);
@@ -5238,7 +5252,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 
 	r = 1;
 	while (r > 0) {
-		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE)
+		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
+		    !vcpu->arch.apf.halted)
 			r = vcpu_enter_guest(vcpu);
 		else {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
@@ -5251,6 +5266,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 					vcpu->arch.mp_state =
 						KVM_MP_STATE_RUNNABLE;
 				case KVM_MP_STATE_RUNNABLE:
+					vcpu->arch.apf.halted = false;
 					break;
 				case KVM_MP_STATE_SIPI_RECEIVED:
 				default:
@@ -5272,6 +5288,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
 			++vcpu->stat.request_irq_exits;
 		}
+		
+		kvm_check_async_pf_completion(vcpu);
+
 		if (signal_pending(current)) {
 			r = -EINTR;
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
@@ -5785,6 +5804,10 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_async_pf_hash_reset(vcpu);
+	vcpu->arch.apf.halted = false;
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -5873,6 +5896,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL))
 		goto fail_free_mce_banks;
 
+	kvm_async_pf_hash_reset(vcpu);
+
 	return 0;
 fail_free_mce_banks:
 	kfree(vcpu->arch.mce_banks);
@@ -5931,8 +5956,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
 	/*
 	 * Unpin any mmu pages first.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
+	}
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_arch_vcpu_free(vcpu);
 
@@ -6043,7 +6070,9 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
-	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
+		!vcpu->arch.apf.halted)
+		|| !list_empty_careful(&vcpu->async_pf.done)
 		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
 		|| vcpu->arch.nmi_pending ||
 		(kvm_arch_interrupt_allowed(vcpu) &&
@@ -6102,6 +6131,83 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
+{
+	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
+}
+
+static inline u32 kvm_async_pf_next_probe(u32 key)
+{
+	return (key + 1) & (roundup_pow_of_two(ASYNC_PF_PER_VCPU) - 1);
+}
+
+static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	while (vcpu->arch.apf.gfns[key] != ~0)
+		key = kvm_async_pf_next_probe(key);
+
+	vcpu->arch.apf.gfns[key] = gfn;
+}
+
+static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	int i;
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU) &&
+		     (vcpu->arch.apf.gfns[key] != gfn ||
+		      vcpu->arch.apf.gfns[key] == ~0); i++)
+		key = kvm_async_pf_next_probe(key);
+
+	return key;
+}
+
+bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	return vcpu->arch.apf.gfns[kvm_async_pf_gfn_slot(vcpu, gfn)] == gfn;
+}
+
+static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 i, j, k;
+
+	i = j = kvm_async_pf_gfn_slot(vcpu, gfn);
+	while (true) {
+		vcpu->arch.apf.gfns[i] = ~0;
+		do {
+			j = kvm_async_pf_next_probe(j);
+			if (vcpu->arch.apf.gfns[j] == ~0)
+				return;
+			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
+			/*
+			 * k lies cyclically in ]i,j]
+			 * |    i.k.j |
+			 * |....j i.k.| or  |.k..j i...|
+			 */
+		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
+		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
+		i = j;
+	}
+}
+
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work)
+{
+	trace_kvm_async_pf_not_present(work->gva);
+
+	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+}
+
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work)
+{
+	trace_kvm_async_pf_ready(work->gva);
+	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b89d00..9a9b017 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -40,6 +40,7 @@
 #define KVM_REQ_KICK               9
 #define KVM_REQ_DEACTIVATE_FPU    10
 #define KVM_REQ_EVENT             11
+#define KVM_REQ_APF_HALT          12
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID	0
 
@@ -74,6 +75,26 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+struct kvm_async_pf {
+	struct work_struct work;
+	struct list_head link;
+	struct list_head queue;
+	struct kvm_vcpu *vcpu;
+	struct mm_struct *mm;
+	gva_t gva;
+	unsigned long addr;
+	struct kvm_arch_async_pf arch;
+	struct page *page;
+	bool done;
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch);
+#endif
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -104,6 +125,15 @@ struct kvm_vcpu {
 	gpa_t mmio_phys_addr;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	struct {
+		u32 queued;
+		struct list_head queue;
+		struct list_head done;
+		spinlock_t lock;
+	} async_pf;
+#endif
+
 	struct kvm_vcpu_arch arch;
 };
 
@@ -302,6 +332,7 @@ void kvm_set_page_accessed(struct page *page);
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr);
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 6dd3a51..a78a5e5 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -185,6 +185,96 @@ TRACE_EVENT(kvm_age_page,
 		  __entry->referenced ? "YOUNG" : "OLD")
 );
 
+#ifdef CONFIG_KVM_ASYNC_PF
+TRACE_EVENT(
+	kvm_try_async_get_page,
+	TP_PROTO(bool async, u64 pfn),
+	TP_ARGS(async, pfn),
+
+	TP_STRUCT__entry(
+		__field(__u64, pfn)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = (!async) ? pfn : (u64)-1;
+		),
+
+	TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_not_present,
+	TP_PROTO(u64 gva),
+	TP_ARGS(gva),
+
+	TP_STRUCT__entry(
+		__field(__u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx not present", __entry->gva)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_ready,
+	TP_PROTO(u64 gva),
+	TP_ARGS(gva),
+
+	TP_STRUCT__entry(
+		__field(__u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx ready", __entry->gva)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_completed,
+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
+	TP_ARGS(address, page, gva),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, address)
+		__field(pfn_t, pfn)
+		__field(u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->address = address;
+		__entry->pfn = page ? page_to_pfn(page) : 0;
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx address %#lx pfn %#llx",  __entry->gva,
+		  __entry->address, __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_async_pf_doublefault,
+	TP_PROTO(u64 gva, u64 gfn),
+	TP_ARGS(gva, gfn),
+
+	TP_STRUCT__entry(
+		__field(u64, gva)
+		__field(u64, gfn)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		__entry->gfn = gfn;
+		),
+
+	TP_printk("gva = %#llx, gfn = %#llx", __entry->gva, __entry->gfn)
+);
+
+#endif
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7f1178f..f63ccb0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -15,3 +15,6 @@ config KVM_APIC_ARCHITECTURE
 
 config KVM_MMIO
        bool
+
+config KVM_ASYNC_PF
+       bool
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
new file mode 100644
index 0000000..8b144d5
--- /dev/null
+++ b/virt/kvm/async_pf.c
@@ -0,0 +1,190 @@
+/*
+ * kvm asynchronous fault support
+ *
+ * Copyright 2010 Red Hat, Inc.
+ *
+ * Author:
+ *      Gleb Natapov <gleb@redhat.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/mmu_context.h>
+
+#include "async_pf.h"
+#include <trace/events/kvm.h>
+
+static struct kmem_cache *async_pf_cache;
+
+int kvm_async_pf_init(void)
+{
+	async_pf_cache = KMEM_CACHE(kvm_async_pf, 0);
+
+	if (!async_pf_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void kvm_async_pf_deinit(void)
+{
+	if (async_pf_cache)
+		kmem_cache_destroy(async_pf_cache);
+	async_pf_cache = NULL;
+}
+
+void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	INIT_LIST_HEAD(&vcpu->async_pf.done);
+	INIT_LIST_HEAD(&vcpu->async_pf.queue);
+	spin_lock_init(&vcpu->async_pf.lock);
+}
+
+static void async_pf_execute(struct work_struct *work)
+{
+	struct page *page = NULL;
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+	struct mm_struct *mm = apf->mm;
+	struct kvm_vcpu *vcpu = apf->vcpu;
+	unsigned long addr = apf->addr;
+	gva_t gva = apf->gva;
+
+	might_sleep();
+
+	use_mm(mm);
+	down_read(&mm->mmap_sem);
+	get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
+	up_read(&mm->mmap_sem);
+	unuse_mm(mm);
+
+	spin_lock(&vcpu->async_pf.lock);
+	list_add_tail(&apf->link, &vcpu->async_pf.done);
+	apf->page = page;
+	apf->done = true;
+	spin_unlock(&vcpu->async_pf.lock);
+
+	/*
+	 * apf may be freed by kvm_check_async_pf_completion() after
+	 * this point
+	 */
+
+	trace_kvm_async_pf_completed(addr, page, gva);
+
+	if (waitqueue_active(&vcpu->wq))
+		wake_up_interruptible(&vcpu->wq);
+
+	mmdrop(mm);
+	kvm_put_kvm(vcpu->kvm);
+}
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+	/* cancel outstanding work queue item */
+	while (!list_empty(&vcpu->async_pf.queue)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf.queue.next,
+				   typeof(*work), queue);
+		cancel_work_sync(&work->work);
+		list_del(&work->queue);
+		if (!work->done) /* work was canceled */
+			kmem_cache_free(async_pf_cache, work);
+	}
+
+	spin_lock(&vcpu->async_pf.lock);
+	while (!list_empty(&vcpu->async_pf.done)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf.done.next,
+				   typeof(*work), link);
+		list_del(&work->link);
+		if (work->page)
+			put_page(work->page);
+		kmem_cache_free(async_pf_cache, work);
+	}
+	spin_unlock(&vcpu->async_pf.lock);
+
+	vcpu->async_pf.queued = 0;
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (list_empty_careful(&vcpu->async_pf.done))
+		return;
+
+	spin_lock(&vcpu->async_pf.lock);
+	work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
+	list_del(&work->link);
+	spin_unlock(&vcpu->async_pf.lock);
+
+	kvm_arch_async_page_present(vcpu, work);
+
+	list_del(&work->queue);
+	vcpu->async_pf.queued--;
+	if (work->page)
+		put_page(work->page);
+	kmem_cache_free(async_pf_cache, work);
+}
+
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch)
+{
+	struct kvm_async_pf *work;
+
+	if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
+		return 0;
+
+	/* setup delayed work */
+
+	/*
+	 * do alloc nowait since if we are going to sleep anyway we
+	 * may as well sleep faulting in page
+	 */
+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+	if (!work)
+		return 0;
+
+	work->page = NULL;
+	work->done = false;
+	work->vcpu = vcpu;
+	work->gva = gva;
+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
+	work->arch = *arch;
+	work->mm = current->mm;
+	atomic_inc(&work->mm->mm_count);
+	kvm_get_kvm(work->vcpu->kvm);
+
+	/* this can't really happen otherwise gfn_to_pfn_async
+	   would succeed */
+	if (unlikely(kvm_is_error_hva(work->addr)))
+		goto retry_sync;
+
+	INIT_WORK(&work->work, async_pf_execute);
+	if (!schedule_work(&work->work))
+		goto retry_sync;
+
+	list_add_tail(&work->queue, &vcpu->async_pf.queue);
+	vcpu->async_pf.queued++;
+	kvm_arch_async_page_not_present(vcpu, work);
+	return 1;
+retry_sync:
+	kvm_put_kvm(work->vcpu->kvm);
+	mmdrop(work->mm);
+	kmem_cache_free(async_pf_cache, work);
+	return 0;
+}
diff --git a/virt/kvm/async_pf.h b/virt/kvm/async_pf.h
new file mode 100644
index 0000000..fa15074
--- /dev/null
+++ b/virt/kvm/async_pf.h
@@ -0,0 +1,36 @@
+/*
+ * kvm asynchronous fault support
+ *
+ * Copyright 2010 Red Hat, Inc.
+ *
+ * Author:
+ *      Gleb Natapov <gleb@redhat.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KVM_ASYNC_PF_H__
+#define __KVM_ASYNC_PF_H__
+
+#ifdef CONFIG_KVM_ASYNC_PF
+int kvm_async_pf_init(void);
+void kvm_async_pf_deinit(void);
+void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu);
+#else
+#define kvm_async_pf_init() (0)
+#define kvm_async_pf_deinit() do{}while(0)
+#define kvm_async_pf_vcpu_init(C) do{}while(0)
+#endif
+
+#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1aeeb7f..238079e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -55,6 +55,7 @@
 #include <asm-generic/bitops/le.h>
 
 #include "coalesced_mmio.h"
+#include "async_pf.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
@@ -186,6 +187,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
 	init_waitqueue_head(&vcpu->wq);
+	kvm_async_pf_vcpu_init(vcpu);
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -948,17 +950,29 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic,
+			bool *async)
 {
 	struct page *page[1];
-	int npages;
+	int npages = 0;
 	pfn_t pfn;
 
-	if (atomic)
+	/* we can do it either atomically or asynchronously, not both */
+	BUG_ON(atomic && async);
+
+	if (atomic || async)
 		npages = __get_user_pages_fast(addr, 1, 1, page);
-	else {
+
+	if (unlikely(npages != 1) && !atomic) {
 		might_sleep();
-		npages = get_user_pages_fast(addr, 1, 1, page);
+
+		if (async) {
+			down_read(&current->mm->mmap_sem);
+			npages = get_user_pages_noio(current, current->mm,
+						     addr, 1, 1, 0, page, NULL);
+			up_read(&current->mm->mmap_sem);
+		} else
+			npages = get_user_pages_fast(addr, 1, 1, page);
 	}
 
 	if (unlikely(npages != 1)) {
@@ -978,6 +992,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
 
 		if (vma == NULL || addr < vma->vm_start ||
 		    !(vma->vm_flags & VM_PFNMAP)) {
+			if (async && !(vma->vm_flags & VM_PFNMAP) &&
+			    (vma->vm_flags & VM_WRITE))
+				*async = true;
 			up_read(&current->mm->mmap_sem);
 return_fault_page:
 			get_page(fault_page);
@@ -995,32 +1012,41 @@ return_fault_page:
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr)
 {
-	return hva_to_pfn(kvm, addr, true);
+	return hva_to_pfn(kvm, addr, true, NULL);
 }
 EXPORT_SYMBOL_GPL(hva_to_pfn_atomic);
 
-static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic)
+static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic, bool *async)
 {
 	unsigned long addr;
 
+	if (async)
+		*async = false;
+
 	addr = gfn_to_hva(kvm, gfn);
 	if (kvm_is_error_hva(addr)) {
 		get_page(bad_page);
 		return page_to_pfn(bad_page);
 	}
 
-	return hva_to_pfn(kvm, addr, atomic);
+	return hva_to_pfn(kvm, addr, atomic, async);
 }
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, true);
+	return __gfn_to_pfn(kvm, gfn, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async)
+{
+	return __gfn_to_pfn(kvm, gfn, false, async);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_async);
+
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, false);
+	return __gfn_to_pfn(kvm, gfn, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
@@ -1028,7 +1054,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	unsigned long addr = gfn_to_hva_memslot(slot, gfn);
-	return hva_to_pfn(kvm, addr, false);
+	return hva_to_pfn(kvm, addr, false, NULL);
 }
 
 int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
@@ -2335,6 +2361,10 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		goto out_free_5;
 	}
 
+	r = kvm_async_pf_init();
+	if (r)
+		goto out_free;
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -2342,7 +2372,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		printk(KERN_ERR "kvm: misc device register failed\n");
-		goto out_free;
+		goto out_unreg;
 	}
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -2352,6 +2382,8 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
+out_unreg:
+	kvm_async_pf_deinit();
 out_free:
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_5:
@@ -2384,6 +2416,7 @@ void kvm_exit(void)
 	kvm_exit_debug();
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
+	kvm_async_pf_deinit();
 	sysdev_unregister(&kvm_sysdev);
 	sysdev_class_unregister(&kvm_sysdev_class);
 	unregister_reboot_notifier(&kvm_reboot_notifier);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 03/12] Retry fault before vmentry
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +++-
 arch/x86/kvm/mmu.c              |   16 ++++++++--------
 arch/x86/kvm/paging_tmpl.h      |    6 +++---
 arch/x86/kvm/x86.c              |    7 +++++++
 virt/kvm/async_pf.c             |    2 ++
 5 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 043e29e..96aca44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,7 @@ struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
@@ -839,6 +839,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work);
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f01e89a..11d152b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2568,7 +2568,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool no_apf)
 {
 	gfn_t gfn;
 	int r;
@@ -2604,8 +2604,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			 pfn_t *pfn)
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+			 gva_t gva, pfn_t *pfn)
 {
 	bool async;
 
@@ -2616,7 +2616,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 
 	put_page(pfn_to_page(*pfn));
 
-	if (can_do_async_pf(vcpu)) {
+	if (!no_apf && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(async, *pfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
@@ -2631,8 +2631,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 	return false;
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool no_apf)
 {
 	pfn_t pfn;
 	int r;
@@ -2654,7 +2654,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
 		return 0;
 
 	/* mmio */
@@ -3317,7 +3317,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index c45376d..d6b281e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -527,8 +527,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool no_apf)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -569,7 +569,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
 		return 0;
 
 	/* mmio */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09e72fc..bf37397 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6131,6 +6131,13 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
+		return;
+	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
+}
+
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 8b144d5..41607ed 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -132,6 +132,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->link);
 	spin_unlock(&vcpu->async_pf.lock);
 
+	if (work->page)
+		kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_async_page_present(vcpu, work);
 
 	list_del(&work->queue);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 03/12] Retry fault before vmentry
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +++-
 arch/x86/kvm/mmu.c              |   16 ++++++++--------
 arch/x86/kvm/paging_tmpl.h      |    6 +++---
 arch/x86/kvm/x86.c              |    7 +++++++
 virt/kvm/async_pf.c             |    2 ++
 5 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 043e29e..96aca44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,7 @@ struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
@@ -839,6 +839,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work);
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f01e89a..11d152b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2568,7 +2568,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool no_apf)
 {
 	gfn_t gfn;
 	int r;
@@ -2604,8 +2604,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			 pfn_t *pfn)
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+			 gva_t gva, pfn_t *pfn)
 {
 	bool async;
 
@@ -2616,7 +2616,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 
 	put_page(pfn_to_page(*pfn));
 
-	if (can_do_async_pf(vcpu)) {
+	if (!no_apf && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(async, *pfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
@@ -2631,8 +2631,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 	return false;
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool no_apf)
 {
 	pfn_t pfn;
 	int r;
@@ -2654,7 +2654,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
 		return 0;
 
 	/* mmio */
@@ -3317,7 +3317,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index c45376d..d6b281e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -527,8 +527,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool no_apf)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -569,7 +569,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
 		return 0;
 
 	/* mmio */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09e72fc..bf37397 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6131,6 +6131,13 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
+		return;
+	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
+}
+
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 8b144d5..41607ed 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -132,6 +132,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->link);
 	spin_unlock(&vcpu->async_pf.lock);
 
+	if (work->page)
+		kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_async_page_present(vcpu, work);
 
 	list_del(&work->queue);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |   11 +++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +------------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 7b562b6..e3faaaf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,7 @@ struct kvm_mmu_op_release_pt {
 #include <asm/processor.h>
 
 extern void kvmclock_init(void);
+extern int kvm_register_clock(char *txt);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..e6db179 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -231,10 +231,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+#ifdef CONFIG_SMP
+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+	WARN_ON(kvm_register_clock("primary cpu clock"));
+	native_smp_prepare_boot_cpu();
+}
+#endif
+
 void __init kvm_guest_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
+#ifdef CONFIG_SMP
+	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
 }
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ca43ce3..f98d3ea 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -125,7 +125,7 @@ static struct clocksource kvm_clock = {
 	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
-static int kvm_register_clock(char *txt)
+int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
@@ -152,14 +152,6 @@ static void __cpuinit kvm_setup_secondary_clock(void)
 }
 #endif
 
-#ifdef CONFIG_SMP
-static void __init kvm_smp_prepare_boot_cpu(void)
-{
-	WARN_ON(kvm_register_clock("primary cpu clock"));
-	native_smp_prepare_boot_cpu();
-}
-#endif
-
 /*
  * After the clock is registered, the host will keep writing to the
  * registered memory location. If the guest happens to shutdown, this memory
@@ -206,9 +198,6 @@ void __init kvmclock_init(void)
 	x86_cpuinit.setup_percpu_clockev =
 		kvm_setup_secondary_clock;
 #endif
-#ifdef CONFIG_SMP
-	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-#endif
 	machine_ops.shutdown  = kvm_shutdown;
 #ifdef CONFIG_KEXEC
 	machine_ops.crash_shutdown  = kvm_crash_shutdown;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |   11 +++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +------------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 7b562b6..e3faaaf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,7 @@ struct kvm_mmu_op_release_pt {
 #include <asm/processor.h>
 
 extern void kvmclock_init(void);
+extern int kvm_register_clock(char *txt);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..e6db179 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -231,10 +231,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+#ifdef CONFIG_SMP
+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+	WARN_ON(kvm_register_clock("primary cpu clock"));
+	native_smp_prepare_boot_cpu();
+}
+#endif
+
 void __init kvm_guest_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
+#ifdef CONFIG_SMP
+	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
 }
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ca43ce3..f98d3ea 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -125,7 +125,7 @@ static struct clocksource kvm_clock = {
 	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
-static int kvm_register_clock(char *txt)
+int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
@@ -152,14 +152,6 @@ static void __cpuinit kvm_setup_secondary_clock(void)
 }
 #endif
 
-#ifdef CONFIG_SMP
-static void __init kvm_smp_prepare_boot_cpu(void)
-{
-	WARN_ON(kvm_register_clock("primary cpu clock"));
-	native_smp_prepare_boot_cpu();
-}
-#endif
-
 /*
  * After the clock is registered, the host will keep writing to the
  * registered memory location. If the guest happens to shutdown, this memory
@@ -206,9 +198,6 @@ void __init kvmclock_init(void)
 	x86_cpuinit.setup_percpu_clockev =
 		kvm_setup_secondary_clock;
 #endif
-#ifdef CONFIG_SMP
-	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-#endif
 	machine_ops.shutdown  = kvm_shutdown;
 #ifdef CONFIG_KEXEC
 	machine_ops.crash_shutdown  = kvm_crash_shutdown;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 06/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Guest enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kvm/cpuid.txt     |    3 +++
 Documentation/kvm/msr.txt       |   36 +++++++++++++++++++++++++++++++++++-
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/include/asm/kvm_para.h |    4 ++++
 arch/x86/kvm/x86.c              |   38 ++++++++++++++++++++++++++++++++++++--
 include/linux/kvm.h             |    1 +
 include/linux/kvm_host.h        |    1 +
 virt/kvm/async_pf.c             |   20 ++++++++++++++++++++
 8 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt
index 14a12ea..8820685 100644
--- a/Documentation/kvm/cpuid.txt
+++ b/Documentation/kvm/cpuid.txt
@@ -36,6 +36,9 @@ KVM_FEATURE_MMU_OP                 ||     2 || deprecated.
 KVM_FEATURE_CLOCKSOURCE2           ||     3 || kvmclock available at msrs
                                    ||       || 0x4b564d00 and 0x4b564d01
 ------------------------------------------------------------------------------
+KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
+                                   ||       || writing to msr 0x4b564d02
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index 8ddcfe8..27c11a6 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -3,7 +3,6 @@ Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
 =====================================================
 
 KVM makes use of some custom MSRs to service some requests.
-At present, this facility is only used by kvmclock.
 
 Custom MSRs have a range reserved for them, that goes from
 0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
@@ -151,3 +150,38 @@ MSR_KVM_SYSTEM_TIME: 0x12
 			return PRESENT;
 		} else
 			return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
+	data: Bits 63-6 hold 64-byte aligned physical address of a
+	64 byte memory area which must be in guest RAM and must be
+	zeroed. Bits 5-1 are reserved and should be zero. Bit 0 is 1
+	when asynchronous page faults are enabled on the vcpu 0 when
+	disabled.
+
+	First 4 byte of 64 byte memory location will be written to by
+	the hypervisor at the time of asynchronous page fault (APF)
+	injection to indicate type of asynchronous page fault. Value
+	of 1 means that the page referred to by the page fault is not
+	present. Value 2 means that the page is now available. Disabling
+	interrupt inhibits APFs. Guest must not enable interrupt
+	before the reason is read, or it may be overwritten by another
+	APF. Since APF uses the same exception vector as regular page
+	fault guest must reset the reason to 0 before it does
+	something that can generate normal page fault.  If during page
+	fault APF reason is 0 it means that this is regular page
+	fault.
+
+	During delivery of type 1 APF cr2 contains a token that will
+	be used to notify a guest when missing page becomes
+	available. When page becomes available type 2 APF is sent with
+	cr2 set to the token associated with the page. There is special
+	kind of token 0xffffffff which tells vcpu that it should wake
+	up all processes waiting for APFs and no individual type 2 APFs
+	will be sent.
+
+	If APF is disabled while there are outstanding APFs, they will
+	not be delivered.
+
+	Currently type 2 APF will be always delivered on the same vcpu as
+	type 1 was, but guest should not rely on that.
+
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 96aca44..26b2064 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -419,6 +419,8 @@ struct kvm_vcpu_arch {
 	struct {
 		bool halted;
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+		struct gfn_to_hva_cache data;
+		u64 msr_val;
 	} apf;
 };
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index e3faaaf..8662ae0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,7 @@
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
+#define KVM_FEATURE_ASYNC_PF		4
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -32,9 +33,12 @@
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
+#define KVM_ASYNC_PF_ENABLED			(1 << 0)
+
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
 #define KVM_MMU_OP_FLUSH_TLB	        2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf37397..68a3a06 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -788,12 +788,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	7
+#define KVM_SAVE_MSRS_BEGIN	8
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-	HV_X64_MSR_APIC_ASSIST_PAGE,
+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1431,6 +1431,29 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	return 0;
 }
 
+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+	gpa_t gpa = data & ~0x3f;
+
+	/* Bits 1:5 are resrved, Should be zero */
+	if (data & 0x3e)
+		return 1;
+
+	vcpu->arch.apf.msr_val = data;
+
+	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+		kvm_clear_async_pf_completion_queue(vcpu);
+		kvm_async_pf_hash_reset(vcpu);
+		return 0;
+	}
+
+	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
+		return 1;
+
+	kvm_async_pf_wakeup_all(vcpu);
+	return 0;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	switch (msr) {
@@ -1512,6 +1535,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		}
 		break;
 	}
+	case MSR_KVM_ASYNC_PF_EN:
+		if (kvm_pv_enable_async_pf(vcpu, data))
+			return 1;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1788,6 +1815,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_ASYNC_PF_EN:
+		data = vcpu->arch.apf.msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -1935,6 +1965,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_ASYNC_PF:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -5784,6 +5815,8 @@ free_vcpu:
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.apf.msr_val = 0;
+
 	vcpu_load(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
@@ -5803,6 +5836,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.dr7 = DR7_FIXED_1;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
+	vcpu->arch.apf.msr_val = 0;
 
 	kvm_clear_async_pf_completion_queue(vcpu);
 	kvm_async_pf_hash_reset(vcpu);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..ea2dc1a 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,7 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#define KVM_CAP_ASYNC_PF 59
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dda88f2..5d09197 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -93,6 +93,7 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
 		       struct kvm_arch_async_pf *arch);
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 struct kvm_vcpu {
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 41607ed..b276b06 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -190,3 +190,23 @@ retry_sync:
 	kmem_cache_free(async_pf_cache, work);
 	return 0;
 }
+
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (!list_empty(&vcpu->async_pf.done))
+		return 0;
+
+	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	work->page = bad_page;
+	get_page(bad_page);
+	INIT_LIST_HEAD(&work->queue); /* for list_del to work */
+
+	list_add_tail(&work->link, &vcpu->async_pf.done);
+	vcpu->async_pf.queued++;
+	return 0;
+}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 06/12] Add PV MSR to enable asynchronous page faults delivery.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Guest enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kvm/cpuid.txt     |    3 +++
 Documentation/kvm/msr.txt       |   36 +++++++++++++++++++++++++++++++++++-
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/include/asm/kvm_para.h |    4 ++++
 arch/x86/kvm/x86.c              |   38 ++++++++++++++++++++++++++++++++++++--
 include/linux/kvm.h             |    1 +
 include/linux/kvm_host.h        |    1 +
 virt/kvm/async_pf.c             |   20 ++++++++++++++++++++
 8 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt
index 14a12ea..8820685 100644
--- a/Documentation/kvm/cpuid.txt
+++ b/Documentation/kvm/cpuid.txt
@@ -36,6 +36,9 @@ KVM_FEATURE_MMU_OP                 ||     2 || deprecated.
 KVM_FEATURE_CLOCKSOURCE2           ||     3 || kvmclock available at msrs
                                    ||       || 0x4b564d00 and 0x4b564d01
 ------------------------------------------------------------------------------
+KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
+                                   ||       || writing to msr 0x4b564d02
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index 8ddcfe8..27c11a6 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -3,7 +3,6 @@ Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
 =====================================================
 
 KVM makes use of some custom MSRs to service some requests.
-At present, this facility is only used by kvmclock.
 
 Custom MSRs have a range reserved for them, that goes from
 0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
@@ -151,3 +150,38 @@ MSR_KVM_SYSTEM_TIME: 0x12
 			return PRESENT;
 		} else
 			return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
+	data: Bits 63-6 hold 64-byte aligned physical address of a
+	64 byte memory area which must be in guest RAM and must be
+	zeroed. Bits 5-1 are reserved and should be zero. Bit 0 is 1
+	when asynchronous page faults are enabled on the vcpu 0 when
+	disabled.
+
+	First 4 byte of 64 byte memory location will be written to by
+	the hypervisor at the time of asynchronous page fault (APF)
+	injection to indicate type of asynchronous page fault. Value
+	of 1 means that the page referred to by the page fault is not
+	present. Value 2 means that the page is now available. Disabling
+	interrupt inhibits APFs. Guest must not enable interrupt
+	before the reason is read, or it may be overwritten by another
+	APF. Since APF uses the same exception vector as regular page
+	fault guest must reset the reason to 0 before it does
+	something that can generate normal page fault.  If during page
+	fault APF reason is 0 it means that this is regular page
+	fault.
+
+	During delivery of type 1 APF cr2 contains a token that will
+	be used to notify a guest when missing page becomes
+	available. When page becomes available type 2 APF is sent with
+	cr2 set to the token associated with the page. There is special
+	kind of token 0xffffffff which tells vcpu that it should wake
+	up all processes waiting for APFs and no individual type 2 APFs
+	will be sent.
+
+	If APF is disabled while there are outstanding APFs, they will
+	not be delivered.
+
+	Currently type 2 APF will be always delivered on the same vcpu as
+	type 1 was, but guest should not rely on that.
+
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 96aca44..26b2064 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -419,6 +419,8 @@ struct kvm_vcpu_arch {
 	struct {
 		bool halted;
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+		struct gfn_to_hva_cache data;
+		u64 msr_val;
 	} apf;
 };
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index e3faaaf..8662ae0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,7 @@
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
+#define KVM_FEATURE_ASYNC_PF		4
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -32,9 +33,12 @@
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
+#define KVM_ASYNC_PF_ENABLED			(1 << 0)
+
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
 #define KVM_MMU_OP_FLUSH_TLB	        2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf37397..68a3a06 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -788,12 +788,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	7
+#define KVM_SAVE_MSRS_BEGIN	8
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-	HV_X64_MSR_APIC_ASSIST_PAGE,
+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1431,6 +1431,29 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	return 0;
 }
 
+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+	gpa_t gpa = data & ~0x3f;
+
+	/* Bits 1:5 are resrved, Should be zero */
+	if (data & 0x3e)
+		return 1;
+
+	vcpu->arch.apf.msr_val = data;
+
+	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+		kvm_clear_async_pf_completion_queue(vcpu);
+		kvm_async_pf_hash_reset(vcpu);
+		return 0;
+	}
+
+	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
+		return 1;
+
+	kvm_async_pf_wakeup_all(vcpu);
+	return 0;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	switch (msr) {
@@ -1512,6 +1535,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		}
 		break;
 	}
+	case MSR_KVM_ASYNC_PF_EN:
+		if (kvm_pv_enable_async_pf(vcpu, data))
+			return 1;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1788,6 +1815,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_ASYNC_PF_EN:
+		data = vcpu->arch.apf.msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -1935,6 +1965,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_ASYNC_PF:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -5784,6 +5815,8 @@ free_vcpu:
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.apf.msr_val = 0;
+
 	vcpu_load(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
@@ -5803,6 +5836,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.dr7 = DR7_FIXED_1;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
+	vcpu->arch.apf.msr_val = 0;
 
 	kvm_clear_async_pf_completion_queue(vcpu);
 	kvm_async_pf_hash_reset(vcpu);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..ea2dc1a 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,7 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#define KVM_CAP_ASYNC_PF 59
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dda88f2..5d09197 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -93,6 +93,7 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
 		       struct kvm_arch_async_pf *arch);
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 struct kvm_vcpu {
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 41607ed..b276b06 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -190,3 +190,23 @@ retry_sync:
 	kmem_cache_free(async_pf_cache, work);
 	return 0;
 }
+
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (!list_empty(&vcpu->async_pf.done))
+		return 0;
+
+	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	work->page = bad_page;
+	get_page(bad_page);
+	INIT_LIST_HEAD(&work->queue); /* for list_del to work */
+
+	list_add_tail(&work->link, &vcpu->async_pf.done);
+	vcpu->async_pf.queued++;
+	return 0;
+}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 07/12] Add async PF initialization to PV guest.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Enable async PF in a guest if async PF capability is discovered.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kernel-parameters.txt |    3 +
 arch/x86/include/asm/kvm_para.h     |    6 ++
 arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8dc2548..0bd2203 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1699,6 +1699,9 @@ and is between 256 and 4096 characters. It is defined in the file
 
 	no-kvmclock	[X86,KVM] Disable paravirtualized KVM clock driver
 
+	no-kvmapf	[X86,KVM] Disable paravirtualized asynchronous page
+			fault handling.
+
 	nolapic		[X86-32,APIC] Do not enable or use the local APIC.
 
 	nolapic_timer	[X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 8662ae0..2315398 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,12 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+struct kvm_vcpu_pv_apf_data {
+	__u32 reason;
+	__u8 pad[60];
+	__u32 enabled;
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6db179..032d03b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,16 +27,30 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/hardirq.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
 #include <asm/timer.h>
+#include <asm/cpu.h>
 
 #define MMU_QUEUE_SIZE 1024
 
+static int kvmapf = 1;
+
+static int parse_no_kvmapf(char *arg)
+{
+        kvmapf = 0;
+        return 0;
+}
+
+early_param("no-kvmapf", parse_no_kvmapf);
+
 struct kvm_para_state {
 	u8 mmu_queue[MMU_QUEUE_SIZE];
 	int mmu_queue_len;
 };
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -231,12 +245,86 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+void __cpuinit kvm_guest_cpu_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
+		u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
+		__get_cpu_var(apf_reason).enabled = 1;
+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
+		       smp_processor_id());
+	}
+}
+
+static void kvm_pv_disable_apf(void *unused)
+{
+	if (!__get_cpu_var(apf_reason).enabled)
+		return;
+
+	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
+	__get_cpu_var(apf_reason).enabled = 0;
+
+	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+	       smp_processor_id());
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+				unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+	.notifier_call = kvm_pv_reboot_notify,
+};
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
 	WARN_ON(kvm_register_clock("primary cpu clock"));
+	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
 }
+
+static void kvm_guest_cpu_online(void *dummy)
+{
+	kvm_guest_cpu_init();
+}
+
+static void kvm_guest_cpu_offline(void *dummy)
+{
+	kvm_pv_disable_apf(NULL);
+}
+
+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
+				    unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE_FROZEN:
+		smp_call_function_single(cpu, kvm_guest_cpu_online, NULL, 0);
+		break;
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		smp_call_function_single(cpu, kvm_guest_cpu_offline, NULL, 1);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
+        .notifier_call  = kvm_cpu_notify,
+};
 #endif
 
 void __init kvm_guest_init(void)
@@ -245,7 +333,11 @@ void __init kvm_guest_init(void)
 		return;
 
 	paravirt_ops_setup();
+	register_reboot_notifier(&kvm_pv_reboot_nb);
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+	register_cpu_notifier(&kvm_cpu_notifier);
+#else
+	kvm_guest_cpu_init();
 #endif
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 07/12] Add async PF initialization to PV guest.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Enable async PF in a guest if async PF capability is discovered.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kernel-parameters.txt |    3 +
 arch/x86/include/asm/kvm_para.h     |    6 ++
 arch/x86/kernel/kvm.c               |   92 +++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8dc2548..0bd2203 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1699,6 +1699,9 @@ and is between 256 and 4096 characters. It is defined in the file
 
 	no-kvmclock	[X86,KVM] Disable paravirtualized KVM clock driver
 
+	no-kvmapf	[X86,KVM] Disable paravirtualized asynchronous page
+			fault handling.
+
 	nolapic		[X86-32,APIC] Do not enable or use the local APIC.
 
 	nolapic_timer	[X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 8662ae0..2315398 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,12 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+struct kvm_vcpu_pv_apf_data {
+	__u32 reason;
+	__u8 pad[60];
+	__u32 enabled;
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6db179..032d03b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,16 +27,30 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/hardirq.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
 #include <asm/timer.h>
+#include <asm/cpu.h>
 
 #define MMU_QUEUE_SIZE 1024
 
+static int kvmapf = 1;
+
+static int parse_no_kvmapf(char *arg)
+{
+        kvmapf = 0;
+        return 0;
+}
+
+early_param("no-kvmapf", parse_no_kvmapf);
+
 struct kvm_para_state {
 	u8 mmu_queue[MMU_QUEUE_SIZE];
 	int mmu_queue_len;
 };
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -231,12 +245,86 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+void __cpuinit kvm_guest_cpu_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
+		u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
+		__get_cpu_var(apf_reason).enabled = 1;
+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
+		       smp_processor_id());
+	}
+}
+
+static void kvm_pv_disable_apf(void *unused)
+{
+	if (!__get_cpu_var(apf_reason).enabled)
+		return;
+
+	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
+	__get_cpu_var(apf_reason).enabled = 0;
+
+	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+	       smp_processor_id());
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+				unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+	.notifier_call = kvm_pv_reboot_notify,
+};
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
 	WARN_ON(kvm_register_clock("primary cpu clock"));
+	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
 }
+
+static void kvm_guest_cpu_online(void *dummy)
+{
+	kvm_guest_cpu_init();
+}
+
+static void kvm_guest_cpu_offline(void *dummy)
+{
+	kvm_pv_disable_apf(NULL);
+}
+
+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
+				    unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE_FROZEN:
+		smp_call_function_single(cpu, kvm_guest_cpu_online, NULL, 0);
+		break;
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		smp_call_function_single(cpu, kvm_guest_cpu_offline, NULL, 1);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
+        .notifier_call  = kvm_cpu_notify,
+};
 #endif
 
 void __init kvm_guest_init(void)
@@ -245,7 +333,11 @@ void __init kvm_guest_init(void)
 		return;
 
 	paravirt_ops_setup();
+	register_reboot_notifier(&kvm_pv_reboot_nb);
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+	register_cpu_notifier(&kvm_cpu_notifier);
+#else
+	kvm_guest_cpu_init();
 #endif
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 08/12] Handle async PF in a guest.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When async PF capability is detected hook up special page fault handler
that will handle async page fault events and bypass other page faults to
regular page fault handler. Also add async PF handling to nested SVM
emulation. Async PF always generates exit to L1 where vcpu thread will
be scheduled out until page is available.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |   12 +++
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 ++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  181 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm.c              |   45 ++++++++--
 6 files changed, 243 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2315398..fbfd367 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
 struct kvm_vcpu_pv_apf_data {
 	__u32 reason;
 	__u8 pad[60];
@@ -171,8 +174,17 @@ static inline unsigned int kvm_arch_para_features(void)
 
 #ifdef CONFIG_KVM_GUEST
 void __init kvm_guest_init(void);
+void kvm_async_pf_task_wait(u32 token);
+void kvm_async_pf_task_wake(u32 token);
+u32 kvm_read_and_reset_pf_reason(void);
 #else
 #define kvm_guest_init() do { } while (0)
+#define kvm_async_pf_task_wait(T) do {} while(0)
+#define kvm_async_pf_task_wake(T) do {} while(0)
+static u32 kvm_read_and_reset_pf_reason(void)
+{
+	return 0;
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index f66cda5..0310da6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
 asmlinkage void stack_segment(void);
 asmlinkage void general_protection(void);
 asmlinkage void page_fault(void);
+asmlinkage void async_page_fault(void);
 asmlinkage void spurious_interrupt_bug(void);
 asmlinkage void coprocessor_error(void);
 asmlinkage void alignment_check(void);
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 227d009..e6e7273 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1496,6 +1496,16 @@ ENTRY(general_protection)
 	CFI_ENDPROC
 END(general_protection)
 
+#ifdef CONFIG_KVM_GUEST
+ENTRY(async_page_fault)
+	RING0_EC_FRAME
+	pushl $do_async_page_fault
+	CFI_ADJUST_CFA_OFFSET 4
+	jmp error_code
+	CFI_ENDPROC
+END(apf_page_fault)
+#endif
+
 /*
  * End of kprobes section
  */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 17be5ec..def98c3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
 #endif
 errorentry general_protection do_general_protection
 errorentry page_fault do_page_fault
+#ifdef CONFIG_KVM_GUEST
+errorentry async_page_fault do_async_page_fault
+#endif
 #ifdef CONFIG_X86_MCE
 paranoidzeroentry machine_check *machine_check_vector(%rip)
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 032d03b..d564063 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,8 +29,14 @@
 #include <linux/hardirq.h>
 #include <linux/notifier.h>
 #include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/kprobes.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
+#include <asm/traps.h>
+#include <asm/desc.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -64,6 +70,168 @@ static void kvm_io_delay(void)
 {
 }
 
+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+	struct hlist_node link;
+	wait_queue_head_t wq;
+	u32 token;
+	int cpu;
+};
+
+static struct kvm_task_sleep_head {
+	spinlock_t lock;
+	struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+						  u32 token)
+{
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		struct kvm_task_sleep_node *n =
+			hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+void kvm_async_pf_task_wait(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node n, *e;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&b->lock);
+	e = _find_apf_task(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		spin_unlock(&b->lock);
+		return;
+	}
+
+	n.token = token;
+	n.cpu = smp_processor_id();
+	init_waitqueue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	spin_unlock(&b->lock);
+
+	for (;;) {
+		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+		local_irq_enable();
+		schedule();
+		local_irq_disable();
+	}
+	finish_wait(&n.wq, &wait);
+
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
+
+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
+{
+	hlist_del_init(&n->link);
+	if (waitqueue_active(&n->wq))
+		wake_up(&n->wq);
+}
+
+static void apf_task_wake_all(void)
+{
+	int i;
+
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
+		struct hlist_node *p, *next;
+		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+		spin_lock(&b->lock);
+		hlist_for_each_safe(p, next, &b->list) {
+			struct kvm_task_sleep_node *n =
+				hlist_entry(p, typeof(*n), link);
+			if (n->cpu == smp_processor_id())
+				apf_task_wake_one(n);
+		}
+		spin_unlock(&b->lock);
+	}
+}
+
+void kvm_async_pf_task_wake(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node *n;
+
+	if (token == ~0) {
+		apf_task_wake_all();
+		return;
+	}
+
+again:
+	spin_lock(&b->lock);
+	n = _find_apf_task(b, token);
+	if (!n) {
+		/*
+		 * async PF was not yet handled.
+		 * Add dummy entry for the token.
+		 */
+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			/*
+			 * Allocation failed! Busy wait while other cpu
+			 * handles async PF.
+			 */
+			spin_unlock(&b->lock);
+			cpu_relax();
+			goto again;
+		}
+		n->token = token;
+		n->cpu = smp_processor_id();
+		init_waitqueue_head(&n->wq);
+		hlist_add_head(&n->link, &b->list);
+	} else
+		apf_task_wake_one(n);
+	spin_unlock(&b->lock);
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
+
+u32 kvm_read_and_reset_pf_reason(void)
+{
+	u32 reason = 0;
+
+	if (__get_cpu_var(apf_reason).enabled) {
+		reason = __get_cpu_var(apf_reason).reason;
+		__get_cpu_var(apf_reason).reason = 0;
+	}
+
+	return reason;
+}
+EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
+
+dotraplinkage void __kprobes
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	switch (kvm_read_and_reset_pf_reason()) {
+	default:
+		do_page_fault(regs, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		/* page is swapped out by the host. */
+		kvm_async_pf_task_wait((u32)read_cr2());
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		kvm_async_pf_task_wake((u32)read_cr2());
+		break;
+	}
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -300,6 +468,7 @@ static void kvm_guest_cpu_online(void *dummy)
 static void kvm_guest_cpu_offline(void *dummy)
 {
 	kvm_pv_disable_apf(NULL);
+	apf_task_wake_all();
 }
 
 static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
@@ -327,13 +496,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
 };
 #endif
 
+static void __init kvm_apf_trap_init(void)
+{
+	set_intr_gate(14, &async_page_fault);
+}
+
 void __init kvm_guest_init(void)
 {
+	int i;
+
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
 	register_reboot_notifier(&kvm_pv_reboot_nb);
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+		spin_lock_init(&async_pf_sleepers[i].lock);
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		x86_init.irqs.trap_init = kvm_apf_trap_init;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9a92224..9fa27a5 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -31,6 +31,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
+#include <asm/kvm_para.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
@@ -133,6 +134,7 @@ struct vcpu_svm {
 
 	unsigned int3_injected;
 	unsigned long int3_rip;
+	u32 apf_reason;
 };
 
 #define MSR_INVALID			0xffffffffU
@@ -1383,16 +1385,33 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
 
 static int pf_interception(struct vcpu_svm *svm)
 {
-	u64 fault_address;
+	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u32 error_code;
+	int r = 1;
 
-	fault_address  = svm->vmcb->control.exit_info_2;
-	error_code = svm->vmcb->control.exit_info_1;
+	switch (svm->apf_reason) {
+	default:
+		error_code = svm->vmcb->control.exit_info_1;
 
-	trace_kvm_page_fault(fault_address, error_code);
-	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
-		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
-	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
+		trace_kvm_page_fault(fault_address, error_code);
+		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
+			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
+		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		svm->apf_reason = 0;
+		local_irq_disable();
+		kvm_async_pf_task_wait(fault_address);
+		local_irq_enable();
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		svm->apf_reason = 0;
+		local_irq_disable();
+		kvm_async_pf_task_wake(fault_address);
+		local_irq_enable();
+		break;
+	}
+	return r;
 }
 
 static int db_interception(struct vcpu_svm *svm)
@@ -1836,8 +1855,8 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
 			return NESTED_EXIT_HOST;
 		break;
 	case SVM_EXIT_EXCP_BASE + PF_VECTOR:
-		/* When we're shadowing, trap PFs */
-		if (!npt_enabled)
+		/* When we're shadowing, trap PFs, but not async PF */
+		if (!npt_enabled && svm->apf_reason == 0)
 			return NESTED_EXIT_HOST;
 		break;
 	case SVM_EXIT_EXCP_BASE + NM_VECTOR:
@@ -1893,6 +1912,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
 		u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
 		if (svm->nested.intercept_exceptions & excp_bits)
 			vmexit = NESTED_EXIT_DONE;
+		/* async page fault always cause vmexit */
+		else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
+			 svm->apf_reason != 0)
+			vmexit = NESTED_EXIT_DONE;
 		break;
 	}
 	case SVM_EXIT_ERR: {
@@ -3409,6 +3432,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	svm->next_rip = 0;
 
+	/* if exit due to PF check for async PF */
+	if (svm->vmcb->control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR)
+		svm->apf_reason = kvm_read_and_reset_pf_reason();
+
 	if (npt_enabled) {
 		vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
 		vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 08/12] Handle async PF in a guest.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When async PF capability is detected hook up special page fault handler
that will handle async page fault events and bypass other page faults to
regular page fault handler. Also add async PF handling to nested SVM
emulation. Async PF always generates exit to L1 where vcpu thread will
be scheduled out until page is available.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |   12 +++
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 ++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  181 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm.c              |   45 ++++++++--
 6 files changed, 243 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2315398..fbfd367 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
 struct kvm_vcpu_pv_apf_data {
 	__u32 reason;
 	__u8 pad[60];
@@ -171,8 +174,17 @@ static inline unsigned int kvm_arch_para_features(void)
 
 #ifdef CONFIG_KVM_GUEST
 void __init kvm_guest_init(void);
+void kvm_async_pf_task_wait(u32 token);
+void kvm_async_pf_task_wake(u32 token);
+u32 kvm_read_and_reset_pf_reason(void);
 #else
 #define kvm_guest_init() do { } while (0)
+#define kvm_async_pf_task_wait(T) do {} while(0)
+#define kvm_async_pf_task_wake(T) do {} while(0)
+static u32 kvm_read_and_reset_pf_reason(void)
+{
+	return 0;
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index f66cda5..0310da6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
 asmlinkage void stack_segment(void);
 asmlinkage void general_protection(void);
 asmlinkage void page_fault(void);
+asmlinkage void async_page_fault(void);
 asmlinkage void spurious_interrupt_bug(void);
 asmlinkage void coprocessor_error(void);
 asmlinkage void alignment_check(void);
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 227d009..e6e7273 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1496,6 +1496,16 @@ ENTRY(general_protection)
 	CFI_ENDPROC
 END(general_protection)
 
+#ifdef CONFIG_KVM_GUEST
+ENTRY(async_page_fault)
+	RING0_EC_FRAME
+	pushl $do_async_page_fault
+	CFI_ADJUST_CFA_OFFSET 4
+	jmp error_code
+	CFI_ENDPROC
+END(apf_page_fault)
+#endif
+
 /*
  * End of kprobes section
  */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 17be5ec..def98c3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
 #endif
 errorentry general_protection do_general_protection
 errorentry page_fault do_page_fault
+#ifdef CONFIG_KVM_GUEST
+errorentry async_page_fault do_async_page_fault
+#endif
 #ifdef CONFIG_X86_MCE
 paranoidzeroentry machine_check *machine_check_vector(%rip)
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 032d03b..d564063 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,8 +29,14 @@
 #include <linux/hardirq.h>
 #include <linux/notifier.h>
 #include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/kprobes.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
+#include <asm/traps.h>
+#include <asm/desc.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -64,6 +70,168 @@ static void kvm_io_delay(void)
 {
 }
 
+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+	struct hlist_node link;
+	wait_queue_head_t wq;
+	u32 token;
+	int cpu;
+};
+
+static struct kvm_task_sleep_head {
+	spinlock_t lock;
+	struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+						  u32 token)
+{
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		struct kvm_task_sleep_node *n =
+			hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+void kvm_async_pf_task_wait(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node n, *e;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&b->lock);
+	e = _find_apf_task(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		spin_unlock(&b->lock);
+		return;
+	}
+
+	n.token = token;
+	n.cpu = smp_processor_id();
+	init_waitqueue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	spin_unlock(&b->lock);
+
+	for (;;) {
+		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+		local_irq_enable();
+		schedule();
+		local_irq_disable();
+	}
+	finish_wait(&n.wq, &wait);
+
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
+
+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
+{
+	hlist_del_init(&n->link);
+	if (waitqueue_active(&n->wq))
+		wake_up(&n->wq);
+}
+
+static void apf_task_wake_all(void)
+{
+	int i;
+
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
+		struct hlist_node *p, *next;
+		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+		spin_lock(&b->lock);
+		hlist_for_each_safe(p, next, &b->list) {
+			struct kvm_task_sleep_node *n =
+				hlist_entry(p, typeof(*n), link);
+			if (n->cpu == smp_processor_id())
+				apf_task_wake_one(n);
+		}
+		spin_unlock(&b->lock);
+	}
+}
+
+void kvm_async_pf_task_wake(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node *n;
+
+	if (token == ~0) {
+		apf_task_wake_all();
+		return;
+	}
+
+again:
+	spin_lock(&b->lock);
+	n = _find_apf_task(b, token);
+	if (!n) {
+		/*
+		 * async PF was not yet handled.
+		 * Add dummy entry for the token.
+		 */
+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			/*
+			 * Allocation failed! Busy wait while other cpu
+			 * handles async PF.
+			 */
+			spin_unlock(&b->lock);
+			cpu_relax();
+			goto again;
+		}
+		n->token = token;
+		n->cpu = smp_processor_id();
+		init_waitqueue_head(&n->wq);
+		hlist_add_head(&n->link, &b->list);
+	} else
+		apf_task_wake_one(n);
+	spin_unlock(&b->lock);
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
+
+u32 kvm_read_and_reset_pf_reason(void)
+{
+	u32 reason = 0;
+
+	if (__get_cpu_var(apf_reason).enabled) {
+		reason = __get_cpu_var(apf_reason).reason;
+		__get_cpu_var(apf_reason).reason = 0;
+	}
+
+	return reason;
+}
+EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
+
+dotraplinkage void __kprobes
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	switch (kvm_read_and_reset_pf_reason()) {
+	default:
+		do_page_fault(regs, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		/* page is swapped out by the host. */
+		kvm_async_pf_task_wait((u32)read_cr2());
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		kvm_async_pf_task_wake((u32)read_cr2());
+		break;
+	}
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -300,6 +468,7 @@ static void kvm_guest_cpu_online(void *dummy)
 static void kvm_guest_cpu_offline(void *dummy)
 {
 	kvm_pv_disable_apf(NULL);
+	apf_task_wake_all();
 }
 
 static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
@@ -327,13 +496,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
 };
 #endif
 
+static void __init kvm_apf_trap_init(void)
+{
+	set_intr_gate(14, &async_page_fault);
+}
+
 void __init kvm_guest_init(void)
 {
+	int i;
+
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
 	register_reboot_notifier(&kvm_pv_reboot_nb);
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+		spin_lock_init(&async_pf_sleepers[i].lock);
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		x86_init.irqs.trap_init = kvm_apf_trap_init;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9a92224..9fa27a5 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -31,6 +31,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/desc.h>
+#include <asm/kvm_para.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
@@ -133,6 +134,7 @@ struct vcpu_svm {
 
 	unsigned int3_injected;
 	unsigned long int3_rip;
+	u32 apf_reason;
 };
 
 #define MSR_INVALID			0xffffffffU
@@ -1383,16 +1385,33 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
 
 static int pf_interception(struct vcpu_svm *svm)
 {
-	u64 fault_address;
+	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u32 error_code;
+	int r = 1;
 
-	fault_address  = svm->vmcb->control.exit_info_2;
-	error_code = svm->vmcb->control.exit_info_1;
+	switch (svm->apf_reason) {
+	default:
+		error_code = svm->vmcb->control.exit_info_1;
 
-	trace_kvm_page_fault(fault_address, error_code);
-	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
-		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
-	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
+		trace_kvm_page_fault(fault_address, error_code);
+		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
+			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
+		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		svm->apf_reason = 0;
+		local_irq_disable();
+		kvm_async_pf_task_wait(fault_address);
+		local_irq_enable();
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		svm->apf_reason = 0;
+		local_irq_disable();
+		kvm_async_pf_task_wake(fault_address);
+		local_irq_enable();
+		break;
+	}
+	return r;
 }
 
 static int db_interception(struct vcpu_svm *svm)
@@ -1836,8 +1855,8 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
 			return NESTED_EXIT_HOST;
 		break;
 	case SVM_EXIT_EXCP_BASE + PF_VECTOR:
-		/* When we're shadowing, trap PFs */
-		if (!npt_enabled)
+		/* When we're shadowing, trap PFs, but not async PF */
+		if (!npt_enabled && svm->apf_reason == 0)
 			return NESTED_EXIT_HOST;
 		break;
 	case SVM_EXIT_EXCP_BASE + NM_VECTOR:
@@ -1893,6 +1912,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
 		u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
 		if (svm->nested.intercept_exceptions & excp_bits)
 			vmexit = NESTED_EXIT_DONE;
+		/* async page fault always cause vmexit */
+		else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
+			 svm->apf_reason != 0)
+			vmexit = NESTED_EXIT_DONE;
 		break;
 	}
 	case SVM_EXIT_ERR: {
@@ -3409,6 +3432,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	svm->next_rip = 0;
 
+	/* if exit due to PF check for async PF */
+	if (svm->vmcb->control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR)
+		svm->apf_reason = kvm_read_and_reset_pf_reason();
+
 	if (npt_enabled) {
 		vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
 		vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 09/12] Inject asynchronous page fault into a PV guest if page is swapped out.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.

Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 ++
 arch/x86/kvm/mmu.c              |    1 +
 arch/x86/kvm/x86.c              |   43 ++++++++++++++++++++++++++++++++++----
 include/trace/events/kvm.h      |   17 ++++++++++-----
 virt/kvm/async_pf.c             |    3 +-
 5 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26b2064..f1868ed 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -421,6 +421,7 @@ struct kvm_vcpu_arch {
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
 		struct gfn_to_hva_cache data;
 		u64 msr_val;
+		u32 id;
 	} apf;
 };
 
@@ -596,6 +597,7 @@ struct kvm_x86_ops {
 };
 
 struct kvm_arch_async_pf {
+	u32 token;
 	gfn_t gfn;
 };
 
@@ -843,6 +845,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 11d152b..463ff2e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2590,6 +2590,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 {
 	struct kvm_arch_async_pf arch;
+	arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
 	arch.gfn = gfn;
 
 	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 68a3a06..8e2fc59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6233,20 +6233,53 @@ static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 	}
 }
 
+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
+				      sizeof(val));
+}
+
 void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work)
 {
-	trace_kvm_async_pf_not_present(work->gva);
-
-	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+	trace_kvm_async_pf_not_present(work->arch.token, work->gva);
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+
+	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
+	    kvm_x86_ops->get_cpl(vcpu) == 0)
+		kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+	else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+		vcpu->arch.fault.error_code = 0;
+		vcpu->arch.fault.address = work->arch.token;
+		kvm_inject_page_fault(vcpu);
+	}
 }
 
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
-	trace_kvm_async_pf_ready(work->gva);
-	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+	trace_kvm_async_pf_ready(work->arch.token, work->gva);
+	if (is_error_page(work->page))
+		work->arch.token = ~0; /* broadcast wakeup */
+	else
+		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+
+	if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
+	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+		vcpu->arch.fault.error_code = 0;
+		vcpu->arch.fault.address = work->arch.token;
+		kvm_inject_page_fault(vcpu);
+	}
+}
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
+		return true;
+	else
+		return !kvm_event_needs_reinjection(vcpu) &&
+			kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index a78a5e5..9c2cc6a 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -204,34 +204,39 @@ TRACE_EVENT(
 
 TRACE_EVENT(
 	kvm_async_pf_not_present,
-	TP_PROTO(u64 gva),
-	TP_ARGS(gva),
+	TP_PROTO(u64 token, u64 gva),
+	TP_ARGS(token, gva),
 
 	TP_STRUCT__entry(
+		__field(__u64, token)
 		__field(__u64, gva)
 		),
 
 	TP_fast_assign(
+		__entry->token = token;
 		__entry->gva = gva;
 		),
 
-	TP_printk("gva %#llx not present", __entry->gva)
+	TP_printk("token %#llx gva %#llx not present", __entry->token,
+		  __entry->gva)
 );
 
 TRACE_EVENT(
 	kvm_async_pf_ready,
-	TP_PROTO(u64 gva),
-	TP_ARGS(gva),
+	TP_PROTO(u64 token, u64 gva),
+	TP_ARGS(token, gva),
 
 	TP_STRUCT__entry(
+		__field(__u64, token)
 		__field(__u64, gva)
 		),
 
 	TP_fast_assign(
+		__entry->token = token;
 		__entry->gva = gva;
 		),
 
-	TP_printk("gva %#llx ready", __entry->gva)
+	TP_printk("token %#llx gva %#llx ready", __entry->token, __entry->gva)
 );
 
 TRACE_EVENT(
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index b276b06..2ab2089 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -124,7 +124,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 {
 	struct kvm_async_pf *work;
 
-	if (list_empty_careful(&vcpu->async_pf.done))
+	if (list_empty_careful(&vcpu->async_pf.done) ||
+	    !kvm_arch_can_inject_async_page_present(vcpu))
 		return;
 
 	spin_lock(&vcpu->async_pf.lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 09/12] Inject asynchronous page fault into a PV guest if page is swapped out.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.

Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 ++
 arch/x86/kvm/mmu.c              |    1 +
 arch/x86/kvm/x86.c              |   43 ++++++++++++++++++++++++++++++++++----
 include/trace/events/kvm.h      |   17 ++++++++++-----
 virt/kvm/async_pf.c             |    3 +-
 5 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26b2064..f1868ed 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -421,6 +421,7 @@ struct kvm_vcpu_arch {
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
 		struct gfn_to_hva_cache data;
 		u64 msr_val;
+		u32 id;
 	} apf;
 };
 
@@ -596,6 +597,7 @@ struct kvm_x86_ops {
 };
 
 struct kvm_arch_async_pf {
+	u32 token;
 	gfn_t gfn;
 };
 
@@ -843,6 +845,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 11d152b..463ff2e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2590,6 +2590,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 {
 	struct kvm_arch_async_pf arch;
+	arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
 	arch.gfn = gfn;
 
 	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 68a3a06..8e2fc59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6233,20 +6233,53 @@ static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 	}
 }
 
+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
+				      sizeof(val));
+}
+
 void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work)
 {
-	trace_kvm_async_pf_not_present(work->gva);
-
-	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+	trace_kvm_async_pf_not_present(work->arch.token, work->gva);
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
+
+	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
+	    kvm_x86_ops->get_cpl(vcpu) == 0)
+		kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+	else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+		vcpu->arch.fault.error_code = 0;
+		vcpu->arch.fault.address = work->arch.token;
+		kvm_inject_page_fault(vcpu);
+	}
 }
 
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
-	trace_kvm_async_pf_ready(work->gva);
-	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+	trace_kvm_async_pf_ready(work->arch.token, work->gva);
+	if (is_error_page(work->page))
+		work->arch.token = ~0; /* broadcast wakeup */
+	else
+		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
+
+	if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
+	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+		vcpu->arch.fault.error_code = 0;
+		vcpu->arch.fault.address = work->arch.token;
+		kvm_inject_page_fault(vcpu);
+	}
+}
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
+		return true;
+	else
+		return !kvm_event_needs_reinjection(vcpu) &&
+			kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index a78a5e5..9c2cc6a 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -204,34 +204,39 @@ TRACE_EVENT(
 
 TRACE_EVENT(
 	kvm_async_pf_not_present,
-	TP_PROTO(u64 gva),
-	TP_ARGS(gva),
+	TP_PROTO(u64 token, u64 gva),
+	TP_ARGS(token, gva),
 
 	TP_STRUCT__entry(
+		__field(__u64, token)
 		__field(__u64, gva)
 		),
 
 	TP_fast_assign(
+		__entry->token = token;
 		__entry->gva = gva;
 		),
 
-	TP_printk("gva %#llx not present", __entry->gva)
+	TP_printk("token %#llx gva %#llx not present", __entry->token,
+		  __entry->gva)
 );
 
 TRACE_EVENT(
 	kvm_async_pf_ready,
-	TP_PROTO(u64 gva),
-	TP_ARGS(gva),
+	TP_PROTO(u64 token, u64 gva),
+	TP_ARGS(token, gva),
 
 	TP_STRUCT__entry(
+		__field(__u64, token)
 		__field(__u64, gva)
 		),
 
 	TP_fast_assign(
+		__entry->token = token;
 		__entry->gva = gva;
 		),
 
-	TP_printk("gva %#llx ready", __entry->gva)
+	TP_printk("token %#llx gva %#llx ready", __entry->token, __entry->gva)
 );
 
 TRACE_EVENT(
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index b276b06..2ab2089 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -124,7 +124,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 {
 	struct kvm_async_pf *work;
 
-	if (list_empty_careful(&vcpu->async_pf.done))
+	if (list_empty_careful(&vcpu->async_pf.done) ||
+	    !kvm_arch_can_inject_async_page_present(vcpu))
 		return;
 
 	spin_lock(&vcpu->async_pf.lock);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 10/12] Handle async PF in non preemptable context
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kernel/kvm.c |   40 ++++++++++++++++++++++++++++++++++------
 1 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d564063..47ea93e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -37,6 +37,7 @@
 #include <asm/cpu.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
+#include <asm/tlbflush.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -78,6 +79,8 @@ struct kvm_task_sleep_node {
 	wait_queue_head_t wq;
 	u32 token;
 	int cpu;
+	bool halted;
+	struct mm_struct *mm;
 };
 
 static struct kvm_task_sleep_head {
@@ -106,6 +109,11 @@ void kvm_async_pf_task_wait(u32 token)
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
 	struct kvm_task_sleep_node n, *e;
 	DEFINE_WAIT(wait);
+	int cpu, idle;
+
+	cpu = get_cpu();
+	idle = idle_cpu(cpu);
+	put_cpu();
 
 	spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
@@ -119,19 +127,33 @@ void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
+	n.mm = current->active_mm;
+	n.halted = idle || preempt_count() > 1;
+	atomic_inc(&n.mm->mm_count);
 	init_waitqueue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	spin_unlock(&b->lock);
 
 	for (;;) {
-		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (!n.halted)
+			prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
-		local_irq_enable();
-		schedule();
-		local_irq_disable();
+
+		if (!n.halted) {
+			local_irq_enable();
+			schedule();
+			local_irq_disable();
+		} else {
+			/*
+			 * We cannot reschedule. So halt.
+			 */
+			native_safe_halt();
+			local_irq_disable();
+		}
 	}
-	finish_wait(&n.wq, &wait);
+	if (!n.halted)
+		finish_wait(&n.wq, &wait);
 
 	return;
 }
@@ -140,7 +162,12 @@ EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (waitqueue_active(&n->wq))
+	if (!n->mm)
+		return;
+	mmdrop(n->mm);
+	if (n->halted)
+		smp_send_reschedule(n->cpu);
+	else if (waitqueue_active(&n->wq))
 		wake_up(&n->wq);
 }
 
@@ -193,6 +220,7 @@ again:
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->mm = NULL;
 		init_waitqueue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 10/12] Handle async PF in non preemptable context
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kernel/kvm.c |   40 ++++++++++++++++++++++++++++++++++------
 1 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d564063..47ea93e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -37,6 +37,7 @@
 #include <asm/cpu.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
+#include <asm/tlbflush.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -78,6 +79,8 @@ struct kvm_task_sleep_node {
 	wait_queue_head_t wq;
 	u32 token;
 	int cpu;
+	bool halted;
+	struct mm_struct *mm;
 };
 
 static struct kvm_task_sleep_head {
@@ -106,6 +109,11 @@ void kvm_async_pf_task_wait(u32 token)
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
 	struct kvm_task_sleep_node n, *e;
 	DEFINE_WAIT(wait);
+	int cpu, idle;
+
+	cpu = get_cpu();
+	idle = idle_cpu(cpu);
+	put_cpu();
 
 	spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
@@ -119,19 +127,33 @@ void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
+	n.mm = current->active_mm;
+	n.halted = idle || preempt_count() > 1;
+	atomic_inc(&n.mm->mm_count);
 	init_waitqueue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	spin_unlock(&b->lock);
 
 	for (;;) {
-		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (!n.halted)
+			prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
-		local_irq_enable();
-		schedule();
-		local_irq_disable();
+
+		if (!n.halted) {
+			local_irq_enable();
+			schedule();
+			local_irq_disable();
+		} else {
+			/*
+			 * We cannot reschedule. So halt.
+			 */
+			native_safe_halt();
+			local_irq_disable();
+		}
 	}
-	finish_wait(&n.wq, &wait);
+	if (!n.halted)
+		finish_wait(&n.wq, &wait);
 
 	return;
 }
@@ -140,7 +162,12 @@ EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (waitqueue_active(&n->wq))
+	if (!n->mm)
+		return;
+	mmdrop(n->mm);
+	if (n->halted)
+		smp_send_reschedule(n->cpu);
+	else if (waitqueue_active(&n->wq))
 		wake_up(&n->wq);
 }
 
@@ -193,6 +220,7 @@ again:
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->mm = NULL;
 		init_waitqueue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 11/12] Let host know whether the guest can handle async PF in non-userspace context.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kvm/msr.txt       |    6 +++---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |    3 +++
 arch/x86/kvm/x86.c              |    5 +++--
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index 27c11a6..d079aed 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -154,9 +154,10 @@ MSR_KVM_SYSTEM_TIME: 0x12
 MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 	data: Bits 63-6 hold 64-byte aligned physical address of a
 	64 byte memory area which must be in guest RAM and must be
-	zeroed. Bits 5-1 are reserved and should be zero. Bit 0 is 1
+	zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
 	when asynchronous page faults are enabled on the vcpu 0 when
-	disabled.
+	disabled. Bit 2 is 1 if asynchronous page faults can be injected
+	when vcpu is in cpl == 0.
 
 	First 4 byte of 64 byte memory location will be written to by
 	the hypervisor at the time of asynchronous page fault (APF)
@@ -184,4 +185,3 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
 	Currently type 2 APF will be always delivered on the same vcpu as
 	type 1 was, but guest should not rely on that.
-
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f1868ed..d2fa951 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -422,6 +422,7 @@ struct kvm_vcpu_arch {
 		struct gfn_to_hva_cache data;
 		u64 msr_val;
 		u32 id;
+		bool send_user_only;
 	} apf;
 };
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index fbfd367..d3a1a48 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -38,6 +38,7 @@
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 47ea93e..91b3d65 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -449,6 +449,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
 		u64 pa = __pa(&__get_cpu_var(apf_reason));
 
+#ifdef CONFIG_PREEMPT
+		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
+#endif
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
 		__get_cpu_var(apf_reason).enabled = 1;
 		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e2fc59..1e442df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1435,8 +1435,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 1:5 are resrved, Should be zero */
-	if (data & 0x3e)
+	/* Bits 2:5 are resrved, Should be zero */
+	if (data & 0x3c)
 		return 1;
 
 	vcpu->arch.apf.msr_val = data;
@@ -1450,6 +1450,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
 		return 1;
 
+	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 11/12] Let host know whether the guest can handle async PF in non-userspace context.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Documentation/kvm/msr.txt       |    6 +++---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |    3 +++
 arch/x86/kvm/x86.c              |    5 +++--
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index 27c11a6..d079aed 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -154,9 +154,10 @@ MSR_KVM_SYSTEM_TIME: 0x12
 MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 	data: Bits 63-6 hold 64-byte aligned physical address of a
 	64 byte memory area which must be in guest RAM and must be
-	zeroed. Bits 5-1 are reserved and should be zero. Bit 0 is 1
+	zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
 	when asynchronous page faults are enabled on the vcpu 0 when
-	disabled.
+	disabled. Bit 2 is 1 if asynchronous page faults can be injected
+	when vcpu is in cpl == 0.
 
 	First 4 byte of 64 byte memory location will be written to by
 	the hypervisor at the time of asynchronous page fault (APF)
@@ -184,4 +185,3 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
 	Currently type 2 APF will be always delivered on the same vcpu as
 	type 1 was, but guest should not rely on that.
-
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f1868ed..d2fa951 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -422,6 +422,7 @@ struct kvm_vcpu_arch {
 		struct gfn_to_hva_cache data;
 		u64 msr_val;
 		u32 id;
+		bool send_user_only;
 	} apf;
 };
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index fbfd367..d3a1a48 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -38,6 +38,7 @@
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 47ea93e..91b3d65 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -449,6 +449,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
 		u64 pa = __pa(&__get_cpu_var(apf_reason));
 
+#ifdef CONFIG_PREEMPT
+		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
+#endif
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
 		__get_cpu_var(apf_reason).enabled = 1;
 		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e2fc59..1e442df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1435,8 +1435,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 1:5 are resrved, Should be zero */
-	if (data & 0x3e)
+	/* Bits 2:5 are resrved, Should be zero */
+	if (data & 0x3c)
 		return 1;
 
 	vcpu->arch.apf.msr_val = data;
@@ -1450,6 +1450,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa))
 		return 1;
 
+	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 12/12] Send async PF when guest is not in userspace too.
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-14  9:22   ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupts are enabled.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/x86.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e442df..51cff2f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6248,7 +6248,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
 
 	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
-	    kvm_x86_ops->get_cpl(vcpu) == 0)
+	    (vcpu->arch.apf.send_user_only &&
+	     kvm_x86_ops->get_cpl(vcpu) == 0))
 		kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 	else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
 		vcpu->arch.fault.error_code = 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 12/12] Send async PF when guest is not in userspace too.
@ 2010-10-14  9:22   ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-14  9:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupts are enabled.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/x86.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e442df..51cff2f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6248,7 +6248,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
 
 	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
-	    kvm_x86_ops->get_cpl(vcpu) == 0)
+	    (vcpu->arch.apf.send_user_only &&
+	     kvm_x86_ops->get_cpl(vcpu) == 0))
 		kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 	else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
 		vcpu->arch.fault.error_code = 0;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 03/12] Retry fault before vmentry
  2010-10-14  9:22   ` Gleb Natapov
@ 2010-10-17 10:33     ` Avi Kivity
  -1 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 10:33 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/14/2010 11:22 AM, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page. Do it only when tdp is enabled for now. Shadow paging case is
> more complicated. CR[034] and EFER registers should be switched before
> doing mapping and then switched back.
>
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> +{
> +	if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
> +		return;
> +	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
> +}

Missing mmu_topup_memory_caches().


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 03/12] Retry fault before vmentry
@ 2010-10-17 10:33     ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 10:33 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/14/2010 11:22 AM, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page. Do it only when tdp is enabled for now. Shadow paging case is
> more complicated. CR[034] and EFER registers should be switched before
> doing mapping and then switched back.
>
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> +{
> +	if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
> +		return;
> +	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
> +}

Missing mmu_topup_memory_caches().


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 03/12] Retry fault before vmentry
  2010-10-17 10:33     ` Avi Kivity
@ 2010-10-17 10:43       ` Avi Kivity
  -1 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 10:43 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/17/2010 12:33 PM, Avi Kivity wrote:
>  On 10/14/2010 11:22 AM, Gleb Natapov wrote:
>> When page is swapped in it is mapped into guest memory only after guest
>> tries to access it again and generate another fault. To save this fault
>> we can map it immediately since we know that guest is going to access
>> the page. Do it only when tdp is enabled for now. Shadow paging case is
>> more complicated. CR[034] and EFER registers should be switched before
>> doing mapping and then switched back.
>>
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct 
>> kvm_async_pf *work)
>> +{
>> +    if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
>> +        return;
>> +    vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
>> +}
>
> Missing mmu_topup_memory_caches().
>

Actually not.  tdp_page_fault() has its own topup.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 03/12] Retry fault before vmentry
@ 2010-10-17 10:43       ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 10:43 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 10/17/2010 12:33 PM, Avi Kivity wrote:
>  On 10/14/2010 11:22 AM, Gleb Natapov wrote:
>> When page is swapped in it is mapped into guest memory only after guest
>> tries to access it again and generate another fault. To save this fault
>> we can map it immediately since we know that guest is going to access
>> the page. Do it only when tdp is enabled for now. Shadow paging case is
>> more complicated. CR[034] and EFER registers should be switched before
>> doing mapping and then switched back.
>>
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct 
>> kvm_async_pf *work)
>> +{
>> +    if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
>> +        return;
>> +    vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
>> +}
>
> Missing mmu_topup_memory_caches().
>

Actually not.  tdp_page_fault() has its own topup.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 03/12] Retry fault before vmentry
  2010-10-14  9:22   ` Gleb Natapov
@ 2010-10-17 16:13     ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-17 16:13 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

Please use this one instead. Need to call kvm_mmu_reload() in kvm_arch_async_page_ready()

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 043e29e..96aca44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,7 @@ struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
@@ -839,6 +839,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work);
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f01e89a..11d152b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2568,7 +2568,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool no_apf)
 {
 	gfn_t gfn;
 	int r;
@@ -2604,8 +2604,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			 pfn_t *pfn)
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+			 gva_t gva, pfn_t *pfn)
 {
 	bool async;
 
@@ -2616,7 +2616,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 
 	put_page(pfn_to_page(*pfn));
 
-	if (can_do_async_pf(vcpu)) {
+	if (!no_apf && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(async, *pfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
@@ -2631,8 +2631,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 	return false;
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool no_apf)
 {
 	pfn_t pfn;
 	int r;
@@ -2654,7 +2654,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
 		return 0;
 
 	/* mmio */
@@ -3317,7 +3317,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index c45376d..d6b281e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -527,8 +527,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool no_apf)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -569,7 +569,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
 		return 0;
 
 	/* mmio */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09e72fc..9b178ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6131,6 +6131,20 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	int r;
+
+	if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
+		return;
+
+	r = kvm_mmu_reload(vcpu);
+	if (unlikely(r))
+		return;
+
+	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
+}
+
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 857d634..e97eae9 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -132,6 +132,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->link);
 	spin_unlock(&vcpu->async_pf.lock);
 
+	if (work->page)
+		kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_async_page_present(vcpu, work);
 
 	list_del(&work->queue);
--
			Gleb.

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 03/12] Retry fault before vmentry
@ 2010-10-17 16:13     ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-17 16:13 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

Please use this one instead. Need to call kvm_mmu_reload() in kvm_arch_async_page_ready()

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 043e29e..96aca44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,7 @@ struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
@@ -839,6 +839,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 				     struct kvm_async_pf *work);
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f01e89a..11d152b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2568,7 +2568,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool no_apf)
 {
 	gfn_t gfn;
 	int r;
@@ -2604,8 +2604,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			 pfn_t *pfn)
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+			 gva_t gva, pfn_t *pfn)
 {
 	bool async;
 
@@ -2616,7 +2616,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 
 	put_page(pfn_to_page(*pfn));
 
-	if (can_do_async_pf(vcpu)) {
+	if (!no_apf && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(async, *pfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
@@ -2631,8 +2631,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 	return false;
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool no_apf)
 {
 	pfn_t pfn;
 	int r;
@@ -2654,7 +2654,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, gfn, gpa, &pfn))
+	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn))
 		return 0;
 
 	/* mmio */
@@ -3317,7 +3317,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index c45376d..d6b281e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -527,8 +527,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool no_apf)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -569,7 +569,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
+	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn))
 		return 0;
 
 	/* mmio */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09e72fc..9b178ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6131,6 +6131,20 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	int r;
+
+	if (!vcpu->arch.mmu.direct_map || is_error_page(work->page))
+		return;
+
+	r = kvm_mmu_reload(vcpu);
+	if (unlikely(r))
+		return;
+
+	vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
+}
+
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 857d634..e97eae9 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -132,6 +132,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->link);
 	spin_unlock(&vcpu->async_pf.lock);
 
+	if (work->page)
+		kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_async_page_present(vcpu, work);
 
 	list_del(&work->queue);
--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-14  9:22   ` Gleb Natapov
@ 2010-10-18 13:22     ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-18 13:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

arch/x86/kvm/x86.c also changes slots. Forgot to update generation
there.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bee71ec..bcc48bc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3193,6 +3193,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 		}
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		slots->memslots[log->slot].dirty_bitmap = dirty_bitmap;
+		slots->generation++;
 
 		old_slots = kvm->memslots;
 		rcu_assign_pointer(kvm->memslots, slots);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
--
			Gleb.

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
@ 2010-10-18 13:22     ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-18 13:22 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

arch/x86/kvm/x86.c also changes slots. Forgot to update generation
there.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bee71ec..bcc48bc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3193,6 +3193,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 		}
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		slots->memslots[log->slot].dirty_bitmap = dirty_bitmap;
+		slots->generation++;
 
 		old_slots = kvm->memslots;
 		rcu_assign_pointer(kvm->memslots, slots);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest
  2010-10-14  9:22 ` Gleb Natapov
@ 2010-10-18 15:34   ` Marcelo Tosatti
  -1 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2010-10-18 15:34 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Thu, Oct 14, 2010 at 11:22:44AM +0200, Gleb Natapov wrote:
> KVM virtualizes guest memory by means of shadow pages or HW assistance
> like NPT/EPT. Not all memory used by a guest is mapped into the guest
> address space or even present in a host memory at any given time.
> When vcpu tries to access memory page that is not mapped into the guest
> address space KVM is notified about it. KVM maps the page into the guest
> address space and resumes vcpu execution. If the page is swapped out from
> the host memory vcpu execution is suspended till the page is swapped
> into the memory again. This is inefficient since vcpu can do other work
> (run other task or serve interrupts) while page gets swapped in.
> 
> The patch series tries to mitigate this problem by introducing two
> mechanisms. The first one is used with non-PV guest and it works like
> this: when vcpu tries to access swapped out page it is halted and
> requested page is swapped in by another thread. That way vcpu can still
> process interrupts while io is happening in parallel and, with any luck,
> interrupt will cause the guest to schedule another task on the vcpu, so
> it will have work to do instead of waiting for the page to be swapped in.
> 
> The second mechanism introduces PV notification about swapped page state to
> a guest (asynchronous page fault). Instead of halting vcpu upon access to
> swapped out page and hoping that some interrupt will cause reschedule we
> immediately inject asynchronous page fault to the vcpu.  PV aware guest
> knows that upon receiving such exception it should schedule another task
> to run on the vcpu. Current task is put to sleep until another kind of
> asynchronous page fault is received that notifies the guest that page
> is now in the host memory, so task that waits for it can run again.
> 
> To measure performance benefits I use a simple benchmark program (below)
> that starts number of threads. Some of them do work (increment counter),
> others access huge array in random location trying to generate host page
> faults. The size of the array is smaller then guest memory bug bigger
> then host memory so we are guarantied that host will swap out part of
> the array.
> 
> I ran the benchmark on three setups: with current kvm.git (master),
> with my patch series + non-pv guest (nonpv) and with my patch series +
> pv guest (pv).
> 
> Each guest had 4 cpus and 2G memory and was launched inside 512M memory
> container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
> threads and 4 working threads for a minute).
> 
> Below is the total amount of "work" each guest managed to do
> (average of 10 runs):
>          total work    std error
> master: 122789420615 (3818565029)
> nonpv:  138455939001 (773774299)
> pv:     234351846135 (10461117116)
> 
> Changes:
>  v1->v2
>    Use MSR instead of hypercall.
>    Move most of the code into arch independent place.
>    halt inside a guest instead of doing "wait for page" hypercall if
>     preemption is disabled.
>  v2->v3
>    Use MSR from range 0x4b564dxx.
>    Add slot version tracking.
>    Support migration by restarting all guest processes after migration.
>    Drop patch that tract preemptability for non-preemptable kernels
>     due to performance concerns. Send async PF to non-preemptable
>     guests only when vcpu is executing userspace code.
>  v3->v4
>   Provide alternative page fault handler in PV guest instead of adding hook to
>    standard page fault handler and patch it out on non-PV guests.
>   Allow only limited number of outstanding async page fault per vcpu.
>   Unify  gfn_to_pfn and gfn_to_pfn_async code.
>   Cancel outstanding slow work on reset.
>  v4->v5
>   Move async pv cpu initialization into cpu hotplug notifier.
>   Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
>   Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
>    cr3 back
>  v5->v6
>   To many. Will list only major changes here.
>   Replace slow work with work queues.
>   Halt vcpu for non-pv guests.
>   Handle async PF in nested SVM mode.
>   Do not prefault swapped in page for non tdp case.
>  v6->v7
>   Fix "GUP fail in work thread" problem
>   Do prefault only if mmu is in direct map mode
>   Use cpu->request to ask for vcpu halt (drop optimization that tried to
>    skip non-present apf injection if page is swapped in before next vmentry)
>   Keep track of synthetic halt in separate state to prevent it from leaking
>    during migration.
>   Fix memslot tracking problems.
>   More documentation.
>   Other small comments are addressed
> 
> Gleb Natapov (12):
>   Add get_user_pages() variant that fails if major fault is required.
>   Halt vcpu if page it tries to access is swapped out.
>   Retry fault before vmentry
>   Add memory slot versioning and use it to provide fast guest write interface
>   Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
>   Add PV MSR to enable asynchronous page faults delivery.
>   Add async PF initialization to PV guest.
>   Handle async PF in a guest.
>   Inject asynchronous page fault into a PV guest if page is swapped out.
>   Handle async PF in non preemptable context
>   Let host know whether the guest can handle async PF in non-userspace context.
>   Send async PF when guest is not in userspace too.
> 
>  Documentation/kernel-parameters.txt |    3 +
>  Documentation/kvm/cpuid.txt         |    3 +
>  Documentation/kvm/msr.txt           |   36 ++++-
>  arch/x86/include/asm/kvm_host.h     |   28 +++-
>  arch/x86/include/asm/kvm_para.h     |   24 +++
>  arch/x86/include/asm/traps.h        |    1 +
>  arch/x86/kernel/entry_32.S          |   10 +
>  arch/x86/kernel/entry_64.S          |    3 +
>  arch/x86/kernel/kvm.c               |  315 +++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/kvmclock.c          |   13 +--
>  arch/x86/kvm/Kconfig                |    1 +
>  arch/x86/kvm/Makefile               |    1 +
>  arch/x86/kvm/mmu.c                  |   61 ++++++-
>  arch/x86/kvm/paging_tmpl.h          |    8 +-
>  arch/x86/kvm/svm.c                  |   45 ++++-
>  arch/x86/kvm/x86.c                  |  192 +++++++++++++++++++++-
>  fs/ncpfs/mmap.c                     |    2 +
>  include/linux/kvm.h                 |    1 +
>  include/linux/kvm_host.h            |   39 +++++
>  include/linux/kvm_types.h           |    7 +
>  include/linux/mm.h                  |    5 +
>  include/trace/events/kvm.h          |   95 +++++++++++
>  mm/filemap.c                        |    3 +
>  mm/memory.c                         |   31 +++-
>  mm/shmem.c                          |    8 +-
>  virt/kvm/Kconfig                    |    3 +
>  virt/kvm/async_pf.c                 |  213 +++++++++++++++++++++++
>  virt/kvm/async_pf.h                 |   36 ++++
>  virt/kvm/kvm_main.c                 |  132 ++++++++++++---
>  29 files changed, 1255 insertions(+), 64 deletions(-)
>  create mode 100644 virt/kvm/async_pf.c
>  create mode 100644 virt/kvm/async_pf.h

Applied, thanks.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest
@ 2010-10-18 15:34   ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2010-10-18 15:34 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl

On Thu, Oct 14, 2010 at 11:22:44AM +0200, Gleb Natapov wrote:
> KVM virtualizes guest memory by means of shadow pages or HW assistance
> like NPT/EPT. Not all memory used by a guest is mapped into the guest
> address space or even present in a host memory at any given time.
> When vcpu tries to access memory page that is not mapped into the guest
> address space KVM is notified about it. KVM maps the page into the guest
> address space and resumes vcpu execution. If the page is swapped out from
> the host memory vcpu execution is suspended till the page is swapped
> into the memory again. This is inefficient since vcpu can do other work
> (run other task or serve interrupts) while page gets swapped in.
> 
> The patch series tries to mitigate this problem by introducing two
> mechanisms. The first one is used with non-PV guest and it works like
> this: when vcpu tries to access swapped out page it is halted and
> requested page is swapped in by another thread. That way vcpu can still
> process interrupts while io is happening in parallel and, with any luck,
> interrupt will cause the guest to schedule another task on the vcpu, so
> it will have work to do instead of waiting for the page to be swapped in.
> 
> The second mechanism introduces PV notification about swapped page state to
> a guest (asynchronous page fault). Instead of halting vcpu upon access to
> swapped out page and hoping that some interrupt will cause reschedule we
> immediately inject asynchronous page fault to the vcpu.  PV aware guest
> knows that upon receiving such exception it should schedule another task
> to run on the vcpu. Current task is put to sleep until another kind of
> asynchronous page fault is received that notifies the guest that page
> is now in the host memory, so task that waits for it can run again.
> 
> To measure performance benefits I use a simple benchmark program (below)
> that starts number of threads. Some of them do work (increment counter),
> others access huge array in random location trying to generate host page
> faults. The size of the array is smaller then guest memory bug bigger
> then host memory so we are guarantied that host will swap out part of
> the array.
> 
> I ran the benchmark on three setups: with current kvm.git (master),
> with my patch series + non-pv guest (nonpv) and with my patch series +
> pv guest (pv).
> 
> Each guest had 4 cpus and 2G memory and was launched inside 512M memory
> container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
> threads and 4 working threads for a minute).
> 
> Below is the total amount of "work" each guest managed to do
> (average of 10 runs):
>          total work    std error
> master: 122789420615 (3818565029)
> nonpv:  138455939001 (773774299)
> pv:     234351846135 (10461117116)
> 
> Changes:
>  v1->v2
>    Use MSR instead of hypercall.
>    Move most of the code into arch independent place.
>    halt inside a guest instead of doing "wait for page" hypercall if
>     preemption is disabled.
>  v2->v3
>    Use MSR from range 0x4b564dxx.
>    Add slot version tracking.
>    Support migration by restarting all guest processes after migration.
>    Drop patch that tract preemptability for non-preemptable kernels
>     due to performance concerns. Send async PF to non-preemptable
>     guests only when vcpu is executing userspace code.
>  v3->v4
>   Provide alternative page fault handler in PV guest instead of adding hook to
>    standard page fault handler and patch it out on non-PV guests.
>   Allow only limited number of outstanding async page fault per vcpu.
>   Unify  gfn_to_pfn and gfn_to_pfn_async code.
>   Cancel outstanding slow work on reset.
>  v4->v5
>   Move async pv cpu initialization into cpu hotplug notifier.
>   Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
>   Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
>    cr3 back
>  v5->v6
>   To many. Will list only major changes here.
>   Replace slow work with work queues.
>   Halt vcpu for non-pv guests.
>   Handle async PF in nested SVM mode.
>   Do not prefault swapped in page for non tdp case.
>  v6->v7
>   Fix "GUP fail in work thread" problem
>   Do prefault only if mmu is in direct map mode
>   Use cpu->request to ask for vcpu halt (drop optimization that tried to
>    skip non-present apf injection if page is swapped in before next vmentry)
>   Keep track of synthetic halt in separate state to prevent it from leaking
>    during migration.
>   Fix memslot tracking problems.
>   More documentation.
>   Other small comments are addressed
> 
> Gleb Natapov (12):
>   Add get_user_pages() variant that fails if major fault is required.
>   Halt vcpu if page it tries to access is swapped out.
>   Retry fault before vmentry
>   Add memory slot versioning and use it to provide fast guest write interface
>   Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
>   Add PV MSR to enable asynchronous page faults delivery.
>   Add async PF initialization to PV guest.
>   Handle async PF in a guest.
>   Inject asynchronous page fault into a PV guest if page is swapped out.
>   Handle async PF in non preemptable context
>   Let host know whether the guest can handle async PF in non-userspace context.
>   Send async PF when guest is not in userspace too.
> 
>  Documentation/kernel-parameters.txt |    3 +
>  Documentation/kvm/cpuid.txt         |    3 +
>  Documentation/kvm/msr.txt           |   36 ++++-
>  arch/x86/include/asm/kvm_host.h     |   28 +++-
>  arch/x86/include/asm/kvm_para.h     |   24 +++
>  arch/x86/include/asm/traps.h        |    1 +
>  arch/x86/kernel/entry_32.S          |   10 +
>  arch/x86/kernel/entry_64.S          |    3 +
>  arch/x86/kernel/kvm.c               |  315 +++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/kvmclock.c          |   13 +--
>  arch/x86/kvm/Kconfig                |    1 +
>  arch/x86/kvm/Makefile               |    1 +
>  arch/x86/kvm/mmu.c                  |   61 ++++++-
>  arch/x86/kvm/paging_tmpl.h          |    8 +-
>  arch/x86/kvm/svm.c                  |   45 ++++-
>  arch/x86/kvm/x86.c                  |  192 +++++++++++++++++++++-
>  fs/ncpfs/mmap.c                     |    2 +
>  include/linux/kvm.h                 |    1 +
>  include/linux/kvm_host.h            |   39 +++++
>  include/linux/kvm_types.h           |    7 +
>  include/linux/mm.h                  |    5 +
>  include/trace/events/kvm.h          |   95 +++++++++++
>  mm/filemap.c                        |    3 +
>  mm/memory.c                         |   31 +++-
>  mm/shmem.c                          |    8 +-
>  virt/kvm/Kconfig                    |    3 +
>  virt/kvm/async_pf.c                 |  213 +++++++++++++++++++++++
>  virt/kvm/async_pf.h                 |   36 ++++
>  virt/kvm/kvm_main.c                 |  132 ++++++++++++---
>  29 files changed, 1255 insertions(+), 64 deletions(-)
>  create mode 100644 virt/kvm/async_pf.c
>  create mode 100644 virt/kvm/async_pf.h

Applied, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-14  9:22   ` Gleb Natapov
@ 2010-10-20 11:28     ` Jan Kiszka
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:28 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 14.10.2010 11:22, Gleb Natapov wrote:
> If a guest accesses swapped out memory do not swap it in from vcpu thread
> context. Schedule work to do swapping and put vcpu into halted state
> instead.
> 
> Interrupts will still be delivered to the guest and if interrupt will
> cause reschedule guest will continue to run another task.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   18 ++++
>  arch/x86/kvm/Kconfig            |    1 +
>  arch/x86/kvm/Makefile           |    1 +
>  arch/x86/kvm/mmu.c              |   52 +++++++++++-
>  arch/x86/kvm/paging_tmpl.h      |    4 +-
>  arch/x86/kvm/x86.c              |  112 ++++++++++++++++++++++-
>  include/linux/kvm_host.h        |   31 +++++++
>  include/trace/events/kvm.h      |   90 ++++++++++++++++++
>  virt/kvm/Kconfig                |    3 +
>  virt/kvm/async_pf.c             |  190 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/async_pf.h             |   36 ++++++++
>  virt/kvm/kvm_main.c             |   57 +++++++++---
>  12 files changed, 578 insertions(+), 17 deletions(-)
>  create mode 100644 virt/kvm/async_pf.c
>  create mode 100644 virt/kvm/async_pf.h
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e209078..043e29e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -83,11 +83,14 @@
>  #define KVM_NR_FIXED_MTRR_REGION 88
>  #define KVM_NR_VAR_MTRR 8
>  
> +#define ASYNC_PF_PER_VCPU 64
> +
>  extern spinlock_t kvm_lock;
>  extern struct list_head vm_list;
>  
>  struct kvm_vcpu;
>  struct kvm;
> +struct kvm_async_pf;
>  
>  enum kvm_reg {
>  	VCPU_REGS_RAX = 0,
> @@ -412,6 +415,11 @@ struct kvm_vcpu_arch {
>  	u64 hv_vapic;
>  
>  	cpumask_var_t wbinvd_dirty_mask;
> +
> +	struct {
> +		bool halted;
> +		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
> +	} apf;
>  };
>  
>  struct kvm_arch {
> @@ -585,6 +593,10 @@ struct kvm_x86_ops {
>  	const struct trace_print_flags *exit_reasons_str;
>  };
>  
> +struct kvm_arch_async_pf {
> +	gfn_t gfn;
> +};
> +
>  extern struct kvm_x86_ops *kvm_x86_ops;
>  
>  int kvm_mmu_module_init(void);
> @@ -823,4 +835,10 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
>  
>  bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
>  
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				 struct kvm_async_pf *work);
> +extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ddc131f..50f6364 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -28,6 +28,7 @@ config KVM
>  	select HAVE_KVM_IRQCHIP
>  	select HAVE_KVM_EVENTFD
>  	select KVM_APIC_ARCHITECTURE
> +	select KVM_ASYNC_PF
>  	select USER_RETURN_NOTIFIER
>  	select KVM_MMIO
>  	---help---
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 31a7035..c53bf19 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -9,6 +9,7 @@ kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
>  				coalesced_mmio.o irq_comm.o eventfd.o \
>  				assigned-dev.o)
>  kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
> +kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
>  
>  kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>  			   i8254.o timer.o
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 908ea54..f01e89a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -18,9 +18,11 @@
>   *
>   */
>  
> +#include "irq.h"
>  #include "mmu.h"
>  #include "x86.h"
>  #include "kvm_cache_regs.h"
> +#include "x86.h"
>  
>  #include <linux/kvm_host.h>
>  #include <linux/types.h>
> @@ -2585,6 +2587,50 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>  			     error_code & PFERR_WRITE_MASK, gfn);
>  }
>  
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> +{
> +	struct kvm_arch_async_pf arch;
> +	arch.gfn = gfn;
> +
> +	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
> +}
> +
> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> +	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> +		     kvm_event_needs_reinjection(vcpu)))
> +		return false;
> +
> +	return kvm_x86_ops->interrupt_allowed(vcpu);
> +}
> +
> +static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> +			 pfn_t *pfn)
> +{
> +	bool async;
> +
> +	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
> +
> +	if (!async)
> +		return false; /* *pfn has correct page already */
> +
> +	put_page(pfn_to_page(*pfn));
> +
> +	if (can_do_async_pf(vcpu)) {
> +		trace_kvm_try_async_get_page(async, *pfn);
> +		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> +			trace_kvm_async_pf_doublefault(gva, gfn);
> +			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> +			return true;
> +		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
> +			return true;
> +	}
> +
> +	*pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +	
> +	return false;
> +}
> +
>  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>  				u32 error_code)
>  {
> @@ -2607,7 +2653,11 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>  
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +
> +	if (try_async_pf(vcpu, gfn, gpa, &pfn))
> +		return 0;
> +
> +	/* mmio */
>  	if (is_error_pfn(pfn))
>  		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
>  	spin_lock(&vcpu->kvm->mmu_lock);
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index cd7a833..c45376d 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -568,7 +568,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> +
> +	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
> +		return 0;
>  
>  	/* mmio */
>  	if (is_error_pfn(pfn))
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7127a13..09e72fc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -43,6 +43,7 @@
>  #include <linux/slab.h>
>  #include <linux/perf_event.h>
>  #include <linux/uaccess.h>
> +#include <linux/hash.h>
>  #include <trace/events/kvm.h>
>  
>  #define CREATE_TRACE_POINTS
> @@ -155,6 +156,13 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>  
>  u64 __read_mostly host_xcr0;
>  
> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU); i++)
> +		vcpu->arch.apf.gfns[i] = ~0;
> +}
> +
>  static inline u32 bit(int bitno)
>  {
>  	return 1 << (bitno & 31);
> @@ -5110,6 +5118,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			vcpu->fpu_active = 0;
>  			kvm_x86_ops->fpu_deactivate(vcpu);
>  		}
> +		if (kvm_check_request(KVM_REQ_APF_HALT, vcpu)) {
> +			/* Page is swapped out. Do synthetic halt */
> +			vcpu->arch.apf.halted = true;
> +			r = 1;
> +			goto out;
> +		}
>  	}
>  
>  	r = kvm_mmu_reload(vcpu);
> @@ -5238,7 +5252,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  
>  	r = 1;
>  	while (r > 0) {
> -		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE)
> +		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
> +		    !vcpu->arch.apf.halted)
>  			r = vcpu_enter_guest(vcpu);
>  		else {
>  			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> @@ -5251,6 +5266,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  					vcpu->arch.mp_state =
>  						KVM_MP_STATE_RUNNABLE;
>  				case KVM_MP_STATE_RUNNABLE:
> +					vcpu->arch.apf.halted = false;
>  					break;
>  				case KVM_MP_STATE_SIPI_RECEIVED:
>  				default:
> @@ -5272,6 +5288,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  			vcpu->run->exit_reason = KVM_EXIT_INTR;
>  			++vcpu->stat.request_irq_exits;
>  		}
> +		
> +		kvm_check_async_pf_completion(vcpu);
> +
>  		if (signal_pending(current)) {
>  			r = -EINTR;
>  			vcpu->run->exit_reason = KVM_EXIT_INTR;
> @@ -5785,6 +5804,10 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>  
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_async_pf_hash_reset(vcpu);
> +	vcpu->arch.apf.halted = false;
> +
>  	return kvm_x86_ops->vcpu_reset(vcpu);
>  }
>  
> @@ -5873,6 +5896,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  	if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL))
>  		goto fail_free_mce_banks;
>  
> +	kvm_async_pf_hash_reset(vcpu);
> +
>  	return 0;
>  fail_free_mce_banks:
>  	kfree(vcpu->arch.mce_banks);
> @@ -5931,8 +5956,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
>  	/*
>  	 * Unpin any mmu pages first.
>  	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		kvm_clear_async_pf_completion_queue(vcpu);
>  		kvm_unload_vcpu_mmu(vcpu);
> +	}
>  	kvm_for_each_vcpu(i, vcpu, kvm)
>  		kvm_arch_vcpu_free(vcpu);
>  
> @@ -6043,7 +6070,9 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
>  
>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>  {
> -	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> +	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
> +		!vcpu->arch.apf.halted)
> +		|| !list_empty_careful(&vcpu->async_pf.done)
>  		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
>  		|| vcpu->arch.nmi_pending ||
>  		(kvm_arch_interrupt_allowed(vcpu) &&
> @@ -6102,6 +6131,83 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> +{
> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
> +}
> +
> +static inline u32 kvm_async_pf_next_probe(u32 key)
> +{
> +	return (key + 1) & (roundup_pow_of_two(ASYNC_PF_PER_VCPU) - 1);
> +}
> +
> +static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	while (vcpu->arch.apf.gfns[key] != ~0)
> +		key = kvm_async_pf_next_probe(key);
> +
> +	vcpu->arch.apf.gfns[key] = gfn;
> +}
> +
> +static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	int i;
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU) &&
> +		     (vcpu->arch.apf.gfns[key] != gfn ||
> +		      vcpu->arch.apf.gfns[key] == ~0); i++)
> +		key = kvm_async_pf_next_probe(key);
> +
> +	return key;
> +}
> +
> +bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	return vcpu->arch.apf.gfns[kvm_async_pf_gfn_slot(vcpu, gfn)] == gfn;
> +}
> +
> +static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 i, j, k;
> +
> +	i = j = kvm_async_pf_gfn_slot(vcpu, gfn);
> +	while (true) {
> +		vcpu->arch.apf.gfns[i] = ~0;
> +		do {
> +			j = kvm_async_pf_next_probe(j);
> +			if (vcpu->arch.apf.gfns[j] == ~0)
> +				return;
> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
> +			/*
> +			 * k lies cyclically in ]i,j]
> +			 * |    i.k.j |
> +			 * |....j i.k.| or  |.k..j i...|
> +			 */
> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
> +		i = j;
> +	}
> +}
> +
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work)
> +{
> +	trace_kvm_async_pf_not_present(work->gva);
> +
> +	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> +	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> +}
> +
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				 struct kvm_async_pf *work)
> +{
> +	trace_kvm_async_pf_ready(work->gva);
> +	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
> +}
> +
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0b89d00..9a9b017 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -40,6 +40,7 @@
>  #define KVM_REQ_KICK               9
>  #define KVM_REQ_DEACTIVATE_FPU    10
>  #define KVM_REQ_EVENT             11
> +#define KVM_REQ_APF_HALT          12
>  
>  #define KVM_USERSPACE_IRQ_SOURCE_ID	0
>  
> @@ -74,6 +75,26 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  			      struct kvm_io_device *dev);
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +struct kvm_async_pf {
> +	struct work_struct work;
> +	struct list_head link;
> +	struct list_head queue;
> +	struct kvm_vcpu *vcpu;
> +	struct mm_struct *mm;
> +	gva_t gva;
> +	unsigned long addr;
> +	struct kvm_arch_async_pf arch;
> +	struct page *page;
> +	bool done;
> +};
> +
> +void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
> +void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
> +int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
> +		       struct kvm_arch_async_pf *arch);
> +#endif
> +

Based on early kvm-kmod experiments, it looks like this (and maybe more)
breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
disabled. Please have a look.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
@ 2010-10-20 11:28     ` Jan Kiszka
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:28 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 14.10.2010 11:22, Gleb Natapov wrote:
> If a guest accesses swapped out memory do not swap it in from vcpu thread
> context. Schedule work to do swapping and put vcpu into halted state
> instead.
> 
> Interrupts will still be delivered to the guest and if interrupt will
> cause reschedule guest will continue to run another task.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   18 ++++
>  arch/x86/kvm/Kconfig            |    1 +
>  arch/x86/kvm/Makefile           |    1 +
>  arch/x86/kvm/mmu.c              |   52 +++++++++++-
>  arch/x86/kvm/paging_tmpl.h      |    4 +-
>  arch/x86/kvm/x86.c              |  112 ++++++++++++++++++++++-
>  include/linux/kvm_host.h        |   31 +++++++
>  include/trace/events/kvm.h      |   90 ++++++++++++++++++
>  virt/kvm/Kconfig                |    3 +
>  virt/kvm/async_pf.c             |  190 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/async_pf.h             |   36 ++++++++
>  virt/kvm/kvm_main.c             |   57 +++++++++---
>  12 files changed, 578 insertions(+), 17 deletions(-)
>  create mode 100644 virt/kvm/async_pf.c
>  create mode 100644 virt/kvm/async_pf.h
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e209078..043e29e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -83,11 +83,14 @@
>  #define KVM_NR_FIXED_MTRR_REGION 88
>  #define KVM_NR_VAR_MTRR 8
>  
> +#define ASYNC_PF_PER_VCPU 64
> +
>  extern spinlock_t kvm_lock;
>  extern struct list_head vm_list;
>  
>  struct kvm_vcpu;
>  struct kvm;
> +struct kvm_async_pf;
>  
>  enum kvm_reg {
>  	VCPU_REGS_RAX = 0,
> @@ -412,6 +415,11 @@ struct kvm_vcpu_arch {
>  	u64 hv_vapic;
>  
>  	cpumask_var_t wbinvd_dirty_mask;
> +
> +	struct {
> +		bool halted;
> +		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
> +	} apf;
>  };
>  
>  struct kvm_arch {
> @@ -585,6 +593,10 @@ struct kvm_x86_ops {
>  	const struct trace_print_flags *exit_reasons_str;
>  };
>  
> +struct kvm_arch_async_pf {
> +	gfn_t gfn;
> +};
> +
>  extern struct kvm_x86_ops *kvm_x86_ops;
>  
>  int kvm_mmu_module_init(void);
> @@ -823,4 +835,10 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
>  
>  bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
>  
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				 struct kvm_async_pf *work);
> +extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ddc131f..50f6364 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -28,6 +28,7 @@ config KVM
>  	select HAVE_KVM_IRQCHIP
>  	select HAVE_KVM_EVENTFD
>  	select KVM_APIC_ARCHITECTURE
> +	select KVM_ASYNC_PF
>  	select USER_RETURN_NOTIFIER
>  	select KVM_MMIO
>  	---help---
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 31a7035..c53bf19 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -9,6 +9,7 @@ kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
>  				coalesced_mmio.o irq_comm.o eventfd.o \
>  				assigned-dev.o)
>  kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
> +kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
>  
>  kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>  			   i8254.o timer.o
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 908ea54..f01e89a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -18,9 +18,11 @@
>   *
>   */
>  
> +#include "irq.h"
>  #include "mmu.h"
>  #include "x86.h"
>  #include "kvm_cache_regs.h"
> +#include "x86.h"
>  
>  #include <linux/kvm_host.h>
>  #include <linux/types.h>
> @@ -2585,6 +2587,50 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>  			     error_code & PFERR_WRITE_MASK, gfn);
>  }
>  
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> +{
> +	struct kvm_arch_async_pf arch;
> +	arch.gfn = gfn;
> +
> +	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
> +}
> +
> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> +	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> +		     kvm_event_needs_reinjection(vcpu)))
> +		return false;
> +
> +	return kvm_x86_ops->interrupt_allowed(vcpu);
> +}
> +
> +static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> +			 pfn_t *pfn)
> +{
> +	bool async;
> +
> +	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
> +
> +	if (!async)
> +		return false; /* *pfn has correct page already */
> +
> +	put_page(pfn_to_page(*pfn));
> +
> +	if (can_do_async_pf(vcpu)) {
> +		trace_kvm_try_async_get_page(async, *pfn);
> +		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> +			trace_kvm_async_pf_doublefault(gva, gfn);
> +			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> +			return true;
> +		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
> +			return true;
> +	}
> +
> +	*pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +	
> +	return false;
> +}
> +
>  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>  				u32 error_code)
>  {
> @@ -2607,7 +2653,11 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>  
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +
> +	if (try_async_pf(vcpu, gfn, gpa, &pfn))
> +		return 0;
> +
> +	/* mmio */
>  	if (is_error_pfn(pfn))
>  		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
>  	spin_lock(&vcpu->kvm->mmu_lock);
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index cd7a833..c45376d 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -568,7 +568,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> +
> +	if (try_async_pf(vcpu, walker.gfn, addr, &pfn))
> +		return 0;
>  
>  	/* mmio */
>  	if (is_error_pfn(pfn))
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7127a13..09e72fc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -43,6 +43,7 @@
>  #include <linux/slab.h>
>  #include <linux/perf_event.h>
>  #include <linux/uaccess.h>
> +#include <linux/hash.h>
>  #include <trace/events/kvm.h>
>  
>  #define CREATE_TRACE_POINTS
> @@ -155,6 +156,13 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>  
>  u64 __read_mostly host_xcr0;
>  
> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU); i++)
> +		vcpu->arch.apf.gfns[i] = ~0;
> +}
> +
>  static inline u32 bit(int bitno)
>  {
>  	return 1 << (bitno & 31);
> @@ -5110,6 +5118,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			vcpu->fpu_active = 0;
>  			kvm_x86_ops->fpu_deactivate(vcpu);
>  		}
> +		if (kvm_check_request(KVM_REQ_APF_HALT, vcpu)) {
> +			/* Page is swapped out. Do synthetic halt */
> +			vcpu->arch.apf.halted = true;
> +			r = 1;
> +			goto out;
> +		}
>  	}
>  
>  	r = kvm_mmu_reload(vcpu);
> @@ -5238,7 +5252,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  
>  	r = 1;
>  	while (r > 0) {
> -		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE)
> +		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
> +		    !vcpu->arch.apf.halted)
>  			r = vcpu_enter_guest(vcpu);
>  		else {
>  			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> @@ -5251,6 +5266,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  					vcpu->arch.mp_state =
>  						KVM_MP_STATE_RUNNABLE;
>  				case KVM_MP_STATE_RUNNABLE:
> +					vcpu->arch.apf.halted = false;
>  					break;
>  				case KVM_MP_STATE_SIPI_RECEIVED:
>  				default:
> @@ -5272,6 +5288,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  			vcpu->run->exit_reason = KVM_EXIT_INTR;
>  			++vcpu->stat.request_irq_exits;
>  		}
> +		
> +		kvm_check_async_pf_completion(vcpu);
> +
>  		if (signal_pending(current)) {
>  			r = -EINTR;
>  			vcpu->run->exit_reason = KVM_EXIT_INTR;
> @@ -5785,6 +5804,10 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>  
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_async_pf_hash_reset(vcpu);
> +	vcpu->arch.apf.halted = false;
> +
>  	return kvm_x86_ops->vcpu_reset(vcpu);
>  }
>  
> @@ -5873,6 +5896,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  	if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL))
>  		goto fail_free_mce_banks;
>  
> +	kvm_async_pf_hash_reset(vcpu);
> +
>  	return 0;
>  fail_free_mce_banks:
>  	kfree(vcpu->arch.mce_banks);
> @@ -5931,8 +5956,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
>  	/*
>  	 * Unpin any mmu pages first.
>  	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		kvm_clear_async_pf_completion_queue(vcpu);
>  		kvm_unload_vcpu_mmu(vcpu);
> +	}
>  	kvm_for_each_vcpu(i, vcpu, kvm)
>  		kvm_arch_vcpu_free(vcpu);
>  
> @@ -6043,7 +6070,9 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
>  
>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>  {
> -	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> +	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
> +		!vcpu->arch.apf.halted)
> +		|| !list_empty_careful(&vcpu->async_pf.done)
>  		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
>  		|| vcpu->arch.nmi_pending ||
>  		(kvm_arch_interrupt_allowed(vcpu) &&
> @@ -6102,6 +6131,83 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> +{
> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
> +}
> +
> +static inline u32 kvm_async_pf_next_probe(u32 key)
> +{
> +	return (key + 1) & (roundup_pow_of_two(ASYNC_PF_PER_VCPU) - 1);
> +}
> +
> +static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	while (vcpu->arch.apf.gfns[key] != ~0)
> +		key = kvm_async_pf_next_probe(key);
> +
> +	vcpu->arch.apf.gfns[key] = gfn;
> +}
> +
> +static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	int i;
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU) &&
> +		     (vcpu->arch.apf.gfns[key] != gfn ||
> +		      vcpu->arch.apf.gfns[key] == ~0); i++)
> +		key = kvm_async_pf_next_probe(key);
> +
> +	return key;
> +}
> +
> +bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	return vcpu->arch.apf.gfns[kvm_async_pf_gfn_slot(vcpu, gfn)] == gfn;
> +}
> +
> +static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 i, j, k;
> +
> +	i = j = kvm_async_pf_gfn_slot(vcpu, gfn);
> +	while (true) {
> +		vcpu->arch.apf.gfns[i] = ~0;
> +		do {
> +			j = kvm_async_pf_next_probe(j);
> +			if (vcpu->arch.apf.gfns[j] == ~0)
> +				return;
> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
> +			/*
> +			 * k lies cyclically in ]i,j]
> +			 * |    i.k.j |
> +			 * |....j i.k.| or  |.k..j i...|
> +			 */
> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
> +		i = j;
> +	}
> +}
> +
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work)
> +{
> +	trace_kvm_async_pf_not_present(work->gva);
> +
> +	kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> +	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> +}
> +
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				 struct kvm_async_pf *work)
> +{
> +	trace_kvm_async_pf_ready(work->gva);
> +	kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
> +}
> +
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0b89d00..9a9b017 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -40,6 +40,7 @@
>  #define KVM_REQ_KICK               9
>  #define KVM_REQ_DEACTIVATE_FPU    10
>  #define KVM_REQ_EVENT             11
> +#define KVM_REQ_APF_HALT          12
>  
>  #define KVM_USERSPACE_IRQ_SOURCE_ID	0
>  
> @@ -74,6 +75,26 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  			      struct kvm_io_device *dev);
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +struct kvm_async_pf {
> +	struct work_struct work;
> +	struct list_head link;
> +	struct list_head queue;
> +	struct kvm_vcpu *vcpu;
> +	struct mm_struct *mm;
> +	gva_t gva;
> +	unsigned long addr;
> +	struct kvm_arch_async_pf arch;
> +	struct page *page;
> +	bool done;
> +};
> +
> +void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
> +void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
> +int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
> +		       struct kvm_arch_async_pf *arch);
> +#endif
> +

Based on early kvm-kmod experiments, it looks like this (and maybe more)
breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
disabled. Please have a look.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-20 11:28     ` Jan Kiszka
@ 2010-10-20 11:33       ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-20 11:33 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Wed, Oct 20, 2010 at 01:28:49PM +0200, Jan Kiszka wrote:
> 
> Based on early kvm-kmod experiments, it looks like this (and maybe more)
> breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
> disabled. Please have a look.
> 
CONFIG_KVM_ASYNC_PF is always enabled on x86.

--
			Gleb.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
@ 2010-10-20 11:33       ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-20 11:33 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Wed, Oct 20, 2010 at 01:28:49PM +0200, Jan Kiszka wrote:
> 
> Based on early kvm-kmod experiments, it looks like this (and maybe more)
> breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
> disabled. Please have a look.
> 
CONFIG_KVM_ASYNC_PF is always enabled on x86.

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-20 11:33       ` Gleb Natapov
@ 2010-10-20 11:35         ` Jan Kiszka
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:35 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 20.10.2010 13:33, Gleb Natapov wrote:
> On Wed, Oct 20, 2010 at 01:28:49PM +0200, Jan Kiszka wrote:
>>
>> Based on early kvm-kmod experiments, it looks like this (and maybe more)
>> breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
>> disabled. Please have a look.
>>
> CONFIG_KVM_ASYNC_PF is always enabled on x86.

Ah, so this is more like CONFIG_HAVE_KVM_ASYNC_PF?

Then I need some stubs in kvm-kmod to handle the case that its disabled
on x86 (due to missing host kernel features).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
@ 2010-10-20 11:35         ` Jan Kiszka
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:35 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 20.10.2010 13:33, Gleb Natapov wrote:
> On Wed, Oct 20, 2010 at 01:28:49PM +0200, Jan Kiszka wrote:
>>
>> Based on early kvm-kmod experiments, it looks like this (and maybe more)
>> breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
>> disabled. Please have a look.
>>
> CONFIG_KVM_ASYNC_PF is always enabled on x86.

Ah, so this is more like CONFIG_HAVE_KVM_ASYNC_PF?

Then I need some stubs in kvm-kmod to handle the case that its disabled
on x86 (due to missing host kernel features).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
  2010-10-20 11:35         ` Jan Kiszka
@ 2010-10-20 11:39           ` Gleb Natapov
  -1 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-20 11:39 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Wed, Oct 20, 2010 at 01:35:59PM +0200, Jan Kiszka wrote:
> Am 20.10.2010 13:33, Gleb Natapov wrote:
> > On Wed, Oct 20, 2010 at 01:28:49PM +0200, Jan Kiszka wrote:
> >>
> >> Based on early kvm-kmod experiments, it looks like this (and maybe more)
> >> breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
> >> disabled. Please have a look.
> >>
> > CONFIG_KVM_ASYNC_PF is always enabled on x86.
> 
> Ah, so this is more like CONFIG_HAVE_KVM_ASYNC_PF?
> 
Yes. Your name is probably better.

--
			Gleb.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out.
@ 2010-10-20 11:39           ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-20 11:39 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Wed, Oct 20, 2010 at 01:35:59PM +0200, Jan Kiszka wrote:
> Am 20.10.2010 13:33, Gleb Natapov wrote:
> > On Wed, Oct 20, 2010 at 01:28:49PM +0200, Jan Kiszka wrote:
> >>
> >> Based on early kvm-kmod experiments, it looks like this (and maybe more)
> >> breaks the build in arch/x86/kvm/x86.c if CONFIG_KVM_ASYNC_PF is
> >> disabled. Please have a look.
> >>
> > CONFIG_KVM_ASYNC_PF is always enabled on x86.
> 
> Ah, so this is more like CONFIG_HAVE_KVM_ASYNC_PF?
> 
Yes. Your name is probably better.

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
  2010-10-14  9:22   ` Gleb Natapov
@ 2010-10-20 11:48     ` Jan Kiszka
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:48 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 14.10.2010 11:22, Gleb Natapov wrote:
> When async PF capability is detected hook up special page fault handler
> that will handle async page fault events and bypass other page faults to
> regular page fault handler. Also add async PF handling to nested SVM
> emulation. Async PF always generates exit to L1 where vcpu thread will
> be scheduled out until page is available.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_para.h |   12 +++
>  arch/x86/include/asm/traps.h    |    1 +
>  arch/x86/kernel/entry_32.S      |   10 ++
>  arch/x86/kernel/entry_64.S      |    3 +
>  arch/x86/kernel/kvm.c           |  181 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm.c              |   45 ++++++++--
>  6 files changed, 243 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 2315398..fbfd367 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
>  	__u64 pt_phys;
>  };
>  
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> +#define KVM_PV_REASON_PAGE_READY 2
> +
>  struct kvm_vcpu_pv_apf_data {
>  	__u32 reason;
>  	__u8 pad[60];
> @@ -171,8 +174,17 @@ static inline unsigned int kvm_arch_para_features(void)
>  
>  #ifdef CONFIG_KVM_GUEST
>  void __init kvm_guest_init(void);
> +void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wake(u32 token);
> +u32 kvm_read_and_reset_pf_reason(void);
>  #else
>  #define kvm_guest_init() do { } while (0)
> +#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wake(T) do {} while(0)
> +static u32 kvm_read_and_reset_pf_reason(void)
> +{
> +	return 0;
> +}
>  #endif
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index f66cda5..0310da6 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
>  asmlinkage void stack_segment(void);
>  asmlinkage void general_protection(void);
>  asmlinkage void page_fault(void);
> +asmlinkage void async_page_fault(void);
>  asmlinkage void spurious_interrupt_bug(void);
>  asmlinkage void coprocessor_error(void);
>  asmlinkage void alignment_check(void);
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index 227d009..e6e7273 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -1496,6 +1496,16 @@ ENTRY(general_protection)
>  	CFI_ENDPROC
>  END(general_protection)
>  
> +#ifdef CONFIG_KVM_GUEST
> +ENTRY(async_page_fault)
> +	RING0_EC_FRAME
> +	pushl $do_async_page_fault
> +	CFI_ADJUST_CFA_OFFSET 4
> +	jmp error_code
> +	CFI_ENDPROC
> +END(apf_page_fault)
> +#endif
> +
>  /*
>   * End of kprobes section
>   */
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 17be5ec..def98c3 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
>  #endif
>  errorentry general_protection do_general_protection
>  errorentry page_fault do_page_fault
> +#ifdef CONFIG_KVM_GUEST
> +errorentry async_page_fault do_async_page_fault
> +#endif
>  #ifdef CONFIG_X86_MCE
>  paranoidzeroentry machine_check *machine_check_vector(%rip)
>  #endif
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 032d03b..d564063 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -29,8 +29,14 @@
>  #include <linux/hardirq.h>
>  #include <linux/notifier.h>
>  #include <linux/reboot.h>
> +#include <linux/hash.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/kprobes.h>
>  #include <asm/timer.h>
>  #include <asm/cpu.h>
> +#include <asm/traps.h>
> +#include <asm/desc.h>
>  
>  #define MMU_QUEUE_SIZE 1024
>  
> @@ -64,6 +70,168 @@ static void kvm_io_delay(void)
>  {
>  }
>  
> +#define KVM_TASK_SLEEP_HASHBITS 8
> +#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
> +
> +struct kvm_task_sleep_node {
> +	struct hlist_node link;
> +	wait_queue_head_t wq;
> +	u32 token;
> +	int cpu;
> +};
> +
> +static struct kvm_task_sleep_head {
> +	spinlock_t lock;
> +	struct hlist_head list;
> +} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
> +
> +static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> +						  u32 token)
> +{
> +	struct hlist_node *p;
> +
> +	hlist_for_each(p, &b->list) {
> +		struct kvm_task_sleep_node *n =
> +			hlist_entry(p, typeof(*n), link);
> +		if (n->token == token)
> +			return n;
> +	}
> +
> +	return NULL;
> +}
> +
> +void kvm_async_pf_task_wait(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> +	struct kvm_task_sleep_node n, *e;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&b->lock);
> +	e = _find_apf_task(b, token);
> +	if (e) {
> +		/* dummy entry exist -> wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		spin_unlock(&b->lock);
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.cpu = smp_processor_id();
> +	init_waitqueue_head(&n.wq);
> +	hlist_add_head(&n.link, &b->list);
> +	spin_unlock(&b->lock);
> +
> +	for (;;) {
> +		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +		local_irq_enable();
> +		schedule();
> +		local_irq_disable();
> +	}
> +	finish_wait(&n.wq, &wait);
> +
> +	return;
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
> +
> +static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> +{
> +	hlist_del_init(&n->link);
> +	if (waitqueue_active(&n->wq))
> +		wake_up(&n->wq);
> +}
> +
> +static void apf_task_wake_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
> +		struct hlist_node *p, *next;
> +		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
> +		spin_lock(&b->lock);
> +		hlist_for_each_safe(p, next, &b->list) {
> +			struct kvm_task_sleep_node *n =
> +				hlist_entry(p, typeof(*n), link);
> +			if (n->cpu == smp_processor_id())
> +				apf_task_wake_one(n);
> +		}
> +		spin_unlock(&b->lock);
> +	}
> +}
> +
> +void kvm_async_pf_task_wake(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> +	struct kvm_task_sleep_node *n;
> +
> +	if (token == ~0) {
> +		apf_task_wake_all();
> +		return;
> +	}
> +
> +again:
> +	spin_lock(&b->lock);
> +	n = _find_apf_task(b, token);
> +	if (!n) {
> +		/*
> +		 * async PF was not yet handled.
> +		 * Add dummy entry for the token.
> +		 */
> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			/*
> +			 * Allocation failed! Busy wait while other cpu
> +			 * handles async PF.
> +			 */
> +			spin_unlock(&b->lock);
> +			cpu_relax();
> +			goto again;
> +		}
> +		n->token = token;
> +		n->cpu = smp_processor_id();
> +		init_waitqueue_head(&n->wq);
> +		hlist_add_head(&n->link, &b->list);
> +	} else
> +		apf_task_wake_one(n);
> +	spin_unlock(&b->lock);
> +	return;
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
> +
> +u32 kvm_read_and_reset_pf_reason(void)
> +{
> +	u32 reason = 0;
> +
> +	if (__get_cpu_var(apf_reason).enabled) {
> +		reason = __get_cpu_var(apf_reason).reason;
> +		__get_cpu_var(apf_reason).reason = 0;
> +	}
> +
> +	return reason;
> +}
> +EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
> +
> +dotraplinkage void __kprobes
> +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> +	switch (kvm_read_and_reset_pf_reason()) {
> +	default:
> +		do_page_fault(regs, error_code);
> +		break;
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		/* page is swapped out by the host. */
> +		kvm_async_pf_task_wait((u32)read_cr2());
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_task_wake((u32)read_cr2());
> +		break;
> +	}
> +}
> +
>  static void kvm_mmu_op(void *buffer, unsigned len)
>  {
>  	int r;
> @@ -300,6 +468,7 @@ static void kvm_guest_cpu_online(void *dummy)
>  static void kvm_guest_cpu_offline(void *dummy)
>  {
>  	kvm_pv_disable_apf(NULL);
> +	apf_task_wake_all();
>  }
>  
>  static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> @@ -327,13 +496,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
>  };
>  #endif
>  
> +static void __init kvm_apf_trap_init(void)
> +{
> +	set_intr_gate(14, &async_page_fault);
> +}
> +
>  void __init kvm_guest_init(void)
>  {
> +	int i;
> +
>  	if (!kvm_para_available())
>  		return;
>  
>  	paravirt_ops_setup();
>  	register_reboot_notifier(&kvm_pv_reboot_nb);
> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
> +		spin_lock_init(&async_pf_sleepers[i].lock);
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
> +		x86_init.irqs.trap_init = kvm_apf_trap_init;
> +
>  #ifdef CONFIG_SMP
>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
>  	register_cpu_notifier(&kvm_cpu_notifier);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 9a92224..9fa27a5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -31,6 +31,7 @@
>  
>  #include <asm/tlbflush.h>
>  #include <asm/desc.h>
> +#include <asm/kvm_para.h>
>  
>  #include <asm/virtext.h>
>  #include "trace.h"
> @@ -133,6 +134,7 @@ struct vcpu_svm {
>  
>  	unsigned int3_injected;
>  	unsigned long int3_rip;
> +	u32 apf_reason;
>  };
>  
>  #define MSR_INVALID			0xffffffffU
> @@ -1383,16 +1385,33 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
>  
>  static int pf_interception(struct vcpu_svm *svm)
>  {
> -	u64 fault_address;
> +	u64 fault_address = svm->vmcb->control.exit_info_2;
>  	u32 error_code;
> +	int r = 1;
>  
> -	fault_address  = svm->vmcb->control.exit_info_2;
> -	error_code = svm->vmcb->control.exit_info_1;
> +	switch (svm->apf_reason) {
> +	default:
> +		error_code = svm->vmcb->control.exit_info_1;
>  
> -	trace_kvm_page_fault(fault_address, error_code);
> -	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
> -		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
> -	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
> +		trace_kvm_page_fault(fault_address, error_code);
> +		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
> +			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
> +		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
> +		break;
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		svm->apf_reason = 0;
> +		local_irq_disable();
> +		kvm_async_pf_task_wait(fault_address);
> +		local_irq_enable();
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		svm->apf_reason = 0;
> +		local_irq_disable();
> +		kvm_async_pf_task_wake(fault_address);
> +		local_irq_enable();
> +		break;

That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
I miss that resolves this dependency automatically? Otherwise, some more
#ifdef CONFIG_KVM_GUEST might be needed.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
@ 2010-10-20 11:48     ` Jan Kiszka
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:48 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 14.10.2010 11:22, Gleb Natapov wrote:
> When async PF capability is detected hook up special page fault handler
> that will handle async page fault events and bypass other page faults to
> regular page fault handler. Also add async PF handling to nested SVM
> emulation. Async PF always generates exit to L1 where vcpu thread will
> be scheduled out until page is available.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/include/asm/kvm_para.h |   12 +++
>  arch/x86/include/asm/traps.h    |    1 +
>  arch/x86/kernel/entry_32.S      |   10 ++
>  arch/x86/kernel/entry_64.S      |    3 +
>  arch/x86/kernel/kvm.c           |  181 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm.c              |   45 ++++++++--
>  6 files changed, 243 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 2315398..fbfd367 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
>  	__u64 pt_phys;
>  };
>  
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> +#define KVM_PV_REASON_PAGE_READY 2
> +
>  struct kvm_vcpu_pv_apf_data {
>  	__u32 reason;
>  	__u8 pad[60];
> @@ -171,8 +174,17 @@ static inline unsigned int kvm_arch_para_features(void)
>  
>  #ifdef CONFIG_KVM_GUEST
>  void __init kvm_guest_init(void);
> +void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wake(u32 token);
> +u32 kvm_read_and_reset_pf_reason(void);
>  #else
>  #define kvm_guest_init() do { } while (0)
> +#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wake(T) do {} while(0)
> +static u32 kvm_read_and_reset_pf_reason(void)
> +{
> +	return 0;
> +}
>  #endif
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index f66cda5..0310da6 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
>  asmlinkage void stack_segment(void);
>  asmlinkage void general_protection(void);
>  asmlinkage void page_fault(void);
> +asmlinkage void async_page_fault(void);
>  asmlinkage void spurious_interrupt_bug(void);
>  asmlinkage void coprocessor_error(void);
>  asmlinkage void alignment_check(void);
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index 227d009..e6e7273 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -1496,6 +1496,16 @@ ENTRY(general_protection)
>  	CFI_ENDPROC
>  END(general_protection)
>  
> +#ifdef CONFIG_KVM_GUEST
> +ENTRY(async_page_fault)
> +	RING0_EC_FRAME
> +	pushl $do_async_page_fault
> +	CFI_ADJUST_CFA_OFFSET 4
> +	jmp error_code
> +	CFI_ENDPROC
> +END(apf_page_fault)
> +#endif
> +
>  /*
>   * End of kprobes section
>   */
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 17be5ec..def98c3 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
>  #endif
>  errorentry general_protection do_general_protection
>  errorentry page_fault do_page_fault
> +#ifdef CONFIG_KVM_GUEST
> +errorentry async_page_fault do_async_page_fault
> +#endif
>  #ifdef CONFIG_X86_MCE
>  paranoidzeroentry machine_check *machine_check_vector(%rip)
>  #endif
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 032d03b..d564063 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -29,8 +29,14 @@
>  #include <linux/hardirq.h>
>  #include <linux/notifier.h>
>  #include <linux/reboot.h>
> +#include <linux/hash.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/kprobes.h>
>  #include <asm/timer.h>
>  #include <asm/cpu.h>
> +#include <asm/traps.h>
> +#include <asm/desc.h>
>  
>  #define MMU_QUEUE_SIZE 1024
>  
> @@ -64,6 +70,168 @@ static void kvm_io_delay(void)
>  {
>  }
>  
> +#define KVM_TASK_SLEEP_HASHBITS 8
> +#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
> +
> +struct kvm_task_sleep_node {
> +	struct hlist_node link;
> +	wait_queue_head_t wq;
> +	u32 token;
> +	int cpu;
> +};
> +
> +static struct kvm_task_sleep_head {
> +	spinlock_t lock;
> +	struct hlist_head list;
> +} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
> +
> +static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> +						  u32 token)
> +{
> +	struct hlist_node *p;
> +
> +	hlist_for_each(p, &b->list) {
> +		struct kvm_task_sleep_node *n =
> +			hlist_entry(p, typeof(*n), link);
> +		if (n->token == token)
> +			return n;
> +	}
> +
> +	return NULL;
> +}
> +
> +void kvm_async_pf_task_wait(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> +	struct kvm_task_sleep_node n, *e;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&b->lock);
> +	e = _find_apf_task(b, token);
> +	if (e) {
> +		/* dummy entry exist -> wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		spin_unlock(&b->lock);
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.cpu = smp_processor_id();
> +	init_waitqueue_head(&n.wq);
> +	hlist_add_head(&n.link, &b->list);
> +	spin_unlock(&b->lock);
> +
> +	for (;;) {
> +		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +		local_irq_enable();
> +		schedule();
> +		local_irq_disable();
> +	}
> +	finish_wait(&n.wq, &wait);
> +
> +	return;
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
> +
> +static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> +{
> +	hlist_del_init(&n->link);
> +	if (waitqueue_active(&n->wq))
> +		wake_up(&n->wq);
> +}
> +
> +static void apf_task_wake_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
> +		struct hlist_node *p, *next;
> +		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
> +		spin_lock(&b->lock);
> +		hlist_for_each_safe(p, next, &b->list) {
> +			struct kvm_task_sleep_node *n =
> +				hlist_entry(p, typeof(*n), link);
> +			if (n->cpu == smp_processor_id())
> +				apf_task_wake_one(n);
> +		}
> +		spin_unlock(&b->lock);
> +	}
> +}
> +
> +void kvm_async_pf_task_wake(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> +	struct kvm_task_sleep_node *n;
> +
> +	if (token == ~0) {
> +		apf_task_wake_all();
> +		return;
> +	}
> +
> +again:
> +	spin_lock(&b->lock);
> +	n = _find_apf_task(b, token);
> +	if (!n) {
> +		/*
> +		 * async PF was not yet handled.
> +		 * Add dummy entry for the token.
> +		 */
> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			/*
> +			 * Allocation failed! Busy wait while other cpu
> +			 * handles async PF.
> +			 */
> +			spin_unlock(&b->lock);
> +			cpu_relax();
> +			goto again;
> +		}
> +		n->token = token;
> +		n->cpu = smp_processor_id();
> +		init_waitqueue_head(&n->wq);
> +		hlist_add_head(&n->link, &b->list);
> +	} else
> +		apf_task_wake_one(n);
> +	spin_unlock(&b->lock);
> +	return;
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
> +
> +u32 kvm_read_and_reset_pf_reason(void)
> +{
> +	u32 reason = 0;
> +
> +	if (__get_cpu_var(apf_reason).enabled) {
> +		reason = __get_cpu_var(apf_reason).reason;
> +		__get_cpu_var(apf_reason).reason = 0;
> +	}
> +
> +	return reason;
> +}
> +EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
> +
> +dotraplinkage void __kprobes
> +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> +	switch (kvm_read_and_reset_pf_reason()) {
> +	default:
> +		do_page_fault(regs, error_code);
> +		break;
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		/* page is swapped out by the host. */
> +		kvm_async_pf_task_wait((u32)read_cr2());
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_task_wake((u32)read_cr2());
> +		break;
> +	}
> +}
> +
>  static void kvm_mmu_op(void *buffer, unsigned len)
>  {
>  	int r;
> @@ -300,6 +468,7 @@ static void kvm_guest_cpu_online(void *dummy)
>  static void kvm_guest_cpu_offline(void *dummy)
>  {
>  	kvm_pv_disable_apf(NULL);
> +	apf_task_wake_all();
>  }
>  
>  static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> @@ -327,13 +496,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
>  };
>  #endif
>  
> +static void __init kvm_apf_trap_init(void)
> +{
> +	set_intr_gate(14, &async_page_fault);
> +}
> +
>  void __init kvm_guest_init(void)
>  {
> +	int i;
> +
>  	if (!kvm_para_available())
>  		return;
>  
>  	paravirt_ops_setup();
>  	register_reboot_notifier(&kvm_pv_reboot_nb);
> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
> +		spin_lock_init(&async_pf_sleepers[i].lock);
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
> +		x86_init.irqs.trap_init = kvm_apf_trap_init;
> +
>  #ifdef CONFIG_SMP
>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
>  	register_cpu_notifier(&kvm_cpu_notifier);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 9a92224..9fa27a5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -31,6 +31,7 @@
>  
>  #include <asm/tlbflush.h>
>  #include <asm/desc.h>
> +#include <asm/kvm_para.h>
>  
>  #include <asm/virtext.h>
>  #include "trace.h"
> @@ -133,6 +134,7 @@ struct vcpu_svm {
>  
>  	unsigned int3_injected;
>  	unsigned long int3_rip;
> +	u32 apf_reason;
>  };
>  
>  #define MSR_INVALID			0xffffffffU
> @@ -1383,16 +1385,33 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
>  
>  static int pf_interception(struct vcpu_svm *svm)
>  {
> -	u64 fault_address;
> +	u64 fault_address = svm->vmcb->control.exit_info_2;
>  	u32 error_code;
> +	int r = 1;
>  
> -	fault_address  = svm->vmcb->control.exit_info_2;
> -	error_code = svm->vmcb->control.exit_info_1;
> +	switch (svm->apf_reason) {
> +	default:
> +		error_code = svm->vmcb->control.exit_info_1;
>  
> -	trace_kvm_page_fault(fault_address, error_code);
> -	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
> -		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
> -	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
> +		trace_kvm_page_fault(fault_address, error_code);
> +		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
> +			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
> +		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
> +		break;
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		svm->apf_reason = 0;
> +		local_irq_disable();
> +		kvm_async_pf_task_wait(fault_address);
> +		local_irq_enable();
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		svm->apf_reason = 0;
> +		local_irq_disable();
> +		kvm_async_pf_task_wake(fault_address);
> +		local_irq_enable();
> +		break;

That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
I miss that resolves this dependency automatically? Otherwise, some more
#ifdef CONFIG_KVM_GUEST might be needed.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
  2010-10-20 11:48     ` Jan Kiszka
@ 2010-10-20 11:50       ` Jan Kiszka
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:50 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 20.10.2010 13:48, Jan Kiszka wrote:
> Am 14.10.2010 11:22, Gleb Natapov wrote:
>> When async PF capability is detected hook up special page fault handler
>> that will handle async page fault events and bypass other page faults to
>> regular page fault handler. Also add async PF handling to nested SVM
>> emulation. Async PF always generates exit to L1 where vcpu thread will
>> be scheduled out until page is available.
>>
>> Acked-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>> ---
>>  arch/x86/include/asm/kvm_para.h |   12 +++
>>  arch/x86/include/asm/traps.h    |    1 +
>>  arch/x86/kernel/entry_32.S      |   10 ++
>>  arch/x86/kernel/entry_64.S      |    3 +
>>  arch/x86/kernel/kvm.c           |  181 +++++++++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/svm.c              |   45 ++++++++--
>>  6 files changed, 243 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>> index 2315398..fbfd367 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
>>  	__u64 pt_phys;
>>  };
>>  
>> +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
>> +#define KVM_PV_REASON_PAGE_READY 2
>> +
>>  struct kvm_vcpu_pv_apf_data {
>>  	__u32 reason;
>>  	__u8 pad[60];
>> @@ -171,8 +174,17 @@ static inline unsigned int kvm_arch_para_features(void)
>>  
>>  #ifdef CONFIG_KVM_GUEST
>>  void __init kvm_guest_init(void);
>> +void kvm_async_pf_task_wait(u32 token);
>> +void kvm_async_pf_task_wake(u32 token);
>> +u32 kvm_read_and_reset_pf_reason(void);
>>  #else
>>  #define kvm_guest_init() do { } while (0)
>> +#define kvm_async_pf_task_wait(T) do {} while(0)
>> +#define kvm_async_pf_task_wake(T) do {} while(0)
>> +static u32 kvm_read_and_reset_pf_reason(void)
>> +{
>> +	return 0;
>> +}
>>  #endif
>>  
>>  #endif /* __KERNEL__ */
>> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
>> index f66cda5..0310da6 100644
>> --- a/arch/x86/include/asm/traps.h
>> +++ b/arch/x86/include/asm/traps.h
>> @@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
>>  asmlinkage void stack_segment(void);
>>  asmlinkage void general_protection(void);
>>  asmlinkage void page_fault(void);
>> +asmlinkage void async_page_fault(void);
>>  asmlinkage void spurious_interrupt_bug(void);
>>  asmlinkage void coprocessor_error(void);
>>  asmlinkage void alignment_check(void);
>> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
>> index 227d009..e6e7273 100644
>> --- a/arch/x86/kernel/entry_32.S
>> +++ b/arch/x86/kernel/entry_32.S
>> @@ -1496,6 +1496,16 @@ ENTRY(general_protection)
>>  	CFI_ENDPROC
>>  END(general_protection)
>>  
>> +#ifdef CONFIG_KVM_GUEST
>> +ENTRY(async_page_fault)
>> +	RING0_EC_FRAME
>> +	pushl $do_async_page_fault
>> +	CFI_ADJUST_CFA_OFFSET 4
>> +	jmp error_code
>> +	CFI_ENDPROC
>> +END(apf_page_fault)
>> +#endif
>> +
>>  /*
>>   * End of kprobes section
>>   */
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 17be5ec..def98c3 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
>>  #endif
>>  errorentry general_protection do_general_protection
>>  errorentry page_fault do_page_fault
>> +#ifdef CONFIG_KVM_GUEST
>> +errorentry async_page_fault do_async_page_fault
>> +#endif
>>  #ifdef CONFIG_X86_MCE
>>  paranoidzeroentry machine_check *machine_check_vector(%rip)
>>  #endif
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 032d03b..d564063 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -29,8 +29,14 @@
>>  #include <linux/hardirq.h>
>>  #include <linux/notifier.h>
>>  #include <linux/reboot.h>
>> +#include <linux/hash.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/kprobes.h>
>>  #include <asm/timer.h>
>>  #include <asm/cpu.h>
>> +#include <asm/traps.h>
>> +#include <asm/desc.h>
>>  
>>  #define MMU_QUEUE_SIZE 1024
>>  
>> @@ -64,6 +70,168 @@ static void kvm_io_delay(void)
>>  {
>>  }
>>  
>> +#define KVM_TASK_SLEEP_HASHBITS 8
>> +#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
>> +
>> +struct kvm_task_sleep_node {
>> +	struct hlist_node link;
>> +	wait_queue_head_t wq;
>> +	u32 token;
>> +	int cpu;
>> +};
>> +
>> +static struct kvm_task_sleep_head {
>> +	spinlock_t lock;
>> +	struct hlist_head list;
>> +} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
>> +
>> +static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>> +						  u32 token)
>> +{
>> +	struct hlist_node *p;
>> +
>> +	hlist_for_each(p, &b->list) {
>> +		struct kvm_task_sleep_node *n =
>> +			hlist_entry(p, typeof(*n), link);
>> +		if (n->token == token)
>> +			return n;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +void kvm_async_pf_task_wait(u32 token)
>> +{
>> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
>> +	struct kvm_task_sleep_node n, *e;
>> +	DEFINE_WAIT(wait);
>> +
>> +	spin_lock(&b->lock);
>> +	e = _find_apf_task(b, token);
>> +	if (e) {
>> +		/* dummy entry exist -> wake up was delivered ahead of PF */
>> +		hlist_del(&e->link);
>> +		kfree(e);
>> +		spin_unlock(&b->lock);
>> +		return;
>> +	}
>> +
>> +	n.token = token;
>> +	n.cpu = smp_processor_id();
>> +	init_waitqueue_head(&n.wq);
>> +	hlist_add_head(&n.link, &b->list);
>> +	spin_unlock(&b->lock);
>> +
>> +	for (;;) {
>> +		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>> +		if (hlist_unhashed(&n.link))
>> +			break;
>> +		local_irq_enable();
>> +		schedule();
>> +		local_irq_disable();
>> +	}
>> +	finish_wait(&n.wq, &wait);
>> +
>> +	return;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
>> +
>> +static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>> +{
>> +	hlist_del_init(&n->link);
>> +	if (waitqueue_active(&n->wq))
>> +		wake_up(&n->wq);
>> +}
>> +
>> +static void apf_task_wake_all(void)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
>> +		struct hlist_node *p, *next;
>> +		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
>> +		spin_lock(&b->lock);
>> +		hlist_for_each_safe(p, next, &b->list) {
>> +			struct kvm_task_sleep_node *n =
>> +				hlist_entry(p, typeof(*n), link);
>> +			if (n->cpu == smp_processor_id())
>> +				apf_task_wake_one(n);
>> +		}
>> +		spin_unlock(&b->lock);
>> +	}
>> +}
>> +
>> +void kvm_async_pf_task_wake(u32 token)
>> +{
>> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
>> +	struct kvm_task_sleep_node *n;
>> +
>> +	if (token == ~0) {
>> +		apf_task_wake_all();
>> +		return;
>> +	}
>> +
>> +again:
>> +	spin_lock(&b->lock);
>> +	n = _find_apf_task(b, token);
>> +	if (!n) {
>> +		/*
>> +		 * async PF was not yet handled.
>> +		 * Add dummy entry for the token.
>> +		 */
>> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
>> +		if (!n) {
>> +			/*
>> +			 * Allocation failed! Busy wait while other cpu
>> +			 * handles async PF.
>> +			 */
>> +			spin_unlock(&b->lock);
>> +			cpu_relax();
>> +			goto again;
>> +		}
>> +		n->token = token;
>> +		n->cpu = smp_processor_id();
>> +		init_waitqueue_head(&n->wq);
>> +		hlist_add_head(&n->link, &b->list);
>> +	} else
>> +		apf_task_wake_one(n);
>> +	spin_unlock(&b->lock);
>> +	return;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
>> +
>> +u32 kvm_read_and_reset_pf_reason(void)
>> +{
>> +	u32 reason = 0;
>> +
>> +	if (__get_cpu_var(apf_reason).enabled) {
>> +		reason = __get_cpu_var(apf_reason).reason;
>> +		__get_cpu_var(apf_reason).reason = 0;
>> +	}
>> +
>> +	return reason;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
>> +
>> +dotraplinkage void __kprobes
>> +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
>> +{
>> +	switch (kvm_read_and_reset_pf_reason()) {
>> +	default:
>> +		do_page_fault(regs, error_code);
>> +		break;
>> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>> +		/* page is swapped out by the host. */
>> +		kvm_async_pf_task_wait((u32)read_cr2());
>> +		break;
>> +	case KVM_PV_REASON_PAGE_READY:
>> +		kvm_async_pf_task_wake((u32)read_cr2());
>> +		break;
>> +	}
>> +}
>> +
>>  static void kvm_mmu_op(void *buffer, unsigned len)
>>  {
>>  	int r;
>> @@ -300,6 +468,7 @@ static void kvm_guest_cpu_online(void *dummy)
>>  static void kvm_guest_cpu_offline(void *dummy)
>>  {
>>  	kvm_pv_disable_apf(NULL);
>> +	apf_task_wake_all();
>>  }
>>  
>>  static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
>> @@ -327,13 +496,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
>>  };
>>  #endif
>>  
>> +static void __init kvm_apf_trap_init(void)
>> +{
>> +	set_intr_gate(14, &async_page_fault);
>> +}
>> +
>>  void __init kvm_guest_init(void)
>>  {
>> +	int i;
>> +
>>  	if (!kvm_para_available())
>>  		return;
>>  
>>  	paravirt_ops_setup();
>>  	register_reboot_notifier(&kvm_pv_reboot_nb);
>> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
>> +		spin_lock_init(&async_pf_sleepers[i].lock);
>> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
>> +		x86_init.irqs.trap_init = kvm_apf_trap_init;
>> +
>>  #ifdef CONFIG_SMP
>>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
>>  	register_cpu_notifier(&kvm_cpu_notifier);
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 9a92224..9fa27a5 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -31,6 +31,7 @@
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/desc.h>
>> +#include <asm/kvm_para.h>
>>  
>>  #include <asm/virtext.h>
>>  #include "trace.h"
>> @@ -133,6 +134,7 @@ struct vcpu_svm {
>>  
>>  	unsigned int3_injected;
>>  	unsigned long int3_rip;
>> +	u32 apf_reason;
>>  };
>>  
>>  #define MSR_INVALID			0xffffffffU
>> @@ -1383,16 +1385,33 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
>>  
>>  static int pf_interception(struct vcpu_svm *svm)
>>  {
>> -	u64 fault_address;
>> +	u64 fault_address = svm->vmcb->control.exit_info_2;
>>  	u32 error_code;
>> +	int r = 1;
>>  
>> -	fault_address  = svm->vmcb->control.exit_info_2;
>> -	error_code = svm->vmcb->control.exit_info_1;
>> +	switch (svm->apf_reason) {
>> +	default:
>> +		error_code = svm->vmcb->control.exit_info_1;
>>  
>> -	trace_kvm_page_fault(fault_address, error_code);
>> -	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
>> -		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
>> -	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
>> +		trace_kvm_page_fault(fault_address, error_code);
>> +		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
>> +			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
>> +		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
>> +		break;
>> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>> +		svm->apf_reason = 0;
>> +		local_irq_disable();
>> +		kvm_async_pf_task_wait(fault_address);
>> +		local_irq_enable();
>> +		break;
>> +	case KVM_PV_REASON_PAGE_READY:
>> +		svm->apf_reason = 0;
>> +		local_irq_disable();
>> +		kvm_async_pf_task_wake(fault_address);
>> +		local_irq_enable();
>> +		break;
> 
> That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
> I miss that resolves this dependency automatically? Otherwise, some more
> #ifdef CONFIG_KVM_GUEST might be needed.

Err, found it. Sorry for the noise.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
@ 2010-10-20 11:50       ` Jan Kiszka
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kiszka @ 2010-10-20 11:50 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Am 20.10.2010 13:48, Jan Kiszka wrote:
> Am 14.10.2010 11:22, Gleb Natapov wrote:
>> When async PF capability is detected hook up special page fault handler
>> that will handle async page fault events and bypass other page faults to
>> regular page fault handler. Also add async PF handling to nested SVM
>> emulation. Async PF always generates exit to L1 where vcpu thread will
>> be scheduled out until page is available.
>>
>> Acked-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>> ---
>>  arch/x86/include/asm/kvm_para.h |   12 +++
>>  arch/x86/include/asm/traps.h    |    1 +
>>  arch/x86/kernel/entry_32.S      |   10 ++
>>  arch/x86/kernel/entry_64.S      |    3 +
>>  arch/x86/kernel/kvm.c           |  181 +++++++++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/svm.c              |   45 ++++++++--
>>  6 files changed, 243 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>> index 2315398..fbfd367 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
>>  	__u64 pt_phys;
>>  };
>>  
>> +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
>> +#define KVM_PV_REASON_PAGE_READY 2
>> +
>>  struct kvm_vcpu_pv_apf_data {
>>  	__u32 reason;
>>  	__u8 pad[60];
>> @@ -171,8 +174,17 @@ static inline unsigned int kvm_arch_para_features(void)
>>  
>>  #ifdef CONFIG_KVM_GUEST
>>  void __init kvm_guest_init(void);
>> +void kvm_async_pf_task_wait(u32 token);
>> +void kvm_async_pf_task_wake(u32 token);
>> +u32 kvm_read_and_reset_pf_reason(void);
>>  #else
>>  #define kvm_guest_init() do { } while (0)
>> +#define kvm_async_pf_task_wait(T) do {} while(0)
>> +#define kvm_async_pf_task_wake(T) do {} while(0)
>> +static u32 kvm_read_and_reset_pf_reason(void)
>> +{
>> +	return 0;
>> +}
>>  #endif
>>  
>>  #endif /* __KERNEL__ */
>> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
>> index f66cda5..0310da6 100644
>> --- a/arch/x86/include/asm/traps.h
>> +++ b/arch/x86/include/asm/traps.h
>> @@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
>>  asmlinkage void stack_segment(void);
>>  asmlinkage void general_protection(void);
>>  asmlinkage void page_fault(void);
>> +asmlinkage void async_page_fault(void);
>>  asmlinkage void spurious_interrupt_bug(void);
>>  asmlinkage void coprocessor_error(void);
>>  asmlinkage void alignment_check(void);
>> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
>> index 227d009..e6e7273 100644
>> --- a/arch/x86/kernel/entry_32.S
>> +++ b/arch/x86/kernel/entry_32.S
>> @@ -1496,6 +1496,16 @@ ENTRY(general_protection)
>>  	CFI_ENDPROC
>>  END(general_protection)
>>  
>> +#ifdef CONFIG_KVM_GUEST
>> +ENTRY(async_page_fault)
>> +	RING0_EC_FRAME
>> +	pushl $do_async_page_fault
>> +	CFI_ADJUST_CFA_OFFSET 4
>> +	jmp error_code
>> +	CFI_ENDPROC
>> +END(apf_page_fault)
>> +#endif
>> +
>>  /*
>>   * End of kprobes section
>>   */
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 17be5ec..def98c3 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -1349,6 +1349,9 @@ errorentry xen_stack_segment do_stack_segment
>>  #endif
>>  errorentry general_protection do_general_protection
>>  errorentry page_fault do_page_fault
>> +#ifdef CONFIG_KVM_GUEST
>> +errorentry async_page_fault do_async_page_fault
>> +#endif
>>  #ifdef CONFIG_X86_MCE
>>  paranoidzeroentry machine_check *machine_check_vector(%rip)
>>  #endif
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 032d03b..d564063 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -29,8 +29,14 @@
>>  #include <linux/hardirq.h>
>>  #include <linux/notifier.h>
>>  #include <linux/reboot.h>
>> +#include <linux/hash.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/kprobes.h>
>>  #include <asm/timer.h>
>>  #include <asm/cpu.h>
>> +#include <asm/traps.h>
>> +#include <asm/desc.h>
>>  
>>  #define MMU_QUEUE_SIZE 1024
>>  
>> @@ -64,6 +70,168 @@ static void kvm_io_delay(void)
>>  {
>>  }
>>  
>> +#define KVM_TASK_SLEEP_HASHBITS 8
>> +#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
>> +
>> +struct kvm_task_sleep_node {
>> +	struct hlist_node link;
>> +	wait_queue_head_t wq;
>> +	u32 token;
>> +	int cpu;
>> +};
>> +
>> +static struct kvm_task_sleep_head {
>> +	spinlock_t lock;
>> +	struct hlist_head list;
>> +} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
>> +
>> +static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>> +						  u32 token)
>> +{
>> +	struct hlist_node *p;
>> +
>> +	hlist_for_each(p, &b->list) {
>> +		struct kvm_task_sleep_node *n =
>> +			hlist_entry(p, typeof(*n), link);
>> +		if (n->token == token)
>> +			return n;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +void kvm_async_pf_task_wait(u32 token)
>> +{
>> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
>> +	struct kvm_task_sleep_node n, *e;
>> +	DEFINE_WAIT(wait);
>> +
>> +	spin_lock(&b->lock);
>> +	e = _find_apf_task(b, token);
>> +	if (e) {
>> +		/* dummy entry exist -> wake up was delivered ahead of PF */
>> +		hlist_del(&e->link);
>> +		kfree(e);
>> +		spin_unlock(&b->lock);
>> +		return;
>> +	}
>> +
>> +	n.token = token;
>> +	n.cpu = smp_processor_id();
>> +	init_waitqueue_head(&n.wq);
>> +	hlist_add_head(&n.link, &b->list);
>> +	spin_unlock(&b->lock);
>> +
>> +	for (;;) {
>> +		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>> +		if (hlist_unhashed(&n.link))
>> +			break;
>> +		local_irq_enable();
>> +		schedule();
>> +		local_irq_disable();
>> +	}
>> +	finish_wait(&n.wq, &wait);
>> +
>> +	return;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
>> +
>> +static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>> +{
>> +	hlist_del_init(&n->link);
>> +	if (waitqueue_active(&n->wq))
>> +		wake_up(&n->wq);
>> +}
>> +
>> +static void apf_task_wake_all(void)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
>> +		struct hlist_node *p, *next;
>> +		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
>> +		spin_lock(&b->lock);
>> +		hlist_for_each_safe(p, next, &b->list) {
>> +			struct kvm_task_sleep_node *n =
>> +				hlist_entry(p, typeof(*n), link);
>> +			if (n->cpu == smp_processor_id())
>> +				apf_task_wake_one(n);
>> +		}
>> +		spin_unlock(&b->lock);
>> +	}
>> +}
>> +
>> +void kvm_async_pf_task_wake(u32 token)
>> +{
>> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>> +	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
>> +	struct kvm_task_sleep_node *n;
>> +
>> +	if (token == ~0) {
>> +		apf_task_wake_all();
>> +		return;
>> +	}
>> +
>> +again:
>> +	spin_lock(&b->lock);
>> +	n = _find_apf_task(b, token);
>> +	if (!n) {
>> +		/*
>> +		 * async PF was not yet handled.
>> +		 * Add dummy entry for the token.
>> +		 */
>> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
>> +		if (!n) {
>> +			/*
>> +			 * Allocation failed! Busy wait while other cpu
>> +			 * handles async PF.
>> +			 */
>> +			spin_unlock(&b->lock);
>> +			cpu_relax();
>> +			goto again;
>> +		}
>> +		n->token = token;
>> +		n->cpu = smp_processor_id();
>> +		init_waitqueue_head(&n->wq);
>> +		hlist_add_head(&n->link, &b->list);
>> +	} else
>> +		apf_task_wake_one(n);
>> +	spin_unlock(&b->lock);
>> +	return;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
>> +
>> +u32 kvm_read_and_reset_pf_reason(void)
>> +{
>> +	u32 reason = 0;
>> +
>> +	if (__get_cpu_var(apf_reason).enabled) {
>> +		reason = __get_cpu_var(apf_reason).reason;
>> +		__get_cpu_var(apf_reason).reason = 0;
>> +	}
>> +
>> +	return reason;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
>> +
>> +dotraplinkage void __kprobes
>> +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
>> +{
>> +	switch (kvm_read_and_reset_pf_reason()) {
>> +	default:
>> +		do_page_fault(regs, error_code);
>> +		break;
>> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>> +		/* page is swapped out by the host. */
>> +		kvm_async_pf_task_wait((u32)read_cr2());
>> +		break;
>> +	case KVM_PV_REASON_PAGE_READY:
>> +		kvm_async_pf_task_wake((u32)read_cr2());
>> +		break;
>> +	}
>> +}
>> +
>>  static void kvm_mmu_op(void *buffer, unsigned len)
>>  {
>>  	int r;
>> @@ -300,6 +468,7 @@ static void kvm_guest_cpu_online(void *dummy)
>>  static void kvm_guest_cpu_offline(void *dummy)
>>  {
>>  	kvm_pv_disable_apf(NULL);
>> +	apf_task_wake_all();
>>  }
>>  
>>  static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
>> @@ -327,13 +496,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
>>  };
>>  #endif
>>  
>> +static void __init kvm_apf_trap_init(void)
>> +{
>> +	set_intr_gate(14, &async_page_fault);
>> +}
>> +
>>  void __init kvm_guest_init(void)
>>  {
>> +	int i;
>> +
>>  	if (!kvm_para_available())
>>  		return;
>>  
>>  	paravirt_ops_setup();
>>  	register_reboot_notifier(&kvm_pv_reboot_nb);
>> +	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
>> +		spin_lock_init(&async_pf_sleepers[i].lock);
>> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
>> +		x86_init.irqs.trap_init = kvm_apf_trap_init;
>> +
>>  #ifdef CONFIG_SMP
>>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
>>  	register_cpu_notifier(&kvm_cpu_notifier);
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 9a92224..9fa27a5 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -31,6 +31,7 @@
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/desc.h>
>> +#include <asm/kvm_para.h>
>>  
>>  #include <asm/virtext.h>
>>  #include "trace.h"
>> @@ -133,6 +134,7 @@ struct vcpu_svm {
>>  
>>  	unsigned int3_injected;
>>  	unsigned long int3_rip;
>> +	u32 apf_reason;
>>  };
>>  
>>  #define MSR_INVALID			0xffffffffU
>> @@ -1383,16 +1385,33 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
>>  
>>  static int pf_interception(struct vcpu_svm *svm)
>>  {
>> -	u64 fault_address;
>> +	u64 fault_address = svm->vmcb->control.exit_info_2;
>>  	u32 error_code;
>> +	int r = 1;
>>  
>> -	fault_address  = svm->vmcb->control.exit_info_2;
>> -	error_code = svm->vmcb->control.exit_info_1;
>> +	switch (svm->apf_reason) {
>> +	default:
>> +		error_code = svm->vmcb->control.exit_info_1;
>>  
>> -	trace_kvm_page_fault(fault_address, error_code);
>> -	if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
>> -		kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
>> -	return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
>> +		trace_kvm_page_fault(fault_address, error_code);
>> +		if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
>> +			kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
>> +		r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code);
>> +		break;
>> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>> +		svm->apf_reason = 0;
>> +		local_irq_disable();
>> +		kvm_async_pf_task_wait(fault_address);
>> +		local_irq_enable();
>> +		break;
>> +	case KVM_PV_REASON_PAGE_READY:
>> +		svm->apf_reason = 0;
>> +		local_irq_disable();
>> +		kvm_async_pf_task_wake(fault_address);
>> +		local_irq_enable();
>> +		break;
> 
> That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
> I miss that resolves this dependency automatically? Otherwise, some more
> #ifdef CONFIG_KVM_GUEST might be needed.

Err, found it. Sorry for the noise.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
  2010-10-20 11:48     ` Jan Kiszka
  (?)
@ 2010-10-20 11:53       ` Peter Zijlstra
  -1 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2010-10-20 11:53 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, avi, mingo, tglx, hpa,
	riel, cl, mtosatti

On Wed, 2010-10-20 at 13:48 +0200, Jan Kiszka wrote:
> > +     case KVM_PV_REASON_PAGE_READY:
> > +             svm->apf_reason = 0;
> > +             local_irq_disable();
> > +             kvm_async_pf_task_wake(fault_address);
> > +             local_irq_enable();
> > +             break;
> 
> That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
> I miss that resolves this dependency automatically? Otherwise, some more
> #ifdef CONFIG_KVM_GUEST might be needed.


Could you please trim your replies?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
@ 2010-10-20 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2010-10-20 11:53 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, avi, mingo, tglx, hpa,
	riel, cl, mtosatti

On Wed, 2010-10-20 at 13:48 +0200, Jan Kiszka wrote:
> > +     case KVM_PV_REASON_PAGE_READY:
> > +             svm->apf_reason = 0;
> > +             local_irq_disable();
> > +             kvm_async_pf_task_wake(fault_address);
> > +             local_irq_enable();
> > +             break;
> 
> That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
> I miss that resolves this dependency automatically? Otherwise, some more
> #ifdef CONFIG_KVM_GUEST might be needed.


Could you please trim your replies?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 08/12] Handle async PF in a guest.
@ 2010-10-20 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2010-10-20 11:53 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, avi, mingo, tglx, hpa,
	riel, cl, mtosatti

On Wed, 2010-10-20 at 13:48 +0200, Jan Kiszka wrote:
> > +     case KVM_PV_REASON_PAGE_READY:
> > +             svm->apf_reason = 0;
> > +             local_irq_disable();
> > +             kvm_async_pf_task_wake(fault_address);
> > +             local_irq_enable();
> > +             break;
> 
> That's only available if CONFIG_KVM_GUEST is set, no? Is there anything
> I miss that resolves this dependency automatically? Otherwise, some more
> #ifdef CONFIG_KVM_GUEST might be needed.


Could you please trim your replies?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-14  9:16 y
@ 2010-10-14  9:17   ` y
  2010-10-14  9:17 ` y
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: y @ 2010-10-14  9:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

From: Gleb Natapov <gleb@redhat.com>

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-14  9:16 y
  2010-10-14  9:17 ` [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface y
@ 2010-10-14  9:17 ` y
  2010-10-14  9:17 ` y
  2010-10-14  9:17   ` y
  3 siblings, 0 replies; 56+ messages in thread
From: y @ 2010-10-14  9:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

From: Gleb Natapov <gleb@redhat.com>

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-14  9:16 y
  2010-10-14  9:17 ` [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface y
  2010-10-14  9:17 ` y
@ 2010-10-14  9:17 ` y
  2010-10-14  9:17   ` y
  3 siblings, 0 replies; 56+ messages in thread
From: y @ 2010-10-14  9:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

From: Gleb Natapov <gleb@redhat.com>

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
  2010-10-14  9:16 y
@ 2010-10-14  9:17 ` y
  2010-10-14  9:17 ` y
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: y @ 2010-10-14  9:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

From: Gleb Natapov <gleb@redhat.com>

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface
@ 2010-10-14  9:17   ` y
  0 siblings, 0 replies; 56+ messages in thread
From: y @ 2010-10-14  9:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

From: Gleb Natapov <gleb@redhat.com>

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h  |    7 ++++
 include/linux/kvm_types.h |    7 ++++
 virt/kvm/kvm_main.c       |   75 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a9b017..dda88f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u64 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
@@ -352,12 +353,18 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 7ac0d4e..fa7cc72 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -67,4 +67,11 @@ struct kvm_lapic_irq {
 	u32 dest_id;
 };
 
+struct gfn_to_hva_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 #endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238079e..5d57ec9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,6 +687,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -723,6 +724,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -853,10 +855,10 @@ int kvm_is_error_hva(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
+						gfn_t gfn)
 {
 	int i;
-	struct kvm_memslots *slots = kvm_memslots(kvm);
 
 	for (i = 0; i < slots->nmemslots; ++i) {
 		struct kvm_memory_slot *memslot = &slots->memslots[i];
@@ -867,6 +869,11 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 	}
 	return NULL;
 }
+
+struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -929,12 +936,9 @@ int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return memslot - slots->memslots;
 }
 
-static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
+static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				     gfn_t *nr_pages)
 {
-	struct kvm_memory_slot *slot;
-
-	slot = gfn_to_memslot(kvm, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
 
@@ -946,7 +950,7 @@ static unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn,
 
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_hva_many(kvm, gfn, NULL);
+	return gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
@@ -1063,7 +1067,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
 	unsigned long addr;
 	gfn_t entry;
 
-	addr = gfn_to_hva_many(kvm, gfn, &entry);
+	addr = gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, &entry);
 	if (kvm_is_error_hva(addr))
 		return -1;
 
@@ -1247,6 +1251,47 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			      gpa_t gpa)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int offset = offset_in_page(gpa);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	ghc->gpa = gpa;
+	ghc->generation = slots->generation;
+	ghc->memslot = __gfn_to_memslot(slots, gfn);
+	ghc->hva = gfn_to_hva_many(ghc->memslot, gfn, NULL);
+	if (!kvm_is_error_hva(ghc->hva))
+		ghc->hva += offset;
+	else
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	int r;
+
+	if (slots->generation != ghc->generation)
+		kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1272,11 +1317,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1284,6 +1327,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2010-10-20 12:13 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-14  9:22 [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
2010-10-14  9:22 ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 01/12] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 02/12] Halt vcpu if page it tries to access is swapped out Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-20 11:28   ` Jan Kiszka
2010-10-20 11:28     ` Jan Kiszka
2010-10-20 11:33     ` Gleb Natapov
2010-10-20 11:33       ` Gleb Natapov
2010-10-20 11:35       ` Jan Kiszka
2010-10-20 11:35         ` Jan Kiszka
2010-10-20 11:39         ` Gleb Natapov
2010-10-20 11:39           ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 03/12] Retry fault before vmentry Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-17 10:33   ` Avi Kivity
2010-10-17 10:33     ` Avi Kivity
2010-10-17 10:43     ` Avi Kivity
2010-10-17 10:43       ` Avi Kivity
2010-10-17 16:13   ` Gleb Natapov
2010-10-17 16:13     ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-18 13:22   ` Gleb Natapov
2010-10-18 13:22     ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 05/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 06/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 07/12] Add async PF initialization to PV guest Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 08/12] Handle async PF in a guest Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-20 11:48   ` Jan Kiszka
2010-10-20 11:48     ` Jan Kiszka
2010-10-20 11:50     ` Jan Kiszka
2010-10-20 11:50       ` Jan Kiszka
2010-10-20 11:53     ` Peter Zijlstra
2010-10-20 11:53       ` Peter Zijlstra
2010-10-20 11:53       ` Peter Zijlstra
2010-10-14  9:22 ` [PATCH v7 09/12] Inject asynchronous page fault into a PV guest if page is swapped out Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 10/12] Handle async PF in non preemptable context Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 11/12] Let host know whether the guest can handle async PF in non-userspace context Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-14  9:22 ` [PATCH v7 12/12] Send async PF when guest is not in userspace too Gleb Natapov
2010-10-14  9:22   ` Gleb Natapov
2010-10-18 15:34 ` [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest Marcelo Tosatti
2010-10-18 15:34   ` Marcelo Tosatti
  -- strict thread matches above, loose matches on Subject: below --
2010-10-14  9:16 y
2010-10-14  9:17 ` [PATCH v7 04/12] Add memory slot versioning and use it to provide fast guest write interface y
2010-10-14  9:17 ` y
2010-10-14  9:17 ` y
2010-10-14  9:17 ` y
2010-10-14  9:17   ` y

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.