All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/12] KVM: Add host swap event notifications for PV guest
@ 2010-07-19 15:30 ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out
from host memory vcpu execution is suspended till the page is not swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

To overcome this inefficiency this patch series implements "asynchronous
page fault" for paravirtualized KVM guests. If a page that vcpu is
trying to access is swapped out KVM sends an async PF to the vcpu
and continues vcpu execution. Requested page is swapped in by another
thread in parallel.  When vcpu gets async PF it puts faulted task to
sleep until "wake up" interrupt is delivered. When the page is brought
to the host memory KVM sends "wake up" interrupt and the guest's task
resumes execution.

To measure performance benefits I use a simple benchmark program (below)
that starts number of threads. Some of them do work (increment counter),
others access huge array in random location trying to generate host page
faults. The size of the array is smaller then guest memory bug bigger
then host memory so we are guarantied that host will swap out part of
the array.

Running the benchmark inside guest with 4 cpus 2G memory running in 512M
container in the host using the command line "./bm -f 4 -w 4 -t 60" (run 4
faulting threads and 4 working threads for a minute) I get this result:

With async pf:
start
worker 0: 63972141051
worker 1: 65149033299
worker 2: 66301967246
worker 3: 63423000989
total: 258846142585


Without async pf:
start
worker 0: 30619912622
worker 1: 33951339266
worker 2: 31577780093
worker 3: 33603607972
total: 129752639953

The gain is 50%.

Perf data look like this:
With async pf:
    97.93%       bm  bm                    [.] work_thread
     1.74%       bm  [kernel.kallsyms]     [k] retint_careful
     0.10%       bm  [kernel.kallsyms]     [k] _raw_spin_unlock_irq
     0.08%       bm  bm                    [.] fault_thread
     0.05%       bm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     0.02%       bm  [kernel.kallsyms]     [k] __do_softirq
     0.02%       bm  [kernel.kallsyms]     [k] rcu_process_gp_end

Without async pf:
    63.42%       bm  bm                    [.] work_thread
    13.64%       bm  [kernel.kallsyms]     [k] __do_softirq
     8.95%       bm  bm                    [.] fault_thread
     5.27%       bm  [kernel.kallsyms]     [k] _raw_spin_unlock_irq
     2.79%       bm  [kernel.kallsyms]     [k] hrtimer_run_pending
     2.35%       bm  [kernel.kallsyms]     [k] run_timer_softirq
     1.28%       bm  [kernel.kallsyms]     [k] _raw_spin_lock_irq
     1.16%       bm  [kernel.kallsyms]     [k] debug_smp_processor_id
     0.23%       bm  libc-2.10.2.so        [.] random_r
     0.18%       bm  [kernel.kallsyms]     [k] rcu_bh_qs
     0.18%       bm  [kernel.kallsyms]     [k] find_busiest_group
     0.14%       bm  [kernel.kallsyms]     [k] retint_careful
     0.14%       bm  libc-2.10.2.so        [.] random

Changes:
 v1->v2
   Use MSR instead of hypercall.
   Move most of the code into arch independent place.
   halt inside a guest instead of doing "wait for page" hypercall if
    preemption is disabled.
 v2->v3
   Use MSR from range 0x4b564dxx.
   Add slot version tracking.
   Support migration by restarting all guest processes after migration.
   Drop patch that tract preemptability for non-preemptable kernels
    due to performance concerns. Send async PF to non-preemptable
    guests only when vcpu is executing userspace code.
 v3->v4
  Provide alternative page fault handler in PV guest instead of adding hook to
   standard page fault handler and patch it out on non-PV guests.
  Allow only limited number of outstanding async page fault per vcpu.
  Unify  gfn_to_pfn and gfn_to_pfn_async code.
  Cancel outstanding slow work on reset.
 v4->v5
  Move async pv cpu initialization into cpu hotplug notifier.
  Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
  Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
   cr3 back

Gleb Natapov (12):
  Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  Add PV MSR to enable asynchronous page faults delivery.
  Add async PF initialization to PV guest.
  Provide special async page fault handler when async PF capability is
    detected
  Export __get_user_pages_fast.
  Add get_user_pages() variant that fails if major fault is required.
  Maintain memslot version number
  Inject asynchronous page fault into a guest if page is swapped out.
  Retry fault before vmentry
  Handle async PF in non preemptable context
  Let host know whether the guest can handle async PF in non-userspace
    context.
  Send async PF when guest is not in userspace too.

 arch/x86/include/asm/kvm_host.h |   27 ++++-
 arch/x86/include/asm/kvm_para.h |   14 ++
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 ++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  280 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +--
 arch/x86/kvm/Kconfig            |    2 +
 arch/x86/kvm/mmu.c              |   62 ++++++++-
 arch/x86/kvm/paging_tmpl.h      |   53 +++++++-
 arch/x86/kvm/x86.c              |  122 +++++++++++++++++-
 arch/x86/mm/gup.c               |    2 +
 fs/ncpfs/mmap.c                 |    2 +
 include/linux/kvm.h             |    1 +
 include/linux/kvm_host.h        |   32 +++++
 include/linux/mm.h              |    5 +
 include/trace/events/kvm.h      |   60 +++++++++
 mm/filemap.c                    |    3 +
 mm/memory.c                     |   31 ++++-
 mm/shmem.c                      |    8 +-
 virt/kvm/Kconfig                |    3 +
 virt/kvm/kvm_main.c             |  266 ++++++++++++++++++++++++++++++++++++-
 22 files changed, 966 insertions(+), 34 deletions(-)

=== benchmark.c ===

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

#define FAULTING_THREADS 1
#define WORKING_THREADS 1
#define TIMEOUT 5
#define MEMORY 1024*1024*1024

pthread_barrier_t barrier;
volatile int stop;
size_t pages;

void *fault_thread(void* p)
{
	char *mem = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		mem[(random() % pages) << 12] = 10;

	pthread_barrier_wait(&barrier);

	return NULL;
}

void *work_thread(void* p)
{
	unsigned long *i = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		(*i)++;

	pthread_barrier_wait(&barrier);

	return NULL;
}

int main(int argc, char **argv)
{
	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
	unsigned int timeout = TIMEOUT;
	size_t mem = MEMORY;
	void *buf;
	int i, opt, verbose = 0;
	pthread_t t;
	pthread_attr_t pattr;
	unsigned long *res, sum = 0;

	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
		switch (opt) {
		case 'f':
			ft = atoi(optarg);
			break;
		case 'w':
			wt = atoi(optarg);
			break;
		case 'm':
			mem = atoi(optarg);
			break;
		case 't':
			timeout = atoi(optarg);
			break;
		case 'v':
			verbose++;
			break;
		default:
			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
			exit(1);
		}
	}

	if (verbose)
		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);

	pages = mem >> 12;
	posix_memalign(&buf, 4096, pages << 12);
	res = malloc(sizeof (unsigned long) * wt);
	memset(res, 0, sizeof (unsigned long) * wt);

	pthread_attr_init(&pattr);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	for (i = 0; i < ft; i++) {
		pthread_create(&t, &pattr, fault_thread, buf);
		pthread_detach(t);
	}

	for (i = 0; i < wt; i++) {
		pthread_create(&t, &pattr, work_thread, &res[i]);
		pthread_detach(t);
	}

	/* prefault memory */
	memset(buf, 0, pages << 12);
	printf("start\n");

	pthread_barrier_wait(&barrier);

	pthread_barrier_destroy(&barrier);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	sleep(timeout);
	stop = 1;

	pthread_barrier_wait(&barrier);

	for (i = 0; i < wt; i++) {
		sum += res[i];
		printf("worker %d: %lu\n", i, res[i]);
	}
	printf("total: %lu\n", sum);

	return 0;
}

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 00/12] KVM: Add host swap event notifications for PV guest
@ 2010-07-19 15:30 ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out
from host memory vcpu execution is suspended till the page is not swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

To overcome this inefficiency this patch series implements "asynchronous
page fault" for paravirtualized KVM guests. If a page that vcpu is
trying to access is swapped out KVM sends an async PF to the vcpu
and continues vcpu execution. Requested page is swapped in by another
thread in parallel.  When vcpu gets async PF it puts faulted task to
sleep until "wake up" interrupt is delivered. When the page is brought
to the host memory KVM sends "wake up" interrupt and the guest's task
resumes execution.

To measure performance benefits I use a simple benchmark program (below)
that starts number of threads. Some of them do work (increment counter),
others access huge array in random location trying to generate host page
faults. The size of the array is smaller then guest memory bug bigger
then host memory so we are guarantied that host will swap out part of
the array.

Running the benchmark inside guest with 4 cpus 2G memory running in 512M
container in the host using the command line "./bm -f 4 -w 4 -t 60" (run 4
faulting threads and 4 working threads for a minute) I get this result:

With async pf:
start
worker 0: 63972141051
worker 1: 65149033299
worker 2: 66301967246
worker 3: 63423000989
total: 258846142585


Without async pf:
start
worker 0: 30619912622
worker 1: 33951339266
worker 2: 31577780093
worker 3: 33603607972
total: 129752639953

The gain is 50%.

Perf data look like this:
With async pf:
    97.93%       bm  bm                    [.] work_thread
     1.74%       bm  [kernel.kallsyms]     [k] retint_careful
     0.10%       bm  [kernel.kallsyms]     [k] _raw_spin_unlock_irq
     0.08%       bm  bm                    [.] fault_thread
     0.05%       bm  [kernel.kallsyms]     [k] _raw_spin_unlock_irqrestore
     0.02%       bm  [kernel.kallsyms]     [k] __do_softirq
     0.02%       bm  [kernel.kallsyms]     [k] rcu_process_gp_end

Without async pf:
    63.42%       bm  bm                    [.] work_thread
    13.64%       bm  [kernel.kallsyms]     [k] __do_softirq
     8.95%       bm  bm                    [.] fault_thread
     5.27%       bm  [kernel.kallsyms]     [k] _raw_spin_unlock_irq
     2.79%       bm  [kernel.kallsyms]     [k] hrtimer_run_pending
     2.35%       bm  [kernel.kallsyms]     [k] run_timer_softirq
     1.28%       bm  [kernel.kallsyms]     [k] _raw_spin_lock_irq
     1.16%       bm  [kernel.kallsyms]     [k] debug_smp_processor_id
     0.23%       bm  libc-2.10.2.so        [.] random_r
     0.18%       bm  [kernel.kallsyms]     [k] rcu_bh_qs
     0.18%       bm  [kernel.kallsyms]     [k] find_busiest_group
     0.14%       bm  [kernel.kallsyms]     [k] retint_careful
     0.14%       bm  libc-2.10.2.so        [.] random

Changes:
 v1->v2
   Use MSR instead of hypercall.
   Move most of the code into arch independent place.
   halt inside a guest instead of doing "wait for page" hypercall if
    preemption is disabled.
 v2->v3
   Use MSR from range 0x4b564dxx.
   Add slot version tracking.
   Support migration by restarting all guest processes after migration.
   Drop patch that tract preemptability for non-preemptable kernels
    due to performance concerns. Send async PF to non-preemptable
    guests only when vcpu is executing userspace code.
 v3->v4
  Provide alternative page fault handler in PV guest instead of adding hook to
   standard page fault handler and patch it out on non-PV guests.
  Allow only limited number of outstanding async page fault per vcpu.
  Unify  gfn_to_pfn and gfn_to_pfn_async code.
  Cancel outstanding slow work on reset.
 v4->v5
  Move async pv cpu initialization into cpu hotplug notifier.
  Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
  Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
   cr3 back

Gleb Natapov (12):
  Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  Add PV MSR to enable asynchronous page faults delivery.
  Add async PF initialization to PV guest.
  Provide special async page fault handler when async PF capability is
    detected
  Export __get_user_pages_fast.
  Add get_user_pages() variant that fails if major fault is required.
  Maintain memslot version number
  Inject asynchronous page fault into a guest if page is swapped out.
  Retry fault before vmentry
  Handle async PF in non preemptable context
  Let host know whether the guest can handle async PF in non-userspace
    context.
  Send async PF when guest is not in userspace too.

 arch/x86/include/asm/kvm_host.h |   27 ++++-
 arch/x86/include/asm/kvm_para.h |   14 ++
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 ++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  280 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +--
 arch/x86/kvm/Kconfig            |    2 +
 arch/x86/kvm/mmu.c              |   62 ++++++++-
 arch/x86/kvm/paging_tmpl.h      |   53 +++++++-
 arch/x86/kvm/x86.c              |  122 +++++++++++++++++-
 arch/x86/mm/gup.c               |    2 +
 fs/ncpfs/mmap.c                 |    2 +
 include/linux/kvm.h             |    1 +
 include/linux/kvm_host.h        |   32 +++++
 include/linux/mm.h              |    5 +
 include/trace/events/kvm.h      |   60 +++++++++
 mm/filemap.c                    |    3 +
 mm/memory.c                     |   31 ++++-
 mm/shmem.c                      |    8 +-
 virt/kvm/Kconfig                |    3 +
 virt/kvm/kvm_main.c             |  266 ++++++++++++++++++++++++++++++++++++-
 22 files changed, 966 insertions(+), 34 deletions(-)

=== benchmark.c ===

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

#define FAULTING_THREADS 1
#define WORKING_THREADS 1
#define TIMEOUT 5
#define MEMORY 1024*1024*1024

pthread_barrier_t barrier;
volatile int stop;
size_t pages;

void *fault_thread(void* p)
{
	char *mem = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		mem[(random() % pages) << 12] = 10;

	pthread_barrier_wait(&barrier);

	return NULL;
}

void *work_thread(void* p)
{
	unsigned long *i = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		(*i)++;

	pthread_barrier_wait(&barrier);

	return NULL;
}

int main(int argc, char **argv)
{
	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
	unsigned int timeout = TIMEOUT;
	size_t mem = MEMORY;
	void *buf;
	int i, opt, verbose = 0;
	pthread_t t;
	pthread_attr_t pattr;
	unsigned long *res, sum = 0;

	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
		switch (opt) {
		case 'f':
			ft = atoi(optarg);
			break;
		case 'w':
			wt = atoi(optarg);
			break;
		case 'm':
			mem = atoi(optarg);
			break;
		case 't':
			timeout = atoi(optarg);
			break;
		case 'v':
			verbose++;
			break;
		default:
			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
			exit(1);
		}
	}

	if (verbose)
		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);

	pages = mem >> 12;
	posix_memalign(&buf, 4096, pages << 12);
	res = malloc(sizeof (unsigned long) * wt);
	memset(res, 0, sizeof (unsigned long) * wt);

	pthread_attr_init(&pattr);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	for (i = 0; i < ft; i++) {
		pthread_create(&t, &pattr, fault_thread, buf);
		pthread_detach(t);
	}

	for (i = 0; i < wt; i++) {
		pthread_create(&t, &pattr, work_thread, &res[i]);
		pthread_detach(t);
	}

	/* prefault memory */
	memset(buf, 0, pages << 12);
	printf("start\n");

	pthread_barrier_wait(&barrier);

	pthread_barrier_destroy(&barrier);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	sleep(timeout);
	stop = 1;

	pthread_barrier_wait(&barrier);

	for (i = 0; i < wt; i++) {
		sum += res[i];
		printf("worker %d: %lu\n", i, res[i]);
	}
	printf("total: %lu\n", sum);

	return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 01/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |   11 +++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +------------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 05eba5e..42298ab 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,7 @@ struct kvm_mmu_op_release_pt {
 #include <asm/processor.h>
 
 extern void kvmclock_init(void);
+extern int kvm_register_clock(char *txt);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..e6db179 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -231,10 +231,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+#ifdef CONFIG_SMP
+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+	WARN_ON(kvm_register_clock("primary cpu clock"));
+	native_smp_prepare_boot_cpu();
+}
+#endif
+
 void __init kvm_guest_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
+#ifdef CONFIG_SMP
+	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
 }
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index eb9b76c..67a5f46 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -125,7 +125,7 @@ static struct clocksource kvm_clock = {
 	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
-static int kvm_register_clock(char *txt)
+int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high;
@@ -150,14 +150,6 @@ static void __cpuinit kvm_setup_secondary_clock(void)
 }
 #endif
 
-#ifdef CONFIG_SMP
-static void __init kvm_smp_prepare_boot_cpu(void)
-{
-	WARN_ON(kvm_register_clock("primary cpu clock"));
-	native_smp_prepare_boot_cpu();
-}
-#endif
-
 /*
  * After the clock is registered, the host will keep writing to the
  * registered memory location. If the guest happens to shutdown, this memory
@@ -204,9 +196,6 @@ void __init kvmclock_init(void)
 	x86_cpuinit.setup_percpu_clockev =
 		kvm_setup_secondary_clock;
 #endif
-#ifdef CONFIG_SMP
-	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-#endif
 	machine_ops.shutdown  = kvm_shutdown;
 #ifdef CONFIG_KEXEC
 	machine_ops.crash_shutdown  = kvm_crash_shutdown;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 01/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |   11 +++++++++++
 arch/x86/kernel/kvmclock.c      |   13 +------------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 05eba5e..42298ab 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,7 @@ struct kvm_mmu_op_release_pt {
 #include <asm/processor.h>
 
 extern void kvmclock_init(void);
+extern int kvm_register_clock(char *txt);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..e6db179 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -231,10 +231,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+#ifdef CONFIG_SMP
+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+	WARN_ON(kvm_register_clock("primary cpu clock"));
+	native_smp_prepare_boot_cpu();
+}
+#endif
+
 void __init kvm_guest_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
+#ifdef CONFIG_SMP
+	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
 }
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index eb9b76c..67a5f46 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -125,7 +125,7 @@ static struct clocksource kvm_clock = {
 	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
-static int kvm_register_clock(char *txt)
+int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high;
@@ -150,14 +150,6 @@ static void __cpuinit kvm_setup_secondary_clock(void)
 }
 #endif
 
-#ifdef CONFIG_SMP
-static void __init kvm_smp_prepare_boot_cpu(void)
-{
-	WARN_ON(kvm_register_clock("primary cpu clock"));
-	native_smp_prepare_boot_cpu();
-}
-#endif
-
 /*
  * After the clock is registered, the host will keep writing to the
  * registered memory location. If the guest happens to shutdown, this memory
@@ -204,9 +196,6 @@ void __init kvmclock_init(void)
 	x86_cpuinit.setup_percpu_clockev =
 		kvm_setup_secondary_clock;
 #endif
-#ifdef CONFIG_SMP
-	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-#endif
 	machine_ops.shutdown  = kvm_shutdown;
 #ifdef CONFIG_KEXEC
 	machine_ops.crash_shutdown  = kvm_crash_shutdown;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Guess enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 ++
 arch/x86/include/asm/kvm_para.h |    4 +++
 arch/x86/kvm/x86.c              |   49 +++++++++++++++++++++++++++++++++++++-
 include/linux/kvm.h             |    1 +
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 502e53f..245831a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -364,6 +364,9 @@ struct kvm_vcpu_arch {
 	u64 hv_vapic;
 
 	cpumask_var_t wbinvd_dirty_mask;
+
+	u32 __user *apf_data;
+	u64 apf_msr_val;
 };
 
 struct kvm_arch {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 42298ab..5b05e9f 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,7 @@
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
+#define KVM_FEATURE_ASYNC_PF		4
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -32,9 +33,12 @@
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
+#define KVM_ASYNC_PF_ENABLED			(1 << 0)
+
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
 #define KVM_MMU_OP_FLUSH_TLB	        2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 84bfb51..b09bf61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -726,12 +726,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	7
+#define KVM_SAVE_MSRS_BEGIN	8
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-	HV_X64_MSR_APIC_ASSIST_PAGE,
+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_K6_STAR,
 #ifdef CONFIG_X86_64
@@ -1214,6 +1214,37 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	return 0;
 }
 
+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+	u64 gpa = data & ~0x3f;
+	int offset = offset_in_page(gpa);
+	unsigned long addr;
+
+	/* Bits 1:5 are resrved, Should be zero */
+	if (data & 0x3e)
+		return 1;
+
+	vcpu->arch.apf_msr_val = data;
+
+	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+		vcpu->arch.apf_data = NULL;
+		return 0;
+	}
+
+	addr = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+	if (kvm_is_error_hva(addr))
+		return 1;
+
+	vcpu->arch.apf_data = (u32 __user*)(addr + offset);
+
+	/* check if address is mapped */
+	if (get_user(offset, vcpu->arch.apf_data)) {
+		vcpu->arch.apf_data = NULL;
+		return 1;
+	}
+	return 0;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	switch (msr) {
@@ -1296,6 +1327,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		kvm_request_guest_time_update(vcpu);
 		break;
 	}
+	case MSR_KVM_ASYNC_PF_EN:
+		if (kvm_pv_enable_async_pf(vcpu, data))
+			return 1;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1548,6 +1583,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_ASYNC_PF_EN:
+		data = vcpu->arch.apf_msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -1683,6 +1721,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_ASYNC_PF:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -5357,6 +5396,9 @@ free_vcpu:
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.apf_data = NULL;
+	vcpu->arch.apf_msr_val = 0;
+
 	vcpu_load(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
@@ -5375,6 +5417,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.dr6 = DR6_FIXED_1;
 	vcpu->arch.dr7 = DR7_FIXED_1;
 
+	vcpu->arch.apf_data = NULL;
+	vcpu->arch.apf_msr_val = 0;
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 636fc38..bab7ef0 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -530,6 +530,7 @@ struct kvm_enable_cap {
 #ifdef __KVM_HAVE_XCRS
 #define KVM_CAP_XCRS 56
 #endif
+#define KVM_CAP_ASYNC_PF 57
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery.
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Guess enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 ++
 arch/x86/include/asm/kvm_para.h |    4 +++
 arch/x86/kvm/x86.c              |   49 +++++++++++++++++++++++++++++++++++++-
 include/linux/kvm.h             |    1 +
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 502e53f..245831a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -364,6 +364,9 @@ struct kvm_vcpu_arch {
 	u64 hv_vapic;
 
 	cpumask_var_t wbinvd_dirty_mask;
+
+	u32 __user *apf_data;
+	u64 apf_msr_val;
 };
 
 struct kvm_arch {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 42298ab..5b05e9f 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,7 @@
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
+#define KVM_FEATURE_ASYNC_PF		4
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -32,9 +33,12 @@
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
+#define KVM_ASYNC_PF_ENABLED			(1 << 0)
+
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
 #define KVM_MMU_OP_FLUSH_TLB	        2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 84bfb51..b09bf61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -726,12 +726,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	7
+#define KVM_SAVE_MSRS_BEGIN	8
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-	HV_X64_MSR_APIC_ASSIST_PAGE,
+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_K6_STAR,
 #ifdef CONFIG_X86_64
@@ -1214,6 +1214,37 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	return 0;
 }
 
+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+	u64 gpa = data & ~0x3f;
+	int offset = offset_in_page(gpa);
+	unsigned long addr;
+
+	/* Bits 1:5 are resrved, Should be zero */
+	if (data & 0x3e)
+		return 1;
+
+	vcpu->arch.apf_msr_val = data;
+
+	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+		vcpu->arch.apf_data = NULL;
+		return 0;
+	}
+
+	addr = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+	if (kvm_is_error_hva(addr))
+		return 1;
+
+	vcpu->arch.apf_data = (u32 __user*)(addr + offset);
+
+	/* check if address is mapped */
+	if (get_user(offset, vcpu->arch.apf_data)) {
+		vcpu->arch.apf_data = NULL;
+		return 1;
+	}
+	return 0;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	switch (msr) {
@@ -1296,6 +1327,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		kvm_request_guest_time_update(vcpu);
 		break;
 	}
+	case MSR_KVM_ASYNC_PF_EN:
+		if (kvm_pv_enable_async_pf(vcpu, data))
+			return 1;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1548,6 +1583,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_SYSTEM_TIME_NEW:
 		data = vcpu->arch.time;
 		break;
+	case MSR_KVM_ASYNC_PF_EN:
+		data = vcpu->arch.apf_msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -1683,6 +1721,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_ASYNC_PF:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -5357,6 +5396,9 @@ free_vcpu:
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.apf_data = NULL;
+	vcpu->arch.apf_msr_val = 0;
+
 	vcpu_load(vcpu);
 	kvm_mmu_unload(vcpu);
 	vcpu_put(vcpu);
@@ -5375,6 +5417,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.dr6 = DR6_FIXED_1;
 	vcpu->arch.dr7 = DR7_FIXED_1;
 
+	vcpu->arch.apf_data = NULL;
+	vcpu->arch.apf_msr_val = 0;
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 636fc38..bab7ef0 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -530,6 +530,7 @@ struct kvm_enable_cap {
 #ifdef __KVM_HAVE_XCRS
 #define KVM_CAP_XCRS 56
 #endif
+#define KVM_CAP_ASYNC_PF 57
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Enable async PF in a guest if async PF capability is discovered.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    5 +++
 arch/x86/kernel/kvm.c           |   68 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5b05e9f..f1662d7 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+struct kvm_vcpu_pv_apf_data {
+	__u32 reason;
+	__u32 enabled;
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6db179..5177dd1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,7 +27,10 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/hardirq.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
 #include <asm/timer.h>
+#include <asm/cpu.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -37,6 +40,7 @@ struct kvm_para_state {
 };
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -231,12 +235,72 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+void __cpuinit kvm_guest_cpu_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
+		u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
+					  pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
+			return;
+		__get_cpu_var(apf_reason).enabled = 1;
+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
+		       smp_processor_id());
+	}
+}
+
+static void kvm_pv_disable_apf(void *unused)
+{
+	if (!__get_cpu_var(apf_reason).enabled)
+		return;
+
+	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
+	__get_cpu_var(apf_reason).enabled = 0;
+
+	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+	       smp_processor_id());
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+				unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+	.notifier_call = kvm_pv_reboot_notify,
+};
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
 	WARN_ON(kvm_register_clock("primary cpu clock"));
+	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
 }
+
+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
+				    unsigned long action, void *hcpu)
+{
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+		kvm_guest_cpu_init();
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
+        .notifier_call  = kvm_cpu_notify,
+};
 #endif
 
 void __init kvm_guest_init(void)
@@ -245,7 +309,11 @@ void __init kvm_guest_init(void)
 		return;
 
 	paravirt_ops_setup();
+	register_reboot_notifier(&kvm_pv_reboot_nb);
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+	register_cpu_notifier(&kvm_cpu_notifier);
+#else
+	kvm_guest_cpu_init();
 #endif
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Enable async PF in a guest if async PF capability is discovered.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    5 +++
 arch/x86/kernel/kvm.c           |   68 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5b05e9f..f1662d7 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+struct kvm_vcpu_pv_apf_data {
+	__u32 reason;
+	__u32 enabled;
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6db179..5177dd1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,7 +27,10 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/hardirq.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
 #include <asm/timer.h>
+#include <asm/cpu.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -37,6 +40,7 @@ struct kvm_para_state {
 };
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -231,12 +235,72 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+void __cpuinit kvm_guest_cpu_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
+		u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
+					  pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
+			return;
+		__get_cpu_var(apf_reason).enabled = 1;
+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
+		       smp_processor_id());
+	}
+}
+
+static void kvm_pv_disable_apf(void *unused)
+{
+	if (!__get_cpu_var(apf_reason).enabled)
+		return;
+
+	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
+	__get_cpu_var(apf_reason).enabled = 0;
+
+	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+	       smp_processor_id());
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+				unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+	.notifier_call = kvm_pv_reboot_notify,
+};
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
 	WARN_ON(kvm_register_clock("primary cpu clock"));
+	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
 }
+
+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
+				    unsigned long action, void *hcpu)
+{
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+		kvm_guest_cpu_init();
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
+        .notifier_call  = kvm_cpu_notify,
+};
 #endif
 
 void __init kvm_guest_init(void)
@@ -245,7 +309,11 @@ void __init kvm_guest_init(void)
 		return;
 
 	paravirt_ops_setup();
+	register_reboot_notifier(&kvm_pv_reboot_nb);
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+	register_cpu_notifier(&kvm_cpu_notifier);
+#else
+	kvm_guest_cpu_init();
 #endif
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When async PF capability is detected hook up special page fault handler
that will handle async page fault events and bypass other page faults to
regular page fault handler.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    3 +
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 +++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  170 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 187 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f1662d7..edf07cf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
 struct kvm_vcpu_pv_apf_data {
 	__u32 reason;
 	__u32 enabled;
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index f66cda5..0310da6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
 asmlinkage void stack_segment(void);
 asmlinkage void general_protection(void);
 asmlinkage void page_fault(void);
+asmlinkage void async_page_fault(void);
 asmlinkage void spurious_interrupt_bug(void);
 asmlinkage void coprocessor_error(void);
 asmlinkage void alignment_check(void);
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index cd49141..95e13da 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1494,6 +1494,16 @@ ENTRY(general_protection)
 	CFI_ENDPROC
 END(general_protection)
 
+#ifdef CONFIG_KVM_GUEST
+ENTRY(async_page_fault)
+	RING0_EC_FRAME
+	pushl $do_async_page_fault
+	CFI_ADJUST_CFA_OFFSET 4
+	jmp error_code
+	CFI_ENDPROC
+END(apf_page_fault)
+#endif
+
 /*
  * End of kprobes section
  */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0697ff1..65c3eb6 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1346,6 +1346,9 @@ errorentry xen_stack_segment do_stack_segment
 #endif
 errorentry general_protection do_general_protection
 errorentry page_fault do_page_fault
+#ifdef CONFIG_KVM_GUEST
+errorentry async_page_fault do_async_page_fault
+#endif
 #ifdef CONFIG_X86_MCE
 paranoidzeroentry machine_check *machine_check_vector(%rip)
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5177dd1..a6db92e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,8 +29,14 @@
 #include <linux/hardirq.h>
 #include <linux/notifier.h>
 #include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/kprobes.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
+#include <asm/traps.h>
+#include <asm/desc.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -54,6 +60,158 @@ static void kvm_io_delay(void)
 {
 }
 
+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+	struct hlist_node link;
+	wait_queue_head_t wq;
+	u32 token;
+	int cpu;
+};
+
+static struct kvm_task_sleep_head {
+	spinlock_t lock;
+	struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+						  u64 token)
+{
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		struct kvm_task_sleep_node *n =
+			hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+static void apf_task_wait(struct task_struct *tsk, u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node n, *e;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&b->lock);
+	e = _find_apf_task(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		spin_unlock(&b->lock);
+		return;
+	}
+
+	n.token = token;
+	n.cpu = smp_processor_id();
+	init_waitqueue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	spin_unlock(&b->lock);
+
+	for (;;) {
+		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+		schedule();
+	}
+	finish_wait(&n.wq, &wait);
+
+	return;
+}
+
+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
+{
+	hlist_del_init(&n->link);
+	if (waitqueue_active(&n->wq))
+		wake_up(&n->wq);
+}
+
+static void apf_task_wake(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node *n;
+
+again:
+	spin_lock(&b->lock);
+	n = _find_apf_task(b, token);
+	if (!n) {
+		/*
+		 * async PF was not yet handled.
+		 * Add dummy entry for the token.
+		 */
+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			/*
+			 * Allocation failed! Busy wait while other vcpu
+			 * handles async PF.
+			 */
+			spin_unlock(&b->lock);
+			cpu_relax();
+			goto again;
+		}
+		n->token = token;
+		n->cpu = smp_processor_id();
+		init_waitqueue_head(&n->wq);
+		hlist_add_head(&n->link, &b->list);
+	} else
+		apf_task_wake_one(n);
+	spin_unlock(&b->lock);
+	return;
+}
+
+static void apf_task_wake_all(void)
+{
+	int i;
+
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
+		struct hlist_node *p, *next;
+		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+		spin_lock(&b->lock);
+		hlist_for_each_safe(p, next, &b->list) {
+			struct kvm_task_sleep_node *n =
+				hlist_entry(p, typeof(*n), link);
+			if (n->cpu == smp_processor_id())
+				apf_task_wake_one(n);
+		}
+		spin_unlock(&b->lock);
+	}
+}
+
+dotraplinkage void __kprobes
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	u32 reason = 0, token;
+
+	if (__get_cpu_var(apf_reason).enabled) {
+		reason = __get_cpu_var(apf_reason).reason;
+		__get_cpu_var(apf_reason).reason = 0;
+
+		token = (u32)read_cr2();
+	}
+
+	switch (reason) {
+	default:
+		do_page_fault(regs, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		/* page is swapped out by the host. */
+		apf_task_wait(current, token);
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		if (unlikely(token == ~0))
+			apf_task_wake_all();
+		else
+			apf_task_wake(token);
+		break;
+	}
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -303,13 +461,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
 };
 #endif
 
+static void __init kvm_apf_trap_init(void)
+{
+	set_intr_gate(14, &async_page_fault);
+}
+
 void __init kvm_guest_init(void)
 {
+	int i;
+
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
 	register_reboot_notifier(&kvm_pv_reboot_nb);
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+		spin_lock_init(&async_pf_sleepers[i].lock);
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		x86_init.irqs.trap_init = kvm_apf_trap_init;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When async PF capability is detected hook up special page fault handler
that will handle async page fault events and bypass other page faults to
regular page fault handler.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    3 +
 arch/x86/include/asm/traps.h    |    1 +
 arch/x86/kernel/entry_32.S      |   10 +++
 arch/x86/kernel/entry_64.S      |    3 +
 arch/x86/kernel/kvm.c           |  170 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 187 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f1662d7..edf07cf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
 struct kvm_vcpu_pv_apf_data {
 	__u32 reason;
 	__u32 enabled;
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index f66cda5..0310da6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
 asmlinkage void stack_segment(void);
 asmlinkage void general_protection(void);
 asmlinkage void page_fault(void);
+asmlinkage void async_page_fault(void);
 asmlinkage void spurious_interrupt_bug(void);
 asmlinkage void coprocessor_error(void);
 asmlinkage void alignment_check(void);
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index cd49141..95e13da 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1494,6 +1494,16 @@ ENTRY(general_protection)
 	CFI_ENDPROC
 END(general_protection)
 
+#ifdef CONFIG_KVM_GUEST
+ENTRY(async_page_fault)
+	RING0_EC_FRAME
+	pushl $do_async_page_fault
+	CFI_ADJUST_CFA_OFFSET 4
+	jmp error_code
+	CFI_ENDPROC
+END(apf_page_fault)
+#endif
+
 /*
  * End of kprobes section
  */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0697ff1..65c3eb6 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1346,6 +1346,9 @@ errorentry xen_stack_segment do_stack_segment
 #endif
 errorentry general_protection do_general_protection
 errorentry page_fault do_page_fault
+#ifdef CONFIG_KVM_GUEST
+errorentry async_page_fault do_async_page_fault
+#endif
 #ifdef CONFIG_X86_MCE
 paranoidzeroentry machine_check *machine_check_vector(%rip)
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5177dd1..a6db92e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,8 +29,14 @@
 #include <linux/hardirq.h>
 #include <linux/notifier.h>
 #include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/kprobes.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
+#include <asm/traps.h>
+#include <asm/desc.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -54,6 +60,158 @@ static void kvm_io_delay(void)
 {
 }
 
+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+	struct hlist_node link;
+	wait_queue_head_t wq;
+	u32 token;
+	int cpu;
+};
+
+static struct kvm_task_sleep_head {
+	spinlock_t lock;
+	struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+						  u64 token)
+{
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		struct kvm_task_sleep_node *n =
+			hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+static void apf_task_wait(struct task_struct *tsk, u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node n, *e;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&b->lock);
+	e = _find_apf_task(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		spin_unlock(&b->lock);
+		return;
+	}
+
+	n.token = token;
+	n.cpu = smp_processor_id();
+	init_waitqueue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	spin_unlock(&b->lock);
+
+	for (;;) {
+		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+		schedule();
+	}
+	finish_wait(&n.wq, &wait);
+
+	return;
+}
+
+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
+{
+	hlist_del_init(&n->link);
+	if (waitqueue_active(&n->wq))
+		wake_up(&n->wq);
+}
+
+static void apf_task_wake(u32 token)
+{
+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+	struct kvm_task_sleep_node *n;
+
+again:
+	spin_lock(&b->lock);
+	n = _find_apf_task(b, token);
+	if (!n) {
+		/*
+		 * async PF was not yet handled.
+		 * Add dummy entry for the token.
+		 */
+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			/*
+			 * Allocation failed! Busy wait while other vcpu
+			 * handles async PF.
+			 */
+			spin_unlock(&b->lock);
+			cpu_relax();
+			goto again;
+		}
+		n->token = token;
+		n->cpu = smp_processor_id();
+		init_waitqueue_head(&n->wq);
+		hlist_add_head(&n->link, &b->list);
+	} else
+		apf_task_wake_one(n);
+	spin_unlock(&b->lock);
+	return;
+}
+
+static void apf_task_wake_all(void)
+{
+	int i;
+
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
+		struct hlist_node *p, *next;
+		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+		spin_lock(&b->lock);
+		hlist_for_each_safe(p, next, &b->list) {
+			struct kvm_task_sleep_node *n =
+				hlist_entry(p, typeof(*n), link);
+			if (n->cpu == smp_processor_id())
+				apf_task_wake_one(n);
+		}
+		spin_unlock(&b->lock);
+	}
+}
+
+dotraplinkage void __kprobes
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	u32 reason = 0, token;
+
+	if (__get_cpu_var(apf_reason).enabled) {
+		reason = __get_cpu_var(apf_reason).reason;
+		__get_cpu_var(apf_reason).reason = 0;
+
+		token = (u32)read_cr2();
+	}
+
+	switch (reason) {
+	default:
+		do_page_fault(regs, error_code);
+		break;
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		/* page is swapped out by the host. */
+		apf_task_wait(current, token);
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		if (unlikely(token == ~0))
+			apf_task_wake_all();
+		else
+			apf_task_wake(token);
+		break;
+	}
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -303,13 +461,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
 };
 #endif
 
+static void __init kvm_apf_trap_init(void)
+{
+	set_intr_gate(14, &async_page_fault);
+}
+
 void __init kvm_guest_init(void)
 {
+	int i;
+
 	if (!kvm_para_available())
 		return;
 
 	paravirt_ops_setup();
 	register_reboot_notifier(&kvm_pv_reboot_nb);
+	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+		spin_lock_init(&async_pf_sleepers[i].lock);
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+		x86_init.irqs.trap_init = kvm_apf_trap_init;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 05/12] Export __get_user_pages_fast.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM will use it to try and find a page without falling back to slow
gup. That is why get_user_pages_fast() is not enough.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/mm/gup.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..a4ce19f 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
+#include <linux/module.h>
 
 #include <asm/pgtable.h>
 
@@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	return nr;
 }
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 
 /**
  * get_user_pages_fast() - pin user pages in memory
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 05/12] Export __get_user_pages_fast.
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

KVM will use it to try and find a page without falling back to slow
gup. That is why get_user_pages_fast() is not enough.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/mm/gup.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..a4ce19f 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
+#include <linux/module.h>
 
 #include <asm/pgtable.h>
 
@@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	return nr;
 }
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 
 /**
  * get_user_pages_fast() - pin user pages in memory
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 06/12] Add get_user_pages() variant that fails if major fault is required.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 fs/ncpfs/mmap.c    |    2 ++
 include/linux/mm.h |    5 +++++
 mm/filemap.c       |    3 +++
 mm/memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/shmem.c         |    8 +++++++-
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 56f5b3a..b9c4f36 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	int bufsize;
 	int pos; /* XXX: loff_t ? */
 
+	if (vmf->flags & FAULT_FLAG_MINOR)
+		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
 	/*
 	 * ncpfs has nothing against high pages as long
 	 * as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4238a9c..2bfc85a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -140,6 +140,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -843,6 +844,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			unsigned long start, int nr_pages, int write, int force,
 			struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int nr_pages, int write, int force,
+			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1373,6 +1377,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MINOR	0x20	/* do only minor page faults */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index 20e5642..1186338 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			goto no_cached_page;
 		}
 	} else {
+		if (vmf->flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 119b7cc..7dfaba2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1433,10 +1433,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				unsigned int fault_fl =
+					((foll_flags & FOLL_WRITE) ?
+					FAULT_FLAG_WRITE : 0) |
+					((foll_flags & FOLL_MINOR) ?
+					FAULT_FLAG_MINOR : 0);
 
-				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+				ret = handle_mm_fault(mm, vma, start, fault_fl);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1444,6 +1447,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					if (ret &
 					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
+					else if (ret & VM_FAULT_MAJOR)
+						return i ? i : -EFAULT;
 					BUG();
 				}
 				if (ret & VM_FAULT_MAJOR)
@@ -1554,6 +1559,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int flags = FOLL_TOUCH | FOLL_MINOR;
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
@@ -2640,6 +2662,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		if (flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
 					GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index f65f840..acc8958 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1227,6 +1227,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
 	swp_entry_t swap;
 	gfp_t gfp;
 	int error;
+	int flags = type ? *type : 0;
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
@@ -1275,6 +1276,11 @@ repeat:
 		swappage = lookup_swap_cache(swap);
 		if (!swappage) {
 			shmem_swp_unmap(entry);
+			if (flags & FAULT_FLAG_MINOR) {
+				spin_unlock(&info->lock);
+				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+				goto failed;
+			}
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
@@ -1483,7 +1489,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int error;
-	int ret;
+	int ret = (int)vmf->flags;
 
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 06/12] Add get_user_pages() variant that fails if major fault is required.
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 fs/ncpfs/mmap.c    |    2 ++
 include/linux/mm.h |    5 +++++
 mm/filemap.c       |    3 +++
 mm/memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/shmem.c         |    8 +++++++-
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 56f5b3a..b9c4f36 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	int bufsize;
 	int pos; /* XXX: loff_t ? */
 
+	if (vmf->flags & FAULT_FLAG_MINOR)
+		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
 	/*
 	 * ncpfs has nothing against high pages as long
 	 * as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4238a9c..2bfc85a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -140,6 +140,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -843,6 +844,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			unsigned long start, int nr_pages, int write, int force,
 			struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int nr_pages, int write, int force,
+			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1373,6 +1377,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MINOR	0x20	/* do only minor page faults */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index 20e5642..1186338 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			goto no_cached_page;
 		}
 	} else {
+		if (vmf->flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 119b7cc..7dfaba2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1433,10 +1433,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				unsigned int fault_fl =
+					((foll_flags & FOLL_WRITE) ?
+					FAULT_FLAG_WRITE : 0) |
+					((foll_flags & FOLL_MINOR) ?
+					FAULT_FLAG_MINOR : 0);
 
-				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+				ret = handle_mm_fault(mm, vma, start, fault_fl);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1444,6 +1447,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					if (ret &
 					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
+					else if (ret & VM_FAULT_MAJOR)
+						return i ? i : -EFAULT;
 					BUG();
 				}
 				if (ret & VM_FAULT_MAJOR)
@@ -1554,6 +1559,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int flags = FOLL_TOUCH | FOLL_MINOR;
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
@@ -2640,6 +2662,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		if (flags & FAULT_FLAG_MINOR)
+			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
 					GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index f65f840..acc8958 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1227,6 +1227,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
 	swp_entry_t swap;
 	gfp_t gfp;
 	int error;
+	int flags = type ? *type : 0;
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
@@ -1275,6 +1276,11 @@ repeat:
 		swappage = lookup_swap_cache(swap);
 		if (!swappage) {
 			shmem_swp_unmap(entry);
+			if (flags & FAULT_FLAG_MINOR) {
+				spin_unlock(&info->lock);
+				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+				goto failed;
+			}
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
@@ -1483,7 +1489,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int error;
-	int ret;
+	int ret = (int)vmf->flags;
 
 	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 07/12] Maintain memslot version number
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Code that depends on particular memslot layout can track changes and
adjust to new layout.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/kvm_main.c      |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c13cc48..c74ffc0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -177,6 +177,7 @@ struct kvm {
 	raw_spinlock_t requests_lock;
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
+	u32 memslot_version;
 	struct kvm_memslots *memslots;
 	struct srcu_struct srcu;
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b78b794..292514c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -733,6 +733,7 @@ skip_lpage:
 	slots->memslots[mem->slot] = new;
 	old_memslots = kvm->memslots;
 	rcu_assign_pointer(kvm->memslots, slots);
+	kvm->memslot_version++;
 	synchronize_srcu_expedited(&kvm->srcu);
 
 	kvm_arch_commit_memory_region(kvm, mem, old, user_alloc);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 07/12] Maintain memslot version number
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

Code that depends on particular memslot layout can track changes and
adjust to new layout.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/kvm_main.c      |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c13cc48..c74ffc0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -177,6 +177,7 @@ struct kvm {
 	raw_spinlock_t requests_lock;
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
+	u32 memslot_version;
 	struct kvm_memslots *memslots;
 	struct srcu_struct srcu;
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b78b794..292514c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -733,6 +733,7 @@ skip_lpage:
 	slots->memslots[mem->slot] = new;
 	old_memslots = kvm->memslots;
 	rcu_assign_pointer(kvm->memslots, slots);
+	kvm->memslot_version++;
 	synchronize_srcu_expedited(&kvm->srcu);
 
 	kvm_arch_commit_memory_region(kvm, mem, old, user_alloc);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest access swapped out memory do not swap it in from vcpu thread
context. Setup slow work to do swapping and send async page fault to
a guest.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able to
reschedule.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   16 +++
 arch/x86/kvm/Kconfig            |    2 +
 arch/x86/kvm/mmu.c              |   35 +++++-
 arch/x86/kvm/paging_tmpl.h      |   17 +++-
 arch/x86/kvm/x86.c              |   63 +++++++++-
 include/linux/kvm_host.h        |   31 +++++
 include/trace/events/kvm.h      |   60 +++++++++
 virt/kvm/Kconfig                |    3 +
 virt/kvm/kvm_main.c             |  263 ++++++++++++++++++++++++++++++++++++++-
 9 files changed, 481 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 245831a..db514ea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -366,7 +366,9 @@ struct kvm_vcpu_arch {
 	cpumask_var_t wbinvd_dirty_mask;
 
 	u32 __user *apf_data;
+	u32 apf_memslot_ver;
 	u64 apf_msr_val;
+	u32 async_pf_id;
 };
 
 struct kvm_arch {
@@ -444,6 +446,8 @@ struct kvm_vcpu_stat {
 	u32 hypercalls;
 	u32 irq_injections;
 	u32 nmi_injections;
+	u32 apf_not_present;
+	u32 apf_present;
 };
 
 struct kvm_x86_ops {
@@ -528,6 +532,10 @@ struct kvm_x86_ops {
 	const struct trace_print_flags *exit_reasons_str;
 };
 
+struct kvm_arch_async_pf {
+	u32 token;
+};
+
 extern struct kvm_x86_ops *kvm_x86_ops;
 
 int kvm_mmu_module_init(void);
@@ -763,4 +771,12 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
+struct kvm_async_pf;
+
+void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
+					    struct kvm_async_pf *work);
+void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
+					struct kvm_async_pf *work);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 #endif /* _ASM_X86_KVM_HOST_H */
+
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 970bbd4..2461284 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -28,6 +28,8 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_EVENTFD
 	select KVM_APIC_ARCHITECTURE
+	select KVM_ASYNC_PF
+	select SLOW_WORK
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
 	---help---
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d16efbe..5e6105c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -21,6 +21,7 @@
 #include "mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -2345,6 +2346,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	arch.token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
+		return false;
+
+	return !!kvm_x86_ops->get_cpl(vcpu);
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 				u32 error_code)
 {
@@ -2353,6 +2369,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	int level;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
+	bool async;
 
 	ASSERT(vcpu);
 	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
@@ -2367,7 +2384,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	if (can_do_async_pf(vcpu)) {
+		pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
+		trace_kvm_try_async_get_page(async, pfn);
+	} else {
+do_sync:
+		async = false;
+		pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	}
+
+	if (async) {
+		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
+			goto do_sync;
+		return 0;
+	}
+
+	/* mmio */
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a09e04c..f8c74a1 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -428,6 +428,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
 	unsigned long mmu_seq;
+	bool async;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 	kvm_mmu_audit(vcpu, "pre page fault");
@@ -459,7 +460,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+	if (can_do_async_pf(vcpu)) {
+		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &async);
+		trace_kvm_try_async_get_page(async, pfn);
+	} else {
+do_sync:
+		async = false;
+		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+	}
+
+	if (async) {
+		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
+			goto do_sync;
+		return 0;
+	}
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b09bf61..2603cc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -118,6 +118,8 @@ static DEFINE_PER_CPU(struct kvm_shared_msrs, shared_msrs);
 struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "pf_fixed", VCPU_STAT(pf_fixed) },
 	{ "pf_guest", VCPU_STAT(pf_guest) },
+	{ "apf_not_present", VCPU_STAT(apf_not_present) },
+	{ "apf_present", VCPU_STAT(apf_present) },
 	{ "tlb_flush", VCPU_STAT(tlb_flush) },
 	{ "invlpg", VCPU_STAT(invlpg) },
 	{ "exits", VCPU_STAT(exits) },
@@ -1228,6 +1230,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 
 	if (!(data & KVM_ASYNC_PF_ENABLED)) {
 		vcpu->arch.apf_data = NULL;
+		kvm_clear_async_pf_completion_queue(vcpu);
 		return 0;
 	}
 
@@ -1242,6 +1245,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 		vcpu->arch.apf_data = NULL;
 		return 1;
 	}
+	vcpu->arch.apf_memslot_ver = vcpu->kvm->memslot_version;
+	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
 
@@ -4748,6 +4753,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (unlikely(r))
 		goto out;
 
+	kvm_check_async_pf_completion(vcpu);
+
 	preempt_disable();
 
 	kvm_x86_ops->prepare_guest_switch(vcpu);
@@ -5420,6 +5427,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.apf_data = NULL;
 	vcpu->arch.apf_msr_val = 0;
 
+	kvm_clear_async_pf_completion_queue(vcpu);
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -5561,8 +5570,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
 	/*
 	 * Unpin any mmu pages first.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
+	}
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_arch_vcpu_free(vcpu);
 
@@ -5674,6 +5685,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+		|| !list_empty_careful(&vcpu->async_pf_done)
 		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
 		|| vcpu->arch.nmi_pending ||
 		(kvm_arch_interrupt_allowed(vcpu) &&
@@ -5731,6 +5743,55 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+	if (unlikely(vcpu->arch.apf_memslot_ver !=
+		     vcpu->kvm->memslot_version)) {
+		u64 gpa = vcpu->arch.apf_msr_val & ~0x3f;
+		unsigned long addr;
+		int offset = offset_in_page(gpa);
+
+		addr = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
+		if (kvm_is_error_hva(addr)) {
+			vcpu->arch.apf_data = NULL;
+			return -EFAULT;
+		}
+	}
+
+	return put_user(val, vcpu->arch.apf_data);
+}
+
+void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
+					    struct kvm_async_pf *work)
+{
+	if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+		kvm_inject_page_fault(vcpu, work->arch.token, 0);
+		++vcpu->stat.apf_not_present;
+		trace_kvm_send_async_pf(work->arch.token, work->gva,
+					KVM_PV_REASON_PAGE_NOT_PRESENT);
+	}
+}
+
+void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
+					struct kvm_async_pf *work)
+{
+	if (is_error_page(work->page))
+		work->arch.token = ~0; /* broadcast wakeup */
+	if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+		kvm_inject_page_fault(vcpu, work->arch.token, 0);
+		++vcpu->stat.apf_present;
+		trace_kvm_send_async_pf(work->arch.token, work->gva,
+					KVM_PV_REASON_PAGE_READY);
+	}
+}
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	return !kvm_event_needs_reinjection(vcpu) &&
+		kvm_x86_ops->interrupt_allowed(vcpu);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c74ffc0..2e14c1b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -16,6 +16,7 @@
 #include <linux/mm.h>
 #include <linux/preempt.h>
 #include <linux/msi.h>
+#include <linux/slow-work.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -73,6 +74,27 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+struct kvm_async_pf {
+	struct slow_work work;
+	struct list_head link;
+	struct list_head queue;
+	struct kvm_vcpu *vcpu;
+	struct mm_struct *mm;
+	gva_t gva;
+	unsigned long addr;
+	struct kvm_arch_async_pf arch;
+	struct page *page;
+	atomic_t used;
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch);
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
+#endif
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -103,6 +125,14 @@ struct kvm_vcpu {
 	gpa_t mmio_phys_addr;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	u32 async_pf_queued;
+	struct list_head async_pf_queue;
+	struct list_head async_pf_done;
+	spinlock_t async_pf_lock;
+	struct kvm_async_pf *async_pf_work;
+#endif
+
 	struct kvm_vcpu_arch arch;
 };
 
@@ -297,6 +327,7 @@ void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 6dd3a51..6d9f0c2 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -185,6 +185,66 @@ TRACE_EVENT(kvm_age_page,
 		  __entry->referenced ? "YOUNG" : "OLD")
 );
 
+#ifdef CONFIG_KVM_ASYNC_PF
+TRACE_EVENT(
+	kvm_try_async_get_page,
+	TP_PROTO(bool async, u64 pfn),
+	TP_ARGS(async, pfn),
+
+	TP_STRUCT__entry(
+		__field(__u64, pfn)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = (!async) ? pfn : (u64)-1;
+		),
+
+	TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_send_async_pf,
+	TP_PROTO(u64 token, u64 gva, u64 reason),
+	TP_ARGS(token, gva, reason),
+
+	TP_STRUCT__entry(
+		__field(__u64, token)
+		__field(__u64, gva)
+		__field(bool, np)
+		),
+
+	TP_fast_assign(
+		__entry->token = token;
+		__entry->gva = gva;
+		__entry->np = (reason == KVM_PV_REASON_PAGE_NOT_PRESENT);
+		),
+
+	TP_printk("token %#llx gva %#llx %s", __entry->token, __entry->gva,
+		  __entry->np ? "not present" : "ready")
+);
+
+TRACE_EVENT(
+	kvm_async_pf_completed,
+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
+	TP_ARGS(address, page, gva),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, address)
+		__field(struct page*, page)
+		__field(u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->address = address;
+		__entry->page = page;
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx address %#lx pfn %lx",  __entry->gva,
+		  __entry->address, page_to_pfn(__entry->page))
+);
+#endif
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7f1178f..f63ccb0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -15,3 +15,6 @@ config KVM_APIC_ARCHITECTURE
 
 config KVM_MMIO
        bool
+
+config KVM_ASYNC_PF
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 292514c..f56e8ac 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -78,6 +78,11 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+#define ASYNC_PF_PER_VCPU 100
+static struct kmem_cache *async_pf_cache;
+#endif
+
 static __read_mostly struct preempt_ops kvm_preempt_ops;
 
 struct dentry *kvm_debugfs_dir;
@@ -186,6 +191,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
 	init_waitqueue_head(&vcpu->wq);
+#ifdef CONFIG_KVM_ASYNC_PF
+	INIT_LIST_HEAD(&vcpu->async_pf_done);
+	INIT_LIST_HEAD(&vcpu->async_pf_queue);
+	spin_lock_init(&vcpu->async_pf_lock);
+#endif
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -944,7 +954,7 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool *async)
 {
 	struct page *page[1];
 	int npages;
@@ -952,7 +962,19 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 
 	might_sleep();
 
-	npages = get_user_pages_fast(addr, 1, 1, page);
+	if (async) {
+#ifdef CONFIG_X86
+		npages = __get_user_pages_fast(addr, 1, 1, page);
+#endif
+		if (unlikely(npages != 1)) {
+			down_read(&current->mm->mmap_sem);
+			npages = get_user_pages_noio(current, current->mm,
+						     addr, 1, 1, 0, page,
+						     NULL);
+			up_read(&current->mm->mmap_sem);
+		}
+	} else
+		npages = get_user_pages_fast(addr, 1, 1, page);
 
 	if (unlikely(npages != 1)) {
 		struct vm_area_struct *vma;
@@ -968,6 +990,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 
 		if (vma == NULL || addr < vma->vm_start ||
 		    !(vma->vm_flags & VM_PFNMAP)) {
+			if (async && !(vma->vm_flags & VM_PFNMAP) &&
+			    (vma->vm_flags & VM_WRITE))
+				*async = true;
 			up_read(&current->mm->mmap_sem);
 			get_page(fault_page);
 			return page_to_pfn(fault_page);
@@ -982,25 +1007,37 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 	return pfn;
 }
 
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+static inline pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool *async)
 {
 	unsigned long addr;
 
+	if (async)
+		*async = false;
 	addr = gfn_to_hva(kvm, gfn);
 	if (kvm_is_error_hva(addr)) {
 		get_page(bad_page);
 		return page_to_pfn(bad_page);
 	}
 
-	return hva_to_pfn(kvm, addr);
+	return hva_to_pfn(kvm, addr, async);
+}
+
+pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_pfn(kvm, gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async)
+{
+	return __gfn_to_pfn(kvm, gfn, async);
+}
+
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	unsigned long addr = gfn_to_hva_memslot(slot, gfn);
-	return hva_to_pfn(kvm, addr);
+	return hva_to_pfn(kvm, addr, NULL);
 }
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
@@ -1213,6 +1250,196 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+#ifdef CONFIG_KVM_ASYNC_PF
+static void async_pf_work_free(struct kvm_async_pf *apf)
+{
+	if (atomic_dec_and_test(&apf->used))
+		kmem_cache_free(async_pf_cache, apf);
+}
+
+static int async_pf_get_ref(struct slow_work *work)
+{
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+
+	atomic_inc(&apf->used);
+	return 0;
+}
+
+static void async_pf_put_ref(struct slow_work *work)
+{
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+
+	kvm_put_kvm(apf->vcpu->kvm);
+	async_pf_work_free(apf);
+}
+
+static void async_pf_execute(struct slow_work *work)
+{
+	struct page *page;
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+	wait_queue_head_t *q = &apf->vcpu->wq;
+
+	might_sleep();
+
+	down_read(&apf->mm->mmap_sem);
+	get_user_pages(current, apf->mm, apf->addr, 1, 1, 0, &page, NULL);
+	up_read(&apf->mm->mmap_sem);
+
+	spin_lock(&apf->vcpu->async_pf_lock);
+	list_add_tail(&apf->link, &apf->vcpu->async_pf_done);
+	apf->page = page;
+	spin_unlock(&apf->vcpu->async_pf_lock);
+
+	trace_kvm_async_pf_completed(apf->addr, apf->page, apf->gva);
+
+	if (waitqueue_active(q))
+		wake_up_interruptible(q);
+
+	mmdrop(apf->mm);
+}
+
+struct slow_work_ops async_pf_ops = {
+	.get_ref = async_pf_get_ref,
+	.put_ref = async_pf_put_ref,
+	.execute = async_pf_execute
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+	/* cancel outstanding slow work item */
+	while (!list_empty(&vcpu->async_pf_queue)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf_queue.next,
+				   typeof(*work), queue);
+		slow_work_cancel(&work->work);
+		list_del(&work->queue);
+		if (!work->page) /* work was canceled */
+			kmem_cache_free(async_pf_cache, work);
+	}
+
+	spin_lock(&vcpu->async_pf_lock);
+	while (!list_empty(&vcpu->async_pf_done)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf_done.next,
+				   typeof(*work), link);
+		list_del(&work->link);
+		put_page(work->page);
+		kmem_cache_free(async_pf_cache, work);
+	}
+	spin_unlock(&vcpu->async_pf_lock);
+
+	vcpu->async_pf_queued = 0;
+	vcpu->async_pf_work = NULL;
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work = vcpu->async_pf_work;
+
+	if (work) {
+		vcpu->async_pf_work = NULL;
+		if (work->page == NULL) {
+			kvm_arch_inject_async_page_not_present(vcpu, work);
+			return;
+		} else {
+			spin_lock(&vcpu->async_pf_lock);
+			list_del(&work->link);
+			spin_unlock(&vcpu->async_pf_lock);
+			put_page(work->page);
+			async_pf_work_free(work);
+			list_del(&work->queue);
+			vcpu->async_pf_queued--;
+		}
+	}
+
+	if (list_empty_careful(&vcpu->async_pf_done) ||
+	    !kvm_arch_can_inject_async_page_present(vcpu))
+		return;
+
+	spin_lock(&vcpu->async_pf_lock);
+	work = list_first_entry(&vcpu->async_pf_done, typeof(*work), link);
+	list_del(&work->link);
+	spin_unlock(&vcpu->async_pf_lock);
+	list_del(&work->queue);
+	vcpu->async_pf_queued--;
+
+	kvm_arch_inject_async_page_present(vcpu, work);
+
+	put_page(work->page);
+	async_pf_work_free(work);
+}
+
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch)
+{
+	struct kvm_async_pf *work;
+
+	if (vcpu->async_pf_queued >= ASYNC_PF_PER_VCPU)
+		return 0;
+
+	/* setup slow work */
+
+	/* do alloc nowait since if we are going to sleep anyway we
+	   may as well sleep faulting in page */
+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+	if (!work)
+		return 0;
+
+	atomic_set(&work->used, 1);
+	work->page = NULL;
+	work->vcpu = vcpu;
+	work->gva = gva;
+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
+	work->arch = *arch;
+	work->mm = current->mm;
+	atomic_inc(&work->mm->mm_count);
+	kvm_get_kvm(work->vcpu->kvm);
+
+	/* this can't really happen otherwise gfn_to_pfn_async
+	   would succeed */
+	if (unlikely(kvm_is_error_hva(work->addr)))
+		goto retry_sync;
+
+	slow_work_init(&work->work, &async_pf_ops);
+	if (slow_work_enqueue(&work->work) != 0)
+		goto retry_sync;
+
+	vcpu->async_pf_work = work;
+	list_add_tail(&work->queue, &vcpu->async_pf_queue);
+	vcpu->async_pf_queued++;
+	return 1;
+retry_sync:
+	kvm_put_kvm(work->vcpu->kvm);
+	mmdrop(work->mm);
+	kmem_cache_free(async_pf_cache, work);
+	return 0;
+}
+
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (!list_empty(&vcpu->async_pf_done))
+		return 0;
+
+	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	atomic_set(&work->used, 1);
+	work->page = bad_page;
+	get_page(bad_page);
+	INIT_LIST_HEAD(&work->queue); /* for list_del to work */
+
+	list_add_tail(&work->link, &vcpu->async_pf_done);
+	vcpu->async_pf_queued++;
+	return 0;
+}
+#endif
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
@@ -2285,6 +2512,19 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		goto out_free_5;
 	}
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	async_pf_cache = KMEM_CACHE(kvm_async_pf, 0);
+
+	if (!async_pf_cache) {
+		r = -ENOMEM;
+		goto out_free_6;
+	}
+
+	r = slow_work_register_user(THIS_MODULE);
+	if (r)
+		goto out_free;
+#endif
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -2292,7 +2532,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		printk(KERN_ERR "kvm: misc device register failed\n");
-		goto out_free;
+		goto out_unreg;
 	}
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -2302,7 +2542,13 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
+out_unreg:
+#ifdef CONFIG_KVM_ASYNC_PF
+	slow_work_unregister_user(THIS_MODULE);
 out_free:
+	kmem_cache_destroy(async_pf_cache);
+out_free_6:
+#endif
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_5:
 	sysdev_unregister(&kvm_sysdev);
@@ -2334,6 +2580,11 @@ void kvm_exit(void)
 	kvm_exit_debug();
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
+#ifdef CONFIG_KVM_ASYNC_PF
+	if (async_pf_cache)
+		kmem_cache_destroy(async_pf_cache);
+	slow_work_unregister_user(THIS_MODULE);
+#endif
 	sysdev_unregister(&kvm_sysdev);
 	sysdev_class_unregister(&kvm_sysdev_class);
 	unregister_reboot_notifier(&kvm_reboot_notifier);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest access swapped out memory do not swap it in from vcpu thread
context. Setup slow work to do swapping and send async page fault to
a guest.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able to
reschedule.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   16 +++
 arch/x86/kvm/Kconfig            |    2 +
 arch/x86/kvm/mmu.c              |   35 +++++-
 arch/x86/kvm/paging_tmpl.h      |   17 +++-
 arch/x86/kvm/x86.c              |   63 +++++++++-
 include/linux/kvm_host.h        |   31 +++++
 include/trace/events/kvm.h      |   60 +++++++++
 virt/kvm/Kconfig                |    3 +
 virt/kvm/kvm_main.c             |  263 ++++++++++++++++++++++++++++++++++++++-
 9 files changed, 481 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 245831a..db514ea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -366,7 +366,9 @@ struct kvm_vcpu_arch {
 	cpumask_var_t wbinvd_dirty_mask;
 
 	u32 __user *apf_data;
+	u32 apf_memslot_ver;
 	u64 apf_msr_val;
+	u32 async_pf_id;
 };
 
 struct kvm_arch {
@@ -444,6 +446,8 @@ struct kvm_vcpu_stat {
 	u32 hypercalls;
 	u32 irq_injections;
 	u32 nmi_injections;
+	u32 apf_not_present;
+	u32 apf_present;
 };
 
 struct kvm_x86_ops {
@@ -528,6 +532,10 @@ struct kvm_x86_ops {
 	const struct trace_print_flags *exit_reasons_str;
 };
 
+struct kvm_arch_async_pf {
+	u32 token;
+};
+
 extern struct kvm_x86_ops *kvm_x86_ops;
 
 int kvm_mmu_module_init(void);
@@ -763,4 +771,12 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
+struct kvm_async_pf;
+
+void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
+					    struct kvm_async_pf *work);
+void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
+					struct kvm_async_pf *work);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 #endif /* _ASM_X86_KVM_HOST_H */
+
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 970bbd4..2461284 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -28,6 +28,8 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_EVENTFD
 	select KVM_APIC_ARCHITECTURE
+	select KVM_ASYNC_PF
+	select SLOW_WORK
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
 	---help---
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d16efbe..5e6105c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -21,6 +21,7 @@
 #include "mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -2345,6 +2346,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	arch.token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
+		return false;
+
+	return !!kvm_x86_ops->get_cpl(vcpu);
+}
+
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 				u32 error_code)
 {
@@ -2353,6 +2369,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	int level;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
+	bool async;
 
 	ASSERT(vcpu);
 	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
@@ -2367,7 +2384,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+	if (can_do_async_pf(vcpu)) {
+		pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
+		trace_kvm_try_async_get_page(async, pfn);
+	} else {
+do_sync:
+		async = false;
+		pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	}
+
+	if (async) {
+		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
+			goto do_sync;
+		return 0;
+	}
+
+	/* mmio */
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a09e04c..f8c74a1 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -428,6 +428,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
 	unsigned long mmu_seq;
+	bool async;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 	kvm_mmu_audit(vcpu, "pre page fault");
@@ -459,7 +460,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+	if (can_do_async_pf(vcpu)) {
+		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &async);
+		trace_kvm_try_async_get_page(async, pfn);
+	} else {
+do_sync:
+		async = false;
+		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+	}
+
+	if (async) {
+		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
+			goto do_sync;
+		return 0;
+	}
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b09bf61..2603cc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -118,6 +118,8 @@ static DEFINE_PER_CPU(struct kvm_shared_msrs, shared_msrs);
 struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "pf_fixed", VCPU_STAT(pf_fixed) },
 	{ "pf_guest", VCPU_STAT(pf_guest) },
+	{ "apf_not_present", VCPU_STAT(apf_not_present) },
+	{ "apf_present", VCPU_STAT(apf_present) },
 	{ "tlb_flush", VCPU_STAT(tlb_flush) },
 	{ "invlpg", VCPU_STAT(invlpg) },
 	{ "exits", VCPU_STAT(exits) },
@@ -1228,6 +1230,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 
 	if (!(data & KVM_ASYNC_PF_ENABLED)) {
 		vcpu->arch.apf_data = NULL;
+		kvm_clear_async_pf_completion_queue(vcpu);
 		return 0;
 	}
 
@@ -1242,6 +1245,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 		vcpu->arch.apf_data = NULL;
 		return 1;
 	}
+	vcpu->arch.apf_memslot_ver = vcpu->kvm->memslot_version;
+	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
 
@@ -4748,6 +4753,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (unlikely(r))
 		goto out;
 
+	kvm_check_async_pf_completion(vcpu);
+
 	preempt_disable();
 
 	kvm_x86_ops->prepare_guest_switch(vcpu);
@@ -5420,6 +5427,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.apf_data = NULL;
 	vcpu->arch.apf_msr_val = 0;
 
+	kvm_clear_async_pf_completion_queue(vcpu);
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -5561,8 +5570,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
 	/*
 	 * Unpin any mmu pages first.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_unload_vcpu_mmu(vcpu);
+	}
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_arch_vcpu_free(vcpu);
 
@@ -5674,6 +5685,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+		|| !list_empty_careful(&vcpu->async_pf_done)
 		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
 		|| vcpu->arch.nmi_pending ||
 		(kvm_arch_interrupt_allowed(vcpu) &&
@@ -5731,6 +5743,55 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+	if (unlikely(vcpu->arch.apf_memslot_ver !=
+		     vcpu->kvm->memslot_version)) {
+		u64 gpa = vcpu->arch.apf_msr_val & ~0x3f;
+		unsigned long addr;
+		int offset = offset_in_page(gpa);
+
+		addr = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
+		if (kvm_is_error_hva(addr)) {
+			vcpu->arch.apf_data = NULL;
+			return -EFAULT;
+		}
+	}
+
+	return put_user(val, vcpu->arch.apf_data);
+}
+
+void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
+					    struct kvm_async_pf *work)
+{
+	if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+		kvm_inject_page_fault(vcpu, work->arch.token, 0);
+		++vcpu->stat.apf_not_present;
+		trace_kvm_send_async_pf(work->arch.token, work->gva,
+					KVM_PV_REASON_PAGE_NOT_PRESENT);
+	}
+}
+
+void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
+					struct kvm_async_pf *work)
+{
+	if (is_error_page(work->page))
+		work->arch.token = ~0; /* broadcast wakeup */
+	if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+		kvm_inject_page_fault(vcpu, work->arch.token, 0);
+		++vcpu->stat.apf_present;
+		trace_kvm_send_async_pf(work->arch.token, work->gva,
+					KVM_PV_REASON_PAGE_READY);
+	}
+}
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	return !kvm_event_needs_reinjection(vcpu) &&
+		kvm_x86_ops->interrupt_allowed(vcpu);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c74ffc0..2e14c1b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -16,6 +16,7 @@
 #include <linux/mm.h>
 #include <linux/preempt.h>
 #include <linux/msi.h>
+#include <linux/slow-work.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -73,6 +74,27 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+struct kvm_async_pf {
+	struct slow_work work;
+	struct list_head link;
+	struct list_head queue;
+	struct kvm_vcpu *vcpu;
+	struct mm_struct *mm;
+	gva_t gva;
+	unsigned long addr;
+	struct kvm_arch_async_pf arch;
+	struct page *page;
+	atomic_t used;
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch);
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
+#endif
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -103,6 +125,14 @@ struct kvm_vcpu {
 	gpa_t mmio_phys_addr;
 #endif
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	u32 async_pf_queued;
+	struct list_head async_pf_queue;
+	struct list_head async_pf_done;
+	spinlock_t async_pf_lock;
+	struct kvm_async_pf *async_pf_work;
+#endif
+
 	struct kvm_vcpu_arch arch;
 };
 
@@ -297,6 +327,7 @@ void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 6dd3a51..6d9f0c2 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -185,6 +185,66 @@ TRACE_EVENT(kvm_age_page,
 		  __entry->referenced ? "YOUNG" : "OLD")
 );
 
+#ifdef CONFIG_KVM_ASYNC_PF
+TRACE_EVENT(
+	kvm_try_async_get_page,
+	TP_PROTO(bool async, u64 pfn),
+	TP_ARGS(async, pfn),
+
+	TP_STRUCT__entry(
+		__field(__u64, pfn)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = (!async) ? pfn : (u64)-1;
+		),
+
+	TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+	kvm_send_async_pf,
+	TP_PROTO(u64 token, u64 gva, u64 reason),
+	TP_ARGS(token, gva, reason),
+
+	TP_STRUCT__entry(
+		__field(__u64, token)
+		__field(__u64, gva)
+		__field(bool, np)
+		),
+
+	TP_fast_assign(
+		__entry->token = token;
+		__entry->gva = gva;
+		__entry->np = (reason == KVM_PV_REASON_PAGE_NOT_PRESENT);
+		),
+
+	TP_printk("token %#llx gva %#llx %s", __entry->token, __entry->gva,
+		  __entry->np ? "not present" : "ready")
+);
+
+TRACE_EVENT(
+	kvm_async_pf_completed,
+	TP_PROTO(unsigned long address, struct page *page, u64 gva),
+	TP_ARGS(address, page, gva),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, address)
+		__field(struct page*, page)
+		__field(u64, gva)
+		),
+
+	TP_fast_assign(
+		__entry->address = address;
+		__entry->page = page;
+		__entry->gva = gva;
+		),
+
+	TP_printk("gva %#llx address %#lx pfn %lx",  __entry->gva,
+		  __entry->address, page_to_pfn(__entry->page))
+);
+#endif
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7f1178f..f63ccb0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -15,3 +15,6 @@ config KVM_APIC_ARCHITECTURE
 
 config KVM_MMIO
        bool
+
+config KVM_ASYNC_PF
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 292514c..f56e8ac 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -78,6 +78,11 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+#define ASYNC_PF_PER_VCPU 100
+static struct kmem_cache *async_pf_cache;
+#endif
+
 static __read_mostly struct preempt_ops kvm_preempt_ops;
 
 struct dentry *kvm_debugfs_dir;
@@ -186,6 +191,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
 	init_waitqueue_head(&vcpu->wq);
+#ifdef CONFIG_KVM_ASYNC_PF
+	INIT_LIST_HEAD(&vcpu->async_pf_done);
+	INIT_LIST_HEAD(&vcpu->async_pf_queue);
+	spin_lock_init(&vcpu->async_pf_lock);
+#endif
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -944,7 +954,7 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool *async)
 {
 	struct page *page[1];
 	int npages;
@@ -952,7 +962,19 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 
 	might_sleep();
 
-	npages = get_user_pages_fast(addr, 1, 1, page);
+	if (async) {
+#ifdef CONFIG_X86
+		npages = __get_user_pages_fast(addr, 1, 1, page);
+#endif
+		if (unlikely(npages != 1)) {
+			down_read(&current->mm->mmap_sem);
+			npages = get_user_pages_noio(current, current->mm,
+						     addr, 1, 1, 0, page,
+						     NULL);
+			up_read(&current->mm->mmap_sem);
+		}
+	} else
+		npages = get_user_pages_fast(addr, 1, 1, page);
 
 	if (unlikely(npages != 1)) {
 		struct vm_area_struct *vma;
@@ -968,6 +990,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 
 		if (vma == NULL || addr < vma->vm_start ||
 		    !(vma->vm_flags & VM_PFNMAP)) {
+			if (async && !(vma->vm_flags & VM_PFNMAP) &&
+			    (vma->vm_flags & VM_WRITE))
+				*async = true;
 			up_read(&current->mm->mmap_sem);
 			get_page(fault_page);
 			return page_to_pfn(fault_page);
@@ -982,25 +1007,37 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 	return pfn;
 }
 
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+static inline pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool *async)
 {
 	unsigned long addr;
 
+	if (async)
+		*async = false;
 	addr = gfn_to_hva(kvm, gfn);
 	if (kvm_is_error_hva(addr)) {
 		get_page(bad_page);
 		return page_to_pfn(bad_page);
 	}
 
-	return hva_to_pfn(kvm, addr);
+	return hva_to_pfn(kvm, addr, async);
+}
+
+pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_pfn(kvm, gfn, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
+pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async)
+{
+	return __gfn_to_pfn(kvm, gfn, async);
+}
+
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	unsigned long addr = gfn_to_hva_memslot(slot, gfn);
-	return hva_to_pfn(kvm, addr);
+	return hva_to_pfn(kvm, addr, NULL);
 }
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
@@ -1213,6 +1250,196 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+#ifdef CONFIG_KVM_ASYNC_PF
+static void async_pf_work_free(struct kvm_async_pf *apf)
+{
+	if (atomic_dec_and_test(&apf->used))
+		kmem_cache_free(async_pf_cache, apf);
+}
+
+static int async_pf_get_ref(struct slow_work *work)
+{
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+
+	atomic_inc(&apf->used);
+	return 0;
+}
+
+static void async_pf_put_ref(struct slow_work *work)
+{
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+
+	kvm_put_kvm(apf->vcpu->kvm);
+	async_pf_work_free(apf);
+}
+
+static void async_pf_execute(struct slow_work *work)
+{
+	struct page *page;
+	struct kvm_async_pf *apf =
+		container_of(work, struct kvm_async_pf, work);
+	wait_queue_head_t *q = &apf->vcpu->wq;
+
+	might_sleep();
+
+	down_read(&apf->mm->mmap_sem);
+	get_user_pages(current, apf->mm, apf->addr, 1, 1, 0, &page, NULL);
+	up_read(&apf->mm->mmap_sem);
+
+	spin_lock(&apf->vcpu->async_pf_lock);
+	list_add_tail(&apf->link, &apf->vcpu->async_pf_done);
+	apf->page = page;
+	spin_unlock(&apf->vcpu->async_pf_lock);
+
+	trace_kvm_async_pf_completed(apf->addr, apf->page, apf->gva);
+
+	if (waitqueue_active(q))
+		wake_up_interruptible(q);
+
+	mmdrop(apf->mm);
+}
+
+struct slow_work_ops async_pf_ops = {
+	.get_ref = async_pf_get_ref,
+	.put_ref = async_pf_put_ref,
+	.execute = async_pf_execute
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+	/* cancel outstanding slow work item */
+	while (!list_empty(&vcpu->async_pf_queue)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf_queue.next,
+				   typeof(*work), queue);
+		slow_work_cancel(&work->work);
+		list_del(&work->queue);
+		if (!work->page) /* work was canceled */
+			kmem_cache_free(async_pf_cache, work);
+	}
+
+	spin_lock(&vcpu->async_pf_lock);
+	while (!list_empty(&vcpu->async_pf_done)) {
+		struct kvm_async_pf *work =
+			list_entry(vcpu->async_pf_done.next,
+				   typeof(*work), link);
+		list_del(&work->link);
+		put_page(work->page);
+		kmem_cache_free(async_pf_cache, work);
+	}
+	spin_unlock(&vcpu->async_pf_lock);
+
+	vcpu->async_pf_queued = 0;
+	vcpu->async_pf_work = NULL;
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work = vcpu->async_pf_work;
+
+	if (work) {
+		vcpu->async_pf_work = NULL;
+		if (work->page == NULL) {
+			kvm_arch_inject_async_page_not_present(vcpu, work);
+			return;
+		} else {
+			spin_lock(&vcpu->async_pf_lock);
+			list_del(&work->link);
+			spin_unlock(&vcpu->async_pf_lock);
+			put_page(work->page);
+			async_pf_work_free(work);
+			list_del(&work->queue);
+			vcpu->async_pf_queued--;
+		}
+	}
+
+	if (list_empty_careful(&vcpu->async_pf_done) ||
+	    !kvm_arch_can_inject_async_page_present(vcpu))
+		return;
+
+	spin_lock(&vcpu->async_pf_lock);
+	work = list_first_entry(&vcpu->async_pf_done, typeof(*work), link);
+	list_del(&work->link);
+	spin_unlock(&vcpu->async_pf_lock);
+	list_del(&work->queue);
+	vcpu->async_pf_queued--;
+
+	kvm_arch_inject_async_page_present(vcpu, work);
+
+	put_page(work->page);
+	async_pf_work_free(work);
+}
+
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+		       struct kvm_arch_async_pf *arch)
+{
+	struct kvm_async_pf *work;
+
+	if (vcpu->async_pf_queued >= ASYNC_PF_PER_VCPU)
+		return 0;
+
+	/* setup slow work */
+
+	/* do alloc nowait since if we are going to sleep anyway we
+	   may as well sleep faulting in page */
+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+	if (!work)
+		return 0;
+
+	atomic_set(&work->used, 1);
+	work->page = NULL;
+	work->vcpu = vcpu;
+	work->gva = gva;
+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
+	work->arch = *arch;
+	work->mm = current->mm;
+	atomic_inc(&work->mm->mm_count);
+	kvm_get_kvm(work->vcpu->kvm);
+
+	/* this can't really happen otherwise gfn_to_pfn_async
+	   would succeed */
+	if (unlikely(kvm_is_error_hva(work->addr)))
+		goto retry_sync;
+
+	slow_work_init(&work->work, &async_pf_ops);
+	if (slow_work_enqueue(&work->work) != 0)
+		goto retry_sync;
+
+	vcpu->async_pf_work = work;
+	list_add_tail(&work->queue, &vcpu->async_pf_queue);
+	vcpu->async_pf_queued++;
+	return 1;
+retry_sync:
+	kvm_put_kvm(work->vcpu->kvm);
+	mmdrop(work->mm);
+	kmem_cache_free(async_pf_cache, work);
+	return 0;
+}
+
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+
+	if (!list_empty(&vcpu->async_pf_done))
+		return 0;
+
+	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	atomic_set(&work->used, 1);
+	work->page = bad_page;
+	get_page(bad_page);
+	INIT_LIST_HEAD(&work->queue); /* for list_del to work */
+
+	list_add_tail(&work->link, &vcpu->async_pf_done);
+	vcpu->async_pf_queued++;
+	return 0;
+}
+#endif
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
@@ -2285,6 +2512,19 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		goto out_free_5;
 	}
 
+#ifdef CONFIG_KVM_ASYNC_PF
+	async_pf_cache = KMEM_CACHE(kvm_async_pf, 0);
+
+	if (!async_pf_cache) {
+		r = -ENOMEM;
+		goto out_free_6;
+	}
+
+	r = slow_work_register_user(THIS_MODULE);
+	if (r)
+		goto out_free;
+#endif
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -2292,7 +2532,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = misc_register(&kvm_dev);
 	if (r) {
 		printk(KERN_ERR "kvm: misc device register failed\n");
-		goto out_free;
+		goto out_unreg;
 	}
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -2302,7 +2542,13 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 
 	return 0;
 
+out_unreg:
+#ifdef CONFIG_KVM_ASYNC_PF
+	slow_work_unregister_user(THIS_MODULE);
 out_free:
+	kmem_cache_destroy(async_pf_cache);
+out_free_6:
+#endif
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_5:
 	sysdev_unregister(&kvm_sysdev);
@@ -2334,6 +2580,11 @@ void kvm_exit(void)
 	kvm_exit_debug();
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
+#ifdef CONFIG_KVM_ASYNC_PF
+	if (async_pf_cache)
+		kmem_cache_destroy(async_pf_cache);
+	slow_work_unregister_user(THIS_MODULE);
+#endif
 	sysdev_unregister(&kvm_sysdev);
 	sysdev_class_unregister(&kvm_sysdev_class);
 	unregister_reboot_notifier(&kvm_reboot_notifier);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 09/12] Retry fault before vmentry
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:30   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    7 +++++-
 arch/x86/kvm/mmu.c              |   27 +++++++++++++++++++------
 arch/x86/kvm/paging_tmpl.h      |   40 +++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |    9 ++++++++
 virt/kvm/kvm_main.c             |    2 +
 5 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db514ea..45e6c12 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -236,7 +236,8 @@ struct kvm_pio_request {
  */
 struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool sync);
+	int (*page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva, u32 err);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
 			    u32 *error);
@@ -534,6 +535,8 @@ struct kvm_x86_ops {
 
 struct kvm_arch_async_pf {
 	u32 token;
+	gpa_t cr3;
+	u32 error_code;
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
@@ -777,6 +780,8 @@ void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
 					    struct kvm_async_pf *work);
 void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
 					struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 #endif /* _ASM_X86_KVM_HOST_H */
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5e6105c..12d1a7b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2327,7 +2327,7 @@ static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool sync)
 {
 	gfn_t gfn;
 	int r;
@@ -2346,10 +2346,13 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
-int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva,
+			    gfn_t gfn, u32 error_code)
 {
 	struct kvm_arch_async_pf arch;
 	arch.token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+	arch.cr3 = cr3;
+	arch.error_code = error_code;
 	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
 }
 
@@ -2361,8 +2364,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return !!kvm_x86_ops->get_cpl(vcpu);
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool sync)
 {
 	pfn_t pfn;
 	int r;
@@ -2385,7 +2388,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (can_do_async_pf(vcpu)) {
+	if (!sync && can_do_async_pf(vcpu)) {
 		pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
 		trace_kvm_try_async_get_page(async, pfn);
 	} else {
@@ -2395,7 +2398,8 @@ do_sync:
 	}
 
 	if (async) {
-		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
+		if (!kvm_arch_setup_async_pf(vcpu, vcpu->arch.cr3, gpa, gfn,
+					     error_code))
 			goto do_sync;
 		return 0;
 	}
@@ -2419,6 +2423,12 @@ out_unlock:
 	return 0;
 }
 
+static int tdp_page_fault_sync(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gpa,
+			       u32 error_code)
+{
+	return tdp_page_fault(vcpu, gpa, error_code, true);
+}
+
 static void nonpaging_free(struct kvm_vcpu *vcpu)
 {
 	mmu_free_roots(vcpu);
@@ -2549,6 +2559,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
 	ASSERT(is_pae(vcpu));
 	context->new_cr3 = paging_new_cr3;
 	context->page_fault = paging64_page_fault;
+	context->page_fault_other_cr3 = paging64_page_fault_other_cr3;
 	context->gva_to_gpa = paging64_gva_to_gpa;
 	context->prefetch_page = paging64_prefetch_page;
 	context->sync_page = paging64_sync_page;
@@ -2573,6 +2584,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 	reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
 	context->new_cr3 = paging_new_cr3;
 	context->page_fault = paging32_page_fault;
+	context->page_fault_other_cr3 = paging32_page_fault_other_cr3;
 	context->gva_to_gpa = paging32_gva_to_gpa;
 	context->free = paging_free;
 	context->prefetch_page = paging32_prefetch_page;
@@ -2596,6 +2608,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 
 	context->new_cr3 = nonpaging_new_cr3;
 	context->page_fault = tdp_page_fault;
+	context->page_fault_other_cr3 = tdp_page_fault_sync;
 	context->free = nonpaging_free;
 	context->prefetch_page = nonpaging_prefetch_page;
 	context->sync_page = nonpaging_sync_page;
@@ -2983,7 +2996,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f8c74a1..fec8e52 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -415,8 +415,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool sync)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -461,7 +461,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (can_do_async_pf(vcpu)) {
+	if (!sync && can_do_async_pf(vcpu)) {
 		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &async);
 		trace_kvm_try_async_get_page(async, pfn);
 	} else {
@@ -471,7 +471,8 @@ do_sync:
 	}
 
 	if (async) {
-		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
+		if (!kvm_arch_setup_async_pf(vcpu, vcpu->arch.cr3, addr,
+					     walker.gfn, error_code))
 			goto do_sync;
 		return 0;
 	}
@@ -505,6 +506,37 @@ out_unlock:
 	return 0;
 }
 
+static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
+				       gva_t addr, u32 error_code)
+{
+	int r = 0;
+	gpa_t curr_cr3 = vcpu->arch.cr3;
+
+	if (curr_cr3 != cr3) {
+		/*
+		 * We do page fault on behalf of a process that is sleeping
+		 * because of async PF. PV guest takes reference to mm that cr3
+		 * belongs too, so it has to be valid here.
+		 */
+		kvm_set_cr3(vcpu, cr3);
+		if (kvm_mmu_reload(vcpu))
+			goto switch_cr3;
+	}
+
+	r = FNAME(page_fault)(vcpu, addr, error_code, true);
+
+	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
+		kvm_mmu_sync_roots(vcpu);
+
+switch_cr3:
+	if (curr_cr3 != vcpu->arch.cr3) {
+		kvm_set_cr3(vcpu, curr_cr3);
+		kvm_mmu_reload(vcpu);
+	}
+
+	return r;
+}
+
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2603cc4..5482db0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5743,6 +5743,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work)
+{
+	if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
+		return;
+	vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
+					    work->arch.error_code);
+}
+
 static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
 {
 	if (unlikely(vcpu->arch.apf_memslot_ver !=
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f56e8ac..de1d5b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1348,6 +1348,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 			spin_lock(&vcpu->async_pf_lock);
 			list_del(&work->link);
 			spin_unlock(&vcpu->async_pf_lock);
+			kvm_arch_async_page_ready(vcpu, work);
 			put_page(work->page);
 			async_pf_work_free(work);
 			list_del(&work->queue);
@@ -1366,6 +1367,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->queue);
 	vcpu->async_pf_queued--;
 
+	kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_inject_async_page_present(vcpu, work);
 
 	put_page(work->page);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 09/12] Retry fault before vmentry
@ 2010-07-19 15:30   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:30 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    7 +++++-
 arch/x86/kvm/mmu.c              |   27 +++++++++++++++++++------
 arch/x86/kvm/paging_tmpl.h      |   40 +++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |    9 ++++++++
 virt/kvm/kvm_main.c             |    2 +
 5 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db514ea..45e6c12 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -236,7 +236,8 @@ struct kvm_pio_request {
  */
 struct kvm_mmu {
 	void (*new_cr3)(struct kvm_vcpu *vcpu);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool sync);
+	int (*page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva, u32 err);
 	void (*free)(struct kvm_vcpu *vcpu);
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
 			    u32 *error);
@@ -534,6 +535,8 @@ struct kvm_x86_ops {
 
 struct kvm_arch_async_pf {
 	u32 token;
+	gpa_t cr3;
+	u32 error_code;
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
@@ -777,6 +780,8 @@ void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
 					    struct kvm_async_pf *work);
 void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
 					struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 #endif /* _ASM_X86_KVM_HOST_H */
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5e6105c..12d1a7b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2327,7 +2327,7 @@ static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-				u32 error_code)
+				u32 error_code, bool sync)
 {
 	gfn_t gfn;
 	int r;
@@ -2346,10 +2346,13 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 			     error_code & PFERR_WRITE_MASK, gfn);
 }
 
-int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva,
+			    gfn_t gfn, u32 error_code)
 {
 	struct kvm_arch_async_pf arch;
 	arch.token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+	arch.cr3 = cr3;
+	arch.error_code = error_code;
 	return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
 }
 
@@ -2361,8 +2364,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	return !!kvm_x86_ops->get_cpl(vcpu);
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
-				u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+			  bool sync)
 {
 	pfn_t pfn;
 	int r;
@@ -2385,7 +2388,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (can_do_async_pf(vcpu)) {
+	if (!sync && can_do_async_pf(vcpu)) {
 		pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);
 		trace_kvm_try_async_get_page(async, pfn);
 	} else {
@@ -2395,7 +2398,8 @@ do_sync:
 	}
 
 	if (async) {
-		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
+		if (!kvm_arch_setup_async_pf(vcpu, vcpu->arch.cr3, gpa, gfn,
+					     error_code))
 			goto do_sync;
 		return 0;
 	}
@@ -2419,6 +2423,12 @@ out_unlock:
 	return 0;
 }
 
+static int tdp_page_fault_sync(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gpa,
+			       u32 error_code)
+{
+	return tdp_page_fault(vcpu, gpa, error_code, true);
+}
+
 static void nonpaging_free(struct kvm_vcpu *vcpu)
 {
 	mmu_free_roots(vcpu);
@@ -2549,6 +2559,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
 	ASSERT(is_pae(vcpu));
 	context->new_cr3 = paging_new_cr3;
 	context->page_fault = paging64_page_fault;
+	context->page_fault_other_cr3 = paging64_page_fault_other_cr3;
 	context->gva_to_gpa = paging64_gva_to_gpa;
 	context->prefetch_page = paging64_prefetch_page;
 	context->sync_page = paging64_sync_page;
@@ -2573,6 +2584,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
 	reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
 	context->new_cr3 = paging_new_cr3;
 	context->page_fault = paging32_page_fault;
+	context->page_fault_other_cr3 = paging32_page_fault_other_cr3;
 	context->gva_to_gpa = paging32_gva_to_gpa;
 	context->free = paging_free;
 	context->prefetch_page = paging32_prefetch_page;
@@ -2596,6 +2608,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 
 	context->new_cr3 = nonpaging_new_cr3;
 	context->page_fault = tdp_page_fault;
+	context->page_fault_other_cr3 = tdp_page_fault_sync;
 	context->free = nonpaging_free;
 	context->prefetch_page = nonpaging_prefetch_page;
 	context->sync_page = nonpaging_sync_page;
@@ -2983,7 +2996,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
 	int r;
 	enum emulation_result er;
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f8c74a1..fec8e52 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -415,8 +415,8 @@ out_gpte_changed:
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
-			       u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+			     bool sync)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
 	int user_fault = error_code & PFERR_USER_MASK;
@@ -461,7 +461,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (can_do_async_pf(vcpu)) {
+	if (!sync && can_do_async_pf(vcpu)) {
 		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &async);
 		trace_kvm_try_async_get_page(async, pfn);
 	} else {
@@ -471,7 +471,8 @@ do_sync:
 	}
 
 	if (async) {
-		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
+		if (!kvm_arch_setup_async_pf(vcpu, vcpu->arch.cr3, addr,
+					     walker.gfn, error_code))
 			goto do_sync;
 		return 0;
 	}
@@ -505,6 +506,37 @@ out_unlock:
 	return 0;
 }
 
+static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
+				       gva_t addr, u32 error_code)
+{
+	int r = 0;
+	gpa_t curr_cr3 = vcpu->arch.cr3;
+
+	if (curr_cr3 != cr3) {
+		/*
+		 * We do page fault on behalf of a process that is sleeping
+		 * because of async PF. PV guest takes reference to mm that cr3
+		 * belongs too, so it has to be valid here.
+		 */
+		kvm_set_cr3(vcpu, cr3);
+		if (kvm_mmu_reload(vcpu))
+			goto switch_cr3;
+	}
+
+	r = FNAME(page_fault)(vcpu, addr, error_code, true);
+
+	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
+		kvm_mmu_sync_roots(vcpu);
+
+switch_cr3:
+	if (curr_cr3 != vcpu->arch.cr3) {
+		kvm_set_cr3(vcpu, curr_cr3);
+		kvm_mmu_reload(vcpu);
+	}
+
+	return r;
+}
+
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2603cc4..5482db0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5743,6 +5743,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work)
+{
+	if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
+		return;
+	vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
+					    work->arch.error_code);
+}
+
 static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
 {
 	if (unlikely(vcpu->arch.apf_memslot_ver !=
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f56e8ac..de1d5b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1348,6 +1348,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 			spin_lock(&vcpu->async_pf_lock);
 			list_del(&work->link);
 			spin_unlock(&vcpu->async_pf_lock);
+			kvm_arch_async_page_ready(vcpu, work);
 			put_page(work->page);
 			async_pf_work_free(work);
 			list_del(&work->queue);
@@ -1366,6 +1367,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	list_del(&work->queue);
 	vcpu->async_pf_queued--;
 
+	kvm_arch_async_page_ready(vcpu, work);
 	kvm_arch_inject_async_page_present(vcpu, work);
 
 	put_page(work->page);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 10/12] Handle async PF in non preemptable context
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:31   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:31 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
 1 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a6db92e..914b0fc 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -37,6 +37,7 @@
 #include <asm/cpu.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
+#include <asm/tlbflush.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
 	wait_queue_head_t wq;
 	u32 token;
 	int cpu;
+	bool halted;
+	struct mm_struct *mm;
 };
 
 static struct kvm_task_sleep_head {
@@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
 	struct kvm_task_sleep_node n, *e;
 	DEFINE_WAIT(wait);
+	int cpu, idle;
+
+	cpu = get_cpu();
+	idle = idle_cpu(cpu);
+	put_cpu();
 
 	spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
@@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
+	n.mm = current->active_mm;
+	n.halted = idle || preempt_count() > 1;
+	atomic_inc(&n.mm->mm_count);
 	init_waitqueue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	spin_unlock(&b->lock);
 
 	for (;;) {
-		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (!n.halted)
+			prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
-		schedule();
+
+		if (!n.halted) {
+			schedule();
+		} else {
+			/*
+			 * We cannot reschedule. So halt.
+			 */
+			native_safe_halt();
+			local_irq_disable();
+		}
 	}
-	finish_wait(&n.wq, &wait);
+	if (!n.halted)
+		finish_wait(&n.wq, &wait);
 
 	return;
 }
@@ -127,7 +149,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (waitqueue_active(&n->wq))
+	if (!n->mm)
+		return;
+	mmdrop(n->mm);
+	if (n->halted)
+		smp_send_reschedule(n->cpu);
+	else if (waitqueue_active(&n->wq))
 		wake_up(&n->wq);
 }
 
@@ -157,6 +184,7 @@ again:
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->mm = NULL;
 		init_waitqueue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 10/12] Handle async PF in non preemptable context
@ 2010-07-19 15:31   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:31 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
 1 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a6db92e..914b0fc 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -37,6 +37,7 @@
 #include <asm/cpu.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
+#include <asm/tlbflush.h>
 
 #define MMU_QUEUE_SIZE 1024
 
@@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
 	wait_queue_head_t wq;
 	u32 token;
 	int cpu;
+	bool halted;
+	struct mm_struct *mm;
 };
 
 static struct kvm_task_sleep_head {
@@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
 	struct kvm_task_sleep_node n, *e;
 	DEFINE_WAIT(wait);
+	int cpu, idle;
+
+	cpu = get_cpu();
+	idle = idle_cpu(cpu);
+	put_cpu();
 
 	spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
@@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
+	n.mm = current->active_mm;
+	n.halted = idle || preempt_count() > 1;
+	atomic_inc(&n.mm->mm_count);
 	init_waitqueue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	spin_unlock(&b->lock);
 
 	for (;;) {
-		prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (!n.halted)
+			prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
-		schedule();
+
+		if (!n.halted) {
+			schedule();
+		} else {
+			/*
+			 * We cannot reschedule. So halt.
+			 */
+			native_safe_halt();
+			local_irq_disable();
+		}
 	}
-	finish_wait(&n.wq, &wait);
+	if (!n.halted)
+		finish_wait(&n.wq, &wait);
 
 	return;
 }
@@ -127,7 +149,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (waitqueue_active(&n->wq))
+	if (!n->mm)
+		return;
+	mmdrop(n->mm);
+	if (n->halted)
+		smp_send_reschedule(n->cpu);
+	else if (waitqueue_active(&n->wq))
 		wake_up(&n->wq);
 }
 
@@ -157,6 +184,7 @@ again:
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->mm = NULL;
 		init_waitqueue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 11/12] Let host know whether the guest can handle async PF in non-userspace context.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:31   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:31 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |    3 +++
 arch/x86/kvm/x86.c              |    5 +++--
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 45e6c12..c675d5d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -367,6 +367,7 @@ struct kvm_vcpu_arch {
 	cpumask_var_t wbinvd_dirty_mask;
 
 	u32 __user *apf_data;
+	bool apf_send_user_only;
 	u32 apf_memslot_ver;
 	u64 apf_msr_val;
 	u32 async_pf_id;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index edf07cf..a33372c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -38,6 +38,7 @@
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 914b0fc..462b47d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -429,6 +429,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
 		u64 pa = __pa(&__get_cpu_var(apf_reason));
 
+#ifdef CONFIG_PREEMPT
+		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
+#endif
 		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
 					  pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
 			return;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5482db0..ba351f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1222,8 +1222,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	int offset = offset_in_page(gpa);
 	unsigned long addr;
 
-	/* Bits 1:5 are resrved, Should be zero */
-	if (data & 0x3e)
+	/* Bits 2:5 are resrved, Should be zero */
+	if (data & 0x3c)
 		return 1;
 
 	vcpu->arch.apf_msr_val = data;
@@ -1246,6 +1246,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 		return 1;
 	}
 	vcpu->arch.apf_memslot_ver = vcpu->kvm->memslot_version;
+	vcpu->arch.apf_send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 11/12] Let host know whether the guest can handle async PF in non-userspace context.
@ 2010-07-19 15:31   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:31 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |    3 +++
 arch/x86/kvm/x86.c              |    5 +++--
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 45e6c12..c675d5d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -367,6 +367,7 @@ struct kvm_vcpu_arch {
 	cpumask_var_t wbinvd_dirty_mask;
 
 	u32 __user *apf_data;
+	bool apf_send_user_only;
 	u32 apf_memslot_ver;
 	u64 apf_msr_val;
 	u32 async_pf_id;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index edf07cf..a33372c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -38,6 +38,7 @@
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 914b0fc..462b47d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -429,6 +429,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
 		u64 pa = __pa(&__get_cpu_var(apf_reason));
 
+#ifdef CONFIG_PREEMPT
+		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
+#endif
 		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
 					  pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
 			return;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5482db0..ba351f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1222,8 +1222,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	int offset = offset_in_page(gpa);
 	unsigned long addr;
 
-	/* Bits 1:5 are resrved, Should be zero */
-	if (data & 0x3e)
+	/* Bits 2:5 are resrved, Should be zero */
+	if (data & 0x3c)
 		return 1;
 
 	vcpu->arch.apf_msr_val = data;
@@ -1246,6 +1246,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 		return 1;
 	}
 	vcpu->arch.apf_memslot_ver = vcpu->kvm->memslot_version;
+	vcpu->arch.apf_send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	kvm_async_pf_wakeup_all(vcpu);
 	return 0;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 12/12] Send async PF when guest is not in userspace too.
  2010-07-19 15:30 ` Gleb Natapov
@ 2010-07-19 15:31   ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:31 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupt are enabled.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 12d1a7b..ed87b1c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2361,7 +2361,13 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
 		return false;
 
-	return !!kvm_x86_ops->get_cpl(vcpu);
+	if (vcpu->arch.apf_send_user_only)
+		return !!kvm_x86_ops->get_cpl(vcpu);
+
+	if (!kvm_x86_ops->interrupt_allowed(vcpu))
+		return false;
+
+	return true;
 }
 
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 12/12] Send async PF when guest is not in userspace too.
@ 2010-07-19 15:31   ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-07-19 15:31 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupt are enabled.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 12d1a7b..ed87b1c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2361,7 +2361,13 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
 		return false;
 
-	return !!kvm_x86_ops->get_cpl(vcpu);
+	if (vcpu->arch.apf_send_user_only)
+		return !!kvm_x86_ops->get_cpl(vcpu);
+
+	if (!kvm_x86_ops->interrupt_allowed(vcpu))
+		return false;
+
+	return true;
 }
 
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-07-19 19:52     ` Rik van Riel
  -1 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2010-07-19 19:52 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 07/19/2010 11:30 AM, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-07-19 19:52     ` Rik van Riel
  0 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2010-07-19 19:52 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, avi, mingo, a.p.zijlstra, tglx, hpa,
	cl, mtosatti

On 07/19/2010 11:30 AM, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-23 15:22     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:22 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> Guess enables async PF vcpu functionality using this MSR.
>
>
>
> +static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	u64 gpa = data&  ~0x3f;
> +	int offset = offset_in_page(gpa);
> +	unsigned long addr;
> +
> +	/* Bits 1:5 are resrved, Should be zero */
> +	if (data&  0x3e)
> +		return 1;
> +
> +	vcpu->arch.apf_msr_val = data;
> +
> +	if (!(data&  KVM_ASYNC_PF_ENABLED)) {
> +		vcpu->arch.apf_data = NULL;
> +		return 0;
> +	}
> +
> +	addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> +	if (kvm_is_error_hva(addr))
> +		return 1;
> +
> +	vcpu->arch.apf_data = (u32 __user*)(addr + offset);

This can be invalidated by host userspace playing with memory regions.  
It needs to be recalculated on memory map changes, and it may disappear 
from under the guest's feet (in which case we're allowed to 
KVM_REQ_TRIPLE_FAULT it).

(note: this is a much better approach than kvmclock's and vapic's, we 
should copy it there)

> +
> +	/* check if address is mapped */
> +	if (get_user(offset, vcpu->arch.apf_data)) {
> +		vcpu->arch.apf_data = NULL;
> +		return 1;
> +	}

So, this check can succeed today but fail tomorrow.

> +	return 0;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery.
@ 2010-08-23 15:22     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:22 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> Guess enables async PF vcpu functionality using this MSR.
>
>
>
> +static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	u64 gpa = data&  ~0x3f;
> +	int offset = offset_in_page(gpa);
> +	unsigned long addr;
> +
> +	/* Bits 1:5 are resrved, Should be zero */
> +	if (data&  0x3e)
> +		return 1;
> +
> +	vcpu->arch.apf_msr_val = data;
> +
> +	if (!(data&  KVM_ASYNC_PF_ENABLED)) {
> +		vcpu->arch.apf_data = NULL;
> +		return 0;
> +	}
> +
> +	addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> +	if (kvm_is_error_hva(addr))
> +		return 1;
> +
> +	vcpu->arch.apf_data = (u32 __user*)(addr + offset);

This can be invalidated by host userspace playing with memory regions.  
It needs to be recalculated on memory map changes, and it may disappear 
from under the guest's feet (in which case we're allowed to 
KVM_REQ_TRIPLE_FAULT it).

(note: this is a much better approach than kvmclock's and vapic's, we 
should copy it there)

> +
> +	/* check if address is mapped */
> +	if (get_user(offset, vcpu->arch.apf_data)) {
> +		vcpu->arch.apf_data = NULL;
> +		return 1;
> +	}

So, this check can succeed today but fail tomorrow.

> +	return 0;
> +}
> +

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-23 15:26     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:26 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_para.h |    5 +++
>   arch/x86/kernel/kvm.c           |   68 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 5b05e9f..f1662d7 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
>   	__u64 pt_phys;
>   };
>
> +struct kvm_vcpu_pv_apf_data {
> +	__u32 reason;
> +	__u32 enabled;
> +};
> +

The guest will have to align this on a 64 byte boundary, should this be 
marked __aligned(64) here?

> @@ -231,12 +235,72 @@ static void __init paravirt_ops_setup(void)
>   #endif
>   }
>
> +void __cpuinit kvm_guest_cpu_init(void)
> +{
> +	if (!kvm_para_available())
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
> +		u64 pa = __pa(&__get_cpu_var(apf_reason));
> +
> +		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
> +					  pa | KVM_ASYNC_PF_ENABLED, pa>>  32))
> +			return;
> +		__get_cpu_var(apf_reason).enabled = 1;
> +		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> +		       smp_processor_id());
> +	}
> +}

Need a way to disable apf from the guest kernel command line.

> +
> +static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> +				    unsigned long action, void *hcpu)
> +{
> +	switch (action) {
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +		kvm_guest_cpu_init();
> +		break;
> +	default:
> +		break;

Should we disable apf if the cpu is dying here?

> +	}
> +	return NOTIFY_OK;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-08-23 15:26     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:26 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> Enable async PF in a guest if async PF capability is discovered.
>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_para.h |    5 +++
>   arch/x86/kernel/kvm.c           |   68 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 5b05e9f..f1662d7 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
>   	__u64 pt_phys;
>   };
>
> +struct kvm_vcpu_pv_apf_data {
> +	__u32 reason;
> +	__u32 enabled;
> +};
> +

The guest will have to align this on a 64 byte boundary, should this be 
marked __aligned(64) here?

> @@ -231,12 +235,72 @@ static void __init paravirt_ops_setup(void)
>   #endif
>   }
>
> +void __cpuinit kvm_guest_cpu_init(void)
> +{
> +	if (!kvm_para_available())
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
> +		u64 pa = __pa(&__get_cpu_var(apf_reason));
> +
> +		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
> +					  pa | KVM_ASYNC_PF_ENABLED, pa>>  32))
> +			return;
> +		__get_cpu_var(apf_reason).enabled = 1;
> +		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> +		       smp_processor_id());
> +	}
> +}

Need a way to disable apf from the guest kernel command line.

> +
> +static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> +				    unsigned long action, void *hcpu)
> +{
> +	switch (action) {
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +		kvm_guest_cpu_init();
> +		break;
> +	default:
> +		break;

Should we disable apf if the cpu is dying here?

> +	}
> +	return NOTIFY_OK;
> +}
> +

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery.
  2010-08-23 15:22     ` Avi Kivity
@ 2010-08-23 15:29       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-23 15:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 06:22:02PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >Guess enables async PF vcpu functionality using this MSR.
> >
> >
> >
> >+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >+{
> >+	u64 gpa = data&  ~0x3f;
> >+	int offset = offset_in_page(gpa);
> >+	unsigned long addr;
> >+
> >+	/* Bits 1:5 are resrved, Should be zero */
> >+	if (data&  0x3e)
> >+		return 1;
> >+
> >+	vcpu->arch.apf_msr_val = data;
> >+
> >+	if (!(data&  KVM_ASYNC_PF_ENABLED)) {
> >+		vcpu->arch.apf_data = NULL;
> >+		return 0;
> >+	}
> >+
> >+	addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> >+	if (kvm_is_error_hva(addr))
> >+		return 1;
> >+
> >+	vcpu->arch.apf_data = (u32 __user*)(addr + offset);
> 
> This can be invalidated by host userspace playing with memory
> regions.  It needs to be recalculated on memory map changes, and it
> may disappear from under the guest's feet (in which case we're
> allowed to KVM_REQ_TRIPLE_FAULT it).
> 
> (note: this is a much better approach than kvmclock's and vapic's,
> we should copy it there)
> 
apf_put_user() tracks memory slot changes and revalidate the address if
needed.

> >+
> >+	/* check if address is mapped */
> >+	if (get_user(offset, vcpu->arch.apf_data)) {
> >+		vcpu->arch.apf_data = NULL;
> >+		return 1;
> >+	}
> 
> So, this check can succeed today but fail tomorrow.
> 

> >+	return 0;
> >+}
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery.
@ 2010-08-23 15:29       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-23 15:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 06:22:02PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >Guess enables async PF vcpu functionality using this MSR.
> >
> >
> >
> >+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >+{
> >+	u64 gpa = data&  ~0x3f;
> >+	int offset = offset_in_page(gpa);
> >+	unsigned long addr;
> >+
> >+	/* Bits 1:5 are resrved, Should be zero */
> >+	if (data&  0x3e)
> >+		return 1;
> >+
> >+	vcpu->arch.apf_msr_val = data;
> >+
> >+	if (!(data&  KVM_ASYNC_PF_ENABLED)) {
> >+		vcpu->arch.apf_data = NULL;
> >+		return 0;
> >+	}
> >+
> >+	addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> >+	if (kvm_is_error_hva(addr))
> >+		return 1;
> >+
> >+	vcpu->arch.apf_data = (u32 __user*)(addr + offset);
> 
> This can be invalidated by host userspace playing with memory
> regions.  It needs to be recalculated on memory map changes, and it
> may disappear from under the guest's feet (in which case we're
> allowed to KVM_REQ_TRIPLE_FAULT it).
> 
> (note: this is a much better approach than kvmclock's and vapic's,
> we should copy it there)
> 
apf_put_user() tracks memory slot changes and revalidate the address if
needed.

> >+
> >+	/* check if address is mapped */
> >+	if (get_user(offset, vcpu->arch.apf_data)) {
> >+		vcpu->arch.apf_data = NULL;
> >+		return 1;
> >+	}
> 
> So, this check can succeed today but fail tomorrow.
> 

> >+	return 0;
> >+}
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-08-23 15:26     ` Avi Kivity
@ 2010-08-23 15:35       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-23 15:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 06:26:58PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >Enable async PF in a guest if async PF capability is discovered.
> >
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/include/asm/kvm_para.h |    5 +++
> >  arch/x86/kernel/kvm.c           |   68 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 73 insertions(+), 0 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index 5b05e9f..f1662d7 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
> >  	__u64 pt_phys;
> >  };
> >
> >+struct kvm_vcpu_pv_apf_data {
> >+	__u32 reason;
> >+	__u32 enabled;
> >+};
> >+
> 
> The guest will have to align this on a 64 byte boundary, should this
> be marked __aligned(64) here?
> 
I do __aligned(64) when I declare variable of that type:

static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);

> >@@ -231,12 +235,72 @@ static void __init paravirt_ops_setup(void)
> >  #endif
> >  }
> >
> >+void __cpuinit kvm_guest_cpu_init(void)
> >+{
> >+	if (!kvm_para_available())
> >+		return;
> >+
> >+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
> >+		u64 pa = __pa(&__get_cpu_var(apf_reason));
> >+
> >+		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
> >+					  pa | KVM_ASYNC_PF_ENABLED, pa>>  32))
> >+			return;
> >+		__get_cpu_var(apf_reason).enabled = 1;
> >+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> >+		       smp_processor_id());
> >+	}
> >+}
> 
> Need a way to disable apf from the guest kernel command line.
> 
OK.

> >+
> >+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> >+				    unsigned long action, void *hcpu)
> >+{
> >+	switch (action) {
> >+	case CPU_ONLINE:
> >+	case CPU_ONLINE_FROZEN:
> >+		kvm_guest_cpu_init();
> >+		break;
> >+	default:
> >+		break;
> 
> Should we disable apf if the cpu is dying here?
> 
Why? Can CPU die with outstanding sleeping tasks?

> >+	}
> >+	return NOTIFY_OK;
> >+}
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-08-23 15:35       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-23 15:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 06:26:58PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >Enable async PF in a guest if async PF capability is discovered.
> >
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/include/asm/kvm_para.h |    5 +++
> >  arch/x86/kernel/kvm.c           |   68 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 73 insertions(+), 0 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index 5b05e9f..f1662d7 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -65,6 +65,11 @@ struct kvm_mmu_op_release_pt {
> >  	__u64 pt_phys;
> >  };
> >
> >+struct kvm_vcpu_pv_apf_data {
> >+	__u32 reason;
> >+	__u32 enabled;
> >+};
> >+
> 
> The guest will have to align this on a 64 byte boundary, should this
> be marked __aligned(64) here?
> 
I do __aligned(64) when I declare variable of that type:

static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);

> >@@ -231,12 +235,72 @@ static void __init paravirt_ops_setup(void)
> >  #endif
> >  }
> >
> >+void __cpuinit kvm_guest_cpu_init(void)
> >+{
> >+	if (!kvm_para_available())
> >+		return;
> >+
> >+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
> >+		u64 pa = __pa(&__get_cpu_var(apf_reason));
> >+
> >+		if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
> >+					  pa | KVM_ASYNC_PF_ENABLED, pa>>  32))
> >+			return;
> >+		__get_cpu_var(apf_reason).enabled = 1;
> >+		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> >+		       smp_processor_id());
> >+	}
> >+}
> 
> Need a way to disable apf from the guest kernel command line.
> 
OK.

> >+
> >+static int __cpuinit kvm_cpu_notify(struct notifier_block *self,
> >+				    unsigned long action, void *hcpu)
> >+{
> >+	switch (action) {
> >+	case CPU_ONLINE:
> >+	case CPU_ONLINE_FROZEN:
> >+		kvm_guest_cpu_init();
> >+		break;
> >+	default:
> >+		break;
> 
> Should we disable apf if the cpu is dying here?
> 
Why? Can CPU die with outstanding sleeping tasks?

> >+	}
> >+	return NOTIFY_OK;
> >+}
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-23 15:48     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:48 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> When async PF capability is detected hook up special page fault handler
> that will handle async page fault events and bypass other page faults to
> regular page fault handler.
>
> Acked-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_para.h |    3 +
>   arch/x86/include/asm/traps.h    |    1 +
>   arch/x86/kernel/entry_32.S      |   10 +++
>   arch/x86/kernel/entry_64.S      |    3 +
>   arch/x86/kernel/kvm.c           |  170 +++++++++++++++++++++++++++++++++++++++
>   5 files changed, 187 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index f1662d7..edf07cf 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
>   	__u64 pt_phys;
>   };
>
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> +#define KVM_PV_REASON_PAGE_READY 2
> +
>   struct kvm_vcpu_pv_apf_data {
>   	__u32 reason;
>   	__u32 enabled;
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index f66cda5..0310da6 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
>   asmlinkage void stack_segment(void);
>   asmlinkage void general_protection(void);
>   asmlinkage void page_fault(void);
> +asmlinkage void async_page_fault(void);
>   asmlinkage void spurious_interrupt_bug(void);
>   asmlinkage void coprocessor_error(void);
>   asmlinkage void alignment_check(void);
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index cd49141..95e13da 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -1494,6 +1494,16 @@ ENTRY(general_protection)
>   	CFI_ENDPROC
>   END(general_protection)
>
> +#ifdef CONFIG_KVM_GUEST
> +ENTRY(async_page_fault)
> +	RING0_EC_FRAME
> +	pushl $do_async_page_fault
> +	CFI_ADJUST_CFA_OFFSET 4
> +	jmp error_code
> +	CFI_ENDPROC
> +END(apf_page_fault)
> +#endif
> +
>   /*
>    * End of kprobes section
>    */
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 0697ff1..65c3eb6 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1346,6 +1346,9 @@ errorentry xen_stack_segment do_stack_segment
>   #endif
>   errorentry general_protection do_general_protection
>   errorentry page_fault do_page_fault
> +#ifdef CONFIG_KVM_GUEST
> +errorentry async_page_fault do_async_page_fault
> +#endif
>   #ifdef CONFIG_X86_MCE
>   paranoidzeroentry machine_check *machine_check_vector(%rip)
>   #endif
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5177dd1..a6db92e 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -29,8 +29,14 @@
>   #include<linux/hardirq.h>
>   #include<linux/notifier.h>
>   #include<linux/reboot.h>
> +#include<linux/hash.h>
> +#include<linux/sched.h>
> +#include<linux/slab.h>
> +#include<linux/kprobes.h>
>   #include<asm/timer.h>
>   #include<asm/cpu.h>
> +#include<asm/traps.h>
> +#include<asm/desc.h>
>
>   #define MMU_QUEUE_SIZE 1024
>
> @@ -54,6 +60,158 @@ static void kvm_io_delay(void)
>   {
>   }
>
> +#define KVM_TASK_SLEEP_HASHBITS 8
> +#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
> +
> +struct kvm_task_sleep_node {
> +	struct hlist_node link;
> +	wait_queue_head_t wq;
> +	u32 token;
> +	int cpu;
> +};
> +
> +static struct kvm_task_sleep_head {
> +	spinlock_t lock;
> +	struct hlist_head list;
> +} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
> +
> +static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> +						  u64 token)

u64 token?

> +{
> +	struct hlist_node *p;
> +
> +	hlist_for_each(p,&b->list) {
> +		struct kvm_task_sleep_node *n =
> +			hlist_entry(p, typeof(*n), link);
> +		if (n->token == token)
> +			return n;

Do you need to match cpu here as well?  Or is token globally unique?

Perhaps we should make it locally unique to remove a requirement from 
the host to synchronize?  I haven't seen how you generate it yet.

> +	}
> +
> +	return NULL;
> +}
> +
> +static void apf_task_wait(struct task_struct *tsk, u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node n, *e;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&b->lock);
> +	e = _find_apf_task(b, token);
> +	if (e) {
> +		/* dummy entry exist ->  wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		spin_unlock(&b->lock);
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.cpu = smp_processor_id();

What's the meaning of cpu?  Won't the waiter migrate to other cpus?  Can 
apf_task_wait() start on a different cpu than the one we got our apf on?

> +	init_waitqueue_head(&n.wq);
> +	hlist_add_head(&n.link,&b->list);
> +	spin_unlock(&b->lock);
> +
> +	for (;;) {
> +		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);

In theory we could make it interruptible if it's in user context.  The 
signal could arrive before the page and we could handle it.  Not worth 
the complexity I think (having a wakeup with no task to wake).

The user might be confused why they have uninterruptible tasks and no 
disk load, but more than likely they're confused already, so no big loss.

> +		if (hlist_unhashed(&n.link))
> +			break;
> +		schedule();
> +	}
> +	finish_wait(&n.wq,&wait);
> +
> +	return;
> +}
> +
> +static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> +{
> +	hlist_del_init(&n->link);
> +	if (waitqueue_active(&n->wq))
> +		wake_up(&n->wq);
> +}
> +
> +static void apf_task_wake(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node *n;
> +
> +again:
> +	spin_lock(&b->lock);
> +	n = _find_apf_task(b, token);
> +	if (!n) {
> +		/*
> +		 * async PF was not yet handled.
> +		 * Add dummy entry for the token.
> +		 */
> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			/*
> +			 * Allocation failed! Busy wait while other vcpu
> +			 * handles async PF.
> +			 */

In guest code, please use 'cpu', not 'vcpu'.

> +			spin_unlock(&b->lock);
> +			cpu_relax();
> +			goto again;
> +		}

The other cpu might be waiting for us to yield.  We can fix it later 
with the the pv spinlock infrastructure.

Or, we can avoid the allocation.  If at most one apf can be pending (is 
this true?), we can use a per-cpu variable for this dummy entry.

> +		n->token = token;
> +		n->cpu = smp_processor_id();
> +		init_waitqueue_head(&n->wq);
> +		hlist_add_head(&n->link,&b->list);
> +	} else
> +		apf_task_wake_one(n);
> +	spin_unlock(&b->lock);
> +	return;
> +}
> +
> +static void apf_task_wake_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i<  KVM_TASK_SLEEP_HASHSIZE; i++) {
> +		struct hlist_node *p, *next;
> +		struct kvm_task_sleep_head *b =&async_pf_sleepers[i];
> +		spin_lock(&b->lock);
> +		hlist_for_each_safe(p, next,&b->list) {
> +			struct kvm_task_sleep_node *n =
> +				hlist_entry(p, typeof(*n), link);
> +			if (n->cpu == smp_processor_id())
> +				apf_task_wake_one(n);
> +		}
> +		spin_unlock(&b->lock);
> +	}
> +}
> +
> +dotraplinkage void __kprobes
> +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> +	u32 reason = 0, token;
> +
> +	if (__get_cpu_var(apf_reason).enabled) {
> +		reason = __get_cpu_var(apf_reason).reason;
> +		__get_cpu_var(apf_reason).reason = 0;

Can per-cpu vars be in vmalloc space?  if so they may trigger nested faults.

I don't think that's the case for core code, so probably safe here.

> +
> +		token = (u32)read_cr2();
> +	}
> +
> +	switch (reason) {
> +	default:
> +		do_page_fault(regs, error_code);
> +		break;
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		/* page is swapped out by the host. */
> +		apf_task_wait(current, token);
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		if (unlikely(token == ~0))
> +			apf_task_wake_all();
> +		else
> +			apf_task_wake(token);
> +		break;
> +	}
> +}
> +
>   static void kvm_mmu_op(void *buffer, unsigned len)
>   {
>   	int r;
> @@ -303,13 +461,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
>   };
>   #endif
>
> +static void __init kvm_apf_trap_init(void)
> +{
> +	set_intr_gate(14,&async_page_fault);
> +}

Nice!  Zero impact on non-virt.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
@ 2010-08-23 15:48     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:48 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> When async PF capability is detected hook up special page fault handler
> that will handle async page fault events and bypass other page faults to
> regular page fault handler.
>
> Acked-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_para.h |    3 +
>   arch/x86/include/asm/traps.h    |    1 +
>   arch/x86/kernel/entry_32.S      |   10 +++
>   arch/x86/kernel/entry_64.S      |    3 +
>   arch/x86/kernel/kvm.c           |  170 +++++++++++++++++++++++++++++++++++++++
>   5 files changed, 187 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index f1662d7..edf07cf 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
>   	__u64 pt_phys;
>   };
>
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> +#define KVM_PV_REASON_PAGE_READY 2
> +
>   struct kvm_vcpu_pv_apf_data {
>   	__u32 reason;
>   	__u32 enabled;
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index f66cda5..0310da6 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
>   asmlinkage void stack_segment(void);
>   asmlinkage void general_protection(void);
>   asmlinkage void page_fault(void);
> +asmlinkage void async_page_fault(void);
>   asmlinkage void spurious_interrupt_bug(void);
>   asmlinkage void coprocessor_error(void);
>   asmlinkage void alignment_check(void);
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index cd49141..95e13da 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -1494,6 +1494,16 @@ ENTRY(general_protection)
>   	CFI_ENDPROC
>   END(general_protection)
>
> +#ifdef CONFIG_KVM_GUEST
> +ENTRY(async_page_fault)
> +	RING0_EC_FRAME
> +	pushl $do_async_page_fault
> +	CFI_ADJUST_CFA_OFFSET 4
> +	jmp error_code
> +	CFI_ENDPROC
> +END(apf_page_fault)
> +#endif
> +
>   /*
>    * End of kprobes section
>    */
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 0697ff1..65c3eb6 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1346,6 +1346,9 @@ errorentry xen_stack_segment do_stack_segment
>   #endif
>   errorentry general_protection do_general_protection
>   errorentry page_fault do_page_fault
> +#ifdef CONFIG_KVM_GUEST
> +errorentry async_page_fault do_async_page_fault
> +#endif
>   #ifdef CONFIG_X86_MCE
>   paranoidzeroentry machine_check *machine_check_vector(%rip)
>   #endif
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5177dd1..a6db92e 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -29,8 +29,14 @@
>   #include<linux/hardirq.h>
>   #include<linux/notifier.h>
>   #include<linux/reboot.h>
> +#include<linux/hash.h>
> +#include<linux/sched.h>
> +#include<linux/slab.h>
> +#include<linux/kprobes.h>
>   #include<asm/timer.h>
>   #include<asm/cpu.h>
> +#include<asm/traps.h>
> +#include<asm/desc.h>
>
>   #define MMU_QUEUE_SIZE 1024
>
> @@ -54,6 +60,158 @@ static void kvm_io_delay(void)
>   {
>   }
>
> +#define KVM_TASK_SLEEP_HASHBITS 8
> +#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
> +
> +struct kvm_task_sleep_node {
> +	struct hlist_node link;
> +	wait_queue_head_t wq;
> +	u32 token;
> +	int cpu;
> +};
> +
> +static struct kvm_task_sleep_head {
> +	spinlock_t lock;
> +	struct hlist_head list;
> +} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
> +
> +static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> +						  u64 token)

u64 token?

> +{
> +	struct hlist_node *p;
> +
> +	hlist_for_each(p,&b->list) {
> +		struct kvm_task_sleep_node *n =
> +			hlist_entry(p, typeof(*n), link);
> +		if (n->token == token)
> +			return n;

Do you need to match cpu here as well?  Or is token globally unique?

Perhaps we should make it locally unique to remove a requirement from 
the host to synchronize?  I haven't seen how you generate it yet.

> +	}
> +
> +	return NULL;
> +}
> +
> +static void apf_task_wait(struct task_struct *tsk, u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node n, *e;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&b->lock);
> +	e = _find_apf_task(b, token);
> +	if (e) {
> +		/* dummy entry exist ->  wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		spin_unlock(&b->lock);
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.cpu = smp_processor_id();

What's the meaning of cpu?  Won't the waiter migrate to other cpus?  Can 
apf_task_wait() start on a different cpu than the one we got our apf on?

> +	init_waitqueue_head(&n.wq);
> +	hlist_add_head(&n.link,&b->list);
> +	spin_unlock(&b->lock);
> +
> +	for (;;) {
> +		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);

In theory we could make it interruptible if it's in user context.  The 
signal could arrive before the page and we could handle it.  Not worth 
the complexity I think (having a wakeup with no task to wake).

The user might be confused why they have uninterruptible tasks and no 
disk load, but more than likely they're confused already, so no big loss.

> +		if (hlist_unhashed(&n.link))
> +			break;
> +		schedule();
> +	}
> +	finish_wait(&n.wq,&wait);
> +
> +	return;
> +}
> +
> +static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> +{
> +	hlist_del_init(&n->link);
> +	if (waitqueue_active(&n->wq))
> +		wake_up(&n->wq);
> +}
> +
> +static void apf_task_wake(u32 token)
> +{
> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> +	struct kvm_task_sleep_node *n;
> +
> +again:
> +	spin_lock(&b->lock);
> +	n = _find_apf_task(b, token);
> +	if (!n) {
> +		/*
> +		 * async PF was not yet handled.
> +		 * Add dummy entry for the token.
> +		 */
> +		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			/*
> +			 * Allocation failed! Busy wait while other vcpu
> +			 * handles async PF.
> +			 */

In guest code, please use 'cpu', not 'vcpu'.

> +			spin_unlock(&b->lock);
> +			cpu_relax();
> +			goto again;
> +		}

The other cpu might be waiting for us to yield.  We can fix it later 
with the the pv spinlock infrastructure.

Or, we can avoid the allocation.  If at most one apf can be pending (is 
this true?), we can use a per-cpu variable for this dummy entry.

> +		n->token = token;
> +		n->cpu = smp_processor_id();
> +		init_waitqueue_head(&n->wq);
> +		hlist_add_head(&n->link,&b->list);
> +	} else
> +		apf_task_wake_one(n);
> +	spin_unlock(&b->lock);
> +	return;
> +}
> +
> +static void apf_task_wake_all(void)
> +{
> +	int i;
> +
> +	for (i = 0; i<  KVM_TASK_SLEEP_HASHSIZE; i++) {
> +		struct hlist_node *p, *next;
> +		struct kvm_task_sleep_head *b =&async_pf_sleepers[i];
> +		spin_lock(&b->lock);
> +		hlist_for_each_safe(p, next,&b->list) {
> +			struct kvm_task_sleep_node *n =
> +				hlist_entry(p, typeof(*n), link);
> +			if (n->cpu == smp_processor_id())
> +				apf_task_wake_one(n);
> +		}
> +		spin_unlock(&b->lock);
> +	}
> +}
> +
> +dotraplinkage void __kprobes
> +do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> +	u32 reason = 0, token;
> +
> +	if (__get_cpu_var(apf_reason).enabled) {
> +		reason = __get_cpu_var(apf_reason).reason;
> +		__get_cpu_var(apf_reason).reason = 0;

Can per-cpu vars be in vmalloc space?  if so they may trigger nested faults.

I don't think that's the case for core code, so probably safe here.

> +
> +		token = (u32)read_cr2();
> +	}
> +
> +	switch (reason) {
> +	default:
> +		do_page_fault(regs, error_code);
> +		break;
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		/* page is swapped out by the host. */
> +		apf_task_wait(current, token);
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		if (unlikely(token == ~0))
> +			apf_task_wake_all();
> +		else
> +			apf_task_wake(token);
> +		break;
> +	}
> +}
> +
>   static void kvm_mmu_op(void *buffer, unsigned len)
>   {
>   	int r;
> @@ -303,13 +461,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
>   };
>   #endif
>
> +static void __init kvm_apf_trap_init(void)
> +{
> +	set_intr_gate(14,&async_page_fault);
> +}

Nice!  Zero impact on non-virt.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 06/12] Add get_user_pages() variant that fails if major fault is required.
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-23 15:50     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:50 UTC (permalink / raw)
  To: Gleb Natapov, Andrew Morton
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> This patch add get_user_pages() variant that only succeeds if getting
> a reference to a page doesn't require major fault.
>


Andrew, can this go in through kvm.git?

> Reviewed-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   fs/ncpfs/mmap.c    |    2 ++
>   include/linux/mm.h |    5 +++++
>   mm/filemap.c       |    3 +++
>   mm/memory.c        |   31 ++++++++++++++++++++++++++++---
>   mm/shmem.c         |    8 +++++++-
>   5 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
> index 56f5b3a..b9c4f36 100644
> --- a/fs/ncpfs/mmap.c
> +++ b/fs/ncpfs/mmap.c
> @@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
>   	int bufsize;
>   	int pos; /* XXX: loff_t ? */
>
> +	if (vmf->flags&  FAULT_FLAG_MINOR)
> +		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
>   	/*
>   	 * ncpfs has nothing against high pages as long
>   	 * as recvmsg and memset works on it
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4238a9c..2bfc85a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -140,6 +140,7 @@ extern pgprot_t protection_map[16];
>   #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
>   #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
>   #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
> +#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
>
>   /*
>    * This interface is used by x86 PAT code to identify a pfn mapping that is
> @@ -843,6 +844,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
>   int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   			unsigned long start, int nr_pages, int write, int force,
>   			struct page **pages, struct vm_area_struct **vmas);
> +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
> +			unsigned long start, int nr_pages, int write, int force,
> +			struct page **pages, struct vm_area_struct **vmas);
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			struct page **pages);
>   struct page *get_dump_page(unsigned long addr);
> @@ -1373,6 +1377,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
>   #define FOLL_GET	0x04	/* do get_page on page */
>   #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
>   #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
> +#define FOLL_MINOR	0x20	/* do only minor page faults */
>
>   typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>   			void *data);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 20e5642..1186338 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   			goto no_cached_page;
>   		}
>   	} else {
> +		if (vmf->flags&  FAULT_FLAG_MINOR)
> +			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
> +
>   		/* No page in the page cache at all */
>   		do_sync_mmap_readahead(vma, ra, file, offset);
>   		count_vm_event(PGMAJFAULT);
> diff --git a/mm/memory.c b/mm/memory.c
> index 119b7cc..7dfaba2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1433,10 +1433,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   			cond_resched();
>   			while (!(page = follow_page(vma, start, foll_flags))) {
>   				int ret;
> +				unsigned int fault_fl =
> +					((foll_flags&  FOLL_WRITE) ?
> +					FAULT_FLAG_WRITE : 0) |
> +					((foll_flags&  FOLL_MINOR) ?
> +					FAULT_FLAG_MINOR : 0);
>
> -				ret = handle_mm_fault(mm, vma, start,
> -					(foll_flags&  FOLL_WRITE) ?
> -					FAULT_FLAG_WRITE : 0);
> +				ret = handle_mm_fault(mm, vma, start, fault_fl);
>
>   				if (ret&  VM_FAULT_ERROR) {
>   					if (ret&  VM_FAULT_OOM)
> @@ -1444,6 +1447,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   					if (ret&
>   					(VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
>   						return i ? i : -EFAULT;
> +					else if (ret&  VM_FAULT_MAJOR)
> +						return i ? i : -EFAULT;
>   					BUG();
>   				}
>   				if (ret&  VM_FAULT_MAJOR)
> @@ -1554,6 +1559,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   }
>   EXPORT_SYMBOL(get_user_pages);
>
> +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas)
> +{
> +	int flags = FOLL_TOUCH | FOLL_MINOR;
> +
> +	if (pages)
> +		flags |= FOLL_GET;
> +	if (write)
> +		flags |= FOLL_WRITE;
> +	if (force)
> +		flags |= FOLL_FORCE;
> +
> +	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
> +}
> +EXPORT_SYMBOL(get_user_pages_noio);
> +
>   /**
>    * get_dump_page() - pin user page in memory while writing it to core dump
>    * @addr: user address
> @@ -2640,6 +2662,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>   	page = lookup_swap_cache(entry);
>   	if (!page) {
> +		if (flags&  FAULT_FLAG_MINOR)
> +			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
> +
>   		grab_swap_token(mm); /* Contend for token _before_ read-in */
>   		page = swapin_readahead(entry,
>   					GFP_HIGHUSER_MOVABLE, vma, address);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f65f840..acc8958 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1227,6 +1227,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
>   	swp_entry_t swap;
>   	gfp_t gfp;
>   	int error;
> +	int flags = type ? *type : 0;
>
>   	if (idx>= SHMEM_MAX_INDEX)
>   		return -EFBIG;
> @@ -1275,6 +1276,11 @@ repeat:
>   		swappage = lookup_swap_cache(swap);
>   		if (!swappage) {
>   			shmem_swp_unmap(entry);
> +			if (flags&  FAULT_FLAG_MINOR) {
> +				spin_unlock(&info->lock);
> +				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
> +				goto failed;
> +			}
>   			/* here we actually do the io */
>   			if (type&&  !(*type&  VM_FAULT_MAJOR)) {
>   				__count_vm_event(PGMAJFAULT);
> @@ -1483,7 +1489,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   {
>   	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
>   	int error;
> -	int ret;
> +	int ret = (int)vmf->flags;
>
>   	if (((loff_t)vmf->pgoff<<  PAGE_CACHE_SHIFT)>= i_size_read(inode))
>   		return VM_FAULT_SIGBUS;


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 06/12] Add get_user_pages() variant that fails if major fault is required.
@ 2010-08-23 15:50     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:50 UTC (permalink / raw)
  To: Gleb Natapov, Andrew Morton
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> This patch add get_user_pages() variant that only succeeds if getting
> a reference to a page doesn't require major fault.
>


Andrew, can this go in through kvm.git?

> Reviewed-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   fs/ncpfs/mmap.c    |    2 ++
>   include/linux/mm.h |    5 +++++
>   mm/filemap.c       |    3 +++
>   mm/memory.c        |   31 ++++++++++++++++++++++++++++---
>   mm/shmem.c         |    8 +++++++-
>   5 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
> index 56f5b3a..b9c4f36 100644
> --- a/fs/ncpfs/mmap.c
> +++ b/fs/ncpfs/mmap.c
> @@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
>   	int bufsize;
>   	int pos; /* XXX: loff_t ? */
>
> +	if (vmf->flags&  FAULT_FLAG_MINOR)
> +		return VM_FAULT_MAJOR | VM_FAULT_ERROR;
>   	/*
>   	 * ncpfs has nothing against high pages as long
>   	 * as recvmsg and memset works on it
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4238a9c..2bfc85a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -140,6 +140,7 @@ extern pgprot_t protection_map[16];
>   #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
>   #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
>   #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
> +#define FAULT_FLAG_MINOR	0x08	/* Do only minor fault */
>
>   /*
>    * This interface is used by x86 PAT code to identify a pfn mapping that is
> @@ -843,6 +844,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
>   int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   			unsigned long start, int nr_pages, int write, int force,
>   			struct page **pages, struct vm_area_struct **vmas);
> +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
> +			unsigned long start, int nr_pages, int write, int force,
> +			struct page **pages, struct vm_area_struct **vmas);
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			struct page **pages);
>   struct page *get_dump_page(unsigned long addr);
> @@ -1373,6 +1377,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
>   #define FOLL_GET	0x04	/* do get_page on page */
>   #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
>   #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
> +#define FOLL_MINOR	0x20	/* do only minor page faults */
>
>   typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>   			void *data);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 20e5642..1186338 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1548,6 +1548,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   			goto no_cached_page;
>   		}
>   	} else {
> +		if (vmf->flags&  FAULT_FLAG_MINOR)
> +			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
> +
>   		/* No page in the page cache at all */
>   		do_sync_mmap_readahead(vma, ra, file, offset);
>   		count_vm_event(PGMAJFAULT);
> diff --git a/mm/memory.c b/mm/memory.c
> index 119b7cc..7dfaba2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1433,10 +1433,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   			cond_resched();
>   			while (!(page = follow_page(vma, start, foll_flags))) {
>   				int ret;
> +				unsigned int fault_fl =
> +					((foll_flags&  FOLL_WRITE) ?
> +					FAULT_FLAG_WRITE : 0) |
> +					((foll_flags&  FOLL_MINOR) ?
> +					FAULT_FLAG_MINOR : 0);
>
> -				ret = handle_mm_fault(mm, vma, start,
> -					(foll_flags&  FOLL_WRITE) ?
> -					FAULT_FLAG_WRITE : 0);
> +				ret = handle_mm_fault(mm, vma, start, fault_fl);
>
>   				if (ret&  VM_FAULT_ERROR) {
>   					if (ret&  VM_FAULT_OOM)
> @@ -1444,6 +1447,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   					if (ret&
>   					(VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
>   						return i ? i : -EFAULT;
> +					else if (ret&  VM_FAULT_MAJOR)
> +						return i ? i : -EFAULT;
>   					BUG();
>   				}
>   				if (ret&  VM_FAULT_MAJOR)
> @@ -1554,6 +1559,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   }
>   EXPORT_SYMBOL(get_user_pages);
>
> +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas)
> +{
> +	int flags = FOLL_TOUCH | FOLL_MINOR;
> +
> +	if (pages)
> +		flags |= FOLL_GET;
> +	if (write)
> +		flags |= FOLL_WRITE;
> +	if (force)
> +		flags |= FOLL_FORCE;
> +
> +	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
> +}
> +EXPORT_SYMBOL(get_user_pages_noio);
> +
>   /**
>    * get_dump_page() - pin user page in memory while writing it to core dump
>    * @addr: user address
> @@ -2640,6 +2662,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>   	page = lookup_swap_cache(entry);
>   	if (!page) {
> +		if (flags&  FAULT_FLAG_MINOR)
> +			return VM_FAULT_MAJOR | VM_FAULT_ERROR;
> +
>   		grab_swap_token(mm); /* Contend for token _before_ read-in */
>   		page = swapin_readahead(entry,
>   					GFP_HIGHUSER_MOVABLE, vma, address);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f65f840..acc8958 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1227,6 +1227,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
>   	swp_entry_t swap;
>   	gfp_t gfp;
>   	int error;
> +	int flags = type ? *type : 0;
>
>   	if (idx>= SHMEM_MAX_INDEX)
>   		return -EFBIG;
> @@ -1275,6 +1276,11 @@ repeat:
>   		swappage = lookup_swap_cache(swap);
>   		if (!swappage) {
>   			shmem_swp_unmap(entry);
> +			if (flags&  FAULT_FLAG_MINOR) {
> +				spin_unlock(&info->lock);
> +				*type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
> +				goto failed;
> +			}
>   			/* here we actually do the io */
>   			if (type&&  !(*type&  VM_FAULT_MAJOR)) {
>   				__count_vm_event(PGMAJFAULT);
> @@ -1483,7 +1489,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   {
>   	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
>   	int error;
> -	int ret;
> +	int ret = (int)vmf->flags;
>
>   	if (((loff_t)vmf->pgoff<<  PAGE_CACHE_SHIFT)>= i_size_read(inode))
>   		return VM_FAULT_SIGBUS;


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
  2010-08-23 15:48     ` Avi Kivity
@ 2010-08-23 15:52       ` Rik van Riel
  -1 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2010-08-23 15:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

On 08/23/2010 11:48 AM, Avi Kivity wrote:

> Do you need to match cpu here as well? Or is token globally unique?
>
> Perhaps we should make it locally unique to remove a requirement from
> the host to synchronize? I haven't seen how you generate it yet.

If a task goes to sleep on one VCPU, but that VCPU ends
up not being runnable later on, it would be nice to wake
the task up on on a different VCPU.

I do not remember why it is safe to send this wakeup
event as an exception rather than an interrupt...

> The other cpu might be waiting for us to yield. We can fix it later with
> the the pv spinlock infrastructure.
>
> Or, we can avoid the allocation. If at most one apf can be pending (is
> this true?), we can use a per-cpu variable for this dummy entry.

Having a limit of just one APF pending kind of defeats
the point.

At that point, a second one of these faults would put
the VCPU to sleep, which prevents the first task from
running once its pagefault (which started earlier)
completes...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
@ 2010-08-23 15:52       ` Rik van Riel
  0 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2010-08-23 15:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

On 08/23/2010 11:48 AM, Avi Kivity wrote:

> Do you need to match cpu here as well? Or is token globally unique?
>
> Perhaps we should make it locally unique to remove a requirement from
> the host to synchronize? I haven't seen how you generate it yet.

If a task goes to sleep on one VCPU, but that VCPU ends
up not being runnable later on, it would be nice to wake
the task up on on a different VCPU.

I do not remember why it is safe to send this wakeup
event as an exception rather than an interrupt...

> The other cpu might be waiting for us to yield. We can fix it later with
> the the pv spinlock infrastructure.
>
> Or, we can avoid the allocation. If at most one apf can be pending (is
> this true?), we can use a per-cpu variable for this dummy entry.

Having a limit of just one APF pending kind of defeats
the point.

At that point, a second one of these faults would put
the VCPU to sleep, which prevents the first task from
running once its pagefault (which started earlier)
completes...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 07/12] Maintain memslot version number
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-23 15:53     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:53 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> Code that depends on particular memslot layout can track changes and
> adjust to new layout.
>
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c13cc48..c74ffc0 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -177,6 +177,7 @@ struct kvm {
>   	raw_spinlock_t requests_lock;
>   	struct mutex slots_lock;
>   	struct mm_struct *mm; /* userspace tied to this vm */
> +	u32 memslot_version;
>   	struct kvm_memslots *memslots;
>   	struct srcu_struct srcu;
>   #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b78b794..292514c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -733,6 +733,7 @@ skip_lpage:
>   	slots->memslots[mem->slot] = new;
>   	old_memslots = kvm->memslots;
>   	rcu_assign_pointer(kvm->memslots, slots);
> +	kvm->memslot_version++;
>   	synchronize_srcu_expedited(&kvm->srcu);
>
>   	kvm_arch_commit_memory_region(kvm, mem, old, user_alloc);

How does this interact with rcu?  Nothing enforces consistency between 
rcu_dereference(kvm->memslots) and kvm->memslot_version.

Should probably be rcu_dereference(kvm->memslots)->version.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 07/12] Maintain memslot version number
@ 2010-08-23 15:53     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 15:53 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> Code that depends on particular memslot layout can track changes and
> adjust to new layout.
>
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c13cc48..c74ffc0 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -177,6 +177,7 @@ struct kvm {
>   	raw_spinlock_t requests_lock;
>   	struct mutex slots_lock;
>   	struct mm_struct *mm; /* userspace tied to this vm */
> +	u32 memslot_version;
>   	struct kvm_memslots *memslots;
>   	struct srcu_struct srcu;
>   #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b78b794..292514c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -733,6 +733,7 @@ skip_lpage:
>   	slots->memslots[mem->slot] = new;
>   	old_memslots = kvm->memslots;
>   	rcu_assign_pointer(kvm->memslots, slots);
> +	kvm->memslot_version++;
>   	synchronize_srcu_expedited(&kvm->srcu);
>
>   	kvm_arch_commit_memory_region(kvm, mem, old, user_alloc);

How does this interact with rcu?  Nothing enforces consistency between 
rcu_dereference(kvm->memslots) and kvm->memslot_version.

Should probably be rcu_dereference(kvm->memslots)->version.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-08-23 15:35       ` Gleb Natapov
@ 2010-08-23 16:08         ` Christoph Lameter
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2010-08-23 16:08 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, mtosatti

On Mon, 23 Aug 2010, Gleb Natapov wrote:

> > The guest will have to align this on a 64 byte boundary, should this
> > be marked __aligned(64) here?
> >
> I do __aligned(64) when I declare variable of that type:
>
> static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);

64 byte boundary: You mean cacheline aligned? We have a special define for
that.

DEFINE_PER_CPU_SHARED_ALIGNED


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-08-23 16:08         ` Christoph Lameter
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Lameter @ 2010-08-23 16:08 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, mtosatti

On Mon, 23 Aug 2010, Gleb Natapov wrote:

> > The guest will have to align this on a 64 byte boundary, should this
> > be marked __aligned(64) here?
> >
> I do __aligned(64) when I declare variable of that type:
>
> static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);

64 byte boundary: You mean cacheline aligned? We have a special define for
that.

DEFINE_PER_CPU_SHARED_ALIGNED

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-08-23 16:08         ` Christoph Lameter
@ 2010-08-23 16:10           ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-23 16:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Avi Kivity, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, mtosatti

On Mon, Aug 23, 2010 at 11:08:06AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Gleb Natapov wrote:
> 
> > > The guest will have to align this on a 64 byte boundary, should this
> > > be marked __aligned(64) here?
> > >
> > I do __aligned(64) when I declare variable of that type:
> >
> > static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> 
> 64 byte boundary: You mean cacheline aligned? We have a special define for
> that.
> 
> DEFINE_PER_CPU_SHARED_ALIGNED
IIRC I tried to use it and it does different alignment on 64/32 bit. The
alignment here is part of guest/host interface so it should be the same
on both.

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-08-23 16:10           ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-23 16:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Avi Kivity, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, mtosatti

On Mon, Aug 23, 2010 at 11:08:06AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Gleb Natapov wrote:
> 
> > > The guest will have to align this on a 64 byte boundary, should this
> > > be marked __aligned(64) here?
> > >
> > I do __aligned(64) when I declare variable of that type:
> >
> > static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> 
> 64 byte boundary: You mean cacheline aligned? We have a special define for
> that.
> 
> DEFINE_PER_CPU_SHARED_ALIGNED
IIRC I tried to use it and it does different alignment on 64/32 bit. The
alignment here is part of guest/host interface so it should be the same
on both.

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-23 16:17     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 16:17 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> If guest access swapped out memory do not swap it in from vcpu thread
> context. Setup slow work to do swapping and send async page fault to
> a guest.
>
> Allow async page fault injection only when guest is in user mode since
> otherwise guest may be in non-sleepable context and will not be able to
> reschedule.
>
>
>
>   struct kvm_arch {
> @@ -444,6 +446,8 @@ struct kvm_vcpu_stat {
>   	u32 hypercalls;
>   	u32 irq_injections;
>   	u32 nmi_injections;
> +	u32 apf_not_present;
> +	u32 apf_present;
>   };

Please don't add more stats, instead add tracepoints which can be 
converted to stats by userspace.

Would be good to have both guest and host tracepoints.

> @@ -2345,6 +2346,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>   			     error_code&  PFERR_WRITE_MASK, gfn);
>   }
>
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> +{
> +	struct kvm_arch_async_pf arch;
> +	arch.token = (vcpu->arch.async_pf_id++<<  12) | vcpu->vcpu_id;
> +	return kvm_setup_async_pf(vcpu, gva, gfn,&arch);
> +}

Ok.  so token is globally unique.  We're limited to 4096 vcpus, I guess 
that's fine.  Wraparound at 1M faults/vcpu, what's the impact?  failure 
if we have 1M faulting processes on one vcpu?

I guess that's fine too.

> +
> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> +	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
> +		return false;
> +
> +	return !!kvm_x86_ops->get_cpl(vcpu);

!! !needed, bool autoconverts.  But > 0 is more readable.

> +}
> +
>   static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>   				u32 error_code)
>   {
> @@ -2353,6 +2369,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>   	int level;
>   	gfn_t gfn = gpa>>  PAGE_SHIFT;
>   	unsigned long mmu_seq;
> +	bool async;
>
>   	ASSERT(vcpu);
>   	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> @@ -2367,7 +2384,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>
>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>   	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +
> +	if (can_do_async_pf(vcpu)) {
> +		pfn = gfn_to_pfn_async(vcpu->kvm, gfn,&async);
> +		trace_kvm_try_async_get_page(async, pfn);
> +	} else {
> +do_sync:
> +		async = false;
> +		pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +	}
> +
> +	if (async) {
> +		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
> +			goto do_sync;
> +		return 0;
> +	}
> +

This goto is pretty ugly.  How about:

     async = false;
     if (can_do_async_pf(&async)) {
     }
     if (async && !setup())
           async = false;
     if (async)
           ...

or something.

> @@ -459,7 +460,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>
>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>   	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> +
> +	if (can_do_async_pf(vcpu)) {
> +		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn,&async);
> +		trace_kvm_try_async_get_page(async, pfn);
> +	} else {
> +do_sync:
> +		async = false;
> +		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> +	}
> +
> +	if (async) {
> +		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
> +			goto do_sync;
> +		return 0;
> +	}
>

This repetition is ugly too.

>
> +static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> +{
> +	if (unlikely(vcpu->arch.apf_memslot_ver !=
> +		     vcpu->kvm->memslot_version)) {
> +		u64 gpa = vcpu->arch.apf_msr_val&  ~0x3f;
> +		unsigned long addr;
> +		int offset = offset_in_page(gpa);
> +
> +		addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> +		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
> +		if (kvm_is_error_hva(addr)) {
> +			vcpu->arch.apf_data = NULL;
> +			return -EFAULT;
> +		}
> +	}
> +
> +	return put_user(val, vcpu->arch.apf_data);
> +}

This nice cache needs to be outside apf to reduce complexity for 
reviewers and since it is useful for others.

Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 292514c..f56e8ac 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -78,6 +78,11 @@ static atomic_t hardware_enable_failed;
>   struct kmem_cache *kvm_vcpu_cache;
>   EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
>
> +#ifdef CONFIG_KVM_ASYNC_PF
> +#define ASYNC_PF_PER_VCPU 100
> +static struct kmem_cache *async_pf_cache;
> +#endif

All those #ifdefs can be eliminated with virt/kvm/apf.[ch].

> +
>   static __read_mostly struct preempt_ops kvm_preempt_ops;
>
>   struct dentry *kvm_debugfs_dir;
> @@ -186,6 +191,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   	vcpu->kvm = kvm;
>   	vcpu->vcpu_id = id;
>   	init_waitqueue_head(&vcpu->wq);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	INIT_LIST_HEAD(&vcpu->async_pf_done);
> +	INIT_LIST_HEAD(&vcpu->async_pf_queue);
> +	spin_lock_init(&vcpu->async_pf_lock);
> +#endif

  kvm_apf_init() etc.

> +		       struct kvm_arch_async_pf *arch)
> +{
> +	struct kvm_async_pf *work;
> +
> +	if (vcpu->async_pf_queued>= ASYNC_PF_PER_VCPU)
> +		return 0;

100 == too high.  At 16 vcpus, this allows 1600 kernel threads to wait 
for I/O.

Would have been best if we could ask for a page to be paged in 
asynchronously.

> +
> +	/* setup slow work */
> +
> +	/* do alloc nowait since if we are going to sleep anyway we
> +	   may as well sleep faulting in page */
> +	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
> +	if (!work)
> +		return 0;
> +
> +	atomic_set(&work->used, 1);
> +	work->page = NULL;
> +	work->vcpu = vcpu;
> +	work->gva = gva;
> +	work->addr = gfn_to_hva(vcpu->kvm, gfn);
> +	work->arch = *arch;
> +	work->mm = current->mm;
> +	atomic_inc(&work->mm->mm_count);
> +	kvm_get_kvm(work->vcpu->kvm);
> +
> +	/* this can't really happen otherwise gfn_to_pfn_async
> +	   would succeed */
> +	if (unlikely(kvm_is_error_hva(work->addr)))
> +		goto retry_sync;
> +
> +	slow_work_init(&work->work,&async_pf_ops);
> +	if (slow_work_enqueue(&work->work) != 0)
> +		goto retry_sync;
> +
> +	vcpu->async_pf_work = work;
> +	list_add_tail(&work->queue,&vcpu->async_pf_queue);
> +	vcpu->async_pf_queued++;
> +	return 1;
> +retry_sync:
> +	kvm_put_kvm(work->vcpu->kvm);
> +	mmdrop(work->mm);
> +	kmem_cache_free(async_pf_cache, work);
> +	return 0;
> +}
> +
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
@ 2010-08-23 16:17     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 16:17 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> If guest access swapped out memory do not swap it in from vcpu thread
> context. Setup slow work to do swapping and send async page fault to
> a guest.
>
> Allow async page fault injection only when guest is in user mode since
> otherwise guest may be in non-sleepable context and will not be able to
> reschedule.
>
>
>
>   struct kvm_arch {
> @@ -444,6 +446,8 @@ struct kvm_vcpu_stat {
>   	u32 hypercalls;
>   	u32 irq_injections;
>   	u32 nmi_injections;
> +	u32 apf_not_present;
> +	u32 apf_present;
>   };

Please don't add more stats, instead add tracepoints which can be 
converted to stats by userspace.

Would be good to have both guest and host tracepoints.

> @@ -2345,6 +2346,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>   			     error_code&  PFERR_WRITE_MASK, gfn);
>   }
>
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> +{
> +	struct kvm_arch_async_pf arch;
> +	arch.token = (vcpu->arch.async_pf_id++<<  12) | vcpu->vcpu_id;
> +	return kvm_setup_async_pf(vcpu, gva, gfn,&arch);
> +}

Ok.  so token is globally unique.  We're limited to 4096 vcpus, I guess 
that's fine.  Wraparound at 1M faults/vcpu, what's the impact?  failure 
if we have 1M faulting processes on one vcpu?

I guess that's fine too.

> +
> +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> +	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
> +		return false;
> +
> +	return !!kvm_x86_ops->get_cpl(vcpu);

!! !needed, bool autoconverts.  But > 0 is more readable.

> +}
> +
>   static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>   				u32 error_code)
>   {
> @@ -2353,6 +2369,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>   	int level;
>   	gfn_t gfn = gpa>>  PAGE_SHIFT;
>   	unsigned long mmu_seq;
> +	bool async;
>
>   	ASSERT(vcpu);
>   	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> @@ -2367,7 +2384,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
>
>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>   	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +
> +	if (can_do_async_pf(vcpu)) {
> +		pfn = gfn_to_pfn_async(vcpu->kvm, gfn,&async);
> +		trace_kvm_try_async_get_page(async, pfn);
> +	} else {
> +do_sync:
> +		async = false;
> +		pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +	}
> +
> +	if (async) {
> +		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
> +			goto do_sync;
> +		return 0;
> +	}
> +

This goto is pretty ugly.  How about:

     async = false;
     if (can_do_async_pf(&async)) {
     }
     if (async && !setup())
           async = false;
     if (async)
           ...

or something.

> @@ -459,7 +460,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
>
>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>   	smp_rmb();
> -	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> +
> +	if (can_do_async_pf(vcpu)) {
> +		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn,&async);
> +		trace_kvm_try_async_get_page(async, pfn);
> +	} else {
> +do_sync:
> +		async = false;
> +		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> +	}
> +
> +	if (async) {
> +		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
> +			goto do_sync;
> +		return 0;
> +	}
>

This repetition is ugly too.

>
> +static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> +{
> +	if (unlikely(vcpu->arch.apf_memslot_ver !=
> +		     vcpu->kvm->memslot_version)) {
> +		u64 gpa = vcpu->arch.apf_msr_val&  ~0x3f;
> +		unsigned long addr;
> +		int offset = offset_in_page(gpa);
> +
> +		addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> +		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
> +		if (kvm_is_error_hva(addr)) {
> +			vcpu->arch.apf_data = NULL;
> +			return -EFAULT;
> +		}
> +	}
> +
> +	return put_user(val, vcpu->arch.apf_data);
> +}

This nice cache needs to be outside apf to reduce complexity for 
reviewers and since it is useful for others.

Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 292514c..f56e8ac 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -78,6 +78,11 @@ static atomic_t hardware_enable_failed;
>   struct kmem_cache *kvm_vcpu_cache;
>   EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
>
> +#ifdef CONFIG_KVM_ASYNC_PF
> +#define ASYNC_PF_PER_VCPU 100
> +static struct kmem_cache *async_pf_cache;
> +#endif

All those #ifdefs can be eliminated with virt/kvm/apf.[ch].

> +
>   static __read_mostly struct preempt_ops kvm_preempt_ops;
>
>   struct dentry *kvm_debugfs_dir;
> @@ -186,6 +191,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   	vcpu->kvm = kvm;
>   	vcpu->vcpu_id = id;
>   	init_waitqueue_head(&vcpu->wq);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	INIT_LIST_HEAD(&vcpu->async_pf_done);
> +	INIT_LIST_HEAD(&vcpu->async_pf_queue);
> +	spin_lock_init(&vcpu->async_pf_lock);
> +#endif

  kvm_apf_init() etc.

> +		       struct kvm_arch_async_pf *arch)
> +{
> +	struct kvm_async_pf *work;
> +
> +	if (vcpu->async_pf_queued>= ASYNC_PF_PER_VCPU)
> +		return 0;

100 == too high.  At 16 vcpus, this allows 1600 kernel threads to wait 
for I/O.

Would have been best if we could ask for a page to be paged in 
asynchronously.

> +
> +	/* setup slow work */
> +
> +	/* do alloc nowait since if we are going to sleep anyway we
> +	   may as well sleep faulting in page */
> +	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
> +	if (!work)
> +		return 0;
> +
> +	atomic_set(&work->used, 1);
> +	work->page = NULL;
> +	work->vcpu = vcpu;
> +	work->gva = gva;
> +	work->addr = gfn_to_hva(vcpu->kvm, gfn);
> +	work->arch = *arch;
> +	work->mm = current->mm;
> +	atomic_inc(&work->mm->mm_count);
> +	kvm_get_kvm(work->vcpu->kvm);
> +
> +	/* this can't really happen otherwise gfn_to_pfn_async
> +	   would succeed */
> +	if (unlikely(kvm_is_error_hva(work->addr)))
> +		goto retry_sync;
> +
> +	slow_work_init(&work->work,&async_pf_ops);
> +	if (slow_work_enqueue(&work->work) != 0)
> +		goto retry_sync;
> +
> +	vcpu->async_pf_work = work;
> +	list_add_tail(&work->queue,&vcpu->async_pf_queue);
> +	vcpu->async_pf_queued++;
> +	return 1;
> +retry_sync:
> +	kvm_put_kvm(work->vcpu->kvm);
> +	mmdrop(work->mm);
> +	kmem_cache_free(async_pf_cache, work);
> +	return 0;
> +}
> +
> +

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
  2010-08-23 16:08         ` Christoph Lameter
@ 2010-08-23 16:19           ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 16:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, mtosatti

  On 08/23/2010 07:08 PM, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Gleb Natapov wrote:
>
>>> The guest will have to align this on a 64 byte boundary, should this
>>> be marked __aligned(64) here?
>>>
>> I do __aligned(64) when I declare variable of that type:
>>
>> static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> 64 byte boundary: You mean cacheline aligned? We have a special define for
> that.
>
> DEFINE_PER_CPU_SHARED_ALIGNED
>

It's an ABI, so we can't use something that might change when Intel 
releases a cpu with 75.2 byte cache lines.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
@ 2010-08-23 16:19           ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 16:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, riel, mtosatti

  On 08/23/2010 07:08 PM, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Gleb Natapov wrote:
>
>>> The guest will have to align this on a 64 byte boundary, should this
>>> be marked __aligned(64) here?
>>>
>> I do __aligned(64) when I declare variable of that type:
>>
>> static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> 64 byte boundary: You mean cacheline aligned? We have a special define for
> that.
>
> DEFINE_PER_CPU_SHARED_ALIGNED
>

It's an ABI, so we can't use something that might change when Intel 
releases a cpu with 75.2 byte cache lines.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
  2010-08-23 15:52       ` Rik van Riel
@ 2010-08-23 16:22         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 16:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

  On 08/23/2010 06:52 PM, Rik van Riel wrote:
> On 08/23/2010 11:48 AM, Avi Kivity wrote:
>
>> Do you need to match cpu here as well? Or is token globally unique?
>>
>> Perhaps we should make it locally unique to remove a requirement from
>> the host to synchronize? I haven't seen how you generate it yet.
>
> If a task goes to sleep on one VCPU, but that VCPU ends
> up not being runnable later on, it would be nice to wake
> the task up on on a different VCPU.
>
> I do not remember why it is safe to send this wakeup
> event as an exception rather than an interrupt...

Wakeup could definitely be an interrupt, but the apf needs to be an 
exception so we reuse it.

>
>> The other cpu might be waiting for us to yield. We can fix it later with
>> the the pv spinlock infrastructure.
>>
>> Or, we can avoid the allocation. If at most one apf can be pending (is
>> this true?), we can use a per-cpu variable for this dummy entry.
>
> Having a limit of just one APF pending kind of defeats
> the point.

Yes.  How about, one APF pending before it is seen by the guest - but 
how can we tell without an annoying xchg?

>
> At that point, a second one of these faults would put
> the VCPU to sleep, which prevents the first task from
> running once its pagefault (which started earlier)
> completes...
>


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
@ 2010-08-23 16:22         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-23 16:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Gleb Natapov, kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra,
	tglx, hpa, cl, mtosatti

  On 08/23/2010 06:52 PM, Rik van Riel wrote:
> On 08/23/2010 11:48 AM, Avi Kivity wrote:
>
>> Do you need to match cpu here as well? Or is token globally unique?
>>
>> Perhaps we should make it locally unique to remove a requirement from
>> the host to synchronize? I haven't seen how you generate it yet.
>
> If a task goes to sleep on one VCPU, but that VCPU ends
> up not being runnable later on, it would be nice to wake
> the task up on on a different VCPU.
>
> I do not remember why it is safe to send this wakeup
> event as an exception rather than an interrupt...

Wakeup could definitely be an interrupt, but the apf needs to be an 
exception so we reuse it.

>
>> The other cpu might be waiting for us to yield. We can fix it later with
>> the the pv spinlock infrastructure.
>>
>> Or, we can avoid the allocation. If at most one apf can be pending (is
>> this true?), we can use a per-cpu variable for this dummy entry.
>
> Having a limit of just one APF pending kind of defeats
> the point.

Yes.  How about, one APF pending before it is seen by the guest - but 
how can we tell without an annoying xchg?

>
> At that point, a second one of these faults would put
> the VCPU to sleep, which prevents the first task from
> running once its pagefault (which started earlier)
> completes...
>


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
  2010-08-23 15:48     ` Avi Kivity
@ 2010-08-24  7:31       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  7:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 06:48:53PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >When async PF capability is detected hook up special page fault handler
> >that will handle async page fault events and bypass other page faults to
> >regular page fault handler.
> >
> >Acked-by: Rik van Riel<riel@redhat.com>
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/include/asm/kvm_para.h |    3 +
> >  arch/x86/include/asm/traps.h    |    1 +
> >  arch/x86/kernel/entry_32.S      |   10 +++
> >  arch/x86/kernel/entry_64.S      |    3 +
> >  arch/x86/kernel/kvm.c           |  170 +++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 187 insertions(+), 0 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index f1662d7..edf07cf 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
> >  	__u64 pt_phys;
> >  };
> >
> >+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> >+#define KVM_PV_REASON_PAGE_READY 2
> >+
> >  struct kvm_vcpu_pv_apf_data {
> >  	__u32 reason;
> >  	__u32 enabled;
> >diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> >index f66cda5..0310da6 100644
> >--- a/arch/x86/include/asm/traps.h
> >+++ b/arch/x86/include/asm/traps.h
> >@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
> >  asmlinkage void stack_segment(void);
> >  asmlinkage void general_protection(void);
> >  asmlinkage void page_fault(void);
> >+asmlinkage void async_page_fault(void);
> >  asmlinkage void spurious_interrupt_bug(void);
> >  asmlinkage void coprocessor_error(void);
> >  asmlinkage void alignment_check(void);
> >diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> >index cd49141..95e13da 100644
> >--- a/arch/x86/kernel/entry_32.S
> >+++ b/arch/x86/kernel/entry_32.S
> >@@ -1494,6 +1494,16 @@ ENTRY(general_protection)
> >  	CFI_ENDPROC
> >  END(general_protection)
> >
> >+#ifdef CONFIG_KVM_GUEST
> >+ENTRY(async_page_fault)
> >+	RING0_EC_FRAME
> >+	pushl $do_async_page_fault
> >+	CFI_ADJUST_CFA_OFFSET 4
> >+	jmp error_code
> >+	CFI_ENDPROC
> >+END(apf_page_fault)
> >+#endif
> >+
> >  /*
> >   * End of kprobes section
> >   */
> >diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >index 0697ff1..65c3eb6 100644
> >--- a/arch/x86/kernel/entry_64.S
> >+++ b/arch/x86/kernel/entry_64.S
> >@@ -1346,6 +1346,9 @@ errorentry xen_stack_segment do_stack_segment
> >  #endif
> >  errorentry general_protection do_general_protection
> >  errorentry page_fault do_page_fault
> >+#ifdef CONFIG_KVM_GUEST
> >+errorentry async_page_fault do_async_page_fault
> >+#endif
> >  #ifdef CONFIG_X86_MCE
> >  paranoidzeroentry machine_check *machine_check_vector(%rip)
> >  #endif
> >diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >index 5177dd1..a6db92e 100644
> >--- a/arch/x86/kernel/kvm.c
> >+++ b/arch/x86/kernel/kvm.c
> >@@ -29,8 +29,14 @@
> >  #include<linux/hardirq.h>
> >  #include<linux/notifier.h>
> >  #include<linux/reboot.h>
> >+#include<linux/hash.h>
> >+#include<linux/sched.h>
> >+#include<linux/slab.h>
> >+#include<linux/kprobes.h>
> >  #include<asm/timer.h>
> >  #include<asm/cpu.h>
> >+#include<asm/traps.h>
> >+#include<asm/desc.h>
> >
> >  #define MMU_QUEUE_SIZE 1024
> >
> >@@ -54,6 +60,158 @@ static void kvm_io_delay(void)
> >  {
> >  }
> >
> >+#define KVM_TASK_SLEEP_HASHBITS 8
> >+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
> >+
> >+struct kvm_task_sleep_node {
> >+	struct hlist_node link;
> >+	wait_queue_head_t wq;
> >+	u32 token;
> >+	int cpu;
> >+};
> >+
> >+static struct kvm_task_sleep_head {
> >+	spinlock_t lock;
> >+	struct hlist_head list;
> >+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
> >+
> >+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >+						  u64 token)
> 
> u64 token?
> 
Should be u32. Will fix.

> >+{
> >+	struct hlist_node *p;
> >+
> >+	hlist_for_each(p,&b->list) {
> >+		struct kvm_task_sleep_node *n =
> >+			hlist_entry(p, typeof(*n), link);
> >+		if (n->token == token)
> >+			return n;
> 
> Do you need to match cpu here as well?  Or is token globally unique?
Tokens are globally unique.

> 
> Perhaps we should make it locally unique to remove a requirement
> from the host to synchronize?  I haven't seen how you generate it
> yet.
Host does not need to synchronize to generate globally unique token
since vcpu id is part of a token.

> 
> >+	}
> >+
> >+	return NULL;
> >+}
> >+
> >+static void apf_task_wait(struct task_struct *tsk, u32 token)
> >+{
> >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node n, *e;
> >+	DEFINE_WAIT(wait);
> >+
> >+	spin_lock(&b->lock);
> >+	e = _find_apf_task(b, token);
> >+	if (e) {
> >+		/* dummy entry exist ->  wake up was delivered ahead of PF */
> >+		hlist_del(&e->link);
> >+		kfree(e);
> >+		spin_unlock(&b->lock);
> >+		return;
> >+	}
> >+
> >+	n.token = token;
> >+	n.cpu = smp_processor_id();
> 
> What's the meaning of cpu?  Won't the waiter migrate to other cpus?
Waiter cannot migrate to other cpu since it is sleeping. It may be
scheduled to run on any cpu when it will be waked.

> Can apf_task_wait() start on a different cpu than the one we got our
> apf on?
No. It is called directly from exception handler.

> 
> >+	init_waitqueue_head(&n.wq);
> >+	hlist_add_head(&n.link,&b->list);
> >+	spin_unlock(&b->lock);
> >+
> >+	for (;;) {
> >+		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> 
> In theory we could make it interruptible if it's in user context.
> The signal could arrive before the page and we could handle it.  Not
> worth the complexity I think (having a wakeup with no task to wake).
> 
> The user might be confused why they have uninterruptible tasks and
> no disk load, but more than likely they're confused already, so no
> big loss.
> 
> >+		if (hlist_unhashed(&n.link))
> >+			break;
> >+		schedule();
> >+	}
> >+	finish_wait(&n.wq,&wait);
> >+
> >+	return;
> >+}
> >+
> >+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> >+{
> >+	hlist_del_init(&n->link);
> >+	if (waitqueue_active(&n->wq))
> >+		wake_up(&n->wq);
> >+}
> >+
> >+static void apf_task_wake(u32 token)
> >+{
> >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node *n;
> >+
> >+again:
> >+	spin_lock(&b->lock);
> >+	n = _find_apf_task(b, token);
> >+	if (!n) {
> >+		/*
> >+		 * async PF was not yet handled.
> >+		 * Add dummy entry for the token.
> >+		 */
> >+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> >+		if (!n) {
> >+			/*
> >+			 * Allocation failed! Busy wait while other vcpu
> >+			 * handles async PF.
> >+			 */
> 
> In guest code, please use 'cpu', not 'vcpu'.
> 
OK. This is just a comment :)

> >+			spin_unlock(&b->lock);
> >+			cpu_relax();
> >+			goto again;
> >+		}
> 
> The other cpu might be waiting for us to yield.  We can fix it later
> with the the pv spinlock infrastructure.
> 
This busy wait happens only if (very small) allocation fails, so if
a guest ever hits this code path I expect it to be on his way to die
anyway.

> Or, we can avoid the allocation.  If at most one apf can be pending
> (is this true?), we can use a per-cpu variable for this dummy entry.
> 
We can have may outstanding apfs.

> >+		n->token = token;
> >+		n->cpu = smp_processor_id();
> >+		init_waitqueue_head(&n->wq);
> >+		hlist_add_head(&n->link,&b->list);
> >+	} else
> >+		apf_task_wake_one(n);
> >+	spin_unlock(&b->lock);
> >+	return;
> >+}
> >+
> >+static void apf_task_wake_all(void)
> >+{
> >+	int i;
> >+
> >+	for (i = 0; i<  KVM_TASK_SLEEP_HASHSIZE; i++) {
> >+		struct hlist_node *p, *next;
> >+		struct kvm_task_sleep_head *b =&async_pf_sleepers[i];
> >+		spin_lock(&b->lock);
> >+		hlist_for_each_safe(p, next,&b->list) {
> >+			struct kvm_task_sleep_node *n =
> >+				hlist_entry(p, typeof(*n), link);
> >+			if (n->cpu == smp_processor_id())
> >+				apf_task_wake_one(n);
> >+		}
> >+		spin_unlock(&b->lock);
> >+	}
> >+}
> >+
> >+dotraplinkage void __kprobes
> >+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> >+{
> >+	u32 reason = 0, token;
> >+
> >+	if (__get_cpu_var(apf_reason).enabled) {
> >+		reason = __get_cpu_var(apf_reason).reason;
> >+		__get_cpu_var(apf_reason).reason = 0;
> 
> Can per-cpu vars be in vmalloc space?  if so they may trigger nested faults.
> 
> I don't think that's the case for core code, so probably safe here.
> 
> >+
> >+		token = (u32)read_cr2();
> >+	}
> >+
> >+	switch (reason) {
> >+	default:
> >+		do_page_fault(regs, error_code);
> >+		break;
> >+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >+		/* page is swapped out by the host. */
> >+		apf_task_wait(current, token);
> >+		break;
> >+	case KVM_PV_REASON_PAGE_READY:
> >+		if (unlikely(token == ~0))
> >+			apf_task_wake_all();
> >+		else
> >+			apf_task_wake(token);
> >+		break;
> >+	}
> >+}
> >+
> >  static void kvm_mmu_op(void *buffer, unsigned len)
> >  {
> >  	int r;
> >@@ -303,13 +461,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
> >  };
> >  #endif
> >
> >+static void __init kvm_apf_trap_init(void)
> >+{
> >+	set_intr_gate(14,&async_page_fault);
> >+}
> 
> Nice!  Zero impact on non-virt.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
@ 2010-08-24  7:31       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  7:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 06:48:53PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >When async PF capability is detected hook up special page fault handler
> >that will handle async page fault events and bypass other page faults to
> >regular page fault handler.
> >
> >Acked-by: Rik van Riel<riel@redhat.com>
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/include/asm/kvm_para.h |    3 +
> >  arch/x86/include/asm/traps.h    |    1 +
> >  arch/x86/kernel/entry_32.S      |   10 +++
> >  arch/x86/kernel/entry_64.S      |    3 +
> >  arch/x86/kernel/kvm.c           |  170 +++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 187 insertions(+), 0 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >index f1662d7..edf07cf 100644
> >--- a/arch/x86/include/asm/kvm_para.h
> >+++ b/arch/x86/include/asm/kvm_para.h
> >@@ -65,6 +65,9 @@ struct kvm_mmu_op_release_pt {
> >  	__u64 pt_phys;
> >  };
> >
> >+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> >+#define KVM_PV_REASON_PAGE_READY 2
> >+
> >  struct kvm_vcpu_pv_apf_data {
> >  	__u32 reason;
> >  	__u32 enabled;
> >diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> >index f66cda5..0310da6 100644
> >--- a/arch/x86/include/asm/traps.h
> >+++ b/arch/x86/include/asm/traps.h
> >@@ -30,6 +30,7 @@ asmlinkage void segment_not_present(void);
> >  asmlinkage void stack_segment(void);
> >  asmlinkage void general_protection(void);
> >  asmlinkage void page_fault(void);
> >+asmlinkage void async_page_fault(void);
> >  asmlinkage void spurious_interrupt_bug(void);
> >  asmlinkage void coprocessor_error(void);
> >  asmlinkage void alignment_check(void);
> >diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> >index cd49141..95e13da 100644
> >--- a/arch/x86/kernel/entry_32.S
> >+++ b/arch/x86/kernel/entry_32.S
> >@@ -1494,6 +1494,16 @@ ENTRY(general_protection)
> >  	CFI_ENDPROC
> >  END(general_protection)
> >
> >+#ifdef CONFIG_KVM_GUEST
> >+ENTRY(async_page_fault)
> >+	RING0_EC_FRAME
> >+	pushl $do_async_page_fault
> >+	CFI_ADJUST_CFA_OFFSET 4
> >+	jmp error_code
> >+	CFI_ENDPROC
> >+END(apf_page_fault)
> >+#endif
> >+
> >  /*
> >   * End of kprobes section
> >   */
> >diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >index 0697ff1..65c3eb6 100644
> >--- a/arch/x86/kernel/entry_64.S
> >+++ b/arch/x86/kernel/entry_64.S
> >@@ -1346,6 +1346,9 @@ errorentry xen_stack_segment do_stack_segment
> >  #endif
> >  errorentry general_protection do_general_protection
> >  errorentry page_fault do_page_fault
> >+#ifdef CONFIG_KVM_GUEST
> >+errorentry async_page_fault do_async_page_fault
> >+#endif
> >  #ifdef CONFIG_X86_MCE
> >  paranoidzeroentry machine_check *machine_check_vector(%rip)
> >  #endif
> >diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >index 5177dd1..a6db92e 100644
> >--- a/arch/x86/kernel/kvm.c
> >+++ b/arch/x86/kernel/kvm.c
> >@@ -29,8 +29,14 @@
> >  #include<linux/hardirq.h>
> >  #include<linux/notifier.h>
> >  #include<linux/reboot.h>
> >+#include<linux/hash.h>
> >+#include<linux/sched.h>
> >+#include<linux/slab.h>
> >+#include<linux/kprobes.h>
> >  #include<asm/timer.h>
> >  #include<asm/cpu.h>
> >+#include<asm/traps.h>
> >+#include<asm/desc.h>
> >
> >  #define MMU_QUEUE_SIZE 1024
> >
> >@@ -54,6 +60,158 @@ static void kvm_io_delay(void)
> >  {
> >  }
> >
> >+#define KVM_TASK_SLEEP_HASHBITS 8
> >+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
> >+
> >+struct kvm_task_sleep_node {
> >+	struct hlist_node link;
> >+	wait_queue_head_t wq;
> >+	u32 token;
> >+	int cpu;
> >+};
> >+
> >+static struct kvm_task_sleep_head {
> >+	spinlock_t lock;
> >+	struct hlist_head list;
> >+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
> >+
> >+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >+						  u64 token)
> 
> u64 token?
> 
Should be u32. Will fix.

> >+{
> >+	struct hlist_node *p;
> >+
> >+	hlist_for_each(p,&b->list) {
> >+		struct kvm_task_sleep_node *n =
> >+			hlist_entry(p, typeof(*n), link);
> >+		if (n->token == token)
> >+			return n;
> 
> Do you need to match cpu here as well?  Or is token globally unique?
Tokens are globally unique.

> 
> Perhaps we should make it locally unique to remove a requirement
> from the host to synchronize?  I haven't seen how you generate it
> yet.
Host does not need to synchronize to generate globally unique token
since vcpu id is part of a token.

> 
> >+	}
> >+
> >+	return NULL;
> >+}
> >+
> >+static void apf_task_wait(struct task_struct *tsk, u32 token)
> >+{
> >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node n, *e;
> >+	DEFINE_WAIT(wait);
> >+
> >+	spin_lock(&b->lock);
> >+	e = _find_apf_task(b, token);
> >+	if (e) {
> >+		/* dummy entry exist ->  wake up was delivered ahead of PF */
> >+		hlist_del(&e->link);
> >+		kfree(e);
> >+		spin_unlock(&b->lock);
> >+		return;
> >+	}
> >+
> >+	n.token = token;
> >+	n.cpu = smp_processor_id();
> 
> What's the meaning of cpu?  Won't the waiter migrate to other cpus?
Waiter cannot migrate to other cpu since it is sleeping. It may be
scheduled to run on any cpu when it will be waked.

> Can apf_task_wait() start on a different cpu than the one we got our
> apf on?
No. It is called directly from exception handler.

> 
> >+	init_waitqueue_head(&n.wq);
> >+	hlist_add_head(&n.link,&b->list);
> >+	spin_unlock(&b->lock);
> >+
> >+	for (;;) {
> >+		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> 
> In theory we could make it interruptible if it's in user context.
> The signal could arrive before the page and we could handle it.  Not
> worth the complexity I think (having a wakeup with no task to wake).
> 
> The user might be confused why they have uninterruptible tasks and
> no disk load, but more than likely they're confused already, so no
> big loss.
> 
> >+		if (hlist_unhashed(&n.link))
> >+			break;
> >+		schedule();
> >+	}
> >+	finish_wait(&n.wq,&wait);
> >+
> >+	return;
> >+}
> >+
> >+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> >+{
> >+	hlist_del_init(&n->link);
> >+	if (waitqueue_active(&n->wq))
> >+		wake_up(&n->wq);
> >+}
> >+
> >+static void apf_task_wake(u32 token)
> >+{
> >+	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >+	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >+	struct kvm_task_sleep_node *n;
> >+
> >+again:
> >+	spin_lock(&b->lock);
> >+	n = _find_apf_task(b, token);
> >+	if (!n) {
> >+		/*
> >+		 * async PF was not yet handled.
> >+		 * Add dummy entry for the token.
> >+		 */
> >+		n = kmalloc(sizeof(*n), GFP_ATOMIC);
> >+		if (!n) {
> >+			/*
> >+			 * Allocation failed! Busy wait while other vcpu
> >+			 * handles async PF.
> >+			 */
> 
> In guest code, please use 'cpu', not 'vcpu'.
> 
OK. This is just a comment :)

> >+			spin_unlock(&b->lock);
> >+			cpu_relax();
> >+			goto again;
> >+		}
> 
> The other cpu might be waiting for us to yield.  We can fix it later
> with the the pv spinlock infrastructure.
> 
This busy wait happens only if (very small) allocation fails, so if
a guest ever hits this code path I expect it to be on his way to die
anyway.

> Or, we can avoid the allocation.  If at most one apf can be pending
> (is this true?), we can use a per-cpu variable for this dummy entry.
> 
We can have may outstanding apfs.

> >+		n->token = token;
> >+		n->cpu = smp_processor_id();
> >+		init_waitqueue_head(&n->wq);
> >+		hlist_add_head(&n->link,&b->list);
> >+	} else
> >+		apf_task_wake_one(n);
> >+	spin_unlock(&b->lock);
> >+	return;
> >+}
> >+
> >+static void apf_task_wake_all(void)
> >+{
> >+	int i;
> >+
> >+	for (i = 0; i<  KVM_TASK_SLEEP_HASHSIZE; i++) {
> >+		struct hlist_node *p, *next;
> >+		struct kvm_task_sleep_head *b =&async_pf_sleepers[i];
> >+		spin_lock(&b->lock);
> >+		hlist_for_each_safe(p, next,&b->list) {
> >+			struct kvm_task_sleep_node *n =
> >+				hlist_entry(p, typeof(*n), link);
> >+			if (n->cpu == smp_processor_id())
> >+				apf_task_wake_one(n);
> >+		}
> >+		spin_unlock(&b->lock);
> >+	}
> >+}
> >+
> >+dotraplinkage void __kprobes
> >+do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> >+{
> >+	u32 reason = 0, token;
> >+
> >+	if (__get_cpu_var(apf_reason).enabled) {
> >+		reason = __get_cpu_var(apf_reason).reason;
> >+		__get_cpu_var(apf_reason).reason = 0;
> 
> Can per-cpu vars be in vmalloc space?  if so they may trigger nested faults.
> 
> I don't think that's the case for core code, so probably safe here.
> 
> >+
> >+		token = (u32)read_cr2();
> >+	}
> >+
> >+	switch (reason) {
> >+	default:
> >+		do_page_fault(regs, error_code);
> >+		break;
> >+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >+		/* page is swapped out by the host. */
> >+		apf_task_wait(current, token);
> >+		break;
> >+	case KVM_PV_REASON_PAGE_READY:
> >+		if (unlikely(token == ~0))
> >+			apf_task_wake_all();
> >+		else
> >+			apf_task_wake(token);
> >+		break;
> >+	}
> >+}
> >+
> >  static void kvm_mmu_op(void *buffer, unsigned len)
> >  {
> >  	int r;
> >@@ -303,13 +461,25 @@ static struct notifier_block __cpuinitdata kvm_cpu_notifier = {
> >  };
> >  #endif
> >
> >+static void __init kvm_apf_trap_init(void)
> >+{
> >+	set_intr_gate(14,&async_page_fault);
> >+}
> 
> Nice!  Zero impact on non-virt.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
  2010-08-23 16:17     ` Avi Kivity
@ 2010-08-24  7:52       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  7:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 07:17:20PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >If guest access swapped out memory do not swap it in from vcpu thread
> >context. Setup slow work to do swapping and send async page fault to
> >a guest.
> >
> >Allow async page fault injection only when guest is in user mode since
> >otherwise guest may be in non-sleepable context and will not be able to
> >reschedule.
> >
> >
> >
> >  struct kvm_arch {
> >@@ -444,6 +446,8 @@ struct kvm_vcpu_stat {
> >  	u32 hypercalls;
> >  	u32 irq_injections;
> >  	u32 nmi_injections;
> >+	u32 apf_not_present;
> >+	u32 apf_present;
> >  };
> 
> Please don't add more stats, instead add tracepoints which can be
> converted to stats by userspace.
> 
> Would be good to have both guest and host tracepoints.
> 
I do have host tracepoints for all events. I still prefer to have also
kvm stats since they are so much easier to use right now. We can delete
them later when we replace kvm_stat with perf.

> >@@ -2345,6 +2346,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
> >  			     error_code&  PFERR_WRITE_MASK, gfn);
> >  }
> >
> >+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> >+{
> >+	struct kvm_arch_async_pf arch;
> >+	arch.token = (vcpu->arch.async_pf_id++<<  12) | vcpu->vcpu_id;
> >+	return kvm_setup_async_pf(vcpu, gva, gfn,&arch);
> >+}
> 
> Ok.  so token is globally unique.  We're limited to 4096 vcpus, I
> guess that's fine.  Wraparound at 1M faults/vcpu, what's the impact?
> failure if we have 1M faulting processes on one vcpu?
1M faulting processes on one vcpu simultaneously. And we limit number of
outstanding apfs anyway.

> 
> I guess that's fine too.
> 
> >+
> >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >+{
> >+	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
> >+		return false;
> >+
> >+	return !!kvm_x86_ops->get_cpl(vcpu);
> 
> !! !needed, bool autoconverts.  But > 0 is more readable.
OK.

> 
> >+}
> >+
> >  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >  				u32 error_code)
> >  {
> >@@ -2353,6 +2369,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >  	int level;
> >  	gfn_t gfn = gpa>>  PAGE_SHIFT;
> >  	unsigned long mmu_seq;
> >+	bool async;
> >
> >  	ASSERT(vcpu);
> >  	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> >@@ -2367,7 +2384,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >
> >  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> >  	smp_rmb();
> >-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> >+
> >+	if (can_do_async_pf(vcpu)) {
> >+		pfn = gfn_to_pfn_async(vcpu->kvm, gfn,&async);
> >+		trace_kvm_try_async_get_page(async, pfn);
> >+	} else {
> >+do_sync:
> >+		async = false;
> >+		pfn = gfn_to_pfn(vcpu->kvm, gfn);
> >+	}
> >+
> >+	if (async) {
> >+		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
> >+			goto do_sync;
> >+		return 0;
> >+	}
> >+
> 
> This goto is pretty ugly.  How about:
> 
>     async = false;
>     if (can_do_async_pf(&async)) {
>     }
>     if (async && !setup())
>           async = false;
>     if (async)
>           ...
> 
> or something.
> 
Will try to re-factor.

> >@@ -459,7 +460,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
> >
> >  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> >  	smp_rmb();
> >-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> >+
> >+	if (can_do_async_pf(vcpu)) {
> >+		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn,&async);
> >+		trace_kvm_try_async_get_page(async, pfn);
> >+	} else {
> >+do_sync:
> >+		async = false;
> >+		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> >+	}
> >+
> >+	if (async) {
> >+		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
> >+			goto do_sync;
> >+		return 0;
> >+	}
> >
> 
> This repetition is ugly too.
> 
Yeah. May be it can be moved to separate function and this will help to
get rid of goto as well.

> >
> >+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> >+{
> >+	if (unlikely(vcpu->arch.apf_memslot_ver !=
> >+		     vcpu->kvm->memslot_version)) {
> >+		u64 gpa = vcpu->arch.apf_msr_val&  ~0x3f;
> >+		unsigned long addr;
> >+		int offset = offset_in_page(gpa);
> >+
> >+		addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> >+		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
> >+		if (kvm_is_error_hva(addr)) {
> >+			vcpu->arch.apf_data = NULL;
> >+			return -EFAULT;
> >+		}
> >+	}
> >+
> >+	return put_user(val, vcpu->arch.apf_data);
> >+}
> 
> This nice cache needs to be outside apf to reduce complexity for
> reviewers and since it is useful for others.
> 
> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
Will look into it.

> 
> >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >index 292514c..f56e8ac 100644
> >--- a/virt/kvm/kvm_main.c
> >+++ b/virt/kvm/kvm_main.c
> >@@ -78,6 +78,11 @@ static atomic_t hardware_enable_failed;
> >  struct kmem_cache *kvm_vcpu_cache;
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
> >
> >+#ifdef CONFIG_KVM_ASYNC_PF
> >+#define ASYNC_PF_PER_VCPU 100
> >+static struct kmem_cache *async_pf_cache;
> >+#endif
> 
> All those #ifdefs can be eliminated with virt/kvm/apf.[ch].
> 
OK.

> >+
> >  static __read_mostly struct preempt_ops kvm_preempt_ops;
> >
> >  struct dentry *kvm_debugfs_dir;
> >@@ -186,6 +191,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	vcpu->kvm = kvm;
> >  	vcpu->vcpu_id = id;
> >  	init_waitqueue_head(&vcpu->wq);
> >+#ifdef CONFIG_KVM_ASYNC_PF
> >+	INIT_LIST_HEAD(&vcpu->async_pf_done);
> >+	INIT_LIST_HEAD(&vcpu->async_pf_queue);
> >+	spin_lock_init(&vcpu->async_pf_lock);
> >+#endif
> 
>  kvm_apf_init() etc.
> 
> >+		       struct kvm_arch_async_pf *arch)
> >+{
> >+	struct kvm_async_pf *work;
> >+
> >+	if (vcpu->async_pf_queued>= ASYNC_PF_PER_VCPU)
> >+		return 0;
> 
> 100 == too high.  At 16 vcpus, this allows 1600 kernel threads to
> wait for I/O.
Number of kernel threads are limited by other means. Slow work subsystem
has its own knobs to tune that. Here we limit how much slow work items
can be queued per vcpu.

> 
> Would have been best if we could ask for a page to be paged in
> asynchronously.
> 
You mean to have core kernel facility for that? I agree it would be
nice, but much harder.

> >+
> >+	/* setup slow work */
> >+
> >+	/* do alloc nowait since if we are going to sleep anyway we
> >+	   may as well sleep faulting in page */
> >+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
> >+	if (!work)
> >+		return 0;
> >+
> >+	atomic_set(&work->used, 1);
> >+	work->page = NULL;
> >+	work->vcpu = vcpu;
> >+	work->gva = gva;
> >+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
> >+	work->arch = *arch;
> >+	work->mm = current->mm;
> >+	atomic_inc(&work->mm->mm_count);
> >+	kvm_get_kvm(work->vcpu->kvm);
> >+
> >+	/* this can't really happen otherwise gfn_to_pfn_async
> >+	   would succeed */
> >+	if (unlikely(kvm_is_error_hva(work->addr)))
> >+		goto retry_sync;
> >+
> >+	slow_work_init(&work->work,&async_pf_ops);
> >+	if (slow_work_enqueue(&work->work) != 0)
> >+		goto retry_sync;
> >+
> >+	vcpu->async_pf_work = work;
> >+	list_add_tail(&work->queue,&vcpu->async_pf_queue);
> >+	vcpu->async_pf_queued++;
> >+	return 1;
> >+retry_sync:
> >+	kvm_put_kvm(work->vcpu->kvm);
> >+	mmdrop(work->mm);
> >+	kmem_cache_free(async_pf_cache, work);
> >+	return 0;
> >+}
> >+
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
@ 2010-08-24  7:52       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  7:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 07:17:20PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >If guest access swapped out memory do not swap it in from vcpu thread
> >context. Setup slow work to do swapping and send async page fault to
> >a guest.
> >
> >Allow async page fault injection only when guest is in user mode since
> >otherwise guest may be in non-sleepable context and will not be able to
> >reschedule.
> >
> >
> >
> >  struct kvm_arch {
> >@@ -444,6 +446,8 @@ struct kvm_vcpu_stat {
> >  	u32 hypercalls;
> >  	u32 irq_injections;
> >  	u32 nmi_injections;
> >+	u32 apf_not_present;
> >+	u32 apf_present;
> >  };
> 
> Please don't add more stats, instead add tracepoints which can be
> converted to stats by userspace.
> 
> Would be good to have both guest and host tracepoints.
> 
I do have host tracepoints for all events. I still prefer to have also
kvm stats since they are so much easier to use right now. We can delete
them later when we replace kvm_stat with perf.

> >@@ -2345,6 +2346,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
> >  			     error_code&  PFERR_WRITE_MASK, gfn);
> >  }
> >
> >+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
> >+{
> >+	struct kvm_arch_async_pf arch;
> >+	arch.token = (vcpu->arch.async_pf_id++<<  12) | vcpu->vcpu_id;
> >+	return kvm_setup_async_pf(vcpu, gva, gfn,&arch);
> >+}
> 
> Ok.  so token is globally unique.  We're limited to 4096 vcpus, I
> guess that's fine.  Wraparound at 1M faults/vcpu, what's the impact?
> failure if we have 1M faulting processes on one vcpu?
1M faulting processes on one vcpu simultaneously. And we limit number of
outstanding apfs anyway.

> 
> I guess that's fine too.
> 
> >+
> >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >+{
> >+	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
> >+		return false;
> >+
> >+	return !!kvm_x86_ops->get_cpl(vcpu);
> 
> !! !needed, bool autoconverts.  But > 0 is more readable.
OK.

> 
> >+}
> >+
> >  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >  				u32 error_code)
> >  {
> >@@ -2353,6 +2369,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >  	int level;
> >  	gfn_t gfn = gpa>>  PAGE_SHIFT;
> >  	unsigned long mmu_seq;
> >+	bool async;
> >
> >  	ASSERT(vcpu);
> >  	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> >@@ -2367,7 +2384,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >
> >  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> >  	smp_rmb();
> >-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> >+
> >+	if (can_do_async_pf(vcpu)) {
> >+		pfn = gfn_to_pfn_async(vcpu->kvm, gfn,&async);
> >+		trace_kvm_try_async_get_page(async, pfn);
> >+	} else {
> >+do_sync:
> >+		async = false;
> >+		pfn = gfn_to_pfn(vcpu->kvm, gfn);
> >+	}
> >+
> >+	if (async) {
> >+		if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
> >+			goto do_sync;
> >+		return 0;
> >+	}
> >+
> 
> This goto is pretty ugly.  How about:
> 
>     async = false;
>     if (can_do_async_pf(&async)) {
>     }
>     if (async && !setup())
>           async = false;
>     if (async)
>           ...
> 
> or something.
> 
Will try to re-factor.

> >@@ -459,7 +460,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
> >
> >  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> >  	smp_rmb();
> >-	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> >+
> >+	if (can_do_async_pf(vcpu)) {
> >+		pfn = gfn_to_pfn_async(vcpu->kvm, walker.gfn,&async);
> >+		trace_kvm_try_async_get_page(async, pfn);
> >+	} else {
> >+do_sync:
> >+		async = false;
> >+		pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
> >+	}
> >+
> >+	if (async) {
> >+		if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
> >+			goto do_sync;
> >+		return 0;
> >+	}
> >
> 
> This repetition is ugly too.
> 
Yeah. May be it can be moved to separate function and this will help to
get rid of goto as well.

> >
> >+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> >+{
> >+	if (unlikely(vcpu->arch.apf_memslot_ver !=
> >+		     vcpu->kvm->memslot_version)) {
> >+		u64 gpa = vcpu->arch.apf_msr_val&  ~0x3f;
> >+		unsigned long addr;
> >+		int offset = offset_in_page(gpa);
> >+
> >+		addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> >+		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
> >+		if (kvm_is_error_hva(addr)) {
> >+			vcpu->arch.apf_data = NULL;
> >+			return -EFAULT;
> >+		}
> >+	}
> >+
> >+	return put_user(val, vcpu->arch.apf_data);
> >+}
> 
> This nice cache needs to be outside apf to reduce complexity for
> reviewers and since it is useful for others.
> 
> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
Will look into it.

> 
> >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >index 292514c..f56e8ac 100644
> >--- a/virt/kvm/kvm_main.c
> >+++ b/virt/kvm/kvm_main.c
> >@@ -78,6 +78,11 @@ static atomic_t hardware_enable_failed;
> >  struct kmem_cache *kvm_vcpu_cache;
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
> >
> >+#ifdef CONFIG_KVM_ASYNC_PF
> >+#define ASYNC_PF_PER_VCPU 100
> >+static struct kmem_cache *async_pf_cache;
> >+#endif
> 
> All those #ifdefs can be eliminated with virt/kvm/apf.[ch].
> 
OK.

> >+
> >  static __read_mostly struct preempt_ops kvm_preempt_ops;
> >
> >  struct dentry *kvm_debugfs_dir;
> >@@ -186,6 +191,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	vcpu->kvm = kvm;
> >  	vcpu->vcpu_id = id;
> >  	init_waitqueue_head(&vcpu->wq);
> >+#ifdef CONFIG_KVM_ASYNC_PF
> >+	INIT_LIST_HEAD(&vcpu->async_pf_done);
> >+	INIT_LIST_HEAD(&vcpu->async_pf_queue);
> >+	spin_lock_init(&vcpu->async_pf_lock);
> >+#endif
> 
>  kvm_apf_init() etc.
> 
> >+		       struct kvm_arch_async_pf *arch)
> >+{
> >+	struct kvm_async_pf *work;
> >+
> >+	if (vcpu->async_pf_queued>= ASYNC_PF_PER_VCPU)
> >+		return 0;
> 
> 100 == too high.  At 16 vcpus, this allows 1600 kernel threads to
> wait for I/O.
Number of kernel threads are limited by other means. Slow work subsystem
has its own knobs to tune that. Here we limit how much slow work items
can be queued per vcpu.

> 
> Would have been best if we could ask for a page to be paged in
> asynchronously.
> 
You mean to have core kernel facility for that? I agree it would be
nice, but much harder.

> >+
> >+	/* setup slow work */
> >+
> >+	/* do alloc nowait since if we are going to sleep anyway we
> >+	   may as well sleep faulting in page */
> >+	work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
> >+	if (!work)
> >+		return 0;
> >+
> >+	atomic_set(&work->used, 1);
> >+	work->page = NULL;
> >+	work->vcpu = vcpu;
> >+	work->gva = gva;
> >+	work->addr = gfn_to_hva(vcpu->kvm, gfn);
> >+	work->arch = *arch;
> >+	work->mm = current->mm;
> >+	atomic_inc(&work->mm->mm_count);
> >+	kvm_get_kvm(work->vcpu->kvm);
> >+
> >+	/* this can't really happen otherwise gfn_to_pfn_async
> >+	   would succeed */
> >+	if (unlikely(kvm_is_error_hva(work->addr)))
> >+		goto retry_sync;
> >+
> >+	slow_work_init(&work->work,&async_pf_ops);
> >+	if (slow_work_enqueue(&work->work) != 0)
> >+		goto retry_sync;
> >+
> >+	vcpu->async_pf_work = work;
> >+	list_add_tail(&work->queue,&vcpu->async_pf_queue);
> >+	vcpu->async_pf_queued++;
> >+	return 1;
> >+retry_sync:
> >+	kvm_put_kvm(work->vcpu->kvm);
> >+	mmdrop(work->mm);
> >+	kmem_cache_free(async_pf_cache, work);
> >+	return 0;
> >+}
> >+
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
  2010-08-24  7:31       ` Gleb Natapov
@ 2010-08-24  9:02         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:02 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 10:31 AM, Gleb Natapov wrote:
> +
>>> +static void apf_task_wait(struct task_struct *tsk, u32 token)
>>> +{
>>> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>>> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
>>> +	struct kvm_task_sleep_node n, *e;
>>> +	DEFINE_WAIT(wait);
>>> +
>>> +	spin_lock(&b->lock);
>>> +	e = _find_apf_task(b, token);
>>> +	if (e) {
>>> +		/* dummy entry exist ->   wake up was delivered ahead of PF */
>>> +		hlist_del(&e->link);
>>> +		kfree(e);
>>> +		spin_unlock(&b->lock);
>>> +		return;
>>> +	}
>>> +
>>> +	n.token = token;
>>> +	n.cpu = smp_processor_id();
>> What's the meaning of cpu?  Won't the waiter migrate to other cpus?
> Waiter cannot migrate to other cpu since it is sleeping. It may be
> scheduled to run on any cpu when it will be waked.

What if you have a spurious wakeup?  Also, nothing prevents the 
scheduler from migrating the thread even if it is sleeping.  It may not 
do so now, but it might do it in the future.

Oh, it probably does now on cpu hotunplug.

Why do you need n.cpu?


>>> +			spin_unlock(&b->lock);
>>> +			cpu_relax();
>>> +			goto again;
>>> +		}
>> The other cpu might be waiting for us to yield.  We can fix it later
>> with the the pv spinlock infrastructure.
>>
> This busy wait happens only if (very small) allocation fails, so if
> a guest ever hits this code path I expect it to be on his way to die
> anyway.

Hm.  I don't have a good feel on how rare atomic allocation failures are 
on common workloads.

Note a kmem_cache for apfs will make failures even more rare.

>> Or, we can avoid the allocation.  If at most one apf can be pending
>> (is this true?), we can use a per-cpu variable for this dummy entry.
>>
> We can have may outstanding apfs.

But, while we're processing an apf, we can't take any more.

So we can have a buffer of one pre-allocated entry per cpu, and do 
something like:

apf:
   disable apf for this cpu
   handle apf using buffered entry
   enable interrupts
   allocate new entry
   buffer it
   enable apf for that cpu

this trades off a bigger apf disabled window for not busy looping.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected
@ 2010-08-24  9:02         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:02 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 10:31 AM, Gleb Natapov wrote:
> +
>>> +static void apf_task_wait(struct task_struct *tsk, u32 token)
>>> +{
>>> +	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>>> +	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
>>> +	struct kvm_task_sleep_node n, *e;
>>> +	DEFINE_WAIT(wait);
>>> +
>>> +	spin_lock(&b->lock);
>>> +	e = _find_apf_task(b, token);
>>> +	if (e) {
>>> +		/* dummy entry exist ->   wake up was delivered ahead of PF */
>>> +		hlist_del(&e->link);
>>> +		kfree(e);
>>> +		spin_unlock(&b->lock);
>>> +		return;
>>> +	}
>>> +
>>> +	n.token = token;
>>> +	n.cpu = smp_processor_id();
>> What's the meaning of cpu?  Won't the waiter migrate to other cpus?
> Waiter cannot migrate to other cpu since it is sleeping. It may be
> scheduled to run on any cpu when it will be waked.

What if you have a spurious wakeup?  Also, nothing prevents the 
scheduler from migrating the thread even if it is sleeping.  It may not 
do so now, but it might do it in the future.

Oh, it probably does now on cpu hotunplug.

Why do you need n.cpu?


>>> +			spin_unlock(&b->lock);
>>> +			cpu_relax();
>>> +			goto again;
>>> +		}
>> The other cpu might be waiting for us to yield.  We can fix it later
>> with the the pv spinlock infrastructure.
>>
> This busy wait happens only if (very small) allocation fails, so if
> a guest ever hits this code path I expect it to be on his way to die
> anyway.

Hm.  I don't have a good feel on how rare atomic allocation failures are 
on common workloads.

Note a kmem_cache for apfs will make failures even more rare.

>> Or, we can avoid the allocation.  If at most one apf can be pending
>> (is this true?), we can use a per-cpu variable for this dummy entry.
>>
> We can have may outstanding apfs.

But, while we're processing an apf, we can't take any more.

So we can have a buffer of one pre-allocated entry per cpu, and do 
something like:

apf:
   disable apf for this cpu
   handle apf using buffered entry
   enable interrupts
   allocate new entry
   buffer it
   enable apf for that cpu

this trades off a bigger apf disabled window for not busy looping.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
  2010-08-24  7:52       ` Gleb Natapov
@ 2010-08-24  9:04         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 10:52 AM, Gleb Natapov wrote:
>> This nice cache needs to be outside apf to reduce complexity for
>> reviewers and since it is useful for others.
>>
>> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
> Will look into it.

In the meantime, you can just drop the caching.


>>> +		       struct kvm_arch_async_pf *arch)
>>> +{
>>> +	struct kvm_async_pf *work;
>>> +
>>> +	if (vcpu->async_pf_queued>= ASYNC_PF_PER_VCPU)
>>> +		return 0;
>> 100 == too high.  At 16 vcpus, this allows 1600 kernel threads to
>> wait for I/O.
> Number of kernel threads are limited by other means. Slow work subsystem
> has its own knobs to tune that. Here we limit how much slow work items
> can be queued per vcpu.

OK.

>> Would have been best if we could ask for a page to be paged in
>> asynchronously.
>>
> You mean to have core kernel facility for that? I agree it would be
> nice, but much harder.

Yes, that's what I meant.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
@ 2010-08-24  9:04         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 10:52 AM, Gleb Natapov wrote:
>> This nice cache needs to be outside apf to reduce complexity for
>> reviewers and since it is useful for others.
>>
>> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
> Will look into it.

In the meantime, you can just drop the caching.


>>> +		       struct kvm_arch_async_pf *arch)
>>> +{
>>> +	struct kvm_async_pf *work;
>>> +
>>> +	if (vcpu->async_pf_queued>= ASYNC_PF_PER_VCPU)
>>> +		return 0;
>> 100 == too high.  At 16 vcpus, this allows 1600 kernel threads to
>> wait for I/O.
> Number of kernel threads are limited by other means. Slow work subsystem
> has its own knobs to tune that. Here we limit how much slow work items
> can be queued per vcpu.

OK.

>> Would have been best if we could ask for a page to be paged in
>> asynchronously.
>>
> You mean to have core kernel facility for that? I agree it would be
> nice, but much harder.

Yes, that's what I meant.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 09/12] Retry fault before vmentry
  2010-07-19 15:30   ` Gleb Natapov
@ 2010-08-24  9:25     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:25 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page.
>
>
>
> -static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> -				u32 error_code)
> +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
> +			  bool sync)

'sync' means something else in the shadow mmu.  Please rename to 
something longer, maybe 'apf_completion'.

Alternatively, split to two functions, a base function that doesn't do 
apf and a wrapper that handles apf.

> @@ -505,6 +506,37 @@ out_unlock:
>   	return 0;
>   }
>
> +static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
> +				       gva_t addr, u32 error_code)
> +{
> +	int r = 0;
> +	gpa_t curr_cr3 = vcpu->arch.cr3;
> +
> +	if (curr_cr3 != cr3) {
> +		/*
> +		 * We do page fault on behalf of a process that is sleeping
> +		 * because of async PF. PV guest takes reference to mm that cr3
> +		 * belongs too, so it has to be valid here.
> +		 */
> +		kvm_set_cr3(vcpu, cr3);
> +		if (kvm_mmu_reload(vcpu))
> +			goto switch_cr3;
> +	}

With nested virtualization, we need to switch cr0, cr4, and efer as well...

> +
> +	r = FNAME(page_fault)(vcpu, addr, error_code, true);
> +
> +	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
> +		kvm_mmu_sync_roots(vcpu);

Why is this needed?

> +
> +switch_cr3:
> +	if (curr_cr3 != vcpu->arch.cr3) {
> +		kvm_set_cr3(vcpu, curr_cr3);
> +		kvm_mmu_reload(vcpu);
> +	}
> +
> +	return r;
> +}

This has the nasty effect of flushing the TLB on AMD.

> +
>   static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>   {
>   	struct kvm_shadow_walk_iterator iterator;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2603cc4..5482db0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5743,6 +5743,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_rflags);
>
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work)
> +{
> +	if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
> +		return;
> +	vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
> +					    work->arch.error_code);
> +}
> +
>   static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
>   {
>   	if (unlikely(vcpu->arch.apf_memslot_ver !=
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f56e8ac..de1d5b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1348,6 +1348,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>   			spin_lock(&vcpu->async_pf_lock);
>   			list_del(&work->link);
>   			spin_unlock(&vcpu->async_pf_lock);
> +			kvm_arch_async_page_ready(vcpu, work);
>   			put_page(work->page);
>   			async_pf_work_free(work);
>   			list_del(&work->queue);
> @@ -1366,6 +1367,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>   	list_del(&work->queue);
>   	vcpu->async_pf_queued--;
>
> +	kvm_arch_async_page_ready(vcpu, work);
>   	kvm_arch_inject_async_page_present(vcpu, work);
>
>   	put_page(work->page);


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 09/12] Retry fault before vmentry
@ 2010-08-24  9:25     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:25 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> When page is swapped in it is mapped into guest memory only after guest
> tries to access it again and generate another fault. To save this fault
> we can map it immediately since we know that guest is going to access
> the page.
>
>
>
> -static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> -				u32 error_code)
> +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
> +			  bool sync)

'sync' means something else in the shadow mmu.  Please rename to 
something longer, maybe 'apf_completion'.

Alternatively, split to two functions, a base function that doesn't do 
apf and a wrapper that handles apf.

> @@ -505,6 +506,37 @@ out_unlock:
>   	return 0;
>   }
>
> +static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
> +				       gva_t addr, u32 error_code)
> +{
> +	int r = 0;
> +	gpa_t curr_cr3 = vcpu->arch.cr3;
> +
> +	if (curr_cr3 != cr3) {
> +		/*
> +		 * We do page fault on behalf of a process that is sleeping
> +		 * because of async PF. PV guest takes reference to mm that cr3
> +		 * belongs too, so it has to be valid here.
> +		 */
> +		kvm_set_cr3(vcpu, cr3);
> +		if (kvm_mmu_reload(vcpu))
> +			goto switch_cr3;
> +	}

With nested virtualization, we need to switch cr0, cr4, and efer as well...

> +
> +	r = FNAME(page_fault)(vcpu, addr, error_code, true);
> +
> +	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
> +		kvm_mmu_sync_roots(vcpu);

Why is this needed?

> +
> +switch_cr3:
> +	if (curr_cr3 != vcpu->arch.cr3) {
> +		kvm_set_cr3(vcpu, curr_cr3);
> +		kvm_mmu_reload(vcpu);
> +	}
> +
> +	return r;
> +}

This has the nasty effect of flushing the TLB on AMD.

> +
>   static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>   {
>   	struct kvm_shadow_walk_iterator iterator;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2603cc4..5482db0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5743,6 +5743,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_rflags);
>
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work)
> +{
> +	if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
> +		return;
> +	vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
> +					    work->arch.error_code);
> +}
> +
>   static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
>   {
>   	if (unlikely(vcpu->arch.apf_memslot_ver !=
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f56e8ac..de1d5b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1348,6 +1348,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>   			spin_lock(&vcpu->async_pf_lock);
>   			list_del(&work->link);
>   			spin_unlock(&vcpu->async_pf_lock);
> +			kvm_arch_async_page_ready(vcpu, work);
>   			put_page(work->page);
>   			async_pf_work_free(work);
>   			list_del(&work->queue);
> @@ -1366,6 +1367,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>   	list_del(&work->queue);
>   	vcpu->async_pf_queued--;
>
> +	kvm_arch_async_page_ready(vcpu, work);
>   	kvm_arch_inject_async_page_present(vcpu, work);
>
>   	put_page(work->page);


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 10/12] Handle async PF in non preemptable context
  2010-07-19 15:31   ` Gleb Natapov
@ 2010-08-24  9:30     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:30 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> If async page fault is received by idle task or when preemp_count is
> not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> ready. vcpu can still process interrupts while it waits for the page to
> be ready.
>
> Acked-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
>   1 files changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index a6db92e..914b0fc 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -37,6 +37,7 @@
>   #include<asm/cpu.h>
>   #include<asm/traps.h>
>   #include<asm/desc.h>
> +#include<asm/tlbflush.h>
>
>   #define MMU_QUEUE_SIZE 1024
>
> @@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
>   	wait_queue_head_t wq;
>   	u32 token;
>   	int cpu;
> +	bool halted;
> +	struct mm_struct *mm;
>   };
>
>   static struct kvm_task_sleep_head {
> @@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>   	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
>   	struct kvm_task_sleep_node n, *e;
>   	DEFINE_WAIT(wait);
> +	int cpu, idle;
> +
> +	cpu = get_cpu();
> +	idle = idle_cpu(cpu);
> +	put_cpu();
>
>   	spin_lock(&b->lock);
>   	e = _find_apf_task(b, token);
> @@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>
>   	n.token = token;
>   	n.cpu = smp_processor_id();
> +	n.mm = current->active_mm;
> +	n.halted = idle || preempt_count()>  1;
> +	atomic_inc(&n.mm->mm_count);
>   	init_waitqueue_head(&n.wq);
>   	hlist_add_head(&n.link,&b->list);
>   	spin_unlock(&b->lock);
>
>   	for (;;) {
> -		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> +		if (!n.halted)
> +			prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
>   		if (hlist_unhashed(&n.link))
>   			break;
> -		schedule();
> +
> +		if (!n.halted) {
> +			schedule();
> +		} else {
> +			/*
> +			 * We cannot reschedule. So halt.
> +			 */

If we get the wakeup here, we'll halt and never wake up again.

> +			native_safe_halt();
> +			local_irq_disable();

So we need a local_irq_disable() before the hlish_unhashed() check.

> +		}
>   	}
> -	finish_wait(&n.wq,&wait);
> +	if (!n.halted)
> +		finish_wait(&n.wq,&wait);
>
>   	return;
>   }
> @@ -127,7 +149,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>   static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>   {
>   	hlist_del_init(&n->link);
> -	if (waitqueue_active(&n->wq))
> +	if (!n->mm)
> +		return;
> +	mmdrop(n->mm);
> +	if (n->halted)
> +		smp_send_reschedule(n->cpu);
> +	else if (waitqueue_active(&n->wq))
>   		wake_up(&n->wq);
>   }
>
> @@ -157,6 +184,7 @@ again:
>   		}
>   		n->token = token;
>   		n->cpu = smp_processor_id();
> +		n->mm = NULL;
>   		init_waitqueue_head(&n->wq);
>   		hlist_add_head(&n->link,&b->list);
>   	} else


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 10/12] Handle async PF in non preemptable context
@ 2010-08-24  9:30     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:30 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> If async page fault is received by idle task or when preemp_count is
> not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> ready. vcpu can still process interrupts while it waits for the page to
> be ready.
>
> Acked-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
>   1 files changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index a6db92e..914b0fc 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -37,6 +37,7 @@
>   #include<asm/cpu.h>
>   #include<asm/traps.h>
>   #include<asm/desc.h>
> +#include<asm/tlbflush.h>
>
>   #define MMU_QUEUE_SIZE 1024
>
> @@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
>   	wait_queue_head_t wq;
>   	u32 token;
>   	int cpu;
> +	bool halted;
> +	struct mm_struct *mm;
>   };
>
>   static struct kvm_task_sleep_head {
> @@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>   	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
>   	struct kvm_task_sleep_node n, *e;
>   	DEFINE_WAIT(wait);
> +	int cpu, idle;
> +
> +	cpu = get_cpu();
> +	idle = idle_cpu(cpu);
> +	put_cpu();
>
>   	spin_lock(&b->lock);
>   	e = _find_apf_task(b, token);
> @@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>
>   	n.token = token;
>   	n.cpu = smp_processor_id();
> +	n.mm = current->active_mm;
> +	n.halted = idle || preempt_count()>  1;
> +	atomic_inc(&n.mm->mm_count);
>   	init_waitqueue_head(&n.wq);
>   	hlist_add_head(&n.link,&b->list);
>   	spin_unlock(&b->lock);
>
>   	for (;;) {
> -		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> +		if (!n.halted)
> +			prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
>   		if (hlist_unhashed(&n.link))
>   			break;
> -		schedule();
> +
> +		if (!n.halted) {
> +			schedule();
> +		} else {
> +			/*
> +			 * We cannot reschedule. So halt.
> +			 */

If we get the wakeup here, we'll halt and never wake up again.

> +			native_safe_halt();
> +			local_irq_disable();

So we need a local_irq_disable() before the hlish_unhashed() check.

> +		}
>   	}
> -	finish_wait(&n.wq,&wait);
> +	if (!n.halted)
> +		finish_wait(&n.wq,&wait);
>
>   	return;
>   }
> @@ -127,7 +149,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>   static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>   {
>   	hlist_del_init(&n->link);
> -	if (waitqueue_active(&n->wq))
> +	if (!n->mm)
> +		return;
> +	mmdrop(n->mm);
> +	if (n->halted)
> +		smp_send_reschedule(n->cpu);
> +	else if (waitqueue_active(&n->wq))
>   		wake_up(&n->wq);
>   }
>
> @@ -157,6 +184,7 @@ again:
>   		}
>   		n->token = token;
>   		n->cpu = smp_processor_id();
> +		n->mm = NULL;
>   		init_waitqueue_head(&n->wq);
>   		hlist_add_head(&n->link,&b->list);
>   	} else


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 11/12] Let host know whether the guest can handle async PF in non-userspace context.
  2010-07-19 15:31   ` Gleb Natapov
@ 2010-08-24  9:31     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:31 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> If guest can detect that it runs in non-preemptable context it can
> handle async PFs at any time, so let host know that it can send async
> PF even if guest cpu is not in userspace.
>
> Acked-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_host.h |    1 +
>   arch/x86/include/asm/kvm_para.h |    1 +
>   arch/x86/kernel/kvm.c           |    3 +++
>   arch/x86/kvm/x86.c              |    5 +++--
>   4 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 45e6c12..c675d5d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -367,6 +367,7 @@ struct kvm_vcpu_arch {
>   	cpumask_var_t wbinvd_dirty_mask;
>
>   	u32 __user *apf_data;
> +	bool apf_send_user_only;
>   	u32 apf_memslot_ver;
>   	u64 apf_msr_val;
>   	u32 async_pf_id;

Lots of apf stuff in here.  Make it apg.data etc.?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 11/12] Let host know whether the guest can handle async PF in non-userspace context.
@ 2010-08-24  9:31     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:31 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> If guest can detect that it runs in non-preemptable context it can
> handle async PFs at any time, so let host know that it can send async
> PF even if guest cpu is not in userspace.
>
> Acked-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/include/asm/kvm_host.h |    1 +
>   arch/x86/include/asm/kvm_para.h |    1 +
>   arch/x86/kernel/kvm.c           |    3 +++
>   arch/x86/kvm/x86.c              |    5 +++--
>   4 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 45e6c12..c675d5d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -367,6 +367,7 @@ struct kvm_vcpu_arch {
>   	cpumask_var_t wbinvd_dirty_mask;
>
>   	u32 __user *apf_data;
> +	bool apf_send_user_only;
>   	u32 apf_memslot_ver;
>   	u64 apf_msr_val;
>   	u32 async_pf_id;

Lots of apf stuff in here.  Make it apg.data etc.?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 09/12] Retry fault before vmentry
  2010-08-24  9:25     ` Avi Kivity
@ 2010-08-24  9:33       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  9:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Tue, Aug 24, 2010 at 12:25:33PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >When page is swapped in it is mapped into guest memory only after guest
> >tries to access it again and generate another fault. To save this fault
> >we can map it immediately since we know that guest is going to access
> >the page.
> >
> >
> >
> >-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >-				u32 error_code)
> >+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
> >+			  bool sync)
> 
> 'sync' means something else in the shadow mmu.  Please rename to
> something longer, maybe 'apf_completion'.
> 
> Alternatively, split to two functions, a base function that doesn't
> do apf and a wrapper that handles apf.
> 
Will rename to something else.

> >@@ -505,6 +506,37 @@ out_unlock:
> >  	return 0;
> >  }
> >
> >+static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
> >+				       gva_t addr, u32 error_code)
> >+{
> >+	int r = 0;
> >+	gpa_t curr_cr3 = vcpu->arch.cr3;
> >+
> >+	if (curr_cr3 != cr3) {
> >+		/*
> >+		 * We do page fault on behalf of a process that is sleeping
> >+		 * because of async PF. PV guest takes reference to mm that cr3
> >+		 * belongs too, so it has to be valid here.
> >+		 */
> >+		kvm_set_cr3(vcpu, cr3);
> >+		if (kvm_mmu_reload(vcpu))
> >+			goto switch_cr3;
> >+	}
> 
> With nested virtualization, we need to switch cr0, cr4, and efer as well...
> 
On SVM or VMX or both?

> >+
> >+	r = FNAME(page_fault)(vcpu, addr, error_code, true);
> >+
> >+	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
> >+		kvm_mmu_sync_roots(vcpu);
> 
> Why is this needed?
> 
http://www.mail-archive.com/kvm@vger.kernel.org/msg37827.html

 KVM_REQ_MMU_SYNC request generated here must be processed before
 switching to a different cr3 (otherwise vcpu_enter_guest will process it 
 with the wrong cr3 in place).


> >+
> >+switch_cr3:
> >+	if (curr_cr3 != vcpu->arch.cr3) {
> >+		kvm_set_cr3(vcpu, curr_cr3);
> >+		kvm_mmu_reload(vcpu);
> >+	}
> >+
> >+	return r;
> >+}
> 
> This has the nasty effect of flushing the TLB on AMD.
> 
What is more expansive reenter the guest and handle one more fault, or
flash TLB here?

> >+
> >  static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
> >  {
> >  	struct kvm_shadow_walk_iterator iterator;
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index 2603cc4..5482db0 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -5743,6 +5743,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> >
> >+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> >+			       struct kvm_async_pf *work)
> >+{
> >+	if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
> >+		return;
> >+	vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
> >+					    work->arch.error_code);
> >+}
> >+
> >  static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> >  {
> >  	if (unlikely(vcpu->arch.apf_memslot_ver !=
> >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >index f56e8ac..de1d5b6 100644
> >--- a/virt/kvm/kvm_main.c
> >+++ b/virt/kvm/kvm_main.c
> >@@ -1348,6 +1348,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >  			spin_lock(&vcpu->async_pf_lock);
> >  			list_del(&work->link);
> >  			spin_unlock(&vcpu->async_pf_lock);
> >+			kvm_arch_async_page_ready(vcpu, work);
> >  			put_page(work->page);
> >  			async_pf_work_free(work);
> >  			list_del(&work->queue);
> >@@ -1366,6 +1367,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >  	list_del(&work->queue);
> >  	vcpu->async_pf_queued--;
> >
> >+	kvm_arch_async_page_ready(vcpu, work);
> >  	kvm_arch_inject_async_page_present(vcpu, work);
> >
> >  	put_page(work->page);
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 09/12] Retry fault before vmentry
@ 2010-08-24  9:33       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  9:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Tue, Aug 24, 2010 at 12:25:33PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:30 PM, Gleb Natapov wrote:
> >When page is swapped in it is mapped into guest memory only after guest
> >tries to access it again and generate another fault. To save this fault
> >we can map it immediately since we know that guest is going to access
> >the page.
> >
> >
> >
> >-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
> >-				u32 error_code)
> >+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
> >+			  bool sync)
> 
> 'sync' means something else in the shadow mmu.  Please rename to
> something longer, maybe 'apf_completion'.
> 
> Alternatively, split to two functions, a base function that doesn't
> do apf and a wrapper that handles apf.
> 
Will rename to something else.

> >@@ -505,6 +506,37 @@ out_unlock:
> >  	return 0;
> >  }
> >
> >+static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
> >+				       gva_t addr, u32 error_code)
> >+{
> >+	int r = 0;
> >+	gpa_t curr_cr3 = vcpu->arch.cr3;
> >+
> >+	if (curr_cr3 != cr3) {
> >+		/*
> >+		 * We do page fault on behalf of a process that is sleeping
> >+		 * because of async PF. PV guest takes reference to mm that cr3
> >+		 * belongs too, so it has to be valid here.
> >+		 */
> >+		kvm_set_cr3(vcpu, cr3);
> >+		if (kvm_mmu_reload(vcpu))
> >+			goto switch_cr3;
> >+	}
> 
> With nested virtualization, we need to switch cr0, cr4, and efer as well...
> 
On SVM or VMX or both?

> >+
> >+	r = FNAME(page_fault)(vcpu, addr, error_code, true);
> >+
> >+	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
> >+		kvm_mmu_sync_roots(vcpu);
> 
> Why is this needed?
> 
http://www.mail-archive.com/kvm@vger.kernel.org/msg37827.html

 KVM_REQ_MMU_SYNC request generated here must be processed before
 switching to a different cr3 (otherwise vcpu_enter_guest will process it 
 with the wrong cr3 in place).


> >+
> >+switch_cr3:
> >+	if (curr_cr3 != vcpu->arch.cr3) {
> >+		kvm_set_cr3(vcpu, curr_cr3);
> >+		kvm_mmu_reload(vcpu);
> >+	}
> >+
> >+	return r;
> >+}
> 
> This has the nasty effect of flushing the TLB on AMD.
> 
What is more expansive reenter the guest and handle one more fault, or
flash TLB here?

> >+
> >  static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
> >  {
> >  	struct kvm_shadow_walk_iterator iterator;
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index 2603cc4..5482db0 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -5743,6 +5743,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> >
> >+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> >+			       struct kvm_async_pf *work)
> >+{
> >+	if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
> >+		return;
> >+	vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
> >+					    work->arch.error_code);
> >+}
> >+
> >  static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> >  {
> >  	if (unlikely(vcpu->arch.apf_memslot_ver !=
> >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >index f56e8ac..de1d5b6 100644
> >--- a/virt/kvm/kvm_main.c
> >+++ b/virt/kvm/kvm_main.c
> >@@ -1348,6 +1348,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >  			spin_lock(&vcpu->async_pf_lock);
> >  			list_del(&work->link);
> >  			spin_unlock(&vcpu->async_pf_lock);
> >+			kvm_arch_async_page_ready(vcpu, work);
> >  			put_page(work->page);
> >  			async_pf_work_free(work);
> >  			list_del(&work->queue);
> >@@ -1366,6 +1367,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >  	list_del(&work->queue);
> >  	vcpu->async_pf_queued--;
> >
> >+	kvm_arch_async_page_ready(vcpu, work);
> >  	kvm_arch_inject_async_page_present(vcpu, work);
> >
> >  	put_page(work->page);
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 12/12] Send async PF when guest is not in userspace too.
  2010-07-19 15:31   ` Gleb Natapov
@ 2010-08-24  9:36     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> If guest indicates that it can handle async pf in kernel mode too send
> it, but only if interrupt are enabled.
>
> Reviewed-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/kvm/mmu.c |    8 +++++++-
>   1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 12d1a7b..ed87b1c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2361,7 +2361,13 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
>   	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
>   		return false;
>
> -	return !!kvm_x86_ops->get_cpl(vcpu);
> +	if (vcpu->arch.apf_send_user_only)
> +		return !!kvm_x86_ops->get_cpl(vcpu);

cpl is not a bool.  Compare it with 0.

> +
> +	if (!kvm_x86_ops->interrupt_allowed(vcpu))
> +		return false;
> +
> +	return true;
>   }

Should have commented before, but get_cpl() is not accurate when doing 
nested virtualization.  When L1 intercepts page faults, being in L2 is 
equivalent to CPL 3.  But we need to get the apf information to L1 somehow.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 12/12] Send async PF when guest is not in userspace too.
@ 2010-08-24  9:36     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> If guest indicates that it can handle async pf in kernel mode too send
> it, but only if interrupt are enabled.
>
> Reviewed-by: Rik van Riel<riel@redhat.com>
> Signed-off-by: Gleb Natapov<gleb@redhat.com>
> ---
>   arch/x86/kvm/mmu.c |    8 +++++++-
>   1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 12d1a7b..ed87b1c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2361,7 +2361,13 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
>   	if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
>   		return false;
>
> -	return !!kvm_x86_ops->get_cpl(vcpu);
> +	if (vcpu->arch.apf_send_user_only)
> +		return !!kvm_x86_ops->get_cpl(vcpu);

cpl is not a bool.  Compare it with 0.

> +
> +	if (!kvm_x86_ops->interrupt_allowed(vcpu))
> +		return false;
> +
> +	return true;
>   }

Should have commented before, but get_cpl() is not accurate when doing 
nested virtualization.  When L1 intercepts page faults, being in L2 is 
equivalent to CPL 3.  But we need to get the apf information to L1 somehow.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 10/12] Handle async PF in non preemptable context
  2010-08-24  9:30     ` Avi Kivity
@ 2010-08-24  9:36       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  9:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Tue, Aug 24, 2010 at 12:30:25PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> >If async page fault is received by idle task or when preemp_count is
> >not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> >ready. vcpu can still process interrupts while it waits for the page to
> >be ready.
> >
> >Acked-by: Rik van Riel<riel@redhat.com>
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
> >  1 files changed, 32 insertions(+), 4 deletions(-)
> >
> >diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >index a6db92e..914b0fc 100644
> >--- a/arch/x86/kernel/kvm.c
> >+++ b/arch/x86/kernel/kvm.c
> >@@ -37,6 +37,7 @@
> >  #include<asm/cpu.h>
> >  #include<asm/traps.h>
> >  #include<asm/desc.h>
> >+#include<asm/tlbflush.h>
> >
> >  #define MMU_QUEUE_SIZE 1024
> >
> >@@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
> >  	wait_queue_head_t wq;
> >  	u32 token;
> >  	int cpu;
> >+	bool halted;
> >+	struct mm_struct *mm;
> >  };
> >
> >  static struct kvm_task_sleep_head {
> >@@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
> >  	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >  	struct kvm_task_sleep_node n, *e;
> >  	DEFINE_WAIT(wait);
> >+	int cpu, idle;
> >+
> >+	cpu = get_cpu();
> >+	idle = idle_cpu(cpu);
> >+	put_cpu();
> >
> >  	spin_lock(&b->lock);
> >  	e = _find_apf_task(b, token);
> >@@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
> >
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> >+	n.mm = current->active_mm;
> >+	n.halted = idle || preempt_count()>  1;
> >+	atomic_inc(&n.mm->mm_count);
> >  	init_waitqueue_head(&n.wq);
> >  	hlist_add_head(&n.link,&b->list);
> >  	spin_unlock(&b->lock);
> >
> >  	for (;;) {
> >-		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >+		if (!n.halted)
> >+			prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >  		if (hlist_unhashed(&n.link))
> >  			break;
> >-		schedule();
> >+
> >+		if (!n.halted) {
> >+			schedule();
> >+		} else {
> >+			/*
> >+			 * We cannot reschedule. So halt.
> >+			 */
> 
> If we get the wakeup here, we'll halt and never wake up again.
> 
We will not. IRQs are disabled here. native_safe_halt() enables them.

> >+			native_safe_halt();
> >+			local_irq_disable();
> 
> So we need a local_irq_disable() before the hlish_unhashed() check.
We are still in exception handler, so IRQ should be off.

> 
> >+		}
> >  	}
> >-	finish_wait(&n.wq,&wait);
> >+	if (!n.halted)
> >+		finish_wait(&n.wq,&wait);
> >
> >  	return;
> >  }
> >@@ -127,7 +149,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
> >  static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> >  {
> >  	hlist_del_init(&n->link);
> >-	if (waitqueue_active(&n->wq))
> >+	if (!n->mm)
> >+		return;
> >+	mmdrop(n->mm);
> >+	if (n->halted)
> >+		smp_send_reschedule(n->cpu);
> >+	else if (waitqueue_active(&n->wq))
> >  		wake_up(&n->wq);
> >  }
> >
> >@@ -157,6 +184,7 @@ again:
> >  		}
> >  		n->token = token;
> >  		n->cpu = smp_processor_id();
> >+		n->mm = NULL;
> >  		init_waitqueue_head(&n->wq);
> >  		hlist_add_head(&n->link,&b->list);
> >  	} else
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 10/12] Handle async PF in non preemptable context
@ 2010-08-24  9:36       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24  9:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Tue, Aug 24, 2010 at 12:30:25PM +0300, Avi Kivity wrote:
>  On 07/19/2010 06:31 PM, Gleb Natapov wrote:
> >If async page fault is received by idle task or when preemp_count is
> >not zero guest cannot reschedule, so do sti; hlt and wait for page to be
> >ready. vcpu can still process interrupts while it waits for the page to
> >be ready.
> >
> >Acked-by: Rik van Riel<riel@redhat.com>
> >Signed-off-by: Gleb Natapov<gleb@redhat.com>
> >---
> >  arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
> >  1 files changed, 32 insertions(+), 4 deletions(-)
> >
> >diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >index a6db92e..914b0fc 100644
> >--- a/arch/x86/kernel/kvm.c
> >+++ b/arch/x86/kernel/kvm.c
> >@@ -37,6 +37,7 @@
> >  #include<asm/cpu.h>
> >  #include<asm/traps.h>
> >  #include<asm/desc.h>
> >+#include<asm/tlbflush.h>
> >
> >  #define MMU_QUEUE_SIZE 1024
> >
> >@@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
> >  	wait_queue_head_t wq;
> >  	u32 token;
> >  	int cpu;
> >+	bool halted;
> >+	struct mm_struct *mm;
> >  };
> >
> >  static struct kvm_task_sleep_head {
> >@@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
> >  	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
> >  	struct kvm_task_sleep_node n, *e;
> >  	DEFINE_WAIT(wait);
> >+	int cpu, idle;
> >+
> >+	cpu = get_cpu();
> >+	idle = idle_cpu(cpu);
> >+	put_cpu();
> >
> >  	spin_lock(&b->lock);
> >  	e = _find_apf_task(b, token);
> >@@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
> >
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> >+	n.mm = current->active_mm;
> >+	n.halted = idle || preempt_count()>  1;
> >+	atomic_inc(&n.mm->mm_count);
> >  	init_waitqueue_head(&n.wq);
> >  	hlist_add_head(&n.link,&b->list);
> >  	spin_unlock(&b->lock);
> >
> >  	for (;;) {
> >-		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >+		if (!n.halted)
> >+			prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
> >  		if (hlist_unhashed(&n.link))
> >  			break;
> >-		schedule();
> >+
> >+		if (!n.halted) {
> >+			schedule();
> >+		} else {
> >+			/*
> >+			 * We cannot reschedule. So halt.
> >+			 */
> 
> If we get the wakeup here, we'll halt and never wake up again.
> 
We will not. IRQs are disabled here. native_safe_halt() enables them.

> >+			native_safe_halt();
> >+			local_irq_disable();
> 
> So we need a local_irq_disable() before the hlish_unhashed() check.
We are still in exception handler, so IRQ should be off.

> 
> >+		}
> >  	}
> >-	finish_wait(&n.wq,&wait);
> >+	if (!n.halted)
> >+		finish_wait(&n.wq,&wait);
> >
> >  	return;
> >  }
> >@@ -127,7 +149,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
> >  static void apf_task_wake_one(struct kvm_task_sleep_node *n)
> >  {
> >  	hlist_del_init(&n->link);
> >-	if (waitqueue_active(&n->wq))
> >+	if (!n->mm)
> >+		return;
> >+	mmdrop(n->mm);
> >+	if (n->halted)
> >+		smp_send_reschedule(n->cpu);
> >+	else if (waitqueue_active(&n->wq))
> >  		wake_up(&n->wq);
> >  }
> >
> >@@ -157,6 +184,7 @@ again:
> >  		}
> >  		n->token = token;
> >  		n->cpu = smp_processor_id();
> >+		n->mm = NULL;
> >  		init_waitqueue_head(&n->wq);
> >  		hlist_add_head(&n->link,&b->list);
> >  	} else
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 09/12] Retry fault before vmentry
  2010-08-24  9:33       ` Gleb Natapov
@ 2010-08-24  9:38         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 12:33 PM, Gleb Natapov wrote:
>
>
>>> @@ -505,6 +506,37 @@ out_unlock:
>>>   	return 0;
>>>   }
>>>
>>> +static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
>>> +				       gva_t addr, u32 error_code)
>>> +{
>>> +	int r = 0;
>>> +	gpa_t curr_cr3 = vcpu->arch.cr3;
>>> +
>>> +	if (curr_cr3 != cr3) {
>>> +		/*
>>> +		 * We do page fault on behalf of a process that is sleeping
>>> +		 * because of async PF. PV guest takes reference to mm that cr3
>>> +		 * belongs too, so it has to be valid here.
>>> +		 */
>>> +		kvm_set_cr3(vcpu, cr3);
>>> +		if (kvm_mmu_reload(vcpu))
>>> +			goto switch_cr3;
>>> +	}
>> With nested virtualization, we need to switch cr0, cr4, and efer as well...
>>
> On SVM or VMX or both?

Both.  Let's defer this patch since it's an optimization, this is really 
complicated.

>>> +
>>> +	r = FNAME(page_fault)(vcpu, addr, error_code, true);
>>> +
>>> +	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
>>> +		kvm_mmu_sync_roots(vcpu);
>> Why is this needed?
>>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg37827.html
>
>   KVM_REQ_MMU_SYNC request generated here must be processed before
>   switching to a different cr3 (otherwise vcpu_enter_guest will process it
>   with the wrong cr3 in place).

Ah, it should be part of the cr3 switch block above.

>>> +
>>> +switch_cr3:
>>> +	if (curr_cr3 != vcpu->arch.cr3) {
>>> +		kvm_set_cr3(vcpu, curr_cr3);
>>> +		kvm_mmu_reload(vcpu);
>>> +	}
>>> +
>>> +	return r;
>>> +}
>> This has the nasty effect of flushing the TLB on AMD.
>>
> What is more expansive reenter the guest and handle one more fault, or
> flash TLB here?

No idea.  Probably the reentry.  On Intel the tlb is flushed anyway.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 09/12] Retry fault before vmentry
@ 2010-08-24  9:38         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 12:33 PM, Gleb Natapov wrote:
>
>
>>> @@ -505,6 +506,37 @@ out_unlock:
>>>   	return 0;
>>>   }
>>>
>>> +static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
>>> +				       gva_t addr, u32 error_code)
>>> +{
>>> +	int r = 0;
>>> +	gpa_t curr_cr3 = vcpu->arch.cr3;
>>> +
>>> +	if (curr_cr3 != cr3) {
>>> +		/*
>>> +		 * We do page fault on behalf of a process that is sleeping
>>> +		 * because of async PF. PV guest takes reference to mm that cr3
>>> +		 * belongs too, so it has to be valid here.
>>> +		 */
>>> +		kvm_set_cr3(vcpu, cr3);
>>> +		if (kvm_mmu_reload(vcpu))
>>> +			goto switch_cr3;
>>> +	}
>> With nested virtualization, we need to switch cr0, cr4, and efer as well...
>>
> On SVM or VMX or both?

Both.  Let's defer this patch since it's an optimization, this is really 
complicated.

>>> +
>>> +	r = FNAME(page_fault)(vcpu, addr, error_code, true);
>>> +
>>> +	if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
>>> +		kvm_mmu_sync_roots(vcpu);
>> Why is this needed?
>>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg37827.html
>
>   KVM_REQ_MMU_SYNC request generated here must be processed before
>   switching to a different cr3 (otherwise vcpu_enter_guest will process it
>   with the wrong cr3 in place).

Ah, it should be part of the cr3 switch block above.

>>> +
>>> +switch_cr3:
>>> +	if (curr_cr3 != vcpu->arch.cr3) {
>>> +		kvm_set_cr3(vcpu, curr_cr3);
>>> +		kvm_mmu_reload(vcpu);
>>> +	}
>>> +
>>> +	return r;
>>> +}
>> This has the nasty effect of flushing the TLB on AMD.
>>
> What is more expansive reenter the guest and handle one more fault, or
> flash TLB here?

No idea.  Probably the reentry.  On Intel the tlb is flushed anyway.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 10/12] Handle async PF in non preemptable context
  2010-08-24  9:36       ` Gleb Natapov
@ 2010-08-24  9:46         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:46 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 12:36 PM, Gleb Natapov wrote:
> On Tue, Aug 24, 2010 at 12:30:25PM +0300, Avi Kivity wrote:
>>   On 07/19/2010 06:31 PM, Gleb Natapov wrote:
>>> If async page fault is received by idle task or when preemp_count is
>>> not zero guest cannot reschedule, so do sti; hlt and wait for page to be
>>> ready. vcpu can still process interrupts while it waits for the page to
>>> be ready.
>>>
>>> Acked-by: Rik van Riel<riel@redhat.com>
>>> Signed-off-by: Gleb Natapov<gleb@redhat.com>
>>> ---
>>>   arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
>>>   1 files changed, 32 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index a6db92e..914b0fc 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -37,6 +37,7 @@
>>>   #include<asm/cpu.h>
>>>   #include<asm/traps.h>
>>>   #include<asm/desc.h>
>>> +#include<asm/tlbflush.h>
>>>
>>>   #define MMU_QUEUE_SIZE 1024
>>>
>>> @@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
>>>   	wait_queue_head_t wq;
>>>   	u32 token;
>>>   	int cpu;
>>> +	bool halted;
>>> +	struct mm_struct *mm;
>>>   };
>>>
>>>   static struct kvm_task_sleep_head {
>>> @@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>>>   	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
>>>   	struct kvm_task_sleep_node n, *e;
>>>   	DEFINE_WAIT(wait);
>>> +	int cpu, idle;
>>> +
>>> +	cpu = get_cpu();
>>> +	idle = idle_cpu(cpu);
>>> +	put_cpu();
>>>
>>>   	spin_lock(&b->lock);
>>>   	e = _find_apf_task(b, token);
>>> @@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>>>
>>>   	n.token = token;
>>>   	n.cpu = smp_processor_id();
>>> +	n.mm = current->active_mm;
>>> +	n.halted = idle || preempt_count()>   1;
>>> +	atomic_inc(&n.mm->mm_count);
>>>   	init_waitqueue_head(&n.wq);
>>>   	hlist_add_head(&n.link,&b->list);
>>>   	spin_unlock(&b->lock);
>>>
>>>   	for (;;) {
>>> -		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
>>> +		if (!n.halted)
>>> +			prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
>>>   		if (hlist_unhashed(&n.link))
>>>   			break;
>>> -		schedule();
>>> +
>>> +		if (!n.halted) {
>>> +			schedule();
>>> +		} else {
>>> +			/*
>>> +			 * We cannot reschedule. So halt.
>>> +			 */
>> If we get the wakeup here, we'll halt and never wake up again.
>>
> We will not. IRQs are disabled here. native_safe_halt() enables them.

Ok.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 10/12] Handle async PF in non preemptable context
@ 2010-08-24  9:46         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24  9:46 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 12:36 PM, Gleb Natapov wrote:
> On Tue, Aug 24, 2010 at 12:30:25PM +0300, Avi Kivity wrote:
>>   On 07/19/2010 06:31 PM, Gleb Natapov wrote:
>>> If async page fault is received by idle task or when preemp_count is
>>> not zero guest cannot reschedule, so do sti; hlt and wait for page to be
>>> ready. vcpu can still process interrupts while it waits for the page to
>>> be ready.
>>>
>>> Acked-by: Rik van Riel<riel@redhat.com>
>>> Signed-off-by: Gleb Natapov<gleb@redhat.com>
>>> ---
>>>   arch/x86/kernel/kvm.c |   36 ++++++++++++++++++++++++++++++++----
>>>   1 files changed, 32 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index a6db92e..914b0fc 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -37,6 +37,7 @@
>>>   #include<asm/cpu.h>
>>>   #include<asm/traps.h>
>>>   #include<asm/desc.h>
>>> +#include<asm/tlbflush.h>
>>>
>>>   #define MMU_QUEUE_SIZE 1024
>>>
>>> @@ -68,6 +69,8 @@ struct kvm_task_sleep_node {
>>>   	wait_queue_head_t wq;
>>>   	u32 token;
>>>   	int cpu;
>>> +	bool halted;
>>> +	struct mm_struct *mm;
>>>   };
>>>
>>>   static struct kvm_task_sleep_head {
>>> @@ -96,6 +99,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>>>   	struct kvm_task_sleep_head *b =&async_pf_sleepers[key];
>>>   	struct kvm_task_sleep_node n, *e;
>>>   	DEFINE_WAIT(wait);
>>> +	int cpu, idle;
>>> +
>>> +	cpu = get_cpu();
>>> +	idle = idle_cpu(cpu);
>>> +	put_cpu();
>>>
>>>   	spin_lock(&b->lock);
>>>   	e = _find_apf_task(b, token);
>>> @@ -109,17 +117,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
>>>
>>>   	n.token = token;
>>>   	n.cpu = smp_processor_id();
>>> +	n.mm = current->active_mm;
>>> +	n.halted = idle || preempt_count()>   1;
>>> +	atomic_inc(&n.mm->mm_count);
>>>   	init_waitqueue_head(&n.wq);
>>>   	hlist_add_head(&n.link,&b->list);
>>>   	spin_unlock(&b->lock);
>>>
>>>   	for (;;) {
>>> -		prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
>>> +		if (!n.halted)
>>> +			prepare_to_wait(&n.wq,&wait, TASK_UNINTERRUPTIBLE);
>>>   		if (hlist_unhashed(&n.link))
>>>   			break;
>>> -		schedule();
>>> +
>>> +		if (!n.halted) {
>>> +			schedule();
>>> +		} else {
>>> +			/*
>>> +			 * We cannot reschedule. So halt.
>>> +			 */
>> If we get the wakeup here, we'll halt and never wake up again.
>>
> We will not. IRQs are disabled here. native_safe_halt() enables them.

Ok.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
  2010-08-23 16:17     ` Avi Kivity
@ 2010-08-24 12:28       ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24 12:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 07:17:20PM +0300, Avi Kivity wrote:
> >
> >+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> >+{
> >+	if (unlikely(vcpu->arch.apf_memslot_ver !=
> >+		     vcpu->kvm->memslot_version)) {
> >+		u64 gpa = vcpu->arch.apf_msr_val&  ~0x3f;
> >+		unsigned long addr;
> >+		int offset = offset_in_page(gpa);
> >+
> >+		addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> >+		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
> >+		if (kvm_is_error_hva(addr)) {
> >+			vcpu->arch.apf_data = NULL;
> >+			return -EFAULT;
> >+		}
> >+	}
> >+
> >+	return put_user(val, vcpu->arch.apf_data);
> >+}
> 
> This nice cache needs to be outside apf to reduce complexity for
> reviewers and since it is useful for others.
> 
> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
> 
Something like this? (only compile tested)


diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c13cc48..9aa3dd2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -168,10 +168,18 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u32 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
 
+struct gfn_to_hva_cache {
+	u32 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 struct kvm {
 	spinlock_t mmu_lock;
 	raw_spinlock_t requests_lock;
@@ -315,12 +323,16 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b78b794..512cf9b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -685,6 +685,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -721,6 +722,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -1175,6 +1177,36 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	int r;
+	gfn_t gfn = ghc->gpa >> PAGE_SHIFT;
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+
+	if (slots->generation != ghc->generation) {
+		int offset = offset_in_page(ghc->gpa);
+
+		ghc->hva = gfn_to_hva(kvm, gfn);
+		if (!kvm_is_error_hva(ghc->hva))
+			ghc->hva += offset;
+		ghc->generation = slots->generation;
+
+		ghc->memslot = gfn_to_memslot(kvm, gfn);
+	}
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, gfn);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1200,11 +1232,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1212,6 +1242,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
--
			Gleb.

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
@ 2010-08-24 12:28       ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2010-08-24 12:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

On Mon, Aug 23, 2010 at 07:17:20PM +0300, Avi Kivity wrote:
> >
> >+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
> >+{
> >+	if (unlikely(vcpu->arch.apf_memslot_ver !=
> >+		     vcpu->kvm->memslot_version)) {
> >+		u64 gpa = vcpu->arch.apf_msr_val&  ~0x3f;
> >+		unsigned long addr;
> >+		int offset = offset_in_page(gpa);
> >+
> >+		addr = gfn_to_hva(vcpu->kvm, gpa>>  PAGE_SHIFT);
> >+		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
> >+		if (kvm_is_error_hva(addr)) {
> >+			vcpu->arch.apf_data = NULL;
> >+			return -EFAULT;
> >+		}
> >+	}
> >+
> >+	return put_user(val, vcpu->arch.apf_data);
> >+}
> 
> This nice cache needs to be outside apf to reduce complexity for
> reviewers and since it is useful for others.
> 
> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
> 
Something like this? (only compile tested)


diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c13cc48..9aa3dd2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -168,10 +168,18 @@ struct kvm_irq_routing_table {};
 
 struct kvm_memslots {
 	int nmemslots;
+	u32 generation;
 	struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
 					KVM_PRIVATE_MEM_SLOTS];
 };
 
+struct gfn_to_hva_cache {
+	u32 generation;
+	gpa_t gpa;
+	unsigned long hva;
+	struct kvm_memory_slot *memslot;
+};
+
 struct kvm {
 	spinlock_t mmu_lock;
 	raw_spinlock_t requests_lock;
@@ -315,12 +323,16 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b78b794..512cf9b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -685,6 +685,7 @@ skip_lpage:
 		memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 		if (mem->slot >= slots->nmemslots)
 			slots->nmemslots = mem->slot + 1;
+		slots->generation++;
 		slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
 
 		old_memslots = kvm->memslots;
@@ -721,6 +722,7 @@ skip_lpage:
 	memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
 	if (mem->slot >= slots->nmemslots)
 		slots->nmemslots = mem->slot + 1;
+	slots->generation++;
 
 	/* actual memory is freed via old in kvm_free_physmem_slot below */
 	if (!npages) {
@@ -1175,6 +1177,36 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	return 0;
 }
 
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			   void *data, unsigned long len)
+{
+	int r;
+	gfn_t gfn = ghc->gpa >> PAGE_SHIFT;
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+
+	if (slots->generation != ghc->generation) {
+		int offset = offset_in_page(ghc->gpa);
+
+		ghc->hva = gfn_to_hva(kvm, gfn);
+		if (!kvm_is_error_hva(ghc->hva))
+			ghc->hva += offset;
+		ghc->generation = slots->generation;
+
+		ghc->memslot = gfn_to_memslot(kvm, gfn);
+	}
+	
+	if (kvm_is_error_hva(ghc->hva))
+		return -EFAULT;
+
+	r = copy_to_user((void __user *)ghc->hva, data, len);
+	if (r)
+		return -EFAULT;
+	mark_page_dirty_in_slot(kvm, ghc->memslot, gfn);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
 	return kvm_write_guest_page(kvm, gfn, empty_zero_page, offset, len);
@@ -1200,11 +1232,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			     gfn_t gfn)
 {
-	struct kvm_memory_slot *memslot;
-
-	memslot = gfn_to_memslot(kvm, gfn);
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
@@ -1212,6 +1242,14 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
  2010-08-24 12:28       ` Gleb Natapov
@ 2010-08-24 12:33         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24 12:33 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 03:28 PM, Gleb Natapov wrote:
> On Mon, Aug 23, 2010 at 07:17:20PM +0300, Avi Kivity wrote:
>>> +static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
>>> +{
>>> +	if (unlikely(vcpu->arch.apf_memslot_ver !=
>>> +		     vcpu->kvm->memslot_version)) {
>>> +		u64 gpa = vcpu->arch.apf_msr_val&   ~0x3f;
>>> +		unsigned long addr;
>>> +		int offset = offset_in_page(gpa);
>>> +
>>> +		addr = gfn_to_hva(vcpu->kvm, gpa>>   PAGE_SHIFT);
>>> +		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
>>> +		if (kvm_is_error_hva(addr)) {
>>> +			vcpu->arch.apf_data = NULL;
>>> +			return -EFAULT;
>>> +		}
>>> +	}
>>> +
>>> +	return put_user(val, vcpu->arch.apf_data);
>>> +}
>> This nice cache needs to be outside apf to reduce complexity for
>> reviewers and since it is useful for others.
>>
>> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
>>
> Something like this? (only compile tested)

Yes, exactly.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out.
@ 2010-08-24 12:33         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2010-08-24 12:33 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-mm, linux-kernel, mingo, a.p.zijlstra, tglx, hpa,
	riel, cl, mtosatti

  On 08/24/2010 03:28 PM, Gleb Natapov wrote:
> On Mon, Aug 23, 2010 at 07:17:20PM +0300, Avi Kivity wrote:
>>> +static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
>>> +{
>>> +	if (unlikely(vcpu->arch.apf_memslot_ver !=
>>> +		     vcpu->kvm->memslot_version)) {
>>> +		u64 gpa = vcpu->arch.apf_msr_val&   ~0x3f;
>>> +		unsigned long addr;
>>> +		int offset = offset_in_page(gpa);
>>> +
>>> +		addr = gfn_to_hva(vcpu->kvm, gpa>>   PAGE_SHIFT);
>>> +		vcpu->arch.apf_data = (u32 __user *)(addr + offset);
>>> +		if (kvm_is_error_hva(addr)) {
>>> +			vcpu->arch.apf_data = NULL;
>>> +			return -EFAULT;
>>> +		}
>>> +	}
>>> +
>>> +	return put_user(val, vcpu->arch.apf_data);
>>> +}
>> This nice cache needs to be outside apf to reduce complexity for
>> reviewers and since it is useful for others.
>>
>> Would be good to have memslot-cached kvm_put_guest() and kvm_get_guest().
>>
> Something like this? (only compile tested)

Yes, exactly.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2010-08-24 12:33 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-19 15:30 [PATCH v5 00/12] KVM: Add host swap event notifications for PV guest Gleb Natapov
2010-07-19 15:30 ` Gleb Natapov
2010-07-19 15:30 ` [PATCH v5 01/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-07-19 15:30 ` [PATCH v5 02/12] Add PV MSR to enable asynchronous page faults delivery Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-08-23 15:22   ` Avi Kivity
2010-08-23 15:22     ` Avi Kivity
2010-08-23 15:29     ` Gleb Natapov
2010-08-23 15:29       ` Gleb Natapov
2010-07-19 15:30 ` [PATCH v5 03/12] Add async PF initialization to PV guest Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-07-19 19:52   ` Rik van Riel
2010-07-19 19:52     ` Rik van Riel
2010-08-23 15:26   ` Avi Kivity
2010-08-23 15:26     ` Avi Kivity
2010-08-23 15:35     ` Gleb Natapov
2010-08-23 15:35       ` Gleb Natapov
2010-08-23 16:08       ` Christoph Lameter
2010-08-23 16:08         ` Christoph Lameter
2010-08-23 16:10         ` Gleb Natapov
2010-08-23 16:10           ` Gleb Natapov
2010-08-23 16:19         ` Avi Kivity
2010-08-23 16:19           ` Avi Kivity
2010-07-19 15:30 ` [PATCH v5 04/12] Provide special async page fault handler when async PF capability is detected Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-08-23 15:48   ` Avi Kivity
2010-08-23 15:48     ` Avi Kivity
2010-08-23 15:52     ` Rik van Riel
2010-08-23 15:52       ` Rik van Riel
2010-08-23 16:22       ` Avi Kivity
2010-08-23 16:22         ` Avi Kivity
2010-08-24  7:31     ` Gleb Natapov
2010-08-24  7:31       ` Gleb Natapov
2010-08-24  9:02       ` Avi Kivity
2010-08-24  9:02         ` Avi Kivity
2010-07-19 15:30 ` [PATCH v5 05/12] Export __get_user_pages_fast Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-07-19 15:30 ` [PATCH v5 06/12] Add get_user_pages() variant that fails if major fault is required Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-08-23 15:50   ` Avi Kivity
2010-08-23 15:50     ` Avi Kivity
2010-07-19 15:30 ` [PATCH v5 07/12] Maintain memslot version number Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-08-23 15:53   ` Avi Kivity
2010-08-23 15:53     ` Avi Kivity
2010-07-19 15:30 ` [PATCH v5 08/12] Inject asynchronous page fault into a guest if page is swapped out Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-08-23 16:17   ` Avi Kivity
2010-08-23 16:17     ` Avi Kivity
2010-08-24  7:52     ` Gleb Natapov
2010-08-24  7:52       ` Gleb Natapov
2010-08-24  9:04       ` Avi Kivity
2010-08-24  9:04         ` Avi Kivity
2010-08-24 12:28     ` Gleb Natapov
2010-08-24 12:28       ` Gleb Natapov
2010-08-24 12:33       ` Avi Kivity
2010-08-24 12:33         ` Avi Kivity
2010-07-19 15:30 ` [PATCH v5 09/12] Retry fault before vmentry Gleb Natapov
2010-07-19 15:30   ` Gleb Natapov
2010-08-24  9:25   ` Avi Kivity
2010-08-24  9:25     ` Avi Kivity
2010-08-24  9:33     ` Gleb Natapov
2010-08-24  9:33       ` Gleb Natapov
2010-08-24  9:38       ` Avi Kivity
2010-08-24  9:38         ` Avi Kivity
2010-07-19 15:31 ` [PATCH v5 10/12] Handle async PF in non preemptable context Gleb Natapov
2010-07-19 15:31   ` Gleb Natapov
2010-08-24  9:30   ` Avi Kivity
2010-08-24  9:30     ` Avi Kivity
2010-08-24  9:36     ` Gleb Natapov
2010-08-24  9:36       ` Gleb Natapov
2010-08-24  9:46       ` Avi Kivity
2010-08-24  9:46         ` Avi Kivity
2010-07-19 15:31 ` [PATCH v5 11/12] Let host know whether the guest can handle async PF in non-userspace context Gleb Natapov
2010-07-19 15:31   ` Gleb Natapov
2010-08-24  9:31   ` Avi Kivity
2010-08-24  9:31     ` Avi Kivity
2010-07-19 15:31 ` [PATCH v5 12/12] Send async PF when guest is not in userspace too Gleb Natapov
2010-07-19 15:31   ` Gleb Natapov
2010-08-24  9:36   ` Avi Kivity
2010-08-24  9:36     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.