linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests
@ 2012-09-06 11:27 Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up() Tomoki Sekiyama
                   ` (22 more replies)
  0 siblings, 23 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, x86, yrl.pp-manager.tt

This RFC patch series provides facility to dedicate CPUs to KVM guests
and enable the guests to handle interrupts from passed-through PCI devices
directly (without VM exit and relay by the host).

With this feature, we can improve throughput and response time of the device
and the host's CPU usage by reducing the overhead of interrupt handling.
This is good for the application using very high throughput/frequent
interrupt device (e.g. 10GbE NIC).
Real-time applicatoins also gets benefit from CPU isolation feature, which
reduces interfare from host kernel tasks and scheduling delay.

The overview of this patch series is presented in CloudOpen 2012.
The slides are available at:
http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_sekiyama.pdf

* Changes from v1 ( https://lkml.org/lkml/2012/6/28/30 )
 - SMP guest is supported
 - Direct EOI is added, that eliminate VM exit on EOI
 - Direct local APIC timer access from guests is added, which pass-through the
   physical timer of a dedicated CPU to the guest.
 - Rebased on v3.6-rc4

* How to test
 - Create a guest VM with 1 CPU and some PCI passthrough devices (which
   supports MSI/MSI-X).
   No VGA display will be better...
 - Apply the patch at the end of this mail to qemu-kvm.
   (This patch is just for simple testing, and dedicated CPU ID for the
    guest is hard-coded.)
 - Run the guest once to ensure the PCI passthrough works correctly.
 - Make the specified CPU offline.
     # echo 0 > /sys/devices/system/cpu/cpu3/online
 - Launch qemu-kvm with -no-kvm-pit option.
   The offlined CPU is booted as a slave CPU and guest is runs on that CPU.

* To-do
 - Enable slave CPUs to handle access fault
 - Support AMD SVM
 - Support non-Linux guests

---

Tomoki Sekiyama (21):
      x86: request TLB flush to slave CPU using NMI
      KVM: Pass-through local APIC timer of on slave CPUs to guest VM
      KVM: Enable direct EOI for directly routed interrupts to guests
      KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs
      KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received
      KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER
      KVM: add tracepoint on enabling/disabling direct interrupt delivery
      KVM: Directly handle interrupts by guests without VM EXIT on slave CPUs
      x86/apic: IRQ vector remapping on slave for slave CPUs
      x86/apic: Enable external interrupt routing to slave CPUs
      KVM: no exiting from guest when slave CPU halted
      KVM: proxy slab operations for slave CPUs on online CPUs
      KVM: Go back to online CPU on VM exit by external interrupt
      KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl
      KVM: handle page faults of slave guests on online CPUs
      KVM: Add facility to run guests on slave CPUs
      KVM: Enable/Disable virtualization on slave CPUs are activated/dying
      x86: Avoid RCU warnings on slave CPUs
      x86: Support hrtimer on slave CPUs
      x86: Add a facility to use offlined CPUs as slave CPUs
      x86: Split memory hotplug function from cpu_up() as cpu_memory_up()


 arch/x86/Kconfig                      |   10 +
 arch/x86/include/asm/apic.h           |   10 +
 arch/x86/include/asm/irq.h            |   15 +
 arch/x86/include/asm/kvm_host.h       |   59 +++++
 arch/x86/include/asm/tlbflush.h       |    5 
 arch/x86/include/asm/vmx.h            |    3 
 arch/x86/kernel/apic/apic.c           |   11 +
 arch/x86/kernel/apic/io_apic.c        |  111 ++++++++-
 arch/x86/kernel/apic/x2apic_cluster.c |    8 -
 arch/x86/kernel/cpu/common.c          |    5 
 arch/x86/kernel/smp.c                 |    2 
 arch/x86/kernel/smpboot.c             |  264 ++++++++++++++++++++++-
 arch/x86/kvm/irq.c                    |  136 ++++++++++++
 arch/x86/kvm/lapic.c                  |   56 +++++
 arch/x86/kvm/lapic.h                  |    2 
 arch/x86/kvm/mmu.c                    |   63 ++++-
 arch/x86/kvm/mmu.h                    |    4 
 arch/x86/kvm/trace.h                  |   19 ++
 arch/x86/kvm/vmx.c                    |  180 +++++++++++++++
 arch/x86/kvm/x86.c                    |  387 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.h                    |    9 +
 arch/x86/mm/tlb.c                     |   94 ++++++++
 drivers/iommu/intel_irq_remapping.c   |   32 ++-
 include/linux/cpu.h                   |   36 +++
 include/linux/cpumask.h               |   26 ++
 include/linux/kvm.h                   |    4 
 include/linux/kvm_host.h              |    2 
 kernel/cpu.c                          |   83 +++++--
 kernel/hrtimer.c                      |   14 +
 kernel/irq/manage.c                   |    4 
 kernel/irq/migration.c                |    2 
 kernel/irq/proc.c                     |    2 
 kernel/rcutree.c                      |   14 +
 kernel/smp.c                          |    9 +
 virt/kvm/assigned-dev.c               |    8 +
 virt/kvm/async_pf.c                   |   17 +
 virt/kvm/kvm_main.c                   |   32 +++
 37 files changed, 1629 insertions(+), 109 deletions(-)



* Patch for qemu-kvm-1.0

diff -Narup a/qemu-kvm-1.0/linux-headers/linux/kvm.h b/qemu-kvm-1.0/linux-headers/linux/kvm.h
--- a/qemu-kvm-1.0/linux-headers/linux/kvm.h	2011-12-04 19:38:06.000000000 +0900
+++ b/qemu-kvm-1.0/linux-headers/linux/kvm.h	2012-08-22 14:20:50.080495725 +0900
@@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
 #define KVM_CAP_ONE_REG 70
+#define KVM_CAP_SLAVE_CPU 81
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -811,6 +812,10 @@ struct kvm_one_reg {
 /* Available with KVM_CAP_ONE_REG */
 #define KVM_GET_ONE_REG		  _IOWR(KVMIO, 0xab, struct kvm_one_reg)
 #define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
+/* Available with KVM_CAP_SLAVE_CPU */
+#define KVM_GET_SLAVE_CPU	  _IO(KVMIO,  0xae)
+#define KVM_SET_SLAVE_CPU	  _IO(KVMIO,  0xaf)
+
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
diff -Narup a/qemu-kvm-1.0/qemu-kvm-x86.c b/qemu-kvm-1.0/qemu-kvm-x86.c
--- a/qemu-kvm-1.0/qemu-kvm-x86.c	2011-12-04 19:38:06.000000000 +0900
+++ b/qemu-kvm-1.0/qemu-kvm-x86.c	2012-09-06 20:19:44.828163734 +0900
@@ -139,12 +139,28 @@ static int kvm_enable_tpr_access_reporti
     return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, &tac);
 }
 
+static int kvm_set_slave_cpu(CPUState *env)
+{
+    int r, slave = env->cpu_index == 0 ? 2 : env->cpu_index == 1 ? 3 : -1;
+
+    r = kvm_ioctl(env->kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_SLAVE_CPU);
+    if (r <= 0) {
+        return -ENOSYS;
+    }
+    r = kvm_vcpu_ioctl(env, KVM_SET_SLAVE_CPU, slave);
+    if (r < 0)
+        perror("kvm_set_slave_cpu");
+    return r;
+}
+
 static int _kvm_arch_init_vcpu(CPUState *env)
 {
     kvm_arch_reset_vcpu(env);
 
     kvm_enable_tpr_access_reporting(env);
 
+    kvm_set_slave_cpu(env);
+
     return kvm_update_ioport_access(env);
 }
 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up()
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-06 11:31   ` Avi Kivity
  2012-09-06 11:27 ` [RFC v2 PATCH 02/21] x86: Add a facility to use offlined CPUs as slave CPUs Tomoki Sekiyama
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Split memory hotplug function from cpu_up() as cpu_memory_up(), which will
be used for assigning memory area to off-lined cpus at following patch
in this series.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 include/linux/cpu.h |    9 +++++++++
 kernel/cpu.c        |   46 +++++++++++++++++++++++++++-------------------
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..8395ac9 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -146,6 +146,15 @@ void notify_cpu_starting(unsigned int cpu);
 extern void cpu_maps_update_begin(void);
 extern void cpu_maps_update_done(void);
 
+#ifdef	CONFIG_MEMORY_HOTPLUG
+extern int cpu_memory_up(unsigned int cpu);
+#else
+static inline int cpu_memory_up(unsigned int cpu)
+{
+	return 0;
+}
+#endif
+
 #else	/* CONFIG_SMP */
 
 #define cpu_notifier(fn, pri)	do { (void)(fn); } while (0)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 14d3258..5df8f36 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -384,11 +384,6 @@ int __cpuinit cpu_up(unsigned int cpu)
 {
 	int err = 0;
 
-#ifdef	CONFIG_MEMORY_HOTPLUG
-	int nid;
-	pg_data_t	*pgdat;
-#endif
-
 	if (!cpu_possible(cpu)) {
 		printk(KERN_ERR "can't online cpu %d because it is not "
 			"configured as may-hotadd at boot time\n", cpu);
@@ -399,7 +394,32 @@ int __cpuinit cpu_up(unsigned int cpu)
 		return -EINVAL;
 	}
 
+	err = cpu_memory_up(cpu);
+	if (err)
+		return err;
+
+	cpu_maps_update_begin();
+
+	if (cpu_hotplug_disabled) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	err = _cpu_up(cpu, 0);
+
+out:
+	cpu_maps_update_done();
+	return err;
+}
+EXPORT_SYMBOL_GPL(cpu_up);
+
 #ifdef	CONFIG_MEMORY_HOTPLUG
+int __cpuinit cpu_memory_up(unsigned int cpu)
+{
+	int err;
+	int nid;
+	pg_data_t	*pgdat;
+
 	nid = cpu_to_node(cpu);
 	if (!node_online(nid)) {
 		err = mem_online_node(nid);
@@ -419,22 +439,10 @@ int __cpuinit cpu_up(unsigned int cpu)
 		build_all_zonelists(NULL, NULL);
 		mutex_unlock(&zonelists_mutex);
 	}
-#endif
 
-	cpu_maps_update_begin();
-
-	if (cpu_hotplug_disabled) {
-		err = -EBUSY;
-		goto out;
-	}
-
-	err = _cpu_up(cpu, 0);
-
-out:
-	cpu_maps_update_done();
-	return err;
+	return 0;
 }
-EXPORT_SYMBOL_GPL(cpu_up);
+#endif
 
 #ifdef CONFIG_PM_SLEEP_SMP
 static cpumask_var_t frozen_cpus;



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 02/21] x86: Add a facility to use offlined CPUs as slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up() Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 03/21] x86: Support hrtimer on " Tomoki Sekiyama
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add a facility of using offlined CPUs as slave CPUs. Slave CPUs are
specialized to exclusively run functions specified by online CPUs,
which do not run user processes.

To use this feature, build the kernel with CONFIG_SLAVE_CPU=y.

A slave CPU is launched by calling cpu_slave_up() when the CPU is offlined.
Once launched, the slave CPU waits for IPI in the idle thread context.
Users of the slave CPU can run specific kernel function by sending IPI using
slave_cpu_call_function().

When cpu_slave_down() is called, the slave cpu will go offline state again.

Cpumask `cpu_slave_mask' is provided to manage whether CPU is slave.
In addition, `cpu_online_or_slave_mask' is also provided for convenence of
APIC handling, etc.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/Kconfig             |   10 ++
 arch/x86/include/asm/cpu.h   |   11 ++
 arch/x86/kernel/cpu/common.c |    5 +
 arch/x86/kernel/smpboot.c    |  190 ++++++++++++++++++++++++++++++++++++++++--
 include/linux/cpumask.h      |   26 ++++++
 kernel/cpu.c                 |   37 ++++++++
 kernel/smp.c                 |    8 +-
 7 files changed, 275 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ec3a1a..106c958 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1678,6 +1678,16 @@ config HOTPLUG_CPU
 	    automatically on SMP systems. )
 	  Say N if you want to disable CPU hotplug.
 
+config SLAVE_CPU
+	bool "Support for slave CPUs (EXPERIMENTAL)"
+	depends on EXPERIMENTAL && HOTPLUG_CPU
+	---help---
+	  Say Y here to allow use some of CPUs as slave processors.
+	  Slave CPUs are controlled from another CPU and do some tasks
+	  and cannot run user processes. Slave processors can be
+	  specified through /sys/devices/system/cpu.
+	  Say N if you want to disable slave CPU support.
+
 config COMPAT_VDSO
 	def_bool y
 	prompt "Compat VDSO support"
diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 4564c8e..b7ace52 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -30,6 +30,17 @@ extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
 #endif
 
+#ifdef CONFIG_SLAVE_CPU
+#define CPU_SLAVE_UP_PREPARE	0xff00
+#define CPU_SLAVE_UP		0xff01
+#define CPU_SLAVE_DEAD		0xff02
+
+extern int slave_cpu_up(unsigned int cpu);
+extern int slave_cpu_down(unsigned int cpu);
+extern void slave_cpu_call_function(unsigned int cpu,
+				    void (*f)(void *), void *arg);
+#endif
+
 DECLARE_PER_CPU(int, cpu_state);
 
 int mwait_usable(const struct cpuinfo_x86 *);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a5fbc3c..ab7f9a7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -913,7 +913,10 @@ static void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
 	}
 
 	/* Init Machine Check Exception if available. */
-	mcheck_cpu_init(c);
+#ifdef CONFIG_SLAVE_CPU
+	if (per_cpu(cpu_state, smp_processor_id()) != CPU_SLAVE_UP_PREPARE)
+#endif
+		mcheck_cpu_init(c);
 
 	select_idle_routine(c);
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 7c5a8c3..b9e1297 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -53,6 +53,9 @@
 #include <linux/stackprotector.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/clockchips.h>
+#include <linux/tick.h>
+#include "../kernel/smpboot.h"
 
 #include <asm/acpi.h>
 #include <asm/desc.h>
@@ -128,7 +131,7 @@ atomic_t init_deasserted;
  * Report back to the Boot Processor.
  * Running on AP.
  */
-static void __cpuinit smp_callin(void)
+static void __cpuinit smp_callin(int notify_starting)
 {
 	int cpuid, phys_id;
 	unsigned long timeout;
@@ -220,7 +223,8 @@ static void __cpuinit smp_callin(void)
 	set_cpu_sibling_map(raw_smp_processor_id());
 	wmb();
 
-	notify_cpu_starting(cpuid);
+	if (notify_starting)
+		notify_cpu_starting(cpuid);
 
 	/*
 	 * Allow the master to continue.
@@ -241,7 +245,7 @@ notrace static void __cpuinit start_secondary(void *unused)
 	cpu_init();
 	x86_cpuinit.early_percpu_clock_init();
 	preempt_disable();
-	smp_callin();
+	smp_callin(1);
 
 #ifdef CONFIG_X86_32
 	/* switch away from the initial page table */
@@ -279,6 +283,90 @@ notrace static void __cpuinit start_secondary(void *unused)
 	cpu_idle();
 }
 
+#ifdef CONFIG_SLAVE_CPU
+
+struct slave_cpu_func_info {
+	void (*func)(void *);
+	void *arg;
+};
+static DEFINE_PER_CPU(struct slave_cpu_func_info, slave_cpu_func);
+
+/*
+ * Activate cpu as a slave processor.
+ * The cpu is used to run specified function using smp_call_function
+ * from online processors.
+ * Note that this doesn't mark the cpu online.
+ */
+notrace static void __cpuinit start_slave_cpu(void *unused)
+{
+	int cpu;
+
+	/*
+	 * Don't put *anything* before cpu_init(), SMP booting is too
+	 * fragile that we want to limit the things done here to the
+	 * most necessary things.
+	 */
+	cpu_init();
+	preempt_disable();
+	smp_callin(0);
+
+#ifdef CONFIG_X86_32
+	/* switch away from the initial page table */
+	load_cr3(swapper_pg_dir);
+	__flush_tlb_all();
+#endif
+
+	/* otherwise gcc will move up smp_processor_id before the cpu_init */
+	barrier();
+	/*
+	 * Check TSC synchronization with the BP:
+	 */
+	check_tsc_sync_target();
+
+	x86_platform.nmi_init();
+
+	/* enable local interrupts */
+	local_irq_enable();
+
+	cpu = smp_processor_id();
+
+	/* to prevent fake stack check failure */
+	boot_init_stack_canary();
+
+	/* announce slave CPU started */
+	pr_info("Slave CPU %d is up\n", cpu);
+	__this_cpu_write(cpu_state, CPU_SLAVE_UP);
+	set_cpu_slave(cpu, true);
+	wmb();
+
+	/* wait for slave_cpu_call_function or slave_cpu_down */
+	while (__this_cpu_read(cpu_state) == CPU_SLAVE_UP) {
+		struct slave_cpu_func_info f;
+
+		local_irq_disable();
+		f = per_cpu(slave_cpu_func, cpu);
+		per_cpu(slave_cpu_func, cpu).func = NULL;
+
+		if (!f.func) {
+			native_safe_halt();
+			continue;
+		}
+
+		local_irq_enable();
+		preempt_enable_no_resched();
+		f.func(f.arg);
+		preempt_disable();
+	}
+
+	/* now stop this CPU again */
+	pr_info("Slave CPU %d is going down ...\n", cpu);
+	local_irq_disable();
+	native_cpu_disable();
+	set_cpu_slave(cpu, false);
+	native_play_dead();
+}
+#endif
+
 /*
  * The bootstrap kernel entry code has set these up. Save them for
  * a given CPU
@@ -655,7 +743,8 @@ static void __cpuinit announce_cpu(int cpu, int apicid)
  * Returns zero if CPU booted OK, else error code from
  * ->wakeup_secondary_cpu.
  */
-static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
+static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
+				 int slave)
 {
 	volatile u32 *trampoline_status =
 		(volatile u32 *) __va(real_mode_header->trampoline_status);
@@ -683,6 +772,10 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 #endif
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
+#ifdef CONFIG_SLAVE_CPU
+	if (unlikely(slave))
+		initial_code = (unsigned long)start_slave_cpu;
+#endif
 	stack_start  = idle->thread.sp;
 
 	/* So we see what's up */
@@ -784,7 +877,8 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 	return boot_error;
 }
 
-int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+static int __cpuinit __native_cpu_up(unsigned int cpu,
+				     struct task_struct *tidle, int slave)
 {
 	int apicid = apic->cpu_present_to_apicid(cpu);
 	unsigned long flags;
@@ -815,9 +909,14 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 	 */
 	mtrr_save_state();
 
+#ifdef CONFIG_SLAVE_CPU
+	per_cpu(cpu_state, cpu) = slave ? CPU_SLAVE_UP_PREPARE
+					: CPU_UP_PREPARE;
+#else
 	per_cpu(cpu_state, cpu) = CPU_UP_PREPARE;
+#endif
 
-	err = do_boot_cpu(apicid, cpu, tidle);
+	err = do_boot_cpu(apicid, cpu, tidle, slave);
 	if (err) {
 		pr_debug("do_boot_cpu failed %d\n", err);
 		return -EIO;
@@ -831,7 +930,7 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 	check_tsc_sync_source(cpu);
 	local_irq_restore(flags);
 
-	while (!cpu_online(cpu)) {
+	while (!cpu_online(cpu) && !cpu_slave(cpu)) {
 		cpu_relax();
 		touch_nmi_watchdog();
 	}
@@ -839,6 +938,83 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 	return 0;
 }
 
+int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+{
+	return __native_cpu_up(cpu, tidle, 0);
+}
+
+#ifdef CONFIG_SLAVE_CPU
+
+/* boot CPU as a slave processor */
+int __cpuinit slave_cpu_up(unsigned int cpu)
+{
+	int ret;
+	struct task_struct *idle;
+
+	if (!cpu_possible(cpu)) {
+		pr_err("can't start slave cpu %d because it is not "
+		       "configured as may-hotadd at boot time\n", cpu);
+		return -EINVAL;
+	}
+	if (cpu_online(cpu) || !cpu_present(cpu))
+		return -EINVAL;
+
+	ret = cpu_memory_up(cpu);
+	if (ret)
+		return ret;
+
+	cpu_maps_update_begin();
+
+	idle = idle_thread_get(cpu);
+	if (IS_ERR(idle))
+		return PTR_ERR(idle);
+
+	ret = __native_cpu_up(cpu, idle, 1);
+
+	cpu_maps_update_done();
+
+	return ret;
+}
+EXPORT_SYMBOL(slave_cpu_up);
+
+static void __slave_cpu_down(void *dummy)
+{
+	__this_cpu_write(cpu_state, CPU_DYING);
+}
+
+int slave_cpu_down(unsigned int cpu)
+{
+	if (!cpu_slave(cpu))
+		return -EINVAL;
+
+	slave_cpu_call_function(cpu, __slave_cpu_down, NULL);
+	native_cpu_die(cpu);
+
+	if (cpu_slave(cpu)) {
+		pr_err("failed to stop slave cpu %d\n", cpu);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(slave_cpu_down);
+
+void __slave_cpu_call_function(void *info)
+{
+	per_cpu(slave_cpu_func, smp_processor_id()) =
+		*(struct slave_cpu_func_info *)info;
+}
+
+void slave_cpu_call_function(unsigned int cpu, void (*f)(void *), void *arg)
+{
+	struct slave_cpu_func_info info = {f, arg};
+
+	smp_call_function_single(cpu, __slave_cpu_call_function, &info, 1);
+}
+EXPORT_SYMBOL(slave_cpu_call_function);
+
+#endif
+
 /**
  * arch_disable_smp_support() - disables SMP support for x86 at runtime
  */
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 0325602..f5fec4f 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -81,6 +81,19 @@ extern const struct cpumask *const cpu_online_mask;
 extern const struct cpumask *const cpu_present_mask;
 extern const struct cpumask *const cpu_active_mask;
 
+/*
+ * cpu_online_or_slave_mask represents cpu_online_mask | cpu_slave_mask.
+ * This mask indicates the cpu can hanlde irq.
+ * if CONFIG_SLAVE_CPU is not defined, cpu_slave is defined as 0,
+ * and cpu_online_or_slave_mask is equals to cpu_online_mask.
+ */
+#ifdef CONFIG_SLAVE_CPU
+extern const struct cpumask *const cpu_slave_mask;
+extern const struct cpumask *const cpu_online_or_slave_mask;
+#else
+#define cpu_online_or_slave_mask cpu_online_mask
+#endif
+
 #if NR_CPUS > 1
 #define num_online_cpus()	cpumask_weight(cpu_online_mask)
 #define num_possible_cpus()	cpumask_weight(cpu_possible_mask)
@@ -101,6 +114,12 @@ extern const struct cpumask *const cpu_active_mask;
 #define cpu_active(cpu)		((cpu) == 0)
 #endif
 
+#if defined(CONFIG_SLAVE_CPU) && NR_CPUS > 1
+#define cpu_slave(cpu)		cpumask_test_cpu((cpu), cpu_slave_mask)
+#else
+#define cpu_slave(cpu)		0
+#endif
+
 /* verify cpu argument to cpumask_* operators */
 static inline unsigned int cpumask_check(unsigned int cpu)
 {
@@ -716,6 +735,13 @@ void init_cpu_present(const struct cpumask *src);
 void init_cpu_possible(const struct cpumask *src);
 void init_cpu_online(const struct cpumask *src);
 
+#ifdef CONFIG_SLAVE_CPU
+#define for_each_slave_cpu(cpu)  for_each_cpu((cpu), cpu_slave_mask)
+void set_cpu_slave(unsigned int cpu, bool slave);
+#else
+#define for_each_slave_cpu(cpu)  while (0)
+#endif
+
 /**
  * to_cpumask - convert an NR_CPUS bitmap to a struct cpumask *
  * @bitmap: the bitmap
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 5df8f36..fecb394 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -343,7 +343,7 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct task_struct *idle;
 
-	if (cpu_online(cpu) || !cpu_present(cpu))
+	if (cpu_online(cpu) || cpu_slave(cpu) || !cpu_present(cpu))
 		return -EINVAL;
 
 	cpu_hotplug_begin();
@@ -685,6 +685,17 @@ static DECLARE_BITMAP(cpu_active_bits, CONFIG_NR_CPUS) __read_mostly;
 const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
 EXPORT_SYMBOL(cpu_active_mask);
 
+#ifdef CONFIG_SLAVE_CPU
+static DECLARE_BITMAP(cpu_slave_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_slave_mask = to_cpumask(cpu_slave_bits);
+EXPORT_SYMBOL(cpu_slave_mask);
+
+static DECLARE_BITMAP(cpu_online_or_slave_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_online_or_slave_mask =
+	to_cpumask(cpu_online_or_slave_bits);
+EXPORT_SYMBOL(cpu_online_or_slave_mask);
+#endif
+
 void set_cpu_possible(unsigned int cpu, bool possible)
 {
 	if (possible)
@@ -707,6 +718,13 @@ void set_cpu_online(unsigned int cpu, bool online)
 		cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
 	else
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
+
+#ifdef CONFIG_SLAVE_CPU
+	if (online)
+		cpumask_set_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+	else
+		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+#endif
 }
 
 void set_cpu_active(unsigned int cpu, bool active)
@@ -717,6 +735,19 @@ void set_cpu_active(unsigned int cpu, bool active)
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_active_bits));
 }
 
+#ifdef CONFIG_SLAVE_CPU
+void set_cpu_slave(unsigned int cpu, bool slave)
+{
+	if (slave) {
+		cpumask_set_cpu(cpu, to_cpumask(cpu_slave_bits));
+		cpumask_set_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+	} else {
+		cpumask_clear_cpu(cpu, to_cpumask(cpu_slave_bits));
+		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+	}
+}
+#endif
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(to_cpumask(cpu_present_bits), src);
@@ -730,4 +761,8 @@ void init_cpu_possible(const struct cpumask *src)
 void init_cpu_online(const struct cpumask *src)
 {
 	cpumask_copy(to_cpumask(cpu_online_bits), src);
+
+#ifdef CONFIG_SLAVE_CPU
+	cpumask_copy(to_cpumask(cpu_online_or_slave_bits), src);
+#endif
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index 29dd40a..fda7a8d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -177,7 +177,7 @@ void generic_smp_call_function_interrupt(void)
 	/*
 	 * Shouldn't receive this interrupt on a cpu that is not yet online.
 	 */
-	WARN_ON_ONCE(!cpu_online(cpu));
+	WARN_ON_ONCE(!cpu_online(cpu) && !cpu_slave(cpu));
 
 	/*
 	 * Ensure entry is visible on call_function_queue after we have
@@ -257,7 +257,8 @@ void generic_smp_call_function_single_interrupt(void)
 	/*
 	 * Shouldn't receive this interrupt on a cpu that is not yet online.
 	 */
-	WARN_ON_ONCE(!cpu_online(smp_processor_id()));
+	WARN_ON_ONCE(!cpu_online(smp_processor_id()) &&
+		     !cpu_slave(smp_processor_id()));
 
 	raw_spin_lock(&q->lock);
 	list_replace_init(&q->list, &list);
@@ -326,7 +327,8 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 		func(info);
 		local_irq_restore(flags);
 	} else {
-		if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
+		if ((unsigned)cpu < nr_cpu_ids &&
+		    (cpu_online(cpu) || cpu_slave(cpu))) {
 			struct call_single_data *data = &d;
 
 			if (!wait)



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 03/21] x86: Support hrtimer on slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up() Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 02/21] x86: Add a facility to use offlined CPUs as slave CPUs Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 04/21] x86: Avoid RCU warnings " Tomoki Sekiyama
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Adds a facility to use hrtimer on slave CPUs.

To initialize hrtimer when slave CPUs are activated, and to shutdown hrtimer
when slave CPUs are stopped, this patch adds the slave cpu notifier chain,
which call registered callbacks when slave CPUs are up, dying, and died.

The registered callbacks are called with CPU_SLAVE_UP when a slave CPU
becomes active. When the slave CPU is stopped, callbacks are called with
CPU_SLAVE_DYING on slave CPUs, and with CPU_SLAVE_DEAD on online CPUs.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/cpu.h |   11 -----------
 arch/x86/kernel/smpboot.c  |   37 +++++++++++++++++++++++++++++++++++++
 include/linux/cpu.h        |   22 ++++++++++++++++++++++
 kernel/hrtimer.c           |   14 ++++++++++++++
 4 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index b7ace52..4564c8e 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -30,17 +30,6 @@ extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
 #endif
 
-#ifdef CONFIG_SLAVE_CPU
-#define CPU_SLAVE_UP_PREPARE	0xff00
-#define CPU_SLAVE_UP		0xff01
-#define CPU_SLAVE_DEAD		0xff02
-
-extern int slave_cpu_up(unsigned int cpu);
-extern int slave_cpu_down(unsigned int cpu);
-extern void slave_cpu_call_function(unsigned int cpu,
-				    void (*f)(void *), void *arg);
-#endif
-
 DECLARE_PER_CPU(int, cpu_state);
 
 int mwait_usable(const struct cpuinfo_x86 *);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b9e1297..e8cfe377 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -127,6 +127,36 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 
 atomic_t init_deasserted;
 
+static void __ref remove_cpu_from_maps(int cpu);
+
+
+#ifdef CONFIG_SLAVE_CPU
+/* Notify slave cpu up and down */
+static RAW_NOTIFIER_HEAD(slave_cpu_chain);
+
+int register_slave_cpu_notifier(struct notifier_block *nb)
+{
+	return raw_notifier_chain_register(&slave_cpu_chain, nb);
+}
+EXPORT_SYMBOL(register_slave_cpu_notifier);
+
+void unregister_slave_cpu_notifier(struct notifier_block *nb)
+{
+	raw_notifier_chain_unregister(&slave_cpu_chain, nb);
+}
+EXPORT_SYMBOL(unregister_slave_cpu_notifier);
+
+static int slave_cpu_notify(unsigned long val, int cpu)
+{
+	int ret;
+
+	ret = __raw_notifier_call_chain(&slave_cpu_chain, val,
+					(void *)(long)cpu, -1, NULL);
+
+	return notifier_to_errno(ret);
+}
+#endif
+
 /*
  * Report back to the Boot Processor.
  * Running on AP.
@@ -307,6 +337,7 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
 	 * most necessary things.
 	 */
 	cpu_init();
+	x86_cpuinit.early_percpu_clock_init();
 	preempt_disable();
 	smp_callin(0);
 
@@ -333,10 +364,14 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
 	/* to prevent fake stack check failure */
 	boot_init_stack_canary();
 
+	x86_cpuinit.setup_percpu_clockev();
+	tick_nohz_idle_enter();
+
 	/* announce slave CPU started */
 	pr_info("Slave CPU %d is up\n", cpu);
 	__this_cpu_write(cpu_state, CPU_SLAVE_UP);
 	set_cpu_slave(cpu, true);
+	slave_cpu_notify(CPU_SLAVE_UP, cpu);
 	wmb();
 
 	/* wait for slave_cpu_call_function or slave_cpu_down */
@@ -363,6 +398,7 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
 	local_irq_disable();
 	native_cpu_disable();
 	set_cpu_slave(cpu, false);
+	slave_cpu_notify(CPU_SLAVE_DYING, cpu);
 	native_play_dead();
 }
 #endif
@@ -995,6 +1031,7 @@ int slave_cpu_down(unsigned int cpu)
 		return -EBUSY;
 	}
 
+	slave_cpu_notify(CPU_SLAVE_DEAD, cpu);
 	return 0;
 }
 EXPORT_SYMBOL(slave_cpu_down);
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 8395ac9..f1aa3cc 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -221,4 +221,26 @@ static inline int disable_nonboot_cpus(void) { return 0; }
 static inline void enable_nonboot_cpus(void) {}
 #endif /* !CONFIG_PM_SLEEP_SMP */
 
+#ifdef CONFIG_SLAVE_CPU
+int register_slave_cpu_notifier(struct notifier_block *nb);
+void unregister_slave_cpu_notifier(struct notifier_block *nb);
+
+/* CPU notifier constants for slave processors */
+#define CPU_SLAVE_UP_PREPARE	0xff00
+#define CPU_SLAVE_UP		0xff01
+#define CPU_SLAVE_DEAD		0xff02
+#define CPU_SLAVE_DYING		0xff03
+
+extern int slave_cpu_up(unsigned int cpu);
+extern int slave_cpu_down(unsigned int cpu);
+extern void slave_cpu_call_function(unsigned int cpu,
+				    void (*f)(void *), void *arg);
+#else
+static inline int register_slave_cpu_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_slave_cpu_notifier(struct notifier_block *nb) {}
+#endif
+
 #endif /* _LINUX_CPU_H_ */
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 6db7a5e..e899a2c 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1727,16 +1727,25 @@ static int __cpuinit hrtimer_cpu_notify(struct notifier_block *self,
 
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_UP:
+#endif
 		init_hrtimers_cpu(scpu);
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_DYING:
 	case CPU_DYING_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_DYING:
+#endif
 		clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DYING, &scpu);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_DEAD:
+#endif
 	{
 		clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, &scpu);
 		migrate_hrtimers(scpu);
@@ -1755,11 +1764,16 @@ static struct notifier_block __cpuinitdata hrtimers_nb = {
 	.notifier_call = hrtimer_cpu_notify,
 };
 
+static struct notifier_block __cpuinitdata hrtimers_slave_nb = {
+	.notifier_call = hrtimer_cpu_notify,
+};
+
 void __init hrtimers_init(void)
 {
 	hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
 			  (void *)(long)smp_processor_id());
 	register_cpu_notifier(&hrtimers_nb);
+	register_slave_cpu_notifier(&hrtimers_slave_nb);
 #ifdef CONFIG_HIGH_RES_TIMERS
 	open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq);
 #endif



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 04/21] x86: Avoid RCU warnings on slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (2 preceding siblings ...)
  2012-09-06 11:27 ` [RFC v2 PATCH 03/21] x86: Support hrtimer on " Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-20 17:34   ` Paul E. McKenney
  2012-09-06 11:27 ` [RFC v2 PATCH 05/21] KVM: Enable/Disable virtualization on slave CPUs are activated/dying Tomoki Sekiyama
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Initialize rcu related variables to avoid warnings about RCU usage while
slave CPUs is running specified functions. Also notify RCU subsystem before
the slave CPU is entered into idle state.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/kernel/smpboot.c |    4 ++++
 kernel/rcutree.c          |   14 ++++++++++++++
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index e8cfe377..45dfc1d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -382,6 +382,8 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
 		f = per_cpu(slave_cpu_func, cpu);
 		per_cpu(slave_cpu_func, cpu).func = NULL;
 
+		rcu_note_context_switch(cpu);
+
 		if (!f.func) {
 			native_safe_halt();
 			continue;
@@ -1005,6 +1007,8 @@ int __cpuinit slave_cpu_up(unsigned int cpu)
 	if (IS_ERR(idle))
 		return PTR_ERR(idle);
 
+	slave_cpu_notify(CPU_SLAVE_UP_PREPARE, cpu);
+
 	ret = __native_cpu_up(cpu, idle, 1);
 
 	cpu_maps_update_done();
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f280e54..31a7c8c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2589,6 +2589,9 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_UP_PREPARE:
+#endif
 		rcu_prepare_cpu(cpu);
 		rcu_prepare_kthreads(cpu);
 		break;
@@ -2603,6 +2606,9 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
 		break;
 	case CPU_DYING:
 	case CPU_DYING_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_DYING:
+#endif
 		/*
 		 * The whole machine is "stopped" except this CPU, so we can
 		 * touch any data without introducing corruption. We send the
@@ -2616,6 +2622,9 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
 	case CPU_DEAD_FROZEN:
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_DEAD:
+#endif
 		for_each_rcu_flavor(rsp)
 			rcu_cleanup_dead_cpu(cpu, rsp);
 		break;
@@ -2797,6 +2806,10 @@ static void __init rcu_init_geometry(void)
 	rcu_num_nodes -= n;
 }
 
+static struct notifier_block __cpuinitdata rcu_slave_nb = {
+	.notifier_call = rcu_cpu_notify,
+};
+
 void __init rcu_init(void)
 {
 	int cpu;
@@ -2814,6 +2827,7 @@ void __init rcu_init(void)
 	 * or the scheduler are operational.
 	 */
 	cpu_notifier(rcu_cpu_notify, 0);
+	register_slave_cpu_notifier(&rcu_slave_nb);
 	for_each_online_cpu(cpu)
 		rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
 	check_cpu_stall_init();



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 05/21] KVM: Enable/Disable virtualization on slave CPUs are activated/dying
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (3 preceding siblings ...)
  2012-09-06 11:27 ` [RFC v2 PATCH 04/21] x86: Avoid RCU warnings " Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 06/21] KVM: Add facility to run guests on slave CPUs Tomoki Sekiyama
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Enable virtualization when slave CPUs are activated, and disable when
the CPUs are dying using slave CPU notifier call chain.

In x86, TSC kHz must also be initialized by tsc_khz_changed when the
new slave CPUs are activated.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/kvm/x86.c  |   16 ++++++++++++++++
 virt/kvm/kvm_main.c |   30 ++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 148ed66..7501cc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -61,6 +61,7 @@
 #include <asm/xcr.h>
 #include <asm/pvclock.h>
 #include <asm/div64.h>
+#include <asm/cpu.h>
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
@@ -4782,9 +4783,15 @@ static int kvmclock_cpu_notifier(struct notifier_block *nfb,
 	switch (action) {
 		case CPU_ONLINE:
 		case CPU_DOWN_FAILED:
+#ifdef CONFIG_SLAVE_CPU
+		case CPU_SLAVE_UP:
+#endif
 			smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
 			break;
 		case CPU_DOWN_PREPARE:
+#ifdef CONFIG_SLAVE_CPU
+		case CPU_SLAVE_DYING:
+#endif
 			smp_call_function_single(cpu, tsc_bad, NULL, 1);
 			break;
 	}
@@ -4796,12 +4803,18 @@ static struct notifier_block kvmclock_cpu_notifier_block = {
 	.priority = -INT_MAX
 };
 
+static struct notifier_block kvmclock_slave_cpu_notifier_block = {
+	.notifier_call  = kvmclock_cpu_notifier,
+	.priority = -INT_MAX
+};
+
 static void kvm_timer_init(void)
 {
 	int cpu;
 
 	max_tsc_khz = tsc_khz;
 	register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+	register_slave_cpu_notifier(&kvmclock_slave_cpu_notifier_block);
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 #ifdef CONFIG_CPU_FREQ
 		struct cpufreq_policy policy;
@@ -4818,6 +4831,8 @@ static void kvm_timer_init(void)
 	pr_debug("kvm: max_tsc_khz = %ld\n", max_tsc_khz);
 	for_each_online_cpu(cpu)
 		smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+	for_each_slave_cpu(cpu)
+		smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
 }
 
 static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
@@ -4943,6 +4958,7 @@ void kvm_arch_exit(void)
 		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+	unregister_slave_cpu_notifier(&kvmclock_slave_cpu_notifier_block);
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..dc86e9a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -54,6 +54,9 @@
 #include <asm/io.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
+#ifdef CONFIG_X86
+#include <asm/cpu.h>
+#endif
 
 #include "coalesced_mmio.h"
 #include "async_pf.h"
@@ -2336,11 +2339,17 @@ static void hardware_disable(void *junk)
 
 static void hardware_disable_all_nolock(void)
 {
+	int cpu;
+
 	BUG_ON(!kvm_usage_count);
 
 	kvm_usage_count--;
-	if (!kvm_usage_count)
+	if (!kvm_usage_count) {
 		on_each_cpu(hardware_disable_nolock, NULL, 1);
+		for_each_slave_cpu(cpu)
+			smp_call_function_single(cpu, hardware_disable_nolock,
+						 NULL, 1);
+	}
 }
 
 static void hardware_disable_all(void)
@@ -2353,6 +2362,7 @@ static void hardware_disable_all(void)
 static int hardware_enable_all(void)
 {
 	int r = 0;
+	int cpu;
 
 	raw_spin_lock(&kvm_lock);
 
@@ -2360,6 +2370,9 @@ static int hardware_enable_all(void)
 	if (kvm_usage_count == 1) {
 		atomic_set(&hardware_enable_failed, 0);
 		on_each_cpu(hardware_enable_nolock, NULL, 1);
+		for_each_slave_cpu(cpu)
+			smp_call_function_single(cpu, hardware_enable_nolock,
+						 NULL, 1);
 
 		if (atomic_read(&hardware_enable_failed)) {
 			hardware_disable_all_nolock();
@@ -2383,11 +2396,17 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
 	val &= ~CPU_TASKS_FROZEN;
 	switch (val) {
 	case CPU_DYING:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_DYING:
+#endif
 		printk(KERN_INFO "kvm: disabling virtualization on CPU%d\n",
 		       cpu);
 		hardware_disable(NULL);
 		break;
 	case CPU_STARTING:
+#ifdef CONFIG_SLAVE_CPU
+	case CPU_SLAVE_UP:
+#endif
 		printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
 		       cpu);
 		hardware_enable(NULL);
@@ -2605,6 +2624,10 @@ static struct notifier_block kvm_cpu_notifier = {
 	.notifier_call = kvm_cpu_hotplug,
 };
 
+static struct notifier_block kvm_slave_cpu_notifier = {
+	.notifier_call = kvm_cpu_hotplug,
+};
+
 static int vm_stat_get(void *_offset, u64 *val)
 {
 	unsigned offset = (long)_offset;
@@ -2768,7 +2791,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	if (r < 0)
 		goto out_free_0a;
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, cpu_online_or_slave_mask) {
 		smp_call_function_single(cpu,
 				kvm_arch_check_processor_compat,
 				&r, 1);
@@ -2779,6 +2802,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = register_cpu_notifier(&kvm_cpu_notifier);
 	if (r)
 		goto out_free_2;
+	register_slave_cpu_notifier(&kvm_slave_cpu_notifier);
 	register_reboot_notifier(&kvm_reboot_notifier);
 
 	/* A kmem cache lets us meet the alignment requirements of fx_save. */
@@ -2826,6 +2850,7 @@ out_free:
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_3:
 	unregister_reboot_notifier(&kvm_reboot_notifier);
+	unregister_slave_cpu_notifier(&kvm_slave_cpu_notifier);
 	unregister_cpu_notifier(&kvm_cpu_notifier);
 out_free_2:
 out_free_1:
@@ -2853,6 +2878,7 @@ void kvm_exit(void)
 	kvm_async_pf_deinit();
 	unregister_syscore_ops(&kvm_syscore_ops);
 	unregister_reboot_notifier(&kvm_reboot_notifier);
+	unregister_slave_cpu_notifier(&kvm_slave_cpu_notifier);
 	unregister_cpu_notifier(&kvm_cpu_notifier);
 	on_each_cpu(hardware_disable_nolock, NULL, 1);
 	kvm_arch_hardware_unsetup();



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 06/21] KVM: Add facility to run guests on slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (4 preceding siblings ...)
  2012-09-06 11:27 ` [RFC v2 PATCH 05/21] KVM: Enable/Disable virtualization on slave CPUs are activated/dying Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-06 11:27 ` [RFC v2 PATCH 07/21] KVM: handle page faults of slave guests on online CPUs Tomoki Sekiyama
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add path to migrate execution of vcpu_enter_guest to a slave CPU when
vcpu->arch.slave_cpu is set.

After moving to the slave CPU, it goes back to the online CPU when the
guest is exited by reasons that cannot be handled by the slave CPU only
(e.g. handling async page faults).

On migration, kvm_arch_vcpu_put_migrate is used to avoid using IPI to
clear loaded vmcs from the old CPU. Instead, this immediately clears
vmcs.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |    9 ++
 arch/x86/kernel/smp.c           |    2 
 arch/x86/kvm/vmx.c              |   10 ++
 arch/x86/kvm/x86.c              |  189 ++++++++++++++++++++++++++++++++++-----
 arch/x86/kvm/x86.h              |    9 ++
 include/linux/kvm_host.h        |    1 
 kernel/smp.c                    |    1 
 virt/kvm/async_pf.c             |    9 +-
 virt/kvm/kvm_main.c             |    3 -
 9 files changed, 203 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..72a0a64 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -354,6 +354,14 @@ struct kvm_vcpu_arch {
 	u64 ia32_misc_enable_msr;
 	bool tpr_access_reporting;
 
+#ifdef CONFIG_SLAVE_CPU
+	/* slave cpu dedicated to this vcpu */
+	int slave_cpu;
+#endif
+
+	/* user process tied to each vcpu */
+	struct task_struct *task;
+
 	/*
 	 * Paging state of the vcpu
 	 *
@@ -617,6 +625,7 @@ struct kvm_x86_ops {
 	void (*prepare_guest_switch)(struct kvm_vcpu *vcpu);
 	void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
 	void (*vcpu_put)(struct kvm_vcpu *vcpu);
+	void (*vcpu_put_migrate)(struct kvm_vcpu *vcpu);
 
 	void (*set_guest_debug)(struct kvm_vcpu *vcpu,
 				struct kvm_guest_debug *dbg);
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..a58dead 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -119,7 +119,7 @@ static bool smp_no_nmi_ipi = false;
  */
 static void native_smp_send_reschedule(int cpu)
 {
-	if (unlikely(cpu_is_offline(cpu))) {
+	if (unlikely(cpu_is_offline(cpu) && !cpu_slave(cpu))) {
 		WARN_ON(1);
 		return;
 	}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c00f03d..c5db714 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1557,6 +1557,13 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void vmx_vcpu_put_migrate(struct kvm_vcpu *vcpu)
+{
+	vmx_vcpu_put(vcpu);
+	__loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
+	vcpu->cpu = -1;
+}
+
 static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
 {
 	ulong cr0;
@@ -5017,7 +5024,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 			return 0;
 		}
 
-		if (signal_pending(current))
+		if (signal_pending(vcpu->arch.task))
 			goto out;
 		if (need_resched())
 			schedule();
@@ -7263,6 +7270,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.prepare_guest_switch = vmx_save_host_state,
 	.vcpu_load = vmx_vcpu_load,
 	.vcpu_put = vmx_vcpu_put,
+	.vcpu_put_migrate = vmx_vcpu_put_migrate,
 
 	.set_guest_debug = set_guest_debug,
 	.get_msr = vmx_get_msr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7501cc4..827b681 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -46,6 +46,7 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/pci.h>
+#include <linux/mmu_context.h>
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -62,6 +63,7 @@
 #include <asm/pvclock.h>
 #include <asm/div64.h>
 #include <asm/cpu.h>
+#include <asm/mmu.h>
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
@@ -1655,6 +1657,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		if (unlikely(!sched_info_on()))
 			return 1;
 
+		if (vcpu_has_slave_cpu(vcpu))
+			break;
+
 		if (data & KVM_STEAL_RESERVED_MASK)
 			return 1;
 
@@ -2348,6 +2353,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	vcpu->arch.last_host_tsc = native_read_tsc();
 }
 
+void kvm_arch_vcpu_put_migrate(struct kvm_vcpu *vcpu)
+{
+	kvm_x86_ops->vcpu_put_migrate(vcpu);
+	kvm_put_guest_fpu(vcpu);
+	vcpu->arch.last_host_tsc = native_read_tsc();
+}
+
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
 				    struct kvm_lapic_state *s)
 {
@@ -5255,7 +5267,46 @@ static void process_nmi(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
-static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
+enum vcpu_enter_guest_slave_retval {
+	EXIT_TO_USER = 0,
+	LOOP_ONLINE,		/* vcpu_post_run is done in online cpu */
+	LOOP_SLAVE,		/* vcpu_post_run is done in slave cpu */
+	LOOP_APF,		    /* must handle async_pf in online cpu */
+	LOOP_RETRY,		    /* may in hlt state */
+};
+
+static int vcpu_post_run(struct kvm_vcpu *vcpu, struct task_struct *task,
+			 int *can_complete_async_pf)
+{
+	int r = LOOP_ONLINE;
+
+	clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
+	if (kvm_cpu_has_pending_timer(vcpu))
+		kvm_inject_pending_timer_irqs(vcpu);
+
+	if (dm_request_for_irq_injection(vcpu)) {
+		r = -EINTR;
+		vcpu->run->exit_reason = KVM_EXIT_INTR;
+		++vcpu->stat.request_irq_exits;
+	}
+
+	if (can_complete_async_pf) {
+		*can_complete_async_pf = kvm_can_complete_async_pf(vcpu);
+		if (r == LOOP_ONLINE)
+			r = *can_complete_async_pf ? LOOP_APF : LOOP_SLAVE;
+	} else
+		kvm_check_async_pf_completion(vcpu);
+
+	if (signal_pending(task)) {
+		r = -EINTR;
+		vcpu->run->exit_reason = KVM_EXIT_INTR;
+		++vcpu->stat.signal_exits;
+	}
+
+	return r;
+}
+
+static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct task_struct *task)
 {
 	int r;
 	bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
@@ -5345,7 +5396,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	local_irq_disable();
 
 	if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
-	    || need_resched() || signal_pending(current)) {
+	    || need_resched() || signal_pending(task)) {
 		vcpu->mode = OUTSIDE_GUEST_MODE;
 		smp_wmb();
 		local_irq_enable();
@@ -5429,10 +5480,95 @@ out:
 	return r;
 }
 
+#ifdef CONFIG_SLAVE_CPU
+
+struct __vcpu_enter_guest_args {
+	struct kvm_vcpu *vcpu;
+	struct task_struct *task;
+	struct completion wait;
+	int ret, apf_pending;
+};
 
-static int __vcpu_run(struct kvm_vcpu *vcpu)
+static void __vcpu_enter_guest_slave(void *_arg)
 {
+	struct __vcpu_enter_guest_args *arg = _arg;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	int r = LOOP_SLAVE;
+	int cpu = smp_processor_id();
+
+	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	if (!tsk_used_math(current))
+		init_fpu(current);
+
+	use_mm(arg->task->mm);
+	kvm_arch_vcpu_load(vcpu, cpu);
+
+	while (r == LOOP_SLAVE) {
+		r = vcpu_enter_guest(vcpu, arg->task);
+
+		if (r <= 0)
+			break;
+
+		/* determine if slave cpu can handle the exit alone */
+		r = vcpu_post_run(vcpu, arg->task, &arg->apf_pending);
+
+		if (r == LOOP_SLAVE &&
+		    (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE ||
+		     vcpu->arch.apf.halted)) {
+			r = LOOP_RETRY;
+		}
+	}
+
+	kvm_arch_vcpu_put_migrate(vcpu);
+	unuse_mm(arg->task->mm);
+	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+
+	arg->ret = r;
+	complete(&arg->wait);
+}
+
+static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
+				  struct task_struct *task, int *apf_pending)
+{
+	struct __vcpu_enter_guest_args arg = {vcpu, task};
+	int slave = vcpu->arch.slave_cpu;
 	int r;
+
+	BUG_ON((unsigned)slave >= nr_cpu_ids || !cpu_slave(slave));
+
+	preempt_disable();
+	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	kvm_arch_vcpu_put_migrate(vcpu);
+	preempt_enable();
+
+	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+	init_completion(&arg.wait);
+	slave_cpu_call_function(slave, __vcpu_enter_guest_slave, &arg);
+	r = wait_for_completion_interruptible(&arg.wait);
+	if (r) {
+		/* interrupted: kick and wait VM on the slave cpu */
+		kvm_vcpu_kick(vcpu);
+		wait_for_completion(&arg.wait);
+	}
+	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	preempt_disable();
+	kvm_arch_vcpu_load(vcpu, smp_processor_id());
+	preempt_notifier_register(&vcpu->preempt_notifier);
+	preempt_enable();
+
+	r = arg.ret;
+	*apf_pending = arg.apf_pending;
+
+	return r;
+}
+
+#endif /* CONFIG_SLAVE_CPU */
+
+static int __vcpu_run(struct kvm_vcpu *vcpu, struct task_struct *task)
+{
+	int r, apf_pending = 0;
 	struct kvm *kvm = vcpu->kvm;
 
 	if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED)) {
@@ -5448,12 +5584,22 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
 	vapic_enter(vcpu);
 
-	r = 1;
+	r = LOOP_ONLINE;
 	while (r > 0) {
 		if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
-		    !vcpu->arch.apf.halted)
-			r = vcpu_enter_guest(vcpu);
-		else {
+		    !vcpu->arch.apf.halted) {
+#ifdef CONFIG_SLAVE_CPU
+			apf_pending = 0;
+			if (vcpu_has_slave_cpu(vcpu)) {
+				r = vcpu_enter_guest_slave(vcpu, task,
+							   &apf_pending);
+				if (r == LOOP_RETRY)
+					continue;
+			} else
+#endif
+				r = vcpu_enter_guest(vcpu, task);
+		} else {
+			r = LOOP_ONLINE;
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 			kvm_vcpu_block(vcpu);
 			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
@@ -5474,26 +5620,15 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 			}
 		}
 
+		if (apf_pending)
+			kvm_check_async_pf_completion(vcpu);
+
 		if (r <= 0)
 			break;
 
-		clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
-		if (kvm_cpu_has_pending_timer(vcpu))
-			kvm_inject_pending_timer_irqs(vcpu);
+		if (r == LOOP_ONLINE)
+			r = vcpu_post_run(vcpu, task, NULL);
 
-		if (dm_request_for_irq_injection(vcpu)) {
-			r = -EINTR;
-			vcpu->run->exit_reason = KVM_EXIT_INTR;
-			++vcpu->stat.request_irq_exits;
-		}
-
-		kvm_check_async_pf_completion(vcpu);
-
-		if (signal_pending(current)) {
-			r = -EINTR;
-			vcpu->run->exit_reason = KVM_EXIT_INTR;
-			++vcpu->stat.signal_exits;
-		}
 		if (need_resched()) {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 			kvm_resched(vcpu);
@@ -5595,8 +5730,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	if (r <= 0)
 		goto out;
 
-	r = __vcpu_run(vcpu);
-
+	r = __vcpu_run(vcpu, current);
 out:
 	post_kvm_run_save(vcpu);
 	if (vcpu->sigset_active)
@@ -6035,6 +6169,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 	r = kvm_arch_vcpu_reset(vcpu);
 	if (r == 0)
 		r = kvm_mmu_setup(vcpu);
+	vcpu->arch.task = current;
 	vcpu_put(vcpu);
 
 	return r;
@@ -6217,6 +6352,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 	kvm_set_tsc_khz(vcpu, max_tsc_khz);
 
+#ifdef CONFIG_SLAVE_CPU
+	vcpu->arch.slave_cpu = -1;
+#endif
+
 	r = kvm_mmu_create(vcpu);
 	if (r < 0)
 		goto fail_free_pio_data;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 3d1134d..3ae93d4 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -108,6 +108,15 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 	return false;
 }
 
+static inline bool vcpu_has_slave_cpu(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_SLAVE_CPU
+	return vcpu->arch.slave_cpu >= 0;
+#else
+	return 0;
+#endif
+}
+
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..a60743f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -119,6 +119,7 @@ struct kvm_async_pf {
 };
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+int kvm_can_complete_async_pf(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
 		       struct kvm_arch_async_pf *arch);
diff --git a/kernel/smp.c b/kernel/smp.c
index fda7a8d..95a9a39 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -431,6 +431,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
 	}
 	put_cpu();
 }
+EXPORT_SYMBOL(__smp_call_function_single);
 
 /**
  * smp_call_function_many(): Run a function on a set of other CPUs.
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 74268b4..feb5e76 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -120,12 +120,17 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 	vcpu->async_pf.queued = 0;
 }
 
+int kvm_can_complete_async_pf(struct kvm_vcpu *vcpu)
+{
+	return !list_empty_careful(&vcpu->async_pf.done) &&
+		kvm_arch_can_inject_async_page_present(vcpu);
+}
+
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 {
 	struct kvm_async_pf *work;
 
-	while (!list_empty_careful(&vcpu->async_pf.done) &&
-	      kvm_arch_can_inject_async_page_present(vcpu)) {
+	while (kvm_can_complete_async_pf(vcpu)) {
 		spin_lock(&vcpu->async_pf.lock);
 		work = list_first_entry(&vcpu->async_pf.done, typeof(*work),
 					      link);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dc86e9a..debbee1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1543,7 +1543,8 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 	}
 
 	me = get_cpu();
-	if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
+	if (cpu != me && (unsigned)cpu < nr_cpu_ids &&
+	    (cpu_online(cpu) || cpu_slave(cpu)))
 		if (kvm_arch_vcpu_should_kick(vcpu))
 			smp_send_reschedule(cpu);
 	put_cpu();



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 07/21] KVM: handle page faults of slave guests on online CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (5 preceding siblings ...)
  2012-09-06 11:27 ` [RFC v2 PATCH 06/21] KVM: Add facility to run guests on slave CPUs Tomoki Sekiyama
@ 2012-09-06 11:27 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 08/21] KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl Tomoki Sekiyama
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Page faults which occured by the guest running on slave CPUs cannot be
handled on slave CPUs because it is running on idle process context.

With this patch, the page fault happened in a slave CPU is notified to
online CPU using struct kvm_access_fault, and is handled after the
user-process for the guest is resumed on an online CPU.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |   15 +++++++++++++++
 arch/x86/kvm/mmu.c              |   13 +++++++++++++
 arch/x86/kvm/x86.c              |   10 ++++++++++
 3 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 72a0a64..8dc1a0a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -67,6 +67,11 @@
 
 #define UNMAPPED_GVA (~(gpa_t)0)
 
+#ifdef CONFIG_SLAVE_CPU
+/* Requests to handle VM exit on online cpu */
+#define KVM_REQ_HANDLE_PF	32
+#endif
+
 /* KVM Hugepage definitions for x86 */
 #define KVM_NR_PAGE_SIZES	3
 #define KVM_HPAGE_GFN_SHIFT(x)	(((x) - 1) * 9)
@@ -413,6 +418,16 @@ struct kvm_vcpu_arch {
 		u8 nr;
 	} interrupt;
 
+#ifdef CONFIG_SLAVE_CPU
+	/* used for recording page fault on offline CPU */
+	struct kvm_access_fault {
+		gva_t cr2;
+		u32 error_code;
+		void *insn;
+		int insn_len;
+	} page_fault;
+#endif
+
 	int halt_request; /* real mode on Intel only */
 
 	int cpuid_nent;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7fbd0d2..eb1d397 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3946,6 +3946,19 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
 	int r, emulation_type = EMULTYPE_RETRY;
 	enum emulation_result er;
 
+#ifdef CONFIG_SLAVE_CPU
+	if (cpu_slave(smp_processor_id())) {
+		/* Page fault must be handled on user-process context. */
+		r = -EFAULT;
+		vcpu->arch.page_fault.cr2 = cr2;
+		vcpu->arch.page_fault.error_code = error_code;
+		vcpu->arch.page_fault.insn = insn;
+		vcpu->arch.page_fault.insn_len = insn_len;
+		kvm_make_request(KVM_REQ_HANDLE_PF, vcpu);
+		goto out;
+	}
+#endif
+
 	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
 	if (r < 0)
 		goto out;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 827b681..579c41c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5561,6 +5561,16 @@ static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
 	r = arg.ret;
 	*apf_pending = arg.apf_pending;
 
+	if (r == -EFAULT && kvm_check_request(KVM_REQ_HANDLE_PF, vcpu)) {
+		pr_debug("handling page fault request @%p\n",
+			 (void *)vcpu->arch.page_fault.cr2);
+		r = kvm_mmu_page_fault(vcpu,
+				       vcpu->arch.page_fault.cr2,
+				       vcpu->arch.page_fault.error_code,
+				       vcpu->arch.page_fault.insn,
+				       vcpu->arch.page_fault.insn_len);
+	}
+
 	return r;
 }
 



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 08/21] KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (6 preceding siblings ...)
  2012-09-06 11:27 ` [RFC v2 PATCH 07/21] KVM: handle page faults of slave guests on online CPUs Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 09/21] KVM: Go back to online CPU on VM exit by external interrupt Tomoki Sekiyama
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add an interface to set/get slave CPU dedicated to the vCPUs.

By calling ioctl with KVM_GET_SLAVE_CPU, users can get the slave CPU id
for the vCPU. -1 is returned if a slave CPU is not set.

By calling ioctl with KVM_SET_SLAVE_CPU, users can dedicate the specified
slave CPU to the vCPU. The CPU must be offlined before calling ioctl.
The CPU is activated as slave CPU for the vCPU when the correct id is set.
The slave CPU is freed and offlined by setting -1 as slave CPU id.

Whether getting/setting slave CPUs are supported by KVM or not can be
known by checking KVM_CAP_SLAVE_CPU.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/vmx.c              |    7 +++++
 arch/x86/kvm/x86.c              |   58 +++++++++++++++++++++++++++++++++++++++
 include/linux/kvm.h             |    4 +++
 4 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8dc1a0a..0ea04c9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -718,6 +718,8 @@ struct kvm_x86_ops {
 	int (*check_intercept)(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage);
+
+	void (*set_slave_mode)(struct kvm_vcpu *vcpu, bool slave);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c5db714..7bbfa01 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1698,6 +1698,11 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
+{
+	/* Nothing */
+}
+
 /*
  * KVM wants to inject page-faults which it got to the guest. This function
  * checks whether in a nested guest, we need to inject them to L1 or L2.
@@ -7344,6 +7349,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.set_tdp_cr3 = vmx_set_cr3,
 
 	.check_intercept = vmx_check_intercept,
+
+	.set_slave_mode = vmx_set_slave_mode,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 579c41c..b62f59c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2183,6 +2183,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_GET_TSC_KHZ:
 	case KVM_CAP_PCI_2_3:
 	case KVM_CAP_KVMCLOCK_CTRL:
+#ifdef CONFIG_SLAVE_CPU
+	case KVM_CAP_SLAVE_CPU:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -2657,6 +2660,48 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+#ifdef CONFIG_SLAVE_CPU
+/* vcpu currently running on each slave CPU */
+static DEFINE_PER_CPU(struct kvm_vcpu *, slave_vcpu);
+
+static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
+					     int slave, int set_slave_mode)
+{
+	int old = vcpu->arch.slave_cpu;
+	int r = -EINVAL;
+
+	if (slave >= nr_cpu_ids || (slave >= 0 && cpu_online(slave)))
+		goto out;
+	if (slave >= 0 && slave != old && cpu_slave(slave))
+		goto out; /* new slave cpu must be offlined */
+
+	if (old >= 0 && slave != old) {
+		BUG_ON(old >= nr_cpu_ids || !cpu_slave(old));
+		per_cpu(slave_vcpu, old) = NULL;
+		r = slave_cpu_down(old);
+		if (r) {
+			pr_err("kvm: slave_cpu_down %d failed\n", old);
+			goto out;
+		}
+	}
+
+	if (slave >= 0) {
+		r = slave_cpu_up(slave);
+		if (r)
+			goto out;
+		BUG_ON(!cpu_slave(slave));
+		per_cpu(slave_vcpu, slave) = vcpu;
+	}
+
+	vcpu->arch.slave_cpu = slave;
+	if (set_slave_mode && kvm_x86_ops->set_slave_mode)
+		kvm_x86_ops->set_slave_mode(vcpu, slave >= 0);
+out:
+	return r;
+}
+
+#endif
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 			 unsigned int ioctl, unsigned long arg)
 {
@@ -2937,6 +2982,16 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_set_guest_paused(vcpu);
 		goto out;
 	}
+#ifdef CONFIG_SLAVE_CPU
+	case KVM_SET_SLAVE_CPU: {
+		r = kvm_arch_vcpu_ioctl_set_slave_cpu(vcpu, (int)arg, 1);
+		goto out;
+	}
+	case KVM_GET_SLAVE_CPU: {
+		r = vcpu->arch.slave_cpu;
+		goto out;
+	}
+#endif
 	default:
 		r = -EINVAL;
 	}
@@ -6154,6 +6209,9 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	kvmclock_reset(vcpu);
+#ifdef CONFIG_SLAVE_CPU
+	kvm_arch_vcpu_ioctl_set_slave_cpu(vcpu, -1, 0);
+#endif
 
 	free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
 	fx_free(vcpu);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..c2d1604 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_SLAVE_CPU 81
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -904,6 +905,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL	  _IO(KVMIO,   0xad)
+/* Available with KVM_CAP_SLAVE_CPU */
+#define KVM_GET_SLAVE_CPU	  _IO(KVMIO,  0xae)
+#define KVM_SET_SLAVE_CPU	  _IO(KVMIO,  0xaf)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 09/21] KVM: Go back to online CPU on VM exit by external interrupt
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (7 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 08/21] KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 10/21] KVM: proxy slab operations for slave CPUs on online CPUs Tomoki Sekiyama
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

If the slave CPU receives an interrupt in running a guest, current
implementation must once go back to onilne CPUs to handle the interupt.

This behavior will be replaced by later patch, which introduces direct
interrupt handling mechanism by the guest.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/vmx.c              |    1 +
 arch/x86/kvm/x86.c              |    6 ++++++
 3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0ea04c9..af68ffb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
 	int sipi_vector;
 	u64 ia32_misc_enable_msr;
 	bool tpr_access_reporting;
+	bool interrupted;
 
 #ifdef CONFIG_SLAVE_CPU
 	/* slave cpu dedicated to this vcpu */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7bbfa01..d99bee6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4408,6 +4408,7 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 
 static int handle_external_interrupt(struct kvm_vcpu *vcpu)
 {
+	vcpu->arch.interrupted = true;
 	++vcpu->stat.irq_exits;
 	return 1;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b62f59c..db0be81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5566,6 +5566,12 @@ static void __vcpu_enter_guest_slave(void *_arg)
 			break;
 
 		/* determine if slave cpu can handle the exit alone */
+		if (vcpu->arch.interrupted) {
+			vcpu->arch.interrupted = false;
+			arg->ret = LOOP_ONLINE;
+			break;
+		}
+
 		r = vcpu_post_run(vcpu, arg->task, &arg->apf_pending);
 
 		if (r == LOOP_SLAVE &&



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 10/21] KVM: proxy slab operations for slave CPUs on online CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (8 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 09/21] KVM: Go back to online CPU on VM exit by external interrupt Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 11/21] KVM: no exiting from guest when slave CPU halted Tomoki Sekiyama
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add some fix-ups that proxy slab operations on online CPUs for the guest,
in order to avoid touching slab on slave CPUs where some slab functions
are not activated.

Currently, slab may be touched on slave CPUs in following 3 cases.
For each cases, the fix-ups below are introduced:

* kvm_mmu_commit_zap_page
    With this patch, Instead of commiting zap pages, the pages are added
    into invalid_mmu_pages list and make KVM_REQ_COMMIT_ZAP_PAGE request.
    Then, the pages are freed on online CPUs after the execution of vCPU
    thread is resumed.

* mmu_topup_memory_caches
    Preallocate caches for mmu operations in vcpu_enter_guest_slave,
    which is done by online CPUs before entering guests.

* kvm_async_pf_wakeup_all
    If this function is called on slave CPUs, it makes KVM_REQ_WAKEUP_APF.
    The request is handled by calling kvm_async_pf_wakeup_all on online CPUs.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |    5 ++++
 arch/x86/kvm/mmu.c              |   52 ++++++++++++++++++++++++++++-----------
 arch/x86/kvm/mmu.h              |    4 +++
 arch/x86/kvm/x86.c              |   15 +++++++++++
 virt/kvm/async_pf.c             |    8 ++++++
 5 files changed, 69 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index af68ffb..5ce89f1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -70,6 +70,8 @@
 #ifdef CONFIG_SLAVE_CPU
 /* Requests to handle VM exit on online cpu */
 #define KVM_REQ_HANDLE_PF	32
+#define KVM_REQ_COMMIT_ZAP_PAGE	33
+#define KVM_REQ_WAKEUP_APF	34
 #endif
 
 /* KVM Hugepage definitions for x86 */
@@ -542,6 +544,9 @@ struct kvm_arch {
 	 * Hash table of struct kvm_mmu_page.
 	 */
 	struct list_head active_mmu_pages;
+#ifdef CONFIG_SLAVE_CPU
+	struct list_head invalid_mmu_pages;
+#endif
 	struct list_head assigned_dev_head;
 	struct iommu_domain *iommu_domain;
 	int iommu_flags;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eb1d397..871483a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -612,6 +612,10 @@ static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 
 	if (cache->nobjs >= min)
 		return 0;
+#ifdef CONFIG_SLAVE_CPU
+	if (cpu_slave(raw_smp_processor_id()))
+		return -ENOMEM;
+#endif
 	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
 		obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
 		if (!obj)
@@ -655,7 +659,7 @@ static void mmu_free_memory_cache_page(struct kvm_mmu_memory_cache *mc)
 		free_page((unsigned long)mc->objects[--mc->nobjs]);
 }
 
-static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
 {
 	int r;
 
@@ -1617,7 +1621,7 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 
 static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 				    struct list_head *invalid_list);
-static void kvm_mmu_commit_zap_page(struct kvm *kvm,
+static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
 				    struct list_head *invalid_list);
 
 #define for_each_gfn_sp(kvm, sp, gfn, pos)				\
@@ -1660,7 +1664,7 @@ static int kvm_sync_page_transient(struct kvm_vcpu *vcpu,
 
 	ret = __kvm_sync_page(vcpu, sp, &invalid_list, false);
 	if (ret)
-		kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+		kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 
 	return ret;
 }
@@ -1700,7 +1704,7 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 		flush = true;
 	}
 
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 	if (flush)
 		kvm_mmu_flush_tlb(vcpu);
 }
@@ -1787,7 +1791,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 			kvm_sync_page(vcpu, sp, &invalid_list);
 			mmu_pages_clear_parents(&parents);
 		}
-		kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+		kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 		cond_resched_lock(&vcpu->kvm->mmu_lock);
 		kvm_mmu_pages_init(parent, &parents, &pages);
 	}
@@ -2064,7 +2068,7 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	return ret;
 }
 
-static void kvm_mmu_commit_zap_page(struct kvm *kvm,
+static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
 				    struct list_head *invalid_list)
 {
 	struct kvm_mmu_page *sp;
@@ -2078,6 +2082,16 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	 */
 	smp_mb();
 
+#ifdef CONFIG_SLAVE_CPU
+	if (cpu_slave(raw_smp_processor_id())) {
+		/* Avoid touching kmem_cache on slave cpu */
+		list_splice_init(invalid_list, &kvm->arch.invalid_mmu_pages);
+		if (vcpu)
+			kvm_make_request(KVM_REQ_COMMIT_ZAP_PAGE, vcpu);
+		return;
+	}
+#endif
+
 	/*
 	 * Wait for all vcpus to exit guest mode and/or lockless shadow
 	 * page table walks.
@@ -2092,6 +2106,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	} while (!list_empty(invalid_list));
 }
 
+#ifdef CONFIG_SLAVE_CPU
+void kvm_mmu_commit_zap_page_late(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_commit_zap_page(vcpu->kvm, vcpu,
+				&vcpu->kvm->arch.invalid_mmu_pages);
+}
+#endif
+
 /*
  * Changing the number of mmu pages allocated to the vm
  * Note: if goal_nr_mmu_pages is too small, you will get dead lock
@@ -2114,7 +2136,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages)
 					    struct kvm_mmu_page, link);
 			kvm_mmu_prepare_zap_page(kvm, page, &invalid_list);
 		}
-		kvm_mmu_commit_zap_page(kvm, &invalid_list);
+		kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
 		goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
 	}
 
@@ -2137,7 +2159,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 		r = 1;
 		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 	}
-	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
 	spin_unlock(&kvm->mmu_lock);
 
 	return r;
@@ -2861,7 +2883,7 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 		--sp->root_count;
 		if (!sp->root_count && sp->role.invalid) {
 			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
-			kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+			kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 		}
 		vcpu->arch.mmu.root_hpa = INVALID_PAGE;
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -2880,7 +2902,7 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 		}
 		vcpu->arch.mmu.pae_root[i] = INVALID_PAGE;
 	}
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 	vcpu->arch.mmu.root_hpa = INVALID_PAGE;
 }
@@ -3895,7 +3917,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		}
 	}
 	mmu_pte_write_flush_tlb(vcpu, zap_page, remote_flush, local_flush);
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PTE_WRITE);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 }
@@ -3929,7 +3951,7 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
 		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
 		++vcpu->kvm->stat.mmu_recycled;
 	}
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
 }
 
 static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
@@ -3947,7 +3969,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
 	enum emulation_result er;
 
 #ifdef CONFIG_SLAVE_CPU
-	if (cpu_slave(smp_processor_id())) {
+	if (cpu_slave(raw_smp_processor_id())) {
 		/* Page fault must be handled on user-process context. */
 		r = -EFAULT;
 		vcpu->arch.page_fault.cr2 = cr2;
@@ -4094,7 +4116,7 @@ restart:
 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
 			goto restart;
 
-	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
 	spin_unlock(&kvm->mmu_lock);
 }
 
@@ -4146,7 +4168,7 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
 		spin_lock(&kvm->mmu_lock);
 
 		kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
-		kvm_mmu_commit_zap_page(kvm, &invalid_list);
+		kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
 
 		spin_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, idx);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e374db9..32efc36 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -52,6 +52,10 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_SLAVE_CPU
+void kvm_mmu_commit_zap_page_late(struct kvm_vcpu *vcpu);
+#endif
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db0be81..a6b2521 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5571,6 +5571,11 @@ static void __vcpu_enter_guest_slave(void *_arg)
 			arg->ret = LOOP_ONLINE;
 			break;
 		}
+		if (test_bit(KVM_REQ_COMMIT_ZAP_PAGE, &vcpu->requests) ||
+		    test_bit(KVM_REQ_WAKEUP_APF,      &vcpu->requests)) {
+			r = LOOP_ONLINE;
+			break;
+		}
 
 		r = vcpu_post_run(vcpu, arg->task, &arg->apf_pending);
 
@@ -5598,6 +5603,9 @@ static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
 
 	BUG_ON((unsigned)slave >= nr_cpu_ids || !cpu_slave(slave));
 
+	/* Refill memory caches here to avoid calling slab on slave cpu */
+	mmu_topup_memory_caches(vcpu);
+
 	preempt_disable();
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_put_migrate(vcpu);
@@ -5631,6 +5639,10 @@ static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
 				       vcpu->arch.page_fault.insn,
 				       vcpu->arch.page_fault.insn_len);
 	}
+	if (kvm_check_request(KVM_REQ_WAKEUP_APF, vcpu))
+		kvm_async_pf_wakeup_all(vcpu);
+	if (kvm_check_request(KVM_REQ_COMMIT_ZAP_PAGE, vcpu))
+		kvm_mmu_commit_zap_page_late(vcpu);
 
 	return r;
 }
@@ -6486,6 +6498,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 		return -EINVAL;
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+#ifdef CONFIG_SLAVE_CPU
+	INIT_LIST_HEAD(&kvm->arch.invalid_mmu_pages);
+#endif
 	INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
 
 	/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index feb5e76..97b3a77 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -204,6 +204,14 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
 	if (!list_empty_careful(&vcpu->async_pf.done))
 		return 0;
 
+#ifdef CONFIG_SLAVE_CPU
+	if (cpu_slave(raw_smp_processor_id())) {
+		/* Redo on online cpu to avoid kmem_cache_alloc on slave cpu */
+		kvm_make_request(KVM_REQ_WAKEUP_APF, vcpu);
+		return -ENOMEM;
+	}
+#endif
+
 	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
 	if (!work)
 		return -ENOMEM;



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 11/21] KVM: no exiting from guest when slave CPU halted
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (9 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 10/21] KVM: proxy slab operations for slave CPUs on online CPUs Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 12/21] x86/apic: Enable external interrupt routing to slave CPUs Tomoki Sekiyama
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Avi Kivity,
	Marcelo Tosatti, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

Avoid exiting from a guest on slave CPU even if HLT instruction is
executed. Since the slave CPU is dedicated to a vCPU, exit on HLT is
not required, and avoiding VM exit will improve the guest's performance.

This is a partial revert of

        10166744b80a ("KVM: VMX: remove yield_on_hlt")

Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/kvm/vmx.c |   25 ++++++++++++++++++++++++-
 1 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d99bee6..03a2d02 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1698,9 +1698,29 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+static inline void vmx_clear_hlt(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_SLAVE_CPU
+	/* Ensure that we clear the HLT state in the VMCS.  We don't need to
+	 * explicitly skip the instruction because if the HLT state is set,
+	 * then the instruction is already executing and RIP has already been
+	 * advanced. */
+	if (vcpu->arch.slave_cpu >= 0 &&
+	    vmcs_read32(GUEST_ACTIVITY_STATE) == GUEST_ACTIVITY_HLT)
+		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
+#endif
+}
+
 static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
 {
-	/* Nothing */
+	/* Don't intercept the guest's halt on slave CPU */
+	if (slave) {
+		vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+				CPU_BASED_HLT_EXITING);
+	} else {
+		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
+			      CPU_BASED_HLT_EXITING);
+	}
 }
 
 /*
@@ -1755,6 +1775,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
+	vmx_clear_hlt(vcpu);
 }
 
 static bool vmx_rdtscp_supported(void)
@@ -4125,6 +4146,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
 	} else
 		intr |= INTR_TYPE_EXT_INTR;
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
+	vmx_clear_hlt(vcpu);
 }
 
 static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
@@ -4156,6 +4178,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 	}
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
 			INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
+	vmx_clear_hlt(vcpu);
 }
 
 static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 12/21] x86/apic: Enable external interrupt routing to slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (10 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 11/21] KVM: no exiting from guest when slave CPU halted Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 13/21] x86/apic: IRQ vector remapping on slave for " Tomoki Sekiyama
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Enable APIC to handle interrupts on slave CPUs, and enables interrupt
routing to slave CPUs by setting IRQ affinity.

As slave CPUs which run a KVM guest handle external interrupts directly in
the vCPUs, the guest's vector/IRQ mapping is different from the host's.
That requires interrupts to be routed either online CPUs or slave CPUs.

In this patch, if online CPUs are contained in specified affinity settings,
the affinity settings will be only applied to online CPUs. If every
specified CPU is slave, IRQ will be routed to slave CPUs.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/apic.h           |    6 ++---
 arch/x86/kernel/apic/io_apic.c        |   43 ++++++++++++++++++++++++---------
 arch/x86/kernel/apic/x2apic_cluster.c |    8 +++---
 drivers/iommu/intel_irq_remapping.c   |   30 +++++++++++++++++++----
 kernel/irq/manage.c                   |    4 ++-
 kernel/irq/migration.c                |    2 +-
 kernel/irq/proc.c                     |    2 +-
 7 files changed, 67 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index f342612..d37ae5c 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -535,7 +535,7 @@ extern void generic_bigsmp_probe(void);
 static inline const struct cpumask *default_target_cpus(void)
 {
 #ifdef CONFIG_SMP
-	return cpu_online_mask;
+	return cpu_online_or_slave_mask;
 #else
 	return cpumask_of(0);
 #endif
@@ -543,7 +543,7 @@ static inline const struct cpumask *default_target_cpus(void)
 
 static inline const struct cpumask *online_target_cpus(void)
 {
-	return cpu_online_mask;
+	return cpu_online_or_slave_mask;
 }
 
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
@@ -602,7 +602,7 @@ flat_cpu_mask_to_apicid_and(const struct cpumask *cpumask,
 {
 	unsigned long cpu_mask = cpumask_bits(cpumask)[0] &
 				 cpumask_bits(andmask)[0] &
-				 cpumask_bits(cpu_online_mask)[0] &
+				 cpumask_bits(cpu_online_or_slave_mask)[0] &
 				 APIC_ALL_CPUS;
 
 	if (likely(cpu_mask)) {
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c265593..0cd2682 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1125,7 +1125,7 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
 	/* Only try and allocate irqs on cpus that are present */
 	err = -ENOSPC;
 	cpumask_clear(cfg->old_domain);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = cpumask_first_and(mask, cpu_online_or_slave_mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;
 
@@ -1158,14 +1158,14 @@ next:
 		if (unlikely(current_vector == vector)) {
 			cpumask_or(cfg->old_domain, cfg->old_domain, tmp_mask);
 			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
-			cpu = cpumask_first_and(tmp_mask, cpu_online_mask);
+			cpu = cpumask_first_and(tmp_mask, cpu_online_or_slave_mask);
 			continue;
 		}
 
 		if (test_bit(vector, used_vectors))
 			goto next;
 
-		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
+		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_or_slave_mask)
 			if (per_cpu(vector_irq, new_cpu)[vector] != -1)
 				goto next;
 		/* Found one! */
@@ -1175,7 +1175,7 @@ next:
 			cfg->move_in_progress = 1;
 			cpumask_copy(cfg->old_domain, cfg->domain);
 		}
-		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
+		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_or_slave_mask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq;
 		cfg->vector = vector;
 		cpumask_copy(cfg->domain, tmp_mask);
@@ -1204,7 +1204,7 @@ static void __clear_irq_vector(int irq, struct irq_cfg *cfg)
 	BUG_ON(!cfg->vector);
 
 	vector = cfg->vector;
-	for_each_cpu_and(cpu, cfg->domain, cpu_online_mask)
+	for_each_cpu_and(cpu, cfg->domain, cpu_online_or_slave_mask)
 		per_cpu(vector_irq, cpu)[vector] = -1;
 
 	cfg->vector = 0;
@@ -1212,7 +1212,7 @@ static void __clear_irq_vector(int irq, struct irq_cfg *cfg)
 
 	if (likely(!cfg->move_in_progress))
 		return;
-	for_each_cpu_and(cpu, cfg->old_domain, cpu_online_mask) {
+	for_each_cpu_and(cpu, cfg->old_domain, cpu_online_or_slave_mask) {
 		for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS;
 								vector++) {
 			if (per_cpu(vector_irq, cpu)[vector] != irq)
@@ -2354,28 +2354,47 @@ int __ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 {
 	struct irq_cfg *cfg = data->chip_data;
 	unsigned int irq = data->irq;
+	cpumask_var_t new_mask;
+	bool mask_ok = false;
 	int err;
 
 	if (!config_enabled(CONFIG_SMP))
 		return -1;
 
-	if (!cpumask_intersects(mask, cpu_online_mask))
+	if (!cpumask_intersects(mask, cpu_online_or_slave_mask))
 		return -EINVAL;
 
+#ifdef CONFIG_SLAVE_CPU
+	/* Set affinity to either online cpus or slave cpus */
+	mask_ok = alloc_cpumask_var(&new_mask, GFP_ATOMIC);
+	if (mask_ok) {
+		cpumask_and(new_mask, mask, cpu_online_mask);
+		if (cpumask_empty(new_mask))
+			cpumask_copy(new_mask, mask);
+		err = assign_irq_vector(irq, cfg, new_mask);
+	} else if (!cpumask_intersects(mask, cpu_online_mask) ||
+		   !cpumask_intersects(mask, cpu_slave_mask))
+		err = assign_irq_vector(irq, cfg, mask);
+	else
+		return -ENOMEM;
+#else
 	err = assign_irq_vector(irq, cfg, mask);
+#endif
 	if (err)
-		return err;
+		goto error;
 
 	err = apic->cpu_mask_to_apicid_and(mask, cfg->domain, dest_id);
 	if (err) {
 		if (assign_irq_vector(irq, cfg, data->affinity))
 			pr_err("Failed to recover vector for irq %d\n", irq);
-		return err;
+		goto error;
 	}
 
-	cpumask_copy(data->affinity, mask);
-
-	return 0;
+	cpumask_copy(data->affinity, mask_ok ? new_mask : mask);
+error:
+	if (mask_ok)
+		free_cpumask_var(new_mask);
+	return err;
 }
 
 static int
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index c88baa4..7403e1e 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -106,7 +106,7 @@ x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask,
 	int i;
 
 	for_each_cpu_and(i, cpumask, andmask) {
-		if (!cpumask_test_cpu(i, cpu_online_mask))
+		if (!cpumask_test_cpu(i, cpu_online_or_slave_mask))
 			continue;
 		dest = per_cpu(x86_cpu_to_logical_apicid, i);
 		cluster = x2apic_cluster(i);
@@ -117,7 +117,7 @@ x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask,
 		return -EINVAL;
 
 	for_each_cpu_and(i, cpumask, andmask) {
-		if (!cpumask_test_cpu(i, cpu_online_mask))
+		if (!cpumask_test_cpu(i, cpu_online_or_slave_mask))
 			continue;
 		if (cluster != x2apic_cluster(i))
 			continue;
@@ -137,7 +137,7 @@ static void init_x2apic_ldr(void)
 	per_cpu(x86_cpu_to_logical_apicid, this_cpu) = apic_read(APIC_LDR);
 
 	__cpu_set(this_cpu, per_cpu(cpus_in_cluster, this_cpu));
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, cpu_online_or_slave_mask) {
 		if (x2apic_cluster(this_cpu) != x2apic_cluster(cpu))
 			continue;
 		__cpu_set(this_cpu, per_cpu(cpus_in_cluster, cpu));
@@ -169,7 +169,7 @@ update_clusterinfo(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
-		for_each_online_cpu(cpu) {
+		for_each_cpu(cpu, cpu_online_or_slave_mask) {
 			if (x2apic_cluster(this_cpu) != x2apic_cluster(cpu))
 				continue;
 			__cpu_clear(this_cpu, per_cpu(cpus_in_cluster, cpu));
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index af8904d..df38334 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -931,26 +931,43 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	struct irq_cfg *cfg = data->chip_data;
 	unsigned int dest, irq = data->irq;
 	struct irte irte;
+	cpumask_var_t new_mask;
+	bool mask_ok = false;
 	int err;
 
 	if (!config_enabled(CONFIG_SMP))
 		return -EINVAL;
 
-	if (!cpumask_intersects(mask, cpu_online_mask))
+	if (!cpumask_intersects(mask, cpu_online_or_slave_mask))
 		return -EINVAL;
 
 	if (get_irte(irq, &irte))
 		return -EBUSY;
 
+#ifdef CONFIG_SLAVE_CPU
+	/* Set affinity to either online cpus only or slave cpus only */
+	mask_ok = alloc_cpumask_var(&new_mask, GFP_ATOMIC);
+	if (mask_ok) {
+		cpumask_and(new_mask, mask, cpu_online_mask);
+		if (cpumask_empty(new_mask))
+			cpumask_copy(new_mask, mask);
+		err = assign_irq_vector(irq, cfg, new_mask);
+	} else if (!cpumask_intersects(mask, cpu_online_mask) ||
+		   !cpumask_intersects(mask, cpu_slave_mask))
+		err = assign_irq_vector(irq, cfg, mask);
+	else
+		return -ENOMEM;
+#else
 	err = assign_irq_vector(irq, cfg, mask);
+#endif
 	if (err)
-		return err;
+		goto error;
 
 	err = apic->cpu_mask_to_apicid_and(cfg->domain, mask, &dest);
 	if (err) {
 		if (assign_irq_vector(irq, cfg, data->affinity))
 			pr_err("Failed to recover vector for irq %d\n", irq);
-		return err;
+		goto error;
 	}
 
 	irte.vector = cfg->vector;
@@ -970,8 +987,11 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	if (cfg->move_in_progress)
 		send_cleanup_vector(cfg);
 
-	cpumask_copy(data->affinity, mask);
-	return 0;
+	cpumask_copy(data->affinity, mask_ok ? new_mask : mask);
+error:
+	if (mask_ok)
+		free_cpumask_var(new_mask);
+	return err;
 }
 
 static void intel_compose_msi_msg(struct pci_dev *pdev,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 4c69326..dafe0505 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -308,13 +308,13 @@ setup_affinity(unsigned int irq, struct irq_desc *desc, struct cpumask *mask)
 	 */
 	if (irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
 		if (cpumask_intersects(desc->irq_data.affinity,
-				       cpu_online_mask))
+				       cpu_online_or_slave_mask))
 			set = desc->irq_data.affinity;
 		else
 			irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
 	}
 
-	cpumask_and(mask, cpu_online_mask, set);
+	cpumask_and(mask, cpu_online_or_slave_mask, set);
 	if (node != NUMA_NO_NODE) {
 		const struct cpumask *nodemask = cpumask_of_node(node);
 
diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c
index ca3f4aa..6e3aaa9 100644
--- a/kernel/irq/migration.c
+++ b/kernel/irq/migration.c
@@ -42,7 +42,7 @@ void irq_move_masked_irq(struct irq_data *idata)
 	 * For correct operation this depends on the caller
 	 * masking the irqs.
 	 */
-	if (cpumask_any_and(desc->pending_mask, cpu_online_mask) < nr_cpu_ids)
+	if (cpumask_any_and(desc->pending_mask, cpu_online_or_slave_mask) < nr_cpu_ids)
 		irq_do_set_affinity(&desc->irq_data, desc->pending_mask, false);
 
 	cpumask_clear(desc->pending_mask);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4bd4faa..76bd7b2 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -103,7 +103,7 @@ static ssize_t write_irq_affinity(int type, struct file *file,
 	 * way to make the system unusable accidentally :-) At least
 	 * one online CPU still has to be targeted.
 	 */
-	if (!cpumask_intersects(new_value, cpu_online_mask)) {
+	if (!cpumask_intersects(new_value, cpu_online_or_slave_mask)) {
 		/* Special case for empty set - allow the architecture
 		   code to set default SMP affinity. */
 		err = irq_select_affinity_usr(irq, new_value) ? -EINVAL : count;



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 13/21] x86/apic: IRQ vector remapping on slave for slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (11 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 12/21] x86/apic: Enable external interrupt routing to slave CPUs Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 14/21] KVM: Directly handle interrupts by guests without VM EXIT on " Tomoki Sekiyama
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add a facility to use IRQ vector different from online CPUs on slave CPUs.

When alternative vector for IRQ is registered by remap_slave_vector_irq()
and the IRQ affinity is set only to slave CPUs, the device is configured
to use the alternative vector.

Current patch only supports MSI and Intel CPU with IRQ remapper of IOMMU.

This is intended to be used to routing interrupts directly to KVM guest
which is running on slave CPUs which do not cause VM EXIT by external
interrupts.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/irq.h          |   15 ++++++++
 arch/x86/kernel/apic/io_apic.c      |   68 ++++++++++++++++++++++++++++++++++-
 drivers/iommu/intel_irq_remapping.c |    2 +
 3 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index ba870bb..84756f7 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -41,4 +41,19 @@ extern int vector_used_by_percpu_irq(unsigned int vector);
 
 extern void init_ISA_irqs(void);
 
+#ifdef CONFIG_SLAVE_CPU
+extern void remap_slave_vector_irq(int irq, int vector,
+				   const struct cpumask *mask);
+extern void revert_slave_vector_irq(int irq, const struct cpumask *mask);
+extern u8 get_remapped_slave_vector(u8 vector, unsigned int irq,
+				    const struct cpumask *mask);
+#else
+static inline u8 get_remapped_slave_vector(u8 vector, unsigned int irq,
+					   const struct cpumask *mask)
+{
+	return vector;
+}
+#endif
+
+
 #endif /* _ASM_X86_IRQ_H */
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 0cd2682..167b001 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1266,6 +1266,69 @@ void __setup_vector_irq(int cpu)
 	raw_spin_unlock(&vector_lock);
 }
 
+#ifdef CONFIG_SLAVE_CPU
+
+/* vector table remapped on slave cpus, indexed by IRQ */
+static DEFINE_PER_CPU(u8[NR_IRQS], slave_vector_remap_tbl) = {
+	[0 ... NR_IRQS - 1] = 0,
+};
+
+void remap_slave_vector_irq(int irq, int vector, const struct cpumask *mask)
+{
+	int cpu;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&vector_lock, flags);
+	for_each_cpu(cpu, mask) {
+		BUG_ON(!cpu_slave(cpu));
+		per_cpu(slave_vector_remap_tbl, cpu)[irq] = vector;
+		per_cpu(vector_irq, cpu)[vector] = irq;
+	}
+	raw_spin_unlock_irqrestore(&vector_lock, flags);
+}
+EXPORT_SYMBOL_GPL(remap_slave_vector_irq);
+
+void revert_slave_vector_irq(int irq, const struct cpumask *mask)
+{
+	int cpu;
+	u8 vector;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&vector_lock, flags);
+	for_each_cpu(cpu, mask) {
+		BUG_ON(!cpu_slave(cpu));
+		vector = per_cpu(slave_vector_remap_tbl, cpu)[irq];
+		if (vector) {
+			per_cpu(vector_irq, cpu)[vector] = -1;
+			per_cpu(slave_vector_remap_tbl, cpu)[irq] = 0;
+		}
+	}
+	raw_spin_unlock_irqrestore(&vector_lock, flags);
+}
+EXPORT_SYMBOL_GPL(revert_slave_vector_irq);
+
+/* If all targets CPUs are slave, returns remapped vector */
+u8 get_remapped_slave_vector(u8 vector, unsigned int irq,
+			     const struct cpumask *mask)
+{
+	u8 slave_vector;
+
+	if (vector < FIRST_EXTERNAL_VECTOR ||
+	    cpumask_intersects(mask, cpu_online_mask))
+		return vector;
+
+	slave_vector = per_cpu(slave_vector_remap_tbl,
+			       cpumask_first(mask))[irq];
+	if (slave_vector >= FIRST_EXTERNAL_VECTOR)
+		vector = slave_vector;
+
+	pr_info("slave vector remap: irq: %d => vector: %d\n", irq, vector);
+
+	return vector;
+}
+
+#endif
+
 static struct irq_chip ioapic_chip;
 
 #ifdef CONFIG_X86_32
@@ -3133,6 +3196,7 @@ static int
 msi_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force)
 {
 	struct irq_cfg *cfg = data->chip_data;
+	int vector = cfg->vector;
 	struct msi_msg msg;
 	unsigned int dest;
 
@@ -3141,8 +3205,10 @@ msi_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force)
 
 	__get_cached_msi_msg(data->msi_desc, &msg);
 
+	vector = get_remapped_slave_vector(vector, data->irq, mask);
+
 	msg.data &= ~MSI_DATA_VECTOR_MASK;
-	msg.data |= MSI_DATA_VECTOR(cfg->vector);
+	msg.data |= MSI_DATA_VECTOR(vector);
 	msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
 	msg.address_lo |= MSI_ADDR_DEST_ID(dest);
 
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index df38334..471d23f 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -970,7 +970,7 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		goto error;
 	}
 
-	irte.vector = cfg->vector;
+	irte.vector = get_remapped_slave_vector(cfg->vector, irq, mask);
 	irte.dest_id = IRTE_DEST(dest);
 
 	/*



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 14/21] KVM: Directly handle interrupts by guests without VM EXIT on slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (12 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 13/21] x86/apic: IRQ vector remapping on slave for " Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 15/21] KVM: add tracepoint on enabling/disabling direct interrupt delivery Tomoki Sekiyama
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Make interrupts on slave CPUs handled by guests without VM EXIT.
This reduces CPU usage by the host to transfer interrupts of assigned
PCI devices from the host to guests. It also reduces cost of VM EXIT
and quickens response of guests to the interrupts.

When a slave CPU is dedicated to a vCPU, exit on external interrupts is
disabled. Unfortunately, we can only enable/disable exits for whole
external interrupts except NMIs and cannot switch exits based on IRQ#
or vectors. Thus, to avoid IPIs from online CPUs transferred to guests,
this patch modify kvm_vcpu_kick() to use NMI for guests on slave CPUs.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/lapic.c            |    5 +++++
 arch/x86/kvm/vmx.c              |   19 ++++++++++++++++++
 arch/x86/kvm/x86.c              |   41 +++++++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h        |    1 +
 virt/kvm/kvm_main.c             |    5 +++--
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ce89f1..65242a6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -725,6 +725,7 @@ struct kvm_x86_ops {
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage);
 
+	void (*set_direct_interrupt)(struct kvm_vcpu *vcpu, bool enabled);
 	void (*set_slave_mode)(struct kvm_vcpu *vcpu, bool slave);
 };
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ce87878..73f57f3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -601,6 +601,9 @@ static int apic_set_eoi(struct kvm_lapic *apic)
 		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
 	}
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
+	if (vcpu_has_slave_cpu(apic->vcpu) &&
+	    kvm_x86_ops->set_direct_interrupt)
+		kvm_x86_ops->set_direct_interrupt(apic->vcpu, 1);
 	return vector;
 }
 
@@ -1569,6 +1572,8 @@ int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data)
 	u64 addr = data & ~KVM_MSR_ENABLED;
 	if (!IS_ALIGNED(addr, 4))
 		return 1;
+	if (vcpu_has_slave_cpu(vcpu))
+		return 1;
 
 	vcpu->arch.pv_eoi.msr_val = data;
 	if (!pv_eoi_enabled(vcpu))
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 03a2d02..605abea 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1711,6 +1711,16 @@ static inline void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 #endif
 }
 
+static void vmx_set_direct_interrupt(struct kvm_vcpu *vcpu, bool enabled)
+{
+	if (enabled)
+		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
+				PIN_BASED_EXT_INTR_MASK);
+	else
+		vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
+			      PIN_BASED_EXT_INTR_MASK);
+}
+
 static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
 {
 	/* Don't intercept the guest's halt on slave CPU */
@@ -1721,6 +1731,8 @@ static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
 		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
 			      CPU_BASED_HLT_EXITING);
 	}
+
+	vmx_set_direct_interrupt(vcpu, slave);
 }
 
 /*
@@ -1776,6 +1788,8 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 	vmx_clear_hlt(vcpu);
+	if (vcpu_has_slave_cpu(vcpu))
+		vmx_set_direct_interrupt(vcpu, 0);
 }
 
 static bool vmx_rdtscp_supported(void)
@@ -4147,6 +4161,8 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
 		intr |= INTR_TYPE_EXT_INTR;
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
 	vmx_clear_hlt(vcpu);
+	if (vcpu_has_slave_cpu(vcpu))
+		vmx_set_direct_interrupt(vcpu, 0);
 }
 
 static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
@@ -4179,6 +4195,8 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
 			INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
 	vmx_clear_hlt(vcpu);
+	if (vcpu_has_slave_cpu(vcpu))
+		vmx_set_direct_interrupt(vcpu, 0);
 }
 
 static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
@@ -7374,6 +7392,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
 
 	.check_intercept = vmx_check_intercept,
 
+	.set_direct_interrupt = vmx_set_direct_interrupt,
 	.set_slave_mode = vmx_set_slave_mode,
 };
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a6b2521..b7d28df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -63,6 +63,7 @@
 #include <asm/pvclock.h>
 #include <asm/div64.h>
 #include <asm/cpu.h>
+#include <asm/nmi.h>
 #include <asm/mmu.h>
 
 #define MAX_IO_MSRS 256
@@ -2664,6 +2665,8 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
 /* vcpu currently running on each slave CPU */
 static DEFINE_PER_CPU(struct kvm_vcpu *, slave_vcpu);
 
+static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs);
+
 static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
 					     int slave, int set_slave_mode)
 {
@@ -5011,6 +5014,11 @@ int kvm_arch_init(void *opaque)
 	if (cpu_has_xsave)
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 
+#ifdef CONFIG_SLAVE_CPU
+	register_nmi_handler(NMI_LOCAL, kvm_arch_kicked_by_nmi, 0,
+			     "kvm_kick");
+#endif
+
 	return 0;
 
 out:
@@ -5026,6 +5034,9 @@ void kvm_arch_exit(void)
 					    CPUFREQ_TRANSITION_NOTIFIER);
 	unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
 	unregister_slave_cpu_notifier(&kvmclock_slave_cpu_notifier_block);
+#ifdef CONFIG_SLAVE_CPU
+	unregister_nmi_handler(NMI_LOCAL, "kvm_kick");
+#endif
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 }
@@ -5322,6 +5333,27 @@ static void process_nmi(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+#ifdef CONFIG_SLAVE_CPU
+static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
+{
+	struct kvm_vcpu *vcpu;
+	int cpu = smp_processor_id();
+
+	if (!cpu_slave(cpu))
+		return NMI_DONE;
+
+	vcpu = __this_cpu_read(slave_vcpu);
+	if (!vcpu)
+		return NMI_DONE;
+
+	/* if called from NMI handler after VM exit, no need to prevent run */
+	if (vcpu->mode == OUTSIDE_GUEST_MODE || kvm_is_in_guest())
+		return NMI_HANDLED;
+
+	return NMI_HANDLED;
+}
+#endif
+
 enum vcpu_enter_guest_slave_retval {
 	EXIT_TO_USER = 0,
 	LOOP_ONLINE,		/* vcpu_post_run is done in online cpu */
@@ -6706,6 +6738,15 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
 }
 
+int kvm_arch_vcpu_kick_slave(struct kvm_vcpu *vcpu)
+{
+	if (kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE) {
+		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), NMI_VECTOR);
+		return 1;
+	}
+	return 0;
+}
+
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
 	return kvm_x86_ops->interrupt_allowed(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a60743f..5fd9c64 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -539,6 +539,7 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_kick_slave(struct kvm_vcpu *vcpu);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index debbee1..858b7a9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1543,10 +1543,11 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 	}
 
 	me = get_cpu();
-	if (cpu != me && (unsigned)cpu < nr_cpu_ids &&
-	    (cpu_online(cpu) || cpu_slave(cpu)))
+	if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
 		if (kvm_arch_vcpu_should_kick(vcpu))
 			smp_send_reschedule(cpu);
+	if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_slave(cpu))
+		kvm_arch_vcpu_kick_slave(vcpu);
 	put_cpu();
 }
 #endif /* !CONFIG_S390 */



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 15/21] KVM: add tracepoint on enabling/disabling direct interrupt delivery
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (13 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 14/21] KVM: Directly handle interrupts by guests without VM EXIT on " Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 16/21] KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER Tomoki Sekiyama
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add trace event "kvm_set_direct_interrupt" to trace enabling/disabling
direct interrupt delivery on slave CPUs. At the event, the guest rip and
whether the feature is enabled or not is logged.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/kvm/trace.h |   18 ++++++++++++++++++
 arch/x86/kvm/vmx.c   |    2 ++
 arch/x86/kvm/x86.c   |    1 +
 3 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a71faf7..6081be7 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -551,6 +551,24 @@ TRACE_EVENT(kvm_pv_eoi,
 	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
 );
 
+TRACE_EVENT(kvm_set_direct_interrupt,
+	TP_PROTO(struct kvm_vcpu *vcpu, bool enabled),
+	TP_ARGS(vcpu, enabled),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	guest_rip	)
+		__field(	bool,	        enabled         )
+	),
+
+	TP_fast_assign(
+		__entry->guest_rip	= kvm_rip_read(vcpu);
+		__entry->enabled        = enabled;
+	),
+
+	TP_printk("rip 0x%lx enabled %d",
+		 __entry->guest_rip, __entry->enabled)
+);
+
 /*
  * Tracepoint for nested VMRUN
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 605abea..6dc59c8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1719,6 +1719,8 @@ static void vmx_set_direct_interrupt(struct kvm_vcpu *vcpu, bool enabled)
 	else
 		vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
 			      PIN_BASED_EXT_INTR_MASK);
+
+	trace_kvm_set_direct_interrupt(vcpu, enabled);
 }
 
 static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b7d28df..1449187 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6936,3 +6936,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_set_direct_interrupt);



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 16/21] KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (14 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 15/21] KVM: add tracepoint on enabling/disabling direct interrupt delivery Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 17/21] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received Tomoki Sekiyama
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Add some definitions to use PIN_BASED_PREEMPTION_TIMER.

When PIN_BASED_PREEMPTION_TIMER is enabled, the guest will exit
with reason=EXIT_REASON_PREEMPTION_TIMER when the counter specified in
VMX_PREEMPTION_TIMER_VALUE becomes 0.
This patch also adds a dummy handler for EXIT_REASON_PREEMPTION_TIMER,
which just goes back to VM execution soon.

These are currently intended only to be used with avoid entering the
guest on a slave CPU when vmx_prevent_run(vcpu, 1) is called.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/vmx.h |    3 +++
 arch/x86/kvm/trace.h       |    1 +
 arch/x86/kvm/vmx.c         |    7 +++++++
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 74fcb96..6899aaa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -66,6 +66,7 @@
 #define PIN_BASED_EXT_INTR_MASK                 0x00000001
 #define PIN_BASED_NMI_EXITING                   0x00000008
 #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
+#define PIN_BASED_PREEMPTION_TIMER              0x00000040
 
 #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002
 #define VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200
@@ -196,6 +197,7 @@ enum vmcs_field {
 	GUEST_INTERRUPTIBILITY_INFO     = 0x00004824,
 	GUEST_ACTIVITY_STATE            = 0X00004826,
 	GUEST_SYSENTER_CS               = 0x0000482A,
+	VMX_PREEMPTION_TIMER_VALUE      = 0x0000482E,
 	HOST_IA32_SYSENTER_CS           = 0x00004c00,
 	CR0_GUEST_HOST_MASK             = 0x00006000,
 	CR4_GUEST_HOST_MASK             = 0x00006002,
@@ -280,6 +282,7 @@ enum vmcs_field {
 #define EXIT_REASON_APIC_ACCESS         44
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_PREEMPTION_TIMER	52
 #define EXIT_REASON_WBINVD		54
 #define EXIT_REASON_XSETBV		55
 #define EXIT_REASON_INVPCID		58
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 6081be7..fc350f3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -218,6 +218,7 @@ TRACE_EVENT(kvm_apic,
 	{ EXIT_REASON_APIC_ACCESS,		"APIC_ACCESS" }, \
 	{ EXIT_REASON_EPT_VIOLATION,		"EPT_VIOLATION" }, \
 	{ EXIT_REASON_EPT_MISCONFIG,		"EPT_MISCONFIG" }, \
+	{ EXIT_REASON_PREEMPTION_TIMER,		"PREEMPTION_TIMER" }, \
 	{ EXIT_REASON_WBINVD,			"WBINVD" }
 
 #define SVM_EXIT_REASONS \
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6dc59c8..2130cbd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4456,6 +4456,12 @@ static int handle_external_interrupt(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_preemption_timer(struct kvm_vcpu *vcpu)
+{
+	/* Nothing */
+	return 1;
+}
+
 static int handle_triple_fault(struct kvm_vcpu *vcpu)
 {
 	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -5768,6 +5774,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
+	[EXIT_REASON_PREEMPTION_TIMER]        = handle_preemption_timer,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 17/21] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (15 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 16/21] KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 18/21] KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs Tomoki Sekiyama
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Since NMI can not be disabled around VM enter, there is a race between
receiving NMI to kick a guest and entering the guests on slave CPUs.If the
NMI is received just before entering VM, after the NMI handler is invoked,
it continues entering the guest and the effect of the NMI will be lost.

This patch adds kvm_arch_vcpu_prevent_run(), which causes VM exit right
after VM enter. The NMI handler uses this to ensure the execution of the
guest is cancelled after NMI.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |    6 ++++++
 arch/x86/kvm/vmx.c              |   42 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   31 +++++++++++++++++++++++++++++
 3 files changed, 78 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 65242a6..624e5ad 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -429,6 +429,9 @@ struct kvm_vcpu_arch {
 		void *insn;
 		int insn_len;
 	} page_fault;
+
+	bool prevent_run;
+	bool prevent_needed;
 #endif
 
 	int halt_request; /* real mode on Intel only */
@@ -681,6 +684,7 @@ struct kvm_x86_ops {
 
 	void (*run)(struct kvm_vcpu *vcpu);
 	int (*handle_exit)(struct kvm_vcpu *vcpu);
+	void (*prevent_run)(struct kvm_vcpu *vcpu, int prevent);
 	void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
 	void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
 	u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
@@ -1027,4 +1031,6 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
 void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 
+int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2130cbd..39a4cb4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1713,6 +1713,9 @@ static inline void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 
 static void vmx_set_direct_interrupt(struct kvm_vcpu *vcpu, bool enabled)
 {
+#ifdef CONFIG_SLAVE_CPU
+	void *msr_bitmap;
+
 	if (enabled)
 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
 				PIN_BASED_EXT_INTR_MASK);
@@ -1721,6 +1724,7 @@ static void vmx_set_direct_interrupt(struct kvm_vcpu *vcpu, bool enabled)
 			      PIN_BASED_EXT_INTR_MASK);
 
 	trace_kvm_set_direct_interrupt(vcpu, enabled);
+#endif
 }
 
 static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
@@ -4458,7 +4462,7 @@ static int handle_external_interrupt(struct kvm_vcpu *vcpu)
 
 static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 {
-	/* Nothing */
+	kvm_arch_vcpu_run_prevented(vcpu);
 	return 1;
 }
 
@@ -6052,6 +6056,10 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 	}
 
 	if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
+#ifdef CONFIG_SLAVE_CPU
+		if (vcpu->arch.prevent_run)
+			return kvm_arch_vcpu_run_prevented(vcpu);
+#endif
 		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		vcpu->run->fail_entry.hardware_entry_failure_reason
 			= exit_reason;
@@ -6059,6 +6067,10 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 	}
 
 	if (unlikely(vmx->fail)) {
+#ifdef CONFIG_SLAVE_CPU
+		if (vcpu->arch.prevent_run)
+			return kvm_arch_vcpu_run_prevented(vcpu);
+#endif
 		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		vcpu->run->fail_entry.hardware_entry_failure_reason
 			= vmcs_read32(VM_INSTRUCTION_ERROR);
@@ -6275,6 +6287,21 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 					msrs[i].host);
 }
 
+/*
+ * Make VMRESUME fail using preemption timer with timer value = 0.
+ * On processors that doesn't support preemption timer, VMRESUME will fail
+ * by internal error.
+ */
+static void vmx_prevent_run(struct kvm_vcpu *vcpu, int prevent)
+{
+	if (prevent)
+		vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
+			      PIN_BASED_PREEMPTION_TIMER);
+	else
+		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
+				PIN_BASED_PREEMPTION_TIMER);
+}
+
 #ifdef CONFIG_X86_64
 #define R "r"
 #define Q "q"
@@ -6326,6 +6353,13 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	atomic_switch_perf_msrs(vmx);
 
+#ifdef CONFIG_SLAVE_CPU
+	barrier();	/* Avoid vmcs modification by NMI before here */
+	vcpu->arch.prevent_needed = 1;
+	if (vcpu->arch.prevent_run)
+		vmx_prevent_run(vcpu, 1);
+#endif
+
 	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
@@ -6439,6 +6473,11 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	loadsegment(es, __USER_DS);
 #endif
 
+#ifdef CONFIG_SLAVE_CPU
+	vcpu->arch.prevent_needed = 0;
+	barrier();	/* Avoid vmcs modification by NMI after here */
+#endif
+
 	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
 				  | (1 << VCPU_EXREG_RFLAGS)
 				  | (1 << VCPU_EXREG_CPL)
@@ -7358,6 +7397,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
 
 	.run = vmx_vcpu_run,
 	.handle_exit = vmx_handle_exit,
+	.prevent_run = vmx_prevent_run,
 	.skip_emulated_instruction = skip_emulated_instruction,
 	.set_interrupt_shadow = vmx_set_interrupt_shadow,
 	.get_interrupt_shadow = vmx_get_interrupt_shadow,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1449187..eb2e5c4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2667,6 +2667,19 @@ static DEFINE_PER_CPU(struct kvm_vcpu *, slave_vcpu);
 
 static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs);
 
+struct kvm_vcpu *get_slave_vcpu(int cpu)
+{
+	return per_cpu(slave_vcpu, cpu);
+}
+
+static int kvm_arch_vcpu_prevent_run(struct kvm_vcpu *vcpu, int prevent)
+{
+	vcpu->arch.prevent_run = prevent;
+	if (!prevent || vcpu->arch.prevent_needed)
+		kvm_x86_ops->prevent_run(vcpu, prevent);
+	return 1;
+}
+
 static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
 					     int slave, int set_slave_mode)
 {
@@ -4873,10 +4886,12 @@ static struct notifier_block kvmclock_cpu_notifier_block = {
 	.priority = -INT_MAX
 };
 
+#ifdef CONFIG_SLAVE_CPU
 static struct notifier_block kvmclock_slave_cpu_notifier_block = {
 	.notifier_call  = kvmclock_cpu_notifier,
 	.priority = -INT_MAX
 };
+#endif
 
 static void kvm_timer_init(void)
 {
@@ -5350,6 +5365,11 @@ static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
 	if (vcpu->mode == OUTSIDE_GUEST_MODE || kvm_is_in_guest())
 		return NMI_HANDLED;
 
+	/*
+	 * We may be about to entering VM. To prevent entering,
+	 * mark to exit as soon as possible.
+	 */
+	kvm_arch_vcpu_prevent_run(vcpu, 1);
 	return NMI_HANDLED;
 }
 #endif
@@ -5592,6 +5612,8 @@ static void __vcpu_enter_guest_slave(void *_arg)
 	kvm_arch_vcpu_load(vcpu, cpu);
 
 	while (r == LOOP_SLAVE) {
+		vcpu->arch.prevent_run = 0;
+
 		r = vcpu_enter_guest(vcpu, arg->task);
 
 		if (r <= 0)
@@ -5618,6 +5640,7 @@ static void __vcpu_enter_guest_slave(void *_arg)
 		}
 	}
 
+	kvm_arch_vcpu_prevent_run(vcpu, 0);
 	kvm_arch_vcpu_put_migrate(vcpu);
 	unuse_mm(arg->task->mm);
 	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
@@ -6733,6 +6756,14 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 		 kvm_cpu_has_interrupt(vcpu));
 }
 
+int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu)
+{
+	kvm_x86_ops->prevent_run(vcpu, 0);
+	vcpu->arch.interrupted = true;
+	return 1;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_run_prevented);
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 18/21] KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (16 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 17/21] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:28 ` [RFC v2 PATCH 19/21] KVM: Enable direct EOI for directly routed interrupts to guests Tomoki Sekiyama
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

When a PCI device is assigned to a guest running on slave CPUs, this
routes the device's MSI/MSI-X interrupts directly to the guest.

Because the guest uses a different interrupt vector from the host,
vector remapping is required. This is safe because slave CPUs only handles
interrupts for the assigned guest.

The slave CPU may receive interrupts for the guest while the guest is not
running. In that case, the host IRQ handler is invoked and the interrupt is
transfered as vIRQ.

If the guest receive the direct interrupt from the devices, EOI to physical
APIC is required. To handle this, if the guest issues EOI when there are no
in-service interrupts in the virtual APIC, physical EOI is issued.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/kvm_host.h |   19 +++++
 arch/x86/kvm/irq.c              |  136 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/lapic.c            |    6 +-
 arch/x86/kvm/x86.c              |   12 +++
 virt/kvm/assigned-dev.c         |    8 ++
 5 files changed, 179 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 624e5ad..f43680e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1033,4 +1033,23 @@ void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 
 int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu);
 
+#ifdef CONFIG_SLAVE_CPU
+void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask);
+
+struct kvm_assigned_dev_kernel;
+extern void assign_slave_msi(struct kvm *kvm,
+			     struct kvm_assigned_dev_kernel *assigned_dev);
+extern void deassign_slave_msi(struct kvm *kvm,
+			       struct kvm_assigned_dev_kernel *assigned_dev);
+extern void assign_slave_msix(struct kvm *kvm,
+			      struct kvm_assigned_dev_kernel *assigned_dev);
+extern void deassign_slave_msix(struct kvm *kvm,
+				struct kvm_assigned_dev_kernel *assigned_dev);
+#else
+#define assign_slave_msi(kvm, assigned_dev)
+#define deassign_slave_msi(kvm, assigned_dev)
+#define assign_slave_msix(kvm, assigned_dev)
+#define deassign_slave_msix(kvm, assigned_dev)
+#endif
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 7e06ba1..128431a 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -22,6 +22,8 @@
 
 #include <linux/module.h>
 #include <linux/kvm_host.h>
+#include <linux/pci.h>
+#include <asm/msidef.h>
 
 #include "irq.h"
 #include "i8254.h"
@@ -94,3 +96,137 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
 	__kvm_migrate_apic_timer(vcpu);
 	__kvm_migrate_pit_timer(vcpu);
 }
+
+
+#ifdef CONFIG_SLAVE_CPU
+
+static int kvm_lookup_msi_routing_entry(struct kvm *kvm, int irq)
+{
+	int vec = -1;
+	struct kvm_irq_routing_table *irq_rt;
+	struct kvm_kernel_irq_routing_entry *e;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (irq < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
+			if (e->type == KVM_IRQ_ROUTING_MSI)
+				vec = (e->msi.data & MSI_DATA_VECTOR_MASK)
+					>> MSI_DATA_VECTOR_SHIFT;
+	rcu_read_unlock();
+
+	return vec;
+}
+
+void assign_slave_msi(struct kvm *kvm,
+		      struct kvm_assigned_dev_kernel *assigned_dev)
+{
+	int irq = assigned_dev->guest_irq;
+	int host_irq = assigned_dev->host_irq;
+	struct irq_data *data = irq_get_irq_data(host_irq);
+	int vec = kvm_lookup_msi_routing_entry(kvm, irq);
+	cpumask_var_t slave_mask;
+	char buffer[16];
+
+	BUG_ON(!data);
+
+	if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+		pr_err("assign slave MSI failed: no memory\n");
+		return;
+	}
+	kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+	bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits, 32);
+	pr_info("assigned_device slave msi: irq:%d host:%d vec:%d mask:%s\n",
+		irq, host_irq, vec, buffer);
+
+	remap_slave_vector_irq(host_irq, vec, slave_mask);
+	data->chip->irq_set_affinity(data, slave_mask, 1);
+
+	free_cpumask_var(slave_mask);
+}
+
+void deassign_slave_msi(struct kvm *kvm,
+			struct kvm_assigned_dev_kernel *assigned_dev)
+{
+	int host_irq = assigned_dev->host_irq;
+	cpumask_var_t slave_mask;
+	char buffer[16];
+
+	if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+		pr_err("deassign slave MSI failed: no memory\n");
+		return;
+	}
+	kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+	bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits, 32);
+	pr_info("deassigned_device slave msi: host:%d mask:%s\n",
+		host_irq, buffer);
+
+	revert_slave_vector_irq(host_irq, slave_mask);
+
+	free_cpumask_var(slave_mask);
+}
+
+void assign_slave_msix(struct kvm *kvm,
+		       struct kvm_assigned_dev_kernel *assigned_dev)
+{
+	int i;
+
+	for (i = 0; i < assigned_dev->entries_nr; i++) {
+		int irq = assigned_dev->guest_msix_entries[i].vector;
+		int host_irq = assigned_dev->host_msix_entries[i].vector;
+		struct irq_data *data = irq_get_irq_data(host_irq);
+		int vec = kvm_lookup_msi_routing_entry(kvm, irq);
+		cpumask_var_t slave_mask;
+		char buffer[16];
+
+		pr_info("assign_slave_msix: %d %d %x\n", irq, host_irq, vec);
+		BUG_ON(!data);
+
+		if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+			pr_err("assign slave MSI-X failed: no memory\n");
+			return;
+		}
+		kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+		bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits,
+				 32);
+		pr_info("assigned_device slave msi-x: irq:%d host:%d vec:%d mask:%s\n",
+			irq, host_irq, vec, buffer);
+
+		remap_slave_vector_irq(host_irq, vec, slave_mask);
+		data->chip->irq_set_affinity(data, slave_mask, 1);
+
+		free_cpumask_var(slave_mask);
+	}
+}
+
+void deassign_slave_msix(struct kvm *kvm,
+			 struct kvm_assigned_dev_kernel *assigned_dev)
+{
+	int i;
+
+	for (i = 0; i < assigned_dev->entries_nr; i++) {
+		int host_irq = assigned_dev->host_msix_entries[i].vector;
+		cpumask_var_t slave_mask;
+		char buffer[16];
+
+		if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+			pr_err("deassign slave MSI failed: no memory\n");
+			return;
+		}
+		kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+		bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits, 32);
+		pr_info("deassigned_device slave msi: host:%d mask:%s\n",
+			host_irq, buffer);
+
+		revert_slave_vector_irq(host_irq, slave_mask);
+
+		free_cpumask_var(slave_mask);
+	}
+}
+
+#endif
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 73f57f3..bf8e351 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -585,8 +585,12 @@ static int apic_set_eoi(struct kvm_lapic *apic)
 	 * Not every write EOI will has corresponding ISR,
 	 * one example is when Kernel check timer on setup_IO_APIC
 	 */
-	if (vector == -1)
+	if (vector == -1) {
+		/* On slave cpu, it can be EOI for a direct interrupt */
+		if (cpu_slave(smp_processor_id()))
+			ack_APIC_irq();
 		return vector;
+	}
 
 	apic_clear_isr(vector, apic);
 	apic_update_ppr(apic);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb2e5c4..609ab62 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5349,6 +5349,17 @@ static void process_nmi(struct kvm_vcpu *vcpu)
 }
 
 #ifdef CONFIG_SLAVE_CPU
+
+void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		if (vcpu_has_slave_cpu(vcpu))
+			cpumask_set_cpu(vcpu->arch.slave_cpu, mask);
+}
+
 static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
 {
 	struct kvm_vcpu *vcpu;
@@ -5372,6 +5383,7 @@ static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
 	kvm_arch_vcpu_prevent_run(vcpu, 1);
 	return NMI_HANDLED;
 }
+
 #endif
 
 enum vcpu_enter_guest_slave_retval {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 23a41a9..a3acc67 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -225,8 +225,12 @@ static void deassign_host_irq(struct kvm *kvm,
 
 		free_irq(assigned_dev->host_irq, assigned_dev);
 
-		if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
+		if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) {
 			pci_disable_msi(assigned_dev->dev);
+			deassign_slave_msi(kvm, assigned_dev);
+		}
+		if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX)
+			deassign_slave_msix(kvm, assigned_dev);
 	}
 
 	assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_HOST_MASK);
@@ -417,6 +421,7 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
 {
 	dev->guest_irq = irq->guest_irq;
 	dev->ack_notifier.gsi = -1;
+	assign_slave_msi(kvm, dev);
 	return 0;
 }
 #endif
@@ -428,6 +433,7 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
 {
 	dev->guest_irq = irq->guest_irq;
 	dev->ack_notifier.gsi = -1;
+	assign_slave_msix(kvm, dev);
 	return 0;
 }
 #endif



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 19/21] KVM: Enable direct EOI for directly routed interrupts to guests
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (17 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 18/21] KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs Tomoki Sekiyama
@ 2012-09-06 11:28 ` Tomoki Sekiyama
  2012-09-06 11:29 ` [RFC v2 PATCH 20/21] KVM: Pass-through local APIC timer of on slave CPUs to guest VM Tomoki Sekiyama
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Enable direct access to EOI MSR of x2apic to accelerate guests.
This accelerate handling of interrupts delivered directly to guest from
passed-through PCI devices. When a virtual IRQ is injected, this feature
is disabled in order to route following EOI to virtual APIC. Then, it is
enabled again after every virtual IRQ is handled.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/kvm/vmx.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 39a4cb4..f93e08c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -636,6 +636,10 @@ static unsigned long *vmx_io_bitmap_a;
 static unsigned long *vmx_io_bitmap_b;
 static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;
+#ifdef CONFIG_SLAVE_CPU
+static unsigned long *vmx_msr_bitmap_slave_legacy;
+static unsigned long *vmx_msr_bitmap_slave_longmode;
+#endif
 
 static bool cpu_has_load_ia32_efer;
 static bool cpu_has_load_perf_global_ctrl;
@@ -912,6 +916,11 @@ static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
 			struct vmcs12 *vmcs12,
 			u32 reason, unsigned long qualification);
 
+static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only);
+#ifdef CONFIG_SLAVE_CPU
+static void vmx_disable_intercept_for_msr_slave(u32 msr, bool longmode_only);
+#endif
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -1716,13 +1725,28 @@ static void vmx_set_direct_interrupt(struct kvm_vcpu *vcpu, bool enabled)
 #ifdef CONFIG_SLAVE_CPU
 	void *msr_bitmap;
 
-	if (enabled)
+	if (enabled) {
 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
 				PIN_BASED_EXT_INTR_MASK);
-	else
+
+		if (cpu_has_vmx_msr_bitmap()) {
+			msr_bitmap = is_long_mode(vcpu) ?
+				vmx_msr_bitmap_slave_longmode :
+				vmx_msr_bitmap_slave_legacy;
+			vmcs_write64(MSR_BITMAP, __pa(msr_bitmap));
+		}
+	} else {
 		vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
 			      PIN_BASED_EXT_INTR_MASK);
 
+		if (cpu_has_vmx_msr_bitmap()) {
+			msr_bitmap = is_long_mode(vcpu) ?
+				vmx_msr_bitmap_longmode :
+				vmx_msr_bitmap_legacy;
+			vmcs_write64(MSR_BITMAP, __pa(msr_bitmap));
+		}
+	}
+
 	trace_kvm_set_direct_interrupt(vcpu, enabled);
 #endif
 }
@@ -3771,6 +3795,16 @@ static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
 }
 
+#ifdef CONFIG_SLAVE_CPU
+static void vmx_disable_intercept_for_msr_slave(u32 msr, bool longmode_only)
+{
+	if (!longmode_only)
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_slave_legacy,
+					       msr);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_slave_longmode, msr);
+}
+#endif
+
 /*
  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
  * will not change in the lifetime of the guest.
@@ -7474,6 +7508,22 @@ static int __init vmx_init(void)
 		goto out2;
 
 
+#ifdef CONFIG_SLAVE_CPU
+	vmx_msr_bitmap_slave_legacy =
+		(unsigned long *)__get_free_page(GFP_KERNEL);
+	if (!vmx_msr_bitmap_slave_legacy) {
+		r = -ENOMEM;
+		goto out1s;
+	}
+
+	vmx_msr_bitmap_slave_longmode =
+		(unsigned long *)__get_free_page(GFP_KERNEL);
+	if (!vmx_msr_bitmap_slave_longmode) {
+		r = -ENOMEM;
+		goto out2s;
+	}
+#endif
+
 	/*
 	 * Allow direct access to the PC debug port (it is often used for I/O
 	 * delays, but the vmexits simply slow things down).
@@ -7500,6 +7550,15 @@ static int __init vmx_init(void)
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
 
+#ifdef CONFIG_SLAVE_CPU
+	memcpy(vmx_msr_bitmap_slave_legacy,
+	       vmx_msr_bitmap_legacy, PAGE_SIZE);
+	memcpy(vmx_msr_bitmap_slave_longmode,
+	       vmx_msr_bitmap_longmode, PAGE_SIZE);
+	vmx_disable_intercept_for_msr_slave(
+		APIC_BASE_MSR + (APIC_EOI >> 4), false);
+#endif
+
 	if (enable_ept) {
 		kvm_mmu_set_mask_ptes(0ull,
 			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
@@ -7513,6 +7572,12 @@ static int __init vmx_init(void)
 	return 0;
 
 out3:
+#ifdef CONFIG_SLAVE_CPU
+	free_page((unsigned long)vmx_msr_bitmap_slave_longmode);
+out2s:
+	free_page((unsigned long)vmx_msr_bitmap_slave_legacy);
+out1s:
+#endif
 	free_page((unsigned long)vmx_msr_bitmap_longmode);
 out2:
 	free_page((unsigned long)vmx_msr_bitmap_legacy);



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 20/21] KVM: Pass-through local APIC timer of on slave CPUs to guest VM
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (18 preceding siblings ...)
  2012-09-06 11:28 ` [RFC v2 PATCH 19/21] KVM: Enable direct EOI for directly routed interrupts to guests Tomoki Sekiyama
@ 2012-09-06 11:29 ` Tomoki Sekiyama
  2012-09-06 11:29 ` [RFC v2 PATCH 21/21] x86: request TLB flush to slave CPU using NMI Tomoki Sekiyama
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:29 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

Provide direct control of local APIC timer of slave CPUs to the guest.
The timer interrupt does not cause VM exit if direct interrupt delivery is
enabled. To handle the timer correctly, this makes the guest occupy the
local APIC timer.

If the host supports x2apic, this expose TMICT and TMCCT to the guest in
order to allow guests to start the timer and to read the timer count
without VM exit. Otherwise, it sets APIC registers to specified values.
LVTT is not passed-through to avoid modifying timer interrupt vector.

Currently the guest timer interrupt vector remapping is not supported, and
guest must use the same vector as host.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/apic.h     |    4 +++
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kernel/apic/apic.c     |   11 ++++++++++
 arch/x86/kernel/smpboot.c       |   30 ++++++++++++++++++++++++++
 arch/x86/kvm/lapic.c            |   45 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/lapic.h            |    2 ++
 arch/x86/kvm/vmx.c              |    6 +++++
 arch/x86/kvm/x86.c              |    3 +++
 include/linux/cpu.h             |    5 ++++
 kernel/hrtimer.c                |    2 +-
 10 files changed, 108 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index d37ae5c..66e1155 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -44,6 +44,8 @@ static inline void generic_apic_probe(void)
 
 #ifdef CONFIG_X86_LOCAL_APIC
 
+struct clock_event_device;
+
 extern unsigned int apic_verbosity;
 extern int local_apic_timer_c2_ok;
 
@@ -245,6 +247,8 @@ extern void init_apic_mappings(void);
 void register_lapic_address(unsigned long address);
 extern void setup_boot_APIC_clock(void);
 extern void setup_secondary_APIC_clock(void);
+extern void override_local_apic_timer(int cpu,
+	void (*handler)(struct clock_event_device *));
 extern int APIC_init_uniprocessor(void);
 extern int apic_force_enable(unsigned long addr);
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f43680e..a95bb62 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1035,6 +1035,7 @@ int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_SLAVE_CPU
 void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask);
+struct kvm_vcpu *get_slave_vcpu(int cpu);
 
 struct kvm_assigned_dev_kernel;
 extern void assign_slave_msi(struct kvm *kvm,
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 24deb30..90ed84a 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -901,6 +901,17 @@ void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
 	set_irq_regs(old_regs);
 }
 
+void override_local_apic_timer(int cpu,
+			       void (*handler)(struct clock_event_device *))
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	per_cpu(lapic_events, cpu).event_handler = handler;
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(override_local_apic_timer);
+
 int setup_profiling_timer(unsigned int multiplier)
 {
 	return -EINVAL;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 45dfc1d..ba7c99b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -133,6 +133,7 @@ static void __ref remove_cpu_from_maps(int cpu);
 #ifdef CONFIG_SLAVE_CPU
 /* Notify slave cpu up and down */
 static RAW_NOTIFIER_HEAD(slave_cpu_chain);
+struct notifier_block *slave_timer_nb;
 
 int register_slave_cpu_notifier(struct notifier_block *nb)
 {
@@ -140,6 +141,13 @@ int register_slave_cpu_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL(register_slave_cpu_notifier);
 
+int register_slave_cpu_timer_notifier(struct notifier_block *nb)
+{
+	slave_timer_nb = nb;
+	return register_slave_cpu_notifier(nb);
+}
+EXPORT_SYMBOL(register_slave_cpu_timer_notifier);
+
 void unregister_slave_cpu_notifier(struct notifier_block *nb)
 {
 	raw_notifier_chain_unregister(&slave_cpu_chain, nb);
@@ -155,6 +163,8 @@ static int slave_cpu_notify(unsigned long val, int cpu)
 
 	return notifier_to_errno(ret);
 }
+
+static void slave_cpu_disable_timer(int cpu);
 #endif
 
 /*
@@ -1013,10 +1023,30 @@ int __cpuinit slave_cpu_up(unsigned int cpu)
 
 	cpu_maps_update_done();
 
+	/* Timer may be used only in starting the slave CPU */
+	slave_cpu_disable_timer(cpu);
+
 	return ret;
 }
 EXPORT_SYMBOL(slave_cpu_up);
 
+static void __slave_cpu_disable_timer(void *hcpu)
+{
+	int cpu = (long)hcpu;
+
+	pr_info("Disabling timer on slave cpu %d\n", cpu);
+	BUG_ON(!slave_timer_nb);
+	slave_timer_nb->notifier_call(slave_timer_nb, CPU_SLAVE_DYING, hcpu);
+}
+
+static void slave_cpu_disable_timer(int cpu)
+{
+	void *hcpu = (void *)(long)cpu;
+
+	slave_cpu_call_function(cpu, __slave_cpu_disable_timer, hcpu);
+	slave_timer_nb->notifier_call(slave_timer_nb, CPU_SLAVE_DEAD, hcpu);
+}
+
 static void __slave_cpu_down(void *dummy)
 {
 	__this_cpu_write(cpu_state, CPU_DYING);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index bf8e351..47288b5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -652,6 +652,9 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
 	if (apic_get_reg(apic, APIC_TMICT) == 0)
 		return 0;
 
+	if (vcpu_has_slave_cpu(apic->vcpu))
+		return apic_read(APIC_TMCCT);
+
 	remaining = hrtimer_get_remaining(&apic->lapic_timer.timer);
 	if (ktime_to_ns(remaining) < 0)
 		remaining = ktime_set(0, 0);
@@ -799,6 +802,10 @@ static void start_apic_timer(struct kvm_lapic *apic)
 	atomic_set(&apic->lapic_timer.pending, 0);
 
 	if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) {
+		if (vcpu_has_slave_cpu(apic->vcpu)) {
+			apic_write(APIC_TMICT, apic_get_reg(apic, APIC_TMICT));
+			return;
+		}
 		/* lapic timer in oneshot or peroidic mode */
 		now = apic->lapic_timer.timer.base->get_time();
 		apic->lapic_timer.period = (u64)apic_get_reg(apic, APIC_TMICT)
@@ -845,6 +852,15 @@ static void start_apic_timer(struct kvm_lapic *apic)
 		unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
 		unsigned long flags;
 
+		if (vcpu_has_slave_cpu(apic->vcpu)) {
+			local_irq_save(flags);
+			guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+			wrmsrl(MSR_IA32_TSCDEADLINE,
+			       tscdeadline - guest_tsc + native_read_tsc());
+			local_irq_restore(flags);
+			return;
+		}
+
 		if (unlikely(!tscdeadline || !this_tsc_khz))
 			return;
 
@@ -971,6 +987,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 			val |= APIC_LVT_MASKED;
 		val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask);
 		apic_set_reg(apic, APIC_LVTT, val);
+		if (vcpu_has_slave_cpu(apic->vcpu)) {
+			u8 vector = val & 0xff;
+
+			if (vector && vector != LOCAL_TIMER_VECTOR) {
+				pr_err("slave lapic: vector %x unsupported\n",
+				       vector);
+				ret = -EINVAL;
+			} else
+				apic_write(reg, val);
+		}
 		break;
 
 	case APIC_TMICT:
@@ -987,6 +1013,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 			apic_debug("KVM_WRITE:TDCR %x\n", val);
 		apic_set_reg(apic, APIC_TDCR, val);
 		update_divide_count(apic);
+		if (vcpu_has_slave_cpu(apic->vcpu))
+			apic_write(reg, val);
 		break;
 
 	case APIC_ESR:
@@ -1267,6 +1295,17 @@ static const struct kvm_io_device_ops apic_mmio_ops = {
 	.write    = apic_mmio_write,
 };
 
+#ifdef CONFIG_SLAVE_CPU
+void slave_lapic_timer_fn(struct clock_event_device *dev)
+{
+	int cpu = raw_smp_processor_id();
+	struct kvm_vcpu *vcpu = get_slave_vcpu(cpu);
+
+	BUG_ON(!vcpu || !vcpu->arch.apic);
+	kvm_timer_fn(&vcpu->arch.apic->lapic_timer.timer);
+}
+#endif
+
 int kvm_create_lapic(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic;
@@ -1295,6 +1334,12 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
 	apic->lapic_timer.kvm = vcpu->kvm;
 	apic->lapic_timer.vcpu = vcpu;
 
+#ifdef CONFIG_SLAVE_CPU
+	if (vcpu_has_slave_cpu(vcpu))
+		override_local_apic_timer(vcpu->arch.slave_cpu,
+					  slave_lapic_timer_fn);
+#endif
+
 	apic->base_address = APIC_DEFAULT_PHYS_BASE;
 	vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE;
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4af5405..e8e51ca 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -55,6 +55,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+void slave_lapic_timer_fn(struct clock_event_device *dev);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f93e08c..0aa9423 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7557,6 +7557,12 @@ static int __init vmx_init(void)
 	       vmx_msr_bitmap_longmode, PAGE_SIZE);
 	vmx_disable_intercept_for_msr_slave(
 		APIC_BASE_MSR + (APIC_EOI >> 4), false);
+
+	/* Allow guests on slave CPU to access local APIC timer directly */
+	vmx_disable_intercept_for_msr_slave(
+		APIC_BASE_MSR + (APIC_TMICT >> 4), false);
+	vmx_disable_intercept_for_msr_slave(
+		APIC_BASE_MSR + (APIC_TMCCT >> 4), false);
 #endif
 
 	if (enable_ept) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 609ab62..9d92581 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2707,6 +2707,9 @@ static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
 			goto out;
 		BUG_ON(!cpu_slave(slave));
 		per_cpu(slave_vcpu, slave) = vcpu;
+
+		if (vcpu->arch.apic)
+			override_local_apic_timer(slave, slave_lapic_timer_fn);
 	}
 
 	vcpu->arch.slave_cpu = slave;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index f1aa3cc..f072c6a 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -223,6 +223,7 @@ static inline void enable_nonboot_cpus(void) {}
 
 #ifdef CONFIG_SLAVE_CPU
 int register_slave_cpu_notifier(struct notifier_block *nb);
+int register_slave_cpu_timer_notifier(struct notifier_block *nb);
 void unregister_slave_cpu_notifier(struct notifier_block *nb);
 
 /* CPU notifier constants for slave processors */
@@ -240,6 +241,10 @@ static inline int register_slave_cpu_notifier(struct notifier_block *nb)
 {
 	return 0;
 }
+static inline int register_slave_cpu_timer_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
 static inline void unregister_slave_cpu_notifier(struct notifier_block *nb) {}
 #endif
 
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index e899a2c..61d1706 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1773,7 +1773,7 @@ void __init hrtimers_init(void)
 	hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
 			  (void *)(long)smp_processor_id());
 	register_cpu_notifier(&hrtimers_nb);
-	register_slave_cpu_notifier(&hrtimers_slave_nb);
+	register_slave_cpu_timer_notifier(&hrtimers_slave_nb);
 #ifdef CONFIG_HIGH_RES_TIMERS
 	open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq);
 #endif



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 PATCH 21/21] x86: request TLB flush to slave CPU using NMI
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (19 preceding siblings ...)
  2012-09-06 11:29 ` [RFC v2 PATCH 20/21] KVM: Pass-through local APIC timer of on slave CPUs to guest VM Tomoki Sekiyama
@ 2012-09-06 11:29 ` Tomoki Sekiyama
  2012-09-06 11:46 ` [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Avi Kivity
  2012-09-07  8:26 ` Jan Kiszka
  22 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-06 11:29 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, x86, yrl.pp-manager.tt, Tomoki Sekiyama,
	Avi Kivity, Marcelo Tosatti, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

For slave CPUs, it is inapropriate to request TLB flush using IPI.
because the IPI may be sent to a KVM guest when the slave CPU is running
the guest with direct interrupt routing.

Instead, it registers a TLB flush request in per-cpu bitmask and send a NMI
to interrupt execution of the guest. Then, NMI handler will check the
requests and handles the requests.

This implementation has an issue in scalability, and is just for PoC.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---

 arch/x86/include/asm/tlbflush.h |    5 ++
 arch/x86/kernel/smpboot.c       |    3 +
 arch/x86/kvm/x86.c              |    5 ++
 arch/x86/mm/tlb.c               |   94 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 106 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 74a4433..bcd637b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -168,6 +168,11 @@ static inline void reset_lazy_tlbstate(void)
 	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
 }
 
+#ifdef CONFIG_SLAVE_CPU
+DECLARE_PER_CPU(bool, slave_idle);
+void handle_slave_tlb_flush(unsigned int cpu);
+#endif	/* SLAVE_CPU */
+
 #endif	/* SMP */
 
 #ifndef CONFIG_PARAVIRT
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ba7c99b..9854087 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -395,7 +395,10 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
 		rcu_note_context_switch(cpu);
 
 		if (!f.func) {
+			__this_cpu_write(slave_idle, 1);
+			handle_slave_tlb_flush(cpu);
 			native_safe_halt();
+			__this_cpu_write(slave_idle, 0);
 			continue;
 		}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9d92581..d3ee570 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -65,6 +65,7 @@
 #include <asm/cpu.h>
 #include <asm/nmi.h>
 #include <asm/mmu.h>
+#include <asm/tlbflush.h>
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
@@ -5529,6 +5530,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct task_struct *task)
 
 	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
+	handle_slave_tlb_flush(vcpu->cpu);
+
 	if (req_immediate_exit)
 		smp_send_reschedule(vcpu->cpu);
 
@@ -5631,6 +5634,8 @@ static void __vcpu_enter_guest_slave(void *_arg)
 
 		r = vcpu_enter_guest(vcpu, arg->task);
 
+		handle_slave_tlb_flush(cpu);
+
 		if (r <= 0)
 			break;
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 613cd83..54f1c1b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -11,6 +11,7 @@
 #include <asm/mmu_context.h>
 #include <asm/cache.h>
 #include <asm/apic.h>
+#include <asm/nmi.h>
 #include <asm/uv/uv.h>
 #include <linux/debugfs.h>
 
@@ -35,6 +36,10 @@ struct flush_tlb_info {
 	struct mm_struct *flush_mm;
 	unsigned long flush_start;
 	unsigned long flush_end;
+#ifdef CONFIG_SLAVE_CPU
+	cpumask_var_t mask;
+	struct list_head list;
+#endif
 };
 
 /*
@@ -97,6 +102,7 @@ EXPORT_SYMBOL_GPL(leave_mm);
 static void flush_tlb_func(void *info)
 {
 	struct flush_tlb_info *f = info;
+	int cpu = smp_processor_id();
 
 	if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
 		return;
@@ -115,9 +121,94 @@ static void flush_tlb_func(void *info)
 			}
 		}
 	} else
-		leave_mm(smp_processor_id());
+		leave_mm(cpu);
+
+#ifdef CONFIG_SLAVE_CPU
+	if (cpu_slave(cpu))
+		cpumask_test_and_clear_cpu(cpu, f->mask);
+#endif
+}
+
+#ifdef CONFIG_SLAVE_CPU
+static DEFINE_PER_CPU(atomic_t, nr_slave_tlbf);
+DEFINE_PER_CPU(bool, slave_idle);
+static LIST_HEAD(fti_list);
+static DEFINE_RWLOCK(fti_list_lock);
+
+static int slave_tlb_flush_nmi(unsigned int val, struct pt_regs *regs)
+{
+	int cpu = smp_processor_id();
+
+	if (!cpu_slave(cpu) || !atomic_read(&__get_cpu_var(nr_slave_tlbf)))
+		return NMI_DONE;
+	if (this_cpu_read(slave_idle))
+		handle_slave_tlb_flush(cpu);
+	return NMI_HANDLED;
+}
+
+static int __cpuinit register_slave_tlb_flush_nmi(void)
+{
+	register_nmi_handler(NMI_LOCAL, slave_tlb_flush_nmi,
+			     NMI_FLAG_FIRST, "slave_tlb_flush");
+	return 0;
+}
+late_initcall(register_slave_tlb_flush_nmi);
+
+void handle_slave_tlb_flush(unsigned int cpu)
+{
+	struct flush_tlb_info *info;
 
+	if (!cpu_slave(cpu) ||
+	    !atomic_read(&__get_cpu_var(nr_slave_tlbf)))
+		return;
+
+	read_lock(&fti_list_lock);
+	list_for_each_entry(info, &fti_list, list) {
+		if (cpumask_test_cpu(cpu, info->mask)) {
+			flush_tlb_func(info);
+			atomic_dec(&__get_cpu_var(nr_slave_tlbf));
+		}
+	}
+	read_unlock(&fti_list_lock);
+}
+EXPORT_SYMBOL_GPL(handle_slave_tlb_flush);
+
+static void request_slave_tlb_flush(const struct cpumask *mask,
+				    struct flush_tlb_info *info)
+{
+	int cpu;
+
+	if (!cpumask_intersects(mask, cpu_slave_mask))
+		return;
+
+	if (!alloc_cpumask_var(&info->mask, GFP_ATOMIC)) {
+		pr_err("%s: not enough memory\n", __func__);
+		return;
+	}
+
+	cpumask_and(info->mask, mask, cpu_slave_mask);
+	INIT_LIST_HEAD(&info->list);
+	write_lock(&fti_list_lock);
+	list_add(&info->list, &fti_list);
+	write_unlock(&fti_list_lock);
+
+	for_each_cpu_and(cpu, mask, cpu_slave_mask)
+		atomic_inc(&per_cpu(nr_slave_tlbf, cpu));
+
+	apic->send_IPI_mask(info->mask, NMI_VECTOR);
+	while (!cpumask_empty(info->mask))
+		cpu_relax();
+	write_lock(&fti_list_lock);
+	list_del(&info->list);
+	write_unlock(&fti_list_lock);
+	free_cpumask_var(info->mask);
+}
+#else
+static inline void request_slave_tlb_flush(const struct cpumask *mask,
+					   struct flush_tlb_info *info)
+{
 }
+#endif
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
@@ -139,6 +230,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		return;
 	}
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+	request_slave_tlb_flush(cpumask, &info);
 }
 
 void flush_tlb_current_task(void)



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up()
  2012-09-06 11:27 ` [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up() Tomoki Sekiyama
@ 2012-09-06 11:31   ` Avi Kivity
  2012-09-06 11:32     ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2012-09-06 11:31 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: kvm, linux-kernel, x86, yrl.pp-manager.tt, Marcelo Tosatti,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On 09/06/2012 02:27 PM, Tomoki Sekiyama wrote:
> Split memory hotplug function from cpu_up() as cpu_memory_up(), which will
> be used for assigning memory area to off-lined cpus at following patch
> in this series.
> 

Can post a summary containing both the general outline for people
reading this for the first time, or who have forgotten it, and the list
of changes from v1?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up()
  2012-09-06 11:31   ` Avi Kivity
@ 2012-09-06 11:32     ` Avi Kivity
  0 siblings, 0 replies; 29+ messages in thread
From: Avi Kivity @ 2012-09-06 11:32 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: kvm, linux-kernel, x86, yrl.pp-manager.tt, Marcelo Tosatti,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On 09/06/2012 02:31 PM, Avi Kivity wrote:
> On 09/06/2012 02:27 PM, Tomoki Sekiyama wrote:
>> Split memory hotplug function from cpu_up() as cpu_memory_up(), which will
>> be used for assigning memory area to off-lined cpus at following patch
>> in this series.
>> 
> 
> Can post a summary containing both the general outline for people
> reading this for the first time, or who have forgotten it, and the list
> of changes from v1?
> 

Never mind, I see it was posted, just that I wasn't copied on it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (20 preceding siblings ...)
  2012-09-06 11:29 ` [RFC v2 PATCH 21/21] x86: request TLB flush to slave CPU using NMI Tomoki Sekiyama
@ 2012-09-06 11:46 ` Avi Kivity
  2012-09-07  8:26 ` Jan Kiszka
  22 siblings, 0 replies; 29+ messages in thread
From: Avi Kivity @ 2012-09-06 11:46 UTC (permalink / raw)
  To: Tomoki Sekiyama; +Cc: kvm, linux-kernel, x86, yrl.pp-manager.tt

On 09/06/2012 02:27 PM, Tomoki Sekiyama wrote:
> This RFC patch series provides facility to dedicate CPUs to KVM guests
> and enable the guests to handle interrupts from passed-through PCI devices
> directly (without VM exit and relay by the host).
> 
> With this feature, we can improve throughput and response time of the device
> and the host's CPU usage by reducing the overhead of interrupt handling.
> This is good for the application using very high throughput/frequent
> interrupt device (e.g. 10GbE NIC).
> Real-time applicatoins also gets benefit from CPU isolation feature, which
> reduces interfare from host kernel tasks and scheduling delay.
> 
> The overview of this patch series is presented in CloudOpen 2012.
> The slides are available at:
> http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_sekiyama.pdf

During Plumbers 2012, both Intel and AMD disclosed upcoming features to
their processors (APIC-V and AVIC) that allow directing device
interrupts to guest vcpus without host kernel involvement.  This works
without pinning, dedicating a core to a guest, or any special measures
beyond support for the feature.

CPU isolation is still useful to improve real-time latency further, but
this is really independent of kvm.

I am inclined to reject this feature in favour of the new hardware
support.  Sorry, I know this isn't nice to hear, but the extra
maintenance burden cannot be justified for a niche use case with special
limitations when generally useful feature exploiting proper hardware
support provides the same functionality.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests
  2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
                   ` (21 preceding siblings ...)
  2012-09-06 11:46 ` [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Avi Kivity
@ 2012-09-07  8:26 ` Jan Kiszka
  2012-09-10 11:36   ` Tomoki Sekiyama
  22 siblings, 1 reply; 29+ messages in thread
From: Jan Kiszka @ 2012-09-07  8:26 UTC (permalink / raw)
  To: Tomoki Sekiyama; +Cc: kvm, linux-kernel, x86, yrl.pp-manager.tt

On 2012-09-06 13:27, Tomoki Sekiyama wrote:
> This RFC patch series provides facility to dedicate CPUs to KVM guests
> and enable the guests to handle interrupts from passed-through PCI devices
> directly (without VM exit and relay by the host).
> 
> With this feature, we can improve throughput and response time of the device
> and the host's CPU usage by reducing the overhead of interrupt handling.
> This is good for the application using very high throughput/frequent
> interrupt device (e.g. 10GbE NIC).
> Real-time applicatoins also gets benefit from CPU isolation feature, which
> reduces interfare from host kernel tasks and scheduling delay.
> 
> The overview of this patch series is presented in CloudOpen 2012.
> The slides are available at:
> http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_sekiyama.pdf

One question regarding your benchmarks: If you measured against standard
KVM, were the vCPU thread running on an isolcpus core of its own as
well? If not, your numbers about the impact of these patches on maximum,
maybe also average latencies are likely too good. Also, using a non-RT
host and possibly a non-prioritized vCPU thread for the standard
scenario looks like an unfair comparison.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests
  2012-09-07  8:26 ` Jan Kiszka
@ 2012-09-10 11:36   ` Tomoki Sekiyama
  0 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-10 11:36 UTC (permalink / raw)
  To: jan.kiszka; +Cc: kvm, linux-kernel, x86

Hi Jan,

On 2012/09/07 17:26, Jan Kiszka wrote:

> On 2012-09-06 13:27, Tomoki Sekiyama wrote:
>> This RFC patch series provides facility to dedicate CPUs to KVM guests
>> and enable the guests to handle interrupts from passed-through PCI devices
>> directly (without VM exit and relay by the host).
>>
>> With this feature, we can improve throughput and response time of the device
>> and the host's CPU usage by reducing the overhead of interrupt handling.
>> This is good for the application using very high throughput/frequent
>> interrupt device (e.g. 10GbE NIC).
>> Real-time applicatoins also gets benefit from CPU isolation feature, which
>> reduces interfare from host kernel tasks and scheduling delay.
>>
>> The overview of this patch series is presented in CloudOpen 2012.
>> The slides are available at:
>> http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_sekiyama.pdf
> 
> One question regarding your benchmarks: If you measured against standard
> KVM, were the vCPU thread running on an isolcpus core of its own as
> well? If not, your numbers about the impact of these patches on maximum,
> maybe also average latencies are likely too good. Also, using a non-RT
> host and possibly a non-prioritized vCPU thread for the standard
> scenario looks like an unfair comparison.


In the standard KVM benchmark, the vCPU thread is pinned down to its own
CPU core. In addition, the vCPU thread and irq/*-kvm threads are both set
to the max priority with SCHED_RR policy.

As you said, RT-host may result in better max latencies as below.
(But min/average latencies became worse, however, this might be our setup
 issue.)
                                 Min / Avg / Max latencies
Normal KVM
  RT-host     (3.4.4-rt14)      15us / 21us / 117us
  non RT-host (3.5.0-rc6)        6us / 11us / 152us
KVM + Direct IRQ
  non RT-host (3.5.0-rc6 +patch) 1us /  2us /  14us

Thanks,
-- 
Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 PATCH 04/21] x86: Avoid RCU warnings on slave CPUs
  2012-09-06 11:27 ` [RFC v2 PATCH 04/21] x86: Avoid RCU warnings " Tomoki Sekiyama
@ 2012-09-20 17:34   ` Paul E. McKenney
  2012-09-28  8:10     ` Tomoki Sekiyama
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2012-09-20 17:34 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: kvm, linux-kernel, x86, yrl.pp-manager.tt, Avi Kivity,
	Marcelo Tosatti, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Thu, Sep 06, 2012 at 08:27:40PM +0900, Tomoki Sekiyama wrote:
> Initialize rcu related variables to avoid warnings about RCU usage while
> slave CPUs is running specified functions. Also notify RCU subsystem before
> the slave CPU is entered into idle state.

Hello, Tomoki,

A few questions and comments interspersed below.

							Thanx, Paul

> Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
> Cc: Avi Kivity <avi@redhat.com>
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> ---
> 
>  arch/x86/kernel/smpboot.c |    4 ++++
>  kernel/rcutree.c          |   14 ++++++++++++++
>  2 files changed, 18 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index e8cfe377..45dfc1d 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -382,6 +382,8 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
>  		f = per_cpu(slave_cpu_func, cpu);
>  		per_cpu(slave_cpu_func, cpu).func = NULL;
> 
> +		rcu_note_context_switch(cpu);
> +

Why not use rcu_idle_enter() and rcu_idle_exit()?  These would tell
RCU to ignore the slave CPU for the duration of its idle period.
The way you have it, if a slave CPU stayed idle for too long, you
would get RCU CPU stall warnings, and possibly system hangs as well.

Or is this being called from some task that is not the idle task?
If so, you instead want the new rcu_user_enter() and rcu_user_exit()
that are hopefully on their way into 3.7.  Or maybe better, use a real
idle task, so that idle_task(smp_processor_id()) returns true and RCU
stops complaining.  ;-)

Note that CPUs that RCU believes to be idle are not permitted to contain
RCU read-side critical sections, which in turn means no entering the
scheduler, no sleeping, and so on.  There is an RCU_NONIDLE() macro
to tell RCU to pay attention to the CPU only for the duration of the
statement passed to RCU_NONIDLE, and there are also an _rcuidle variant
of the tracing statement to allow tracing from idle.

>  		if (!f.func) {
>  			native_safe_halt();
>  			continue;
> @@ -1005,6 +1007,8 @@ int __cpuinit slave_cpu_up(unsigned int cpu)
>  	if (IS_ERR(idle))
>  		return PTR_ERR(idle);
> 
> +	slave_cpu_notify(CPU_SLAVE_UP_PREPARE, cpu);
> +
>  	ret = __native_cpu_up(cpu, idle, 1);
> 
>  	cpu_maps_update_done();
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index f280e54..31a7c8c 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -2589,6 +2589,9 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> +#ifdef CONFIG_SLAVE_CPU
> +	case CPU_SLAVE_UP_PREPARE:
> +#endif

Why do you need #ifdef here?  Why not define CPU_SLAVE_UP_PREPARE
unconditionally?  Then if CONFIG_SLAVE_CPU=n, rcu_cpu_notify() would
never be invoked with CPU_SLAVE_UP_PREPARE, so no problems.

>  		rcu_prepare_cpu(cpu);
>  		rcu_prepare_kthreads(cpu);
>  		break;
> @@ -2603,6 +2606,9 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
>  		break;
>  	case CPU_DYING:
>  	case CPU_DYING_FROZEN:
> +#ifdef CONFIG_SLAVE_CPU
> +	case CPU_SLAVE_DYING:
> +#endif

Same here.

>  		/*
>  		 * The whole machine is "stopped" except this CPU, so we can
>  		 * touch any data without introducing corruption. We send the
> @@ -2616,6 +2622,9 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
>  	case CPU_DEAD_FROZEN:
>  	case CPU_UP_CANCELED:
>  	case CPU_UP_CANCELED_FROZEN:
> +#ifdef CONFIG_SLAVE_CPU
> +	case CPU_SLAVE_DEAD:
> +#endif

And here.

>  		for_each_rcu_flavor(rsp)
>  			rcu_cleanup_dead_cpu(cpu, rsp);
>  		break;
> @@ -2797,6 +2806,10 @@ static void __init rcu_init_geometry(void)
>  	rcu_num_nodes -= n;
>  }
> 
> +static struct notifier_block __cpuinitdata rcu_slave_nb = {
> +	.notifier_call = rcu_cpu_notify,
> +};
> +
>  void __init rcu_init(void)
>  {
>  	int cpu;
> @@ -2814,6 +2827,7 @@ void __init rcu_init(void)
>  	 * or the scheduler are operational.
>  	 */
>  	cpu_notifier(rcu_cpu_notify, 0);
> +	register_slave_cpu_notifier(&rcu_slave_nb);
>  	for_each_online_cpu(cpu)
>  		rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
>  	check_cpu_stall_init();
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re: [RFC v2 PATCH 04/21] x86: Avoid RCU warnings on slave CPUs
  2012-09-20 17:34   ` Paul E. McKenney
@ 2012-09-28  8:10     ` Tomoki Sekiyama
  0 siblings, 0 replies; 29+ messages in thread
From: Tomoki Sekiyama @ 2012-09-28  8:10 UTC (permalink / raw)
  To: paulmck
  Cc: kvm, linux-kernel, x86, yrl.pp-manager.tt, avi, mtosatti, tglx,
	mingo, hpa

Hi Paul,

Thank you for your comments, and sorry for my late reply.

On 2012/09/21 2:34, Paul E. McKenney wrote:

> On Thu, Sep 06, 2012 at 08:27:40PM +0900, Tomoki Sekiyama wrote:
>> Initialize rcu related variables to avoid warnings about RCU usage while
>> slave CPUs is running specified functions. Also notify RCU subsystem before
>> the slave CPU is entered into idle state.
> 
> Hello, Tomoki,
> 
> A few questions and comments interspersed below.
>> <snip>
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index e8cfe377..45dfc1d 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -382,6 +382,8 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
>>  		f = per_cpu(slave_cpu_func, cpu);
>>  		per_cpu(slave_cpu_func, cpu).func = NULL;
>>
>> +		rcu_note_context_switch(cpu);
>> +
> 
> Why not use rcu_idle_enter() and rcu_idle_exit()?  These would tell
> RCU to ignore the slave CPU for the duration of its idle period.
> The way you have it, if a slave CPU stayed idle for too long, you
> would get RCU CPU stall warnings, and possibly system hangs as well. 

That's true, rcu_idle_enter() and rcu_idle_exit() should be used when
the slave cpu is idle. Thanks.

> Or is this being called from some task that is not the idle task?
> If so, you instead want the new rcu_user_enter() and rcu_user_exit()
> that are hopefully on their way into 3.7.  Or maybe better, use a real
> idle task, so that idle_task(smp_processor_id()) returns true and RCU
> stops complaining.  ;-)
>
> Note that CPUs that RCU believes to be idle are not permitted to contain
> RCU read-side critical sections, which in turn means no entering the
> scheduler, no sleeping, and so on.  There is an RCU_NONIDLE() macro
> to tell RCU to pay attention to the CPU only for the duration of the
> statement passed to RCU_NONIDLE, and there are also an _rcuidle variant
> of the tracing statement to allow tracing from idle. 

This was for KVM is called as `func', which contains RCU read-side critical
sections, and rcu_virt_note_context_switch() (that is
rcu_note_context_switch(cpu)) before entering guest.
Maybe it should be replaced by rcu_user_enter() and rcu_user_exit() in the
future.

>> --- a/kernel/rcutree.c
>> +++ b/kernel/rcutree.c
>> @@ -2589,6 +2589,9 @@ static int __cpuinit rcu_cpu_notify(struc tnotifier_block *self,
>>  	switch (action) {
>>  	case CPU_UP_PREPARE:
>>  	case CPU_UP_PREPARE_FROZEN:
>> +#ifdef CONFIG_SLAVE_CPU
>> +	case CPU_SLAVE_UP_PREPARE:
>> +#endif
> 
> Why do you need #ifdef here?  Why not define CPU_SLAVE_UP_PREPARE
> unconditionally?  Then if CONFIG_SLAVE_CPU=n, rcu_cpu_notify() would
> never be invoked with CPU_SLAVE_UP_PREPARE, so no problems. 

Agreed. That will make the code simpler.

Thank you again,
-- 
Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-09-28  8:10 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-06 11:27 [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Tomoki Sekiyama
2012-09-06 11:27 ` [RFC v2 PATCH 01/21] x86: Split memory hotplug function from cpu_up() as cpu_memory_up() Tomoki Sekiyama
2012-09-06 11:31   ` Avi Kivity
2012-09-06 11:32     ` Avi Kivity
2012-09-06 11:27 ` [RFC v2 PATCH 02/21] x86: Add a facility to use offlined CPUs as slave CPUs Tomoki Sekiyama
2012-09-06 11:27 ` [RFC v2 PATCH 03/21] x86: Support hrtimer on " Tomoki Sekiyama
2012-09-06 11:27 ` [RFC v2 PATCH 04/21] x86: Avoid RCU warnings " Tomoki Sekiyama
2012-09-20 17:34   ` Paul E. McKenney
2012-09-28  8:10     ` Tomoki Sekiyama
2012-09-06 11:27 ` [RFC v2 PATCH 05/21] KVM: Enable/Disable virtualization on slave CPUs are activated/dying Tomoki Sekiyama
2012-09-06 11:27 ` [RFC v2 PATCH 06/21] KVM: Add facility to run guests on slave CPUs Tomoki Sekiyama
2012-09-06 11:27 ` [RFC v2 PATCH 07/21] KVM: handle page faults of slave guests on online CPUs Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 08/21] KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 09/21] KVM: Go back to online CPU on VM exit by external interrupt Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 10/21] KVM: proxy slab operations for slave CPUs on online CPUs Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 11/21] KVM: no exiting from guest when slave CPU halted Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 12/21] x86/apic: Enable external interrupt routing to slave CPUs Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 13/21] x86/apic: IRQ vector remapping on slave for " Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 14/21] KVM: Directly handle interrupts by guests without VM EXIT on " Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 15/21] KVM: add tracepoint on enabling/disabling direct interrupt delivery Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 16/21] KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 17/21] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 18/21] KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs Tomoki Sekiyama
2012-09-06 11:28 ` [RFC v2 PATCH 19/21] KVM: Enable direct EOI for directly routed interrupts to guests Tomoki Sekiyama
2012-09-06 11:29 ` [RFC v2 PATCH 20/21] KVM: Pass-through local APIC timer of on slave CPUs to guest VM Tomoki Sekiyama
2012-09-06 11:29 ` [RFC v2 PATCH 21/21] x86: request TLB flush to slave CPU using NMI Tomoki Sekiyama
2012-09-06 11:46 ` [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests Avi Kivity
2012-09-07  8:26 ` Jan Kiszka
2012-09-10 11:36   ` Tomoki Sekiyama

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).