All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] KVM paravirt remote flush tlb
@ 2012-06-04  5:05 Nikunj A. Dadhania
  2012-06-04  5:06 ` [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:05 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa


Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This also brings one more dependency for lock-less page walk that is
performed by get_user_pages_fast(gup_fast). gup_fast disables the
interrupt and assumes that the pages will not be freed during that
period. And this was fine as the flush_tlb_others_ipi would wait for
the all the IPI to be processed and return back. With the new approach
of not waiting for the sleeping vcpus, this assumption is not valid
anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
will make sure that all the cpus would atleast process smp_callback
before the pages are freed.

The patchset depends on ticketlocks[1] and KVM Paravirt Spinlock
patches[2]

Changelog from v1:
• Race fixes reported by Vatsa
• Address gup_fast dependency using PeterZ's rcu table free patch
• Fix rcu_table_free for hw pagetable walkers
• Increased SPIN_THRESHOLD 8k - to address the baseline numbers
  regression in ebizzy(non-ple). Raghu is working on tuning the
  threshold value along with the ple_window and ple_gap.

Here are the results from PLE hardware. Here is the setup details:
• 8 CPUs (HT disabled)
• 64-bit VM
   • 8vcpus
   • 1GB RAM

Numbers are % improvement/degradation wrt base kernel 3.4.0-rc4
(commit: af3a3ab2) 

Note: SPINLOCK_THRESHOLD is set to 8192

gang - Base kernel + gang scheduling patches
pvspin - Base kernel + ticketlocks patches + paravirt spinlock patches
pvflush - Base kernel + paravirt tlb flush patches
pvall - pvspin + paravirt tlb flush patches
pvallnople - pvall and PLE is disabled(ple_gap = 0) 

+-------------+-----------+-----------+-----------+-----------+-----------+
|             |   gang    |  pvspin   |  pvflush  |   pvall   | pvallnople|
+-------------+-----------+-----------+-----------+-----------+-----------+
|  ebizzy-1vm |        2  |        2  |        3  |      -11  |         4 |
|  ebizzy-2vm |      156  |       15  |      -58  |      343  |       110 | 
|  ebizzy-4vm |      238  |       14  |      -42  |       17  |        47 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| specjbb-1vm |        3  |        5  |        3  |        3  |         2 |
| specjbb-2vm |      -10  |        3  |        2  |        2  |         3 |
| specjbb-4vm |        1  |        4  |        3  |        4  |         4 |
+-------------+-----------+-----------+-----------+-----------+-----------+
|  hbench-1vm |      -14  |      -58  |       -1  |        2  |         7 |
|  hbench-2vm |      -35  |       -5  |        7  |       11  |        12 |
|  hbench-4vm |       19  |        8  |       -1  |       14  |        35 |
+-------------+-----------+-----------+-----------+-----------+-----------+
|  dbench-1vm |       -1  |      -17  |      -25  |       -7  |       -18 |
|  dbench-2vm |        3  |       -4  |        1  |        5  |         3 |
|  dbench-4vm |        8  |        6  |       22  |        6  |        -6 |
+-------------+-----------+-----------+-----------+-----------+-----------+
|  kbench-1vm |     -100  |        8  |        4  |        5  |         7 |
|  kbench-2vm |        7  |        9  |        0  |       -2  |        -2 |
|  kbench-4vm |       12  |       -1  |        0  |       -6  |       -15 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| sysbnch-1vm |        4  |        1  |        3  |        4  |         5 |
| sysbnch-2vm |       73  |       15  |       29  |       34  |        49 |
| sysbnch-4vm |       22  |        2  |        9  |       17  |        31 |
+-------------+-----------+-----------+-----------+-----------+-----------+

Observations from the above table:
* pvall does well in most of the benchmarks.
* pvall does no do quite well for kernbench 2vm(-2%) and 4vm(-6%)

Other experiment that Vatsa suggested was to disable PLE. As the
paravirt patches provide similar functionality. So in those
experiments we did see notable improvements in hackbench and
sysbench. Kernbench degraded further, PLE does help kernbench. This
will be addressed by Raghu's directed yield approach.

Comments/suggestions welcome.

Regards
Nikunj

---

Nikunj A. Dadhania (6):
      KVM Guest: Add VCPU running/pre-empted state for guest
      KVM-HV: Add VCPU running/pre-empted state for guest
      KVM: Add paravirt kvm_flush_tlb_others
      KVM: export kvm_kick_vcpu for pv_flush
      KVM: Introduce PV kick in flush tlb
      Flush page-table pages before freeing them

Peter Zijlstra (1):
      kvm,x86: RCU based table free


 arch/Kconfig                        |    3 ++
 arch/powerpc/include/asm/pgalloc.h  |    1 +
 arch/s390/mm/pgtable.c              |    1 +
 arch/sparc/include/asm/pgalloc_64.h |    1 +
 arch/x86/Kconfig                    |   12 ++++++
 arch/x86/include/asm/kvm_host.h     |    7 ++++
 arch/x86/include/asm/kvm_para.h     |   15 ++++++++
 arch/x86/include/asm/tlbflush.h     |    9 +++++
 arch/x86/kernel/kvm.c               |   52 ++++++++++++++++++++++----
 arch/x86/kvm/cpuid.c                |    1 +
 arch/x86/kvm/x86.c                  |   57 ++++++++++++++++++++++++++++-
 arch/x86/mm/pgtable.c               |    6 ++-
 arch/x86/mm/tlb.c                   |   70 +++++++++++++++++++++++++++++++++++
 include/asm-generic/tlb.h           |    9 +++++
 mm/memory.c                         |   31 +++++++++++++++-
 15 files changed, 260 insertions(+), 15 deletions(-)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
@ 2012-06-04  5:06 ` Nikunj A. Dadhania
  2012-06-12 22:43   ` Marcelo Tosatti
  2012-06-04  5:06 ` [PATCH v2 2/7] KVM-HV: " Nikunj A. Dadhania
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:06 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_para.h |   10 ++++++++++
 arch/x86/kernel/kvm.c           |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 77266d3..f57b5cc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -24,6 +24,7 @@
 #define KVM_FEATURE_ASYNC_PF		4
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_UNHALT		6
+#define KVM_FEATURE_VCPU_STATE          7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_VCPU_STATE  0x4b564d04
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -51,6 +53,14 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+	__u32 state;
+	__u32 pad[15];
+};
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 98f0378..bb686a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
 		cpu, __pa(st));
 }
 
+static void kvm_register_vcpu_state(void)
+{
+	int cpu = smp_processor_id();
+	struct kvm_vcpu_state *v_state;
+
+	if (!has_vcpu_state)
+		return;
+
+	v_state = &per_cpu(vcpu_state, cpu);
+	memset(v_state, 0, sizeof(*v_state));
+
+	wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+	printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
+		cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
 	if (!kvm_para_available())
@@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
 	if (has_steal_clock)
 		kvm_register_steal_time();
+
+	if (has_vcpu_state)
+		kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
 	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+	if (!has_vcpu_state)
+		return;
+
+	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+	kvm_disable_vcpu_state();
 	kvm_disable_steal_time();
 	kvm_pv_disable_apf(NULL);
 	apf_task_wake_all();
@@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
 		pv_time_ops.steal_clock = kvm_steal_clock;
 	}
 
+	has_vcpu_state = 1;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/7] KVM-HV: Add VCPU running/pre-empted state for guest
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
  2012-06-04  5:06 ` [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
@ 2012-06-04  5:06 ` Nikunj A. Dadhania
  2012-06-04  5:07 ` [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:06 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

Hypervisor code to indicate guest running/pre-empteded status through
msr.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    7 ++++++
 arch/x86/kvm/cpuid.c            |    1 +
 arch/x86/kvm/x86.c              |   45 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dad475b..12fe3c7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,13 @@ struct kvm_vcpu_arch {
 		struct kvm_steal_time steal;
 	} st;
 
+	/* indicates vcpu is running or preempted */
+	struct {
+		u64 msr_val;
+		struct gfn_to_hva_cache data;
+		struct kvm_vcpu_state vs;
+	} v_state;
+
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
 	u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c93806..0588984 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
 			     (1 << KVM_FEATURE_ASYNC_PF) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+			     (1 << KVM_FEATURE_VCPU_STATE) |
 			     (1 << KVM_FEATURE_PV_UNHALT);
 
 		if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5f57b..264f172 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -789,12 +789,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	9
+#define KVM_SAVE_MSRS_BEGIN	10
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+	MSR_KVM_VCPU_STATE,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1539,6 +1540,32 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+	struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	vs->state = 1;
+	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+	smp_wmb();
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+	struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	vs->state = 0;
+	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+	smp_wmb();
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	bool pr = false;
@@ -1654,6 +1681,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 
 		break;
 
+	case MSR_KVM_VCPU_STATE:
+		if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.v_state.data,
+					      data & KVM_VCPU_STATE_VALID_BITS))
+			return 1;
+
+		vcpu->arch.v_state.msr_val = data;
+		break;
+
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1974,6 +2009,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_STEAL_TIME:
 		data = vcpu->arch.st.msr_val;
 		break;
+	case MSR_KVM_VCPU_STATE:
+		data = vcpu->arch.v_state.msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -5324,6 +5362,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		kvm_load_guest_fpu(vcpu);
 	kvm_load_guest_xcr0(vcpu);
 
+	kvm_set_vcpu_state(vcpu);
+
 	vcpu->mode = IN_GUEST_MODE;
 
 	/* We should set ->mode before check ->requests,
@@ -5340,6 +5380,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		local_irq_enable();
 		preempt_enable();
 		kvm_x86_ops->cancel_injection(vcpu);
+		kvm_clear_vcpu_state(vcpu);
 		r = 1;
 		goto out;
 	}
@@ -5374,6 +5415,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
+	kvm_clear_vcpu_state(vcpu);
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();
 	local_irq_enable();
@@ -6029,6 +6071,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	vcpu->arch.apf.msr_val = 0;
 	vcpu->arch.st.msr_val = 0;
+	vcpu->arch.v_state.msr_val = 0;
 
 	kvmclock_reset(vcpu);
 


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
  2012-06-04  5:06 ` [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
  2012-06-04  5:06 ` [PATCH v2 2/7] KVM-HV: " Nikunj A. Dadhania
@ 2012-06-04  5:07 ` Nikunj A. Dadhania
  2012-06-12 23:02   ` Marcelo Tosatti
                     ` (2 more replies)
  2012-06-04  5:07 ` [PATCH v2 4/7] KVM: export kvm_kick_vcpu for pv_flush Nikunj A. Dadhania
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:07 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

--
Pseudo Algo:

   Write()
   ======

	   guest_exit()
		   flush_on_enter[i]=0;
		   running[i] = 0;

	   guest_enter()
		   running[i] = 1;
		   smp_mb();
		   if(flush_on_enter[i]) {
			   tlb_flush()
			   flush_on_enter[i]=0;
		   }


   Read()
   ======

	   GUEST                                                KVM-HV

   f->flushcpumask = cpumask - me;

again:
   for_each_cpu(i, f->flushmask) {

	   if (!running[i]) {
						   case 1:

						   running[n]=1

						   (cpuN does not see
						   flush_on_enter set,
						   guest later finds it
						   running and sends ipi,
						   we are fine here, need
						   to clear the flag on
						   guest_exit)

		  flush_on_enter[i] = 1;
						   case2:

						   running[n]=1
						   (cpuN - will see flush
						   on enter and an IPI as
						   well - addressed in patch-4)

		  if (!running[i])
		     cpu_clear(f->flushmask);      All is well, vm_enter
						   will do the fixup
	   }
						   case 3:
						   running[n] = 0;

						   (cpuN went to sleep,
						   we saw it as awake,
						   ipi sent, but wait
						   will break without
						   zero_mask and goto
						   again will take care)

   }
   send_ipi(f->flushmask)

   wait_a_while_for_zero_mask();

   if (!zero_mask)
	   goto again;
---
 arch/x86/include/asm/kvm_para.h |    3 +-
 arch/x86/include/asm/tlbflush.h |    9 ++++++
 arch/x86/kernel/kvm.c           |    1 +
 arch/x86/kvm/x86.c              |   14 ++++++++-
 arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f57b5cc..684a285 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -55,7 +55,8 @@ struct kvm_steal_time {
 
 struct kvm_vcpu_state {
 	__u32 state;
-	__u32 pad[15];
+	__u32 flush_on_enter;
+	__u32 pad[14];
 };
 
 #define KVM_VCPU_STATE_ALIGN_BITS 5
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c0e108e..29470bd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+					struct mm_struct *mm,
+					unsigned long va)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     struct mm_struct *mm, unsigned long va);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			  struct mm_struct *mm, unsigned long va);
+
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bb686a6..66db54e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
 	}
 
 	has_vcpu_state = 1;
+	pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
 
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 264f172..4714a7b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1548,9 +1548,20 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
 		return;
 
+	/* 
+	 * Let the guest know that we are online, make sure we do not
+	 * overwrite flush_on_enter, just write the vs->state.
+	 */
 	vs->state = 1;
-	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 1*sizeof(__u32));
 	smp_wmb();
+	/* 
+	 * Guest might have seen us offline and would have set
+	 * flush_on_enter. 
+	 */
+	kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+	if (vs->flush_on_enter) 
+		kvm_x86_ops->tlb_flush(vcpu);
 }
 
 static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
@@ -1561,6 +1572,7 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
 	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
 		return;
 
+	vs->flush_on_enter = 0;
 	vs->state = 0;
 	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
 	smp_wmb();
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d6c0418..f5dacdd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/cpu.h>
+#include <linux/kvm_para.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -202,6 +203,66 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
 		raw_spin_unlock(&f->tlbstate_lock);
 }
 
+#ifdef CONFIG_KVM_GUEST
+
+DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			struct mm_struct *mm, unsigned long va)
+{
+	unsigned int sender;
+	union smp_flush_state *f;
+	int cpu, loop;
+	struct kvm_vcpu_state *v_state;
+
+	/* Caller has disabled preemption */
+	sender = this_cpu_read(tlb_vector_offset);
+	f = &flush_state[sender];
+
+	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+		raw_spin_lock(&f->tlbstate_lock);
+
+	f->flush_mm = mm;
+	f->flush_va = va;
+	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+		/*
+		 * We have to send the IPI only to online vCPUs
+		 * affected. And queue flush_on_enter for pre-empted
+		 * vCPUs
+		 */
+again:
+		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
+			v_state = &per_cpu(vcpu_state, cpu);
+
+			if (!v_state->state) {
+				v_state->flush_on_enter = 1;
+				smp_mb();
+				if (!v_state->state)
+					cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
+			}
+		}
+
+		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+			goto out;
+
+		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
+				    INVALIDATE_TLB_VECTOR_START + sender);
+
+		loop = 1000;
+		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+			cpu_relax();
+
+		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+			goto again;
+	}
+out:
+	f->flush_mm = NULL;
+	f->flush_va = 0;
+	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+		raw_spin_unlock(&f->tlbstate_lock);
+}
+#endif /* CONFIG_KVM_GUEST */
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     struct mm_struct *mm, unsigned long va)
 {


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/7] KVM: export kvm_kick_vcpu for pv_flush
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (2 preceding siblings ...)
  2012-06-04  5:07 ` [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
@ 2012-06-04  5:07 ` Nikunj A. Dadhania
  2012-06-04  5:08 ` [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:07 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa


Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_para.h |    4 ++++
 arch/x86/kernel/kvm.c           |   18 +++++++++---------
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 684a285..651a305 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -206,6 +206,7 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
+void kvm_kick_cpu(int cpu);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
@@ -229,6 +230,9 @@ static inline void kvm_disable_steal_time(void)
 {
 	return;
 }
+
+#define kvm_kick_cpu(T) do {} while (0)
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66db54e..5943285 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -487,6 +487,15 @@ static __init int activate_jump_labels(void)
 }
 arch_initcall(activate_jump_labels);
 
+/* Kick a cpu */
+void kvm_kick_cpu(int cpu)
+{
+	int apicid;
+
+	apicid = per_cpu(x86_cpu_to_apicid, cpu);
+	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 enum kvm_contention_stat {
@@ -695,15 +704,6 @@ out:
 }
 PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 
-/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int cpu)
-{
-	int apicid;
-
-	apicid = per_cpu(x86_cpu_to_apicid, cpu);
-	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
-}
-
 /* Kick vcpu waiting on @lock->head to reach value @ticket */
 static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 {


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (3 preceding siblings ...)
  2012-06-04  5:07 ` [PATCH v2 4/7] KVM: export kvm_kick_vcpu for pv_flush Nikunj A. Dadhania
@ 2012-06-04  5:08 ` Nikunj A. Dadhania
  2012-07-03  8:07   ` Marcelo Tosatti
  2012-06-04  5:08 ` [PATCH v2 6/7] kvm,x86: RCU based table free Nikunj A. Dadhania
  2012-06-04  5:08 ` [PATCH v2 7/7] Flush page-table pages before freeing them Nikunj A. Dadhania
  6 siblings, 1 reply; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:08 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

In place of looping continuously introduce a halt if we do not succeed
after some time.

For vcpus that were running an IPI is sent.  In case, it went to sleep
between this, we will be doing flush_on_enter(harmless). But as a
flush IPI was already sent, that will be processed in ipi handler,
this might result into something undesireable, i.e. It might clear the
flush_mask of a new request.

So after sending an IPI and waiting for a while, do a halt and wait
for a kick from the last vcpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/mm/tlb.c |   25 +++++++++++++++++--------
 1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f5dacdd..2c686bf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,6 +43,8 @@ union smp_flush_state {
 	struct {
 		struct mm_struct *flush_mm;
 		unsigned long flush_va;
+		int sender_cpu;
+		unsigned int need_kick;
 		raw_spinlock_t tlbstate_lock;
 		DECLARE_BITMAP(flush_cpumask, NR_CPUS);
 	};
@@ -167,6 +169,11 @@ out:
 	smp_mb__before_clear_bit();
 	cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
 	smp_mb__after_clear_bit();
+	if (f->need_kick && cpumask_empty(to_cpumask(f->flush_cpumask))) {
+		f->need_kick = 0;
+		smp_wmb();
+		kvm_kick_cpu(f->sender_cpu);
+	}
 	inc_irq_stat(irq_tlb_count);
 }
 
@@ -222,15 +229,17 @@ void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
 		raw_spin_lock(&f->tlbstate_lock);
 
+	cpu = smp_processor_id();
 	f->flush_mm = mm;
 	f->flush_va = va;
-	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+	f->sender_cpu = cpu;
+	f->need_kick = 0;
+	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(cpu))) {
 		/*
 		 * We have to send the IPI only to online vCPUs
 		 * affected. And queue flush_on_enter for pre-empted
 		 * vCPUs
 		 */
-again:
 		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
 			v_state = &per_cpu(vcpu_state, cpu);
 
@@ -242,9 +251,6 @@ again:
 			}
 		}
 
-		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
-			goto out;
-
 		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
 				    INVALIDATE_TLB_VECTOR_START + sender);
 
@@ -252,10 +258,13 @@ again:
 		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
 			cpu_relax();
 
-		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
-			goto again;
+		if (!loop) {
+			f->need_kick = 1;
+			smp_mb();
+			while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+				halt();
+		}
 	}
-out:
 	f->flush_mm = NULL;
 	f->flush_va = 0;
 	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (4 preceding siblings ...)
  2012-06-04  5:08 ` [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania
@ 2012-06-04  5:08 ` Nikunj A. Dadhania
  2012-06-05 10:48   ` Stefano Stabellini
  2012-06-04  5:08 ` [PATCH v2 7/7] Flush page-table pages before freeing them Nikunj A. Dadhania
  6 siblings, 1 reply; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:08 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

get_user_pages_fast() depends on the IPI to hold off page table
teardown while they are locklessly walked with interrupts disabled.
If a vcpu were to be preempted while in this critical section, another
vcpu tearing down page tables would go ahead and destroy them.  when
the preempted vcpu resumes it then touches the freed pages.

Using HAVE_RCU_TABLE_FREE:

By using call_rcu_sched() to free the page-tables you'd need to
receive and process at least one tick on the woken up cpu after the
freeing, but since the in-progress gup_fast() will have IRQs disabled
this will be delayed.

http://article.gmane.org/gmane.linux.kernel/1290539

Tested-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgalloc.h  |    1 +
 arch/s390/mm/pgtable.c              |    1 +
 arch/sparc/include/asm/pgalloc_64.h |    1 +
 arch/x86/mm/pgtable.c               |    6 +++---
 include/asm-generic/tlb.h           |    9 +++++++++
 mm/memory.c                         |    7 +++++++
 6 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
 
 	pgtable_free(table, shift);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
 	else
 		free_pages((unsigned long) table, ALLOC_ORDER);
 }
+#define __tlb_remove_table __tlb_remove_table
 
 static void tlb_remove_table_smp_sync(void *arg)
 {
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
 		is_page = true;
 	pgtable_free(table, is_page);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pgtable_page_dtor(pte);
 	paravirt_release_pte(page_to_pfn(pte));
-	tlb_remove_page(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
 	paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pmd));
+	tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif	/* PAGETABLE_LEVELS > 3 */
 #endif	/* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f96a5b5..9ac30f7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *page)
+{
+	tlb_remove_page(tlb, page);
+}
+
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..c12685d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
  * See the comment near struct mmu_table_batch.
  */
 
+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+	free_page_and_swap_cache(table);
+}
+#endif
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
 	/* Simply deliver the interrupt */


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 7/7] Flush page-table pages before freeing them
  2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (5 preceding siblings ...)
  2012-06-04  5:08 ` [PATCH v2 6/7] kvm,x86: RCU based table free Nikunj A. Dadhania
@ 2012-06-04  5:08 ` Nikunj A. Dadhania
  6 siblings, 0 replies; 37+ messages in thread
From: Nikunj A. Dadhania @ 2012-06-04  5:08 UTC (permalink / raw)
  To: peterz, mingo, mtosatti, avi
  Cc: raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

Certain architecture(viz. x86, arm, s390) have hardware page-table
walkers(#PF). So during the RCU page-table teardown process make sure
we do a tlb flush of page-table pages on all relevant CPUs to
synchronize against hardware walkers, and then free the pages.

Moreover, the (mm_users < 2) condition does not hold good for the above
architectures, as the hardware engine is one of the user.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/Kconfig     |    3 +++
 arch/x86/Kconfig |   12 ++++++++++++
 mm/memory.c      |   24 ++++++++++++++++++++++--
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 684eb5a..abc3739 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -196,6 +196,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
 	bool
 
+config ARCH_HW_WALKS_PAGE_TABLE
+       bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a9ec0da..b0a9f11 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -617,6 +617,18 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer N.
 
+config PARAVIRT_TLB_FLUSH
+	bool "Paravirtualization layer for TLB Flush"
+	depends on PARAVIRT && SMP && EXPERIMENTAL
+	select HAVE_RCU_TABLE_FREE
+	select ARCH_HW_WALKS_PAGE_TABLE
+	---help---
+	  Paravirtualized Flush TLB replace the native implementation
+	  with something virtualization-friendly (for example, set a
+	  flag for sleeping vcpu and do not wait for it).
+
+	  If you are unsure how to answer this question, answer N.
+
 config PARAVIRT_CLOCK
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index c12685d..acfadb8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -335,11 +335,27 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
 	free_page((unsigned long)batch);
 }
 
+#ifdef CONFIG_ARCH_HW_WALKS_PAGE_TABLE
+/*
+ * Some architectures(x86, arm, s390) HW walks the page tables when
+ * the page-table tear down might be happening. So make sure that
+ * before freeing the page-table pages, flush their tlbs
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+	tlb_flush_mmu(tlb);
+}
+
+#else
+#define tlb_table_flush_mmu(tlb) do {} while (0)
+#endif
+
 void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
 	if (*batch) {
+		tlb_table_flush_mmu(tlb);
 		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
 		*batch = NULL;
 	}
@@ -351,18 +367,22 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 
 	tlb->need_flush = 1;
 
+#ifndef CONFIG_ARCH_HW_WALKS_PAGE_TABLE
 	/*
-	 * When there's less then two users of this mm there cannot be a
-	 * concurrent page-table walk.
+	 * When there's less then two users of this mm there cannot be
+	 * a concurrent page-table walk for architectures that do not
+	 * have hardware page-table walkers.
 	 */
 	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		__tlb_remove_table(table);
 		return;
 	}
+#endif
 
 	if (*batch == NULL) {
 		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
 		if (*batch == NULL) {
+			tlb_table_flush_mmu(tlb);
 			tlb_remove_table_one(table);
 			return;
 		}


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-04  5:08 ` [PATCH v2 6/7] kvm,x86: RCU based table free Nikunj A. Dadhania
@ 2012-06-05 10:48   ` Stefano Stabellini
  2012-06-05 11:08     ` Nikunj A Dadhania
  0 siblings, 1 reply; 37+ messages in thread
From: Stefano Stabellini @ 2012-06-05 10:48 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, mtosatti, avi, raghukt, kvm, linux-kernel, x86,
	jeremy, vatsa, hpa

On Mon, 4 Jun 2012, Nikunj A. Dadhania wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> get_user_pages_fast() depends on the IPI to hold off page table
> teardown while they are locklessly walked with interrupts disabled.
> If a vcpu were to be preempted while in this critical section, another
> vcpu tearing down page tables would go ahead and destroy them.  when
> the preempted vcpu resumes it then touches the freed pages.
> 
> Using HAVE_RCU_TABLE_FREE:
> 
> By using call_rcu_sched() to free the page-tables you'd need to
> receive and process at least one tick on the woken up cpu after the
> freeing, but since the in-progress gup_fast() will have IRQs disabled
> this will be delayed.
> 
> http://article.gmane.org/gmane.linux.kernel/1290539
> 
> Tested-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
>
>  arch/powerpc/include/asm/pgalloc.h  |    1 +
>  arch/s390/mm/pgtable.c              |    1 +
>  arch/sparc/include/asm/pgalloc_64.h |    1 +
>  arch/x86/mm/pgtable.c               |    6 +++---
>  include/asm-generic/tlb.h           |    9 +++++++++
>  mm/memory.c                         |    7 +++++++
>  6 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
> index bf301ac..c33ae79 100644
> --- a/arch/powerpc/include/asm/pgalloc.h
> +++ b/arch/powerpc/include/asm/pgalloc.h
> @@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
>  
>  	pgtable_free(table, shift);
>  }
> +#define __tlb_remove_table __tlb_remove_table
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 6e765bf..7029ed7 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
>  	else
>  		free_pages((unsigned long) table, ALLOC_ORDER);
>  }
> +#define __tlb_remove_table __tlb_remove_table
>  
>  static void tlb_remove_table_smp_sync(void *arg)
>  {
> diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
> index 40b2d7a..d10913a 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
>  		is_page = true;
>  	pgtable_free(table, is_page);
>  }
> +#define __tlb_remove_table __tlb_remove_table
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
>  {
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 8573b83..34fa168 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
>  {
>  	pgtable_page_dtor(pte);
>  	paravirt_release_pte(page_to_pfn(pte));
> -	tlb_remove_page(tlb, pte);
> +	tlb_remove_table(tlb, pte);
>  }
>  
>  #if PAGETABLE_LEVELS > 2
>  void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
>  {
>  	paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
> -	tlb_remove_page(tlb, virt_to_page(pmd));
> +	tlb_remove_table(tlb, virt_to_page(pmd));
>  }
>  
>  #if PAGETABLE_LEVELS > 3
>  void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
>  {
>  	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
> -	tlb_remove_page(tlb, virt_to_page(pud));
> +	tlb_remove_table(tlb, virt_to_page(pud));
>  }
>  #endif	/* PAGETABLE_LEVELS > 3 */
>  #endif	/* PAGETABLE_LEVELS > 2 */
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index f96a5b5..9ac30f7 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -19,6 +19,8 @@
>  #include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
> +
>  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
>  /*
>   * Semi RCU freeing of the page directories.
> @@ -60,6 +62,13 @@ struct mmu_table_batch {
>  extern void tlb_table_flush(struct mmu_gather *tlb);
>  extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
>  
> +#else
> +
> +static inline void tlb_remove_table(struct mmu_gather *tlb, void *page)
> +{
> +	tlb_remove_page(tlb, page);
> +}
> +
>  #endif
>  
>  /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 6105f47..c12685d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>   * See the comment near struct mmu_table_batch.
>   */
>  
> +#ifndef __tlb_remove_table
> +static void __tlb_remove_table(void *table)
> +{
> +	free_page_and_swap_cache(table);
> +}
> +#endif
> +
>  static void tlb_remove_table_smp_sync(void *arg)
>  {
>  	/* Simply deliver the interrupt */
 

I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
Maybe we can pull our efforts together :-)

Giving a look at this patch, it doesn't look like it is introducing
CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
How is the user supposed to set it?

I am appending the version of this patch I was working on: it introduces
a pvop in order not to force HAVE_RCU_TABLE_FREE on native x86.


>From cbb816810f17b57627285383c3d0f771dc89939f Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 22 May 2012 11:39:44 +0000
Subject: [PATCH] [RFC] Introduce a new pv_mmu_op to free a directory page

At the moment get_user_pages_fast is unsafe on Xen, because it relies on
the fact that flush_tlb_others sends an IPI to flush the tlb but
xen_flush_tlb_others doesn't send any IPIs and always returns succesfully
and immediately.

The kernel offers HAVE_RCU_TABLE_FREE that enables an RCU lock to
protect this kind of software pagetable walks (see
include/asm-generic/tlb.h).
The goal of this patch is to enable HAVE_RCU_TABLE_FREE on Xen without
impacting the native x86 case.

The original patch to enable HAVE_RCU_TABLE_FREE on x86 is by Peter
Zijlstra, see http://marc.info/?l=linux-kernel&m=133595422305515&w=2; I
have only modified it so that it enables HAVE_RCU_TABLE_FREE on Xen
but not on native.
It does so by introducing a new pv_mmu_op to free a directory page:
pv_mmu_ops.tlb_remove_table.
The pvop is set to tlb_remove_page on native and to tlb_remove_table on
Xen.

The result is that if CONFIG_XEN is disabled, the behavior is the same
as today.
If CONFIG_XEN is enabled and we are running on native,
HAVE_RCU_TABLE_FREE is set but tlb_remove_table is never called: we
still call tlb_remove_page so there should be no performance penalty.
If CONFIG_XEN is enabled and we are running on Xen we make full usage of
the RCU directory page freeing as described in tlb.h.

Please advice on how to proceed: is it correct to enable
HAVE_RCU_TABLE_FREE but never call tlb_remove_table?
Is the pvops really the best thing to do here?
Maybe we can just implement get_user_pages_fast as a call to
get_user_pages on Xen?

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
---
 arch/powerpc/include/asm/pgalloc.h    |    1 +
 arch/s390/mm/pgtable.c                |    1 +
 arch/sparc/include/asm/pgalloc_64.h   |    1 +
 arch/x86/include/asm/paravirt.h       |    6 ++++++
 arch/x86/include/asm/paravirt_types.h |    3 +++
 arch/x86/include/asm/pgalloc.h        |    1 +
 arch/x86/kernel/paravirt.c            |    2 ++
 arch/x86/mm/pgtable.c                 |    6 +++---
 arch/x86/xen/Kconfig                  |    1 +
 arch/x86/xen/mmu.c                    |    7 +++++++
 mm/memory.c                           |    7 +++++++
 11 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
 
 	pgtable_free(table, shift);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
 	else
 		free_pages((unsigned long) table, ALLOC_ORDER);
 }
+#define __tlb_remove_table __tlb_remove_table
 
 static void tlb_remove_table_smp_sync(void *arg)
 {
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
 		is_page = true;
 	pgtable_free(table, is_page);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
 {
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index aa0f913..42c6a9b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -402,6 +402,12 @@ static inline void flush_tlb_others(const struct cpumask *cpumask,
 	PVOP_VCALL3(pv_mmu_ops.flush_tlb_others, cpumask, mm, va);
 }
 
+static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb,
+		struct page *page)
+{
+	PVOP_VCALL2(pv_mmu_ops.tlb_remove_table, tlb, page);
+}
+
 static inline int paravirt_pgd_alloc(struct mm_struct *mm)
 {
 	return PVOP_CALL1(int, pv_mmu_ops.pgd_alloc, mm);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8e8b9a4..7e5ad6d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -51,6 +51,7 @@ struct mm_struct;
 struct desc_struct;
 struct task_struct;
 struct cpumask;
+struct mmu_gather;
 
 /*
  * Wrapper type for pointers to code which uses the non-standard
@@ -251,6 +252,8 @@ struct pv_mmu_ops {
 	void (*flush_tlb_others)(const struct cpumask *cpus,
 				 struct mm_struct *mm,
 				 unsigned long va);
+	/* free a directory page */
+	void (*tlb_remove_table)(struct mmu_gather *tlb, struct page *page);
 
 	/* Hooks for allocating and freeing a pagetable top-level */
 	int  (*pgd_alloc)(struct mm_struct *mm);
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b4389a4..0cc3251 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -20,6 +20,7 @@ static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn)	{
 static inline void paravirt_release_pte(unsigned long pfn) {}
 static inline void paravirt_release_pmd(unsigned long pfn) {}
 static inline void paravirt_release_pud(unsigned long pfn) {}
+#define paravirt_tlb_remove_table(tlb, page) tlb_remove_page(tlb, page)
 #endif
 
 /*
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab13760..370b3b4 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -37,6 +37,7 @@
 #include <asm/fixmap.h>
 #include <asm/apic.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 #include <asm/timer.h>
 #include <asm/special_insns.h>
 
@@ -422,6 +423,7 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.flush_tlb_kernel = native_flush_tlb_global,
 	.flush_tlb_single = native_flush_tlb_single,
 	.flush_tlb_others = native_flush_tlb_others,
+	.tlb_remove_table = tlb_remove_page,
 
 	.pgd_alloc = __paravirt_pgd_alloc,
 	.pgd_free = paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..904d45c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pgtable_page_dtor(pte);
 	paravirt_release_pte(page_to_pfn(pte));
-	tlb_remove_page(tlb, pte);
+	paravirt_tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
 	paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pmd));
+	paravirt_tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pud));
+	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif	/* PAGETABLE_LEVELS > 3 */
 #endif	/* PAGETABLE_LEVELS > 2 */
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index fdce49c..e2efb29 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -6,6 +6,7 @@ config XEN
 	bool "Xen guest support"
 	select PARAVIRT
 	select PARAVIRT_CLOCK
+	select HAVE_RCU_TABLE_FREE if SMP
 	depends on X86_64 || (X86_32 && X86_PAE && !X86_VISWS)
 	depends on X86_CMPXCHG && X86_TSC
 	help
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 69f5857..a58153b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -62,6 +62,7 @@
 #include <asm/init.h>
 #include <asm/pat.h>
 #include <asm/smp.h>
+#include <asm/tlb.h>
 
 #include <asm/xen/hypercall.h>
 #include <asm/xen/hypervisor.h>
@@ -1281,6 +1282,11 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
+static void xen_tlb_remove_table(struct mmu_gather *tlb, struct page *page)
+{
+	tlb_remove_table(tlb, page);
+}
+
 static unsigned long xen_read_cr3(void)
 {
 	return this_cpu_read(xen_cr3);
@@ -2006,6 +2012,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.flush_tlb_kernel = xen_flush_tlb,
 	.flush_tlb_single = xen_flush_tlb_single,
 	.flush_tlb_others = xen_flush_tlb_others,
+	.tlb_remove_table = xen_tlb_remove_table,
 
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..c12685d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
  * See the comment near struct mmu_table_batch.
  */
 
+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+	free_page_and_swap_cache(table);
+}
+#endif
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
 	/* Simply deliver the interrupt */
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 10:48   ` Stefano Stabellini
@ 2012-06-05 11:08     ` Nikunj A Dadhania
  2012-06-05 11:58       ` Stefano Stabellini
  0 siblings, 1 reply; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-06-05 11:08 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: peterz, mingo, mtosatti, avi, raghukt, kvm, linux-kernel, x86,
	jeremy, vatsa, hpa

On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> 
> I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> Maybe we can pull our efforts together :-)
> 
> Giving a look at this patch, it doesn't look like it is introducing
> CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> How is the user supposed to set it?
>
I am doing that in the next patch only for KVM-ParavirtTLB flush, as
there is a bug in this implementation that patch [7/7] fixes.

Refer following thread for details:
http://mid.gmane.org/1337254086.4281.26.camel@twins
http://mid.gmane.org/1337273959.4281.62.camel@twins

Regards
Nikunj


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 11:08     ` Nikunj A Dadhania
@ 2012-06-05 11:58       ` Stefano Stabellini
  2012-06-05 13:04         ` Nikunj A Dadhania
  0 siblings, 1 reply; 37+ messages in thread
From: Stefano Stabellini @ 2012-06-05 11:58 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Stefano Stabellini, peterz, mingo, mtosatti, avi, raghukt, kvm,
	linux-kernel, x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > 
> > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > Maybe we can pull our efforts together :-)
> > 
> > Giving a look at this patch, it doesn't look like it is introducing
> > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > How is the user supposed to set it?
> >
> I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> there is a bug in this implementation that patch [7/7] fixes.
> 
> Refer following thread for details:
> http://mid.gmane.org/1337254086.4281.26.camel@twins
> http://mid.gmane.org/1337273959.4281.62.camel@twins

Thanks, somehow I missed the 7/7 patch.

>From the Xen POV, your patch is fine because we'll just select
PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).

The main difference between the two approaches is that a kernel with
PARAVIRT_TLB_FLUSH and/or CONFIG_XEN enabled is going to have
HAVE_RCU_TABLE_FREE even when running on native.

Are you proposing this series for 3.5?
If not (because it depends on ticketlocks and KVM Paravirt Spinlock
patches), could you extract patch 6/7 and 7/7 and send them out
separately?
I am saying this because Xen needs the HAVE_RCU_TABLE_FREE fix even if
pv ticketlock are not accepted. This is an outstanding bug for us
unfortunately.

---

xen: select PARAVIRT_TLB_FLUSH if SMP

At the moment get_user_pages_fast is unsafe on Xen, because it relies on
the fact that flush_tlb_others sends an IPI to flush the tlb but
xen_flush_tlb_others doesn't send any IPIs and always returns
succesfully and immediately.

Select PARAVIRT_TLB_FLUSH, that enables an RCU lock to protect this kind
of software pagetable walks (also see HAVE_RCU_TABLE_FREE and
include/asm-generic/tlb.h).

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index fdce49c..18c9876 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -6,6 +6,7 @@ config XEN
 	bool "Xen guest support"
 	select PARAVIRT
 	select PARAVIRT_CLOCK
+	select PARAVIRT_TLB_FLUSH if SMP
 	depends on X86_64 || (X86_32 && X86_PAE && !X86_VISWS)
 	depends on X86_CMPXCHG && X86_TSC
 	help

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 11:58       ` Stefano Stabellini
@ 2012-06-05 13:04         ` Nikunj A Dadhania
  2012-06-05 13:08           ` Peter Zijlstra
  2012-06-05 13:21           ` Stefano Stabellini
  0 siblings, 2 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-06-05 13:04 UTC (permalink / raw)
  To: Stefano Stabellini, peterz
  Cc: Stefano Stabellini, mingo, mtosatti, avi, raghukt, kvm,
	linux-kernel, x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

On Tue, 5 Jun 2012 12:58:32 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> > On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > > 
> > > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > > Maybe we can pull our efforts together :-)
> > > 
> > > Giving a look at this patch, it doesn't look like it is introducing
> > > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > > How is the user supposed to set it?
> > >
> > I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> > there is a bug in this implementation that patch [7/7] fixes.
> > 
> > Refer following thread for details:
> > http://mid.gmane.org/1337254086.4281.26.camel@twins
> > http://mid.gmane.org/1337273959.4281.62.camel@twins
> 
> Thanks, somehow I missed the 7/7 patch.
> 
> From the Xen POV, your patch is fine because we'll just select
> PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).
> 
Selecting ARCH_HW_WALKS_PAGE_TABLE in place of PARAVIRT_TLB_FLUSH should
suffice.

> The main difference between the two approaches is that a kernel with
> PARAVIRT_TLB_FLUSH and/or CONFIG_XEN enabled is going to have
> HAVE_RCU_TABLE_FREE even when running on native.
> 
> Are you proposing this series for 3.5?
> If not (because it depends on ticketlocks and KVM Paravirt Spinlock
> patches), 
>
3.6 I suppose as the merge window is already closed and we are having
some discussions on PLE results.

> could you extract patch 6/7 and 7/7 and send them out
> separately?
>
> I am saying this because Xen needs the HAVE_RCU_TABLE_FREE fix even if
> pv ticketlock are not accepted. This is an outstanding bug for us
> unfortunately.
> 
PeterZ has a patch in his tlb-unify:

    mm, x86: Add HAVE_RCU_TABLE_FREE support
    
    Implements optional HAVE_RCU_TABLE_FREE support for x86.
    
    This is useful for things like Xen and KVM where paravirt tlb flush
    means the software page table walkers like GUP-fast cannot rely on
    IRQs disabling like regular x86 can.
    
    Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
    Cc: Jeremy Fitzhardinge <jeremy@goop.org>
    Cc: Avi Kivity <avi@redhat.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

http://git.kernel.org/?p=linux/kernel/git/peterz/mmu.git;a=commit;h=8a7e6fa5be9d2645c3394892c870113e6e5d9309

PeterZ, is 7/7 alright to be picked?

Regards
Nikunj



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:04         ` Nikunj A Dadhania
@ 2012-06-05 13:08           ` Peter Zijlstra
  2012-06-05 13:26             ` Stefano Stabellini
  2012-06-05 15:29             ` Nikunj A Dadhania
  2012-06-05 13:21           ` Stefano Stabellini
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2012-06-05 13:08 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Stefano Stabellini, mingo, mtosatti, avi, raghukt, kvm,
	linux-kernel, x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> PeterZ, is 7/7 alright to be picked?

Yeah, I guess it is.. I haven't had time to rework my tlb series yet
though. But these two patches together should make it work for x86.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:04         ` Nikunj A Dadhania
  2012-06-05 13:08           ` Peter Zijlstra
@ 2012-06-05 13:21           ` Stefano Stabellini
  1 sibling, 0 replies; 37+ messages in thread
From: Stefano Stabellini @ 2012-06-05 13:21 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Stefano Stabellini, peterz, mingo, mtosatti, avi, raghukt, kvm,
	linux-kernel, x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> On Tue, 5 Jun 2012 12:58:32 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> > > On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > > > 
> > > > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > > > Maybe we can pull our efforts together :-)
> > > > 
> > > > Giving a look at this patch, it doesn't look like it is introducing
> > > > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > > > How is the user supposed to set it?
> > > >
> > > I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> > > there is a bug in this implementation that patch [7/7] fixes.
> > > 
> > > Refer following thread for details:
> > > http://mid.gmane.org/1337254086.4281.26.camel@twins
> > > http://mid.gmane.org/1337273959.4281.62.camel@twins
> > 
> > Thanks, somehow I missed the 7/7 patch.
> > 
> > From the Xen POV, your patch is fine because we'll just select
> > PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).
> > 
> Selecting ARCH_HW_WALKS_PAGE_TABLE in place of PARAVIRT_TLB_FLUSH should
> suffice.

We would also need to select HAVE_RCU_TABLE_FREE, but it could be a good
idea, not go through PARAVIRT_TLB_FLUSH.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:08           ` Peter Zijlstra
@ 2012-06-05 13:26             ` Stefano Stabellini
  2012-06-05 13:31               ` Peter Zijlstra
  2012-08-01 11:23               ` Stefano Stabellini
  2012-06-05 15:29             ` Nikunj A Dadhania
  1 sibling, 2 replies; 37+ messages in thread
From: Stefano Stabellini @ 2012-06-05 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A Dadhania, Stefano Stabellini, mingo, mtosatti, avi,
	raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa,
	Konrad Rzeszutek Wilk

On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > PeterZ, is 7/7 alright to be picked?
> 
> Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> though. But these two patches together should make it work for x86.
> 

Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
3.6?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:26             ` Stefano Stabellini
@ 2012-06-05 13:31               ` Peter Zijlstra
  2012-06-05 13:41                 ` Stefano Stabellini
  2012-08-01 11:23               ` Stefano Stabellini
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2012-06-05 13:31 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Nikunj A Dadhania, mingo, mtosatti, avi, raghukt, kvm,
	linux-kernel, x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

On Tue, 2012-06-05 at 14:26 +0100, Stefano Stabellini wrote:
> On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > PeterZ, is 7/7 alright to be picked?
> > 
> > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > though. But these two patches together should make it work for x86.
> > 
> 
> Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> 3.6?

I wouldn't rush this stuff, but if you're up for it and can convince
Linus that its all peachy you can try ;-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:31               ` Peter Zijlstra
@ 2012-06-05 13:41                 ` Stefano Stabellini
  0 siblings, 0 replies; 37+ messages in thread
From: Stefano Stabellini @ 2012-06-05 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefano Stabellini, Nikunj A Dadhania, mingo, mtosatti, avi,
	raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa,
	Konrad Rzeszutek Wilk

On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 14:26 +0100, Stefano Stabellini wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > PeterZ, is 7/7 alright to be picked?
> > > 
> > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > though. But these two patches together should make it work for x86.
> > > 
> > 
> > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > 3.6?
> 
> I wouldn't rush this stuff, but if you're up for it and can convince
> Linus that its all peachy you can try ;-)
> 

Fair enough, I think I'll let Konrad make this call :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:08           ` Peter Zijlstra
  2012-06-05 13:26             ` Stefano Stabellini
@ 2012-06-05 15:29             ` Nikunj A Dadhania
  1 sibling, 0 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-06-05 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefano Stabellini, mingo, mtosatti, avi, raghukt, kvm,
	linux-kernel, x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

On Tue, 05 Jun 2012 15:08:08 +0200, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > PeterZ, is 7/7 alright to be picked?
> 
> Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> though. But these two patches together should make it work for x86.
>
I haven't added your SOB yet to this, though I have your name in
mentioned "From". Should I add your SOB to this, I have added a
minor fix for !CONFIG_HAVE_RCU_TABLE_FREE case?

Regards
Nikunj


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-06-04  5:06 ` [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
@ 2012-06-12 22:43   ` Marcelo Tosatti
  2012-06-19  6:03     ` Nikunj A Dadhania
  0 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2012-06-12 22:43 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Mon, Jun 04, 2012 at 10:36:05AM +0530, Nikunj A. Dadhania wrote:
> The patch adds guest code for msr between guest and hypervisor. The
> msr will export the vcpu running/pre-empted information to the guest
> from host. This will enable guest to intelligently send ipi to running
> vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> vcpus that are not running.
> 
> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_para.h |   10 ++++++++++
>  arch/x86/kernel/kvm.c           |   33 +++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 77266d3..f57b5cc 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -24,6 +24,7 @@
>  #define KVM_FEATURE_ASYNC_PF		4
>  #define KVM_FEATURE_STEAL_TIME		5
>  #define KVM_FEATURE_PV_UNHALT		6
> +#define KVM_FEATURE_VCPU_STATE          7
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -39,6 +40,7 @@
>  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>  #define MSR_KVM_STEAL_TIME  0x4b564d03
> +#define MSR_KVM_VCPU_STATE  0x4b564d04
>  
>  struct kvm_steal_time {
>  	__u64 steal;
> @@ -51,6 +53,14 @@ struct kvm_steal_time {
>  #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
>  #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
>  
> +struct kvm_vcpu_state {
> +	__u32 state;
> +	__u32 pad[15];
> +};
> +
> +#define KVM_VCPU_STATE_ALIGN_BITS 5
> +#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
> +
>  #define KVM_MAX_MMU_OP_BATCH           32
>  
>  #define KVM_ASYNC_PF_ENABLED			(1 << 0)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 98f0378..bb686a6 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
>  static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
>  static int has_steal_clock = 0;
>  
> +DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
> +static int has_vcpu_state;
> +
>  /*
>   * No need for any "IO delay" on KVM
>   */
> @@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
>  		cpu, __pa(st));
>  }
>  
> +static void kvm_register_vcpu_state(void)
> +{
> +	int cpu = smp_processor_id();
> +	struct kvm_vcpu_state *v_state;
> +
> +	if (!has_vcpu_state)
> +		return;
> +
> +	v_state = &per_cpu(vcpu_state, cpu);
> +	memset(v_state, 0, sizeof(*v_state));
> +
> +	wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
> +	printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
> +		cpu, __pa(v_state));
> +}
> +
>  void __cpuinit kvm_guest_cpu_init(void)
>  {
>  	if (!kvm_para_available())
> @@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)
>  
>  	if (has_steal_clock)
>  		kvm_register_steal_time();
> +
> +	if (has_vcpu_state)
> +		kvm_register_vcpu_state();
>  }
>  
>  static void kvm_pv_disable_apf(void *unused)
> @@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
>  	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
>  }
>  
> +void kvm_disable_vcpu_state(void)
> +{
> +	if (!has_vcpu_state)
> +		return;
> +
> +	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
> +}
> +
>  #ifdef CONFIG_SMP
>  static void __init kvm_smp_prepare_boot_cpu(void)
>  {
> @@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
>  
>  static void kvm_guest_cpu_offline(void *dummy)
>  {
> +	kvm_disable_vcpu_state();
>  	kvm_disable_steal_time();
>  	kvm_pv_disable_apf(NULL);
>  	apf_task_wake_all();
> @@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
>  		pv_time_ops.steal_clock = kvm_steal_clock;
>  	}
>  
> +	has_vcpu_state = 1;
> +

Should be checking for a feature bit, see kvm_para_has_feature() 
examples above in the function.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-06-04  5:07 ` [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
@ 2012-06-12 23:02   ` Marcelo Tosatti
  2012-06-19  6:11     ` Nikunj A Dadhania
  2012-07-03  7:55   ` Marcelo Tosatti
  2012-07-03  8:11   ` Marcelo Tosatti
  2 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2012-06-12 23:02 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> paravirtualization.
> 
> Use the vcpu state information inside the kvm_flush_tlb_others to
> avoid sending ipi to pre-empted vcpus.
> 
> * Do not send ipi's to offline vcpus and set flush_on_enter flag
> * For online vcpus: Wait for them to clear the flag
> 
> The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> 
> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

Why not reintroduce the hypercall to flush TLBs? No waiting, no
entry/exit trickery.

This is half-way-there paravirt with all the downsides. Even though the
guest running information might be useful in other cases.

> Pseudo Algo:
> 
>    Write()
>    ======
> 
> 	   guest_exit()
> 		   flush_on_enter[i]=0;
> 		   running[i] = 0;
> 
> 	   guest_enter()
> 		   running[i] = 1;
> 		   smp_mb();
> 		   if(flush_on_enter[i]) {
> 			   tlb_flush()
> 			   flush_on_enter[i]=0;
> 		   }
> 
> 
>    Read()
>    ======
> 
> 	   GUEST                                                KVM-HV
> 
>    f->flushcpumask = cpumask - me;
> 
> again:
>    for_each_cpu(i, f->flushmask) {
> 
> 	   if (!running[i]) {
> 						   case 1:
> 
> 						   running[n]=1
> 
> 						   (cpuN does not see
> 						   flush_on_enter set,
> 						   guest later finds it
> 						   running and sends ipi,
> 						   we are fine here, need
> 						   to clear the flag on
> 						   guest_exit)
> 
> 		  flush_on_enter[i] = 1;
> 						   case2:
> 
> 						   running[n]=1
> 						   (cpuN - will see flush
> 						   on enter and an IPI as
> 						   well - addressed in patch-4)
> 
> 		  if (!running[i])
> 		     cpu_clear(f->flushmask);      All is well, vm_enter
> 						   will do the fixup
> 	   }
> 						   case 3:
> 						   running[n] = 0;
> 
> 						   (cpuN went to sleep,
> 						   we saw it as awake,
> 						   ipi sent, but wait
> 						   will break without
> 						   zero_mask and goto
> 						   again will take care)
> 
>    }
>    send_ipi(f->flushmask)
> 
>    wait_a_while_for_zero_mask();
> 
>    if (!zero_mask)
> 	   goto again;
> ---
>  arch/x86/include/asm/kvm_para.h |    3 +-
>  arch/x86/include/asm/tlbflush.h |    9 ++++++
>  arch/x86/kernel/kvm.c           |    1 +
>  arch/x86/kvm/x86.c              |   14 ++++++++-
>  arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 86 insertions(+), 2 deletions(-)
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-06-12 22:43   ` Marcelo Tosatti
@ 2012-06-19  6:03     ` Nikunj A Dadhania
  0 siblings, 0 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-06-19  6:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, 12 Jun 2012 19:43:10 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jun 04, 2012 at 10:36:05AM +0530, Nikunj A. Dadhania wrote:
> > The patch adds guest code for msr between guest and hypervisor. The
> > msr will export the vcpu running/pre-empted information to the guest
> > from host. This will enable guest to intelligently send ipi to running
> > vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> > vcpus that are not running.
> > 
> > Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

[...]

> > @@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
> >  		pv_time_ops.steal_clock = kvm_steal_clock;
> >  	}
> >  
> > +	has_vcpu_state = 1;
> > +
> 
> Should be checking for a feature bit, see kvm_para_has_feature() 
> examples above in the function.
>
Sure, will take of this in my next version. 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-06-12 23:02   ` Marcelo Tosatti
@ 2012-06-19  6:11     ` Nikunj A Dadhania
  2012-06-21 12:26       ` Marcelo Tosatti
  0 siblings, 1 reply; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-06-19  6:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

Hi Marcelo,

Thanks for the review.

On Tue, 12 Jun 2012 20:02:18 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> > 
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> > 
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > * For online vcpus: Wait for them to clear the flag
> > 
> > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> > 
> > Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> 
> Why not reintroduce the hypercall to flush TLBs? 
Sure, I will get a version with this.

> No waiting, no entry/exit trickery.
> 
Are you also suggesting to get rid of vcpu-running information?
We would atleast need this to raise a flushTLB hypercall on the sleeping
vcpu.

> This is half-way-there paravirt with all the downsides. 
Some more details on what are the downsides would help us reach to a
better solution.

> Even though the guest running information might be useful in other
> cases.
> 
Yes, that was one of the things on the mind.

> > Pseudo Algo:
> > 
> >    Write()
> >    ======
> > 
> > 	   guest_exit()
> > 		   flush_on_enter[i]=0;
> > 		   running[i] = 0;
> > 
> > 	   guest_enter()
> > 		   running[i] = 1;
> > 		   smp_mb();
> > 		   if(flush_on_enter[i]) {
> > 			   tlb_flush()
> > 			   flush_on_enter[i]=0;
> > 		   }
> > 
> > 
> >    Read()
> >    ======
> > 
> > 	   GUEST                                                KVM-HV
> > 
> >    f->flushcpumask = cpumask - me;
> > 
> > again:
> >    for_each_cpu(i, f->flushmask) {
> > 
> > 	   if (!running[i]) {
> > 						   case 1:
> > 
> > 						   running[n]=1
> > 
> > 						   (cpuN does not see
> > 						   flush_on_enter set,
> > 						   guest later finds it
> > 						   running and sends ipi,
> > 						   we are fine here, need
> > 						   to clear the flag on
> > 						   guest_exit)
> > 
> > 		  flush_on_enter[i] = 1;
> > 						   case2:
> > 
> > 						   running[n]=1
> > 						   (cpuN - will see flush
> > 						   on enter and an IPI as
> > 						   well - addressed in patch-4)
> > 
> > 		  if (!running[i])
> > 		     cpu_clear(f->flushmask);      All is well, vm_enter
> > 						   will do the fixup
> > 	   }
> > 						   case 3:
> > 						   running[n] = 0;
> > 
> > 						   (cpuN went to sleep,
> > 						   we saw it as awake,
> > 						   ipi sent, but wait
> > 						   will break without
> > 						   zero_mask and goto
> > 						   again will take care)
> > 
> >    }
> >    send_ipi(f->flushmask)
> > 
> >    wait_a_while_for_zero_mask();
> > 
> >    if (!zero_mask)
> > 	   goto again;
> > ---
> >  arch/x86/include/asm/kvm_para.h |    3 +-
> >  arch/x86/include/asm/tlbflush.h |    9 ++++++
> >  arch/x86/kernel/kvm.c           |    1 +
> >  arch/x86/kvm/x86.c              |   14 ++++++++-
> >  arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 86 insertions(+), 2 deletions(-)
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-06-19  6:11     ` Nikunj A Dadhania
@ 2012-06-21 12:26       ` Marcelo Tosatti
  0 siblings, 0 replies; 37+ messages in thread
From: Marcelo Tosatti @ 2012-06-21 12:26 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, Jun 19, 2012 at 11:41:54AM +0530, Nikunj A Dadhania wrote:
> Hi Marcelo,
> 
> Thanks for the review.
> 
> On Tue, 12 Jun 2012 20:02:18 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > > 
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > > 
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > > * For online vcpus: Wait for them to clear the flag
> > > 
> > > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> > > 
> > > Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > 
> > Why not reintroduce the hypercall to flush TLBs? 
> Sure, I will get a version with this.
> 
> > No waiting, no entry/exit trickery.
> > 
> Are you also suggesting to get rid of vcpu-running information?

Yes.

> We would atleast need this to raise a flushTLB hypercall on the sleeping
> vcpu.

No, you always raise a flush TLB hypercall on flush_tlb_others, like Xen does.
Instead of an MMIO exit to APIC to IPI, its a hypercall exit.

> > This is half-way-there paravirt with all the downsides. 
> Some more details on what are the downsides would help us reach to a
> better solution.

The downside i refer to is overhead to write, test and maintain separate code
for running as a guest. If you are adding awareness about hypervisor
anyway, a hypercall is the simplest way:

- It is simple to request a remote TLB flush via vcpu->requests in the
  hypervisor. Information about which vcpus are in/out guest mode 
  already available.
- Maintenance of in/out guest mode information in guest memory adds more
  overhead to entry/exit paths.
- No need to handle the interprocessor synchronization in the pseudo
  algo below (already handled by vcpu->requests mechanism).
- The guest TLB IPI flow is (target vcpu):
  - exit-due-to-external-interrupt.
  - inject-ipi.
  - enter guest mode.
  - execute IPI handler, flush TLB.
  - ACK IPI.
  - source vcpu continues.
- The hypervisor TLB flush flow is:
  - exit-due-to-external-interrupt.
  - execute IPI handler (in host, from make_all_cpus_request)
  - ACK IPI.
  - source vcpu continues.

Unless there is an advantage in using APIC IPI exit vs hypercall
exit?

> > Even though the guest running information might be useful in other
> > cases.
> > 
> Yes, that was one of the things on the mind.
> 
> > > Pseudo Algo:
> > > 
> > >    Write()
> > >    ======
> > > 
> > > 	   guest_exit()
> > > 		   flush_on_enter[i]=0;
> > > 		   running[i] = 0;
> > > 
> > > 	   guest_enter()
> > > 		   running[i] = 1;
> > > 		   smp_mb();
> > > 		   if(flush_on_enter[i]) {
> > > 			   tlb_flush()
> > > 			   flush_on_enter[i]=0;
> > > 		   }
> > > 
> > > 
> > >    Read()
> > >    ======
> > > 
> > > 	   GUEST                                                KVM-HV
> > > 
> > >    f->flushcpumask = cpumask - me;
> > > 
> > > again:
> > >    for_each_cpu(i, f->flushmask) {
> > > 
> > > 	   if (!running[i]) {
> > > 						   case 1:
> > > 
> > > 						   running[n]=1
> > > 
> > > 						   (cpuN does not see
> > > 						   flush_on_enter set,
> > > 						   guest later finds it
> > > 						   running and sends ipi,
> > > 						   we are fine here, need
> > > 						   to clear the flag on
> > > 						   guest_exit)
> > > 
> > > 		  flush_on_enter[i] = 1;
> > > 						   case2:
> > > 
> > > 						   running[n]=1
> > > 						   (cpuN - will see flush
> > > 						   on enter and an IPI as
> > > 						   well - addressed in patch-4)
> > > 
> > > 		  if (!running[i])
> > > 		     cpu_clear(f->flushmask);      All is well, vm_enter
> > > 						   will do the fixup
> > > 	   }
> > > 						   case 3:
> > > 						   running[n] = 0;
> > > 
> > > 						   (cpuN went to sleep,
> > > 						   we saw it as awake,
> > > 						   ipi sent, but wait
> > > 						   will break without
> > > 						   zero_mask and goto
> > > 						   again will take care)
> > > 
> > >    }
> > >    send_ipi(f->flushmask)
> > > 
> > >    wait_a_while_for_zero_mask();
> > > 
> > >    if (!zero_mask)
> > > 	   goto again;
> > > ---
> > >  arch/x86/include/asm/kvm_para.h |    3 +-
> > >  arch/x86/include/asm/tlbflush.h |    9 ++++++
> > >  arch/x86/kernel/kvm.c           |    1 +
> > >  arch/x86/kvm/x86.c              |   14 ++++++++-
> > >  arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
> > >  5 files changed, 86 insertions(+), 2 deletions(-)
> > > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-06-04  5:07 ` [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
  2012-06-12 23:02   ` Marcelo Tosatti
@ 2012-07-03  7:55   ` Marcelo Tosatti
  2012-07-03  8:19     ` Nikunj A Dadhania
  2012-07-06  9:47     ` Nikunj A Dadhania
  2012-07-03  8:11   ` Marcelo Tosatti
  2 siblings, 2 replies; 37+ messages in thread
From: Marcelo Tosatti @ 2012-07-03  7:55 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> paravirtualization.
> 
> Use the vcpu state information inside the kvm_flush_tlb_others to
> avoid sending ipi to pre-empted vcpus.
> 
> * Do not send ipi's to offline vcpus and set flush_on_enter flag
> * For online vcpus: Wait for them to clear the flag
> 
> The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> 
> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> 
> --
> Pseudo Algo:
> 
>    Write()
>    ======
> 
> 	   guest_exit()
> 		   flush_on_enter[i]=0;
> 		   running[i] = 0;
> 
> 	   guest_enter()
> 		   running[i] = 1;
> 		   smp_mb();
> 		   if(flush_on_enter[i]) {
> 			   tlb_flush()
> 			   flush_on_enter[i]=0;
> 		   }
> 
> 
>    Read()
>    ======
> 
> 	   GUEST                                                KVM-HV
> 
>    f->flushcpumask = cpumask - me;
> 
> again:
>    for_each_cpu(i, f->flushmask) {
> 
> 	   if (!running[i]) {
> 						   case 1:
> 
> 						   running[n]=1
> 
> 						   (cpuN does not see
> 						   flush_on_enter set,
> 						   guest later finds it
> 						   running and sends ipi,
> 						   we are fine here, need
> 						   to clear the flag on
> 						   guest_exit)
> 
> 		  flush_on_enter[i] = 1;
> 						   case2:
> 
> 						   running[n]=1
> 						   (cpuN - will see flush
> 						   on enter and an IPI as
> 						   well - addressed in patch-4)
> 
> 		  if (!running[i])
> 		     cpu_clear(f->flushmask);      All is well, vm_enter
> 						   will do the fixup
> 	   }
> 						   case 3:
> 						   running[n] = 0;
> 
> 						   (cpuN went to sleep,
> 						   we saw it as awake,
> 						   ipi sent, but wait
> 						   will break without
> 						   zero_mask and goto
> 						   again will take care)
> 
>    }
>    send_ipi(f->flushmask)
> 
>    wait_a_while_for_zero_mask();
> 
>    if (!zero_mask)
> 	   goto again;

Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
help.

>  arch/x86/include/asm/kvm_para.h |    3 +-
>  arch/x86/include/asm/tlbflush.h |    9 ++++++
>  arch/x86/kernel/kvm.c           |    1 +
>  arch/x86/kvm/x86.c              |   14 ++++++++-
>  arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index f57b5cc..684a285 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -55,7 +55,8 @@ struct kvm_steal_time {
>  
>  struct kvm_vcpu_state {
>  	__u32 state;
> -	__u32 pad[15];
> +	__u32 flush_on_enter;
> +	__u32 pad[14];
>  };
>  
>  #define KVM_VCPU_STATE_ALIGN_BITS 5
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index c0e108e..29470bd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct cpumask *cpumask,
>  {
>  }
>  
> +static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +					struct mm_struct *mm,
> +					unsigned long va)
> +{
> +}
> +
>  static inline void reset_lazy_tlbstate(void)
>  {
>  }
> @@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>  			     struct mm_struct *mm, unsigned long va);
>  
> +void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +			  struct mm_struct *mm, unsigned long va);
> +
>  #define TLBSTATE_OK	1
>  #define TLBSTATE_LAZY	2
>  
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index bb686a6..66db54e 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
>  	}
>  
>  	has_vcpu_state = 1;
> +	pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
>  
>  #ifdef CONFIG_SMP
>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 264f172..4714a7b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1548,9 +1548,20 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
>  	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
>  		return;
>  
> +	/* 
> +	 * Let the guest know that we are online, make sure we do not
> +	 * overwrite flush_on_enter, just write the vs->state.
> +	 */
>  	vs->state = 1;
> -	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> +	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 1*sizeof(__u32));
>  	smp_wmb();
> +	/* 
> +	 * Guest might have seen us offline and would have set
> +	 * flush_on_enter. 
> +	 */
> +	kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> +	if (vs->flush_on_enter) 
> +		kvm_x86_ops->tlb_flush(vcpu);


So flush_tlb_page which was an invlpg now flushes the entire TLB. Did
you take that into account?

>  static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
> @@ -1561,6 +1572,7 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
>  	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
>  		return;
>  
> +	vs->flush_on_enter = 0;
>  	vs->state = 0;
>  	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
>  	smp_wmb();
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index d6c0418..f5dacdd 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -6,6 +6,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/module.h>
>  #include <linux/cpu.h>
> +#include <linux/kvm_para.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -202,6 +203,66 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
>  		raw_spin_unlock(&f->tlbstate_lock);
>  }
>  
> +#ifdef CONFIG_KVM_GUEST
> +
> +DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
> +
> +void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +			struct mm_struct *mm, unsigned long va)
> +{
> +	unsigned int sender;
> +	union smp_flush_state *f;
> +	int cpu, loop;
> +	struct kvm_vcpu_state *v_state;
> +
> +	/* Caller has disabled preemption */
> +	sender = this_cpu_read(tlb_vector_offset);
> +	f = &flush_state[sender];
> +
> +	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
> +		raw_spin_lock(&f->tlbstate_lock);
> +
> +	f->flush_mm = mm;
> +	f->flush_va = va;
> +	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
> +		/*
> +		 * We have to send the IPI only to online vCPUs
> +		 * affected. And queue flush_on_enter for pre-empted
> +		 * vCPUs
> +		 */
> +again:
> +		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
> +			v_state = &per_cpu(vcpu_state, cpu);
> +
> +			if (!v_state->state) {

Should use ACCESS_ONCE to make sure the value is not register cached.

> +				v_state->flush_on_enter = 1;
> +				smp_mb();
> +				if (!v_state->state)

And here.

> +					cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
> +			}
> +		}
> +
> +		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
> +			goto out;
> +
> +		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
> +				    INVALIDATE_TLB_VECTOR_START + sender);
> +
> +		loop = 1000;
> +		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> +			cpu_relax();
> +
> +		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
> +			goto again;

Is this necessary in addition to the in-guest-mode/out-guest-mode
detection? If so, why?

> +	}
> +out:
> +	f->flush_mm = NULL;
> +	f->flush_va = 0;
> +	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
> +		raw_spin_unlock(&f->tlbstate_lock);
> +}
> +#endif /* CONFIG_KVM_GUEST */
> +
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>  			     struct mm_struct *mm, unsigned long va)
>  {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb
  2012-06-04  5:08 ` [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania
@ 2012-07-03  8:07   ` Marcelo Tosatti
  2012-07-03  8:25     ` Nikunj A Dadhania
  0 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2012-07-03  8:07 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Mon, Jun 04, 2012 at 10:38:17AM +0530, Nikunj A. Dadhania wrote:
> In place of looping continuously introduce a halt if we do not succeed
> after some time.
> 
> For vcpus that were running an IPI is sent.  In case, it went to sleep
> between this, we will be doing flush_on_enter(harmless). But as a
> flush IPI was already sent, that will be processed in ipi handler,
> this might result into something undesireable, i.e. It might clear the
> flush_mask of a new request.
> 
> So after sending an IPI and waiting for a while, do a halt and wait
> for a kick from the last vcpu.
> 
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

Again, was it determined that this is necessary from data of 
benchmarking on the in-guest-mode/out-guest-mode patch?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-06-04  5:07 ` [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
  2012-06-12 23:02   ` Marcelo Tosatti
  2012-07-03  7:55   ` Marcelo Tosatti
@ 2012-07-03  8:11   ` Marcelo Tosatti
  2012-07-03  8:27     ` Nikunj A Dadhania
  2 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2012-07-03  8:11 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> paravirtualization.
> 
> Use the vcpu state information inside the kvm_flush_tlb_others to
> avoid sending ipi to pre-empted vcpus.
> 
> * Do not send ipi's to offline vcpus and set flush_on_enter flag
> * For online vcpus: Wait for them to clear the flag
> 
> The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> 
> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> 
> --
> Pseudo Algo:
> 
>    Write()
>    ======
> 
> 	   guest_exit()
> 		   flush_on_enter[i]=0;
> 		   running[i] = 0;
> 
> 	   guest_enter()
> 		   running[i] = 1;
> 		   smp_mb();
> 		   if(flush_on_enter[i]) {
> 			   tlb_flush()
> 			   flush_on_enter[i]=0;
> 		   }
> 
> 
>    Read()
>    ======
> 
> 	   GUEST                                                KVM-HV
> 
>    f->flushcpumask = cpumask - me;
> 
> again:
>    for_each_cpu(i, f->flushmask) {
> 
> 	   if (!running[i]) {
> 						   case 1:
> 
> 						   running[n]=1
> 
> 						   (cpuN does not see
> 						   flush_on_enter set,
> 						   guest later finds it
> 						   running and sends ipi,
> 						   we are fine here, need
> 						   to clear the flag on
> 						   guest_exit)
> 
> 		  flush_on_enter[i] = 1;
> 						   case2:
> 
> 						   running[n]=1
> 						   (cpuN - will see flush
> 						   on enter and an IPI as
> 						   well - addressed in patch-4)
> 
> 		  if (!running[i])
> 		     cpu_clear(f->flushmask);      All is well, vm_enter
> 						   will do the fixup
> 	   }
> 						   case 3:
> 						   running[n] = 0;
> 
> 						   (cpuN went to sleep,
> 						   we saw it as awake,
> 						   ipi sent, but wait
> 						   will break without
> 						   zero_mask and goto
> 						   again will take care)
> 
>    }
>    send_ipi(f->flushmask)
> 
>    wait_a_while_for_zero_mask();
> 
>    if (!zero_mask)
> 	   goto again;
> ---
>  arch/x86/include/asm/kvm_para.h |    3 +-
>  arch/x86/include/asm/tlbflush.h |    9 ++++++
>  arch/x86/kernel/kvm.c           |    1 +
>  arch/x86/kvm/x86.c              |   14 ++++++++-
>  arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index f57b5cc..684a285 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -55,7 +55,8 @@ struct kvm_steal_time {
>  
>  struct kvm_vcpu_state {
>  	__u32 state;
> -	__u32 pad[15];
> +	__u32 flush_on_enter;
> +	__u32 pad[14];
>  };
>  
>  #define KVM_VCPU_STATE_ALIGN_BITS 5
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index c0e108e..29470bd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct cpumask *cpumask,
>  {
>  }
>  
> +static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +					struct mm_struct *mm,
> +					unsigned long va)
> +{
> +}
> +
>  static inline void reset_lazy_tlbstate(void)
>  {
>  }
> @@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>  			     struct mm_struct *mm, unsigned long va);
>  
> +void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +			  struct mm_struct *mm, unsigned long va);
> +
>  #define TLBSTATE_OK	1
>  #define TLBSTATE_LAZY	2
>  
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index bb686a6..66db54e 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
>  	}
>  
>  	has_vcpu_state = 1;
> +	pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
>  
>  #ifdef CONFIG_SMP
>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 264f172..4714a7b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c

Please split guest/host (arch/x86/kernel/kvm.c etc VS arch/x86/kvm/) 
patches.

Please document guest/host interface (Documentation/virtual/kvm/paravirt-tlb-flush.txt, add a pointer to it from msr.txt).



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-07-03  7:55   ` Marcelo Tosatti
@ 2012-07-03  8:19     ` Nikunj A Dadhania
  2012-07-05  2:09       ` Marcelo Tosatti
  2012-07-06  9:47     ` Nikunj A Dadhania
  1 sibling, 1 reply; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-07-03  8:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > 
> >    if (!zero_mask)
> > 	   goto again;
> 
> Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> help.
>
Sure will get back with the result.

> > +	/* 
> > +	 * Guest might have seen us offline and would have set
> > +	 * flush_on_enter. 
> > +	 */
> > +	kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> > +	if (vs->flush_on_enter) 
> > +		kvm_x86_ops->tlb_flush(vcpu);
> 
> 
> So flush_tlb_page which was an invlpg now flushes the entire TLB. Did
> you take that into account?
> 
When the vcpu is sleeping/pre-empted out, multiple request for flush_tlb
could have happened. And now when we are here, it is cleaning up all the
TLB.

One other approach would be to queue the addresses, that brings us with
the question: how many request to queue? This would require us adding
more syncronization between guest and host for updating the area where
these addresses is shared.

> > +again:
> > +		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
> > +			v_state = &per_cpu(vcpu_state, cpu);
> > +
> > +			if (!v_state->state) {
> 
> Should use ACCESS_ONCE to make sure the value is not register cached.
> \
> > +				v_state->flush_on_enter = 1;
> > +				smp_mb();
> > +				if (!v_state->state)
> 
> And here.
> 
Sure will add this check for both in my next version.

> > +					cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
> > +			}
> > +		}
> > +
> > +		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
> > +			goto out;
> > +
> > +		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
> > +				    INVALIDATE_TLB_VECTOR_START + sender);
> > +
> > +		loop = 1000;
> > +		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > +			cpu_relax();
> > +
> > +		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
> > +			goto again;
> 
> Is this necessary in addition to the in-guest-mode/out-guest-mode
> detection? If so, why?
> 
The "case 3" where we initially saw the vcpu was running, and a flush
ipi is send to the vcpu. During this time the vcpu might be pre-empted,
so we come out of the loop=1000 with !empty flushmask. We then re-verify
the flushmask against the current running vcpu and make sure that the
vcpu that was pre-empted is un-marked and we can proceed out of the
kvm_flush_tlb_others_ipi without waiting for sleeping/pre-empted vcpus.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb
  2012-07-03  8:07   ` Marcelo Tosatti
@ 2012-07-03  8:25     ` Nikunj A Dadhania
  2012-07-05  2:37       ` Marcelo Tosatti
  0 siblings, 1 reply; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-07-03  8:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, 3 Jul 2012 05:07:13 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jun 04, 2012 at 10:38:17AM +0530, Nikunj A. Dadhania wrote:
> > In place of looping continuously introduce a halt if we do not succeed
> > after some time.
> > 
> > For vcpus that were running an IPI is sent.  In case, it went to sleep
> > between this, we will be doing flush_on_enter(harmless). But as a
> > flush IPI was already sent, that will be processed in ipi handler,
> > this might result into something undesireable, i.e. It might clear the
> > flush_mask of a new request.
> > 
> > So after sending an IPI and waiting for a while, do a halt and wait
> > for a kick from the last vcpu.
> > 
> > Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> 
> Again, was it determined that this is necessary from data of 
> benchmarking on the in-guest-mode/out-guest-mode patch?
> 
No, this is more of a fix wrt algo.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-07-03  8:11   ` Marcelo Tosatti
@ 2012-07-03  8:27     ` Nikunj A Dadhania
  0 siblings, 0 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-07-03  8:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, 3 Jul 2012 05:11:35 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:

> >  arch/x86/include/asm/kvm_para.h |    3 +-
> >  arch/x86/include/asm/tlbflush.h |    9 ++++++
> >  arch/x86/kernel/kvm.c           |    1 +
> >  arch/x86/kvm/x86.c              |   14 ++++++++-
> >  arch/x86/mm/tlb.c               |   61 +++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 86 insertions(+), 2 deletions(-)
> > 

[...]

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 264f172..4714a7b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> 
> Please split guest/host (arch/x86/kernel/kvm.c etc VS arch/x86/kvm/)
> patches.
> 
Ok

> Please document guest/host interface
> (Documentation/virtual/kvm/paravirt-tlb-flush.txt, add a pointer to it
> from msr.txt).
> 
Sure.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-07-03  8:19     ` Nikunj A Dadhania
@ 2012-07-05  2:09       ` Marcelo Tosatti
  2012-07-05  5:55         ` Nikunj A Dadhania
  0 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2012-07-05  2:09 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, Jul 03, 2012 at 01:49:49PM +0530, Nikunj A Dadhania wrote:
> On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > 
> > >    if (!zero_mask)
> > > 	   goto again;
> > 
> > Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> > of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> > help.
> >
> Sure will get back with the result.
> 
> > > +	/* 
> > > +	 * Guest might have seen us offline and would have set
> > > +	 * flush_on_enter. 
> > > +	 */
> > > +	kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> > > +	if (vs->flush_on_enter) 
> > > +		kvm_x86_ops->tlb_flush(vcpu);
> > 
> > 
> > So flush_tlb_page which was an invlpg now flushes the entire TLB. Did
> > you take that into account?
> > 
> When the vcpu is sleeping/pre-empted out, multiple request for flush_tlb
> could have happened. And now when we are here, it is cleaning up all the
> TLB.

Yes, cases where there are sufficient exits transforming one TLB entry
invalidation into full TLB invalidation should go unnoticed.

> One other approach would be to queue the addresses, that brings us with
> the question: how many request to queue? This would require us adding
> more syncronization between guest and host for updating the area where
> these addresses is shared.

Sounds unnecessarily complicated.

> > > +again:
> > > +		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
> > > +			v_state = &per_cpu(vcpu_state, cpu);
> > > +
> > > +			if (!v_state->state) {
> > 
> > Should use ACCESS_ONCE to make sure the value is not register cached.
> > \
> > > +				v_state->flush_on_enter = 1;
> > > +				smp_mb();
> > > +				if (!v_state->state)
> > 
> > And here.
> > 
> Sure will add this check for both in my next version.
> 
> > > +					cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
> > > +			}
> > > +		}
> > > +
> > > +		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
> > > +			goto out;
> > > +
> > > +		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
> > > +				    INVALIDATE_TLB_VECTOR_START + sender);
> > > +
> > > +		loop = 1000;
> > > +		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > > +			cpu_relax();
> > > +
> > > +		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
> > > +			goto again;
> > 
> > Is this necessary in addition to the in-guest-mode/out-guest-mode
> > detection? If so, why?
> > 
> The "case 3" where we initially saw the vcpu was running, and a flush
> ipi is send to the vcpu. During this time the vcpu might be pre-empted,
> so we come out of the loop=1000 with !empty flushmask. We then re-verify
> the flushmask against the current running vcpu and make sure that the
> vcpu that was pre-empted is un-marked and we can proceed out of the
> kvm_flush_tlb_others_ipi without waiting for sleeping/pre-empted vcpus.
> 
> Regards
> Nikunj

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb
  2012-07-03  8:25     ` Nikunj A Dadhania
@ 2012-07-05  2:37       ` Marcelo Tosatti
  2012-07-05  5:53         ` Nikunj A Dadhania
  0 siblings, 1 reply; 37+ messages in thread
From: Marcelo Tosatti @ 2012-07-05  2:37 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Tue, Jul 03, 2012 at 01:55:02PM +0530, Nikunj A Dadhania wrote:
> On Tue, 3 Jul 2012 05:07:13 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Jun 04, 2012 at 10:38:17AM +0530, Nikunj A. Dadhania wrote:
> > > In place of looping continuously introduce a halt if we do not succeed
> > > after some time.
> > > 
> > > For vcpus that were running an IPI is sent.  In case, it went to sleep
> > > between this, we will be doing flush_on_enter(harmless). But as a
> > > flush IPI was already sent, that will be processed in ipi handler,
> > > this might result into something undesireable, i.e. It might clear the
> > > flush_mask of a new request.
> > > 
> > > So after sending an IPI and waiting for a while, do a halt and wait
> > > for a kick from the last vcpu.
> > > 
> > > Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > 
> > Again, was it determined that this is necessary from data of 
> > benchmarking on the in-guest-mode/out-guest-mode patch?
> > 
> No, this is more of a fix wrt algo.

Please have numbers for the improvement relative to the previous
patch.

It introduces a dependency, these (pvtlbflush and pvspinlocks) are
separate features. It is useful to switch them on/off individually.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb
  2012-07-05  2:37       ` Marcelo Tosatti
@ 2012-07-05  5:53         ` Nikunj A Dadhania
  0 siblings, 0 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-07-05  5:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Wed, 4 Jul 2012 23:37:46 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jul 03, 2012 at 01:55:02PM +0530, Nikunj A Dadhania wrote:
> > On Tue, 3 Jul 2012 05:07:13 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > On Mon, Jun 04, 2012 at 10:38:17AM +0530, Nikunj A. Dadhania wrote:
> > > > In place of looping continuously introduce a halt if we do not succeed
> > > > after some time.
> > > > 
> > > > For vcpus that were running an IPI is sent.  In case, it went to sleep
> > > > between this, we will be doing flush_on_enter(harmless). But as a
> > > > flush IPI was already sent, that will be processed in ipi handler,
> > > > this might result into something undesireable, i.e. It might clear the
> > > > flush_mask of a new request.
> > > > 
> > > > So after sending an IPI and waiting for a while, do a halt and wait
> > > > for a kick from the last vcpu.
> > > > 
> > > > Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > > > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > > 
> > > Again, was it determined that this is necessary from data of 
> > > benchmarking on the in-guest-mode/out-guest-mode patch?
> > > 
> > No, this is more of a fix wrt algo.
> 
> Please have numbers for the improvement relative to the previous
> patch.
> 
I would consider this more of a correctness fix, rather than an
improvement. In this scenario, suppose vcpu1 was pre-empted out before
the delivery of the IPI. After the loop count, we find that vcpu1 did
not respond and is found pre-empted. We set the flush_on_enter flag for
the vcpu1 and proceed. During vcpu1's guest_enter we would do a
flush_on_enter. Now the vcpu1 will also receive an ipi interrupt in guest
mode, in which it will try to clear the flush_mask and acknowledge to
the interrupt. This processing of ipi would not be correct. So with this
patch, we execute a halt and wait for vcpu1 to clear the flush_mask
through the ipi interrupt.

> It introduces a dependency, these (pvtlbflush and pvspinlocks) are
> separate features. It is useful to switch them on/off individually.
> 
Agree, we can also get the pv kick feature separated which can be useful
to both of the approaches, so they can become independent.

Although, tests suggests that for best results both these features
should be enabled.

Nikunj


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-07-05  2:09       ` Marcelo Tosatti
@ 2012-07-05  5:55         ` Nikunj A Dadhania
  0 siblings, 0 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-07-05  5:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

On Wed, 4 Jul 2012 23:09:10 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Jul 03, 2012 at 01:49:49PM +0530, Nikunj A Dadhania wrote:
> > On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > > 
> > > >    if (!zero_mask)
> > > > 	   goto again;
> > > 
> > > Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> > > of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> > > help.
> > >
> > Sure will get back with the result.
> > 
> > > > +	/* 
> > > > +	 * Guest might have seen us offline and would have set
> > > > +	 * flush_on_enter. 
> > > > +	 */
> > > > +	kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> > > > +	if (vs->flush_on_enter) 
> > > > +		kvm_x86_ops->tlb_flush(vcpu);
> > > 
> > > 
> > > So flush_tlb_page which was an invlpg now flushes the entire TLB. Did
> > > you take that into account?
> > > 
> > When the vcpu is sleeping/pre-empted out, multiple request for flush_tlb
> > could have happened. And now when we are here, it is cleaning up all the
> > TLB.
> 
> Yes, cases where there are sufficient exits transforming one TLB entry
> invalidation into full TLB invalidation should go unnoticed.
> 
> > One other approach would be to queue the addresses, that brings us with
> > the question: how many request to queue? This would require us adding
> > more syncronization between guest and host for updating the area where
> > these addresses is shared.
> 
> Sounds unnecessarily complicated.
> 
Yes, I did give this a try earlier, but did not see much improvement
with the amount of complexity that it was bringing in.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others
  2012-07-03  7:55   ` Marcelo Tosatti
  2012-07-03  8:19     ` Nikunj A Dadhania
@ 2012-07-06  9:47     ` Nikunj A Dadhania
  1 sibling, 0 replies; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-07-06  9:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: peterz, mingo, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa, hpa

[-- Attachment #1: Type: text/plain, Size: 5551 bytes --]

On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> > 
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> > 
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > * For online vcpus: Wait for them to clear the flag
> > 
> > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> > 
> > Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
> > 
> > --
> > Pseudo Algo:
> > 
> >    Write()
> >    ======
> > 
> > 	   guest_exit()
> > 		   flush_on_enter[i]=0;
> > 		   running[i] = 0;
> > 
> > 	   guest_enter()
> > 		   running[i] = 1;
> > 		   smp_mb();
> > 		   if(flush_on_enter[i]) {
> > 			   tlb_flush()
> > 			   flush_on_enter[i]=0;
> > 		   }
> > 
> > 
> >    Read()
> >    ======
> > 
> > 	   GUEST                                                KVM-HV
> > 
> >    f->flushcpumask = cpumask - me;
> > 
> > again:
> >    for_each_cpu(i, f->flushmask) {
> > 
> > 	   if (!running[i]) {
> > 						   case 1:
> > 
> > 						   running[n]=1
> > 
> > 						   (cpuN does not see
> > 						   flush_on_enter set,
> > 						   guest later finds it
> > 						   running and sends ipi,
> > 						   we are fine here, need
> > 						   to clear the flag on
> > 						   guest_exit)
> > 
> > 		  flush_on_enter[i] = 1;
> > 						   case2:
> > 
> > 						   running[n]=1
> > 						   (cpuN - will see flush
> > 						   on enter and an IPI as
> > 						   well - addressed in patch-4)
> > 
> > 		  if (!running[i])
> > 		     cpu_clear(f->flushmask);      All is well, vm_enter
> > 						   will do the fixup
> > 	   }
> > 						   case 3:
> > 						   running[n] = 0;
> > 
> > 						   (cpuN went to sleep,
> > 						   we saw it as awake,
> > 						   ipi sent, but wait
> > 						   will break without
> > 						   zero_mask and goto
> > 						   again will take care)
> > 
> >    }
> >    send_ipi(f->flushmask)
> > 
> >    wait_a_while_for_zero_mask();
> > 
> >    if (!zero_mask)
> > 	   goto again;
> 
> Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> help.
> 

Please find below the results (debug patch attached for enabling
registration of kvm_vcu_state)

I have taken results for 1 and 4 vcpus. Used the following command for
starting the tests:

/usr/libexec/qemu-kvm -smp $i -device testdev,chardev=testlog -chardev
file,id=testlog,path=vmexit.out -serial stdio -kernel ./x86/vmexit.flat

Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU 
          with 32 core, 32 online cpus and 4*64GB RAM.

x  base - unpatched host kernel 
+  wo_vs - patched host kernel, vcpu_state not registered
*  w_vs.txt - patched host kernel and vcpu_state registered

1 vcpu results:
---------------
    cpuid
    =====
           N        Avg       Stddev
    x     10     2135.1      17.8975
    +     10       2188      18.3666
    *     10     2448.9      43.9910
    
    vmcall
    ======
           N        Avg       Stddev
    x     10     2025.5      38.1641
    +     10     2047.5      24.8205
    *     10     2306.2      40.3066
    
    mov_from_cr8
    ============
           N        Avg       Stddev
    x     10         12       0.0000
    +     10         12       0.0000
    *     10         12       0.0000
    
    mov_to_cr8
    ==========
           N        Avg       Stddev
    x     10       19.4       0.5164
    +     10       19.1       0.3162
    *     10       19.2       0.4216
    
    inl_from_pmtimer
    ================
           N        Avg       Stddev
    x     10    18093.2     462.0543
    +     10    16579.7    1448.8892
    *     10    18577.7     266.2676
    
    ple-round-robin
    ===============
           N        Avg       Stddev
    x     10       16.1       0.3162
    +     10       16.2       0.4216
    *     10       15.3       0.4830

4 vcpus result
--------------
    cpuid
    =====
           N        Avg       Stddev
    x     10     2135.8      10.0642
    +     10       2165       6.4118
    *     10     2423.7      12.5526
    
    vmcall
    ======
           N        Avg       Stddev
    x     10     2028.3      19.6641
    +     10     2024.7       7.2273
    *     10     2276.1      13.8680
    
    mov_from_cr8
    ============
           N        Avg       Stddev
    x     10         12       0.0000
    +     10         12       0.0000
    *     10         12       0.0000
    
    mov_to_cr8
    ==========
           N        Avg       Stddev
    x     10         19       0.0000
    +     10         19       0.0000
    *     10         19       0.0000
    
    inl_from_pmtimer
    ================
           N        Avg       Stddev
    x     10    25574.2    1693.5374
    +     10    25190.7    2219.9223
    *     10      23044    1230.8737
    
    ipi
    ===
           N        Avg       Stddev
    x     20   31996.75    7290.1777
    +     20   33683.25    9795.1601
    *     20    34563.5    8338.7826
    
    ple-round-robin
    ===============
           N        Avg       Stddev
    x     10     6281.7    1543.8601
    +     10     6149.8    1207.7928
    *     10     6433.3    2304.5377

Thanks
Nikunj


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: enable_vcpu_state.diff --]
[-- Type: text/x-patch, Size: 1895 bytes --]


Enable and register vcpu_state information to the host
    
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

diff --git a/x86/vmexit.c b/x86/vmexit.c
index ad8ab55..a9823c9 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -3,6 +3,7 @@
 #include "smp.h"
 #include "processor.h"
 #include "atomic.h"
+#include "vm.h"
 
 static unsigned int inl(unsigned short port)
 {
@@ -173,10 +174,45 @@ static void enable_nx(void *junk)
 		wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_NX_MASK);
 }
 
+#define KVM_MSR_ENABLED                 1
+#define KVM_FEATURE_VCPU_STATE          7
+#define MSR_KVM_VCPU_STATE              0x4b564d04
+
+struct kvm_vcpu_state {
+        int state;
+        int flush_on_enter;
+        int pad[14];
+};
+
+struct kvm_vcpu_state test[4];
+
+static inline void my_wrmsr(unsigned int msr,
+			  unsigned low, unsigned high)
+{
+  asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high) : "memory");
+}
+#define wrmsrl(msr, val) my_wrmsr(msr, (u32)((u64)(val)), ((u64)(val))>>32)
+
+static void enable_vcpu_state(void *junk)
+{
+	struct kvm_vcpu_state *vs;
+	int me = smp_id();
+
+	if (cpuid(0x80000001).d & (1 << KVM_FEATURE_VCPU_STATE)) {
+		vs = &test[me];
+		memset(vs, 0, sizeof(struct kvm_vcpu_state));
+		
+		wrmsrl(MSR_KVM_VCPU_STATE, ((unsigned long)(vs) | KVM_MSR_ENABLED));
+		printf("%d: Done vcpu state %p\n", me, virt_to_phys((void*)vs));
+	}
+}
+
 bool test_wanted(struct test *test, char *wanted[], int nwanted)
 {
 	int i;
 
+	return true;
+
 	if (!nwanted)
 		return true;
 
@@ -192,11 +228,16 @@ int main(int ac, char **av)
 	int i;
 
 	smp_init();
+	setup_vm();
+
 	nr_cpus = cpu_count();
 
 	for (i = cpu_count(); i > 0; i--)
 		on_cpu(i-1, enable_nx, 0);
 
+	for (i = cpu_count(); i > 0; i--)
+		on_cpu(i-1, enable_vcpu_state, 0);
+
 	for (i = 0; i < ARRAY_SIZE(tests); ++i)
 		if (test_wanted(&tests[i], av + 1, ac - 1))
 			do_test(&tests[i]);

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-06-05 13:26             ` Stefano Stabellini
  2012-06-05 13:31               ` Peter Zijlstra
@ 2012-08-01 11:23               ` Stefano Stabellini
  2012-08-01 12:12                 ` Nikunj A Dadhania
  1 sibling, 1 reply; 37+ messages in thread
From: Stefano Stabellini @ 2012-08-01 11:23 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Peter Zijlstra, Nikunj A Dadhania, mingo, mtosatti, avi, raghukt,
	kvm, linux-kernel, x86, jeremy, vatsa, hpa,
	Konrad Rzeszutek Wilk

On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > PeterZ, is 7/7 alright to be picked?
> > 
> > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > though. But these two patches together should make it work for x86.
> > 
> 
> Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> 3.6?
> 

Hello Nikunj,
what happened to this patch series?
In particular I am interested in the following two patches:

kvm,x86: RCU based table free
Flush page-table pages before freeing them

do you still intend to carry on with the development? Is there anything
missing that is preventing them from going upstream?

Cheers,

Stefano

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-08-01 11:23               ` Stefano Stabellini
@ 2012-08-01 12:12                 ` Nikunj A Dadhania
  2012-08-01 12:59                   ` Stefano Stabellini
  0 siblings, 1 reply; 37+ messages in thread
From: Nikunj A Dadhania @ 2012-08-01 12:12 UTC (permalink / raw)
  To: Stefano Stabellini, Stefano Stabellini
  Cc: Peter Zijlstra, mingo, mtosatti, avi, raghukt, kvm, linux-kernel,
	x86, jeremy, vatsa, hpa, Konrad Rzeszutek Wilk

Hi Stefano,

On Wed, 1 Aug 2012 12:23:37 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > PeterZ, is 7/7 alright to be picked?
> > > 
> > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > though. But these two patches together should make it work for x86.
> > > 
> > 
> > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > 3.6?
> > 
> 
> Hello Nikunj,
> what happened to this patch series?
> In particular I am interested in the following two patches:
> 
> kvm,x86: RCU based table free
> Flush page-table pages before freeing them
> 
> do you still intend to carry on with the development? Is there anything
> missing that is preventing them from going upstream?
>
I have posted a v3 on the kvm-list:
http://www.spinics.net/lists/kvm/msg76955.html

I am carrying the above two patches(with one fix) in my series as well
for completeness. 

I have picked up the patches from PeterZ's "Unify TLB gather
implementations -v3"
http://article.gmane.org/gmane.linux.kernel.mm/81278

Regards
Nikunj


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] kvm,x86: RCU based table free
  2012-08-01 12:12                 ` Nikunj A Dadhania
@ 2012-08-01 12:59                   ` Stefano Stabellini
  0 siblings, 0 replies; 37+ messages in thread
From: Stefano Stabellini @ 2012-08-01 12:59 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Stefano Stabellini, Stefano Stabellini, Peter Zijlstra, mingo,
	mtosatti, avi, raghukt, kvm, linux-kernel, x86, jeremy, vatsa,
	hpa, Konrad Rzeszutek Wilk

On Wed, 1 Aug 2012, Nikunj A Dadhania wrote:
> Hi Stefano,
> 
> On Wed, 1 Aug 2012 12:23:37 +0100, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> > > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > > PeterZ, is 7/7 alright to be picked?
> > > > 
> > > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > > though. But these two patches together should make it work for x86.
> > > > 
> > > 
> > > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > > 3.6?
> > > 
> > 
> > Hello Nikunj,
> > what happened to this patch series?
> > In particular I am interested in the following two patches:
> > 
> > kvm,x86: RCU based table free
> > Flush page-table pages before freeing them
> > 
> > do you still intend to carry on with the development? Is there anything
> > missing that is preventing them from going upstream?
> >
> I have posted a v3 on the kvm-list:
> http://www.spinics.net/lists/kvm/msg76955.html
> 
> I am carrying the above two patches(with one fix) in my series as well
> for completeness. 
> 
> I have picked up the patches from PeterZ's "Unify TLB gather
> implementations -v3"
> http://article.gmane.org/gmane.linux.kernel.mm/81278

It is good to see that you are following up on this; since I didn't see
any updates I got worried :)
Cheers,

Stefano

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-08-01 13:00 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-04  5:05 [PATCH v2 0/7] KVM paravirt remote flush tlb Nikunj A. Dadhania
2012-06-04  5:06 ` [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
2012-06-12 22:43   ` Marcelo Tosatti
2012-06-19  6:03     ` Nikunj A Dadhania
2012-06-04  5:06 ` [PATCH v2 2/7] KVM-HV: " Nikunj A. Dadhania
2012-06-04  5:07 ` [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
2012-06-12 23:02   ` Marcelo Tosatti
2012-06-19  6:11     ` Nikunj A Dadhania
2012-06-21 12:26       ` Marcelo Tosatti
2012-07-03  7:55   ` Marcelo Tosatti
2012-07-03  8:19     ` Nikunj A Dadhania
2012-07-05  2:09       ` Marcelo Tosatti
2012-07-05  5:55         ` Nikunj A Dadhania
2012-07-06  9:47     ` Nikunj A Dadhania
2012-07-03  8:11   ` Marcelo Tosatti
2012-07-03  8:27     ` Nikunj A Dadhania
2012-06-04  5:07 ` [PATCH v2 4/7] KVM: export kvm_kick_vcpu for pv_flush Nikunj A. Dadhania
2012-06-04  5:08 ` [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania
2012-07-03  8:07   ` Marcelo Tosatti
2012-07-03  8:25     ` Nikunj A Dadhania
2012-07-05  2:37       ` Marcelo Tosatti
2012-07-05  5:53         ` Nikunj A Dadhania
2012-06-04  5:08 ` [PATCH v2 6/7] kvm,x86: RCU based table free Nikunj A. Dadhania
2012-06-05 10:48   ` Stefano Stabellini
2012-06-05 11:08     ` Nikunj A Dadhania
2012-06-05 11:58       ` Stefano Stabellini
2012-06-05 13:04         ` Nikunj A Dadhania
2012-06-05 13:08           ` Peter Zijlstra
2012-06-05 13:26             ` Stefano Stabellini
2012-06-05 13:31               ` Peter Zijlstra
2012-06-05 13:41                 ` Stefano Stabellini
2012-08-01 11:23               ` Stefano Stabellini
2012-08-01 12:12                 ` Nikunj A Dadhania
2012-08-01 12:59                   ` Stefano Stabellini
2012-06-05 15:29             ` Nikunj A Dadhania
2012-06-05 13:21           ` Stefano Stabellini
2012-06-04  5:08 ` [PATCH v2 7/7] Flush page-table pages before freeing them Nikunj A. Dadhania

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.