linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/5] KVM paravirt remote flush tlb
@ 2012-04-27 16:23 Nikunj A. Dadhania
  2012-04-27 16:23 ` [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Nikunj A. Dadhania @ 2012-04-27 16:23 UTC (permalink / raw)
  To: peterz, mingo; +Cc: jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This patch depends on ticketlocks[1] and KVM Paravirt Spinlock patches[2]
Based to 3.4.0-rc4 (commit: af3a3ab2)

Here are the results from non-PLE hardware. Running ebizzy workload
inside the VMs. The table shows the normalized ebizzy score wrt to the
baseline.

Machine:
8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)

	Gang    pv_spin     pv_flush    pv_spin_flush    
1VM     1.01      0.30        1.01         0.49
2VMs    7.07      0.53        0.91         4.04
4VMs    9.07      0.59        0.31         5.27
8VMs    9.99      1.58        0.48         7.65

Perf report from the guest VM:
Base:
    41.25%       [k] flush_tlb_others_ipi
    41.21%       [k] __bitmap_empty
     7.66%       [k] _raw_spin_unlock_irqrestore
     3.07%       [.] __memcpy_ssse3_back
     1.20%       [k] clear_page

gang:
    22.92%       [.] __memcpy_ssse3_back
    15.46%       [k] _raw_spin_unlock_irqrestore
     9.82%       [k] clear_page
     6.35%       [k] do_page_fault
     4.57%       [k] down_read_trylock
     3.36%       [k] __mem_cgroup_commit_charge
     3.26%       [k] __x2apic_send_IPI_mask
     3.23%       [k] up_read
     2.87%       [k] __bitmap_empty
     2.78%       [k] flush_tlb_others_ipi

pv_spin:
    34.82%       [k] __bitmap_empty
    34.75%       [k] flush_tlb_others_ipi
    25.10%       [k] _raw_spin_unlock_irqrestore
     1.52%       [.] __memcpy_ssse3_back

pv_flush:
    37.34%       [k] _raw_spin_unlock_irqrestore
    18.26%       [k] native_halt
    11.58%       [.] __memcpy_ssse3_back
     4.83%       [k] clear_page
     3.68%       [k] do_page_fault

pv_spin_flush:
    71.13%       [k] _raw_spin_unlock_irqrestore
     8.89%       [.] __memcpy_ssse3_back
     4.68%       [k] native_halt
     3.92%       [k] clear_page
     2.31%       [k] do_page_fault

So looking at the perf output for pv_flush and pv_spin_flush, in both
the cases all the flush_tlb_others_ipi is no more contending for the
cpu and relinquishing the cpu for progress. 

Comments?

Regards
Nikunj

1. https://lkml.org/lkml/2012/4/19/335
2. https://lkml.org/lkml/2012/4/23/123

---

Nikunj A. Dadhania (5):
      KVM Guest: Add VCPU running/pre-empted state for guest
      KVM-HV: Add VCPU running/pre-empted state for guest
      KVM: Add paravirt kvm_flush_tlb_others
      KVM: export kvm_kick_vcpu for pv_flush
      KVM: Introduce PV kick in flush tlb


 arch/x86/include/asm/kvm_host.h |    7 ++++
 arch/x86/include/asm/kvm_para.h |   11 ++++++
 arch/x86/include/asm/tlbflush.h |    9 +++++
 arch/x86/kernel/kvm.c           |   52 +++++++++++++++++++++++++-----
 arch/x86/kvm/cpuid.c            |    1 +
 arch/x86/kvm/x86.c              |   50 ++++++++++++++++++++++++++++-
 arch/x86/mm/tlb.c               |   68 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 188 insertions(+), 10 deletions(-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-04-27 16:23 [RFC PATCH v1 0/5] KVM paravirt remote flush tlb Nikunj A. Dadhania
@ 2012-04-27 16:23 ` Nikunj A. Dadhania
  2012-05-01  1:03   ` Raghavendra K T
  2012-04-27 16:23 ` [RFC PATCH v1 2/5] KVM-HV: " Nikunj A. Dadhania
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Nikunj A. Dadhania @ 2012-04-27 16:23 UTC (permalink / raw)
  To: peterz, mingo; +Cc: jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_para.h |   10 ++++++++++
 arch/x86/kernel/kvm.c           |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 77266d3..f57b5cc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -24,6 +24,7 @@
 #define KVM_FEATURE_ASYNC_PF		4
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_UNHALT		6
+#define KVM_FEATURE_VCPU_STATE          7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_VCPU_STATE  0x4b564d04
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -51,6 +53,14 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+	__u32 state;
+	__u32 pad[15];
+};
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 98f0378..bb686a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
 		cpu, __pa(st));
 }
 
+static void kvm_register_vcpu_state(void)
+{
+	int cpu = smp_processor_id();
+	struct kvm_vcpu_state *v_state;
+
+	if (!has_vcpu_state)
+		return;
+
+	v_state = &per_cpu(vcpu_state, cpu);
+	memset(v_state, 0, sizeof(*v_state));
+
+	wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+	printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
+		cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
 	if (!kvm_para_available())
@@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
 	if (has_steal_clock)
 		kvm_register_steal_time();
+
+	if (has_vcpu_state)
+		kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
 	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+	if (!has_vcpu_state)
+		return;
+
+	wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+	kvm_disable_vcpu_state();
 	kvm_disable_steal_time();
 	kvm_pv_disable_apf(NULL);
 	apf_task_wake_all();
@@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
 		pv_time_ops.steal_clock = kvm_steal_clock;
 	}
 
+	has_vcpu_state = 1;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 2/5] KVM-HV: Add VCPU running/pre-empted state for guest
  2012-04-27 16:23 [RFC PATCH v1 0/5] KVM paravirt remote flush tlb Nikunj A. Dadhania
  2012-04-27 16:23 ` [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
@ 2012-04-27 16:23 ` Nikunj A. Dadhania
  2012-04-27 16:24 ` [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Nikunj A. Dadhania @ 2012-04-27 16:23 UTC (permalink / raw)
  To: peterz, mingo; +Cc: jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

Hypervisor code to indicate guest running/pre-empteded status through
msr.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    7 ++++++
 arch/x86/kvm/cpuid.c            |    1 +
 arch/x86/kvm/x86.c              |   44 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dad475b..12fe3c7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,13 @@ struct kvm_vcpu_arch {
 		struct kvm_steal_time steal;
 	} st;
 
+	/* indicates vcpu is running or preempted */
+	struct {
+		u64 msr_val;
+		struct gfn_to_hva_cache data;
+		struct kvm_vcpu_state vs;
+	} v_state;
+
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
 	u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c93806..0588984 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
 			     (1 << KVM_FEATURE_ASYNC_PF) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+			     (1 << KVM_FEATURE_VCPU_STATE) |
 			     (1 << KVM_FEATURE_PV_UNHALT);
 
 		if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5f57b..60546e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -789,12 +789,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	9
+#define KVM_SAVE_MSRS_BEGIN	10
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+	MSR_KVM_VCPU_STATE,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1539,6 +1540,32 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+	struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	vs->state = 1;
+	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+	smp_wmb();
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+	struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	vs->state = 0;
+	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+	smp_wmb();
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	bool pr = false;
@@ -1654,6 +1681,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 
 		break;
 
+	case MSR_KVM_VCPU_STATE:
+		if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.v_state.data,
+					      data & KVM_VCPU_STATE_VALID_BITS))
+			return 1;
+
+		vcpu->arch.v_state.msr_val = data;
+		break;
+
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1974,6 +2009,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_STEAL_TIME:
 		data = vcpu->arch.st.msr_val;
 		break;
+	case MSR_KVM_VCPU_STATE:
+		data = vcpu->arch.v_state.msr_val;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -5324,6 +5362,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		kvm_load_guest_fpu(vcpu);
 	kvm_load_guest_xcr0(vcpu);
 
+	kvm_set_vcpu_state(vcpu);
+
 	vcpu->mode = IN_GUEST_MODE;
 
 	/* We should set ->mode before check ->requests,
@@ -5374,6 +5414,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
+	kvm_clear_vcpu_state(vcpu);
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();
 	local_irq_enable();
@@ -6029,6 +6070,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	vcpu->arch.apf.msr_val = 0;
 	vcpu->arch.st.msr_val = 0;
+	vcpu->arch.v_state.msr_val = 0;
 
 	kvmclock_reset(vcpu);
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-04-27 16:23 [RFC PATCH v1 0/5] KVM paravirt remote flush tlb Nikunj A. Dadhania
  2012-04-27 16:23 ` [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
  2012-04-27 16:23 ` [RFC PATCH v1 2/5] KVM-HV: " Nikunj A. Dadhania
@ 2012-04-27 16:24 ` Nikunj A. Dadhania
  2012-04-29 12:23   ` Avi Kivity
  2012-05-04 11:44   ` Srivatsa Vaddagiri
  2012-04-27 16:26 ` [RFC PATCH v1 4/5] KVM: get kvm_kick_vcpu out for pv_flush Nikunj A. Dadhania
  2012-04-27 16:27 ` [RFC PATCH v1 5/5] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania
  4 siblings, 2 replies; 36+ messages in thread
From: Nikunj A. Dadhania @ 2012-04-27 16:24 UTC (permalink / raw)
  To: peterz, mingo; +Cc: jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

--
Pseudo Algo:

   Write()
   ======

	   guest_exit()
		   flush_on_enter[i]=0;
		   running[i] = 0;

	   guest_enter()
		   running[i] = 1;
		   if(flush_on_enter[i]) {
			   tlb_flush()
			   flush_on_enter[i]=0;
		   }


   Read()
   ======

	   GUEST                                                KVM-HV

   f->flushcpumask = cpumask - me;

again:
   for_each_cpu(i, f->flushmask) {

	   if (!running[i]) {
						   case 1:

						   running[n]=1

						   (cpuN does not see
						   flush_on_enter set,
						   guest later finds it
						   running and sends ipi,
						   we are fine here, need
						   to clear the flag on
						   guest_exit)

		  flush_on_enter[i] = 1;
						   case2:

						   running[n]=1
						   (cpuN - will see flush
						   on enter and an IPI as
						   well - addressed in patch-4)

		  if (!running[i])
		     cpu_clear(f->flushmask);      All is well, vm_enter
						   will do the fixup
	   }
						   case 3:
						   running[n] = 0;

						   (cpuN went to sleep,
						   we saw it as awake,
						   ipi sent, but wait
						   will break without
						   zero_mask and goto
						   again will take care)

   }
   send_ipi(f->flushmask)

   wait_a_while_for_zero_mask();

   if (!zero_mask)
	   goto again;
---
 arch/x86/include/asm/kvm_para.h |    3 +-
 arch/x86/include/asm/tlbflush.h |    9 ++++++
 arch/x86/kernel/kvm.c           |    1 +
 arch/x86/kvm/x86.c              |    6 ++++
 arch/x86/mm/tlb.c               |   57 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 75 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f57b5cc..684a285 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -55,7 +55,8 @@ struct kvm_steal_time {
 
 struct kvm_vcpu_state {
 	__u32 state;
-	__u32 pad[15];
+	__u32 flush_on_enter;
+	__u32 pad[14];
 };
 
 #define KVM_VCPU_STATE_ALIGN_BITS 5
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c0e108e..29470bd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+					struct mm_struct *mm,
+					unsigned long va)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     struct mm_struct *mm, unsigned long va);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			  struct mm_struct *mm, unsigned long va);
+
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bb686a6..66db54e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
 	}
 
 	has_vcpu_state = 1;
+	pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
 
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 60546e9..6c42056 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1549,6 +1549,11 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 		return;
 
 	vs->state = 1;
+	if (vs->flush_on_enter) {
+		kvm_mmu_flush_tlb(vcpu);
+		vs->flush_on_enter = 0;
+	}
+
 	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
 	smp_wmb();
 }
@@ -1561,6 +1566,7 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
 	if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
 		return;
 
+	vs->flush_on_enter = 0;
 	vs->state = 0;
 	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
 	smp_wmb();
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d6c0418..91ae34e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/cpu.h>
+#include <linux/kvm_para.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -69,6 +70,7 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
 /*
  *
  * The flush IPI assumes that a thread switch happens in this order:
@@ -202,6 +204,61 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
 		raw_spin_unlock(&f->tlbstate_lock);
 }
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			struct mm_struct *mm, unsigned long va)
+{
+	unsigned int sender;
+	union smp_flush_state *f;
+	int cpu, loop;
+	struct kvm_vcpu_state *v_state;
+
+	/* Caller has disabled preemption */
+	sender = this_cpu_read(tlb_vector_offset);
+	f = &flush_state[sender];
+
+	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+		raw_spin_lock(&f->tlbstate_lock);
+
+	f->flush_mm = mm;
+	f->flush_va = va;
+	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+		/*
+		 * We have to send the IPI only to online vCPUs
+		 * affected. And queue flush_on_enter for pre-empted
+		 * vCPUs
+		 */
+again:
+		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
+			v_state = &per_cpu(vcpu_state, cpu);
+
+			if (!v_state->state) {
+				v_state->flush_on_enter = 1;
+				smp_mb();
+				if (!v_state->state)
+					cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
+			}
+		}
+
+		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+			goto out;
+
+		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
+				    INVALIDATE_TLB_VECTOR_START + sender);
+
+		loop = 1000;
+		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+			cpu_relax();
+
+		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+			goto again;
+	}
+out:
+	f->flush_mm = NULL;
+	f->flush_va = 0;
+	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+		raw_spin_unlock(&f->tlbstate_lock);
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     struct mm_struct *mm, unsigned long va)
 {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 4/5] KVM: get kvm_kick_vcpu out for pv_flush
  2012-04-27 16:23 [RFC PATCH v1 0/5] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (2 preceding siblings ...)
  2012-04-27 16:24 ` [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
@ 2012-04-27 16:26 ` Nikunj A. Dadhania
  2012-04-27 16:27 ` [RFC PATCH v1 5/5] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania
  4 siblings, 0 replies; 36+ messages in thread
From: Nikunj A. Dadhania @ 2012-04-27 16:26 UTC (permalink / raw)
  To: peterz, mingo; +Cc: jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

Get kvm_kick_cpu out of CONFIG_PARAVIRT_SPINLOCK define

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/kernel/kvm.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66db54e..5943285 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -487,6 +487,15 @@ static __init int activate_jump_labels(void)
 }
 arch_initcall(activate_jump_labels);
 
+/* Kick a cpu */
+void kvm_kick_cpu(int cpu)
+{
+	int apicid;
+
+	apicid = per_cpu(x86_cpu_to_apicid, cpu);
+	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 enum kvm_contention_stat {
@@ -695,15 +704,6 @@ out:
 }
 PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 
-/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int cpu)
-{
-	int apicid;
-
-	apicid = per_cpu(x86_cpu_to_apicid, cpu);
-	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
-}
-
 /* Kick vcpu waiting on @lock->head to reach value @ticket */
 static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 5/5] KVM: Introduce PV kick in flush tlb
  2012-04-27 16:23 [RFC PATCH v1 0/5] KVM paravirt remote flush tlb Nikunj A. Dadhania
                   ` (3 preceding siblings ...)
  2012-04-27 16:26 ` [RFC PATCH v1 4/5] KVM: get kvm_kick_vcpu out for pv_flush Nikunj A. Dadhania
@ 2012-04-27 16:27 ` Nikunj A. Dadhania
  4 siblings, 0 replies; 36+ messages in thread
From: Nikunj A. Dadhania @ 2012-04-27 16:27 UTC (permalink / raw)
  To: peterz, mingo; +Cc: jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

In place of looping continuously introduce a halt if we do not succeed
after some time.

For vcpus that were running an IPI is sent.  In case, it went to sleep
between this, we will be doing flush_on_enter(harmless). But as a
flush IPI was already sent, that will be processed in ipi handler,
this might result into something undesireable, i.e. It might clear the
flush_mask of a new request.

So after sending an IPI and waiting for a while, do a halt and wait
for a kick from the last vcpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---
 arch/x86/mm/tlb.c |   27 +++++++++++++++++++--------
 1 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 91ae34e..2a20e59 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,6 +43,8 @@ union smp_flush_state {
 	struct {
 		struct mm_struct *flush_mm;
 		unsigned long flush_va;
+		int sender_cpu;
+		unsigned int need_kick;
 		raw_spinlock_t tlbstate_lock;
 		DECLARE_BITMAP(flush_cpumask, NR_CPUS);
 	};
@@ -71,6 +73,8 @@ void leave_mm(int cpu)
 EXPORT_SYMBOL_GPL(leave_mm);
 
 DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+extern void kvm_kick_cpu(int cpu);
+
 /*
  *
  * The flush IPI assumes that a thread switch happens in this order:
@@ -168,6 +172,11 @@ out:
 	smp_mb__before_clear_bit();
 	cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
 	smp_mb__after_clear_bit();
+	if (f->need_kick && cpumask_empty(to_cpumask(f->flush_cpumask))) {
+		f->need_kick = 0;
+		smp_wmb();
+		kvm_kick_cpu(f->sender_cpu);
+	}
 	inc_irq_stat(irq_tlb_count);
 }
 
@@ -219,15 +228,17 @@ void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
 		raw_spin_lock(&f->tlbstate_lock);
 
+	cpu = smp_processor_id();
 	f->flush_mm = mm;
 	f->flush_va = va;
-	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+	f->sender_cpu = cpu;
+	f->need_kick = 0;
+	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(cpu))) {
 		/*
 		 * We have to send the IPI only to online vCPUs
 		 * affected. And queue flush_on_enter for pre-empted
 		 * vCPUs
 		 */
-again:
 		for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
 			v_state = &per_cpu(vcpu_state, cpu);
 
@@ -239,9 +250,6 @@ again:
 			}
 		}
 
-		if (cpumask_empty(to_cpumask(f->flush_cpumask)))
-			goto out;
-
 		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
 				    INVALIDATE_TLB_VECTOR_START + sender);
 
@@ -249,10 +257,13 @@ again:
 		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
 			cpu_relax();
 
-		if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
-			goto again;
+		if (!loop) {
+			f->need_kick = 1;
+			smp_mb();
+			while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+				halt();
+		}
 	}
-out:
 	f->flush_mm = NULL;
 	f->flush_va = 0;
 	if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-04-27 16:24 ` [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
@ 2012-04-29 12:23   ` Avi Kivity
  2012-05-01  3:34     ` Nikunj A Dadhania
  2012-05-01  9:39     ` Peter Zijlstra
  2012-05-04 11:44   ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2012-04-29 12:23 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, hpa

On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> paravirtualization.
>
> Use the vcpu state information inside the kvm_flush_tlb_others to
> avoid sending ipi to pre-empted vcpus.
>
> * Do not send ipi's to offline vcpus and set flush_on_enter flag

get_user_pages_fast() depends on the IPI to hold off page table teardown
while they are locklessly walked with interrupts disabled.  If a vcpu
were to be preempted while in this critical section, another vcpu
tearing down page tables would go ahead and destroy them.  when the
preempted vcpu resumes it then touches the freed pages.

We could try to teach kvm and get_user_pages_fast() about this, but this
is intrusive.  Another option is to replace the cpu_relax() loop with
something that sleeps and is then woken up by the TLB IPI handler if needed.

> * For online vcpus: Wait for them to clear the flag
>
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-04-27 16:23 ` [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
@ 2012-05-01  1:03   ` Raghavendra K T
  2012-05-01  3:25     ` Nikunj A Dadhania
  0 siblings, 1 reply; 36+ messages in thread
From: Raghavendra K T @ 2012-05-01  1:03 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

On 04/27/2012 09:53 PM, Nikunj A. Dadhania wrote:
> The patch adds guest code for msr between guest and hypervisor. The
> msr will export the vcpu running/pre-empted information to the guest
> from host. This will enable guest to intelligently send ipi to running
> vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> vcpus that are not running.
>
> Suggested-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Nikunj A. Dadhania<nikunj@linux.vnet.ibm.com>
> ---
>   arch/x86/include/asm/kvm_para.h |   10 ++++++++++
>   arch/x86/kernel/kvm.c           |   33 +++++++++++++++++++++++++++++++++
>   2 files changed, 43 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 77266d3..f57b5cc 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -24,6 +24,7 @@
>   #define KVM_FEATURE_ASYNC_PF		4
>   #define KVM_FEATURE_STEAL_TIME		5
>   #define KVM_FEATURE_PV_UNHALT		6
> +#define KVM_FEATURE_VCPU_STATE          7

I think you intended to use KVM_FEATURE_VCPU_STATE to address
guest/host compatibility issue so that host/guest does not break
when one of them run older kernel?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest
  2012-05-01  1:03   ` Raghavendra K T
@ 2012-05-01  3:25     ` Nikunj A Dadhania
  0 siblings, 0 replies; 36+ messages in thread
From: Nikunj A Dadhania @ 2012-05-01  3:25 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: peterz, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, avi, hpa

On Tue, 01 May 2012 06:33:59 +0530, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> On 04/27/2012 09:53 PM, Nikunj A. Dadhania wrote:
> > The patch adds guest code for msr between guest and hypervisor. The
> > msr will export the vcpu running/pre-empted information to the guest
> > from host. This will enable guest to intelligently send ipi to running
> > vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> > vcpus that are not running.
> >
> > Suggested-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> > Signed-off-by: Nikunj A. Dadhania<nikunj@linux.vnet.ibm.com>
> > ---
> >   arch/x86/include/asm/kvm_para.h |   10 ++++++++++
> >   arch/x86/kernel/kvm.c           |   33 +++++++++++++++++++++++++++++++++
> >   2 files changed, 43 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index 77266d3..f57b5cc 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -24,6 +24,7 @@
> >   #define KVM_FEATURE_ASYNC_PF		4
> >   #define KVM_FEATURE_STEAL_TIME		5
> >   #define KVM_FEATURE_PV_UNHALT		6
> > +#define KVM_FEATURE_VCPU_STATE          7
> 
> I think you intended to use KVM_FEATURE_VCPU_STATE to address
> guest/host compatibility issue so that host/guest does not break
> when one of them run older kernel?
>
Yes, thats correct. 

Regards
Nikunj


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-04-29 12:23   ` Avi Kivity
@ 2012-05-01  3:34     ` Nikunj A Dadhania
  2012-05-01  9:39     ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Nikunj A Dadhania @ 2012-05-01  3:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: peterz, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, hpa

On Sun, 29 Apr 2012 15:23:16 +0300, Avi Kivity <avi@redhat.com> wrote:
> On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> >
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> >
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> 
> get_user_pages_fast() depends on the IPI to hold off page table teardown
> while they are locklessly walked with interrupts disabled.  If a vcpu
> were to be preempted while in this critical section, another vcpu
> tearing down page tables would go ahead and destroy them.  when the
> preempted vcpu resumes it then touches the freed pages.
> 
> We could try to teach kvm and get_user_pages_fast() about this, but this
> is intrusive. 


> Another option is to replace the cpu_relax() loop with
> something that sleeps and is then woken up by the TLB IPI handler if needed.
> 
That was the initial implementation that I did. Where we were not
looking for sleeping vcpus, just send IPIs to affected vcpus and execute
halt. One of the vcpu that sees the flushmask as empty would generate
the kick to the halted vcpu. I can respin my patches with just this as
well.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-04-29 12:23   ` Avi Kivity
  2012-05-01  3:34     ` Nikunj A Dadhania
@ 2012-05-01  9:39     ` Peter Zijlstra
  2012-05-01 10:47       ` Avi Kivity
  2012-05-02  8:51       ` Nikunj A Dadhania
  1 sibling, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01  9:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> >
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> >
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> 
> get_user_pages_fast() depends on the IPI to hold off page table teardown
> while they are locklessly walked with interrupts disabled.  If a vcpu
> were to be preempted while in this critical section, another vcpu
> tearing down page tables would go ahead and destroy them.  when the
> preempted vcpu resumes it then touches the freed pages.
> 
> We could try to teach kvm and get_user_pages_fast() about this, but this
> is intrusive.  Another option is to replace the cpu_relax() loop with
> something that sleeps and is then woken up by the TLB IPI handler if needed.

I think something like

  select HAVE_RCU_TABLE_FREE if PARAVIRT

or somesuch is just about all it takes.

A slightly better option would be to wrap all that tlb_*table* goo into
paravirt stuff and only do the RCU free when paravirt is indeed enabled,
but other than that you're there.

This should work because the preempted vcpu's RCU state would also be
stalled and thus avoids the actual page-table from going away.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01  9:39     ` Peter Zijlstra
@ 2012-05-01 10:47       ` Avi Kivity
  2012-05-01 10:57         ` Peter Zijlstra
  2012-05-02  8:51       ` Nikunj A Dadhania
  1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2012-05-01 10:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On 05/01/2012 12:39 PM, Peter Zijlstra wrote:
> On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > 
> > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > while they are locklessly walked with interrupts disabled.  If a vcpu
> > were to be preempted while in this critical section, another vcpu
> > tearing down page tables would go ahead and destroy them.  when the
> > preempted vcpu resumes it then touches the freed pages.
> > 
> > We could try to teach kvm and get_user_pages_fast() about this, but this
> > is intrusive.  Another option is to replace the cpu_relax() loop with
> > something that sleeps and is then woken up by the TLB IPI handler if needed.
>
> I think something like
>
>   select HAVE_RCU_TABLE_FREE if PARAVIRT
>
> or somesuch is just about all it takes.
>
> A slightly better option would be to wrap all that tlb_*table* goo into
> paravirt stuff and only do the RCU free when paravirt is indeed enabled,
> but other than that you're there.

I infer from this that there is a cost involved with rcu freeing.  Any
idea how much?

Looks like this increases performance for the overcommitted case, and
also for the case where many vcpus are sleeping, while reducing
performance for the uncontended, high duty cycle case.

> This should work because the preempted vcpu's RCU state would also be
> stalled and thus avoids the actual page-table from going away.

It can be unstalled at any moment.  But spin_lock_irq() > rcu_read_lock().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 10:47       ` Avi Kivity
@ 2012-05-01 10:57         ` Peter Zijlstra
  2012-05-01 10:59           ` Peter Zijlstra
  2012-05-01 12:12           ` Avi Kivity
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 10:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 13:47 +0300, Avi Kivity wrote:
> On 05/01/2012 12:39 PM, Peter Zijlstra wrote:
> > On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > > paravirtualization.
> > > >
> > > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > > avoid sending ipi to pre-empted vcpus.
> > > >
> > > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > > 
> > > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > > while they are locklessly walked with interrupts disabled.  If a vcpu
> > > were to be preempted while in this critical section, another vcpu
> > > tearing down page tables would go ahead and destroy them.  when the
> > > preempted vcpu resumes it then touches the freed pages.
> > > 
> > > We could try to teach kvm and get_user_pages_fast() about this, but this
> > > is intrusive.  Another option is to replace the cpu_relax() loop with
> > > something that sleeps and is then woken up by the TLB IPI handler if needed.
> >
> > I think something like
> >
> >   select HAVE_RCU_TABLE_FREE if PARAVIRT
> >
> > or somesuch is just about all it takes.
> >
> > A slightly better option would be to wrap all that tlb_*table* goo into
> > paravirt stuff and only do the RCU free when paravirt is indeed enabled,
> > but other than that you're there.
> 
> I infer from this that there is a cost involved with rcu freeing.  Any
> idea how much?

No idea, so far that code has only been used on platforms that required
it so they didn't have a choice in the matter.

> Looks like this increases performance for the overcommitted case, and
> also for the case where many vcpus are sleeping, while reducing
> performance for the uncontended, high duty cycle case.

Sounds backwards if you put it like that ;-)

> > This should work because the preempted vcpu's RCU state would also be
> > stalled and thus avoids the actual page-table from going away.
> 
> It can be unstalled at any moment.  But spin_lock_irq() > rcu_read_lock().

Right, but since gup_fast has IRQs disabled the RCU state machine (as
driven by the tick) won't actually do anything until its done.

To be clear, the case was where the gup_fast() performing vcpu was
preempted in the middle of gup_fast(), on wakeup it would perform the
TLB flush on the virt-enter hook, but meanwhile a sibling vcpu might
have free'd the page-tables.

By using call_rcu_sched() to free the page-tables you'd need to receive
and process at least one tick on the woken up cpu after the freeing, but
since the in-progress gup_fast() will have IRQs disabled this will be
delayed.

Anyway, I don't have any idea about the costs involved with
HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
specific case, whereas mmu-gather is something affecting pretty much all
tasks.

But mostly my comment was due to you saying modifying gup_fast() would
be difficult.. I was thinking the one Kconfig line wasn't as onerous ;-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 10:57         ` Peter Zijlstra
@ 2012-05-01 10:59           ` Peter Zijlstra
  2012-05-01 22:49             ` Jeremy Fitzhardinge
  2012-05-01 12:12           ` Avi Kivity
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 10:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
> Anyway, I don't have any idea about the costs involved with
> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> specific case, whereas mmu-gather is something affecting pretty much all
> tasks. 

Which reminds me, I thought Xen needed this too, but a git grep on
HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.

Jeremy?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 10:57         ` Peter Zijlstra
  2012-05-01 10:59           ` Peter Zijlstra
@ 2012-05-01 12:12           ` Avi Kivity
  2012-05-01 14:59             ` Peter Zijlstra
                               ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Avi Kivity @ 2012-05-01 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On 05/01/2012 01:57 PM, Peter Zijlstra wrote:
> > Looks like this increases performance for the overcommitted case, and
> > also for the case where many vcpus are sleeping, while reducing
> > performance for the uncontended, high duty cycle case.
>
> Sounds backwards if you put it like that ;-)

Yes...

> > > This should work because the preempted vcpu's RCU state would also be
> > > stalled and thus avoids the actual page-table from going away.
> > 
> > It can be unstalled at any moment.  But spin_lock_irq() > rcu_read_lock().
>
> Right, but since gup_fast has IRQs disabled the RCU state machine (as
> driven by the tick) won't actually do anything until its done.

That's what I meant, except I mistyped local_irq_disable() as
spin_lock_irq().  local_irq_save() is a stronger version or rcu_read_lock().

> To be clear, the case was where the gup_fast() performing vcpu was
> preempted in the middle of gup_fast(), on wakeup it would perform the
> TLB flush on the virt-enter hook, but meanwhile a sibling vcpu might
> have free'd the page-tables.
>
> By using call_rcu_sched() to free the page-tables you'd need to receive
> and process at least one tick on the woken up cpu after the freeing, but
> since the in-progress gup_fast() will have IRQs disabled this will be
> delayed.

We're now moving the freeing of kvm shadow page tables from using rcu to
using an irq-protected scheme like gup_fast(), because of the
performance differences.  We didn't track it down but I expect the cause
is less reuse of cache-hot pages.

> Anyway, I don't have any idea about the costs involved with
> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> specific case, whereas mmu-gather is something affecting pretty much all
> tasks.

What's changed is not gup_fast() but the performance of munmap(),
exit(), and exec(), no?

What bounds the amount of memory waiting to be freed during an rcu grace
period?

> But mostly my comment was due to you saying modifying gup_fast() would
> be difficult.. I was thinking the one Kconfig line wasn't as onerous ;-)

Yes, you're right - the code is all there and as long as we tolerate the
use of local_irq_disable() in place of rcu_read_lock() no code changes
are needed.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 12:12           ` Avi Kivity
@ 2012-05-01 14:59             ` Peter Zijlstra
  2012-05-01 15:31               ` Avi Kivity
  2012-05-01 15:11             ` Peter Zijlstra
  2012-05-01 15:14             ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 14:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> local_irq_save() is a stronger version or rcu_read_lock()

Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
things like preemptible rcu can be driven from rcu_read_unlock().

Note that this is the reason call_rcu_sched() exists and is used in this
case.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 12:12           ` Avi Kivity
  2012-05-01 14:59             ` Peter Zijlstra
@ 2012-05-01 15:11             ` Peter Zijlstra
  2012-05-01 15:33               ` Avi Kivity
  2012-05-01 15:14             ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 15:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> We're now moving the freeing of kvm shadow page tables from using rcu to
> using an irq-protected scheme like gup_fast(), because of the
> performance differences.  We didn't track it down but I expect the cause
> is less reuse of cache-hot pages. 

It would be good to understand these things.. just doing things because
tends to come back and bite you :-)

If its cache behaviour you should be able to clearly see that using perf
stat.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 12:12           ` Avi Kivity
  2012-05-01 14:59             ` Peter Zijlstra
  2012-05-01 15:11             ` Peter Zijlstra
@ 2012-05-01 15:14             ` Peter Zijlstra
  2012-05-01 15:36               ` Avi Kivity
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 15:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> 
> What's changed is not gup_fast() but the performance of munmap(),
> exit(), and exec(), no?

If it is indeed cache related like you suggested earlier, it would be
the allocation side of things, like fork()/mmap() that suffer since
there's fewer hot pages about, but yes, anything creating/destroying
page-tables.

> What bounds the amount of memory waiting to be freed during an rcu grace
> period?

Most RCU implementations don't have limits, so that could be quite a
lot. I think preemptible RCU has a batch limit at which point it tries
rather hard to force a grace period, but I'm not sure if even that
provides a hard limit.

Practically though, I haven't had reports of PPC/Sparc going funny
because of this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 14:59             ` Peter Zijlstra
@ 2012-05-01 15:31               ` Avi Kivity
  2012-05-01 15:36                 ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2012-05-01 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On 05/01/2012 05:59 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > local_irq_save() is a stronger version or rcu_read_lock()
>
> Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
> things like preemptible rcu can be driven from rcu_read_unlock().
>
> Note that this is the reason call_rcu_sched() exists and is used in this
> case.

Ah, thanks for the correction.  I'll read some more about the _sched
variants.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:11             ` Peter Zijlstra
@ 2012-05-01 15:33               ` Avi Kivity
  0 siblings, 0 replies; 36+ messages in thread
From: Avi Kivity @ 2012-05-01 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On 05/01/2012 06:11 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > We're now moving the freeing of kvm shadow page tables from using rcu to
> > using an irq-protected scheme like gup_fast(), because of the
> > performance differences.  We didn't track it down but I expect the cause
> > is less reuse of cache-hot pages. 
>
> It would be good to understand these things.. just doing things because
> tends to come back and bite you :-)

Agree.  Anyway that was only part of the motivation, the other part was
establishing a bound on the number of pages under deferred freeing.

> If its cache behaviour you should be able to clearly see that using perf
> stat.

It's hard to see it clearly because the changes are in the order of 1-2%
in a very noisy benchmark.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:14             ` Peter Zijlstra
@ 2012-05-01 15:36               ` Avi Kivity
  2012-05-01 16:16                 ` Peter Zijlstra
  2012-05-01 16:18                 ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2012-05-01 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On 05/01/2012 06:14 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > 
> > What's changed is not gup_fast() but the performance of munmap(),
> > exit(), and exec(), no?
>
> If it is indeed cache related like you suggested earlier, it would be
> the allocation side of things, like fork()/mmap() that suffer since
> there's fewer hot pages about, but yes, anything creating/destroying
> page-tables.

Right.

>
> > What bounds the amount of memory waiting to be freed during an rcu grace
> > period?
>
> Most RCU implementations don't have limits, so that could be quite a
> lot. I think preemptible RCU has a batch limit at which point it tries
> rather hard to force a grace period, but I'm not sure if even that
> provides a hard limit.
>
> Practically though, I haven't had reports of PPC/Sparc going funny
> because of this.

It could be considered a DoS if a user is able to free page tables
faster than rcu is able to recycle them, possibly triggering the oom
killer (should that force a grace period before firing from the hip?)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:31               ` Avi Kivity
@ 2012-05-01 15:36                 ` Peter Zijlstra
  2012-05-01 15:39                   ` Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 15:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 18:31 +0300, Avi Kivity wrote:
> On 05/01/2012 05:59 PM, Peter Zijlstra wrote:
> > On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > > local_irq_save() is a stronger version or rcu_read_lock()
> >
> > Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
> > things like preemptible rcu can be driven from rcu_read_unlock().
> >
> > Note that this is the reason call_rcu_sched() exists and is used in this
> > case.
> 
> Ah, thanks for the correction.  I'll read some more about the _sched
> variants.

Basically:

 rcu_read_{,un}lock_sched() -- aka preempt_{en,dis}able()
 call_rcu_sched()
 synchronize_rcu_sched() -- aka synchronize_sched();

Define an RCU variant where each cpu has to have scheduled at least once
to complete a grace period.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:36                 ` Peter Zijlstra
@ 2012-05-01 15:39                   ` Avi Kivity
  2012-05-01 15:42                     ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2012-05-01 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On 05/01/2012 06:36 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 18:31 +0300, Avi Kivity wrote:
> > On 05/01/2012 05:59 PM, Peter Zijlstra wrote:
> > > On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > > > local_irq_save() is a stronger version or rcu_read_lock()
> > >
> > > Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
> > > things like preemptible rcu can be driven from rcu_read_unlock().
> > >
> > > Note that this is the reason call_rcu_sched() exists and is used in this
> > > case.
> > 
> > Ah, thanks for the correction.  I'll read some more about the _sched
> > variants.
>
> Basically:
>
>  rcu_read_{,un}lock_sched() -- aka preempt_{en,dis}able()
>  call_rcu_sched()
>  synchronize_rcu_sched() -- aka synchronize_sched();
>
> Define an RCU variant where each cpu has to have scheduled at least once
> to complete a grace period.

Which was the original rcu implementation, until it got replaced by
preemptible rcu, yes?  I guess that was the source of my confusion.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:39                   ` Avi Kivity
@ 2012-05-01 15:42                     ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 15:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa

On Tue, 2012-05-01 at 18:39 +0300, Avi Kivity wrote:

> >  rcu_read_{,un}lock_sched() -- aka preempt_{en,dis}able()
> >  call_rcu_sched()
> >  synchronize_rcu_sched() -- aka synchronize_sched();
> >
> > Define an RCU variant where each cpu has to have scheduled at least once
> > to complete a grace period.
> 
> Which was the original rcu implementation, until it got replaced by
> preemptible rcu, yes?  I guess that was the source of my confusion.

Indeed. The _sched RCU variant has the 'classic' RCU semantics.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:36               ` Avi Kivity
@ 2012-05-01 16:16                 ` Peter Zijlstra
  2012-05-01 16:43                   ` Paul E. McKenney
  2012-05-01 16:18                 ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 16:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa, paulmck

On Tue, 2012-05-01 at 18:36 +0300, Avi Kivity wrote:

> > > What bounds the amount of memory waiting to be freed during an rcu grace
> > > period?
> >
> > Most RCU implementations don't have limits, so that could be quite a
> > lot. I think preemptible RCU has a batch limit at which point it tries
> > rather hard to force a grace period, but I'm not sure if even that
> > provides a hard limit.
> >
> > Practically though, I haven't had reports of PPC/Sparc going funny
> > because of this.
> 
> It could be considered a DoS if a user is able to free page tables
> faster than rcu is able to recycle them, possibly triggering the oom
> killer (should that force a grace period before firing from the hip?)

One would think that would be a good thing, yes. However I cannot seem
to find anything like that in the current OOM killer. David, Paul, I
seem to have vague recollections of a discussion about RCU vs OOM, what
was the resolution (if anything) and would something like the below make
sense?

---
 mm/oom_kill.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 46bf2ed5..244a371 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 	struct zone *zone;
 	int ret = 1;
 
+	synchronize_sched();
+	synchronize_rcu();
+
 	spin_lock(&zone_scan_lock);
 	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
 		if (zone_is_oom_locked(zone)) {


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 15:36               ` Avi Kivity
  2012-05-01 16:16                 ` Peter Zijlstra
@ 2012-05-01 16:18                 ` Peter Zijlstra
  2012-05-01 16:20                   ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 16:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa, paulmck, David Rientjes

And now with David actually on CC ;-)

On Tue, 2012-05-01 at 18:36 +0300, Avi Kivity wrote:

> > > What bounds the amount of memory waiting to be freed during an rcu grace
> > > period?
> >
> > Most RCU implementations don't have limits, so that could be quite a
> > lot. I think preemptible RCU has a batch limit at which point it tries
> > rather hard to force a grace period, but I'm not sure if even that
> > provides a hard limit.
> >
> > Practically though, I haven't had reports of PPC/Sparc going funny
> > because of this.
> 
> It could be considered a DoS if a user is able to free page tables
> faster than rcu is able to recycle them, possibly triggering the oom
> killer (should that force a grace period before firing from the hip?)

One would think that would be a good thing, yes. However I cannot seem
to find anything like that in the current OOM killer. David, Paul, I
seem to have vague recollections of a discussion about RCU vs OOM, what
was the resolution (if anything) and would something like the below make
sense?

---
 mm/oom_kill.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 46bf2ed5..244a371 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 	struct zone *zone;
 	int ret = 1;
 
+	synchronize_sched();
+	synchronize_rcu();
+
 	spin_lock(&zone_scan_lock);
 	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
 		if (zone_is_oom_locked(zone)) {



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 16:18                 ` Peter Zijlstra
@ 2012-05-01 16:20                   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-01 16:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa, paulmck, David Rientjes

On Tue, 2012-05-01 at 18:18 +0200, Peter Zijlstra wrote:

> ---
>  mm/oom_kill.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 46bf2ed5..244a371 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  	struct zone *zone;
>  	int ret = 1;
>  
> +	synchronize_sched();
> +	synchronize_rcu();
> +
>  	spin_lock(&zone_scan_lock);
>  	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
>  		if (zone_is_oom_locked(zone)) {
> 

Hmm I guess that should be rcu_barrier_sched(); rcu_barrier(); instead
of sync.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 16:16                 ` Peter Zijlstra
@ 2012-05-01 16:43                   ` Paul E. McKenney
  0 siblings, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2012-05-01 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Nikunj A. Dadhania, mingo, jeremy, mtosatti, kvm,
	x86, vatsa, linux-kernel, hpa

On Tue, May 01, 2012 at 06:16:46PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 18:36 +0300, Avi Kivity wrote:
> 
> > > > What bounds the amount of memory waiting to be freed during an rcu grace
> > > > period?
> > >
> > > Most RCU implementations don't have limits, so that could be quite a
> > > lot. I think preemptible RCU has a batch limit at which point it tries
> > > rather hard to force a grace period, but I'm not sure if even that
> > > provides a hard limit.

All the TREE_RCU variants will get more aggressive about forcing grace
periods if any given CPU has more than 10,000 callbacks posted.  When this
happens, the call_rcu() variants will try to push things ahead.

> > > Practically though, I haven't had reports of PPC/Sparc going funny
> > > because of this.
> > 
> > It could be considered a DoS if a user is able to free page tables
> > faster than rcu is able to recycle them, possibly triggering the oom
> > killer (should that force a grace period before firing from the hip?)
> 
> One would think that would be a good thing, yes. However I cannot seem
> to find anything like that in the current OOM killer. David, Paul, I
> seem to have vague recollections of a discussion about RCU vs OOM, what
> was the resolution (if anything) and would something like the below make
> sense?
> 
> ---
>  mm/oom_kill.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 46bf2ed5..244a371 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  	struct zone *zone;
>  	int ret = 1;
>  
> +	synchronize_sched();
> +	synchronize_rcu();

This will wait for a grace period, but not for the callbacks, which are
the things that actually free the memory.  Given that, should we instead
do something like:

	rcu_barrier();

Note that rcu_barrier() and rcu_barrier_sched() are one and the same
for CONFIG_PREEMPT=n kernels, and there seems to be a lot more
call_rcu() than call_rcu_sched(), so I left out the rcu_barrier_sched().

That said, this does have the effect of delaying the startup of the OOM
killer, and it does nothing to tell RCU that accelerating grace periods
would be a good thing.  If DoS attack is a theoretical possibility rather
than a real bug, is a pure wait on RCU the right approach.

Alternative approaches include:

1.	OOM killer calls into RCU, which arranges to become more
	aggressive about forcing grace periods.  (For example, RCU
	could set a flag that caused it to act as if there were
	lots of callbacks posted.)

2.	RCU provides an API that forces grace periods, perhaps
	invoked from a separate kthread so that the OOM killer can
	proceed in parallel with RCU's grace-period forcing.

3.	Like #2, but invoke it a bit earlier than the OOM killer
	would normally start running.

						Thanx, Paul

>  	spin_lock(&zone_scan_lock);
>  	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
>  		if (zone_is_oom_locked(zone)) {
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 10:59           ` Peter Zijlstra
@ 2012-05-01 22:49             ` Jeremy Fitzhardinge
  2012-05-03 14:09               ` Stefano Stabellini
  0 siblings, 1 reply; 36+ messages in thread
From: Jeremy Fitzhardinge @ 2012-05-01 22:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Nikunj A. Dadhania, mingo, mtosatti, kvm, x86, vatsa,
	linux-kernel, hpa, Konrad Rzeszutek Wilk, Stefano Stabellini

On 05/01/2012 03:59 AM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
>> Anyway, I don't have any idea about the costs involved with
>> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
>> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
>> specific case, whereas mmu-gather is something affecting pretty much all
>> tasks. 
> Which reminds me, I thought Xen needed this too, but a git grep on
> HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.
>
> Jeremy?

Yeah, I was thinking that too, but I can't remember what we did to
resolve it.  For pure PV guests, gupf simply isn't used, so the problem
is moot.  But for dom0 or PCI-passthrough it could be.

Konrad, Stefano?

    J

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01  9:39     ` Peter Zijlstra
  2012-05-01 10:47       ` Avi Kivity
@ 2012-05-02  8:51       ` Nikunj A Dadhania
  2012-05-02 10:20         ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Nikunj A Dadhania @ 2012-05-02  8:51 UTC (permalink / raw)
  To: Peter Zijlstra, Avi Kivity
  Cc: mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, hpa

On Tue, 01 May 2012 11:39:36 +0200, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > 
> > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > while they are locklessly walked with interrupts disabled.  If a vcpu
> > were to be preempted while in this critical section, another vcpu
> > tearing down page tables would go ahead and destroy them.  when the
> > preempted vcpu resumes it then touches the freed pages.
> > 
> > We could try to teach kvm and get_user_pages_fast() about this, but this
> > is intrusive.  Another option is to replace the cpu_relax() loop with
> > something that sleeps and is then woken up by the TLB IPI handler if needed.
> 
> I think something like
> 
>   select HAVE_RCU_TABLE_FREE if PARAVIRT
> 
> or somesuch is just about all it takes.
>
[root@krm1 linux]# grep HAVE_RCU_TABLE .config
CONFIG_HAVE_RCU_TABLE_FREE=y
[root@krm1 linux]# make -j32  -s
mm/memory.c: In function ‘tlb_remove_table_one’:
mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’

I suppose we need to have __tlb_remove_table. Trying to understand what
needs to be done there.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-02  8:51       ` Nikunj A Dadhania
@ 2012-05-02 10:20         ` Peter Zijlstra
  2012-05-02 13:53           ` Nikunj A Dadhania
  2012-05-04  4:32           ` Nikunj A Dadhania
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2012-05-02 10:20 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Avi Kivity, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, hpa

On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> CONFIG_HAVE_RCU_TABLE_FREE=y
> [root@krm1 linux]# make -j32  -s
> mm/memory.c: In function ‘tlb_remove_table_one’:
> mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’
> 
> I suppose we need to have __tlb_remove_table. Trying to understand what
> needs to be done there. 

Argh, I really should get back to unifying all mmu-gather
implementations :/

I think something like the below ought to sort it.

Completely untested though..

---
 arch/powerpc/include/asm/pgalloc.h  |    1 +
 arch/s390/mm/pgtable.c              |    1 +
 arch/sparc/include/asm/pgalloc_64.h |    1 +
 arch/x86/mm/pgtable.c               |    6 +++---
 include/asm-generic/tlb.h           |    9 +++++++++
 mm/memory.c                         |    7 +++++++
 6 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
 
 	pgtable_free(table, shift);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
 	else
 		free_pages((unsigned long) table, ALLOC_ORDER);
 }
+#define __tlb_remove_table __tlb_remove_table
 
 static void tlb_remove_table_smp_sync(void *arg)
 {
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
 		is_page = true;
 	pgtable_free(table, is_page);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pgtable_page_dtor(pte);
 	paravirt_release_pte(page_to_pfn(pte));
-	tlb_remove_page(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
 	paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pmd));
+	tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	tlb_remove_page(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif	/* PAGETABLE_LEVELS > 3 */
 #endif	/* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f96a5b5..8ca33e9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_page(tlb, page);
+}
+
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index bf8b403..6ca6561 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
  * See the comment near struct mmu_table_batch.
  */
 
+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+	free_page_and_swap_cache(table);
+}
+#endif
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
 	/* Simply deliver the interrupt */


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-02 10:20         ` Peter Zijlstra
@ 2012-05-02 13:53           ` Nikunj A Dadhania
  2012-05-04  4:32           ` Nikunj A Dadhania
  1 sibling, 0 replies; 36+ messages in thread
From: Nikunj A Dadhania @ 2012-05-02 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, hpa

On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> > [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> > CONFIG_HAVE_RCU_TABLE_FREE=y
> > [root@krm1 linux]# make -j32  -s
> > mm/memory.c: In function ‘tlb_remove_table_one’:
> > mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’
> > 
> > I suppose we need to have __tlb_remove_table. Trying to understand what
> > needs to be done there. 
> 
> Argh, I really should get back to unifying all mmu-gather
> implementations :/
> 
> I think something like the below ought to sort it.
> 
Thanks a lot.

> Completely untested though..
> 

Tested-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>

Here is the comparison with the other version. 

	Gang    pv_spin_flush    pv_spin_flush_rcu  
1VM     1.01       0.49            0.49     
2VMs    7.07       4.04            4.06     
4VMs    9.07       5.27            5.19     
8VMs    9.99       7.65            7.80     

Will test other use cases as well and report back.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-01 22:49             ` Jeremy Fitzhardinge
@ 2012-05-03 14:09               ` Stefano Stabellini
  0 siblings, 0 replies; 36+ messages in thread
From: Stefano Stabellini @ 2012-05-03 14:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Avi Kivity, Nikunj A. Dadhania, mingo, mtosatti,
	kvm, x86, vatsa, linux-kernel, hpa, Konrad Rzeszutek Wilk,
	Stefano Stabellini

On Tue, 1 May 2012, Jeremy Fitzhardinge wrote:
> On 05/01/2012 03:59 AM, Peter Zijlstra wrote:
> > On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
> >> Anyway, I don't have any idea about the costs involved with
> >> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> >> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> >> specific case, whereas mmu-gather is something affecting pretty much all
> >> tasks. 
> > Which reminds me, I thought Xen needed this too, but a git grep on
> > HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.
> >
> > Jeremy?
> 
> Yeah, I was thinking that too, but I can't remember what we did to
> resolve it.  For pure PV guests, gupf simply isn't used, so the problem
> is moot.  But for dom0 or PCI-passthrough it could be.

Yes, dom0 can use gupf, for example when a userspace block backend is
involved.

Reading the code it seems to me that xen_flush_tlb_others returns
immediately and succesfully, no matter whether one or more vcpus are or
are not running at the moment and no matter if one or more vcpus
previously disabled interrupts.

Therefore I think that we should be using HAVE_RCU_TABLE_FREE.
I am going to submit a patch for that.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-02 10:20         ` Peter Zijlstra
  2012-05-02 13:53           ` Nikunj A Dadhania
@ 2012-05-04  4:32           ` Nikunj A Dadhania
  1 sibling, 0 replies; 36+ messages in thread
From: Nikunj A Dadhania @ 2012-05-04  4:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, mingo, jeremy, mtosatti, kvm, x86, vatsa, linux-kernel, hpa

On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra <peterz@infradead.org> wrote:
[...] 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index f96a5b5..8ca33e9 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -19,6 +19,8 @@
>  #include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
> +
>  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
>  /*
>   * Semi RCU freeing of the page directories.
> @@ -60,6 +62,13 @@ struct mmu_table_batch {
>  extern void tlb_table_flush(struct mmu_gather *tlb);
>  extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
>  
> +#else
> +
> +static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +	tlb_remove_page(tlb, page);
>
tlb_remove_page(tlb, table);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-04-27 16:24 ` [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
  2012-04-29 12:23   ` Avi Kivity
@ 2012-05-04 11:44   ` Srivatsa Vaddagiri
  2012-05-07  3:10     ` Nikunj A Dadhania
  1 sibling, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2012-05-04 11:44 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: peterz, mingo, jeremy, mtosatti, kvm, x86, linux-kernel, avi, hpa

* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2012-04-27 21:54:37]:

> @@ -1549,6 +1549,11 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
>  		return;
> 
>  	vs->state = 1;
> +	if (vs->flush_on_enter) {
> +		kvm_mmu_flush_tlb(vcpu);
> +		vs->flush_on_enter = 0;
> +	}
> +
>  	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));

Reading flush_on_enter before writing ->state (=1) is racy afaics (and
may cause vcpu to miss a TLB flush request).

- vatsa


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others
  2012-05-04 11:44   ` Srivatsa Vaddagiri
@ 2012-05-07  3:10     ` Nikunj A Dadhania
  0 siblings, 0 replies; 36+ messages in thread
From: Nikunj A Dadhania @ 2012-05-07  3:10 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: peterz, mingo, jeremy, mtosatti, kvm, x86, linux-kernel, avi, hpa

On Fri, 4 May 2012 17:14:49 +0530, Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:
> * Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2012-04-27 21:54:37]:
> 
> > @@ -1549,6 +1549,11 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> >  		return;
> > 
> >  	vs->state = 1;
> > +	if (vs->flush_on_enter) {
> > +		kvm_mmu_flush_tlb(vcpu);
> > +		vs->flush_on_enter = 0;
> > +	}
> > +
> >  	kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> 
> Reading flush_on_enter before writing ->state (=1) is racy afaics (and
> may cause vcpu to miss a TLB flush request).
> 
Yes I see this with sysbench, here is what I have now, currently I have
tested it with sysbench(50 runs). Will fold this in my v2.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 60546e9..b2ee9fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1548,9 +1548,20 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
        if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
                return;
 
+       /* 
+        * Let the guest know that we are online, make sure we do not
+        * overwrite flush_on_enter, just write the vs->state.
+        */
        vs->state = 1;
-       kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+       kvm_write_guest_cached(vcpu->kvm, ghc, vs, 1*sizeof(__u32));
        smp_wmb();
+       /* 
+        * Guest might have seen us offline and would have set
+        * flush_on_enter. 
+        */
+       kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+       if (vs->flush_on_enter) 
+               kvm_x86_ops->tlb_flush(vcpu);
 }
 
 static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)


Nikunj


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-05-07  3:29 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-27 16:23 [RFC PATCH v1 0/5] KVM paravirt remote flush tlb Nikunj A. Dadhania
2012-04-27 16:23 ` [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest Nikunj A. Dadhania
2012-05-01  1:03   ` Raghavendra K T
2012-05-01  3:25     ` Nikunj A Dadhania
2012-04-27 16:23 ` [RFC PATCH v1 2/5] KVM-HV: " Nikunj A. Dadhania
2012-04-27 16:24 ` [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others Nikunj A. Dadhania
2012-04-29 12:23   ` Avi Kivity
2012-05-01  3:34     ` Nikunj A Dadhania
2012-05-01  9:39     ` Peter Zijlstra
2012-05-01 10:47       ` Avi Kivity
2012-05-01 10:57         ` Peter Zijlstra
2012-05-01 10:59           ` Peter Zijlstra
2012-05-01 22:49             ` Jeremy Fitzhardinge
2012-05-03 14:09               ` Stefano Stabellini
2012-05-01 12:12           ` Avi Kivity
2012-05-01 14:59             ` Peter Zijlstra
2012-05-01 15:31               ` Avi Kivity
2012-05-01 15:36                 ` Peter Zijlstra
2012-05-01 15:39                   ` Avi Kivity
2012-05-01 15:42                     ` Peter Zijlstra
2012-05-01 15:11             ` Peter Zijlstra
2012-05-01 15:33               ` Avi Kivity
2012-05-01 15:14             ` Peter Zijlstra
2012-05-01 15:36               ` Avi Kivity
2012-05-01 16:16                 ` Peter Zijlstra
2012-05-01 16:43                   ` Paul E. McKenney
2012-05-01 16:18                 ` Peter Zijlstra
2012-05-01 16:20                   ` Peter Zijlstra
2012-05-02  8:51       ` Nikunj A Dadhania
2012-05-02 10:20         ` Peter Zijlstra
2012-05-02 13:53           ` Nikunj A Dadhania
2012-05-04  4:32           ` Nikunj A Dadhania
2012-05-04 11:44   ` Srivatsa Vaddagiri
2012-05-07  3:10     ` Nikunj A Dadhania
2012-04-27 16:26 ` [RFC PATCH v1 4/5] KVM: get kvm_kick_vcpu out for pv_flush Nikunj A. Dadhania
2012-04-27 16:27 ` [RFC PATCH v1 5/5] KVM: Introduce PV kick in flush tlb Nikunj A. Dadhania

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).