All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: x86: Various bug fixes
@ 2014-10-02 22:10 Nadav Amit
  2014-10-02 22:10 ` [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Nadav Amit @ 2014-10-02 22:10 UTC (permalink / raw)
  To: pbonzini, joro; +Cc: kvm, Nadav Amit

This patch-set includes various bug fixes.

The most notable bug is the wrong decoding of guest instructions which cross
page boundary. I am surprised it does not happen in real-systems.

Please review carefully the changes the DR4/5 bug-fix introduces to SVM part.
Since I do not have AMD machine I could not check the patch. Anyhow, I suspect
SVM dr-interception is buggy since in certain cases instructions are skipped
while fault is injected.

Thanks for reviewing the patches.

Nadav Amit (5):
  KVM: x86: Clear DR7.LE during task-switch
  KVM: x86: Emulator performs code segment checks on read access
  KVM: x86: Decoding guest instructions which cross page boundary may
    fail
  KVM: vmx: Unavailable DR4/5 is checked before CPL
  KVM: x86: Using TSC deadline may cause multiple interrupts by user
    writes

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/emulate.c          | 10 +++++----
 arch/x86/kvm/lapic.c            |  7 +++++-
 arch/x86/kvm/lapic.h            |  3 ++-
 arch/x86/kvm/svm.c              | 10 +++++----
 arch/x86/kvm/vmx.c              | 19 ++++++++++------
 arch/x86/kvm/x86.c              | 48 ++++++++++++++++-------------------------
 7 files changed, 51 insertions(+), 47 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch
  2014-10-02 22:10 [PATCH 0/5] KVM: x86: Various bug fixes Nadav Amit
@ 2014-10-02 22:10 ` Nadav Amit
  2014-10-06 19:45   ` Radim Krčmář
  2014-10-02 22:10 ` [PATCH 2/5] KVM: x86: Emulator performs code segment checks on read access Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-02 22:10 UTC (permalink / raw)
  To: pbonzini, joro; +Cc: kvm, Nadav Amit

DR7.LE should be cleared during task-switch. This feature is poorly documented.
For reference, see:
http://pdos.csail.mit.edu/6.828/2005/readings/i386/s12_02.htm

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 04fa1b8..4190eb5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5483,7 +5483,7 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 	}
 
 	/* clear all local breakpoint enable flags */
-	vmcs_writel(GUEST_DR7, vmcs_readl(GUEST_DR7) & ~0x55);
+	vmcs_writel(GUEST_DR7, vmcs_readl(GUEST_DR7) & ~0x155);
 
 	/*
 	 * TODO: What about debug traps on tss switch?
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-02 22:10 [PATCH 0/5] KVM: x86: Various bug fixes Nadav Amit
  2014-10-02 22:10 ` [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch Nadav Amit
@ 2014-10-02 22:10 ` Nadav Amit
  2014-10-06 20:32   ` Radim Krčmář
  2014-10-02 22:10 ` [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-02 22:10 UTC (permalink / raw)
  To: pbonzini, joro; +Cc: kvm, Nadav Amit

When read access is performed using a readable code segment, the "conforming"
and "non-conforming" checks should not be done.  As a result, read using
non-conforming readable code segment fails.

This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments").

The fix is not to perform the "non-conforming" checks if the access is not a
fetch.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/emulate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a46207a..694dfa7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -661,8 +661,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
 				goto bad;
 		}
 		cpl = ctxt->ops->cpl(ctxt);
-		if (!(desc.type & 8)) {
-			/* data segment */
+		if (!fetch) {
+			/* data segment or readable code segment */
 			if (cpl > desc.dpl)
 				goto bad;
 		} else if ((desc.type & 8) && !(desc.type & 4)) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail
  2014-10-02 22:10 [PATCH 0/5] KVM: x86: Various bug fixes Nadav Amit
  2014-10-02 22:10 ` [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch Nadav Amit
  2014-10-02 22:10 ` [PATCH 2/5] KVM: x86: Emulator performs code segment checks on read access Nadav Amit
@ 2014-10-02 22:10 ` Nadav Amit
  2014-10-06 20:50   ` Radim Krčmář
  2014-10-02 22:10 ` [PATCH 4/5] KVM: vmx: Unavailable DR4/5 is checked before CPL Nadav Amit
  2014-10-02 22:10 ` [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes Nadav Amit
  4 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-02 22:10 UTC (permalink / raw)
  To: pbonzini, joro; +Cc: kvm, Nadav Amit

Once an instruction crosses a page boundary, the size read from the second page
disregards the common case that part of the operand resides on the first page.
As a result, fetch of long insturctions may fail, and thereby cause the
decoding to fail as well.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/emulate.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 694dfa7..b2e8aba 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -751,8 +751,10 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
 					       unsigned size)
 {
-	if (unlikely(ctxt->fetch.end - ctxt->fetch.ptr < size))
-		return __do_insn_fetch_bytes(ctxt, size);
+	unsigned done_size = ctxt->fetch.end - ctxt->fetch.ptr;
+
+	if (unlikely(done_size < size))
+		return __do_insn_fetch_bytes(ctxt, size - done_size);
 	else
 		return X86EMUL_CONTINUE;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/5] KVM: vmx: Unavailable DR4/5 is checked before CPL
  2014-10-02 22:10 [PATCH 0/5] KVM: x86: Various bug fixes Nadav Amit
                   ` (2 preceding siblings ...)
  2014-10-02 22:10 ` [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail Nadav Amit
@ 2014-10-02 22:10 ` Nadav Amit
  2014-10-06 19:33   ` Radim Krčmář
  2014-10-02 22:10 ` [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes Nadav Amit
  4 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-02 22:10 UTC (permalink / raw)
  To: pbonzini, joro; +Cc: kvm, Nadav Amit

If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
"Priority Among Simultaneous Exceptions and Interrupts".

Note, that this may happen on the first DR access, even if the host does not
sets debug breakpoints. Obviously, it occurs when the host debugs the guest.

This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
The emulator already checks DR4/5 availability in check_dr_read. Nested
virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
exceptions to the guest.

As for SVM, the patch follows the previous logic as much as possible. Anyhow,
it appears the DR interception code might be buggy - even if the DR access
may cause an exception, the instruction is skipped.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/svm.c              | 10 +++++----
 arch/x86/kvm/vmx.c              | 17 +++++++++------
 arch/x86/kvm/x86.c              | 46 +++++++++++++++--------------------------
 4 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7d603a7..592e030 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -895,6 +895,7 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			    gfn_t gfn, void *data, int offset, int len,
 			    u32 access);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
+bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr);
 
 static inline int __kvm_irq_line_state(unsigned long *irq_state,
 				       int irq_source_id, int level)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f7f6a4a..df3860b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2999,7 +2999,6 @@ static int dr_interception(struct vcpu_svm *svm)
 {
 	int reg, dr;
 	unsigned long val;
-	int err;
 
 	if (svm->vcpu.guest_debug == 0) {
 		/*
@@ -3019,12 +3018,15 @@ static int dr_interception(struct vcpu_svm *svm)
 	dr = svm->vmcb->control.exit_code - SVM_EXIT_READ_DR0;
 
 	if (dr >= 16) { /* mov to DRn */
+		if (!kvm_require_dr(&svm->vcpu, dr - 16))
+			return 1;
 		val = kvm_register_read(&svm->vcpu, reg);
 		kvm_set_dr(&svm->vcpu, dr - 16, val);
 	} else {
-		err = kvm_get_dr(&svm->vcpu, dr, &val);
-		if (!err)
-			kvm_register_write(&svm->vcpu, reg, val);
+		if (!kvm_require_dr(&svm->vcpu, dr))
+			return 1;
+		kvm_get_dr(&svm->vcpu, dr, &val);
+		kvm_register_write(&svm->vcpu, reg, val);
 	}
 
 	skip_emulated_instruction(&svm->vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4190eb5..c845a19 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5154,13 +5154,20 @@ static int handle_cr(struct kvm_vcpu *vcpu)
 static int handle_dr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification;
-	int dr, reg;
+	int dr, dr7, reg;
+
+	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	dr = exit_qualification & DEBUG_REG_ACCESS_NUM;
+
+	/* First, if DR does not exist, trigger UD */
+	if (!kvm_require_dr(vcpu, dr))
+		return 1;
 
 	/* Do not handle if the CPL > 0, will trigger GP on re-entry */
 	if (!kvm_require_cpl(vcpu, 0))
 		return 1;
-	dr = vmcs_readl(GUEST_DR7);
-	if (dr & DR7_GD) {
+	dr7 = vmcs_readl(GUEST_DR7);
+	if (dr7 & DR7_GD) {
 		/*
 		 * As the vm-exit takes precedence over the debug trap, we
 		 * need to emulate the latter, either for the host or the
@@ -5168,7 +5175,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
 		 */
 		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
 			vcpu->run->debug.arch.dr6 = vcpu->arch.dr6;
-			vcpu->run->debug.arch.dr7 = dr;
+			vcpu->run->debug.arch.dr7 = dr7;
 			vcpu->run->debug.arch.pc =
 				vmcs_readl(GUEST_CS_BASE) +
 				vmcs_readl(GUEST_RIP);
@@ -5200,8 +5207,6 @@ static int handle_dr(struct kvm_vcpu *vcpu)
 		return 1;
 	}
 
-	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
-	dr = exit_qualification & DEBUG_REG_ACCESS_NUM;
 	reg = DEBUG_REG_ACCESS_REG(exit_qualification);
 	if (exit_qualification & TYPE_MOV_FROM_DR) {
 		unsigned long val;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6857257..e903167 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -450,6 +450,16 @@ bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl)
 }
 EXPORT_SYMBOL_GPL(kvm_require_cpl);
 
+bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr)
+{
+	if ((dr != 4 && dr != 5) || !kvm_read_cr4_bits(vcpu, X86_CR4_DE))
+		return true;
+
+	kvm_queue_exception(vcpu, UD_VECTOR);
+	return false;
+}
+EXPORT_SYMBOL_GPL(kvm_require_dr);
+
 /*
  * This function will be used to read from the physical memory of the currently
  * running guest. The difference to kvm_read_guest_page is that this function
@@ -806,8 +816,6 @@ static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
 			vcpu->arch.eff_db[dr] = val;
 		break;
 	case 4:
-		if (kvm_read_cr4_bits(vcpu, X86_CR4_DE))
-			return 1; /* #UD */
 		/* fall through */
 	case 6:
 		if (val & 0xffffffff00000000ULL)
@@ -816,8 +824,6 @@ static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
 		kvm_update_dr6(vcpu);
 		break;
 	case 5:
-		if (kvm_read_cr4_bits(vcpu, X86_CR4_DE))
-			return 1; /* #UD */
 		/* fall through */
 	default: /* 7 */
 		if (val & 0xffffffff00000000ULL)
@@ -832,27 +838,21 @@ static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
 
 int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
 {
-	int res;
-
-	res = __kvm_set_dr(vcpu, dr, val);
-	if (res > 0)
-		kvm_queue_exception(vcpu, UD_VECTOR);
-	else if (res < 0)
+	if (__kvm_set_dr(vcpu, dr, val)) {
 		kvm_inject_gp(vcpu, 0);
-
-	return res;
+		return 1;
+	}
+	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_dr);
 
-static int _kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
+int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
 {
 	switch (dr) {
 	case 0 ... 3:
 		*val = vcpu->arch.db[dr];
 		break;
 	case 4:
-		if (kvm_read_cr4_bits(vcpu, X86_CR4_DE))
-			return 1;
 		/* fall through */
 	case 6:
 		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
@@ -861,23 +861,11 @@ static int _kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
 			*val = kvm_x86_ops->get_dr6(vcpu);
 		break;
 	case 5:
-		if (kvm_read_cr4_bits(vcpu, X86_CR4_DE))
-			return 1;
 		/* fall through */
 	default: /* 7 */
 		*val = vcpu->arch.dr7;
 		break;
 	}
-
-	return 0;
-}
-
-int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
-{
-	if (_kvm_get_dr(vcpu, dr, val)) {
-		kvm_queue_exception(vcpu, UD_VECTOR);
-		return 1;
-	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_get_dr);
@@ -3076,7 +3064,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
 	unsigned long val;
 
 	memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db));
-	_kvm_get_dr(vcpu, 6, &val);
+	kvm_get_dr(vcpu, 6, &val);
 	dbgregs->dr6 = val;
 	dbgregs->dr7 = vcpu->arch.dr7;
 	dbgregs->flags = 0;
@@ -4637,7 +4625,7 @@ static void emulator_wbinvd(struct x86_emulate_ctxt *ctxt)
 
 int emulator_get_dr(struct x86_emulate_ctxt *ctxt, int dr, unsigned long *dest)
 {
-	return _kvm_get_dr(emul_to_vcpu(ctxt), dr, dest);
+	return kvm_get_dr(emul_to_vcpu(ctxt), dr, dest);
 }
 
 int emulator_set_dr(struct x86_emulate_ctxt *ctxt, int dr, unsigned long value)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-02 22:10 [PATCH 0/5] KVM: x86: Various bug fixes Nadav Amit
                   ` (3 preceding siblings ...)
  2014-10-02 22:10 ` [PATCH 4/5] KVM: vmx: Unavailable DR4/5 is checked before CPL Nadav Amit
@ 2014-10-02 22:10 ` Nadav Amit
  2014-10-06 20:57   ` Radim Krčmář
  2014-10-08  9:29   ` Paolo Bonzini
  4 siblings, 2 replies; 32+ messages in thread
From: Nadav Amit @ 2014-10-02 22:10 UTC (permalink / raw)
  To: pbonzini, joro; +Cc: kvm, Nadav Amit

Setting the TSC deadline MSR that are initiated by the host (using ioctl's) may
cause superfluous interrupt.  This occurs in the following case:

1. A TSC deadline timer interrupt is pending.
2. TSC deadline was still not cleared (which happens during vcpu_run).
3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.

To solve this situation, ignore host initiated TSC deadline writes that do not
change the deadline value.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/lapic.c | 7 ++++++-
 arch/x86/kvm/lapic.h | 3 ++-
 arch/x86/kvm/x86.c   | 2 +-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b8345dd..0bcf2e1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1346,14 +1346,19 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu)
 	return apic->lapic_timer.tscdeadline;
 }
 
-void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
+void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu,
+				   struct msr_data *msr_info)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 data = msr_info->data;
 
 	if (!kvm_vcpu_has_lapic(vcpu) || apic_lvtt_oneshot(apic) ||
 			apic_lvtt_period(apic))
 		return;
 
+	if (msr_info->host_initiated && apic->lapic_timer.tscdeadline == data)
+		return;
+
 	hrtimer_cancel(&apic->lapic_timer.timer);
 	/* Inject here so clearing tscdeadline won't override new value */
 	if (apic_has_pending_timer(vcpu))
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6a11845..5bfa3db 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -71,7 +71,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
-void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
+void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu,
+				struct msr_data *msr_info);
 
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
 void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e903167..8c2745c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2093,7 +2093,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
 		return kvm_x2apic_msr_write(vcpu, msr, data);
 	case MSR_IA32_TSCDEADLINE:
-		kvm_set_lapic_tscdeadline_msr(vcpu, data);
+		kvm_set_lapic_tscdeadline_msr(vcpu, msr_info);
 		break;
 	case MSR_IA32_TSC_ADJUST:
 		if (guest_cpuid_has_tsc_adjust(vcpu)) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] KVM: vmx: Unavailable DR4/5 is checked before CPL
  2014-10-02 22:10 ` [PATCH 4/5] KVM: vmx: Unavailable DR4/5 is checked before CPL Nadav Amit
@ 2014-10-06 19:33   ` Radim Krčmář
  0 siblings, 0 replies; 32+ messages in thread
From: Radim Krčmář @ 2014-10-06 19:33 UTC (permalink / raw)
  To: Nadav Amit; +Cc: pbonzini, joro, kvm

2014-10-03 01:10+0300, Nadav Amit:
> If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
> should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
> "Priority Among Simultaneous Exceptions and Interrupts".
> 
> Note, that this may happen on the first DR access, even if the host does not
> sets debug breakpoints. Obviously, it occurs when the host debugs the guest.

(This got me confused for a while; "first" because we disable DR exiting
 in the handler.)

> This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
> The emulator already checks DR4/5 availability in check_dr_read. Nested
> virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
> exceptions to the guest.
> 
> As for SVM, the patch follows the previous logic as much as possible. Anyhow,
> it appears the DR interception code might be buggy - even if the DR access
> may cause an exception, the instruction is skipped.

SVM likely injects GP (UD) before it intercepts DR.  [2:Table 15-7]:
  All normal exception checks take precedence over the SVM intercepts.
=> no need to check even in our case.

> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6857257..e903167 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -806,8 +816,6 @@ static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
>  			vcpu->arch.eff_db[dr] = val;
>  		break;
>  	case 4:
> -		if (kvm_read_cr4_bits(vcpu, X86_CR4_DE))

WARN_ONCE_ON() instead?

> -			return 1; /* #UD */
>  		/* fall through */
>  	case 6:
>  		if (val & 0xffffffff00000000ULL)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch
  2014-10-02 22:10 ` [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch Nadav Amit
@ 2014-10-06 19:45   ` Radim Krčmář
  0 siblings, 0 replies; 32+ messages in thread
From: Radim Krčmář @ 2014-10-06 19:45 UTC (permalink / raw)
  To: Nadav Amit; +Cc: pbonzini, joro, kvm

2014-10-03 01:10+0300, Nadav Amit:
> DR7.LE should be cleared during task-switch. This feature is poorly documented.
> For reference, see:
> http://pdos.csail.mit.edu/6.828/2005/readings/i386/s12_02.htm

SDM [17.2.4]:
  This feature is not supported in the P6 family processors, later IA-32
  processors, and Intel 64 processors.

AMD [2:13.1.1.4]:
  This bit is ignored by implementations of the AMD64 architecture.

Intel's formulation could mean that it isn't even zeroed, but if current
hardware behaves like that,

> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
(It shouldn't change a thing.)

>  arch/x86/kvm/vmx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 04fa1b8..4190eb5 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5483,7 +5483,7 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
>  	}
>  
>  	/* clear all local breakpoint enable flags */
> -	vmcs_writel(GUEST_DR7, vmcs_readl(GUEST_DR7) & ~0x55);
> +	vmcs_writel(GUEST_DR7, vmcs_readl(GUEST_DR7) & ~0x155);
>  
>  	/*
>  	 * TODO: What about debug traps on tss switch?
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-02 22:10 ` [PATCH 2/5] KVM: x86: Emulator performs code segment checks on read access Nadav Amit
@ 2014-10-06 20:32   ` Radim Krčmář
  2014-10-10  2:07     ` [PATCH v2 " Nadav Amit
  0 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2014-10-06 20:32 UTC (permalink / raw)
  To: Nadav Amit; +Cc: pbonzini, joro, kvm

2014-10-03 01:10+0300, Nadav Amit:
> When read access is performed using a readable code segment, the "conforming"
> and "non-conforming" checks should not be done.  As a result, read using
> non-conforming readable code segment fails.
> 
> This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments").

(Checks are neccessarily true, except for non-conforming readable code,
 which is to be treated like a data segment.)

> The fix is not to perform the "non-conforming" checks if the access is not a
> fetch.

(Because 'fetch' implies correct conditions and the rest is checked when
 loading the segment,)

> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>

>  arch/x86/kvm/emulate.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index a46207a..694dfa7 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -661,8 +661,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
>  				goto bad;
>  		}
>  		cpl = ctxt->ops->cpl(ctxt);
> -		if (!(desc.type & 8)) {
> -			/* data segment */
> +		if (!fetch) {
> +			/* data segment or readable code segment */
>  			if (cpl > desc.dpl)
>  				goto bad;
>  		} else if ((desc.type & 8) && !(desc.type & 4)) {
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail
  2014-10-02 22:10 ` [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail Nadav Amit
@ 2014-10-06 20:50   ` Radim Krčmář
  2014-10-07  9:15     ` Nadav Amit
  0 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2014-10-06 20:50 UTC (permalink / raw)
  To: Nadav Amit; +Cc: pbonzini, joro, kvm

2014-10-03 01:10+0300, Nadav Amit:
> Once an instruction crosses a page boundary, the size read from the second page
> disregards the common case that part of the operand resides on the first page.
> As a result, fetch of long insturctions may fail, and thereby cause the
> decoding to fail as well.
> 
> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---

Good catch, was it thanks to an exhaustive test-suite?

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>

>  arch/x86/kvm/emulate.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 694dfa7..b2e8aba 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -751,8 +751,10 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
>  static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
>  					       unsigned size)
>  {
> -	if (unlikely(ctxt->fetch.end - ctxt->fetch.ptr < size))
> -		return __do_insn_fetch_bytes(ctxt, size);
> +	unsigned done_size = ctxt->fetch.end - ctxt->fetch.ptr;
> +
> +	if (unlikely(done_size < size))
> +		return __do_insn_fetch_bytes(ctxt, size - done_size);
>  	else
>  		return X86EMUL_CONTINUE;
>  }
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-02 22:10 ` [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes Nadav Amit
@ 2014-10-06 20:57   ` Radim Krčmář
  2014-10-07  9:35     ` Nadav Amit
  2014-10-08  9:29   ` Paolo Bonzini
  1 sibling, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2014-10-06 20:57 UTC (permalink / raw)
  To: Nadav Amit; +Cc: pbonzini, joro, kvm

2014-10-03 01:10+0300, Nadav Amit:
> Setting the TSC deadline MSR that are initiated by the host (using ioctl's) may
> cause superfluous interrupt.  This occurs in the following case:
> 
> 1. A TSC deadline timer interrupt is pending.
> 2. TSC deadline was still not cleared (which happens during vcpu_run).
> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> 
> To solve this situation, ignore host initiated TSC deadline writes that do not
> change the deadline value.

I find this change slightly dubious ...
 - why does the userspace do that?
 - why is host_initiated required?

Thanks.

It seems like an performance improvement, so why shouldn't return when
  'data <= tscdeadline && pending()'
or even
  'data <= now() && pending()'

(Sorry, I ran out of time to search for answers today.)

> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
>  arch/x86/kvm/lapic.c | 7 ++++++-
>  arch/x86/kvm/lapic.h | 3 ++-
>  arch/x86/kvm/x86.c   | 2 +-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..0bcf2e1 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1346,14 +1346,19 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu)
>  	return apic->lapic_timer.tscdeadline;
>  }
>  
> -void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
> +void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu,
> +				   struct msr_data *msr_info)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 data = msr_info->data;
>  
>  	if (!kvm_vcpu_has_lapic(vcpu) || apic_lvtt_oneshot(apic) ||
>  			apic_lvtt_period(apic))
>  		return;
>  
> +	if (msr_info->host_initiated && apic->lapic_timer.tscdeadline == data)
> +		return;
> +
>  	hrtimer_cancel(&apic->lapic_timer.timer);
>  	/* Inject here so clearing tscdeadline won't override new value */
>  	if (apic_has_pending_timer(vcpu))
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 6a11845..5bfa3db 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -71,7 +71,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
>  int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
>  
>  u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
> -void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
> +void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu,
> +				struct msr_data *msr_info);
>  
>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
>  void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e903167..8c2745c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2093,7 +2093,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
>  		return kvm_x2apic_msr_write(vcpu, msr, data);
>  	case MSR_IA32_TSCDEADLINE:
> -		kvm_set_lapic_tscdeadline_msr(vcpu, data);
> +		kvm_set_lapic_tscdeadline_msr(vcpu, msr_info);
>  		break;
>  	case MSR_IA32_TSC_ADJUST:
>  		if (guest_cpuid_has_tsc_adjust(vcpu)) {
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail
  2014-10-06 20:50   ` Radim Krčmář
@ 2014-10-07  9:15     ` Nadav Amit
  2014-10-08  9:02       ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-07  9:15 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Nadav Amit, Paolo Bonzini, joro, kvm


On Oct 6, 2014, at 11:50 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:

> 2014-10-03 01:10+0300, Nadav Amit:
>> Once an instruction crosses a page boundary, the size read from the second page
>> disregards the common case that part of the operand resides on the first page.
>> As a result, fetch of long insturctions may fail, and thereby cause the
>> decoding to fail as well.
>> 
>> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
>> ---
> 
> Good catch, was it thanks to an exhaustive test-suite?
> 
> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>

It was catcher in a test-environment. However, I keep wondering how it did not happen in real guest OS.
I think it is due to pure luck, so I recommend to put it in -stable.

Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-06 20:57   ` Radim Krčmář
@ 2014-10-07  9:35     ` Nadav Amit
  2014-10-08 10:06       ` Radim Krčmář
  0 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-07  9:35 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Nadav Amit, pbonzini, joro, kvm

Thanks for reviewing this patch and the rest of the gang.

On Oct 6, 2014, at 11:57 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:

> 2014-10-03 01:10+0300, Nadav Amit:
>> Setting the TSC deadline MSR that are initiated by the host (using ioctl's) may
>> cause superfluous interrupt.  This occurs in the following case:
>> 
>> 1. A TSC deadline timer interrupt is pending.
>> 2. TSC deadline was still not cleared (which happens during vcpu_run).
>> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
>> 
>> To solve this situation, ignore host initiated TSC deadline writes that do not
>> change the deadline value.
> 
> I find this change slightly dubious …
Why? I see similar handling of MSR_TSC_ADJUST.

> - why does the userspace do that?
It seems qemu’s kvm_cpu_exec does so when it calls kvm_arch_put_registers.
It is pretty much done after every exit to userspace.

> - why is host_initiated required?
Since if the guest writes to the MSR, it means it wants to rearm the TSC deadline. Even if the deadline passed, interrupt should be triggered.
If the guest writes the same value on the deadline MSR twice, it might expect two interrupts.


> 
> Thanks.
> 
> It seems like an performance improvement, so why shouldn't return when
>  'data <= tscdeadline && pending()'
> or even
>  'data <= now() && pending()'
> 
> (Sorry, I ran out of time to search for answers today.)
The bug I encountered is not a performance issue, but a superfluous interrupt (functional bug).
As I said, the guest may write a new deadline MSR value which is in the past and expect an interrupt.

Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail
  2014-10-07  9:15     ` Nadav Amit
@ 2014-10-08  9:02       ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-08  9:02 UTC (permalink / raw)
  To: Nadav Amit, Radim Krčmář; +Cc: Nadav Amit, joro, kvm

Il 07/10/2014 11:15, Nadav Amit ha scritto:
> 
> On Oct 6, 2014, at 11:50 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> 
>> 2014-10-03 01:10+0300, Nadav Amit:
>>> Once an instruction crosses a page boundary, the size read from the second page
>>> disregards the common case that part of the operand resides on the first page.
>>> As a result, fetch of long insturctions may fail, and thereby cause the
>>> decoding to fail as well.
>>>
>>> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
>>> ---
>>
>> Good catch, was it thanks to an exhaustive test-suite?
>>
>> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
> 
> It was catcher in a test-environment. However, I keep wondering how it did not happen in real guest OS.
> I think it is due to pure luck, so I recommend to put it in -stable.

The shorter the immediate, the more you need an unlucky alignment for
this to happen.  For example, say you have 10 byte instruction with 2
opcode bytes and one qword immediate.

1) Instruction at 0x1ffd, __do_insn_fetch_bytes is requested 8 bytes
instead of 7, but it accepts up to 15 - 3 = 12 bytes and everything works.

2) Instruction at 0x1ff7, __do_insn_fetch_bytes is requested 8 bytes
instead of 1.  It accepts up to 15 - 9 = 6 bytes and fails.

Most emulated instructions have a 4-byte immediate or no immediate at all.

Fixes: 5cfc7e0f5e5e1adf998df94f8e36edaf5d30d38e

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-02 22:10 ` [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes Nadav Amit
  2014-10-06 20:57   ` Radim Krčmář
@ 2014-10-08  9:29   ` Paolo Bonzini
  1 sibling, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-08  9:29 UTC (permalink / raw)
  To: Nadav Amit, joro; +Cc: kvm

Il 03/10/2014 00:10, Nadav Amit ha scritto:
> To solve this situation, ignore host initiated TSC deadline writes that do not
> change the deadline value.
> 
> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
>  arch/x86/kvm/lapic.c | 7 ++++++-
>  arch/x86/kvm/lapic.h | 3 ++-
>  arch/x86/kvm/x86.c   | 2 +-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..0bcf2e1 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1346,14 +1346,19 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu)
>  	return apic->lapic_timer.tscdeadline;
>  }
>  
> -void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
> +void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu,
> +				   struct msr_data *msr_info)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 data = msr_info->data;
>  
>  	if (!kvm_vcpu_has_lapic(vcpu) || apic_lvtt_oneshot(apic) ||
>  			apic_lvtt_period(apic))
>  		return;
>  
> +	if (msr_info->host_initiated && apic->lapic_timer.tscdeadline == data)
> +		return;
> +

Why do we have to check host_initiated?

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-07  9:35     ` Nadav Amit
@ 2014-10-08 10:06       ` Radim Krčmář
  2014-10-08 10:07         ` Paolo Bonzini
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Radim Krčmář @ 2014-10-08 10:06 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Nadav Amit, pbonzini, joro, kvm

2014-10-07 12:35+0300, Nadav Amit:
> Thanks for reviewing this patch and the rest of the gang.

Happy to do so, I've learned a lot.

> On Oct 6, 2014, at 11:57 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> > 2014-10-03 01:10+0300, Nadav Amit:
> >> Setting the TSC deadline MSR that are initiated by the host (using ioctl's) may
> >> cause superfluous interrupt.  This occurs in the following case:
> >> 
> >> 1. A TSC deadline timer interrupt is pending.
> >> 2. TSC deadline was still not cleared (which happens during vcpu_run).
> >> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> >> 
> >> To solve this situation, ignore host initiated TSC deadline writes that do not
> >> change the deadline value.
> > 
> > I find this change slightly dubious …
> Why? I see similar handling of MSR_TSC_ADJUST.

In other modes, we don't inject pending timer when writing to
APIC_TMICT.  (Which, sadly, is inconsistent with APIC_LVTT.)

Adding a workaround is usually worse than removing the reason ...

> > - why does the userspace do that?
> It seems qemu’s kvm_cpu_exec does so when it calls kvm_arch_put_registers.
> It is pretty much done after every exit to userspace.

Thanks, it really doesn't do much checking of what is needed.

> > - why is host_initiated required?
> Since if the guest writes to the MSR, it means it wants to rearm the TSC deadline. Even if the deadline passed, interrupt should be triggered.

MSR isn't 0, so the deadline hasn't passed for the guest yet.

> If the guest writes the same value on the deadline MSR twice, it might expect two interrupts.

When guest writes to it without getting an interrupt first, it might
expect just one.  (Which it better IMO.)

> > 
> > Thanks.
> > 
> > It seems like an performance improvement, so why shouldn't return when
> >  'data <= tscdeadline && pending()'
> > or even
> >  'data <= now() && pending()'
> > 
> > (Sorry, I ran out of time to search for answers today.)
> The bug I encountered is not a performance issue, but a superfluous interrupt (functional bug).

True.

> As I said, the guest may write a new deadline MSR value which is in the past and expect an interrupt.

And it would get one from the currently pending timer.

What about the following patch?
(The introduced else branch could use some abstractions.)

--8<---
KVM: x86: fix deadline tsc interrupt injection

The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
situation where we lose a pending deadline timer in a MSR write.
Losing it is fine, because it effectively occurs before the timer fired,
so we should be able to cancel or postpone it.

Another problem comes from interaction with QEMU, or other userspace
that can set deadline MSR without a good reason, when timer is already
pending:  one guest's deadline request results in more than one
interrupt because one is injected immediately on MSR write from
userspace and one through hrtimer later.

The solution is to remove the injection when replacing a pending timer
and to improve the usual QEMU path, we inject without a hrtimer when the
deadline has already passed.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/lapic.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b8345dd..51428dd 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
 		if (likely(tscdeadline > guest_tsc)) {
 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
 			do_div(ns, this_tsc_khz);
+			hrtimer_start(&apic->lapic_timer.timer,
+				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+		} else {
+			atomic_inc(&ktimer->pending);
+			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
 		}
-		hrtimer_start(&apic->lapic_timer.timer,
-			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
 
 		local_irq_restore(flags);
 	}
@@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
 		return;
 
 	hrtimer_cancel(&apic->lapic_timer.timer);
-	/* Inject here so clearing tscdeadline won't override new value */
-	if (apic_has_pending_timer(vcpu))
-		kvm_inject_apic_timer_irqs(vcpu);
 	apic->lapic_timer.tscdeadline = data;
 	start_apic_timer(apic);
 }

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-08 10:06       ` Radim Krčmář
@ 2014-10-08 10:07         ` Paolo Bonzini
  2014-10-10  1:55         ` Nadav Amit
  2014-10-10 14:02         ` Paolo Bonzini
  2 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-08 10:07 UTC (permalink / raw)
  To: Radim Krčmář, Nadav Amit; +Cc: Nadav Amit, joro, kvm

Il 08/10/2014 12:06, Radim Krčmář ha scritto:
> > > - why is host_initiated required?
> 
> > Since if the guest writes to the MSR, it means it wants to rearm the TSC deadline. Even if the deadline passed, interrupt should be triggered.
> 
> MSR isn't 0, so the deadline hasn't passed for the guest yet.
> 
> > If the guest writes the same value on the deadline MSR twice, it might expect two interrupts.
> 
> When guest writes to it without getting an interrupt first, it might
> expect just one.  (Which it better IMO.)

Indeed that was my doubt as well.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-08 10:06       ` Radim Krčmář
  2014-10-08 10:07         ` Paolo Bonzini
@ 2014-10-10  1:55         ` Nadav Amit
  2014-10-10  9:45           ` Paolo Bonzini
  2014-10-10 14:02         ` Paolo Bonzini
  2 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-10  1:55 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Nadav Amit, pbonzini, joro, kvm


On Oct 8, 2014, at 1:06 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:

> 
> And it would get one from the currently pending timer.
> 
> What about the following patch?
> (The introduced else branch could use some abstractions.)
> 
> --8<---
> KVM: x86: fix deadline tsc interrupt injection
> 
> The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
> situation where we lose a pending deadline timer in a MSR write.
> Losing it is fine, because it effectively occurs before the timer fired,
> so we should be able to cancel or postpone it.
> 
> Another problem comes from interaction with QEMU, or other userspace
> that can set deadline MSR without a good reason, when timer is already
> pending:  one guest's deadline request results in more than one
> interrupt because one is injected immediately on MSR write from
> userspace and one through hrtimer later.
> 
> The solution is to remove the injection when replacing a pending timer
> and to improve the usual QEMU path, we inject without a hrtimer when the
> deadline has already passed.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> Reported-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
> arch/x86/kvm/lapic.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..51428dd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>  		if (likely(tscdeadline > guest_tsc)) {
>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
>  			do_div(ns, this_tsc_khz);
> +			hrtimer_start(&apic->lapic_timer.timer,
> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> +		} else {
> +			atomic_inc(&ktimer->pending);
> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>  		}
> -		hrtimer_start(&apic->lapic_timer.timer,
> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> 
>  		local_irq_restore(flags);
>  	}
> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
>  		return;
> 
>  	hrtimer_cancel(&apic->lapic_timer.timer);
> -	/* Inject here so clearing tscdeadline won't override new value */
> -	if (apic_has_pending_timer(vcpu))
> -		kvm_inject_apic_timer_irqs(vcpu);
>  	apic->lapic_timer.tscdeadline = data;
>  	start_apic_timer(apic);
> }

Perhaps I am missing something, but I don’t see how it solves the problem I encountered.
Recall the scenario:
1. A TSC deadline timer interrupt is pending.
2. TSC deadline was still not cleared (which happens during vcpu_run).
3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.

It appears that in such scenario the guest would still get spurious interrupt for no reason, as ktimer->pending may already be increased in apic_timer_fn.

Second, I think that the solution I proposed would perform better. Currently, there are many unnecessary cancellations and setups of the timer. This solution does not resolve this problem.

Last, I think that having less interrupts on deadline changes is not completely according to the SDM which says: "If software disarms the timer or postpones the deadline, race conditions may result in the delivery of a spurious timer interrupt.” It never says interrupts may be lost if you reprogram the deadline before you check it expired.

Thanks,
Nadav


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-06 20:32   ` Radim Krčmář
@ 2014-10-10  2:07     ` Nadav Amit
  2014-10-10 15:54       ` Radim Krčmář
  0 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-10  2:07 UTC (permalink / raw)
  To: pbonzini, rkrcmar; +Cc: kvm, Nadav Amit

When read access is performed using a readable code segment, the "conforming"
and "non-conforming" checks should not be done.  As a result, read using
non-conforming readable code segment fails.

This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments").

One exception is the case of conforming code segment. The SDM says: "Use a
code-segment override prefix (CS) to read a readable...  [it is] valid because
the DPL of the code segment selected by the CS register is the same as the
CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, and
CS would still be accessible.  The emulator should avoid privilage level checks
for data reads using CS.

The fix is not to perform the "non-conforming" checks if the access is not a
fetch, and never to perform the checks for CS.

---
v1->v2: Privilage level checks are always skipped for CS

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/emulate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a46207a..0fee0a0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -661,9 +661,9 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
 				goto bad;
 		}
 		cpl = ctxt->ops->cpl(ctxt);
-		if (!(desc.type & 8)) {
-			/* data segment */
-			if (cpl > desc.dpl)
+		if (!fetch) {
+			/* data segment or readable code segment */
+			if (cpl > desc.dpl && addr.seg != VCPU_SREG_CS)
 				goto bad;
 		} else if ((desc.type & 8) && !(desc.type & 4)) {
 			/* nonconforming code segment */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-10  1:55         ` Nadav Amit
@ 2014-10-10  9:45           ` Paolo Bonzini
  2014-10-10 12:50             ` Radim Krčmář
  2014-10-10 12:51             ` Nadav Amit
  0 siblings, 2 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-10  9:45 UTC (permalink / raw)
  To: Nadav Amit, Radim Krčmář; +Cc: Nadav Amit, joro, kvm

Il 10/10/2014 03:55, Nadav Amit ha scritto:
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index b8345dd..51428dd 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>>  		if (likely(tscdeadline > guest_tsc)) {
>>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
>>  			do_div(ns, this_tsc_khz);
>> +			hrtimer_start(&apic->lapic_timer.timer,
>> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>> +		} else {
>> +			atomic_inc(&ktimer->pending);
>> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>  		}
>> -		hrtimer_start(&apic->lapic_timer.timer,
>> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>>
>>  		local_irq_restore(flags);
>>  	}
>> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
>>  		return;
>>
>>  	hrtimer_cancel(&apic->lapic_timer.timer);
>> -	/* Inject here so clearing tscdeadline won't override new value */
>> -	if (apic_has_pending_timer(vcpu))
>> -		kvm_inject_apic_timer_irqs(vcpu);
>>  	apic->lapic_timer.tscdeadline = data;
>>  	start_apic_timer(apic);
>> }
> 
> Perhaps I am missing something, but I don’t see how it solves the problem I encountered.
> Recall the scenario:
> 1. A TSC deadline timer interrupt is pending.
> 2. TSC deadline was still not cleared (which happens during vcpu_run).
> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> 
> It appears that in such scenario the guest would still get spurious
> interrupt for no reason, as ktimer->pending may already be increased in
> apic_timer_fn.

If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
misunderstanding your objection?

> Second, I think that the solution I proposed would perform better.
> Currently, there are many unnecessary cancellations and setups of the
> timer. This solution does not resolve this problem.

I think it does.  You do not get an hrtimer_start if tscdeadline <=
guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
before calling hrtimer_cancel, or go straight to the source and avoid
taking the lock in the easy cases:

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 1c2fe7de2842..6ce725007424 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
 {
 	struct hrtimer_clock_base *base;
 	unsigned long flags;
-	int ret = -1;
+	unsigned long state = timer->state;
+	int ret;
+
+	if (state & HRTIMER_STATE_ENQUEUED)
+		return 0;
+	if (state & HRTIMER_STATE_CALLBACK)
+		return -1;

 	base = lock_hrtimer_base(timer, &flags);

+	ret = -1;
 	if (!hrtimer_callback_running(timer))
 		ret = remove_hrtimer(timer, base);


> Last, I think that having less interrupts on deadline changes is not
> completely according to the SDM which says: "If software disarms the
> timer or postpones the deadline, race conditions may result in the
> delivery of a spurious timer interrupt.” It never says interrupts may
> be lost if you reprogram the deadline before you check it expired.

But the case when you rewrite the same value to the MSR is neither
disarming nor postponing.  You would be getting two interrupts for the
same event.  That is why I agree with Radim that checking host_initiated
is wrong.

Paolo

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-10  9:45           ` Paolo Bonzini
@ 2014-10-10 12:50             ` Radim Krčmář
  2014-10-10 12:51             ` Nadav Amit
  1 sibling, 0 replies; 32+ messages in thread
From: Radim Krčmář @ 2014-10-10 12:50 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Nadav Amit, Nadav Amit, joro, kvm

2014-10-10 11:45+0200, Paolo Bonzini:
> Il 10/10/2014 03:55, Nadav Amit ha scritto:
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index b8345dd..51428dd 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
> >>  		if (likely(tscdeadline > guest_tsc)) {
> >>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
> >>  			do_div(ns, this_tsc_khz);
> >> +			hrtimer_start(&apic->lapic_timer.timer,
> >> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >> +		} else {
> >> +			atomic_inc(&ktimer->pending);
> >> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
> >>  		}
> >> -		hrtimer_start(&apic->lapic_timer.timer,
> >> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >>
> >>  		local_irq_restore(flags);
> >>  	}
> >> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
> >>  		return;
> >>
> >>  	hrtimer_cancel(&apic->lapic_timer.timer);
> >> -	/* Inject here so clearing tscdeadline won't override new value */
> >> -	if (apic_has_pending_timer(vcpu))
> >> -		kvm_inject_apic_timer_irqs(vcpu);
> >>  	apic->lapic_timer.tscdeadline = data;
> >>  	start_apic_timer(apic);
> >> }
> > 
> > Perhaps I am missing something, but I don’t see how it solves the problem I encountered.
> > Recall the scenario:
> > 1. A TSC deadline timer interrupt is pending.
> > 2. TSC deadline was still not cleared (which happens during vcpu_run).
> > 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> > 
> > It appears that in such scenario the guest would still get spurious
> > interrupt for no reason, as ktimer->pending may already be increased in
> > apic_timer_fn.
> 
> If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
> misunderstanding your objection?

(I don't see it either, the patch removed kvm_inject_apic_timer_irqs(),
 so we inject as if the MSR write didn't happen.)

> > Second, I think that the solution I proposed would perform better.
> > Currently, there are many unnecessary cancellations and setups of the
> > timer. This solution does not resolve this problem.
> 
> I think it does.  You do not get an hrtimer_start if tscdeadline <=
> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
> before calling hrtimer_cancel, or go straight to the source and avoid
> taking the lock in the easy cases:

I think the useless cancels were when the timer is set to the same value
in the future.

Which makes sense to optimize, but I wasn't sure how it would affect the
guest if the TSC offset was be changing or TSC itself wasn't stable ...
so I chose not to do it.

> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 1c2fe7de2842..6ce725007424 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
>  {
>  	struct hrtimer_clock_base *base;
>  	unsigned long flags;
> -	int ret = -1;
> +	unsigned long state = timer->state;
> +	int ret;
> +
> +	if (state & HRTIMER_STATE_ENQUEUED)
> +		return 0;

It doesn't try very hard to cancel it ;)

> +	if (state & HRTIMER_STATE_CALLBACK)
> +		return -1;
> 
>  	base = lock_hrtimer_base(timer, &flags);
> 
> +	ret = -1;
>  	if (!hrtimer_callback_running(timer))
>  		ret = remove_hrtimer(timer, base);
> 
> > Last, I think that having less interrupts on deadline changes is not
> > completely according to the SDM which says: "If software disarms the
> > timer or postpones the deadline, race conditions may result in the
> > delivery of a spurious timer interrupt.” It never says interrupts may
> > be lost if you reprogram the deadline before you check it expired.

Yeah, too bad it wasn't written in a formally defined language ...
They could be just reserving design space for concurrent implementation,
not making race conditions mandatory.

Our race windows are much bigger -- a disarm that is hundreds of cycles
in the future can easily be hit after a msr write vm-exit, but it would
be very unlikely to get an interrupt on bare metal.

And thinking about the meaning of postpone or disarm -- software doesn't
want the old interrupt anymore, so it would just add useless work.

(And I liked the code better without it, but it'd be ok if we fixed all
 the cases.)

> But the case when you rewrite the same value to the MSR is neither
> disarming nor postponing.  You would be getting two interrupts for the
> same event.  That is why I agree with Radim that checking host_initiated
> is wrong.

Current implementation also injects two interrupts when the timer is set
to a lower nonzero value, which should give us just one under both
interpretations.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-10  9:45           ` Paolo Bonzini
  2014-10-10 12:50             ` Radim Krčmář
@ 2014-10-10 12:51             ` Nadav Amit
  2014-10-10 13:55               ` Paolo Bonzini
  1 sibling, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-10 12:51 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Radim Krčmář, Nadav Amit, joro, kvm


On Oct 10, 2014, at 12:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 10/10/2014 03:55, Nadav Amit ha scritto:
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index b8345dd..51428dd 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>>> 		if (likely(tscdeadline > guest_tsc)) {
>>> 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
>>> 			do_div(ns, this_tsc_khz);
>>> +			hrtimer_start(&apic->lapic_timer.timer,
>>> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>>> +		} else {
>>> +			atomic_inc(&ktimer->pending);
>>> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>> 		}
>>> -		hrtimer_start(&apic->lapic_timer.timer,
>>> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>>> 
>>> 		local_irq_restore(flags);
>>> 	}
>>> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
>>> 		return;
>>> 
>>> 	hrtimer_cancel(&apic->lapic_timer.timer);
>>> -	/* Inject here so clearing tscdeadline won't override new value */
>>> -	if (apic_has_pending_timer(vcpu))
>>> -		kvm_inject_apic_timer_irqs(vcpu);
>>> 	apic->lapic_timer.tscdeadline = data;
>>> 	start_apic_timer(apic);
>>> }
>> 
>> Perhaps I am missing something, but I don’t see how it solves the problem I encountered.
>> Recall the scenario:
>> 1. A TSC deadline timer interrupt is pending.
>> 2. TSC deadline was still not cleared (which happens during vcpu_run).
>> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
>> 
>> It appears that in such scenario the guest would still get spurious
>> interrupt for no reason, as ktimer->pending may already be increased in
>> apic_timer_fn.
> 
> If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
> misunderstanding your objection?
You understood me, and I was wrong...

> 
>> Second, I think that the solution I proposed would perform better.
>> Currently, there are many unnecessary cancellations and setups of the
>> timer. This solution does not resolve this problem.
> 
> I think it does.  You do not get an hrtimer_start if tscdeadline <=
> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
> before calling hrtimer_cancel, or go straight to the source and avoid
> taking the lock in the easy cases:
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 1c2fe7de2842..6ce725007424 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
> {
> 	struct hrtimer_clock_base *base;
> 	unsigned long flags;
> -	int ret = -1;
> +	unsigned long state = timer->state;
> +	int ret;
> +
> +	if (state & HRTIMER_STATE_ENQUEUED)
> +		return 0;
> +	if (state & HRTIMER_STATE_CALLBACK)
> +		return -1;
> 
> 	base = lock_hrtimer_base(timer, &flags);
> 
> +	ret = -1;
> 	if (!hrtimer_callback_running(timer))
> 		ret = remove_hrtimer(timer, base);
Wouldn’t this change would cause cancellations never to succeed (the first check would always be true if the timer is active)?

> 
> 
>> Last, I think that having less interrupts on deadline changes is not
>> completely according to the SDM which says: "If software disarms the
>> timer or postpones the deadline, race conditions may result in the
>> delivery of a spurious timer interrupt.” It never says interrupts may
>> be lost if you reprogram the deadline before you check it expired.
> 
> But the case when you rewrite the same value to the MSR is neither
> disarming nor postponing.  You would be getting two interrupts for the
> same event.  That is why I agree with Radim that checking host_initiated
> is wrong.
I understand, and Radim's solution seems functionally fine, now that I am fully awake and understand it.
I still think that if tscdeadline > guest_tsc, then reprogramming the deadline with the same value, as QEMU does, would result in unwarranted overhead.
Perhaps it would be enough not to reprogram the timer if tscdeadline value does not change (by either guest or host).

Thanks,
Nadav


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-10 12:51             ` Nadav Amit
@ 2014-10-10 13:55               ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-10 13:55 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Radim Krčmář, Nadav Amit, joro, kvm

Il 10/10/2014 14:51, Nadav Amit ha scritto:
>>> Second, I think that the solution I proposed would perform better.
>>> Currently, there are many unnecessary cancellations and setups of the
>>> timer. This solution does not resolve this problem.
>>
>> I think it does.  You do not get an hrtimer_start if tscdeadline <=
>> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
>> before calling hrtimer_cancel, or go straight to the source and avoid
>> taking the lock in the easy cases:
>>
>> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
>> index 1c2fe7de2842..6ce725007424 100644
>> --- a/kernel/time/hrtimer.c
>> +++ b/kernel/time/hrtimer.c
>> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
>> {
>> 	struct hrtimer_clock_base *base;
>> 	unsigned long flags;
>> -	int ret = -1;
>> +	unsigned long state = timer->state;
>> +	int ret;
>> +
>> +	if (state & HRTIMER_STATE_ENQUEUED)
>> +		return 0;
>> +	if (state & HRTIMER_STATE_CALLBACK)
>> +		return -1;
>>
>> 	base = lock_hrtimer_base(timer, &flags);
>>
>> +	ret = -1;
>> 	if (!hrtimer_callback_running(timer))
>> 		ret = remove_hrtimer(timer, base);
> Wouldn’t this change would cause cancellations never to succeed (the first check would always be true if the timer is active)?

Ehm, there is a missing ! in that first "if".

>>> Last, I think that having less interrupts on deadline changes is not
>>> completely according to the SDM which says: "If software disarms the
>>> timer or postpones the deadline, race conditions may result in the
>>> delivery of a spurious timer interrupt.” It never says interrupts may
>>> be lost if you reprogram the deadline before you check it expired.
>>
>> But the case when you rewrite the same value to the MSR is neither
>> disarming nor postponing.  You would be getting two interrupts for the
>> same event.  That is why I agree with Radim that checking host_initiated
>> is wrong.
> 
> I understand, and Radim's solution seems functionally fine, now that I am fully awake and understand it.
> I still think that if tscdeadline > guest_tsc, then reprogramming the deadline with the same value, as QEMU does, would result in unwarranted overhead.

The overhead is about two atomic operations (70 clock cycles?).  Still,
there are plenty of other micro-optimizations possible:

1) instead of incrementing timer->pending, set it to 1

2) change it to test_and_set_bit and only set PENDING_TIMER if the
result was zero

3) non-atomically test PENDING_TIMER before (atomically) clearing it

4) return bool from kvm_inject_apic_timer_irqs and only clear
PENDING_TIMER if a timer interrupt was actually injected.

(1) or (2) would remove one atomic operation when reprogramming a passed
deadline with the same value.  (3) or (4) would remove one atomic
operation in the case where the cause of the exit is not an expired
timer.  Any takers?

> Perhaps it would be enough not to reprogram the timer if tscdeadline value does not change (by either guest or host).

Yes, and that would just be your patch without host_initiated.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
  2014-10-08 10:06       ` Radim Krčmář
  2014-10-08 10:07         ` Paolo Bonzini
  2014-10-10  1:55         ` Nadav Amit
@ 2014-10-10 14:02         ` Paolo Bonzini
  2 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-10 14:02 UTC (permalink / raw)
  To: Radim Krčmář, Nadav Amit; +Cc: Nadav Amit, joro, kvm

Il 08/10/2014 12:06, Radim Krčmář ha scritto:
> KVM: x86: fix deadline tsc interrupt injection
> 
> The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
> situation where we lose a pending deadline timer in a MSR write.
> Losing it is fine, because it effectively occurs before the timer fired,
> so we should be able to cancel or postpone it.
> 
> Another problem comes from interaction with QEMU, or other userspace
> that can set deadline MSR without a good reason, when timer is already
> pending:  one guest's deadline request results in more than one
> interrupt because one is injected immediately on MSR write from
> userspace and one through hrtimer later.
> 
> The solution is to remove the injection when replacing a pending timer
> and to improve the usual QEMU path, we inject without a hrtimer when the
> deadline has already passed.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> Reported-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
>  arch/x86/kvm/lapic.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..51428dd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>  		if (likely(tscdeadline > guest_tsc)) {
>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
>  			do_div(ns, this_tsc_khz);
> +			hrtimer_start(&apic->lapic_timer.timer,
> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> +		} else {
> +			atomic_inc(&ktimer->pending);
> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>  		}
> -		hrtimer_start(&apic->lapic_timer.timer,
> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>  
>  		local_irq_restore(flags);
>  	}
> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
>  		return;
>  
>  	hrtimer_cancel(&apic->lapic_timer.timer);
> -	/* Inject here so clearing tscdeadline won't override new value */
> -	if (apic_has_pending_timer(vcpu))
> -		kvm_inject_apic_timer_irqs(vcpu);
>  	apic->lapic_timer.tscdeadline = data;
>  	start_apic_timer(apic);
>  }

Radim, the patch looks good but please extract this code:

        /*
         * There is a race window between reading and incrementing, but we do
         * not care about potentially losing timer events in the !reinject
         * case anyway. Note: KVM_REQ_PENDING_TIMER is implicitly checked
         * in vcpu_enter_guest.
         */
        if (!atomic_read(&ktimer->pending)) {
                atomic_inc(&ktimer->pending);
                /* FIXME: this code should not know anything about vcpus */
                kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
        }

        if (waitqueue_active(q))
                wake_up_interruptible(q);

to a new "static void apic_timer_expired(struct kvm_lapic *apic)" function,
and call it from both apic_timer_fn and start_apic_timer.

Also, we should not need to do wake_up_interruptible() unless we have
changed ktimer->pending from zero to non-zero.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-10  2:07     ` [PATCH v2 " Nadav Amit
@ 2014-10-10 15:54       ` Radim Krčmář
  2014-10-11  9:39         ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2014-10-10 15:54 UTC (permalink / raw)
  To: Nadav Amit; +Cc: pbonzini, kvm

2014-10-10 05:07+0300, Nadav Amit:
> When read access is performed using a readable code segment, the "conforming"
> and "non-conforming" checks should not be done.  As a result, read using
> non-conforming readable code segment fails.
> 
> This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments").
> 
> One exception is the case of conforming code segment. The SDM says: "Use a
> code-segment override prefix (CS) to read a readable...  [it is] valid because
> the DPL of the code segment selected by the CS register is the same as the
> CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, and
> CS would still be accessible.  The emulator should avoid privilage level checks
> for data reads using CS.

Ah, after stripping faulty presumptions, I'm not sure this change is
enough ... shouldn't we also skip the check on conforming code segments?

 Method 2 is always valid because the privilege level of a conforming
 code segment is effectively the same as the CPL, regardless of its DPL.

And we couldn't read it from less privileged modes.

> The fix is not to perform the "non-conforming" checks if the access is not a
> fetch, and never to perform the checks for CS.
> 
> ---
> v1->v2: Privilage level checks are always skipped for CS
> 
> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
>  arch/x86/kvm/emulate.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index a46207a..0fee0a0 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -661,9 +661,9 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
>  				goto bad;
>  		}
>  		cpl = ctxt->ops->cpl(ctxt);
> -		if (!(desc.type & 8)) {
> -			/* data segment */
> -			if (cpl > desc.dpl)
> +		if (!fetch) {
> +			/* data segment or readable code segment */
> +			if (cpl > desc.dpl && addr.seg != VCPU_SREG_CS)
>  				goto bad;
>  		} else if ((desc.type & 8) && !(desc.type & 4)) {
>  			/* nonconforming code segment */
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-10 15:54       ` Radim Krčmář
@ 2014-10-11  9:39         ` Paolo Bonzini
  2014-10-12  6:57           ` Nadav Amit
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-11  9:39 UTC (permalink / raw)
  To: Radim Krčmář, Nadav Amit; +Cc: kvm

Il 10/10/2014 17:54, Radim Krčmář ha scritto:
>> > 
>> > One exception is the case of conforming code segment. The SDM says: "Use a
>> > code-segment override prefix (CS) to read a readable...  [it is] valid because
>> > the DPL of the code segment selected by the CS register is the same as the
>> > CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, and
>> > CS would still be accessible.  The emulator should avoid privilage level checks
>> > for data reads using CS.
> Ah, after stripping faulty presumptions, I'm not sure this change is
> enough ... shouldn't we also skip the check on conforming code segments?
> 
>  Method 2 is always valid because the privilege level of a conforming
>  code segment is effectively the same as the CPL, regardless of its DPL.

Radim is right; we need to skip the check on conforming code segments 
and, once we do that, checking addr.seg is not necessary anymore.  That 
is because, for a CS override on a nonconforming code segment, at the 
time we fetch the instruction we know that cpl == desc.dpl.  The less 
restrictive data segment check (cpl <= desc.dpl) thus always passes.

Let's put together this check and the readability check, too, since
we are adding another "if (fetch)".

Can you guys think of a way to simplify the following untested patch?

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 03954f7900f5..9f3e33551db9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -638,9 +638,6 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
 		if ((((ctxt->mode != X86EMUL_MODE_REAL) && (desc.type & 8))
 					|| !(desc.type & 2)) && write)
 			goto bad;
-		/* unreadable code segment */
-		if (!fetch && (desc.type & 8) && !(desc.type & 2))
-			goto bad;
 		lim = desc_limit_scaled(&desc);
 		if ((ctxt->mode == X86EMUL_MODE_REAL) && !fetch &&
 		    (ctxt->d & NoBigReal)) {
@@ -660,17 +657,40 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
 				goto bad;
 		}
 		cpl = ctxt->ops->cpl(ctxt);
-		if (!(desc.type & 8)) {
-			/* data segment */
+		if (fetch && (desc.type & 8)) {
+			if (!(desc.type & 4)) {
+				/* nonconforming code segment */
+				if (cpl != desc.dpl)
+					goto bad;
+				break;
+			} else {
+				/* conforming code segment */
+				if (cpl < desc.dpl)
+					goto bad;
+				break;
+			}
+		}
+
+		if (likely(!(desc.type & 8) || (desc.type & 6) == 2)) {
+			/*
+			 * Data segment or readable, nonconforming code
+			 * segment.  The SDM mentions that access through
+			 * a code-segment override prefix is always valid.
+			 * This really only matters for conforming code
+			 * segments (checked below, and always valid anyway):
+			 * for nonconforming ones, cpl == desc.dpl was checked
+			 * when fetching the instruction, meaning the following
+			 * test will always pass too.
+			 */
 			if (cpl > desc.dpl)
 				goto bad;
-		} else if ((desc.type & 8) && !(desc.type & 4)) {
-			/* nonconforming code segment */
-			if (cpl != desc.dpl)
-				goto bad;
-		} else if ((desc.type & 8) && (desc.type & 4)) {
-			/* conforming code segment */
-			if (cpl < desc.dpl)
+		} else {
+			/*
+			 * These are the (rare) cases that do not behave
+			 * like data segments: nonreadable code segments (bad)
+			 * and readable, conforming code segments (good).
+			 */
+			if (!(desc.type & 2))
 				goto bad;
 		}
 		break;


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-11  9:39         ` Paolo Bonzini
@ 2014-10-12  6:57           ` Nadav Amit
  2014-10-12 12:12             ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2014-10-12  6:57 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Radim Krčmář, Nadav Amit, kvm

[-- Attachment #1: Type: text/plain, Size: 4131 bytes --]

Radim, Paolo, Sorry for the late responses (due to holidays)…

On Oct 11, 2014, at 12:39 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 10/10/2014 17:54, Radim Krčmář ha scritto:
>>>> 
>>>> One exception is the case of conforming code segment. The SDM says: "Use a
>>>> code-segment override prefix (CS) to read a readable...  [it is] valid because
>>>> the DPL of the code segment selected by the CS register is the same as the
>>>> CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, and
>>>> CS would still be accessible.  The emulator should avoid privilage level checks
>>>> for data reads using CS.
>> Ah, after stripping faulty presumptions, I'm not sure this change is
>> enough ... shouldn't we also skip the check on conforming code segments?
>> 
>> Method 2 is always valid because the privilege level of a conforming
>> code segment is effectively the same as the CPL, regardless of its DPL.
> 
> Radim is right; we need to skip the check on conforming code segments 
> and, once we do that, checking addr.seg is not necessary anymore.  That 
> is because, for a CS override on a nonconforming code segment, at the 
> time we fetch the instruction we know that cpl == desc.dpl.  The less 
> restrictive data segment check (cpl <= desc.dpl) thus always passes.
Yes. I was wrong, assuming the code-segment checks are just a derivative of the data segment checks.


> 
> Let's put together this check and the readability check, too, since
> we are adding another "if (fetch)".
> 
> Can you guys think of a way to simplify the following untested patch?
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 03954f7900f5..9f3e33551db9 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -638,9 +638,6 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
> 		if ((((ctxt->mode != X86EMUL_MODE_REAL) && (desc.type & 8))
> 					|| !(desc.type & 2)) && write)
> 			goto bad;
> -		/* unreadable code segment */
> -		if (!fetch && (desc.type & 8) && !(desc.type & 2))
> -			goto bad;
> 		lim = desc_limit_scaled(&desc);
> 		if ((ctxt->mode == X86EMUL_MODE_REAL) && !fetch &&
> 		    (ctxt->d & NoBigReal)) {
> @@ -660,17 +657,40 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
> 				goto bad;
> 		}
> 		cpl = ctxt->ops->cpl(ctxt);
> -		if (!(desc.type & 8)) {
> -			/* data segment */
> +		if (fetch && (desc.type & 8)) {
> +			if (!(desc.type & 4)) {
> +				/* nonconforming code segment */
> +				if (cpl != desc.dpl)
> +					goto bad;
> +				break;
> +			} else {
> +				/* conforming code segment */
> +				if (cpl < desc.dpl)
> +					goto bad;
> +				break;
> +			}
> +		}
> +
> +		if (likely(!(desc.type & 8) || (desc.type & 6) == 2)) {
> +			/*
> +			 * Data segment or readable, nonconforming code
> +			 * segment.  The SDM mentions that access through
> +			 * a code-segment override prefix is always valid.
> +			 * This really only matters for conforming code
> +			 * segments (checked below, and always valid anyway):
> +			 * for nonconforming ones, cpl == desc.dpl was checked
> +			 * when fetching the instruction, meaning the following
> +			 * test will always pass too.
> +			 */
> 			if (cpl > desc.dpl)
> 				goto bad;
> -		} else if ((desc.type & 8) && !(desc.type & 4)) {
> -			/* nonconforming code segment */
> -			if (cpl != desc.dpl)
> -				goto bad;
> -		} else if ((desc.type & 8) && (desc.type & 4)) {
> -			/* conforming code segment */
> -			if (cpl < desc.dpl)
> +		} else {
> +			/*
> +			 * These are the (rare) cases that do not behave
> +			 * like data segments: nonreadable code segments (bad)
> +			 * and readable, conforming code segments (good).
> +			 */
> +			if (!(desc.type & 2))
> 				goto bad;
> 		}
> 		break;

Looks good. I’ll give it a try but it is hard to give a definitive answer, since the emulator is still bug-ridden.
Please note I submitted another patch at this area ("Wrong error code on limit violation during emulation”), so both should be merged.

Thanks,
Nadav




[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-12  6:57           ` Nadav Amit
@ 2014-10-12 12:12             ` Paolo Bonzini
  2014-10-12 23:15               ` Nadav Amit
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-12 12:12 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Radim Krčmář, Nadav Amit, kvm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Il 12/10/2014 08:57, Nadav Amit ha scritto:
> Looks good. I’ll give it a try but it is hard to give a definitive 
> answer, since the emulator is still bug-ridden.

Yes, we need to write unit tests for this, especially the conforming
case.  A bit of a pain to get kvm-unit-tests in ring 3 (access.flat
does it), but I'll give it a shot.

Paolo
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJUOnBCAAoJEBRUblpOawnXLHEH/3zlIsVFow9IWrZsxaZopWy5
DKncQriHnhsyc6W2U9oFvNgB/7+dTWscdR58jnKLr/Qt64DGH01pq4MvisdhV+31
53pv+CycUgs85EoZkA7MzArbT1Tb/gd/KB8QoqdKIC4+bNEd6JMydcsq5d6nMgOd
yWLYjcYzWzzmJSNCC7UYOtN4SB4brC5RyITLq+CgT4ufSPHtBYxGd8fSHpzvJzTU
T1hqelYcLRFGzoPR4ux4SP8EgXle+sslFV4KAyXkucLIafmeekcmR/AO8hy84TXj
Kcpt6Y3hsY8pWXM3YB4LF/7+NuaPu/Ud+2VkbfuFhq5gUtVLwjtWtA32IlYdFEc=
=BhUU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-12 12:12             ` Paolo Bonzini
@ 2014-10-12 23:15               ` Nadav Amit
  2014-10-13  4:29                 ` Paolo Bonzini
  2014-10-13 11:31                 ` Gleb Natapov
  0 siblings, 2 replies; 32+ messages in thread
From: Nadav Amit @ 2014-10-12 23:15 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Radim Krčmář, Nadav Amit, kvm



On 10/12/14 3:12 PM, Paolo Bonzini wrote:
> Il 12/10/2014 08:57, Nadav Amit ha scritto:
>> Looks good. I’ll give it a try but it is hard to give a definitive
>> answer, since the emulator is still bug-ridden.
> 
> Yes, we need to write unit tests for this, especially the conforming
> case.  A bit of a pain to get kvm-unit-tests in ring 3 (access.flat
> does it), but I'll give it a shot.
> 
> Paolo
> 

I think the problem might be even more fundamental.
According to the SDM, the privilege level checks (CPL/DPL/RPL) are only performed when the segment is loaded; I see no reference to privilege checks when data is accessed.
You should be able to load a segment with DPL=0 while you are in CPL=0, then change CPL to 3 and still access the segment (obviously, it is not the best practice).

In that case, all the privilege checks in __linearize are redundant and for some extent incorrect.
Obviously, I am afraid to submit a patch that removes them, since if the privilege checks of __linearize are needed in certain case, this may introduce security problem.

Do you agree?

Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-12 23:15               ` Nadav Amit
@ 2014-10-13  4:29                 ` Paolo Bonzini
  2014-10-13 11:31                 ` Gleb Natapov
  1 sibling, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2014-10-13  4:29 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Radim Krčmář, Nadav Amit, kvm

Il 13/10/2014 01:15, Nadav Amit ha scritto:
> I think the problem might be even more fundamental. According to the
> SDM, the privilege level checks (CPL/DPL/RPL) are only performed when
> the segment is loaded; I see no reference to privilege checks when
> data is accessed. You should be able to load a segment with DPL=0
> while you are in CPL=0, then change CPL to 3 and still access the
> segment (obviously, it is not the best practice).

This can be tested without invoking the emulator...

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-12 23:15               ` Nadav Amit
  2014-10-13  4:29                 ` Paolo Bonzini
@ 2014-10-13 11:31                 ` Gleb Natapov
  2014-10-19 16:07                   ` Nadav Amit
  1 sibling, 1 reply; 32+ messages in thread
From: Gleb Natapov @ 2014-10-13 11:31 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Paolo Bonzini, Radim Krčmář, Nadav Amit, kvm

On Mon, Oct 13, 2014 at 02:15:43AM +0300, Nadav Amit wrote:
> 
> 
> On 10/12/14 3:12 PM, Paolo Bonzini wrote:
> > Il 12/10/2014 08:57, Nadav Amit ha scritto:
> >> Looks good. I’ll give it a try but it is hard to give a definitive
> >> answer, since the emulator is still bug-ridden.
> > 
> > Yes, we need to write unit tests for this, especially the conforming
> > case.  A bit of a pain to get kvm-unit-tests in ring 3 (access.flat
> > does it), but I'll give it a shot.
> > 
> > Paolo
> > 
> 
> I think the problem might be even more fundamental.
> According to the SDM, the privilege level checks (CPL/DPL/RPL) are only performed when the segment is loaded; I see no reference to privilege checks when data is accessed.
> You should be able to load a segment with DPL=0 while you are in CPL=0, then change CPL to 3 and still access the segment (obviously, it is not the best practice).
> 
> In that case, all the privilege checks in __linearize are redundant and for some extent incorrect.
> Obviously, I am afraid to submit a patch that removes them, since if the privilege checks of __linearize are needed in certain case, this may introduce security problem.
> 
> Do you agree?
> 
3a78a4f46302bfc83602a53dfa4dcbe76a7a1f5f removed RPL check from __linearize already, so
you are probably right, but better verify it on real HW.

--
			Gleb.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
  2014-10-13 11:31                 ` Gleb Natapov
@ 2014-10-19 16:07                   ` Nadav Amit
  0 siblings, 0 replies; 32+ messages in thread
From: Nadav Amit @ 2014-10-19 16:07 UTC (permalink / raw)
  To: Gleb Natapov, Paolo Bonzini
  Cc: Radim Krčmář, Nadav Amit, kvm list


> On Oct 13, 2014, at 2:31 PM, Gleb Natapov <gleb@kernel.org> wrote:
> 
> On Mon, Oct 13, 2014 at 02:15:43AM +0300, Nadav Amit wrote:
>> 
>> 
>> On 10/12/14 3:12 PM, Paolo Bonzini wrote:
>>> Il 12/10/2014 08:57, Nadav Amit ha scritto:
>>>> Looks good. I’ll give it a try but it is hard to give a definitive
>>>> answer, since the emulator is still bug-ridden.
>>> 
>>> Yes, we need to write unit tests for this, especially the conforming
>>> case.  A bit of a pain to get kvm-unit-tests in ring 3 (access.flat
>>> does it), but I'll give it a shot.
>>> 
>>> Paolo
>>> 
>> 
>> I think the problem might be even more fundamental.
>> According to the SDM, the privilege level checks (CPL/DPL/RPL) are only performed when the segment is loaded; I see no reference to privilege checks when data is accessed.
>> You should be able to load a segment with DPL=0 while you are in CPL=0, then change CPL to 3 and still access the segment (obviously, it is not the best practice).
>> 
>> In that case, all the privilege checks in __linearize are redundant and for some extent incorrect.
>> Obviously, I am afraid to submit a patch that removes them, since if the privilege checks of __linearize are needed in certain case, this may introduce security problem.
>> 
>> Do you agree?
>> 
> 3a78a4f46302bfc83602a53dfa4dcbe76a7a1f5f removed RPL check from __linearize already, so
> you are probably right, but better verify it on real HW.


It turns far-ret cannot be used for such experiments, since it rechecks privilege and sets the segments which should be inaccessible (whose DPL is less than the new CPL) to NULL - see Intel SDM "5.8.6 Returning from a Called Procedure”.

Using the sysexit, however, I managed to verify the behaviour - DS was loaded with a segment whose DPL=0, then sysexit was executed, and eventually DS was used to access memory (while CPL=3). Experiments were done using the kvm-unit-test environment; the memory accessing instruction was _not_ emulated.

Accordingly, I’ll create a new version of this patch which removes all segment privilege checks in __linearize.

Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-10-19 16:07 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-02 22:10 [PATCH 0/5] KVM: x86: Various bug fixes Nadav Amit
2014-10-02 22:10 ` [PATCH 1/5] KVM: x86: Clear DR7.LE during task-switch Nadav Amit
2014-10-06 19:45   ` Radim Krčmář
2014-10-02 22:10 ` [PATCH 2/5] KVM: x86: Emulator performs code segment checks on read access Nadav Amit
2014-10-06 20:32   ` Radim Krčmář
2014-10-10  2:07     ` [PATCH v2 " Nadav Amit
2014-10-10 15:54       ` Radim Krčmář
2014-10-11  9:39         ` Paolo Bonzini
2014-10-12  6:57           ` Nadav Amit
2014-10-12 12:12             ` Paolo Bonzini
2014-10-12 23:15               ` Nadav Amit
2014-10-13  4:29                 ` Paolo Bonzini
2014-10-13 11:31                 ` Gleb Natapov
2014-10-19 16:07                   ` Nadav Amit
2014-10-02 22:10 ` [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail Nadav Amit
2014-10-06 20:50   ` Radim Krčmář
2014-10-07  9:15     ` Nadav Amit
2014-10-08  9:02       ` Paolo Bonzini
2014-10-02 22:10 ` [PATCH 4/5] KVM: vmx: Unavailable DR4/5 is checked before CPL Nadav Amit
2014-10-06 19:33   ` Radim Krčmář
2014-10-02 22:10 ` [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes Nadav Amit
2014-10-06 20:57   ` Radim Krčmář
2014-10-07  9:35     ` Nadav Amit
2014-10-08 10:06       ` Radim Krčmář
2014-10-08 10:07         ` Paolo Bonzini
2014-10-10  1:55         ` Nadav Amit
2014-10-10  9:45           ` Paolo Bonzini
2014-10-10 12:50             ` Radim Krčmář
2014-10-10 12:51             ` Nadav Amit
2014-10-10 13:55               ` Paolo Bonzini
2014-10-10 14:02         ` Paolo Bonzini
2014-10-08  9:29   ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.