All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
@ 2015-07-25  5:36 Andy Lutomirski
  2015-07-25  5:36 ` [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous Andy Lutomirski
                   ` (9 more replies)
  0 siblings, 10 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Andy Lutomirski

Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
probably a good general attack surface reduction, and it replaces some
scary code with IMO less scary code.

Also, servers and embedded systems should probably turn off modify_ldt.
This makes that possible.

Xen people, can you take a look at this?

Willy and Kees: I left the config option alone.  The -tiny people will
like it, and we can always add a sysctl of some sort later.

Changes from v3:
 - Hopefully fixed Xen.
 - Fixed 32-bit test case on 32-bit native kernel.
 - Fix bogus vumnap for some LDT sizes.
 - Strengthen test case to check all LDT sizes (catches bogus vunmap).
 - Lots of cleanups, mostly from Borislav.
 - Simplify IPI code using on_each_cpu_mask.

Changes from v2:
 - Allocate ldt_struct and the LDT entries separately.  This should fix Xen.
 - Stop using write_ldt_entry, since I'm pretty sure it's unnecessary now
   that we no longer mutate an in-use LDT.  (Xen people, can you check?)

Changes from v1:
 - The config option is new.
 - The test case is new.
 - Fixed a missing allocation failure check.
 - Fixed a use-after-free on fork().

Andy Lutomirski (3):
  x86/ldt: Make modify_ldt synchronous
  x86/ldt: Make modify_ldt optional
  selftests/x86, x86/ldt: Add a selftest for modify_ldt

 arch/x86/Kconfig                      |  17 ++
 arch/x86/include/asm/desc.h           |  15 --
 arch/x86/include/asm/mmu.h            |   5 +-
 arch/x86/include/asm/mmu_context.h    |  68 ++++-
 arch/x86/kernel/Makefile              |   3 +-
 arch/x86/kernel/cpu/common.c          |   4 +-
 arch/x86/kernel/cpu/perf_event.c      |  16 +-
 arch/x86/kernel/ldt.c                 | 262 +++++++++---------
 arch/x86/kernel/process_64.c          |   6 +-
 arch/x86/kernel/step.c                |   8 +-
 arch/x86/power/cpu.c                  |   3 +-
 kernel/sys_ni.c                       |   1 +
 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/ldt_gdt.c | 492 ++++++++++++++++++++++++++++++++++
 14 files changed, 747 insertions(+), 155 deletions(-)
 create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

-- 
2.4.3


^ permalink raw reply	[flat|nested] 130+ messages in thread

* [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
@ 2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  9:03   ` Borislav Petkov
  2015-07-25  9:03   ` Borislav Petkov
  2015-07-25  5:36 ` Andy Lutomirski
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Andy Lutomirski, stable

modify_ldt has questionable locking and does not synchronize
threads.  Improve it: redesign the locking and synchronize all
threads' LDTs using an IPI on all modifications.

This will dramatically slow down modify_ldt in multithreaded
programs, but there shouldn't be any multithreaded programs that
care about modify_ldt's performance in the first place.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h        |  15 ---
 arch/x86/include/asm/mmu.h         |   3 +-
 arch/x86/include/asm/mmu_context.h |  53 +++++++-
 arch/x86/kernel/cpu/common.c       |   4 +-
 arch/x86/kernel/cpu/perf_event.c   |  12 +-
 arch/x86/kernel/ldt.c              | 262 ++++++++++++++++++++-----------------
 arch/x86/kernel/process_64.c       |   4 +-
 arch/x86/kernel/step.c             |   6 +-
 arch/x86/power/cpu.c               |   3 +-
 9 files changed, 209 insertions(+), 153 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index a0bf89fd2647..4e10d73cf018 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -280,21 +280,6 @@ static inline void clear_LDT(void)
 	set_ldt(NULL, 0);
 }
 
-/*
- * load one particular LDT into the current CPU
- */
-static inline void load_LDT_nolock(mm_context_t *pc)
-{
-	set_ldt(pc->ldt, pc->size);
-}
-
-static inline void load_LDT(mm_context_t *pc)
-{
-	preempt_disable();
-	load_LDT_nolock(pc);
-	preempt_enable();
-}
-
 static inline unsigned long get_desc_base(const struct desc_struct *desc)
 {
 	return (unsigned)(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 09b9620a73b4..364d27481a52 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,8 +9,7 @@
  * we put the segment information here.
  */
 typedef struct {
-	void *ldt;
-	int size;
+	struct ldt_struct *ldt;
 
 #ifdef CONFIG_X86_64
 	/* True if mm supports a task running in 32 bit compatibility mode. */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 804a3a6030ca..3fcff70c398e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -34,6 +34,49 @@ static inline void load_mm_cr4(struct mm_struct *mm) {}
 #endif
 
 /*
+ * ldt_structs can be allocated, used, and freed, but they are never
+ * modified while live.
+ */
+struct ldt_struct {
+	/*
+	 * Xen requires page-aligned LDTs with special permissions.  This is
+	 * needed to prevent us from installing evil descriptors such as
+	 * call gates.  On native, we could merge the ldt_struct and LDT
+	 * allocations, but it's not worth trying to optimize.
+	 */
+	struct desc_struct *entries;
+	int size;
+};
+
+static inline void load_mm_ldt(struct mm_struct *mm)
+{
+	struct ldt_struct *ldt;
+	DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+
+	/* lockless_dereference synchronizes with smp_store_release */
+	ldt = lockless_dereference(mm->context.ldt);
+
+	/*
+	 * Any change to mm->context.ldt is followed by an IPI to all
+	 * CPUs with the mm active.  The LDT will not be freed until
+	 * after the IPI is handled by all such CPUs.  This means that,
+	 * if the ldt_struct changes before we return, the values we see
+	 * will be safe, and the new values will be loaded before we run
+	 * any user code.
+	 *
+	 * NB: don't try to convert this to use RCU without extreme care.
+	 * We would still need IRQs off, because we don't want to change
+	 * the local LDT after an IPI loaded a newer value than the one
+	 * that we can see.
+	 */
+
+	if (unlikely(ldt))
+		set_ldt(ldt->entries, ldt->size);
+	else
+		clear_LDT();
+}
+
+/*
  * Used for LDT copy/destruction.
  */
 int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
@@ -78,12 +121,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		 * was called and then modify_ldt changed
 		 * prev->context.ldt but suppressed an IPI to this CPU.
 		 * In this case, prev->context.ldt != NULL, because we
-		 * never free an LDT while the mm still exists.  That
-		 * means that next->context.ldt != prev->context.ldt,
-		 * because mms never share an LDT.
+		 * never set context.ldt to NULL while the mm still
+		 * exists.  That means that next->context.ldt !=
+		 * prev->context.ldt, because mms never share an LDT.
 		 */
 		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_LDT_nolock(&next->context);
+			load_mm_ldt(next);
 	}
 #ifdef CONFIG_SMP
 	  else {
@@ -106,7 +149,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			load_cr3(next->pgd);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 			load_mm_cr4(next);
-			load_LDT_nolock(&next->context);
+			load_mm_ldt(next);
 		}
 	}
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 922c5e0cea4c..cb9e5df42dd2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1410,7 +1410,7 @@ void cpu_init(void)
 	load_sp0(t, &current->thread);
 	set_tss_desc(cpu, t);
 	load_TR_desc();
-	load_LDT(&init_mm.context);
+	load_mm_ldt(&init_mm);
 
 	clear_all_debug_regs();
 	dbg_restore_debug_regs();
@@ -1459,7 +1459,7 @@ void cpu_init(void)
 	load_sp0(t, thread);
 	set_tss_desc(cpu, t);
 	load_TR_desc();
-	load_LDT(&init_mm.context);
+	load_mm_ldt(&init_mm);
 
 	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3658de47900f..9469dfa55607 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2179,21 +2179,25 @@ static unsigned long get_segment_base(unsigned int segment)
 	int idx = segment >> 3;
 
 	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+		struct ldt_struct *ldt;
+
 		if (idx > LDT_ENTRIES)
 			return 0;
 
-		if (idx > current->active_mm->context.size)
+		/* IRQs are off, so this synchronizes with smp_store_release */
+		ldt = lockless_dereference(current->active_mm->context.ldt);
+		if (!ldt || idx > ldt->size)
 			return 0;
 
-		desc = current->active_mm->context.ldt;
+		desc = &ldt->entries[idx];
 	} else {
 		if (idx > GDT_ENTRIES)
 			return 0;
 
-		desc = raw_cpu_ptr(gdt_page.gdt);
+		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
 	}
 
-	return get_desc_base(desc + idx);
+	return get_desc_base(desc);
 }
 
 #ifdef CONFIG_COMPAT
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index c37886d759cc..2bcc0525f1c1 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
+#include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 
@@ -20,82 +21,82 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
-#ifdef CONFIG_SMP
+/* context.lock is held for us, so we don't need any locking. */
 static void flush_ldt(void *current_mm)
 {
-	if (current->active_mm == current_mm)
-		load_LDT(&current->active_mm->context);
+	mm_context_t *pc;
+
+	if (current->active_mm != current_mm)
+		return;
+
+	pc = &current->active_mm->context;
+	set_ldt(pc->ldt->entries, pc->ldt->size);
 }
-#endif
 
-static int alloc_ldt(mm_context_t *pc, int mincount, int reload)
+/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
+static struct ldt_struct *alloc_ldt_struct(int size)
 {
-	void *oldldt, *newldt;
-	int oldsize;
-
-	if (mincount <= pc->size)
-		return 0;
-	oldsize = pc->size;
-	mincount = (mincount + (PAGE_SIZE / LDT_ENTRY_SIZE - 1)) &
-			(~(PAGE_SIZE / LDT_ENTRY_SIZE - 1));
-	if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE)
-		newldt = vmalloc(mincount * LDT_ENTRY_SIZE);
+	struct ldt_struct *new_ldt;
+	int alloc_size;
+
+	if (size > LDT_ENTRIES)
+		return NULL;
+
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	if (!new_ldt)
+		return NULL;
+
+	BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
+	alloc_size = size * LDT_ENTRY_SIZE;
+
+	/*
+	 * Xen is very picky: it requires a page-aligned LDT that has no
+	 * trailing nonzero bytes in any page that contains LDT descriptors.
+	 * Keep it simple: zero the whole allocation and never allocate less
+	 * than PAGE_SIZE.
+	 */
+	if (alloc_size > PAGE_SIZE)
+		new_ldt->entries = vzalloc(alloc_size);
 	else
-		newldt = (void *)__get_free_page(GFP_KERNEL);
-
-	if (!newldt)
-		return -ENOMEM;
+		new_ldt->entries = kzalloc(PAGE_SIZE, GFP_KERNEL);
 
-	if (oldsize)
-		memcpy(newldt, pc->ldt, oldsize * LDT_ENTRY_SIZE);
-	oldldt = pc->ldt;
-	memset(newldt + oldsize * LDT_ENTRY_SIZE, 0,
-	       (mincount - oldsize) * LDT_ENTRY_SIZE);
+	if (!new_ldt->entries) {
+		kfree(new_ldt);
+		return NULL;
+	}
 
-	paravirt_alloc_ldt(newldt, mincount);
+	new_ldt->size = size;
+	return new_ldt;
+}
 
-#ifdef CONFIG_X86_64
-	/* CHECKME: Do we really need this ? */
-	wmb();
-#endif
-	pc->ldt = newldt;
-	wmb();
-	pc->size = mincount;
-	wmb();
-
-	if (reload) {
-#ifdef CONFIG_SMP
-		preempt_disable();
-		load_LDT(pc);
-		if (!cpumask_equal(mm_cpumask(current->mm),
-				   cpumask_of(smp_processor_id())))
-			smp_call_function(flush_ldt, current->mm, 1);
-		preempt_enable();
-#else
-		load_LDT(pc);
-#endif
-	}
-	if (oldsize) {
-		paravirt_free_ldt(oldldt, oldsize);
-		if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
-			vfree(oldldt);
-		else
-			put_page(virt_to_page(oldldt));
-	}
-	return 0;
+/* After calling this, the LDT is immutable. */
+static void finalize_ldt_struct(struct ldt_struct *ldt)
+{
+	paravirt_alloc_ldt(ldt->entries, ldt->size);
 }
 
-static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
+/* context.lock is held */
+static void install_ldt(struct mm_struct *current_mm,
+			struct ldt_struct *ldt)
 {
-	int err = alloc_ldt(new, old->size, 0);
-	int i;
+	/* Synchronizes with lockless_dereference in load_mm_ldt. */
+	smp_store_release(&current_mm->context.ldt, ldt);
+
+	/* Activate the LDT for all CPUs using current_mm. */
+	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+}
 
-	if (err < 0)
-		return err;
+static void free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (likely(!ldt))
+		return;
 
-	for (i = 0; i < old->size; i++)
-		write_ldt_entry(new->ldt, i, old->ldt + i * LDT_ENTRY_SIZE);
-	return 0;
+	paravirt_free_ldt(ldt->entries, ldt->size);
+	if (ldt->size * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree(ldt->entries);
+	else
+		kfree(ldt->entries);
+	kfree(ldt);
 }
 
 /*
@@ -104,17 +105,37 @@ static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
  */
 int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 {
+	struct ldt_struct *new_ldt;
 	struct mm_struct *old_mm;
 	int retval = 0;
 
 	mutex_init(&mm->context.lock);
-	mm->context.size = 0;
 	old_mm = current->mm;
-	if (old_mm && old_mm->context.size > 0) {
-		mutex_lock(&old_mm->context.lock);
-		retval = copy_ldt(&mm->context, &old_mm->context);
-		mutex_unlock(&old_mm->context.lock);
+	if (!old_mm) {
+		mm->context.ldt = NULL;
+		return 0;
 	}
+
+	mutex_lock(&old_mm->context.lock);
+	if (!old_mm->context.ldt) {
+		mm->context.ldt = NULL;
+		goto out_unlock;
+	}
+
+	new_ldt = alloc_ldt_struct(old_mm->context.ldt->size);
+	if (!new_ldt) {
+		retval = -ENOMEM;
+		goto out_unlock;
+	}
+
+	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
+	       new_ldt->size * LDT_ENTRY_SIZE);
+	finalize_ldt_struct(new_ldt);
+
+	mm->context.ldt = new_ldt;
+
+out_unlock:
+	mutex_unlock(&old_mm->context.lock);
 	return retval;
 }
 
@@ -125,53 +146,47 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
  */
 void destroy_context(struct mm_struct *mm)
 {
-	if (mm->context.size) {
-#ifdef CONFIG_X86_32
-		/* CHECKME: Can this ever happen ? */
-		if (mm == current->active_mm)
-			clear_LDT();
-#endif
-		paravirt_free_ldt(mm->context.ldt, mm->context.size);
-		if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE)
-			vfree(mm->context.ldt);
-		else
-			put_page(virt_to_page(mm->context.ldt));
-		mm->context.size = 0;
-	}
+	free_ldt_struct(mm->context.ldt);
+	mm->context.ldt = NULL;
 }
 
 static int read_ldt(void __user *ptr, unsigned long bytecount)
 {
-	int err;
+	int retval;
 	unsigned long size;
 	struct mm_struct *mm = current->mm;
 
-	if (!mm->context.size)
-		return 0;
+	mutex_lock(&mm->context.lock);
+
+	if (!mm->context.ldt) {
+		retval = 0;
+		goto out_unlock;
+	}
+
 	if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
 		bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
 
-	mutex_lock(&mm->context.lock);
-	size = mm->context.size * LDT_ENTRY_SIZE;
+	size = mm->context.ldt->size * LDT_ENTRY_SIZE;
 	if (size > bytecount)
 		size = bytecount;
 
-	err = 0;
-	if (copy_to_user(ptr, mm->context.ldt, size))
-		err = -EFAULT;
-	mutex_unlock(&mm->context.lock);
-	if (err < 0)
-		goto error_return;
+	if (copy_to_user(ptr, mm->context.ldt->entries, size)) {
+		retval = -EFAULT;
+		goto out_unlock;
+	}
+
 	if (size != bytecount) {
-		/* zero-fill the rest */
-		if (clear_user(ptr + size, bytecount - size) != 0) {
-			err = -EFAULT;
-			goto error_return;
+		/* Zero-fill the rest and pretend we read bytecount bytes. */
+		if (clear_user(ptr + size, bytecount - size)) {
+			retval = -EFAULT;
+			goto out_unlock;
 		}
 	}
-	return bytecount;
-error_return:
-	return err;
+	retval = bytecount;
+
+out_unlock:
+	mutex_unlock(&mm->context.lock);
+	return retval;
 }
 
 static int read_default_ldt(void __user *ptr, unsigned long bytecount)
@@ -195,6 +210,8 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 	struct desc_struct ldt;
 	int error;
 	struct user_desc ldt_info;
+	int oldsize, newsize;
+	struct ldt_struct *new_ldt, *old_ldt;
 
 	error = -EINVAL;
 	if (bytecount != sizeof(ldt_info))
@@ -213,34 +230,39 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 			goto out;
 	}
 
-	mutex_lock(&mm->context.lock);
-	if (ldt_info.entry_number >= mm->context.size) {
-		error = alloc_ldt(&current->mm->context,
-				  ldt_info.entry_number + 1, 1);
-		if (error < 0)
-			goto out_unlock;
-	}
-
-	/* Allow LDTs to be cleared by the user. */
-	if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
-		if (oldmode || LDT_empty(&ldt_info)) {
-			memset(&ldt, 0, sizeof(ldt));
-			goto install;
+	if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
+	    LDT_empty(&ldt_info)) {
+		/* The user wants to clear the entry. */
+		memset(&ldt, 0, sizeof(ldt));
+	} else {
+		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
+			error = -EINVAL;
+			goto out;
 		}
+
+		fill_ldt(&ldt, &ldt_info);
+		if (oldmode)
+			ldt.avl = 0;
 	}
 
-	if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
-		error = -EINVAL;
+	mutex_lock(&mm->context.lock);
+
+	old_ldt = mm->context.ldt;
+	oldsize = old_ldt ? old_ldt->size : 0;
+	newsize = max((int)(ldt_info.entry_number + 1), oldsize);
+
+	error = -ENOMEM;
+	new_ldt = alloc_ldt_struct(newsize);
+	if (!new_ldt)
 		goto out_unlock;
-	}
 
-	fill_ldt(&ldt, &ldt_info);
-	if (oldmode)
-		ldt.avl = 0;
+	if (old_ldt)
+		memcpy(new_ldt->entries, old_ldt->entries, oldsize * LDT_ENTRY_SIZE);
+	new_ldt->entries[ldt_info.entry_number] = ldt;
+	finalize_ldt_struct(new_ldt);
 
-	/* Install the new entry ...  */
-install:
-	write_ldt_entry(mm->context.ldt, ldt_info.entry_number, &ldt);
+	install_ldt(mm, new_ldt);
+	free_ldt_struct(old_ldt);
 	error = 0;
 
 out_unlock:
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 71d7849a07f7..f6b916387590 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -121,11 +121,11 @@ void __show_regs(struct pt_regs *regs, int all)
 void release_thread(struct task_struct *dead_task)
 {
 	if (dead_task->mm) {
-		if (dead_task->mm->context.size) {
+		if (dead_task->mm->context.ldt) {
 			pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
 				dead_task->comm,
 				dead_task->mm->context.ldt,
-				dead_task->mm->context.size);
+				dead_task->mm->context.ldt->size);
 			BUG();
 		}
 	}
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 9b4d51d0c0d0..6273324186ac 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/ptrace.h>
 #include <asm/desc.h>
+#include <asm/mmu_context.h>
 
 unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs)
 {
@@ -30,10 +31,11 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
 		seg &= ~7UL;
 
 		mutex_lock(&child->mm->context.lock);
-		if (unlikely((seg >> 3) >= child->mm->context.size))
+		if (unlikely(!child->mm->context.ldt ||
+			     (seg >> 3) >= child->mm->context.ldt->size))
 			addr = -1L; /* bogus selector, access would fault */
 		else {
-			desc = child->mm->context.ldt + seg;
+			desc = &child->mm->context.ldt->entries[seg];
 			base = get_desc_base(desc);
 
 			/* 16-bit code segment? */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 0d7dd1f5ac36..9ab52791fed5 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -22,6 +22,7 @@
 #include <asm/fpu/internal.h>
 #include <asm/debugreg.h>
 #include <asm/cpu.h>
+#include <asm/mmu_context.h>
 
 #ifdef CONFIG_X86_32
 __visible unsigned long saved_context_ebx;
@@ -153,7 +154,7 @@ static void fix_processor_context(void)
 	syscall_init();				/* This sets MSR_*STAR and related */
 #endif
 	load_TR_desc();				/* This does ltr */
-	load_LDT(&current->active_mm->context);	/* This does lldt */
+	load_mm_ldt(current->active_mm);	/* This does lldt */
 
 	fpu__resume_cpu();
 }
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
  2015-07-25  5:36 ` [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous Andy Lutomirski
@ 2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  5:36 ` [PATCH v4 2/3] x86/ldt: Make modify_ldt optional Andy Lutomirski
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, Andy Lutomirski, Andrew Cooper, X86 ML, linux-kernel,
	stable, xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

modify_ldt has questionable locking and does not synchronize
threads.  Improve it: redesign the locking and synchronize all
threads' LDTs using an IPI on all modifications.

This will dramatically slow down modify_ldt in multithreaded
programs, but there shouldn't be any multithreaded programs that
care about modify_ldt's performance in the first place.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h        |  15 ---
 arch/x86/include/asm/mmu.h         |   3 +-
 arch/x86/include/asm/mmu_context.h |  53 +++++++-
 arch/x86/kernel/cpu/common.c       |   4 +-
 arch/x86/kernel/cpu/perf_event.c   |  12 +-
 arch/x86/kernel/ldt.c              | 262 ++++++++++++++++++++-----------------
 arch/x86/kernel/process_64.c       |   4 +-
 arch/x86/kernel/step.c             |   6 +-
 arch/x86/power/cpu.c               |   3 +-
 9 files changed, 209 insertions(+), 153 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index a0bf89fd2647..4e10d73cf018 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -280,21 +280,6 @@ static inline void clear_LDT(void)
 	set_ldt(NULL, 0);
 }
 
-/*
- * load one particular LDT into the current CPU
- */
-static inline void load_LDT_nolock(mm_context_t *pc)
-{
-	set_ldt(pc->ldt, pc->size);
-}
-
-static inline void load_LDT(mm_context_t *pc)
-{
-	preempt_disable();
-	load_LDT_nolock(pc);
-	preempt_enable();
-}
-
 static inline unsigned long get_desc_base(const struct desc_struct *desc)
 {
 	return (unsigned)(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 09b9620a73b4..364d27481a52 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,8 +9,7 @@
  * we put the segment information here.
  */
 typedef struct {
-	void *ldt;
-	int size;
+	struct ldt_struct *ldt;
 
 #ifdef CONFIG_X86_64
 	/* True if mm supports a task running in 32 bit compatibility mode. */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 804a3a6030ca..3fcff70c398e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -34,6 +34,49 @@ static inline void load_mm_cr4(struct mm_struct *mm) {}
 #endif
 
 /*
+ * ldt_structs can be allocated, used, and freed, but they are never
+ * modified while live.
+ */
+struct ldt_struct {
+	/*
+	 * Xen requires page-aligned LDTs with special permissions.  This is
+	 * needed to prevent us from installing evil descriptors such as
+	 * call gates.  On native, we could merge the ldt_struct and LDT
+	 * allocations, but it's not worth trying to optimize.
+	 */
+	struct desc_struct *entries;
+	int size;
+};
+
+static inline void load_mm_ldt(struct mm_struct *mm)
+{
+	struct ldt_struct *ldt;
+	DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+
+	/* lockless_dereference synchronizes with smp_store_release */
+	ldt = lockless_dereference(mm->context.ldt);
+
+	/*
+	 * Any change to mm->context.ldt is followed by an IPI to all
+	 * CPUs with the mm active.  The LDT will not be freed until
+	 * after the IPI is handled by all such CPUs.  This means that,
+	 * if the ldt_struct changes before we return, the values we see
+	 * will be safe, and the new values will be loaded before we run
+	 * any user code.
+	 *
+	 * NB: don't try to convert this to use RCU without extreme care.
+	 * We would still need IRQs off, because we don't want to change
+	 * the local LDT after an IPI loaded a newer value than the one
+	 * that we can see.
+	 */
+
+	if (unlikely(ldt))
+		set_ldt(ldt->entries, ldt->size);
+	else
+		clear_LDT();
+}
+
+/*
  * Used for LDT copy/destruction.
  */
 int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
@@ -78,12 +121,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		 * was called and then modify_ldt changed
 		 * prev->context.ldt but suppressed an IPI to this CPU.
 		 * In this case, prev->context.ldt != NULL, because we
-		 * never free an LDT while the mm still exists.  That
-		 * means that next->context.ldt != prev->context.ldt,
-		 * because mms never share an LDT.
+		 * never set context.ldt to NULL while the mm still
+		 * exists.  That means that next->context.ldt !=
+		 * prev->context.ldt, because mms never share an LDT.
 		 */
 		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_LDT_nolock(&next->context);
+			load_mm_ldt(next);
 	}
 #ifdef CONFIG_SMP
 	  else {
@@ -106,7 +149,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			load_cr3(next->pgd);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 			load_mm_cr4(next);
-			load_LDT_nolock(&next->context);
+			load_mm_ldt(next);
 		}
 	}
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 922c5e0cea4c..cb9e5df42dd2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1410,7 +1410,7 @@ void cpu_init(void)
 	load_sp0(t, &current->thread);
 	set_tss_desc(cpu, t);
 	load_TR_desc();
-	load_LDT(&init_mm.context);
+	load_mm_ldt(&init_mm);
 
 	clear_all_debug_regs();
 	dbg_restore_debug_regs();
@@ -1459,7 +1459,7 @@ void cpu_init(void)
 	load_sp0(t, thread);
 	set_tss_desc(cpu, t);
 	load_TR_desc();
-	load_LDT(&init_mm.context);
+	load_mm_ldt(&init_mm);
 
 	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3658de47900f..9469dfa55607 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2179,21 +2179,25 @@ static unsigned long get_segment_base(unsigned int segment)
 	int idx = segment >> 3;
 
 	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+		struct ldt_struct *ldt;
+
 		if (idx > LDT_ENTRIES)
 			return 0;
 
-		if (idx > current->active_mm->context.size)
+		/* IRQs are off, so this synchronizes with smp_store_release */
+		ldt = lockless_dereference(current->active_mm->context.ldt);
+		if (!ldt || idx > ldt->size)
 			return 0;
 
-		desc = current->active_mm->context.ldt;
+		desc = &ldt->entries[idx];
 	} else {
 		if (idx > GDT_ENTRIES)
 			return 0;
 
-		desc = raw_cpu_ptr(gdt_page.gdt);
+		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
 	}
 
-	return get_desc_base(desc + idx);
+	return get_desc_base(desc);
 }
 
 #ifdef CONFIG_COMPAT
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index c37886d759cc..2bcc0525f1c1 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
+#include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 
@@ -20,82 +21,82 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
-#ifdef CONFIG_SMP
+/* context.lock is held for us, so we don't need any locking. */
 static void flush_ldt(void *current_mm)
 {
-	if (current->active_mm == current_mm)
-		load_LDT(&current->active_mm->context);
+	mm_context_t *pc;
+
+	if (current->active_mm != current_mm)
+		return;
+
+	pc = &current->active_mm->context;
+	set_ldt(pc->ldt->entries, pc->ldt->size);
 }
-#endif
 
-static int alloc_ldt(mm_context_t *pc, int mincount, int reload)
+/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
+static struct ldt_struct *alloc_ldt_struct(int size)
 {
-	void *oldldt, *newldt;
-	int oldsize;
-
-	if (mincount <= pc->size)
-		return 0;
-	oldsize = pc->size;
-	mincount = (mincount + (PAGE_SIZE / LDT_ENTRY_SIZE - 1)) &
-			(~(PAGE_SIZE / LDT_ENTRY_SIZE - 1));
-	if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE)
-		newldt = vmalloc(mincount * LDT_ENTRY_SIZE);
+	struct ldt_struct *new_ldt;
+	int alloc_size;
+
+	if (size > LDT_ENTRIES)
+		return NULL;
+
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	if (!new_ldt)
+		return NULL;
+
+	BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
+	alloc_size = size * LDT_ENTRY_SIZE;
+
+	/*
+	 * Xen is very picky: it requires a page-aligned LDT that has no
+	 * trailing nonzero bytes in any page that contains LDT descriptors.
+	 * Keep it simple: zero the whole allocation and never allocate less
+	 * than PAGE_SIZE.
+	 */
+	if (alloc_size > PAGE_SIZE)
+		new_ldt->entries = vzalloc(alloc_size);
 	else
-		newldt = (void *)__get_free_page(GFP_KERNEL);
-
-	if (!newldt)
-		return -ENOMEM;
+		new_ldt->entries = kzalloc(PAGE_SIZE, GFP_KERNEL);
 
-	if (oldsize)
-		memcpy(newldt, pc->ldt, oldsize * LDT_ENTRY_SIZE);
-	oldldt = pc->ldt;
-	memset(newldt + oldsize * LDT_ENTRY_SIZE, 0,
-	       (mincount - oldsize) * LDT_ENTRY_SIZE);
+	if (!new_ldt->entries) {
+		kfree(new_ldt);
+		return NULL;
+	}
 
-	paravirt_alloc_ldt(newldt, mincount);
+	new_ldt->size = size;
+	return new_ldt;
+}
 
-#ifdef CONFIG_X86_64
-	/* CHECKME: Do we really need this ? */
-	wmb();
-#endif
-	pc->ldt = newldt;
-	wmb();
-	pc->size = mincount;
-	wmb();
-
-	if (reload) {
-#ifdef CONFIG_SMP
-		preempt_disable();
-		load_LDT(pc);
-		if (!cpumask_equal(mm_cpumask(current->mm),
-				   cpumask_of(smp_processor_id())))
-			smp_call_function(flush_ldt, current->mm, 1);
-		preempt_enable();
-#else
-		load_LDT(pc);
-#endif
-	}
-	if (oldsize) {
-		paravirt_free_ldt(oldldt, oldsize);
-		if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
-			vfree(oldldt);
-		else
-			put_page(virt_to_page(oldldt));
-	}
-	return 0;
+/* After calling this, the LDT is immutable. */
+static void finalize_ldt_struct(struct ldt_struct *ldt)
+{
+	paravirt_alloc_ldt(ldt->entries, ldt->size);
 }
 
-static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
+/* context.lock is held */
+static void install_ldt(struct mm_struct *current_mm,
+			struct ldt_struct *ldt)
 {
-	int err = alloc_ldt(new, old->size, 0);
-	int i;
+	/* Synchronizes with lockless_dereference in load_mm_ldt. */
+	smp_store_release(&current_mm->context.ldt, ldt);
+
+	/* Activate the LDT for all CPUs using current_mm. */
+	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+}
 
-	if (err < 0)
-		return err;
+static void free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (likely(!ldt))
+		return;
 
-	for (i = 0; i < old->size; i++)
-		write_ldt_entry(new->ldt, i, old->ldt + i * LDT_ENTRY_SIZE);
-	return 0;
+	paravirt_free_ldt(ldt->entries, ldt->size);
+	if (ldt->size * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree(ldt->entries);
+	else
+		kfree(ldt->entries);
+	kfree(ldt);
 }
 
 /*
@@ -104,17 +105,37 @@ static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
  */
 int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 {
+	struct ldt_struct *new_ldt;
 	struct mm_struct *old_mm;
 	int retval = 0;
 
 	mutex_init(&mm->context.lock);
-	mm->context.size = 0;
 	old_mm = current->mm;
-	if (old_mm && old_mm->context.size > 0) {
-		mutex_lock(&old_mm->context.lock);
-		retval = copy_ldt(&mm->context, &old_mm->context);
-		mutex_unlock(&old_mm->context.lock);
+	if (!old_mm) {
+		mm->context.ldt = NULL;
+		return 0;
 	}
+
+	mutex_lock(&old_mm->context.lock);
+	if (!old_mm->context.ldt) {
+		mm->context.ldt = NULL;
+		goto out_unlock;
+	}
+
+	new_ldt = alloc_ldt_struct(old_mm->context.ldt->size);
+	if (!new_ldt) {
+		retval = -ENOMEM;
+		goto out_unlock;
+	}
+
+	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
+	       new_ldt->size * LDT_ENTRY_SIZE);
+	finalize_ldt_struct(new_ldt);
+
+	mm->context.ldt = new_ldt;
+
+out_unlock:
+	mutex_unlock(&old_mm->context.lock);
 	return retval;
 }
 
@@ -125,53 +146,47 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
  */
 void destroy_context(struct mm_struct *mm)
 {
-	if (mm->context.size) {
-#ifdef CONFIG_X86_32
-		/* CHECKME: Can this ever happen ? */
-		if (mm == current->active_mm)
-			clear_LDT();
-#endif
-		paravirt_free_ldt(mm->context.ldt, mm->context.size);
-		if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE)
-			vfree(mm->context.ldt);
-		else
-			put_page(virt_to_page(mm->context.ldt));
-		mm->context.size = 0;
-	}
+	free_ldt_struct(mm->context.ldt);
+	mm->context.ldt = NULL;
 }
 
 static int read_ldt(void __user *ptr, unsigned long bytecount)
 {
-	int err;
+	int retval;
 	unsigned long size;
 	struct mm_struct *mm = current->mm;
 
-	if (!mm->context.size)
-		return 0;
+	mutex_lock(&mm->context.lock);
+
+	if (!mm->context.ldt) {
+		retval = 0;
+		goto out_unlock;
+	}
+
 	if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
 		bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
 
-	mutex_lock(&mm->context.lock);
-	size = mm->context.size * LDT_ENTRY_SIZE;
+	size = mm->context.ldt->size * LDT_ENTRY_SIZE;
 	if (size > bytecount)
 		size = bytecount;
 
-	err = 0;
-	if (copy_to_user(ptr, mm->context.ldt, size))
-		err = -EFAULT;
-	mutex_unlock(&mm->context.lock);
-	if (err < 0)
-		goto error_return;
+	if (copy_to_user(ptr, mm->context.ldt->entries, size)) {
+		retval = -EFAULT;
+		goto out_unlock;
+	}
+
 	if (size != bytecount) {
-		/* zero-fill the rest */
-		if (clear_user(ptr + size, bytecount - size) != 0) {
-			err = -EFAULT;
-			goto error_return;
+		/* Zero-fill the rest and pretend we read bytecount bytes. */
+		if (clear_user(ptr + size, bytecount - size)) {
+			retval = -EFAULT;
+			goto out_unlock;
 		}
 	}
-	return bytecount;
-error_return:
-	return err;
+	retval = bytecount;
+
+out_unlock:
+	mutex_unlock(&mm->context.lock);
+	return retval;
 }
 
 static int read_default_ldt(void __user *ptr, unsigned long bytecount)
@@ -195,6 +210,8 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 	struct desc_struct ldt;
 	int error;
 	struct user_desc ldt_info;
+	int oldsize, newsize;
+	struct ldt_struct *new_ldt, *old_ldt;
 
 	error = -EINVAL;
 	if (bytecount != sizeof(ldt_info))
@@ -213,34 +230,39 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 			goto out;
 	}
 
-	mutex_lock(&mm->context.lock);
-	if (ldt_info.entry_number >= mm->context.size) {
-		error = alloc_ldt(&current->mm->context,
-				  ldt_info.entry_number + 1, 1);
-		if (error < 0)
-			goto out_unlock;
-	}
-
-	/* Allow LDTs to be cleared by the user. */
-	if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
-		if (oldmode || LDT_empty(&ldt_info)) {
-			memset(&ldt, 0, sizeof(ldt));
-			goto install;
+	if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
+	    LDT_empty(&ldt_info)) {
+		/* The user wants to clear the entry. */
+		memset(&ldt, 0, sizeof(ldt));
+	} else {
+		if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
+			error = -EINVAL;
+			goto out;
 		}
+
+		fill_ldt(&ldt, &ldt_info);
+		if (oldmode)
+			ldt.avl = 0;
 	}
 
-	if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
-		error = -EINVAL;
+	mutex_lock(&mm->context.lock);
+
+	old_ldt = mm->context.ldt;
+	oldsize = old_ldt ? old_ldt->size : 0;
+	newsize = max((int)(ldt_info.entry_number + 1), oldsize);
+
+	error = -ENOMEM;
+	new_ldt = alloc_ldt_struct(newsize);
+	if (!new_ldt)
 		goto out_unlock;
-	}
 
-	fill_ldt(&ldt, &ldt_info);
-	if (oldmode)
-		ldt.avl = 0;
+	if (old_ldt)
+		memcpy(new_ldt->entries, old_ldt->entries, oldsize * LDT_ENTRY_SIZE);
+	new_ldt->entries[ldt_info.entry_number] = ldt;
+	finalize_ldt_struct(new_ldt);
 
-	/* Install the new entry ...  */
-install:
-	write_ldt_entry(mm->context.ldt, ldt_info.entry_number, &ldt);
+	install_ldt(mm, new_ldt);
+	free_ldt_struct(old_ldt);
 	error = 0;
 
 out_unlock:
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 71d7849a07f7..f6b916387590 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -121,11 +121,11 @@ void __show_regs(struct pt_regs *regs, int all)
 void release_thread(struct task_struct *dead_task)
 {
 	if (dead_task->mm) {
-		if (dead_task->mm->context.size) {
+		if (dead_task->mm->context.ldt) {
 			pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
 				dead_task->comm,
 				dead_task->mm->context.ldt,
-				dead_task->mm->context.size);
+				dead_task->mm->context.ldt->size);
 			BUG();
 		}
 	}
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 9b4d51d0c0d0..6273324186ac 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/ptrace.h>
 #include <asm/desc.h>
+#include <asm/mmu_context.h>
 
 unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs)
 {
@@ -30,10 +31,11 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
 		seg &= ~7UL;
 
 		mutex_lock(&child->mm->context.lock);
-		if (unlikely((seg >> 3) >= child->mm->context.size))
+		if (unlikely(!child->mm->context.ldt ||
+			     (seg >> 3) >= child->mm->context.ldt->size))
 			addr = -1L; /* bogus selector, access would fault */
 		else {
-			desc = child->mm->context.ldt + seg;
+			desc = &child->mm->context.ldt->entries[seg];
 			base = get_desc_base(desc);
 
 			/* 16-bit code segment? */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 0d7dd1f5ac36..9ab52791fed5 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -22,6 +22,7 @@
 #include <asm/fpu/internal.h>
 #include <asm/debugreg.h>
 #include <asm/cpu.h>
+#include <asm/mmu_context.h>
 
 #ifdef CONFIG_X86_32
 __visible unsigned long saved_context_ebx;
@@ -153,7 +154,7 @@ static void fix_processor_context(void)
 	syscall_init();				/* This sets MSR_*STAR and related */
 #endif
 	load_TR_desc();				/* This does ltr */
-	load_LDT(&current->active_mm->context);	/* This does lldt */
+	load_mm_ldt(current->active_mm);	/* This does lldt */
 
 	fpu__resume_cpu();
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-07-25  5:36 ` [PATCH v4 2/3] x86/ldt: Make modify_ldt optional Andy Lutomirski
@ 2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  6:23   ` Willy Tarreau
                     ` (3 more replies)
  2015-07-25  5:36 ` [PATCH v4 3/3] selftests/x86, x86/ldt: Add a selftest for modify_ldt Andy Lutomirski
                   ` (5 subsequent siblings)
  9 siblings, 4 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Andy Lutomirski

The modify_ldt syscall exposes a large attack surface and is
unnecessary for modern userspace.  Make it optional.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   | 17 +++++++++++++++++
 arch/x86/include/asm/mmu.h         |  2 ++
 arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
 arch/x86/kernel/Makefile           |  3 ++-
 arch/x86/kernel/cpu/perf_event.c   |  4 ++++
 arch/x86/kernel/process_64.c       |  2 ++
 arch/x86/kernel/step.c             |  2 ++
 kernel/sys_ni.c                    |  1 +
 8 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d77d92..ede52be845db 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1015,6 +1015,7 @@ config VM86
 config X86_16BIT
 	bool "Enable support for 16-bit segments" if EXPERT
 	default y
+	depends on MODIFY_LDT_SYSCALL
 	---help---
 	  This option is required by programs like Wine to run 16-bit
 	  protected mode legacy code on x86 processors.  Disabling
@@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
 	  This is used to work around broken boot loaders.  This should
 	  be set to 'N' under normal conditions.
 
+config MODIFY_LDT_SYSCALL
+       bool "Enable the LDT (local descriptor table)" if EXPERT
+       default y
+       ---help---
+         Linux can allow user programs to install a per-process x86
+	 Local Descriptor Table (LDT) using the modify_ldt(2) system
+	 call.  This is required to run 16-bit or segmented code such as
+	 DOSEMU or some Wine programs.  It is also used by some very old
+	 threading libraries.
+
+	 Enabling this feature adds a small amount of overhead to
+	 context switches and increases the low-level kernel attack
+	 surface.  Disabling it removes the modify_ldt(2) system call.
+
+	 Saying 'N' here may make sense for embedded or server kernels.
+
 source "kernel/livepatch/Kconfig"
 
 endmenu
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 364d27481a52..55234d5e7160 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,7 +9,9 @@
  * we put the segment information here.
  */
 typedef struct {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
+#endif
 
 #ifdef CONFIG_X86_64
 	/* True if mm supports a task running in 32 bit compatibility mode. */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 3fcff70c398e..08094eded318 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -33,6 +33,7 @@ static inline void load_mm_cr4(struct mm_struct *mm)
 static inline void load_mm_cr4(struct mm_struct *mm) {}
 #endif
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
  * modified while live.
@@ -48,10 +49,24 @@ struct ldt_struct {
 	int size;
 };
 
+/*
+ * Used for LDT copy/destruction.
+ */
+int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context(struct mm_struct *mm);
+#else	/* CONFIG_MODIFY_LDT_SYSCALL */
+static inline int init_new_context(struct task_struct *tsk,
+				   struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm) {}
+#endif
+
 static inline void load_mm_ldt(struct mm_struct *mm)
 {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
-	DEBUG_LOCKS_WARN_ON(!irqs_disabled());
 
 	/* lockless_dereference synchronizes with smp_store_release */
 	ldt = lockless_dereference(mm->context.ldt);
@@ -74,14 +89,12 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 		set_ldt(ldt->entries, ldt->size);
 	else
 		clear_LDT();
-}
-
-/*
- * Used for LDT copy/destruction.
- */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+#else
+	clear_LDT();
+#endif
 
+	DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+}
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
@@ -113,6 +126,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		/* Load per-mm CR4 state */
 		load_mm_cr4(next);
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 		/*
 		 * Load the LDT, if the LDT is different.
 		 *
@@ -127,6 +141,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		 */
 		if (unlikely(prev->context.ldt != next->context.ldt))
 			load_mm_ldt(next);
+#endif
 	}
 #ifdef CONFIG_SMP
 	  else {
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0f15af41bd80..2b507befcd3f 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,7 +24,8 @@ CFLAGS_irq.o := -I$(src)/../include/asm/trace
 
 obj-y			:= process_$(BITS).o signal.o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
-obj-y			+= time.o ioport.o ldt.o dumpstack.o nmi.o
+obj-y			+= time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_MODIFY_LDT_SYSCALL)	+= ldt.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
 obj-$(CONFIG_IRQ_WORK)  += irq_work.o
 obj-y			+= probe_roms.o
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9469dfa55607..58b872ef2329 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2179,6 +2179,7 @@ static unsigned long get_segment_base(unsigned int segment)
 	int idx = segment >> 3;
 
 	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 		struct ldt_struct *ldt;
 
 		if (idx > LDT_ENTRIES)
@@ -2190,6 +2191,9 @@ static unsigned long get_segment_base(unsigned int segment)
 			return 0;
 
 		desc = &ldt->entries[idx];
+#else
+		return 0;
+#endif
 	} else {
 		if (idx > GDT_ENTRIES)
 			return 0;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f6b916387590..941295ddf802 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -121,6 +121,7 @@ void __show_regs(struct pt_regs *regs, int all)
 void release_thread(struct task_struct *dead_task)
 {
 	if (dead_task->mm) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 		if (dead_task->mm->context.ldt) {
 			pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
 				dead_task->comm,
@@ -128,6 +129,7 @@ void release_thread(struct task_struct *dead_task)
 				dead_task->mm->context.ldt->size);
 			BUG();
 		}
+#endif
 	}
 }
 
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 6273324186ac..fd88e152d584 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -18,6 +18,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
 		return addr;
 	}
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 	/*
 	 * We'll assume that the code segments in the GDT
 	 * are all zero-based. That is largely true: the
@@ -45,6 +46,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
 		}
 		mutex_unlock(&child->mm->context.lock);
 	}
+#endif
 
 	return addr;
 }
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5868d8..ca7d84f438f1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -140,6 +140,7 @@ cond_syscall(sys_sgetmask);
 cond_syscall(sys_ssetmask);
 cond_syscall(sys_vm86old);
 cond_syscall(sys_vm86);
+cond_syscall(sys_modify_ldt);
 cond_syscall(sys_ipc);
 cond_syscall(compat_sys_ipc);
 cond_syscall(compat_sys_sysctl);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
  2015-07-25  5:36 ` [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous Andy Lutomirski
  2015-07-25  5:36 ` Andy Lutomirski
@ 2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  5:36 ` Andy Lutomirski
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, Andy Lutomirski, Andrew Cooper, X86 ML, linux-kernel,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

The modify_ldt syscall exposes a large attack surface and is
unnecessary for modern userspace.  Make it optional.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   | 17 +++++++++++++++++
 arch/x86/include/asm/mmu.h         |  2 ++
 arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
 arch/x86/kernel/Makefile           |  3 ++-
 arch/x86/kernel/cpu/perf_event.c   |  4 ++++
 arch/x86/kernel/process_64.c       |  2 ++
 arch/x86/kernel/step.c             |  2 ++
 kernel/sys_ni.c                    |  1 +
 8 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d77d92..ede52be845db 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1015,6 +1015,7 @@ config VM86
 config X86_16BIT
 	bool "Enable support for 16-bit segments" if EXPERT
 	default y
+	depends on MODIFY_LDT_SYSCALL
 	---help---
 	  This option is required by programs like Wine to run 16-bit
 	  protected mode legacy code on x86 processors.  Disabling
@@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
 	  This is used to work around broken boot loaders.  This should
 	  be set to 'N' under normal conditions.
 
+config MODIFY_LDT_SYSCALL
+       bool "Enable the LDT (local descriptor table)" if EXPERT
+       default y
+       ---help---
+         Linux can allow user programs to install a per-process x86
+	 Local Descriptor Table (LDT) using the modify_ldt(2) system
+	 call.  This is required to run 16-bit or segmented code such as
+	 DOSEMU or some Wine programs.  It is also used by some very old
+	 threading libraries.
+
+	 Enabling this feature adds a small amount of overhead to
+	 context switches and increases the low-level kernel attack
+	 surface.  Disabling it removes the modify_ldt(2) system call.
+
+	 Saying 'N' here may make sense for embedded or server kernels.
+
 source "kernel/livepatch/Kconfig"
 
 endmenu
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 364d27481a52..55234d5e7160 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,7 +9,9 @@
  * we put the segment information here.
  */
 typedef struct {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
+#endif
 
 #ifdef CONFIG_X86_64
 	/* True if mm supports a task running in 32 bit compatibility mode. */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 3fcff70c398e..08094eded318 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -33,6 +33,7 @@ static inline void load_mm_cr4(struct mm_struct *mm)
 static inline void load_mm_cr4(struct mm_struct *mm) {}
 #endif
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 /*
  * ldt_structs can be allocated, used, and freed, but they are never
  * modified while live.
@@ -48,10 +49,24 @@ struct ldt_struct {
 	int size;
 };
 
+/*
+ * Used for LDT copy/destruction.
+ */
+int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context(struct mm_struct *mm);
+#else	/* CONFIG_MODIFY_LDT_SYSCALL */
+static inline int init_new_context(struct task_struct *tsk,
+				   struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm) {}
+#endif
+
 static inline void load_mm_ldt(struct mm_struct *mm)
 {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
-	DEBUG_LOCKS_WARN_ON(!irqs_disabled());
 
 	/* lockless_dereference synchronizes with smp_store_release */
 	ldt = lockless_dereference(mm->context.ldt);
@@ -74,14 +89,12 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 		set_ldt(ldt->entries, ldt->size);
 	else
 		clear_LDT();
-}
-
-/*
- * Used for LDT copy/destruction.
- */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+#else
+	clear_LDT();
+#endif
 
+	DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+}
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
@@ -113,6 +126,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		/* Load per-mm CR4 state */
 		load_mm_cr4(next);
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 		/*
 		 * Load the LDT, if the LDT is different.
 		 *
@@ -127,6 +141,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		 */
 		if (unlikely(prev->context.ldt != next->context.ldt))
 			load_mm_ldt(next);
+#endif
 	}
 #ifdef CONFIG_SMP
 	  else {
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0f15af41bd80..2b507befcd3f 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,7 +24,8 @@ CFLAGS_irq.o := -I$(src)/../include/asm/trace
 
 obj-y			:= process_$(BITS).o signal.o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
-obj-y			+= time.o ioport.o ldt.o dumpstack.o nmi.o
+obj-y			+= time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_MODIFY_LDT_SYSCALL)	+= ldt.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
 obj-$(CONFIG_IRQ_WORK)  += irq_work.o
 obj-y			+= probe_roms.o
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9469dfa55607..58b872ef2329 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2179,6 +2179,7 @@ static unsigned long get_segment_base(unsigned int segment)
 	int idx = segment >> 3;
 
 	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 		struct ldt_struct *ldt;
 
 		if (idx > LDT_ENTRIES)
@@ -2190,6 +2191,9 @@ static unsigned long get_segment_base(unsigned int segment)
 			return 0;
 
 		desc = &ldt->entries[idx];
+#else
+		return 0;
+#endif
 	} else {
 		if (idx > GDT_ENTRIES)
 			return 0;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f6b916387590..941295ddf802 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -121,6 +121,7 @@ void __show_regs(struct pt_regs *regs, int all)
 void release_thread(struct task_struct *dead_task)
 {
 	if (dead_task->mm) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 		if (dead_task->mm->context.ldt) {
 			pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
 				dead_task->comm,
@@ -128,6 +129,7 @@ void release_thread(struct task_struct *dead_task)
 				dead_task->mm->context.ldt->size);
 			BUG();
 		}
+#endif
 	}
 }
 
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 6273324186ac..fd88e152d584 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -18,6 +18,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
 		return addr;
 	}
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
 	/*
 	 * We'll assume that the code segments in the GDT
 	 * are all zero-based. That is largely true: the
@@ -45,6 +46,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
 		}
 		mutex_unlock(&child->mm->context.lock);
 	}
+#endif
 
 	return addr;
 }
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5868d8..ca7d84f438f1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -140,6 +140,7 @@ cond_syscall(sys_sgetmask);
 cond_syscall(sys_ssetmask);
 cond_syscall(sys_vm86old);
 cond_syscall(sys_vm86);
+cond_syscall(sys_modify_ldt);
 cond_syscall(sys_ipc);
 cond_syscall(compat_sys_ipc);
 cond_syscall(compat_sys_sysctl);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH v4 3/3] selftests/x86, x86/ldt: Add a selftest for modify_ldt
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (4 preceding siblings ...)
  2015-07-25  5:36 ` [PATCH v4 3/3] selftests/x86, x86/ldt: Add a selftest for modify_ldt Andy Lutomirski
@ 2015-07-25  5:36 ` Andy Lutomirski
  2015-07-27 15:52   ` [PATCH v4.1 3.3] " Andy Lutomirski
  2015-07-27 15:52   ` Andy Lutomirski
  2015-07-25  6:27 ` [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Willy Tarreau
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Andy Lutomirski

This tests general modify_ldt behavior (only writes, so far) as
well as synchronous updates via IPI.  It fails on old kernels.

I called this ldt_gdt because I'll add set_thread_area tests to
it at some point.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/ldt_gdt.c | 492 ++++++++++++++++++++++++++++++++++
 2 files changed, 493 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..4138387b892c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -4,7 +4,7 @@ include ../lib.mk
 
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
-TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
+TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs ldt_gdt
 TARGETS_C_32BIT_ONLY := entry_from_vm86
 
 TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
new file mode 100644
index 000000000000..7723a12d42e1
--- /dev/null
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -0,0 +1,492 @@
+/*
+ * ldt_gdt.c - Test cases for LDT and GDT access
+ * Copyright (c) 2015 Andrew Lutomirski
+ */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <asm/ldt.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <sched.h>
+#include <linux/futex.h>
+
+#define AR_ACCESSED		(1<<8)
+
+#define AR_TYPE_RODATA		(0 * (1<<9))
+#define AR_TYPE_RWDATA		(1 * (1<<9))
+#define AR_TYPE_RODATA_EXPDOWN	(2 * (1<<9))
+#define AR_TYPE_RWDATA_EXPDOWN	(3 * (1<<9))
+#define AR_TYPE_XOCODE		(4 * (1<<9))
+#define AR_TYPE_XRCODE		(5 * (1<<9))
+#define AR_TYPE_XOCODE_CONF	(6 * (1<<9))
+#define AR_TYPE_XRCODE_CONF	(7 * (1<<9))
+
+#define AR_DPL3			(3 * (1<<13))
+
+#define AR_S			(1 << 12)
+#define AR_P			(1 << 15)
+#define AR_AVL			(1 << 20)
+#define AR_L			(1 << 21)
+#define AR_DB			(1 << 22)
+#define AR_G			(1 << 23)
+
+static int nerrs;
+
+static void check_invalid_segment(uint16_t index, int ldt)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (has_limit || has_ar) {
+		printf("[FAIL]\t%s entry %hu is valid but should be invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+	} else {
+		printf("[OK]\t%s entry %hu is invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+	}
+}
+
+static void check_valid_segment(uint16_t index, int ldt,
+				uint32_t expected_ar, uint32_t expected_limit)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (!has_limit || !has_ar) {
+		printf("[FAIL]\t%s entry %hu is invalid but should be valid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+		return;
+	}
+
+	if (ar != expected_ar) {
+		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
+		nerrs++;
+	} else if (limit != expected_limit) {
+		printf("[FAIL]\t%s entry %hu has limit 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, limit, expected_limit);
+		nerrs++;
+	} else {
+		printf("[OK]\t%s entry %hu has AR 0x%08X and limit 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, limit);
+	}
+}
+
+static bool install_valid_mode(const struct user_desc *desc, uint32_t ar,
+			       bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		uint32_t limit = desc->limit;
+		if (desc->limit_in_pages)
+			limit = (limit << 12) + 4095;
+		check_valid_segment(desc->entry_number, 1, ar, limit);
+		return true;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+		return false;
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+			return false;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+			return false;
+		}
+	}
+}
+
+static bool install_valid(const struct user_desc *desc, uint32_t ar)
+{
+	return install_valid_mode(desc, ar, false);
+}
+
+static void install_invalid(const struct user_desc *desc, bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		check_invalid_segment(desc->entry_number, 1);
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+		}
+	}
+}
+
+static int safe_modify_ldt(int func, struct user_desc *ptr,
+			   unsigned long bytecount)
+{
+	int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount);
+	if (ret < -1)
+		errno = -ret;
+	return ret;
+}
+
+static void fail_install(struct user_desc *desc)
+{
+	if (safe_modify_ldt(0x11, desc, sizeof(*desc)) == 0) {
+		printf("[FAIL]\tmodify_ldt accepted a bad descriptor\n");
+		nerrs++;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		printf("[OK]\tmodify_ldt failure %d\n", errno);
+	}
+}
+
+static void do_simple_tests(void)
+{
+	struct user_desc desc = {
+		.entry_number    = 0,
+		.base_addr       = 0,
+		.limit           = 10,
+		.seg_32bit       = 1,
+		.contents        = 2, /* Code, not conforming */
+		.read_exec_only  = 0,
+		.limit_in_pages  = 0,
+		.seg_not_present = 0,
+		.useable         = 0
+	};
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+
+	desc.limit_in_pages = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.entry_number = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.base_addr = 0xf0000000;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	desc.useable = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_32bit = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_G | AR_AVL);
+
+	desc.seg_32bit = 1;
+	desc.contents = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.contents = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA_EXPDOWN |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.limit_in_pages = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA_EXPDOWN |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.contents = 3;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.contents = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+
+#ifdef __x86_64__
+	desc.lm = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+		      AR_S | AR_DB | AR_AVL);
+	desc.lm = 0;
+#endif
+
+	bool entry1_okay = install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+					 AR_S | AR_DB | AR_AVL);
+
+	if (entry1_okay) {
+		printf("[RUN]\tTest fork\n");
+		pid_t child = fork();
+		if (child == 0) {
+			nerrs = 0;
+			check_valid_segment(desc.entry_number, 1,
+					    AR_DPL3 | AR_TYPE_XOCODE |
+					    AR_S | AR_DB | AR_AVL, desc.limit);
+			check_invalid_segment(1, 1);
+			exit(nerrs ? 1 : 0);
+		} else {
+			int status;
+			if (waitpid(child, &status, 0) != child ||
+			    !WIFEXITED(status)) {
+				printf("[FAIL]\tChild died\n");
+				nerrs++;
+			} else if (WEXITSTATUS(status) != 0) {
+				printf("[FAIL]\tChild failed\n");
+				nerrs++;
+			} else {
+				printf("[OK]\tChild succeeded\n");
+			}
+		}
+	} else {
+		printf("[SKIP]\tSkipping fork test because have no LDT\n");
+	}
+
+	/* Test entry_number too high. */
+	desc.entry_number = 100000;
+	fail_install(&desc);
+
+	/* Test deletion and actions mistakeable for deletion. */
+	memset(&desc, 0, sizeof(desc));
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S | AR_P);
+
+	desc.read_exec_only = 0;
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.read_exec_only = 1;
+	desc.limit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.limit = 0;
+	desc.base_addr = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.base_addr = 0;
+	install_invalid(&desc, false);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 0;
+	desc.seg_32bit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P | AR_DB);
+	install_invalid(&desc, true);
+}
+
+/*
+ * 0: thread is idle
+ * 1: thread armed
+ * 2: thread should clear LDT entry 0
+ * 3: thread should exit
+ */
+static volatile unsigned int ftx;
+
+static void *threadproc(void *ctx)
+{
+	cpu_set_t cpuset;
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 1");	/* should never fail */
+
+	while (1) {
+		syscall(SYS_futex, &ftx, FUTEX_WAIT, 0, NULL, NULL, 0);
+		while (ftx != 2) {
+			if (ftx == 3)
+				return NULL;
+		}
+
+		/* clear LDT entry 0 */
+		const struct user_desc desc = {};
+		if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
+			err(1, "modify_ldt");
+
+		ftx = 0;
+	}
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+
+}
+
+static jmp_buf jmpbuf;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+	siglongjmp(jmpbuf, 1);
+}
+
+static void do_multicpu_tests(void)
+{
+	cpu_set_t cpuset;
+	pthread_t thread;
+	int failures = 0, iters = 5, i;
+	unsigned short orig_ss;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 1\n");
+		return;
+	}
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 0\n");
+		return;
+	}
+
+	sethandler(SIGSEGV, sigsegv, 0);
+
+	printf("[RUN]\tCross-CPU LDT invalidation\n");
+
+	if (pthread_create(&thread, 0, threadproc, 0) != 0)
+		err(1, "pthread_create");
+
+	asm volatile ("mov %%ss, %0" : "=rm" (orig_ss));
+
+	for (i = 0; i < 5; i++) {
+		if (sigsetjmp(jmpbuf, 1) != 0)
+			continue;
+
+		/* Make sure the thread is ready after the last test. */
+		while (ftx != 0)
+			;
+
+		struct user_desc desc = {
+			.entry_number    = 0,
+			.base_addr       = 0,
+			.limit           = 0xfffff,
+			.seg_32bit       = 1,
+			.contents        = 0, /* Data */
+			.read_exec_only  = 0,
+			.limit_in_pages  = 1,
+			.seg_not_present = 0,
+			.useable         = 0
+		};
+
+		if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+			if (errno != ENOSYS)
+				err(1, "modify_ldt");
+			printf("[SKIP]\tmodify_ldt unavailable\n");
+			break;
+		}
+
+		/* Arm the thread. */
+		ftx = 1;
+		syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+		asm volatile ("mov %0, %%ss" : : "r" (0x7));
+
+		/* Go! */
+		ftx = 2;
+
+		while (ftx != 0)
+			;
+
+		/*
+		 * On success, modify_ldt will segfault us synchronously,
+		 * and we'll escape via siglongjmp.
+		 */
+
+		failures++;
+		asm volatile ("mov %0, %%ss" : : "rm" (orig_ss));
+	};
+
+	ftx = 3;
+	syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+	if (pthread_join(thread, NULL) != 0)
+		err(1, "pthread_join");
+
+	if (failures) {
+		printf("[FAIL]\t%d of %d iterations failed\n", failures, iters);
+		nerrs++;
+	} else {
+		printf("[OK]\tAll %d iterations succeeded\n", iters);
+	}
+}
+
+int main()
+{
+	do_simple_tests();
+
+	do_multicpu_tests();
+
+	return nerrs ? 1 : 0;
+}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH v4 3/3] selftests/x86, x86/ldt: Add a selftest for modify_ldt
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (3 preceding siblings ...)
  2015-07-25  5:36 ` Andy Lutomirski
@ 2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  5:36 ` Andy Lutomirski
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, Andy Lutomirski, Andrew Cooper, X86 ML, linux-kernel,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

This tests general modify_ldt behavior (only writes, so far) as
well as synchronous updates via IPI.  It fails on old kernels.

I called this ldt_gdt because I'll add set_thread_area tests to
it at some point.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/ldt_gdt.c | 492 ++++++++++++++++++++++++++++++++++
 2 files changed, 493 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..4138387b892c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -4,7 +4,7 @@ include ../lib.mk
 
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
-TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
+TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs ldt_gdt
 TARGETS_C_32BIT_ONLY := entry_from_vm86
 
 TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
new file mode 100644
index 000000000000..7723a12d42e1
--- /dev/null
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -0,0 +1,492 @@
+/*
+ * ldt_gdt.c - Test cases for LDT and GDT access
+ * Copyright (c) 2015 Andrew Lutomirski
+ */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <asm/ldt.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <sched.h>
+#include <linux/futex.h>
+
+#define AR_ACCESSED		(1<<8)
+
+#define AR_TYPE_RODATA		(0 * (1<<9))
+#define AR_TYPE_RWDATA		(1 * (1<<9))
+#define AR_TYPE_RODATA_EXPDOWN	(2 * (1<<9))
+#define AR_TYPE_RWDATA_EXPDOWN	(3 * (1<<9))
+#define AR_TYPE_XOCODE		(4 * (1<<9))
+#define AR_TYPE_XRCODE		(5 * (1<<9))
+#define AR_TYPE_XOCODE_CONF	(6 * (1<<9))
+#define AR_TYPE_XRCODE_CONF	(7 * (1<<9))
+
+#define AR_DPL3			(3 * (1<<13))
+
+#define AR_S			(1 << 12)
+#define AR_P			(1 << 15)
+#define AR_AVL			(1 << 20)
+#define AR_L			(1 << 21)
+#define AR_DB			(1 << 22)
+#define AR_G			(1 << 23)
+
+static int nerrs;
+
+static void check_invalid_segment(uint16_t index, int ldt)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (has_limit || has_ar) {
+		printf("[FAIL]\t%s entry %hu is valid but should be invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+	} else {
+		printf("[OK]\t%s entry %hu is invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+	}
+}
+
+static void check_valid_segment(uint16_t index, int ldt,
+				uint32_t expected_ar, uint32_t expected_limit)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (!has_limit || !has_ar) {
+		printf("[FAIL]\t%s entry %hu is invalid but should be valid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+		return;
+	}
+
+	if (ar != expected_ar) {
+		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
+		nerrs++;
+	} else if (limit != expected_limit) {
+		printf("[FAIL]\t%s entry %hu has limit 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, limit, expected_limit);
+		nerrs++;
+	} else {
+		printf("[OK]\t%s entry %hu has AR 0x%08X and limit 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, limit);
+	}
+}
+
+static bool install_valid_mode(const struct user_desc *desc, uint32_t ar,
+			       bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		uint32_t limit = desc->limit;
+		if (desc->limit_in_pages)
+			limit = (limit << 12) + 4095;
+		check_valid_segment(desc->entry_number, 1, ar, limit);
+		return true;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+		return false;
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+			return false;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+			return false;
+		}
+	}
+}
+
+static bool install_valid(const struct user_desc *desc, uint32_t ar)
+{
+	return install_valid_mode(desc, ar, false);
+}
+
+static void install_invalid(const struct user_desc *desc, bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		check_invalid_segment(desc->entry_number, 1);
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+		}
+	}
+}
+
+static int safe_modify_ldt(int func, struct user_desc *ptr,
+			   unsigned long bytecount)
+{
+	int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount);
+	if (ret < -1)
+		errno = -ret;
+	return ret;
+}
+
+static void fail_install(struct user_desc *desc)
+{
+	if (safe_modify_ldt(0x11, desc, sizeof(*desc)) == 0) {
+		printf("[FAIL]\tmodify_ldt accepted a bad descriptor\n");
+		nerrs++;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		printf("[OK]\tmodify_ldt failure %d\n", errno);
+	}
+}
+
+static void do_simple_tests(void)
+{
+	struct user_desc desc = {
+		.entry_number    = 0,
+		.base_addr       = 0,
+		.limit           = 10,
+		.seg_32bit       = 1,
+		.contents        = 2, /* Code, not conforming */
+		.read_exec_only  = 0,
+		.limit_in_pages  = 0,
+		.seg_not_present = 0,
+		.useable         = 0
+	};
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+
+	desc.limit_in_pages = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.entry_number = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.base_addr = 0xf0000000;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	desc.useable = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_32bit = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_G | AR_AVL);
+
+	desc.seg_32bit = 1;
+	desc.contents = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.contents = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA_EXPDOWN |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.limit_in_pages = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA_EXPDOWN |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.contents = 3;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.contents = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+
+#ifdef __x86_64__
+	desc.lm = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+		      AR_S | AR_DB | AR_AVL);
+	desc.lm = 0;
+#endif
+
+	bool entry1_okay = install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+					 AR_S | AR_DB | AR_AVL);
+
+	if (entry1_okay) {
+		printf("[RUN]\tTest fork\n");
+		pid_t child = fork();
+		if (child == 0) {
+			nerrs = 0;
+			check_valid_segment(desc.entry_number, 1,
+					    AR_DPL3 | AR_TYPE_XOCODE |
+					    AR_S | AR_DB | AR_AVL, desc.limit);
+			check_invalid_segment(1, 1);
+			exit(nerrs ? 1 : 0);
+		} else {
+			int status;
+			if (waitpid(child, &status, 0) != child ||
+			    !WIFEXITED(status)) {
+				printf("[FAIL]\tChild died\n");
+				nerrs++;
+			} else if (WEXITSTATUS(status) != 0) {
+				printf("[FAIL]\tChild failed\n");
+				nerrs++;
+			} else {
+				printf("[OK]\tChild succeeded\n");
+			}
+		}
+	} else {
+		printf("[SKIP]\tSkipping fork test because have no LDT\n");
+	}
+
+	/* Test entry_number too high. */
+	desc.entry_number = 100000;
+	fail_install(&desc);
+
+	/* Test deletion and actions mistakeable for deletion. */
+	memset(&desc, 0, sizeof(desc));
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S | AR_P);
+
+	desc.read_exec_only = 0;
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.read_exec_only = 1;
+	desc.limit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.limit = 0;
+	desc.base_addr = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.base_addr = 0;
+	install_invalid(&desc, false);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 0;
+	desc.seg_32bit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P | AR_DB);
+	install_invalid(&desc, true);
+}
+
+/*
+ * 0: thread is idle
+ * 1: thread armed
+ * 2: thread should clear LDT entry 0
+ * 3: thread should exit
+ */
+static volatile unsigned int ftx;
+
+static void *threadproc(void *ctx)
+{
+	cpu_set_t cpuset;
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 1");	/* should never fail */
+
+	while (1) {
+		syscall(SYS_futex, &ftx, FUTEX_WAIT, 0, NULL, NULL, 0);
+		while (ftx != 2) {
+			if (ftx == 3)
+				return NULL;
+		}
+
+		/* clear LDT entry 0 */
+		const struct user_desc desc = {};
+		if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
+			err(1, "modify_ldt");
+
+		ftx = 0;
+	}
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+
+}
+
+static jmp_buf jmpbuf;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+	siglongjmp(jmpbuf, 1);
+}
+
+static void do_multicpu_tests(void)
+{
+	cpu_set_t cpuset;
+	pthread_t thread;
+	int failures = 0, iters = 5, i;
+	unsigned short orig_ss;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 1\n");
+		return;
+	}
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 0\n");
+		return;
+	}
+
+	sethandler(SIGSEGV, sigsegv, 0);
+
+	printf("[RUN]\tCross-CPU LDT invalidation\n");
+
+	if (pthread_create(&thread, 0, threadproc, 0) != 0)
+		err(1, "pthread_create");
+
+	asm volatile ("mov %%ss, %0" : "=rm" (orig_ss));
+
+	for (i = 0; i < 5; i++) {
+		if (sigsetjmp(jmpbuf, 1) != 0)
+			continue;
+
+		/* Make sure the thread is ready after the last test. */
+		while (ftx != 0)
+			;
+
+		struct user_desc desc = {
+			.entry_number    = 0,
+			.base_addr       = 0,
+			.limit           = 0xfffff,
+			.seg_32bit       = 1,
+			.contents        = 0, /* Data */
+			.read_exec_only  = 0,
+			.limit_in_pages  = 1,
+			.seg_not_present = 0,
+			.useable         = 0
+		};
+
+		if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+			if (errno != ENOSYS)
+				err(1, "modify_ldt");
+			printf("[SKIP]\tmodify_ldt unavailable\n");
+			break;
+		}
+
+		/* Arm the thread. */
+		ftx = 1;
+		syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+		asm volatile ("mov %0, %%ss" : : "r" (0x7));
+
+		/* Go! */
+		ftx = 2;
+
+		while (ftx != 0)
+			;
+
+		/*
+		 * On success, modify_ldt will segfault us synchronously,
+		 * and we'll escape via siglongjmp.
+		 */
+
+		failures++;
+		asm volatile ("mov %0, %%ss" : : "rm" (orig_ss));
+	};
+
+	ftx = 3;
+	syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+	if (pthread_join(thread, NULL) != 0)
+		err(1, "pthread_join");
+
+	if (failures) {
+		printf("[FAIL]\t%d of %d iterations failed\n", failures, iters);
+		nerrs++;
+	} else {
+		printf("[OK]\tAll %d iterations succeeded\n", iters);
+	}
+}
+
+int main()
+{
+	do_simple_tests();
+
+	do_multicpu_tests();
+
+	return nerrs ? 1 : 0;
+}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  5:36 ` Andy Lutomirski
@ 2015-07-25  6:23   ` Willy Tarreau
  2015-07-25  6:44     ` Andy Lutomirski
  2015-07-25  6:44     ` Andy Lutomirski
  2015-07-25  6:23   ` Willy Tarreau
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25  6:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, security, X86 ML,
	Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
> The modify_ldt syscall exposes a large attack surface and is
> unnecessary for modern userspace.  Make it optional.

Andy, you didn't respond whether you think it wouldn't be better to make
it runtime-configurable instead. The goal here is to ensure distros
ship with modify_ldt disabled by default. But if it means breaking
compatibility with (rare) existing applications, I'm seeing a risk
that they'll ship with it enabled instead, which would make the config
option useless. The CONFIG_DEFAULT_MMAP_ADDR was a good example of
successful deployment of a hardening measure that has been widely
adopted despite its (low) risk of breakage in field because it was
adjustable in field.

That's why here I think we should do the same, and possibly even
emit a warning once to report the first user of modify_ldt if that
can help.

What do you think ?

Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  6:23   ` Willy Tarreau
@ 2015-07-25  6:23   ` Willy Tarreau
  2015-07-25  9:15   ` Borislav Petkov
  2015-07-25  9:15   ` Borislav Petkov
  3 siblings, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25  6:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin, Boris Ostrovsky

On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
> The modify_ldt syscall exposes a large attack surface and is
> unnecessary for modern userspace.  Make it optional.

Andy, you didn't respond whether you think it wouldn't be better to make
it runtime-configurable instead. The goal here is to ensure distros
ship with modify_ldt disabled by default. But if it means breaking
compatibility with (rare) existing applications, I'm seeing a risk
that they'll ship with it enabled instead, which would make the config
option useless. The CONFIG_DEFAULT_MMAP_ADDR was a good example of
successful deployment of a hardening measure that has been widely
adopted despite its (low) risk of breakage in field because it was
adjustable in field.

That's why here I think we should do the same, and possibly even
emit a warning once to report the first user of modify_ldt if that
can help.

What do you think ?

Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (6 preceding siblings ...)
  2015-07-25  6:27 ` [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Willy Tarreau
@ 2015-07-25  6:27 ` Willy Tarreau
  2015-07-27 15:36 ` Boris Ostrovsky
  2015-07-27 15:36 ` Boris Ostrovsky
  9 siblings, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25  6:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, security, X86 ML,
	Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Fri, Jul 24, 2015 at 10:36:43PM -0700, Andy Lutomirski wrote:
> Willy and Kees: I left the config option alone.  The -tiny people will
> like it, and we can always add a sysctl of some sort later.

OK, please ignore my other e-mail I missed this part. I'll see if I
can propose the sysctl completement on top of this so that we can
hope a wider deployment asap.

Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (5 preceding siblings ...)
  2015-07-25  5:36 ` Andy Lutomirski
@ 2015-07-25  6:27 ` Willy Tarreau
  2015-07-25  6:27 ` Willy Tarreau
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25  6:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin, Boris Ostrovsky

On Fri, Jul 24, 2015 at 10:36:43PM -0700, Andy Lutomirski wrote:
> Willy and Kees: I left the config option alone.  The -tiny people will
> like it, and we can always add a sysctl of some sort later.

OK, please ignore my other e-mail I missed this part. I'll see if I
can propose the sysctl completement on top of this so that we can
hope a wider deployment asap.

Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  6:23   ` Willy Tarreau
@ 2015-07-25  6:44     ` Andy Lutomirski
  2015-07-25  7:50       ` Willy Tarreau
  2015-07-25  7:50       ` [PATCH v4 2/3] x86/ldt: Make modify_ldt optional Willy Tarreau
  2015-07-25  6:44     ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  6:44 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Fri, Jul 24, 2015 at 11:23 PM, Willy Tarreau <w@1wt.eu> wrote:
> On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
>> The modify_ldt syscall exposes a large attack surface and is
>> unnecessary for modern userspace.  Make it optional.
>
> Andy, you didn't respond whether you think it wouldn't be better to make
> it runtime-configurable instead. The goal here is to ensure distros
> ship with modify_ldt disabled by default. But if it means breaking
> compatibility with (rare) existing applications, I'm seeing a risk
> that they'll ship with it enabled instead, which would make the config
> option useless. The CONFIG_DEFAULT_MMAP_ADDR was a good example of
> successful deployment of a hardening measure that has been widely
> adopted despite its (low) risk of breakage in field because it was
> adjustable in field.

I'm all for it, but I think it should be hard-disablable in config,
too, for the -tiny people.  If we add a runtime disable, let's do a
separate patch, and you and Kees can fight over how general it should
be.

>
> That's why here I think we should do the same, and possibly even
> emit a warning once to report the first user of modify_ldt if that
> can help.
>
> What do you think ?

I'm generally in favor.

On the other hand, the current series is already written, might even
be compatible with Xen, and patch 1 at least fixes a real bug.  Maybe
several real bugs.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  6:23   ` Willy Tarreau
  2015-07-25  6:44     ` Andy Lutomirski
@ 2015-07-25  6:44     ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  6:44 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky

On Fri, Jul 24, 2015 at 11:23 PM, Willy Tarreau <w@1wt.eu> wrote:
> On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
>> The modify_ldt syscall exposes a large attack surface and is
>> unnecessary for modern userspace.  Make it optional.
>
> Andy, you didn't respond whether you think it wouldn't be better to make
> it runtime-configurable instead. The goal here is to ensure distros
> ship with modify_ldt disabled by default. But if it means breaking
> compatibility with (rare) existing applications, I'm seeing a risk
> that they'll ship with it enabled instead, which would make the config
> option useless. The CONFIG_DEFAULT_MMAP_ADDR was a good example of
> successful deployment of a hardening measure that has been widely
> adopted despite its (low) risk of breakage in field because it was
> adjustable in field.

I'm all for it, but I think it should be hard-disablable in config,
too, for the -tiny people.  If we add a runtime disable, let's do a
separate patch, and you and Kees can fight over how general it should
be.

>
> That's why here I think we should do the same, and possibly even
> emit a warning once to report the first user of modify_ldt if that
> can help.
>
> What do you think ?

I'm generally in favor.

On the other hand, the current series is already written, might even
be compatible with Xen, and patch 1 at least fixes a real bug.  Maybe
several real bugs.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  6:44     ` Andy Lutomirski
@ 2015-07-25  7:50       ` Willy Tarreau
  2015-07-25 13:03         ` [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime Willy Tarreau
  2015-07-25 13:03         ` Willy Tarreau
  2015-07-25  7:50       ` [PATCH v4 2/3] x86/ldt: Make modify_ldt optional Willy Tarreau
  1 sibling, 2 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25  7:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
> I'm all for it, but I think it should be hard-disablable in config,
> too, for the -tiny people.

I totally agree.

> If we add a runtime disable, let's do a
> separate patch, and you and Kees can fight over how general it should
> be.

Initially I was thinking about changing it for a 3-state option but
that would prevent X86_16BIT from being hard-disablable, so I'll do
something completely separate.

> > That's why here I think we should do the same, and possibly even
> > emit a warning once to report the first user of modify_ldt if that
> > can help.
> >
> > What do you think ?
> 
> I'm generally in favor.

OK.

> On the other hand, the current series is already written, might even
> be compatible with Xen, and patch 1 at least fixes a real bug.  Maybe
> several real bugs.

That's my guess as well given how hard it seems for everyone in this
long thread to imagine all possible bugs we can face :-/

Thanks,
Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  6:44     ` Andy Lutomirski
  2015-07-25  7:50       ` Willy Tarreau
@ 2015-07-25  7:50       ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25  7:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky

On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
> I'm all for it, but I think it should be hard-disablable in config,
> too, for the -tiny people.

I totally agree.

> If we add a runtime disable, let's do a
> separate patch, and you and Kees can fight over how general it should
> be.

Initially I was thinking about changing it for a 3-state option but
that would prevent X86_16BIT from being hard-disablable, so I'll do
something completely separate.

> > That's why here I think we should do the same, and possibly even
> > emit a warning once to report the first user of modify_ldt if that
> > can help.
> >
> > What do you think ?
> 
> I'm generally in favor.

OK.

> On the other hand, the current series is already written, might even
> be compatible with Xen, and patch 1 at least fixes a real bug.  Maybe
> several real bugs.

That's my guess as well given how hard it seems for everyone in this
long thread to imagine all possible bugs we can face :-/

Thanks,
Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous
  2015-07-25  5:36 ` [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous Andy Lutomirski
@ 2015-07-25  9:03   ` Borislav Petkov
  2015-07-25  9:03   ` Borislav Petkov
  1 sibling, 0 replies; 130+ messages in thread
From: Borislav Petkov @ 2015-07-25  9:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, security, X86 ML, Sasha Levin,
	linux-kernel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Andrew Cooper, Jan Beulich, xen-devel, stable

On Fri, Jul 24, 2015 at 10:36:44PM -0700, Andy Lutomirski wrote:
> modify_ldt has questionable locking and does not synchronize
> threads.  Improve it: redesign the locking and synchronize all
> threads' LDTs using an IPI on all modifications.
> 
> This will dramatically slow down modify_ldt in multithreaded
> programs, but there shouldn't be any multithreaded programs that
> care about modify_ldt's performance in the first place.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous
  2015-07-25  5:36 ` [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous Andy Lutomirski
  2015-07-25  9:03   ` Borislav Petkov
@ 2015-07-25  9:03   ` Borislav Petkov
  1 sibling, 0 replies; 130+ messages in thread
From: Borislav Petkov @ 2015-07-25  9:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, stable, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

On Fri, Jul 24, 2015 at 10:36:44PM -0700, Andy Lutomirski wrote:
> modify_ldt has questionable locking and does not synchronize
> threads.  Improve it: redesign the locking and synchronize all
> threads' LDTs using an IPI on all modifications.
> 
> This will dramatically slow down modify_ldt in multithreaded
> programs, but there shouldn't be any multithreaded programs that
> care about modify_ldt's performance in the first place.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  5:36 ` Andy Lutomirski
                     ` (2 preceding siblings ...)
  2015-07-25  9:15   ` Borislav Petkov
@ 2015-07-25  9:15   ` Borislav Petkov
  2015-07-25 16:03     ` Andy Lutomirski
  2015-07-25 16:03     ` Andy Lutomirski
  3 siblings, 2 replies; 130+ messages in thread
From: Borislav Petkov @ 2015-07-25  9:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, security, X86 ML, Sasha Levin,
	linux-kernel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Andrew Cooper, Jan Beulich, xen-devel

On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
> The modify_ldt syscall exposes a large attack surface and is
> unnecessary for modern userspace.  Make it optional.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/Kconfig                   | 17 +++++++++++++++++
>  arch/x86/include/asm/mmu.h         |  2 ++
>  arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
>  arch/x86/kernel/Makefile           |  3 ++-
>  arch/x86/kernel/cpu/perf_event.c   |  4 ++++
>  arch/x86/kernel/process_64.c       |  2 ++
>  arch/x86/kernel/step.c             |  2 ++
>  kernel/sys_ni.c                    |  1 +
>  8 files changed, 53 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..ede52be845db 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1015,6 +1015,7 @@ config VM86
>  config X86_16BIT
>  	bool "Enable support for 16-bit segments" if EXPERT
>  	default y
> +	depends on MODIFY_LDT_SYSCALL
>  	---help---
>  	  This option is required by programs like Wine to run 16-bit
>  	  protected mode legacy code on x86 processors.  Disabling
> @@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
>  	  This is used to work around broken boot loaders.  This should
>  	  be set to 'N' under normal conditions.
>  
> +config MODIFY_LDT_SYSCALL
> +       bool "Enable the LDT (local descriptor table)" if EXPERT

	bool "Enable modify_ldt() for per-process Local Descriptor Table"

is how I'd call it.

> +       default y

Is that "default y" going to turn into a "default n" after a grace
period?

> +       ---help---
> +         Linux can allow user programs to install a per-process x86
> +	 Local Descriptor Table (LDT) using the modify_ldt(2) system
> +	 call.  This is required to run 16-bit or segmented code such as
> +	 DOSEMU or some Wine programs.  It is also used by some very old
> +	 threading libraries.
> +
> +	 Enabling this feature adds a small amount of overhead to
> +	 context switches and increases the low-level kernel attack
> +	 surface.  Disabling it removes the modify_ldt(2) system call.
> +
> +	 Saying 'N' here may make sense for embedded or server kernels.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  5:36 ` Andy Lutomirski
  2015-07-25  6:23   ` Willy Tarreau
  2015-07-25  6:23   ` Willy Tarreau
@ 2015-07-25  9:15   ` Borislav Petkov
  2015-07-25  9:15   ` Borislav Petkov
  3 siblings, 0 replies; 130+ messages in thread
From: Borislav Petkov @ 2015-07-25  9:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
> The modify_ldt syscall exposes a large attack surface and is
> unnecessary for modern userspace.  Make it optional.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/Kconfig                   | 17 +++++++++++++++++
>  arch/x86/include/asm/mmu.h         |  2 ++
>  arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
>  arch/x86/kernel/Makefile           |  3 ++-
>  arch/x86/kernel/cpu/perf_event.c   |  4 ++++
>  arch/x86/kernel/process_64.c       |  2 ++
>  arch/x86/kernel/step.c             |  2 ++
>  kernel/sys_ni.c                    |  1 +
>  8 files changed, 53 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..ede52be845db 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1015,6 +1015,7 @@ config VM86
>  config X86_16BIT
>  	bool "Enable support for 16-bit segments" if EXPERT
>  	default y
> +	depends on MODIFY_LDT_SYSCALL
>  	---help---
>  	  This option is required by programs like Wine to run 16-bit
>  	  protected mode legacy code on x86 processors.  Disabling
> @@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
>  	  This is used to work around broken boot loaders.  This should
>  	  be set to 'N' under normal conditions.
>  
> +config MODIFY_LDT_SYSCALL
> +       bool "Enable the LDT (local descriptor table)" if EXPERT

	bool "Enable modify_ldt() for per-process Local Descriptor Table"

is how I'd call it.

> +       default y

Is that "default y" going to turn into a "default n" after a grace
period?

> +       ---help---
> +         Linux can allow user programs to install a per-process x86
> +	 Local Descriptor Table (LDT) using the modify_ldt(2) system
> +	 call.  This is required to run 16-bit or segmented code such as
> +	 DOSEMU or some Wine programs.  It is also used by some very old
> +	 threading libraries.
> +
> +	 Enabling this feature adds a small amount of overhead to
> +	 context switches and increases the low-level kernel attack
> +	 surface.  Disabling it removes the modify_ldt(2) system call.
> +
> +	 Saying 'N' here may make sense for embedded or server kernels.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25  7:50       ` Willy Tarreau
  2015-07-25 13:03         ` [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime Willy Tarreau
@ 2015-07-25 13:03         ` Willy Tarreau
  2015-07-25 16:08           ` Andy Lutomirski
                             ` (3 more replies)
  1 sibling, 4 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 13:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Kees Cook

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
> > I'm all for it, but I think it should be hard-disablable in config,
> > too, for the -tiny people.
> 
> I totally agree.
> 
> > If we add a runtime disable, let's do a
> > separate patch, and you and Kees can fight over how general it should
> > be.
> 
> Initially I was thinking about changing it for a 3-state option but
> that would prevent X86_16BIT from being hard-disablable, so I'll do
> something completely separate.

So here comes the proposed patch. It adds a default setting for the
sysctl when the option is not hard-disabled (eg: distros not wanting
to take risks with legacy apps). It suggests to leave the option off.
In case a syscall is blocked, a printk_ratelimited() is called with
relevant info (program name, pid, uid) so that the admin can decide
whether it's a legitimate call or not. Eg:

  Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.

I personally think it completes well your series, hence the 4/3 numbering.
Feel free to adopt it if you cycle another round and if you're OK with it
of course.

CCing Kees as well.

Willy


[-- Attachment #2: 0001-x86-ldt-allow-to-disable-modify_ldt-at-runtime.patch --]
[-- Type: text/plain, Size: 5342 bytes --]

>From 93cadf50b56a1f2f1e43137503edc1242f8476a7 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Sat, 25 Jul 2015 12:18:33 +0200
Subject: x86/ldt: allow to disable modify_ldt at runtime

For distros who prefer not to take the risk of completely disabling the
modify_ldt syscall using CONFIG_MODIFY_LDT_SYSCALL, this patch adds a
sysctl to enable or/disable it at runtime, and proposes to disable it
by default. This can be a safe alternative. A message is logged if an
attempt was stopped so that it's easy to spot if/when it is needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 Documentation/sysctl/kernel.txt | 15 +++++++++++++++
 arch/x86/Kconfig                | 17 +++++++++++++++++
 arch/x86/kernel/ldt.c           | 15 +++++++++++++++
 kernel/sysctl.c                 | 12 ++++++++++++
 4 files changed, 59 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 6fccb69..60c7c7a 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -41,6 +41,7 @@ show up in /proc/sys/kernel:
 - kptr_restrict
 - kstack_depth_to_print       [ X86 only ]
 - l2cr                        [ PPC only ]
+- modify_ldt                  [ X86 only ]
 - modprobe                    ==> Documentation/debugging-modules.txt
 - modules_disabled
 - msg_next_id		      [ sysv ipc ]
@@ -391,6 +392,20 @@ This flag controls the L2 cache of G3 processor boards. If
 
 ==============================================================
 
+modify_ldt: (X86 only)
+
+Enables (1) or disables (0) the modify_ldt syscall. Modifying the LDT
+(Local Descriptor Table) may be needed to run a 16-bit or segmented code
+such as Dosemu or Wine. This is done via a system call which is not needed
+to run portable applications, and which can sometimes be abused to exploit
+some weaknesses of the architecture, opening new vulnerabilities.
+
+This sysctl allows one to increase the system's security by disabling the
+system call, or to restore compatibility with specific applications when it
+was already disabled.
+
+==============================================================
+
 modules_disabled:
 
 A toggle value indicating if modules are allowed to be loaded
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ede52be..37f83d6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2069,6 +2069,23 @@ config MODIFY_LDT_SYSCALL
 	 surface.  Disabling it removes the modify_ldt(2) system call.
 
 	 Saying 'N' here may make sense for embedded or server kernels.
+	 If really unsure, say 'Y', you'll be able to disable it at runtime.
+
+config DEFAULT_MODIFY_LDT_SYSCALL
+	bool "Allow userspace to modify the LDT by default"
+	depends on MODIFY_LDT_SYSCALL
+	default y
+	---help---
+	  Modifying the LDT (Local Descriptor Table) may be needed to run a
+	  16-bit or segmented code such as Dosemu or Wine. This is done via
+	  a system call which is not needed to run portable applications,
+	  and which can sometimes be abused to exploit some weaknesses of
+	  the architecture, opening new vulnerabilities.
+
+	  For this reason this option allows one to enable or disable the
+	  feature at runtime. It is recommended to say 'N' here to leave
+	  the system protected, and to enable it at runtime only if needed
+	  by setting the sys.kernel.modify_ldt sysctl.
 
 source "kernel/livepatch/Kconfig"
 
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 2bcc052..cb64b85 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -11,6 +11,7 @@
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/mm.h>
+#include <linux/ratelimit.h>
 #include <linux/smp.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
@@ -21,6 +22,11 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+int sysctl_modify_ldt __read_mostly =
+	IS_ENABLED(CONFIG_DEFAULT_MODIFY_LDT_SYSCALL);
+#endif
+
 /* context.lock is held for us, so we don't need any locking. */
 static void flush_ldt(void *current_mm)
 {
@@ -276,6 +282,15 @@ asmlinkage int sys_modify_ldt(int func, void __user *ptr,
 {
 	int ret = -ENOSYS;
 
+	if (!sysctl_modify_ldt) {
+		printk_ratelimited(KERN_INFO
+			"Denied a call to modify_ldt() from %s[%d] (uid: %d)."
+			" Adjust sysctl if this was not an exploit attempt.\n",
+			current->comm, task_pid_nr(current),
+			from_kuid_munged(current_user_ns(), current_uid()));
+		return ret;
+	}
+
 	switch (func) {
 	case 0:
 		ret = read_ldt(ptr, bytecount);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 19b62b5..3dcf8e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -111,6 +111,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+extern int sysctl_modify_ldt;
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -960,6 +963,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	{
+		.procname	= "modify_ldt",
+		.data		= &sysctl_modify_ldt,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 #endif
 #if defined(CONFIG_MMU)
 	{
-- 
1.7.12.1


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25  7:50       ` Willy Tarreau
@ 2015-07-25 13:03         ` Willy Tarreau
  2015-07-25 13:03         ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 13:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, Kees Cook

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
> > I'm all for it, but I think it should be hard-disablable in config,
> > too, for the -tiny people.
> 
> I totally agree.
> 
> > If we add a runtime disable, let's do a
> > separate patch, and you and Kees can fight over how general it should
> > be.
> 
> Initially I was thinking about changing it for a 3-state option but
> that would prevent X86_16BIT from being hard-disablable, so I'll do
> something completely separate.

So here comes the proposed patch. It adds a default setting for the
sysctl when the option is not hard-disabled (eg: distros not wanting
to take risks with legacy apps). It suggests to leave the option off.
In case a syscall is blocked, a printk_ratelimited() is called with
relevant info (program name, pid, uid) so that the admin can decide
whether it's a legitimate call or not. Eg:

  Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.

I personally think it completes well your series, hence the 4/3 numbering.
Feel free to adopt it if you cycle another round and if you're OK with it
of course.

CCing Kees as well.

Willy


[-- Attachment #2: 0001-x86-ldt-allow-to-disable-modify_ldt-at-runtime.patch --]
[-- Type: text/plain, Size: 5342 bytes --]

>From 93cadf50b56a1f2f1e43137503edc1242f8476a7 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Sat, 25 Jul 2015 12:18:33 +0200
Subject: x86/ldt: allow to disable modify_ldt at runtime

For distros who prefer not to take the risk of completely disabling the
modify_ldt syscall using CONFIG_MODIFY_LDT_SYSCALL, this patch adds a
sysctl to enable or/disable it at runtime, and proposes to disable it
by default. This can be a safe alternative. A message is logged if an
attempt was stopped so that it's easy to spot if/when it is needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 Documentation/sysctl/kernel.txt | 15 +++++++++++++++
 arch/x86/Kconfig                | 17 +++++++++++++++++
 arch/x86/kernel/ldt.c           | 15 +++++++++++++++
 kernel/sysctl.c                 | 12 ++++++++++++
 4 files changed, 59 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 6fccb69..60c7c7a 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -41,6 +41,7 @@ show up in /proc/sys/kernel:
 - kptr_restrict
 - kstack_depth_to_print       [ X86 only ]
 - l2cr                        [ PPC only ]
+- modify_ldt                  [ X86 only ]
 - modprobe                    ==> Documentation/debugging-modules.txt
 - modules_disabled
 - msg_next_id		      [ sysv ipc ]
@@ -391,6 +392,20 @@ This flag controls the L2 cache of G3 processor boards. If
 
 ==============================================================
 
+modify_ldt: (X86 only)
+
+Enables (1) or disables (0) the modify_ldt syscall. Modifying the LDT
+(Local Descriptor Table) may be needed to run a 16-bit or segmented code
+such as Dosemu or Wine. This is done via a system call which is not needed
+to run portable applications, and which can sometimes be abused to exploit
+some weaknesses of the architecture, opening new vulnerabilities.
+
+This sysctl allows one to increase the system's security by disabling the
+system call, or to restore compatibility with specific applications when it
+was already disabled.
+
+==============================================================
+
 modules_disabled:
 
 A toggle value indicating if modules are allowed to be loaded
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ede52be..37f83d6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2069,6 +2069,23 @@ config MODIFY_LDT_SYSCALL
 	 surface.  Disabling it removes the modify_ldt(2) system call.
 
 	 Saying 'N' here may make sense for embedded or server kernels.
+	 If really unsure, say 'Y', you'll be able to disable it at runtime.
+
+config DEFAULT_MODIFY_LDT_SYSCALL
+	bool "Allow userspace to modify the LDT by default"
+	depends on MODIFY_LDT_SYSCALL
+	default y
+	---help---
+	  Modifying the LDT (Local Descriptor Table) may be needed to run a
+	  16-bit or segmented code such as Dosemu or Wine. This is done via
+	  a system call which is not needed to run portable applications,
+	  and which can sometimes be abused to exploit some weaknesses of
+	  the architecture, opening new vulnerabilities.
+
+	  For this reason this option allows one to enable or disable the
+	  feature at runtime. It is recommended to say 'N' here to leave
+	  the system protected, and to enable it at runtime only if needed
+	  by setting the sys.kernel.modify_ldt sysctl.
 
 source "kernel/livepatch/Kconfig"
 
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 2bcc052..cb64b85 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -11,6 +11,7 @@
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/mm.h>
+#include <linux/ratelimit.h>
 #include <linux/smp.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
@@ -21,6 +22,11 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+int sysctl_modify_ldt __read_mostly =
+	IS_ENABLED(CONFIG_DEFAULT_MODIFY_LDT_SYSCALL);
+#endif
+
 /* context.lock is held for us, so we don't need any locking. */
 static void flush_ldt(void *current_mm)
 {
@@ -276,6 +282,15 @@ asmlinkage int sys_modify_ldt(int func, void __user *ptr,
 {
 	int ret = -ENOSYS;
 
+	if (!sysctl_modify_ldt) {
+		printk_ratelimited(KERN_INFO
+			"Denied a call to modify_ldt() from %s[%d] (uid: %d)."
+			" Adjust sysctl if this was not an exploit attempt.\n",
+			current->comm, task_pid_nr(current),
+			from_kuid_munged(current_user_ns(), current_uid()));
+		return ret;
+	}
+
 	switch (func) {
 	case 0:
 		ret = read_ldt(ptr, bytecount);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 19b62b5..3dcf8e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -111,6 +111,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+extern int sysctl_modify_ldt;
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -960,6 +963,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	{
+		.procname	= "modify_ldt",
+		.data		= &sysctl_modify_ldt,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 #endif
 #if defined(CONFIG_MMU)
 	{
-- 
1.7.12.1


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  9:15   ` Borislav Petkov
  2015-07-25 16:03     ` Andy Lutomirski
@ 2015-07-25 16:03     ` Andy Lutomirski
  2015-07-25 16:35       ` Willy Tarreau
  2015-07-25 16:35       ` Willy Tarreau
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25 16:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Sasha Levin, linux-kernel, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Andrew Cooper, Jan Beulich, xen-devel

On Sat, Jul 25, 2015 at 2:15 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
>> The modify_ldt syscall exposes a large attack surface and is
>> unnecessary for modern userspace.  Make it optional.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/Kconfig                   | 17 +++++++++++++++++
>>  arch/x86/include/asm/mmu.h         |  2 ++
>>  arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
>>  arch/x86/kernel/Makefile           |  3 ++-
>>  arch/x86/kernel/cpu/perf_event.c   |  4 ++++
>>  arch/x86/kernel/process_64.c       |  2 ++
>>  arch/x86/kernel/step.c             |  2 ++
>>  kernel/sys_ni.c                    |  1 +
>>  8 files changed, 53 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index b3a1a5d77d92..ede52be845db 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1015,6 +1015,7 @@ config VM86
>>  config X86_16BIT
>>       bool "Enable support for 16-bit segments" if EXPERT
>>       default y
>> +     depends on MODIFY_LDT_SYSCALL
>>       ---help---
>>         This option is required by programs like Wine to run 16-bit
>>         protected mode legacy code on x86 processors.  Disabling
>> @@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
>>         This is used to work around broken boot loaders.  This should
>>         be set to 'N' under normal conditions.
>>
>> +config MODIFY_LDT_SYSCALL
>> +       bool "Enable the LDT (local descriptor table)" if EXPERT
>
>         bool "Enable modify_ldt() for per-process Local Descriptor Table"
>
> is how I'd call it.

Okay with me.

>
>> +       default y
>
> Is that "default y" going to turn into a "default n" after a grace
> period?

Let's see how Willy's default-off sysctl plays out.  In the long run,
maybe we'll have it compiled in but runtime-disabled by default.
There's a big community of users who *really* like using Wine :)

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25  9:15   ` Borislav Petkov
@ 2015-07-25 16:03     ` Andy Lutomirski
  2015-07-25 16:03     ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25 16:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Andy Lutomirski,
	Sasha Levin, Boris Ostrovsky

On Sat, Jul 25, 2015 at 2:15 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 24, 2015 at 10:36:45PM -0700, Andy Lutomirski wrote:
>> The modify_ldt syscall exposes a large attack surface and is
>> unnecessary for modern userspace.  Make it optional.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/Kconfig                   | 17 +++++++++++++++++
>>  arch/x86/include/asm/mmu.h         |  2 ++
>>  arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
>>  arch/x86/kernel/Makefile           |  3 ++-
>>  arch/x86/kernel/cpu/perf_event.c   |  4 ++++
>>  arch/x86/kernel/process_64.c       |  2 ++
>>  arch/x86/kernel/step.c             |  2 ++
>>  kernel/sys_ni.c                    |  1 +
>>  8 files changed, 53 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index b3a1a5d77d92..ede52be845db 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1015,6 +1015,7 @@ config VM86
>>  config X86_16BIT
>>       bool "Enable support for 16-bit segments" if EXPERT
>>       default y
>> +     depends on MODIFY_LDT_SYSCALL
>>       ---help---
>>         This option is required by programs like Wine to run 16-bit
>>         protected mode legacy code on x86 processors.  Disabling
>> @@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
>>         This is used to work around broken boot loaders.  This should
>>         be set to 'N' under normal conditions.
>>
>> +config MODIFY_LDT_SYSCALL
>> +       bool "Enable the LDT (local descriptor table)" if EXPERT
>
>         bool "Enable modify_ldt() for per-process Local Descriptor Table"
>
> is how I'd call it.

Okay with me.

>
>> +       default y
>
> Is that "default y" going to turn into a "default n" after a grace
> period?

Let's see how Willy's default-off sysctl plays out.  In the long run,
maybe we'll have it compiled in but runtime-disabled by default.
There's a big community of users who *really* like using Wine :)

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 13:03         ` Willy Tarreau
@ 2015-07-25 16:08           ` Andy Lutomirski
  2015-07-25 16:33             ` Willy Tarreau
  2015-07-25 16:33             ` Willy Tarreau
  2015-07-25 16:08           ` Andy Lutomirski
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25 16:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Kees Cook

On Sat, Jul 25, 2015 at 6:03 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
>> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
>> > I'm all for it, but I think it should be hard-disablable in config,
>> > too, for the -tiny people.
>>
>> I totally agree.
>>
>> > If we add a runtime disable, let's do a
>> > separate patch, and you and Kees can fight over how general it should
>> > be.
>>
>> Initially I was thinking about changing it for a 3-state option but
>> that would prevent X86_16BIT from being hard-disablable, so I'll do
>> something completely separate.
>
> So here comes the proposed patch. It adds a default setting for the
> sysctl when the option is not hard-disabled (eg: distros not wanting
> to take risks with legacy apps). It suggests to leave the option off.
> In case a syscall is blocked, a printk_ratelimited() is called with
> relevant info (program name, pid, uid) so that the admin can decide
> whether it's a legitimate call or not. Eg:
>
>   Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.
>
> I personally think it completes well your series, hence the 4/3 numbering.
> Feel free to adopt it if you cycle another round and if you're OK with it
> of course.
>

There's one thing that I think is incomplete here.  Currently, espfix
triggers if SS points to the LDT.  It's possible for SS to point to
the LDT even with modify_ldt disabled, and there's a decent amount of
attack surface there.

Can we improve this?  Two ideas:

1. In the asm, patch out or otherwise disable espfix if that sysctl
has never been set.  (Ick.)

2. When modify_ldt is runtime-disabled (or compile-time disabled,
perhaps), disallow setting the LDT bit in SS in the handful of places
that would allow it (ptrace and sigreturn off the top of my head).  We
don't need to worry about (regs->ss & 4) being set on kernel entry
because we'll never be in user mode with that bit set if the LDT is
disabled, but that bit could still be set using kernel APIs.  (In
fact, my sigreturn test does exactly that.)

Hmm.  With synchronous LDT, we could plausibly check at runtime in the
espfix code, too.  We used to use LAR to do this, but hpa removed it
when he realized that it was racy.  It shouldn't be racy any more,
because, with my patches applied, the LDT never changes while
interrupts are off.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 13:03         ` Willy Tarreau
  2015-07-25 16:08           ` Andy Lutomirski
@ 2015-07-25 16:08           ` Andy Lutomirski
  2015-07-27 19:04           ` Kees Cook
  2015-07-27 19:04           ` Kees Cook
  3 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25 16:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, Kees Cook

On Sat, Jul 25, 2015 at 6:03 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
>> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
>> > I'm all for it, but I think it should be hard-disablable in config,
>> > too, for the -tiny people.
>>
>> I totally agree.
>>
>> > If we add a runtime disable, let's do a
>> > separate patch, and you and Kees can fight over how general it should
>> > be.
>>
>> Initially I was thinking about changing it for a 3-state option but
>> that would prevent X86_16BIT from being hard-disablable, so I'll do
>> something completely separate.
>
> So here comes the proposed patch. It adds a default setting for the
> sysctl when the option is not hard-disabled (eg: distros not wanting
> to take risks with legacy apps). It suggests to leave the option off.
> In case a syscall is blocked, a printk_ratelimited() is called with
> relevant info (program name, pid, uid) so that the admin can decide
> whether it's a legitimate call or not. Eg:
>
>   Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.
>
> I personally think it completes well your series, hence the 4/3 numbering.
> Feel free to adopt it if you cycle another round and if you're OK with it
> of course.
>

There's one thing that I think is incomplete here.  Currently, espfix
triggers if SS points to the LDT.  It's possible for SS to point to
the LDT even with modify_ldt disabled, and there's a decent amount of
attack surface there.

Can we improve this?  Two ideas:

1. In the asm, patch out or otherwise disable espfix if that sysctl
has never been set.  (Ick.)

2. When modify_ldt is runtime-disabled (or compile-time disabled,
perhaps), disallow setting the LDT bit in SS in the handful of places
that would allow it (ptrace and sigreturn off the top of my head).  We
don't need to worry about (regs->ss & 4) being set on kernel entry
because we'll never be in user mode with that bit set if the LDT is
disabled, but that bit could still be set using kernel APIs.  (In
fact, my sigreturn test does exactly that.)

Hmm.  With synchronous LDT, we could plausibly check at runtime in the
espfix code, too.  We used to use LAR to do this, but hpa removed it
when he realized that it was racy.  It shouldn't be racy any more,
because, with my patches applied, the LDT never changes while
interrupts are off.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 16:08           ` Andy Lutomirski
  2015-07-25 16:33             ` Willy Tarreau
@ 2015-07-25 16:33             ` Willy Tarreau
  2015-07-25 17:42               ` Andy Lutomirski
  2015-07-25 17:42               ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 16:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Kees Cook

On Sat, Jul 25, 2015 at 09:08:39AM -0700, Andy Lutomirski wrote:
> There's one thing that I think is incomplete here.  Currently, espfix
> triggers if SS points to the LDT.  It's possible for SS to point to
> the LDT even with modify_ldt disabled, and there's a decent amount of
> attack surface there.
> 
> Can we improve this?  Two ideas:
> 
> 1. In the asm, patch out or otherwise disable espfix if that sysctl
> has never been set.  (Ick.)
> 
> 2. When modify_ldt is runtime-disabled (or compile-time disabled,
> perhaps), disallow setting the LDT bit in SS in the handful of places
> that would allow it (ptrace and sigreturn off the top of my head).  We
> don't need to worry about (regs->ss & 4) being set on kernel entry
> because we'll never be in user mode with that bit set if the LDT is
> disabled, but that bit could still be set using kernel APIs.  (In
> fact, my sigreturn test does exactly that.)
> 
> Hmm.  With synchronous LDT, we could plausibly check at runtime in the
> espfix code, too.  We used to use LAR to do this, but hpa removed it
> when he realized that it was racy.  It shouldn't be racy any more,
> because, with my patches applied, the LDT never changes while
> interrupts are off.

I understand it's not complete but I'm a bit bothered with conflating
this sysctl with other setting methods, because if the purpose of the
sysctl is to disable the syscall, it should do that only. I'd rather
document that it's less complete than the Kconfig method and continue
to recommend using your option whenever possible (eg: all my kernels
will use it just as I've already disabled X86_16BIT everywhere).

Also one benefit of having both options is that it will mechanically
make LDT a much less interesting target for future attacks, since it
will significantly reduce the likeliness of success, hence the motivation
for writing exploits that only work in conferences.

Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 16:08           ` Andy Lutomirski
@ 2015-07-25 16:33             ` Willy Tarreau
  2015-07-25 16:33             ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 16:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, Kees Cook

On Sat, Jul 25, 2015 at 09:08:39AM -0700, Andy Lutomirski wrote:
> There's one thing that I think is incomplete here.  Currently, espfix
> triggers if SS points to the LDT.  It's possible for SS to point to
> the LDT even with modify_ldt disabled, and there's a decent amount of
> attack surface there.
> 
> Can we improve this?  Two ideas:
> 
> 1. In the asm, patch out or otherwise disable espfix if that sysctl
> has never been set.  (Ick.)
> 
> 2. When modify_ldt is runtime-disabled (or compile-time disabled,
> perhaps), disallow setting the LDT bit in SS in the handful of places
> that would allow it (ptrace and sigreturn off the top of my head).  We
> don't need to worry about (regs->ss & 4) being set on kernel entry
> because we'll never be in user mode with that bit set if the LDT is
> disabled, but that bit could still be set using kernel APIs.  (In
> fact, my sigreturn test does exactly that.)
> 
> Hmm.  With synchronous LDT, we could plausibly check at runtime in the
> espfix code, too.  We used to use LAR to do this, but hpa removed it
> when he realized that it was racy.  It shouldn't be racy any more,
> because, with my patches applied, the LDT never changes while
> interrupts are off.

I understand it's not complete but I'm a bit bothered with conflating
this sysctl with other setting methods, because if the purpose of the
sysctl is to disable the syscall, it should do that only. I'd rather
document that it's less complete than the Kconfig method and continue
to recommend using your option whenever possible (eg: all my kernels
will use it just as I've already disabled X86_16BIT everywhere).

Also one benefit of having both options is that it will mechanically
make LDT a much less interesting target for future attacks, since it
will significantly reduce the likeliness of success, hence the motivation
for writing exploits that only work in conferences.

Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25 16:03     ` Andy Lutomirski
  2015-07-25 16:35       ` Willy Tarreau
@ 2015-07-25 16:35       ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 16:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Andy Lutomirski, Peter Zijlstra, Steven Rostedt,
	security, X86 ML, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Sat, Jul 25, 2015 at 09:03:54AM -0700, Andy Lutomirski wrote:
> On Sat, Jul 25, 2015 at 2:15 AM, Borislav Petkov <bp@alien8.de> wrote:
> > Is that "default y" going to turn into a "default n" after a grace
> > period?
> 
> Let's see how Willy's default-off sysctl plays out.  In the long run,
> maybe we'll have it compiled in but runtime-disabled by default.

That's the purpose at least at the beginning.

> There's a big community of users who *really* like using Wine :)

If distro vendors are willing to document a sysctl setting in order
to be able to use Wine in exchange for better security, I'm sure most
users will still prefer to stay safe.

Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 2/3] x86/ldt: Make modify_ldt optional
  2015-07-25 16:03     ` Andy Lutomirski
@ 2015-07-25 16:35       ` Willy Tarreau
  2015-07-25 16:35       ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 16:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky

On Sat, Jul 25, 2015 at 09:03:54AM -0700, Andy Lutomirski wrote:
> On Sat, Jul 25, 2015 at 2:15 AM, Borislav Petkov <bp@alien8.de> wrote:
> > Is that "default y" going to turn into a "default n" after a grace
> > period?
> 
> Let's see how Willy's default-off sysctl plays out.  In the long run,
> maybe we'll have it compiled in but runtime-disabled by default.

That's the purpose at least at the beginning.

> There's a big community of users who *really* like using Wine :)

If distro vendors are willing to document a sysctl setting in order
to be able to use Wine in exchange for better security, I'm sure most
users will still prefer to stay safe.

Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 16:33             ` Willy Tarreau
@ 2015-07-25 17:42               ` Andy Lutomirski
  2015-07-25 18:45                 ` Willy Tarreau
  2015-07-25 18:45                 ` Willy Tarreau
  2015-07-25 17:42               ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25 17:42 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Kees Cook

On Sat, Jul 25, 2015 at 9:33 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Jul 25, 2015 at 09:08:39AM -0700, Andy Lutomirski wrote:
>> There's one thing that I think is incomplete here.  Currently, espfix
>> triggers if SS points to the LDT.  It's possible for SS to point to
>> the LDT even with modify_ldt disabled, and there's a decent amount of
>> attack surface there.
>>
>> Can we improve this?  Two ideas:
>>
>> 1. In the asm, patch out or otherwise disable espfix if that sysctl
>> has never been set.  (Ick.)
>>
>> 2. When modify_ldt is runtime-disabled (or compile-time disabled,
>> perhaps), disallow setting the LDT bit in SS in the handful of places
>> that would allow it (ptrace and sigreturn off the top of my head).  We
>> don't need to worry about (regs->ss & 4) being set on kernel entry
>> because we'll never be in user mode with that bit set if the LDT is
>> disabled, but that bit could still be set using kernel APIs.  (In
>> fact, my sigreturn test does exactly that.)
>>
>> Hmm.  With synchronous LDT, we could plausibly check at runtime in the
>> espfix code, too.  We used to use LAR to do this, but hpa removed it
>> when he realized that it was racy.  It shouldn't be racy any more,
>> because, with my patches applied, the LDT never changes while
>> interrupts are off.
>
> I understand it's not complete but I'm a bit bothered with conflating
> this sysctl with other setting methods, because if the purpose of the
> sysctl is to disable the syscall, it should do that only. I'd rather
> document that it's less complete than the Kconfig method and continue
> to recommend using your option whenever possible (eg: all my kernels
> will use it just as I've already disabled X86_16BIT everywhere).
>

Agreed.  We can certainly tighten up the espfix code later.

> Also one benefit of having both options is that it will mechanically
> make LDT a much less interesting target for future attacks, since it
> will significantly reduce the likeliness of success, hence the motivation
> for writing exploits that only work in conferences.
>

Patch looks fine to me.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 16:33             ` Willy Tarreau
  2015-07-25 17:42               ` Andy Lutomirski
@ 2015-07-25 17:42               ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25 17:42 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, Kees Cook

On Sat, Jul 25, 2015 at 9:33 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Jul 25, 2015 at 09:08:39AM -0700, Andy Lutomirski wrote:
>> There's one thing that I think is incomplete here.  Currently, espfix
>> triggers if SS points to the LDT.  It's possible for SS to point to
>> the LDT even with modify_ldt disabled, and there's a decent amount of
>> attack surface there.
>>
>> Can we improve this?  Two ideas:
>>
>> 1. In the asm, patch out or otherwise disable espfix if that sysctl
>> has never been set.  (Ick.)
>>
>> 2. When modify_ldt is runtime-disabled (or compile-time disabled,
>> perhaps), disallow setting the LDT bit in SS in the handful of places
>> that would allow it (ptrace and sigreturn off the top of my head).  We
>> don't need to worry about (regs->ss & 4) being set on kernel entry
>> because we'll never be in user mode with that bit set if the LDT is
>> disabled, but that bit could still be set using kernel APIs.  (In
>> fact, my sigreturn test does exactly that.)
>>
>> Hmm.  With synchronous LDT, we could plausibly check at runtime in the
>> espfix code, too.  We used to use LAR to do this, but hpa removed it
>> when he realized that it was racy.  It shouldn't be racy any more,
>> because, with my patches applied, the LDT never changes while
>> interrupts are off.
>
> I understand it's not complete but I'm a bit bothered with conflating
> this sysctl with other setting methods, because if the purpose of the
> sysctl is to disable the syscall, it should do that only. I'd rather
> document that it's less complete than the Kconfig method and continue
> to recommend using your option whenever possible (eg: all my kernels
> will use it just as I've already disabled X86_16BIT everywhere).
>

Agreed.  We can certainly tighten up the espfix code later.

> Also one benefit of having both options is that it will mechanically
> make LDT a much less interesting target for future attacks, since it
> will significantly reduce the likeliness of success, hence the motivation
> for writing exploits that only work in conferences.
>

Patch looks fine to me.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 17:42               ` Andy Lutomirski
@ 2015-07-25 18:45                 ` Willy Tarreau
  2015-07-25 18:45                 ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Kees Cook

On Sat, Jul 25, 2015 at 10:42:14AM -0700, Andy Lutomirski wrote:
> On Sat, Jul 25, 2015 at 9:33 AM, Willy Tarreau <w@1wt.eu> wrote:
> > On Sat, Jul 25, 2015 at 09:08:39AM -0700, Andy Lutomirski wrote:
> >> There's one thing that I think is incomplete here.  Currently, espfix
> >> triggers if SS points to the LDT.  It's possible for SS to point to
> >> the LDT even with modify_ldt disabled, and there's a decent amount of
> >> attack surface there.
> >>
> >> Can we improve this?  Two ideas:
> >>
> >> 1. In the asm, patch out or otherwise disable espfix if that sysctl
> >> has never been set.  (Ick.)
> >>
> >> 2. When modify_ldt is runtime-disabled (or compile-time disabled,
> >> perhaps), disallow setting the LDT bit in SS in the handful of places
> >> that would allow it (ptrace and sigreturn off the top of my head).  We
> >> don't need to worry about (regs->ss & 4) being set on kernel entry
> >> because we'll never be in user mode with that bit set if the LDT is
> >> disabled, but that bit could still be set using kernel APIs.  (In
> >> fact, my sigreturn test does exactly that.)
> >>
> >> Hmm.  With synchronous LDT, we could plausibly check at runtime in the
> >> espfix code, too.  We used to use LAR to do this, but hpa removed it
> >> when he realized that it was racy.  It shouldn't be racy any more,
> >> because, with my patches applied, the LDT never changes while
> >> interrupts are off.
> >
> > I understand it's not complete but I'm a bit bothered with conflating
> > this sysctl with other setting methods, because if the purpose of the
> > sysctl is to disable the syscall, it should do that only. I'd rather
> > document that it's less complete than the Kconfig method and continue
> > to recommend using your option whenever possible (eg: all my kernels
> > will use it just as I've already disabled X86_16BIT everywhere).
> >
> 
> Agreed.  We can certainly tighten up the espfix code later.
> 
> > Also one benefit of having both options is that it will mechanically
> > make LDT a much less interesting target for future attacks, since it
> > will significantly reduce the likeliness of success, hence the motivation
> > for writing exploits that only work in conferences.
> >
> 
> Patch looks fine to me.

OK thanks.

Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 17:42               ` Andy Lutomirski
  2015-07-25 18:45                 ` Willy Tarreau
@ 2015-07-25 18:45                 ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-25 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, Kees Cook

On Sat, Jul 25, 2015 at 10:42:14AM -0700, Andy Lutomirski wrote:
> On Sat, Jul 25, 2015 at 9:33 AM, Willy Tarreau <w@1wt.eu> wrote:
> > On Sat, Jul 25, 2015 at 09:08:39AM -0700, Andy Lutomirski wrote:
> >> There's one thing that I think is incomplete here.  Currently, espfix
> >> triggers if SS points to the LDT.  It's possible for SS to point to
> >> the LDT even with modify_ldt disabled, and there's a decent amount of
> >> attack surface there.
> >>
> >> Can we improve this?  Two ideas:
> >>
> >> 1. In the asm, patch out or otherwise disable espfix if that sysctl
> >> has never been set.  (Ick.)
> >>
> >> 2. When modify_ldt is runtime-disabled (or compile-time disabled,
> >> perhaps), disallow setting the LDT bit in SS in the handful of places
> >> that would allow it (ptrace and sigreturn off the top of my head).  We
> >> don't need to worry about (regs->ss & 4) being set on kernel entry
> >> because we'll never be in user mode with that bit set if the LDT is
> >> disabled, but that bit could still be set using kernel APIs.  (In
> >> fact, my sigreturn test does exactly that.)
> >>
> >> Hmm.  With synchronous LDT, we could plausibly check at runtime in the
> >> espfix code, too.  We used to use LAR to do this, but hpa removed it
> >> when he realized that it was racy.  It shouldn't be racy any more,
> >> because, with my patches applied, the LDT never changes while
> >> interrupts are off.
> >
> > I understand it's not complete but I'm a bit bothered with conflating
> > this sysctl with other setting methods, because if the purpose of the
> > sysctl is to disable the syscall, it should do that only. I'd rather
> > document that it's less complete than the Kconfig method and continue
> > to recommend using your option whenever possible (eg: all my kernels
> > will use it just as I've already disabled X86_16BIT everywhere).
> >
> 
> Agreed.  We can certainly tighten up the espfix code later.
> 
> > Also one benefit of having both options is that it will mechanically
> > make LDT a much less interesting target for future attacks, since it
> > will significantly reduce the likeliness of success, hence the motivation
> > for writing exploits that only work in conferences.
> >
> 
> Patch looks fine to me.

OK thanks.

Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (7 preceding siblings ...)
  2015-07-25  6:27 ` Willy Tarreau
@ 2015-07-27 15:36 ` Boris Ostrovsky
  2015-07-27 15:53   ` Andy Lutomirski
  2015-07-27 15:53   ` Andy Lutomirski
  2015-07-27 15:36 ` Boris Ostrovsky
  9 siblings, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-27 15:36 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra, Steven Rostedt
  Cc: security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
> probably a good general attack surface reduction, and it replaces some
> scary code with IMO less scary code.
>
> Also, servers and embedded systems should probably turn off modify_ldt.
> This makes that possible.
>
> Xen people, can you take a look at this?
>
> Willy and Kees: I left the config option alone.  The -tiny people will
> like it, and we can always add a sysctl of some sort later.
>
> Changes from v3:
>   - Hopefully fixed Xen.

32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)

>   - Fixed 32-bit test case on 32-bit native kernel.

I am not sure I see what changed.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
                   ` (8 preceding siblings ...)
  2015-07-27 15:36 ` Boris Ostrovsky
@ 2015-07-27 15:36 ` Boris Ostrovsky
  9 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-27 15:36 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra, Steven Rostedt
  Cc: security, Andrew Cooper, X86 ML, linux-kernel, xen-devel,
	Borislav Petkov, Jan Beulich, Sasha Levin

On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
> probably a good general attack surface reduction, and it replaces some
> scary code with IMO less scary code.
>
> Also, servers and embedded systems should probably turn off modify_ldt.
> This makes that possible.
>
> Xen people, can you take a look at this?
>
> Willy and Kees: I left the config option alone.  The -tiny people will
> like it, and we can always add a sysctl of some sort later.
>
> Changes from v3:
>   - Hopefully fixed Xen.

32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)

>   - Fixed 32-bit test case on 32-bit native kernel.

I am not sure I see what changed.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [PATCH v4.1 3.3] selftests/x86, x86/ldt: Add a selftest for modify_ldt
  2015-07-25  5:36 ` Andy Lutomirski
@ 2015-07-27 15:52   ` Andy Lutomirski
  2015-07-27 15:52   ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-27 15:52 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel, Andy Lutomirski

This tests general modify_ldt behavior (only writes, so far) as
well as synchronous updates via IPI.  It fails on old kernels.

I called this ldt_gdt because I'll add set_thread_area tests to
it at some point.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

Oops.  This is what I meant to send as v4.  Patches 1 and 2 were correct
but I sent the wrong version of patch 3.

 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/ldt_gdt.c | 520 ++++++++++++++++++++++++++++++++++
 2 files changed, 521 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..4138387b892c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -4,7 +4,7 @@ include ../lib.mk
 
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
-TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
+TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs ldt_gdt
 TARGETS_C_32BIT_ONLY := entry_from_vm86
 
 TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
new file mode 100644
index 000000000000..c27adfc9ae72
--- /dev/null
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -0,0 +1,520 @@
+/*
+ * ldt_gdt.c - Test cases for LDT and GDT access
+ * Copyright (c) 2015 Andrew Lutomirski
+ */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <asm/ldt.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <sched.h>
+#include <linux/futex.h>
+
+#define AR_ACCESSED		(1<<8)
+
+#define AR_TYPE_RODATA		(0 * (1<<9))
+#define AR_TYPE_RWDATA		(1 * (1<<9))
+#define AR_TYPE_RODATA_EXPDOWN	(2 * (1<<9))
+#define AR_TYPE_RWDATA_EXPDOWN	(3 * (1<<9))
+#define AR_TYPE_XOCODE		(4 * (1<<9))
+#define AR_TYPE_XRCODE		(5 * (1<<9))
+#define AR_TYPE_XOCODE_CONF	(6 * (1<<9))
+#define AR_TYPE_XRCODE_CONF	(7 * (1<<9))
+
+#define AR_DPL3			(3 * (1<<13))
+
+#define AR_S			(1 << 12)
+#define AR_P			(1 << 15)
+#define AR_AVL			(1 << 20)
+#define AR_L			(1 << 21)
+#define AR_DB			(1 << 22)
+#define AR_G			(1 << 23)
+
+static int nerrs;
+
+static void check_invalid_segment(uint16_t index, int ldt)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (has_limit || has_ar) {
+		printf("[FAIL]\t%s entry %hu is valid but should be invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+	} else {
+		printf("[OK]\t%s entry %hu is invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+	}
+}
+
+static void check_valid_segment(uint16_t index, int ldt,
+				uint32_t expected_ar, uint32_t expected_limit,
+				bool verbose)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (!has_limit || !has_ar) {
+		printf("[FAIL]\t%s entry %hu is invalid but should be valid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+		return;
+	}
+
+	if (ar != expected_ar) {
+		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
+		nerrs++;
+	} else if (limit != expected_limit) {
+		printf("[FAIL]\t%s entry %hu has limit 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, limit, expected_limit);
+		nerrs++;
+	} else if (verbose) {
+		printf("[OK]\t%s entry %hu has AR 0x%08X and limit 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, limit);
+	}
+}
+
+static bool install_valid_mode(const struct user_desc *desc, uint32_t ar,
+			       bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		uint32_t limit = desc->limit;
+		if (desc->limit_in_pages)
+			limit = (limit << 12) + 4095;
+		check_valid_segment(desc->entry_number, 1, ar, limit, true);
+		return true;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+		return false;
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+			return false;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+			return false;
+		}
+	}
+}
+
+static bool install_valid(const struct user_desc *desc, uint32_t ar)
+{
+	return install_valid_mode(desc, ar, false);
+}
+
+static void install_invalid(const struct user_desc *desc, bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		check_invalid_segment(desc->entry_number, 1);
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+		}
+	}
+}
+
+static int safe_modify_ldt(int func, struct user_desc *ptr,
+			   unsigned long bytecount)
+{
+	int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount);
+	if (ret < -1)
+		errno = -ret;
+	return ret;
+}
+
+static void fail_install(struct user_desc *desc)
+{
+	if (safe_modify_ldt(0x11, desc, sizeof(*desc)) == 0) {
+		printf("[FAIL]\tmodify_ldt accepted a bad descriptor\n");
+		nerrs++;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		printf("[OK]\tmodify_ldt failure %d\n", errno);
+	}
+}
+
+static void do_simple_tests(void)
+{
+	struct user_desc desc = {
+		.entry_number    = 0,
+		.base_addr       = 0,
+		.limit           = 10,
+		.seg_32bit       = 1,
+		.contents        = 2, /* Code, not conforming */
+		.read_exec_only  = 0,
+		.limit_in_pages  = 0,
+		.seg_not_present = 0,
+		.useable         = 0
+	};
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+
+	desc.limit_in_pages = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.entry_number = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.base_addr = 0xf0000000;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	desc.useable = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_32bit = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_G | AR_AVL);
+
+	desc.seg_32bit = 1;
+	desc.contents = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.contents = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA_EXPDOWN |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.limit_in_pages = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA_EXPDOWN |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.contents = 3;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.contents = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+
+#ifdef __x86_64__
+	desc.lm = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+		      AR_S | AR_DB | AR_AVL);
+	desc.lm = 0;
+#endif
+
+	bool entry1_okay = install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+					 AR_S | AR_DB | AR_AVL);
+
+	if (entry1_okay) {
+		printf("[RUN]\tTest fork\n");
+		pid_t child = fork();
+		if (child == 0) {
+			nerrs = 0;
+			check_valid_segment(desc.entry_number, 1,
+					    AR_DPL3 | AR_TYPE_XOCODE |
+					    AR_S | AR_DB | AR_AVL, desc.limit,
+					    true);
+			check_invalid_segment(1, 1);
+			exit(nerrs ? 1 : 0);
+		} else {
+			int status;
+			if (waitpid(child, &status, 0) != child ||
+			    !WIFEXITED(status)) {
+				printf("[FAIL]\tChild died\n");
+				nerrs++;
+			} else if (WEXITSTATUS(status) != 0) {
+				printf("[FAIL]\tChild failed\n");
+				nerrs++;
+			} else {
+				printf("[OK]\tChild succeeded\n");
+			}
+		}
+
+		printf("[RUN]\tTest size\n");
+		int i;
+		for (i = 0; i < 8192; i++) {
+			desc.entry_number = i;
+			desc.limit = i;
+			if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+				printf("[FAIL]\tFailed to install entry %d\n", i);
+				nerrs++;
+				break;
+			}
+		}
+		for (int j = 0; j < i; j++) {
+			check_valid_segment(j, 1, AR_DPL3 | AR_TYPE_XOCODE |
+					    AR_S | AR_DB | AR_AVL, j, false);
+		}
+		printf("[DONE]\tSize test\n");
+	} else {
+		printf("[SKIP]\tSkipping fork and size tests because we have no LDT\n");
+	}
+
+	/* Test entry_number too high. */
+	desc.entry_number = 8192;
+	fail_install(&desc);
+
+	/* Test deletion and actions mistakeable for deletion. */
+	memset(&desc, 0, sizeof(desc));
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S | AR_P);
+
+	desc.read_exec_only = 0;
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.read_exec_only = 1;
+	desc.limit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.limit = 0;
+	desc.base_addr = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.base_addr = 0;
+	install_invalid(&desc, false);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 0;
+	desc.seg_32bit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P | AR_DB);
+	install_invalid(&desc, true);
+}
+
+/*
+ * 0: thread is idle
+ * 1: thread armed
+ * 2: thread should clear LDT entry 0
+ * 3: thread should exit
+ */
+static volatile unsigned int ftx;
+
+static void *threadproc(void *ctx)
+{
+	cpu_set_t cpuset;
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 1");	/* should never fail */
+
+	while (1) {
+		syscall(SYS_futex, &ftx, FUTEX_WAIT, 0, NULL, NULL, 0);
+		while (ftx != 2) {
+			if (ftx >= 3)
+				return NULL;
+		}
+
+		/* clear LDT entry 0 */
+		const struct user_desc desc = {};
+		if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
+			err(1, "modify_ldt");
+
+		/* If ftx == 2, set it to zero.  If ftx == 100, quit. */
+		unsigned int x = -2;
+		asm volatile ("lock xaddl %[x], %[ftx]" :
+			      [x] "+r" (x), [ftx] "+m" (ftx));
+		if (x != 2)
+			return NULL;
+	}
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+
+}
+
+static jmp_buf jmpbuf;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+	siglongjmp(jmpbuf, 1);
+}
+
+static void do_multicpu_tests(void)
+{
+	cpu_set_t cpuset;
+	pthread_t thread;
+	int failures = 0, iters = 5, i;
+	unsigned short orig_ss;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 1\n");
+		return;
+	}
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 0\n");
+		return;
+	}
+
+	sethandler(SIGSEGV, sigsegv, 0);
+#ifdef __i386__
+	/* True 32-bit kernels send SIGILL instead of SIGSEGV on IRET faults. */
+	sethandler(SIGILL, sigsegv, 0);
+#endif
+
+	printf("[RUN]\tCross-CPU LDT invalidation\n");
+
+	if (pthread_create(&thread, 0, threadproc, 0) != 0)
+		err(1, "pthread_create");
+
+	asm volatile ("mov %%ss, %0" : "=rm" (orig_ss));
+
+	for (i = 0; i < 5; i++) {
+		if (sigsetjmp(jmpbuf, 1) != 0)
+			continue;
+
+		/* Make sure the thread is ready after the last test. */
+		while (ftx != 0)
+			;
+
+		struct user_desc desc = {
+			.entry_number    = 0,
+			.base_addr       = 0,
+			.limit           = 0xfffff,
+			.seg_32bit       = 1,
+			.contents        = 0, /* Data */
+			.read_exec_only  = 0,
+			.limit_in_pages  = 1,
+			.seg_not_present = 0,
+			.useable         = 0
+		};
+
+		if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+			if (errno != ENOSYS)
+				err(1, "modify_ldt");
+			printf("[SKIP]\tmodify_ldt unavailable\n");
+			break;
+		}
+
+		/* Arm the thread. */
+		ftx = 1;
+		syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+		asm volatile ("mov %0, %%ss" : : "r" (0x7));
+
+		/* Go! */
+		ftx = 2;
+
+		while (ftx != 0)
+			;
+
+		/*
+		 * On success, modify_ldt will segfault us synchronously,
+		 * and we'll escape via siglongjmp.
+		 */
+
+		failures++;
+		asm volatile ("mov %0, %%ss" : : "rm" (orig_ss));
+	};
+
+	ftx = 100;  /* Kill the thread. */
+	syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+	if (pthread_join(thread, NULL) != 0)
+		err(1, "pthread_join");
+
+	if (failures) {
+		printf("[FAIL]\t%d of %d iterations failed\n", failures, iters);
+		nerrs++;
+	} else {
+		printf("[OK]\tAll %d iterations succeeded\n", iters);
+	}
+}
+
+int main()
+{
+	do_simple_tests();
+
+	do_multicpu_tests();
+
+	return nerrs ? 1 : 0;
+}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [PATCH v4.1 3.3] selftests/x86, x86/ldt: Add a selftest for modify_ldt
  2015-07-25  5:36 ` Andy Lutomirski
  2015-07-27 15:52   ` [PATCH v4.1 3.3] " Andy Lutomirski
@ 2015-07-27 15:52   ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-27 15:52 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, Andy Lutomirski, Andrew Cooper, X86 ML, linux-kernel,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

This tests general modify_ldt behavior (only writes, so far) as
well as synchronous updates via IPI.  It fails on old kernels.

I called this ldt_gdt because I'll add set_thread_area tests to
it at some point.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

Oops.  This is what I meant to send as v4.  Patches 1 and 2 were correct
but I sent the wrong version of patch 3.

 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/ldt_gdt.c | 520 ++++++++++++++++++++++++++++++++++
 2 files changed, 521 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..4138387b892c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -4,7 +4,7 @@ include ../lib.mk
 
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
-TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
+TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs ldt_gdt
 TARGETS_C_32BIT_ONLY := entry_from_vm86
 
 TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
new file mode 100644
index 000000000000..c27adfc9ae72
--- /dev/null
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -0,0 +1,520 @@
+/*
+ * ldt_gdt.c - Test cases for LDT and GDT access
+ * Copyright (c) 2015 Andrew Lutomirski
+ */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <asm/ldt.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <sched.h>
+#include <linux/futex.h>
+
+#define AR_ACCESSED		(1<<8)
+
+#define AR_TYPE_RODATA		(0 * (1<<9))
+#define AR_TYPE_RWDATA		(1 * (1<<9))
+#define AR_TYPE_RODATA_EXPDOWN	(2 * (1<<9))
+#define AR_TYPE_RWDATA_EXPDOWN	(3 * (1<<9))
+#define AR_TYPE_XOCODE		(4 * (1<<9))
+#define AR_TYPE_XRCODE		(5 * (1<<9))
+#define AR_TYPE_XOCODE_CONF	(6 * (1<<9))
+#define AR_TYPE_XRCODE_CONF	(7 * (1<<9))
+
+#define AR_DPL3			(3 * (1<<13))
+
+#define AR_S			(1 << 12)
+#define AR_P			(1 << 15)
+#define AR_AVL			(1 << 20)
+#define AR_L			(1 << 21)
+#define AR_DB			(1 << 22)
+#define AR_G			(1 << 23)
+
+static int nerrs;
+
+static void check_invalid_segment(uint16_t index, int ldt)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (has_limit || has_ar) {
+		printf("[FAIL]\t%s entry %hu is valid but should be invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+	} else {
+		printf("[OK]\t%s entry %hu is invalid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+	}
+}
+
+static void check_valid_segment(uint16_t index, int ldt,
+				uint32_t expected_ar, uint32_t expected_limit,
+				bool verbose)
+{
+	uint32_t has_limit = 0, has_ar = 0, limit, ar;
+	uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+	asm ("lsl %[selector], %[limit]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_limit]\n\t"
+	     "1:"
+	     : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+	     : [selector] "r" (selector));
+	asm ("larl %[selector], %[ar]\n\t"
+	     "jnz 1f\n\t"
+	     "movl $1, %[has_ar]\n\t"
+	     "1:"
+	     : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+	     : [selector] "r" (selector));
+
+	if (!has_limit || !has_ar) {
+		printf("[FAIL]\t%s entry %hu is invalid but should be valid\n",
+		       (ldt ? "LDT" : "GDT"), index);
+		nerrs++;
+		return;
+	}
+
+	if (ar != expected_ar) {
+		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
+		nerrs++;
+	} else if (limit != expected_limit) {
+		printf("[FAIL]\t%s entry %hu has limit 0x%08X but expected 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, limit, expected_limit);
+		nerrs++;
+	} else if (verbose) {
+		printf("[OK]\t%s entry %hu has AR 0x%08X and limit 0x%08X\n",
+		       (ldt ? "LDT" : "GDT"), index, ar, limit);
+	}
+}
+
+static bool install_valid_mode(const struct user_desc *desc, uint32_t ar,
+			       bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		uint32_t limit = desc->limit;
+		if (desc->limit_in_pages)
+			limit = (limit << 12) + 4095;
+		check_valid_segment(desc->entry_number, 1, ar, limit, true);
+		return true;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+		return false;
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+			return false;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+			return false;
+		}
+	}
+}
+
+static bool install_valid(const struct user_desc *desc, uint32_t ar)
+{
+	return install_valid_mode(desc, ar, false);
+}
+
+static void install_invalid(const struct user_desc *desc, bool oldmode)
+{
+	int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+			  desc, sizeof(*desc));
+	if (ret < -1)
+		errno = -ret;
+	if (ret == 0) {
+		check_invalid_segment(desc->entry_number, 1);
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		if (desc->seg_32bit) {
+			printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+			       errno);
+			nerrs++;
+		} else {
+			printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+		}
+	}
+}
+
+static int safe_modify_ldt(int func, struct user_desc *ptr,
+			   unsigned long bytecount)
+{
+	int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount);
+	if (ret < -1)
+		errno = -ret;
+	return ret;
+}
+
+static void fail_install(struct user_desc *desc)
+{
+	if (safe_modify_ldt(0x11, desc, sizeof(*desc)) == 0) {
+		printf("[FAIL]\tmodify_ldt accepted a bad descriptor\n");
+		nerrs++;
+	} else if (errno == ENOSYS) {
+		printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+	} else {
+		printf("[OK]\tmodify_ldt failure %d\n", errno);
+	}
+}
+
+static void do_simple_tests(void)
+{
+	struct user_desc desc = {
+		.entry_number    = 0,
+		.base_addr       = 0,
+		.limit           = 10,
+		.seg_32bit       = 1,
+		.contents        = 2, /* Code, not conforming */
+		.read_exec_only  = 0,
+		.limit_in_pages  = 0,
+		.seg_not_present = 0,
+		.useable         = 0
+	};
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+
+	desc.limit_in_pages = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.entry_number = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	check_invalid_segment(1, 1);
+
+	desc.base_addr = 0xf0000000;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G);
+
+	desc.useable = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_P | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.seg_32bit = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_G | AR_AVL);
+
+	desc.seg_32bit = 1;
+	desc.contents = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.contents = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA_EXPDOWN |
+		      AR_S | AR_DB | AR_G | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.limit_in_pages = 0;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA_EXPDOWN |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.contents = 3;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE_CONF |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 0;
+	desc.contents = 2;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+		      AR_S | AR_DB | AR_AVL);
+
+	desc.read_exec_only = 1;
+
+#ifdef __x86_64__
+	desc.lm = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+		      AR_S | AR_DB | AR_AVL);
+	desc.lm = 0;
+#endif
+
+	bool entry1_okay = install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+					 AR_S | AR_DB | AR_AVL);
+
+	if (entry1_okay) {
+		printf("[RUN]\tTest fork\n");
+		pid_t child = fork();
+		if (child == 0) {
+			nerrs = 0;
+			check_valid_segment(desc.entry_number, 1,
+					    AR_DPL3 | AR_TYPE_XOCODE |
+					    AR_S | AR_DB | AR_AVL, desc.limit,
+					    true);
+			check_invalid_segment(1, 1);
+			exit(nerrs ? 1 : 0);
+		} else {
+			int status;
+			if (waitpid(child, &status, 0) != child ||
+			    !WIFEXITED(status)) {
+				printf("[FAIL]\tChild died\n");
+				nerrs++;
+			} else if (WEXITSTATUS(status) != 0) {
+				printf("[FAIL]\tChild failed\n");
+				nerrs++;
+			} else {
+				printf("[OK]\tChild succeeded\n");
+			}
+		}
+
+		printf("[RUN]\tTest size\n");
+		int i;
+		for (i = 0; i < 8192; i++) {
+			desc.entry_number = i;
+			desc.limit = i;
+			if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+				printf("[FAIL]\tFailed to install entry %d\n", i);
+				nerrs++;
+				break;
+			}
+		}
+		for (int j = 0; j < i; j++) {
+			check_valid_segment(j, 1, AR_DPL3 | AR_TYPE_XOCODE |
+					    AR_S | AR_DB | AR_AVL, j, false);
+		}
+		printf("[DONE]\tSize test\n");
+	} else {
+		printf("[SKIP]\tSkipping fork and size tests because we have no LDT\n");
+	}
+
+	/* Test entry_number too high. */
+	desc.entry_number = 8192;
+	fail_install(&desc);
+
+	/* Test deletion and actions mistakeable for deletion. */
+	memset(&desc, 0, sizeof(desc));
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P);
+
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S | AR_P);
+
+	desc.read_exec_only = 0;
+	desc.seg_not_present = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+	desc.read_exec_only = 1;
+	desc.limit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.limit = 0;
+	desc.base_addr = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+	desc.base_addr = 0;
+	install_invalid(&desc, false);
+
+	desc.seg_not_present = 0;
+	desc.read_exec_only = 0;
+	desc.seg_32bit = 1;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P | AR_DB);
+	install_invalid(&desc, true);
+}
+
+/*
+ * 0: thread is idle
+ * 1: thread armed
+ * 2: thread should clear LDT entry 0
+ * 3: thread should exit
+ */
+static volatile unsigned int ftx;
+
+static void *threadproc(void *ctx)
+{
+	cpu_set_t cpuset;
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 1");	/* should never fail */
+
+	while (1) {
+		syscall(SYS_futex, &ftx, FUTEX_WAIT, 0, NULL, NULL, 0);
+		while (ftx != 2) {
+			if (ftx >= 3)
+				return NULL;
+		}
+
+		/* clear LDT entry 0 */
+		const struct user_desc desc = {};
+		if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
+			err(1, "modify_ldt");
+
+		/* If ftx == 2, set it to zero.  If ftx == 100, quit. */
+		unsigned int x = -2;
+		asm volatile ("lock xaddl %[x], %[ftx]" :
+			      [x] "+r" (x), [ftx] "+m" (ftx));
+		if (x != 2)
+			return NULL;
+	}
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+
+}
+
+static jmp_buf jmpbuf;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+	siglongjmp(jmpbuf, 1);
+}
+
+static void do_multicpu_tests(void)
+{
+	cpu_set_t cpuset;
+	pthread_t thread;
+	int failures = 0, iters = 5, i;
+	unsigned short orig_ss;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(1, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 1\n");
+		return;
+	}
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+		printf("[SKIP]\tCannot set affinity to CPU 0\n");
+		return;
+	}
+
+	sethandler(SIGSEGV, sigsegv, 0);
+#ifdef __i386__
+	/* True 32-bit kernels send SIGILL instead of SIGSEGV on IRET faults. */
+	sethandler(SIGILL, sigsegv, 0);
+#endif
+
+	printf("[RUN]\tCross-CPU LDT invalidation\n");
+
+	if (pthread_create(&thread, 0, threadproc, 0) != 0)
+		err(1, "pthread_create");
+
+	asm volatile ("mov %%ss, %0" : "=rm" (orig_ss));
+
+	for (i = 0; i < 5; i++) {
+		if (sigsetjmp(jmpbuf, 1) != 0)
+			continue;
+
+		/* Make sure the thread is ready after the last test. */
+		while (ftx != 0)
+			;
+
+		struct user_desc desc = {
+			.entry_number    = 0,
+			.base_addr       = 0,
+			.limit           = 0xfffff,
+			.seg_32bit       = 1,
+			.contents        = 0, /* Data */
+			.read_exec_only  = 0,
+			.limit_in_pages  = 1,
+			.seg_not_present = 0,
+			.useable         = 0
+		};
+
+		if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+			if (errno != ENOSYS)
+				err(1, "modify_ldt");
+			printf("[SKIP]\tmodify_ldt unavailable\n");
+			break;
+		}
+
+		/* Arm the thread. */
+		ftx = 1;
+		syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+		asm volatile ("mov %0, %%ss" : : "r" (0x7));
+
+		/* Go! */
+		ftx = 2;
+
+		while (ftx != 0)
+			;
+
+		/*
+		 * On success, modify_ldt will segfault us synchronously,
+		 * and we'll escape via siglongjmp.
+		 */
+
+		failures++;
+		asm volatile ("mov %0, %%ss" : : "rm" (orig_ss));
+	};
+
+	ftx = 100;  /* Kill the thread. */
+	syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+	if (pthread_join(thread, NULL) != 0)
+		err(1, "pthread_join");
+
+	if (failures) {
+		printf("[FAIL]\t%d of %d iterations failed\n", failures, iters);
+		nerrs++;
+	} else {
+		printf("[OK]\tAll %d iterations succeeded\n", iters);
+	}
+}
+
+int main()
+{
+	do_simple_tests();
+
+	do_multicpu_tests();
+
+	return nerrs ? 1 : 0;
+}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-27 15:36 ` Boris Ostrovsky
@ 2015-07-27 15:53   ` Andy Lutomirski
  2015-07-27 16:18     ` Boris Ostrovsky
  2015-07-27 16:18     ` Boris Ostrovsky
  2015-07-27 15:53   ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-27 15:53 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>
>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>> probably a good general attack surface reduction, and it replaces some
>> scary code with IMO less scary code.
>>
>> Also, servers and embedded systems should probably turn off modify_ldt.
>> This makes that possible.
>>
>> Xen people, can you take a look at this?
>>
>> Willy and Kees: I left the config option alone.  The -tiny people will
>> like it, and we can always add a sysctl of some sort later.
>>
>> Changes from v3:
>>   - Hopefully fixed Xen.
>
>
> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>
>>   - Fixed 32-bit test case on 32-bit native kernel.
>
>
> I am not sure I see what changed.

I misplaced the fix in the wrong git commit, so I failed to sent it.  Oops.

I just sent v4.1 of patch 3.  Can you try that?

>
> -boris



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-27 15:36 ` Boris Ostrovsky
  2015-07-27 15:53   ` Andy Lutomirski
@ 2015-07-27 15:53   ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-27 15:53 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin

On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>
>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>> probably a good general attack surface reduction, and it replaces some
>> scary code with IMO less scary code.
>>
>> Also, servers and embedded systems should probably turn off modify_ldt.
>> This makes that possible.
>>
>> Xen people, can you take a look at this?
>>
>> Willy and Kees: I left the config option alone.  The -tiny people will
>> like it, and we can always add a sysctl of some sort later.
>>
>> Changes from v3:
>>   - Hopefully fixed Xen.
>
>
> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>
>>   - Fixed 32-bit test case on 32-bit native kernel.
>
>
> I am not sure I see what changed.

I misplaced the fix in the wrong git commit, so I failed to sent it.  Oops.

I just sent v4.1 of patch 3.  Can you try that?

>
> -boris



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-27 15:53   ` Andy Lutomirski
@ 2015-07-27 16:18     ` Boris Ostrovsky
  2015-07-28  2:20       ` Andy Lutomirski
  2015-07-28  2:20       ` Andy Lutomirski
  2015-07-27 16:18     ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-27 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>> probably a good general attack surface reduction, and it replaces some
>>> scary code with IMO less scary code.
>>>
>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>> This makes that possible.
>>>
>>> Xen people, can you take a look at this?
>>>
>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>> like it, and we can always add a sysctl of some sort later.
>>>
>>> Changes from v3:
>>>    - Hopefully fixed Xen.
>>
>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>
>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>
>> I am not sure I see what changed.
> I misplaced the fix in the wrong git commit, so I failed to sent it.  Oops.
>
> I just sent v4.1 of patch 3.  Can you try that?


I am hitting BUG() in Xen code (returning from a hypercall) when freeing 
LDT in destroy_context(). Interestingly though when I run the test in 
the debugger I get SIGILL (just like before) but no BUG().

Let me get back to you on that later today.


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-27 15:53   ` Andy Lutomirski
  2015-07-27 16:18     ` Boris Ostrovsky
@ 2015-07-27 16:18     ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-27 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin

On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>> probably a good general attack surface reduction, and it replaces some
>>> scary code with IMO less scary code.
>>>
>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>> This makes that possible.
>>>
>>> Xen people, can you take a look at this?
>>>
>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>> like it, and we can always add a sysctl of some sort later.
>>>
>>> Changes from v3:
>>>    - Hopefully fixed Xen.
>>
>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>
>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>
>> I am not sure I see what changed.
> I misplaced the fix in the wrong git commit, so I failed to sent it.  Oops.
>
> I just sent v4.1 of patch 3.  Can you try that?


I am hitting BUG() in Xen code (returning from a hypercall) when freeing 
LDT in destroy_context(). Interestingly though when I run the test in 
the debugger I get SIGILL (just like before) but no BUG().

Let me get back to you on that later today.


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 13:03         ` Willy Tarreau
                             ` (2 preceding siblings ...)
  2015-07-27 19:04           ` Kees Cook
@ 2015-07-27 19:04           ` Kees Cook
  2015-07-27 21:37             ` Willy Tarreau
  2015-07-27 21:37             ` Willy Tarreau
  3 siblings, 2 replies; 130+ messages in thread
From: Kees Cook @ 2015-07-27 19:04 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, Steven Rostedt,
	security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Sat, Jul 25, 2015 at 6:03 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
>> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
>> > I'm all for it, but I think it should be hard-disablable in config,
>> > too, for the -tiny people.
>>
>> I totally agree.
>>
>> > If we add a runtime disable, let's do a
>> > separate patch, and you and Kees can fight over how general it should
>> > be.
>>
>> Initially I was thinking about changing it for a 3-state option but
>> that would prevent X86_16BIT from being hard-disablable, so I'll do
>> something completely separate.
>
> So here comes the proposed patch. It adds a default setting for the
> sysctl when the option is not hard-disabled (eg: distros not wanting
> to take risks with legacy apps). It suggests to leave the option off.
> In case a syscall is blocked, a printk_ratelimited() is called with
> relevant info (program name, pid, uid) so that the admin can decide
> whether it's a legitimate call or not. Eg:
>
>   Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.
>
> I personally think it completes well your series, hence the 4/3 numbering.
> Feel free to adopt it if you cycle another round and if you're OK with it
> of course.
>
> CCing Kees as well.

This patch looks reasonable, but I'd prefer a tri-state (enable,
disable, hard-disable). I do something like this for Yama's ptrace
zero to max_scope range (which "pins" to max_scope if set):

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/security/yama/yama_lsm.c#n361

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-25 13:03         ` Willy Tarreau
  2015-07-25 16:08           ` Andy Lutomirski
  2015-07-25 16:08           ` Andy Lutomirski
@ 2015-07-27 19:04           ` Kees Cook
  2015-07-27 19:04           ` Kees Cook
  3 siblings, 0 replies; 130+ messages in thread
From: Kees Cook @ 2015-07-27 19:04 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, Andy Lutomirski, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, xen-devel

On Sat, Jul 25, 2015 at 6:03 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
>> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
>> > I'm all for it, but I think it should be hard-disablable in config,
>> > too, for the -tiny people.
>>
>> I totally agree.
>>
>> > If we add a runtime disable, let's do a
>> > separate patch, and you and Kees can fight over how general it should
>> > be.
>>
>> Initially I was thinking about changing it for a 3-state option but
>> that would prevent X86_16BIT from being hard-disablable, so I'll do
>> something completely separate.
>
> So here comes the proposed patch. It adds a default setting for the
> sysctl when the option is not hard-disabled (eg: distros not wanting
> to take risks with legacy apps). It suggests to leave the option off.
> In case a syscall is blocked, a printk_ratelimited() is called with
> relevant info (program name, pid, uid) so that the admin can decide
> whether it's a legitimate call or not. Eg:
>
>   Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.
>
> I personally think it completes well your series, hence the 4/3 numbering.
> Feel free to adopt it if you cycle another round and if you're OK with it
> of course.
>
> CCing Kees as well.

This patch looks reasonable, but I'd prefer a tri-state (enable,
disable, hard-disable). I do something like this for Yama's ptrace
zero to max_scope range (which "pins" to max_scope if set):

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/security/yama/yama_lsm.c#n361

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-27 19:04           ` Kees Cook
  2015-07-27 21:37             ` Willy Tarreau
@ 2015-07-27 21:37             ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-27 21:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, Steven Rostedt,
	security, X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Andrew Cooper,
	Jan Beulich, xen-devel

On Mon, Jul 27, 2015 at 12:04:54PM -0700, Kees Cook wrote:
> On Sat, Jul 25, 2015 at 6:03 AM, Willy Tarreau <w@1wt.eu> wrote:
> > On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
> >> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
> >> > I'm all for it, but I think it should be hard-disablable in config,
> >> > too, for the -tiny people.
> >>
> >> I totally agree.
> >>
> >> > If we add a runtime disable, let's do a
> >> > separate patch, and you and Kees can fight over how general it should
> >> > be.
> >>
> >> Initially I was thinking about changing it for a 3-state option but
> >> that would prevent X86_16BIT from being hard-disablable, so I'll do
> >> something completely separate.
> >
> > So here comes the proposed patch. It adds a default setting for the
> > sysctl when the option is not hard-disabled (eg: distros not wanting
> > to take risks with legacy apps). It suggests to leave the option off.
> > In case a syscall is blocked, a printk_ratelimited() is called with
> > relevant info (program name, pid, uid) so that the admin can decide
> > whether it's a legitimate call or not. Eg:
> >
> >   Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.
> >
> > I personally think it completes well your series, hence the 4/3 numbering.
> > Feel free to adopt it if you cycle another round and if you're OK with it
> > of course.
> >
> > CCing Kees as well.
> 
> This patch looks reasonable, but I'd prefer a tri-state (enable,
> disable, hard-disable).

That was my first goal initially until I realized that the current two
options make it possible to also get rid of X86_16BIT as Andy did. I
don't see how to do this with the 3-state mode.

> I do something like this for Yama's ptrace
> zero to max_scope range (which "pins" to max_scope if set):
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/security/yama/yama_lsm.c#n361

I agree with this and initially I intended to do something approximately
like this when I realized that for this specific case it didn't match the
pattern. In fact here we have the opportunity to completely remove support
for LDT changes, not just the modify_ldt() syscall. Then it makes sense to
have the two options here.

Regards,
Willy


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime
  2015-07-27 19:04           ` Kees Cook
@ 2015-07-27 21:37             ` Willy Tarreau
  2015-07-27 21:37             ` Willy Tarreau
  1 sibling, 0 replies; 130+ messages in thread
From: Willy Tarreau @ 2015-07-27 21:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, Andy Lutomirski, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, Boris Ostrovsky, xen-devel

On Mon, Jul 27, 2015 at 12:04:54PM -0700, Kees Cook wrote:
> On Sat, Jul 25, 2015 at 6:03 AM, Willy Tarreau <w@1wt.eu> wrote:
> > On Sat, Jul 25, 2015 at 09:50:52AM +0200, Willy Tarreau wrote:
> >> On Fri, Jul 24, 2015 at 11:44:52PM -0700, Andy Lutomirski wrote:
> >> > I'm all for it, but I think it should be hard-disablable in config,
> >> > too, for the -tiny people.
> >>
> >> I totally agree.
> >>
> >> > If we add a runtime disable, let's do a
> >> > separate patch, and you and Kees can fight over how general it should
> >> > be.
> >>
> >> Initially I was thinking about changing it for a 3-state option but
> >> that would prevent X86_16BIT from being hard-disablable, so I'll do
> >> something completely separate.
> >
> > So here comes the proposed patch. It adds a default setting for the
> > sysctl when the option is not hard-disabled (eg: distros not wanting
> > to take risks with legacy apps). It suggests to leave the option off.
> > In case a syscall is blocked, a printk_ratelimited() is called with
> > relevant info (program name, pid, uid) so that the admin can decide
> > whether it's a legitimate call or not. Eg:
> >
> >   Denied a call to modify_ldt() from a.out[1736] (uid: 100). Adjust sysctl if this was not an exploit attempt.
> >
> > I personally think it completes well your series, hence the 4/3 numbering.
> > Feel free to adopt it if you cycle another round and if you're OK with it
> > of course.
> >
> > CCing Kees as well.
> 
> This patch looks reasonable, but I'd prefer a tri-state (enable,
> disable, hard-disable).

That was my first goal initially until I realized that the current two
options make it possible to also get rid of X86_16BIT as Andy did. I
don't see how to do this with the 3-state mode.

> I do something like this for Yama's ptrace
> zero to max_scope range (which "pins" to max_scope if set):
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/security/yama/yama_lsm.c#n361

I agree with this and initially I intended to do something approximately
like this when I realized that for this specific case it didn't match the
pattern. In fact here we have the opportunity to completely remove support
for LDT changes, not just the modify_ldt() syscall. Then it makes sense to
have the two options here.

Regards,
Willy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-27 16:18     ` Boris Ostrovsky
  2015-07-28  2:20       ` Andy Lutomirski
@ 2015-07-28  2:20       ` Andy Lutomirski
  2015-07-28  3:16         ` Andy Lutomirski
  2015-07-28  3:16         ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28  2:20 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>>
>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>
>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>> probably a good general attack surface reduction, and it replaces some
>>>> scary code with IMO less scary code.
>>>>
>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>> This makes that possible.
>>>>
>>>> Xen people, can you take a look at this?
>>>>
>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>> like it, and we can always add a sysctl of some sort later.
>>>>
>>>> Changes from v3:
>>>>    - Hopefully fixed Xen.
>>>
>>>
>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>
>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>
>>>
>>> I am not sure I see what changed.
>>
>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>> Oops.
>>
>> I just sent v4.1 of patch 3.  Can you try that?
>
>
>
> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
> in destroy_context(). Interestingly though when I run the test in the
> debugger I get SIGILL (just like before) but no BUG().
>
> Let me get back to you on that later today.
>
>

After forward-porting my virtio patches, I got this thing to run on
Xen.  After several tries, I got:

[   53.985707] ------------[ cut here ]------------
[   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
[   53.986677] invalid opcode: 0000 [#1] SMP
[   53.986677] Modules linked in:
[   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
[   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
[   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
[   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
[   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
[   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
[   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
[   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
[   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
[   53.986677] Stack:
[   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
00000b4a 00000200
[   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
c1062310 c01861c0
[   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
c2373a80 00000000
[   53.986677] Call Trace:
[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
[   53.986677]  [<c1863736>] __schedule+0x316/0x950
[   53.986677]  [<c1863d96>] schedule+0x26/0x70
[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
[   53.986677]  [<c186717a>] syscall_call+0x7/0x7
[   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
89 e5
[   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
[   54.010069] ---[ end trace 89ac35b29c1c59bb ]---

Is that the error you're seeing?

If I change xen_free_ldt to:

static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
{
    const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
    int i;

    vm_unmap_aliases();
    xen_mc_flush();

    for(i = 0; i < entries; i += entries_per_page)
        set_aliased_prot(ldt + i, PAGE_KERNEL);
}

then it works.  I don't know why this makes a difference.
(xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
doesn't.)

It's *possible* that there's some but in my code that causes a CPU to
retain a reference to a stale LDT, but I don't see it.  Hmm.  Is it
possible that, when a process exits, we kill the mm without
synchronously unlazying it everywhere else?  Seems a bit hard to
imagine to me -- I don't see why this wouldn't blow up when the pgt
went away.

My best guess is that there's a silly race in which one CPU frees and
LDT before the other CPU flushes its hypercalls.  But I don't really
believe this, because I got this trace:

[   14.257546] Free LDT cb912000: CPU0 cb923000 CPU1 cb923000
[OK]    All 5 iterations succeeded
root@(none):/# [   15.824723] Free LDT cb923000: CPU0   (null) CPU1   (null)
[   15.827404] ------------[ cut here ]------------
[   15.828349] kernel BUG at arch/x86/xen/enlighten.c:497!
[   15.828349] invalid opcode: 0000 [#1] SMP

with this patch applied:

@@ -537,7 +542,9 @@ static void xen_set_ldt(const void *addr, unsigned entries)

        MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);

-       xen_mc_issue(PARAVIRT_LAZY_CPU);
+       xen_mc_flush();
+
+       this_cpu_write(cpu_ldt, addr);
 }

so both CPUs on my VM have definitely zeroed their LDTs before the
failing hypercall.

Hmm.  Looking at the hypervisor code, I don't see why setting the LDT
to NULL is handled correctly.  Am I missing something?

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-27 16:18     ` Boris Ostrovsky
@ 2015-07-28  2:20       ` Andy Lutomirski
  2015-07-28  2:20       ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28  2:20 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin

On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>>
>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>
>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>> probably a good general attack surface reduction, and it replaces some
>>>> scary code with IMO less scary code.
>>>>
>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>> This makes that possible.
>>>>
>>>> Xen people, can you take a look at this?
>>>>
>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>> like it, and we can always add a sysctl of some sort later.
>>>>
>>>> Changes from v3:
>>>>    - Hopefully fixed Xen.
>>>
>>>
>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>
>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>
>>>
>>> I am not sure I see what changed.
>>
>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>> Oops.
>>
>> I just sent v4.1 of patch 3.  Can you try that?
>
>
>
> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
> in destroy_context(). Interestingly though when I run the test in the
> debugger I get SIGILL (just like before) but no BUG().
>
> Let me get back to you on that later today.
>
>

After forward-porting my virtio patches, I got this thing to run on
Xen.  After several tries, I got:

[   53.985707] ------------[ cut here ]------------
[   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
[   53.986677] invalid opcode: 0000 [#1] SMP
[   53.986677] Modules linked in:
[   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
[   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
[   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
[   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
[   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
[   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
[   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
[   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
[   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
[   53.986677] Stack:
[   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
00000b4a 00000200
[   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
c1062310 c01861c0
[   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
c2373a80 00000000
[   53.986677] Call Trace:
[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
[   53.986677]  [<c1863736>] __schedule+0x316/0x950
[   53.986677]  [<c1863d96>] schedule+0x26/0x70
[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
[   53.986677]  [<c186717a>] syscall_call+0x7/0x7
[   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
89 e5
[   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
[   54.010069] ---[ end trace 89ac35b29c1c59bb ]---

Is that the error you're seeing?

If I change xen_free_ldt to:

static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
{
    const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
    int i;

    vm_unmap_aliases();
    xen_mc_flush();

    for(i = 0; i < entries; i += entries_per_page)
        set_aliased_prot(ldt + i, PAGE_KERNEL);
}

then it works.  I don't know why this makes a difference.
(xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
doesn't.)

It's *possible* that there's some but in my code that causes a CPU to
retain a reference to a stale LDT, but I don't see it.  Hmm.  Is it
possible that, when a process exits, we kill the mm without
synchronously unlazying it everywhere else?  Seems a bit hard to
imagine to me -- I don't see why this wouldn't blow up when the pgt
went away.

My best guess is that there's a silly race in which one CPU frees and
LDT before the other CPU flushes its hypercalls.  But I don't really
believe this, because I got this trace:

[   14.257546] Free LDT cb912000: CPU0 cb923000 CPU1 cb923000
[OK]    All 5 iterations succeeded
root@(none):/# [   15.824723] Free LDT cb923000: CPU0   (null) CPU1   (null)
[   15.827404] ------------[ cut here ]------------
[   15.828349] kernel BUG at arch/x86/xen/enlighten.c:497!
[   15.828349] invalid opcode: 0000 [#1] SMP

with this patch applied:

@@ -537,7 +542,9 @@ static void xen_set_ldt(const void *addr, unsigned entries)

        MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);

-       xen_mc_issue(PARAVIRT_LAZY_CPU);
+       xen_mc_flush();
+
+       this_cpu_write(cpu_ldt, addr);
 }

so both CPUs on my VM have definitely zeroed their LDTs before the
failing hypercall.

Hmm.  Looking at the hypervisor code, I don't see why setting the LDT
to NULL is handled correctly.  Am I missing something?

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  2:20       ` Andy Lutomirski
  2015-07-28  3:16         ` Andy Lutomirski
@ 2015-07-28  3:16         ` Andy Lutomirski
  2015-07-28  3:23           ` Andy Lutomirski
                             ` (5 more replies)
  1 sibling, 6 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28  3:16 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>
>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>
>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>
>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>> probably a good general attack surface reduction, and it replaces some
>>>>> scary code with IMO less scary code.
>>>>>
>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>> This makes that possible.
>>>>>
>>>>> Xen people, can you take a look at this?
>>>>>
>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>
>>>>> Changes from v3:
>>>>>    - Hopefully fixed Xen.
>>>>
>>>>
>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>
>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>>
>>>>
>>>> I am not sure I see what changed.
>>>
>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>> Oops.
>>>
>>> I just sent v4.1 of patch 3.  Can you try that?
>>
>>
>>
>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>> in destroy_context(). Interestingly though when I run the test in the
>> debugger I get SIGILL (just like before) but no BUG().
>>
>> Let me get back to you on that later today.
>>
>>
>
> After forward-porting my virtio patches, I got this thing to run on
> Xen.  After several tries, I got:
>
> [   53.985707] ------------[ cut here ]------------
> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> [   53.986677] invalid opcode: 0000 [#1] SMP
> [   53.986677] Modules linked in:
> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> 04/01/2014
> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> [   53.986677] Stack:
> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> 00000b4a 00000200
> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> c1062310 c01861c0
> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> c2373a80 00000000
> [   53.986677] Call Trace:
> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> 89 e5
> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>
> Is that the error you're seeing?
>
> If I change xen_free_ldt to:
>
> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> {
>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>     int i;
>
>     vm_unmap_aliases();
>     xen_mc_flush();
>
>     for(i = 0; i < entries; i += entries_per_page)
>         set_aliased_prot(ldt + i, PAGE_KERNEL);
> }
>
> then it works.  I don't know why this makes a difference.
> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> doesn't.)
>

That fix makes sense if there's some way that the vmalloc area we're
freeing has an extra alias somewhere, which is very much possible.  On
the other hand, I don't see how this happens without first doing an
MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
expected that to blow up and/or result in test case failures.

But I'm still confused, because it seems like Xen will never populate
the actual (hidden) LDT mapping unless the pages backing it are
unaliased and well-formed, which make me wonder why this stuff ever
worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
in segfaults?

The semantics seem to be very odd.  xen_free_ldt with an aliased
address might fail (and OOPS), but actual access to the LDT with an
aliased address page faults.

Also, using kzalloc for everything fixes the problem, which suggests
that there really is something to my theory that the problem involves
unexpected aliases.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  2:20       ` Andy Lutomirski
@ 2015-07-28  3:16         ` Andy Lutomirski
  2015-07-28  3:16         ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28  3:16 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin

On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>
>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>
>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>
>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>> probably a good general attack surface reduction, and it replaces some
>>>>> scary code with IMO less scary code.
>>>>>
>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>> This makes that possible.
>>>>>
>>>>> Xen people, can you take a look at this?
>>>>>
>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>
>>>>> Changes from v3:
>>>>>    - Hopefully fixed Xen.
>>>>
>>>>
>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>
>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>>
>>>>
>>>> I am not sure I see what changed.
>>>
>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>> Oops.
>>>
>>> I just sent v4.1 of patch 3.  Can you try that?
>>
>>
>>
>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>> in destroy_context(). Interestingly though when I run the test in the
>> debugger I get SIGILL (just like before) but no BUG().
>>
>> Let me get back to you on that later today.
>>
>>
>
> After forward-porting my virtio patches, I got this thing to run on
> Xen.  After several tries, I got:
>
> [   53.985707] ------------[ cut here ]------------
> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> [   53.986677] invalid opcode: 0000 [#1] SMP
> [   53.986677] Modules linked in:
> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> 04/01/2014
> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> [   53.986677] Stack:
> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> 00000b4a 00000200
> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> c1062310 c01861c0
> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> c2373a80 00000000
> [   53.986677] Call Trace:
> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> 89 e5
> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>
> Is that the error you're seeing?
>
> If I change xen_free_ldt to:
>
> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> {
>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>     int i;
>
>     vm_unmap_aliases();
>     xen_mc_flush();
>
>     for(i = 0; i < entries; i += entries_per_page)
>         set_aliased_prot(ldt + i, PAGE_KERNEL);
> }
>
> then it works.  I don't know why this makes a difference.
> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> doesn't.)
>

That fix makes sense if there's some way that the vmalloc area we're
freeing has an extra alias somewhere, which is very much possible.  On
the other hand, I don't see how this happens without first doing an
MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
expected that to blow up and/or result in test case failures.

But I'm still confused, because it seems like Xen will never populate
the actual (hidden) LDT mapping unless the pages backing it are
unaliased and well-formed, which make me wonder why this stuff ever
worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
in segfaults?

The semantics seem to be very odd.  xen_free_ldt with an aliased
address might fail (and OOPS), but actual access to the LDT with an
aliased address page faults.

Also, using kzalloc for everything fixes the problem, which suggests
that there really is something to my theory that the problem involves
unexpected aliases.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  3:16         ` Andy Lutomirski
@ 2015-07-28  3:23           ` Andy Lutomirski
  2015-07-28  3:23           ` Andy Lutomirski
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28  3:23 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5781 bytes --]

On Mon, Jul 27, 2015 at 8:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>
>>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>>
>>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>>> probably a good general attack surface reduction, and it replaces some
>>>>>> scary code with IMO less scary code.
>>>>>>
>>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>>> This makes that possible.
>>>>>>
>>>>>> Xen people, can you take a look at this?
>>>>>>
>>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>>
>>>>>> Changes from v3:
>>>>>>    - Hopefully fixed Xen.
>>>>>
>>>>>
>>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>>
>>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>>>
>>>>>
>>>>> I am not sure I see what changed.
>>>>
>>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>>> Oops.
>>>>
>>>> I just sent v4.1 of patch 3.  Can you try that?
>>>
>>>
>>>
>>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>>> in destroy_context(). Interestingly though when I run the test in the
>>> debugger I get SIGILL (just like before) but no BUG().
>>>
>>> Let me get back to you on that later today.
>>>
>>>
>>
>> After forward-porting my virtio patches, I got this thing to run on
>> Xen.  After several tries, I got:
>>
>> [   53.985707] ------------[ cut here ]------------
>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>> [   53.986677] invalid opcode: 0000 [#1] SMP
>> [   53.986677] Modules linked in:
>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>> 04/01/2014
>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>> [   53.986677] Stack:
>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>> 00000b4a 00000200
>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>> c1062310 c01861c0
>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>> c2373a80 00000000
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>> 89 e5
>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>
>> Is that the error you're seeing?
>>
>> If I change xen_free_ldt to:
>>
>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>     int i;
>>
>>     vm_unmap_aliases();
>>     xen_mc_flush();
>>
>>     for(i = 0; i < entries; i += entries_per_page)
>>         set_aliased_prot(ldt + i, PAGE_KERNEL);
>> }
>>
>> then it works.  I don't know why this makes a difference.
>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>> doesn't.)
>>
>
> That fix makes sense if there's some way that the vmalloc area we're
> freeing has an extra alias somewhere, which is very much possible.  On
> the other hand, I don't see how this happens without first doing an
> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> expected that to blow up and/or result in test case failures.
>
> But I'm still confused, because it seems like Xen will never populate
> the actual (hidden) LDT mapping unless the pages backing it are
> unaliased and well-formed, which make me wonder why this stuff ever
> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> in segfaults?
>
> The semantics seem to be very odd.  xen_free_ldt with an aliased
> address might fail (and OOPS), but actual access to the LDT with an
> aliased address page faults.
>
> Also, using kzalloc for everything fixes the problem, which suggests
> that there really is something to my theory that the problem involves
> unexpected aliases.

The attachment fixes the problem for me, and I could easily believe
that the attachment is correct.  I'd like to know why the code appears
to work without the xen_alloc_ldt change in there, though.

--Andy

[-- Attachment #2: xen.patch --]
[-- Type: text/x-diff, Size: 2238 bytes --]

commit b4772cf849f05d5ceab079b4e88497dd1b990acd
Author: Andy Lutomirski <luto@kernel.org>
Date:   Mon Jul 27 20:20:55 2015 -0700

    x86/xen: Unmap aliases in xen_alloc_ldt and xen_free_ldt
    
    The xen_free_ldt change fixes an OOPS in the new modify_ldt
    implementation.  I think the xen_alloc_ldt change should be
    necessary, too, but I can't seem to trigger any failures without it,
    which I find surprising.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski <luto@kernel.org>

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 0b95c9b8283f..100a2e2294af 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -32,6 +32,7 @@
 #include <linux/gfp.h>
 #include <linux/memblock.h>
 #include <linux/edd.h>
+#include <linux/vmalloc.h>
 
 #include <xen/xen.h>
 #include <xen/events.h>
@@ -512,6 +513,10 @@ static void xen_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 
 	for(i = 0; i < entries; i += entries_per_page)
 		set_aliased_prot(ldt + i, PAGE_KERNEL_RO);
+
+	/* If there are stray aliases, the LDT won't work. */
+	if (is_vmalloc_addr(ldt))
+		vm_unmap_aliases();
 }
 
 static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
@@ -519,6 +524,13 @@ static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
 	const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
 	int i;
 
+	/*
+	 * If there are stray aliases, hypervisor will fail the hypercalls,
+	 * causing us to OOPS.
+	 */
+	if (is_vmalloc_addr(ldt))
+		vm_unmap_aliases();
+
 	for(i = 0; i < entries; i += entries_per_page)
 		set_aliased_prot(ldt + i, PAGE_KERNEL);
 }
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
index c27adfc9ae72..fba5bc133aa2 100644
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -204,6 +204,10 @@ static void do_simple_tests(void)
 	};
 	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
 
+	desc.entry_number = 8191;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+	desc.entry_number = 0;
+
 	desc.limit_in_pages = 1;
 	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
 		      AR_S | AR_P | AR_DB | AR_G);

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  3:16         ` Andy Lutomirski
  2015-07-28  3:23           ` Andy Lutomirski
@ 2015-07-28  3:23           ` Andy Lutomirski
  2015-07-28  3:43           ` Boris Ostrovsky
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28  3:23 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin

[-- Attachment #1: Type: text/plain, Size: 5781 bytes --]

On Mon, Jul 27, 2015 at 8:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>
>>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>>
>>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>>> probably a good general attack surface reduction, and it replaces some
>>>>>> scary code with IMO less scary code.
>>>>>>
>>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>>> This makes that possible.
>>>>>>
>>>>>> Xen people, can you take a look at this?
>>>>>>
>>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>>
>>>>>> Changes from v3:
>>>>>>    - Hopefully fixed Xen.
>>>>>
>>>>>
>>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>>
>>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>>>
>>>>>
>>>>> I am not sure I see what changed.
>>>>
>>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>>> Oops.
>>>>
>>>> I just sent v4.1 of patch 3.  Can you try that?
>>>
>>>
>>>
>>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>>> in destroy_context(). Interestingly though when I run the test in the
>>> debugger I get SIGILL (just like before) but no BUG().
>>>
>>> Let me get back to you on that later today.
>>>
>>>
>>
>> After forward-porting my virtio patches, I got this thing to run on
>> Xen.  After several tries, I got:
>>
>> [   53.985707] ------------[ cut here ]------------
>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>> [   53.986677] invalid opcode: 0000 [#1] SMP
>> [   53.986677] Modules linked in:
>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>> 04/01/2014
>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>> [   53.986677] Stack:
>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>> 00000b4a 00000200
>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>> c1062310 c01861c0
>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>> c2373a80 00000000
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>> 89 e5
>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>
>> Is that the error you're seeing?
>>
>> If I change xen_free_ldt to:
>>
>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>     int i;
>>
>>     vm_unmap_aliases();
>>     xen_mc_flush();
>>
>>     for(i = 0; i < entries; i += entries_per_page)
>>         set_aliased_prot(ldt + i, PAGE_KERNEL);
>> }
>>
>> then it works.  I don't know why this makes a difference.
>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>> doesn't.)
>>
>
> That fix makes sense if there's some way that the vmalloc area we're
> freeing has an extra alias somewhere, which is very much possible.  On
> the other hand, I don't see how this happens without first doing an
> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> expected that to blow up and/or result in test case failures.
>
> But I'm still confused, because it seems like Xen will never populate
> the actual (hidden) LDT mapping unless the pages backing it are
> unaliased and well-formed, which make me wonder why this stuff ever
> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> in segfaults?
>
> The semantics seem to be very odd.  xen_free_ldt with an aliased
> address might fail (and OOPS), but actual access to the LDT with an
> aliased address page faults.
>
> Also, using kzalloc for everything fixes the problem, which suggests
> that there really is something to my theory that the problem involves
> unexpected aliases.

The attachment fixes the problem for me, and I could easily believe
that the attachment is correct.  I'd like to know why the code appears
to work without the xen_alloc_ldt change in there, though.

--Andy

[-- Attachment #2: xen.patch --]
[-- Type: text/x-diff, Size: 2238 bytes --]

commit b4772cf849f05d5ceab079b4e88497dd1b990acd
Author: Andy Lutomirski <luto@kernel.org>
Date:   Mon Jul 27 20:20:55 2015 -0700

    x86/xen: Unmap aliases in xen_alloc_ldt and xen_free_ldt
    
    The xen_free_ldt change fixes an OOPS in the new modify_ldt
    implementation.  I think the xen_alloc_ldt change should be
    necessary, too, but I can't seem to trigger any failures without it,
    which I find surprising.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski <luto@kernel.org>

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 0b95c9b8283f..100a2e2294af 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -32,6 +32,7 @@
 #include <linux/gfp.h>
 #include <linux/memblock.h>
 #include <linux/edd.h>
+#include <linux/vmalloc.h>
 
 #include <xen/xen.h>
 #include <xen/events.h>
@@ -512,6 +513,10 @@ static void xen_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 
 	for(i = 0; i < entries; i += entries_per_page)
 		set_aliased_prot(ldt + i, PAGE_KERNEL_RO);
+
+	/* If there are stray aliases, the LDT won't work. */
+	if (is_vmalloc_addr(ldt))
+		vm_unmap_aliases();
 }
 
 static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
@@ -519,6 +524,13 @@ static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
 	const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
 	int i;
 
+	/*
+	 * If there are stray aliases, hypervisor will fail the hypercalls,
+	 * causing us to OOPS.
+	 */
+	if (is_vmalloc_addr(ldt))
+		vm_unmap_aliases();
+
 	for(i = 0; i < entries; i += entries_per_page)
 		set_aliased_prot(ldt + i, PAGE_KERNEL);
 }
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
index c27adfc9ae72..fba5bc133aa2 100644
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -204,6 +204,10 @@ static void do_simple_tests(void)
 	};
 	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
 
+	desc.entry_number = 8191;
+	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+	desc.entry_number = 0;
+
 	desc.limit_in_pages = 1;
 	install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
 		      AR_S | AR_P | AR_DB | AR_G);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  3:16         ` Andy Lutomirski
                             ` (2 preceding siblings ...)
  2015-07-28  3:43           ` Boris Ostrovsky
@ 2015-07-28  3:43           ` Boris Ostrovsky
  2015-07-28 10:29           ` Andrew Cooper
  2015-07-28 10:29           ` Andrew Cooper
  5 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28  3:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Andrew Cooper, Jan Beulich, xen-devel

On 07/27/2015 11:16 PM, Andy Lutomirski wrote:
> On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>>> probably a good general attack surface reduction, and it replaces some
>>>>>> scary code with IMO less scary code.
>>>>>>
>>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>>> This makes that possible.
>>>>>>
>>>>>> Xen people, can you take a look at this?
>>>>>>
>>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>>
>>>>>> Changes from v3:
>>>>>>     - Hopefully fixed Xen.
>>>>>
>>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>>
>>>>>>     - Fixed 32-bit test case on 32-bit native kernel.
>>>>>
>>>>> I am not sure I see what changed.
>>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>>> Oops.
>>>>
>>>> I just sent v4.1 of patch 3.  Can you try that?
>>>
>>>
>>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>>> in destroy_context(). Interestingly though when I run the test in the
>>> debugger I get SIGILL (just like before) but no BUG().
>>>
>>> Let me get back to you on that later today.
>>>
>>>
>> After forward-porting my virtio patches, I got this thing to run on
>> Xen.  After several tries, I got:
>>
>> [   53.985707] ------------[ cut here ]------------
>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>> [   53.986677] invalid opcode: 0000 [#1] SMP
>> [   53.986677] Modules linked in:
>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>> 04/01/2014
>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>> [   53.986677] Stack:
>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>> 00000b4a 00000200
>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>> c1062310 c01861c0
>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>> c2373a80 00000000
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>> 89 e5
>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>
>> Is that the error you're seeing?
>>
>> If I change xen_free_ldt to:
>>
>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>      int i;
>>
>>      vm_unmap_aliases();
>>      xen_mc_flush();
>>
>>      for(i = 0; i < entries; i += entries_per_page)
>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
>> }
>>
>> then it works.  I don't know why this makes a difference.
>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>> doesn't.)
>>
> That fix makes sense if there's some way that the vmalloc area we're
> freeing has an extra alias somewhere, which is very much possible.  On
> the other hand, I don't see how this happens without first doing an
> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> expected that to blow up and/or result in test case failures.
>
> But I'm still confused, because it seems like Xen will never populate
> the actual (hidden) LDT mapping unless the pages backing it are
> unaliased and well-formed, which make me wonder why this stuff ever
> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> in segfaults?
>
> The semantics seem to be very odd.  xen_free_ldt with an aliased
> address might fail (and OOPS), but actual access to the LDT with an
> aliased address page faults.
>
> Also, using kzalloc for everything fixes the problem, which suggests
> that there really is something to my theory that the problem involves
> unexpected aliases.

Yes, this is as far as I got as well (I didn't try unaliasing but now 
that you found it -- it does indeed work). I am not sure whether you are 
saying this (I think you do, implicitly, since you are replacing vzalloc 
with kzalloc), but the problem only happens when we have multi-page LDT.

And it is reproducible with a single CPU.


-boris


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  3:16         ` Andy Lutomirski
  2015-07-28  3:23           ` Andy Lutomirski
  2015-07-28  3:23           ` Andy Lutomirski
@ 2015-07-28  3:43           ` Boris Ostrovsky
  2015-07-28  3:43           ` Boris Ostrovsky
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28  3:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	Andy Lutomirski, Sasha Levin

On 07/27/2015 11:16 PM, Andy Lutomirski wrote:
> On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>>> probably a good general attack surface reduction, and it replaces some
>>>>>> scary code with IMO less scary code.
>>>>>>
>>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>>> This makes that possible.
>>>>>>
>>>>>> Xen people, can you take a look at this?
>>>>>>
>>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>>
>>>>>> Changes from v3:
>>>>>>     - Hopefully fixed Xen.
>>>>>
>>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>>
>>>>>>     - Fixed 32-bit test case on 32-bit native kernel.
>>>>>
>>>>> I am not sure I see what changed.
>>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>>> Oops.
>>>>
>>>> I just sent v4.1 of patch 3.  Can you try that?
>>>
>>>
>>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>>> in destroy_context(). Interestingly though when I run the test in the
>>> debugger I get SIGILL (just like before) but no BUG().
>>>
>>> Let me get back to you on that later today.
>>>
>>>
>> After forward-porting my virtio patches, I got this thing to run on
>> Xen.  After several tries, I got:
>>
>> [   53.985707] ------------[ cut here ]------------
>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>> [   53.986677] invalid opcode: 0000 [#1] SMP
>> [   53.986677] Modules linked in:
>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>> 04/01/2014
>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>> [   53.986677] Stack:
>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>> 00000b4a 00000200
>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>> c1062310 c01861c0
>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>> c2373a80 00000000
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>> 89 e5
>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>
>> Is that the error you're seeing?
>>
>> If I change xen_free_ldt to:
>>
>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>      int i;
>>
>>      vm_unmap_aliases();
>>      xen_mc_flush();
>>
>>      for(i = 0; i < entries; i += entries_per_page)
>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
>> }
>>
>> then it works.  I don't know why this makes a difference.
>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>> doesn't.)
>>
> That fix makes sense if there's some way that the vmalloc area we're
> freeing has an extra alias somewhere, which is very much possible.  On
> the other hand, I don't see how this happens without first doing an
> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> expected that to blow up and/or result in test case failures.
>
> But I'm still confused, because it seems like Xen will never populate
> the actual (hidden) LDT mapping unless the pages backing it are
> unaliased and well-formed, which make me wonder why this stuff ever
> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> in segfaults?
>
> The semantics seem to be very odd.  xen_free_ldt with an aliased
> address might fail (and OOPS), but actual access to the LDT with an
> aliased address page faults.
>
> Also, using kzalloc for everything fixes the problem, which suggests
> that there really is something to my theory that the problem involves
> unexpected aliases.

Yes, this is as far as I got as well (I didn't try unaliasing but now 
that you found it -- it does indeed work). I am not sure whether you are 
saying this (I think you do, implicitly, since you are replacing vzalloc 
with kzalloc), but the problem only happens when we have multi-page LDT.

And it is reproducible with a single CPU.


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  3:16         ` Andy Lutomirski
                             ` (4 preceding siblings ...)
  2015-07-28 10:29           ` Andrew Cooper
@ 2015-07-28 10:29           ` Andrew Cooper
  2015-07-28 14:05             ` Boris Ostrovsky
                               ` (3 more replies)
  5 siblings, 4 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 10:29 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Jan Beulich, xen-devel

On 28/07/15 04:16, Andy Lutomirski wrote:
> On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>>> probably a good general attack surface reduction, and it replaces some
>>>>>> scary code with IMO less scary code.
>>>>>>
>>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>>> This makes that possible.
>>>>>>
>>>>>> Xen people, can you take a look at this?
>>>>>>
>>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>>
>>>>>> Changes from v3:
>>>>>>    - Hopefully fixed Xen.
>>>>>
>>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>>
>>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>>>
>>>>> I am not sure I see what changed.
>>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>>> Oops.
>>>>
>>>> I just sent v4.1 of patch 3.  Can you try that?
>>>
>>>
>>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>>> in destroy_context(). Interestingly though when I run the test in the
>>> debugger I get SIGILL (just like before) but no BUG().
>>>
>>> Let me get back to you on that later today.
>>>
>>>
>> After forward-porting my virtio patches, I got this thing to run on
>> Xen.  After several tries, I got:
>>
>> [   53.985707] ------------[ cut here ]------------
>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>> [   53.986677] invalid opcode: 0000 [#1] SMP
>> [   53.986677] Modules linked in:
>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>> 04/01/2014
>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>> [   53.986677] Stack:
>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>> 00000b4a 00000200
>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>> c1062310 c01861c0
>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>> c2373a80 00000000
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>> 89 e5
>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>
>> Is that the error you're seeing?
>>
>> If I change xen_free_ldt to:
>>
>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>     int i;
>>
>>     vm_unmap_aliases();
>>     xen_mc_flush();
>>
>>     for(i = 0; i < entries; i += entries_per_page)
>>         set_aliased_prot(ldt + i, PAGE_KERNEL);
>> }
>>
>> then it works.  I don't know why this makes a difference.
>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>> doesn't.)
>>
> That fix makes sense if there's some way that the vmalloc area we're
> freeing has an extra alias somewhere, which is very much possible.  On
> the other hand, I don't see how this happens without first doing an
> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> expected that to blow up and/or result in test case failures.
>
> But I'm still confused, because it seems like Xen will never populate
> the actual (hidden) LDT mapping unless the pages backing it are
> unaliased and well-formed, which make me wonder why this stuff ever
> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> in segfaults?
>
> The semantics seem to be very odd.  xen_free_ldt with an aliased
> address might fail (and OOPS), but actual access to the LDT with an
> aliased address page faults.
>
> Also, using kzalloc for everything fixes the problem, which suggests
> that there really is something to my theory that the problem involves
> unexpected aliases.

Xen does lazily populate the LDT frames.  The first time a page is ever
referenced via the LDT, Xen will perform a typechange.

Under Xen, guest mappings are reference counted with both a plain
reference, and a type count.  Types of writeable, segdec and pagetables
are mutually exclusive.  This prevents the guest from having writeable
mappings of interesting datastructures, but readable mappings are fine. 
Typechanges may only occur when the type reference count is 0.

At the point of the typechange, no writeable mappings of the frame may
exist (and it must not be referenced by a L2 or greater page directory),
or the typechange will fail.  Additionally the descriptors are audited
at this point, so if Xen objects to any of the descriptors in the same
page, the typechange will also fail.

If the typechange fails, the pagefault gets propagated back to the guest.

The corollary to this is that, for xen_free_ldt() to create writeable
mappings again, a typechange back to writeable is needed.  This will
fail if the LDT frames are still referenced in any vcpus LDT.

It would be interesting to know which of the two BUG()s in
set_aliased_prot() tripped.  If writeable aliases did exist then
xen_alloc_ldt() could indeed be insufficient to make the frames usable
as an LDT, but xen_free_ldt() wouldn't fail when trying to recreate the
writeable mappings.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28  3:16         ` Andy Lutomirski
                             ` (3 preceding siblings ...)
  2015-07-28  3:43           ` Boris Ostrovsky
@ 2015-07-28 10:29           ` Andrew Cooper
  2015-07-28 10:29           ` Andrew Cooper
  5 siblings, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 10:29 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Jan Beulich, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Andy Lutomirski,
	Sasha Levin

On 28/07/15 04:16, Andy Lutomirski wrote:
> On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
>>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
>>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
>>>>>> probably a good general attack surface reduction, and it replaces some
>>>>>> scary code with IMO less scary code.
>>>>>>
>>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
>>>>>> This makes that possible.
>>>>>>
>>>>>> Xen people, can you take a look at this?
>>>>>>
>>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
>>>>>> like it, and we can always add a sysctl of some sort later.
>>>>>>
>>>>>> Changes from v3:
>>>>>>    - Hopefully fixed Xen.
>>>>>
>>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
>>>>>
>>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
>>>>>
>>>>> I am not sure I see what changed.
>>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
>>>> Oops.
>>>>
>>>> I just sent v4.1 of patch 3.  Can you try that?
>>>
>>>
>>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
>>> in destroy_context(). Interestingly though when I run the test in the
>>> debugger I get SIGILL (just like before) but no BUG().
>>>
>>> Let me get back to you on that later today.
>>>
>>>
>> After forward-porting my virtio patches, I got this thing to run on
>> Xen.  After several tries, I got:
>>
>> [   53.985707] ------------[ cut here ]------------
>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>> [   53.986677] invalid opcode: 0000 [#1] SMP
>> [   53.986677] Modules linked in:
>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>> 04/01/2014
>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>> [   53.986677] Stack:
>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>> 00000b4a 00000200
>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>> c1062310 c01861c0
>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>> c2373a80 00000000
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>> 89 e5
>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>
>> Is that the error you're seeing?
>>
>> If I change xen_free_ldt to:
>>
>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>     int i;
>>
>>     vm_unmap_aliases();
>>     xen_mc_flush();
>>
>>     for(i = 0; i < entries; i += entries_per_page)
>>         set_aliased_prot(ldt + i, PAGE_KERNEL);
>> }
>>
>> then it works.  I don't know why this makes a difference.
>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>> doesn't.)
>>
> That fix makes sense if there's some way that the vmalloc area we're
> freeing has an extra alias somewhere, which is very much possible.  On
> the other hand, I don't see how this happens without first doing an
> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> expected that to blow up and/or result in test case failures.
>
> But I'm still confused, because it seems like Xen will never populate
> the actual (hidden) LDT mapping unless the pages backing it are
> unaliased and well-formed, which make me wonder why this stuff ever
> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> in segfaults?
>
> The semantics seem to be very odd.  xen_free_ldt with an aliased
> address might fail (and OOPS), but actual access to the LDT with an
> aliased address page faults.
>
> Also, using kzalloc for everything fixes the problem, which suggests
> that there really is something to my theory that the problem involves
> unexpected aliases.

Xen does lazily populate the LDT frames.  The first time a page is ever
referenced via the LDT, Xen will perform a typechange.

Under Xen, guest mappings are reference counted with both a plain
reference, and a type count.  Types of writeable, segdec and pagetables
are mutually exclusive.  This prevents the guest from having writeable
mappings of interesting datastructures, but readable mappings are fine. 
Typechanges may only occur when the type reference count is 0.

At the point of the typechange, no writeable mappings of the frame may
exist (and it must not be referenced by a L2 or greater page directory),
or the typechange will fail.  Additionally the descriptors are audited
at this point, so if Xen objects to any of the descriptors in the same
page, the typechange will also fail.

If the typechange fails, the pagefault gets propagated back to the guest.

The corollary to this is that, for xen_free_ldt() to create writeable
mappings again, a typechange back to writeable is needed.  This will
fail if the LDT frames are still referenced in any vcpus LDT.

It would be interesting to know which of the two BUG()s in
set_aliased_prot() tripped.  If writeable aliases did exist then
xen_alloc_ldt() could indeed be insufficient to make the frames usable
as an LDT, but xen_free_ldt() wouldn't fail when trying to recreate the
writeable mappings.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 10:29           ` Andrew Cooper
  2015-07-28 14:05             ` Boris Ostrovsky
@ 2015-07-28 14:05             ` Boris Ostrovsky
  2015-07-28 14:35               ` Andrew Cooper
  2015-07-28 14:35               ` Andrew Cooper
  2015-07-28 15:43             ` Andy Lutomirski
  2015-07-28 15:43             ` Andy Lutomirski
  3 siblings, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 14:05 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Jan Beulich, xen-devel

On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>
>>> After forward-porting my virtio patches, I got this thing to run on
>>> Xen.  After several tries, I got:
>>>
>>> [   53.985707] ------------[ cut here ]------------
>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>> [   53.986677] Modules linked in:
>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>> 04/01/2014
>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>> [   53.986677] Stack:
>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>> 00000b4a 00000200
>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>> c1062310 c01861c0
>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>> c2373a80 00000000
>>> [   53.986677] Call Trace:
>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>> 89 e5
>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>
>>> Is that the error you're seeing?
>>>
>>> If I change xen_free_ldt to:
>>>
>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>> {
>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>      int i;
>>>
>>>      vm_unmap_aliases();
>>>      xen_mc_flush();
>>>
>>>      for(i = 0; i < entries; i += entries_per_page)
>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
>>> }
>>>
>>> then it works.  I don't know why this makes a difference.
>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>> doesn't.)
>>>
>> That fix makes sense if there's some way that the vmalloc area we're
>> freeing has an extra alias somewhere, which is very much possible.  On
>> the other hand, I don't see how this happens without first doing an
>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>> expected that to blow up and/or result in test case failures.
>>
>> But I'm still confused, because it seems like Xen will never populate
>> the actual (hidden) LDT mapping unless the pages backing it are
>> unaliased and well-formed, which make me wonder why this stuff ever
>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>> in segfaults?
>>
>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>> address might fail (and OOPS), but actual access to the LDT with an
>> aliased address page faults.
>>
>> Also, using kzalloc for everything fixes the problem, which suggests
>> that there really is something to my theory that the problem involves
>> unexpected aliases.
> Xen does lazily populate the LDT frames.  The first time a page is ever
> referenced via the LDT, Xen will perform a typechange.
>
> Under Xen, guest mappings are reference counted with both a plain
> reference, and a type count.  Types of writeable, segdec and pagetables
> are mutually exclusive.  This prevents the guest from having writeable
> mappings of interesting datastructures, but readable mappings are fine.
> Typechanges may only occur when the type reference count is 0.
>
> At the point of the typechange, no writeable mappings of the frame may
> exist (and it must not be referenced by a L2 or greater page directory),
> or the typechange will fail.  Additionally the descriptors are audited
> at this point, so if Xen objects to any of the descriptors in the same
> page, the typechange will also fail.
>
> If the typechange fails, the pagefault gets propagated back to the guest.
>
> The corollary to this is that, for xen_free_ldt() to create writeable
> mappings again, a typechange back to writeable is needed.  This will
> fail if the LDT frames are still referenced in any vcpus LDT.
>
> It would be interesting to know which of the two BUG()s in
> set_aliased_prot() tripped.

The first one (i.e. not the alias)

-boris

> If writeable aliases did exist then
> xen_alloc_ldt() could indeed be insufficient to make the frames usable
> as an LDT, but xen_free_ldt() wouldn't fail when trying to recreate the
> writeable mappings.
>
> ~Andrew


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 10:29           ` Andrew Cooper
@ 2015-07-28 14:05             ` Boris Ostrovsky
  2015-07-28 14:05             ` Boris Ostrovsky
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 14:05 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Andy Lutomirski,
	Sasha Levin

On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>
>>> After forward-porting my virtio patches, I got this thing to run on
>>> Xen.  After several tries, I got:
>>>
>>> [   53.985707] ------------[ cut here ]------------
>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>> [   53.986677] Modules linked in:
>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>> 04/01/2014
>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>> [   53.986677] Stack:
>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>> 00000b4a 00000200
>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>> c1062310 c01861c0
>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>> c2373a80 00000000
>>> [   53.986677] Call Trace:
>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>> 89 e5
>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>
>>> Is that the error you're seeing?
>>>
>>> If I change xen_free_ldt to:
>>>
>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>> {
>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>      int i;
>>>
>>>      vm_unmap_aliases();
>>>      xen_mc_flush();
>>>
>>>      for(i = 0; i < entries; i += entries_per_page)
>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
>>> }
>>>
>>> then it works.  I don't know why this makes a difference.
>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>> doesn't.)
>>>
>> That fix makes sense if there's some way that the vmalloc area we're
>> freeing has an extra alias somewhere, which is very much possible.  On
>> the other hand, I don't see how this happens without first doing an
>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>> expected that to blow up and/or result in test case failures.
>>
>> But I'm still confused, because it seems like Xen will never populate
>> the actual (hidden) LDT mapping unless the pages backing it are
>> unaliased and well-formed, which make me wonder why this stuff ever
>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>> in segfaults?
>>
>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>> address might fail (and OOPS), but actual access to the LDT with an
>> aliased address page faults.
>>
>> Also, using kzalloc for everything fixes the problem, which suggests
>> that there really is something to my theory that the problem involves
>> unexpected aliases.
> Xen does lazily populate the LDT frames.  The first time a page is ever
> referenced via the LDT, Xen will perform a typechange.
>
> Under Xen, guest mappings are reference counted with both a plain
> reference, and a type count.  Types of writeable, segdec and pagetables
> are mutually exclusive.  This prevents the guest from having writeable
> mappings of interesting datastructures, but readable mappings are fine.
> Typechanges may only occur when the type reference count is 0.
>
> At the point of the typechange, no writeable mappings of the frame may
> exist (and it must not be referenced by a L2 or greater page directory),
> or the typechange will fail.  Additionally the descriptors are audited
> at this point, so if Xen objects to any of the descriptors in the same
> page, the typechange will also fail.
>
> If the typechange fails, the pagefault gets propagated back to the guest.
>
> The corollary to this is that, for xen_free_ldt() to create writeable
> mappings again, a typechange back to writeable is needed.  This will
> fail if the LDT frames are still referenced in any vcpus LDT.
>
> It would be interesting to know which of the two BUG()s in
> set_aliased_prot() tripped.

The first one (i.e. not the alias)

-boris

> If writeable aliases did exist then
> xen_alloc_ldt() could indeed be insufficient to make the frames usable
> as an LDT, but xen_free_ldt() wouldn't fail when trying to recreate the
> writeable mappings.
>
> ~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:05             ` Boris Ostrovsky
  2015-07-28 14:35               ` Andrew Cooper
@ 2015-07-28 14:35               ` Andrew Cooper
  2015-07-28 14:50                 ` Boris Ostrovsky
  2015-07-28 14:50                 ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 14:35 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Jan Beulich, xen-devel

On 28/07/15 15:05, Boris Ostrovsky wrote:
> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>
>>>> After forward-porting my virtio patches, I got this thing to run on
>>>> Xen.  After several tries, I got:
>>>>
>>>> [   53.985707] ------------[ cut here ]------------
>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>> [   53.986677] Modules linked in:
>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>> 04/01/2014
>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>> [   53.986677] Stack:
>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>> 00000b4a 00000200
>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>> c1062310 c01861c0
>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>> c2373a80 00000000
>>>> [   53.986677] Call Trace:
>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>> 89 e5
>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>> 0069:c0875e74
>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>
>>>> Is that the error you're seeing?
>>>>
>>>> If I change xen_free_ldt to:
>>>>
>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>> {
>>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>      int i;
>>>>
>>>>      vm_unmap_aliases();
>>>>      xen_mc_flush();
>>>>
>>>>      for(i = 0; i < entries; i += entries_per_page)
>>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>> }
>>>>
>>>> then it works.  I don't know why this makes a difference.
>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>> doesn't.)
>>>>
>>> That fix makes sense if there's some way that the vmalloc area we're
>>> freeing has an extra alias somewhere, which is very much possible.  On
>>> the other hand, I don't see how this happens without first doing an
>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>> expected that to blow up and/or result in test case failures.
>>>
>>> But I'm still confused, because it seems like Xen will never populate
>>> the actual (hidden) LDT mapping unless the pages backing it are
>>> unaliased and well-formed, which make me wonder why this stuff ever
>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>> in segfaults?
>>>
>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>> address might fail (and OOPS), but actual access to the LDT with an
>>> aliased address page faults.
>>>
>>> Also, using kzalloc for everything fixes the problem, which suggests
>>> that there really is something to my theory that the problem involves
>>> unexpected aliases.
>> Xen does lazily populate the LDT frames.  The first time a page is ever
>> referenced via the LDT, Xen will perform a typechange.
>>
>> Under Xen, guest mappings are reference counted with both a plain
>> reference, and a type count.  Types of writeable, segdec and pagetables
>> are mutually exclusive.  This prevents the guest from having writeable
>> mappings of interesting datastructures, but readable mappings are fine.
>> Typechanges may only occur when the type reference count is 0.
>>
>> At the point of the typechange, no writeable mappings of the frame may
>> exist (and it must not be referenced by a L2 or greater page directory),
>> or the typechange will fail.  Additionally the descriptors are audited
>> at this point, so if Xen objects to any of the descriptors in the same
>> page, the typechange will also fail.
>>
>> If the typechange fails, the pagefault gets propagated back to the
>> guest.
>>
>> The corollary to this is that, for xen_free_ldt() to create writeable
>> mappings again, a typechange back to writeable is needed.  This will
>> fail if the LDT frames are still referenced in any vcpus LDT.
>>
>> It would be interesting to know which of the two BUG()s in
>> set_aliased_prot() tripped.
>
> The first one (i.e. not the alias)
>

In which case the page in question is still referenced in an LDT
(perhaps on a different vcpu) or has been reused as a pagetable (I
really hope this is not the case).

A sufficiently-debug Xen might be persuaded into telling you exactly
what it didn't like about the attempted transition.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:05             ` Boris Ostrovsky
@ 2015-07-28 14:35               ` Andrew Cooper
  2015-07-28 14:35               ` Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 14:35 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Andy Lutomirski,
	Sasha Levin

On 28/07/15 15:05, Boris Ostrovsky wrote:
> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>
>>>> After forward-porting my virtio patches, I got this thing to run on
>>>> Xen.  After several tries, I got:
>>>>
>>>> [   53.985707] ------------[ cut here ]------------
>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>> [   53.986677] Modules linked in:
>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>> 04/01/2014
>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>> [   53.986677] Stack:
>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>> 00000b4a 00000200
>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>> c1062310 c01861c0
>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>> c2373a80 00000000
>>>> [   53.986677] Call Trace:
>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>> 89 e5
>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>> 0069:c0875e74
>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>
>>>> Is that the error you're seeing?
>>>>
>>>> If I change xen_free_ldt to:
>>>>
>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>> {
>>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>      int i;
>>>>
>>>>      vm_unmap_aliases();
>>>>      xen_mc_flush();
>>>>
>>>>      for(i = 0; i < entries; i += entries_per_page)
>>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>> }
>>>>
>>>> then it works.  I don't know why this makes a difference.
>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>> doesn't.)
>>>>
>>> That fix makes sense if there's some way that the vmalloc area we're
>>> freeing has an extra alias somewhere, which is very much possible.  On
>>> the other hand, I don't see how this happens without first doing an
>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>> expected that to blow up and/or result in test case failures.
>>>
>>> But I'm still confused, because it seems like Xen will never populate
>>> the actual (hidden) LDT mapping unless the pages backing it are
>>> unaliased and well-formed, which make me wonder why this stuff ever
>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>> in segfaults?
>>>
>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>> address might fail (and OOPS), but actual access to the LDT with an
>>> aliased address page faults.
>>>
>>> Also, using kzalloc for everything fixes the problem, which suggests
>>> that there really is something to my theory that the problem involves
>>> unexpected aliases.
>> Xen does lazily populate the LDT frames.  The first time a page is ever
>> referenced via the LDT, Xen will perform a typechange.
>>
>> Under Xen, guest mappings are reference counted with both a plain
>> reference, and a type count.  Types of writeable, segdec and pagetables
>> are mutually exclusive.  This prevents the guest from having writeable
>> mappings of interesting datastructures, but readable mappings are fine.
>> Typechanges may only occur when the type reference count is 0.
>>
>> At the point of the typechange, no writeable mappings of the frame may
>> exist (and it must not be referenced by a L2 or greater page directory),
>> or the typechange will fail.  Additionally the descriptors are audited
>> at this point, so if Xen objects to any of the descriptors in the same
>> page, the typechange will also fail.
>>
>> If the typechange fails, the pagefault gets propagated back to the
>> guest.
>>
>> The corollary to this is that, for xen_free_ldt() to create writeable
>> mappings again, a typechange back to writeable is needed.  This will
>> fail if the LDT frames are still referenced in any vcpus LDT.
>>
>> It would be interesting to know which of the two BUG()s in
>> set_aliased_prot() tripped.
>
> The first one (i.e. not the alias)
>

In which case the page in question is still referenced in an LDT
(perhaps on a different vcpu) or has been reused as a pagetable (I
really hope this is not the case).

A sufficiently-debug Xen might be persuaded into telling you exactly
what it didn't like about the attempted transition.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:35               ` Andrew Cooper
  2015-07-28 14:50                 ` Boris Ostrovsky
@ 2015-07-28 14:50                 ` Boris Ostrovsky
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
                                     ` (3 more replies)
  1 sibling, 4 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 14:50 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Jan Beulich, xen-devel

On 07/28/2015 10:35 AM, Andrew Cooper wrote:
> On 28/07/15 15:05, Boris Ostrovsky wrote:
>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>> Xen.  After several tries, I got:
>>>>>
>>>>> [   53.985707] ------------[ cut here ]------------
>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>> [   53.986677] Modules linked in:
>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>> 04/01/2014
>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>>> [   53.986677] Stack:
>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>>> 00000b4a 00000200
>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>>> c1062310 c01861c0
>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>>> c2373a80 00000000
>>>>> [   53.986677] Call Trace:
>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>>> 89 e5
>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>> 0069:c0875e74
>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>
>>>>> Is that the error you're seeing?
>>>>>
>>>>> If I change xen_free_ldt to:
>>>>>
>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>> {
>>>>>       const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>       int i;
>>>>>
>>>>>       vm_unmap_aliases();
>>>>>       xen_mc_flush();
>>>>>
>>>>>       for(i = 0; i < entries; i += entries_per_page)
>>>>>           set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>> }
>>>>>
>>>>> then it works.  I don't know why this makes a difference.
>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>> doesn't.)
>>>>>
>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>> freeing has an extra alias somewhere, which is very much possible.  On
>>>> the other hand, I don't see how this happens without first doing an
>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>> expected that to blow up and/or result in test case failures.
>>>>
>>>> But I'm still confused, because it seems like Xen will never populate
>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>> in segfaults?
>>>>
>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>> aliased address page faults.
>>>>
>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>> that there really is something to my theory that the problem involves
>>>> unexpected aliases.
>>> Xen does lazily populate the LDT frames.  The first time a page is ever
>>> referenced via the LDT, Xen will perform a typechange.
>>>
>>> Under Xen, guest mappings are reference counted with both a plain
>>> reference, and a type count.  Types of writeable, segdec and pagetables
>>> are mutually exclusive.  This prevents the guest from having writeable
>>> mappings of interesting datastructures, but readable mappings are fine.
>>> Typechanges may only occur when the type reference count is 0.
>>>
>>> At the point of the typechange, no writeable mappings of the frame may
>>> exist (and it must not be referenced by a L2 or greater page directory),
>>> or the typechange will fail.  Additionally the descriptors are audited
>>> at this point, so if Xen objects to any of the descriptors in the same
>>> page, the typechange will also fail.
>>>
>>> If the typechange fails, the pagefault gets propagated back to the
>>> guest.
>>>
>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>> mappings again, a typechange back to writeable is needed.  This will
>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>
>>> It would be interesting to know which of the two BUG()s in
>>> set_aliased_prot() tripped.
>> The first one (i.e. not the alias)
>>
> In which case the page in question is still referenced in an LDT
> (perhaps on a different vcpu)

The problem is reproducible on a UP guest so it's not that.

> or has been reused as a pagetable (I
> really hope this is not the case).
>
> A sufficiently-debug Xen might be persuaded into telling you exactly
> what it didn't like about the attempted transition.

It just can't find l1 entry for the LDT address in __do_update_va_mapping().

-boris


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:35               ` Andrew Cooper
@ 2015-07-28 14:50                 ` Boris Ostrovsky
  2015-07-28 14:50                 ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 14:50 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Andy Lutomirski,
	Sasha Levin

On 07/28/2015 10:35 AM, Andrew Cooper wrote:
> On 28/07/15 15:05, Boris Ostrovsky wrote:
>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>> Xen.  After several tries, I got:
>>>>>
>>>>> [   53.985707] ------------[ cut here ]------------
>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>> [   53.986677] Modules linked in:
>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>> 04/01/2014
>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>>> [   53.986677] Stack:
>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>>> 00000b4a 00000200
>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>>> c1062310 c01861c0
>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>>> c2373a80 00000000
>>>>> [   53.986677] Call Trace:
>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>>> 89 e5
>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>> 0069:c0875e74
>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>
>>>>> Is that the error you're seeing?
>>>>>
>>>>> If I change xen_free_ldt to:
>>>>>
>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>> {
>>>>>       const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>       int i;
>>>>>
>>>>>       vm_unmap_aliases();
>>>>>       xen_mc_flush();
>>>>>
>>>>>       for(i = 0; i < entries; i += entries_per_page)
>>>>>           set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>> }
>>>>>
>>>>> then it works.  I don't know why this makes a difference.
>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>> doesn't.)
>>>>>
>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>> freeing has an extra alias somewhere, which is very much possible.  On
>>>> the other hand, I don't see how this happens without first doing an
>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>> expected that to blow up and/or result in test case failures.
>>>>
>>>> But I'm still confused, because it seems like Xen will never populate
>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>> in segfaults?
>>>>
>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>> aliased address page faults.
>>>>
>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>> that there really is something to my theory that the problem involves
>>>> unexpected aliases.
>>> Xen does lazily populate the LDT frames.  The first time a page is ever
>>> referenced via the LDT, Xen will perform a typechange.
>>>
>>> Under Xen, guest mappings are reference counted with both a plain
>>> reference, and a type count.  Types of writeable, segdec and pagetables
>>> are mutually exclusive.  This prevents the guest from having writeable
>>> mappings of interesting datastructures, but readable mappings are fine.
>>> Typechanges may only occur when the type reference count is 0.
>>>
>>> At the point of the typechange, no writeable mappings of the frame may
>>> exist (and it must not be referenced by a L2 or greater page directory),
>>> or the typechange will fail.  Additionally the descriptors are audited
>>> at this point, so if Xen objects to any of the descriptors in the same
>>> page, the typechange will also fail.
>>>
>>> If the typechange fails, the pagefault gets propagated back to the
>>> guest.
>>>
>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>> mappings again, a typechange back to writeable is needed.  This will
>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>
>>> It would be interesting to know which of the two BUG()s in
>>> set_aliased_prot() tripped.
>> The first one (i.e. not the alias)
>>
> In which case the page in question is still referenced in an LDT
> (perhaps on a different vcpu)

The problem is reproducible on a UP guest so it's not that.

> or has been reused as a pagetable (I
> really hope this is not the case).
>
> A sufficiently-debug Xen might be persuaded into telling you exactly
> what it didn't like about the attempted transition.

It just can't find l1 entry for the LDT address in __do_update_va_mapping().

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:50                 ` Boris Ostrovsky
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
@ 2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
  2015-07-28 15:39                     ` Boris Ostrovsky
  2015-07-28 15:39                     ` Boris Ostrovsky
  2015-07-28 15:23                   ` Andrew Cooper
  2015-07-28 15:23                   ` Andrew Cooper
  3 siblings, 2 replies; 130+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-07-28 15:15 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra,
	Steven Rostedt, security, X86 ML, Borislav Petkov, Sasha Levin,
	linux-kernel, Jan Beulich, xen-devel

On Tue, Jul 28, 2015 at 10:50:39AM -0400, Boris Ostrovsky wrote:
> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
> >On 28/07/15 15:05, Boris Ostrovsky wrote:
> >>On 07/28/2015 06:29 AM, Andrew Cooper wrote:
> >>>>>After forward-porting my virtio patches, I got this thing to run on
> >>>>>Xen.  After several tries, I got:
> >>>>>
> >>>>>[   53.985707] ------------[ cut here ]------------
> >>>>>[   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> >>>>>[   53.986677] invalid opcode: 0000 [#1] SMP
> >>>>>[   53.986677] Modules linked in:
> >>>>>[   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> >>>>>[   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >>>>>BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> >>>>>04/01/2014
> >>>>>[   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> >>>>>[   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> >>>>>[   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> >>>>>[   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> >>>>>[   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> >>>>>[   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> >>>>>[   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> >>>>>[   53.986677] Stack:
> >>>>>[   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> >>>>>00000b4a 00000200
> >>>>>[   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> >>>>>c1062310 c01861c0
> >>>>>[   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> >>>>>c2373a80 00000000
> >>>>>[   53.986677] Call Trace:
> >>>>>[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> >>>>>[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> >>>>>[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> >>>>>[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> >>>>>[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> >>>>>[   53.986677]  [<c1863736>] __schedule+0x316/0x950
> >>>>>[   53.986677]  [<c1863d96>] schedule+0x26/0x70
> >>>>>[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> >>>>>[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> >>>>>[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> >>>>>[   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> >>>>>[   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> >>>>>4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> >>>>>c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> >>>>>89 e5
> >>>>>[   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
> >>>>>0069:c0875e74
> >>>>>[   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
> >>>>>
> >>>>>Is that the error you're seeing?
> >>>>>
> >>>>>If I change xen_free_ldt to:
> >>>>>
> >>>>>static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> >>>>>{
> >>>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
> >>>>>      int i;
> >>>>>
> >>>>>      vm_unmap_aliases();
> >>>>>      xen_mc_flush();
> >>>>>
> >>>>>      for(i = 0; i < entries; i += entries_per_page)
> >>>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
> >>>>>}
> >>>>>
> >>>>>then it works.  I don't know why this makes a difference.
> >>>>>(xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> >>>>>doesn't.)
> >>>>>
> >>>>That fix makes sense if there's some way that the vmalloc area we're
> >>>>freeing has an extra alias somewhere, which is very much possible.  On
> >>>>the other hand, I don't see how this happens without first doing an
> >>>>MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> >>>>expected that to blow up and/or result in test case failures.
> >>>>
> >>>>But I'm still confused, because it seems like Xen will never populate
> >>>>the actual (hidden) LDT mapping unless the pages backing it are
> >>>>unaliased and well-formed, which make me wonder why this stuff ever
> >>>>worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> >>>>in segfaults?
> >>>>
> >>>>The semantics seem to be very odd.  xen_free_ldt with an aliased
> >>>>address might fail (and OOPS), but actual access to the LDT with an
> >>>>aliased address page faults.
> >>>>
> >>>>Also, using kzalloc for everything fixes the problem, which suggests
> >>>>that there really is something to my theory that the problem involves
> >>>>unexpected aliases.
> >>>Xen does lazily populate the LDT frames.  The first time a page is ever
> >>>referenced via the LDT, Xen will perform a typechange.
> >>>
> >>>Under Xen, guest mappings are reference counted with both a plain
> >>>reference, and a type count.  Types of writeable, segdec and pagetables
> >>>are mutually exclusive.  This prevents the guest from having writeable
> >>>mappings of interesting datastructures, but readable mappings are fine.
> >>>Typechanges may only occur when the type reference count is 0.
> >>>
> >>>At the point of the typechange, no writeable mappings of the frame may
> >>>exist (and it must not be referenced by a L2 or greater page directory),
> >>>or the typechange will fail.  Additionally the descriptors are audited
> >>>at this point, so if Xen objects to any of the descriptors in the same
> >>>page, the typechange will also fail.
> >>>
> >>>If the typechange fails, the pagefault gets propagated back to the
> >>>guest.
> >>>
> >>>The corollary to this is that, for xen_free_ldt() to create writeable
> >>>mappings again, a typechange back to writeable is needed.  This will
> >>>fail if the LDT frames are still referenced in any vcpus LDT.
> >>>
> >>>It would be interesting to know which of the two BUG()s in
> >>>set_aliased_prot() tripped.
> >>The first one (i.e. not the alias)
> >>
> >In which case the page in question is still referenced in an LDT
> >(perhaps on a different vcpu)
> 
> The problem is reproducible on a UP guest so it's not that.

The Linux kernel does a bunch of lazy maps and unmaps and we
may be getting an interrupt while the lazy unmap hasn't been
called  (arch_leave_lazy_mmu_mode).

Having the vm_unmap_aliases and then xc_mc_flush (which is what
arch_leave_lazy_mmu_mode ends up doing too and more) would solve it.

Thought I would have thought that vm_unmap_aliases would call
arch_leave_lazy_mmu_mode.
> 
> >or has been reused as a pagetable (I
> >really hope this is not the case).
> >
> >A sufficiently-debug Xen might be persuaded into telling you exactly
> >what it didn't like about the attempted transition.
> 
> It just can't find l1 entry for the LDT address in __do_update_va_mapping().

Which would imply that it has not been written in. Which corresponds
to the set_aliased_prot hitting the first BUG_ON.

The xc_mc_flush() also triggers the batched hypercalls - which means we
may have some hypercalls that have not yet gone to the hypervisor and
then we try do an LDT hypercall (not batched).

You could try building with this debug:


diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
index ea54a08..5d214ce 100644
--- a/arch/x86/xen/multicalls.c
+++ b/arch/x86/xen/multicalls.c
@@ -28,9 +28,9 @@
 #include "multicalls.h"
 #include "debugfs.h"
 
-#define MC_BATCH	32
+#define MC_BATCH	1
 
-#define MC_DEBUG	0
+#define MC_DEBUG	1
 
 #define MC_ARGS		(MC_BATCH * 16)
 
> 
> -boris
> 

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:50                 ` Boris Ostrovsky
@ 2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 130+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-07-28 15:15 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, Andy Lutomirski, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, xen-devel

On Tue, Jul 28, 2015 at 10:50:39AM -0400, Boris Ostrovsky wrote:
> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
> >On 28/07/15 15:05, Boris Ostrovsky wrote:
> >>On 07/28/2015 06:29 AM, Andrew Cooper wrote:
> >>>>>After forward-porting my virtio patches, I got this thing to run on
> >>>>>Xen.  After several tries, I got:
> >>>>>
> >>>>>[   53.985707] ------------[ cut here ]------------
> >>>>>[   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> >>>>>[   53.986677] invalid opcode: 0000 [#1] SMP
> >>>>>[   53.986677] Modules linked in:
> >>>>>[   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> >>>>>[   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >>>>>BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> >>>>>04/01/2014
> >>>>>[   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> >>>>>[   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> >>>>>[   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> >>>>>[   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> >>>>>[   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> >>>>>[   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> >>>>>[   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> >>>>>[   53.986677] Stack:
> >>>>>[   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> >>>>>00000b4a 00000200
> >>>>>[   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> >>>>>c1062310 c01861c0
> >>>>>[   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> >>>>>c2373a80 00000000
> >>>>>[   53.986677] Call Trace:
> >>>>>[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> >>>>>[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> >>>>>[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> >>>>>[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> >>>>>[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> >>>>>[   53.986677]  [<c1863736>] __schedule+0x316/0x950
> >>>>>[   53.986677]  [<c1863d96>] schedule+0x26/0x70
> >>>>>[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> >>>>>[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> >>>>>[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> >>>>>[   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> >>>>>[   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> >>>>>4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> >>>>>c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> >>>>>89 e5
> >>>>>[   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
> >>>>>0069:c0875e74
> >>>>>[   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
> >>>>>
> >>>>>Is that the error you're seeing?
> >>>>>
> >>>>>If I change xen_free_ldt to:
> >>>>>
> >>>>>static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> >>>>>{
> >>>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
> >>>>>      int i;
> >>>>>
> >>>>>      vm_unmap_aliases();
> >>>>>      xen_mc_flush();
> >>>>>
> >>>>>      for(i = 0; i < entries; i += entries_per_page)
> >>>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
> >>>>>}
> >>>>>
> >>>>>then it works.  I don't know why this makes a difference.
> >>>>>(xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> >>>>>doesn't.)
> >>>>>
> >>>>That fix makes sense if there's some way that the vmalloc area we're
> >>>>freeing has an extra alias somewhere, which is very much possible.  On
> >>>>the other hand, I don't see how this happens without first doing an
> >>>>MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> >>>>expected that to blow up and/or result in test case failures.
> >>>>
> >>>>But I'm still confused, because it seems like Xen will never populate
> >>>>the actual (hidden) LDT mapping unless the pages backing it are
> >>>>unaliased and well-formed, which make me wonder why this stuff ever
> >>>>worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> >>>>in segfaults?
> >>>>
> >>>>The semantics seem to be very odd.  xen_free_ldt with an aliased
> >>>>address might fail (and OOPS), but actual access to the LDT with an
> >>>>aliased address page faults.
> >>>>
> >>>>Also, using kzalloc for everything fixes the problem, which suggests
> >>>>that there really is something to my theory that the problem involves
> >>>>unexpected aliases.
> >>>Xen does lazily populate the LDT frames.  The first time a page is ever
> >>>referenced via the LDT, Xen will perform a typechange.
> >>>
> >>>Under Xen, guest mappings are reference counted with both a plain
> >>>reference, and a type count.  Types of writeable, segdec and pagetables
> >>>are mutually exclusive.  This prevents the guest from having writeable
> >>>mappings of interesting datastructures, but readable mappings are fine.
> >>>Typechanges may only occur when the type reference count is 0.
> >>>
> >>>At the point of the typechange, no writeable mappings of the frame may
> >>>exist (and it must not be referenced by a L2 or greater page directory),
> >>>or the typechange will fail.  Additionally the descriptors are audited
> >>>at this point, so if Xen objects to any of the descriptors in the same
> >>>page, the typechange will also fail.
> >>>
> >>>If the typechange fails, the pagefault gets propagated back to the
> >>>guest.
> >>>
> >>>The corollary to this is that, for xen_free_ldt() to create writeable
> >>>mappings again, a typechange back to writeable is needed.  This will
> >>>fail if the LDT frames are still referenced in any vcpus LDT.
> >>>
> >>>It would be interesting to know which of the two BUG()s in
> >>>set_aliased_prot() tripped.
> >>The first one (i.e. not the alias)
> >>
> >In which case the page in question is still referenced in an LDT
> >(perhaps on a different vcpu)
> 
> The problem is reproducible on a UP guest so it's not that.

The Linux kernel does a bunch of lazy maps and unmaps and we
may be getting an interrupt while the lazy unmap hasn't been
called  (arch_leave_lazy_mmu_mode).

Having the vm_unmap_aliases and then xc_mc_flush (which is what
arch_leave_lazy_mmu_mode ends up doing too and more) would solve it.

Thought I would have thought that vm_unmap_aliases would call
arch_leave_lazy_mmu_mode.
> 
> >or has been reused as a pagetable (I
> >really hope this is not the case).
> >
> >A sufficiently-debug Xen might be persuaded into telling you exactly
> >what it didn't like about the attempted transition.
> 
> It just can't find l1 entry for the LDT address in __do_update_va_mapping().

Which would imply that it has not been written in. Which corresponds
to the set_aliased_prot hitting the first BUG_ON.

The xc_mc_flush() also triggers the batched hypercalls - which means we
may have some hypercalls that have not yet gone to the hypervisor and
then we try do an LDT hypercall (not batched).

You could try building with this debug:


diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
index ea54a08..5d214ce 100644
--- a/arch/x86/xen/multicalls.c
+++ b/arch/x86/xen/multicalls.c
@@ -28,9 +28,9 @@
 #include "multicalls.h"
 #include "debugfs.h"
 
-#define MC_BATCH	32
+#define MC_BATCH	1
 
-#define MC_DEBUG	0
+#define MC_DEBUG	1
 
 #define MC_ARGS		(MC_BATCH * 16)
 
> 
> -boris
> 

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:50                 ` Boris Ostrovsky
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
@ 2015-07-28 15:23                   ` Andrew Cooper
  2015-07-28 15:59                     ` [Xen-devel] " Boris Ostrovsky
  2015-07-28 15:59                     ` Boris Ostrovsky
  2015-07-28 15:23                   ` Andrew Cooper
  3 siblings, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 15:23 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Steven Rostedt, security,
	X86 ML, Borislav Petkov, Sasha Levin, linux-kernel,
	Konrad Rzeszutek Wilk, Jan Beulich, xen-devel

On 28/07/15 15:50, Boris Ostrovsky wrote:
> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
>> On 28/07/15 15:05, Boris Ostrovsky wrote:
>>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>>> Xen.  After several tries, I got:
>>>>>>
>>>>>> [   53.985707] ------------[ cut here ]------------
>>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>>> [   53.986677] Modules linked in:
>>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>> 1996),
>>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>>> 04/01/2014
>>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX:
>>>>>> 80000000
>>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP:
>>>>>> c0875e74
>>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4:
>>>>>> 00042660
>>>>>> [   53.986677] Stack:
>>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001
>>>>>> cc3d2000
>>>>>> 00000b4a 00000200
>>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000
>>>>>> c0875eb4
>>>>>> c1062310 c01861c0
>>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e
>>>>>> c7007a00
>>>>>> c2373a80 00000000
>>>>>> [   53.986677] Call Trace:
>>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74
>>>>>> 31 55
>>>>>> 89 e5
>>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>>> 0069:c0875e74
>>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>>
>>>>>> Is that the error you're seeing?
>>>>>>
>>>>>> If I change xen_free_ldt to:
>>>>>>
>>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>>> {
>>>>>>       const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>>       int i;
>>>>>>
>>>>>>       vm_unmap_aliases();
>>>>>>       xen_mc_flush();
>>>>>>
>>>>>>       for(i = 0; i < entries; i += entries_per_page)
>>>>>>           set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>>> }
>>>>>>
>>>>>> then it works.  I don't know why this makes a difference.
>>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>>> doesn't.)
>>>>>>
>>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>>> freeing has an extra alias somewhere, which is very much
>>>>> possible.  On
>>>>> the other hand, I don't see how this happens without first doing an
>>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>>> expected that to blow up and/or result in test case failures.
>>>>>
>>>>> But I'm still confused, because it seems like Xen will never populate
>>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>>> in segfaults?
>>>>>
>>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>>> aliased address page faults.
>>>>>
>>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>>> that there really is something to my theory that the problem involves
>>>>> unexpected aliases.
>>>> Xen does lazily populate the LDT frames.  The first time a page is
>>>> ever
>>>> referenced via the LDT, Xen will perform a typechange.
>>>>
>>>> Under Xen, guest mappings are reference counted with both a plain
>>>> reference, and a type count.  Types of writeable, segdec and
>>>> pagetables
>>>> are mutually exclusive.  This prevents the guest from having writeable
>>>> mappings of interesting datastructures, but readable mappings are
>>>> fine.
>>>> Typechanges may only occur when the type reference count is 0.
>>>>
>>>> At the point of the typechange, no writeable mappings of the frame may
>>>> exist (and it must not be referenced by a L2 or greater page
>>>> directory),
>>>> or the typechange will fail.  Additionally the descriptors are audited
>>>> at this point, so if Xen objects to any of the descriptors in the same
>>>> page, the typechange will also fail.
>>>>
>>>> If the typechange fails, the pagefault gets propagated back to the
>>>> guest.
>>>>
>>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>>> mappings again, a typechange back to writeable is needed.  This will
>>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>>
>>>> It would be interesting to know which of the two BUG()s in
>>>> set_aliased_prot() tripped.
>>> The first one (i.e. not the alias)
>>>
>> In which case the page in question is still referenced in an LDT
>> (perhaps on a different vcpu)
>
> The problem is reproducible on a UP guest so it's not that.

Are you certain that the set_ldt(NULL, 0) has been flushed to Xen to
actually remove the LDT reference?  All of this is hidden behind some
lazy logic.

>
>> or has been reused as a pagetable (I
>> really hope this is not the case).
>>
>> A sufficiently-debug Xen might be persuaded into telling you exactly
>> what it didn't like about the attempted transition.
>
> It just can't find l1 entry for the LDT address in
> __do_update_va_mapping().

Did you get the companion "Bad L1 flags" error message with that?

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 14:50                 ` Boris Ostrovsky
                                     ` (2 preceding siblings ...)
  2015-07-28 15:23                   ` Andrew Cooper
@ 2015-07-28 15:23                   ` Andrew Cooper
  3 siblings, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 15:23 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Andy Lutomirski,
	Sasha Levin

On 28/07/15 15:50, Boris Ostrovsky wrote:
> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
>> On 28/07/15 15:05, Boris Ostrovsky wrote:
>>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>>> Xen.  After several tries, I got:
>>>>>>
>>>>>> [   53.985707] ------------[ cut here ]------------
>>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>>> [   53.986677] Modules linked in:
>>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>> 1996),
>>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>>> 04/01/2014
>>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX:
>>>>>> 80000000
>>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP:
>>>>>> c0875e74
>>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4:
>>>>>> 00042660
>>>>>> [   53.986677] Stack:
>>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001
>>>>>> cc3d2000
>>>>>> 00000b4a 00000200
>>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000
>>>>>> c0875eb4
>>>>>> c1062310 c01861c0
>>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e
>>>>>> c7007a00
>>>>>> c2373a80 00000000
>>>>>> [   53.986677] Call Trace:
>>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74
>>>>>> 31 55
>>>>>> 89 e5
>>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>>> 0069:c0875e74
>>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>>
>>>>>> Is that the error you're seeing?
>>>>>>
>>>>>> If I change xen_free_ldt to:
>>>>>>
>>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>>> {
>>>>>>       const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>>       int i;
>>>>>>
>>>>>>       vm_unmap_aliases();
>>>>>>       xen_mc_flush();
>>>>>>
>>>>>>       for(i = 0; i < entries; i += entries_per_page)
>>>>>>           set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>>> }
>>>>>>
>>>>>> then it works.  I don't know why this makes a difference.
>>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>>> doesn't.)
>>>>>>
>>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>>> freeing has an extra alias somewhere, which is very much
>>>>> possible.  On
>>>>> the other hand, I don't see how this happens without first doing an
>>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>>> expected that to blow up and/or result in test case failures.
>>>>>
>>>>> But I'm still confused, because it seems like Xen will never populate
>>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>>> in segfaults?
>>>>>
>>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>>> aliased address page faults.
>>>>>
>>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>>> that there really is something to my theory that the problem involves
>>>>> unexpected aliases.
>>>> Xen does lazily populate the LDT frames.  The first time a page is
>>>> ever
>>>> referenced via the LDT, Xen will perform a typechange.
>>>>
>>>> Under Xen, guest mappings are reference counted with both a plain
>>>> reference, and a type count.  Types of writeable, segdec and
>>>> pagetables
>>>> are mutually exclusive.  This prevents the guest from having writeable
>>>> mappings of interesting datastructures, but readable mappings are
>>>> fine.
>>>> Typechanges may only occur when the type reference count is 0.
>>>>
>>>> At the point of the typechange, no writeable mappings of the frame may
>>>> exist (and it must not be referenced by a L2 or greater page
>>>> directory),
>>>> or the typechange will fail.  Additionally the descriptors are audited
>>>> at this point, so if Xen objects to any of the descriptors in the same
>>>> page, the typechange will also fail.
>>>>
>>>> If the typechange fails, the pagefault gets propagated back to the
>>>> guest.
>>>>
>>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>>> mappings again, a typechange back to writeable is needed.  This will
>>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>>
>>>> It would be interesting to know which of the two BUG()s in
>>>> set_aliased_prot() tripped.
>>> The first one (i.e. not the alias)
>>>
>> In which case the page in question is still referenced in an LDT
>> (perhaps on a different vcpu)
>
> The problem is reproducible on a UP guest so it's not that.

Are you certain that the set_ldt(NULL, 0) has been flushed to Xen to
actually remove the LDT reference?  All of this is hidden behind some
lazy logic.

>
>> or has been reused as a pagetable (I
>> really hope this is not the case).
>>
>> A sufficiently-debug Xen might be persuaded into telling you exactly
>> what it didn't like about the attempted transition.
>
> It just can't find l1 entry for the LDT address in
> __do_update_va_mapping().

Did you get the companion "Bad L1 flags" error message with that?

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
  2015-07-28 15:39                     ` Boris Ostrovsky
@ 2015-07-28 15:39                     ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 15:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra,
	Steven Rostedt, security, X86 ML, Borislav Petkov, Sasha Levin,
	linux-kernel, Jan Beulich, xen-devel

On 07/28/2015 11:15 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 28, 2015 at 10:50:39AM -0400, Boris Ostrovsky wrote:
>> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
>>> On 28/07/15 15:05, Boris Ostrovsky wrote:
>>>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>>>> Xen.  After several tries, I got:
>>>>>>>
>>>>>>> [   53.985707] ------------[ cut here ]------------
>>>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>>>> [   53.986677] Modules linked in:
>>>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>>>> 04/01/2014
>>>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>>>>> [   53.986677] Stack:
>>>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>>>>> 00000b4a 00000200
>>>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>>>>> c1062310 c01861c0
>>>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>>>>> c2373a80 00000000
>>>>>>> [   53.986677] Call Trace:
>>>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>>>>> 89 e5
>>>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>>>> 0069:c0875e74
>>>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>>>
>>>>>>> Is that the error you're seeing?
>>>>>>>
>>>>>>> If I change xen_free_ldt to:
>>>>>>>
>>>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>>>> {
>>>>>>>       const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>>>       int i;
>>>>>>>
>>>>>>>       vm_unmap_aliases();
>>>>>>>       xen_mc_flush();
>>>>>>>
>>>>>>>       for(i = 0; i < entries; i += entries_per_page)
>>>>>>>           set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>>>> }
>>>>>>>
>>>>>>> then it works.  I don't know why this makes a difference.
>>>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>>>> doesn't.)
>>>>>>>
>>>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>>>> freeing has an extra alias somewhere, which is very much possible.  On
>>>>>> the other hand, I don't see how this happens without first doing an
>>>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>>>> expected that to blow up and/or result in test case failures.
>>>>>>
>>>>>> But I'm still confused, because it seems like Xen will never populate
>>>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>>>> in segfaults?
>>>>>>
>>>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>>>> aliased address page faults.
>>>>>>
>>>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>>>> that there really is something to my theory that the problem involves
>>>>>> unexpected aliases.
>>>>> Xen does lazily populate the LDT frames.  The first time a page is ever
>>>>> referenced via the LDT, Xen will perform a typechange.
>>>>>
>>>>> Under Xen, guest mappings are reference counted with both a plain
>>>>> reference, and a type count.  Types of writeable, segdec and pagetables
>>>>> are mutually exclusive.  This prevents the guest from having writeable
>>>>> mappings of interesting datastructures, but readable mappings are fine.
>>>>> Typechanges may only occur when the type reference count is 0.
>>>>>
>>>>> At the point of the typechange, no writeable mappings of the frame may
>>>>> exist (and it must not be referenced by a L2 or greater page directory),
>>>>> or the typechange will fail.  Additionally the descriptors are audited
>>>>> at this point, so if Xen objects to any of the descriptors in the same
>>>>> page, the typechange will also fail.
>>>>>
>>>>> If the typechange fails, the pagefault gets propagated back to the
>>>>> guest.
>>>>>
>>>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>>>> mappings again, a typechange back to writeable is needed.  This will
>>>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>>>
>>>>> It would be interesting to know which of the two BUG()s in
>>>>> set_aliased_prot() tripped.
>>>> The first one (i.e. not the alias)
>>>>
>>> In which case the page in question is still referenced in an LDT
>>> (perhaps on a different vcpu)
>> The problem is reproducible on a UP guest so it's not that.
> The Linux kernel does a bunch of lazy maps and unmaps and we
> may be getting an interrupt while the lazy unmap hasn't been
> called  (arch_leave_lazy_mmu_mode).
>
> Having the vm_unmap_aliases and then xc_mc_flush (which is what
> arch_leave_lazy_mmu_mode ends up doing too and more) would solve it.
>
> Thought I would have thought that vm_unmap_aliases would call
> arch_leave_lazy_mmu_mode.
>>> or has been reused as a pagetable (I
>>> really hope this is not the case).
>>>
>>> A sufficiently-debug Xen might be persuaded into telling you exactly
>>> what it didn't like about the attempted transition.
>> It just can't find l1 entry for the LDT address in __do_update_va_mapping().
> Which would imply that it has not been written in. Which corresponds
> to the set_aliased_prot hitting the first BUG_ON.
>
> The xc_mc_flush() also triggers the batched hypercalls - which means we
> may have some hypercalls that have not yet gone to the hypervisor and
> then we try do an LDT hypercall (not batched).

If this were true then having xen_mc_flush() in xen_free_ldt() would 
have fixed this problem, and it didn't (without preceding 
vm_unmap_aliases()).

In any case, the patch below doesn't help.

-boris

>
> You could try building with this debug:
>
>
> diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
> index ea54a08..5d214ce 100644
> --- a/arch/x86/xen/multicalls.c
> +++ b/arch/x86/xen/multicalls.c
> @@ -28,9 +28,9 @@
>   #include "multicalls.h"
>   #include "debugfs.h"
>   
> -#define MC_BATCH	32
> +#define MC_BATCH	1
>   
> -#define MC_DEBUG	0
> +#define MC_DEBUG	1
>   
>   #define MC_ARGS		(MC_BATCH * 16)
>   
>> -boris
>>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
@ 2015-07-28 15:39                     ` Boris Ostrovsky
  2015-07-28 15:39                     ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 15:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: security, Jan Beulich, Peter Zijlstra, Andrew Cooper, X86 ML,
	linux-kernel, Steven Rostedt, Andy Lutomirski, Borislav Petkov,
	Andy Lutomirski, Sasha Levin, xen-devel

On 07/28/2015 11:15 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 28, 2015 at 10:50:39AM -0400, Boris Ostrovsky wrote:
>> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
>>> On 28/07/15 15:05, Boris Ostrovsky wrote:
>>>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>>>> Xen.  After several tries, I got:
>>>>>>>
>>>>>>> [   53.985707] ------------[ cut here ]------------
>>>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>>>> [   53.986677] Modules linked in:
>>>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>>>> 04/01/2014
>>>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>>>>> [   53.986677] Stack:
>>>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>>>>> 00000b4a 00000200
>>>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>>>>> c1062310 c01861c0
>>>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>>>>> c2373a80 00000000
>>>>>>> [   53.986677] Call Trace:
>>>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>>>>> 89 e5
>>>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>>>> 0069:c0875e74
>>>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>>>
>>>>>>> Is that the error you're seeing?
>>>>>>>
>>>>>>> If I change xen_free_ldt to:
>>>>>>>
>>>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>>>> {
>>>>>>>       const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>>>       int i;
>>>>>>>
>>>>>>>       vm_unmap_aliases();
>>>>>>>       xen_mc_flush();
>>>>>>>
>>>>>>>       for(i = 0; i < entries; i += entries_per_page)
>>>>>>>           set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>>>> }
>>>>>>>
>>>>>>> then it works.  I don't know why this makes a difference.
>>>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>>>> doesn't.)
>>>>>>>
>>>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>>>> freeing has an extra alias somewhere, which is very much possible.  On
>>>>>> the other hand, I don't see how this happens without first doing an
>>>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>>>> expected that to blow up and/or result in test case failures.
>>>>>>
>>>>>> But I'm still confused, because it seems like Xen will never populate
>>>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>>>> in segfaults?
>>>>>>
>>>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>>>> aliased address page faults.
>>>>>>
>>>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>>>> that there really is something to my theory that the problem involves
>>>>>> unexpected aliases.
>>>>> Xen does lazily populate the LDT frames.  The first time a page is ever
>>>>> referenced via the LDT, Xen will perform a typechange.
>>>>>
>>>>> Under Xen, guest mappings are reference counted with both a plain
>>>>> reference, and a type count.  Types of writeable, segdec and pagetables
>>>>> are mutually exclusive.  This prevents the guest from having writeable
>>>>> mappings of interesting datastructures, but readable mappings are fine.
>>>>> Typechanges may only occur when the type reference count is 0.
>>>>>
>>>>> At the point of the typechange, no writeable mappings of the frame may
>>>>> exist (and it must not be referenced by a L2 or greater page directory),
>>>>> or the typechange will fail.  Additionally the descriptors are audited
>>>>> at this point, so if Xen objects to any of the descriptors in the same
>>>>> page, the typechange will also fail.
>>>>>
>>>>> If the typechange fails, the pagefault gets propagated back to the
>>>>> guest.
>>>>>
>>>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>>>> mappings again, a typechange back to writeable is needed.  This will
>>>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>>>
>>>>> It would be interesting to know which of the two BUG()s in
>>>>> set_aliased_prot() tripped.
>>>> The first one (i.e. not the alias)
>>>>
>>> In which case the page in question is still referenced in an LDT
>>> (perhaps on a different vcpu)
>> The problem is reproducible on a UP guest so it's not that.
> The Linux kernel does a bunch of lazy maps and unmaps and we
> may be getting an interrupt while the lazy unmap hasn't been
> called  (arch_leave_lazy_mmu_mode).
>
> Having the vm_unmap_aliases and then xc_mc_flush (which is what
> arch_leave_lazy_mmu_mode ends up doing too and more) would solve it.
>
> Thought I would have thought that vm_unmap_aliases would call
> arch_leave_lazy_mmu_mode.
>>> or has been reused as a pagetable (I
>>> really hope this is not the case).
>>>
>>> A sufficiently-debug Xen might be persuaded into telling you exactly
>>> what it didn't like about the attempted transition.
>> It just can't find l1 entry for the LDT address in __do_update_va_mapping().
> Which would imply that it has not been written in. Which corresponds
> to the set_aliased_prot hitting the first BUG_ON.
>
> The xc_mc_flush() also triggers the batched hypercalls - which means we
> may have some hypercalls that have not yet gone to the hypervisor and
> then we try do an LDT hypercall (not batched).

If this were true then having xen_mc_flush() in xen_free_ldt() would 
have fixed this problem, and it didn't (without preceding 
vm_unmap_aliases()).

In any case, the patch below doesn't help.

-boris

>
> You could try building with this debug:
>
>
> diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
> index ea54a08..5d214ce 100644
> --- a/arch/x86/xen/multicalls.c
> +++ b/arch/x86/xen/multicalls.c
> @@ -28,9 +28,9 @@
>   #include "multicalls.h"
>   #include "debugfs.h"
>   
> -#define MC_BATCH	32
> +#define MC_BATCH	1
>   
> -#define MC_DEBUG	0
> +#define MC_DEBUG	1
>   
>   #define MC_ARGS		(MC_BATCH * 16)
>   
>> -boris
>>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 10:29           ` Andrew Cooper
                               ` (2 preceding siblings ...)
  2015-07-28 15:43             ` Andy Lutomirski
@ 2015-07-28 15:43             ` Andy Lutomirski
  2015-07-28 16:30               ` Andrew Cooper
  2015-07-28 16:30               ` Andrew Cooper
  3 siblings, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28 15:43 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Boris Ostrovsky, Borislav Petkov, X86 ML, xen-devel,
	Konrad Rzeszutek Wilk, Steven Rostedt, linux-kernel, Jan Beulich,
	Sasha Levin, Peter Zijlstra

On Jul 28, 2015 3:30 AM, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
> On 28/07/15 04:16, Andy Lutomirski wrote:
> > On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
> >> <boris.ostrovsky@oracle.com> wrote:
> >>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
> >>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
> >>>> <boris.ostrovsky@oracle.com> wrote:
> >>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
> >>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
> >>>>>> probably a good general attack surface reduction, and it replaces some
> >>>>>> scary code with IMO less scary code.
> >>>>>>
> >>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
> >>>>>> This makes that possible.
> >>>>>>
> >>>>>> Xen people, can you take a look at this?
> >>>>>>
> >>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
> >>>>>> like it, and we can always add a sysctl of some sort later.
> >>>>>>
> >>>>>> Changes from v3:
> >>>>>>    - Hopefully fixed Xen.
> >>>>>
> >>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
> >>>>>
> >>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
> >>>>>
> >>>>> I am not sure I see what changed.
> >>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
> >>>> Oops.
> >>>>
> >>>> I just sent v4.1 of patch 3.  Can you try that?
> >>>
> >>>
> >>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
> >>> in destroy_context(). Interestingly though when I run the test in the
> >>> debugger I get SIGILL (just like before) but no BUG().
> >>>
> >>> Let me get back to you on that later today.
> >>>
> >>>
> >> After forward-porting my virtio patches, I got this thing to run on
> >> Xen.  After several tries, I got:
> >>
> >> [   53.985707] ------------[ cut here ]------------
> >> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> >> [   53.986677] invalid opcode: 0000 [#1] SMP
> >> [   53.986677] Modules linked in:
> >> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> >> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> >> 04/01/2014
> >> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> >> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> >> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> >> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> >> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> >> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> >> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> >> [   53.986677] Stack:
> >> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> >> 00000b4a 00000200
> >> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> >> c1062310 c01861c0
> >> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> >> c2373a80 00000000
> >> [   53.986677] Call Trace:
> >> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> >> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> >> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> >> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> >> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> >> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
> >> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
> >> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> >> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> >> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> >> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> >> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> >> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> >> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> >> 89 e5
> >> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
> >> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
> >>
> >> Is that the error you're seeing?
> >>
> >> If I change xen_free_ldt to:
> >>
> >> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> >> {
> >>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
> >>     int i;
> >>
> >>     vm_unmap_aliases();
> >>     xen_mc_flush();
> >>
> >>     for(i = 0; i < entries; i += entries_per_page)
> >>         set_aliased_prot(ldt + i, PAGE_KERNEL);
> >> }
> >>
> >> then it works.  I don't know why this makes a difference.
> >> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> >> doesn't.)
> >>
> > That fix makes sense if there's some way that the vmalloc area we're
> > freeing has an extra alias somewhere, which is very much possible.  On
> > the other hand, I don't see how this happens without first doing an
> > MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> > expected that to blow up and/or result in test case failures.
> >
> > But I'm still confused, because it seems like Xen will never populate
> > the actual (hidden) LDT mapping unless the pages backing it are
> > unaliased and well-formed, which make me wonder why this stuff ever
> > worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> > in segfaults?
> >
> > The semantics seem to be very odd.  xen_free_ldt with an aliased
> > address might fail (and OOPS), but actual access to the LDT with an
> > aliased address page faults.
> >
> > Also, using kzalloc for everything fixes the problem, which suggests
> > that there really is something to my theory that the problem involves
> > unexpected aliases.
>
> Xen does lazily populate the LDT frames.  The first time a page is ever
> referenced via the LDT, Xen will perform a typechange.
>
> Under Xen, guest mappings are reference counted with both a plain
> reference, and a type count.  Types of writeable, segdec and pagetables
> are mutually exclusive.  This prevents the guest from having writeable
> mappings of interesting datastructures, but readable mappings are fine.
> Typechanges may only occur when the type reference count is 0.

Makes sense.

>
> At the point of the typechange, no writeable mappings of the frame may
> exist (and it must not be referenced by a L2 or greater page directory),
> or the typechange will fail.  Additionally the descriptors are audited
> at this point, so if Xen objects to any of the descriptors in the same
> page, the typechange will also fail.
>
> If the typechange fails, the pagefault gets propagated back to the guest.

The part I don't understand is that I didn't observe any page faults.

>
> The corollary to this is that, for xen_free_ldt() to create writeable
> mappings again, a typechange back to writeable is needed.  This will
> fail if the LDT frames are still referenced in any vcpus LDT.

And the mystery here is that I don't see how the typechange to LDT
would have succeeded in the first place if we had a writable alias.
In fact, I just fudged xen_set_ldt to probe the entire LDT using lsl,
and it didn't page fault, so I'm still quite confused as to what's
going on.  I've confirmed that my lsl hack really does work -- if I
disable xen_alloc_ldt, then xen_set_ldt blows up immediately with my
patch.

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/tmp&id=ce664ed73aac804ef2a16ddef45589dbeba55570

Oddly, the OOPS seems much more likely when I kill ldt_gdt_32 with
Ctrl-C than when I let it run to completion.  The bug is here:

        if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0))
                BUG();

so the only sensible explanation I have is either the guest really
didn't drop all refs to the LDT or that Xen hasn't noticed yet.  I
still don't see how aliases could be involved, because Xen really did
accept the LDT.  Are multicalls ordered?

Hrm.  Does Xen actually do something sensible with set_ldt(NULL, 0)?

Also, xen_mc_issue seems buggy.  Is lazy_mode an enum or a bit field?
What happens if two separate lazy modes are entered?  I suspect that
one of them clobbers the other one.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 10:29           ` Andrew Cooper
  2015-07-28 14:05             ` Boris Ostrovsky
  2015-07-28 14:05             ` Boris Ostrovsky
@ 2015-07-28 15:43             ` Andy Lutomirski
  2015-07-28 15:43             ` Andy Lutomirski
  3 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28 15:43 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

On Jul 28, 2015 3:30 AM, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
> On 28/07/15 04:16, Andy Lutomirski wrote:
> > On Mon, Jul 27, 2015 at 7:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Mon, Jul 27, 2015 at 9:18 AM, Boris Ostrovsky
> >> <boris.ostrovsky@oracle.com> wrote:
> >>> On 07/27/2015 11:53 AM, Andy Lutomirski wrote:
> >>>> On Mon, Jul 27, 2015 at 8:36 AM, Boris Ostrovsky
> >>>> <boris.ostrovsky@oracle.com> wrote:
> >>>>> On 07/25/2015 01:36 AM, Andy Lutomirski wrote:
> >>>>>> Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
> >>>>>> probably a good general attack surface reduction, and it replaces some
> >>>>>> scary code with IMO less scary code.
> >>>>>>
> >>>>>> Also, servers and embedded systems should probably turn off modify_ldt.
> >>>>>> This makes that possible.
> >>>>>>
> >>>>>> Xen people, can you take a look at this?
> >>>>>>
> >>>>>> Willy and Kees: I left the config option alone.  The -tiny people will
> >>>>>> like it, and we can always add a sysctl of some sort later.
> >>>>>>
> >>>>>> Changes from v3:
> >>>>>>    - Hopefully fixed Xen.
> >>>>>
> >>>>> 32b-on-32b fails in the same manner. (but non-zero LDT is taken care of)
> >>>>>
> >>>>>>    - Fixed 32-bit test case on 32-bit native kernel.
> >>>>>
> >>>>> I am not sure I see what changed.
> >>>> I misplaced the fix in the wrong git commit, so I failed to sent it.
> >>>> Oops.
> >>>>
> >>>> I just sent v4.1 of patch 3.  Can you try that?
> >>>
> >>>
> >>> I am hitting BUG() in Xen code (returning from a hypercall) when freeing LDT
> >>> in destroy_context(). Interestingly though when I run the test in the
> >>> debugger I get SIGILL (just like before) but no BUG().
> >>>
> >>> Let me get back to you on that later today.
> >>>
> >>>
> >> After forward-porting my virtio patches, I got this thing to run on
> >> Xen.  After several tries, I got:
> >>
> >> [   53.985707] ------------[ cut here ]------------
> >> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> >> [   53.986677] invalid opcode: 0000 [#1] SMP
> >> [   53.986677] Modules linked in:
> >> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> >> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> >> 04/01/2014
> >> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> >> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> >> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> >> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> >> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> >> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> >> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> >> [   53.986677] Stack:
> >> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> >> 00000b4a 00000200
> >> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> >> c1062310 c01861c0
> >> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> >> c2373a80 00000000
> >> [   53.986677] Call Trace:
> >> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> >> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> >> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> >> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> >> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> >> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
> >> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
> >> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> >> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> >> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> >> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> >> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> >> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> >> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> >> 89 e5
> >> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
> >> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
> >>
> >> Is that the error you're seeing?
> >>
> >> If I change xen_free_ldt to:
> >>
> >> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> >> {
> >>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
> >>     int i;
> >>
> >>     vm_unmap_aliases();
> >>     xen_mc_flush();
> >>
> >>     for(i = 0; i < entries; i += entries_per_page)
> >>         set_aliased_prot(ldt + i, PAGE_KERNEL);
> >> }
> >>
> >> then it works.  I don't know why this makes a difference.
> >> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> >> doesn't.)
> >>
> > That fix makes sense if there's some way that the vmalloc area we're
> > freeing has an extra alias somewhere, which is very much possible.  On
> > the other hand, I don't see how this happens without first doing an
> > MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> > expected that to blow up and/or result in test case failures.
> >
> > But I'm still confused, because it seems like Xen will never populate
> > the actual (hidden) LDT mapping unless the pages backing it are
> > unaliased and well-formed, which make me wonder why this stuff ever
> > worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> > in segfaults?
> >
> > The semantics seem to be very odd.  xen_free_ldt with an aliased
> > address might fail (and OOPS), but actual access to the LDT with an
> > aliased address page faults.
> >
> > Also, using kzalloc for everything fixes the problem, which suggests
> > that there really is something to my theory that the problem involves
> > unexpected aliases.
>
> Xen does lazily populate the LDT frames.  The first time a page is ever
> referenced via the LDT, Xen will perform a typechange.
>
> Under Xen, guest mappings are reference counted with both a plain
> reference, and a type count.  Types of writeable, segdec and pagetables
> are mutually exclusive.  This prevents the guest from having writeable
> mappings of interesting datastructures, but readable mappings are fine.
> Typechanges may only occur when the type reference count is 0.

Makes sense.

>
> At the point of the typechange, no writeable mappings of the frame may
> exist (and it must not be referenced by a L2 or greater page directory),
> or the typechange will fail.  Additionally the descriptors are audited
> at this point, so if Xen objects to any of the descriptors in the same
> page, the typechange will also fail.
>
> If the typechange fails, the pagefault gets propagated back to the guest.

The part I don't understand is that I didn't observe any page faults.

>
> The corollary to this is that, for xen_free_ldt() to create writeable
> mappings again, a typechange back to writeable is needed.  This will
> fail if the LDT frames are still referenced in any vcpus LDT.

And the mystery here is that I don't see how the typechange to LDT
would have succeeded in the first place if we had a writable alias.
In fact, I just fudged xen_set_ldt to probe the entire LDT using lsl,
and it didn't page fault, so I'm still quite confused as to what's
going on.  I've confirmed that my lsl hack really does work -- if I
disable xen_alloc_ldt, then xen_set_ldt blows up immediately with my
patch.

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/tmp&id=ce664ed73aac804ef2a16ddef45589dbeba55570

Oddly, the OOPS seems much more likely when I kill ldt_gdt_32 with
Ctrl-C than when I let it run to completion.  The bug is here:

        if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0))
                BUG();

so the only sensible explanation I have is either the guest really
didn't drop all refs to the LDT or that Xen hasn't noticed yet.  I
still don't see how aliases could be involved, because Xen really did
accept the LDT.  Are multicalls ordered?

Hrm.  Does Xen actually do something sensible with set_ldt(NULL, 0)?

Also, xen_mc_issue seems buggy.  Is lazy_mode an enum or a bit field?
What happens if two separate lazy modes are entered?  I suspect that
one of them clobbers the other one.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 15:23                   ` Andrew Cooper
@ 2015-07-28 15:59                     ` Boris Ostrovsky
  2015-07-28 15:59                     ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 15:59 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Jan Beulich, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Andy Lutomirski,
	Sasha Levin

On 07/28/2015 11:23 AM, Andrew Cooper wrote:
> On 28/07/15 15:50, Boris Ostrovsky wrote:
>> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
>>> On 28/07/15 15:05, Boris Ostrovsky wrote:
>>>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>>>> Xen.  After several tries, I got:
>>>>>>>
>>>>>>> [   53.985707] ------------[ cut here ]------------
>>>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>>>> [   53.986677] Modules linked in:
>>>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>> 1996),
>>>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>>>> 04/01/2014
>>>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX:
>>>>>>> 80000000
>>>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP:
>>>>>>> c0875e74
>>>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4:
>>>>>>> 00042660
>>>>>>> [   53.986677] Stack:
>>>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001
>>>>>>> cc3d2000
>>>>>>> 00000b4a 00000200
>>>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000
>>>>>>> c0875eb4
>>>>>>> c1062310 c01861c0
>>>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e
>>>>>>> c7007a00
>>>>>>> c2373a80 00000000
>>>>>>> [   53.986677] Call Trace:
>>>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74
>>>>>>> 31 55
>>>>>>> 89 e5
>>>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>>>> 0069:c0875e74
>>>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>>>
>>>>>>> Is that the error you're seeing?
>>>>>>>
>>>>>>> If I change xen_free_ldt to:
>>>>>>>
>>>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>>>> {
>>>>>>>        const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>>>        int i;
>>>>>>>
>>>>>>>        vm_unmap_aliases();
>>>>>>>        xen_mc_flush();
>>>>>>>
>>>>>>>        for(i = 0; i < entries; i += entries_per_page)
>>>>>>>            set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>>>> }
>>>>>>>
>>>>>>> then it works.  I don't know why this makes a difference.
>>>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>>>> doesn't.)
>>>>>>>
>>>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>>>> freeing has an extra alias somewhere, which is very much
>>>>>> possible.  On
>>>>>> the other hand, I don't see how this happens without first doing an
>>>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>>>> expected that to blow up and/or result in test case failures.
>>>>>>
>>>>>> But I'm still confused, because it seems like Xen will never populate
>>>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>>>> in segfaults?
>>>>>>
>>>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>>>> aliased address page faults.
>>>>>>
>>>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>>>> that there really is something to my theory that the problem involves
>>>>>> unexpected aliases.
>>>>> Xen does lazily populate the LDT frames.  The first time a page is
>>>>> ever
>>>>> referenced via the LDT, Xen will perform a typechange.
>>>>>
>>>>> Under Xen, guest mappings are reference counted with both a plain
>>>>> reference, and a type count.  Types of writeable, segdec and
>>>>> pagetables
>>>>> are mutually exclusive.  This prevents the guest from having writeable
>>>>> mappings of interesting datastructures, but readable mappings are
>>>>> fine.
>>>>> Typechanges may only occur when the type reference count is 0.
>>>>>
>>>>> At the point of the typechange, no writeable mappings of the frame may
>>>>> exist (and it must not be referenced by a L2 or greater page
>>>>> directory),
>>>>> or the typechange will fail.  Additionally the descriptors are audited
>>>>> at this point, so if Xen objects to any of the descriptors in the same
>>>>> page, the typechange will also fail.
>>>>>
>>>>> If the typechange fails, the pagefault gets propagated back to the
>>>>> guest.
>>>>>
>>>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>>>> mappings again, a typechange back to writeable is needed.  This will
>>>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>>>
>>>>> It would be interesting to know which of the two BUG()s in
>>>>> set_aliased_prot() tripped.
>>>> The first one (i.e. not the alias)
>>>>
>>> In which case the page in question is still referenced in an LDT
>>> (perhaps on a different vcpu)
>> The problem is reproducible on a UP guest so it's not that.
> Are you certain that the set_ldt(NULL, 0) has been flushed to Xen to
> actually remove the LDT reference?  All of this is hidden behind some
> lazy logic.

Andy's patch actually removed clear_LDT().

I did put it back though while debugging this and it didn't make any 
difference (it was flushed after I added xen_mc_flush() to 
xen_free_ldt(), which would be called soon after that. Before changing 
LDT page attributes).

>
>>> or has been reused as a pagetable (I
>>> really hope this is not the case).
>>>
>>> A sufficiently-debug Xen might be persuaded into telling you exactly
>>> what it didn't like about the attempted transition.
>> It just can't find l1 entry for the LDT address in
>> __do_update_va_mapping().
> Did you get the companion "Bad L1 flags" error message with that?
>

No.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 15:23                   ` Andrew Cooper
  2015-07-28 15:59                     ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-28 15:59                     ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 15:59 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Andy Lutomirski, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On 07/28/2015 11:23 AM, Andrew Cooper wrote:
> On 28/07/15 15:50, Boris Ostrovsky wrote:
>> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
>>> On 28/07/15 15:05, Boris Ostrovsky wrote:
>>>> On 07/28/2015 06:29 AM, Andrew Cooper wrote:
>>>>>>> After forward-porting my virtio patches, I got this thing to run on
>>>>>>> Xen.  After several tries, I got:
>>>>>>>
>>>>>>> [   53.985707] ------------[ cut here ]------------
>>>>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>>>>> [   53.986677] Modules linked in:
>>>>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>> 1996),
>>>>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>>>>> 04/01/2014
>>>>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX:
>>>>>>> 80000000
>>>>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP:
>>>>>>> c0875e74
>>>>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4:
>>>>>>> 00042660
>>>>>>> [   53.986677] Stack:
>>>>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001
>>>>>>> cc3d2000
>>>>>>> 00000b4a 00000200
>>>>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000
>>>>>>> c0875eb4
>>>>>>> c1062310 c01861c0
>>>>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e
>>>>>>> c7007a00
>>>>>>> c2373a80 00000000
>>>>>>> [   53.986677] Call Trace:
>>>>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74
>>>>>>> 31 55
>>>>>>> 89 e5
>>>>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
>>>>>>> 0069:c0875e74
>>>>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>>>>
>>>>>>> Is that the error you're seeing?
>>>>>>>
>>>>>>> If I change xen_free_ldt to:
>>>>>>>
>>>>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>>>>> {
>>>>>>>        const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>>>>        int i;
>>>>>>>
>>>>>>>        vm_unmap_aliases();
>>>>>>>        xen_mc_flush();
>>>>>>>
>>>>>>>        for(i = 0; i < entries; i += entries_per_page)
>>>>>>>            set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>>>>> }
>>>>>>>
>>>>>>> then it works.  I don't know why this makes a difference.
>>>>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>>>>> doesn't.)
>>>>>>>
>>>>>> That fix makes sense if there's some way that the vmalloc area we're
>>>>>> freeing has an extra alias somewhere, which is very much
>>>>>> possible.  On
>>>>>> the other hand, I don't see how this happens without first doing an
>>>>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>>>>> expected that to blow up and/or result in test case failures.
>>>>>>
>>>>>> But I'm still confused, because it seems like Xen will never populate
>>>>>> the actual (hidden) LDT mapping unless the pages backing it are
>>>>>> unaliased and well-formed, which make me wonder why this stuff ever
>>>>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>>>>> in segfaults?
>>>>>>
>>>>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>>>>> address might fail (and OOPS), but actual access to the LDT with an
>>>>>> aliased address page faults.
>>>>>>
>>>>>> Also, using kzalloc for everything fixes the problem, which suggests
>>>>>> that there really is something to my theory that the problem involves
>>>>>> unexpected aliases.
>>>>> Xen does lazily populate the LDT frames.  The first time a page is
>>>>> ever
>>>>> referenced via the LDT, Xen will perform a typechange.
>>>>>
>>>>> Under Xen, guest mappings are reference counted with both a plain
>>>>> reference, and a type count.  Types of writeable, segdec and
>>>>> pagetables
>>>>> are mutually exclusive.  This prevents the guest from having writeable
>>>>> mappings of interesting datastructures, but readable mappings are
>>>>> fine.
>>>>> Typechanges may only occur when the type reference count is 0.
>>>>>
>>>>> At the point of the typechange, no writeable mappings of the frame may
>>>>> exist (and it must not be referenced by a L2 or greater page
>>>>> directory),
>>>>> or the typechange will fail.  Additionally the descriptors are audited
>>>>> at this point, so if Xen objects to any of the descriptors in the same
>>>>> page, the typechange will also fail.
>>>>>
>>>>> If the typechange fails, the pagefault gets propagated back to the
>>>>> guest.
>>>>>
>>>>> The corollary to this is that, for xen_free_ldt() to create writeable
>>>>> mappings again, a typechange back to writeable is needed.  This will
>>>>> fail if the LDT frames are still referenced in any vcpus LDT.
>>>>>
>>>>> It would be interesting to know which of the two BUG()s in
>>>>> set_aliased_prot() tripped.
>>>> The first one (i.e. not the alias)
>>>>
>>> In which case the page in question is still referenced in an LDT
>>> (perhaps on a different vcpu)
>> The problem is reproducible on a UP guest so it's not that.
> Are you certain that the set_ldt(NULL, 0) has been flushed to Xen to
> actually remove the LDT reference?  All of this is hidden behind some
> lazy logic.

Andy's patch actually removed clear_LDT().

I did put it back though while debugging this and it didn't make any 
difference (it was flushed after I added xen_mc_flush() to 
xen_free_ldt(), which would be called soon after that. Before changing 
LDT page attributes).

>
>>> or has been reused as a pagetable (I
>>> really hope this is not the case).
>>>
>>> A sufficiently-debug Xen might be persuaded into telling you exactly
>>> what it didn't like about the attempted transition.
>> It just can't find l1 entry for the LDT address in
>> __do_update_va_mapping().
> Did you get the companion "Bad L1 flags" error message with that?
>

No.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 15:43             ` Andy Lutomirski
  2015-07-28 16:30               ` Andrew Cooper
@ 2015-07-28 16:30               ` Andrew Cooper
  2015-07-28 17:07                 ` Andy Lutomirski
  2015-07-28 17:07                 ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Boris Ostrovsky, Borislav Petkov, X86 ML, xen-devel,
	Konrad Rzeszutek Wilk, Steven Rostedt, linux-kernel, Jan Beulich,
	Sasha Levin, Peter Zijlstra

On 28/07/15 16:43, Andy Lutomirski wrote:
>
>>>> After forward-porting my virtio patches, I got this thing to run on
>>>> Xen.  After several tries, I got:
>>>>
>>>> [   53.985707] ------------[ cut here ]------------
>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>> [   53.986677] Modules linked in:
>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>> 04/01/2014
>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>> [   53.986677] Stack:
>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>> 00000b4a 00000200
>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>> c1062310 c01861c0
>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>> c2373a80 00000000
>>>> [   53.986677] Call Trace:
>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>> 89 e5
>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>
>>>> Is that the error you're seeing?
>>>>
>>>> If I change xen_free_ldt to:
>>>>
>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>> {
>>>>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>     int i;
>>>>
>>>>     vm_unmap_aliases();
>>>>     xen_mc_flush();
>>>>
>>>>     for(i = 0; i < entries; i += entries_per_page)
>>>>         set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>> }
>>>>
>>>> then it works.  I don't know why this makes a difference.
>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>> doesn't.)
>>>>
>>> That fix makes sense if there's some way that the vmalloc area we're
>>> freeing has an extra alias somewhere, which is very much possible.  On
>>> the other hand, I don't see how this happens without first doing an
>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>> expected that to blow up and/or result in test case failures.
>>>
>>> But I'm still confused, because it seems like Xen will never populate
>>> the actual (hidden) LDT mapping unless the pages backing it are
>>> unaliased and well-formed, which make me wonder why this stuff ever
>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>> in segfaults?
>>>
>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>> address might fail (and OOPS), but actual access to the LDT with an
>>> aliased address page faults.
>>>
>>> Also, using kzalloc for everything fixes the problem, which suggests
>>> that there really is something to my theory that the problem involves
>>> unexpected aliases.
>> Xen does lazily populate the LDT frames.  The first time a page is ever
>> referenced via the LDT, Xen will perform a typechange.
>>
>> Under Xen, guest mappings are reference counted with both a plain
>> reference, and a type count.  Types of writeable, segdec and pagetables
>> are mutually exclusive.  This prevents the guest from having writeable
>> mappings of interesting datastructures, but readable mappings are fine.
>> Typechanges may only occur when the type reference count is 0.
> Makes sense.
>
>> At the point of the typechange, no writeable mappings of the frame may
>> exist (and it must not be referenced by a L2 or greater page directory),
>> or the typechange will fail.  Additionally the descriptors are audited
>> at this point, so if Xen objects to any of the descriptors in the same
>> page, the typechange will also fail.
>>
>> If the typechange fails, the pagefault gets propagated back to the guest.
> The part I don't understand is that I didn't observe any page faults.
>
>> The corollary to this is that, for xen_free_ldt() to create writeable
>> mappings again, a typechange back to writeable is needed.  This will
>> fail if the LDT frames are still referenced in any vcpus LDT.
> And the mystery here is that I don't see how the typechange to LDT
> would have succeeded in the first place if we had a writable alias.

I wouldn't have.  (Unless we have a serious bug in Xen).

> In fact, I just fudged xen_set_ldt to probe the entire LDT using lsl,
> and it didn't page fault, so I'm still quite confused as to what's
> going on.  I've confirmed that my lsl hack really does work -- if I
> disable xen_alloc_ldt, then xen_set_ldt blows up immediately with my
> patch.

In which case we can be fairly sure that the LDT was properly installed.

>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/tmp&id=ce664ed73aac804ef2a16ddef45589dbeba55570
>
> Oddly, the OOPS seems much more likely when I kill ldt_gdt_32 with
> Ctrl-C than when I let it run to completion.  The bug is here:
>
>         if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0))
>                 BUG();
>
> so the only sensible explanation I have is either the guest really
> didn't drop all refs to the LDT or that Xen hasn't noticed yet.  I
> still don't see how aliases could be involved, because Xen really did
> accept the LDT.

Agreed.

> Are multicalls ordered?

Once the multicall hypercall is made, the multicalls are processed in
order.  The hypercalls might be preempted in Xen to deliver interrupts
to the guest.

However, the xen_mc_* infrastructure in the kernel obscures a lot of this.

>
> Hrm.  Does Xen actually do something sensible with set_ldt(NULL, 0)?

(When the hypercall gets to Xen), any set_ldt() call flushes the current
LDT, including synchronously decremented the typecount for every LDT
page faulted in thusfar.

>
> Also, xen_mc_issue seems buggy.  Is lazy_mode an enum or a bit field?
> What happens if two separate lazy modes are entered?  I suspect that
> one of them clobbers the other one.

This is the first time I have really peered into the xen_mc_* stuff.  It
is hardly the most clear of code to follow.  I am afraid that I will
have to pass that question to the Xen maintainers in Linux.


I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
xen_free_ldt() is attempting to nab back the pages which Xen still has
mapped as an LDT.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 15:43             ` Andy Lutomirski
@ 2015-07-28 16:30               ` Andrew Cooper
  2015-07-28 16:30               ` Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-28 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

On 28/07/15 16:43, Andy Lutomirski wrote:
>
>>>> After forward-porting my virtio patches, I got this thing to run on
>>>> Xen.  After several tries, I got:
>>>>
>>>> [   53.985707] ------------[ cut here ]------------
>>>> [   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
>>>> [   53.986677] invalid opcode: 0000 [#1] SMP
>>>> [   53.986677] Modules linked in:
>>>> [   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
>>>> [   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
>>>> 04/01/2014
>>>> [   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
>>>> [   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
>>>> [   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
>>>> [   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
>>>> [   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
>>>> [   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
>>>> [   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
>>>> [   53.986677] Stack:
>>>> [   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
>>>> 00000b4a 00000200
>>>> [   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
>>>> c1062310 c01861c0
>>>> [   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
>>>> c2373a80 00000000
>>>> [   53.986677] Call Trace:
>>>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>>>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>>>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>>>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>>>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>>>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>>>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>>>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>>>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>>>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>>>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>>> [   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
>>>> 4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
>>>> c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
>>>> 89 e5
>>>> [   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP 0069:c0875e74
>>>> [   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
>>>>
>>>> Is that the error you're seeing?
>>>>
>>>> If I change xen_free_ldt to:
>>>>
>>>> static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
>>>> {
>>>>     const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
>>>>     int i;
>>>>
>>>>     vm_unmap_aliases();
>>>>     xen_mc_flush();
>>>>
>>>>     for(i = 0; i < entries; i += entries_per_page)
>>>>         set_aliased_prot(ldt + i, PAGE_KERNEL);
>>>> }
>>>>
>>>> then it works.  I don't know why this makes a difference.
>>>> (xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
>>>> doesn't.)
>>>>
>>> That fix makes sense if there's some way that the vmalloc area we're
>>> freeing has an extra alias somewhere, which is very much possible.  On
>>> the other hand, I don't see how this happens without first doing an
>>> MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
>>> expected that to blow up and/or result in test case failures.
>>>
>>> But I'm still confused, because it seems like Xen will never populate
>>> the actual (hidden) LDT mapping unless the pages backing it are
>>> unaliased and well-formed, which make me wonder why this stuff ever
>>> worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
>>> in segfaults?
>>>
>>> The semantics seem to be very odd.  xen_free_ldt with an aliased
>>> address might fail (and OOPS), but actual access to the LDT with an
>>> aliased address page faults.
>>>
>>> Also, using kzalloc for everything fixes the problem, which suggests
>>> that there really is something to my theory that the problem involves
>>> unexpected aliases.
>> Xen does lazily populate the LDT frames.  The first time a page is ever
>> referenced via the LDT, Xen will perform a typechange.
>>
>> Under Xen, guest mappings are reference counted with both a plain
>> reference, and a type count.  Types of writeable, segdec and pagetables
>> are mutually exclusive.  This prevents the guest from having writeable
>> mappings of interesting datastructures, but readable mappings are fine.
>> Typechanges may only occur when the type reference count is 0.
> Makes sense.
>
>> At the point of the typechange, no writeable mappings of the frame may
>> exist (and it must not be referenced by a L2 or greater page directory),
>> or the typechange will fail.  Additionally the descriptors are audited
>> at this point, so if Xen objects to any of the descriptors in the same
>> page, the typechange will also fail.
>>
>> If the typechange fails, the pagefault gets propagated back to the guest.
> The part I don't understand is that I didn't observe any page faults.
>
>> The corollary to this is that, for xen_free_ldt() to create writeable
>> mappings again, a typechange back to writeable is needed.  This will
>> fail if the LDT frames are still referenced in any vcpus LDT.
> And the mystery here is that I don't see how the typechange to LDT
> would have succeeded in the first place if we had a writable alias.

I wouldn't have.  (Unless we have a serious bug in Xen).

> In fact, I just fudged xen_set_ldt to probe the entire LDT using lsl,
> and it didn't page fault, so I'm still quite confused as to what's
> going on.  I've confirmed that my lsl hack really does work -- if I
> disable xen_alloc_ldt, then xen_set_ldt blows up immediately with my
> patch.

In which case we can be fairly sure that the LDT was properly installed.

>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/tmp&id=ce664ed73aac804ef2a16ddef45589dbeba55570
>
> Oddly, the OOPS seems much more likely when I kill ldt_gdt_32 with
> Ctrl-C than when I let it run to completion.  The bug is here:
>
>         if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0))
>                 BUG();
>
> so the only sensible explanation I have is either the guest really
> didn't drop all refs to the LDT or that Xen hasn't noticed yet.  I
> still don't see how aliases could be involved, because Xen really did
> accept the LDT.

Agreed.

> Are multicalls ordered?

Once the multicall hypercall is made, the multicalls are processed in
order.  The hypercalls might be preempted in Xen to deliver interrupts
to the guest.

However, the xen_mc_* infrastructure in the kernel obscures a lot of this.

>
> Hrm.  Does Xen actually do something sensible with set_ldt(NULL, 0)?

(When the hypercall gets to Xen), any set_ldt() call flushes the current
LDT, including synchronously decremented the typecount for every LDT
page faulted in thusfar.

>
> Also, xen_mc_issue seems buggy.  Is lazy_mode an enum or a bit field?
> What happens if two separate lazy modes are entered?  I suspect that
> one of them clobbers the other one.

This is the first time I have really peered into the xen_mc_* stuff.  It
is hardly the most clear of code to follow.  I am afraid that I will
have to pass that question to the Xen maintainers in Linux.


I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
xen_free_ldt() is attempting to nab back the pages which Xen still has
mapped as an LDT.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 16:30               ` Andrew Cooper
  2015-07-28 17:07                 ` Andy Lutomirski
@ 2015-07-28 17:07                 ` Andy Lutomirski
  2015-07-28 17:10                   ` [Xen-devel] " Boris Ostrovsky
  2015-07-28 17:10                   ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28 17:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Boris Ostrovsky, Borislav Petkov, X86 ML, xen-devel,
	Konrad Rzeszutek Wilk, Steven Rostedt, linux-kernel, Jan Beulich,
	Sasha Levin, Peter Zijlstra

On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
> xen_free_ldt() is attempting to nab back the pages which Xen still has
> mapped as an LDT.
>

I just instrumented it with yet more LSL instructions.  I'm pretty
sure that set_ldt really is clearing at least LDT entry zero.
Nonetheless the free_ldt call still oopses.

--Andy

> ~Andrew



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 16:30               ` Andrew Cooper
@ 2015-07-28 17:07                 ` Andy Lutomirski
  2015-07-28 17:07                 ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-28 17:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
> xen_free_ldt() is attempting to nab back the pages which Xen still has
> mapped as an LDT.
>

I just instrumented it with yet more LSL instructions.  I'm pretty
sure that set_ldt really is clearing at least LDT entry zero.
Nonetheless the free_ldt call still oopses.

--Andy

> ~Andrew



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 17:07                 ` Andy Lutomirski
@ 2015-07-28 17:10                   ` Boris Ostrovsky
  2015-07-29  0:21                     ` Andy Lutomirski
  2015-07-29  0:21                     ` [Xen-devel] " Andy Lutomirski
  2015-07-28 17:10                   ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 17:10 UTC (permalink / raw)
  To: Andy Lutomirski, Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>> mapped as an LDT.
>>
> I just instrumented it with yet more LSL instructions.  I'm pretty
> sure that set_ldt really is clearing at least LDT entry zero.
> Nonetheless the free_ldt call still oopses.
>

Yes, I added some instrumentation to the hypervisor and we definitely 
set LDT to NULL before failing.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 17:07                 ` Andy Lutomirski
  2015-07-28 17:10                   ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-28 17:10                   ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-28 17:10 UTC (permalink / raw)
  To: Andy Lutomirski, Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>> mapped as an LDT.
>>
> I just instrumented it with yet more LSL instructions.  I'm pretty
> sure that set_ldt really is clearing at least LDT entry zero.
> Nonetheless the free_ldt call still oopses.
>

Yes, I added some instrumentation to the hypervisor and we definitely 
set LDT to NULL before failing.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 17:10                   ` [Xen-devel] " Boris Ostrovsky
  2015-07-29  0:21                     ` Andy Lutomirski
@ 2015-07-29  0:21                     ` Andy Lutomirski
  2015-07-29  0:47                       ` Andrew Cooper
  2015-07-29  0:47                       ` [Xen-devel] " Andrew Cooper
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29  0:21 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>
>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>>
>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>> mapped as an LDT.
>>>
>> I just instrumented it with yet more LSL instructions.  I'm pretty
>> sure that set_ldt really is clearing at least LDT entry zero.
>> Nonetheless the free_ldt call still oopses.
>>
>
> Yes, I added some instrumentation to the hypervisor and we definitely set
> LDT to NULL before failing.
>
> -boris

Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
getting incremented once on each CPU at the same time if both CPUs
fault in the same shadow LDT page at the same time?  Similarly, what
keeps both CPUs from calling get_page_type at the same time and
therefore losing track of the page type reference count?

I don't see why vmalloc or vm_unmap_aliases would have anything to do
with this, though.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-28 17:10                   ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-29  0:21                     ` Andy Lutomirski
  2015-07-29  0:21                     ` [Xen-devel] " Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29  0:21 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>
>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>>
>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>> mapped as an LDT.
>>>
>> I just instrumented it with yet more LSL instructions.  I'm pretty
>> sure that set_ldt really is clearing at least LDT entry zero.
>> Nonetheless the free_ldt call still oopses.
>>
>
> Yes, I added some instrumentation to the hypervisor and we definitely set
> LDT to NULL before failing.
>
> -boris

Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
getting incremented once on each CPU at the same time if both CPUs
fault in the same shadow LDT page at the same time?  Similarly, what
keeps both CPUs from calling get_page_type at the same time and
therefore losing track of the page type reference count?

I don't see why vmalloc or vm_unmap_aliases would have anything to do
with this, though.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  0:21                     ` [Xen-devel] " Andy Lutomirski
  2015-07-29  0:47                       ` Andrew Cooper
@ 2015-07-29  0:47                       ` Andrew Cooper
  2015-07-29  3:01                         ` Boris Ostrovsky
  2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29  0:47 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 29/07/2015 01:21, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>> mapped as an LDT.
>>>>
>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>> sure that set_ldt really is clearing at least LDT entry zero.
>>> Nonetheless the free_ldt call still oopses.
>>>
>> Yes, I added some instrumentation to the hypervisor and we definitely set
>> LDT to NULL before failing.
>>
>> -boris
> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
> getting incremented once on each CPU at the same time if both CPUs
> fault in the same shadow LDT page at the same time?

Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
expected to have a type refcount of 2.

> Similarly, what
> keeps both CPUs from calling get_page_type at the same time and
> therefore losing track of the page type reference count?

a cmpxchg() loop in the depths of __get_page_type().

>
> I don't see why vmalloc or vm_unmap_aliases would have anything to do
> with this, though.

Nor me.  I have compiled your branch and will see about reproducing the
issue myself tomorrow.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  0:21                     ` [Xen-devel] " Andy Lutomirski
@ 2015-07-29  0:47                       ` Andrew Cooper
  2015-07-29  0:47                       ` [Xen-devel] " Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29  0:47 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 29/07/2015 01:21, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>> mapped as an LDT.
>>>>
>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>> sure that set_ldt really is clearing at least LDT entry zero.
>>> Nonetheless the free_ldt call still oopses.
>>>
>> Yes, I added some instrumentation to the hypervisor and we definitely set
>> LDT to NULL before failing.
>>
>> -boris
> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
> getting incremented once on each CPU at the same time if both CPUs
> fault in the same shadow LDT page at the same time?

Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
expected to have a type refcount of 2.

> Similarly, what
> keeps both CPUs from calling get_page_type at the same time and
> therefore losing track of the page type reference count?

a cmpxchg() loop in the depths of __get_page_type().

>
> I don't see why vmalloc or vm_unmap_aliases would have anything to do
> with this, though.

Nor me.  I have compiled your branch and will see about reproducing the
issue myself tomorrow.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  0:47                       ` [Xen-devel] " Andrew Cooper
  2015-07-29  3:01                         ` Boris Ostrovsky
@ 2015-07-29  3:01                         ` Boris Ostrovsky
  2015-07-29  4:26                           ` Andy Lutomirski
                                             ` (3 more replies)
  1 sibling, 4 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29  3:01 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 07/28/2015 08:47 PM, Andrew Cooper wrote:
> On 29/07/2015 01:21, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>> mapped as an LDT.
>>>>>
>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>> Nonetheless the free_ldt call still oopses.
>>>>
>>> Yes, I added some instrumentation to the hypervisor and we definitely set
>>> LDT to NULL before failing.
>>>
>>> -boris
>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>> getting incremented once on each CPU at the same time if both CPUs
>> fault in the same shadow LDT page at the same time?
> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
> expected to have a type refcount of 2.
>
>> Similarly, what
>> keeps both CPUs from calling get_page_type at the same time and
>> therefore losing track of the page type reference count?
> a cmpxchg() loop in the depths of __get_page_type().
>
>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>> with this, though.

So just for kicks I made lazy_max_pages() return 0 to free vmaps 
immediately and the problem went away.

I also saw this warning, BTW:

[  178.686542] ------------[ cut here ]------------
[  178.686554] WARNING: CPU: 0 PID: 16440 at 
./arch/x86/include/asm/mmu_context.h:96 load_mm_ldt+0x70/0x76()
[  178.686558] DEBUG_LOCKS_WARN_ON(!irqs_disabled())
[  178.686561] Modules linked in:
[  178.686566] CPU: 0 PID: 16440 Comm: kworker/u2:1 Not tainted 
4.1.0-32b #80
[  178.686570]  00000000 00000000 ea4e3df8 c1670e71 00000000 ea4e3e28 
c106ac1e c1814e43
[  178.686577]  ea4e3e54 00004038 c181bc2c 00000060 c166fd3b c166fd3b 
e6705dc0 00000000
[  178.686583]  ea665000 ea4e3e40 c106ad03 00000009 ea4e3e38 c1814e43 
ea4e3e54 ea4e3e5c
[  178.686589] Call Trace:
[  178.686594]  [<c1670e71>] dump_stack+0x41/0x52
[  178.686598]  [<c106ac1e>] warn_slowpath_common+0x8e/0xd0
[  178.686602]  [<c166fd3b>] ? load_mm_ldt+0x70/0x76
[  178.686609]  [<c166fd3b>] ? load_mm_ldt+0x70/0x76
[  178.686612]  [<c106ad03>] warn_slowpath_fmt+0x33/0x40
[  178.686615]  [<c166fd3b>] load_mm_ldt+0x70/0x76
[  178.686619]  [<c11ad5e9>] flush_old_exec+0x6f9/0x750
[  178.686626]  [<c11efb54>] load_elf_binary+0x2b4/0x1040
[  178.686630]  [<c1173785>] ? page_address+0x15/0xf0
[  178.686633]  [<c106466f>] ? kunmap+0x1f/0x70
[  178.686636]  [<c11ac819>] search_binary_handler+0x89/0x1c0
[  178.686639]  [<c11add40>] do_execveat_common+0x4c0/0x620
[  178.686653]  [<c11673e3>] ? kmemdup+0x33/0x50
[  178.686659]  [<c10c5e3b>] ? __call_rcu.constprop.66+0xbb/0x220
[  178.686673]  [<c11adec4>] do_execve+0x24/0x30
[  178.686679]  [<c107c0be>] ____call_usermodehelper+0xde/0x120
[  178.686684]  [<c1677501>] ret_from_kernel_thread+0x21/0x30
[  178.686696]  [<c107bfe0>] ? __request_module+0x240/0x240
[  178.686701] ---[ end trace 8b3f5341f50e6c88 ]---


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  0:47                       ` [Xen-devel] " Andrew Cooper
@ 2015-07-29  3:01                         ` Boris Ostrovsky
  2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29  3:01 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 07/28/2015 08:47 PM, Andrew Cooper wrote:
> On 29/07/2015 01:21, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>> mapped as an LDT.
>>>>>
>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>> Nonetheless the free_ldt call still oopses.
>>>>
>>> Yes, I added some instrumentation to the hypervisor and we definitely set
>>> LDT to NULL before failing.
>>>
>>> -boris
>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>> getting incremented once on each CPU at the same time if both CPUs
>> fault in the same shadow LDT page at the same time?
> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
> expected to have a type refcount of 2.
>
>> Similarly, what
>> keeps both CPUs from calling get_page_type at the same time and
>> therefore losing track of the page type reference count?
> a cmpxchg() loop in the depths of __get_page_type().
>
>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>> with this, though.

So just for kicks I made lazy_max_pages() return 0 to free vmaps 
immediately and the problem went away.

I also saw this warning, BTW:

[  178.686542] ------------[ cut here ]------------
[  178.686554] WARNING: CPU: 0 PID: 16440 at 
./arch/x86/include/asm/mmu_context.h:96 load_mm_ldt+0x70/0x76()
[  178.686558] DEBUG_LOCKS_WARN_ON(!irqs_disabled())
[  178.686561] Modules linked in:
[  178.686566] CPU: 0 PID: 16440 Comm: kworker/u2:1 Not tainted 
4.1.0-32b #80
[  178.686570]  00000000 00000000 ea4e3df8 c1670e71 00000000 ea4e3e28 
c106ac1e c1814e43
[  178.686577]  ea4e3e54 00004038 c181bc2c 00000060 c166fd3b c166fd3b 
e6705dc0 00000000
[  178.686583]  ea665000 ea4e3e40 c106ad03 00000009 ea4e3e38 c1814e43 
ea4e3e54 ea4e3e5c
[  178.686589] Call Trace:
[  178.686594]  [<c1670e71>] dump_stack+0x41/0x52
[  178.686598]  [<c106ac1e>] warn_slowpath_common+0x8e/0xd0
[  178.686602]  [<c166fd3b>] ? load_mm_ldt+0x70/0x76
[  178.686609]  [<c166fd3b>] ? load_mm_ldt+0x70/0x76
[  178.686612]  [<c106ad03>] warn_slowpath_fmt+0x33/0x40
[  178.686615]  [<c166fd3b>] load_mm_ldt+0x70/0x76
[  178.686619]  [<c11ad5e9>] flush_old_exec+0x6f9/0x750
[  178.686626]  [<c11efb54>] load_elf_binary+0x2b4/0x1040
[  178.686630]  [<c1173785>] ? page_address+0x15/0xf0
[  178.686633]  [<c106466f>] ? kunmap+0x1f/0x70
[  178.686636]  [<c11ac819>] search_binary_handler+0x89/0x1c0
[  178.686639]  [<c11add40>] do_execveat_common+0x4c0/0x620
[  178.686653]  [<c11673e3>] ? kmemdup+0x33/0x50
[  178.686659]  [<c10c5e3b>] ? __call_rcu.constprop.66+0xbb/0x220
[  178.686673]  [<c11adec4>] do_execve+0x24/0x30
[  178.686679]  [<c107c0be>] ____call_usermodehelper+0xde/0x120
[  178.686684]  [<c1677501>] ret_from_kernel_thread+0x21/0x30
[  178.686696]  [<c107bfe0>] ? __request_module+0x240/0x240
[  178.686701] ---[ end trace 8b3f5341f50e6c88 ]---


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-29  4:26                           ` Andy Lutomirski
  2015-07-29  4:26                           ` Andy Lutomirski
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29  4:26 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>
>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>
>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>
>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>
>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>
>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>> mapped as an LDT.
>>>>>>
>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>> Nonetheless the free_ldt call still oopses.
>>>>>
>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>> set
>>>> LDT to NULL before failing.
>>>>
>>>> -boris
>>>
>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>> getting incremented once on each CPU at the same time if both CPUs
>>> fault in the same shadow LDT page at the same time?
>>
>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>> expected to have a type refcount of 2.
>>
>>> Similarly, what
>>> keeps both CPUs from calling get_page_type at the same time and
>>> therefore losing track of the page type reference count?
>>
>> a cmpxchg() loop in the depths of __get_page_type().
>>
>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>> with this, though.
>
>
> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
> and the problem went away.
>
> I also saw this warning, BTW:
>
> [  178.686542] ------------[ cut here ]------------
> [  178.686554] WARNING: CPU: 0 PID: 16440 at
> ./arch/x86/include/asm/mmu_context.h:96 load_mm_ldt+0x70/0x76()
> [  178.686558] DEBUG_LOCKS_WARN_ON(!irqs_disabled())

Whoops!  That should be checking preemptible(), not irqs_disabled().

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
  2015-07-29  4:26                           ` Andy Lutomirski
@ 2015-07-29  4:26                           ` Andy Lutomirski
  2015-07-29  5:28                           ` [Xen-devel] " Andy Lutomirski
  2015-07-29  5:28                           ` Andy Lutomirski
  3 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29  4:26 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>
>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>
>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>
>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>
>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>
>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>> mapped as an LDT.
>>>>>>
>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>> Nonetheless the free_ldt call still oopses.
>>>>>
>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>> set
>>>> LDT to NULL before failing.
>>>>
>>>> -boris
>>>
>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>> getting incremented once on each CPU at the same time if both CPUs
>>> fault in the same shadow LDT page at the same time?
>>
>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>> expected to have a type refcount of 2.
>>
>>> Similarly, what
>>> keeps both CPUs from calling get_page_type at the same time and
>>> therefore losing track of the page type reference count?
>>
>> a cmpxchg() loop in the depths of __get_page_type().
>>
>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>> with this, though.
>
>
> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
> and the problem went away.
>
> I also saw this warning, BTW:
>
> [  178.686542] ------------[ cut here ]------------
> [  178.686554] WARNING: CPU: 0 PID: 16440 at
> ./arch/x86/include/asm/mmu_context.h:96 load_mm_ldt+0x70/0x76()
> [  178.686558] DEBUG_LOCKS_WARN_ON(!irqs_disabled())

Whoops!  That should be checking preemptible(), not irqs_disabled().

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
  2015-07-29  4:26                           ` Andy Lutomirski
  2015-07-29  4:26                           ` Andy Lutomirski
@ 2015-07-29  5:28                           ` Andy Lutomirski
  2015-07-29 14:21                             ` Andrew Cooper
  2015-07-29 14:21                             ` Andrew Cooper
  2015-07-29  5:28                           ` Andy Lutomirski
  3 siblings, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29  5:28 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>
>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>
>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>
>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>
>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>
>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>> mapped as an LDT.
>>>>>>
>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>> Nonetheless the free_ldt call still oopses.
>>>>>
>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>> set
>>>> LDT to NULL before failing.
>>>>
>>>> -boris
>>>
>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>> getting incremented once on each CPU at the same time if both CPUs
>>> fault in the same shadow LDT page at the same time?
>>
>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>> expected to have a type refcount of 2.
>>
>>> Similarly, what
>>> keeps both CPUs from calling get_page_type at the same time and
>>> therefore losing track of the page type reference count?
>>
>> a cmpxchg() loop in the depths of __get_page_type().
>>
>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>> with this, though.
>
>
> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
> and the problem went away.

As far as I can tell, this affects TLB flushes but not unmaps.  That
means that my patch is totally bogus -- vm_unmap_aliases() *flushed*
aliases but isn't involved in removing them from the page tables.
That must be why xen_alloc_ldt and xen_set_ldt work today.

So what does flushing the TLB have to do with anything?  The only
thing I can think of is that it might force some deferred hypercalls
out.  I can reproduce this easily on UP, so IPIs aren't involved.

The other odd thing is that it seems like this happens when clearing
the LDT and freeing the old one but not when setting the LDT and
freeing the old one.  This is plausibly related to the lazy mode in
effect at the time, but I have no evidence for that.

Two more data points:  Putting xen_flush_mc before and after the
SET_LDT multicall has no effect.  Putting flush_tlb_all() in
xen_free_ldt doesn't help either, while vm_unmap_aliases() in the
exact same place does help.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
                                             ` (2 preceding siblings ...)
  2015-07-29  5:28                           ` [Xen-devel] " Andy Lutomirski
@ 2015-07-29  5:28                           ` Andy Lutomirski
  3 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29  5:28 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin

On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>
>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>
>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>
>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>
>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>
>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>> mapped as an LDT.
>>>>>>
>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>> Nonetheless the free_ldt call still oopses.
>>>>>
>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>> set
>>>> LDT to NULL before failing.
>>>>
>>>> -boris
>>>
>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>> getting incremented once on each CPU at the same time if both CPUs
>>> fault in the same shadow LDT page at the same time?
>>
>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>> expected to have a type refcount of 2.
>>
>>> Similarly, what
>>> keeps both CPUs from calling get_page_type at the same time and
>>> therefore losing track of the page type reference count?
>>
>> a cmpxchg() loop in the depths of __get_page_type().
>>
>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>> with this, though.
>
>
> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
> and the problem went away.

As far as I can tell, this affects TLB flushes but not unmaps.  That
means that my patch is totally bogus -- vm_unmap_aliases() *flushed*
aliases but isn't involved in removing them from the page tables.
That must be why xen_alloc_ldt and xen_set_ldt work today.

So what does flushing the TLB have to do with anything?  The only
thing I can think of is that it might force some deferred hypercalls
out.  I can reproduce this easily on UP, so IPIs aren't involved.

The other odd thing is that it seems like this happens when clearing
the LDT and freeing the old one but not when setting the LDT and
freeing the old one.  This is plausibly related to the lazy mode in
effect at the time, but I have no evidence for that.

Two more data points:  Putting xen_flush_mc before and after the
SET_LDT multicall has no effect.  Putting flush_tlb_all() in
xen_free_ldt doesn't help either, while vm_unmap_aliases() in the
exact same place does help.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  5:28                           ` [Xen-devel] " Andy Lutomirski
@ 2015-07-29 14:21                             ` Andrew Cooper
  2015-07-29 14:43                               ` Boris Ostrovsky
  2015-07-29 14:43                               ` Boris Ostrovsky
  2015-07-29 14:21                             ` Andrew Cooper
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 14:21 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 29/07/15 06:28, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>>> mapped as an LDT.
>>>>>>>
>>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>>> Nonetheless the free_ldt call still oopses.
>>>>>>
>>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>>> set
>>>>> LDT to NULL before failing.
>>>>>
>>>>> -boris
>>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>>> getting incremented once on each CPU at the same time if both CPUs
>>>> fault in the same shadow LDT page at the same time?
>>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>>> expected to have a type refcount of 2.
>>>
>>>> Similarly, what
>>>> keeps both CPUs from calling get_page_type at the same time and
>>>> therefore losing track of the page type reference count?
>>> a cmpxchg() loop in the depths of __get_page_type().
>>>
>>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>>> with this, though.
>>
>> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
>> and the problem went away.
> As far as I can tell, this affects TLB flushes but not unmaps.  That
> means that my patch is totally bogus -- vm_unmap_aliases() *flushed*
> aliases but isn't involved in removing them from the page tables.
> That must be why xen_alloc_ldt and xen_set_ldt work today.
>
> So what does flushing the TLB have to do with anything?  The only
> thing I can think of is that it might force some deferred hypercalls
> out.  I can reproduce this easily on UP, so IPIs aren't involved.
>
> The other odd thing is that it seems like this happens when clearing
> the LDT and freeing the old one but not when setting the LDT and
> freeing the old one.  This is plausibly related to the lazy mode in
> effect at the time, but I have no evidence for that.
>
> Two more data points:  Putting xen_flush_mc before and after the
> SET_LDT multicall has no effect.  Putting flush_tlb_all() in
> xen_free_ldt doesn't help either, while vm_unmap_aliases() in the
> exact same place does help.

FYI, I have got a repro now and am investigating.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29  5:28                           ` [Xen-devel] " Andy Lutomirski
  2015-07-29 14:21                             ` Andrew Cooper
@ 2015-07-29 14:21                             ` Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 14:21 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 29/07/15 06:28, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>>> mapped as an LDT.
>>>>>>>
>>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>>> Nonetheless the free_ldt call still oopses.
>>>>>>
>>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>>> set
>>>>> LDT to NULL before failing.
>>>>>
>>>>> -boris
>>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>>> getting incremented once on each CPU at the same time if both CPUs
>>>> fault in the same shadow LDT page at the same time?
>>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>>> expected to have a type refcount of 2.
>>>
>>>> Similarly, what
>>>> keeps both CPUs from calling get_page_type at the same time and
>>>> therefore losing track of the page type reference count?
>>> a cmpxchg() loop in the depths of __get_page_type().
>>>
>>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>>> with this, though.
>>
>> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
>> and the problem went away.
> As far as I can tell, this affects TLB flushes but not unmaps.  That
> means that my patch is totally bogus -- vm_unmap_aliases() *flushed*
> aliases but isn't involved in removing them from the page tables.
> That must be why xen_alloc_ldt and xen_set_ldt work today.
>
> So what does flushing the TLB have to do with anything?  The only
> thing I can think of is that it might force some deferred hypercalls
> out.  I can reproduce this easily on UP, so IPIs aren't involved.
>
> The other odd thing is that it seems like this happens when clearing
> the LDT and freeing the old one but not when setting the LDT and
> freeing the old one.  This is plausibly related to the lazy mode in
> effect at the time, but I have no evidence for that.
>
> Two more data points:  Putting xen_flush_mc before and after the
> SET_LDT multicall has no effect.  Putting flush_tlb_all() in
> xen_free_ldt doesn't help either, while vm_unmap_aliases() in the
> exact same place does help.

FYI, I have got a repro now and am investigating.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 14:21                             ` Andrew Cooper
@ 2015-07-29 14:43                               ` Boris Ostrovsky
  2015-07-29 19:03                                 ` Andrew Cooper
  2015-07-29 19:03                                 ` [Xen-devel] " Andrew Cooper
  2015-07-29 14:43                               ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 14:43 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 07/29/2015 10:21 AM, Andrew Cooper wrote:
> On 29/07/15 06:28, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>>>> mapped as an LDT.
>>>>>>>>
>>>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>>>> Nonetheless the free_ldt call still oopses.
>>>>>>>
>>>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>>>> set
>>>>>> LDT to NULL before failing.
>>>>>>
>>>>>> -boris
>>>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>>>> getting incremented once on each CPU at the same time if both CPUs
>>>>> fault in the same shadow LDT page at the same time?
>>>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>>>> expected to have a type refcount of 2.
>>>>
>>>>> Similarly, what
>>>>> keeps both CPUs from calling get_page_type at the same time and
>>>>> therefore losing track of the page type reference count?
>>>> a cmpxchg() loop in the depths of __get_page_type().
>>>>
>>>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>>>> with this, though.
>>> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
>>> and the problem went away.
>> As far as I can tell, this affects TLB flushes but not unmaps.  That
>> means that my patch is totally bogus -- vm_unmap_aliases() *flushed*
>> aliases but isn't involved in removing them from the page tables.
>> That must be why xen_alloc_ldt and xen_set_ldt work today.
>>
>> So what does flushing the TLB have to do with anything?  The only
>> thing I can think of is that it might force some deferred hypercalls
>> out.  I can reproduce this easily on UP, so IPIs aren't involved.
>>
>> The other odd thing is that it seems like this happens when clearing
>> the LDT and freeing the old one but not when setting the LDT and
>> freeing the old one.  This is plausibly related to the lazy mode in
>> effect at the time, but I have no evidence for that.
>>
>> Two more data points:  Putting xen_flush_mc before and after the
>> SET_LDT multicall has no effect.  Putting flush_tlb_all() in
>> xen_free_ldt doesn't help either, while vm_unmap_aliases() in the
>> exact same place does help.
> FYI, I have got a repro now and am investigating.


To simplify your test case, this is sufficient for me to trigger this:


#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <asm/ldt.h>


int main()
{
         int i;

         struct user_desc desc = {
                 .entry_number    = 0,
                 .base_addr       = 0,
                 .limit           = 10,
                 .seg_32bit       = 1,
                 .contents        = 2, /* Code, not conforming */
                 .read_exec_only  = 0,
                 .limit_in_pages  = 0,
                 .seg_not_present = 0,
                 .useable         = 0
};

         for (i = 0; i < 500; i++)
                 syscall(SYS_modify_ldt, 0x11, &desc, sizeof(desc));
}


Run this program in a loop --- the error is triggered (again, for me), 
when it exits.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 14:21                             ` Andrew Cooper
  2015-07-29 14:43                               ` Boris Ostrovsky
@ 2015-07-29 14:43                               ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 14:43 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin

On 07/29/2015 10:21 AM, Andrew Cooper wrote:
> On 29/07/15 06:28, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 8:01 PM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/28/2015 08:47 PM, Andrew Cooper wrote:
>>>> On 29/07/2015 01:21, Andy Lutomirski wrote:
>>>>> On Tue, Jul 28, 2015 at 10:10 AM, Boris Ostrovsky
>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>> On 07/28/2015 01:07 PM, Andy Lutomirski wrote:
>>>>>>> On Tue, Jul 28, 2015 at 9:30 AM, Andrew Cooper
>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>> I suspect that the set_ldt(NULL, 0) call hasn't reached Xen before
>>>>>>>> xen_free_ldt() is attempting to nab back the pages which Xen still has
>>>>>>>> mapped as an LDT.
>>>>>>>>
>>>>>>> I just instrumented it with yet more LSL instructions.  I'm pretty
>>>>>>> sure that set_ldt really is clearing at least LDT entry zero.
>>>>>>> Nonetheless the free_ldt call still oopses.
>>>>>>>
>>>>>> Yes, I added some instrumentation to the hypervisor and we definitely
>>>>>> set
>>>>>> LDT to NULL before failing.
>>>>>>
>>>>>> -boris
>>>>> Looking at map_ldt_shadow_page: what keeps shadow_ldt_mapcnt from
>>>>> getting incremented once on each CPU at the same time if both CPUs
>>>>> fault in the same shadow LDT page at the same time?
>>>> Nothing, but that is fine.  If a page is in use in two vcpus LDTs, it is
>>>> expected to have a type refcount of 2.
>>>>
>>>>> Similarly, what
>>>>> keeps both CPUs from calling get_page_type at the same time and
>>>>> therefore losing track of the page type reference count?
>>>> a cmpxchg() loop in the depths of __get_page_type().
>>>>
>>>>> I don't see why vmalloc or vm_unmap_aliases would have anything to do
>>>>> with this, though.
>>> So just for kicks I made lazy_max_pages() return 0 to free vmaps immediately
>>> and the problem went away.
>> As far as I can tell, this affects TLB flushes but not unmaps.  That
>> means that my patch is totally bogus -- vm_unmap_aliases() *flushed*
>> aliases but isn't involved in removing them from the page tables.
>> That must be why xen_alloc_ldt and xen_set_ldt work today.
>>
>> So what does flushing the TLB have to do with anything?  The only
>> thing I can think of is that it might force some deferred hypercalls
>> out.  I can reproduce this easily on UP, so IPIs aren't involved.
>>
>> The other odd thing is that it seems like this happens when clearing
>> the LDT and freeing the old one but not when setting the LDT and
>> freeing the old one.  This is plausibly related to the lazy mode in
>> effect at the time, but I have no evidence for that.
>>
>> Two more data points:  Putting xen_flush_mc before and after the
>> SET_LDT multicall has no effect.  Putting flush_tlb_all() in
>> xen_free_ldt doesn't help either, while vm_unmap_aliases() in the
>> exact same place does help.
> FYI, I have got a repro now and am investigating.


To simplify your test case, this is sufficient for me to trigger this:


#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <asm/ldt.h>


int main()
{
         int i;

         struct user_desc desc = {
                 .entry_number    = 0,
                 .base_addr       = 0,
                 .limit           = 10,
                 .seg_32bit       = 1,
                 .contents        = 2, /* Code, not conforming */
                 .read_exec_only  = 0,
                 .limit_in_pages  = 0,
                 .seg_not_present = 0,
                 .useable         = 0
};

         for (i = 0; i < 500; i++)
                 syscall(SYS_modify_ldt, 0x11, &desc, sizeof(desc));
}


Run this program in a loop --- the error is triggered (again, for me), 
when it exits.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 14:43                               ` Boris Ostrovsky
  2015-07-29 19:03                                 ` Andrew Cooper
@ 2015-07-29 19:03                                 ` Andrew Cooper
  2015-07-29 21:23                                   ` Boris Ostrovsky
  2015-07-29 21:23                                   ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 19:03 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	David Vrabel, Konrad Wilk

On 29/07/15 15:43, Boris Ostrovsky wrote:
> FYI, I have got a repro now and am investigating.

Good and bad news.  This bug has nothing to do with LDTs themselves.

I have worked out what is going on, but this:

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5abeaac..7e1a82e 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
 
        pte = pfn_pte(pfn, prot);
 
+       (void)*(volatile int*)v;
        if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
                pr_err("set_aliased_prot va update failed w/ lazy mode
%u\n", paravirt_get_lazy_mode());
                BUG();

Is perhaps not the fix we are looking for, and every use of
HYPERVISOR_update_va_mapping() is susceptible to the same problem.

The update_va_mapping hypercall is designed to emulate writing the pte
for v, with auditing applied.  As part of this, it does a pagewalk on v
to locate and map the l1.  During this walk, Xen it finds the l2 not
present, and fails the hypercall.  i.e. v is not reachable from the
current cr3.

Reading the virtual address immediately before issuing the hypercall
causes Linux's memory faulting logic to fault in the l2.  This also
explains why vm_unmap_aliases() appears to fix the issue; it is likely
to fault in enough of the paging structure for v to be reachable.

One solution might be to use MMU_NORMAL_PT_UPDATE hypercall instead,
which take the physical address of pte to update.  This won't fail in
Xen if part of the paging structure is missing, and can be batched.

~Andrew

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 14:43                               ` Boris Ostrovsky
@ 2015-07-29 19:03                                 ` Andrew Cooper
  2015-07-29 19:03                                 ` [Xen-devel] " Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 19:03 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 29/07/15 15:43, Boris Ostrovsky wrote:
> FYI, I have got a repro now and am investigating.

Good and bad news.  This bug has nothing to do with LDTs themselves.

I have worked out what is going on, but this:

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5abeaac..7e1a82e 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
 
        pte = pfn_pte(pfn, prot);
 
+       (void)*(volatile int*)v;
        if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
                pr_err("set_aliased_prot va update failed w/ lazy mode
%u\n", paravirt_get_lazy_mode());
                BUG();

Is perhaps not the fix we are looking for, and every use of
HYPERVISOR_update_va_mapping() is susceptible to the same problem.

The update_va_mapping hypercall is designed to emulate writing the pte
for v, with auditing applied.  As part of this, it does a pagewalk on v
to locate and map the l1.  During this walk, Xen it finds the l2 not
present, and fails the hypercall.  i.e. v is not reachable from the
current cr3.

Reading the virtual address immediately before issuing the hypercall
causes Linux's memory faulting logic to fault in the l2.  This also
explains why vm_unmap_aliases() appears to fix the issue; it is likely
to fault in enough of the paging structure for v to be reachable.

One solution might be to use MMU_NORMAL_PT_UPDATE hypercall instead,
which take the physical address of pte to update.  This won't fail in
Xen if part of the paging structure is missing, and can be batched.

~Andrew

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 19:03                                 ` [Xen-devel] " Andrew Cooper
@ 2015-07-29 21:23                                   ` Boris Ostrovsky
  2015-07-29 21:26                                     ` Andy Lutomirski
  2015-07-29 21:26                                     ` Andy Lutomirski
  2015-07-29 21:23                                   ` Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 21:23 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	David Vrabel, Konrad Wilk

On 07/29/2015 03:03 PM, Andrew Cooper wrote:
> On 29/07/15 15:43, Boris Ostrovsky wrote:
>> FYI, I have got a repro now and am investigating.
> Good and bad news.  This bug has nothing to do with LDTs themselves.
>
> I have worked out what is going on, but this:
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index 5abeaac..7e1a82e 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>   
>          pte = pfn_pte(pfn, prot);
>   
> +       (void)*(volatile int*)v;
>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>                  pr_err("set_aliased_prot va update failed w/ lazy mode
> %u\n", paravirt_get_lazy_mode());
>                  BUG();
>
> Is perhaps not the fix we are looking for, and every use of
> HYPERVISOR_update_va_mapping() is susceptible to the same problem.

I think in most cases we know that page is mapped so hopefully this is 
the only site that we need to be careful about.

>
> The update_va_mapping hypercall is designed to emulate writing the pte
> for v, with auditing applied.  As part of this, it does a pagewalk on v
> to locate and map the l1.  During this walk, Xen it finds the l2 not
> present, and fails the hypercall.  i.e. v is not reachable from the
> current cr3.
>
> Reading the virtual address immediately before issuing the hypercall
> causes Linux's memory faulting logic to fault in the l2.  This also
> explains why vm_unmap_aliases() appears to fix the issue; it is likely
> to fault in enough of the paging structure for v to be reachable.

We've just touched this page (in write_ldt()) in this test so why would 
it not be mapped?


>
> One solution might be to use MMU_NORMAL_PT_UPDATE hypercall instead,
> which take the physical address of pte to update.  This won't fail in
> Xen if part of the paging structure is missing, and can be batched.

Yes, it does work. Thanks Andrew.


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 19:03                                 ` [Xen-devel] " Andrew Cooper
  2015-07-29 21:23                                   ` Boris Ostrovsky
@ 2015-07-29 21:23                                   ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 21:23 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 07/29/2015 03:03 PM, Andrew Cooper wrote:
> On 29/07/15 15:43, Boris Ostrovsky wrote:
>> FYI, I have got a repro now and am investigating.
> Good and bad news.  This bug has nothing to do with LDTs themselves.
>
> I have worked out what is going on, but this:
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index 5abeaac..7e1a82e 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>   
>          pte = pfn_pte(pfn, prot);
>   
> +       (void)*(volatile int*)v;
>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>                  pr_err("set_aliased_prot va update failed w/ lazy mode
> %u\n", paravirt_get_lazy_mode());
>                  BUG();
>
> Is perhaps not the fix we are looking for, and every use of
> HYPERVISOR_update_va_mapping() is susceptible to the same problem.

I think in most cases we know that page is mapped so hopefully this is 
the only site that we need to be careful about.

>
> The update_va_mapping hypercall is designed to emulate writing the pte
> for v, with auditing applied.  As part of this, it does a pagewalk on v
> to locate and map the l1.  During this walk, Xen it finds the l2 not
> present, and fails the hypercall.  i.e. v is not reachable from the
> current cr3.
>
> Reading the virtual address immediately before issuing the hypercall
> causes Linux's memory faulting logic to fault in the l2.  This also
> explains why vm_unmap_aliases() appears to fix the issue; it is likely
> to fault in enough of the paging structure for v to be reachable.

We've just touched this page (in write_ldt()) in this test so why would 
it not be mapped?


>
> One solution might be to use MMU_NORMAL_PT_UPDATE hypercall instead,
> which take the physical address of pte to update.  This won't fail in
> Xen if part of the paging structure is missing, and can be batched.

Yes, it does work. Thanks Andrew.


-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:23                                   ` Boris Ostrovsky
@ 2015-07-29 21:26                                     ` Andy Lutomirski
  2015-07-29 21:33                                       ` Boris Ostrovsky
                                                         ` (3 more replies)
  2015-07-29 21:26                                     ` Andy Lutomirski
  1 sibling, 4 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29 21:26 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin, David Vrabel, Konrad Wilk

On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>
>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>
>>> FYI, I have got a repro now and am investigating.
>>
>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>
>> I have worked out what is going on, but this:
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index 5abeaac..7e1a82e 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>            pte = pfn_pte(pfn, prot);
>>   +       (void)*(volatile int*)v;
>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>> %u\n", paravirt_get_lazy_mode());
>>                  BUG();
>>
>> Is perhaps not the fix we are looking for, and every use of
>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>
>
> I think in most cases we know that page is mapped so hopefully this is the
> only site that we need to be careful about.

Is there any chance we can get some kind of quick-and-dirty fix that
can go to x86/urgent in the next few days even if a clean fix isn't
available yet?

>
>>
>> The update_va_mapping hypercall is designed to emulate writing the pte
>> for v, with auditing applied.  As part of this, it does a pagewalk on v
>> to locate and map the l1.  During this walk, Xen it finds the l2 not
>> present, and fails the hypercall.  i.e. v is not reachable from the
>> current cr3.
>>
>> Reading the virtual address immediately before issuing the hypercall
>> causes Linux's memory faulting logic to fault in the l2.  This also
>> explains why vm_unmap_aliases() appears to fix the issue; it is likely
>> to fault in enough of the paging structure for v to be reachable.
>
>
> We've just touched this page (in write_ldt()) in this test so why would it
> not be mapped?

With my patches applied, the LDT is never written via any paravirt
hook -- I write it once (possibly implicitly using kzalloc/vzalloc)
before paravirt_alloc_ldt(), and write_ldt() is never called.  We
could even remove it write_ldt() :)

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:23                                   ` Boris Ostrovsky
  2015-07-29 21:26                                     ` Andy Lutomirski
@ 2015-07-29 21:26                                     ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29 21:26 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin

On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>
>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>
>>> FYI, I have got a repro now and am investigating.
>>
>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>
>> I have worked out what is going on, but this:
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index 5abeaac..7e1a82e 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>            pte = pfn_pte(pfn, prot);
>>   +       (void)*(volatile int*)v;
>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>> %u\n", paravirt_get_lazy_mode());
>>                  BUG();
>>
>> Is perhaps not the fix we are looking for, and every use of
>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>
>
> I think in most cases we know that page is mapped so hopefully this is the
> only site that we need to be careful about.

Is there any chance we can get some kind of quick-and-dirty fix that
can go to x86/urgent in the next few days even if a clean fix isn't
available yet?

>
>>
>> The update_va_mapping hypercall is designed to emulate writing the pte
>> for v, with auditing applied.  As part of this, it does a pagewalk on v
>> to locate and map the l1.  During this walk, Xen it finds the l2 not
>> present, and fails the hypercall.  i.e. v is not reachable from the
>> current cr3.
>>
>> Reading the virtual address immediately before issuing the hypercall
>> causes Linux's memory faulting logic to fault in the l2.  This also
>> explains why vm_unmap_aliases() appears to fix the issue; it is likely
>> to fault in enough of the paging structure for v to be reachable.
>
>
> We've just touched this page (in write_ldt()) in this test so why would it
> not be mapped?

With my patches applied, the LDT is never written via any paravirt
hook -- I write it once (possibly implicitly using kzalloc/vzalloc)
before paravirt_alloc_ldt(), and write_ldt() is never called.  We
could even remove it write_ldt() :)

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:26                                     ` Andy Lutomirski
  2015-07-29 21:33                                       ` Boris Ostrovsky
@ 2015-07-29 21:33                                       ` Boris Ostrovsky
  2015-07-29 21:37                                       ` Andrew Cooper
  2015-07-29 21:37                                       ` [Xen-devel] " Andrew Cooper
  3 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Cooper, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin, David Vrabel, Konrad Wilk

On 07/29/2015 05:26 PM, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>> FYI, I have got a repro now and am investigating.
>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>
>>> I have worked out what is going on, but this:
>>>
>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>> index 5abeaac..7e1a82e 100644
>>> --- a/arch/x86/xen/enlighten.c
>>> +++ b/arch/x86/xen/enlighten.c
>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>             pte = pfn_pte(pfn, prot);
>>>    +       (void)*(volatile int*)v;
>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>                   pr_err("set_aliased_prot va update failed w/ lazy mode
>>> %u\n", paravirt_get_lazy_mode());
>>>                   BUG();
>>>
>>> Is perhaps not the fix we are looking for, and every use of
>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>
>> I think in most cases we know that page is mapped so hopefully this is the
>> only site that we need to be careful about.
> Is there any chance we can get some kind of quick-and-dirty fix that
> can go to x86/urgent in the next few days even if a clean fix isn't
> available yet?

I'll try to have it tomorrow.

>
>>> The update_va_mapping hypercall is designed to emulate writing the pte
>>> for v, with auditing applied.  As part of this, it does a pagewalk on v
>>> to locate and map the l1.  During this walk, Xen it finds the l2 not
>>> present, and fails the hypercall.  i.e. v is not reachable from the
>>> current cr3.
>>>
>>> Reading the virtual address immediately before issuing the hypercall
>>> causes Linux's memory faulting logic to fault in the l2.  This also
>>> explains why vm_unmap_aliases() appears to fix the issue; it is likely
>>> to fault in enough of the paging structure for v to be reachable.
>>
>> We've just touched this page (in write_ldt()) in this test so why would it
>> not be mapped?
> With my patches applied, the LDT is never written via any paravirt
> hook -- I write it once (possibly implicitly using kzalloc/vzalloc)
> before paravirt_alloc_ldt(), and write_ldt() is never called.  We
> could even remove it write_ldt() :)

I was referring to 'new_ldt->entries[ldt_info.entry_number] = ldt;' 
which we do write in this test, so it will fault the page in.

-boris


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:26                                     ` Andy Lutomirski
@ 2015-07-29 21:33                                       ` Boris Ostrovsky
  2015-07-29 21:33                                       ` [Xen-devel] " Boris Ostrovsky
                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin

On 07/29/2015 05:26 PM, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>> FYI, I have got a repro now and am investigating.
>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>
>>> I have worked out what is going on, but this:
>>>
>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>> index 5abeaac..7e1a82e 100644
>>> --- a/arch/x86/xen/enlighten.c
>>> +++ b/arch/x86/xen/enlighten.c
>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>             pte = pfn_pte(pfn, prot);
>>>    +       (void)*(volatile int*)v;
>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>                   pr_err("set_aliased_prot va update failed w/ lazy mode
>>> %u\n", paravirt_get_lazy_mode());
>>>                   BUG();
>>>
>>> Is perhaps not the fix we are looking for, and every use of
>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>
>> I think in most cases we know that page is mapped so hopefully this is the
>> only site that we need to be careful about.
> Is there any chance we can get some kind of quick-and-dirty fix that
> can go to x86/urgent in the next few days even if a clean fix isn't
> available yet?

I'll try to have it tomorrow.

>
>>> The update_va_mapping hypercall is designed to emulate writing the pte
>>> for v, with auditing applied.  As part of this, it does a pagewalk on v
>>> to locate and map the l1.  During this walk, Xen it finds the l2 not
>>> present, and fails the hypercall.  i.e. v is not reachable from the
>>> current cr3.
>>>
>>> Reading the virtual address immediately before issuing the hypercall
>>> causes Linux's memory faulting logic to fault in the l2.  This also
>>> explains why vm_unmap_aliases() appears to fix the issue; it is likely
>>> to fault in enough of the paging structure for v to be reachable.
>>
>> We've just touched this page (in write_ldt()) in this test so why would it
>> not be mapped?
> With my patches applied, the LDT is never written via any paravirt
> hook -- I write it once (possibly implicitly using kzalloc/vzalloc)
> before paravirt_alloc_ldt(), and write_ldt() is never called.  We
> could even remove it write_ldt() :)

I was referring to 'new_ldt->entries[ldt_info.entry_number] = ldt;' 
which we do write in this test, so it will fault the page in.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:26                                     ` Andy Lutomirski
                                                         ` (2 preceding siblings ...)
  2015-07-29 21:37                                       ` Andrew Cooper
@ 2015-07-29 21:37                                       ` Andrew Cooper
  2015-07-29 22:05                                         ` Andy Lutomirski
  2015-07-29 22:05                                         ` [Xen-devel] " Andy Lutomirski
  3 siblings, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 21:37 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	David Vrabel, Konrad Wilk

On 29/07/2015 22:26, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>> FYI, I have got a repro now and am investigating.
>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>
>>> I have worked out what is going on, but this:
>>>
>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>> index 5abeaac..7e1a82e 100644
>>> --- a/arch/x86/xen/enlighten.c
>>> +++ b/arch/x86/xen/enlighten.c
>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>            pte = pfn_pte(pfn, prot);
>>>   +       (void)*(volatile int*)v;
>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>> %u\n", paravirt_get_lazy_mode());
>>>                  BUG();
>>>
>>> Is perhaps not the fix we are looking for, and every use of
>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>
>> I think in most cases we know that page is mapped so hopefully this is the
>> only site that we need to be careful about.
> Is there any chance we can get some kind of quick-and-dirty fix that
> can go to x86/urgent in the next few days even if a clean fix isn't
> available yet?

Quick and dirty?

Reading from v is the most obvious and quick way, for areas where we are
certain v exists, is kernel memory and is expected to have a backing
page.  I don't know offhand how many of current
HYPERVISOR_update_va_mapping() callsites this applies to.

>
>>> The update_va_mapping hypercall is designed to emulate writing the pte
>>> for v, with auditing applied.  As part of this, it does a pagewalk on v
>>> to locate and map the l1.  During this walk, Xen it finds the l2 not
>>> present, and fails the hypercall.  i.e. v is not reachable from the
>>> current cr3.
>>>
>>> Reading the virtual address immediately before issuing the hypercall
>>> causes Linux's memory faulting logic to fault in the l2.  This also
>>> explains why vm_unmap_aliases() appears to fix the issue; it is likely
>>> to fault in enough of the paging structure for v to be reachable.
>>
>> We've just touched this page (in write_ldt()) in this test so why would it
>> not be mapped?
> With my patches applied, the LDT is never written via any paravirt
> hook -- I write it once (possibly implicitly using kzalloc/vzalloc)
> before paravirt_alloc_ldt(), and write_ldt() is never called.  We
> could even remove it write_ldt() :)

Even better!

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:26                                     ` Andy Lutomirski
  2015-07-29 21:33                                       ` Boris Ostrovsky
  2015-07-29 21:33                                       ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-29 21:37                                       ` Andrew Cooper
  2015-07-29 21:37                                       ` [Xen-devel] " Andrew Cooper
  3 siblings, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 21:37 UTC (permalink / raw)
  To: Andy Lutomirski, Boris Ostrovsky
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 29/07/2015 22:26, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>> FYI, I have got a repro now and am investigating.
>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>
>>> I have worked out what is going on, but this:
>>>
>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>> index 5abeaac..7e1a82e 100644
>>> --- a/arch/x86/xen/enlighten.c
>>> +++ b/arch/x86/xen/enlighten.c
>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>            pte = pfn_pte(pfn, prot);
>>>   +       (void)*(volatile int*)v;
>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>> %u\n", paravirt_get_lazy_mode());
>>>                  BUG();
>>>
>>> Is perhaps not the fix we are looking for, and every use of
>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>
>> I think in most cases we know that page is mapped so hopefully this is the
>> only site that we need to be careful about.
> Is there any chance we can get some kind of quick-and-dirty fix that
> can go to x86/urgent in the next few days even if a clean fix isn't
> available yet?

Quick and dirty?

Reading from v is the most obvious and quick way, for areas where we are
certain v exists, is kernel memory and is expected to have a backing
page.  I don't know offhand how many of current
HYPERVISOR_update_va_mapping() callsites this applies to.

>
>>> The update_va_mapping hypercall is designed to emulate writing the pte
>>> for v, with auditing applied.  As part of this, it does a pagewalk on v
>>> to locate and map the l1.  During this walk, Xen it finds the l2 not
>>> present, and fails the hypercall.  i.e. v is not reachable from the
>>> current cr3.
>>>
>>> Reading the virtual address immediately before issuing the hypercall
>>> causes Linux's memory faulting logic to fault in the l2.  This also
>>> explains why vm_unmap_aliases() appears to fix the issue; it is likely
>>> to fault in enough of the paging structure for v to be reachable.
>>
>> We've just touched this page (in write_ldt()) in this test so why would it
>> not be mapped?
> With my patches applied, the LDT is never written via any paravirt
> hook -- I write it once (possibly implicitly using kzalloc/vzalloc)
> before paravirt_alloc_ldt(), and write_ldt() is never called.  We
> could even remove it write_ldt() :)

Even better!

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:37                                       ` [Xen-devel] " Andrew Cooper
  2015-07-29 22:05                                         ` Andy Lutomirski
@ 2015-07-29 22:05                                         ` Andy Lutomirski
  2015-07-29 22:11                                           ` Andrew Cooper
  2015-07-29 22:11                                           ` Andrew Cooper
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29 22:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Boris Ostrovsky, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin, David Vrabel, Konrad Wilk

On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 29/07/2015 22:26, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>> FYI, I have got a repro now and am investigating.
>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>
>>>> I have worked out what is going on, but this:
>>>>
>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>> index 5abeaac..7e1a82e 100644
>>>> --- a/arch/x86/xen/enlighten.c
>>>> +++ b/arch/x86/xen/enlighten.c
>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>            pte = pfn_pte(pfn, prot);
>>>>   +       (void)*(volatile int*)v;
>>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>>> %u\n", paravirt_get_lazy_mode());
>>>>                  BUG();
>>>>
>>>> Is perhaps not the fix we are looking for, and every use of
>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>
>>> I think in most cases we know that page is mapped so hopefully this is the
>>> only site that we need to be careful about.
>> Is there any chance we can get some kind of quick-and-dirty fix that
>> can go to x86/urgent in the next few days even if a clean fix isn't
>> available yet?
>
> Quick and dirty?
>
> Reading from v is the most obvious and quick way, for areas where we are
> certain v exists, is kernel memory and is expected to have a backing
> page.  I don't know offhand how many of current
> HYPERVISOR_update_va_mapping() callsites this applies to.

__get_user((char *)v, tmp), perhaps, unless there's something better
in the wings.  Keep in mind that we need this for -stable, and it's
likely to get backported quite quickly due to CVE-2015-5157.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 21:37                                       ` [Xen-devel] " Andrew Cooper
@ 2015-07-29 22:05                                         ` Andy Lutomirski
  2015-07-29 22:05                                         ` [Xen-devel] " Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29 22:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin, Boris Ostrovsky

On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 29/07/2015 22:26, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>> FYI, I have got a repro now and am investigating.
>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>
>>>> I have worked out what is going on, but this:
>>>>
>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>> index 5abeaac..7e1a82e 100644
>>>> --- a/arch/x86/xen/enlighten.c
>>>> +++ b/arch/x86/xen/enlighten.c
>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>            pte = pfn_pte(pfn, prot);
>>>>   +       (void)*(volatile int*)v;
>>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>>> %u\n", paravirt_get_lazy_mode());
>>>>                  BUG();
>>>>
>>>> Is perhaps not the fix we are looking for, and every use of
>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>
>>> I think in most cases we know that page is mapped so hopefully this is the
>>> only site that we need to be careful about.
>> Is there any chance we can get some kind of quick-and-dirty fix that
>> can go to x86/urgent in the next few days even if a clean fix isn't
>> available yet?
>
> Quick and dirty?
>
> Reading from v is the most obvious and quick way, for areas where we are
> certain v exists, is kernel memory and is expected to have a backing
> page.  I don't know offhand how many of current
> HYPERVISOR_update_va_mapping() callsites this applies to.

__get_user((char *)v, tmp), perhaps, unless there's something better
in the wings.  Keep in mind that we need this for -stable, and it's
likely to get backported quite quickly due to CVE-2015-5157.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:05                                         ` [Xen-devel] " Andy Lutomirski
@ 2015-07-29 22:11                                           ` Andrew Cooper
  2015-07-29 22:40                                             ` Boris Ostrovsky
                                                               ` (2 more replies)
  2015-07-29 22:11                                           ` Andrew Cooper
  1 sibling, 3 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Boris Ostrovsky, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, Jan Beulich,
	Sasha Levin, David Vrabel, Konrad Wilk

On 29/07/2015 23:05, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>> FYI, I have got a repro now and am investigating.
>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>
>>>>> I have worked out what is going on, but this:
>>>>>
>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>> index 5abeaac..7e1a82e 100644
>>>>> --- a/arch/x86/xen/enlighten.c
>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>            pte = pfn_pte(pfn, prot);
>>>>>   +       (void)*(volatile int*)v;
>>>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>                  BUG();
>>>>>
>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>> only site that we need to be careful about.
>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>> available yet?
>> Quick and dirty?
>>
>> Reading from v is the most obvious and quick way, for areas where we are
>> certain v exists, is kernel memory and is expected to have a backing
>> page.  I don't know offhand how many of current
>> HYPERVISOR_update_va_mapping() callsites this applies to.
> __get_user((char *)v, tmp), perhaps, unless there's something better
> in the wings.  Keep in mind that we need this for -stable, and it's
> likely to get backported quite quickly due to CVE-2015-5157.

Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
would probably work, and certainly be minimal hassle for -stable.

Altering the hypercall used is certainly not something to backport, nor
are we sure it is a viable fix at this time.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:05                                         ` [Xen-devel] " Andy Lutomirski
  2015-07-29 22:11                                           ` Andrew Cooper
@ 2015-07-29 22:11                                           ` Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin, Boris Ostrovsky

On 29/07/2015 23:05, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>> FYI, I have got a repro now and am investigating.
>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>
>>>>> I have worked out what is going on, but this:
>>>>>
>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>> index 5abeaac..7e1a82e 100644
>>>>> --- a/arch/x86/xen/enlighten.c
>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>            pte = pfn_pte(pfn, prot);
>>>>>   +       (void)*(volatile int*)v;
>>>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>                  BUG();
>>>>>
>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>> only site that we need to be careful about.
>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>> available yet?
>> Quick and dirty?
>>
>> Reading from v is the most obvious and quick way, for areas where we are
>> certain v exists, is kernel memory and is expected to have a backing
>> page.  I don't know offhand how many of current
>> HYPERVISOR_update_va_mapping() callsites this applies to.
> __get_user((char *)v, tmp), perhaps, unless there's something better
> in the wings.  Keep in mind that we need this for -stable, and it's
> likely to get backported quite quickly due to CVE-2015-5157.

Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
would probably work, and certainly be minimal hassle for -stable.

Altering the hypercall used is certainly not something to backport, nor
are we sure it is a viable fix at this time.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:11                                           ` Andrew Cooper
@ 2015-07-29 22:40                                             ` Boris Ostrovsky
  2015-07-29 22:40                                             ` Boris Ostrovsky
  2015-07-29 22:46                                               ` David Vrabel
  2 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 22:40 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 07/29/2015 06:11 PM, Andrew Cooper wrote:
> On 29/07/2015 23:05, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>>
>>>>>> I have worked out what is going on, but this:
>>>>>>
>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>> index 5abeaac..7e1a82e 100644
>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>    +       (void)*(volatile int*)v;
>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>>                   pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>                   BUG();
>>>>>>
>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>>> only site that we need to be careful about.
>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>> available yet?
>>> Quick and dirty?
>>>
>>> Reading from v is the most obvious and quick way, for areas where we are
>>> certain v exists, is kernel memory and is expected to have a backing
>>> page.  I don't know offhand how many of current
>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>> __get_user((char *)v, tmp), perhaps, unless there's something better
>> in the wings.  Keep in mind that we need this for -stable, and it's
>> likely to get backported quite quickly due to CVE-2015-5157.
> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
> would probably work, and certainly be minimal hassle for -stable.
>
> Altering the hypercall used is certainly not something to backport, nor
> are we sure it is a viable fix at this time.

OK, I'll test tonight this quick fix and will defer a more proper patch 
for 4.3 then.

-boris



^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:11                                           ` Andrew Cooper
  2015-07-29 22:40                                             ` Boris Ostrovsky
@ 2015-07-29 22:40                                             ` Boris Ostrovsky
  2015-07-29 22:46                                               ` David Vrabel
  2 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 22:40 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 07/29/2015 06:11 PM, Andrew Cooper wrote:
> On 29/07/2015 23:05, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>>
>>>>>> I have worked out what is going on, but this:
>>>>>>
>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>> index 5abeaac..7e1a82e 100644
>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>    +       (void)*(volatile int*)v;
>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>>                   pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>                   BUG();
>>>>>>
>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>>> only site that we need to be careful about.
>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>> available yet?
>>> Quick and dirty?
>>>
>>> Reading from v is the most obvious and quick way, for areas where we are
>>> certain v exists, is kernel memory and is expected to have a backing
>>> page.  I don't know offhand how many of current
>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>> __get_user((char *)v, tmp), perhaps, unless there's something better
>> in the wings.  Keep in mind that we need this for -stable, and it's
>> likely to get backported quite quickly due to CVE-2015-5157.
> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
> would probably work, and certainly be minimal hassle for -stable.
>
> Altering the hypercall used is certainly not something to backport, nor
> are we sure it is a viable fix at this time.

OK, I'll test tonight this quick fix and will defer a more proper patch 
for 4.3 then.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:11                                           ` Andrew Cooper
@ 2015-07-29 22:46                                               ` David Vrabel
  2015-07-29 22:40                                             ` Boris Ostrovsky
  2015-07-29 22:46                                               ` David Vrabel
  2 siblings, 0 replies; 130+ messages in thread
From: David Vrabel @ 2015-07-29 22:46 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin, Boris Ostrovsky



On 29/07/2015 23:11, Andrew Cooper wrote:
> On 29/07/2015 23:05, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>>
>>>>>> I have worked out what is going on, but this:
>>>>>>
>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>> index 5abeaac..7e1a82e 100644
>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>>            pte = pfn_pte(pfn, prot);
>>>>>>   +       (void)*(volatile int*)v;
>>>>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>                  BUG();
>>>>>>
>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>>> only site that we need to be careful about.
>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>> available yet?
>>> Quick and dirty?
>>>
>>> Reading from v is the most obvious and quick way, for areas where we are
>>> certain v exists, is kernel memory and is expected to have a backing
>>> page.  I don't know offhand how many of current
>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>> __get_user((char *)v, tmp), perhaps, unless there's something better
>> in the wings.  Keep in mind that we need this for -stable, and it's
>> likely to get backported quite quickly due to CVE-2015-5157.
> 
> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
> would probably work, and certainly be minimal hassle for -stable.
> 
> Altering the hypercall used is certainly not something to backport, nor
> are we sure it is a viable fix at this time.

Changing this one use of update_va_mapping to use mmu_update_normal_pt
is the correct fix to unblock this LDT series.  I see no reason why this
cannot be backported.

We can address any other potential update_va_mapping calls at a later
date (if they are shown to be problematic).

David

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
@ 2015-07-29 22:46                                               ` David Vrabel
  0 siblings, 0 replies; 130+ messages in thread
From: David Vrabel @ 2015-07-29 22:46 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin, Boris Ostrovsky



On 29/07/2015 23:11, Andrew Cooper wrote:
> On 29/07/2015 23:05, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>>
>>>>>> I have worked out what is going on, but this:
>>>>>>
>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>> index 5abeaac..7e1a82e 100644
>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>>            pte = pfn_pte(pfn, prot);
>>>>>>   +       (void)*(volatile int*)v;
>>>>>>          if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>>                  pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>                  BUG();
>>>>>>
>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>>> only site that we need to be careful about.
>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>> available yet?
>>> Quick and dirty?
>>>
>>> Reading from v is the most obvious and quick way, for areas where we are
>>> certain v exists, is kernel memory and is expected to have a backing
>>> page.  I don't know offhand how many of current
>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>> __get_user((char *)v, tmp), perhaps, unless there's something better
>> in the wings.  Keep in mind that we need this for -stable, and it's
>> likely to get backported quite quickly due to CVE-2015-5157.
> 
> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
> would probably work, and certainly be minimal hassle for -stable.
> 
> Altering the hypercall used is certainly not something to backport, nor
> are we sure it is a viable fix at this time.

Changing this one use of update_va_mapping to use mmu_update_normal_pt
is the correct fix to unblock this LDT series.  I see no reason why this
cannot be backported.

We can address any other potential update_va_mapping calls at a later
date (if they are shown to be problematic).

David

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:46                                               ` David Vrabel
  (?)
  (?)
@ 2015-07-29 22:49                                               ` Boris Ostrovsky
  2015-07-29 22:55                                                 ` David Vrabel
                                                                   ` (3 more replies)
  -1 siblings, 4 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 22:49 UTC (permalink / raw)
  To: David Vrabel, Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 07/29/2015 06:46 PM, David Vrabel wrote:
>
> On 29/07/2015 23:11, Andrew Cooper wrote:
>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>>>
>>>>>>> I have worked out what is going on, but this:
>>>>>>>
>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>>>                   pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>                   BUG();
>>>>>>>
>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>>>> only site that we need to be careful about.
>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>> available yet?
>>>> Quick and dirty?
>>>>
>>>> Reading from v is the most obvious and quick way, for areas where we are
>>>> certain v exists, is kernel memory and is expected to have a backing
>>>> page.  I don't know offhand how many of current
>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>> likely to get backported quite quickly due to CVE-2015-5157.
>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>> would probably work, and certainly be minimal hassle for -stable.
>>
>> Altering the hypercall used is certainly not something to backport, nor
>> are we sure it is a viable fix at this time.
> Changing this one use of update_va_mapping to use mmu_update_normal_pt
> is the correct fix to unblock this LDT series.  I see no reason why this
> cannot be backported.

To properly fix it should include batching and that is not something 
that I think we should target for stable.

-boris


>
> We can address any other potential update_va_mapping calls at a later
> date (if they are shown to be problematic).
>
> David


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:46                                               ` David Vrabel
  (?)
@ 2015-07-29 22:49                                               ` Boris Ostrovsky
  -1 siblings, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-29 22:49 UTC (permalink / raw)
  To: David Vrabel, Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 07/29/2015 06:46 PM, David Vrabel wrote:
>
> On 29/07/2015 23:11, Andrew Cooper wrote:
>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>> Good and bad news.  This bug has nothing to do with LDTs themselves.
>>>>>>>
>>>>>>> I have worked out what is going on, but this:
>>>>>>>
>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v, pgprot_t prot)
>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v, pte, 0)) {
>>>>>>>                   pr_err("set_aliased_prot va update failed w/ lazy mode
>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>                   BUG();
>>>>>>>
>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>> I think in most cases we know that page is mapped so hopefully this is the
>>>>>> only site that we need to be careful about.
>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>> available yet?
>>>> Quick and dirty?
>>>>
>>>> Reading from v is the most obvious and quick way, for areas where we are
>>>> certain v exists, is kernel memory and is expected to have a backing
>>>> page.  I don't know offhand how many of current
>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>> likely to get backported quite quickly due to CVE-2015-5157.
>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>> would probably work, and certainly be minimal hassle for -stable.
>>
>> Altering the hypercall used is certainly not something to backport, nor
>> are we sure it is a viable fix at this time.
> Changing this one use of update_va_mapping to use mmu_update_normal_pt
> is the correct fix to unblock this LDT series.  I see no reason why this
> cannot be backported.

To properly fix it should include batching and that is not something 
that I think we should target for stable.

-boris


>
> We can address any other potential update_va_mapping calls at a later
> date (if they are shown to be problematic).
>
> David

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:49                                               ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-29 22:55                                                 ` David Vrabel
  2015-07-29 22:55                                                 ` David Vrabel
                                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 130+ messages in thread
From: David Vrabel @ 2015-07-29 22:55 UTC (permalink / raw)
  To: Boris Ostrovsky, Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin



On 29/07/2015 23:49, Boris Ostrovsky wrote:
> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>
>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>> themselves.
>>>>>>>>
>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>> pgprot_t prot)
>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>> pte, 0)) {
>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>> lazy mode
>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>                   BUG();
>>>>>>>>
>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>> this is the
>>>>>>> only site that we need to be careful about.
>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>> available yet?
>>>>> Quick and dirty?
>>>>>
>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>> we are
>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>> page.  I don't know offhand how many of current
>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>> would probably work, and certainly be minimal hassle for -stable.
>>>
>>> Altering the hypercall used is certainly not something to backport, nor
>>> are we sure it is a viable fix at this time.
>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>> is the correct fix to unblock this LDT series.  I see no reason why this
>> cannot be backported.
> 
> To properly fix it should include batching and that is not something
> that I think we should target for stable.

The original call isn't batched, so it's replacement doesn't need to be,

David

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:49                                               ` [Xen-devel] " Boris Ostrovsky
  2015-07-29 22:55                                                 ` David Vrabel
@ 2015-07-29 22:55                                                 ` David Vrabel
  2015-07-29 23:02                                                 ` [Xen-devel] " Andrew Cooper
  2015-07-29 23:02                                                 ` Andrew Cooper
  3 siblings, 0 replies; 130+ messages in thread
From: David Vrabel @ 2015-07-29 22:55 UTC (permalink / raw)
  To: Boris Ostrovsky, Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin



On 29/07/2015 23:49, Boris Ostrovsky wrote:
> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>
>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>> themselves.
>>>>>>>>
>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>> pgprot_t prot)
>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>> pte, 0)) {
>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>> lazy mode
>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>                   BUG();
>>>>>>>>
>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>> this is the
>>>>>>> only site that we need to be careful about.
>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>> available yet?
>>>>> Quick and dirty?
>>>>>
>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>> we are
>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>> page.  I don't know offhand how many of current
>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>> would probably work, and certainly be minimal hassle for -stable.
>>>
>>> Altering the hypercall used is certainly not something to backport, nor
>>> are we sure it is a viable fix at this time.
>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>> is the correct fix to unblock this LDT series.  I see no reason why this
>> cannot be backported.
> 
> To properly fix it should include batching and that is not something
> that I think we should target for stable.

The original call isn't batched, so it's replacement doesn't need to be,

David

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:49                                               ` [Xen-devel] " Boris Ostrovsky
  2015-07-29 22:55                                                 ` David Vrabel
  2015-07-29 22:55                                                 ` David Vrabel
@ 2015-07-29 23:02                                                 ` Andrew Cooper
  2015-07-29 23:13                                                   ` Andy Lutomirski
  2015-07-29 23:13                                                   ` Andy Lutomirski
  2015-07-29 23:02                                                 ` Andrew Cooper
  3 siblings, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 23:02 UTC (permalink / raw)
  To: Boris Ostrovsky, David Vrabel, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 29/07/2015 23:49, Boris Ostrovsky wrote:
> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>
>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>> themselves.
>>>>>>>>
>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>> pgprot_t prot)
>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>> pte, 0)) {
>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>> lazy mode
>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>                   BUG();
>>>>>>>>
>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>> this is the
>>>>>>> only site that we need to be careful about.
>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>> available yet?
>>>>> Quick and dirty?
>>>>>
>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>> we are
>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>> page.  I don't know offhand how many of current
>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>> would probably work, and certainly be minimal hassle for -stable.
>>>
>>> Altering the hypercall used is certainly not something to backport, nor
>>> are we sure it is a viable fix at this time.
>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>> is the correct fix to unblock this LDT series.  I see no reason why this
>> cannot be backported.
>
> To properly fix it should include batching and that is not something
> that I think we should target for stable.

Batching is absolutely not necessary to alter update_va_mapping to
mmu_update_normal_pt.  After all, update_va_mapping isn't batched.

However this isn't the first issue issue we have had lazy mmu faulting,
and I doubt it is the last.  There are not many callsites of
update_va_mapping - I will audit them tomorrow and see if any similar
issues are lurking elsewhere.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 22:49                                               ` [Xen-devel] " Boris Ostrovsky
                                                                   ` (2 preceding siblings ...)
  2015-07-29 23:02                                                 ` [Xen-devel] " Andrew Cooper
@ 2015-07-29 23:02                                                 ` Andrew Cooper
  3 siblings, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-29 23:02 UTC (permalink / raw)
  To: Boris Ostrovsky, David Vrabel, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, Borislav Petkov, David Vrabel, Jan Beulich,
	Sasha Levin

On 29/07/2015 23:49, Boris Ostrovsky wrote:
> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>
>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>> themselves.
>>>>>>>>
>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>> pgprot_t prot)
>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>> pte, 0)) {
>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>> lazy mode
>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>                   BUG();
>>>>>>>>
>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>> this is the
>>>>>>> only site that we need to be careful about.
>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>> available yet?
>>>>> Quick and dirty?
>>>>>
>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>> we are
>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>> page.  I don't know offhand how many of current
>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>> would probably work, and certainly be minimal hassle for -stable.
>>>
>>> Altering the hypercall used is certainly not something to backport, nor
>>> are we sure it is a viable fix at this time.
>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>> is the correct fix to unblock this LDT series.  I see no reason why this
>> cannot be backported.
>
> To properly fix it should include batching and that is not something
> that I think we should target for stable.

Batching is absolutely not necessary to alter update_va_mapping to
mmu_update_normal_pt.  After all, update_va_mapping isn't batched.

However this isn't the first issue issue we have had lazy mmu faulting,
and I doubt it is the last.  There are not many callsites of
update_va_mapping - I will audit them tomorrow and see if any similar
issues are lurking elsewhere.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 23:02                                                 ` [Xen-devel] " Andrew Cooper
@ 2015-07-29 23:13                                                   ` Andy Lutomirski
  2015-07-30  0:29                                                     ` Andrew Cooper
  2015-07-30  0:29                                                     ` [Xen-devel] " Andrew Cooper
  2015-07-29 23:13                                                   ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29 23:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Boris Ostrovsky, David Vrabel, security, Peter Zijlstra, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>
>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>> themselves.
>>>>>>>>>
>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>
>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>> pgprot_t prot)
>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>> pte, 0)) {
>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>> lazy mode
>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>                   BUG();
>>>>>>>>>
>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>> this is the
>>>>>>>> only site that we need to be careful about.
>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>> available yet?
>>>>>> Quick and dirty?
>>>>>>
>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>> we are
>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>> page.  I don't know offhand how many of current
>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>
>>>> Altering the hypercall used is certainly not something to backport, nor
>>>> are we sure it is a viable fix at this time.
>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>> cannot be backported.
>>
>> To properly fix it should include batching and that is not something
>> that I think we should target for stable.
>
> Batching is absolutely not necessary to alter update_va_mapping to
> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>
> However this isn't the first issue issue we have had lazy mmu faulting,
> and I doubt it is the last.  There are not many callsites of
> update_va_mapping - I will audit them tomorrow and see if any similar
> issues are lurking elsewhere.

One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
access to fault.  Is this something we should be worried about?

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 23:02                                                 ` [Xen-devel] " Andrew Cooper
  2015-07-29 23:13                                                   ` Andy Lutomirski
@ 2015-07-29 23:13                                                   ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-29 23:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, David Vrabel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin, Boris Ostrovsky

On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>
>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>> themselves.
>>>>>>>>>
>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>
>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>> pgprot_t prot)
>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>> pte, 0)) {
>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>> lazy mode
>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>                   BUG();
>>>>>>>>>
>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>> this is the
>>>>>>>> only site that we need to be careful about.
>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>> available yet?
>>>>>> Quick and dirty?
>>>>>>
>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>> we are
>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>> page.  I don't know offhand how many of current
>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>
>>>> Altering the hypercall used is certainly not something to backport, nor
>>>> are we sure it is a viable fix at this time.
>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>> cannot be backported.
>>
>> To properly fix it should include batching and that is not something
>> that I think we should target for stable.
>
> Batching is absolutely not necessary to alter update_va_mapping to
> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>
> However this isn't the first issue issue we have had lazy mmu faulting,
> and I doubt it is the last.  There are not many callsites of
> update_va_mapping - I will audit them tomorrow and see if any similar
> issues are lurking elsewhere.

One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
access to fault.  Is this something we should be worried about?

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 23:13                                                   ` Andy Lutomirski
  2015-07-30  0:29                                                     ` Andrew Cooper
@ 2015-07-30  0:29                                                     ` Andrew Cooper
  2015-07-30 18:30                                                       ` Andy Lutomirski
  2015-07-30 18:30                                                       ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-30  0:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Boris Ostrovsky, David Vrabel, security, Peter Zijlstra, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On 30/07/2015 00:13, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>> themselves.
>>>>>>>>>>
>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>> pte, 0)) {
>>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>>> lazy mode
>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>                   BUG();
>>>>>>>>>>
>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>> this is the
>>>>>>>>> only site that we need to be careful about.
>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>> available yet?
>>>>>>> Quick and dirty?
>>>>>>>
>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>> we are
>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>> page.  I don't know offhand how many of current
>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>
>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>> are we sure it is a viable fix at this time.
>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>> cannot be backported.
>>> To properly fix it should include batching and that is not something
>>> that I think we should target for stable.
>> Batching is absolutely not necessary to alter update_va_mapping to
>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>
>> However this isn't the first issue issue we have had lazy mmu faulting,
>> and I doubt it is the last.  There are not many callsites of
>> update_va_mapping - I will audit them tomorrow and see if any similar
>> issues are lurking elsewhere.
> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
> access to fault.  Is this something we should be worried about?

Yes.  update_va_mapping() will function perfectly well taking one RW
mapping to RO even if there is a second RW mapping.  In such a case, the
next LDT access will fault.

On closer inspection, Xen is rather unhelpful with the fault.  Xen's
lazy #PF will be bounced back to the guest with cr2 adjusted to appear
in the range passed to set_ldt().  The error code however will be
unmodified (and limited only by not-user and not-reserved), so will
appear as a non-present read or write supervisor access to an address
which the kernel has a valid read mapping of.

Unlike pagetables, there is no notion of pinning a segdesc page in the
Xen ABI.  Pinning to a type allows the guest to take a single extra type
ref, and as a side effect forces eager validation of the contents.  It
also prevents another unsuspecting vcpu from coming along, constructing
a writeable mapping and turning the soon-to-be-faulted-in LDT into a
plain writeable page and forcing a fault.

This frankly looks like an oversight, as pinning a segdesc page would
work work fine in the existing page model; it is just that there isn't a
hypercall to make such an action happen.

Therefore, set_ldt() needs to be confident that there are no writeable
mappings to the frames used to make up the LDT.  It could proactively
fault them in by accessing one descriptor in each page inside the limit,
but by the time a fault is received it is probably too late to work out
where the other mapping is which prevented the typechange (or indeed,
whether Xen objected to one of the descriptors instead).

This is all a little bit messy.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-29 23:13                                                   ` Andy Lutomirski
@ 2015-07-30  0:29                                                     ` Andrew Cooper
  2015-07-30  0:29                                                     ` [Xen-devel] " Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-30  0:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, David Vrabel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin, Boris Ostrovsky

On 30/07/2015 00:13, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>> themselves.
>>>>>>>>>>
>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>> pte, 0)) {
>>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>>> lazy mode
>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>                   BUG();
>>>>>>>>>>
>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>> this is the
>>>>>>>>> only site that we need to be careful about.
>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>> available yet?
>>>>>>> Quick and dirty?
>>>>>>>
>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>> we are
>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>> page.  I don't know offhand how many of current
>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>
>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>> are we sure it is a viable fix at this time.
>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>> cannot be backported.
>>> To properly fix it should include batching and that is not something
>>> that I think we should target for stable.
>> Batching is absolutely not necessary to alter update_va_mapping to
>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>
>> However this isn't the first issue issue we have had lazy mmu faulting,
>> and I doubt it is the last.  There are not many callsites of
>> update_va_mapping - I will audit them tomorrow and see if any similar
>> issues are lurking elsewhere.
> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
> access to fault.  Is this something we should be worried about?

Yes.  update_va_mapping() will function perfectly well taking one RW
mapping to RO even if there is a second RW mapping.  In such a case, the
next LDT access will fault.

On closer inspection, Xen is rather unhelpful with the fault.  Xen's
lazy #PF will be bounced back to the guest with cr2 adjusted to appear
in the range passed to set_ldt().  The error code however will be
unmodified (and limited only by not-user and not-reserved), so will
appear as a non-present read or write supervisor access to an address
which the kernel has a valid read mapping of.

Unlike pagetables, there is no notion of pinning a segdesc page in the
Xen ABI.  Pinning to a type allows the guest to take a single extra type
ref, and as a side effect forces eager validation of the contents.  It
also prevents another unsuspecting vcpu from coming along, constructing
a writeable mapping and turning the soon-to-be-faulted-in LDT into a
plain writeable page and forcing a fault.

This frankly looks like an oversight, as pinning a segdesc page would
work work fine in the existing page model; it is just that there isn't a
hypercall to make such an action happen.

Therefore, set_ldt() needs to be confident that there are no writeable
mappings to the frames used to make up the LDT.  It could proactively
fault them in by accessing one descriptor in each page inside the limit,
but by the time a fault is received it is probably too late to work out
where the other mapping is which prevented the typechange (or indeed,
whether Xen objected to one of the descriptors instead).

This is all a little bit messy.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30  0:29                                                     ` [Xen-devel] " Andrew Cooper
@ 2015-07-30 18:30                                                       ` Andy Lutomirski
  2015-07-30 18:54                                                         ` Andrew Cooper
  2015-07-30 18:54                                                         ` [Xen-devel] " Andrew Cooper
  2015-07-30 18:30                                                       ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-30 18:30 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Boris Ostrovsky, David Vrabel, security, Peter Zijlstra, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 30/07/2015 00:13, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>> themselves.
>>>>>>>>>>>
>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>> lazy mode
>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>                   BUG();
>>>>>>>>>>>
>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>> this is the
>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>> available yet?
>>>>>>>> Quick and dirty?
>>>>>>>>
>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>> we are
>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>
>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>> are we sure it is a viable fix at this time.
>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>> cannot be backported.
>>>> To properly fix it should include batching and that is not something
>>>> that I think we should target for stable.
>>> Batching is absolutely not necessary to alter update_va_mapping to
>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>
>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>> and I doubt it is the last.  There are not many callsites of
>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>> issues are lurking elsewhere.
>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>> access to fault.  Is this something we should be worried about?
>
> Yes.  update_va_mapping() will function perfectly well taking one RW
> mapping to RO even if there is a second RW mapping.  In such a case, the
> next LDT access will fault.

Which is a problem because that alias might still exist, and also
because Linux really doesn't expect that fault.

>
> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
> in the range passed to set_ldt().  The error code however will be
> unmodified (and limited only by not-user and not-reserved), so will
> appear as a non-present read or write supervisor access to an address
> which the kernel has a valid read mapping of.

More yuck.

I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.

> Therefore, set_ldt() needs to be confident that there are no writeable
> mappings to the frames used to make up the LDT.  It could proactively
> fault them in by accessing one descriptor in each page inside the limit,
> but by the time a fault is received it is probably too late to work out
> where the other mapping is which prevented the typechange (or indeed,
> whether Xen objected to one of the descriptors instead).

This seems like overkill.

I'm still a bit confused, though: the failure is in xen_free_ldt.  How
do we make it all the way to xen_free_ldt without the vmapped page
existing in the guest's page tables?  After all, we had to survive
xen_alloc_ldt first, and ISTM that should fail in exactly the same
way.

Anyway, I'll send v6.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30  0:29                                                     ` [Xen-devel] " Andrew Cooper
  2015-07-30 18:30                                                       ` Andy Lutomirski
@ 2015-07-30 18:30                                                       ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-30 18:30 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, David Vrabel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin, Boris Ostrovsky

On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 30/07/2015 00:13, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>> themselves.
>>>>>>>>>>>
>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>> lazy mode
>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>                   BUG();
>>>>>>>>>>>
>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>> this is the
>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>> available yet?
>>>>>>>> Quick and dirty?
>>>>>>>>
>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>> we are
>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>
>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>> are we sure it is a viable fix at this time.
>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>> cannot be backported.
>>>> To properly fix it should include batching and that is not something
>>>> that I think we should target for stable.
>>> Batching is absolutely not necessary to alter update_va_mapping to
>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>
>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>> and I doubt it is the last.  There are not many callsites of
>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>> issues are lurking elsewhere.
>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>> access to fault.  Is this something we should be worried about?
>
> Yes.  update_va_mapping() will function perfectly well taking one RW
> mapping to RO even if there is a second RW mapping.  In such a case, the
> next LDT access will fault.

Which is a problem because that alias might still exist, and also
because Linux really doesn't expect that fault.

>
> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
> in the range passed to set_ldt().  The error code however will be
> unmodified (and limited only by not-user and not-reserved), so will
> appear as a non-present read or write supervisor access to an address
> which the kernel has a valid read mapping of.

More yuck.

I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.

> Therefore, set_ldt() needs to be confident that there are no writeable
> mappings to the frames used to make up the LDT.  It could proactively
> fault them in by accessing one descriptor in each page inside the limit,
> but by the time a fault is received it is probably too late to work out
> where the other mapping is which prevented the typechange (or indeed,
> whether Xen objected to one of the descriptors instead).

This seems like overkill.

I'm still a bit confused, though: the failure is in xen_free_ldt.  How
do we make it all the way to xen_free_ldt without the vmapped page
existing in the guest's page tables?  After all, we had to survive
xen_alloc_ldt first, and ISTM that should fail in exactly the same
way.

Anyway, I'll send v6.

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 18:30                                                       ` Andy Lutomirski
  2015-07-30 18:54                                                         ` Andrew Cooper
@ 2015-07-30 18:54                                                         ` Andrew Cooper
  2015-07-30 20:01                                                           ` Boris Ostrovsky
  2015-07-30 20:01                                                           ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 2 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-30 18:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Boris Ostrovsky, David Vrabel, security, Peter Zijlstra, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On 30/07/15 19:30, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>> themselves.
>>>>>>>>>>>>
>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>>> lazy mode
>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>                   BUG();
>>>>>>>>>>>>
>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>> this is the
>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>>> available yet?
>>>>>>>>> Quick and dirty?
>>>>>>>>>
>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>> we are
>>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>
>>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>>> are we sure it is a viable fix at this time.
>>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>>> cannot be backported.
>>>>> To properly fix it should include batching and that is not something
>>>>> that I think we should target for stable.
>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>
>>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>>> and I doubt it is the last.  There are not many callsites of
>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>> issues are lurking elsewhere.
>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>> access to fault.  Is this something we should be worried about?
>> Yes.  update_va_mapping() will function perfectly well taking one RW
>> mapping to RO even if there is a second RW mapping.  In such a case, the
>> next LDT access will fault.
> Which is a problem because that alias might still exist, and also
> because Linux really doesn't expect that fault.
>
>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>> in the range passed to set_ldt().  The error code however will be
>> unmodified (and limited only by not-user and not-reserved), so will
>> appear as a non-present read or write supervisor access to an address
>> which the kernel has a valid read mapping of.
> More yuck.
>
> I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.
>
>> Therefore, set_ldt() needs to be confident that there are no writeable
>> mappings to the frames used to make up the LDT.  It could proactively
>> fault them in by accessing one descriptor in each page inside the limit,
>> but by the time a fault is received it is probably too late to work out
>> where the other mapping is which prevented the typechange (or indeed,
>> whether Xen objected to one of the descriptors instead).
> This seems like overkill.
>
> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
> do we make it all the way to xen_free_ldt without the vmapped page
> existing in the guest's page tables?  After all, we had to survive
> xen_alloc_ldt first, and ISTM that should fail in exactly the same
> way.

(Summarising part of a discussion which has just occurred on IRC)

I presume that xen_free_ldt() is called while in the context of an mm
which doesn't have the particular area of the vmalloc() space faulted in.

This is (I presume) why reading 'v' (which occasionally causes a
pagefault to occur) fixes the issue.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 18:30                                                       ` Andy Lutomirski
@ 2015-07-30 18:54                                                         ` Andrew Cooper
  2015-07-30 18:54                                                         ` [Xen-devel] " Andrew Cooper
  1 sibling, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2015-07-30 18:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, David Vrabel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin, Boris Ostrovsky

On 30/07/15 19:30, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>> themselves.
>>>>>>>>>>>>
>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>             pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>    +       (void)*(volatile int*)v;
>>>>>>>>>>>>           if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>                   pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>>> lazy mode
>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>                   BUG();
>>>>>>>>>>>>
>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>> this is the
>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>>> available yet?
>>>>>>>>> Quick and dirty?
>>>>>>>>>
>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>> we are
>>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>
>>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>>> are we sure it is a viable fix at this time.
>>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>>> cannot be backported.
>>>>> To properly fix it should include batching and that is not something
>>>>> that I think we should target for stable.
>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>
>>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>>> and I doubt it is the last.  There are not many callsites of
>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>> issues are lurking elsewhere.
>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>> access to fault.  Is this something we should be worried about?
>> Yes.  update_va_mapping() will function perfectly well taking one RW
>> mapping to RO even if there is a second RW mapping.  In such a case, the
>> next LDT access will fault.
> Which is a problem because that alias might still exist, and also
> because Linux really doesn't expect that fault.
>
>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>> in the range passed to set_ldt().  The error code however will be
>> unmodified (and limited only by not-user and not-reserved), so will
>> appear as a non-present read or write supervisor access to an address
>> which the kernel has a valid read mapping of.
> More yuck.
>
> I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.
>
>> Therefore, set_ldt() needs to be confident that there are no writeable
>> mappings to the frames used to make up the LDT.  It could proactively
>> fault them in by accessing one descriptor in each page inside the limit,
>> but by the time a fault is received it is probably too late to work out
>> where the other mapping is which prevented the typechange (or indeed,
>> whether Xen objected to one of the descriptors instead).
> This seems like overkill.
>
> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
> do we make it all the way to xen_free_ldt without the vmapped page
> existing in the guest's page tables?  After all, we had to survive
> xen_alloc_ldt first, and ISTM that should fail in exactly the same
> way.

(Summarising part of a discussion which has just occurred on IRC)

I presume that xen_free_ldt() is called while in the context of an mm
which doesn't have the particular area of the vmalloc() space faulted in.

This is (I presume) why reading 'v' (which occasionally causes a
pagefault to occur) fixes the issue.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 18:54                                                         ` [Xen-devel] " Andrew Cooper
  2015-07-30 20:01                                                           ` Boris Ostrovsky
@ 2015-07-30 20:01                                                           ` Boris Ostrovsky
  2015-07-30 20:05                                                             ` Andy Lutomirski
  2015-07-30 20:05                                                             ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-30 20:01 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: David Vrabel, security, Peter Zijlstra, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin

On 07/30/2015 02:54 PM, Andrew Cooper wrote:
> On 30/07/15 19:30, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>              pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>     +       (void)*(volatile int*)v;
>>>>>>>>>>>>>            if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>                    pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>                    BUG();
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>> this is the
>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>>>> available yet?
>>>>>>>>>> Quick and dirty?
>>>>>>>>>>
>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>> we are
>>>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>
>>>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>>>> cannot be backported.
>>>>>> To properly fix it should include batching and that is not something
>>>>>> that I think we should target for stable.
>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>
>>>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>>>> and I doubt it is the last.  There are not many callsites of
>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>> issues are lurking elsewhere.
>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>> access to fault.  Is this something we should be worried about?
>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>> next LDT access will fault.
>> Which is a problem because that alias might still exist, and also
>> because Linux really doesn't expect that fault.
>>
>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>> in the range passed to set_ldt().  The error code however will be
>>> unmodified (and limited only by not-user and not-reserved), so will
>>> appear as a non-present read or write supervisor access to an address
>>> which the kernel has a valid read mapping of.
>> More yuck.
>>
>> I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.
>>
>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>> mappings to the frames used to make up the LDT.  It could proactively
>>> fault them in by accessing one descriptor in each page inside the limit,
>>> but by the time a fault is received it is probably too late to work out
>>> where the other mapping is which prevented the typechange (or indeed,
>>> whether Xen objected to one of the descriptors instead).
>> This seems like overkill.
>>
>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>> do we make it all the way to xen_free_ldt without the vmapped page
>> existing in the guest's page tables?  After all, we had to survive
>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>> way.
> (Summarising part of a discussion which has just occurred on IRC)
>
> I presume that xen_free_ldt() is called while in the context of an mm
> which doesn't have the particular area of the vmalloc() space faulted in.

This is exactly what's happening --- the bug is only triggered during 
exit and xen_free_ldt() is called from someone else's context, e.g.:

[   53.986677] Call Trace:
[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
[   53.986677]  [<c1863736>] __schedule+0x316/0x950
[   53.986677]  [<c1863d96>] schedule+0x26/0x70
[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
[   53.986677]  [<c186717a>] syscall_call+0x7/0x7

But that would imply that this other context has mm->context.ldt of 
ldt_gdt_32. How is that possible?

-boris

>
> This is (I presume) why reading 'v' (which occasionally causes a
> pagefault to occur) fixes the issue.
>
> ~Andrew


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 18:54                                                         ` [Xen-devel] " Andrew Cooper
@ 2015-07-30 20:01                                                           ` Boris Ostrovsky
  2015-07-30 20:01                                                           ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-30 20:01 UTC (permalink / raw)
  To: Andrew Cooper, Andy Lutomirski
  Cc: security, Peter Zijlstra, X86 ML, linux-kernel, Steven Rostedt,
	xen-devel, David Vrabel, Borislav Petkov, David Vrabel,
	Jan Beulich, Sasha Levin

On 07/30/2015 02:54 PM, Andrew Cooper wrote:
> On 30/07/15 19:30, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>              pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>     +       (void)*(volatile int*)v;
>>>>>>>>>>>>>            if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>                    pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>                    BUG();
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>> this is the
>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>>>> available yet?
>>>>>>>>>> Quick and dirty?
>>>>>>>>>>
>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>> we are
>>>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>
>>>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>>>> cannot be backported.
>>>>>> To properly fix it should include batching and that is not something
>>>>>> that I think we should target for stable.
>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>
>>>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>>>> and I doubt it is the last.  There are not many callsites of
>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>> issues are lurking elsewhere.
>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>> access to fault.  Is this something we should be worried about?
>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>> next LDT access will fault.
>> Which is a problem because that alias might still exist, and also
>> because Linux really doesn't expect that fault.
>>
>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>> in the range passed to set_ldt().  The error code however will be
>>> unmodified (and limited only by not-user and not-reserved), so will
>>> appear as a non-present read or write supervisor access to an address
>>> which the kernel has a valid read mapping of.
>> More yuck.
>>
>> I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.
>>
>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>> mappings to the frames used to make up the LDT.  It could proactively
>>> fault them in by accessing one descriptor in each page inside the limit,
>>> but by the time a fault is received it is probably too late to work out
>>> where the other mapping is which prevented the typechange (or indeed,
>>> whether Xen objected to one of the descriptors instead).
>> This seems like overkill.
>>
>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>> do we make it all the way to xen_free_ldt without the vmapped page
>> existing in the guest's page tables?  After all, we had to survive
>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>> way.
> (Summarising part of a discussion which has just occurred on IRC)
>
> I presume that xen_free_ldt() is called while in the context of an mm
> which doesn't have the particular area of the vmalloc() space faulted in.

This is exactly what's happening --- the bug is only triggered during 
exit and xen_free_ldt() is called from someone else's context, e.g.:

[   53.986677] Call Trace:
[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
[   53.986677]  [<c1863736>] __schedule+0x316/0x950
[   53.986677]  [<c1863d96>] schedule+0x26/0x70
[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
[   53.986677]  [<c186717a>] syscall_call+0x7/0x7

But that would imply that this other context has mm->context.ldt of 
ldt_gdt_32. How is that possible?

-boris

>
> This is (I presume) why reading 'v' (which occasionally causes a
> pagefault to occur) fixes the issue.
>
> ~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 20:01                                                           ` [Xen-devel] " Boris Ostrovsky
@ 2015-07-30 20:05                                                             ` Andy Lutomirski
  2015-07-30 20:18                                                               ` Boris Ostrovsky
  2015-07-30 20:18                                                               ` [Xen-devel] " Boris Ostrovsky
  2015-07-30 20:05                                                             ` Andy Lutomirski
  1 sibling, 2 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-30 20:05 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, David Vrabel, security, Peter Zijlstra, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On Thu, Jul 30, 2015 at 1:01 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/30/2015 02:54 PM, Andrew Cooper wrote:
>>
>> On 30/07/15 19:30, Andy Lutomirski wrote:
>>>
>>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>>
>>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>>>
>>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>
>>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>>>
>>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>>>
>>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>>>
>>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>>              pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>>     +       (void)*(volatile int*)v;
>>>>>>>>>>>>>>            if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>>                    pr_err("set_aliased_prot va update failed
>>>>>>>>>>>>>> w/
>>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>>                    BUG();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same
>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>>> this is the
>>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix
>>>>>>>>>>>> that
>>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix
>>>>>>>>>>>> isn't
>>>>>>>>>>>> available yet?
>>>>>>>>>>>
>>>>>>>>>>> Quick and dirty?
>>>>>>>>>>>
>>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>>> we are
>>>>>>>>>>> certain v exists, is kernel memory and is expected to have a
>>>>>>>>>>> backing
>>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>>>
>>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something
>>>>>>>>>> better
>>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and
>>>>>>>>>> it's
>>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>>>
>>>>>>>>> Hmm - something like that tucked inside
>>>>>>>>> HYPERVISOR_update_va_mapping()
>>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>>
>>>>>>>>> Altering the hypercall used is certainly not something to backport,
>>>>>>>>> nor
>>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>>>
>>>>>>>> Changing this one use of update_va_mapping to use
>>>>>>>> mmu_update_normal_pt
>>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why
>>>>>>>> this
>>>>>>>> cannot be backported.
>>>>>>>
>>>>>>> To properly fix it should include batching and that is not something
>>>>>>> that I think we should target for stable.
>>>>>>
>>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>>
>>>>>> However this isn't the first issue issue we have had lazy mmu
>>>>>> faulting,
>>>>>> and I doubt it is the last.  There are not many callsites of
>>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>>> issues are lurking elsewhere.
>>>>>
>>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>>> access to fault.  Is this something we should be worried about?
>>>>
>>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>>> next LDT access will fault.
>>>
>>> Which is a problem because that alias might still exist, and also
>>> because Linux really doesn't expect that fault.
>>>
>>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>>> in the range passed to set_ldt().  The error code however will be
>>>> unmodified (and limited only by not-user and not-reserved), so will
>>>> appear as a non-present read or write supervisor access to an address
>>>> which the kernel has a valid read mapping of.
>>>
>>> More yuck.
>>>
>>> I think I'm just going to stick an unconditional vm_flush_aliases in
>>> alloc_ldt.
>>>
>>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>>> mappings to the frames used to make up the LDT.  It could proactively
>>>> fault them in by accessing one descriptor in each page inside the limit,
>>>> but by the time a fault is received it is probably too late to work out
>>>> where the other mapping is which prevented the typechange (or indeed,
>>>> whether Xen objected to one of the descriptors instead).
>>>
>>> This seems like overkill.
>>>
>>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>>> do we make it all the way to xen_free_ldt without the vmapped page
>>> existing in the guest's page tables?  After all, we had to survive
>>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>>> way.
>>
>> (Summarising part of a discussion which has just occurred on IRC)
>>
>> I presume that xen_free_ldt() is called while in the context of an mm
>> which doesn't have the particular area of the vmalloc() space faulted in.
>
>
> This is exactly what's happening --- the bug is only triggered during exit
> and xen_free_ldt() is called from someone else's context, e.g.:
>
> [   53.986677] Call Trace:
> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>
> But that would imply that this other context has mm->context.ldt of
> ldt_gdt_32. How is that possible?
>

It's freed via destroy_context, which destroys someone else's LDT, right?

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 20:01                                                           ` [Xen-devel] " Boris Ostrovsky
  2015-07-30 20:05                                                             ` Andy Lutomirski
@ 2015-07-30 20:05                                                             ` Andy Lutomirski
  1 sibling, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-30 20:05 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, David Vrabel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On Thu, Jul 30, 2015 at 1:01 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/30/2015 02:54 PM, Andrew Cooper wrote:
>>
>> On 30/07/15 19:30, Andy Lutomirski wrote:
>>>
>>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>>> <andrew.cooper3@citrix.com> wrote:
>>>>
>>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>>>
>>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>
>>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>>>
>>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>>>
>>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>>>
>>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>>              pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>>     +       (void)*(volatile int*)v;
>>>>>>>>>>>>>>            if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>>                    pr_err("set_aliased_prot va update failed
>>>>>>>>>>>>>> w/
>>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>>                    BUG();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same
>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>>> this is the
>>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix
>>>>>>>>>>>> that
>>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix
>>>>>>>>>>>> isn't
>>>>>>>>>>>> available yet?
>>>>>>>>>>>
>>>>>>>>>>> Quick and dirty?
>>>>>>>>>>>
>>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>>> we are
>>>>>>>>>>> certain v exists, is kernel memory and is expected to have a
>>>>>>>>>>> backing
>>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>>>
>>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something
>>>>>>>>>> better
>>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and
>>>>>>>>>> it's
>>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>>>
>>>>>>>>> Hmm - something like that tucked inside
>>>>>>>>> HYPERVISOR_update_va_mapping()
>>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>>
>>>>>>>>> Altering the hypercall used is certainly not something to backport,
>>>>>>>>> nor
>>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>>>
>>>>>>>> Changing this one use of update_va_mapping to use
>>>>>>>> mmu_update_normal_pt
>>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why
>>>>>>>> this
>>>>>>>> cannot be backported.
>>>>>>>
>>>>>>> To properly fix it should include batching and that is not something
>>>>>>> that I think we should target for stable.
>>>>>>
>>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>>
>>>>>> However this isn't the first issue issue we have had lazy mmu
>>>>>> faulting,
>>>>>> and I doubt it is the last.  There are not many callsites of
>>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>>> issues are lurking elsewhere.
>>>>>
>>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>>> access to fault.  Is this something we should be worried about?
>>>>
>>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>>> next LDT access will fault.
>>>
>>> Which is a problem because that alias might still exist, and also
>>> because Linux really doesn't expect that fault.
>>>
>>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>>> in the range passed to set_ldt().  The error code however will be
>>>> unmodified (and limited only by not-user and not-reserved), so will
>>>> appear as a non-present read or write supervisor access to an address
>>>> which the kernel has a valid read mapping of.
>>>
>>> More yuck.
>>>
>>> I think I'm just going to stick an unconditional vm_flush_aliases in
>>> alloc_ldt.
>>>
>>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>>> mappings to the frames used to make up the LDT.  It could proactively
>>>> fault them in by accessing one descriptor in each page inside the limit,
>>>> but by the time a fault is received it is probably too late to work out
>>>> where the other mapping is which prevented the typechange (or indeed,
>>>> whether Xen objected to one of the descriptors instead).
>>>
>>> This seems like overkill.
>>>
>>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>>> do we make it all the way to xen_free_ldt without the vmapped page
>>> existing in the guest's page tables?  After all, we had to survive
>>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>>> way.
>>
>> (Summarising part of a discussion which has just occurred on IRC)
>>
>> I presume that xen_free_ldt() is called while in the context of an mm
>> which doesn't have the particular area of the vmalloc() space faulted in.
>
>
> This is exactly what's happening --- the bug is only triggered during exit
> and xen_free_ldt() is called from someone else's context, e.g.:
>
> [   53.986677] Call Trace:
> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>
> But that would imply that this other context has mm->context.ldt of
> ldt_gdt_32. How is that possible?
>

It's freed via destroy_context, which destroys someone else's LDT, right?

--Andy

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 20:05                                                             ` Andy Lutomirski
  2015-07-30 20:18                                                               ` Boris Ostrovsky
@ 2015-07-30 20:18                                                               ` Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-30 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Cooper, David Vrabel, security, Peter Zijlstra, X86 ML,
	linux-kernel, Steven Rostedt, xen-devel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On 07/30/2015 04:05 PM, Andy Lutomirski wrote:
> On Thu, Jul 30, 2015 at 1:01 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/30/2015 02:54 PM, Andrew Cooper wrote:
>>> On 30/07/15 19:30, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>>>               pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>>>      +       (void)*(volatile int*)v;
>>>>>>>>>>>>>>>             if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>>>                     pr_err("set_aliased_prot va update failed
>>>>>>>>>>>>>>> w/
>>>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>>>                     BUG();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>>>> this is the
>>>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix
>>>>>>>>>>>>> that
>>>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix
>>>>>>>>>>>>> isn't
>>>>>>>>>>>>> available yet?
>>>>>>>>>>>> Quick and dirty?
>>>>>>>>>>>>
>>>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>>>> we are
>>>>>>>>>>>> certain v exists, is kernel memory and is expected to have a
>>>>>>>>>>>> backing
>>>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something
>>>>>>>>>>> better
>>>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and
>>>>>>>>>>> it's
>>>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>>>> Hmm - something like that tucked inside
>>>>>>>>>> HYPERVISOR_update_va_mapping()
>>>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>>>
>>>>>>>>>> Altering the hypercall used is certainly not something to backport,
>>>>>>>>>> nor
>>>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>>>> Changing this one use of update_va_mapping to use
>>>>>>>>> mmu_update_normal_pt
>>>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why
>>>>>>>>> this
>>>>>>>>> cannot be backported.
>>>>>>>> To properly fix it should include batching and that is not something
>>>>>>>> that I think we should target for stable.
>>>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>>>
>>>>>>> However this isn't the first issue issue we have had lazy mmu
>>>>>>> faulting,
>>>>>>> and I doubt it is the last.  There are not many callsites of
>>>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>>>> issues are lurking elsewhere.
>>>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>>>> access to fault.  Is this something we should be worried about?
>>>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>>>> next LDT access will fault.
>>>> Which is a problem because that alias might still exist, and also
>>>> because Linux really doesn't expect that fault.
>>>>
>>>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>>>> in the range passed to set_ldt().  The error code however will be
>>>>> unmodified (and limited only by not-user and not-reserved), so will
>>>>> appear as a non-present read or write supervisor access to an address
>>>>> which the kernel has a valid read mapping of.
>>>> More yuck.
>>>>
>>>> I think I'm just going to stick an unconditional vm_flush_aliases in
>>>> alloc_ldt.
>>>>
>>>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>>>> mappings to the frames used to make up the LDT.  It could proactively
>>>>> fault them in by accessing one descriptor in each page inside the limit,
>>>>> but by the time a fault is received it is probably too late to work out
>>>>> where the other mapping is which prevented the typechange (or indeed,
>>>>> whether Xen objected to one of the descriptors instead).
>>>> This seems like overkill.
>>>>
>>>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>>>> do we make it all the way to xen_free_ldt without the vmapped page
>>>> existing in the guest's page tables?  After all, we had to survive
>>>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>>>> way.
>>> (Summarising part of a discussion which has just occurred on IRC)
>>>
>>> I presume that xen_free_ldt() is called while in the context of an mm
>>> which doesn't have the particular area of the vmalloc() space faulted in.
>>
>> This is exactly what's happening --- the bug is only triggered during exit
>> and xen_free_ldt() is called from someone else's context, e.g.:
>>
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>
>> But that would imply that this other context has mm->context.ldt of
>> ldt_gdt_32. How is that possible?
>>
> It's freed via destroy_context, which destroys someone else's LDT, right?
>

Yes, that's what it appears to be.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  2015-07-30 20:05                                                             ` Andy Lutomirski
@ 2015-07-30 20:18                                                               ` Boris Ostrovsky
  2015-07-30 20:18                                                               ` [Xen-devel] " Boris Ostrovsky
  1 sibling, 0 replies; 130+ messages in thread
From: Boris Ostrovsky @ 2015-07-30 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: security, Peter Zijlstra, Andrew Cooper, X86 ML, linux-kernel,
	Steven Rostedt, xen-devel, David Vrabel, Borislav Petkov,
	David Vrabel, Jan Beulich, Sasha Levin

On 07/30/2015 04:05 PM, Andy Lutomirski wrote:
> On Thu, Jul 30, 2015 at 1:01 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 07/30/2015 02:54 PM, Andrew Cooper wrote:
>>> On 30/07/15 19:30, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>>>> <andrew.cooper3@citrix.com> wrote:
>>>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>>>> <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>>>               pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>>>      +       (void)*(volatile int*)v;
>>>>>>>>>>>>>>>             if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>>>                     pr_err("set_aliased_prot va update failed
>>>>>>>>>>>>>>> w/
>>>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>>>                     BUG();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>>>> this is the
>>>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix
>>>>>>>>>>>>> that
>>>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix
>>>>>>>>>>>>> isn't
>>>>>>>>>>>>> available yet?
>>>>>>>>>>>> Quick and dirty?
>>>>>>>>>>>>
>>>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>>>> we are
>>>>>>>>>>>> certain v exists, is kernel memory and is expected to have a
>>>>>>>>>>>> backing
>>>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something
>>>>>>>>>>> better
>>>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and
>>>>>>>>>>> it's
>>>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>>>> Hmm - something like that tucked inside
>>>>>>>>>> HYPERVISOR_update_va_mapping()
>>>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>>>
>>>>>>>>>> Altering the hypercall used is certainly not something to backport,
>>>>>>>>>> nor
>>>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>>>> Changing this one use of update_va_mapping to use
>>>>>>>>> mmu_update_normal_pt
>>>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why
>>>>>>>>> this
>>>>>>>>> cannot be backported.
>>>>>>>> To properly fix it should include batching and that is not something
>>>>>>>> that I think we should target for stable.
>>>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>>>
>>>>>>> However this isn't the first issue issue we have had lazy mmu
>>>>>>> faulting,
>>>>>>> and I doubt it is the last.  There are not many callsites of
>>>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>>>> issues are lurking elsewhere.
>>>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>>>> access to fault.  Is this something we should be worried about?
>>>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>>>> next LDT access will fault.
>>>> Which is a problem because that alias might still exist, and also
>>>> because Linux really doesn't expect that fault.
>>>>
>>>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>>>> in the range passed to set_ldt().  The error code however will be
>>>>> unmodified (and limited only by not-user and not-reserved), so will
>>>>> appear as a non-present read or write supervisor access to an address
>>>>> which the kernel has a valid read mapping of.
>>>> More yuck.
>>>>
>>>> I think I'm just going to stick an unconditional vm_flush_aliases in
>>>> alloc_ldt.
>>>>
>>>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>>>> mappings to the frames used to make up the LDT.  It could proactively
>>>>> fault them in by accessing one descriptor in each page inside the limit,
>>>>> but by the time a fault is received it is probably too late to work out
>>>>> where the other mapping is which prevented the typechange (or indeed,
>>>>> whether Xen objected to one of the descriptors instead).
>>>> This seems like overkill.
>>>>
>>>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>>>> do we make it all the way to xen_free_ldt without the vmapped page
>>>> existing in the guest's page tables?  After all, we had to survive
>>>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>>>> way.
>>> (Summarising part of a discussion which has just occurred on IRC)
>>>
>>> I presume that xen_free_ldt() is called while in the context of an mm
>>> which doesn't have the particular area of the vmalloc() space faulted in.
>>
>> This is exactly what's happening --- the bug is only triggered during exit
>> and xen_free_ldt() is called from someone else's context, e.g.:
>>
>> [   53.986677] Call Trace:
>> [   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
>> [   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
>> [   53.986677]  [<c1062735>] destroy_context+0x25/0x40
>> [   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
>> [   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
>> [   53.986677]  [<c1863736>] __schedule+0x316/0x950
>> [   53.986677]  [<c1863d96>] schedule+0x26/0x70
>> [   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
>> [   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
>> [   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
>> [   53.986677]  [<c186717a>] syscall_call+0x7/0x7
>>
>> But that would imply that this other context has mm->context.ldt of
>> ldt_gdt_32. How is that possible?
>>
> It's freed via destroy_context, which destroys someone else's LDT, right?
>

Yes, that's what it appears to be.

-boris

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
@ 2015-07-25  5:36 Andy Lutomirski
  0 siblings, 0 replies; 130+ messages in thread
From: Andy Lutomirski @ 2015-07-25  5:36 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: security, Andy Lutomirski, Andrew Cooper, X86 ML, linux-kernel,
	xen-devel, Borislav Petkov, Jan Beulich, Sasha Levin,
	Boris Ostrovsky

Here's v3.  It fixes the "dazed and confused" issue, I hope.  It's also
probably a good general attack surface reduction, and it replaces some
scary code with IMO less scary code.

Also, servers and embedded systems should probably turn off modify_ldt.
This makes that possible.

Xen people, can you take a look at this?

Willy and Kees: I left the config option alone.  The -tiny people will
like it, and we can always add a sysctl of some sort later.

Changes from v3:
 - Hopefully fixed Xen.
 - Fixed 32-bit test case on 32-bit native kernel.
 - Fix bogus vumnap for some LDT sizes.
 - Strengthen test case to check all LDT sizes (catches bogus vunmap).
 - Lots of cleanups, mostly from Borislav.
 - Simplify IPI code using on_each_cpu_mask.

Changes from v2:
 - Allocate ldt_struct and the LDT entries separately.  This should fix Xen.
 - Stop using write_ldt_entry, since I'm pretty sure it's unnecessary now
   that we no longer mutate an in-use LDT.  (Xen people, can you check?)

Changes from v1:
 - The config option is new.
 - The test case is new.
 - Fixed a missing allocation failure check.
 - Fixed a use-after-free on fork().

Andy Lutomirski (3):
  x86/ldt: Make modify_ldt synchronous
  x86/ldt: Make modify_ldt optional
  selftests/x86, x86/ldt: Add a selftest for modify_ldt

 arch/x86/Kconfig                      |  17 ++
 arch/x86/include/asm/desc.h           |  15 --
 arch/x86/include/asm/mmu.h            |   5 +-
 arch/x86/include/asm/mmu_context.h    |  68 ++++-
 arch/x86/kernel/Makefile              |   3 +-
 arch/x86/kernel/cpu/common.c          |   4 +-
 arch/x86/kernel/cpu/perf_event.c      |  16 +-
 arch/x86/kernel/ldt.c                 | 262 +++++++++---------
 arch/x86/kernel/process_64.c          |   6 +-
 arch/x86/kernel/step.c                |   8 +-
 arch/x86/power/cpu.c                  |   3 +-
 kernel/sys_ni.c                       |   1 +
 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/ldt_gdt.c | 492 ++++++++++++++++++++++++++++++++++
 14 files changed, 747 insertions(+), 155 deletions(-)
 create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

-- 
2.4.3

^ permalink raw reply	[flat|nested] 130+ messages in thread

end of thread, other threads:[~2015-07-30 20:20 UTC | newest]

Thread overview: 130+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-25  5:36 [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Andy Lutomirski
2015-07-25  5:36 ` [PATCH v4 1/3] x86/ldt: Make modify_ldt synchronous Andy Lutomirski
2015-07-25  9:03   ` Borislav Petkov
2015-07-25  9:03   ` Borislav Petkov
2015-07-25  5:36 ` Andy Lutomirski
2015-07-25  5:36 ` [PATCH v4 2/3] x86/ldt: Make modify_ldt optional Andy Lutomirski
2015-07-25  5:36 ` Andy Lutomirski
2015-07-25  6:23   ` Willy Tarreau
2015-07-25  6:44     ` Andy Lutomirski
2015-07-25  7:50       ` Willy Tarreau
2015-07-25 13:03         ` [PATCH 4/3] x86/ldt: allow to disable modify_ldt at runtime Willy Tarreau
2015-07-25 13:03         ` Willy Tarreau
2015-07-25 16:08           ` Andy Lutomirski
2015-07-25 16:33             ` Willy Tarreau
2015-07-25 16:33             ` Willy Tarreau
2015-07-25 17:42               ` Andy Lutomirski
2015-07-25 18:45                 ` Willy Tarreau
2015-07-25 18:45                 ` Willy Tarreau
2015-07-25 17:42               ` Andy Lutomirski
2015-07-25 16:08           ` Andy Lutomirski
2015-07-27 19:04           ` Kees Cook
2015-07-27 19:04           ` Kees Cook
2015-07-27 21:37             ` Willy Tarreau
2015-07-27 21:37             ` Willy Tarreau
2015-07-25  7:50       ` [PATCH v4 2/3] x86/ldt: Make modify_ldt optional Willy Tarreau
2015-07-25  6:44     ` Andy Lutomirski
2015-07-25  6:23   ` Willy Tarreau
2015-07-25  9:15   ` Borislav Petkov
2015-07-25  9:15   ` Borislav Petkov
2015-07-25 16:03     ` Andy Lutomirski
2015-07-25 16:03     ` Andy Lutomirski
2015-07-25 16:35       ` Willy Tarreau
2015-07-25 16:35       ` Willy Tarreau
2015-07-25  5:36 ` [PATCH v4 3/3] selftests/x86, x86/ldt: Add a selftest for modify_ldt Andy Lutomirski
2015-07-25  5:36 ` Andy Lutomirski
2015-07-27 15:52   ` [PATCH v4.1 3.3] " Andy Lutomirski
2015-07-27 15:52   ` Andy Lutomirski
2015-07-25  6:27 ` [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option Willy Tarreau
2015-07-25  6:27 ` Willy Tarreau
2015-07-27 15:36 ` Boris Ostrovsky
2015-07-27 15:53   ` Andy Lutomirski
2015-07-27 16:18     ` Boris Ostrovsky
2015-07-28  2:20       ` Andy Lutomirski
2015-07-28  2:20       ` Andy Lutomirski
2015-07-28  3:16         ` Andy Lutomirski
2015-07-28  3:16         ` Andy Lutomirski
2015-07-28  3:23           ` Andy Lutomirski
2015-07-28  3:23           ` Andy Lutomirski
2015-07-28  3:43           ` Boris Ostrovsky
2015-07-28  3:43           ` Boris Ostrovsky
2015-07-28 10:29           ` Andrew Cooper
2015-07-28 10:29           ` Andrew Cooper
2015-07-28 14:05             ` Boris Ostrovsky
2015-07-28 14:05             ` Boris Ostrovsky
2015-07-28 14:35               ` Andrew Cooper
2015-07-28 14:35               ` Andrew Cooper
2015-07-28 14:50                 ` Boris Ostrovsky
2015-07-28 14:50                 ` Boris Ostrovsky
2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
2015-07-28 15:15                   ` Konrad Rzeszutek Wilk
2015-07-28 15:39                     ` Boris Ostrovsky
2015-07-28 15:39                     ` Boris Ostrovsky
2015-07-28 15:23                   ` Andrew Cooper
2015-07-28 15:59                     ` [Xen-devel] " Boris Ostrovsky
2015-07-28 15:59                     ` Boris Ostrovsky
2015-07-28 15:23                   ` Andrew Cooper
2015-07-28 15:43             ` Andy Lutomirski
2015-07-28 15:43             ` Andy Lutomirski
2015-07-28 16:30               ` Andrew Cooper
2015-07-28 16:30               ` Andrew Cooper
2015-07-28 17:07                 ` Andy Lutomirski
2015-07-28 17:07                 ` Andy Lutomirski
2015-07-28 17:10                   ` [Xen-devel] " Boris Ostrovsky
2015-07-29  0:21                     ` Andy Lutomirski
2015-07-29  0:21                     ` [Xen-devel] " Andy Lutomirski
2015-07-29  0:47                       ` Andrew Cooper
2015-07-29  0:47                       ` [Xen-devel] " Andrew Cooper
2015-07-29  3:01                         ` Boris Ostrovsky
2015-07-29  3:01                         ` [Xen-devel] " Boris Ostrovsky
2015-07-29  4:26                           ` Andy Lutomirski
2015-07-29  4:26                           ` Andy Lutomirski
2015-07-29  5:28                           ` [Xen-devel] " Andy Lutomirski
2015-07-29 14:21                             ` Andrew Cooper
2015-07-29 14:43                               ` Boris Ostrovsky
2015-07-29 19:03                                 ` Andrew Cooper
2015-07-29 19:03                                 ` [Xen-devel] " Andrew Cooper
2015-07-29 21:23                                   ` Boris Ostrovsky
2015-07-29 21:26                                     ` Andy Lutomirski
2015-07-29 21:33                                       ` Boris Ostrovsky
2015-07-29 21:33                                       ` [Xen-devel] " Boris Ostrovsky
2015-07-29 21:37                                       ` Andrew Cooper
2015-07-29 21:37                                       ` [Xen-devel] " Andrew Cooper
2015-07-29 22:05                                         ` Andy Lutomirski
2015-07-29 22:05                                         ` [Xen-devel] " Andy Lutomirski
2015-07-29 22:11                                           ` Andrew Cooper
2015-07-29 22:40                                             ` Boris Ostrovsky
2015-07-29 22:40                                             ` Boris Ostrovsky
2015-07-29 22:46                                             ` [Xen-devel] " David Vrabel
2015-07-29 22:46                                               ` David Vrabel
2015-07-29 22:49                                               ` Boris Ostrovsky
2015-07-29 22:49                                               ` [Xen-devel] " Boris Ostrovsky
2015-07-29 22:55                                                 ` David Vrabel
2015-07-29 22:55                                                 ` David Vrabel
2015-07-29 23:02                                                 ` [Xen-devel] " Andrew Cooper
2015-07-29 23:13                                                   ` Andy Lutomirski
2015-07-30  0:29                                                     ` Andrew Cooper
2015-07-30  0:29                                                     ` [Xen-devel] " Andrew Cooper
2015-07-30 18:30                                                       ` Andy Lutomirski
2015-07-30 18:54                                                         ` Andrew Cooper
2015-07-30 18:54                                                         ` [Xen-devel] " Andrew Cooper
2015-07-30 20:01                                                           ` Boris Ostrovsky
2015-07-30 20:01                                                           ` [Xen-devel] " Boris Ostrovsky
2015-07-30 20:05                                                             ` Andy Lutomirski
2015-07-30 20:18                                                               ` Boris Ostrovsky
2015-07-30 20:18                                                               ` [Xen-devel] " Boris Ostrovsky
2015-07-30 20:05                                                             ` Andy Lutomirski
2015-07-30 18:30                                                       ` Andy Lutomirski
2015-07-29 23:13                                                   ` Andy Lutomirski
2015-07-29 23:02                                                 ` Andrew Cooper
2015-07-29 22:11                                           ` Andrew Cooper
2015-07-29 21:26                                     ` Andy Lutomirski
2015-07-29 21:23                                   ` Boris Ostrovsky
2015-07-29 14:43                               ` Boris Ostrovsky
2015-07-29 14:21                             ` Andrew Cooper
2015-07-29  5:28                           ` Andy Lutomirski
2015-07-28 17:10                   ` Boris Ostrovsky
2015-07-27 16:18     ` Boris Ostrovsky
2015-07-27 15:53   ` Andy Lutomirski
2015-07-27 15:36 ` Boris Ostrovsky
2015-07-25  5:36 Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.