LKML Archive on lore.kernel.org
 help / color / Atom feed
* [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3)
@ 2019-11-06 19:34 Thomas Gleixner
  2019-11-06 19:35 ` [patch 1/9] x86/ptrace: Prevent truncation of bitmap size Thomas Gleixner
                   ` (9 more replies)
  0 siblings, 10 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:34 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

This is the result of the previous discussion about assumptions that user
space always runs with interrupts enabled:

 https://lore.kernel.org/r/20191023123118.296135499@linutronix.de

The infinite wisdom of hardware designers coupled the I/O permission level
of accessing all 65536 I/O ports with the ability to use CLI/STI.

iopl(3), if granted gives this ability to user space. That's broken in
several ways:

  1) User space can lock up the machine when an interrupt disabled region
     runs into an infinite loop.

  2) Disabling interrupts in user space has no semantics, at least no well
     defined, consistent and understandable semantics.

     syscalls and exceptions ignore that state and can block, preempt etc.

#1 could be arguably achieved by fiddling with the wrong I/O ports as well.

#2 is the real issue:

It causes a problem in the user/kernel interface and in exception
handlers as it is a common assumption that user space executes with
interrupts enabled. But with IOPL(3) this assumption is not correct.
Neither for syscalls nor for exceptions.

There is code in the low level entry and exception handlers which
makes this assumption.  Even experienced kernel developers trip over
that as shown in the discussion referenced above.

Ideally we should delete iopl(), but there are existing users including
DPDK. None of those I checked rely on the CLI/STI ability. They all use it
for conveniance to access I/O ports.

The only thing I found using CLI/STI was some really ancient X
implementation. So dragons might be lurking, but that X stuff really won't
work on a current kernel anymore :)

After quite some discussion I came up with a solution to emulate IOPL via
the I/O bitmap mechanism without copying 8k of zeroed bitmap on every
context switch which is the main concern of people who prefer iopl() over
ioperm().

The trick is to use the io-bitmap offset in the TSS to point the CPU to a
bitmap with all bits cleared. This is slightly slower than just relying on
the IOPL magic in (E)FLAGS, but it's almost not noticeable.

The same trick can be used when switching away from a task which uses an
I/O bitmap to a task which does not. Instead of cleaning up the bitmap
storage, just point the I/O bitmap offset to a location which is outside of
the TSS limit. That puts the copy overhead solely on tasks which have
actually an I/O bitmap installed. The copy mechanism is quite stupid as
well as it starts always from 0 even if the first cleared bit is right at
the end of the bitmap.

The following series addresses this.

The first few patches are preparatory and consolidate needlessly duplicated
code to avoid duplicating all the changes for the IOPL emulation.

At the end it removes the legacy support completely which cleans up quite
some code all over the place including paravirt.

The improvement for switching away from an I/O bitmap using task to a sane
task w/o I/O bitmap is quite measurable in a microbench mark.

Also avoiding to copy several kilobytes just to update a tiny region has a
measurable impact.

Removing CLI/STI from iopl() allows us to consolidate and simplify the
entry and exception code instead of wasting time and racking nerves by
analysing the world and some more whether there is an implicit
assumption of user space having interrupts always enabled.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/iopl

Thanks,

	tglx

8<---------------
 arch/x86/Kconfig                      |   26 ++++
 arch/x86/include/asm/paravirt.h       |    4 
 arch/x86/include/asm/paravirt_types.h |    2 
 arch/x86/include/asm/processor.h      |   92 ++++++++-------
 arch/x86/include/asm/ptrace.h         |    6 +
 arch/x86/include/asm/switch_to.h      |   10 +
 arch/x86/include/asm/xen/hypervisor.h |    2 
 arch/x86/kernel/cpu/common.c          |  176 +++++++++++------------------
 arch/x86/kernel/doublefault.c         |    2 
 arch/x86/kernel/ioport.c              |  203 ++++++++++++++++++++++++----------
 arch/x86/kernel/paravirt.c            |    2 
 arch/x86/kernel/process.c             |  177 ++++++++++++++++++++++++-----
 arch/x86/kernel/process_32.c          |   77 ------------
 arch/x86/kernel/process_64.c          |   86 --------------
 arch/x86/kernel/ptrace.c              |    2 
 arch/x86/xen/enlighten_pv.c           |   10 -
 tools/testing/selftests/x86/iopl.c    |  104 +++++++++++++++--
 17 files changed, 556 insertions(+), 425 deletions(-)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 1/9] x86/ptrace: Prevent truncation of bitmap size
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-07  7:31   ` Ingo Molnar
  2019-11-06 19:35 ` [patch 2/9] x86/process: Unify copy_thread_tls() Thomas Gleixner
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

The active() callback of the IO bitmap regset divides the IO bitmap size by
the word size (32/64 bit). As the I/O bitmap size is in bytes the active
check fails for bitmap sizes of 1-3 bytes on 32bit and 1-7 bytes on 64bit.

Use DIV_ROUND_UP() instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ptrace.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -697,7 +697,7 @@ static int ptrace_set_debugreg(struct ta
 static int ioperm_active(struct task_struct *target,
 			 const struct user_regset *regset)
 {
-	return target->thread.io_bitmap_max / regset->size;
+	return DIV_ROUND_UP(target->thread.io_bitmap_max, regset->size);
 }
 
 static int ioperm_get(struct task_struct *target,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 2/9] x86/process: Unify copy_thread_tls()
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
  2019-11-06 19:35 ` [patch 1/9] x86/ptrace: Prevent truncation of bitmap size Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-08 22:31   ` Andy Lutomirski
  2019-11-06 19:35 ` [patch 3/9] x86/cpu: Unify cpu_init() Thomas Gleixner
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

While looking at the TSS io bitmap it turned out that any change in that
area would require identical changes to copy_thread_tls(). The 32 and 64
bit variants share sufficient code to consolidate them into a common
function to avoid duplication of upcoming modifications.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/ptrace.h    |    6 ++
 arch/x86/include/asm/switch_to.h |   10 ++++
 arch/x86/kernel/process.c        |   94 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/process_32.c     |   68 ----------------------------
 arch/x86/kernel/process_64.c     |   75 -------------------------------
 5 files changed, 110 insertions(+), 143 deletions(-)

--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -361,5 +361,11 @@ extern int do_get_thread_area(struct tas
 extern int do_set_thread_area(struct task_struct *p, int idx,
 			      struct user_desc __user *info, int can_allocate);
 
+#ifdef CONFIG_X86_64
+# define do_set_thread_area_64(p, s, t)	do_arch_prctl_64(p, s, t)
+#else
+# define do_set_thread_area_64(p, s, t)	(0)
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_PTRACE_H */
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -103,7 +103,17 @@ static inline void update_task_stack(str
 	if (static_cpu_has(X86_FEATURE_XENPV))
 		load_sp0(task_top_of_stack(task));
 #endif
+}
 
+static inline void kthread_frame_init(struct inactive_task_frame *frame,
+				      unsigned long fun, unsigned long arg)
+{
+	frame->bx = fun;
+#ifdef CONFIG_X86_32
+	frame->di = arg;
+#else
+	frame->r12 = arg;
+#endif
 }
 
 #endif /* _ASM_X86_SWITCH_TO_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -132,6 +132,100 @@ void exit_thread(struct task_struct *tsk
 	fpu__drop(fpu);
 }
 
+static int set_new_tls(struct task_struct *p, unsigned long tls)
+{
+	struct user_desc __user *utls = (struct user_desc __user *)tls;
+
+	if (in_ia32_syscall())
+		return do_set_thread_area(p, -1, utls, 0);
+	else
+		return do_set_thread_area_64(p, ARCH_SET_FS, tls);
+}
+
+static inline int copy_io_bitmap(struct task_struct *tsk)
+{
+	if (likely(!test_tsk_thread_flag(current, TIF_IO_BITMAP)))
+		return 0;
+
+	tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
+					    IO_BITMAP_BYTES, GFP_KERNEL);
+	if (!tsk->thread.io_bitmap_ptr) {
+		tsk->thread.io_bitmap_max = 0;
+		return -ENOMEM;
+	}
+	set_tsk_thread_flag(tsk, TIF_IO_BITMAP);
+	return 0;
+}
+
+static inline void free_io_bitmap(struct task_struct *tsk)
+{
+	if (tsk->thread.io_bitmap_ptr) {
+		kfree(tsk->thread.io_bitmap_ptr);
+		tsk->thread.io_bitmap_ptr = NULL;
+		tsk->thread.io_bitmap_max = 0;
+	}
+}
+
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+		    unsigned long arg, struct task_struct *p, unsigned long tls)
+{
+	struct inactive_task_frame *frame;
+	struct fork_frame *fork_frame;
+	struct pt_regs *childregs;
+	int ret;
+
+	childregs = task_pt_regs(p);
+	fork_frame = container_of(childregs, struct fork_frame, regs);
+	frame = &fork_frame->frame;
+
+	frame->bp = 0;
+	frame->ret_addr = (unsigned long) ret_from_fork;
+	p->thread.sp = (unsigned long) fork_frame;
+	p->thread.io_bitmap_ptr = NULL;
+	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
+
+#ifdef CONFIG_X86_64
+	savesegment(gs, p->thread.gsindex);
+	p->thread.gsbase = p->thread.gsindex ? 0 : current->thread.gsbase;
+	savesegment(fs, p->thread.fsindex);
+	p->thread.fsbase = p->thread.fsindex ? 0 : current->thread.fsbase;
+	savesegment(es, p->thread.es);
+	savesegment(ds, p->thread.ds);
+#else
+	/* Clear all status flags including IF and set fixed bit. */
+	frame->flags = X86_EFLAGS_FIXED;
+#endif
+
+	/* Kernel thread ? */
+	if (unlikely(p->flags & PF_KTHREAD)) {
+		memset(childregs, 0, sizeof(struct pt_regs));
+		kthread_frame_init(frame, sp, arg);
+		return 0;
+	}
+
+	frame->bx = 0;
+	*childregs = *current_pt_regs();
+	childregs->ax = 0;
+	if (sp)
+		childregs->sp = sp;
+
+#ifdef CONFIG_X86_32
+	task_user_gs(p) = get_user_gs(current_pt_regs());
+#endif
+
+	ret = copy_io_bitmap(p);
+	if (ret)
+		return ret;
+
+	/* Set a new TLS for the child thread? */
+	if (clone_flags & CLONE_SETTLS) {
+		ret = set_new_tls(p, tls);
+		if (ret)
+			free_io_bitmap(p);
+	}
+	return ret;
+}
+
 void flush_thread(void)
 {
 	struct task_struct *tsk = current;
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -112,74 +112,6 @@ void release_thread(struct task_struct *
 	release_vm86_irqs(dead_task);
 }
 
-int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p, unsigned long tls)
-{
-	struct pt_regs *childregs = task_pt_regs(p);
-	struct fork_frame *fork_frame = container_of(childregs, struct fork_frame, regs);
-	struct inactive_task_frame *frame = &fork_frame->frame;
-	struct task_struct *tsk;
-	int err;
-
-	/*
-	 * For a new task use the RESET flags value since there is no before.
-	 * All the status flags are zero; DF and all the system flags must also
-	 * be 0, specifically IF must be 0 because we context switch to the new
-	 * task with interrupts disabled.
-	 */
-	frame->flags = X86_EFLAGS_FIXED;
-	frame->bp = 0;
-	frame->ret_addr = (unsigned long) ret_from_fork;
-	p->thread.sp = (unsigned long) fork_frame;
-	p->thread.sp0 = (unsigned long) (childregs+1);
-	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
-
-	if (unlikely(p->flags & PF_KTHREAD)) {
-		/* kernel thread */
-		memset(childregs, 0, sizeof(struct pt_regs));
-		frame->bx = sp;		/* function */
-		frame->di = arg;
-		p->thread.io_bitmap_ptr = NULL;
-		return 0;
-	}
-	frame->bx = 0;
-	*childregs = *current_pt_regs();
-	childregs->ax = 0;
-	if (sp)
-		childregs->sp = sp;
-
-	task_user_gs(p) = get_user_gs(current_pt_regs());
-
-	p->thread.io_bitmap_ptr = NULL;
-	tsk = current;
-	err = -ENOMEM;
-
-	if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {
-		p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr,
-						IO_BITMAP_BYTES, GFP_KERNEL);
-		if (!p->thread.io_bitmap_ptr) {
-			p->thread.io_bitmap_max = 0;
-			return -ENOMEM;
-		}
-		set_tsk_thread_flag(p, TIF_IO_BITMAP);
-	}
-
-	err = 0;
-
-	/*
-	 * Set a new TLS for the child thread?
-	 */
-	if (clone_flags & CLONE_SETTLS)
-		err = do_set_thread_area(p, -1,
-			(struct user_desc __user *)tls, 0);
-
-	if (err && p->thread.io_bitmap_ptr) {
-		kfree(p->thread.io_bitmap_ptr);
-		p->thread.io_bitmap_max = 0;
-	}
-	return err;
-}
-
 void
 start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
 {
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -371,81 +371,6 @@ void x86_gsbase_write_task(struct task_s
 	task->thread.gsbase = gsbase;
 }
 
-int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p, unsigned long tls)
-{
-	int err;
-	struct pt_regs *childregs;
-	struct fork_frame *fork_frame;
-	struct inactive_task_frame *frame;
-	struct task_struct *me = current;
-
-	childregs = task_pt_regs(p);
-	fork_frame = container_of(childregs, struct fork_frame, regs);
-	frame = &fork_frame->frame;
-
-	frame->bp = 0;
-	frame->ret_addr = (unsigned long) ret_from_fork;
-	p->thread.sp = (unsigned long) fork_frame;
-	p->thread.io_bitmap_ptr = NULL;
-
-	savesegment(gs, p->thread.gsindex);
-	p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
-	savesegment(fs, p->thread.fsindex);
-	p->thread.fsbase = p->thread.fsindex ? 0 : me->thread.fsbase;
-	savesegment(es, p->thread.es);
-	savesegment(ds, p->thread.ds);
-	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
-
-	if (unlikely(p->flags & PF_KTHREAD)) {
-		/* kernel thread */
-		memset(childregs, 0, sizeof(struct pt_regs));
-		frame->bx = sp;		/* function */
-		frame->r12 = arg;
-		return 0;
-	}
-	frame->bx = 0;
-	*childregs = *current_pt_regs();
-
-	childregs->ax = 0;
-	if (sp)
-		childregs->sp = sp;
-
-	err = -ENOMEM;
-	if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
-		p->thread.io_bitmap_ptr = kmemdup(me->thread.io_bitmap_ptr,
-						  IO_BITMAP_BYTES, GFP_KERNEL);
-		if (!p->thread.io_bitmap_ptr) {
-			p->thread.io_bitmap_max = 0;
-			return -ENOMEM;
-		}
-		set_tsk_thread_flag(p, TIF_IO_BITMAP);
-	}
-
-	/*
-	 * Set a new TLS for the child thread?
-	 */
-	if (clone_flags & CLONE_SETTLS) {
-#ifdef CONFIG_IA32_EMULATION
-		if (in_ia32_syscall())
-			err = do_set_thread_area(p, -1,
-				(struct user_desc __user *)tls, 0);
-		else
-#endif
-			err = do_arch_prctl_64(p, ARCH_SET_FS, tls);
-		if (err)
-			goto out;
-	}
-	err = 0;
-out:
-	if (err && p->thread.io_bitmap_ptr) {
-		kfree(p->thread.io_bitmap_ptr);
-		p->thread.io_bitmap_max = 0;
-	}
-
-	return err;
-}
-
 static void
 start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		    unsigned long new_sp,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 3/9] x86/cpu: Unify cpu_init()
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
  2019-11-06 19:35 ` [patch 1/9] x86/ptrace: Prevent truncation of bitmap size Thomas Gleixner
  2019-11-06 19:35 ` [patch 2/9] x86/process: Unify copy_thread_tls() Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-08 22:34   ` Andy Lutomirski
  2019-11-06 19:35 ` [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user Thomas Gleixner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

Similar to copy_thread_tls() the 32bit and 64bit implementations of
cpu_init() are very similar and unification avoids duplicate changes in the
future.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/common.c |  172 ++++++++++++++++---------------------------
 1 file changed, 66 insertions(+), 106 deletions(-)

--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1749,7 +1749,7 @@ static void wait_for_master_cpu(int cpu)
 }
 
 #ifdef CONFIG_X86_64
-static void setup_getcpu(int cpu)
+static inline void setup_getcpu(int cpu)
 {
 	unsigned long cpudata = vdso_encode_cpunode(cpu, early_cpu_to_node(cpu));
 	struct desc_struct d = { };
@@ -1769,7 +1769,43 @@ static void setup_getcpu(int cpu)
 
 	write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_CPUNODE, &d, DESCTYPE_S);
 }
+
+static inline void ucode_cpu_init(int cpu)
+{
+	if (cpu)
+		load_ucode_ap();
+}
+
+static inline void tss_setup_ist(struct tss_struct *tss)
+{
+	/* Set up the per-CPU TSS IST stacks */
+	tss->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(DF);
+	tss->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
+	tss->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
+	tss->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
+}
+
+static inline void gdt_setup_doublefault_tss(int cpu) { }
+
+#else /* CONFIG_X86_64 */
+
+static inline void setup_getcpu(int cpu) { }
+
+static inline void ucode_cpu_init(int cpu)
+{
+	show_ucode_info_early();
+}
+
+static inline void tss_setup_ist(struct tss_struct *tss) { }
+
+static inline void gdt_setup_doublefault_tss(int cpu)
+{
+#ifdef CONFIG_DOUBLEFAULT
+	/* Set up the doublefault TSS pointer in the GDT */
+	__set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss);
 #endif
+}
+#endif /* !CONFIG_X86_64 */
 
 /*
  * cpu_init() initializes state that is per-CPU. Some data is already
@@ -1777,21 +1813,15 @@ static void setup_getcpu(int cpu)
  * and IDT. We reload them nevertheless, this function acts as a
  * 'CPU state barrier', nothing should get across.
  */
-#ifdef CONFIG_X86_64
-
 void cpu_init(void)
 {
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
+	struct task_struct *cur = current;
 	int cpu = raw_smp_processor_id();
-	struct task_struct *me;
-	struct tss_struct *t;
-	int i;
 
 	wait_for_master_cpu(cpu);
 
-	if (cpu)
-		load_ucode_ap();
-
-	t = &per_cpu(cpu_tss_rw, cpu);
+	ucode_cpu_init(cpu);
 
 #ifdef CONFIG_NUMA
 	if (this_cpu_read(numa_node) == 0 &&
@@ -1800,63 +1830,48 @@ void cpu_init(void)
 #endif
 	setup_getcpu(cpu);
 
-	me = current;
-
 	pr_debug("Initializing CPU#%d\n", cpu);
 
-	cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
+	if (IS_ENABLED(CONFIG_X86_64) || cpu_feature_enabled(X86_FEATURE_VME) ||
+	    boot_cpu_has(X86_FEATURE_TSC) || boot_cpu_has(X86_FEATURE_DE))
+		cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
 
 	/*
 	 * Initialize the per-CPU GDT with the boot GDT,
 	 * and set up the GDT descriptor:
 	 */
-
 	switch_to_new_gdt(cpu);
-	loadsegment(fs, 0);
-
 	load_current_idt();
 
-	memset(me->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
-	syscall_init();
-
-	wrmsrl(MSR_FS_BASE, 0);
-	wrmsrl(MSR_KERNEL_GS_BASE, 0);
-	barrier();
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		loadsegment(fs, 0);
+		memset(cur->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
+		syscall_init();
+
+		wrmsrl(MSR_FS_BASE, 0);
+		wrmsrl(MSR_KERNEL_GS_BASE, 0);
+		barrier();
 
-	x86_configure_nx();
-	x2apic_setup();
-
-	/*
-	 * set up and load the per-CPU TSS
-	 */
-	if (!t->x86_tss.ist[0]) {
-		t->x86_tss.ist[IST_INDEX_DF] = __this_cpu_ist_top_va(DF);
-		t->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI);
-		t->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB);
-		t->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE);
+		x2apic_setup();
 	}
 
-	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
-
-	/*
-	 * <= is required because the CPU will access up to
-	 * 8 bits beyond the end of the IO permission bitmap.
-	 */
-	for (i = 0; i <= IO_BITMAP_LONGS; i++)
-		t->io_bitmap[i] = ~0UL;
-
 	mmgrab(&init_mm);
-	me->active_mm = &init_mm;
-	BUG_ON(me->mm);
+	cur->active_mm = &init_mm;
+	BUG_ON(cur->mm);
 	initialize_tlbstate_and_flush();
-	enter_lazy_tlb(&init_mm, me);
+	enter_lazy_tlb(&init_mm, cur);
 
-	/*
-	 * Initialize the TSS.  sp0 points to the entry trampoline stack
-	 * regardless of what task is running.
-	 */
+	/* Initialize the TSS. */
+	tss_setup_ist(tss);
+	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
+	memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
+
 	load_TR_desc();
+	/*
+	 * sp0 points to the entry trampoline stack regardless of what task
+	 * is running.
+	 */
 	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
@@ -1864,6 +1879,8 @@ void cpu_init(void)
 	clear_all_debug_regs();
 	dbg_restore_debug_regs();
 
+	gdt_setup_doublefault_tss(cpu);
+
 	fpu__init_cpu();
 
 	if (is_uv_system())
@@ -1872,63 +1889,6 @@ void cpu_init(void)
 	load_fixmap_gdt(cpu);
 }
 
-#else
-
-void cpu_init(void)
-{
-	int cpu = smp_processor_id();
-	struct task_struct *curr = current;
-	struct tss_struct *t = &per_cpu(cpu_tss_rw, cpu);
-
-	wait_for_master_cpu(cpu);
-
-	show_ucode_info_early();
-
-	pr_info("Initializing CPU#%d\n", cpu);
-
-	if (cpu_feature_enabled(X86_FEATURE_VME) ||
-	    boot_cpu_has(X86_FEATURE_TSC) ||
-	    boot_cpu_has(X86_FEATURE_DE))
-		cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
-
-	load_current_idt();
-	switch_to_new_gdt(cpu);
-
-	/*
-	 * Set up and load the per-CPU TSS and LDT
-	 */
-	mmgrab(&init_mm);
-	curr->active_mm = &init_mm;
-	BUG_ON(curr->mm);
-	initialize_tlbstate_and_flush();
-	enter_lazy_tlb(&init_mm, curr);
-
-	/*
-	 * Initialize the TSS.  sp0 points to the entry trampoline stack
-	 * regardless of what task is running.
-	 */
-	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
-	load_TR_desc();
-	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
-
-	load_mm_ldt(&init_mm);
-
-	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
-
-#ifdef CONFIG_DOUBLEFAULT
-	/* Set up doublefault TSS pointer in the GDT */
-	__set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss);
-#endif
-
-	clear_all_debug_regs();
-	dbg_restore_debug_regs();
-
-	fpu__init_cpu();
-
-	load_fixmap_gdt(cpu);
-}
-#endif
-
 /*
  * The microcode loader calls this upon late microcode load to recheck features,
  * only when microcode has been updated. Caller holds microcode_mutex and CPU



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (2 preceding siblings ...)
  2019-11-06 19:35 ` [patch 3/9] x86/cpu: Unify cpu_init() Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-07  9:12   ` Peter Zijlstra
  2019-11-09  0:24   ` Andy Lutomirski
  2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

There is no requirement to update the TSS I/O bitmap when a thread using it is
scheduled out and the incoming thread does not use it.

For the permission check based on the TSS I/O bitmap the CPU calculates the memory
location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
offset to an address outside of the TSS limit.

If an I/O instruction is issued from user space the TSS limit causes #GP to be
raised in the same was as valid I/O bitmap with all bits set to 1 would do.

This removes the extra work when an I/O bitmap using task is scheduled out
and puts the burden on the rare I/O bitmap users when they are scheduled
in.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/processor.h |   38 +++++++++++++++++--------
 arch/x86/kernel/cpu/common.c     |    3 +
 arch/x86/kernel/doublefault.c    |    2 -
 arch/x86/kernel/process.c        |   59 +++++++++++++++++++++------------------
 4 files changed, 61 insertions(+), 41 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -330,8 +330,23 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
-#define INVALID_IO_BITMAP_OFFSET	0x8000
+
+#define IO_BITMAP_OFFSET_VALID				\
+	(offsetof(struct tss_struct, io_bitmap) -	\
+	 offsetof(struct tss_struct, x86_tss))
+
+/*
+ * sizeof(unsigned long) coming from an extra "long" at the end
+ * of the iobitmap.
+ *
+ * -1? seg base+limit should be pointing to the address of the
+ * last valid byte
+ */
+#define __KERNEL_TSS_LIMIT	\
+	(IO_BITMAP_OFFSET_VALID + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
+
+/* Base offset outside of TSS_LIMIT so unpriviledged IO causes #GP */
+#define IO_BITMAP_OFFSET_INVALID	(__KERNEL_TSS_LIMIT + 1)
 
 struct entry_stack {
 	unsigned long		words[64];
@@ -350,6 +365,15 @@ struct tss_struct {
 	struct x86_hw_tss	x86_tss;
 
 	/*
+	 * Store the dirty size of the last io bitmap offender. The next
+	 * one will have to do the cleanup as the switch out to a non
+	 * io bitmap user will just set x86_tss.io_bitmap_base to a value
+	 * outside of the TSS limit. So for sane tasks there is no need
+	 * to actually touch the io_bitmap at all.
+	 */
+	unsigned int		io_bitmap_prev_max;
+
+	/*
 	 * The extra 1 is there because the CPU will access an
 	 * additional byte beyond the end of the IO permission
 	 * bitmap. The extra byte must be all 1 bits, and must
@@ -360,16 +384,6 @@ struct tss_struct {
 
 DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
 
-/*
- * sizeof(unsigned long) coming from an extra "long" at the end
- * of the iobitmap.
- *
- * -1? seg base+limit should be pointing to the address of the
- * last valid byte
- */
-#define __KERNEL_TSS_LIMIT	\
-	(IO_BITMAP_OFFSET + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
-
 /* Per CPU interrupt stacks */
 struct irq_stack {
 	char		stack[IRQ_STACK_SIZE];
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1863,7 +1863,8 @@ void cpu_init(void)
 
 	/* Initialize the TSS. */
 	tss_setup_ist(tss);
-	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
+	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
+	tss->io_bitmap_prev_max = 0;
 	memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -54,7 +54,7 @@ struct x86_hw_tss doublefault_tss __cach
 	.sp0		= STACK_START,
 	.ss0		= __KERNEL_DS,
 	.ldt		= 0,
-	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+	.io_bitmap_base	= IO_BITMAP_OFFSET_INVALID,
 
 	.ip		= (unsigned long) doublefault_fn,
 	/* 0x2 bit is always set */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -72,18 +72,9 @@
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
 #endif
+		.io_bitmap_base	= IO_BITMAP_OFFSET_INVALID,
 	 },
-#ifdef CONFIG_X86_32
-	 /*
-	  * Note that the .io_bitmap member must be extra-big. This is because
-	  * the CPU will access an additional byte beyond the end of the IO
-	  * permission bitmap. The extra byte must be all 1 bits, and must
-	  * be within the limit.
-	  */
-	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 
@@ -112,18 +103,18 @@ void exit_thread(struct task_struct *tsk
 	struct thread_struct *t = &tsk->thread;
 	unsigned long *bp = t->io_bitmap_ptr;
 	struct fpu *fpu = &t->fpu;
+	struct tss_struct *tss;
 
 	if (bp) {
-		struct tss_struct *tss = &per_cpu(cpu_tss_rw, get_cpu());
+		preempt_disable();
+		tss = this_cpu_ptr(&cpu_tss_rw);
 
 		t->io_bitmap_ptr = NULL;
-		clear_thread_flag(TIF_IO_BITMAP);
-		/*
-		 * Careful, clear this in the TSS too:
-		 */
-		memset(tss->io_bitmap, 0xff, t->io_bitmap_max);
 		t->io_bitmap_max = 0;
-		put_cpu();
+		clear_thread_flag(TIF_IO_BITMAP);
+		/* Invalidate the io bitmap base in the TSS */
+		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
+		preempt_enable();
 		kfree(bp);
 	}
 
@@ -363,29 +354,43 @@ void arch_setup_new_exec(void)
 	}
 }
 
-static inline void switch_to_bitmap(struct thread_struct *prev,
-				    struct thread_struct *next,
+static inline void switch_to_bitmap(struct thread_struct *next,
 				    unsigned long tifp, unsigned long tifn)
 {
 	struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
 
 	if (tifn & _TIF_IO_BITMAP) {
 		/*
-		 * Copy the relevant range of the IO bitmap.
-		 * Normally this is 128 bytes or less:
+		 * Copy at least the size of the incoming tasks bitmap
+		 * which covers the last permitted I/O port.
+		 *
+		 * If the previous task which used an io bitmap had more
+		 * bits permitted, then the copy needs to cover those as
+		 * well so they get turned off.
 		 */
 		memcpy(tss->io_bitmap, next->io_bitmap_ptr,
-		       max(prev->io_bitmap_max, next->io_bitmap_max));
+		       max(tss->io_bitmap_prev_max, next->io_bitmap_max));
+
+		/* Store the new max and set io_bitmap_base valid */
+		tss->io_bitmap_prev_max = next->io_bitmap_max;
+		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
+
 		/*
-		 * Make sure that the TSS limit is correct for the CPU
-		 * to notice the IO bitmap.
+		 * Make sure that the TSS limit is covering the io bitmap.
+		 * It might have been cut down by a VMEXIT to 0x67 which
+		 * would cause a subsequent I/O access from user space to
+		 * trigger a #GP because tbe bitmap is outside the TSS
+		 * limit.
 		 */
 		refresh_tss_limit();
 	} else if (tifp & _TIF_IO_BITMAP) {
 		/*
-		 * Clear any possible leftover bits:
+		 * Do not touch the bitmap. Let the next bitmap using task
+		 * deal with the mess. Just make the io_bitmap_base invalid
+		 * by moving it outside the TSS limit so any subsequent I/O
+		 * access from user space will trigger a #GP.
 		 */
-		memset(tss->io_bitmap, 0xff, prev->io_bitmap_max);
+		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
 	}
 }
 
@@ -599,7 +604,7 @@ void __switch_to_xtra(struct task_struct
 
 	tifn = READ_ONCE(task_thread_info(next_p)->flags);
 	tifp = READ_ONCE(task_thread_info(prev_p)->flags);
-	switch_to_bitmap(prev, next, tifp, tifn);
+	switch_to_bitmap(next, tifp, tifn);
 
 	propagate_user_return_notify(prev_p, next_p);
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (3 preceding siblings ...)
  2019-11-06 19:35 ` [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-07  1:11   ` Linus Torvalds
                     ` (3 more replies)
  2019-11-06 19:35 ` [patch 6/9] x86/iopl: Fixup misleading comment Thomas Gleixner
                   ` (4 subsequent siblings)
  9 siblings, 4 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

Sane ioperm() users only set the few bits in the I/O space which they need to
access. But the update logic of the TSS I/O bitmap copies always from byte
0 to the last byte in the tasks bitmap which contains a zero permission bit.

That means that for access only to port 65335 the full 8K bitmap has to be
copied even if all the bytes in the TSS I/O bitmap are already filled with
0xff.

Calculate both the position of the first zero bit and the last zero bit to
limit the range which needs to be copied. This does not solve the problem
when the previous tasked had only byte 0 cleared and the next one has only
byte 65535 cleared, but trying to solve that would be too complex and
heavyweight for the context switch path. As the ioperm() usage is very rare
the case which is optimized is the single task/process which uses ioperm().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/processor.h |   28 ++++---
 arch/x86/kernel/cpu/common.c     |    3 
 arch/x86/kernel/ioport.c         |  141 +++++++++++++++++++++++++++------------
 arch/x86/kernel/process.c        |   51 +++++++++-----
 arch/x86/kernel/ptrace.c         |    2 
 5 files changed, 152 insertions(+), 73 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -365,19 +365,19 @@ struct tss_struct {
 	struct x86_hw_tss	x86_tss;
 
 	/*
-	 * Store the dirty size of the last io bitmap offender. The next
-	 * one will have to do the cleanup as the switch out to a non
-	 * io bitmap user will just set x86_tss.io_bitmap_base to a value
-	 * outside of the TSS limit. So for sane tasks there is no need
-	 * to actually touch the io_bitmap at all.
+	 * Store the dirty byte range of the last io bitmap offender. The
+	 * next one will have to do the cleanup because the switch out to a
+	 * non I/O bitmap user will just set x86_tss.io_bitmap_base to a
+	 * value outside of the TSS limit to not penalize tasks which do
+	 * not use the I/O bitmap at all.
 	 */
-	unsigned int		io_bitmap_prev_max;
+	unsigned int		io_zerobits_start;
+	unsigned int		io_zerobits_end;
 
 	/*
-	 * The extra 1 is there because the CPU will access an
-	 * additional byte beyond the end of the IO permission
-	 * bitmap. The extra byte must be all 1 bits, and must
-	 * be within the limit.
+	 * The extra 1 is there because the CPU will access an additional
+	 * byte beyond the end of the I/O permission bitmap. The extra byte
+	 * must have all bits set and must be within the TSS limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
@@ -496,8 +496,12 @@ struct thread_struct {
 	/* IO permissions: */
 	unsigned long		*io_bitmap_ptr;
 	unsigned long		iopl;
-	/* Max allowed port in the bitmap, in bytes: */
-	unsigned		io_bitmap_max;
+	/*
+	 * The byte range in the I/O permission bitmap which contains zero
+	 * bits.
+	 */
+	unsigned int		io_zerobits_start;
+	unsigned int		io_zerobits_end;
 
 	mm_segment_t		addr_limit;
 
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1864,7 +1864,8 @@ void cpu_init(void)
 	/* Initialize the TSS. */
 	tss_setup_ist(tss);
 	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
-	tss->io_bitmap_prev_max = 0;
+	tss->io_zerobits_start = IO_BITMAP_BYTES;
+	tss->io_zerobits_end = 0;
 	memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -26,9 +26,11 @@
  */
 long ksys_ioperm(unsigned long from, unsigned long num, int turn_on)
 {
+	unsigned int first, last, next, start, end, copy_start, copy_len;
 	struct thread_struct *t = &current->thread;
+	unsigned long *bitmap, *tofree = NULL;
 	struct tss_struct *tss;
-	unsigned int i, max_long, bytes, bytes_updated;
+	char *src, *dst;
 
 	if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
 		return -EINVAL;
@@ -37,63 +39,120 @@ long ksys_ioperm(unsigned long from, uns
 		return -EPERM;
 
 	/*
-	 * If it's the first ioperm() call in this thread's lifetime, set the
-	 * IO bitmap up. ioperm() is much less timing critical than clone(),
-	 * this is why we delay this operation until now:
+	 * I/O bitmap storage is allocated on demand, but don't bother if
+	 * the task does not have one and permissions are only cleared.
 	 */
-	if (!t->io_bitmap_ptr) {
-		unsigned long *bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
+	if (!t->io_bitmap_ptr && !turn_on)
+		return 0;
 
-		if (!bitmap)
-			return -ENOMEM;
+	/*
+	 * Allocate a temporary bitmap to minimize the amount of work
+	 * in the atomic region.
+	 */
+	bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
+	if (!bitmap)
+		return -ENOMEM;
 
+	if (!t->io_bitmap_ptr)
 		memset(bitmap, 0xff, IO_BITMAP_BYTES);
-		t->io_bitmap_ptr = bitmap;
-		set_thread_flag(TIF_IO_BITMAP);
+	else
+		memcpy(bitmap, t->io_bitmap_ptr, IO_BITMAP_BYTES);
+
+	/* Update the bitmap */
+	if (turn_on) {
+		bitmap_clear(bitmap, from, num);
+	} else {
+		bitmap_set(bitmap, from, num);
+	}
+
+	/* Get the new range */
+	first = find_first_zero_bit(bitmap, IO_BITMAP_BITS);
+
+	for (last = next = first; next < IO_BITMAP_BITS; last = next) {
+		/* Find the next set bit and update last */
+		next = find_next_bit(bitmap, IO_BITMAP_BITS, last);
+		last = next - 1;
+		if (next == IO_BITMAP_BITS)
+			break;
+		/* Find the next zero bit and continue searching */
+		next = find_next_zero_bit(bitmap, IO_BITMAP_BITS, next);
+	}
+
+	/* Calculate the byte boundaries for the updated region */
+	copy_start = from / 8;
+	copy_len = (round_up(from + num, 8) / 8) - copy_start;
+
+	/*
+	 * Update the per thread storage and the TSS bitmap. This must be
+	 * done with preemption disabled to prevent racing against a
+	 * context switch.
+	 */
+	preempt_disable();
+	tss = this_cpu_ptr(&cpu_tss_rw);
 
+	if (!t->io_bitmap_ptr) {
+		unsigned int tss_start = tss->io_zerobits_start;
+		/*
+		 * If the task did not use the I/O bitmap yet then the
+		 * perhaps stale content in the TSS needs to be taken into
+		 * account. If tss start is out of bounds the TSS storage
+		 * does not contain a zero bit and it's sufficient just to
+		 * copy the new range over.
+		 */
+		if (tss_start < IO_BITMAP_BYTES) {
+			unsigned int tss_end =  tss->io_zerobits_end;
+			unsigned int copy_end = copy_start + copy_len;
+
+			copy_start = min(tss_start, copy_start);
+			copy_len = max(tss_end, copy_end) - copy_start;
+		}
+	}
+
+	/* Copy the changed range over to the TSS bitmap */
+	dst = (char *)tss->io_bitmap;
+	src = (char *)bitmap;
+	memcpy(dst + copy_start, src + copy_start, copy_len);
+
+	if (first >= IO_BITMAP_BITS) {
+		/*
+		 * If the resulting bitmap has all permissions dropped, clear
+		 * TIF_IO_BITMAP and set the IO bitmap offset in the TSS to
+		 * invalid. Deallocate both the new and the thread's bitmap.
+		 */
+		clear_thread_flag(TIF_IO_BITMAP);
+		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
+		tofree = bitmap;
+		bitmap = NULL;
+	} else {
 		/*
-		 * Now that we have an IO bitmap, we need our TSS limit to be
-		 * correct.  It's fine if we are preempted after doing this:
-		 * with TIF_IO_BITMAP set, context switches will keep our TSS
-		 * limit correct.
+		 * I/O bitmap contains zero bits. Set TIF_IO_BITMAP, make
+		 * the bitmap offset valid and make sure that the TSS limit
+		 * is correct. It might have been wreckaged by a VMEXiT.
 		 */
-		preempt_disable();
+		set_thread_flag(TIF_IO_BITMAP);
+		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
 		refresh_tss_limit();
-		preempt_enable();
 	}
 
 	/*
-	 * do it in the per-thread copy and in the TSS ...
+	 * Update the range in the thread and the TSS
 	 *
-	 * Disable preemption via get_cpu() - we must not switch away
-	 * because the ->io_bitmap_max value must match the bitmap
-	 * contents:
+	 * Get the byte position of the first zero bit and calculate
+	 * the length of the range in which zero bits exist.
 	 */
-	tss = &per_cpu(cpu_tss_rw, get_cpu());
-
-	if (turn_on)
-		bitmap_clear(t->io_bitmap_ptr, from, num);
-	else
-		bitmap_set(t->io_bitmap_ptr, from, num);
+	start = first / 8;
+	end = first < IO_BITMAP_BITS ? round_up(last, 8) / 8 : 0;
+	t->io_zerobits_start = tss->io_zerobits_start = start;
+	t->io_zerobits_end = tss->io_zerobits_end = end;
 
 	/*
-	 * Search for a (possibly new) maximum. This is simple and stupid,
-	 * to keep it obviously correct:
+	 * Finally exchange the bitmap pointer in the thread.
 	 */
-	max_long = 0;
-	for (i = 0; i < IO_BITMAP_LONGS; i++)
-		if (t->io_bitmap_ptr[i] != ~0UL)
-			max_long = i;
-
-	bytes = (max_long + 1) * sizeof(unsigned long);
-	bytes_updated = max(bytes, t->io_bitmap_max);
-
-	t->io_bitmap_max = bytes;
-
-	/* Update the TSS: */
-	memcpy(tss->io_bitmap, t->io_bitmap_ptr, bytes_updated);
+	bitmap = xchg(&t->io_bitmap_ptr, bitmap);
+	preempt_enable();
 
-	put_cpu();
+	kfree(bitmap);
+	kfree(tofree);
 
 	return 0;
 }
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -110,7 +110,8 @@ void exit_thread(struct task_struct *tsk
 		tss = this_cpu_ptr(&cpu_tss_rw);
 
 		t->io_bitmap_ptr = NULL;
-		t->io_bitmap_max = 0;
+		t->io_zerobits_start = IO_BITMAP_BYTES;
+		t->io_zerobits_end = 0;
 		clear_thread_flag(TIF_IO_BITMAP);
 		/* Invalidate the io bitmap base in the TSS */
 		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
@@ -141,7 +142,8 @@ static inline int copy_io_bitmap(struct
 	tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
 					    IO_BITMAP_BYTES, GFP_KERNEL);
 	if (!tsk->thread.io_bitmap_ptr) {
-		tsk->thread.io_bitmap_max = 0;
+		tsk->thread.io_zerobits_start = IO_BITMAP_BYTES;
+		tsk->thread.io_zerobits_end = 0;
 		return -ENOMEM;
 	}
 	set_tsk_thread_flag(tsk, TIF_IO_BITMAP);
@@ -153,7 +155,8 @@ static inline void free_io_bitmap(struct
 	if (tsk->thread.io_bitmap_ptr) {
 		kfree(tsk->thread.io_bitmap_ptr);
 		tsk->thread.io_bitmap_ptr = NULL;
-		tsk->thread.io_bitmap_max = 0;
+		tsk->thread.io_zerobits_start = IO_BITMAP_BYTES;
+		tsk->thread.io_zerobits_end = 0;
 	}
 }
 
@@ -354,27 +357,39 @@ void arch_setup_new_exec(void)
 	}
 }
 
+static void tss_update_io_bitmap(struct tss_struct *tss,
+				 struct thread_struct *thread)
+{
+	unsigned int start, len;
+	char *src, *dst;
+
+	/*
+	 * Copy at least the byte range of the incoming tasks bitmap which
+	 * covers the permitted I/O ports.
+	 *
+	 * If the previous task which used an I/O bitmap had more bits
+	 * permitted, then the copy needs to cover those as well so they
+	 * get turned off.
+	 */
+	start = min(tss->io_zerobits_start, thread->io_zerobits_start);
+	len = max(tss->io_zerobits_end, thread->io_zerobits_end) - start;
+	src = (char *)thread->io_bitmap_ptr;
+	dst = (char *)tss->io_bitmap_map;
+	memcpy(dst + start, dst + start, len);
+
+	/* Store the new start/end and set io_bitmap_base valid */
+	tss->io_zerobits_start = thread->io_zerobits_start;
+	tss->io_zerobits_end = thread->io_zerobits_end;
+	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID_MAP;
+}
+
 static inline void switch_to_bitmap(struct thread_struct *next,
 				    unsigned long tifp, unsigned long tifn)
 {
 	struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
 
 	if (tifn & _TIF_IO_BITMAP) {
-		/*
-		 * Copy at least the size of the incoming tasks bitmap
-		 * which covers the last permitted I/O port.
-		 *
-		 * If the previous task which used an io bitmap had more
-		 * bits permitted, then the copy needs to cover those as
-		 * well so they get turned off.
-		 */
-		memcpy(tss->io_bitmap, next->io_bitmap_ptr,
-		       max(tss->io_bitmap_prev_max, next->io_bitmap_max));
-
-		/* Store the new max and set io_bitmap_base valid */
-		tss->io_bitmap_prev_max = next->io_bitmap_max;
-		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
-
+		tss_update_io_bitmap(tss, next);
 		/*
 		 * Make sure that the TSS limit is covering the io bitmap.
 		 * It might have been cut down by a VMEXIT to 0x67 which
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -697,7 +697,7 @@ static int ptrace_set_debugreg(struct ta
 static int ioperm_active(struct task_struct *target,
 			 const struct user_regset *regset)
 {
-	return DIV_ROUND_UP(target->thread.io_bitmap_max, regset->size);
+	return DIV_ROUND_UP(target->thread.io_zerobits_end, regset->size);
 }
 
 static int ioperm_get(struct task_struct *target,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 6/9] x86/iopl: Fixup misleading comment
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (4 preceding siblings ...)
  2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-06 19:35 ` [patch 7/9] x86/iopl: Restrict iopl() permission scope Thomas Gleixner
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

The comment for the sys_iopl() implementation is outdated and actively
misleading in some parts. Fix it up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ioport.c |   35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -22,7 +22,7 @@
 #include <asm/desc.h>
 
 /*
- * this changes the io permissions bitmap in the current task.
+ * This changes the io permissions bitmap in the current task.
  */
 long ksys_ioperm(unsigned long from, unsigned long num, int turn_on)
 {
@@ -163,14 +163,24 @@ SYSCALL_DEFINE3(ioperm, unsigned long, f
 }
 
 /*
- * sys_iopl has to be used when you want to access the IO ports
- * beyond the 0x3ff range: to get the full 65536 ports bitmapped
- * you'd need 8kB of bitmaps/process, which is a bit excessive.
+ * The sys_iopl functionality depends on the level argument, which if
+ * granted for the task is used by the CPU to check I/O instruction and
+ * CLI/STI against the current priviledge level (CPL). If CPL is less than
+ * or equal the tasks IOPL level the instructions take effect. If not a #GP
+ * is raised. The default IOPL is 0, i.e. no permissions.
  *
- * Here we just change the flags value on the stack: we allow
- * only the super-user to do it. This depends on the stack-layout
- * on system-call entry - see also fork() and the signal handling
- * code.
+ * Setting IOPL to level 0-2 is disabling the userspace access. Only level
+ * 3 enables it. If set it allows the user space thread:
+ *
+ * - Unrestricted access to all 65535 I/O ports
+ * - The usage of CLI/STI instructions
+ *
+ * The advantage over ioperm is that the context switch does not require to
+ * update the I/O bitmap which is especially true when a large number of
+ * ports is accessed. But the allowance of CLI/STI in userspace is
+ * considered a major problem.
+ *
+ * IOPL is strictly per thread and inherited on fork.
  */
 SYSCALL_DEFINE1(iopl, unsigned int, level)
 {
@@ -191,9 +201,18 @@ SYSCALL_DEFINE1(iopl, unsigned int, leve
 		    security_locked_down(LOCKDOWN_IOPORT))
 			return -EPERM;
 	}
+	/*
+	 * Change the flags value on the return stack, which has been set
+	 * up on system-call entry. See also the fork and signal handling
+	 * code how this is handled.
+	 */
 	regs->flags = (regs->flags & ~X86_EFLAGS_IOPL) |
 		(level << X86_EFLAGS_IOPL_BIT);
+	/* Store the new level in the thread struct */
 	t->iopl = level << X86_EFLAGS_IOPL_BIT;
+	/*
+	 * X86_32 switches immediately and XEN handles it via emulation.
+	 */
 	set_iopl_mask(t->iopl);
 
 	return 0;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 7/9] x86/iopl: Restrict iopl() permission scope
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (5 preceding siblings ...)
  2019-11-06 19:35 ` [patch 6/9] x86/iopl: Fixup misleading comment Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-07  9:09   ` Peter Zijlstra
  2019-11-10 17:26   ` Andy Lutomirski
  2019-11-06 19:35 ` [patch 8/9] x86/iopl: Remove legacy IOPL option Thomas Gleixner
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

The access to the full I/O port range can be also provided by the TSS I/O
bitmap, but that would require to copy 8k of data on scheduling in the
task. As shown with the sched out optimization TSS.io_bitmap_base can be
used to switch the incoming task to a preallocated I/O bitmap which has all
bits zero, i.e. allows access to all I/O ports.

Implementing this allows to provide an iopl() emulation mode which restricts
the IOPL level 3 permissions to I/O port access but removes the STI/CLI
permission which is coming with the hardware IOPL mechansim.

Provide a config option to switch IOPL to emulation mode, make it the
default and while at it also provide an option to disable IOPL completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig                 |   32 ++++++++++++++++
 arch/x86/include/asm/processor.h |   24 +++++++++---
 arch/x86/kernel/cpu/common.c     |    4 +-
 arch/x86/kernel/ioport.c         |   75 ++++++++++++++++++++++++++++++---------
 arch/x86/kernel/process.c        |   15 +++++--
 5 files changed, 123 insertions(+), 27 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1254,6 +1254,38 @@ config X86_VSYSCALL_EMULATION
 	 Disabling this option saves about 7K of kernel size and
 	 possibly 4K of additional runtime pagetable memory.
 
+choice
+	prompt "IOPL"
+	default X86_IOPL_EMULATION
+
+config X86_IOPL_EMULATION
+	bool "IOPL Emulation"
+	---help---
+	  Legacy IOPL support is an overbroad mechanism which allows user
+	  space aside of accessing all 65536 I/O ports also to disable
+	  interrupts. To gain this access the caller needs CAP_SYS_RAWIO
+	  capabilities and permission from eventually active security
+	  modules.
+
+	  The emulation restricts the functionality of the syscall to
+	  only allowing the full range I/O port access, but prevents the
+	  ability to disable interrupts from user space.
+
+config X86_IOPL_LEGACY
+	bool "IOPL Legacy"
+	---help---
+	Allow the full IOPL permissions, i.e. user space access to all
+	65536 I/O ports and also the ability to disable interrupts, which
+	is overbroad and can result in system lockups.
+
+config X86_IOPL_NONE
+	bool "IOPL None"
+	---help---
+	Disable the IOPL permission syscall. That's the safest option as
+	no sane application should depend on this functionality.
+
+endchoice
+
 config TOSHIBA
 	tristate "Toshiba Laptop support"
 	depends on X86_32
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -331,8 +331,12 @@ struct x86_hw_tss {
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
 
-#define IO_BITMAP_OFFSET_VALID				\
-	(offsetof(struct tss_struct, io_bitmap) -	\
+#define IO_BITMAP_OFFSET_VALID_MAP			\
+	(offsetof(struct tss_struct, io_bitmap_map) -	\
+	 offsetof(struct tss_struct, x86_tss))
+
+#define IO_BITMAP_OFFSET_VALID_ALL			\
+	(offsetof(struct tss_struct, io_bitmap_all) -	\
 	 offsetof(struct tss_struct, x86_tss))
 
 /*
@@ -343,7 +347,7 @@ struct x86_hw_tss {
  * last valid byte
  */
 #define __KERNEL_TSS_LIMIT	\
-	(IO_BITMAP_OFFSET_VALID + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
+	(IO_BITMAP_OFFSET_VALID_ALL + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
 
 /* Base offset outside of TSS_LIMIT so unpriviledged IO causes #GP */
 #define IO_BITMAP_OFFSET_INVALID	(__KERNEL_TSS_LIMIT + 1)
@@ -379,7 +383,8 @@ struct tss_struct {
 	 * byte beyond the end of the I/O permission bitmap. The extra byte
 	 * must have all bits set and must be within the TSS limit.
 	 */
-	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+	unsigned long		io_bitmap_map[IO_BITMAP_LONGS + 1];
+	unsigned long		io_bitmap_all[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
 
 DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
@@ -495,7 +500,6 @@ struct thread_struct {
 #endif
 	/* IO permissions: */
 	unsigned long		*io_bitmap_ptr;
-	unsigned long		iopl;
 	/*
 	 * The byte range in the I/O permission bitmap which contains zero
 	 * bits.
@@ -503,6 +507,13 @@ struct thread_struct {
 	unsigned int		io_zerobits_start;
 	unsigned int		io_zerobits_end;
 
+	/*
+	 * IOPL. Priviledge level dependent I/O permission which includes
+	 * user space CLI/STI when granted.
+	 */
+	unsigned long		iopl;
+	unsigned long		iopl_emul;
+
 	mm_segment_t		addr_limit;
 
 	unsigned int		sig_on_uaccess_err:1;
@@ -524,6 +535,9 @@ static inline void arch_thread_struct_wh
 	*size = fpu_kernel_xstate_size;
 }
 
+extern void tss_update_io_bitmap(struct tss_struct *tss,
+				 struct thread_struct *thread);
+
 /*
  * Thread-synchronous status.
  *
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1866,7 +1866,9 @@ void cpu_init(void)
 	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
 	tss->io_zerobits_start = IO_BITMAP_BYTES;
 	tss->io_zerobits_end = 0;
-	memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
+	memset(tss->io_bitmap_map, 0xff, sizeof(tss->io_bitmap_map));
+	/* Invalidate the extra tail entry of the allow all I/O bitmap */
+	tss->io_bitmap_all[IO_BITMAP_LONGS] = ~0UL;
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 	load_TR_desc();
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -109,7 +109,7 @@ long ksys_ioperm(unsigned long from, uns
 	}
 
 	/* Copy the changed range over to the TSS bitmap */
-	dst = (char *)tss->io_bitmap;
+	dst = (char *)tss->io_bitmap_map;
 	src = (char *)bitmap;
 	memcpy(dst + copy_start, src + copy_start, copy_len);
 
@@ -130,7 +130,7 @@ long ksys_ioperm(unsigned long from, uns
 		 * is correct. It might have been wreckaged by a VMEXiT.
 		 */
 		set_thread_flag(TIF_IO_BITMAP);
-		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
+		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID_MAP;
 		refresh_tss_limit();
 	}
 
@@ -184,36 +184,77 @@ SYSCALL_DEFINE3(ioperm, unsigned long, f
  */
 SYSCALL_DEFINE1(iopl, unsigned int, level)
 {
-	struct pt_regs *regs = current_pt_regs();
 	struct thread_struct *t = &current->thread;
+	struct pt_regs *regs = current_pt_regs();
+	unsigned int old;
 
 	/*
 	 * Careful: the IOPL bits in regs->flags are undefined under Xen PV
 	 * and changing them has no effect.
 	 */
-	unsigned int old = t->iopl >> X86_EFLAGS_IOPL_BIT;
+	if (IS_ENABLED(CONFIG_X86_IOPL_NONE))
+		return -ENOSYS;
 
 	if (level > 3)
 		return -EINVAL;
+
+	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION))
+		old = t->iopl_emul;
+	else
+		old = t->iopl >> X86_EFLAGS_IOPL_BIT;
+
+	/* No point in going further if nothing changes */
+	if (level == old)
+		return 0;
+
 	/* Trying to gain more privileges? */
 	if (level > old) {
 		if (!capable(CAP_SYS_RAWIO) ||
 		    security_locked_down(LOCKDOWN_IOPORT))
 			return -EPERM;
 	}
-	/*
-	 * Change the flags value on the return stack, which has been set
-	 * up on system-call entry. See also the fork and signal handling
-	 * code how this is handled.
-	 */
-	regs->flags = (regs->flags & ~X86_EFLAGS_IOPL) |
-		(level << X86_EFLAGS_IOPL_BIT);
-	/* Store the new level in the thread struct */
-	t->iopl = level << X86_EFLAGS_IOPL_BIT;
-	/*
-	 * X86_32 switches immediately and XEN handles it via emulation.
-	 */
-	set_iopl_mask(t->iopl);
+
+	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION)) {
+		struct tss_struct *tss;
+		unsigned int tss_base;
+
+		/* Prevent racing against a task switch */
+		preempt_disable();
+		tss = this_cpu_ptr(&cpu_tss_rw);
+		if (level == 3) {
+			/* Grant access to all I/O ports */
+			set_thread_flag(TIF_IO_BITMAP);
+			tss_base = IO_BITMAP_OFFSET_VALID_ALL;
+		} else if (t->io_bitmap_ptr) {
+			/* Thread has a I/O bitmap */
+			tss_update_io_bitmap(tss, t);
+			set_thread_flag(TIF_IO_BITMAP);
+			tss_base = IO_BITMAP_OFFSET_VALID_MAP;
+		} else {
+			/* Take it out of the context switch work burden */
+			clear_thread_flag(TIF_IO_BITMAP);
+			tss_base = IO_BITMAP_OFFSET_INVALID;
+		}
+		tss->x86_tss.io_bitmap_base = tss_base;
+		t->iopl_emul = level;
+		preempt_enable();
+
+	} else {
+		/*
+		 * Change the flags value on the return stack, which has
+		 * been set up on system-call entry. See also the fork and
+		 * signal handling code how this is handled.
+		 */
+		regs->flags = (regs->flags & ~X86_EFLAGS_IOPL) |
+			(level << X86_EFLAGS_IOPL_BIT);
+		/* Store the new level in the thread struct */
+		t->iopl = level << X86_EFLAGS_IOPL_BIT;
+		/*
+		 * X86_32 switches immediately and XEN handles it via
+		 * emulation.
+		 */
+		set_iopl_mask(t->iopl);
+	}
 
 	return 0;
 }
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -357,8 +357,7 @@ void arch_setup_new_exec(void)
 	}
 }
 
-static void tss_update_io_bitmap(struct tss_struct *tss,
-				 struct thread_struct *thread)
+void tss_update_io_bitmap(struct tss_struct *tss, struct thread_struct *thread)
 {
 	unsigned int start, len;
 	char *src, *dst;
@@ -387,9 +386,17 @@ static inline void switch_to_bitmap(stru
 				    unsigned long tifp, unsigned long tifn)
 {
 	struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
+	u16 *base = &tss->x86_tss.io_bitmap_base;
 
 	if (tifn & _TIF_IO_BITMAP) {
-		tss_update_io_bitmap(tss, next);
+		/*
+		 * IF IOPL emulation is enabled and the emulated I/O
+		 * priviledge level is 3, switch to the 'grant all' bitmap.
+		 */
+		if (IS_ENABLED(CONFIG_IOPL_EMULATION) && next->iopl_emul == 3)
+			*base = IO_BITMAP_OFFSET_VALID_ALL;
+		else
+			tss_update_io_bitmap(tss, next);
 		/*
 		 * Make sure that the TSS limit is covering the io bitmap.
 		 * It might have been cut down by a VMEXIT to 0x67 which
@@ -405,7 +412,7 @@ static inline void switch_to_bitmap(stru
 		 * by moving it outside the TSS limit so any subsequent I/O
 		 * access from user space will trigger a #GP.
 		 */
-		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
+		*base = IO_BITMAP_OFFSET_INVALID;
 	}
 }
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 8/9] x86/iopl: Remove legacy IOPL option
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (6 preceding siblings ...)
  2019-11-06 19:35 ` [patch 7/9] x86/iopl: Restrict iopl() permission scope Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-07  6:11   ` Jürgen Groß
  2019-11-07  9:13   ` Peter Zijlstra
  2019-11-06 19:35 ` [patch 9/9] selftests/x86/iopl: Verify that CLI/STI result in #GP Thomas Gleixner
  2019-11-07  7:28 ` [patch] x86/iopl: Remove unused local variable, update comments in ksys_ioperm() Ingo Molnar
  9 siblings, 2 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
cruft dealing with the (e)flags based IOPL mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig                      |   10 ----
 arch/x86/include/asm/paravirt.h       |    4 -
 arch/x86/include/asm/paravirt_types.h |    2 
 arch/x86/include/asm/processor.h      |   26 +---------
 arch/x86/include/asm/xen/hypervisor.h |    2 
 arch/x86/kernel/ioport.c              |   84 +++++++++++-----------------------
 arch/x86/kernel/paravirt.c            |    2 
 arch/x86/kernel/process_32.c          |    9 ---
 arch/x86/kernel/process_64.c          |   11 ----
 arch/x86/xen/enlighten_pv.c           |   10 ----
 10 files changed, 34 insertions(+), 126 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1269,14 +1269,8 @@ config X86_IOPL_EMULATION
 
 	  The emulation restricts the functionality of the syscall to
 	  only allowing the full range I/O port access, but prevents the
-	  ability to disable interrupts from user space.
-
-config X86_IOPL_LEGACY
-	bool "IOPL Legacy"
-	---help---
-	Allow the full IOPL permissions, i.e. user space access to all
-	65536 I/O ports and also the ability to disable interrupts, which
-	is overbroad and can result in system lockups.
+	  ability to disable interrupts from user space which would be
+	  granted if the hardware IOPL mechanism would be used.
 
 config X86_IOPL_NONE
 	bool "IOPL None"
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -294,10 +294,6 @@ static inline void write_idt_entry(gate_
 {
 	PVOP_VCALL3(cpu.write_idt_entry, dt, entry, g);
 }
-static inline void set_iopl_mask(unsigned mask)
-{
-	PVOP_VCALL1(cpu.set_iopl_mask, mask);
-}
 
 static inline void paravirt_activate_mm(struct mm_struct *prev,
 					struct mm_struct *next)
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -140,8 +140,6 @@ struct pv_cpu_ops {
 
 	void (*load_sp0)(unsigned long sp0);
 
-	void (*set_iopl_mask)(unsigned mask);
-
 	void (*wbinvd)(void);
 
 	/* cpuid emulation, mostly so that caps bits can be disabled */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -508,10 +508,10 @@ struct thread_struct {
 	unsigned int		io_zerobits_end;
 
 	/*
-	 * IOPL. Priviledge level dependent I/O permission which includes
-	 * user space CLI/STI when granted.
+	 * IOPL. Priviledge level dependent I/O permission which is
+	 * emulated via the I/O bitmap to prevent user space from disabling
+	 * interrupts.
 	 */
-	unsigned long		iopl;
 	unsigned long		iopl_emul;
 
 	mm_segment_t		addr_limit;
@@ -547,25 +547,6 @@ extern void tss_update_io_bitmap(struct
  */
 #define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
 
-/*
- * Set IOPL bits in EFLAGS from given mask
- */
-static inline void native_set_iopl_mask(unsigned mask)
-{
-#ifdef CONFIG_X86_32
-	unsigned int reg;
-
-	asm volatile ("pushfl;"
-		      "popl %0;"
-		      "andl %1, %0;"
-		      "orl %2, %0;"
-		      "pushl %0;"
-		      "popfl"
-		      : "=&r" (reg)
-		      : "i" (~X86_EFLAGS_IOPL), "r" (mask));
-#endif
-}
-
 static inline void
 native_load_sp0(unsigned long sp0)
 {
@@ -605,7 +586,6 @@ static inline void load_sp0(unsigned lon
 	native_load_sp0(sp0);
 }
 
-#define set_iopl_mask native_set_iopl_mask
 #endif /* CONFIG_PARAVIRT_XXL */
 
 /* Free all resources held by a thread. */
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -62,6 +62,4 @@ void xen_arch_register_cpu(int num);
 void xen_arch_unregister_cpu(int num);
 #endif
 
-extern void xen_set_iopl_mask(unsigned mask);
-
 #endif /* _ASM_X86_XEN_HYPERVISOR_H */
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -164,21 +164,17 @@ SYSCALL_DEFINE3(ioperm, unsigned long, f
 
 /*
  * The sys_iopl functionality depends on the level argument, which if
- * granted for the task is used by the CPU to check I/O instruction and
- * CLI/STI against the current priviledge level (CPL). If CPL is less than
- * or equal the tasks IOPL level the instructions take effect. If not a #GP
- * is raised. The default IOPL is 0, i.e. no permissions.
+ * granted for the task is used to enable access to all 65536 I/O ports.
  *
- * Setting IOPL to level 0-2 is disabling the userspace access. Only level
- * 3 enables it. If set it allows the user space thread:
+ * This does not use the IOPL mechanism provided by the CPU as that would
+ * also allow the user space task to use the CLI/STI instructions.
  *
- * - Unrestricted access to all 65535 I/O ports
- * - The usage of CLI/STI instructions
+ * Disabling interrupts in a user space task is dangerous as it might lock
+ * up the machine and the semantics vs. syscalls and exceptions is
+ * undefined.
  *
- * The advantage over ioperm is that the context switch does not require to
- * update the I/O bitmap which is especially true when a large number of
- * ports is accessed. But the allowance of CLI/STI in userspace is
- * considered a major problem.
+ * Setting IOPL to level 0-2 is disabling I/O permissions. Level 3
+ * 3 enables them.
  *
  * IOPL is strictly per thread and inherited on fork.
  */
@@ -186,6 +182,8 @@ SYSCALL_DEFINE1(iopl, unsigned int, leve
 {
 	struct thread_struct *t = &current->thread;
 	struct pt_regs *regs = current_pt_regs();
+	struct tss_struct *tss;
+	unsigned int tss_base;
 	unsigned int old;
 
 	/*
@@ -198,10 +196,7 @@ SYSCALL_DEFINE1(iopl, unsigned int, leve
 	if (level > 3)
 		return -EINVAL;
 
-	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION))
-		old = t->iopl_emul;
-	else
-		old = t->iopl >> X86_EFLAGS_IOPL_BIT;
+	old = t->iopl_emul;
 
 	/* No point in going further if nothing changes */
 	if (level == old)
@@ -214,47 +209,26 @@ SYSCALL_DEFINE1(iopl, unsigned int, leve
 			return -EPERM;
 	}
 
-	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION)) {
-		struct tss_struct *tss;
-		unsigned int tss_base;
-
-		/* Prevent racing against a task switch */
-		preempt_disable();
-		tss = this_cpu_ptr(&cpu_tss_rw);
-		if (level == 3) {
-			/* Grant access to all I/O ports */
-			set_thread_flag(TIF_IO_BITMAP);
-			tss_base = IO_BITMAP_OFFSET_VALID_ALL;
-		} else if (t->io_bitmap_ptr) {
-			/* Thread has a I/O bitmap */
-			tss_update_io_bitmap(tss, t);
-			set_thread_flag(TIF_IO_BITMAP);
-			tss_base = IO_BITMAP_OFFSET_VALID_MAP;
-		} else {
-			/* Take it out of the context switch work burden */
-			clear_thread_flag(TIF_IO_BITMAP);
-			tss_base = IO_BITMAP_OFFSET_INVALID;
-		}
-		tss->x86_tss.io_bitmap_base = tss_base;
-		t->iopl_emul = level;
-		preempt_enable();
-
+	/* Prevent racing against a task switch */
+	preempt_disable();
+	tss = this_cpu_ptr(&cpu_tss_rw);
+	if (level == 3) {
+		/* Grant access to all I/O ports */
+		set_thread_flag(TIF_IO_BITMAP);
+		tss_base = IO_BITMAP_OFFSET_VALID_ALL;
+	} else if (t->io_bitmap_ptr) {
+		/* Thread has a I/O bitmap */
+		tss_update_io_bitmap(tss, t);
+		set_thread_flag(TIF_IO_BITMAP);
+		tss_base = IO_BITMAP_OFFSET_VALID_MAP;
 	} else {
-		/*
-		 * Change the flags value on the return stack, which has
-		 * been set up on system-call entry. See also the fork and
-		 * signal handling code how this is handled.
-		 */
-		regs->flags = (regs->flags & ~X86_EFLAGS_IOPL) |
-			(level << X86_EFLAGS_IOPL_BIT);
-		/* Store the new level in the thread struct */
-		t->iopl = level << X86_EFLAGS_IOPL_BIT;
-		/*
-		 * X86_32 switches immediately and XEN handles it via
-		 * emulation.
-		 */
-		set_iopl_mask(t->iopl);
+		/* Take it out of the context switch work burden */
+		clear_thread_flag(TIF_IO_BITMAP);
+		tss_base = IO_BITMAP_OFFSET_INVALID;
 	}
+	tss->x86_tss.io_bitmap_base = tss_base;
+	t->iopl_emul = level;
+	preempt_enable();
 
 	return 0;
 }
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -341,8 +341,6 @@ struct paravirt_patch_template pv_ops =
 	.cpu.iret		= native_iret,
 	.cpu.swapgs		= native_swapgs,
 
-	.cpu.set_iopl_mask	= native_set_iopl_mask,
-
 	.cpu.start_context_switch	= paravirt_nop,
 	.cpu.end_context_switch		= paravirt_nop,
 
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -187,15 +187,6 @@ EXPORT_SYMBOL_GPL(start_thread);
 	 */
 	load_TLS(next, cpu);
 
-	/*
-	 * Restore IOPL if needed.  In normal use, the flags restore
-	 * in the switch assembly will handle this.  But if the kernel
-	 * is running virtualized at a non-zero CPL, the popf will
-	 * not restore flags, so it must be done in a separate step.
-	 */
-	if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
-		set_iopl_mask(next->iopl);
-
 	switch_to_extra(prev_p, next_p);
 
 	/*
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -497,17 +497,6 @@ void compat_start_thread(struct pt_regs
 
 	switch_to_extra(prev_p, next_p);
 
-#ifdef CONFIG_XEN_PV
-	/*
-	 * On Xen PV, IOPL bits in pt_regs->flags have no effect, and
-	 * current_pt_regs()->flags may not match the current task's
-	 * intended IOPL.  We need to switch it manually.
-	 */
-	if (unlikely(static_cpu_has(X86_FEATURE_XENPV) &&
-		     prev->iopl != next->iopl))
-		xen_set_iopl_mask(next->iopl);
-#endif
-
 	if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
 		/*
 		 * AMD CPUs have a misfeature: SYSRET sets the SS selector but
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -829,15 +829,6 @@ static void xen_load_sp0(unsigned long s
 	this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0);
 }
 
-void xen_set_iopl_mask(unsigned mask)
-{
-	struct physdev_set_iopl set_iopl;
-
-	/* Force the change at ring 0. */
-	set_iopl.iopl = (mask == 0) ? 1 : (mask >> 12) & 3;
-	HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
-}
-
 static void xen_io_delay(void)
 {
 }
@@ -1047,7 +1038,6 @@ static const struct pv_cpu_ops xen_cpu_o
 	.write_idt_entry = xen_write_idt_entry,
 	.load_sp0 = xen_load_sp0,
 
-	.set_iopl_mask = xen_set_iopl_mask,
 	.io_delay = xen_io_delay,
 
 	/* Xen takes care of %gs when switching to usermode for us */



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 9/9] selftests/x86/iopl: Verify that CLI/STI result in #GP
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (7 preceding siblings ...)
  2019-11-06 19:35 ` [patch 8/9] x86/iopl: Remove legacy IOPL option Thomas Gleixner
@ 2019-11-06 19:35 ` Thomas Gleixner
  2019-11-07  7:28 ` [patch] x86/iopl: Remove unused local variable, update comments in ksys_ioperm() Ingo Molnar
  9 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-06 19:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

Add tests that the now emulated iopl() functionality does not longer
allow user space to disable interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/x86/iopl.c |  104 +++++++++++++++++++++++++++++++++----
 1 file changed, 93 insertions(+), 11 deletions(-)

--- a/tools/testing/selftests/x86/iopl.c
+++ b/tools/testing/selftests/x86/iopl.c
@@ -35,6 +35,16 @@ static void sethandler(int sig, void (*h
 
 }
 
+static void clearhandler(int sig)
+{
+	struct sigaction sa;
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_handler = SIG_DFL;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
 static jmp_buf jmpbuf;
 
 static void sigsegv(int sig, siginfo_t *si, void *ctx_void)
@@ -42,22 +52,100 @@ static void sigsegv(int sig, siginfo_t *
 	siglongjmp(jmpbuf, 1);
 }
 
+static bool try_outb(unsigned short port)
+{
+	sethandler(SIGSEGV, sigsegv, SA_RESETHAND);
+	if (sigsetjmp(jmpbuf, 1) != 0) {
+		return false;
+	} else {
+		asm volatile ("outb %%al, %w[port]"
+			      : : [port] "Nd" (port), "a" (0));
+		return true;
+	}
+	clearhandler(SIGSEGV);
+}
+
+static void expect_gp_outb(unsigned short port)
+{
+	if (try_outb(port)) {
+		printf("[FAIL]\toutb to 0x%02hx worked\n", port);
+		nerrs++;
+	}
+
+	printf("[OK]\toutb to 0x%02hx failed\n", port);
+}
+
+static bool try_cli(void)
+{
+	sethandler(SIGSEGV, sigsegv, SA_RESETHAND);
+	if (sigsetjmp(jmpbuf, 1) != 0) {
+		return false;
+	} else {
+		asm volatile ("cli");
+		return true;
+	}
+	clearhandler(SIGSEGV);
+}
+
+static bool try_sti(void)
+{
+	sethandler(SIGSEGV, sigsegv, SA_RESETHAND);
+	if (sigsetjmp(jmpbuf, 1) != 0) {
+		return false;
+	} else {
+		asm volatile ("sti");
+		return true;
+	}
+	clearhandler(SIGSEGV);
+}
+
+static void expect_gp_sti(void)
+{
+	if (try_sti()) {
+		printf("[FAIL]\tSTI worked\n");
+		nerrs++;
+	} else {
+		printf("[OK]\tSTI faulted\n");
+	}
+}
+
+static void expect_gp_cli(void)
+{
+	if (try_cli()) {
+		printf("[FAIL]\tCLI worked\n");
+		nerrs++;
+	} else {
+		printf("[OK]\tCLI faulted\n");
+	}
+}
+
 int main(void)
 {
 	cpu_set_t cpuset;
+
 	CPU_ZERO(&cpuset);
 	CPU_SET(0, &cpuset);
 	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
 		err(1, "sched_setaffinity to CPU 0");
 
 	/* Probe for iopl support.  Note that iopl(0) works even as nonroot. */
-	if (iopl(3) != 0) {
+	switch(iopl(3)) {
+	case 0:
+		break;
+	case -ENOSYS:
+		printf("[OK]\tiopl() nor supported\n");
+		return 0;
+	default:
 		printf("[OK]\tiopl(3) failed (%d) -- try running as root\n",
 		       errno);
 		return 0;
 	}
 
-	/* Restore our original state prior to starting the test. */
+	/* Make sure that CLI/STI are blocked even with IOPL level 3 */
+	expect_gp_cli();
+	expect_gp_sti();
+
+	/* Restore our original state prior to starting the fork test. */
 	if (iopl(0) != 0)
 		err(1, "iopl(0)");
 
@@ -90,14 +178,9 @@ int main(void)
 
 	printf("[RUN]\tparent: write to 0x80 (should fail)\n");
 
-	sethandler(SIGSEGV, sigsegv, 0);
-	if (sigsetjmp(jmpbuf, 1) != 0) {
-		printf("[OK]\twrite was denied\n");
-	} else {
-		asm volatile ("outb %%al, $0x80" : : "a" (0));
-		printf("[FAIL]\twrite was allowed\n");
-		nerrs++;
-	}
+	expect_gp_outb(0x80);
+	expect_gp_cli();
+	expect_gp_sti();
 
 	/* Test the capability checks. */
 	printf("\tiopl(3)\n");
@@ -133,4 +216,3 @@ int main(void)
 done:
 	return nerrs ? 1 : 0;
 }
-



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
@ 2019-11-07  1:11   ` Linus Torvalds
  2019-11-07  7:44     ` Thomas Gleixner
  2019-11-07  8:25     ` Ingo Molnar
  2019-11-07  7:37   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 62+ messages in thread
From: Linus Torvalds @ 2019-11-07  1:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, H. Peter Anvin

On Wed, Nov 6, 2019 at 12:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Calculate both the position of the first zero bit and the last zero bit to
> limit the range which needs to be copied. This does not solve the problem
> when the previous tasked had only byte 0 cleared and the next one has only
> byte 65535 cleared, but trying to solve that would be too complex and
> heavyweight for the context switch path. As the ioperm() usage is very rare
> the case which is optimized is the single task/process which uses ioperm().

Hmm.

I may read this patch wrong, but from what I can tell, if we really
have just one process with an io bitmap, we're doing unnecessary
copies.

If we really have just one process that has an iobitmap, I think we
could just keep the bitmap of that process entirely unchanged. Then,
when we switch away from it, we set the io_bitmap_base to an invalid
base outside the TSS segment, and when we switch back, we set it back
to the valid one. No actual bitmap copies at all.

So I think that rather than the "begin/end offset" games, we should
perhaps have a "what was the last process that used the IO bitmap for
this TSS" pointer (and, I think, some sequence counter, so that when
the process updates its bitmap, it invalidates that case)?

 Of course, you can do *nboth*, but if we really think that the common
case is "one special process", then I think the begin/end offset is
useless, but a "last bitmap process" would be very useful.

Am I missing something?

               Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 8/9] x86/iopl: Remove legacy IOPL option
  2019-11-06 19:35 ` [patch 8/9] x86/iopl: Remove legacy IOPL option Thomas Gleixner
@ 2019-11-07  6:11   ` Jürgen Groß
  2019-11-07  6:26     ` hpa
  2019-11-07 16:44     ` Stephen Hemminger
  2019-11-07  9:13   ` Peter Zijlstra
  1 sibling, 2 replies; 62+ messages in thread
From: Jürgen Groß @ 2019-11-07  6:11 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Sean Christopherson,
	Linus Torvalds, H. Peter Anvin

On 06.11.19 20:35, Thomas Gleixner wrote:
> The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
> cruft dealing with the (e)flags based IOPL mechanism.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Juergen Gross <jgross@suse.com> (Paravirt and Xen parts)


Juergen

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 8/9] x86/iopl: Remove legacy IOPL option
  2019-11-07  6:11   ` Jürgen Groß
@ 2019-11-07  6:26     ` hpa
  2019-11-07 16:44     ` Stephen Hemminger
  1 sibling, 0 replies; 62+ messages in thread
From: hpa @ 2019-11-07  6:26 UTC (permalink / raw)
  To: Jürgen Groß, Thomas Gleixner, LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Sean Christopherson,
	Linus Torvalds

On November 6, 2019 10:11:49 PM PST, "Jürgen Groß" <jgross@suse.com> wrote:
>On 06.11.19 20:35, Thomas Gleixner wrote:
>> The IOPL emulation via the I/O bitmap is sufficient. Remove the
>legacy
>> cruft dealing with the (e)flags based IOPL mechanism.
>> 
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
>Reviewed-by: Juergen Gross <jgross@suse.com> (Paravirt and Xen parts)
>
>
>Juergen

Good grief, good riddance.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch] x86/iopl: Remove unused local variable, update comments in ksys_ioperm()
  2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
                   ` (8 preceding siblings ...)
  2019-11-06 19:35 ` [patch 9/9] selftests/x86/iopl: Verify that CLI/STI result in #GP Thomas Gleixner
@ 2019-11-07  7:28 ` Ingo Molnar
  9 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2019-11-07  7:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin


* Thomas Gleixner <tglx@linutronix.de> wrote:

> The series is also available from git:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/iopl

Very nice series - I fully agree with this simplification of ioperm 
legacies.


On x86-64 defconfig new warning in ioport.c:

    arch/x86/kernel/ioport.c:184:18: warning: unused variable ‘regs’ [-Wunused-variable]

This local variable can simply be removed, now that we don't rely on 
regs->flags anymore. See the patch below.

I also removed the now stale comment about the Xen PV 
quirk/incompatibility.

Thanks,

	Ingo

---
 arch/x86/kernel/ioport.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/kernel/ioport.c b/arch/x86/kernel/ioport.c
index aad296a23170..78127087b1ed 100644
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -181,15 +181,10 @@ SYSCALL_DEFINE3(ioperm, unsigned long, from, unsigned long, num, int, turn_on)
 SYSCALL_DEFINE1(iopl, unsigned int, level)
 {
 	struct thread_struct *t = &current->thread;
-	struct pt_regs *regs = current_pt_regs();
 	struct tss_struct *tss;
 	unsigned int tss_base;
 	unsigned int old;
 
-	/*
-	 * Careful: the IOPL bits in regs->flags are undefined under Xen PV
-	 * and changing them has no effect.
-	 */
 	if (IS_ENABLED(CONFIG_X86_IOPL_NONE))
 		return -ENOSYS;
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 1/9] x86/ptrace: Prevent truncation of bitmap size
  2019-11-06 19:35 ` [patch 1/9] x86/ptrace: Prevent truncation of bitmap size Thomas Gleixner
@ 2019-11-07  7:31   ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2019-11-07  7:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin


* Thomas Gleixner <tglx@linutronix.de> wrote:

> The active() callback of the IO bitmap regset divides the IO bitmap size by
> the word size (32/64 bit). As the I/O bitmap size is in bytes the active
> check fails for bitmap sizes of 1-3 bytes on 32bit and 1-7 bytes on 64bit.
> 
> Use DIV_ROUND_UP() instead.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/kernel/ptrace.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -697,7 +697,7 @@ static int ptrace_set_debugreg(struct ta
>  static int ioperm_active(struct task_struct *target,
>  			 const struct user_regset *regset)
>  {
> -	return target->thread.io_bitmap_max / regset->size;
> +	return DIV_ROUND_UP(target->thread.io_bitmap_max, regset->size);
>  }
>  
>  static int ioperm_get(struct task_struct *target,

Reviewed-by: Ingo Molnar <mingo@kernel.org>


But the ioperm core dumping logic in ptrace.c looks a bit weird. For 
example why do we alias REGSET_IOPERM64 to REGSET_XFP:

enum x86_regset {
        REGSET_GENERAL,
        REGSET_FP,
        REGSET_XFP,
        REGSET_IOPERM64 = REGSET_XFP,
        REGSET_XSTATE,
        REGSET_TLS,
        REGSET_IOPERM32,
};

This has been so since the original regset commit (325af5fb1418).

Unless I'm misreading the code this makes either REGSET_XFP or 
REGSET_IOPERM64 misbehave? What am I missing?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
  2019-11-07  1:11   ` Linus Torvalds
@ 2019-11-07  7:37   ` Ingo Molnar
  2019-11-07  7:45     ` Thomas Gleixner
  2019-11-07  8:16   ` Ingo Molnar
  2019-11-07 19:24   ` Brian Gerst
  3 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2019-11-07  7:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin


* Thomas Gleixner <tglx@linutronix.de> wrote:

> @@ -365,19 +365,19 @@ struct tss_struct {
>  	struct x86_hw_tss	x86_tss;
>  
>  	/*
> -	 * Store the dirty size of the last io bitmap offender. The next
> -	 * one will have to do the cleanup as the switch out to a non
> -	 * io bitmap user will just set x86_tss.io_bitmap_base to a value
> -	 * outside of the TSS limit. So for sane tasks there is no need
> -	 * to actually touch the io_bitmap at all.
> +	 * Store the dirty byte range of the last io bitmap offender. The
> +	 * next one will have to do the cleanup because the switch out to a
> +	 * non I/O bitmap user will just set x86_tss.io_bitmap_base to a
> +	 * value outside of the TSS limit to not penalize tasks which do
> +	 * not use the I/O bitmap at all.
>  	 */
> -	unsigned int		io_bitmap_prev_max;
> +	unsigned int		io_zerobits_start;
> +	unsigned int		io_zerobits_end;
>  
>  	/*
> -	 * The extra 1 is there because the CPU will access an
> -	 * additional byte beyond the end of the IO permission
> -	 * bitmap. The extra byte must be all 1 bits, and must
> -	 * be within the limit.
> +	 * The extra 1 is there because the CPU will access an additional
> +	 * byte beyond the end of the I/O permission bitmap. The extra byte
> +	 * must have all bits set and must be within the TSS limit.
>  	 */
>  	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
>  } __aligned(PAGE_SIZE);

Note that on 32-bit kernels this blows up our CPU area calculations:

./include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_181’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
./include/linux/compiler.h:331:4: note: in definition of macro ‘__compiletime_assert’
./include/linux/compiler.h:350:2: note: in expansion of macro ‘_compiletime_assert’
./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
arch/x86/mm/cpu_entry_area.c:181:2: note: in expansion of macro ‘BUILD_BUG_ON’
make[2]: *** [scripts/Makefile.build:265: arch/x86/mm/cpu_entry_area.o] Error 1

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  1:11   ` Linus Torvalds
@ 2019-11-07  7:44     ` Thomas Gleixner
  2019-11-07  8:25     ` Ingo Molnar
  1 sibling, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07  7:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, H. Peter Anvin

On Wed, 6 Nov 2019, Linus Torvalds wrote:
> I may read this patch wrong, but from what I can tell, if we really
> have just one process with an io bitmap, we're doing unnecessary
> copies.
> 
> If we really have just one process that has an iobitmap, I think we
> could just keep the bitmap of that process entirely unchanged. Then,
> when we switch away from it, we set the io_bitmap_base to an invalid
> base outside the TSS segment, and when we switch back, we set it back
> to the valid one. No actual bitmap copies at all.
> 
> So I think that rather than the "begin/end offset" games, we should
> perhaps have a "what was the last process that used the IO bitmap for
> this TSS" pointer (and, I think, some sequence counter, so that when
> the process updates its bitmap, it invalidates that case)?
> 
>  Of course, you can do *nboth*, but if we really think that the common
> case is "one special process", then I think the begin/end offset is
> useless, but a "last bitmap process" would be very useful.
> 
> Am I missing something?

No. You are right. I'll have a look at that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  7:37   ` Ingo Molnar
@ 2019-11-07  7:45     ` Thomas Gleixner
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07  7:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 1438 bytes --]

On Thu, 7 Nov 2019, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> >  	/*
> > -	 * The extra 1 is there because the CPU will access an
> > -	 * additional byte beyond the end of the IO permission
> > -	 * bitmap. The extra byte must be all 1 bits, and must
> > -	 * be within the limit.
> > +	 * The extra 1 is there because the CPU will access an additional
> > +	 * byte beyond the end of the I/O permission bitmap. The extra byte
> > +	 * must have all bits set and must be within the TSS limit.
> >  	 */
> >  	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
> >  } __aligned(PAGE_SIZE);
> 
> Note that on 32-bit kernels this blows up our CPU area calculations:
> 
> ./include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_181’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
> ./include/linux/compiler.h:331:4: note: in definition of macro ‘__compiletime_assert’
> ./include/linux/compiler.h:350:2: note: in expansion of macro ‘_compiletime_assert’
> ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
> ./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
> arch/x86/mm/cpu_entry_area.c:181:2: note: in expansion of macro ‘BUILD_BUG_ON’
> make[2]: *** [scripts/Makefile.build:265: arch/x86/mm/cpu_entry_area.o] Error 1

Duh. /me goes investigate.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
  2019-11-07  1:11   ` Linus Torvalds
  2019-11-07  7:37   ` Ingo Molnar
@ 2019-11-07  8:16   ` Ingo Molnar
  2019-11-07 18:02     ` Thomas Gleixner
  2019-11-07 19:24   ` Brian Gerst
  3 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2019-11-07  8:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin


* Thomas Gleixner <tglx@linutronix.de> wrote:

> +	/* Update the bitmap */
> +	if (turn_on) {
> +		bitmap_clear(bitmap, from, num);
> +	} else {
> +		bitmap_set(bitmap, from, num);
> +	}
> +
> +	/* Get the new range */
> +	first = find_first_zero_bit(bitmap, IO_BITMAP_BITS);
> +
> +	for (last = next = first; next < IO_BITMAP_BITS; last = next) {
> +		/* Find the next set bit and update last */
> +		next = find_next_bit(bitmap, IO_BITMAP_BITS, last);
> +		last = next - 1;
> +		if (next == IO_BITMAP_BITS)
> +			break;
> +		/* Find the next zero bit and continue searching */
> +		next = find_next_zero_bit(bitmap, IO_BITMAP_BITS, next);
> +	}
> +
> +	/* Calculate the byte boundaries for the updated region */
> +	copy_start = from / 8;
> +	copy_len = (round_up(from + num, 8) / 8) - copy_start;

This might seem like a small detail, but since we do the range tracking 
and copying at byte granularity anyway, why not do the zero range search 
at byte granularity as well?

I bet it's faster and simpler as well than the bit-searching.

We could also change over the bitmap to a char or u8 based array and lose 
all the sizeof(long) indexing headaches, resulting type casts, for 
anything but the actual bitmap_set/clear() calls, etc.?

I.e. now that most of the logic is byte granular, the basic data 
structure might as well reflect that?

> +	/*
> +	 * Update the per thread storage and the TSS bitmap. This must be
> +	 * done with preemption disabled to prevent racing against a
> +	 * context switch.
> +	 */
> +	preempt_disable();
> +	tss = this_cpu_ptr(&cpu_tss_rw);
>  
> +	if (!t->io_bitmap_ptr) {
> +		unsigned int tss_start = tss->io_zerobits_start;
> +		/*
> +		 * If the task did not use the I/O bitmap yet then the
> +		 * perhaps stale content in the TSS needs to be taken into
> +		 * account. If tss start is out of bounds the TSS storage
> +		 * does not contain a zero bit and it's sufficient just to
> +		 * copy the new range over.
> +		 */

s/tss/TSS

> +		if (tss_start < IO_BITMAP_BYTES) {
> +			unsigned int tss_end =  tss->io_zerobits_end;
> +			unsigned int copy_end = copy_start + copy_len;
> +
> +			copy_start = min(tss_start, copy_start);
> +			copy_len = max(tss_end, copy_end) - copy_start;
> +		}
> +	}
> +
> +	/* Copy the changed range over to the TSS bitmap */
> +	dst = (char *)tss->io_bitmap;
> +	src = (char *)bitmap;
> +	memcpy(dst + copy_start, src + copy_start, copy_len);
> +
> +	if (first >= IO_BITMAP_BITS) {
> +		/*
> +		 * If the resulting bitmap has all permissions dropped, clear
> +		 * TIF_IO_BITMAP and set the IO bitmap offset in the TSS to
> +		 * invalid. Deallocate both the new and the thread's bitmap.
> +		 */
> +		clear_thread_flag(TIF_IO_BITMAP);
> +		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
> +		tofree = bitmap;
> +		bitmap = NULL;

BTW., wouldn't it be simpler to just say that if a thread uses IO ops 
even once, it gets a bitmap and that's it? I.e. we could further simplify 
this seldom used piece of code.

> +	} else {
>  		/*
> +		 * I/O bitmap contains zero bits. Set TIF_IO_BITMAP, make
> +		 * the bitmap offset valid and make sure that the TSS limit
> +		 * is correct. It might have been wreckaged by a VMEXiT.
>  		 */
> +		set_thread_flag(TIF_IO_BITMAP);
> +		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
>  		refresh_tss_limit();
>  	}

I'm wondering, shouldn't we call refresh_tss_limit() in both branches, or 
is a VMEXIT-wreckaged TSS limit harmless if we otherwise have 
io_bitmap_base set to IO_BITMAP_OFFSET_INVALID?


>  	/*
> +	 * Update the range in the thread and the TSS
>  	 *
> +	 * Get the byte position of the first zero bit and calculate
> +	 * the length of the range in which zero bits exist.
>  	 */
> +	start = first / 8;
> +	end = first < IO_BITMAP_BITS ? round_up(last, 8) / 8 : 0;
> +	t->io_zerobits_start = tss->io_zerobits_start = start;
> +	t->io_zerobits_end = tss->io_zerobits_end = end;
>  
>  	/*
> +	 * Finally exchange the bitmap pointer in the thread.
>  	 */
> +	bitmap = xchg(&t->io_bitmap_ptr, bitmap);
> +	preempt_enable();
>  
> +	kfree(bitmap);
> +	kfree(tofree);
>  
>  	return 0;


Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  1:11   ` Linus Torvalds
  2019-11-07  7:44     ` Thomas Gleixner
@ 2019-11-07  8:25     ` Ingo Molnar
  2019-11-07  9:17       ` Willy Tarreau
  2019-11-10 17:17       ` Andy Lutomirski
  1 sibling, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2019-11-07  8:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Nov 6, 2019 at 12:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Calculate both the position of the first zero bit and the last zero bit to
> > limit the range which needs to be copied. This does not solve the problem
> > when the previous tasked had only byte 0 cleared and the next one has only
> > byte 65535 cleared, but trying to solve that would be too complex and
> > heavyweight for the context switch path. As the ioperm() usage is very rare
> > the case which is optimized is the single task/process which uses ioperm().
> 
> Hmm.
> 
> I may read this patch wrong, but from what I can tell, if we really
> have just one process with an io bitmap, we're doing unnecessary
> copies.
> 
> If we really have just one process that has an iobitmap, I think we
> could just keep the bitmap of that process entirely unchanged. Then,
> when we switch away from it, we set the io_bitmap_base to an invalid
> base outside the TSS segment, and when we switch back, we set it back
> to the valid one. No actual bitmap copies at all.
> 
> So I think that rather than the "begin/end offset" games, we should
> perhaps have a "what was the last process that used the IO bitmap for
> this TSS" pointer (and, I think, some sequence counter, so that when
> the process updates its bitmap, it invalidates that case)?
> 
>  Of course, you can do *nboth*, but if we really think that the common
> case is "one special process", then I think the begin/end offset is
> useless, but a "last bitmap process" would be very useful.
> 
> Am I missing something?

In fact on SMP systems this would result in a very nice optimization: 
pretty quickly *all* TSS's would be populated with that single task's 
bitmap, and it would persist even across migrations from CPU to CPU.

I'd love to get rid of the offset caching and bit scanning games as well 
- it doesn't really help in a number of common scenarios and it 
complicates this non-trivial piece of code a *LOT* - and we probably 
don't really have the natural testing density of this code anymore to 
find any regressions quickly.

So intuitively I'd suggest we gravitate towards the simplest 
implementation, with a good caching optimization for the single-task 
case.

I.e. the model I'm suggesting is that if a task uses ioperm() or iopl() 
then it should have a bitmap from that point on until exit(), even if 
it's all zeroes or all ones. Most applications that are using those 
primitives really need it all the time and are using just a few ioports, 
so all the tracking doesn't help much anyway.

On a related note, another simplification would be that in principle we 
could also use just a single bitmap and emulate iopl() as an ioperm(all) 
or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed 
ioperm()/iopl() uses, but is that ABI actually being relied on in 
practice?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 7/9] x86/iopl: Restrict iopl() permission scope
  2019-11-06 19:35 ` [patch 7/9] x86/iopl: Restrict iopl() permission scope Thomas Gleixner
@ 2019-11-07  9:09   ` Peter Zijlstra
  2019-11-10 17:26   ` Andy Lutomirski
  1 sibling, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2019-11-07  9:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Wed, Nov 06, 2019 at 08:35:06PM +0100, Thomas Gleixner wrote:

Something like:

> @@ -379,7 +383,8 @@ struct tss_struct {
>  	 * byte beyond the end of the I/O permission bitmap. The extra byte
>  	 * must have all bits set and must be within the TSS limit.
>  	 */
> -	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];

#ifndef X86_IOPL_NONE
> +	unsigned long		io_bitmap_map[IO_BITMAP_LONGS + 1];
#ifdef X86_IOPL_EMLATION
> +	unsigned long		io_bitmap_all[IO_BITMAP_LONGS + 1];
#endif /* X86_IOPL_EMLATION */
#endif /* !X86_IOPL_NONE */

>  } __aligned(PAGE_SIZE);

Would allow us to reclaim those 8/16K bitmaps for LEGACY/NONE kernels.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-06 19:35 ` [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user Thomas Gleixner
@ 2019-11-07  9:12   ` Peter Zijlstra
  2019-11-07 14:04     ` Thomas Gleixner
  2019-11-09  0:24   ` Andy Lutomirski
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-11-07  9:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Wed, Nov 06, 2019 at 08:35:03PM +0100, Thomas Gleixner wrote:
> There is no requirement to update the TSS I/O bitmap when a thread using it is
> scheduled out and the incoming thread does not use it.
> 
> For the permission check based on the TSS I/O bitmap the CPU calculates the memory
> location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
> of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
> offset to an address outside of the TSS limit.
> 
> If an I/O instruction is issued from user space the TSS limit causes #GP to be
> raised in the same was as valid I/O bitmap with all bits set to 1 would do.
> 
> This removes the extra work when an I/O bitmap using task is scheduled out
> and puts the burden on the rare I/O bitmap users when they are scheduled
> in.

This also nicely aligns with that the context switch time is accounted
to the next task. So by doing the expensive part on switch-in gets it
all accounted to the task that caused it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 8/9] x86/iopl: Remove legacy IOPL option
  2019-11-06 19:35 ` [patch 8/9] x86/iopl: Remove legacy IOPL option Thomas Gleixner
  2019-11-07  6:11   ` Jürgen Groß
@ 2019-11-07  9:13   ` Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2019-11-07  9:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Wed, Nov 06, 2019 at 08:35:07PM +0100, Thomas Gleixner wrote:
> The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
> cruft dealing with the (e)flags based IOPL mechanism.

Much joy at seeing this code go!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  8:25     ` Ingo Molnar
@ 2019-11-07  9:17       ` Willy Tarreau
  2019-11-07 10:00         ` Thomas Gleixner
  2019-11-10 17:17       ` Andy Lutomirski
  1 sibling, 1 reply; 62+ messages in thread
From: Willy Tarreau @ 2019-11-07  9:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Juergen Gross, Sean Christopherson,
	H. Peter Anvin

On Thu, Nov 07, 2019 at 09:25:41AM +0100, Ingo Molnar wrote:
> I.e. the model I'm suggesting is that if a task uses ioperm() or iopl() 
> then it should have a bitmap from that point on until exit(), even if 
> it's all zeroes or all ones. Most applications that are using those 
> primitives really need it all the time and are using just a few ioports, 
> so all the tracking doesn't help much anyway.

I'd go even further, considering that any task having called ioperm()
or iopl() once is granted access to all 64k ports for life: if the task
was granted access to any port, it will be able to request access for any
other port anyway. And we cannot claim that finely filtering accesses
brings any particular reliability in my opinion, considering that it's
generally possible to make the system really sick by starting to play
with most I/O ports. So for me that becomes a matter of trusted vs not
trusted task. Then we can simply have two pages of 0xFF to describe
their I/O access bitmap.

> On a related note, another simplification would be that in principle we 
> could also use just a single bitmap and emulate iopl() as an ioperm(all) 
> or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed 
> ioperm()/iopl() uses, but is that ABI actually being relied on in 
> practice?

You mean you'd have a unified map for all tasks ? In this case I think
it's simpler and equivalent to simply ignore the values in the calls
and grant full perms to the 64k ports range after the calls were
validated. I could be totally wrong and missing something obvious
though.

Regards,
Willy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  9:17       ` Willy Tarreau
@ 2019-11-07 10:00         ` Thomas Gleixner
  2019-11-07 10:13           ` Willy Tarreau
  2019-11-07 10:19           ` hpa
  0 siblings, 2 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07 10:00 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Linus Torvalds, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Juergen Gross, Sean Christopherson,
	H. Peter Anvin

On Thu, 7 Nov 2019, Willy Tarreau wrote:
> On Thu, Nov 07, 2019 at 09:25:41AM +0100, Ingo Molnar wrote:
> > I.e. the model I'm suggesting is that if a task uses ioperm() or iopl() 
> > then it should have a bitmap from that point on until exit(), even if 
> > it's all zeroes or all ones. Most applications that are using those 
> > primitives really need it all the time and are using just a few ioports, 
> > so all the tracking doesn't help much anyway.
> 
> I'd go even further, considering that any task having called ioperm()
> or iopl() once is granted access to all 64k ports for life: if the task
> was granted access to any port, it will be able to request access for any
> other port anyway. And we cannot claim that finely filtering accesses
> brings any particular reliability in my opinion, considering that it's
> generally possible to make the system really sick by starting to play
> with most I/O ports. So for me that becomes a matter of trusted vs not
> trusted task. Then we can simply have two pages of 0xFF to describe
> their I/O access bitmap.
> 
> > On a related note, another simplification would be that in principle we 
> > could also use just a single bitmap and emulate iopl() as an ioperm(all) 
> > or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed 
> > ioperm()/iopl() uses, but is that ABI actually being relied on in 
> > practice?
> 
> You mean you'd have a unified map for all tasks ? In this case I think
> it's simpler and equivalent to simply ignore the values in the calls
> and grant full perms to the 64k ports range after the calls were
> validated. I could be totally wrong and missing something obvious
> though.

Changing ioperm(single port, port range) to be ioperm(all) is going to
break a bunch of test cases which actually check whether the permission is
restricted to a single I/O port or the requested port range.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 10:00         ` Thomas Gleixner
@ 2019-11-07 10:13           ` Willy Tarreau
  2019-11-07 10:19           ` hpa
  1 sibling, 0 replies; 62+ messages in thread
From: Willy Tarreau @ 2019-11-07 10:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Linus Torvalds, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Juergen Gross, Sean Christopherson,
	H. Peter Anvin

On Thu, Nov 07, 2019 at 11:00:27AM +0100, Thomas Gleixner wrote:
> Changing ioperm(single port, port range) to be ioperm(all) is going to
> break a bunch of test cases which actually check whether the permission is
> restricted to a single I/O port or the requested port range.

But out of curiosity, are these solely test cases or things that real
applications do ? We could imagine having a sysctl entry to indicate
whether or not we want strict compatibility with older code in which
case we'd take the slow path, or a modernized behavior using only the
fast path. If we managed to deal with mmap_min_addr over time, I think
it should be manageable to deal with the rare applications using ioperm().

Willy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 10:00         ` Thomas Gleixner
  2019-11-07 10:13           ` Willy Tarreau
@ 2019-11-07 10:19           ` hpa
  2019-11-07 10:27             ` Willy Tarreau
  1 sibling, 1 reply; 62+ messages in thread
From: hpa @ 2019-11-07 10:19 UTC (permalink / raw)
  To: Thomas Gleixner, Willy Tarreau
  Cc: Ingo Molnar, Linus Torvalds, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Juergen Gross, Sean Christopherson

On November 7, 2019 2:00:27 AM PST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Thu, 7 Nov 2019, Willy Tarreau wrote:
>> On Thu, Nov 07, 2019 at 09:25:41AM +0100, Ingo Molnar wrote:
>> > I.e. the model I'm suggesting is that if a task uses ioperm() or
>iopl() 
>> > then it should have a bitmap from that point on until exit(), even
>if 
>> > it's all zeroes or all ones. Most applications that are using those
>
>> > primitives really need it all the time and are using just a few
>ioports, 
>> > so all the tracking doesn't help much anyway.
>> 
>> I'd go even further, considering that any task having called ioperm()
>> or iopl() once is granted access to all 64k ports for life: if the
>task
>> was granted access to any port, it will be able to request access for
>any
>> other port anyway. And we cannot claim that finely filtering accesses
>> brings any particular reliability in my opinion, considering that
>it's
>> generally possible to make the system really sick by starting to play
>> with most I/O ports. So for me that becomes a matter of trusted vs
>not
>> trusted task. Then we can simply have two pages of 0xFF to describe
>> their I/O access bitmap.
>> 
>> > On a related note, another simplification would be that in
>principle we 
>> > could also use just a single bitmap and emulate iopl() as an
>ioperm(all) 
>> > or ioperm(none) calls. Yeah, it's not fully ABI compatible for
>mixed 
>> > ioperm()/iopl() uses, but is that ABI actually being relied on in 
>> > practice?
>> 
>> You mean you'd have a unified map for all tasks ? In this case I
>think
>> it's simpler and equivalent to simply ignore the values in the calls
>> and grant full perms to the 64k ports range after the calls were
>> validated. I could be totally wrong and missing something obvious
>> though.
>
>Changing ioperm(single port, port range) to be ioperm(all) is going to
>break a bunch of test cases which actually check whether the permission
>is
>restricted to a single I/O port or the requested port range.
>
>Thanks,
>
>	tglx

This seems very undesirable... as much as we might wish otherwise, the port bitmap is the equivalent to the MMU, and there are definitely users doing direct device I/O out there.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 10:19           ` hpa
@ 2019-11-07 10:27             ` Willy Tarreau
  2019-11-07 10:50               ` hpa
  0 siblings, 1 reply; 62+ messages in thread
From: Willy Tarreau @ 2019-11-07 10:27 UTC (permalink / raw)
  To: hpa
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, LKML,
	the arch/x86 maintainers, Stephen Hemminger, Juergen Gross,
	Sean Christopherson

On Thu, Nov 07, 2019 at 02:19:19AM -0800, hpa@zytor.com wrote:
> >Changing ioperm(single port, port range) to be ioperm(all) is going to
> >break a bunch of test cases which actually check whether the permission
> >is restricted to a single I/O port or the requested port range.
> >
> >Thanks,
> >
> >	tglx
> 
> This seems very undesirable... as much as we might wish otherwise, the port
> bitmap is the equivalent to the MMU, and there are definitely users doing
> direct device I/O out there.

Doing these, sure, but doing these while ranges are really checked ?
I mean, the MMU grants you access to the pages you were assigned. Here
with the I/O bitmap you just have to ask for access to port X and you
get it. I could understand the benefit if we had EBUSY in return but
that's not the case, you can actually request access to a port range
another device driver or process is currently using, and mess up with
what it does even by accident. I remember streaming 1-bit music in
userland from the LED of my floppy drive in the late-90s, it used to
cause some trouble to the floppy driver when using mtools in parallel :-)

Willy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 10:27             ` Willy Tarreau
@ 2019-11-07 10:50               ` hpa
  2019-11-07 12:56                 ` Willy Tarreau
  0 siblings, 1 reply; 62+ messages in thread
From: hpa @ 2019-11-07 10:50 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, LKML,
	the arch/x86 maintainers, Stephen Hemminger, Juergen Gross,
	Sean Christopherson

On November 7, 2019 2:27:56 AM PST, Willy Tarreau <w@1wt.eu> wrote:
>On Thu, Nov 07, 2019 at 02:19:19AM -0800, hpa@zytor.com wrote:
>> >Changing ioperm(single port, port range) to be ioperm(all) is going
>to
>> >break a bunch of test cases which actually check whether the
>permission
>> >is restricted to a single I/O port or the requested port range.
>> >
>> >Thanks,
>> >
>> >	tglx
>> 
>> This seems very undesirable... as much as we might wish otherwise,
>the port
>> bitmap is the equivalent to the MMU, and there are definitely users
>doing
>> direct device I/O out there.
>
>Doing these, sure, but doing these while ranges are really checked ?
>I mean, the MMU grants you access to the pages you were assigned. Here
>with the I/O bitmap you just have to ask for access to port X and you
>get it. I could understand the benefit if we had EBUSY in return but
>that's not the case, you can actually request access to a port range
>another device driver or process is currently using, and mess up with
>what it does even by accident. I remember streaming 1-bit music in
>userland from the LED of my floppy drive in the late-90s, it used to
>cause some trouble to the floppy driver when using mtools in parallel
>:-)
>
>Willy

You get access to the ports you are assigned, just like pages you are assigned... the rest is kernel policy, or, for that matter, privileged userspace (get permissions to the necessary ports, then drop privilege... the usual stuff.)

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 10:50               ` hpa
@ 2019-11-07 12:56                 ` Willy Tarreau
  2019-11-07 16:45                   ` ebiederm
  0 siblings, 1 reply; 62+ messages in thread
From: Willy Tarreau @ 2019-11-07 12:56 UTC (permalink / raw)
  To: hpa
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, LKML,
	the arch/x86 maintainers, Stephen Hemminger, Juergen Gross,
	Sean Christopherson

On Thu, Nov 07, 2019 at 02:50:20AM -0800, hpa@zytor.com wrote:
> You get access to the ports you are assigned, just like pages you are
> assigned... the rest is kernel policy, or, for that matter, privileged
> userspace (get permissions to the necessary ports, then drop privilege... the
> usual stuff.)

I agree, my point is that there's already no policy checking at the
moment ports are assigned, hence a process having the permissions to
request just port 0x70-0x71 to read the hwclock will also have permission
to request access to the sensor chip a 0x2E and trigger a watchdog
reset or stop the CPU fan. Thus any policy enforcement is solely done
by the requesting process itself, assuming it doesn't simply use iopl()
already, which grants everything.

This is why I'm really wondering if the real use cases that need all
this stuff still exist at all in practice.

Willy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-07  9:12   ` Peter Zijlstra
@ 2019-11-07 14:04     ` Thomas Gleixner
  2019-11-07 14:08       ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07 14:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Thu, 7 Nov 2019, Peter Zijlstra wrote:
> On Wed, Nov 06, 2019 at 08:35:03PM +0100, Thomas Gleixner wrote:
> > There is no requirement to update the TSS I/O bitmap when a thread using it is
> > scheduled out and the incoming thread does not use it.
> > 
> > For the permission check based on the TSS I/O bitmap the CPU calculates the memory
> > location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
> > of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
> > offset to an address outside of the TSS limit.
> > 
> > If an I/O instruction is issued from user space the TSS limit causes #GP to be
> > raised in the same was as valid I/O bitmap with all bits set to 1 would do.
> > 
> > This removes the extra work when an I/O bitmap using task is scheduled out
> > and puts the burden on the rare I/O bitmap users when they are scheduled
> > in.
> 
> This also nicely aligns with that the context switch time is accounted
> to the next task. So by doing the expensive part on switch-in gets it
> all accounted to the task that caused it.

Just that I can't add the storage to tss_struct due to the VMX insanity of
setting TSS limit hard to 0x67 on vmexit instead of restoring the host
value.

Did I say that already that virt creates more problems than it solves?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-07 14:04     ` Thomas Gleixner
@ 2019-11-07 14:08       ` Thomas Gleixner
  2019-11-08 22:41         ` Andy Lutomirski
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Thu, 7 Nov 2019, Thomas Gleixner wrote:
> On Thu, 7 Nov 2019, Peter Zijlstra wrote:
> > On Wed, Nov 06, 2019 at 08:35:03PM +0100, Thomas Gleixner wrote:
> > > There is no requirement to update the TSS I/O bitmap when a thread using it is
> > > scheduled out and the incoming thread does not use it.
> > > 
> > > For the permission check based on the TSS I/O bitmap the CPU calculates the memory
> > > location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
> > > of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
> > > offset to an address outside of the TSS limit.
> > > 
> > > If an I/O instruction is issued from user space the TSS limit causes #GP to be
> > > raised in the same was as valid I/O bitmap with all bits set to 1 would do.
> > > 
> > > This removes the extra work when an I/O bitmap using task is scheduled out
> > > and puts the burden on the rare I/O bitmap users when they are scheduled
> > > in.
> > 
> > This also nicely aligns with that the context switch time is accounted
> > to the next task. So by doing the expensive part on switch-in gets it
> > all accounted to the task that caused it.
> 
> Just that I can't add the storage to tss_struct due to the VMX insanity of
> setting TSS limit hard to 0x67 on vmexit instead of restoring the host
> value.

Well, I can. The build bugon in vmx.c is just bogus.
 
> Did I say that already that virt creates more problems than it solves?

Still stands.

      tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 8/9] x86/iopl: Remove legacy IOPL option
  2019-11-07  6:11   ` Jürgen Groß
  2019-11-07  6:26     ` hpa
@ 2019-11-07 16:44     ` Stephen Hemminger
  1 sibling, 0 replies; 62+ messages in thread
From: Stephen Hemminger @ 2019-11-07 16:44 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Thomas Gleixner, LKML, x86, Willy Tarreau, Sean Christopherson,
	Linus Torvalds, H. Peter Anvin

On Thu, 7 Nov 2019 07:11:49 +0100
Jürgen Groß <jgross@suse.com> wrote:

> On 06.11.19 20:35, Thomas Gleixner wrote:
> > The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
> > cruft dealing with the (e)flags based IOPL mechanism.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>  
> 
> Reviewed-by: Juergen Gross <jgross@suse.com> (Paravirt and Xen parts)

I am happy to see this go. The DPDK only did initially because it
was only way virtio would work at the time.

Acked-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 12:56                 ` Willy Tarreau
@ 2019-11-07 16:45                   ` ebiederm
  2019-11-07 16:53                     ` Linus Torvalds
  2019-11-07 16:57                     ` Willy Tarreau
  0 siblings, 2 replies; 62+ messages in thread
From: ebiederm @ 2019-11-07 16:45 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: hpa, Thomas Gleixner, Ingo Molnar, Linus Torvalds, LKML,
	the arch/x86 maintainers, Stephen Hemminger, Juergen Gross,
	Sean Christopherson

Willy Tarreau <w@1wt.eu> writes:

> On Thu, Nov 07, 2019 at 02:50:20AM -0800, hpa@zytor.com wrote:
>> You get access to the ports you are assigned, just like pages you are
>> assigned... the rest is kernel policy, or, for that matter, privileged
>> userspace (get permissions to the necessary ports, then drop privilege... the
>> usual stuff.)
>
> I agree, my point is that there's already no policy checking at the
> moment ports are assigned, hence a process having the permissions to
> request just port 0x70-0x71 to read the hwclock will also have permission
> to request access to the sensor chip a 0x2E and trigger a watchdog
> reset or stop the CPU fan. Thus any policy enforcement is solely done
> by the requesting process itself, assuming it doesn't simply use iopl()
> already, which grants everything.
>
> This is why I'm really wondering if the real use cases that need all
> this stuff still exist at all in practice.

My memory is that the applications that didn't need fine grain access to
ports would just use iopl.

Further a quick look shows that dosemu uses ioperm in a fine grained
manner.  From memory it would allow a handful of ports to allow directly
accessing a device and depended on the rest of the port accesses to be
disallowed so it could trap and emulate them.

So yes I do believe making ioperm ioperm(all) will break userspace.

Eric


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 16:45                   ` ebiederm
@ 2019-11-07 16:53                     ` Linus Torvalds
  2019-11-07 16:57                     ` Willy Tarreau
  1 sibling, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2019-11-07 16:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Willy Tarreau, Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	the arch/x86 maintainers, Stephen Hemminger, Juergen Gross,
	Sean Christopherson

On Thu, Nov 7, 2019 at 8:45 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Further a quick look shows that dosemu uses ioperm in a fine grained
> manner.  From memory it would allow a handful of ports to allow directly
> accessing a device and depended on the rest of the port accesses to be
> disallowed so it could trap and emulate them.

Yes. Making ioperm() some all-or-nothing thing would be horribly bad,
and has no real advantages that I can see.

               Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 16:45                   ` ebiederm
  2019-11-07 16:53                     ` Linus Torvalds
@ 2019-11-07 16:57                     ` Willy Tarreau
  1 sibling, 0 replies; 62+ messages in thread
From: Willy Tarreau @ 2019-11-07 16:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hpa, Thomas Gleixner, Ingo Molnar, Linus Torvalds, LKML,
	the arch/x86 maintainers, Stephen Hemminger, Juergen Gross,
	Sean Christopherson

On Thu, Nov 07, 2019 at 10:45:09AM -0600, Eric W. Biederman wrote:
> Further a quick look shows that dosemu uses ioperm in a fine grained
> manner.  From memory it would allow a handful of ports to allow directly
> accessing a device and depended on the rest of the port accesses to be
> disallowed so it could trap and emulate them.
> 
> So yes I do believe making ioperm ioperm(all) will break userspace.

OK, and I must confess I almost forgot about dosemu, that's a good point!

Thanks,
Willy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  8:16   ` Ingo Molnar
@ 2019-11-07 18:02     ` Thomas Gleixner
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Thu, 7 Nov 2019, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> This might seem like a small detail, but since we do the range tracking 
> and copying at byte granularity anyway, why not do the zero range search 
> at byte granularity as well?
> 
> I bet it's faster and simpler as well than the bit-searching.

Not necessarily. The bitmap search uses unsigned long internally at least
to the point where it finds a zero bit.

But probably we should just ditch that optimization and rather have the
sequence number to figure out whether something needs to be copied at all.

> > +	if (first >= IO_BITMAP_BITS) {
> > +		/*
> > +		 * If the resulting bitmap has all permissions dropped, clear
> > +		 * TIF_IO_BITMAP and set the IO bitmap offset in the TSS to
> > +		 * invalid. Deallocate both the new and the thread's bitmap.
> > +		 */
> > +		clear_thread_flag(TIF_IO_BITMAP);
> > +		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
> > +		tofree = bitmap;
> > +		bitmap = NULL;
> 
> BTW., wouldn't it be simpler to just say that if a thread uses IO ops 
> even once, it gets a bitmap and that's it? I.e. we could further simplify 
> this seldom used piece of code.

Maybe.
 
> > +	} else {
> >  		/*
> > +		 * I/O bitmap contains zero bits. Set TIF_IO_BITMAP, make
> > +		 * the bitmap offset valid and make sure that the TSS limit
> > +		 * is correct. It might have been wreckaged by a VMEXiT.
> >  		 */
> > +		set_thread_flag(TIF_IO_BITMAP);
> > +		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
> >  		refresh_tss_limit();
> >  	}
> 
> I'm wondering, shouldn't we call refresh_tss_limit() in both branches, or 
> is a VMEXIT-wreckaged TSS limit harmless if we otherwise have 
> io_bitmap_base set to IO_BITMAP_OFFSET_INVALID?

Yes. because the VMEXIT wreckage restricts TSS limit to 0x67 which is the
end of the hardware tss struct. As the invalid offset points in any case
outside the TSS limit it does not matter. It always #GP's when an I/O
access happens in user space.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
                     ` (2 preceding siblings ...)
  2019-11-07  8:16   ` Ingo Molnar
@ 2019-11-07 19:24   ` Brian Gerst
  2019-11-07 19:54     ` Linus Torvalds
  3 siblings, 1 reply; 62+ messages in thread
From: Brian Gerst @ 2019-11-07 19:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, Linus Torvalds,
	H. Peter Anvin

On Wed, Nov 6, 2019 at 4:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Sane ioperm() users only set the few bits in the I/O space which they need to
> access. But the update logic of the TSS I/O bitmap copies always from byte
> 0 to the last byte in the tasks bitmap which contains a zero permission bit.
>
> That means that for access only to port 65335 the full 8K bitmap has to be
> copied even if all the bytes in the TSS I/O bitmap are already filled with
> 0xff.
>
> Calculate both the position of the first zero bit and the last zero bit to
> limit the range which needs to be copied. This does not solve the problem
> when the previous tasked had only byte 0 cleared and the next one has only
> byte 65535 cleared, but trying to solve that would be too complex and
> heavyweight for the context switch path. As the ioperm() usage is very rare
> the case which is optimized is the single task/process which uses ioperm().

Here is a different idea:  We already map the TSS virtually in
cpu_entry_area.  Why not page-align the IO bitmap and remap it to the
task's bitmap on task switch?  That would avoid all copying on task
switch.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 19:24   ` Brian Gerst
@ 2019-11-07 19:54     ` Linus Torvalds
  2019-11-07 21:00       ` Brian Gerst
  0 siblings, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2019-11-07 19:54 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin

On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <brgerst@gmail.com> wrote:
>
> Here is a different idea:  We already map the TSS virtually in
> cpu_entry_area.  Why not page-align the IO bitmap and remap it to the
> task's bitmap on task switch?  That would avoid all copying on task
> switch.

We map the tss _once_, statically, percpu, without ever changing it,
and then we just (potentially) change a couple of fields in it on
process switch.

Your idea isn't horrible, but it would involve a TLB flush for the
page when the io bitmap changes. Which is almost certainly more
expensive than just copying the bitmap intelligently.

Particularly since I do think that the copy can basically be done
effectively never, assuming there really aren't multiple concurrent
users of ioperm() (and iopl).

               Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 19:54     ` Linus Torvalds
@ 2019-11-07 21:00       ` Brian Gerst
  2019-11-07 21:32         ` Thomas Gleixner
  2019-11-07 21:44         ` Linus Torvalds
  0 siblings, 2 replies; 62+ messages in thread
From: Brian Gerst @ 2019-11-07 21:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin

On Thu, Nov 7, 2019 at 2:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <brgerst@gmail.com> wrote:
> >
> > Here is a different idea:  We already map the TSS virtually in
> > cpu_entry_area.  Why not page-align the IO bitmap and remap it to the
> > task's bitmap on task switch?  That would avoid all copying on task
> > switch.
>
> We map the tss _once_, statically, percpu, without ever changing it,
> and then we just (potentially) change a couple of fields in it on
> process switch.
>
> Your idea isn't horrible, but it would involve a TLB flush for the
> page when the io bitmap changes. Which is almost certainly more
> expensive than just copying the bitmap intelligently.
>
> Particularly since I do think that the copy can basically be done
> effectively never, assuming there really aren't multiple concurrent
> users of ioperm() (and iopl).

There wouldn't have to be a flush on every task switch.  If we make it
so that tasks that don't use a bitmap just unmap the pages in the
cpu_entry_area and set tss.io_bitmap_base to outside the segment
limit, we would only have to flush when switching from a task using
the bitmap (because the next task uses a different bitmap or we are
unmapping it).  If the previous task doesn't have a bitmap the pages
in cpu_entry_area were unmapped and can't be in the TLB, so no flush
is needed.

Going a step further, we could track which task is mapped to the
current cpu like proposed above, and only flush when a different task
needs the IO bitmap, or when the bitmap is being freed on task exit.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 21:00       ` Brian Gerst
@ 2019-11-07 21:32         ` Thomas Gleixner
  2019-11-07 23:20           ` hpa
  2019-11-07 21:44         ` Linus Torvalds
  1 sibling, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-07 21:32 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin

On Thu, 7 Nov 2019, Brian Gerst wrote:
> On Thu, Nov 7, 2019 at 2:54 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <brgerst@gmail.com> wrote:
> > >
> > > Here is a different idea:  We already map the TSS virtually in
> > > cpu_entry_area.  Why not page-align the IO bitmap and remap it to the
> > > task's bitmap on task switch?  That would avoid all copying on task
> > > switch.
> >
> > We map the tss _once_, statically, percpu, without ever changing it,
> > and then we just (potentially) change a couple of fields in it on
> > process switch.
> >
> > Your idea isn't horrible, but it would involve a TLB flush for the
> > page when the io bitmap changes. Which is almost certainly more
> > expensive than just copying the bitmap intelligently.
> >
> > Particularly since I do think that the copy can basically be done
> > effectively never, assuming there really aren't multiple concurrent
> > users of ioperm() (and iopl).
> 
> There wouldn't have to be a flush on every task switch.  If we make it
> so that tasks that don't use a bitmap just unmap the pages in the
> cpu_entry_area and set tss.io_bitmap_base to outside the segment
> limit, we would only have to flush when switching from a task using
> the bitmap (because the next task uses a different bitmap or we are
> unmapping it).  If the previous task doesn't have a bitmap the pages
> in cpu_entry_area were unmapped and can't be in the TLB, so no flush
> is needed.

Funny. I was just debating exactly this with Peter Ziljstra over IRC :)
 
> Going a step further, we could track which task is mapped to the
> current cpu like proposed above, and only flush when a different task
> needs the IO bitmap, or when the bitmap is being freed on task exit.

Yes.

But, we really should check what aside of DoSemu is using this still. None
of my machines I checked have a single instance of ioperm()/iopl() usage.

So the real question is whether it's worth the trouble or if we are just
better off to copy if there is an actual user and the sequence count of the
bitmap is different than the one which was active last time.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 21:00       ` Brian Gerst
  2019-11-07 21:32         ` Thomas Gleixner
@ 2019-11-07 21:44         ` Linus Torvalds
  2019-11-08  1:12           ` H. Peter Anvin
  2019-11-10 17:21           ` Andy Lutomirski
  1 sibling, 2 replies; 62+ messages in thread
From: Linus Torvalds @ 2019-11-07 21:44 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin

On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <brgerst@gmail.com> wrote:
>
> There wouldn't have to be a flush on every task switch.

No. But we'd have to flush on any switch that currently does that memcpy.

And my point is that a tlb flush (even the single-page case) is likely
more expensive than the memcpy.

> Going a step further, we could track which task is mapped to the
> current cpu like proposed above, and only flush when a different task
> needs the IO bitmap, or when the bitmap is being freed on task exit.

Well, that's exactly my "track the last task" optimization for copying
the thing.

IOW, it's the same optimization as avoiding the memcpy.

Which I think is likely very effective, but also makes it fairly
pointless to then try to be clever..

So the basic issue remains that playing VM games has almost
universally been slower and more complex than simply not playing VM
games. TLB flushes - even invlpg - tends to be pretty slow.

Of course, we probably end up invalidating the TLB's anyway, so maybe
in this case we don't care. The ioperm bitmap is _technically_
per-thread, though, so it should be flushed even if the VM isn't
flushed...

             Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 21:32         ` Thomas Gleixner
@ 2019-11-07 23:20           ` hpa
  0 siblings, 0 replies; 62+ messages in thread
From: hpa @ 2019-11-07 23:20 UTC (permalink / raw)
  To: Thomas Gleixner, Brian Gerst
  Cc: Linus Torvalds, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson

On November 7, 2019 1:32:15 PM PST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Thu, 7 Nov 2019, Brian Gerst wrote:
>> On Thu, Nov 7, 2019 at 2:54 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <brgerst@gmail.com>
>wrote:
>> > >
>> > > Here is a different idea:  We already map the TSS virtually in
>> > > cpu_entry_area.  Why not page-align the IO bitmap and remap it to
>the
>> > > task's bitmap on task switch?  That would avoid all copying on
>task
>> > > switch.
>> >
>> > We map the tss _once_, statically, percpu, without ever changing
>it,
>> > and then we just (potentially) change a couple of fields in it on
>> > process switch.
>> >
>> > Your idea isn't horrible, but it would involve a TLB flush for the
>> > page when the io bitmap changes. Which is almost certainly more
>> > expensive than just copying the bitmap intelligently.
>> >
>> > Particularly since I do think that the copy can basically be done
>> > effectively never, assuming there really aren't multiple concurrent
>> > users of ioperm() (and iopl).
>> 
>> There wouldn't have to be a flush on every task switch.  If we make
>it
>> so that tasks that don't use a bitmap just unmap the pages in the
>> cpu_entry_area and set tss.io_bitmap_base to outside the segment
>> limit, we would only have to flush when switching from a task using
>> the bitmap (because the next task uses a different bitmap or we are
>> unmapping it).  If the previous task doesn't have a bitmap the pages
>> in cpu_entry_area were unmapped and can't be in the TLB, so no flush
>> is needed.
>
>Funny. I was just debating exactly this with Peter Ziljstra over IRC :)
> 
>> Going a step further, we could track which task is mapped to the
>> current cpu like proposed above, and only flush when a different task
>> needs the IO bitmap, or when the bitmap is being freed on task exit.
>
>Yes.
>
>But, we really should check what aside of DoSemu is using this still.
>None
>of my machines I checked have a single instance of ioperm()/iopl()
>usage.
>
>So the real question is whether it's worth the trouble or if we are
>just
>better off to copy if there is an actual user and the sequence count of
>the
>bitmap is different than the one which was active last time.
>
>Thanks,
>
>	tglx

I have written suffer using this, because of far better real time performance. I just want to punch a hole (just like mmapping an MMIO device.)

I do agree that let's not optimize for the rare case.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 21:44         ` Linus Torvalds
@ 2019-11-08  1:12           ` H. Peter Anvin
  2019-11-08  2:12             ` Brian Gerst
  2019-11-10 17:21           ` Andy Lutomirski
  1 sibling, 1 reply; 62+ messages in thread
From: H. Peter Anvin @ 2019-11-08  1:12 UTC (permalink / raw)
  To: Linus Torvalds, Brian Gerst
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson

On 2019-11-07 13:44, Linus Torvalds wrote:
> On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <brgerst@gmail.com> wrote:
>>
>> There wouldn't have to be a flush on every task switch.
> 
> No. But we'd have to flush on any switch that currently does that memcpy.
> 
> And my point is that a tlb flush (even the single-page case) is likely
> more expensive than the memcpy.
> 
>> Going a step further, we could track which task is mapped to the
>> current cpu like proposed above, and only flush when a different task
>> needs the IO bitmap, or when the bitmap is being freed on task exit.
> 
> Well, that's exactly my "track the last task" optimization for copying
> the thing.
> 
> IOW, it's the same optimization as avoiding the memcpy.
> 
> Which I think is likely very effective, but also makes it fairly
> pointless to then try to be clever..
> 
> So the basic issue remains that playing VM games has almost
> universally been slower and more complex than simply not playing VM
> games. TLB flushes - even invlpg - tends to be pretty slow.
> 
> Of course, we probably end up invalidating the TLB's anyway, so maybe
> in this case we don't care. The ioperm bitmap is _technically_
> per-thread, though, so it should be flushed even if the VM isn't
> flushed...
> 

One option, probably a lot saner (if we care at all, after all, copying 8K
really isn't that much, but it might have some impact on real-time processes,
which is one of the rather few use cases for direct I/O) would be to keep the
bitmask in a pre-formatted TSS (ioperm being per thread, so no concerns about
the TSS being in use on another processor), and copy the TSS fields (88 bytes)
over if and only if the thread has been migrated to a different CPU, then
switch the TSS rather than switching  For the common case (no ioperms) we use
the standard per-cpu TSS.

That being said, I don't actually know that copying 88 bytes + LTR is any
cheaper than copying 8K.

	-hpa

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-08  1:12           ` H. Peter Anvin
@ 2019-11-08  2:12             ` Brian Gerst
  0 siblings, 0 replies; 62+ messages in thread
From: Brian Gerst @ 2019-11-08  2:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson

On Thu, Nov 7, 2019 at 8:12 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2019-11-07 13:44, Linus Torvalds wrote:
> > On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <brgerst@gmail.com> wrote:
> >>
> >> There wouldn't have to be a flush on every task switch.
> >
> > No. But we'd have to flush on any switch that currently does that memcpy.
> >
> > And my point is that a tlb flush (even the single-page case) is likely
> > more expensive than the memcpy.
> >
> >> Going a step further, we could track which task is mapped to the
> >> current cpu like proposed above, and only flush when a different task
> >> needs the IO bitmap, or when the bitmap is being freed on task exit.
> >
> > Well, that's exactly my "track the last task" optimization for copying
> > the thing.
> >
> > IOW, it's the same optimization as avoiding the memcpy.
> >
> > Which I think is likely very effective, but also makes it fairly
> > pointless to then try to be clever..
> >
> > So the basic issue remains that playing VM games has almost
> > universally been slower and more complex than simply not playing VM
> > games. TLB flushes - even invlpg - tends to be pretty slow.
> >
> > Of course, we probably end up invalidating the TLB's anyway, so maybe
> > in this case we don't care. The ioperm bitmap is _technically_
> > per-thread, though, so it should be flushed even if the VM isn't
> > flushed...
> >
>
> One option, probably a lot saner (if we care at all, after all, copying 8K
> really isn't that much, but it might have some impact on real-time processes,
> which is one of the rather few use cases for direct I/O) would be to keep the
> bitmask in a pre-formatted TSS (ioperm being per thread, so no concerns about
> the TSS being in use on another processor), and copy the TSS fields (88 bytes)
> over if and only if the thread has been migrated to a different CPU, then
> switch the TSS rather than switching  For the common case (no ioperms) we use
> the standard per-cpu TSS.
>
> That being said, I don't actually know that copying 88 bytes + LTR is any
> cheaper than copying 8K.

I don't think that can work.  The TSS has to be at a fixed address in
the cpu_entry_area so that it is visible when running in usermode
(thanks to Meltdown).

--
Brian Gerst

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 2/9] x86/process: Unify copy_thread_tls()
  2019-11-06 19:35 ` [patch 2/9] x86/process: Unify copy_thread_tls() Thomas Gleixner
@ 2019-11-08 22:31   ` Andy Lutomirski
  2019-11-08 23:43     ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-08 22:31 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> While looking at the TSS io bitmap it turned out that any change in that
> area would require identical changes to copy_thread_tls(). The 32 and 64
> bit variants share sufficient code to consolidate them into a common
> function to avoid duplication of upcoming modifications.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/ptrace.h    |    6 ++
>  arch/x86/include/asm/switch_to.h |   10 ++++
>  arch/x86/kernel/process.c        |   94 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/process_32.c     |   68 ----------------------------
>  arch/x86/kernel/process_64.c     |   75 -------------------------------
>  5 files changed, 110 insertions(+), 143 deletions(-)
> 
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -361,5 +361,11 @@ extern int do_get_thread_area(struct tas
>  extern int do_set_thread_area(struct task_struct *p, int idx,
>  			      struct user_desc __user *info, int can_allocate);
>  
> +#ifdef CONFIG_X86_64
> +# define do_set_thread_area_64(p, s, t)	do_arch_prctl_64(p, s, t)
> +#else
> +# define do_set_thread_area_64(p, s, t)	(0)
> +#endif
> +
>  #endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_X86_PTRACE_H */
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -103,7 +103,17 @@ static inline void update_task_stack(str
>  	if (static_cpu_has(X86_FEATURE_XENPV))
>  		load_sp0(task_top_of_stack(task));
>  #endif
> +}
>  
> +static inline void kthread_frame_init(struct inactive_task_frame *frame,
> +				      unsigned long fun, unsigned long arg)
> +{
> +	frame->bx = fun;
> +#ifdef CONFIG_X86_32
> +	frame->di = arg;
> +#else
> +	frame->r12 = arg;
> +#endif
>  }
>  
>  #endif /* _ASM_X86_SWITCH_TO_H */
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -132,6 +132,100 @@ void exit_thread(struct task_struct *tsk
>  	fpu__drop(fpu);
>  }
>  
> +static int set_new_tls(struct task_struct *p, unsigned long tls)
> +{
> +	struct user_desc __user *utls = (struct user_desc __user *)tls;
> +
> +	if (in_ia32_syscall())
> +		return do_set_thread_area(p, -1, utls, 0);
> +	else
> +		return do_set_thread_area_64(p, ARCH_SET_FS, tls);
> +}
> +
> +static inline int copy_io_bitmap(struct task_struct *tsk)
> +{
> +	if (likely(!test_tsk_thread_flag(current, TIF_IO_BITMAP)))
> +		return 0;
> +
> +	tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
> +					    IO_BITMAP_BYTES, GFP_KERNEL);

tsk->thread.io_bitmap_max = current->thread.io_bitmap_max?

I realize you inherited this from the code you're refactoring, but it
does seem to be missing.

> +	if (!tsk->thread.io_bitmap_ptr) {
> +		tsk->thread.io_bitmap_max = 0;
> +		return -ENOMEM;
> +	}
> +	set_tsk_thread_flag(tsk, TIF_IO_BITMAP);
> +	return 0;
> +}
> +
> +static inline void free_io_bitmap(struct task_struct *tsk)
> +{
> +	if (tsk->thread.io_bitmap_ptr) {
> +		kfree(tsk->thread.io_bitmap_ptr);
> +		tsk->thread.io_bitmap_ptr = NULL;
> +		tsk->thread.io_bitmap_max = 0;
> +	}
> +}
> +
> +int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
> +		    unsigned long arg, struct task_struct *p, unsigned long tls)
> +{
> +	struct inactive_task_frame *frame;
> +	struct fork_frame *fork_frame;
> +	struct pt_regs *childregs;
> +	int ret;
> +
> +	childregs = task_pt_regs(p);
> +	fork_frame = container_of(childregs, struct fork_frame, regs);
> +	frame = &fork_frame->frame;
> +
> +	frame->bp = 0;
> +	frame->ret_addr = (unsigned long) ret_from_fork;
> +	p->thread.sp = (unsigned long) fork_frame;
> +	p->thread.io_bitmap_ptr = NULL;
> +	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
> +
> +#ifdef CONFIG_X86_64
> +	savesegment(gs, p->thread.gsindex);
> +	p->thread.gsbase = p->thread.gsindex ? 0 : current->thread.gsbase;
> +	savesegment(fs, p->thread.fsindex);
> +	p->thread.fsbase = p->thread.fsindex ? 0 : current->thread.fsbase;
> +	savesegment(es, p->thread.es);
> +	savesegment(ds, p->thread.ds);
> +#else
> +	/* Clear all status flags including IF and set fixed bit. */
> +	frame->flags = X86_EFLAGS_FIXED;
> +#endif

Want to do another commit to make the eflags fixup unconditional?  I'm
wondering if we have a bug.

Other than these questions:

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 3/9] x86/cpu: Unify cpu_init()
  2019-11-06 19:35 ` [patch 3/9] x86/cpu: Unify cpu_init() Thomas Gleixner
@ 2019-11-08 22:34   ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-08 22:34 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> Similar to copy_thread_tls() the 32bit and 64bit implementations of
> cpu_init() are very similar and unification avoids duplicate changes in the
> future.
> 

I believe you :)

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-07 14:08       ` Thomas Gleixner
@ 2019-11-08 22:41         ` Andy Lutomirski
  2019-11-08 23:45           ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-08 22:41 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On 11/7/19 6:08 AM, Thomas Gleixner wrote:
> On Thu, 7 Nov 2019, Thomas Gleixner wrote:
>> On Thu, 7 Nov 2019, Peter Zijlstra wrote:
>>> On Wed, Nov 06, 2019 at 08:35:03PM +0100, Thomas Gleixner wrote:
>>>> There is no requirement to update the TSS I/O bitmap when a thread using it is
>>>> scheduled out and the incoming thread does not use it.
>>>>
>>>> For the permission check based on the TSS I/O bitmap the CPU calculates the memory
>>>> location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
>>>> of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
>>>> offset to an address outside of the TSS limit.
>>>>
>>>> If an I/O instruction is issued from user space the TSS limit causes #GP to be
>>>> raised in the same was as valid I/O bitmap with all bits set to 1 would do.
>>>>
>>>> This removes the extra work when an I/O bitmap using task is scheduled out
>>>> and puts the burden on the rare I/O bitmap users when they are scheduled
>>>> in.
>>>
>>> This also nicely aligns with that the context switch time is accounted
>>> to the next task. So by doing the expensive part on switch-in gets it
>>> all accounted to the task that caused it.
>>
>> Just that I can't add the storage to tss_struct due to the VMX insanity of
>> setting TSS limit hard to 0x67 on vmexit instead of restoring the host
>> value.
> 
> Well, I can. The build bugon in vmx.c is just bogus.

SDM vol 3 27.5.2 says the BUILD_BUG_ON is right.  Or am I
misunderstanding you?

I'm reasonably confident that the TSS limit is indeed 0x67 after VM
exit, and I wrote the existing code that tries to optimize this to avoid
LTR when not needed.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 2/9] x86/process: Unify copy_thread_tls()
  2019-11-08 22:31   ` Andy Lutomirski
@ 2019-11-08 23:43     ` Thomas Gleixner
  2019-11-10 12:36       ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-08 23:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Fri, 8 Nov 2019, Andy Lutomirski wrote:
> On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> > +static inline int copy_io_bitmap(struct task_struct *tsk)
> > +{
> > +	if (likely(!test_tsk_thread_flag(current, TIF_IO_BITMAP)))
> > +		return 0;
> > +
> > +	tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
> > +					    IO_BITMAP_BYTES, GFP_KERNEL);
> 
> tsk->thread.io_bitmap_max = current->thread.io_bitmap_max?
> 
> I realize you inherited this from the code you're refactoring, but it
> does seem to be missing.

It already got copied with the task struct :)
 
> > +#ifdef CONFIG_X86_64
> > +	savesegment(gs, p->thread.gsindex);
> > +	p->thread.gsbase = p->thread.gsindex ? 0 : current->thread.gsbase;
> > +	savesegment(fs, p->thread.fsindex);
> > +	p->thread.fsbase = p->thread.fsindex ? 0 : current->thread.fsbase;
> > +	savesegment(es, p->thread.es);
> > +	savesegment(ds, p->thread.ds);
> > +#else
> > +	/* Clear all status flags including IF and set fixed bit. */
> > +	frame->flags = X86_EFLAGS_FIXED;
> > +#endif
> 
> Want to do another commit to make the eflags fixup unconditional?  I'm
> wondering if we have a bug.

Let me look at that.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-08 22:41         ` Andy Lutomirski
@ 2019-11-08 23:45           ` Thomas Gleixner
  2019-11-09  3:32             ` Andy Lutomirski
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-08 23:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, LKML, x86, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, Linus Torvalds,
	H. Peter Anvin

On Fri, 8 Nov 2019, Andy Lutomirski wrote:
> On 11/7/19 6:08 AM, Thomas Gleixner wrote:
> > On Thu, 7 Nov 2019, Thomas Gleixner wrote:
> >> Just that I can't add the storage to tss_struct due to the VMX insanity of
> >> setting TSS limit hard to 0x67 on vmexit instead of restoring the host
> >> value.
> > 
> > Well, I can. The build bugon in vmx.c is just bogus.
> 
> SDM vol 3 27.5.2 says the BUILD_BUG_ON is right.  Or am I
> misunderstanding you?
> 
> I'm reasonably confident that the TSS limit is indeed 0x67 after VM
> exit, and I wrote the existing code that tries to optimize this to avoid
> LTR when not needed.

The BUILD_BUG_ON(IO_BITMAP_OFFSET - 1 == 0x67) in the VMX code is bogus in
two aspects:

1) This wants to be in generic x86 code

2) The IO_BITMAP_OFFSET is not the right thing to check because it makes
   asssumptions about the layout of tss_struct. Nothing requires that the
   I/O bitmap is placed right after x86_tss, which is the hardware mandated
   tss structure. It pointlessly makes restrictions on the struct
   tss_struct layout.

The proper thing to check is:

    - Offset of x86_tss in tss_struct is 0
    - Size of x86_tss == 0x68

We already have the page alignment sanity check off TSS in
cpu_entry_area.c. That's where this should have gone into in the first
place.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-06 19:35 ` [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user Thomas Gleixner
  2019-11-07  9:12   ` Peter Zijlstra
@ 2019-11-09  0:24   ` Andy Lutomirski
  1 sibling, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-09  0:24 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> There is no requirement to update the TSS I/O bitmap when a thread using it is
> scheduled out and the incoming thread does not use it.
> 
> For the permission check based on the TSS I/O bitmap the CPU calculates the memory
> location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
> of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
> offset to an address outside of the TSS limit.
> 
> If an I/O instruction is issued from user space the TSS limit causes #GP to be
> raised in the same was as valid I/O bitmap with all bits set to 1 would do.
> 
> This removes the extra work when an I/O bitmap using task is scheduled out
> and puts the burden on the rare I/O bitmap users when they are scheduled
> in.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/processor.h |   38 +++++++++++++++++--------
>  arch/x86/kernel/cpu/common.c     |    3 +
>  arch/x86/kernel/doublefault.c    |    2 -
>  arch/x86/kernel/process.c        |   59 +++++++++++++++++++++------------------
>  4 files changed, 61 insertions(+), 41 deletions(-)
> 
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -330,8 +330,23 @@ struct x86_hw_tss {
>  #define IO_BITMAP_BITS			65536
>  #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
>  #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
> -#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
> -#define INVALID_IO_BITMAP_OFFSET	0x8000
> +
> +#define IO_BITMAP_OFFSET_VALID				\
> +	(offsetof(struct tss_struct, io_bitmap) -	\
> +	 offsetof(struct tss_struct, x86_tss))
> +
> +/*
> + * sizeof(unsigned long) coming from an extra "long" at the end
> + * of the iobitmap.
> + *
> + * -1? seg base+limit should be pointing to the address of the
> + * last valid byte

What's with the '?'

> + */
> +#define __KERNEL_TSS_LIMIT	\
> +	(IO_BITMAP_OFFSET_VALID + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
> +
> +/* Base offset outside of TSS_LIMIT so unpriviledged IO causes #GP */
> +#define IO_BITMAP_OFFSET_INVALID	(__KERNEL_TSS_LIMIT + 1)
>  
>  struct entry_stack {
>  	unsigned long		words[64];
> @@ -350,6 +365,15 @@ struct tss_struct {
>  	struct x86_hw_tss	x86_tss;
>  
>  	/*
> +	 * Store the dirty size of the last io bitmap offender. The next
> +	 * one will have to do the cleanup as the switch out to a non
> +	 * io bitmap user will just set x86_tss.io_bitmap_base to a value
> +	 * outside of the TSS limit. So for sane tasks there is no need
> +	 * to actually touch the io_bitmap at all.
> +	 */
> +	unsigned int		io_bitmap_prev_max;

Hmm.  I'm wondering if a clearer way to say this would be:

io_bitmap_max_allowed_byte: offset from the start of the io bitmap to
the first byte that might contain a zero bit.  If we switch from a task
that uses ioperm() to one that does not, we will invalidate the io
bitmap by changing the offset.  The next task that starts using the io
bitmap again will need to make sure it updates all the bytes through
io_bitmap_max_allowed_byte.

But your description is okay, too.

It's worth noting that, due to Meltdown, this patch leaks the io bitmap
of io bitmap-using tasks.  I'm not sure I care.

> +
> +	/*
>  	 * The extra 1 is there because the CPU will access an
>  	 * additional byte beyond the end of the IO permission
>  	 * bitmap. The extra byte must be all 1 bits, and must
> @@ -360,16 +384,6 @@ struct tss_struct {
>  
>  DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
>  
> -/*
> - * sizeof(unsigned long) coming from an extra "long" at the end
> - * of the iobitmap.
> - *
> - * -1? seg base+limit should be pointing to the address of the
> - * last valid byte
> - */
> -#define __KERNEL_TSS_LIMIT	\
> -	(IO_BITMAP_OFFSET + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
> -
>  /* Per CPU interrupt stacks */
>  struct irq_stack {
>  	char		stack[IRQ_STACK_SIZE];
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1863,7 +1863,8 @@ void cpu_init(void)
>  
>  	/* Initialize the TSS. */
>  	tss_setup_ist(tss);
> -	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
> +	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
> +	tss->io_bitmap_prev_max = 0;
>  	memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
>  	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
>  
> --- a/arch/x86/kernel/doublefault.c
> +++ b/arch/x86/kernel/doublefault.c
> @@ -54,7 +54,7 @@ struct x86_hw_tss doublefault_tss __cach
>  	.sp0		= STACK_START,
>  	.ss0		= __KERNEL_DS,
>  	.ldt		= 0,
> -	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
> +	.io_bitmap_base	= IO_BITMAP_OFFSET_INVALID,
>  
>  	.ip		= (unsigned long) doublefault_fn,
>  	/* 0x2 bit is always set */
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -72,18 +72,9 @@
>  #ifdef CONFIG_X86_32
>  		.ss0 = __KERNEL_DS,
>  		.ss1 = __KERNEL_CS,
> -		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
>  #endif
> +		.io_bitmap_base	= IO_BITMAP_OFFSET_INVALID,
>  	 },
> -#ifdef CONFIG_X86_32
> -	 /*
> -	  * Note that the .io_bitmap member must be extra-big. This is because
> -	  * the CPU will access an additional byte beyond the end of the IO
> -	  * permission bitmap. The extra byte must be all 1 bits, and must
> -	  * be within the limit.
> -	  */
> -	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
> -#endif
>  };
>  EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
>  
> @@ -112,18 +103,18 @@ void exit_thread(struct task_struct *tsk
>  	struct thread_struct *t = &tsk->thread;
>  	unsigned long *bp = t->io_bitmap_ptr;
>  	struct fpu *fpu = &t->fpu;
> +	struct tss_struct *tss;
>  
>  	if (bp) {
> -		struct tss_struct *tss = &per_cpu(cpu_tss_rw, get_cpu());
> +		preempt_disable();
> +		tss = this_cpu_ptr(&cpu_tss_rw);
>  
>  		t->io_bitmap_ptr = NULL;
> -		clear_thread_flag(TIF_IO_BITMAP);
> -		/*
> -		 * Careful, clear this in the TSS too:
> -		 */
> -		memset(tss->io_bitmap, 0xff, t->io_bitmap_max);
>  		t->io_bitmap_max = 0;
> -		put_cpu();
> +		clear_thread_flag(TIF_IO_BITMAP);
> +		/* Invalidate the io bitmap base in the TSS */
> +		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;

The first time I read this, I thought this code was all unnecessary.
But now I think I understand.  How about a comment:

/*
 * We're about to free the IO bitmap.  We need to disable it in the TSS
 * too so that switch_to() doesn't see an inconsistent state.
 */

In any case:

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-08 23:45           ` Thomas Gleixner
@ 2019-11-09  3:32             ` Andy Lutomirski
  2019-11-10 12:43               ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-09  3:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Peter Zijlstra, LKML, x86, Stephen Hemminger,
	Willy Tarreau, Juergen Gross, Sean Christopherson,
	Linus Torvalds, H. Peter Anvin



> On Nov 8, 2019, at 3:45 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, 8 Nov 2019, Andy Lutomirski wrote:
>>> On 11/7/19 6:08 AM, Thomas Gleixner wrote:
>>> On Thu, 7 Nov 2019, Thomas Gleixner wrote:
>>>> Just that I can't add the storage to tss_struct due to the VMX insanity of
>>>> setting TSS limit hard to 0x67 on vmexit instead of restoring the host
>>>> value.
>>> 
>>> Well, I can. The build bugon in vmx.c is just bogus.
>> 
>> SDM vol 3 27.5.2 says the BUILD_BUG_ON is right.  Or am I
>> misunderstanding you?
>> 
>> I'm reasonably confident that the TSS limit is indeed 0x67 after VM
>> exit, and I wrote the existing code that tries to optimize this to avoid
>> LTR when not needed.
> 
> The BUILD_BUG_ON(IO_BITMAP_OFFSET - 1 == 0x67) in the VMX code is bogus in
> two aspects:
> 
> 1) This wants to be in generic x86 code

I think disagree. The only thing special about 0x67 is that VMX hard codes it. It’s specifically a VMX-ism. So I think the VMX code should indeed assert that 0x67 is a safe value.

> 
> 2) The IO_BITMAP_OFFSET is not the right thing to check because it makes
>   asssumptions about the layout of tss_struct. Nothing requires that the
>   I/O bitmap is placed right after x86_tss, which is the hardware mandated
>   tss structure. It pointlessly makes restrictions on the struct
>   tss_struct layout.

I agree with this.

> 
> The proper thing to check is:
> 
>    - Offset of x86_tss in tss_struct is 0
>    - Size of x86_tss == 0x68
> 
> We already have the page alignment sanity check off TSS in
> cpu_entry_area.c. That's where this should have gone into in the first
> place.

> 
> Thanks,
> 
>    tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 2/9] x86/process: Unify copy_thread_tls()
  2019-11-08 23:43     ` Thomas Gleixner
@ 2019-11-10 12:36       ` Thomas Gleixner
  2019-11-10 16:56         ` Andy Lutomirski
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-10 12:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Sat, 9 Nov 2019, Thomas Gleixner wrote:
> On Fri, 8 Nov 2019, Andy Lutomirski wrote:
> > On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> > > +static inline int copy_io_bitmap(struct task_struct *tsk)
> > > +{
> > > +	if (likely(!test_tsk_thread_flag(current, TIF_IO_BITMAP)))
> > > +		return 0;
> > > +
> > > +	tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
> > > +					    IO_BITMAP_BYTES, GFP_KERNEL);
> > 
> > tsk->thread.io_bitmap_max = current->thread.io_bitmap_max?
> > 
> > I realize you inherited this from the code you're refactoring, but it
> > does seem to be missing.
> 
> It already got copied with the task struct :)
>  
> > > +#ifdef CONFIG_X86_64
> > > +	savesegment(gs, p->thread.gsindex);
> > > +	p->thread.gsbase = p->thread.gsindex ? 0 : current->thread.gsbase;
> > > +	savesegment(fs, p->thread.fsindex);
> > > +	p->thread.fsbase = p->thread.fsindex ? 0 : current->thread.fsbase;
> > > +	savesegment(es, p->thread.es);
> > > +	savesegment(ds, p->thread.ds);
> > > +#else
> > > +	/* Clear all status flags including IF and set fixed bit. */
> > > +	frame->flags = X86_EFLAGS_FIXED;
> > > +#endif
> > 
> > Want to do another commit to make the eflags fixup unconditional?  I'm
> > wondering if we have a bug.
> 
> Let me look at that.

64bit does not have flags in the inactive_task_frame ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user
  2019-11-09  3:32             ` Andy Lutomirski
@ 2019-11-10 12:43               ` Thomas Gleixner
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-10 12:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, LKML, x86, Stephen Hemminger,
	Willy Tarreau, Juergen Gross, Sean Christopherson,
	Linus Torvalds, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

On Fri, 8 Nov 2019, Andy Lutomirski wrote:
> > On Nov 8, 2019, at 3:45 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 8 Nov 2019, Andy Lutomirski wrote:
> >> SDM vol 3 27.5.2 says the BUILD_BUG_ON is right.  Or am I
> >> misunderstanding you?
> >> 
> >> I'm reasonably confident that the TSS limit is indeed 0x67 after VM
> >> exit, and I wrote the existing code that tries to optimize this to avoid
> >> LTR when not needed.
> > 
> > The BUILD_BUG_ON(IO_BITMAP_OFFSET - 1 == 0x67) in the VMX code is bogus in
> > two aspects:
> > 
> > 1) This wants to be in generic x86 code
> 
> I think disagree. The only thing special about 0x67 is that VMX hard
> codes it. It’s specifically a VMX-ism. So I think the VMX code should
> indeed assert that 0x67 is a safe value.

Yes, it is a VMX specific issue, but I really prefer the build to fail in
the common code without having to enable VMX if something changes the
TSS layout in incompatible ways.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 2/9] x86/process: Unify copy_thread_tls()
  2019-11-10 12:36       ` Thomas Gleixner
@ 2019-11-10 16:56         ` Andy Lutomirski
  2019-11-11  8:52           ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-10 16:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, Linus Torvalds,
	H. Peter Anvin

On Sun, Nov 10, 2019 at 4:36 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sat, 9 Nov 2019, Thomas Gleixner wrote:
> > On Fri, 8 Nov 2019, Andy Lutomirski wrote:
> > > On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> > > > +static inline int copy_io_bitmap(struct task_struct *tsk)
> > > > +{
> > > > + if (likely(!test_tsk_thread_flag(current, TIF_IO_BITMAP)))
> > > > +         return 0;
> > > > +
> > > > + tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
> > > > +                                     IO_BITMAP_BYTES, GFP_KERNEL);
> > >
> > > tsk->thread.io_bitmap_max = current->thread.io_bitmap_max?
> > >
> > > I realize you inherited this from the code you're refactoring, but it
> > > does seem to be missing.
> >
> > It already got copied with the task struct :)
> >
> > > > +#ifdef CONFIG_X86_64
> > > > + savesegment(gs, p->thread.gsindex);
> > > > + p->thread.gsbase = p->thread.gsindex ? 0 : current->thread.gsbase;
> > > > + savesegment(fs, p->thread.fsindex);
> > > > + p->thread.fsbase = p->thread.fsindex ? 0 : current->thread.fsbase;
> > > > + savesegment(es, p->thread.es);
> > > > + savesegment(ds, p->thread.ds);
> > > > +#else
> > > > + /* Clear all status flags including IF and set fixed bit. */
> > > > + frame->flags = X86_EFLAGS_FIXED;
> > > > +#endif
> > >
> > > Want to do another commit to make the eflags fixup unconditional?  I'm
> > > wondering if we have a bug.
> >
> > Let me look at that.
>
> 64bit does not have flags in the inactive_task_frame ...
>

Hmm.  One more thing to unify, I guess.

--Andy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07  8:25     ` Ingo Molnar
  2019-11-07  9:17       ` Willy Tarreau
@ 2019-11-10 17:17       ` Andy Lutomirski
  1 sibling, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-10 17:17 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin

On 11/7/19 12:25 AM, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Wed, Nov 6, 2019 at 12:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>
>>> Calculate both the position of the first zero bit and the last zero bit to
>>> limit the range which needs to be copied. This does not solve the problem
>>> when the previous tasked had only byte 0 cleared and the next one has only
>>> byte 65535 cleared, but trying to solve that would be too complex and
>>> heavyweight for the context switch path. As the ioperm() usage is very rare
>>> the case which is optimized is the single task/process which uses ioperm().
>>
>> Hmm.
>>
>> I may read this patch wrong, but from what I can tell, if we really
>> have just one process with an io bitmap, we're doing unnecessary
>> copies.
>>
>> If we really have just one process that has an iobitmap, I think we
>> could just keep the bitmap of that process entirely unchanged. Then,
>> when we switch away from it, we set the io_bitmap_base to an invalid
>> base outside the TSS segment, and when we switch back, we set it back
>> to the valid one. No actual bitmap copies at all.
>>
>> So I think that rather than the "begin/end offset" games, we should
>> perhaps have a "what was the last process that used the IO bitmap for
>> this TSS" pointer (and, I think, some sequence counter, so that when
>> the process updates its bitmap, it invalidates that case)?
>>
>>  Of course, you can do *nboth*, but if we really think that the common
>> case is "one special process", then I think the begin/end offset is
>> useless, but a "last bitmap process" would be very useful.
>>
>> Am I missing something?
> 
> In fact on SMP systems this would result in a very nice optimization: 
> pretty quickly *all* TSS's would be populated with that single task's 
> bitmap, and it would persist even across migrations from CPU to CPU.
> 
> I'd love to get rid of the offset caching and bit scanning games as well 
> - it doesn't really help in a number of common scenarios and it 
> complicates this non-trivial piece of code a *LOT* - and we probably 
> don't really have the natural testing density of this code anymore to 
> find any regressions quickly.

I think we should not over-optimize this.  I am all for penalizing
ioperm() and iopl() users as much as is convenient for us.  There is
simply no legitimate use case.  Sorry, DPDK, but "virtio-legacy sucks,
let's optimize the crap out of something that is slow anyway and use
iopl()" is not a good excuse.  Just use the %*!7 syscall to write to
/sys/.../resource0 and suck up the probably negligible performance hit.
And tell your customers to upgrade their hypervisors.  And quite
kvetching about performance of the control place on an old
software-emulated NIC while you're at it.

For the TLB case, it's worth tracking who last used which ASID and
whether it's still up to date, since *everyone* uses the MMU.  For
ioperm, I don't really believe this is worth it.

> 
> So intuitively I'd suggest we gravitate towards the simplest 
> implementation, with a good caching optimization for the single-task 
> case.

I agree with the first bit, but caching on an SMP system is necessarily
subtle.  Some kind of invalidation is needed.

> 
> I.e. the model I'm suggesting is that if a task uses ioperm() or iopl() 
> then it should have a bitmap from that point on until exit(), even if 
> it's all zeroes or all ones. Most applications that are using those 
> primitives really need it all the time and are using just a few ioports, 
> so all the tracking doesn't help much anyway.
> 
> On a related note, another simplification would be that in principle we 
> could also use just a single bitmap and emulate iopl() as an ioperm(all) 
> or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed 
> ioperm()/iopl() uses, but is that ABI actually being relied on in 
> practice?
> 

Let's please keep the ABI.  Or rather, let's attempt to eventually
remove the ABI, but let's not change it in the mean time please.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further
  2019-11-07 21:44         ` Linus Torvalds
  2019-11-08  1:12           ` H. Peter Anvin
@ 2019-11-10 17:21           ` Andy Lutomirski
  1 sibling, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-10 17:21 UTC (permalink / raw)
  To: Linus Torvalds, Brian Gerst
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, H. Peter Anvin

On 11/7/19 1:44 PM, Linus Torvalds wrote:
> On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <brgerst@gmail.com> wrote:
>>
>> There wouldn't have to be a flush on every task switch.
> 
> No. But we'd have to flush on any switch that currently does that memcpy.
> 
> And my point is that a tlb flush (even the single-page case) is likely
> more expensive than the memcpy.
> 
>> Going a step further, we could track which task is mapped to the
>> current cpu like proposed above, and only flush when a different task
>> needs the IO bitmap, or when the bitmap is being freed on task exit.
> 
> Well, that's exactly my "track the last task" optimization for copying
> the thing.
> 
> IOW, it's the same optimization as avoiding the memcpy.
> 
> Which I think is likely very effective, but also makes it fairly
> pointless to then try to be clever..
> 
> So the basic issue remains that playing VM games has almost
> universally been slower and more complex than simply not playing VM
> games. TLB flushes - even invlpg - tends to be pretty slow.
> 

With my TLB-handling-writing-and-reviewing code on, NAK to any VM games
here.

Honestly, I almost think we should unconditionally copy the whole 8K for
the case where the ioperm() syscall has been used (i.e. not emulating
iopl()).  The benefit simply does not justify the risk of getting it
wrong.  I'm okay, but barely, with optimizing the end of the copied
range.  Optimizing the start of the copied range is pushing it.  Playing
MMU tricks and getting all the per-task-ioperm and invalidation right is
way beyond reasonable.

Even the time spent discussing how to optimize a case that has literally
one known user that none of us can bring ourselves to care about much
seems wasteful.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 7/9] x86/iopl: Restrict iopl() permission scope
  2019-11-06 19:35 ` [patch 7/9] x86/iopl: Restrict iopl() permission scope Thomas Gleixner
  2019-11-07  9:09   ` Peter Zijlstra
@ 2019-11-10 17:26   ` Andy Lutomirski
  2019-11-10 20:31     ` Thomas Gleixner
  1 sibling, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-10 17:26 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> The access to the full I/O port range can be also provided by the TSS I/O
> bitmap, but that would require to copy 8k of data on scheduling in the
> task. As shown with the sched out optimization TSS.io_bitmap_base can be
> used to switch the incoming task to a preallocated I/O bitmap which has all
> bits zero, i.e. allows access to all I/O ports.
> 
> Implementing this allows to provide an iopl() emulation mode which restricts
> the IOPL level 3 permissions to I/O port access but removes the STI/CLI
> permission which is coming with the hardware IOPL mechansim.
> 
> Provide a config option to switch IOPL to emulation mode, make it the
> default and while at it also provide an option to disable IOPL completely.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/Kconfig                 |   32 ++++++++++++++++
>  arch/x86/include/asm/processor.h |   24 +++++++++---
>  arch/x86/kernel/cpu/common.c     |    4 +-
>  arch/x86/kernel/ioport.c         |   75 ++++++++++++++++++++++++++++++---------
>  arch/x86/kernel/process.c        |   15 +++++--
>  5 files changed, 123 insertions(+), 27 deletions(-)
> 
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1254,6 +1254,38 @@ config X86_VSYSCALL_EMULATION
>  	 Disabling this option saves about 7K of kernel size and
>  	 possibly 4K of additional runtime pagetable memory.
>  
> +choice
> +	prompt "IOPL"
> +	default X86_IOPL_EMULATION
> +
> +config X86_IOPL_EMULATION
> +	bool "IOPL Emulation"
> +	---help---
> +	  Legacy IOPL support is an overbroad mechanism which allows user
> +	  space aside of accessing all 65536 I/O ports also to disable
> +	  interrupts. To gain this access the caller needs CAP_SYS_RAWIO
> +	  capabilities and permission from eventually active security
> +	  modules.
> +
> +	  The emulation restricts the functionality of the syscall to
> +	  only allowing the full range I/O port access, but prevents the
> +	  ability to disable interrupts from user space.
> +
> +config X86_IOPL_LEGACY
> +	bool "IOPL Legacy"
> +	---help---
> +	Allow the full IOPL permissions, i.e. user space access to all
> +	65536 I/O ports and also the ability to disable interrupts, which
> +	is overbroad and can result in system lockups.
> +
> +config X86_IOPL_NONE
> +	bool "IOPL None"
> +	---help---
> +	Disable the IOPL permission syscall. That's the safest option as
> +	no sane application should depend on this functionality.
> +
> +endchoice
> +
>  config TOSHIBA
>  	tristate "Toshiba Laptop support"
>  	depends on X86_32
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -331,8 +331,12 @@ struct x86_hw_tss {
>  #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
>  #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
>  
> -#define IO_BITMAP_OFFSET_VALID				\
> -	(offsetof(struct tss_struct, io_bitmap) -	\
> +#define IO_BITMAP_OFFSET_VALID_MAP			\
> +	(offsetof(struct tss_struct, io_bitmap_map) -	\
> +	 offsetof(struct tss_struct, x86_tss))
> +
> +#define IO_BITMAP_OFFSET_VALID_ALL			\
> +	(offsetof(struct tss_struct, io_bitmap_all) -	\
>  	 offsetof(struct tss_struct, x86_tss))
>  
>  /*
> @@ -343,7 +347,7 @@ struct x86_hw_tss {
>   * last valid byte
>   */
>  #define __KERNEL_TSS_LIMIT	\
> -	(IO_BITMAP_OFFSET_VALID + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
> +	(IO_BITMAP_OFFSET_VALID_ALL + IO_BITMAP_BYTES + sizeof(unsigned long) - 1)
>  
>  /* Base offset outside of TSS_LIMIT so unpriviledged IO causes #GP */
>  #define IO_BITMAP_OFFSET_INVALID	(__KERNEL_TSS_LIMIT + 1)
> @@ -379,7 +383,8 @@ struct tss_struct {
>  	 * byte beyond the end of the I/O permission bitmap. The extra byte
>  	 * must have all bits set and must be within the TSS limit.
>  	 */
> -	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
> +	unsigned long		io_bitmap_map[IO_BITMAP_LONGS + 1];
> +	unsigned long		io_bitmap_all[IO_BITMAP_LONGS + 1];
>  } __aligned(PAGE_SIZE);
>  
>  DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
> @@ -495,7 +500,6 @@ struct thread_struct {
>  #endif
>  	/* IO permissions: */
>  	unsigned long		*io_bitmap_ptr;
> -	unsigned long		iopl;
>  	/*
>  	 * The byte range in the I/O permission bitmap which contains zero
>  	 * bits.
> @@ -503,6 +507,13 @@ struct thread_struct {
>  	unsigned int		io_zerobits_start;
>  	unsigned int		io_zerobits_end;
>  
> +	/*
> +	 * IOPL. Priviledge level dependent I/O permission which includes
> +	 * user space CLI/STI when granted.
> +	 */
> +	unsigned long		iopl;
> +	unsigned long		iopl_emul;
> +
>  	mm_segment_t		addr_limit;
>  
>  	unsigned int		sig_on_uaccess_err:1;
> @@ -524,6 +535,9 @@ static inline void arch_thread_struct_wh
>  	*size = fpu_kernel_xstate_size;
>  }
>  
> +extern void tss_update_io_bitmap(struct tss_struct *tss,
> +				 struct thread_struct *thread);
> +
>  /*
>   * Thread-synchronous status.
>   *
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1866,7 +1866,9 @@ void cpu_init(void)
>  	tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
>  	tss->io_zerobits_start = IO_BITMAP_BYTES;
>  	tss->io_zerobits_end = 0;
> -	memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
> +	memset(tss->io_bitmap_map, 0xff, sizeof(tss->io_bitmap_map));
> +	/* Invalidate the extra tail entry of the allow all I/O bitmap */
> +	tss->io_bitmap_all[IO_BITMAP_LONGS] = ~0UL;
>  	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
>  
>  	load_TR_desc();
> --- a/arch/x86/kernel/ioport.c
> +++ b/arch/x86/kernel/ioport.c
> @@ -109,7 +109,7 @@ long ksys_ioperm(unsigned long from, uns
>  	}
>  
>  	/* Copy the changed range over to the TSS bitmap */
> -	dst = (char *)tss->io_bitmap;
> +	dst = (char *)tss->io_bitmap_map;
>  	src = (char *)bitmap;
>  	memcpy(dst + copy_start, src + copy_start, copy_len);
>  
> @@ -130,7 +130,7 @@ long ksys_ioperm(unsigned long from, uns
>  		 * is correct. It might have been wreckaged by a VMEXiT.
>  		 */
>  		set_thread_flag(TIF_IO_BITMAP);
> -		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
> +		tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID_MAP;
>  		refresh_tss_limit();
>  	}
>  
> @@ -184,36 +184,77 @@ SYSCALL_DEFINE3(ioperm, unsigned long, f
>   */
>  SYSCALL_DEFINE1(iopl, unsigned int, level)
>  {
> -	struct pt_regs *regs = current_pt_regs();
>  	struct thread_struct *t = &current->thread;
> +	struct pt_regs *regs = current_pt_regs();
> +	unsigned int old;
>  
>  	/*
>  	 * Careful: the IOPL bits in regs->flags are undefined under Xen PV
>  	 * and changing them has no effect.
>  	 */
> -	unsigned int old = t->iopl >> X86_EFLAGS_IOPL_BIT;
> +	if (IS_ENABLED(CONFIG_X86_IOPL_NONE))
> +		return -ENOSYS;
>  
>  	if (level > 3)
>  		return -EINVAL;
> +
> +	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION))
> +		old = t->iopl_emul;
> +	else
> +		old = t->iopl >> X86_EFLAGS_IOPL_BIT;
> +
> +	/* No point in going further if nothing changes */
> +	if (level == old)
> +		return 0;
> +
>  	/* Trying to gain more privileges? */
>  	if (level > old) {
>  		if (!capable(CAP_SYS_RAWIO) ||
>  		    security_locked_down(LOCKDOWN_IOPORT))
>  			return -EPERM;
>  	}
> -	/*
> -	 * Change the flags value on the return stack, which has been set
> -	 * up on system-call entry. See also the fork and signal handling
> -	 * code how this is handled.
> -	 */
> -	regs->flags = (regs->flags & ~X86_EFLAGS_IOPL) |
> -		(level << X86_EFLAGS_IOPL_BIT);
> -	/* Store the new level in the thread struct */
> -	t->iopl = level << X86_EFLAGS_IOPL_BIT;
> -	/*
> -	 * X86_32 switches immediately and XEN handles it via emulation.
> -	 */
> -	set_iopl_mask(t->iopl);
> +
> +	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION)) {
> +		struct tss_struct *tss;
> +		unsigned int tss_base;
> +
> +		/* Prevent racing against a task switch */
> +		preempt_disable();
> +		tss = this_cpu_ptr(&cpu_tss_rw);
> +		if (level == 3) {
> +			/* Grant access to all I/O ports */
> +			set_thread_flag(TIF_IO_BITMAP);
> +			tss_base = IO_BITMAP_OFFSET_VALID_ALL;

Where is the actual TSS updated?

> +		} else if (t->io_bitmap_ptr) {
> +			/* Thread has a I/O bitmap */
> +			tss_update_io_bitmap(tss, t);
> +			set_thread_flag(TIF_IO_BITMAP);
> +			tss_base = IO_BITMAP_OFFSET_VALID_MAP;
> +		} else {
> +			/* Take it out of the context switch work burden */
> +			clear_thread_flag(TIF_IO_BITMAP);
> +			tss_base = IO_BITMAP_OFFSET_INVALID;

Ditto.

I think what you need to do is have a single function, called by
exit_thread(), switch_to(), and here, that updates the TSS to match a
given task's IO bitmap state.  This is probably switch_to_bitmap() or
similar.

(Maybe it already is, but I swear I checked all the patches in the
series and I can't find the body of tss_update_io_bitmap().  But you
should call it in all branches of this if-else thing.)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 7/9] x86/iopl: Restrict iopl() permission scope
  2019-11-10 17:26   ` Andy Lutomirski
@ 2019-11-10 20:31     ` Thomas Gleixner
  2019-11-10 21:05       ` Andy Lutomirski
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-10 20:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Sun, 10 Nov 2019, Andy Lutomirski wrote:
> On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> > +
> > +	if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION)) {
> > +		struct tss_struct *tss;
> > +		unsigned int tss_base;
> > +
> > +		/* Prevent racing against a task switch */
> > +		preempt_disable();
> > +		tss = this_cpu_ptr(&cpu_tss_rw);
> > +		if (level == 3) {
> > +			/* Grant access to all I/O ports */
> > +			set_thread_flag(TIF_IO_BITMAP);
> > +			tss_base = IO_BITMAP_OFFSET_VALID_ALL;
> 
> Where is the actual TSS updated?

Here. It sets the offset to the all zero bitmap. That's all it needs.

> > +		} else if (t->io_bitmap_ptr) {
> > +			/* Thread has a I/O bitmap */
> > +			tss_update_io_bitmap(tss, t);
> > +			set_thread_flag(TIF_IO_BITMAP);
> > +			tss_base = IO_BITMAP_OFFSET_VALID_MAP;
> > +		} else {
> > +			/* Take it out of the context switch work burden */
> > +			clear_thread_flag(TIF_IO_BITMAP);
> > +			tss_base = IO_BITMAP_OFFSET_INVALID;
> 
> Ditto.
> 
> I think what you need to do is have a single function, called by
> exit_thread(), switch_to(), and here, that updates the TSS to match a
> given task's IO bitmap state.  This is probably switch_to_bitmap() or
> similar.

Well, no. exit_thread() and this here actually fiddle with the TIF bit
which is not what the switch to case does. There is some stuff which can be
shared.

> (Maybe it already is, but I swear I checked all the patches in the
> series and I can't find the body of tss_update_io_bitmap().  But you
> should call it in all branches of this if-else thing.)

It's in that very same patch:

> -static void tss_update_io_bitmap(struct tss_struct *tss,
> -                                struct thread_struct *thread)
> +void tss_update_io_bitmap(struct tss_struct *tss, struct thread_struct *thread)
>  {

Let me try to get a bit more reuse. Which still leaves the TIF bit fiddling
in this code path.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 7/9] x86/iopl: Restrict iopl() permission scope
  2019-11-10 20:31     ` Thomas Gleixner
@ 2019-11-10 21:05       ` Andy Lutomirski
  2019-11-10 21:21         ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2019-11-10 21:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, Linus Torvalds,
	H. Peter Anvin

On Sun, Nov 10, 2019 at 12:31 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, 10 Nov 2019, Andy Lutomirski wrote:
> > On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> > > +
> > > +   if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION)) {
> > > +           struct tss_struct *tss;
> > > +           unsigned int tss_base;
> > > +
> > > +           /* Prevent racing against a task switch */
> > > +           preempt_disable();
> > > +           tss = this_cpu_ptr(&cpu_tss_rw);
> > > +           if (level == 3) {
> > > +                   /* Grant access to all I/O ports */
> > > +                   set_thread_flag(TIF_IO_BITMAP);
> > > +                   tss_base = IO_BITMAP_OFFSET_VALID_ALL;
> >
> > Where is the actual TSS updated?
>
> Here. It sets the offset to the all zero bitmap. That's all it needs.

Ah, I see.

> > I think what you need to do is have a single function, called by
> > exit_thread(), switch_to(), and here, that updates the TSS to match a
> > given task's IO bitmap state.  This is probably switch_to_bitmap() or
> > similar.
>
> Well, no. exit_thread() and this here actually fiddle with the TIF bit
> which is not what the switch to case does. There is some stuff which can be
> shared.

I was thinking that the code that read iopl_emul and t->io_bitmap_ptr
and updated x86_tss.io_bitmap_base could be factored out of
switch_to().  Suppose you call that tss_update_io_bitmap().  And you
could have another tiny helper:

static void update_tif_io_bitmap(void)
{
  if (...->iopl_emul || ...->io_bitmap_ptr)
    set_thread_flag(TIF_IO_BITMAP);
  else
    clear_thread_flag(TIF_IO_BITMAP);
}

Then the whole iopl emulation path becomes:

preempt_disable();
...->iopl_emul = iopl;
update_tif_io_bitmap();
tss_update_io_bitmap();

and switch_to() does

if (new or prev has TIF_IO_BITMAP)
  tss_update_io_bitmap();

and exit_thread() does:

...->iopl_emul = 0;
kfree(...);
...->io_bitmap_ptr = 0;
clear_thread_flag(TIF_IO_BITMAP) or update_tif_io_bitmap();
tss_update_io_bitmap();

and ioperm() does:

kmalloc (if needed);
update the thread_struct copy of the bitmap.
set_thread_flag(TIF_IO_BITMAP) or update_tif_io_bitmap();
tss_update_io_bitmap();

Does that make sense?

>
> > (Maybe it already is, but I swear I checked all the patches in the
> > series and I can't find the body of tss_update_io_bitmap().  But you
> > should call it in all branches of this if-else thing.)
>
> It's in that very same patch:
>
> > -static void tss_update_io_bitmap(struct tss_struct *tss,
> > -                                struct thread_struct *thread)

But where did the line you just deleted come from?  I'm obviously just
bad at reading emails somewhere.

> > +void tss_update_io_bitmap(struct tss_struct *tss, struct thread_struct *thread)
> >  {
>
> Let me try to get a bit more reuse. Which still leaves the TIF bit fiddling
> in this code path.
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 7/9] x86/iopl: Restrict iopl() permission scope
  2019-11-10 21:05       ` Andy Lutomirski
@ 2019-11-10 21:21         ` Thomas Gleixner
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2019-11-10 21:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Stephen Hemminger, Willy Tarreau, Juergen Gross,
	Sean Christopherson, Linus Torvalds, H. Peter Anvin

On Sun, 10 Nov 2019, Andy Lutomirski wrote:
> On Sun, Nov 10, 2019 at 12:31 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Sun, 10 Nov 2019, Andy Lutomirski wrote:
> > > On 11/6/19 11:35 AM, Thomas Gleixner wrote:
> > > > +
> > > > +   if (IS_ENABLED(CONFIG_X86_IOPL_EMULATION)) {
> > > > +           struct tss_struct *tss;
> > > > +           unsigned int tss_base;
> > > > +
> > > > +           /* Prevent racing against a task switch */
> > > > +           preempt_disable();
> > > > +           tss = this_cpu_ptr(&cpu_tss_rw);
> > > > +           if (level == 3) {
> > > > +                   /* Grant access to all I/O ports */
> > > > +                   set_thread_flag(TIF_IO_BITMAP);
> > > > +                   tss_base = IO_BITMAP_OFFSET_VALID_ALL;
> > >
> > > Where is the actual TSS updated?
> >
> > Here. It sets the offset to the all zero bitmap. That's all it needs.
> 
> Ah, I see.
> 
> > > I think what you need to do is have a single function, called by
> > > exit_thread(), switch_to(), and here, that updates the TSS to match a
> > > given task's IO bitmap state.  This is probably switch_to_bitmap() or
> > > similar.
> >
> > Well, no. exit_thread() and this here actually fiddle with the TIF bit
> > which is not what the switch to case does. There is some stuff which can be
> > shared.
> 
> I was thinking that the code that read iopl_emul and t->io_bitmap_ptr
> and updated x86_tss.io_bitmap_base could be factored out of
> switch_to().  Suppose you call that tss_update_io_bitmap().  And you
> could have another tiny helper:
> 
> static void update_tif_io_bitmap(void)
> {
>   if (...->iopl_emul || ...->io_bitmap_ptr)
>     set_thread_flag(TIF_IO_BITMAP);
>   else
>     clear_thread_flag(TIF_IO_BITMAP);
> }
> 
> Then the whole iopl emulation path becomes:

Yes. I'm almost done with that already :)
> > It's in that very same patch:
> >
> > > -static void tss_update_io_bitmap(struct tss_struct *tss,
> > > -                                struct thread_struct *thread)
> 
> But where did the line you just deleted come from?  I'm obviously just
> bad at reading emails somewhere.

patch 5/9 :)
 
Let me finish that stuff and send out another version which has that reuse
of the switch to code in a separate patch.

There is another new detail in the series. I split out all the io bitmap
data into a separate data struct, which also allows me to stick a refcount
into it which removes the kmemdup() from the fork path.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 2/9] x86/process: Unify copy_thread_tls()
  2019-11-10 16:56         ` Andy Lutomirski
@ 2019-11-11  8:52           ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2019-11-11  8:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Stephen Hemminger, Willy Tarreau,
	Juergen Gross, Sean Christopherson, Linus Torvalds,
	H. Peter Anvin

On Sun, Nov 10, 2019 at 08:56:50AM -0800, Andy Lutomirski wrote:
> On Sun, Nov 10, 2019 at 4:36 AM Thomas Gleixner <tglx@linutronix.de> wrote:

> > 64bit does not have flags in the inactive_task_frame ...
> >
> 
> Hmm.  One more thing to unify, I guess.

All that takes is porting objtool to 32bit.

See commit: 64604d54d311 ("sched/x86_64: Don't save flags on context switch")


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, back to index

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06 19:34 [patch 0/9] x86/iopl: Prevent user space from using CLI/STI with iopl(3) Thomas Gleixner
2019-11-06 19:35 ` [patch 1/9] x86/ptrace: Prevent truncation of bitmap size Thomas Gleixner
2019-11-07  7:31   ` Ingo Molnar
2019-11-06 19:35 ` [patch 2/9] x86/process: Unify copy_thread_tls() Thomas Gleixner
2019-11-08 22:31   ` Andy Lutomirski
2019-11-08 23:43     ` Thomas Gleixner
2019-11-10 12:36       ` Thomas Gleixner
2019-11-10 16:56         ` Andy Lutomirski
2019-11-11  8:52           ` Peter Zijlstra
2019-11-06 19:35 ` [patch 3/9] x86/cpu: Unify cpu_init() Thomas Gleixner
2019-11-08 22:34   ` Andy Lutomirski
2019-11-06 19:35 ` [patch 4/9] x86/io: Speedup schedule out of I/O bitmap user Thomas Gleixner
2019-11-07  9:12   ` Peter Zijlstra
2019-11-07 14:04     ` Thomas Gleixner
2019-11-07 14:08       ` Thomas Gleixner
2019-11-08 22:41         ` Andy Lutomirski
2019-11-08 23:45           ` Thomas Gleixner
2019-11-09  3:32             ` Andy Lutomirski
2019-11-10 12:43               ` Thomas Gleixner
2019-11-09  0:24   ` Andy Lutomirski
2019-11-06 19:35 ` [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further Thomas Gleixner
2019-11-07  1:11   ` Linus Torvalds
2019-11-07  7:44     ` Thomas Gleixner
2019-11-07  8:25     ` Ingo Molnar
2019-11-07  9:17       ` Willy Tarreau
2019-11-07 10:00         ` Thomas Gleixner
2019-11-07 10:13           ` Willy Tarreau
2019-11-07 10:19           ` hpa
2019-11-07 10:27             ` Willy Tarreau
2019-11-07 10:50               ` hpa
2019-11-07 12:56                 ` Willy Tarreau
2019-11-07 16:45                   ` ebiederm
2019-11-07 16:53                     ` Linus Torvalds
2019-11-07 16:57                     ` Willy Tarreau
2019-11-10 17:17       ` Andy Lutomirski
2019-11-07  7:37   ` Ingo Molnar
2019-11-07  7:45     ` Thomas Gleixner
2019-11-07  8:16   ` Ingo Molnar
2019-11-07 18:02     ` Thomas Gleixner
2019-11-07 19:24   ` Brian Gerst
2019-11-07 19:54     ` Linus Torvalds
2019-11-07 21:00       ` Brian Gerst
2019-11-07 21:32         ` Thomas Gleixner
2019-11-07 23:20           ` hpa
2019-11-07 21:44         ` Linus Torvalds
2019-11-08  1:12           ` H. Peter Anvin
2019-11-08  2:12             ` Brian Gerst
2019-11-10 17:21           ` Andy Lutomirski
2019-11-06 19:35 ` [patch 6/9] x86/iopl: Fixup misleading comment Thomas Gleixner
2019-11-06 19:35 ` [patch 7/9] x86/iopl: Restrict iopl() permission scope Thomas Gleixner
2019-11-07  9:09   ` Peter Zijlstra
2019-11-10 17:26   ` Andy Lutomirski
2019-11-10 20:31     ` Thomas Gleixner
2019-11-10 21:05       ` Andy Lutomirski
2019-11-10 21:21         ` Thomas Gleixner
2019-11-06 19:35 ` [patch 8/9] x86/iopl: Remove legacy IOPL option Thomas Gleixner
2019-11-07  6:11   ` Jürgen Groß
2019-11-07  6:26     ` hpa
2019-11-07 16:44     ` Stephen Hemminger
2019-11-07  9:13   ` Peter Zijlstra
2019-11-06 19:35 ` [patch 9/9] selftests/x86/iopl: Verify that CLI/STI result in #GP Thomas Gleixner
2019-11-07  7:28 ` [patch] x86/iopl: Remove unused local variable, update comments in ksys_ioperm() Ingo Molnar

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git