linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
@ 2015-08-18 19:11 Andy Lutomirski
  2015-08-18 22:16 ` Frederic Weisbecker
  2015-08-19  0:30 ` Andy Lutomirski
  0 siblings, 2 replies; 10+ messages in thread
From: Andy Lutomirski @ 2015-08-18 19:11 UTC (permalink / raw)
  To: X86 ML
  Cc: Sasha Levin, Brian Gerst, Frédéric Weisbecker,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov,
	Andy Lutomirski

This fixes a couple minor holes if we took an IRQ very early in syscall
processing:

 - We could enter the IRQ with CONTEXT_USER.  Everything worked (RCU
   was fine), but we could warn if all the debugging options were
   set.

 - We could have the IRQ regs overlap task_pt_regs.  I'm not aware
   of anything important that would break, but some of the /proc
   stuff could plausibly have gotten confused.

Fix it the straightforward way: finish filling in pt_regs and call
enter_from_user_mode before enabling interrupts if _TIF_NOHZ is set.

This should be the last piece of the puzzle needed to get rid of most
remaining exception_enter calls.  (vmalloc faults are still tricky,
but they're mostly fatal in the syscall prologue already.)

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

This is the last significant functionality change I send for 4.3, I
hope.  With this applied, context tracking for all non-NMI, non-debug
entries should be exact.

There's probably some (minor) performance regression on
CONFIG_CONTEXT_TRACKING=y kernels that aren't using nohz.  If so
(I'll benchmark it later this week), I'll try to rig up a simple
patch to NOP out the hooks of nohz is off.

Sasha, this should fix the intermittent DEBUG_LOCKS splat you're
seeing.

I don't intend to send v2 the #BP stuff for 4.3.  The pile is plenty
big already.

 arch/x86/entry/common.c            | 12 +-------
 arch/x86/entry/entry_64.S          | 32 ++++++++++++++------
 arch/x86/entry/entry_64_compat.S   | 60 +++++++++++++++++++++++++++++---------
 arch/x86/include/asm/thread_info.h |  3 +-
 4 files changed, 71 insertions(+), 36 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 80dcc9261ca3..b570cea2f469 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -70,21 +70,11 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	u32 work;
 
 	BUG_ON(regs != task_pt_regs(current));
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 
 	work = ACCESS_ONCE(current_thread_info()->flags) &
 		_TIF_WORK_SYSCALL_ENTRY;
 
-#ifdef CONFIG_CONTEXT_TRACKING
-	/*
-	 * If TIF_NOHZ is set, we are required to call user_exit() before
-	 * doing anything that could touch RCU.
-	 */
-	if (work & _TIF_NOHZ) {
-		enter_from_user_mode();
-		work &= ~_TIF_NOHZ;
-	}
-#endif
-
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp first -- it should minimize exposure of other
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e2d078c9dfe4..6bf0c7ecf399 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,20 +142,16 @@ ENTRY(entry_SYSCALL_64)
 	 */
 GLOBAL(entry_SYSCALL_64_after_swapgs)
 
+	/*
+	 * IRQs must be off while we use rsp_scratch to keep it from
+	 * being clobbered by a different task.
+	 */
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
 	pushq	$__USER_DS			/* pt_regs->ss */
 	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
-	/*
-	 * Re-enable interrupts.
-	 * We use 'rsp_scratch' as a scratch space, hence irq-off block above
-	 * must execute atomically in the face of possible interrupt-driven
-	 * task preemption. We must enable interrupts only after we're done
-	 * with using rsp_scratch:
-	 */
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq	%r11				/* pt_regs->flags */
 	pushq	$__USER_CS			/* pt_regs->cs */
 	pushq	%rcx				/* pt_regs->ip */
@@ -171,8 +167,17 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	pushq	%r11				/* pt_regs->r11 */
 	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
 
-	testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	testl	$(_TIF_WORK_SYSCALL_ENTRY | _TIF_NOHZ), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
 	jnz	tracesys
+
+	/*
+	 * Re-enable interrupts.  IRQ tracing already thinks that IRQs are
+	 * on (since we treat user mode as having IRQs on), and the
+	 * prologue above is too short for it to be worth adding a
+	 * tracing round trip.
+	 */
+	ENABLE_INTERRUPTS(CLBR_NONE)
+
 entry_SYSCALL_64_fastpath:
 #if __SYSCALL_MASK == ~0
 	cmpq	$__NR_syscall_max, %rax
@@ -235,6 +240,15 @@ GLOBAL(int_ret_from_sys_call_irqs_off)
 
 	/* Do syscall entry tracing */
 tracesys:
+#ifdef CONFIG_CONTEXT_TRACKING
+	/* This is slow enough that it's worth tracing. */
+	TRACE_IRQS_OFF
+	call enter_from_user_mode
+	TRACE_IRQS_ON
+#endif
+
+	ENABLE_INTERRUPTS(CLBR_NONE)
+
 	movq	%rsp, %rdi
 	movl	$AUDIT_ARCH_X86_64, %esi
 	call	syscall_trace_enter_phase1
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index ff32a289b5d1..099ec1174ff9 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -103,11 +103,19 @@ ENTRY(entry_SYSENTER_compat)
 	jnz	sysenter_fix_flags
 sysenter_flags_fixed:
 
+#ifdef CONFIG_CONTEXT_TRACKING
+	/* This is slow enough that it's worth tracing. */
+	TRACE_IRQS_OFF
+	call enter_from_user_mode
+	TRACE_IRQS_ON
+#endif
+
 	/*
 	 * Re-enable interrupts.  IRQ tracing already thinks that IRQs are
 	 * on (since we treat user mode as having IRQs on), and the
 	 * prologue above is too short for it to be worth adding a
-	 * tracing round trip.
+	 * tracing round trip except in the CONFIG_CONTEXT_TRACKING
+	 * case.
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 
@@ -318,15 +326,10 @@ ENDPROC(entry_SYSENTER_compat)
  * with the int 0x80 path.
  */
 ENTRY(entry_SYSCALL_compat)
-	/*
-	 * Interrupts are off on entry.
-	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
-	 * it is too small to ever cause noticeable irq latency.
-	 */
+	/* Interrupts are off on entry. */
 	SWAPGS_UNSAFE_STACK
 	movl	%esp, %r8d
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
-	ENABLE_INTERRUPTS(CLBR_NONE)
 
 	/* Zero-extending 32-bit regs, do not remove */
 	movl	%eax, %eax
@@ -346,6 +349,22 @@ ENTRY(entry_SYSCALL_compat)
 	pushq	$-ENOSYS		/* pt_regs->ax */
 	sub	$(10*8), %rsp		/* pt_regs->r8-11, bp, bx, r12-15 not saved */
 
+#ifdef CONFIG_CONTEXT_TRACKING
+	/* This is slow enough that it's worth tracing. */
+	TRACE_IRQS_OFF
+	call enter_from_user_mode
+	TRACE_IRQS_ON
+#endif
+
+	/*
+	 * Re-enable interrupts.  IRQ tracing already thinks that IRQs are
+	 * on (since we treat user mode as having IRQs on), and the
+	 * prologue above is too short for it to be worth adding a
+	 * tracing round trip except in the CONFIG_CONTEXT_TRACKING
+	 * case.
+	 */
+	ENABLE_INTERRUPTS(CLBR_NONE)
+
 	/*
 	 * No need to do an access_ok check here because r8 has been
 	 * 32-bit zero extended:
@@ -354,6 +373,7 @@ ENTRY(entry_SYSCALL_compat)
 1:	movl	(%r8), %r9d
 	_ASM_EXTABLE(1b, ia32_badarg)
 	ASM_CLAC
+
 	orl	$TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
 	testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
 	jnz	cstar_tracesys
@@ -518,14 +538,9 @@ ia32_ret_from_sys_call:
  */
 
 ENTRY(entry_INT80_compat)
-	/*
-	 * Interrupts are off on entry.
-	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
-	 * it is too small to ever cause noticeable irq latency.
-	 */
+	/* Interrupts are off on entry. */
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
 	SWAPGS
-	ENABLE_INTERRUPTS(CLBR_NONE)
 
 	/* Zero-extending 32-bit regs, do not remove */
 	movl	%eax, %eax
@@ -545,9 +560,17 @@ ENTRY(entry_INT80_compat)
 	sub	$(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */
 
 	orl	$TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-	testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	testl	$(_TIF_WORK_SYSCALL_ENTRY | _TIF_NOHZ), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
 	jnz	ia32_tracesys
 
+	/*
+	 * Re-enable interrupts.  IRQ tracing already thinks that IRQs are
+	 * on (since we treat user mode as having IRQs on), and the
+	 * prologue above is too short for it to be worth adding a
+	 * tracing round trip.
+	 */
+	ENABLE_INTERRUPTS(CLBR_NONE)
+
 ia32_do_call:
 	/* 32-bit syscall -> 64-bit C ABI argument conversion */
 	movl	%edi, %r8d		/* arg5 */
@@ -564,6 +587,15 @@ ia32_do_call:
 	jmp	int_ret_from_sys_call
 
 ia32_tracesys:
+#ifdef CONFIG_CONTEXT_TRACKING
+	/* This is slow enough that it's worth tracing. */
+	TRACE_IRQS_OFF
+	call enter_from_user_mode
+	TRACE_IRQS_ON
+#endif
+
+	ENABLE_INTERRUPTS(CLBR_NONE)
+
 	SAVE_EXTRA_REGS
 	movq	%rsp, %rdi			/* &pt_regs -> arg1 */
 	call	syscall_trace_enter
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8afdc3e44247..3c5a96815dec 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -140,8 +140,7 @@ struct thread_info {
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_NOHZ)
+	 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK						\
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-18 19:11 [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts Andy Lutomirski
@ 2015-08-18 22:16 ` Frederic Weisbecker
  2015-08-18 22:35   ` Andy Lutomirski
  2015-08-19  0:30 ` Andy Lutomirski
  1 sibling, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2015-08-18 22:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Sasha Levin, Brian Gerst, Denys Vlasenko, linux-kernel,
	Oleg Nesterov, Borislav Petkov

On Tue, Aug 18, 2015 at 12:11:59PM -0700, Andy Lutomirski wrote:
> This fixes a couple minor holes if we took an IRQ very early in syscall
> processing:
> 
>  - We could enter the IRQ with CONTEXT_USER.  Everything worked (RCU
>    was fine), but we could warn if all the debugging options were
>    set.

So this is fixing issues after your changes that call user_exit() from
IRQs, right?

But the IRQs aren't supposed to call user_exit(), they have their own hooks.
That's where the real issue is.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-18 22:16 ` Frederic Weisbecker
@ 2015-08-18 22:35   ` Andy Lutomirski
  2015-08-18 23:02     ` Frederic Weisbecker
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2015-08-18 22:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, X86 ML, Sasha Levin, Brian Gerst,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov,
	Rik van Riel

On Tue, Aug 18, 2015 at 3:16 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Tue, Aug 18, 2015 at 12:11:59PM -0700, Andy Lutomirski wrote:
>> This fixes a couple minor holes if we took an IRQ very early in syscall
>> processing:
>>
>>  - We could enter the IRQ with CONTEXT_USER.  Everything worked (RCU
>>    was fine), but we could warn if all the debugging options were
>>    set.
>
> So this is fixing issues after your changes that call user_exit() from
> IRQs, right?

Yes.  Here's an example splat, courtesy of Sasha:

https://gist.github.com/sashalevin/a006a44989312f6835e7

>
> But the IRQs aren't supposed to call user_exit(), they have their own hooks.
> That's where the real issue is.

In -tip, the assumption is that we *always* switch to CONTEXT_KERNEL
when entering the kernel for a non-NMI reason.  That means that we can
avoid all of the (expensive!) checks for what context we're in.  It
also means that (other than IRQs, which need further cleanup), we only
switch once per user/kernel switch.

The cost for doing should be essentially zero, modulo artifacts from
poor inlining.  IMO the code is much more straightforward than it used
to be, and it has the potential to be quite fast.  For one thing, we
never invoke context tracking with IRQs on, and Rik had some profiles
suggesting that a bunch of the overhead involved dealing with repeated
irq flag manipulation.

One way or another, IRQs need to switch from RCU-not-watching to
RCU-watching, and I don't see what's wrong with user_exit for this
purpose.  Of course, if user_exit is slow, we should fix that.

Also, this isn't really related to IRQs calling user_exit.  It's that
IRQs can recurse into other entries (#GP in Sasha's case) which also
validate the context.

None of the speedups that will be enabled are written yet, but I
strongly suspect they will be soon :)

In my book, the fact that we now have context tracking assertions all
over the place is a good thing.  It means we're much less likely to
break it.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-18 22:35   ` Andy Lutomirski
@ 2015-08-18 23:02     ` Frederic Weisbecker
  2015-08-18 23:07       ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2015-08-18 23:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Sasha Levin, Brian Gerst,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov,
	Rik van Riel

On Tue, Aug 18, 2015 at 03:35:30PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 18, 2015 at 3:16 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Tue, Aug 18, 2015 at 12:11:59PM -0700, Andy Lutomirski wrote:
> >> This fixes a couple minor holes if we took an IRQ very early in syscall
> >> processing:
> >>
> >>  - We could enter the IRQ with CONTEXT_USER.  Everything worked (RCU
> >>    was fine), but we could warn if all the debugging options were
> >>    set.
> >
> > So this is fixing issues after your changes that call user_exit() from
> > IRQs, right?
> 
> Yes.  Here's an example splat, courtesy of Sasha:
> 
> https://gist.github.com/sashalevin/a006a44989312f6835e7
> 
> >
> > But the IRQs aren't supposed to call user_exit(), they have their own hooks.
> > That's where the real issue is.
> 
> In -tip, the assumption is that we *always* switch to CONTEXT_KERNEL
> when entering the kernel for a non-NMI reason.

Why? IRQs don't need that! We already have irq_enter()/irq_exit().

And we don't want to call rcu_user_*() pairs on IRQs, you're
introducing a serious performance regression here! And I'm talking about
the code that's currently in -tip.

> That means that we can
> avoid all of the (expensive!) checks for what context we're in.

If you're referring to context tracking, the context check is a per-cpu
read. Not something that's usually considered expensive.

> It also means that (other than IRQs, which need further cleanup), we only
> switch once per user/kernel switch.

???

> 
> The cost for doing should be essentially zero, modulo artifacts from
> poor inlining.

And modulo rcu_user_*() that do multiple costly atomic_add_return() operations
implying full memory barriers. Plus the unnecessary vtime accounting that doubles
the existing one in irq_enter/exit() (those even imply a lock currently, which will
probably be turned to seqcount, but still, full memory barriers...).

I'm sorry but I'm going to NACK any code that does that in IRQs (and again that
concerns current tip:x86/asm).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-18 23:02     ` Frederic Weisbecker
@ 2015-08-18 23:07       ` Andy Lutomirski
  2015-08-19 17:10         ` Frederic Weisbecker
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2015-08-18 23:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, X86 ML, Sasha Levin, Brian Gerst,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov,
	Rik van Riel

On Tue, Aug 18, 2015 at 4:02 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Tue, Aug 18, 2015 at 03:35:30PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 18, 2015 at 3:16 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> > On Tue, Aug 18, 2015 at 12:11:59PM -0700, Andy Lutomirski wrote:
>> >> This fixes a couple minor holes if we took an IRQ very early in syscall
>> >> processing:
>> >>
>> >>  - We could enter the IRQ with CONTEXT_USER.  Everything worked (RCU
>> >>    was fine), but we could warn if all the debugging options were
>> >>    set.
>> >
>> > So this is fixing issues after your changes that call user_exit() from
>> > IRQs, right?
>>
>> Yes.  Here's an example splat, courtesy of Sasha:
>>
>> https://gist.github.com/sashalevin/a006a44989312f6835e7
>>
>> >
>> > But the IRQs aren't supposed to call user_exit(), they have their own hooks.
>> > That's where the real issue is.
>>
>> In -tip, the assumption is that we *always* switch to CONTEXT_KERNEL
>> when entering the kernel for a non-NMI reason.
>
> Why? IRQs don't need that! We already have irq_enter()/irq_exit().
>

Those are certainly redundant.  I want to have a real hook to call
that says "switch to IRQ context from CONTEXT_USER" or "switch to IRQ
context from CONTEXT_KERNEL" (aka noop), but that doesn't currently
exist.

> And we don't want to call rcu_user_*() pairs on IRQs, you're
> introducing a serious performance regression here! And I'm talking about
> the code that's currently in -tip.

Is there an easy way to fix it?  For example, could we figure out what
makes it take so long and make it faster?  If we need to, we could
back out the IRQ bit and change the assertions for 4.3, but I'd rather
keep the exact context tracking if at all possible.

>
>> That means that we can
>> avoid all of the (expensive!) checks for what context we're in.
>
> If you're referring to context tracking, the context check is a per-cpu
> read. Not something that's usually considered expensive.

In -tip, there aren't even extra branches, except those imposed by the
user_exit implementation.

>
>> It also means that (other than IRQs, which need further cleanup), we only
>> switch once per user/kernel switch.
>
> ???

In 4.2 and before, we can switch multiple times on the way out of the
kernel, via SCHEDULE_USER, do_notify_resume, etc.  In -tip, we do it
exactly once no matter what.

>
>>
>> The cost for doing should be essentially zero, modulo artifacts from
>> poor inlining.
>
> And modulo rcu_user_*() that do multiple costly atomic_add_return() operations
> implying full memory barriers. Plus the unnecessary vtime accounting that doubles
> the existing one in irq_enter/exit() (those even imply a lock currently, which will
> probably be turned to seqcount, but still, full memory barriers...).
>
> I'm sorry but I'm going to NACK any code that does that in IRQs (and again that
> concerns current tip:x86/asm).

Why do we need these heavyweight barriers?

If there's actually a measurable performance hit in IRQs in -tip, then
can we come up with a better fix?  For example, we could change all
the new CT_WARN_ON calls to check "are we in CONTEXT_KERNEL or in IRQ
context" and make the IRQ entry do a lighter weight context tracking
operation.

But I think I'm still missing something fundamental about the
performance: why is irq_enter() any faster than user_exit()?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-18 19:11 [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts Andy Lutomirski
  2015-08-18 22:16 ` Frederic Weisbecker
@ 2015-08-19  0:30 ` Andy Lutomirski
  2015-08-19 15:54   ` Andy Lutomirski
  1 sibling, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2015-08-19  0:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Sasha Levin, Brian Gerst, Frédéric Weisbecker,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov

On Tue, Aug 18, 2015 at 12:11 PM, Andy Lutomirski <luto@kernel.org> wrote:
> This fixes a couple minor holes if we took an IRQ very early in syscall
> processing:
>

The patch is buggy.  v2 coming soon, hopefully.

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-19  0:30 ` Andy Lutomirski
@ 2015-08-19 15:54   ` Andy Lutomirski
  0 siblings, 0 replies; 10+ messages in thread
From: Andy Lutomirski @ 2015-08-19 15:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Sasha Levin, Brian Gerst, Frédéric Weisbecker,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov

On Tue, Aug 18, 2015 at 5:30 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Aug 18, 2015 at 12:11 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> This fixes a couple minor holes if we took an IRQ very early in syscall
>> processing:
>>
>
> The patch is buggy.  v2 coming soon, hopefully.
>

No, let's drop this for now.  It would be straightforward but quite
messy to fix it, and I don't want to introduce further
incomprehensibility into the asm.  I'll send the warning change
instead, and we can fix it better for 4.4

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-18 23:07       ` Andy Lutomirski
@ 2015-08-19 17:10         ` Frederic Weisbecker
  2015-08-21  7:50           ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2015-08-19 17:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Sasha Levin, Brian Gerst,
	Denys Vlasenko, linux-kernel, Oleg Nesterov, Borislav Petkov,
	Rik van Riel

On Tue, Aug 18, 2015 at 04:07:51PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 18, 2015 at 4:02 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Tue, Aug 18, 2015 at 03:35:30PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 18, 2015 at 3:16 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >> > On Tue, Aug 18, 2015 at 12:11:59PM -0700, Andy Lutomirski wrote:
> >> >> This fixes a couple minor holes if we took an IRQ very early in syscall
> >> >> processing:
> >> >>
> >> >>  - We could enter the IRQ with CONTEXT_USER.  Everything worked (RCU
> >> >>    was fine), but we could warn if all the debugging options were
> >> >>    set.
> >> >
> >> > So this is fixing issues after your changes that call user_exit() from
> >> > IRQs, right?
> >>
> >> Yes.  Here's an example splat, courtesy of Sasha:
> >>
> >> https://gist.github.com/sashalevin/a006a44989312f6835e7
> >>
> >> >
> >> > But the IRQs aren't supposed to call user_exit(), they have their own hooks.
> >> > That's where the real issue is.
> >>
> >> In -tip, the assumption is that we *always* switch to CONTEXT_KERNEL
> >> when entering the kernel for a non-NMI reason.
> >
> > Why? IRQs don't need that! We already have irq_enter()/irq_exit().
> >
> 
> Those are certainly redundant.

So? What's the point in duplicating a hook in arch code that core code already
has?

> I want to have a real hook to call
> that says "switch to IRQ context from CONTEXT_USER" or "switch to IRQ
> context from CONTEXT_KERNEL" (aka noop), but that doesn't currently
> exist.

You're not answering _why_ you want that.

> 
> > And we don't want to call rcu_user_*() pairs on IRQs, you're
> > introducing a serious performance regression here! And I'm talking about
> > the code that's currently in -tip.
> 
> Is there an easy way to fix it?  For example, could we figure out what
> makes it take so long and make it faster?

Sure, just remove your arch IRQ hook.

> If we need to, we could
> back out the IRQ bit and change the assertions for 4.3, but I'd rather
> keep the exact context tracking if at all possible.

I have no idea what you mean by exact context tracking here.

But If we ever want to call irq_enter() using arch hooks, and I have no idea why
we would ever want to do that since that involve complexifying the code
by $NR_ARCHS and moving C code to ASM, we need serious reasons! And that's
certainly not something we are going to plan now for the next week's merge window.

> >> That means that we can
> >> avoid all of the (expensive!) checks for what context we're in.
> >
> > If you're referring to context tracking, the context check is a per-cpu
> > read. Not something that's usually considered expensive.
> 
> In -tip, there aren't even extra branches, except those imposed by the
> user_exit implementation.

No there is the "call enter_from_user_mode" in the IRQ fast path.

> 
> >
> >> It also means that (other than IRQs, which need further cleanup), we only
> >> switch once per user/kernel switch.
> >
> > ???
> 
> In 4.2 and before, we can switch multiple times on the way out of the
> kernel, via SCHEDULE_USER, do_notify_resume, etc.  In -tip, we do it
> exactly once no matter what.

That's what we want for syscalls but not for IRQs.

> 
> >
> >>
> >> The cost for doing should be essentially zero, modulo artifacts from
> >> poor inlining.
> >
> > And modulo rcu_user_*() that do multiple costly atomic_add_return() operations
> > implying full memory barriers. Plus the unnecessary vtime accounting that doubles
> > the existing one in irq_enter/exit() (those even imply a lock currently, which will
> > probably be turned to seqcount, but still, full memory barriers...).
> >
> > I'm sorry but I'm going to NACK any code that does that in IRQs (and again that
> > concerns current tip:x86/asm).
> 
> Why do we need these heavyweight barriers?

Actually it's not full barriers but atomic ones (smp_mb__after_atomic_stuff())
I suspect we can't do much better given RCU requirements.

Still we don't need to call it twice.

> 
> If there's actually a measurable performance hit in IRQs in -tip, then
> can we come up with a better fix?

I'm sure it's very easily measurable.

> For example, we could change all
> the new CT_WARN_ON calls to check "are we in CONTEXT_KERNEL or in IRQ
> context" and make the IRQ entry do a lighter weight context tracking
> operation.

I don't see what we need to check actually. Context tracking can be in any
state while in IRQ.

> 
> But I think I'm still missing something fundamental about the
> performance: why is irq_enter() any faster than user_exit()?

It's stlightly faster at least because it takes care of nesting IRQs which
is likely with softirqs that get interrupted.

Now of course we wouldn't call user_exit() in this case, but the hook is there
in generic code, no need for anything from the arch.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-19 17:10         ` Frederic Weisbecker
@ 2015-08-21  7:50           ` Ingo Molnar
  2015-08-21 13:14             ` Frederic Weisbecker
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2015-08-21  7:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Sasha Levin,
	Brian Gerst, Denys Vlasenko, linux-kernel, Oleg Nesterov,
	Borislav Petkov, Rik van Riel


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> > I want to have a real hook to call that says "switch to IRQ context from 
> > CONTEXT_USER" or "switch to IRQ context from CONTEXT_KERNEL" (aka noop), but 
> > that doesn't currently exist.
> 
> You're not answering _why_ you want that.

So we'd have a comprehensive, 100% coverage, self-sufficient set of callbacks that 
track the kernel's current context state at the points where the context switches 
actually occur - not just something cobbled together heterogenously. The low level 
x86 asm code was rather messy in this area, better organization would be welcome, 
I don't think we can overdo it.

( I'm assuming here that it can all be done for zero or negative cost, and that
  the result will be correct and won't hurt existing users in any fashion. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts
  2015-08-21  7:50           ` Ingo Molnar
@ 2015-08-21 13:14             ` Frederic Weisbecker
  0 siblings, 0 replies; 10+ messages in thread
From: Frederic Weisbecker @ 2015-08-21 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Sasha Levin,
	Brian Gerst, Denys Vlasenko, linux-kernel, Oleg Nesterov,
	Borislav Petkov, Rik van Riel

On Fri, Aug 21, 2015 at 09:50:35AM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > > I want to have a real hook to call that says "switch to IRQ context from 
> > > CONTEXT_USER" or "switch to IRQ context from CONTEXT_KERNEL" (aka noop), but 
> > > that doesn't currently exist.
> > 
> > You're not answering _why_ you want that.
> 
> So we'd have a comprehensive, 100% coverage, self-sufficient set of callbacks that 
> track the kernel's current context state at the points where the context switches 
> actually occur - not just something cobbled together heterogenously. The low level 
> x86 asm code was rather messy in this area, better organization would be welcome, 
> I don't think we can overdo it.
> 
> ( I'm assuming here that it can all be done for zero or negative cost, and that
>   the result will be correct and won't hurt existing users in any fashion. )

Sure, if x86 needs such a thing for internal organization it's ok, but don't
do context tracking calls there. irq_enter() and irq_exit() already take care
of RCU and time accounting. What Andy is doing is roughly like calling these
callbacks twice for no reason. It's horrible for IRQ performances.

The merge window is like tomorrow and he keeps arguing against that obvious
regression instead of fixing it.

Thanks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-08-21 13:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-18 19:11 [PATCH] x86/entry/64: Context-track syscalls before enabling interrupts Andy Lutomirski
2015-08-18 22:16 ` Frederic Weisbecker
2015-08-18 22:35   ` Andy Lutomirski
2015-08-18 23:02     ` Frederic Weisbecker
2015-08-18 23:07       ` Andy Lutomirski
2015-08-19 17:10         ` Frederic Weisbecker
2015-08-21  7:50           ` Ingo Molnar
2015-08-21 13:14             ` Frederic Weisbecker
2015-08-19  0:30 ` Andy Lutomirski
2015-08-19 15:54   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).