[RFC,1/5] x86: introduce preemption disable prefix
diff mbox series

Message ID 20181018005420.82993-2-namit@vmware.com
State New
Headers show
Series
  • x86: dynamic indirect call promotion
Related show

Commit Message

Nadav Amit Oct. 18, 2018, 12:54 a.m. UTC
It is sometimes beneficial to prevent preemption for very few
instructions, or prevent preemption for some instructions that precede
a branch (this latter case will be introduced in the next patches).

To provide such functionality on x86-64, we use an empty REX-prefix
(opcode 0x40) as an indication that preemption is disabled for the
following instruction.

It is expected that this opcode is not in common use.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/entry_64.S            | 10 ++++++++++
 arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++
 2 files changed, 22 insertions(+)

Comments

Andy Lutomirski Oct. 18, 2018, 1:22 a.m. UTC | #1
> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
> 
> It is sometimes beneficial to prevent preemption for very few
> instructions, or prevent preemption for some instructions that precede
> a branch (this latter case will be introduced in the next patches).
> 
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication that preemption is disabled for the
> following instruction.

Nifty!

That being said, I think you have a few bugs.  First, you can’t just ignore a rescheduling interrupt, as you introduce unbounded latency when this happens — you’re effectively emulating preempt_enable_no_resched(), which is not a drop-in replacement for preempt_enable(). To fix this, you may need to jump to a slow-path trampoline that calls schedule() at the end or consider rewinding one instruction instead. Or use TF, which is only a little bit terrifying...

You also aren’t accounting for the case where you get an exception that is, in turn, preempted.



> 
> It is expected that this opcode is not in common use.
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
> arch/x86/entry/entry_64.S            | 10 ++++++++++
> arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++
> 2 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cb8a5893fd33..31d59aad496e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -643,6 +643,16 @@ retint_kernel:
>    jnc    1f
> 0:    cmpl    $0, PER_CPU_VAR(__preempt_count)
>    jnz    1f
> +
> +    /*
> +     * Allow to use hint to prevent preemption on a certain instruction.
> +     * Consider an instruction with the first byte having REX prefix
> +     * without any bits set as an indication for preemption disabled.
> +     */
> +    movq    RIP(%rsp), %rax
> +    cmpb    $PREEMPT_DISABLE_PREFIX, (%rax)
> +    jz    1f
> +
>    call    preempt_schedule_irq
>    jmp    0b
> 1:
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 80dc14422495..0267611eb247 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -52,6 +52,12 @@
>    jnz    771b;                \
>    add    $(BITS_PER_LONG/8) * nr, sp;
> 
> +/*
> + * An empty REX-prefix is an indication that preemption should not take place on
> + * this instruction.
> + */
> +#define PREEMPT_DISABLE_PREFIX                 (0x40)
> +
> #ifdef __ASSEMBLY__
> 
> /*
> @@ -148,6 +154,12 @@
> #endif
> .endm
> 
> +.macro preempt_disable_prefix
> +#ifdef CONFIG_PREEMPT
> +    .byte    PREEMPT_DISABLE_PREFIX
> +#endif
> +.endm
> +
> #else /* __ASSEMBLY__ */
> 
> #define ANNOTATE_NOSPEC_ALTERNATIVE                \
> -- 
> 2.17.1
>
Nadav Amit Oct. 18, 2018, 3:12 a.m. UTC | #2
at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:

> 
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…

Yes, I didn’t pay enough attention here. For my use-case, I think that the
easiest solution would be to make synchronize_sched() ignore preemptions
that happen while the prefix is detected. It would slightly change the
meaning of the prefix.

> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.

Hmm.. Can you give me an example for such an exception in my use-case? I
cannot think of an exception that might be preempted (assuming #BP, #MC
cannot be preempted).

I agree that for super-general case this might be inappropriate.
Nadav Amit Oct. 18, 2018, 3:26 a.m. UTC | #3
at 8:11 PM, Nadav Amit <namit@vmware.com> wrote:

> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>> 
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>> 
>> Nifty!
>> 
>> That being said, I think you have a few bugs. First, you can’t just ignore
>> a rescheduling interrupt, as you introduce unbounded latency when this
>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>> need to jump to a slow-path trampoline that calls schedule() at the end or
>> consider rewinding one instruction instead. Or use TF, which is only a
>> little bit terrifying…
> 
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.

Ignore this nonsense that I wrote. I’ll try to come up with a decent
solution.

>> You also aren’t accounting for the case where you get an exception that
>> is, in turn, preempted.
> 
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
> 
> I agree that for super-general case this might be inappropriate.
Andy Lutomirski Oct. 18, 2018, 3:51 a.m. UTC | #4
On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>
> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> >
> >> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
> >>
> >> It is sometimes beneficial to prevent preemption for very few
> >> instructions, or prevent preemption for some instructions that precede
> >> a branch (this latter case will be introduced in the next patches).
> >>
> >> To provide such functionality on x86-64, we use an empty REX-prefix
> >> (opcode 0x40) as an indication that preemption is disabled for the
> >> following instruction.
> >
> > Nifty!
> >
> > That being said, I think you have a few bugs. First, you can’t just ignore
> > a rescheduling interrupt, as you introduce unbounded latency when this
> > happens — you’re effectively emulating preempt_enable_no_resched(), which
> > is not a drop-in replacement for preempt_enable(). To fix this, you may
> > need to jump to a slow-path trampoline that calls schedule() at the end or
> > consider rewinding one instruction instead. Or use TF, which is only a
> > little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.
>
> > You also aren’t accounting for the case where you get an exception that
> > is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
>

Look for cond_local_irq_enable().

--Andy
Peter Zijlstra Oct. 18, 2018, 7:54 a.m. UTC | #5
On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
> 
> > On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
> > 
> > It is sometimes beneficial to prevent preemption for very few
> > instructions, or prevent preemption for some instructions that precede
> > a branch (this latter case will be introduced in the next patches).
> > 
> > To provide such functionality on x86-64, we use an empty REX-prefix
> > (opcode 0x40) as an indication that preemption is disabled for the
> > following instruction.
> 
> Nifty!
> 
> That being said, I think you have a few bugs.

> First, you can’t just ignore a rescheduling interrupt, as you
> introduce unbounded latency when this happens — you’re effectively
> emulating preempt_enable_no_resched(), which is not a drop-in
> replacement for preempt_enable().

> To fix this, you may need to jump to a slow-path trampoline that calls
> schedule() at the end or consider rewinding one instruction instead.
> Or use TF, which is only a little bit terrifying...

At which point we're very close to in-kernel rseq.
Nadav Amit Oct. 18, 2018, 4:47 p.m. UTC | #6
at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:

> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>> It is sometimes beneficial to prevent preemption for very few
>>>> instructions, or prevent preemption for some instructions that precede
>>>> a branch (this latter case will be introduced in the next patches).
>>>> 
>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>> following instruction.
>>> 
>>> Nifty!
>>> 
>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>> 
>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.

So thinking about it further, rewinding the instruction seems the easiest
and most robust solution. I’ll do it.

>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>> 
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
> 
> Look for cond_local_irq_enable().

I looked at it. Yet, I still don’t see how exceptions might happen in my
use-case, but having said that - this can be fixed too.

To be frank, I paid relatively little attention to this subject. Any
feedback about the other parts and especially on the high-level approach? Is
modifying the retpolines in the proposed manner (assembly macros)
acceptable?

Thanks,
Nadav
Andy Lutomirski Oct. 18, 2018, 5 p.m. UTC | #7
> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
> 
> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>> 
>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>> instructions, or prevent preemption for some instructions that precede
>>>>> a branch (this latter case will be introduced in the next patches).
>>>>> 
>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>> following instruction.
>>>> 
>>>> Nifty!
>>>> 
>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>> little bit terrifying…
>>> 
>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>> that happen while the prefix is detected. It would slightly change the
>>> meaning of the prefix.
> 
> So thinking about it further, rewinding the instruction seems the easiest
> and most robust solution. I’ll do it.
> 
>>>> You also aren’t accounting for the case where you get an exception that
>>>> is, in turn, preempted.
>>> 
>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>> cannot be preempted).
>> 
>> Look for cond_local_irq_enable().
> 
> I looked at it. Yet, I still don’t see how exceptions might happen in my
> use-case, but having said that - this can be fixed too.

I’m not totally certain there’s a case that matters.  But it’s worth checking 

> 
> To be frank, I paid relatively little attention to this subject. Any
> feedback about the other parts and especially on the high-level approach? Is
> modifying the retpolines in the proposed manner (assembly macros)
> acceptable?
> 

It’s certainly a neat idea, and it could be a real speedup.

> Thanks,
> Nadav
Nadav Amit Oct. 18, 2018, 5:25 p.m. UTC | #8
at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:

> 
> 
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
>> 
>> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> 
>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>>> 
>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>> 
>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>> following instruction.
>>>>> 
>>>>> Nifty!
>>>>> 
>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>> little bit terrifying…
>>>> 
>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>> that happen while the prefix is detected. It would slightly change the
>>>> meaning of the prefix.
>> 
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>> 
>>>>> You also aren’t accounting for the case where you get an exception that
>>>>> is, in turn, preempted.
>>>> 
>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>> cannot be preempted).
>>> 
>>> Look for cond_local_irq_enable().
>> 
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
> 
> I’m not totally certain there’s a case that matters.  But it’s worth checking 
> 
>> To be frank, I paid relatively little attention to this subject. Any
>> feedback about the other parts and especially on the high-level approach? Is
>> modifying the retpolines in the proposed manner (assembly macros)
>> acceptable?
> 
> It’s certainly a neat idea, and it could be a real speedup.

Great. So I’ll try to shape things up, and I still wait for other comments
(from others).

I’ll just mention two more patches I need to cleanup (I know I still owe you some
work, so obviously it will be done later):

1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
BPF filters on the Redis server process that are invoked on each
system-call. Invoking each one requires an indirect branch. The patch keeps
a per-process kernel code-page that holds trampolines for these functions.

2. Binary-search for system-calls. Use the per-process kernel code-page also
to hold multiple trampolines for the 16 common system calls of a certain
process. The patch uses an indirection table and a binary-search to find the
proper trampoline.

Thanks again,
Nadav
Andy Lutomirski Oct. 18, 2018, 5:29 p.m. UTC | #9
On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit <namit@vmware.com> wrote:
>
> at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> >
> >
> >> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
> >>
> >> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>
> >>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
> >>>>>>
> >>>>>> It is sometimes beneficial to prevent preemption for very few
> >>>>>> instructions, or prevent preemption for some instructions that precede
> >>>>>> a branch (this latter case will be introduced in the next patches).
> >>>>>>
> >>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
> >>>>>> (opcode 0x40) as an indication that preemption is disabled for the
> >>>>>> following instruction.
> >>>>>
> >>>>> Nifty!
> >>>>>
> >>>>> That being said, I think you have a few bugs. First, you can’t just ignore
> >>>>> a rescheduling interrupt, as you introduce unbounded latency when this
> >>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
> >>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
> >>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
> >>>>> consider rewinding one instruction instead. Or use TF, which is only a
> >>>>> little bit terrifying…
> >>>>
> >>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> >>>> easiest solution would be to make synchronize_sched() ignore preemptions
> >>>> that happen while the prefix is detected. It would slightly change the
> >>>> meaning of the prefix.
> >>
> >> So thinking about it further, rewinding the instruction seems the easiest
> >> and most robust solution. I’ll do it.
> >>
> >>>>> You also aren’t accounting for the case where you get an exception that
> >>>>> is, in turn, preempted.
> >>>>
> >>>> Hmm.. Can you give me an example for such an exception in my use-case? I
> >>>> cannot think of an exception that might be preempted (assuming #BP, #MC
> >>>> cannot be preempted).
> >>>
> >>> Look for cond_local_irq_enable().
> >>
> >> I looked at it. Yet, I still don’t see how exceptions might happen in my
> >> use-case, but having said that - this can be fixed too.
> >
> > I’m not totally certain there’s a case that matters.  But it’s worth checking
> >
> >> To be frank, I paid relatively little attention to this subject. Any
> >> feedback about the other parts and especially on the high-level approach? Is
> >> modifying the retpolines in the proposed manner (assembly macros)
> >> acceptable?
> >
> > It’s certainly a neat idea, and it could be a real speedup.
>
> Great. So I’ll try to shape things up, and I still wait for other comments
> (from others).
>
> I’ll just mention two more patches I need to cleanup (I know I still owe you some
> work, so obviously it will be done later):
>
> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
> BPF filters on the Redis server process that are invoked on each
> system-call. Invoking each one requires an indirect branch. The patch keeps
> a per-process kernel code-page that holds trampolines for these functions.

I wonder how many levels of branches are needed before the branches
involved exceed the retpoline cost.

>
> 2. Binary-search for system-calls. Use the per-process kernel code-page also
> to hold multiple trampolines for the 16 common system calls of a certain
> process. The patch uses an indirection table and a binary-search to find the
> proper trampoline.

Same comment applies here.

>
> Thanks again,
> Nadav
Nadav Amit Oct. 18, 2018, 5:42 p.m. UTC | #10
at 10:29 AM, Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit <namit@vmware.com> wrote:
>> at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> 
>>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> 
>>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>>>>> 
>>>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>>> 
>>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>>>> following instruction.
>>>>>>> 
>>>>>>> Nifty!
>>>>>>> 
>>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>>>> little bit terrifying…
>>>>>> 
>>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>>>> that happen while the prefix is detected. It would slightly change the
>>>>>> meaning of the prefix.
>>>> 
>>>> So thinking about it further, rewinding the instruction seems the easiest
>>>> and most robust solution. I’ll do it.
>>>> 
>>>>>>> You also aren’t accounting for the case where you get an exception that
>>>>>>> is, in turn, preempted.
>>>>>> 
>>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>>>> cannot be preempted).
>>>>> 
>>>>> Look for cond_local_irq_enable().
>>>> 
>>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>>> use-case, but having said that - this can be fixed too.
>>> 
>>> I’m not totally certain there’s a case that matters.  But it’s worth checking
>>> 
>>>> To be frank, I paid relatively little attention to this subject. Any
>>>> feedback about the other parts and especially on the high-level approach? Is
>>>> modifying the retpolines in the proposed manner (assembly macros)
>>>> acceptable?
>>> 
>>> It’s certainly a neat idea, and it could be a real speedup.
>> 
>> Great. So I’ll try to shape things up, and I still wait for other comments
>> (from others).
>> 
>> I’ll just mention two more patches I need to cleanup (I know I still owe you some
>> work, so obviously it will be done later):
>> 
>> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
>> BPF filters on the Redis server process that are invoked on each
>> system-call. Invoking each one requires an indirect branch. The patch keeps
>> a per-process kernel code-page that holds trampolines for these functions.
> 
> I wonder how many levels of branches are needed before the branches
> involved exceed the retpoline cost.

In this case there is no hierarchy, but a list of trampolines that are
called one after the other, as the seccomp filter order is predefined. It
does not work if different threads of the same process have different
filters.

>> 2. Binary-search for system-calls. Use the per-process kernel code-page also
>> to hold multiple trampolines for the 16 common system calls of a certain
>> process. The patch uses an indirection table and a binary-search to find the
>> proper trampoline.
> 
> Same comment applies here.

Branch misprediction wastes ~7 cycles and a retpoline takes at least 30. So
assuming the branch predictor is not completely stupid 3-4 levels should not
be too much.
Nadav Amit Oct. 18, 2018, 6:14 p.m. UTC | #11
at 12:54 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>> 
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>> 
>> Nifty!
>> 
>> That being said, I think you have a few bugs.
> 
>> First, you can’t just ignore a rescheduling interrupt, as you
>> introduce unbounded latency when this happens — you’re effectively
>> emulating preempt_enable_no_resched(), which is not a drop-in
>> replacement for preempt_enable().
> 
>> To fix this, you may need to jump to a slow-path trampoline that calls
>> schedule() at the end or consider rewinding one instruction instead.
>> Or use TF, which is only a little bit terrifying...
> 
> At which point we're very close to in-kernel rseq.

Interesting. I didn’t know about this feature. I’ll see if I can draw some
ideas from there.
Nadav Amit Oct. 19, 2018, 1:08 a.m. UTC | #12
at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:

> 
> 
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
>> 
>> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> 
>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>>> 
>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>> 
>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>> following instruction.
>>>>> 
>>>>> Nifty!
>>>>> 
>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>> little bit terrifying…
>>>> 
>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>> that happen while the prefix is detected. It would slightly change the
>>>> meaning of the prefix.
>> 
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>> 
>>>>> You also aren’t accounting for the case where you get an exception that
>>>>> is, in turn, preempted.
>>>> 
>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>> cannot be preempted).
>>> 
>>> Look for cond_local_irq_enable().
>> 
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
> 
> I’m not totally certain there’s a case that matters.  But it’s worth checking 

I am still checking. But, I wanted to ask you whether the existing code is
correct, since it seems to me that others do the same mistake I did, unless
I don’t understand the code.

Consider for example do_int3(), and see my inlined comments:

dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
	...
	ist_enter(regs); 		// => preempt_disable()
	cond_local_irq_enable(regs);	// => assume it enables IRQs

	...
	// resched irq can be delivered here. It will not caused rescheduling
	// since preemption is disabled

	cond_local_irq_disable(regs);	// => assume it disables IRQs
	ist_exit(regs);			// => preempt_enable_no_resched()
}

At this point resched will not happen for unbounded length of time (unless
there is another point when exiting the trap handler that checks if
preemption should take place).

Another example is __BPF_PROG_RUN_ARRAY(), which also uses
preempt_enable_no_resched().

Am I missing something?

Thanks,
Nadav
Andy Lutomirski Oct. 19, 2018, 4:29 a.m. UTC | #13
> On Oct 18, 2018, at 6:08 PM, Nadav Amit <namit@vmware.com> wrote:
>
> at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>>
>>
>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
>>>
>>> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>>>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>
>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>>>>
>>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>>
>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>>> following instruction.
>>>>>>
>>>>>> Nifty!
>>>>>>
>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>>> little bit terrifying…
>>>>>
>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>>> that happen while the prefix is detected. It would slightly change the
>>>>> meaning of the prefix.
>>>
>>> So thinking about it further, rewinding the instruction seems the easiest
>>> and most robust solution. I’ll do it.
>>>
>>>>>> You also aren’t accounting for the case where you get an exception that
>>>>>> is, in turn, preempted.
>>>>>
>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>>> cannot be preempted).
>>>>
>>>> Look for cond_local_irq_enable().
>>>
>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>> use-case, but having said that - this can be fixed too.
>>
>> I’m not totally certain there’s a case that matters.  But it’s worth checking
>
> I am still checking. But, I wanted to ask you whether the existing code is
> correct, since it seems to me that others do the same mistake I did, unless
> I don’t understand the code.
>
> Consider for example do_int3(), and see my inlined comments:
>
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
>    ...
>    ist_enter(regs);        // => preempt_disable()
>    cond_local_irq_enable(regs);    // => assume it enables IRQs
>
>    ...
>    // resched irq can be delivered here. It will not caused rescheduling
>    // since preemption is disabled
>
>    cond_local_irq_disable(regs);    // => assume it disables IRQs
>    ist_exit(regs);            // => preempt_enable_no_resched()
> }
>
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).

I think it's only a bug in the cases where someone uses extable to fix
up an int3 (which would be nuts) or that we oops.  But I should still
fix it.  In the normal case where int3 was in user code, we'll miss
the reschedule in do_trap(), but we'll reschedule in
prepare_exit_to_usermode() -> exit_to_usermode_loop().

>
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().

Alexei, I think this code is just wrong. Do you know why it uses
preempt_enable_no_resched()?

Oleg, the code in kernel/signal.c:

                preempt_disable();
                read_unlock(&tasklist_lock);
                preempt_enable_no_resched();
                freezable_schedule();

looks bogus.  I don't get what it's trying to achieve with
preempt_disable(), and I also don't see why no_resched does anything.
Sure, it prevents a reschedule triggered during read_unlock() from
causing a reschedule, but it doesn't prevent an interrupt immediately
after the preempt_enable_no_resched() call from scheduling.

--Andy

>
> Am I missing something?
>
> Thanks,
> Nadav
Nadav Amit Oct. 19, 2018, 4:44 a.m. UTC | #14
at 9:29 PM, Andy Lutomirski <luto@kernel.org> wrote:

>> On Oct 18, 2018, at 6:08 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>> at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> 
>>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> 
>>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>>>>> 
>>>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>>> 
>>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>>>> following instruction.
>>>>>>> 
>>>>>>> Nifty!
>>>>>>> 
>>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>>>> little bit terrifying…
>>>>>> 
>>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>>>> that happen while the prefix is detected. It would slightly change the
>>>>>> meaning of the prefix.
>>>> 
>>>> So thinking about it further, rewinding the instruction seems the easiest
>>>> and most robust solution. I’ll do it.
>>>> 
>>>>>>> You also aren’t accounting for the case where you get an exception that
>>>>>>> is, in turn, preempted.
>>>>>> 
>>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>>>> cannot be preempted).
>>>>> 
>>>>> Look for cond_local_irq_enable().
>>>> 
>>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>>> use-case, but having said that - this can be fixed too.
>>> 
>>> I’m not totally certain there’s a case that matters.  But it’s worth checking
>> 
>> I am still checking. But, I wanted to ask you whether the existing code is
>> correct, since it seems to me that others do the same mistake I did, unless
>> I don’t understand the code.
>> 
>> Consider for example do_int3(), and see my inlined comments:
>> 
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>>   ...
>>   ist_enter(regs);        // => preempt_disable()
>>   cond_local_irq_enable(regs);    // => assume it enables IRQs
>> 
>>   ...
>>   // resched irq can be delivered here. It will not caused rescheduling
>>   // since preemption is disabled
>> 
>>   cond_local_irq_disable(regs);    // => assume it disables IRQs
>>   ist_exit(regs);            // => preempt_enable_no_resched()
>> }
>> 
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
> 
> I think it's only a bug in the cases where someone uses extable to fix
> up an int3 (which would be nuts) or that we oops.  But I should still
> fix it.  In the normal case where int3 was in user code, we'll miss
> the reschedule in do_trap(), but we'll reschedule in
> prepare_exit_to_usermode() -> exit_to_usermode_loop().

Thanks for your quick response, and sorry for bothering instead of dealing
with it. Note that do_debug() does something similar to do_int3().

And then there is optimized_callback() that also uses
preempt_enable_no_resched(). I think the original use was correct, but then
a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
kprobes”) removed the IRQ disabling, while leaving
preempt_enable_no_resched() . No?
Alexei Starovoitov Oct. 19, 2018, 5 a.m. UTC | #15
> 
> >
> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
> 
> Alexei, I think this code is just wrong.

why 'just wrong' ?

> Do you know why it uses
> preempt_enable_no_resched()?

dont recall precisely.
we could be preemptable at the point where macro is called.
I think the goal of no_resched was to avoid adding scheduling points
where they didn't exist before just because a prog ran for few nsec.
May be Daniel or Roman remember.
Peter Zijlstra Oct. 19, 2018, 8:19 a.m. UTC | #16
On Thu, Oct 18, 2018 at 09:29:39PM -0700, Andy Lutomirski wrote:

> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
> 
> Alexei, I think this code is just wrong. Do you know why it uses
> preempt_enable_no_resched()?

Yes, that's a straight up bug.

It looks like I need to go fix up abuse again :/

> Oleg, the code in kernel/signal.c:
> 
>                 preempt_disable();
>                 read_unlock(&tasklist_lock);
>                 preempt_enable_no_resched();
>                 freezable_schedule();
> 

The purpose here is to avoid back-to-back schedule() calls, and this
pattern is one of the few correct uses of preempt_enable_no_resched().

Suppose we got a preemption while holding the read_lock(), then when
we'd do read_unlock(), we'd drop preempt_count to 0 and reschedule, then
when we get back we instantly call into schedule _again_.

What this code does, is it increments preempt_count such that
read_unlock() doesn't hit 0 and doesn't call schedule, then we lower it
to 0 without a call to schedule() and then call schedule() explicitly.
Peter Zijlstra Oct. 19, 2018, 8:22 a.m. UTC | #17
On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > 
> > >
> > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > preempt_enable_no_resched().
> > 
> > Alexei, I think this code is just wrong.
> 
> why 'just wrong' ?

Because you lost a preemption point, this is a no-no.

> 
> > Do you know why it uses
> > preempt_enable_no_resched()?
> 
> dont recall precisely.
> we could be preemptable at the point where macro is called.
> I think the goal of no_resched was to avoid adding scheduling points
> where they didn't exist before just because a prog ran for few nsec.
> May be Daniel or Roman remember.

No, you did the exact opposite, where there previously was a preemption,
you just ate it. The band saw didn't get stopped in time, you loose your
hand etc..
Peter Zijlstra Oct. 19, 2018, 8:33 a.m. UTC | #18
On Fri, Oct 19, 2018 at 01:08:23AM +0000, Nadav Amit wrote:
> Consider for example do_int3(), and see my inlined comments:
> 
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
> 	...
> 	ist_enter(regs); 		// => preempt_disable()
> 	cond_local_irq_enable(regs);	// => assume it enables IRQs
> 
> 	...
> 	// resched irq can be delivered here. It will not caused rescheduling
> 	// since preemption is disabled
> 
> 	cond_local_irq_disable(regs);	// => assume it disables IRQs
> 	ist_exit(regs);			// => preempt_enable_no_resched()
> }
> 
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).
> 
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().
> 
> Am I missing something?

Would not the interrupt return then check for TIF_NEED_RESCHED and call
schedule() ?

I think (and this certainly wants a comment) is that the ist_exit()
thing hard relies on the interrupt-return path doing the reschedule.
Oleg Nesterov Oct. 19, 2018, 10:38 a.m. UTC | #19
On 10/18, Andy Lutomirski wrote:
>
> Oleg, the code in kernel/signal.c:
>
>                 preempt_disable();
>                 read_unlock(&tasklist_lock);
>                 preempt_enable_no_resched();
>                 freezable_schedule();
>
> looks bogus.  I don't get what it's trying to achieve with
> preempt_disable(), and I also don't see why no_resched does anything.
> Sure, it prevents a reschedule triggered during read_unlock() from
> causing a reschedule,

Yes. Lets suppose we remove preempt_disable/enable.

Debugger was already woken up, if it runs on the same CPU quite possibly
it will preemt the tracee. After that debugger will spin in wait_task_inactive(),
until it is in turn preempted or calls schedule_timeout(1), so that the tracee
(current) can finally call __schedule(preempt = F) and call deactivate_task() to
become inactive.

> but it doesn't prevent an interrupt immediately
> after the preempt_enable_no_resched() call from scheduling.

Yes, but this is less likely.

Oleg.
Andy Lutomirski Oct. 19, 2018, 2:29 p.m. UTC | #20
> On Oct 19, 2018, at 1:33 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Oct 19, 2018 at 01:08:23AM +0000, Nadav Amit wrote:
>> Consider for example do_int3(), and see my inlined comments:
>> 
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>>    ...
>>    ist_enter(regs);        // => preempt_disable()
>>    cond_local_irq_enable(regs);    // => assume it enables IRQs
>> 
>>    ...
>>    // resched irq can be delivered here. It will not caused rescheduling
>>    // since preemption is disabled
>> 
>>    cond_local_irq_disable(regs);    // => assume it disables IRQs
>>    ist_exit(regs);            // => preempt_enable_no_resched()
>> }
>> 
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
>> 
>> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
>> preempt_enable_no_resched().
>> 
>> Am I missing something?
> 
> Would not the interrupt return then check for TIF_NEED_RESCHED and call
> schedule() ?

The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s fundamentally atomic — it’s running on a percpu stack and it can’t schedule. In theory we could do some evil stack switching, but we don’t.

How does NMI handle this?  If an NMI that hit interruptible kernel code overflows a perf counter, how does the wake up work?

(do_int3() is special because it’s not actually IST.  But it can hit in odd places due to kprobes, and I’m nervous about recursing incorrectly into RCU and context tracking code if we were to use exception_enter().)

> 
> I think (and this certainly wants a comment) is that the ist_exit()
> thing hard relies on the interrupt-return path doing the reschedule.
Alexei Starovoitov Oct. 19, 2018, 2:47 p.m. UTC | #21
On Fri, Oct 19, 2018 at 1:22 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > >
> > > >
> > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > > preempt_enable_no_resched().
> > >
> > > Alexei, I think this code is just wrong.
> >
> > why 'just wrong' ?
>
> Because you lost a preemption point, this is a no-no.
>
> >
> > > Do you know why it uses
> > > preempt_enable_no_resched()?
> >
> > dont recall precisely.
> > we could be preemptable at the point where macro is called.
> > I think the goal of no_resched was to avoid adding scheduling points
> > where they didn't exist before just because a prog ran for few nsec.
> > May be Daniel or Roman remember.
>
> No, you did the exact opposite, where there previously was a preemption,
> you just ate it. The band saw didn't get stopped in time, you loose your
> hand etc..

Let me do few experiments then.
We will fix it up.
Masami Hiramatsu Oct. 20, 2018, 1:22 a.m. UTC | #22
On Fri, 19 Oct 2018 04:44:33 +0000
Nadav Amit <namit@vmware.com> wrote:

> at 9:29 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> >> On Oct 18, 2018, at 6:08 PM, Nadav Amit <namit@vmware.com> wrote:
> >> 
> >> at 10:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> 
> >>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
> >>>> 
> >>>> at 8:51 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>>> 
> >>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <namit@vmware.com> wrote:
> >>>>>> at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>>> 
> >>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <namit@vmware.com> wrote:
> >>>>>>>> 
> >>>>>>>> It is sometimes beneficial to prevent preemption for very few
> >>>>>>>> instructions, or prevent preemption for some instructions that precede
> >>>>>>>> a branch (this latter case will be introduced in the next patches).
> >>>>>>>> 
> >>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
> >>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
> >>>>>>>> following instruction.
> >>>>>>> 
> >>>>>>> Nifty!
> >>>>>>> 
> >>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
> >>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
> >>>>>>> happens ― you’re effectively emulating preempt_enable_no_resched(), which
> >>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
> >>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
> >>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
> >>>>>>> little bit terrifying…
> >>>>>> 
> >>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> >>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
> >>>>>> that happen while the prefix is detected. It would slightly change the
> >>>>>> meaning of the prefix.
> >>>> 
> >>>> So thinking about it further, rewinding the instruction seems the easiest
> >>>> and most robust solution. I’ll do it.
> >>>> 
> >>>>>>> You also aren’t accounting for the case where you get an exception that
> >>>>>>> is, in turn, preempted.
> >>>>>> 
> >>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
> >>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
> >>>>>> cannot be preempted).
> >>>>> 
> >>>>> Look for cond_local_irq_enable().
> >>>> 
> >>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
> >>>> use-case, but having said that - this can be fixed too.
> >>> 
> >>> I’m not totally certain there’s a case that matters.  But it’s worth checking
> >> 
> >> I am still checking. But, I wanted to ask you whether the existing code is
> >> correct, since it seems to me that others do the same mistake I did, unless
> >> I don’t understand the code.
> >> 
> >> Consider for example do_int3(), and see my inlined comments:
> >> 
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >>   ...
> >>   ist_enter(regs);        // => preempt_disable()
> >>   cond_local_irq_enable(regs);    // => assume it enables IRQs
> >> 
> >>   ...
> >>   // resched irq can be delivered here. It will not caused rescheduling
> >>   // since preemption is disabled
> >> 
> >>   cond_local_irq_disable(regs);    // => assume it disables IRQs
> >>   ist_exit(regs);            // => preempt_enable_no_resched()
> >> }
> >> 
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> > 
> > I think it's only a bug in the cases where someone uses extable to fix
> > up an int3 (which would be nuts) or that we oops.  But I should still
> > fix it.  In the normal case where int3 was in user code, we'll miss
> > the reschedule in do_trap(), but we'll reschedule in
> > prepare_exit_to_usermode() -> exit_to_usermode_loop().
> 
> Thanks for your quick response, and sorry for bothering instead of dealing
> with it. Note that do_debug() does something similar to do_int3().
> 
> And then there is optimized_callback() that also uses
> preempt_enable_no_resched(). I think the original use was correct, but then
> a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
> kprobes”) removed the IRQ disabling, while leaving
> preempt_enable_no_resched() . No?

Ah, good catch!
Indeed, we don't need to stick on no_resched anymore.

Thanks!
Peter Zijlstra Nov. 29, 2018, 9:46 a.m. UTC | #23
On Fri, Oct 19, 2018 at 07:29:45AM -0700, Andy Lutomirski wrote:
> > On Oct 19, 2018, at 1:33 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Fri, Oct 19, 2018 at 01:08:23AM +0000, Nadav Amit wrote:
> >> Consider for example do_int3(), and see my inlined comments:
> >> 
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >>    ...
> >>    ist_enter(regs);        // => preempt_disable()
> >>    cond_local_irq_enable(regs);    // => assume it enables IRQs
> >> 
> >>    ...
> >>    // resched irq can be delivered here. It will not caused rescheduling
> >>    // since preemption is disabled
> >> 
> >>    cond_local_irq_disable(regs);    // => assume it disables IRQs
> >>    ist_exit(regs);            // => preempt_enable_no_resched()
> >> }
> >> 
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> >> 
> >> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> >> preempt_enable_no_resched().
> >> 
> >> Am I missing something?
> > 
> > Would not the interrupt return then check for TIF_NEED_RESCHED and call
> > schedule() ?
> 
> The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s
> fundamentally atomic — it’s running on a percpu stack and it can’t
> schedule. In theory we could do some evil stack switching, but we
> don’t.
> 
> How does NMI handle this?  If an NMI that hit interruptible kernel
> code overflows a perf counter, how does the wake up work?

NMIs should never set NEED_RESCHED. What the perf does it self-IPI
(irq_work) and do the wakeup from there.

Patch
diff mbox series

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cb8a5893fd33..31d59aad496e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -643,6 +643,16 @@  retint_kernel:
 	jnc	1f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
 	jnz	1f
+
+	/*
+	 * Allow to use hint to prevent preemption on a certain instruction.
+	 * Consider an instruction with the first byte having REX prefix
+	 * without any bits set as an indication for preemption disabled.
+	 */
+	movq    RIP(%rsp), %rax
+	cmpb    $PREEMPT_DISABLE_PREFIX, (%rax)
+	jz	1f
+
 	call	preempt_schedule_irq
 	jmp	0b
 1:
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 80dc14422495..0267611eb247 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -52,6 +52,12 @@ 
 	jnz	771b;				\
 	add	$(BITS_PER_LONG/8) * nr, sp;
 
+/*
+ * An empty REX-prefix is an indication that preemption should not take place on
+ * this instruction.
+ */
+#define PREEMPT_DISABLE_PREFIX                 (0x40)
+
 #ifdef __ASSEMBLY__
 
 /*
@@ -148,6 +154,12 @@ 
 #endif
 .endm
 
+.macro preempt_disable_prefix
+#ifdef CONFIG_PREEMPT
+	.byte	PREEMPT_DISABLE_PREFIX
+#endif
+.endm
+
 #else /* __ASSEMBLY__ */
 
 #define ANNOTATE_NOSPEC_ALTERNATIVE				\