All of lore.kernel.org
 help / color / mirror / Atom feed
* Question WRT early IRQ/NMI entry code
@ 2021-11-30 11:28 ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 42+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-30 11:28 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, rcu
  Cc: Thomas Gleixner, Peter Zijlstra, Mark Rutland, Steven Rostedt,
	paulmck, mtosatti, frederic

Hi All,
while going over the IRQ/NMI entry code I've found a small 'inconsistency':
while in the IRQ entry path, we inform RCU of the context change *before*
incrementing the preempt counter, the opposite happens for the NMI entry
path. This applies to both arm64 and x86[1].

Actually, rcu_nmi_enter() — which is also the main RCU context switch function
for the IRQ entry path — uses the preempt counter to verify it's not in NMI
context. So it would make sense to assume all callers have the same updated
view of the preempt count, which isn't true ATM.

I'm sure there an obscure/non-obvious reason for this, right?

Thanks!
Nicolas

[1] 
IRQ path:
  -> x86_64 asm (entry_64.S)
  -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
  -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
  -> // Run IRQ...

NMI path:
  -> x86_64 asm (entry_64.S)
  -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
                          -> *rcu_nmi_enter()*

For arm64, see 'arch/arm64/kernel/entry-common.c'.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Question WRT early IRQ/NMI entry code
@ 2021-11-30 11:28 ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 42+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-30 11:28 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, rcu
  Cc: Thomas Gleixner, Peter Zijlstra, Mark Rutland, Steven Rostedt,
	paulmck, mtosatti, frederic

Hi All,
while going over the IRQ/NMI entry code I've found a small 'inconsistency':
while in the IRQ entry path, we inform RCU of the context change *before*
incrementing the preempt counter, the opposite happens for the NMI entry
path. This applies to both arm64 and x86[1].

Actually, rcu_nmi_enter() — which is also the main RCU context switch function
for the IRQ entry path — uses the preempt counter to verify it's not in NMI
context. So it would make sense to assume all callers have the same updated
view of the preempt count, which isn't true ATM.

I'm sure there an obscure/non-obvious reason for this, right?

Thanks!
Nicolas

[1] 
IRQ path:
  -> x86_64 asm (entry_64.S)
  -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
  -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
  -> // Run IRQ...

NMI path:
  -> x86_64 asm (entry_64.S)
  -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
                          -> *rcu_nmi_enter()*

For arm64, see 'arch/arm64/kernel/entry-common.c'.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
  2021-11-30 11:28 ` Nicolas Saenz Julienne
@ 2021-11-30 12:05   ` Frederic Weisbecker
  -1 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2021-11-30 12:05 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: linux-kernel, linux-arm-kernel, rcu, Thomas Gleixner,
	Peter Zijlstra, Mark Rutland, Steven Rostedt, paulmck, mtosatti

On Tue, Nov 30, 2021 at 12:28:41PM +0100, Nicolas Saenz Julienne wrote:
> Hi All,
> while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> while in the IRQ entry path, we inform RCU of the context change *before*
> incrementing the preempt counter, the opposite happens for the NMI entry
> path. This applies to both arm64 and x86[1].
> 
> Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> context. So it would make sense to assume all callers have the same updated
> view of the preempt count, which isn't true ATM.
> 
> I'm sure there an obscure/non-obvious reason for this, right?

At least I can't find an immediate reason for it either :-)

Thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
@ 2021-11-30 12:05   ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2021-11-30 12:05 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: linux-kernel, linux-arm-kernel, rcu, Thomas Gleixner,
	Peter Zijlstra, Mark Rutland, Steven Rostedt, paulmck, mtosatti

On Tue, Nov 30, 2021 at 12:28:41PM +0100, Nicolas Saenz Julienne wrote:
> Hi All,
> while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> while in the IRQ entry path, we inform RCU of the context change *before*
> incrementing the preempt counter, the opposite happens for the NMI entry
> path. This applies to both arm64 and x86[1].
> 
> Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> context. So it would make sense to assume all callers have the same updated
> view of the preempt count, which isn't true ATM.
> 
> I'm sure there an obscure/non-obvious reason for this, right?

At least I can't find an immediate reason for it either :-)

Thanks!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
  2021-11-30 11:28 ` Nicolas Saenz Julienne
@ 2021-11-30 12:50   ` Mark Rutland
  -1 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-11-30 12:50 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: linux-kernel, linux-arm-kernel, rcu, Thomas Gleixner,
	Peter Zijlstra, Steven Rostedt, mtosatti, frederic, paulmck

On Tue, Nov 30, 2021 at 12:28:41PM +0100, Nicolas Saenz Julienne wrote:
> Hi All,

Hi Nicolas,

> while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> while in the IRQ entry path, we inform RCU of the context change *before*
> incrementing the preempt counter, the opposite happens for the NMI entry
> path. This applies to both arm64 and x86[1].

For arm64, the style was copied from the x86 code, and (AFAIK) I had no
particular reason for following either order other than consistency with x86.

> Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> context. So it would make sense to assume all callers have the same updated
> view of the preempt count, which isn't true ATM.

I agree consistency would be nice, assuming there's no issue preventing us from
moving the IRQ preempt_count logic earlier.

It sounds like today the ordering is only *required* when entering an NMI, and
we already do the right thing there. Do you see a case where something would go
wrong (or would behave differently with the flipped ordering) for IRQ today?

> I'm sure there an obscure/non-obvious reason for this, right?

TBH I suspect this is mostly oversight / legacy, and likely something we can
tighten up.

Thanks,
Mark.

> 
> Thanks!
> Nicolas
> 
> [1] 
> IRQ path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
>   -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
>   -> // Run IRQ...
> 
> NMI path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
>                           -> *rcu_nmi_enter()*
> 
> For arm64, see 'arch/arm64/kernel/entry-common.c'.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
@ 2021-11-30 12:50   ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-11-30 12:50 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: linux-kernel, linux-arm-kernel, rcu, Thomas Gleixner,
	Peter Zijlstra, Steven Rostedt, mtosatti, frederic, paulmck

On Tue, Nov 30, 2021 at 12:28:41PM +0100, Nicolas Saenz Julienne wrote:
> Hi All,

Hi Nicolas,

> while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> while in the IRQ entry path, we inform RCU of the context change *before*
> incrementing the preempt counter, the opposite happens for the NMI entry
> path. This applies to both arm64 and x86[1].

For arm64, the style was copied from the x86 code, and (AFAIK) I had no
particular reason for following either order other than consistency with x86.

> Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> context. So it would make sense to assume all callers have the same updated
> view of the preempt count, which isn't true ATM.

I agree consistency would be nice, assuming there's no issue preventing us from
moving the IRQ preempt_count logic earlier.

It sounds like today the ordering is only *required* when entering an NMI, and
we already do the right thing there. Do you see a case where something would go
wrong (or would behave differently with the flipped ordering) for IRQ today?

> I'm sure there an obscure/non-obvious reason for this, right?

TBH I suspect this is mostly oversight / legacy, and likely something we can
tighten up.

Thanks,
Mark.

> 
> Thanks!
> Nicolas
> 
> [1] 
> IRQ path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
>   -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
>   -> // Run IRQ...
> 
> NMI path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
>                           -> *rcu_nmi_enter()*
> 
> For arm64, see 'arch/arm64/kernel/entry-common.c'.
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
  2021-11-30 11:28 ` Nicolas Saenz Julienne
@ 2021-11-30 13:47   ` Thomas Gleixner
  -1 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-30 13:47 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu
  Cc: Peter Zijlstra, Mark Rutland, Steven Rostedt, paulmck, mtosatti,
	frederic

On Tue, Nov 30 2021 at 12:28, Nicolas Saenz Julienne wrote:
> while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> while in the IRQ entry path, we inform RCU of the context change *before*
> incrementing the preempt counter, the opposite happens for the NMI entry
> path. This applies to both arm64 and x86[1].
>
> Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> context. So it would make sense to assume all callers have the same updated
> view of the preempt count, which isn't true ATM.
>
> I'm sure there an obscure/non-obvious reason for this, right?

There is.

> IRQ path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
>   -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
>   -> // Run IRQ...
>
> NMI path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
>                           -> *rcu_nmi_enter()*

The reason is symmetry vs. returning from interupt / exception:

 irqentry_enter()
      exit_rcu = false;

      if (user_mode(regs)) {
          irqentry_enter_from_user_mode(regs)
            __enter_from_user_mode(regs)
              user_exit_irqoff();       <- RCU handling for NOHZ full

      } else if (is_idle_task_current()) {
            rcu_irq_enter()
            exit_rcu = true;
      }

 irq_enter_rcu()
     __irq_enter_raw()
     preempt_count_add(HARDIRQ_OFFSET);

 irq_handler()

 irq_exit_rcu()
     preempt_count_sub(HARDIRQ_OFFSET);
     if (!in_interrupt() && local_softirq_pending())
     	 invoke_softirq();

 irqentry_exit(regs, exit_rcu)

     if (user_mode(regs)) {
         irqentry_exit_to_usermode(regs)
           user_enter_irqoff();     <- RCU handling for NOHZ full
     } else if (irqs_enabled(regs)) {
           if (exit_rcu) {          <- Idle task special case
               rcu_irq_exit();
           } else {
              irqentry_exit_cond_resched();
           }

     } else if (exit_rcu) {
         rcu_irq_exit();
     }

On return from interrupt HARDIRQ_OFFSET has to be removed _before_
handling soft interrupts. It's also required that the preempt count has
the original state _before_ reaching irqentry_exit() which
might schedule if the interrupt/exception hit user space or kernel space
with interrupts enabled.

So doing it symmetric makes sense.

For NMIs the above conditionals do not apply at all and we just do

    __nmi_enter()
        preempt_count_add(NMI_COUNT + HARDIRQ_COUNT);
    rcu_nmi_enter();

    handle_nmi();

    rcu_nmi_exit();
    __nmi_exit()
        preempt_count_sub(NMI_COUNT + HARDIRQ_COUNT);

The reason why preempt count is incremented before invoking
rcu_nmi_enter() is simply that RCU has to know about being in NMI
context, i.e. in_nmi() has to return the correct answer.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
@ 2021-11-30 13:47   ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-30 13:47 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu
  Cc: Peter Zijlstra, Mark Rutland, Steven Rostedt, paulmck, mtosatti,
	frederic

On Tue, Nov 30 2021 at 12:28, Nicolas Saenz Julienne wrote:
> while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> while in the IRQ entry path, we inform RCU of the context change *before*
> incrementing the preempt counter, the opposite happens for the NMI entry
> path. This applies to both arm64 and x86[1].
>
> Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> context. So it would make sense to assume all callers have the same updated
> view of the preempt count, which isn't true ATM.
>
> I'm sure there an obscure/non-obvious reason for this, right?

There is.

> IRQ path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
>   -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
>   -> // Run IRQ...
>
> NMI path:
>   -> x86_64 asm (entry_64.S)
>   -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
>                           -> *rcu_nmi_enter()*

The reason is symmetry vs. returning from interupt / exception:

 irqentry_enter()
      exit_rcu = false;

      if (user_mode(regs)) {
          irqentry_enter_from_user_mode(regs)
            __enter_from_user_mode(regs)
              user_exit_irqoff();       <- RCU handling for NOHZ full

      } else if (is_idle_task_current()) {
            rcu_irq_enter()
            exit_rcu = true;
      }

 irq_enter_rcu()
     __irq_enter_raw()
     preempt_count_add(HARDIRQ_OFFSET);

 irq_handler()

 irq_exit_rcu()
     preempt_count_sub(HARDIRQ_OFFSET);
     if (!in_interrupt() && local_softirq_pending())
     	 invoke_softirq();

 irqentry_exit(regs, exit_rcu)

     if (user_mode(regs)) {
         irqentry_exit_to_usermode(regs)
           user_enter_irqoff();     <- RCU handling for NOHZ full
     } else if (irqs_enabled(regs)) {
           if (exit_rcu) {          <- Idle task special case
               rcu_irq_exit();
           } else {
              irqentry_exit_cond_resched();
           }

     } else if (exit_rcu) {
         rcu_irq_exit();
     }

On return from interrupt HARDIRQ_OFFSET has to be removed _before_
handling soft interrupts. It's also required that the preempt count has
the original state _before_ reaching irqentry_exit() which
might schedule if the interrupt/exception hit user space or kernel space
with interrupts enabled.

So doing it symmetric makes sense.

For NMIs the above conditionals do not apply at all and we just do

    __nmi_enter()
        preempt_count_add(NMI_COUNT + HARDIRQ_COUNT);
    rcu_nmi_enter();

    handle_nmi();

    rcu_nmi_exit();
    __nmi_exit()
        preempt_count_sub(NMI_COUNT + HARDIRQ_COUNT);

The reason why preempt count is incremented before invoking
rcu_nmi_enter() is simply that RCU has to know about being in NMI
context, i.e. in_nmi() has to return the correct answer.

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
  2021-11-30 13:47   ` Thomas Gleixner
@ 2021-11-30 14:13     ` Steven Rostedt
  -1 siblings, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2021-11-30 14:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, Mark Rutland, paulmck, mtosatti, frederic

On Tue, 30 Nov 2021 14:47:01 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> The reason is symmetry vs. returning from interupt / exception:
> 
>  irqentry_enter()
>       exit_rcu = false;
> 
>       if (user_mode(regs)) {
>           irqentry_enter_from_user_mode(regs)
>             __enter_from_user_mode(regs)
>               user_exit_irqoff();       <- RCU handling for NOHZ full
> 
>       } else if (is_idle_task_current()) {
>             rcu_irq_enter()
>             exit_rcu = true;
>       }
> 
>  irq_enter_rcu()
>      __irq_enter_raw()
>      preempt_count_add(HARDIRQ_OFFSET);
> 
>  irq_handler()
> 
>  irq_exit_rcu()
>      preempt_count_sub(HARDIRQ_OFFSET);
>      if (!in_interrupt() && local_softirq_pending())
>      	 invoke_softirq();
> 
>  irqentry_exit(regs, exit_rcu)
> 
>      if (user_mode(regs)) {
>          irqentry_exit_to_usermode(regs)
>            user_enter_irqoff();     <- RCU handling for NOHZ full
>      } else if (irqs_enabled(regs)) {
>            if (exit_rcu) {          <- Idle task special case
>                rcu_irq_exit();
>            } else {
>               irqentry_exit_cond_resched();
>            }
> 
>      } else if (exit_rcu) {
>          rcu_irq_exit();
>      }
> 
> On return from interrupt HARDIRQ_OFFSET has to be removed _before_
> handling soft interrupts. It's also required that the preempt count has
> the original state _before_ reaching irqentry_exit() which
> might schedule if the interrupt/exception hit user space or kernel space
> with interrupts enabled.
> 
> So doing it symmetric makes sense.
> 
> For NMIs the above conditionals do not apply at all and we just do
> 
>     __nmi_enter()
>         preempt_count_add(NMI_COUNT + HARDIRQ_COUNT);
>     rcu_nmi_enter();
> 
>     handle_nmi();
> 
>     rcu_nmi_exit();
>     __nmi_exit()
>         preempt_count_sub(NMI_COUNT + HARDIRQ_COUNT);
> 
> The reason why preempt count is incremented before invoking
> rcu_nmi_enter() is simply that RCU has to know about being in NMI
> context, i.e. in_nmi() has to return the correct answer.

Seems like there should be a comment in the code somewhere that explains
this.

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
@ 2021-11-30 14:13     ` Steven Rostedt
  0 siblings, 0 replies; 42+ messages in thread
From: Steven Rostedt @ 2021-11-30 14:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, Mark Rutland, paulmck, mtosatti, frederic

On Tue, 30 Nov 2021 14:47:01 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> The reason is symmetry vs. returning from interupt / exception:
> 
>  irqentry_enter()
>       exit_rcu = false;
> 
>       if (user_mode(regs)) {
>           irqentry_enter_from_user_mode(regs)
>             __enter_from_user_mode(regs)
>               user_exit_irqoff();       <- RCU handling for NOHZ full
> 
>       } else if (is_idle_task_current()) {
>             rcu_irq_enter()
>             exit_rcu = true;
>       }
> 
>  irq_enter_rcu()
>      __irq_enter_raw()
>      preempt_count_add(HARDIRQ_OFFSET);
> 
>  irq_handler()
> 
>  irq_exit_rcu()
>      preempt_count_sub(HARDIRQ_OFFSET);
>      if (!in_interrupt() && local_softirq_pending())
>      	 invoke_softirq();
> 
>  irqentry_exit(regs, exit_rcu)
> 
>      if (user_mode(regs)) {
>          irqentry_exit_to_usermode(regs)
>            user_enter_irqoff();     <- RCU handling for NOHZ full
>      } else if (irqs_enabled(regs)) {
>            if (exit_rcu) {          <- Idle task special case
>                rcu_irq_exit();
>            } else {
>               irqentry_exit_cond_resched();
>            }
> 
>      } else if (exit_rcu) {
>          rcu_irq_exit();
>      }
> 
> On return from interrupt HARDIRQ_OFFSET has to be removed _before_
> handling soft interrupts. It's also required that the preempt count has
> the original state _before_ reaching irqentry_exit() which
> might schedule if the interrupt/exception hit user space or kernel space
> with interrupts enabled.
> 
> So doing it symmetric makes sense.
> 
> For NMIs the above conditionals do not apply at all and we just do
> 
>     __nmi_enter()
>         preempt_count_add(NMI_COUNT + HARDIRQ_COUNT);
>     rcu_nmi_enter();
> 
>     handle_nmi();
> 
>     rcu_nmi_exit();
>     __nmi_exit()
>         preempt_count_sub(NMI_COUNT + HARDIRQ_COUNT);
> 
> The reason why preempt count is incremented before invoking
> rcu_nmi_enter() is simply that RCU has to know about being in NMI
> context, i.e. in_nmi() has to return the correct answer.

Seems like there should be a comment in the code somewhere that explains
this.

-- Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
  2021-11-30 13:47   ` Thomas Gleixner
@ 2021-11-30 15:13     ` Nicolas Saenz Julienne
  -1 siblings, 0 replies; 42+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-30 15:13 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, linux-arm-kernel, rcu
  Cc: Peter Zijlstra, Mark Rutland, Steven Rostedt, paulmck, mtosatti,
	frederic

On Tue, 2021-11-30 at 14:47 +0100, Thomas Gleixner wrote:
> On Tue, Nov 30 2021 at 12:28, Nicolas Saenz Julienne wrote:
> > while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> > while in the IRQ entry path, we inform RCU of the context change *before*
> > incrementing the preempt counter, the opposite happens for the NMI entry
> > path. This applies to both arm64 and x86[1].
> > 
> > Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> > for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> > context. So it would make sense to assume all callers have the same updated
> > view of the preempt count, which isn't true ATM.
> > 
> > I'm sure there an obscure/non-obvious reason for this, right?
> 
> There is.
> 
> > IRQ path:
> >   -> x86_64 asm (entry_64.S)
> >   -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
> >   -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
> >   -> // Run IRQ...
> > 
> > NMI path:
> >   -> x86_64 asm (entry_64.S)
> >   -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
> >                           -> *rcu_nmi_enter()*
> 
> The reason is symmetry vs. returning from interupt / exception:
> 
>  irqentry_enter()
>       exit_rcu = false;
> 
>       if (user_mode(regs)) {
>           irqentry_enter_from_user_mode(regs)
>             __enter_from_user_mode(regs)
>               user_exit_irqoff();       <- RCU handling for NOHZ full
> 
>       } else if (is_idle_task_current()) {
>             rcu_irq_enter()
>             exit_rcu = true;
>       }
> 
>  irq_enter_rcu()
>      __irq_enter_raw()
>      preempt_count_add(HARDIRQ_OFFSET);
> 
>  irq_handler()
> 
>  irq_exit_rcu()
>      preempt_count_sub(HARDIRQ_OFFSET);
>      if (!in_interrupt() && local_softirq_pending())
>      	 invoke_softirq();
> 
>  irqentry_exit(regs, exit_rcu)
> 
>      if (user_mode(regs)) {
>          irqentry_exit_to_usermode(regs)
>            user_enter_irqoff();     <- RCU handling for NOHZ full
>      } else if (irqs_enabled(regs)) {
>            if (exit_rcu) {          <- Idle task special case
>                rcu_irq_exit();
>            } else {
>               irqentry_exit_cond_resched();
>            }
> 
>      } else if (exit_rcu) {
>          rcu_irq_exit();
>      }
> 
> On return from interrupt HARDIRQ_OFFSET has to be removed _before_
> handling soft interrupts. It's also required that the preempt count has
> the original state _before_ reaching irqentry_exit() which
> might schedule if the interrupt/exception hit user space or kernel space
> with interrupts enabled.
> 
> So doing it symmetric makes sense.
> 
> For NMIs the above conditionals do not apply at all and we just do
> 
>     __nmi_enter()
>         preempt_count_add(NMI_COUNT + HARDIRQ_COUNT);
>     rcu_nmi_enter();
> 
>     handle_nmi();
> 
>     rcu_nmi_exit();
>     __nmi_exit()
>         preempt_count_sub(NMI_COUNT + HARDIRQ_COUNT);
> 
> The reason why preempt count is incremented before invoking
> rcu_nmi_enter() is simply that RCU has to know about being in NMI
> context, i.e. in_nmi() has to return the correct answer.
> 
> Thanks,

Thanks Thomas!

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Question WRT early IRQ/NMI entry code
@ 2021-11-30 15:13     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 42+ messages in thread
From: Nicolas Saenz Julienne @ 2021-11-30 15:13 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, linux-arm-kernel, rcu
  Cc: Peter Zijlstra, Mark Rutland, Steven Rostedt, paulmck, mtosatti,
	frederic

On Tue, 2021-11-30 at 14:47 +0100, Thomas Gleixner wrote:
> On Tue, Nov 30 2021 at 12:28, Nicolas Saenz Julienne wrote:
> > while going over the IRQ/NMI entry code I've found a small 'inconsistency':
> > while in the IRQ entry path, we inform RCU of the context change *before*
> > incrementing the preempt counter, the opposite happens for the NMI entry
> > path. This applies to both arm64 and x86[1].
> > 
> > Actually, rcu_nmi_enter() — which is also the main RCU context switch function
> > for the IRQ entry path — uses the preempt counter to verify it's not in NMI
> > context. So it would make sense to assume all callers have the same updated
> > view of the preempt count, which isn't true ATM.
> > 
> > I'm sure there an obscure/non-obvious reason for this, right?
> 
> There is.
> 
> > IRQ path:
> >   -> x86_64 asm (entry_64.S)
> >   -> irqentry_enter() -> rcu_irq_enter() -> *rcu_nmi_enter()*
> >   -> run_irq_on_irqstack_cond() -> irq_exit_rcu() -> *preempt_count_add(HARDIRQ_OFFSET)*
> >   -> // Run IRQ...
> > 
> > NMI path:
> >   -> x86_64 asm (entry_64.S)
> >   -> irqentry_nmi_enter() -> __nmi_enter() -> *__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET)*
> >                           -> *rcu_nmi_enter()*
> 
> The reason is symmetry vs. returning from interupt / exception:
> 
>  irqentry_enter()
>       exit_rcu = false;
> 
>       if (user_mode(regs)) {
>           irqentry_enter_from_user_mode(regs)
>             __enter_from_user_mode(regs)
>               user_exit_irqoff();       <- RCU handling for NOHZ full
> 
>       } else if (is_idle_task_current()) {
>             rcu_irq_enter()
>             exit_rcu = true;
>       }
> 
>  irq_enter_rcu()
>      __irq_enter_raw()
>      preempt_count_add(HARDIRQ_OFFSET);
> 
>  irq_handler()
> 
>  irq_exit_rcu()
>      preempt_count_sub(HARDIRQ_OFFSET);
>      if (!in_interrupt() && local_softirq_pending())
>      	 invoke_softirq();
> 
>  irqentry_exit(regs, exit_rcu)
> 
>      if (user_mode(regs)) {
>          irqentry_exit_to_usermode(regs)
>            user_enter_irqoff();     <- RCU handling for NOHZ full
>      } else if (irqs_enabled(regs)) {
>            if (exit_rcu) {          <- Idle task special case
>                rcu_irq_exit();
>            } else {
>               irqentry_exit_cond_resched();
>            }
> 
>      } else if (exit_rcu) {
>          rcu_irq_exit();
>      }
> 
> On return from interrupt HARDIRQ_OFFSET has to be removed _before_
> handling soft interrupts. It's also required that the preempt count has
> the original state _before_ reaching irqentry_exit() which
> might schedule if the interrupt/exception hit user space or kernel space
> with interrupts enabled.
> 
> So doing it symmetric makes sense.
> 
> For NMIs the above conditionals do not apply at all and we just do
> 
>     __nmi_enter()
>         preempt_count_add(NMI_COUNT + HARDIRQ_COUNT);
>     rcu_nmi_enter();
> 
>     handle_nmi();
> 
>     rcu_nmi_exit();
>     __nmi_exit()
>         preempt_count_sub(NMI_COUNT + HARDIRQ_COUNT);
> 
> The reason why preempt count is incremented before invoking
> rcu_nmi_enter() is simply that RCU has to know about being in NMI
> context, i.e. in_nmi() has to return the correct answer.
> 
> Thanks,

Thanks Thomas!

-- 
Nicolás Sáenz


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
  2021-11-30 14:13     ` Steven Rostedt
@ 2021-11-30 22:31       ` Thomas Gleixner
  -1 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-30 22:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, Mark Rutland, paulmck, mtosatti, frederic,
	Jonathan Corbet

The entry/exit handling for exceptions, interrupts, syscalls and KVM is
not really documented except for some comments.

Fill the gaps.

Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
 Documentation/core-api/index.rst |    8 +
 kernel/entry/common.c            |    1 
 3 files changed, 276 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,268 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+For any transition from one execution domain into another the kernel
+requires update of various states. The state updates have strict rules
+versus ordering.
+
+The states which need to be updated are:
+
+  * Lockdep
+  * RCU
+  * Preemption counter
+  * Tracing
+  * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Low level transition code cannot be instrumented before RCU is watching and
+after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
+instrumentation facilities depend on RCU.
+
+Aside of that many architectures have to save register state, e.g. debug or
+cause registers before another exception of the same type can happen. A
+breakpoint in the breakpoint entry code would overwrite the debug registers
+of the inital breakpoint.
+
+Such code has to be marked with the 'noinstr' attribute. That places the
+code into a special section which is taboo for instrumentation and debug
+facilities.
+
+In a function which is marked 'noinstr' it's only allowed to call into
+non-instrumentable code except when the invocation of instrumentable code
+is annotated with a instrumentation_begin()/instrumentation_end() pair::
+
+  noinstr void entry(void)
+  {
+  	handle_entry();     <-- must be 'noinstr' or '__always_inline'
+	...
+	instrumentation_begin();
+	handle_context();   <-- instrumentable code
+	instrumentation_end();
+	...
+	handle_exit();     <-- must be 'noinstr' or '__always_inline'
+  }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall entry exit code starts obviously in low level architecture specific
+assembly code and calls out into C-code after establishing low level
+architecture specific state and stack frames. This low level code must not
+be instrumented. A typical syscall handling function invoked from low level
+assembly code looks like this::
+
+  noinstr void do_syscall(struct pt_regs \*regs, int nr)
+  {
+	arch_syscall_enter(regs);
+	nr = syscall_enter_from_user_mode(regs, nr);
+
+	instrumentation_begin();
+
+	if (!invoke_syscall(regs, nr) && nr != -1)
+	 	result_reg(regs) = __sys_ni_syscall(regs);
+
+	instrumentation_end();
+
+	syscall_exit_to_user_mode(regs);
+  }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+  * Lockdep
+  * RCU / Context tracking
+  * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp,
+audit, syscall tracing etc. After the function returns instrumentable code
+can be invoked. After returning from the syscall handler the instrumentable
+code section ends and syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+  * Tracing
+  * RCU / Context tracking
+  * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry
+and exit handling is slightly different. RCU state is only updated when the
+interrupt was raised in context of the idle task because that's the only
+kernel context where RCU can be not watching on NOHZ enabled kernels.
+Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture specific part looks similar to syscall handling::
+
+  noinstr void do_interrupt(struct pt_regs \*regs, int nr)
+  {
+	arch_interrupt_enter(regs);
+	state = irqentry_enter(regs);
+
+	instrumentation_begin();
+
+	irq_enter_rcu();
+	invoke_irq_handler(regs, nr);
+	irq_exit_rcu();
+
+	instrumentation_end();
+
+	irqentry_exit(regs, state);
+  }
+
+Note, that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+The preemption count could be established in irqentry_enter() already, but
+there is no real value to do so. This allows the preemption count to be
+traced and just puts a restriction on the early entry code up to
+irq_enter_rcu().
+
+This also keeps the handling vs. irq_exit_rcu() symmetric and
+irq_exit_rcu() must undo the preempt count elevation before handling soft
+interrupts and irqentry_exit() also requires that because it might
+schedule.
+
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
+interrupts etc. can hit any context and have to be extra careful vs. the
+state.
+
+Debug exceptions can handle user space breakpoints or watchpoints in the
+same way as an interrupt which was raised while executing in user space,
+but kernel mode debug exceptions have to be treated like NMIs as they can
+even happen in NMI context, e.g. due to code patching.
+
+Also Machine check exceptions can handle user mode exceptions like regular
+interrupts, but for kernel mode exceptions they have to be treated like
+NMIs.
+
+NMIs and the other NMI-like exceptions handle state transitions in the most
+straight forward way and do not differentiate between user and kernel mode
+origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+  * Preemption counter
+  * Lockdep
+  * RCU
+  * Tracing
+
+The exit counterpart irqenttry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note, that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case can obviously not
+be traced.
+
+Architecture specific code looks like this::
+
+  noinstr void do_nmi(struct pt_regs \*regs)
+  {
+	arch_nmi_enter(regs);
+	state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+
+	invoke_nmi_handler(regs);
+
+	instrumentation_end();
+	irqentry_nmi_exit(regs);
+  }
+
+and for e.g. a debug exception it can look like this::
+
+  noinstr void do_debug(struct pt_regs \*regs)
+  {
+	arch_nmi_enter(regs);
+
+	debug_regs = save_debug_regs();
+
+	if (user_mode(regs)) {
+		state = irqentry_enter(regs);
+
+		instrumentation_begin();
+
+		user_mode_debug_handler(regs, debug_regs);
+
+		instrumentation_end();
+
+		irqentry_exit(regs, state);
+  	} else {
+  		state = irqentry_nmi_enter(regs);
+
+		instrumentation_begin();
+
+		kernel_mode_debug_handler(regs, debug_regs);
+
+		instrumentation_end();
+
+		irqentry_nmi_exit(regs, state);
+	}
+  }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception agnostic way.
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -44,6 +44,14 @@ Library functionality that is used throu
    timekeeping
    errseq
 
+Low level entry and exit
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   entry
+
 Concurrency primitives
 ======================
 
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
-
 #include <linux/context_tracking.h>
 #include <linux/entry-common.h>
 #include <linux/highmem.h>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-11-30 22:31       ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-11-30 22:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, Mark Rutland, paulmck, mtosatti, frederic,
	Jonathan Corbet

The entry/exit handling for exceptions, interrupts, syscalls and KVM is
not really documented except for some comments.

Fill the gaps.

Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
 Documentation/core-api/index.rst |    8 +
 kernel/entry/common.c            |    1 
 3 files changed, 276 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,268 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+For any transition from one execution domain into another the kernel
+requires update of various states. The state updates have strict rules
+versus ordering.
+
+The states which need to be updated are:
+
+  * Lockdep
+  * RCU
+  * Preemption counter
+  * Tracing
+  * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Low level transition code cannot be instrumented before RCU is watching and
+after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
+instrumentation facilities depend on RCU.
+
+Aside of that many architectures have to save register state, e.g. debug or
+cause registers before another exception of the same type can happen. A
+breakpoint in the breakpoint entry code would overwrite the debug registers
+of the inital breakpoint.
+
+Such code has to be marked with the 'noinstr' attribute. That places the
+code into a special section which is taboo for instrumentation and debug
+facilities.
+
+In a function which is marked 'noinstr' it's only allowed to call into
+non-instrumentable code except when the invocation of instrumentable code
+is annotated with a instrumentation_begin()/instrumentation_end() pair::
+
+  noinstr void entry(void)
+  {
+  	handle_entry();     <-- must be 'noinstr' or '__always_inline'
+	...
+	instrumentation_begin();
+	handle_context();   <-- instrumentable code
+	instrumentation_end();
+	...
+	handle_exit();     <-- must be 'noinstr' or '__always_inline'
+  }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall entry exit code starts obviously in low level architecture specific
+assembly code and calls out into C-code after establishing low level
+architecture specific state and stack frames. This low level code must not
+be instrumented. A typical syscall handling function invoked from low level
+assembly code looks like this::
+
+  noinstr void do_syscall(struct pt_regs \*regs, int nr)
+  {
+	arch_syscall_enter(regs);
+	nr = syscall_enter_from_user_mode(regs, nr);
+
+	instrumentation_begin();
+
+	if (!invoke_syscall(regs, nr) && nr != -1)
+	 	result_reg(regs) = __sys_ni_syscall(regs);
+
+	instrumentation_end();
+
+	syscall_exit_to_user_mode(regs);
+  }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+  * Lockdep
+  * RCU / Context tracking
+  * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp,
+audit, syscall tracing etc. After the function returns instrumentable code
+can be invoked. After returning from the syscall handler the instrumentable
+code section ends and syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+  * Tracing
+  * RCU / Context tracking
+  * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry
+and exit handling is slightly different. RCU state is only updated when the
+interrupt was raised in context of the idle task because that's the only
+kernel context where RCU can be not watching on NOHZ enabled kernels.
+Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture specific part looks similar to syscall handling::
+
+  noinstr void do_interrupt(struct pt_regs \*regs, int nr)
+  {
+	arch_interrupt_enter(regs);
+	state = irqentry_enter(regs);
+
+	instrumentation_begin();
+
+	irq_enter_rcu();
+	invoke_irq_handler(regs, nr);
+	irq_exit_rcu();
+
+	instrumentation_end();
+
+	irqentry_exit(regs, state);
+  }
+
+Note, that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+The preemption count could be established in irqentry_enter() already, but
+there is no real value to do so. This allows the preemption count to be
+traced and just puts a restriction on the early entry code up to
+irq_enter_rcu().
+
+This also keeps the handling vs. irq_exit_rcu() symmetric and
+irq_exit_rcu() must undo the preempt count elevation before handling soft
+interrupts and irqentry_exit() also requires that because it might
+schedule.
+
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
+interrupts etc. can hit any context and have to be extra careful vs. the
+state.
+
+Debug exceptions can handle user space breakpoints or watchpoints in the
+same way as an interrupt which was raised while executing in user space,
+but kernel mode debug exceptions have to be treated like NMIs as they can
+even happen in NMI context, e.g. due to code patching.
+
+Also Machine check exceptions can handle user mode exceptions like regular
+interrupts, but for kernel mode exceptions they have to be treated like
+NMIs.
+
+NMIs and the other NMI-like exceptions handle state transitions in the most
+straight forward way and do not differentiate between user and kernel mode
+origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+  * Preemption counter
+  * Lockdep
+  * RCU
+  * Tracing
+
+The exit counterpart irqenttry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note, that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case can obviously not
+be traced.
+
+Architecture specific code looks like this::
+
+  noinstr void do_nmi(struct pt_regs \*regs)
+  {
+	arch_nmi_enter(regs);
+	state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+
+	invoke_nmi_handler(regs);
+
+	instrumentation_end();
+	irqentry_nmi_exit(regs);
+  }
+
+and for e.g. a debug exception it can look like this::
+
+  noinstr void do_debug(struct pt_regs \*regs)
+  {
+	arch_nmi_enter(regs);
+
+	debug_regs = save_debug_regs();
+
+	if (user_mode(regs)) {
+		state = irqentry_enter(regs);
+
+		instrumentation_begin();
+
+		user_mode_debug_handler(regs, debug_regs);
+
+		instrumentation_end();
+
+		irqentry_exit(regs, state);
+  	} else {
+  		state = irqentry_nmi_enter(regs);
+
+		instrumentation_begin();
+
+		kernel_mode_debug_handler(regs, debug_regs);
+
+		instrumentation_end();
+
+		irqentry_nmi_exit(regs, state);
+	}
+  }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception agnostic way.
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -44,6 +44,14 @@ Library functionality that is used throu
    timekeeping
    errseq
 
+Low level entry and exit
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   entry
+
 Concurrency primitives
 ======================
 
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
-
 #include <linux/context_tracking.h>
 #include <linux/entry-common.h>
 #include <linux/highmem.h>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
  2021-11-30 22:31       ` Thomas Gleixner
@ 2021-12-01 10:56         ` Mark Rutland
  -1 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-01 10:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

Hi Thomas,

On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.

Thanks for this! Now there's less chance I'll get this wrong. :)

I have a couple of minor comments below -- mostly typo/formatting junk.

> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  kernel/entry/common.c            |    1 

I think the change to kernel/entry/common.c got included by accident?

>  3 files changed, 276 insertions(+), 1 deletion(-)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,268 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.
> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair::
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     <-- must be 'noinstr' or '__always_inline'
> +	...
> +	instrumentation_begin();
> +	handle_context();   <-- instrumentable code
> +	instrumentation_end();
> +	...
> +	handle_exit();     <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry exit code starts obviously in low level architecture specific

As a small nit, can we remove the "obviously"? It's certainly obvious to you
and me, but it doesn't meaningfully affect the sentence either way.

> +assembly code and calls out into C-code after establishing low level
> +architecture specific state and stack frames. This low level code must not
> +be instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this::
> +
> +  noinstr void do_syscall(struct pt_regs \*regs, int nr)
                                            ^^

Is `\*` necessary here? ... and/or should this be an explicit code block (which
IIUC doesn't require this esacping), e.g.

.. code-block:: c

      noinstr void do_syscall(struct pt_regs *regs, int nr)
	{
		...
	}

Similar comment for the other code snippets in this patch.

> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the idle task because that's the only

Since we have an idle task for each cpu, perhaps either:

  s/the idle task/an idle task/
  s/the idle task/the CPU's idle task/

> +kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling::
> +
> +  noinstr void do_interrupt(struct pt_regs \*regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqenttry_nmi_exit() does the reverse operation in the
                        ^^^^^^^^^

 s/irqenttry/irqentry/

> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case can obviously not
> +be traced.

Could we say "must not" instead of "can not", e.g.

  The preemption count modification in the NMI entry/exit must not be traced.

That way it's clearly a requirement, rather than a limitation.

> +Architecture specific code looks like this::
> +
> +  noinstr void do_nmi(struct pt_regs \*regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	invoke_nmi_handler(regs);
> +
> +	instrumentation_end();
> +	irqentry_nmi_exit(regs);
> +  }

To keep the begin/end and enter/exit calls visually balanced, should the
instrumentation_end() call have trailing a line space, e.g.

e.g.

	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();

	invoke_nmi_handler(regs);

	instrumentation_end();

	irqentry_nmi_exit(regs);

... or sandwiched around invoke_nmi_handler(), e.g.

	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();
	invoke_nmi_handler(regs);
	instrumentation_end();

	irqentry_nmi_exit(regs);

Since the examples in this file are fairly simple I'd suggest the latter.

> +and for e.g. a debug exception it can look like this::
> +
> +  noinstr void do_debug(struct pt_regs \*regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +
> +		user_mode_debug_handler(regs, debug_regs);
> +
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +
> +		kernel_mode_debug_handler(regs, debug_regs);
> +
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0
> -

Unrelated whitespace change?

Thanks,
Mark.

>  #include <linux/context_tracking.h>
>  #include <linux/entry-common.h>
>  #include <linux/highmem.h>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-01 10:56         ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-01 10:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

Hi Thomas,

On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.

Thanks for this! Now there's less chance I'll get this wrong. :)

I have a couple of minor comments below -- mostly typo/formatting junk.

> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  kernel/entry/common.c            |    1 

I think the change to kernel/entry/common.c got included by accident?

>  3 files changed, 276 insertions(+), 1 deletion(-)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,268 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.
> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair::
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     <-- must be 'noinstr' or '__always_inline'
> +	...
> +	instrumentation_begin();
> +	handle_context();   <-- instrumentable code
> +	instrumentation_end();
> +	...
> +	handle_exit();     <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry exit code starts obviously in low level architecture specific

As a small nit, can we remove the "obviously"? It's certainly obvious to you
and me, but it doesn't meaningfully affect the sentence either way.

> +assembly code and calls out into C-code after establishing low level
> +architecture specific state and stack frames. This low level code must not
> +be instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this::
> +
> +  noinstr void do_syscall(struct pt_regs \*regs, int nr)
                                            ^^

Is `\*` necessary here? ... and/or should this be an explicit code block (which
IIUC doesn't require this esacping), e.g.

.. code-block:: c

      noinstr void do_syscall(struct pt_regs *regs, int nr)
	{
		...
	}

Similar comment for the other code snippets in this patch.

> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the idle task because that's the only

Since we have an idle task for each cpu, perhaps either:

  s/the idle task/an idle task/
  s/the idle task/the CPU's idle task/

> +kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling::
> +
> +  noinstr void do_interrupt(struct pt_regs \*regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqenttry_nmi_exit() does the reverse operation in the
                        ^^^^^^^^^

 s/irqenttry/irqentry/

> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case can obviously not
> +be traced.

Could we say "must not" instead of "can not", e.g.

  The preemption count modification in the NMI entry/exit must not be traced.

That way it's clearly a requirement, rather than a limitation.

> +Architecture specific code looks like this::
> +
> +  noinstr void do_nmi(struct pt_regs \*regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	invoke_nmi_handler(regs);
> +
> +	instrumentation_end();
> +	irqentry_nmi_exit(regs);
> +  }

To keep the begin/end and enter/exit calls visually balanced, should the
instrumentation_end() call have trailing a line space, e.g.

e.g.

	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();

	invoke_nmi_handler(regs);

	instrumentation_end();

	irqentry_nmi_exit(regs);

... or sandwiched around invoke_nmi_handler(), e.g.

	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();
	invoke_nmi_handler(regs);
	instrumentation_end();

	irqentry_nmi_exit(regs);

Since the examples in this file are fairly simple I'd suggest the latter.

> +and for e.g. a debug exception it can look like this::
> +
> +  noinstr void do_debug(struct pt_regs \*regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +
> +		user_mode_debug_handler(regs, debug_regs);
> +
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +
> +		kernel_mode_debug_handler(regs, debug_regs);
> +
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0
> -

Unrelated whitespace change?

Thanks,
Mark.

>  #include <linux/context_tracking.h>
>  #include <linux/entry-common.h>
>  #include <linux/highmem.h>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 10:56         ` Mark Rutland
@ 2021-12-01 18:14           ` Thomas Gleixner
  -1 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-01 18:14 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

Mark,

On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
> On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
>> ---
>>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>>  Documentation/core-api/index.rst |    8 +
>>  kernel/entry/common.c            |    1 
>
> I think the change to kernel/entry/common.c got included by accident?

That's what I get from doing such things 30 minutes before midnight...

>> +
>> +Syscall entry exit code starts obviously in low level architecture specific
>
> As a small nit, can we remove the "obviously"? It's certainly obvious to you
> and me, but it doesn't meaningfully affect the sentence either way.

Indeed.

>> +assembly code and calls out into C-code after establishing low level
>> +architecture specific state and stack frames. This low level code must not
>> +be instrumented. A typical syscall handling function invoked from low level
>> +assembly code looks like this::
>> +
>> +  noinstr void do_syscall(struct pt_regs \*regs, int nr)
>                                             ^^
>
> Is `\*` necessary here? ... and/or should this be an explicit code block (which
> IIUC doesn't require this esacping), e.g.
>
> .. code-block:: c

Right. Let me try that.

>       noinstr void do_syscall(struct pt_regs *regs, int nr)
>> +
>> +If the interrupt is raised while the CPU executes in kernel space the entry
>> +and exit handling is slightly different. RCU state is only updated when the
>> +interrupt was raised in context of the idle task because that's the only
>
> Since we have an idle task for each cpu, perhaps either:
>
>   s/the idle task/an idle task/
>   s/the idle task/the CPU's idle task/

Yes, that's more precise

>> +Note, that the update of the preemption counter has to be the first
>> +operation on enter and the last operation on exit. The reason is that both
>> +lockdep and RCU rely on in_nmi() returning true in this case. The
>> +preemption count modification in the NMI entry/exit case can obviously not
>> +be traced.
>
> Could we say "must not" instead of "can not", e.g.
>
>   The preemption count modification in the NMI entry/exit must not be traced.
>
> That way it's clearly a requirement, rather than a limitation.

Yes.

>> +Architecture specific code looks like this::
>> +
>> +  noinstr void do_nmi(struct pt_regs \*regs)
>> +  {
>> +	arch_nmi_enter(regs);
>> +	state = irqentry_nmi_enter(regs);
>> +
>> +	instrumentation_begin();
>> +
>> +	invoke_nmi_handler(regs);
>> +
>> +	instrumentation_end();
>> +	irqentry_nmi_exit(regs);
>> +  }
>
> To keep the begin/end and enter/exit calls visually balanced, should the
> instrumentation_end() call have trailing a line space, e.g.

Yup.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-01 18:14           ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-01 18:14 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

Mark,

On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
> On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
>> ---
>>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>>  Documentation/core-api/index.rst |    8 +
>>  kernel/entry/common.c            |    1 
>
> I think the change to kernel/entry/common.c got included by accident?

That's what I get from doing such things 30 minutes before midnight...

>> +
>> +Syscall entry exit code starts obviously in low level architecture specific
>
> As a small nit, can we remove the "obviously"? It's certainly obvious to you
> and me, but it doesn't meaningfully affect the sentence either way.

Indeed.

>> +assembly code and calls out into C-code after establishing low level
>> +architecture specific state and stack frames. This low level code must not
>> +be instrumented. A typical syscall handling function invoked from low level
>> +assembly code looks like this::
>> +
>> +  noinstr void do_syscall(struct pt_regs \*regs, int nr)
>                                             ^^
>
> Is `\*` necessary here? ... and/or should this be an explicit code block (which
> IIUC doesn't require this esacping), e.g.
>
> .. code-block:: c

Right. Let me try that.

>       noinstr void do_syscall(struct pt_regs *regs, int nr)
>> +
>> +If the interrupt is raised while the CPU executes in kernel space the entry
>> +and exit handling is slightly different. RCU state is only updated when the
>> +interrupt was raised in context of the idle task because that's the only
>
> Since we have an idle task for each cpu, perhaps either:
>
>   s/the idle task/an idle task/
>   s/the idle task/the CPU's idle task/

Yes, that's more precise

>> +Note, that the update of the preemption counter has to be the first
>> +operation on enter and the last operation on exit. The reason is that both
>> +lockdep and RCU rely on in_nmi() returning true in this case. The
>> +preemption count modification in the NMI entry/exit case can obviously not
>> +be traced.
>
> Could we say "must not" instead of "can not", e.g.
>
>   The preemption count modification in the NMI entry/exit must not be traced.
>
> That way it's clearly a requirement, rather than a limitation.

Yes.

>> +Architecture specific code looks like this::
>> +
>> +  noinstr void do_nmi(struct pt_regs \*regs)
>> +  {
>> +	arch_nmi_enter(regs);
>> +	state = irqentry_nmi_enter(regs);
>> +
>> +	instrumentation_begin();
>> +
>> +	invoke_nmi_handler(regs);
>> +
>> +	instrumentation_end();
>> +	irqentry_nmi_exit(regs);
>> +  }
>
> To keep the begin/end and enter/exit calls visually balanced, should the
> instrumentation_end() call have trailing a line space, e.g.

Yup.

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 18:14           ` Thomas Gleixner
@ 2021-12-01 18:23             ` Mark Rutland
  -1 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-01 18:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01, 2021 at 07:14:41PM +0100, Thomas Gleixner wrote:
> Mark,
> 
> On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
> > On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
> >> ---
> >>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
> >>  Documentation/core-api/index.rst |    8 +
> >>  kernel/entry/common.c            |    1 
> >
> > I think the change to kernel/entry/common.c got included by accident?
> 
> That's what I get from doing such things 30 minutes before midnight...

Ah, I had debugged it down to:

nobikeshed void do_rst(struct tglx *tglx);
{
	aargh_rst_enter(tglx);

	documentation_begin();
	invoke_editor(tglx);
 	documentation_end();
}

... where I think we forgot the:

	enter_from_sleep_mode(tglx);
	...
	exit_to_sleep_mode(tglx);

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-01 18:23             ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-01 18:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01, 2021 at 07:14:41PM +0100, Thomas Gleixner wrote:
> Mark,
> 
> On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
> > On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
> >> ---
> >>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
> >>  Documentation/core-api/index.rst |    8 +
> >>  kernel/entry/common.c            |    1 
> >
> > I think the change to kernel/entry/common.c got included by accident?
> 
> That's what I get from doing such things 30 minutes before midnight...

Ah, I had debugged it down to:

nobikeshed void do_rst(struct tglx *tglx);
{
	aargh_rst_enter(tglx);

	documentation_begin();
	invoke_editor(tglx);
 	documentation_end();
}

... where I think we forgot the:

	enter_from_sleep_mode(tglx);
	...
	exit_to_sleep_mode(tglx);

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 18:23             ` Mark Rutland
@ 2021-12-01 20:28               ` Thomas Gleixner
  -1 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:28 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01 2021 at 18:23, Mark Rutland wrote:
> On Wed, Dec 01, 2021 at 07:14:41PM +0100, Thomas Gleixner wrote:
>> Mark,
>> 
>> On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
>> > On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
>> >> ---
>> >>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>> >>  Documentation/core-api/index.rst |    8 +
>> >>  kernel/entry/common.c            |    1 
>> >
>> > I think the change to kernel/entry/common.c got included by accident?
>> 
>> That's what I get from doing such things 30 minutes before midnight...
>
> Ah, I had debugged it down to:
>
> nobikeshed void do_rst(struct tglx *tglx);
> {
> 	aargh_rst_enter(tglx);
>
> 	documentation_begin();
> 	invoke_editor(tglx);
>  	documentation_end();
> }
>
> ... where I think we forgot the:
>
> 	enter_from_sleep_mode(tglx);
> 	...
> 	exit_to_sleep_mode(tglx);

ROTFL. You made my day!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-01 20:28               ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:28 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01 2021 at 18:23, Mark Rutland wrote:
> On Wed, Dec 01, 2021 at 07:14:41PM +0100, Thomas Gleixner wrote:
>> Mark,
>> 
>> On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
>> > On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
>> >> ---
>> >>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>> >>  Documentation/core-api/index.rst |    8 +
>> >>  kernel/entry/common.c            |    1 
>> >
>> > I think the change to kernel/entry/common.c got included by accident?
>> 
>> That's what I get from doing such things 30 minutes before midnight...
>
> Ah, I had debugged it down to:
>
> nobikeshed void do_rst(struct tglx *tglx);
> {
> 	aargh_rst_enter(tglx);
>
> 	documentation_begin();
> 	invoke_editor(tglx);
>  	documentation_end();
> }
>
> ... where I think we forgot the:
>
> 	enter_from_sleep_mode(tglx);
> 	...
> 	exit_to_sleep_mode(tglx);

ROTFL. You made my day!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 20:28               ` Thomas Gleixner
@ 2021-12-01 20:35                 ` Thomas Gleixner
  -1 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

The entry/exit handling for exceptions, interrupts, syscalls and KVM is
not really documented except for some comments.

Fill the gaps.

Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Be more precise, fix formatting and typos - Mark 
---
 Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
 Documentation/core-api/index.rst |    8 +
 2 files changed, 282 insertions(+)

--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,274 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+For any transition from one execution domain into another the kernel
+requires update of various states. The state updates have strict rules
+versus ordering.
+
+The states which need to be updated are:
+
+  * Lockdep
+  * RCU
+  * Preemption counter
+  * Tracing
+  * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Low level transition code cannot be instrumented before RCU is watching and
+after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
+instrumentation facilities depend on RCU.
+
+Aside of that many architectures have to save register state, e.g. debug or
+cause registers before another exception of the same type can happen. A
+breakpoint in the breakpoint entry code would overwrite the debug registers
+of the inital breakpoint.
+
+Such code has to be marked with the 'noinstr' attribute. That places the
+code into a special section which is taboo for instrumentation and debug
+facilities.
+
+In a function which is marked 'noinstr' it's only allowed to call into
+non-instrumentable code except when the invocation of instrumentable code
+is annotated with a instrumentation_begin()/instrumentation_end() pair:
+
+.. code-block:: c
+
+  noinstr void entry(void)
+  {
+  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
+	...
+
+	instrumentation_begin();
+	handle_context();   // <-- instrumentable code
+	instrumentation_end();
+
+	...
+	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
+  }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall entry code starts in low level architecture specific assembly code
+and calls out into C-code after establishing low level architecture
+specific state and stack frames. This low level code must not be
+instrumented. A typical syscall handling function invoked from low level
+assembly code looks like this:
+
+.. code-block:: c
+
+  noinstr void syscall(struct pt_regs *regs, int nr)
+  {
+	arch_syscall_enter(regs);
+	nr = syscall_enter_from_user_mode(regs, nr);
+
+	instrumentation_begin();
+	if (!invoke_syscall(regs, nr) && nr != -1)
+	 	result_reg(regs) = __sys_ni_syscall(regs);
+	instrumentation_end();
+
+	syscall_exit_to_user_mode(regs);
+  }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+  * Lockdep
+  * RCU / Context tracking
+  * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp,
+audit, syscall tracing etc. After the function returns instrumentable code
+can be invoked. After returning from the syscall handler the instrumentable
+code section ends and syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+  * Tracing
+  * RCU / Context tracking
+  * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry
+and exit handling is slightly different. RCU state is only updated when the
+interrupt was raised in context of the CPU's idle task because that's the
+only kernel context where RCU can be not watching on NOHZ enabled kernels.
+Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture specific part looks similar to syscall handling:
+
+.. code-block:: c
+
+  noinstr void interrupt(struct pt_regs *regs, int nr)
+  {
+	arch_interrupt_enter(regs);
+	state = irqentry_enter(regs);
+
+	instrumentation_begin();
+
+	irq_enter_rcu();
+	invoke_irq_handler(regs, nr);
+	irq_exit_rcu();
+
+	instrumentation_end();
+
+	irqentry_exit(regs, state);
+  }
+
+Note, that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+The preemption count could be established in irqentry_enter() already, but
+there is no real value to do so. This allows the preemption count to be
+traced and just puts a restriction on the early entry code up to
+irq_enter_rcu().
+
+This also keeps the handling vs. irq_exit_rcu() symmetric and
+irq_exit_rcu() must undo the preempt count elevation before handling soft
+interrupts and irqentry_exit() also requires that because it might
+schedule.
+
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
+interrupts etc. can hit any context and have to be extra careful vs. the
+state.
+
+Debug exceptions can handle user space breakpoints or watchpoints in the
+same way as an interrupt which was raised while executing in user space,
+but kernel mode debug exceptions have to be treated like NMIs as they can
+even happen in NMI context, e.g. due to code patching.
+
+Also Machine check exceptions can handle user mode exceptions like regular
+interrupts, but for kernel mode exceptions they have to be treated like
+NMIs.
+
+NMIs and the other NMI-like exceptions handle state transitions in the most
+straight forward way and do not differentiate between user and kernel mode
+origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+  * Preemption counter
+  * Lockdep
+  * RCU
+  * Tracing
+
+The exit counterpart irqentry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note, that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case must not be
+traced.
+
+Architecture specific code looks like this:
+
+.. code-block:: c
+
+  noinstr void nmi(struct pt_regs *regs)
+  {
+	arch_nmi_enter(regs);
+	state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+	nmi_handler(regs);
+	instrumentation_end();
+
+	irqentry_nmi_exit(regs);
+  }
+
+and for e.g. a debug exception it can look like this:
+
+.. code-block:: c
+
+  noinstr void debug(struct pt_regs *regs)
+  {
+	arch_nmi_enter(regs);
+
+	debug_regs = save_debug_regs();
+
+	if (user_mode(regs)) {
+		state = irqentry_enter(regs);
+
+		instrumentation_begin();
+		user_mode_debug_handler(regs, debug_regs);
+		instrumentation_end();
+
+		irqentry_exit(regs, state);
+  	} else {
+  		state = irqentry_nmi_enter(regs);
+
+		instrumentation_begin();
+		kernel_mode_debug_handler(regs, debug_regs);
+		instrumentation_end();
+
+		irqentry_nmi_exit(regs, state);
+	}
+  }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception agnostic way.
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -44,6 +44,14 @@ Library functionality that is used throu
    timekeeping
    errseq
 
+Low level entry and exit
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   entry
+
 Concurrency primitives
 ======================
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-01 20:35                 ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

The entry/exit handling for exceptions, interrupts, syscalls and KVM is
not really documented except for some comments.

Fill the gaps.

Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Be more precise, fix formatting and typos - Mark 
---
 Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
 Documentation/core-api/index.rst |    8 +
 2 files changed, 282 insertions(+)

--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,274 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+For any transition from one execution domain into another the kernel
+requires update of various states. The state updates have strict rules
+versus ordering.
+
+The states which need to be updated are:
+
+  * Lockdep
+  * RCU
+  * Preemption counter
+  * Tracing
+  * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Low level transition code cannot be instrumented before RCU is watching and
+after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
+instrumentation facilities depend on RCU.
+
+Aside of that many architectures have to save register state, e.g. debug or
+cause registers before another exception of the same type can happen. A
+breakpoint in the breakpoint entry code would overwrite the debug registers
+of the inital breakpoint.
+
+Such code has to be marked with the 'noinstr' attribute. That places the
+code into a special section which is taboo for instrumentation and debug
+facilities.
+
+In a function which is marked 'noinstr' it's only allowed to call into
+non-instrumentable code except when the invocation of instrumentable code
+is annotated with a instrumentation_begin()/instrumentation_end() pair:
+
+.. code-block:: c
+
+  noinstr void entry(void)
+  {
+  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
+	...
+
+	instrumentation_begin();
+	handle_context();   // <-- instrumentable code
+	instrumentation_end();
+
+	...
+	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
+  }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall entry code starts in low level architecture specific assembly code
+and calls out into C-code after establishing low level architecture
+specific state and stack frames. This low level code must not be
+instrumented. A typical syscall handling function invoked from low level
+assembly code looks like this:
+
+.. code-block:: c
+
+  noinstr void syscall(struct pt_regs *regs, int nr)
+  {
+	arch_syscall_enter(regs);
+	nr = syscall_enter_from_user_mode(regs, nr);
+
+	instrumentation_begin();
+	if (!invoke_syscall(regs, nr) && nr != -1)
+	 	result_reg(regs) = __sys_ni_syscall(regs);
+	instrumentation_end();
+
+	syscall_exit_to_user_mode(regs);
+  }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+  * Lockdep
+  * RCU / Context tracking
+  * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp,
+audit, syscall tracing etc. After the function returns instrumentable code
+can be invoked. After returning from the syscall handler the instrumentable
+code section ends and syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+  * Tracing
+  * RCU / Context tracking
+  * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry
+and exit handling is slightly different. RCU state is only updated when the
+interrupt was raised in context of the CPU's idle task because that's the
+only kernel context where RCU can be not watching on NOHZ enabled kernels.
+Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture specific part looks similar to syscall handling:
+
+.. code-block:: c
+
+  noinstr void interrupt(struct pt_regs *regs, int nr)
+  {
+	arch_interrupt_enter(regs);
+	state = irqentry_enter(regs);
+
+	instrumentation_begin();
+
+	irq_enter_rcu();
+	invoke_irq_handler(regs, nr);
+	irq_exit_rcu();
+
+	instrumentation_end();
+
+	irqentry_exit(regs, state);
+  }
+
+Note, that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+The preemption count could be established in irqentry_enter() already, but
+there is no real value to do so. This allows the preemption count to be
+traced and just puts a restriction on the early entry code up to
+irq_enter_rcu().
+
+This also keeps the handling vs. irq_exit_rcu() symmetric and
+irq_exit_rcu() must undo the preempt count elevation before handling soft
+interrupts and irqentry_exit() also requires that because it might
+schedule.
+
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
+interrupts etc. can hit any context and have to be extra careful vs. the
+state.
+
+Debug exceptions can handle user space breakpoints or watchpoints in the
+same way as an interrupt which was raised while executing in user space,
+but kernel mode debug exceptions have to be treated like NMIs as they can
+even happen in NMI context, e.g. due to code patching.
+
+Also Machine check exceptions can handle user mode exceptions like regular
+interrupts, but for kernel mode exceptions they have to be treated like
+NMIs.
+
+NMIs and the other NMI-like exceptions handle state transitions in the most
+straight forward way and do not differentiate between user and kernel mode
+origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+  * Preemption counter
+  * Lockdep
+  * RCU
+  * Tracing
+
+The exit counterpart irqentry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note, that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case must not be
+traced.
+
+Architecture specific code looks like this:
+
+.. code-block:: c
+
+  noinstr void nmi(struct pt_regs *regs)
+  {
+	arch_nmi_enter(regs);
+	state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+	nmi_handler(regs);
+	instrumentation_end();
+
+	irqentry_nmi_exit(regs);
+  }
+
+and for e.g. a debug exception it can look like this:
+
+.. code-block:: c
+
+  noinstr void debug(struct pt_regs *regs)
+  {
+	arch_nmi_enter(regs);
+
+	debug_regs = save_debug_regs();
+
+	if (user_mode(regs)) {
+		state = irqentry_enter(regs);
+
+		instrumentation_begin();
+		user_mode_debug_handler(regs, debug_regs);
+		instrumentation_end();
+
+		irqentry_exit(regs, state);
+  	} else {
+  		state = irqentry_nmi_enter(regs);
+
+		instrumentation_begin();
+		kernel_mode_debug_handler(regs, debug_regs);
+		instrumentation_end();
+
+		irqentry_nmi_exit(regs, state);
+	}
+  }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception agnostic way.
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -44,6 +44,14 @@ Library functionality that is used throu
    timekeeping
    errseq
 
+Low level entry and exit
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   entry
+
 Concurrency primitives
 ======================
 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 20:35                 ` Thomas Gleixner
@ 2021-12-02 10:03                   ` Mark Rutland
  -1 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-02 10:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01, 2021 at 09:35:37PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.
> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

The content looks good, and I'm all out of nits to pick, so FWIW:

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> ---
> V2: Be more precise, fix formatting and typos - Mark 
> ---
>  Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  2 files changed, 282 insertions(+)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,274 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.
> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair:
> +
> +.. code-block:: c
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
> +	...
> +
> +	instrumentation_begin();
> +	handle_context();   // <-- instrumentable code
> +	instrumentation_end();
> +
> +	...
> +	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry code starts in low level architecture specific assembly code
> +and calls out into C-code after establishing low level architecture
> +specific state and stack frames. This low level code must not be
> +instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void syscall(struct pt_regs *regs, int nr)
> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the CPU's idle task because that's the
> +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling:
> +
> +.. code-block:: c
> +
> +  noinstr void interrupt(struct pt_regs *regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqentry_nmi_exit() does the reverse operation in the
> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case must not be
> +traced.
> +
> +Architecture specific code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void nmi(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +	nmi_handler(regs);
> +	instrumentation_end();
> +
> +	irqentry_nmi_exit(regs);
> +  }
> +
> +and for e.g. a debug exception it can look like this:
> +
> +.. code-block:: c
> +
> +  noinstr void debug(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +		user_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +		kernel_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-02 10:03                   ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-02 10:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01, 2021 at 09:35:37PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.
> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

The content looks good, and I'm all out of nits to pick, so FWIW:

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> ---
> V2: Be more precise, fix formatting and typos - Mark 
> ---
>  Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  2 files changed, 282 insertions(+)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,274 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.
> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair:
> +
> +.. code-block:: c
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
> +	...
> +
> +	instrumentation_begin();
> +	handle_context();   // <-- instrumentable code
> +	instrumentation_end();
> +
> +	...
> +	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry code starts in low level architecture specific assembly code
> +and calls out into C-code after establishing low level architecture
> +specific state and stack frames. This low level code must not be
> +instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void syscall(struct pt_regs *regs, int nr)
> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the CPU's idle task because that's the
> +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling:
> +
> +.. code-block:: c
> +
> +  noinstr void interrupt(struct pt_regs *regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqentry_nmi_exit() does the reverse operation in the
> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case must not be
> +traced.
> +
> +Architecture specific code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void nmi(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +	nmi_handler(regs);
> +	instrumentation_end();
> +
> +	irqentry_nmi_exit(regs);
> +  }
> +
> +and for e.g. a debug exception it can look like this:
> +
> +.. code-block:: c
> +
> +  noinstr void debug(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +		user_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +		kernel_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 20:35                 ` Thomas Gleixner
@ 2021-12-03 20:08                   ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-03 20:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mark Rutland, Steven Rostedt, Nicolas Saenz Julienne,
	linux-kernel, linux-arm-kernel, rcu, Peter Zijlstra, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01, 2021 at 09:35:37PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.
> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Good stuff, thank you very much!  I hope that this documentation helps
people bringing up new hardware, and from a more selfish viewpoint, I
especially hope that it reduces the number of "RCU is broken!" complaints
from them.  ;-)

Some native-English-speaker wordsmithing below.  Anything that might be
important from a technical viewpoint is marked "!!!".

							Thanx, Paul

> ---
> V2: Be more precise, fix formatting and typos - Mark 
> ---
>  Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  2 files changed, 282 insertions(+)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,274 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +
> +The states which need to be updated are:

I suggest reworking the above two paragraphs together something like
the following:

All transitions between execution domains require state updates which
are subject to strict ordering constraints.  State updates are required
for the following:

> +  * Lockdep
> +  * RCU

!!! As you do below, I suggest replacing "RCU" with:

  * RCU / Context tracking

With some luck, RCU's dyntick-idle tracking will be pulled into the
context-tracking code.  Doing this could allow a pair of atomic RMW
operations to be combined into one, at least if appropriate adjustments
can be applied.

> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
@@@
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.

I suggest combining the above two paragraphs something like the following:

Most instrumentation facilities depend on RCU, so intrumentation is
prohibited for entry code before RCU starts watching and exit code
after RCU stops watching.  In addition, many architectures must save
and restore register state, which means that (for example) a breakpoint
in the breakpoint entry code would overwrite the debug registers of the
initial breakpoint.

> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair:

I suggest combining the above two paragraphs something like the following:

Such code must be marked with the 'noinstr' attribute, placing that
code into a special section inaccessible to instrumentation and debug
facilities.  Some functions are partially instrumentable, which is
handled by marking them nointr and using instrumentation_begin() and
instrumentation_end() to flag the instrumentable ranges of code:

> +.. code-block:: c
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
> +	...
> +
> +	instrumentation_begin();
> +	handle_context();   // <-- instrumentable code
> +	instrumentation_end();
> +
> +	...
> +	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry code starts in low level architecture specific assembly code
> +and calls out into C-code after establishing low level architecture
> +specific state and stack frames. This low level code must not be
> +instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this:

(Nits, but whatever.  I am (perhaps misguidedly) asserting that assembly
code is by definition low-level and architecture-specific.  That lets
"low-level" tag the C code, helping the reader make the connection
between the two mentions of C code.  The above paragraph then looks
something like this.)

Syscall-entry code starts in assembly code and calls out into low-level
C code after establishing low-level architecture-specific state and stack
frames. This low-level C code must not be instrumented. A typical syscall
handling function invoked from low-level assembly code looks like this:

> +.. code-block:: c
> +
> +  noinstr void syscall(struct pt_regs *regs, int nr)
> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.

Not at all a big deal, but I suggest reworking the above paragraphs as
follows to avoid the ambiguous "the function" in the second sentence:

and then invokes the various entry work functions like ptrace, seccomp,
audit, syscall tracing, etc. After all that is done, the instrumentable
invoke_syscall function can be invoked. The instrumentable code section
then ends, after which syscall_exit_to_user_mode() is invoked.

> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.

!!! Here I have a question.  Can calls to enter_from_user_mode()
be nested?  RCU is OK with this, but I am not so sure that everything
else is.  If nesting is prohibited, this paragraph should explicitly
say that.  If nesting is theoretically possible, but should be avoided,
it would be good to say that as well.  (Otherwise "It looks like it
might work, so let's go for it!")

> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the CPU's idle task because that's the
> +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.

!!! You lost me on this one.  Does that second-to-last sentence instead
want to end something like this?  "... where RCU will not be watching
when running on non-nohz_full CPUs."

> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling:
> +
> +.. code-block:: c
> +
> +  noinstr void interrupt(struct pt_regs *regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.

Nit:  You don't need the comma.  "Note that the invocation..."

> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().

It took me awhile to work out that the "This" refers to in "This
allows".  So I suggest the following for that last paragraph:

In theory, the preemption count could be updated in irqentry_enter().
In practice, deferring this update to irq_enter_rcu() allows the
preemption-count code to be traced, while also maintaining symmetry
with irq_exit_rcu() and irqentry_exit(), which are described in the
next paragraph.  The only downside is that the early entry code up to
irq_enter_rcu() must be aware that the preemption count has not yet been
updated with the HARDIRQ_OFFSET state.

> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.

I suggest reworking the preceding paragraph something like this:

Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption
count before it handles soft interrupts, whose handlers must run in BH
context rather than irq-disabled context.  In addition, irqentry_exit()
might schedule, which also requires that HARDIRQ_OFFSET has been removed
from the preemption count.

> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.

I suggest the following for the above paragraph, but at least please
translate "Maschine" the rest of the way to "machine".  ;-)

NMIs and NMI-like exceptions (machine checks, double faults, debug
interrupts, etc.) can hit any context and must be extra careful with
the state.

> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.

If I understand correctly, for the purpose of state transitions, debug
exceptions work the same way as machine-check exceptions.  If so, I
suggest combining the above two paragraphs something like the following:

State changes for debug exceptions and machine-check exceptions depend
on whether these exceptions happened in userspace (breakpoints or
watchpoints) or in kernel mode (code patching).  From userspace, they
are treated like interrupts, while from kernel mode they are treated
like NMIs.

> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.

I suggest something like the following for the above paragraph:

NMIs and other NMI-like exceptions handle state transitions without
distinguishing between user-mode and kernel-mode origin.

> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU

Here too, I suggest replacing that "RCU" with this:

  * RCU / Context tracking

> +  * Tracing
> +
> +The exit counterpart irqentry_nmi_exit() does the reverse operation in the
> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case must not be
> +traced.

Please remove the comma following "Note".

> +Architecture specific code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void nmi(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +	nmi_handler(regs);
> +	instrumentation_end();
> +
> +	irqentry_nmi_exit(regs);
> +  }
> +
> +and for e.g. a debug exception it can look like this:
> +
> +.. code-block:: c
> +
> +  noinstr void debug(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +		user_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +		kernel_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-03 20:08                   ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-03 20:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mark Rutland, Steven Rostedt, Nicolas Saenz Julienne,
	linux-kernel, linux-arm-kernel, rcu, Peter Zijlstra, mtosatti,
	frederic, Jonathan Corbet

On Wed, Dec 01, 2021 at 09:35:37PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.
> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Good stuff, thank you very much!  I hope that this documentation helps
people bringing up new hardware, and from a more selfish viewpoint, I
especially hope that it reduces the number of "RCU is broken!" complaints
from them.  ;-)

Some native-English-speaker wordsmithing below.  Anything that might be
important from a technical viewpoint is marked "!!!".

							Thanx, Paul

> ---
> V2: Be more precise, fix formatting and typos - Mark 
> ---
>  Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  2 files changed, 282 insertions(+)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,274 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +
> +The states which need to be updated are:

I suggest reworking the above two paragraphs together something like
the following:

All transitions between execution domains require state updates which
are subject to strict ordering constraints.  State updates are required
for the following:

> +  * Lockdep
> +  * RCU

!!! As you do below, I suggest replacing "RCU" with:

  * RCU / Context tracking

With some luck, RCU's dyntick-idle tracking will be pulled into the
context-tracking code.  Doing this could allow a pair of atomic RMW
operations to be combined into one, at least if appropriate adjustments
can be applied.

> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
@@@
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.

I suggest combining the above two paragraphs something like the following:

Most instrumentation facilities depend on RCU, so intrumentation is
prohibited for entry code before RCU starts watching and exit code
after RCU stops watching.  In addition, many architectures must save
and restore register state, which means that (for example) a breakpoint
in the breakpoint entry code would overwrite the debug registers of the
initial breakpoint.

> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair:

I suggest combining the above two paragraphs something like the following:

Such code must be marked with the 'noinstr' attribute, placing that
code into a special section inaccessible to instrumentation and debug
facilities.  Some functions are partially instrumentable, which is
handled by marking them nointr and using instrumentation_begin() and
instrumentation_end() to flag the instrumentable ranges of code:

> +.. code-block:: c
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
> +	...
> +
> +	instrumentation_begin();
> +	handle_context();   // <-- instrumentable code
> +	instrumentation_end();
> +
> +	...
> +	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry code starts in low level architecture specific assembly code
> +and calls out into C-code after establishing low level architecture
> +specific state and stack frames. This low level code must not be
> +instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this:

(Nits, but whatever.  I am (perhaps misguidedly) asserting that assembly
code is by definition low-level and architecture-specific.  That lets
"low-level" tag the C code, helping the reader make the connection
between the two mentions of C code.  The above paragraph then looks
something like this.)

Syscall-entry code starts in assembly code and calls out into low-level
C code after establishing low-level architecture-specific state and stack
frames. This low-level C code must not be instrumented. A typical syscall
handling function invoked from low-level assembly code looks like this:

> +.. code-block:: c
> +
> +  noinstr void syscall(struct pt_regs *regs, int nr)
> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.

Not at all a big deal, but I suggest reworking the above paragraphs as
follows to avoid the ambiguous "the function" in the second sentence:

and then invokes the various entry work functions like ptrace, seccomp,
audit, syscall tracing, etc. After all that is done, the instrumentable
invoke_syscall function can be invoked. The instrumentable code section
then ends, after which syscall_exit_to_user_mode() is invoked.

> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.

!!! Here I have a question.  Can calls to enter_from_user_mode()
be nested?  RCU is OK with this, but I am not so sure that everything
else is.  If nesting is prohibited, this paragraph should explicitly
say that.  If nesting is theoretically possible, but should be avoided,
it would be good to say that as well.  (Otherwise "It looks like it
might work, so let's go for it!")

> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the CPU's idle task because that's the
> +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.

!!! You lost me on this one.  Does that second-to-last sentence instead
want to end something like this?  "... where RCU will not be watching
when running on non-nohz_full CPUs."

> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling:
> +
> +.. code-block:: c
> +
> +  noinstr void interrupt(struct pt_regs *regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.

Nit:  You don't need the comma.  "Note that the invocation..."

> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().

It took me awhile to work out that the "This" refers to in "This
allows".  So I suggest the following for that last paragraph:

In theory, the preemption count could be updated in irqentry_enter().
In practice, deferring this update to irq_enter_rcu() allows the
preemption-count code to be traced, while also maintaining symmetry
with irq_exit_rcu() and irqentry_exit(), which are described in the
next paragraph.  The only downside is that the early entry code up to
irq_enter_rcu() must be aware that the preemption count has not yet been
updated with the HARDIRQ_OFFSET state.

> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.

I suggest reworking the preceding paragraph something like this:

Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption
count before it handles soft interrupts, whose handlers must run in BH
context rather than irq-disabled context.  In addition, irqentry_exit()
might schedule, which also requires that HARDIRQ_OFFSET has been removed
from the preemption count.

> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.

I suggest the following for the above paragraph, but at least please
translate "Maschine" the rest of the way to "machine".  ;-)

NMIs and NMI-like exceptions (machine checks, double faults, debug
interrupts, etc.) can hit any context and must be extra careful with
the state.

> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.

If I understand correctly, for the purpose of state transitions, debug
exceptions work the same way as machine-check exceptions.  If so, I
suggest combining the above two paragraphs something like the following:

State changes for debug exceptions and machine-check exceptions depend
on whether these exceptions happened in userspace (breakpoints or
watchpoints) or in kernel mode (code patching).  From userspace, they
are treated like interrupts, while from kernel mode they are treated
like NMIs.

> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.

I suggest something like the following for the above paragraph:

NMIs and other NMI-like exceptions handle state transitions without
distinguishing between user-mode and kernel-mode origin.

> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU

Here too, I suggest replacing that "RCU" with this:

  * RCU / Context tracking

> +  * Tracing
> +
> +The exit counterpart irqentry_nmi_exit() does the reverse operation in the
> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case must not be
> +traced.

Please remove the comma following "Note".

> +Architecture specific code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void nmi(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +	nmi_handler(regs);
> +	instrumentation_end();
> +
> +	irqentry_nmi_exit(regs);
> +  }
> +
> +and for e.g. a debug exception it can look like this:
> +
> +.. code-block:: c
> +
> +  noinstr void debug(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +		user_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +		kernel_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-01 20:35                 ` Thomas Gleixner
@ 2021-12-04  3:48                   ` Randy Dunlap
  -1 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2021-12-04  3:48 UTC (permalink / raw)
  To: Thomas Gleixner, Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

Hi Thomas,

Just some editing. I read Paul's comments but I am mostly ignoring them
for this review.


On 12/1/21 12:35, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.
> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Be more precise, fix formatting and typos - Mark 
> ---
>  Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  2 files changed, 282 insertions(+)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,274 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and

   Low-level

> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most

                         non-watching

> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or

                                                          state (e.g. debug) or

> +cause registers before another exception of the same type can happen. A

   ^^^^^ cannot parse (with or without the change to the previous line)

> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.

          initial

> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair:
> +
> +.. code-block:: c
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
> +	...
> +
> +	instrumentation_begin();
> +	handle_context();   // <-- instrumentable code
> +	instrumentation_end();
> +
> +	...
> +	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry code starts in low level architecture specific assembly code

                                low-level architecture-specific

> +and calls out into C-code after establishing low level architecture

                                                low-level architecture-

> +specific state and stack frames. This low level code must not be

                                         low-level

> +instrumented. A typical syscall handling function invoked from low level

                                                                  low-level

> +assembly code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void syscall(struct pt_regs *regs, int nr)
> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()

                                 KVM-specific

> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the CPU's idle task because that's the
> +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling:

       architecture-specific

> +
> +.. code-block:: c
> +
> +  noinstr void interrupt(struct pt_regs *regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a

   Note that
(as Paul said)

> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug

            NMI-like

> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode

   straightforward

> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqentry_nmi_exit() does the reverse operation in the
> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first

   Note that
(as Paul said)

> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case must not be
> +traced.
> +
> +Architecture specific code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void nmi(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +	nmi_handler(regs);
> +	instrumentation_end();
> +
> +	irqentry_nmi_exit(regs);
> +  }
> +
> +and for e.g. a debug exception it can look like this:
> +
> +.. code-block:: c
> +
> +  noinstr void debug(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +		user_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +		kernel_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.

                                 exception-agnostic


-- 
~Randy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-04  3:48                   ` Randy Dunlap
  0 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2021-12-04  3:48 UTC (permalink / raw)
  To: Thomas Gleixner, Mark Rutland
  Cc: Steven Rostedt, Nicolas Saenz Julienne, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, paulmck, mtosatti,
	frederic, Jonathan Corbet

Hi Thomas,

Just some editing. I read Paul's comments but I am mostly ignoring them
for this review.


On 12/1/21 12:35, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.
> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Be more precise, fix formatting and typos - Mark 
> ---
>  Documentation/core-api/entry.rst |  274 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  2 files changed, 282 insertions(+)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,274 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and

   Low-level

> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most

                         non-watching

> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or

                                                          state (e.g. debug) or

> +cause registers before another exception of the same type can happen. A

   ^^^^^ cannot parse (with or without the change to the previous line)

> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.

          initial

> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair:
> +
> +.. code-block:: c
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
> +	...
> +
> +	instrumentation_begin();
> +	handle_context();   // <-- instrumentable code
> +	instrumentation_end();
> +
> +	...
> +	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry code starts in low level architecture specific assembly code

                                low-level architecture-specific

> +and calls out into C-code after establishing low level architecture

                                                low-level architecture-

> +specific state and stack frames. This low level code must not be

                                         low-level

> +instrumented. A typical syscall handling function invoked from low level

                                                                  low-level

> +assembly code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void syscall(struct pt_regs *regs, int nr)
> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()

                                 KVM-specific

> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the CPU's idle task because that's the
> +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling:

       architecture-specific

> +
> +.. code-block:: c
> +
> +  noinstr void interrupt(struct pt_regs *regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a

   Note that
(as Paul said)

> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug

            NMI-like

> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode

   straightforward

> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqentry_nmi_exit() does the reverse operation in the
> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first

   Note that
(as Paul said)

> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case must not be
> +traced.
> +
> +Architecture specific code looks like this:
> +
> +.. code-block:: c
> +
> +  noinstr void nmi(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +	nmi_handler(regs);
> +	instrumentation_end();
> +
> +	irqentry_nmi_exit(regs);
> +  }
> +
> +and for e.g. a debug exception it can look like this:
> +
> +.. code-block:: c
> +
> +  noinstr void debug(struct pt_regs *regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +		user_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +		kernel_mode_debug_handler(regs, debug_regs);
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.

                                 exception-agnostic


-- 
~Randy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-04  3:48                   ` Randy Dunlap
@ 2021-12-06 17:36                     ` Mark Rutland
  -1 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-06 17:36 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Thomas Gleixner, Steven Rostedt, Nicolas Saenz Julienne,
	linux-kernel, linux-arm-kernel, rcu, Peter Zijlstra, paulmck,
	mtosatti, frederic, Jonathan Corbet

On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
> On 12/1/21 12:35, Thomas Gleixner wrote:
> > +Aside of that many architectures have to save register state, e.g. debug or
> 
>                                                           state (e.g. debug) or
> 
> > +cause registers before another exception of the same type can happen. A
> 
>    ^^^^^ cannot parse (with or without the change to the previous line)

I think the difficulty here is with "cause register"? That' a register which
indicates the cause of an exception, e.g.

* MIPS has `cause` (coprocessor 0 register 13)
* arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)

We could probably clarify this as "exception cause registers" or "exception
status registers", if that helps?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-06 17:36                     ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2021-12-06 17:36 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Thomas Gleixner, Steven Rostedt, Nicolas Saenz Julienne,
	linux-kernel, linux-arm-kernel, rcu, Peter Zijlstra, paulmck,
	mtosatti, frederic, Jonathan Corbet

On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
> On 12/1/21 12:35, Thomas Gleixner wrote:
> > +Aside of that many architectures have to save register state, e.g. debug or
> 
>                                                           state (e.g. debug) or
> 
> > +cause registers before another exception of the same type can happen. A
> 
>    ^^^^^ cannot parse (with or without the change to the previous line)

I think the difficulty here is with "cause register"? That' a register which
indicates the cause of an exception, e.g.

* MIPS has `cause` (coprocessor 0 register 13)
* arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)

We could probably clarify this as "exception cause registers" or "exception
status registers", if that helps?

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-06 17:36                     ` Mark Rutland
@ 2021-12-06 17:53                       ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-06 17:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Randy Dunlap, Thomas Gleixner, Steven Rostedt,
	Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, mtosatti, frederic, Jonathan Corbet

On Mon, Dec 06, 2021 at 05:36:51PM +0000, Mark Rutland wrote:
> On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
> > On 12/1/21 12:35, Thomas Gleixner wrote:
> > > +Aside of that many architectures have to save register state, e.g. debug or
> > 
> >                                                           state (e.g. debug) or
> > 
> > > +cause registers before another exception of the same type can happen. A
> > 
> >    ^^^^^ cannot parse (with or without the change to the previous line)
> 
> I think the difficulty here is with "cause register"? That' a register which
> indicates the cause of an exception, e.g.
> 
> * MIPS has `cause` (coprocessor 0 register 13)
> * arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)
> 
> We could probably clarify this as "exception cause registers" or "exception
> status registers", if that helps?

Or to make it word-by-word unambiguous, "exception-cause registers"
and "exception-status registers".

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-06 17:53                       ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-06 17:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Randy Dunlap, Thomas Gleixner, Steven Rostedt,
	Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, mtosatti, frederic, Jonathan Corbet

On Mon, Dec 06, 2021 at 05:36:51PM +0000, Mark Rutland wrote:
> On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
> > On 12/1/21 12:35, Thomas Gleixner wrote:
> > > +Aside of that many architectures have to save register state, e.g. debug or
> > 
> >                                                           state (e.g. debug) or
> > 
> > > +cause registers before another exception of the same type can happen. A
> > 
> >    ^^^^^ cannot parse (with or without the change to the previous line)
> 
> I think the difficulty here is with "cause register"? That' a register which
> indicates the cause of an exception, e.g.
> 
> * MIPS has `cause` (coprocessor 0 register 13)
> * arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)
> 
> We could probably clarify this as "exception cause registers" or "exception
> status registers", if that helps?

Or to make it word-by-word unambiguous, "exception-cause registers"
and "exception-status registers".

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-06 17:53                       ` Paul E. McKenney
@ 2021-12-06 21:24                         ` Randy Dunlap
  -1 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2021-12-06 21:24 UTC (permalink / raw)
  To: paulmck, Mark Rutland
  Cc: Thomas Gleixner, Steven Rostedt, Nicolas Saenz Julienne,
	linux-kernel, linux-arm-kernel, rcu, Peter Zijlstra, mtosatti,
	frederic, Jonathan Corbet



On 12/6/21 09:53, Paul E. McKenney wrote:
> On Mon, Dec 06, 2021 at 05:36:51PM +0000, Mark Rutland wrote:
>> On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
>>> On 12/1/21 12:35, Thomas Gleixner wrote:
>>>> +Aside of that many architectures have to save register state, e.g. debug or
>>>
>>>                                                           state (e.g. debug) or
>>>
>>>> +cause registers before another exception of the same type can happen. A
>>>
>>>    ^^^^^ cannot parse (with or without the change to the previous line)
>>
>> I think the difficulty here is with "cause register"? That' a register which
>> indicates the cause of an exception, e.g.

Oh. I see. Thanks.

>> * MIPS has `cause` (coprocessor 0 register 13)
>> * arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)
>>
>> We could probably clarify this as "exception cause registers" or "exception
>> status registers", if that helps?
> 
> Or to make it word-by-word unambiguous, "exception-cause registers"
> and "exception-status registers".

Any of those works. Or even 'cause' registers.

-- 
~Randy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-06 21:24                         ` Randy Dunlap
  0 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2021-12-06 21:24 UTC (permalink / raw)
  To: paulmck, Mark Rutland
  Cc: Thomas Gleixner, Steven Rostedt, Nicolas Saenz Julienne,
	linux-kernel, linux-arm-kernel, rcu, Peter Zijlstra, mtosatti,
	frederic, Jonathan Corbet



On 12/6/21 09:53, Paul E. McKenney wrote:
> On Mon, Dec 06, 2021 at 05:36:51PM +0000, Mark Rutland wrote:
>> On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
>>> On 12/1/21 12:35, Thomas Gleixner wrote:
>>>> +Aside of that many architectures have to save register state, e.g. debug or
>>>
>>>                                                           state (e.g. debug) or
>>>
>>>> +cause registers before another exception of the same type can happen. A
>>>
>>>    ^^^^^ cannot parse (with or without the change to the previous line)
>>
>> I think the difficulty here is with "cause register"? That' a register which
>> indicates the cause of an exception, e.g.

Oh. I see. Thanks.

>> * MIPS has `cause` (coprocessor 0 register 13)
>> * arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)
>>
>> We could probably clarify this as "exception cause registers" or "exception
>> status registers", if that helps?
> 
> Or to make it word-by-word unambiguous, "exception-cause registers"
> and "exception-status registers".

Any of those works. Or even 'cause' registers.

-- 
~Randy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-06 21:24                         ` Randy Dunlap
@ 2021-12-06 21:36                           ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-06 21:36 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Mark Rutland, Thomas Gleixner, Steven Rostedt,
	Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, mtosatti, frederic, Jonathan Corbet

On Mon, Dec 06, 2021 at 01:24:03PM -0800, Randy Dunlap wrote:
> 
> 
> On 12/6/21 09:53, Paul E. McKenney wrote:
> > On Mon, Dec 06, 2021 at 05:36:51PM +0000, Mark Rutland wrote:
> >> On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
> >>> On 12/1/21 12:35, Thomas Gleixner wrote:
> >>>> +Aside of that many architectures have to save register state, e.g. debug or
> >>>
> >>>                                                           state (e.g. debug) or
> >>>
> >>>> +cause registers before another exception of the same type can happen. A
> >>>
> >>>    ^^^^^ cannot parse (with or without the change to the previous line)
> >>
> >> I think the difficulty here is with "cause register"? That' a register which
> >> indicates the cause of an exception, e.g.
> 
> Oh. I see. Thanks.
> 
> >> * MIPS has `cause` (coprocessor 0 register 13)
> >> * arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)
> >>
> >> We could probably clarify this as "exception cause registers" or "exception
> >> status registers", if that helps?
> > 
> > Or to make it word-by-word unambiguous, "exception-cause registers"
> > and "exception-status registers".
> 
> Any of those works. Or even 'cause' registers.

Agreed, careful use of quotation marks can also do the trick.  That is the
"wonderful" thing about English, so very many ways to say the same thing...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-06 21:36                           ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-06 21:36 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Mark Rutland, Thomas Gleixner, Steven Rostedt,
	Nicolas Saenz Julienne, linux-kernel, linux-arm-kernel, rcu,
	Peter Zijlstra, mtosatti, frederic, Jonathan Corbet

On Mon, Dec 06, 2021 at 01:24:03PM -0800, Randy Dunlap wrote:
> 
> 
> On 12/6/21 09:53, Paul E. McKenney wrote:
> > On Mon, Dec 06, 2021 at 05:36:51PM +0000, Mark Rutland wrote:
> >> On Fri, Dec 03, 2021 at 07:48:08PM -0800, Randy Dunlap wrote:
> >>> On 12/1/21 12:35, Thomas Gleixner wrote:
> >>>> +Aside of that many architectures have to save register state, e.g. debug or
> >>>
> >>>                                                           state (e.g. debug) or
> >>>
> >>>> +cause registers before another exception of the same type can happen. A
> >>>
> >>>    ^^^^^ cannot parse (with or without the change to the previous line)
> >>
> >> I think the difficulty here is with "cause register"? That' a register which
> >> indicates the cause of an exception, e.g.
> 
> Oh. I see. Thanks.
> 
> >> * MIPS has `cause` (coprocessor 0 register 13)
> >> * arm64 / AArch64 has `ESR_ELx` (Exception Syndrome Register, ELx)
> >>
> >> We could probably clarify this as "exception cause registers" or "exception
> >> status registers", if that helps?
> > 
> > Or to make it word-by-word unambiguous, "exception-cause registers"
> > and "exception-status registers".
> 
> Any of those works. Or even 'cause' registers.

Agreed, careful use of quotation marks can also do the trick.  That is the
"wonderful" thing about English, so very many ways to say the same thing...

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-03 20:08                   ` Paul E. McKenney
@ 2021-12-13 10:36                     ` Nicolas Saenz Julienne
  -1 siblings, 0 replies; 42+ messages in thread
From: Nicolas Saenz Julienne @ 2021-12-13 10:36 UTC (permalink / raw)
  To: paulmck, Thomas Gleixner
  Cc: Mark Rutland, Steven Rostedt, linux-kernel, linux-arm-kernel,
	rcu, Peter Zijlstra, mtosatti, frederic, Jonathan Corbet

Hi All,
now that this is good shape I'm taking over Thomas and preparing v3.

Paul, I introduced most (if not all) your paragraph corrections. Some questions
below.

On Fri, 2021-12-03 at 12:08 -0800, Paul E. McKenney wrote:
> > +The update order depends on the transition type and is explained below in
> > +the transition type sections.
> @@@

Sorry, I'm not 100% sure I get what you meant by this. Maybe introducing some
sort of link?

[...]

> > +syscall_exit_to_user_mode() handles all work which needs to be done before
> > +returning to user space like tracing, audit, signals, task work etc. After
> > +that it invokes exit_to_user_mode() which again handles the state
> > +transition in the reverse order:
> > +
> > +  * Tracing
> > +  * RCU / Context tracking
> > +  * Lockdep
> > +
> > +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> > +available as fine grained subfunctions in cases where the architecture code
> > +has to do extra work between the various steps. In such cases it has to
> > +ensure that enter_from_user_mode() is called first on entry and
> > +exit_to_user_mode() is called last on exit.
> 
> !!! Here I have a question.  Can calls to enter_from_user_mode()
> be nested?  RCU is OK with this, but I am not so sure that everything
> else is.  If nesting is prohibited, this paragraph should explicitly
> say that.  If nesting is theoretically possible, but should be avoided,
> it would be good to say that as well.  (Otherwise "It looks like it
> might work, so let's go for it!")


In __enter_from_user_mode() I see:

	CT_WARN_ON(ct_state() != CONTEXT_USER);

IIUC this signals that a nested syscall entry isn't expected from CT's point of
view. I remember reading through RCU's dyntick code that the rationale for
nesting in the syscall path was half interrupts (or upcalls). I did some
research, but couldn't find an example of this. Is this something we can
discard as an old technique not used anymore?

On the other hand, interrupts are prone to nesting:
 - Weird interrupt handlers that re-enable interrupts
 - NMIs interrupting Hard IRQ context
 - NMIs interrupting NMIs

Please let me know if I'm off-base, but I think the topic of nesting is worth a
sentence or two in each section.

[...]

> > +Interrupts and regular exceptions
> > +---------------------------------
> > +
> > +Interrupts entry and exit handling is slightly more complex than syscalls
> > +and KVM transitions.
> > +
> > +If an interrupt is raised while the CPU executes in user space, the entry
> > +and exit handling is exactly the same as for syscalls.
> > +
> > +If the interrupt is raised while the CPU executes in kernel space the entry
> > +and exit handling is slightly different. RCU state is only updated when the
> > +interrupt was raised in context of the CPU's idle task because that's the
> > +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> > +Lockdep and tracing have to be updated unconditionally.
> 
> !!! You lost me on this one.  Does that second-to-last sentence instead
> want to end something like this?  "... where RCU will not be watching
> when running on non-nohz_full CPUs."

The paragraph covers IRQ entry from kernel space. In that context RCU is only
shut-off during idle. That only happens on a NOHZ-enabled kernel, be it
NO_HZ_IDLE or NO_HZ_FULL.

I'll try to reword it a bit so it's more explicit.

Thanks!

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-13 10:36                     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 42+ messages in thread
From: Nicolas Saenz Julienne @ 2021-12-13 10:36 UTC (permalink / raw)
  To: paulmck, Thomas Gleixner
  Cc: Mark Rutland, Steven Rostedt, linux-kernel, linux-arm-kernel,
	rcu, Peter Zijlstra, mtosatti, frederic, Jonathan Corbet

Hi All,
now that this is good shape I'm taking over Thomas and preparing v3.

Paul, I introduced most (if not all) your paragraph corrections. Some questions
below.

On Fri, 2021-12-03 at 12:08 -0800, Paul E. McKenney wrote:
> > +The update order depends on the transition type and is explained below in
> > +the transition type sections.
> @@@

Sorry, I'm not 100% sure I get what you meant by this. Maybe introducing some
sort of link?

[...]

> > +syscall_exit_to_user_mode() handles all work which needs to be done before
> > +returning to user space like tracing, audit, signals, task work etc. After
> > +that it invokes exit_to_user_mode() which again handles the state
> > +transition in the reverse order:
> > +
> > +  * Tracing
> > +  * RCU / Context tracking
> > +  * Lockdep
> > +
> > +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> > +available as fine grained subfunctions in cases where the architecture code
> > +has to do extra work between the various steps. In such cases it has to
> > +ensure that enter_from_user_mode() is called first on entry and
> > +exit_to_user_mode() is called last on exit.
> 
> !!! Here I have a question.  Can calls to enter_from_user_mode()
> be nested?  RCU is OK with this, but I am not so sure that everything
> else is.  If nesting is prohibited, this paragraph should explicitly
> say that.  If nesting is theoretically possible, but should be avoided,
> it would be good to say that as well.  (Otherwise "It looks like it
> might work, so let's go for it!")


In __enter_from_user_mode() I see:

	CT_WARN_ON(ct_state() != CONTEXT_USER);

IIUC this signals that a nested syscall entry isn't expected from CT's point of
view. I remember reading through RCU's dyntick code that the rationale for
nesting in the syscall path was half interrupts (or upcalls). I did some
research, but couldn't find an example of this. Is this something we can
discard as an old technique not used anymore?

On the other hand, interrupts are prone to nesting:
 - Weird interrupt handlers that re-enable interrupts
 - NMIs interrupting Hard IRQ context
 - NMIs interrupting NMIs

Please let me know if I'm off-base, but I think the topic of nesting is worth a
sentence or two in each section.

[...]

> > +Interrupts and regular exceptions
> > +---------------------------------
> > +
> > +Interrupts entry and exit handling is slightly more complex than syscalls
> > +and KVM transitions.
> > +
> > +If an interrupt is raised while the CPU executes in user space, the entry
> > +and exit handling is exactly the same as for syscalls.
> > +
> > +If the interrupt is raised while the CPU executes in kernel space the entry
> > +and exit handling is slightly different. RCU state is only updated when the
> > +interrupt was raised in context of the CPU's idle task because that's the
> > +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> > +Lockdep and tracing have to be updated unconditionally.
> 
> !!! You lost me on this one.  Does that second-to-last sentence instead
> want to end something like this?  "... where RCU will not be watching
> when running on non-nohz_full CPUs."

The paragraph covers IRQ entry from kernel space. In that context RCU is only
shut-off during idle. That only happens on a NOHZ-enabled kernel, be it
NO_HZ_IDLE or NO_HZ_FULL.

I'll try to reword it a bit so it's more explicit.

Thanks!

-- 
Nicolás Sáenz


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
  2021-12-13 10:36                     ` Nicolas Saenz Julienne
@ 2021-12-13 16:41                       ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-13 16:41 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Thomas Gleixner, Mark Rutland, Steven Rostedt, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, mtosatti, frederic,
	Jonathan Corbet

On Mon, Dec 13, 2021 at 11:36:56AM +0100, Nicolas Saenz Julienne wrote:
> Hi All,
> now that this is good shape I'm taking over Thomas and preparing v3.
> 
> Paul, I introduced most (if not all) your paragraph corrections. Some questions
> below.

Thank you for taking this on!

> On Fri, 2021-12-03 at 12:08 -0800, Paul E. McKenney wrote:
> > > +The update order depends on the transition type and is explained below in
> > > +the transition type sections.
> > @@@
> 
> Sorry, I'm not 100% sure I get what you meant by this. Maybe introducing some
> sort of link?

What this sentence is trying to get across is that there are different
orders of state updates, depending on the type of transition.  It would
be good to link to the following sections, if that can be done reasonably.

> [...]
> 
> > > +syscall_exit_to_user_mode() handles all work which needs to be done before
> > > +returning to user space like tracing, audit, signals, task work etc. After
> > > +that it invokes exit_to_user_mode() which again handles the state
> > > +transition in the reverse order:
> > > +
> > > +  * Tracing
> > > +  * RCU / Context tracking
> > > +  * Lockdep
> > > +
> > > +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> > > +available as fine grained subfunctions in cases where the architecture code
> > > +has to do extra work between the various steps. In such cases it has to
> > > +ensure that enter_from_user_mode() is called first on entry and
> > > +exit_to_user_mode() is called last on exit.
> > 
> > !!! Here I have a question.  Can calls to enter_from_user_mode()
> > be nested?  RCU is OK with this, but I am not so sure that everything
> > else is.  If nesting is prohibited, this paragraph should explicitly
> > say that.  If nesting is theoretically possible, but should be avoided,
> > it would be good to say that as well.  (Otherwise "It looks like it
> > might work, so let's go for it!")
> 
> 
> In __enter_from_user_mode() I see:
> 
> 	CT_WARN_ON(ct_state() != CONTEXT_USER);
> 
> IIUC this signals that a nested syscall entry isn't expected from CT's point of
> view. I remember reading through RCU's dyntick code that the rationale for
> nesting in the syscall path was half interrupts (or upcalls). I did some
> research, but couldn't find an example of this. Is this something we can
> discard as an old technique not used anymore?

Indeed, there are thankfully no more half interrupts.

> On the other hand, interrupts are prone to nesting:
>  - Weird interrupt handlers that re-enable interrupts
>  - NMIs interrupting Hard IRQ context
>  - NMIs interrupting NMIs

Plus there are odd cases where (from RCU's viewpoint) an interrupt can
happen within an NMI handler.

> Please let me know if I'm off-base, but I think the topic of nesting is worth a
> sentence or two in each section.

I completely agree.  We should be clear on what nesting is permitted and
not.

> [...]
> 
> > > +Interrupts and regular exceptions
> > > +---------------------------------
> > > +
> > > +Interrupts entry and exit handling is slightly more complex than syscalls
> > > +and KVM transitions.
> > > +
> > > +If an interrupt is raised while the CPU executes in user space, the entry
> > > +and exit handling is exactly the same as for syscalls.
> > > +
> > > +If the interrupt is raised while the CPU executes in kernel space the entry
> > > +and exit handling is slightly different. RCU state is only updated when the
> > > +interrupt was raised in context of the CPU's idle task because that's the
> > > +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> > > +Lockdep and tracing have to be updated unconditionally.
> > 
> > !!! You lost me on this one.  Does that second-to-last sentence instead
> > want to end something like this?  "... where RCU will not be watching
> > when running on non-nohz_full CPUs."
> 
> The paragraph covers IRQ entry from kernel space. In that context RCU is only
> shut-off during idle. That only happens on a NOHZ-enabled kernel, be it
> NO_HZ_IDLE or NO_HZ_FULL.

OK, good.  So the RCU-not-watching case is on a nohz-full CPU.

> I'll try to reword it a bit so it's more explicit.

Sounds good!  And thank you again for taking this on!

							Thanx, Paul

> Thanks!
> 
> -- 
> Nicolás Sáenz
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] Documentation: Fill the gaps about entry/noinstr constraints
@ 2021-12-13 16:41                       ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2021-12-13 16:41 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Thomas Gleixner, Mark Rutland, Steven Rostedt, linux-kernel,
	linux-arm-kernel, rcu, Peter Zijlstra, mtosatti, frederic,
	Jonathan Corbet

On Mon, Dec 13, 2021 at 11:36:56AM +0100, Nicolas Saenz Julienne wrote:
> Hi All,
> now that this is good shape I'm taking over Thomas and preparing v3.
> 
> Paul, I introduced most (if not all) your paragraph corrections. Some questions
> below.

Thank you for taking this on!

> On Fri, 2021-12-03 at 12:08 -0800, Paul E. McKenney wrote:
> > > +The update order depends on the transition type and is explained below in
> > > +the transition type sections.
> > @@@
> 
> Sorry, I'm not 100% sure I get what you meant by this. Maybe introducing some
> sort of link?

What this sentence is trying to get across is that there are different
orders of state updates, depending on the type of transition.  It would
be good to link to the following sections, if that can be done reasonably.

> [...]
> 
> > > +syscall_exit_to_user_mode() handles all work which needs to be done before
> > > +returning to user space like tracing, audit, signals, task work etc. After
> > > +that it invokes exit_to_user_mode() which again handles the state
> > > +transition in the reverse order:
> > > +
> > > +  * Tracing
> > > +  * RCU / Context tracking
> > > +  * Lockdep
> > > +
> > > +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> > > +available as fine grained subfunctions in cases where the architecture code
> > > +has to do extra work between the various steps. In such cases it has to
> > > +ensure that enter_from_user_mode() is called first on entry and
> > > +exit_to_user_mode() is called last on exit.
> > 
> > !!! Here I have a question.  Can calls to enter_from_user_mode()
> > be nested?  RCU is OK with this, but I am not so sure that everything
> > else is.  If nesting is prohibited, this paragraph should explicitly
> > say that.  If nesting is theoretically possible, but should be avoided,
> > it would be good to say that as well.  (Otherwise "It looks like it
> > might work, so let's go for it!")
> 
> 
> In __enter_from_user_mode() I see:
> 
> 	CT_WARN_ON(ct_state() != CONTEXT_USER);
> 
> IIUC this signals that a nested syscall entry isn't expected from CT's point of
> view. I remember reading through RCU's dyntick code that the rationale for
> nesting in the syscall path was half interrupts (or upcalls). I did some
> research, but couldn't find an example of this. Is this something we can
> discard as an old technique not used anymore?

Indeed, there are thankfully no more half interrupts.

> On the other hand, interrupts are prone to nesting:
>  - Weird interrupt handlers that re-enable interrupts
>  - NMIs interrupting Hard IRQ context
>  - NMIs interrupting NMIs

Plus there are odd cases where (from RCU's viewpoint) an interrupt can
happen within an NMI handler.

> Please let me know if I'm off-base, but I think the topic of nesting is worth a
> sentence or two in each section.

I completely agree.  We should be clear on what nesting is permitted and
not.

> [...]
> 
> > > +Interrupts and regular exceptions
> > > +---------------------------------
> > > +
> > > +Interrupts entry and exit handling is slightly more complex than syscalls
> > > +and KVM transitions.
> > > +
> > > +If an interrupt is raised while the CPU executes in user space, the entry
> > > +and exit handling is exactly the same as for syscalls.
> > > +
> > > +If the interrupt is raised while the CPU executes in kernel space the entry
> > > +and exit handling is slightly different. RCU state is only updated when the
> > > +interrupt was raised in context of the CPU's idle task because that's the
> > > +only kernel context where RCU can be not watching on NOHZ enabled kernels.
> > > +Lockdep and tracing have to be updated unconditionally.
> > 
> > !!! You lost me on this one.  Does that second-to-last sentence instead
> > want to end something like this?  "... where RCU will not be watching
> > when running on non-nohz_full CPUs."
> 
> The paragraph covers IRQ entry from kernel space. In that context RCU is only
> shut-off during idle. That only happens on a NOHZ-enabled kernel, be it
> NO_HZ_IDLE or NO_HZ_FULL.

OK, good.  So the RCU-not-watching case is on a nohz-full CPU.

> I'll try to reword it a bit so it's more explicit.

Sounds good!  And thank you again for taking this on!

							Thanx, Paul

> Thanks!
> 
> -- 
> Nicolás Sáenz
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2021-12-13 16:43 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-30 11:28 Question WRT early IRQ/NMI entry code Nicolas Saenz Julienne
2021-11-30 11:28 ` Nicolas Saenz Julienne
2021-11-30 12:05 ` Frederic Weisbecker
2021-11-30 12:05   ` Frederic Weisbecker
2021-11-30 12:50 ` Mark Rutland
2021-11-30 12:50   ` Mark Rutland
2021-11-30 13:47 ` Thomas Gleixner
2021-11-30 13:47   ` Thomas Gleixner
2021-11-30 14:13   ` Steven Rostedt
2021-11-30 14:13     ` Steven Rostedt
2021-11-30 22:31     ` [PATCH] Documentation: Fill the gaps about entry/noinstr constraints Thomas Gleixner
2021-11-30 22:31       ` Thomas Gleixner
2021-12-01 10:56       ` Mark Rutland
2021-12-01 10:56         ` Mark Rutland
2021-12-01 18:14         ` Thomas Gleixner
2021-12-01 18:14           ` Thomas Gleixner
2021-12-01 18:23           ` Mark Rutland
2021-12-01 18:23             ` Mark Rutland
2021-12-01 20:28             ` Thomas Gleixner
2021-12-01 20:28               ` Thomas Gleixner
2021-12-01 20:35               ` [PATCH v2] " Thomas Gleixner
2021-12-01 20:35                 ` Thomas Gleixner
2021-12-02 10:03                 ` Mark Rutland
2021-12-02 10:03                   ` Mark Rutland
2021-12-03 20:08                 ` Paul E. McKenney
2021-12-03 20:08                   ` Paul E. McKenney
2021-12-13 10:36                   ` Nicolas Saenz Julienne
2021-12-13 10:36                     ` Nicolas Saenz Julienne
2021-12-13 16:41                     ` Paul E. McKenney
2021-12-13 16:41                       ` Paul E. McKenney
2021-12-04  3:48                 ` Randy Dunlap
2021-12-04  3:48                   ` Randy Dunlap
2021-12-06 17:36                   ` Mark Rutland
2021-12-06 17:36                     ` Mark Rutland
2021-12-06 17:53                     ` Paul E. McKenney
2021-12-06 17:53                       ` Paul E. McKenney
2021-12-06 21:24                       ` Randy Dunlap
2021-12-06 21:24                         ` Randy Dunlap
2021-12-06 21:36                         ` Paul E. McKenney
2021-12-06 21:36                           ` Paul E. McKenney
2021-11-30 15:13   ` Question WRT early IRQ/NMI entry code Nicolas Saenz Julienne
2021-11-30 15:13     ` Nicolas Saenz Julienne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.