Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH net 0/2] __napi_schedule_irqoff() used in wrong context
@ 2019-11-26 22:20 Sebastian Andrzej Siewior
  2019-11-26 22:20 ` [PATCH net 1/2] amd-xgbe: Use __napi_schedule() in BH context Sebastian Andrzej Siewior
  2019-11-26 22:20 ` [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs' Sebastian Andrzej Siewior
  0 siblings, 2 replies; 9+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-26 22:20 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Thomas Gleixner


I found three cases where __napi_schedule_irqoff() could be used with
enabled interrupts while looking for something else… There are patches
for two of them.

The third one, hyperv/netvsc, is wrong with the `threadirqs' commandline
option because the interrupt will be delivered directly via the vector
and not the thread. I will try to sort this out with the x86 folks.

Sebastian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net 1/2] amd-xgbe: Use __napi_schedule() in BH context
  2019-11-26 22:20 [PATCH net 0/2] __napi_schedule_irqoff() used in wrong context Sebastian Andrzej Siewior
@ 2019-11-26 22:20 ` Sebastian Andrzej Siewior
  2019-12-02 17:19   ` Tom Lendacky
  2019-11-26 22:20 ` [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs' Sebastian Andrzej Siewior
  1 sibling, 1 reply; 9+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-26 22:20 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Thomas Gleixner, Sebastian Andrzej Siewior,
	Tom Lendacky

The driver uses __napi_schedule_irqoff() which is fine as long as it is
invoked with disabled interrupts by everybody. Since the commit
mentioned below the driver may invoke xgbe_isr_task() in tasklet/softirq
context. This may lead to list corruption if another driver uses
__napi_schedule_irqoff() in IRQ context.

Use __napi_schedule() which safe to use from IRQ and softirq context.

Fixes: 85b85c853401d ("amd-xgbe: Re-issue interrupt if interrupt status not cleared")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index 98f8f20331544..3bd20f7651207 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -514,7 +514,7 @@ static void xgbe_isr_task(unsigned long data)
 				xgbe_disable_rx_tx_ints(pdata);
 
 				/* Turn on polling */
-				__napi_schedule_irqoff(&pdata->napi);
+				__napi_schedule(&pdata->napi);
 			}
 		} else {
 			/* Don't clear Rx/Tx status if doing per channel DMA
-- 
2.24.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
  2019-11-26 22:20 [PATCH net 0/2] __napi_schedule_irqoff() used in wrong context Sebastian Andrzej Siewior
  2019-11-26 22:20 ` [PATCH net 1/2] amd-xgbe: Use __napi_schedule() in BH context Sebastian Andrzej Siewior
@ 2019-11-26 22:20 ` Sebastian Andrzej Siewior
  2019-11-26 22:39   ` Eric Dumazet
  1 sibling, 1 reply; 9+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-26 22:20 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Thomas Gleixner, Sebastian Andrzej Siewior,
	Eric Dumazet

The timer callback (napi_watchdog()) invokes __napi_schedule_irqoff()
with disabled interrupts. With the `threadirqs' commandline option all
interrupt handler are threaded and using __napi_schedule_irqoff() is not
an issue because everyone is using it in threaded context which is
synchronised with local_bh_disable().
The napi_watchdog() timer is still expiring in hardirq context and may
interrupt a threaded handler which is in the middle of
__napi_schedule_irqoff() leading to list corruption.

Let the napi_watchdog() expire in softirq context if `threadirqs' is
used.

Fixes: 3b47d30396bae ("net: gro: add a per device gro flush timer")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 99ac84ff398f4..ec533d20931bc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5994,6 +5994,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		napi_gro_flush(n, !!timeout);
 		if (timeout)
 			hrtimer_start(&n->timer, ns_to_ktime(timeout),
+				      force_irqthreads ?
+				      HRTIMER_MODE_REL_PINNED_SOFT :
 				      HRTIMER_MODE_REL_PINNED);
 	}
 	if (unlikely(!list_empty(&n->poll_list))) {
@@ -6225,7 +6227,9 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
 	INIT_LIST_HEAD(&napi->poll_list);
-	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	hrtimer_init(&napi->timer, CLOCK_MONOTONIC,
+		     force_irqthreads ?
+		     HRTIMER_MODE_REL_PINNED_SOFT : HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
 	init_gro_hash(napi);
 	napi->skb = NULL;
-- 
2.24.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
  2019-11-26 22:20 ` [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs' Sebastian Andrzej Siewior
@ 2019-11-26 22:39   ` Eric Dumazet
  2019-11-27  9:35     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2019-11-26 22:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: netdev, David S. Miller, Thomas Gleixner

On Tue, Nov 26, 2019 at 2:20 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> The timer callback (napi_watchdog()) invokes __napi_schedule_irqoff()
> with disabled interrupts. With the `threadirqs' commandline option all
> interrupt handler are threaded and using __napi_schedule_irqoff() is not
> an issue because everyone is using it in threaded context which is
> synchronised with local_bh_disable().
> The napi_watchdog() timer is still expiring in hardirq context and may
> interrupt a threaded handler which is in the middle of
> __napi_schedule_irqoff() leading to list corruption.

Sorry, I do not understand this changelog.

Where/how do you get list corruption  exactly ?

 __napi_schedule_irqoff() _has_ to be called with hard IRQ disabled.

Please post a stack trace.


>
> Let the napi_watchdog() expire in softirq context if `threadirqs' is
> used.
>
> Fixes: 3b47d30396bae ("net: gro: add a per device gro flush timer")

Are you sure this commit is the root cause of the problem you see ?



> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/dev.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 99ac84ff398f4..ec533d20931bc 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5994,6 +5994,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>                 napi_gro_flush(n, !!timeout);
>                 if (timeout)
>                         hrtimer_start(&n->timer, ns_to_ktime(timeout),
> +                                     force_irqthreads ?

Honestly something is wrong with this patch, force_irqthreads should
not be used in net/ territory,
that is some layering problem.

> +                                     HRTIMER_MODE_REL_PINNED_SOFT :
>                                       HRTIMER_MODE_REL_PINNED);
>         }
>         if (unlikely(!list_empty(&n->poll_list))) {
> @@ -6225,7 +6227,9 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>                     int (*poll)(struct napi_struct *, int), int weight)
>  {
>         INIT_LIST_HEAD(&napi->poll_list);
> -       hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
> +       hrtimer_init(&napi->timer, CLOCK_MONOTONIC,
> +                    force_irqthreads ?
> +                    HRTIMER_MODE_REL_PINNED_SOFT : HRTIMER_MODE_REL_PINNED);
>         napi->timer.function = napi_watchdog;
>         init_gro_hash(napi);
>         napi->skb = NULL;
> --
> 2.24.0
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
  2019-11-26 22:39   ` Eric Dumazet
@ 2019-11-27  9:35     ` Sebastian Andrzej Siewior
       [not found]       ` <CANn89iL=q2wwjdSj1=veBE0hDATm_K=akKhz3Dyddnk28DRJhg@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-27  9:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Thomas Gleixner

On 2019-11-26 14:39:47 [-0800], Eric Dumazet wrote:
> On Tue, Nov 26, 2019 at 2:20 PM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > The timer callback (napi_watchdog()) invokes __napi_schedule_irqoff()
> > with disabled interrupts. With the `threadirqs' commandline option all
> > interrupt handler are threaded and using __napi_schedule_irqoff() is not
> > an issue because everyone is using it in threaded context which is
> > synchronised with local_bh_disable().
> > The napi_watchdog() timer is still expiring in hardirq context and may
> > interrupt a threaded handler which is in the middle of
> > __napi_schedule_irqoff() leading to list corruption.
> 
> Sorry, I do not understand this changelog.
> 
> Where/how do you get list corruption  exactly ?
> 
>  __napi_schedule_irqoff() _has_ to be called with hard IRQ disabled.
> 
> Please post a stack trace.

I don't have a stack trace, this is based on review.
__napi_schedule_irqoff() is used in IRQ context and this is only the
primary IRQ handler. There is no other in-IRQ usage (like SMP-function
call or so (excluding the HV network driver)) but there is the hrtimer.

With `threadirqs' you don't have the in-IRQ usage from the IRQ handler
anymore. The IRQ-handler for two different NICs don't interrupt/
preempt each other because the handler is invoked with disabled softirq
(which also disables preemption). This is all safe.
The hrtimer fires always in IRQ context no matter if `threadirqs' is
used or not. Which brings us to the following race condition:

One CPU, 2 NICs:

    threaded_IRQ of NIC1                     hard-irq context, hrtimer
       local_bh_disable()
         nic_irq_handler()
	  if (napi_schedule_prep())
	    __napi_schedule_irqoff()
	       ____napi_schedule(this_cpu_ptr(&softnet_data), n);
->	         list_add_tail(, &sd->poll_list);

                                        hrtimer_interrupt()
                                          __hrtimer_run_queues()
                                             napi_watchdog()
                                               __napi_schedule_irqoff()
                                    	       ____napi_schedule(this_cpu_ptr(&softnet_data), n);
->	                                         list_add_tail(&napi->poll_list, &sd->poll_list);

The same per-CPU list modified again.
If the callback is moved to softirq instead:

    threaded_IRQ of NIC1                     hard-irq context, hrtimer
       local_bh_disable()
         nic_irq_handler()
	  if (napi_schedule_prep())
	    __napi_schedule_irqoff()
	       ____napi_schedule(this_cpu_ptr(&softnet_data), n);
->	         list_add_tail(, &sd->poll_list);

                                        hrtimer_interrupt()
                                          raise_softirq_irqoff(HRTIMER_SOFTIRQ);
                 list_add_tail() completes
       local_bh_enable()
          if (unlikely(!in_interrupt() && local_softirq_pending())) {
             do_softirq();
                                               napi_watchdog()
                                                 __napi_schedule_irqoff()
                                    	           ____napi_schedule(this_cpu_ptr(&softnet_data), n);
	                                             list_add_tail(, &sd->poll_list);

> > Let the napi_watchdog() expire in softirq context if `threadirqs' is
> > used.
> >
> > Fixes: 3b47d30396bae ("net: gro: add a per device gro flush timer")
> 
> Are you sure this commit is the root cause of the problem you see ?

This commit was introduced in v3.19-rc1 and threadirqs was introduced in
commit
   8d32a307e4faa ("genirq: Provide forced interrupt threading")

which is v2.6.39-rc1. Based on my explanation above this problem exists
with the timer and `threadirqs'. The hrtimer always expired in hardirq
context.

> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
> >  net/core/dev.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 99ac84ff398f4..ec533d20931bc 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5994,6 +5994,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> >                 napi_gro_flush(n, !!timeout);
> >                 if (timeout)
> >                         hrtimer_start(&n->timer, ns_to_ktime(timeout),
> > +                                     force_irqthreads ?
> 
> Honestly something is wrong with this patch, force_irqthreads should
> not be used in net/ territory,
> that is some layering problem.

I'm not aware of an other problems of this kind.
Most drivers do spin_lock_irqsave() and in that case in does not matter
if the interrupt is threaded or not vs the hrtimer callback. Which means
they do not assume that the IRQ handler runs with interrupts disabled.
The timeout timers are usually timer_list timer which run in softirq
context and since the force-IRQ-thread runs also with disabled softirq
it is fine using just spin_lock().

If you don't want this here maybe tglx can add some hrtimer annotation.

Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
       [not found]       ` <CANn89iL=q2wwjdSj1=veBE0hDATm_K=akKhz3Dyddnk28DRJhg@mail.gmail.com>
@ 2019-11-27 17:11         ` Eric Dumazet
  2019-11-27 17:37           ` Sebastian Andrzej Siewior
  2019-11-27 17:22         ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2019-11-27 17:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: netdev, David S. Miller, Thomas Gleixner

Resent in non HTML mode :/

Long story short, why hrtimer are not by default using threaded mode
in threadirqs mode ?

Idea of having some (but not all of them) hard irq handlers' now being
run from BH mode,
is rather scary.

Also, hrtimers got the SOFT thing only in 4.16, while the GRO patch
went in linux-3.19

What would be the plan for stable trees ?


On Wed, Nov 27, 2019 at 8:15 AM Eric Dumazet <edumazet@google.com> wrote:
>
>
>
> On Wed, Nov 27, 2019 at 1:35 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>>
>> On 2019-11-26 14:39:47 [-0800], Eric Dumazet wrote:
>> > On Tue, Nov 26, 2019 at 2:20 PM Sebastian Andrzej Siewior
>> > <bigeasy@linutronix.de> wrote:
>> > >
>> > > The timer callback (napi_watchdog()) invokes __napi_schedule_irqoff()
>> > > with disabled interrupts. With the `threadirqs' commandline option all
>> > > interrupt handler are threaded and using __napi_schedule_irqoff() is not
>> > > an issue because everyone is using it in threaded context which is
>> > > synchronised with local_bh_disable().
>> > > The napi_watchdog() timer is still expiring in hardirq context and may
>> > > interrupt a threaded handler which is in the middle of
>> > > __napi_schedule_irqoff() leading to list corruption.
>> >
>> > Sorry, I do not understand this changelog.
>> >
>> > Where/how do you get list corruption  exactly ?
>> >
>> >  __napi_schedule_irqoff() _has_ to be called with hard IRQ disabled.
>> >
>> > Please post a stack trace.
>>
>> I don't have a stack trace, this is based on review.
>> __napi_schedule_irqoff() is used in IRQ context and this is only the
>> primary IRQ handler. There is no other in-IRQ usage (like SMP-function
>> call or so (excluding the HV network driver)) but there is the hrtimer.
>>
>> With `threadirqs' you don't have the in-IRQ usage from the IRQ handler
>> anymore. The IRQ-handler for two different NICs don't interrupt/
>> preempt each other because the handler is invoked with disabled softirq
>> (which also disables preemption). This is all safe.
>> The hrtimer fires always in IRQ context no matter if `threadirqs' is
>> used or not. Which brings us to the following race condition:
>>
>> One CPU, 2 NICs:
>>
>>     threaded_IRQ of NIC1                     hard-irq context, hrtimer
>>        local_bh_disable()
>>          nic_irq_handler()
>>           if (napi_schedule_prep())
>>             __napi_schedule_irqoff()
>
>
> And _which_ driver would call this variant without being sure hard irq are disabled ?
>
> Hard irq handlers are supposed to run with hard irq being disabled.
>
> Who decided that this was no longer the case ?
> This conflicts with hrtimer being delivered from hard irq context, this is the root cause of the problem.
>
> You are telling us napi_schedule_irqoff() should never be used, because we do not know if hard irqs are masked or not.
>
> Honestly this is a nightmare, we can not trust anymore stuff that has been settled 20 years ago.
>
> If you want to get rid of hard irq completely, make all handlers being run from threaded_IRQ,
> not only a subset of them ?
>
>
>>
>>                ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>> ->               list_add_tail(, &sd->poll_list);
>>
>>                                         hrtimer_interrupt()
>>                                           __hrtimer_run_queues()
>>                                              napi_watchdog()
>>                                                __napi_schedule_irqoff()
>>                                                ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>> ->                                               list_add_tail(&napi->poll_list, &sd->poll_list);
>>
>> The same per-CPU list modified again.
>> If the callback is moved to softirq instead:
>>
>>     threaded_IRQ of NIC1                     hard-irq context, hrtimer
>>        local_bh_disable()
>>          nic_irq_handler()
>>           if (napi_schedule_prep())
>>             __napi_schedule_irqoff()
>>                ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>> ->               list_add_tail(, &sd->poll_list);
>>
>>                                         hrtimer_interrupt()
>>                                           raise_softirq_irqoff(HRTIMER_SOFTIRQ);
>>                  list_add_tail() completes
>>        local_bh_enable()
>>           if (unlikely(!in_interrupt() && local_softirq_pending())) {
>>              do_softirq();
>>                                                napi_watchdog()
>>                                                  __napi_schedule_irqoff()
>>                                                    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>>                                                      list_add_tail(, &sd->poll_list);
>>
>> > > Let the napi_watchdog() expire in softirq context if `threadirqs' is
>> > > used.
>> > >
>> > > Fixes: 3b47d30396bae ("net: gro: add a per device gro flush timer")
>> >
>> > Are you sure this commit is the root cause of the problem you see ?
>>
>> This commit was introduced in v3.19-rc1 and threadirqs was introduced in
>> commit
>>    8d32a307e4faa ("genirq: Provide forced interrupt threading")
>>
>> which is v2.6.39-rc1. Based on my explanation above this problem exists
>> with the timer and `threadirqs'. The hrtimer always expired in hardirq
>> context.
>>
>> > > Cc: Eric Dumazet <edumazet@google.com>
>> > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> > > ---
>> > >  net/core/dev.c | 6 +++++-
>> > >  1 file changed, 5 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/net/core/dev.c b/net/core/dev.c
>> > > index 99ac84ff398f4..ec533d20931bc 100644
>> > > --- a/net/core/dev.c
>> > > +++ b/net/core/dev.c
>> > > @@ -5994,6 +5994,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>> > >                 napi_gro_flush(n, !!timeout);
>> > >                 if (timeout)
>> > >                         hrtimer_start(&n->timer, ns_to_ktime(timeout),
>> > > +                                     force_irqthreads ?
>> >
>> > Honestly something is wrong with this patch, force_irqthreads should
>> > not be used in net/ territory,
>> > that is some layering problem.
>>
>> I'm not aware of an other problems of this kind.
>> Most drivers do spin_lock_irqsave() and in that case in does not matter
>> if the interrupt is threaded or not vs the hrtimer callback. Which means
>> they do not assume that the IRQ handler runs with interrupts disabled.
>
>
> Most NIC drivers simply raises a softirq.
>
>>
>> The timeout timers are usually timer_list timer which run in softirq
>> context and since the force-IRQ-thread runs also with disabled softirq
>> it is fine using just spin_lock().
>>
>> If you don't want this here maybe tglx can add some hrtimer annotation.
>
>
> I only want to understand how many other points in the stack we have to audit and ' fix' ...
>
>>
>>
>> Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
       [not found]       ` <CANn89iL=q2wwjdSj1=veBE0hDATm_K=akKhz3Dyddnk28DRJhg@mail.gmail.com>
  2019-11-27 17:11         ` Eric Dumazet
@ 2019-11-27 17:22         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 9+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-27 17:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Thomas Gleixner

On 2019-11-27 08:15:52 [-0800], Eric Dumazet wrote:
> > One CPU, 2 NICs:
> >
> >     threaded_IRQ of NIC1                     hard-irq context, hrtimer
> >        local_bh_disable()
> >          nic_irq_handler()
> >           if (napi_schedule_prep())
> >             __napi_schedule_irqoff()
> >
> 
> And _which_ driver would call this variant without being sure hard irq are
> disabled ?

->
|$ git grep __napi_schedule_irqoff drivers/net/

returns for instance drivers/net/ethernet/rdc/r6040.c. It invokes
r6040_interrupt() from the hardirq handler. Everything is correct.

> Hard irq handlers are supposed to run with hard irq being disabled.

This is the case except for…

> Who decided that this was no longer the case ?

the system was bootet with the `threadirqs' on command line.

> This conflicts with hrtimer being delivered from hard irq context, this is
> the root cause of the problem.

correct. That is why moved them softirq context if the system is booted
with `threadirqs'.

> You are telling us napi_schedule_irqoff() should never be used, because we
> do not know if hard irqs are masked or not.

If interrupts are forced-threaded, correct. From kernel/irq/manage.c,
irq_forced_thread_fn() is used as the "handler" in this case:
| /*
|  * Interrupts which are not explicitly requested as threaded
|  * interrupts rely on the implicit bh/preempt disable of the hard irq
|  * context. So we need to disable bh here to avoid deadlocks and other
|  * side effects.
|  */
| static irqreturn_t
| irq_forced_thread_fn(struct irq_desc *desc, struct irqaction *action)
| {
|         irqreturn_t ret;
| 
|         local_bh_disable();
|         ret = action->thread_fn(action->irq, action->dev_id);
|         if (ret == IRQ_HANDLED)
|                 atomic_inc(&desc->threads_handled);
| 
|         irq_finalize_oneshot(desc, action);
|         local_bh_enable();
|         return ret;
| }

> Honestly this is a nightmare, we can not trust anymore stuff that has been
> settled 20 years ago.

This changes only if the `threadirqs' command line switch has been used.
We didn't have threaded interrupts 20 years ago.
Also, if the hrtimer was already expired at the time of programming then
the hrtimer used to fire in softirq context until commit
    c6eb3f70d4482 ("hrtimer: Get rid of hrtimer softirq")

This could happen if you program the timer to 50ns instead of 50us in
case you missed a few zeros in your echo to the sysfs file.

> If you want to get rid of hard irq completely, make all handlers being run
> from threaded_IRQ,
> not only a subset of them ?

`threadirqs' threads only IRQ handlers. All of them. The hrtimers
infrastructure is not affected by this change. Also smp_function call
and irqwork continue to fire in hardirq context. Also the direct
interrupts vectors as used by the HyperV driver is not affected by this
switch but I think it should. 

> > I'm not aware of an other problems of this kind.
> > Most drivers do spin_lock_irqsave() and in that case in does not matter
> > if the interrupt is threaded or not vs the hrtimer callback. Which means
> > they do not assume that the IRQ handler runs with interrupts disabled.
> >
> 
> Most NIC drivers simply raises a softirq.

Yes. For those it does not matter because they use __napi_schedule()
which disables interrupts.

> > The timeout timers are usually timer_list timer which run in softirq
> > context and since the force-IRQ-thread runs also with disabled softirq
> > it is fine using just spin_lock().
> >
> > If you don't want this here maybe tglx can add some hrtimer annotation.
> >
> 
> I only want to understand how many other points in the stack we have to
> audit and ' fix' ...
Okay. Looking now over all hrtimer in net/ look good, most of them
expire in softirq context. Looking at drivers/net they also look fine.
They acquire spin_lock_irqsave() within the driver, schedule a tasklet
or wake a net queue.

Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
  2019-11-27 17:11         ` Eric Dumazet
@ 2019-11-27 17:37           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 9+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-27 17:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Thomas Gleixner

On 2019-11-27 09:11:40 [-0800], Eric Dumazet wrote:
> Resent in non HTML mode :/
don't worry, mutt handles both :)

> Long story short, why hrtimer are not by default using threaded mode
> in threadirqs mode ?

Because it is only documented to thread only interrupts. Not sure if we
want change this.
In RT we expire most of the hrtimers in softirq context for other
reasons. A subset of them still expire in hardirq context.

> Idea of having some (but not all of them) hard irq handlers' now being
> run from BH mode,
> is rather scary.

As I explained in my previous email: All IRQ-handlers fire in
threaded-mode if enabled. Only the hrtimer is not affected by this
change.

> Also, hrtimers got the SOFT thing only in 4.16, while the GRO patch
> went in linux-3.19
> 
> What would be the plan for stable trees ?
No idea yet. We could let __napi_schedule_irqoff() behave like
__napi_schedule(). 

Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 1/2] amd-xgbe: Use __napi_schedule() in BH context
  2019-11-26 22:20 ` [PATCH net 1/2] amd-xgbe: Use __napi_schedule() in BH context Sebastian Andrzej Siewior
@ 2019-12-02 17:19   ` Tom Lendacky
  0 siblings, 0 replies; 9+ messages in thread
From: Tom Lendacky @ 2019-12-02 17:19 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, netdev; +Cc: David S. Miller, Thomas Gleixner

On 11/26/19 4:20 PM, Sebastian Andrzej Siewior wrote:
> The driver uses __napi_schedule_irqoff() which is fine as long as it is
> invoked with disabled interrupts by everybody. Since the commit
> mentioned below the driver may invoke xgbe_isr_task() in tasklet/softirq
> context. This may lead to list corruption if another driver uses
> __napi_schedule_irqoff() in IRQ context.
> 
> Use __napi_schedule() which safe to use from IRQ and softirq context.
> 
> Fixes: 85b85c853401d ("amd-xgbe: Re-issue interrupt if interrupt status not cleared")
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Acked-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
> index 98f8f20331544..3bd20f7651207 100644
> --- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
> +++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
> @@ -514,7 +514,7 @@ static void xgbe_isr_task(unsigned long data)
>  				xgbe_disable_rx_tx_ints(pdata);
>  
>  				/* Turn on polling */
> -				__napi_schedule_irqoff(&pdata->napi);
> +				__napi_schedule(&pdata->napi);
>  			}
>  		} else {
>  			/* Don't clear Rx/Tx status if doing per channel DMA
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-26 22:20 [PATCH net 0/2] __napi_schedule_irqoff() used in wrong context Sebastian Andrzej Siewior
2019-11-26 22:20 ` [PATCH net 1/2] amd-xgbe: Use __napi_schedule() in BH context Sebastian Andrzej Siewior
2019-12-02 17:19   ` Tom Lendacky
2019-11-26 22:20 ` [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs' Sebastian Andrzej Siewior
2019-11-26 22:39   ` Eric Dumazet
2019-11-27  9:35     ` Sebastian Andrzej Siewior
     [not found]       ` <CANn89iL=q2wwjdSj1=veBE0hDATm_K=akKhz3Dyddnk28DRJhg@mail.gmail.com>
2019-11-27 17:11         ` Eric Dumazet
2019-11-27 17:37           ` Sebastian Andrzej Siewior
2019-11-27 17:22         ` Sebastian Andrzej Siewior

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git