[PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
@ 2022-02-03 18:40 Yannick Vignon
  2022-02-03 18:40 ` [PATCH net-next 2/2] net: stmmac: move to threaded IRQ Yannick Vignon
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Yannick Vignon @ 2022-02-03 18:40 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Jakub Kicinski, Maxime Coquelin, Eric Dumazet,
	Antoine Tenart, Alexander Lobakin, Sebastian Andrzej Siewior,
	Paolo Abeni, Wei Wang, Kumar Kartikeya Dwivedi, Yunsheng Lin,
	Arnd Bergmann, netdev, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze
  Cc: Yannick Vignon

From: Yannick Vignon <yannick.vignon@nxp.com>

If NAPI was not scheduled from interrupt or softirq,
__raise_softirq_irqoff would mark the softirq pending, but not
wake up ksoftirqd. With force threaded IRQs, this is
compensated by the fact that the interrupt handlers are
protected inside a local_bh_disable()/local_bh_enable()
section, and bh_enable will call do_softirq if needed. With
normal threaded IRQs however, this is no longer the case
(unless the interrupt handler itself calls local_bh_enable()),
whic results in a pending softirq not being handled, and the
following message being printed out from tick-sched.c:
"NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"

Call raise_softirq_irqoff instead to make sure ksoftirqd is
woken up in such a case, ensuring __napi_schedule, etc behave
normally in more situations than just from an interrupt,
softirq or from within a bh_disable/bh_enable section.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f6..f93b3173454c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 	}

 	list_add_tail(&napi->poll_list, &sd->poll_list);
-	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+	raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }

 #ifdef CONFIG_RPS
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 2/2] net: stmmac: move to threaded IRQ
  2022-02-03 18:40 [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Yannick Vignon
@ 2022-02-03 18:40 ` Yannick Vignon
  2022-02-03 19:08 ` [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Eric Dumazet
  2022-02-03 19:41 ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 22+ messages in thread
From: Yannick Vignon @ 2022-02-03 18:40 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Jakub Kicinski, Maxime Coquelin, Eric Dumazet,
	Antoine Tenart, Alexander Lobakin, Sebastian Andrzej Siewior,
	Paolo Abeni, Wei Wang, Kumar Kartikeya Dwivedi, Yunsheng Lin,
	Arnd Bergmann, netdev, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze
  Cc: Yannick Vignon

From: Yannick Vignon <yannick.vignon@nxp.com>

When an IRQ is forced threaded (such as with PREEMPT_RT), execution of the
handler remains protected by local_bh_disable()/local_bh_enable() calls to
keep the semantics of the IRQ context and avoid deadlocks. However, this
also creates a contention point where a higher prio interrupt handler gets
blocked by a lower prio task already holding the lock. Even though priority
inheritance kicks in in such a case, the lower prio task can still execute
for an indefinite time.

Move the stmmac interrupts to be explicitly threaded, so that high priority
traffic can be processed without delay even if another piece of code was
already running with BH disabled.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 28 +++++++++----------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index bde76ea2deec..4bfc2cb89456 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3459,8 +3459,8 @@ static int stmmac_request_irq_multi_msi(struct net_device *dev)
 	/* For common interrupt */
 	int_name = priv->int_name_mac;
 	sprintf(int_name, "%s:%s", dev->name, "mac");
-	ret = request_irq(dev->irq, stmmac_mac_interrupt,
-			  0, int_name, dev);
+	ret = request_threaded_irq(dev->irq, NULL, stmmac_mac_interrupt,
+				   IRQF_ONESHOT, int_name, dev);
 	if (unlikely(ret < 0)) {
 		netdev_err(priv->dev,
 			   "%s: alloc mac MSI %d (error: %d)\n",
@@ -3475,9 +3475,9 @@ static int stmmac_request_irq_multi_msi(struct net_device *dev)
 	if (priv->wol_irq > 0 && priv->wol_irq != dev->irq) {
 		int_name = priv->int_name_wol;
 		sprintf(int_name, "%s:%s", dev->name, "wol");
-		ret = request_irq(priv->wol_irq,
-				  stmmac_mac_interrupt,
-				  0, int_name, dev);
+		ret = request_threaded_irq(priv->wol_irq,
+					   NULL, stmmac_mac_interrupt,
+					   IRQF_ONESHOT, int_name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: alloc wol MSI %d (error: %d)\n",
@@ -3493,9 +3493,9 @@ static int stmmac_request_irq_multi_msi(struct net_device *dev)
 	if (priv->lpi_irq > 0 && priv->lpi_irq != dev->irq) {
 		int_name = priv->int_name_lpi;
 		sprintf(int_name, "%s:%s", dev->name, "lpi");
-		ret = request_irq(priv->lpi_irq,
-				  stmmac_mac_interrupt,
-				  0, int_name, dev);
+		ret = request_threaded_irq(priv->lpi_irq,
+					   NULL, stmmac_mac_interrupt,
+					   IRQF_ONESHOT, int_name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: alloc lpi MSI %d (error: %d)\n",
@@ -3604,8 +3604,8 @@ static int stmmac_request_irq_single(struct net_device *dev)
 	enum request_irq_err irq_err;
 	int ret;
 
-	ret = request_irq(dev->irq, stmmac_interrupt,
-			  IRQF_SHARED, dev->name, dev);
+	ret = request_threaded_irq(dev->irq, NULL, stmmac_interrupt,
+				   IRQF_SHARED | IRQF_ONESHOT, dev->name, dev);
 	if (unlikely(ret < 0)) {
 		netdev_err(priv->dev,
 			   "%s: ERROR: allocating the IRQ %d (error: %d)\n",
@@ -3618,8 +3618,8 @@ static int stmmac_request_irq_single(struct net_device *dev)
 	 * is used for WoL
 	 */
 	if (priv->wol_irq > 0 && priv->wol_irq != dev->irq) {
-		ret = request_irq(priv->wol_irq, stmmac_interrupt,
-				  IRQF_SHARED, dev->name, dev);
+		ret = request_threaded_irq(priv->wol_irq, NULL, stmmac_interrupt,
+					   IRQF_SHARED | IRQF_ONESHOT, dev->name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: ERROR: allocating the WoL IRQ %d (%d)\n",
@@ -3631,8 +3631,8 @@ static int stmmac_request_irq_single(struct net_device *dev)
 
 	/* Request the IRQ lines */
 	if (priv->lpi_irq > 0 && priv->lpi_irq != dev->irq) {
-		ret = request_irq(priv->lpi_irq, stmmac_interrupt,
-				  IRQF_SHARED, dev->name, dev);
+		ret = request_threaded_irq(priv->lpi_irq, NULL, stmmac_interrupt,
+					   IRQF_SHARED | IRQF_ONESHOT, dev->name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: ERROR: allocating the LPI IRQ %d (%d)\n",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-03 18:40 [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Yannick Vignon
  2022-02-03 18:40 ` [PATCH net-next 2/2] net: stmmac: move to threaded IRQ Yannick Vignon
@ 2022-02-03 19:08 ` Eric Dumazet
  2022-02-03 23:40   ` Yannick Vignon
  2022-02-03 19:41 ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2022-02-03 19:08 UTC (permalink / raw)
  To: Yannick Vignon
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Jakub Kicinski, Maxime Coquelin, Antoine Tenart,
	Alexander Lobakin, Sebastian Andrzej Siewior, Paolo Abeni,
	Wei Wang, Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann,
	netdev, Vladimir Oltean, Xiaoliang Yang, mingkai.hu,
	Joakim Zhang, sebastien.laveze, Yannick Vignon

On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon
<yannick.vignon@oss.nxp.com> wrote:
>
> From: Yannick Vignon <yannick.vignon@nxp.com>
>
> If NAPI was not scheduled from interrupt or softirq,
> __raise_softirq_irqoff would mark the softirq pending, but not
> wake up ksoftirqd. With force threaded IRQs, this is
> compensated by the fact that the interrupt handlers are
> protected inside a local_bh_disable()/local_bh_enable()
> section, and bh_enable will call do_softirq if needed. With
> normal threaded IRQs however, this is no longer the case
> (unless the interrupt handler itself calls local_bh_enable()),
> whic results in a pending softirq not being handled, and the
> following message being printed out from tick-sched.c:
> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"
>
> Call raise_softirq_irqoff instead to make sure ksoftirqd is
> woken up in such a case, ensuring __napi_schedule, etc behave
> normally in more situations than just from an interrupt,
> softirq or from within a bh_disable/bh_enable section.
>

This is buggy. NAPI is called from the right context.

Can you provide a stack trace or something, so that the buggy driver
can be fixed ?

> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
> ---
>  net/core/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1baab07820f6..f93b3173454c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>         }
>
>         list_add_tail(&napi->poll_list, &sd->poll_list);
> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> +       raise_softirq_irqoff(NET_RX_SOFTIRQ);
>  }
>
>  #ifdef CONFIG_RPS
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-03 18:40 [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Yannick Vignon
  2022-02-03 18:40 ` [PATCH net-next 2/2] net: stmmac: move to threaded IRQ Yannick Vignon
  2022-02-03 19:08 ` [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Eric Dumazet
@ 2022-02-03 19:41 ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-03 19:41 UTC (permalink / raw)
  To: Yannick Vignon
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Jakub Kicinski, Maxime Coquelin, Eric Dumazet,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On 2022-02-03 19:40:30 [+0100], Yannick Vignon wrote:
> From: Yannick Vignon <yannick.vignon@nxp.com>
> 
> If NAPI was not scheduled from interrupt or softirq,
> __raise_softirq_irqoff would mark the softirq pending, but not
> wake up ksoftirqd. With force threaded IRQs, this is
> compensated by the fact that the interrupt handlers are
> protected inside a local_bh_disable()/local_bh_enable()

This is not compensated but one of the reasons why it has been added.

> section, and bh_enable will call do_softirq if needed. With
> normal threaded IRQs however, this is no longer the case
> (unless the interrupt handler itself calls local_bh_enable()),

Exactly.

> whic results in a pending softirq not being handled, and the
> following message being printed out from tick-sched.c:
> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"

Yes. This also includes various other call sites.

> Call raise_softirq_irqoff instead to make sure ksoftirqd is
> woken up in such a case, ensuring __napi_schedule, etc behave
> normally in more situations than just from an interrupt,
> softirq or from within a bh_disable/bh_enable section.

I would suggest to add a bh dis/en around the function that is known to
raise BH. This change to ____napi_schedule() as you suggest will raise
ksoftirqd and is not what you want. What you want is to process NAPI in
your current context.

> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-03 19:08 ` [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Eric Dumazet
@ 2022-02-03 23:40   ` Yannick Vignon
  2022-02-03 23:57     ` Eric Dumazet
  2022-02-04  1:09     ` Jakub Kicinski
  0 siblings, 2 replies; 22+ messages in thread
From: Yannick Vignon @ 2022-02-03 23:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Jakub Kicinski, Maxime Coquelin, Antoine Tenart,
	Alexander Lobakin, Sebastian Andrzej Siewior, Paolo Abeni,
	Wei Wang, Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann,
	netdev, Vladimir Oltean, Xiaoliang Yang, mingkai.hu,
	Joakim Zhang, sebastien.laveze, Yannick Vignon

On 2/3/2022 8:08 PM, Eric Dumazet wrote:
> On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon
> <yannick.vignon@oss.nxp.com> wrote:
>>
>> From: Yannick Vignon <yannick.vignon@nxp.com>
>>
>> If NAPI was not scheduled from interrupt or softirq,
>> __raise_softirq_irqoff would mark the softirq pending, but not
>> wake up ksoftirqd. With force threaded IRQs, this is
>> compensated by the fact that the interrupt handlers are
>> protected inside a local_bh_disable()/local_bh_enable()
>> section, and bh_enable will call do_softirq if needed. With
>> normal threaded IRQs however, this is no longer the case
>> (unless the interrupt handler itself calls local_bh_enable()),
>> whic results in a pending softirq not being handled, and the
>> following message being printed out from tick-sched.c:
>> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"
>>
>> Call raise_softirq_irqoff instead to make sure ksoftirqd is
>> woken up in such a case, ensuring __napi_schedule, etc behave
>> normally in more situations than just from an interrupt,
>> softirq or from within a bh_disable/bh_enable section.
>>
> 
> This is buggy. NAPI is called from the right context.
> 
> Can you provide a stack trace or something, so that the buggy driver
> can be fixed ?
> 

Maybe some background on how I came to this would be helpful. I have 
been chasing down sources of latencies in processing rx packets on a 
PREEMPT_RT kernel and the stmmac driver. I observed that the main ones 
were bh_dis/en sections, preventing even my high-prio, (force-)threaded 
rx irq from being handled in a timely manner. Given that explicitly 
threaded irq handlers were not enclosed in a bh_dis/en section, and that 
from what I saw the stmmac interrupt handler didn't need such a 
protection anyway, I modified the stmmac driver to request threaded 
interrupts. This worked, safe for that "NOHZ" message: because 
__napi_schedule was now called from a kernel thread context, the softirq 
was no longer triggered.
(note that the problem solves itself when enabling threaded NAPI)

Is there a rule saying we shouldn't call __napi_schedule from a regular 
kernel thread, and in particular a threaded interrupt handler?

>> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
>> ---
>>   net/core/dev.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 1baab07820f6..f93b3173454c 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>          }
>>
>>          list_add_tail(&napi->poll_list, &sd->poll_list);
>> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>> +       raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>   }
>>
>>   #ifdef CONFIG_RPS
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-03 23:40   ` Yannick Vignon
@ 2022-02-03 23:57     ` Eric Dumazet
  2022-02-04  1:09     ` Jakub Kicinski
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2022-02-03 23:57 UTC (permalink / raw)
  To: Yannick Vignon
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Jakub Kicinski, Maxime Coquelin, Antoine Tenart,
	Alexander Lobakin, Sebastian Andrzej Siewior, Paolo Abeni,
	Wei Wang, Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann,
	netdev, Vladimir Oltean, Xiaoliang Yang, mingkai.hu,
	Joakim Zhang, sebastien.laveze, Yannick Vignon

On Thu, Feb 3, 2022 at 3:40 PM Yannick Vignon
<yannick.vignon@oss.nxp.com> wrote:
>
> On 2/3/2022 8:08 PM, Eric Dumazet wrote:
> > On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon
> > <yannick.vignon@oss.nxp.com> wrote:
> >>
> >> From: Yannick Vignon <yannick.vignon@nxp.com>
> >>
> >> If NAPI was not scheduled from interrupt or softirq,
> >> __raise_softirq_irqoff would mark the softirq pending, but not
> >> wake up ksoftirqd. With force threaded IRQs, this is
> >> compensated by the fact that the interrupt handlers are
> >> protected inside a local_bh_disable()/local_bh_enable()
> >> section, and bh_enable will call do_softirq if needed. With
> >> normal threaded IRQs however, this is no longer the case
> >> (unless the interrupt handler itself calls local_bh_enable()),
> >> whic results in a pending softirq not being handled, and the
> >> following message being printed out from tick-sched.c:
> >> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"
> >>
> >> Call raise_softirq_irqoff instead to make sure ksoftirqd is
> >> woken up in such a case, ensuring __napi_schedule, etc behave
> >> normally in more situations than just from an interrupt,
> >> softirq or from within a bh_disable/bh_enable section.
> >>
> >
> > This is buggy. NAPI is called from the right context.
> >
> > Can you provide a stack trace or something, so that the buggy driver
> > can be fixed ?
> >
>
> Maybe some background on how I came to this would be helpful. I have
> been chasing down sources of latencies in processing rx packets on a
> PREEMPT_RT kernel and the stmmac driver. I observed that the main ones
> were bh_dis/en sections, preventing even my high-prio, (force-)threaded
> rx irq from being handled in a timely manner. Given that explicitly
> threaded irq handlers were not enclosed in a bh_dis/en section, and that
> from what I saw the stmmac interrupt handler didn't need such a
> protection anyway, I modified the stmmac driver to request threaded
> interrupts. This worked, safe for that "NOHZ" message: because
> __napi_schedule was now called from a kernel thread context, the softirq
> was no longer triggered.
> (note that the problem solves itself when enabling threaded NAPI)
>
> Is there a rule saying we shouldn't call __napi_schedule from a regular
> kernel thread, and in particular a threaded interrupt handler?

The rule is that you need to be in a safe context before calling
__napi_schedule()

This has been the case for more than a decade.

We are not going to slow down the stack just in case a process context
does not care about BH.

Please check:

commit ec13ee80145ccb95b00e6e610044bbd94a170051
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Wed May 16 10:57:12 2012 +0300

    virtio_net: invoke softirqs after __napi_schedule


>
> >> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
> >> ---
> >>   net/core/dev.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 1baab07820f6..f93b3173454c 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> >>          }
> >>
> >>          list_add_tail(&napi->poll_list, &sd->poll_list);
> >> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >> +       raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >>   }
> >>
> >>   #ifdef CONFIG_RPS
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-03 23:40   ` Yannick Vignon
  2022-02-03 23:57     ` Eric Dumazet
@ 2022-02-04  1:09     ` Jakub Kicinski
  2022-02-04  8:19       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-04  1:09 UTC (permalink / raw)
  To: Yannick Vignon
  Cc: Eric Dumazet, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Maxime Coquelin, Antoine Tenart,
	Alexander Lobakin, Sebastian Andrzej Siewior, Paolo Abeni,
	Wei Wang, Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann,
	netdev, Vladimir Oltean, Xiaoliang Yang, mingkai.hu,
	Joakim Zhang, sebastien.laveze, Yannick Vignon

On Fri, 4 Feb 2022 00:40:41 +0100 Yannick Vignon wrote:
> Maybe some background on how I came to this would be helpful. I have 
> been chasing down sources of latencies in processing rx packets on a 
> PREEMPT_RT kernel and the stmmac driver. I observed that the main ones 
> were bh_dis/en sections, preventing even my high-prio, (force-)threaded 
> rx irq from being handled in a timely manner. Given that explicitly 
> threaded irq handlers were not enclosed in a bh_dis/en section, and that 
> from what I saw the stmmac interrupt handler didn't need such a 
> protection anyway, I modified the stmmac driver to request threaded 
> interrupts. This worked, safe for that "NOHZ" message: because 
> __napi_schedule was now called from a kernel thread context, the softirq 
> was no longer triggered.
> (note that the problem solves itself when enabling threaded NAPI)

Let's be clear that the problem only exists when switching to threaded
IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
check in __napi_schedule_irqoff() which should handle your problem on
PREEMPT_RT.

We should slap a lockdep warning for non-irq contexts in
____napi_schedule(), I think, it was proposed by got lost.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04  1:09     ` Jakub Kicinski
@ 2022-02-04  8:19       ` Sebastian Andrzej Siewior
  2022-02-04 15:43         ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-04  8:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote:
> Let's be clear that the problem only exists when switching to threaded
> IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
> check in __napi_schedule_irqoff() which should handle your problem on
> PREEMPT_RT.

It does not. The problem is the missing bh-off/on around the call. The
forced-threaded handler has this. His explicit threaded-handler does not
and needs it.

> We should slap a lockdep warning for non-irq contexts in
> ____napi_schedule(), I think, it was proposed by got lost.

Something like this perhaps?:

diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f65..11c5f003d1591 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 {
 	struct task_struct *thread;
 
+	lockdep_assert_once(hardirq_count() | softirq_count());
+	lockdep_assert_irqs_disabled();
+
 	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
 		/* Paired with smp_mb__before_atomic() in
 		 * napi_enable()/dev_set_threaded().

Be aware that this (the first assert) will trigger in dev_cpu_dead() and
needs a bh-off/on around. I should have something in my RT tree :)

Sebastian

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04  8:19       ` Sebastian Andrzej Siewior
@ 2022-02-04 15:43         ` Jakub Kicinski
  2022-02-04 17:15           ` Yannick Vignon
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-04 15:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On Fri, 4 Feb 2022 09:19:22 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote:
> > Let's be clear that the problem only exists when switching to threaded
> > IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
> > check in __napi_schedule_irqoff() which should handle your problem on
> > PREEMPT_RT.  
> 
> It does not. The problem is the missing bh-off/on around the call. The
> forced-threaded handler has this. His explicit threaded-handler does not
> and needs it.

I see, what I was getting at is on PREEMPT_RT IRQs are already threaded
so I thought the patch was only targeting non-RT, I didn't think that
explicitly threading IRQ is advantageous also on RT.

> > We should slap a lockdep warning for non-irq contexts in
> > ____napi_schedule(), I think, it was proposed by got lost.  
> 
> Something like this perhaps?:
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1baab07820f65..11c5f003d1591 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>  {
>  	struct task_struct *thread;
>  
> +	lockdep_assert_once(hardirq_count() | softirq_count());
> +	lockdep_assert_irqs_disabled();
> +
>  	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
>  		/* Paired with smp_mb__before_atomic() in
>  		 * napi_enable()/dev_set_threaded().

👍 maybe with a comment above the first one saying that we want to make
sure softirq will be handled somewhere down the callstack. Possibly push
it as a helper in lockdep.h called "lockdep_assert_softirq_will_run()"
so it's self-explanatory?

> Be aware that this (the first assert) will trigger in dev_cpu_dead() and
> needs a bh-off/on around. I should have something in my RT tree :)

Or we could push the asserts only into the driver-facing helpers
(__napi_schedule(), __napi_schedule_irqoff()).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 15:43         ` Jakub Kicinski
@ 2022-02-04 17:15           ` Yannick Vignon
  2022-02-04 17:36             ` Jakub Kicinski
  2022-02-04 17:45             ` Jakub Kicinski
  0 siblings, 2 replies; 22+ messages in thread
From: Yannick Vignon @ 2022-02-04 17:15 UTC (permalink / raw)
  To: Jakub Kicinski, Sebastian Andrzej Siewior
  Cc: Eric Dumazet, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Maxime Coquelin, Antoine Tenart,
	Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On 2/4/2022 4:43 PM, Jakub Kicinski wrote:
> On Fri, 4 Feb 2022 09:19:22 +0100 Sebastian Andrzej Siewior wrote:
>> On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote:
>>> Let's be clear that the problem only exists when switching to threaded
>>> IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
>>> check in __napi_schedule_irqoff() which should handle your problem on
>>> PREEMPT_RT.
>>
>> It does not. The problem is the missing bh-off/on around the call. The
>> forced-threaded handler has this. His explicit threaded-handler does not
>> and needs it.
> 
> I see, what I was getting at is on PREEMPT_RT IRQs are already threaded
> so I thought the patch was only targeting non-RT, I didn't think that
> explicitly threading IRQ is advantageous also on RT.
> 

Something I forgot to mention is that the final use case I care about 
uses threaded NAPI (because of the improvement it gives when processing 
latency-sensitive network streams). And in that case, __napi_schedule is 
simply waking up the NAPI thread, no softirq is needed, and my 
controversial change isn't even needed for the whole system to work 
properly.

>>> We should slap a lockdep warning for non-irq contexts in
>>> ____napi_schedule(), I think, it was proposed by got lost.
>>
>> Something like this perhaps?:
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 1baab07820f65..11c5f003d1591 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>   {
>>   	struct task_struct *thread;
>>   
>> +	lockdep_assert_once(hardirq_count() | softirq_count());
>> +	lockdep_assert_irqs_disabled();
>> +
>>   	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
>>   		/* Paired with smp_mb__before_atomic() in
>>   		 * napi_enable()/dev_set_threaded().
> 
> 👍 maybe with a comment above the first one saying that we want to make
> sure softirq will be handled somewhere down the callstack. Possibly push
> it as a helper in lockdep.h called "lockdep_assert_softirq_will_run()"
> so it's self-explanatory?
> 
>> Be aware that this (the first assert) will trigger in dev_cpu_dead() and
>> needs a bh-off/on around. I should have something in my RT tree :)
> 
> Or we could push the asserts only into the driver-facing helpers
> (__napi_schedule(), __napi_schedule_irqoff()).

As I explained above, everything is working fine when using threaded 
NAPI. Why then forbid such a use case?

How about something like this instead:
in the (stmmac) threaded interrupt handler:
if (test_bit(NAPI_STATE_THREADED, &napi->state))
	__napi_schedule();
else {
	local_bh_disable();
	__napi_schedule();
	local_bh_enable();
}

Then in __napi_schedule, add the lockdep checks, but __below__ the "if 
(threaded) { ... }" block.

Would that be an acceptable change? Because really, the whole point of 
my patchqueue is to remove latencies imposed on network interrupts by 
bh_disable/enable sections. If moving to explicitly threaded IRQs means 
the bh_disable/enable section is simply moved down the path and around 
__napi_schedule, there is just no point.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 17:15           ` Yannick Vignon
@ 2022-02-04 17:36             ` Jakub Kicinski
  2022-02-04 17:45             ` Jakub Kicinski
  1 sibling, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-04 17:36 UTC (permalink / raw)
  To: Yannick Vignon
  Cc: Sebastian Andrzej Siewior, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On Fri, 4 Feb 2022 18:15:40 +0100 Yannick Vignon wrote:
> >> Be aware that this (the first assert) will trigger in dev_cpu_dead() and
> >> needs a bh-off/on around. I should have something in my RT tree :)  
> > 
> > Or we could push the asserts only into the driver-facing helpers
> > (__napi_schedule(), __napi_schedule_irqoff()).  
> 
> As I explained above, everything is working fine when using threaded 
> NAPI. Why then forbid such a use case?
> 
> How about something like this instead:
> in the (stmmac) threaded interrupt handler:
> if (test_bit(NAPI_STATE_THREADED, &napi->state))
> 	__napi_schedule();
> else {
> 	local_bh_disable();
> 	__napi_schedule();
> 	local_bh_enable();
> }

Looks slightly racy, we check the bit again in ____napi_schedule() and
it may change in between.

> Then in __napi_schedule, add the lockdep checks, but __below__ the "if 
> (threaded) { ... }" block.
> 
> Would that be an acceptable change? Because really, the whole point of 
> my patchqueue is to remove latencies imposed on network interrupts by 
> bh_disable/enable sections. If moving to explicitly threaded IRQs means 
> the bh_disable/enable section is simply moved down the path and around 
> __napi_schedule, there is just no point.

IMHO seems reasonable as long as it's coded up neatly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 17:15           ` Yannick Vignon
  2022-02-04 17:36             ` Jakub Kicinski
@ 2022-02-04 17:45             ` Jakub Kicinski
  2022-02-04 18:03               ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-04 17:45 UTC (permalink / raw)
  To: Yannick Vignon
  Cc: Sebastian Andrzej Siewior, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On Fri, 4 Feb 2022 18:15:40 +0100 Yannick Vignon wrote:
> > I see, what I was getting at is on PREEMPT_RT IRQs are already threaded
> > so I thought the patch was only targeting non-RT, I didn't think that
> > explicitly threading IRQ is advantageous also on RT.
> 
> Something I forgot to mention is that the final use case I care about 
> uses threaded NAPI (because of the improvement it gives when processing 
> latency-sensitive network streams). And in that case, __napi_schedule is 
> simply waking up the NAPI thread, no softirq is needed, and my 
> controversial change isn't even needed for the whole system to work 
> properly.

Coincidentally, I believe the threaded NAPI wake up is buggy - 
we assume the thread is only woken up when NAPI gets scheduled,
but IIUC signal delivery and other rare paths may wake up kthreads,
randomly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 17:45             ` Jakub Kicinski
@ 2022-02-04 18:03               ` Sebastian Andrzej Siewior
  2022-02-04 18:50                 ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-04 18:03 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> Coincidentally, I believe the threaded NAPI wake up is buggy - 
> we assume the thread is only woken up when NAPI gets scheduled,
> but IIUC signal delivery and other rare paths may wake up kthreads,
> randomly.

I had to look into NAPI-threads for some reason.
What I dislike is that after enabling it via sysfs I have to:
- adjust task priority manual so it is preferred over other threads.
  This is usually important on RT. But then there is no overload
  protection.

- set an affinity-mask for the thread so it does not migrate from one
  CPU to the other. This is worse for a RT task where the scheduler
  tries to keep the task running.

Wouldn't it work to utilize the threaded-IRQ API and use that instead
the custom thread? Basically the primary handler would what it already
does (disable the interrupt) and the threaded handler would feed packets
into the stack. In the overload case one would need to lower the
thread-priority.

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 18:03               ` Sebastian Andrzej Siewior
@ 2022-02-04 18:50                 ` Jakub Kicinski
  2022-02-04 18:52                   ` Jakub Kicinski
                                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-04 18:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> > Coincidentally, I believe the threaded NAPI wake up is buggy - 
> > we assume the thread is only woken up when NAPI gets scheduled,
> > but IIUC signal delivery and other rare paths may wake up kthreads,
> > randomly.  
> 
> I had to look into NAPI-threads for some reason.
> What I dislike is that after enabling it via sysfs I have to:
> - adjust task priority manual so it is preferred over other threads.
>   This is usually important on RT. But then there is no overload
>   protection.
> 
> - set an affinity-mask for the thread so it does not migrate from one
>   CPU to the other. This is worse for a RT task where the scheduler
>   tries to keep the task running.
> 
> Wouldn't it work to utilize the threaded-IRQ API and use that instead
> the custom thread? Basically the primary handler would what it already
> does (disable the interrupt) and the threaded handler would feed packets
> into the stack. In the overload case one would need to lower the
> thread-priority.

Sounds like an interesting direction if you ask me! That said I have
not been able to make threaded NAPI useful in my experiments / with my
workloads so I'd defer to Wei for confirmation.

To be clear -- are you suggesting that drivers just switch to threaded
NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
dynamically engages a thread in a generic fashion?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 18:50                 ` Jakub Kicinski
@ 2022-02-04 18:52                   ` Jakub Kicinski
  2022-02-08 11:51                   ` Sebastian Andrzej Siewior
  2022-02-09 11:26                   ` Marc Kleine-Budde
  2 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-04 18:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

On Fri, 4 Feb 2022 10:50:35 -0800 Jakub Kicinski wrote:
> To be clear -- are you suggesting that drivers just switch to threaded
> NAPI,

s/NAPI/IRQs/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 18:50                 ` Jakub Kicinski
  2022-02-04 18:52                   ` Jakub Kicinski
@ 2022-02-08 11:51                   ` Sebastian Andrzej Siewior
  2022-02-08 15:35                     ` Jakub Kicinski
  2022-02-08 15:57                     ` Paolo Abeni
  2022-02-09 11:26                   ` Marc Kleine-Budde
  2 siblings, 2 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-08 11:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon, Thomas Gleixner

On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote:
> On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:
> > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> > > Coincidentally, I believe the threaded NAPI wake up is buggy - 
> > > we assume the thread is only woken up when NAPI gets scheduled,
> > > but IIUC signal delivery and other rare paths may wake up kthreads,
> > > randomly.  
> > 
> > I had to look into NAPI-threads for some reason.
> > What I dislike is that after enabling it via sysfs I have to:
> > - adjust task priority manual so it is preferred over other threads.
> >   This is usually important on RT. But then there is no overload
> >   protection.
> > 
> > - set an affinity-mask for the thread so it does not migrate from one
> >   CPU to the other. This is worse for a RT task where the scheduler
> >   tries to keep the task running.
> > 
> > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > the custom thread? Basically the primary handler would what it already
> > does (disable the interrupt) and the threaded handler would feed packets
> > into the stack. In the overload case one would need to lower the
> > thread-priority.
> 
> Sounds like an interesting direction if you ask me! That said I have
> not been able to make threaded NAPI useful in my experiments / with my
> workloads so I'd defer to Wei for confirmation.
> 
> To be clear -- are you suggesting that drivers just switch to threaded
> NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
> dynamically engages a thread in a generic fashion?

Uhm, kind of, yes.

Now you have
	request_irq(, handler_irq);
	netif_napi_add(, , handler_napi);

The handler_irq() disables the interrupt line and schedules the softirq
to process handler_napi(). Once handler_napi() is it re-enables the
interrupt line otherwise it will be processed again on the next tick.

If you enable threaded NAPI then you end up with a thread and the
softirq is no longer used. I don't know what the next action is but I
guess you search for that thread and pin it manually to CPU and assign a
RT priority (probably, otherwise it will compete with other tasks for
CPU resources).

Instead we could have
	request_threaded_irq(, handler_irq, handler_napi);

And we would have basically the same outcome. Except that handler_napi()
runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
the CPU affinity is adjusted if the IRQ-affinity is changed).
We would still have to work out the details what handler_irq() is
allowed to do and how to handle one IRQ and multiple handler_napi().

If you wrap request_threaded_irq() in something like request_napi_irq()
the you could switch between the former (softirq) and later (thread)
based NAPI handling (since you have all the needed details).

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-08 11:51                   ` Sebastian Andrzej Siewior
@ 2022-02-08 15:35                     ` Jakub Kicinski
  2022-02-08 17:45                       ` Sebastian Andrzej Siewior
  2022-02-08 15:57                     ` Paolo Abeni
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-08 15:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon, Thomas Gleixner

On Tue, 8 Feb 2022 12:51:07 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote:
> > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:  
> > > I had to look into NAPI-threads for some reason.
> > > What I dislike is that after enabling it via sysfs I have to:
> > > - adjust task priority manual so it is preferred over other threads.
> > >   This is usually important on RT. But then there is no overload
> > >   protection.
> > > 
> > > - set an affinity-mask for the thread so it does not migrate from one
> > >   CPU to the other. This is worse for a RT task where the scheduler
> > >   tries to keep the task running.
> > > 
> > > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > > the custom thread? Basically the primary handler would what it already
> > > does (disable the interrupt) and the threaded handler would feed packets
> > > into the stack. In the overload case one would need to lower the
> > > thread-priority.  
> > 
> > Sounds like an interesting direction if you ask me! That said I have
> > not been able to make threaded NAPI useful in my experiments / with my
> > workloads so I'd defer to Wei for confirmation.
> > 
> > To be clear -- are you suggesting that drivers just switch to threaded
> > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
> > dynamically engages a thread in a generic fashion?  
> 
> Uhm, kind of, yes.
> 
> Now you have
> 	request_irq(, handler_irq);
> 	netif_napi_add(, , handler_napi);
> 
> The handler_irq() disables the interrupt line and schedules the softirq
> to process handler_napi(). Once handler_napi() is it re-enables the
> interrupt line otherwise it will be processed again on the next tick.
> 
> If you enable threaded NAPI then you end up with a thread and the
> softirq is no longer used. I don't know what the next action is but I
> guess you search for that thread and pin it manually to CPU and assign a
> RT priority (probably, otherwise it will compete with other tasks for
> CPU resources).

FWIW I don't think servers would want RT prio.

> Instead we could have
> 	request_threaded_irq(, handler_irq, handler_napi);
> 
> And we would have basically the same outcome. Except that handler_napi()
> runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> the CPU affinity is adjusted if the IRQ-affinity is changed).
> We would still have to work out the details what handler_irq() is
> allowed to do and how to handle one IRQ and multiple handler_napi().
> 
> If you wrap request_threaded_irq() in something like request_napi_irq()
> the you could switch between the former (softirq) and later (thread)
> based NAPI handling (since you have all the needed details).

One use case to watch out for is drivers which explicitly moved 
to threaded NAPI because they want to schedule multiple threads 
(NAPIs) from a single IRQ to spread processing across more cores.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-08 11:51                   ` Sebastian Andrzej Siewior
  2022-02-08 15:35                     ` Jakub Kicinski
@ 2022-02-08 15:57                     ` Paolo Abeni
  2022-02-08 18:21                       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 22+ messages in thread
From: Paolo Abeni @ 2022-02-08 15:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Jakub Kicinski
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon, Thomas Gleixner

On Tue, 2022-02-08 at 12:51 +0100, Sebastian Andrzej Siewior wrote:
> On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote:
> > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:
> > > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> > > > Coincidentally, I believe the threaded NAPI wake up is buggy - 
> > > > we assume the thread is only woken up when NAPI gets scheduled,
> > > > but IIUC signal delivery and other rare paths may wake up kthreads,
> > > > randomly.  
> > > 
> > > I had to look into NAPI-threads for some reason.
> > > What I dislike is that after enabling it via sysfs I have to:
> > > - adjust task priority manual so it is preferred over other threads.
> > >   This is usually important on RT. But then there is no overload
> > >   protection.
> > > 
> > > - set an affinity-mask for the thread so it does not migrate from one
> > >   CPU to the other. This is worse for a RT task where the scheduler
> > >   tries to keep the task running.
> > > 
> > > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > > the custom thread? Basically the primary handler would what it already
> > > does (disable the interrupt) and the threaded handler would feed packets
> > > into the stack. In the overload case one would need to lower the
> > > thread-priority.
> > 
> > Sounds like an interesting direction if you ask me! That said I have
> > not been able to make threaded NAPI useful in my experiments / with my
> > workloads so I'd defer to Wei for confirmation.
> > 
> > To be clear -- are you suggesting that drivers just switch to threaded
> > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
> > dynamically engages a thread in a generic fashion?
> 
> Uhm, kind of, yes.
> 
> Now you have
> 	request_irq(, handler_irq);
> 	netif_napi_add(, , handler_napi);
> 
> The handler_irq() disables the interrupt line and schedules the softirq
> to process handler_napi(). Once handler_napi() is it re-enables the
> interrupt line otherwise it will be processed again on the next tick.
> 
> If you enable threaded NAPI then you end up with a thread and the
> softirq is no longer used. I don't know what the next action is but I
> guess you search for that thread and pin it manually to CPU and assign a
> RT priority (probably, otherwise it will compete with other tasks for
> CPU resources).
> 
> Instead we could have
> 	request_threaded_irq(, handler_irq, handler_napi);
> 
> And we would have basically the same outcome. Except that handler_napi()
> runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> the CPU affinity is adjusted if the IRQ-affinity is changed).
> We would still have to work out the details what handler_irq() is
> allowed to do and how to handle one IRQ and multiple handler_napi().
> 
> If you wrap request_threaded_irq() in something like request_napi_irq()
> the you could switch between the former (softirq) and later (thread)
> based NAPI handling (since you have all the needed details).

just for historic reference:

https://lkml.org/lkml/2016/6/15/460

I think that running the thread performing the NAPI loop with
SCHED_FIFO would be dangerous WRT DDOS. Even the affinity setting can
give mixed results depending on the workload - unless you do good
static CPUs allocation pinning each process manually, not really a
generic setup.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-08 15:35                     ` Jakub Kicinski
@ 2022-02-08 17:45                       ` Sebastian Andrzej Siewior
  2022-02-09  0:16                         ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-08 17:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon, Thomas Gleixner

On 2022-02-08 07:35:20 [-0800], Jakub Kicinski wrote:
> > If you enable threaded NAPI then you end up with a thread and the
> > softirq is no longer used. I don't know what the next action is but I
> > guess you search for that thread and pin it manually to CPU and assign a
> > RT priority (probably, otherwise it will compete with other tasks for
> > CPU resources).
> 
> FWIW I don't think servers would want RT prio.

but then the NAPI thread is treated the same way like your xz -9.

> > Instead we could have
> > 	request_threaded_irq(, handler_irq, handler_napi);
> > 
> > And we would have basically the same outcome. Except that handler_napi()
> > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> > the CPU affinity is adjusted if the IRQ-affinity is changed).
> > We would still have to work out the details what handler_irq() is
> > allowed to do and how to handle one IRQ and multiple handler_napi().
> > 
> > If you wrap request_threaded_irq() in something like request_napi_irq()
> > the you could switch between the former (softirq) and later (thread)
> > based NAPI handling (since you have all the needed details).
> 
> One use case to watch out for is drivers which explicitly moved 
> to threaded NAPI because they want to schedule multiple threads 
> (NAPIs) from a single IRQ to spread processing across more cores.

the request_napi_irq() could have a sysfs switch (like we currently
have).
But you mentioned one IRQ and multiple NAPI threads, to distribute
across core. The usual case is that you have one IRQ for a network queue
and this network queue has one NAPI struct, right?

In the case where you would have one IRQ but two network queues, each
with one NAPI struct? And then you use the napi-threads and pin manually
queue-napi#1 to CPU#1 and queue-napi#2 to CPU#2 while the IRQ itself
fires on CPU#1?

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-08 15:57                     ` Paolo Abeni
@ 2022-02-08 18:21                       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-08 18:21 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon, Thomas Gleixner

On 2022-02-08 16:57:59 [+0100], Paolo Abeni wrote:
> just for historic reference:
> 
> https://lkml.org/lkml/2016/6/15/460

that is
  https://lore.kernel.org/all/cover.1465996447.git.pabeni@redhat.com

let me digest that later…

> I think that running the thread performing the NAPI loop with
> SCHED_FIFO would be dangerous WRT DDOS. Even the affinity setting can
> give mixed results depending on the workload - unless you do good
> static CPUs allocation pinning each process manually, not really a
> generic setup.

The DDoS part is where I meant we need figure out the details. Since it
is a threaded-interrupt we could do msleep() as a break which is similar
to what softirq does. Detecting such a case might be difficult since the
budget is per-thread only and does not involve other NAPI-structs on the
same CPU like backlog.

The performance is usually best if the IRQ and threaded handler are
running on the same CPU. The win with the NAPI thread seems to be that
if two NAPIs fire on the same CPU then the scheduler will see two tasks
which will be moved (at some point) to different CPUs. And in a DDoS
like situation they can constantly push new skbs into the stack and
SCHED_OTHER ensures that the user can still do something. Without napi
threads, both NAPIs would be processed on the same CPU (in order) and
eventually moved to ksoftirqd where it will take a break under high
load.

> Cheers,
> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-08 17:45                       ` Sebastian Andrzej Siewior
@ 2022-02-09  0:16                         ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2022-02-09  0:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yannick Vignon, Eric Dumazet, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, David S. Miller, Maxime Coquelin,
	Antoine Tenart, Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon, Thomas Gleixner

On Tue, 8 Feb 2022 18:45:53 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-08 07:35:20 [-0800], Jakub Kicinski wrote:
> > > If you enable threaded NAPI then you end up with a thread and the
> > > softirq is no longer used. I don't know what the next action is but I
> > > guess you search for that thread and pin it manually to CPU and assign a
> > > RT priority (probably, otherwise it will compete with other tasks for
> > > CPU resources).  
> > 
> > FWIW I don't think servers would want RT prio.  
> 
> but then the NAPI thread is treated the same way like your xz -9.

You'd either pin the workload away from the network processing cores 
or, if there's no pinning, prefer requests to run to completion to
achieve lower latency.

> > > Instead we could have
> > > 	request_threaded_irq(, handler_irq, handler_napi);
> > > 
> > > And we would have basically the same outcome. Except that handler_napi()
> > > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> > > the CPU affinity is adjusted if the IRQ-affinity is changed).
> > > We would still have to work out the details what handler_irq() is
> > > allowed to do and how to handle one IRQ and multiple handler_napi().
> > > 
> > > If you wrap request_threaded_irq() in something like request_napi_irq()
> > > the you could switch between the former (softirq) and later (thread)
> > > based NAPI handling (since you have all the needed details).  
> > 
> > One use case to watch out for is drivers which explicitly moved 
> > to threaded NAPI because they want to schedule multiple threads 
> > (NAPIs) from a single IRQ to spread processing across more cores.  
> 
> the request_napi_irq() could have a sysfs switch (like we currently
> have).
> But you mentioned one IRQ and multiple NAPI threads, to distribute
> across core. The usual case is that you have one IRQ for a network queue
> and this network queue has one NAPI struct, right?
> 
> In the case where you would have one IRQ but two network queues, each
> with one NAPI struct? And then you use the napi-threads and pin manually
> queue-napi#1 to CPU#1 and queue-napi#2 to CPU#2 while the IRQ itself
> fires on CPU#1?

It was mt76, the WiFi driver, I looked in the morning and I think it
had a NAPI for Tx and 2 NAPIs for Rx. The scheduler can spread them
around. Felix will have the exact details.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI
  2022-02-04 18:50                 ` Jakub Kicinski
  2022-02-04 18:52                   ` Jakub Kicinski
  2022-02-08 11:51                   ` Sebastian Andrzej Siewior
@ 2022-02-09 11:26                   ` Marc Kleine-Budde
  2 siblings, 0 replies; 22+ messages in thread
From: Marc Kleine-Budde @ 2022-02-09 11:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sebastian Andrzej Siewior, Yannick Vignon, Eric Dumazet,
	Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, Maxime Coquelin, Antoine Tenart,
	Alexander Lobakin, Paolo Abeni, Wei Wang,
	Kumar Kartikeya Dwivedi, Yunsheng Lin, Arnd Bergmann, netdev,
	Vladimir Oltean, Xiaoliang Yang, mingkai.hu, Joakim Zhang,
	sebastien.laveze, Yannick Vignon

[-- Attachment #1: Type: text/plain, Size: 1147 bytes --]

On 04.02.2022 10:50:35, Jakub Kicinski wrote:
> > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > the custom thread? Basically the primary handler would what it already
> > does (disable the interrupt) and the threaded handler would feed packets
> > into the stack. In the overload case one would need to lower the
> > thread-priority.
> 
> Sounds like an interesting direction if you ask me! That said I have
> not been able to make threaded NAPI useful in my experiments / with my
> workloads so I'd defer to Wei for confirmation.

FWIW: We have a threaded NAPI use case. On a PREEMPT_RT enabled imx6 (4
core 32 bit ARM) the NAPI of the CAN controller (flexcan) is running in
threaded mode with raised priorities. This reduces latency spikes and
jitter of RX'ed CAN frames, which are part of a closed control loop.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-02-09 12:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03 18:40 [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Yannick Vignon
2022-02-03 18:40 ` [PATCH net-next 2/2] net: stmmac: move to threaded IRQ Yannick Vignon
2022-02-03 19:08 ` [PATCH net-next 1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI Eric Dumazet
2022-02-03 23:40   ` Yannick Vignon
2022-02-03 23:57     ` Eric Dumazet
2022-02-04  1:09     ` Jakub Kicinski
2022-02-04  8:19       ` Sebastian Andrzej Siewior
2022-02-04 15:43         ` Jakub Kicinski
2022-02-04 17:15           ` Yannick Vignon
2022-02-04 17:36             ` Jakub Kicinski
2022-02-04 17:45             ` Jakub Kicinski
2022-02-04 18:03               ` Sebastian Andrzej Siewior
2022-02-04 18:50                 ` Jakub Kicinski
2022-02-04 18:52                   ` Jakub Kicinski
2022-02-08 11:51                   ` Sebastian Andrzej Siewior
2022-02-08 15:35                     ` Jakub Kicinski
2022-02-08 17:45                       ` Sebastian Andrzej Siewior
2022-02-09  0:16                         ` Jakub Kicinski
2022-02-08 15:57                     ` Paolo Abeni
2022-02-08 18:21                       ` Sebastian Andrzej Siewior
2022-02-09 11:26                   ` Marc Kleine-Budde
2022-02-03 19:41 ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.