All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] mlx4: do not fire tasklet unless necessary
@ 2017-02-10 12:27 Eric Dumazet
  2017-02-14 16:29 ` David Miller
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Eric Dumazet @ 2017-02-10 12:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Saeed Mahameed

From: Eric Dumazet <edumazet@google.com>

All rx and rx netdev interrupts are handled by respectively
by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.

But mlx4_eq_int() also fires a tasklet to service all items that were
queued via mlx4_add_cq_to_tasklet(), but this handler was not called
unless user cqe was handled.

This is very confusing, as "mpstat -I SCPU ..." show huge number of
tasklet invocations.

This patch saves this overhead, by carefully firing the tasklet directly
from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
 drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
 
 static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
 {
-	unsigned long flags;
 	struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
+	unsigned long flags;
+	bool kick;
 
 	spin_lock_irqsave(&tasklet_ctx->lock, flags);
 	/* When migrating CQs between EQs will be implemented, please note
@@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
 	 */
 	if (list_empty_careful(&cq->tasklet_ctx.list)) {
 		atomic_inc(&cq->refcount);
+		kick = list_empty(&tasklet_ctx->list);
 		list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
+		if (kick)
+			tasklet_schedule(&tasklet_ctx->task);
 	}
 	spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_eqe *eqe;
-	int cqn = -1;
+	int cqn;
 	int eqes_found = 0;
 	int set_ci = 0;
 	int port;
@@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
 
 	eq_set_ci(eq, 1);
 
-	/* cqn is 24bit wide but is initialized such that its higher bits
-	 * are ones too. Thus, if we got any event, cqn's high bits should be off
-	 * and we need to schedule the tasklet.
-	 */
-	if (!(cqn & ~0xffffff))
-		tasklet_schedule(&eq->tasklet_ctx.task);
-
 	return eqes_found;
 }
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-10 12:27 [PATCH net-next] mlx4: do not fire tasklet unless necessary Eric Dumazet
@ 2017-02-14 16:29 ` David Miller
  2017-02-15 11:10 ` Saeed Mahameed
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2017-02-14 16:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tariqt, saeedm

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 10 Feb 2017 04:27:58 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
> 
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
> 
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
> 
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>

Tariq/Saeed, please review.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-10 12:27 [PATCH net-next] mlx4: do not fire tasklet unless necessary Eric Dumazet
  2017-02-14 16:29 ` David Miller
@ 2017-02-15 11:10 ` Saeed Mahameed
  2017-02-15 13:29   ` Eric Dumazet
  2017-02-15 14:52   ` Matan Barak (External)
  2017-02-16 12:38 ` Saeed Mahameed
  2017-02-17 15:55 ` David Miller
  3 siblings, 2 replies; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-15 11:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
>
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
>
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
>  drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
>  2 files changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
> index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
> @@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
>
>  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>  {
> -       unsigned long flags;
>         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> +       unsigned long flags;
> +       bool kick;
>
>         spin_lock_irqsave(&tasklet_ctx->lock, flags);
>         /* When migrating CQs between EQs will be implemented, please note
> @@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>          */
>         if (list_empty_careful(&cq->tasklet_ctx.list)) {
>                 atomic_inc(&cq->refcount);
> +               kick = list_empty(&tasklet_ctx->list);

So first one in would fire the tasklet, but wouldn't this cause CQE
processing loss
in the same mlx4_eq_int loop if the tasklet was fast enough to
schedule and while other CQEs are going to add themselves to the
tasklet_ctx->list ?

Anyway i tried to find race scenarios that could cause such thing but
synchronization looks good.

>                 list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> +               if (kick)
> +                       tasklet_schedule(&tasklet_ctx->task);
>         }
>         spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
> index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
> @@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>  {
>         struct mlx4_priv *priv = mlx4_priv(dev);
>         struct mlx4_eqe *eqe;
> -       int cqn = -1;
> +       int cqn;
>         int eqes_found = 0;
>         int set_ci = 0;
>         int port;
> @@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>
>         eq_set_ci(eq, 1);
>
> -       /* cqn is 24bit wide but is initialized such that its higher bits
> -        * are ones too. Thus, if we got any event, cqn's high bits should be off
> -        * and we need to schedule the tasklet.
> -        */
> -       if (!(cqn & ~0xffffff))

what if we simply change this condition to:
if (!list_empty_careful(eq->tasklet_ctx.list))

Wouldn't this be sort of equivalent to what you did ? and this way we
would simply fire the tasklet only when needed and not on every
handled CQE.

> -               tasklet_schedule(&eq->tasklet_ctx.task);
> -
>         return eqes_found;
>  }
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 11:10 ` Saeed Mahameed
@ 2017-02-15 13:29   ` Eric Dumazet
  2017-02-15 13:59     ` Saeed Mahameed
  2017-02-15 14:04     ` Eric Dumazet
  2017-02-15 14:52   ` Matan Barak (External)
  1 sibling, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2017-02-15 13:29 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Wed, 2017-02-15 at 13:10 +0200, Saeed Mahameed wrote:
> On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > All rx and rx netdev interrupts are handled by respectively
> > by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
> >
> > But mlx4_eq_int() also fires a tasklet to service all items that were
> > queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> > unless user cqe was handled.
> >
> > This is very confusing, as "mpstat -I SCPU ..." show huge number of
> > tasklet invocations.
> >
> > This patch saves this overhead, by carefully firing the tasklet directly
> > from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
> >  drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
> >  2 files changed, 6 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
> > index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
> > @@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
> >
> >  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
> >  {
> > -       unsigned long flags;
> >         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> > +       unsigned long flags;
> > +       bool kick;
> >
> >         spin_lock_irqsave(&tasklet_ctx->lock, flags);
> >         /* When migrating CQs between EQs will be implemented, please note
> > @@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
> >          */
> >         if (list_empty_careful(&cq->tasklet_ctx.list)) {
> >                 atomic_inc(&cq->refcount);
> > +               kick = list_empty(&tasklet_ctx->list);
> 
> So first one in would fire the tasklet, but wouldn't this cause CQE
> processing loss
> in the same mlx4_eq_int loop if the tasklet was fast enough to
> schedule and while other CQEs are going to add themselves to the
> tasklet_ctx->list ?


mlx4_eq_int() is a hard irq handler.

How a tasklet could run in the middle of it ?

A tasklet is a softirq handler.

softirq must wait that the current hard irq handler is done.
> 
> Anyway i tried to find race scenarios that could cause such thing but
> synchronization looks good.
> 
> >                 list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> > +               if (kick)
> > +                       tasklet_schedule(&tasklet_ctx->task);
> >         }
> >         spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
> >  }
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
> > index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
> > @@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
> >  {
> >         struct mlx4_priv *priv = mlx4_priv(dev);
> >         struct mlx4_eqe *eqe;
> > -       int cqn = -1;
> > +       int cqn;
> >         int eqes_found = 0;
> >         int set_ci = 0;
> >         int port;
> > @@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
> >
> >         eq_set_ci(eq, 1);
> >
> > -       /* cqn is 24bit wide but is initialized such that its higher bits
> > -        * are ones too. Thus, if we got any event, cqn's high bits should be off
> > -        * and we need to schedule the tasklet.
> > -        */
> > -       if (!(cqn & ~0xffffff))
> 
> what if we simply change this condition to:
> if (!list_empty_careful(eq->tasklet_ctx.list))
> 
> Wouldn't this be sort of equivalent to what you did ? and this way we
> would simply fire the tasklet only when needed and not on every
> handled CQE.

Still this test would be done one million time per second on my hosts.

What is the point exactly ?

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 13:29   ` Eric Dumazet
@ 2017-02-15 13:59     ` Saeed Mahameed
  2017-02-15 14:34       ` Eric Dumazet
  2017-02-15 14:04     ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-15 13:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Wed, Feb 15, 2017 at 3:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-15 at 13:10 +0200, Saeed Mahameed wrote:
>> On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > All rx and rx netdev interrupts are handled by respectively
>> > by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>> >
>> > But mlx4_eq_int() also fires a tasklet to service all items that were
>> > queued via mlx4_add_cq_to_tasklet(), but this handler was not called
>> > unless user cqe was handled.
>> >
>> > This is very confusing, as "mpstat -I SCPU ..." show huge number of
>> > tasklet invocations.
>> >
>> > This patch saves this overhead, by carefully firing the tasklet directly
>> > from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: Tariq Toukan <tariqt@mellanox.com>
>> > Cc: Saeed Mahameed <saeedm@mellanox.com>
>> > ---
>> >  drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
>> >  drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
>> >  2 files changed, 6 insertions(+), 9 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
>> > index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
>> > @@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
>> >
>> >  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>> >  {
>> > -       unsigned long flags;
>> >         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
>> > +       unsigned long flags;
>> > +       bool kick;
>> >
>> >         spin_lock_irqsave(&tasklet_ctx->lock, flags);
>> >         /* When migrating CQs between EQs will be implemented, please note
>> > @@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>> >          */
>> >         if (list_empty_careful(&cq->tasklet_ctx.list)) {
>> >                 atomic_inc(&cq->refcount);
>> > +               kick = list_empty(&tasklet_ctx->list);
>>
>> So first one in would fire the tasklet, but wouldn't this cause CQE
>> processing loss
>> in the same mlx4_eq_int loop if the tasklet was fast enough to
>> schedule and while other CQEs are going to add themselves to the
>> tasklet_ctx->list ?
>
>
> mlx4_eq_int() is a hard irq handler.
>
> How a tasklet could run in the middle of it ?
>

can the tasklet run on a different core ?

> A tasklet is a softirq handler.
>
> softirq must wait that the current hard irq handler is done.
>>
>> Anyway i tried to find race scenarios that could cause such thing but
>> synchronization looks good.
>>
>> >                 list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
>> > +               if (kick)
>> > +                       tasklet_schedule(&tasklet_ctx->task);
>> >         }
>> >         spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
>> >  }
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
>> > index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
>> > @@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>> >  {
>> >         struct mlx4_priv *priv = mlx4_priv(dev);
>> >         struct mlx4_eqe *eqe;
>> > -       int cqn = -1;
>> > +       int cqn;
>> >         int eqes_found = 0;
>> >         int set_ci = 0;
>> >         int port;
>> > @@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>> >
>> >         eq_set_ci(eq, 1);
>> >
>> > -       /* cqn is 24bit wide but is initialized such that its higher bits
>> > -        * are ones too. Thus, if we got any event, cqn's high bits should be off
>> > -        * and we need to schedule the tasklet.
>> > -        */
>> > -       if (!(cqn & ~0xffffff))
>>
>> what if we simply change this condition to:
>> if (!list_empty_careful(eq->tasklet_ctx.list))
>>
>> Wouldn't this be sort of equivalent to what you did ? and this way we
>> would simply fire the tasklet only when needed and not on every
>> handled CQE.
>
> Still this test would be done one million time per second on my hosts.
>
> What is the point exactly ?
>

the point is that if the EQ is full of CQEs from different CQs you would
do the "  kick = list_empty(&tasklet_ctx->list);" test per empty CQ
list rather than once at the end.

in mlx4_en case, you have only two CQs on each EQ but in RoCE/IB you
can have as many CQs as you want.

> Thanks.
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 13:29   ` Eric Dumazet
  2017-02-15 13:59     ` Saeed Mahameed
@ 2017-02-15 14:04     ` Eric Dumazet
  2017-02-16 12:44       ` Saeed Mahameed
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2017-02-15 14:04 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:

> 
> mlx4_eq_int() is a hard irq handler.
> 
> How a tasklet could run in the middle of it ?
> 
> A tasklet is a softirq handler.

Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
in eq_set_ci()

I wonder why this very expensive mb() is required, right before exiting
the interrupt handler.
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 13:59     ` Saeed Mahameed
@ 2017-02-15 14:34       ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2017-02-15 14:34 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Wed, 2017-02-15 at 15:59 +0200, Saeed Mahameed wrote:

> can the tasklet run on a different core ?

No. tasklets are scheduled on local cpu, like softirqs in general.

They are not migrated, unless cpu is removed (hotplug)


> 
> the point is that if the EQ is full of CQEs from different CQs you would
> do the "  kick = list_empty(&tasklet_ctx->list);" test per empty CQ
> list rather than once at the end.
> 
> in mlx4_en case, you have only two CQs on each EQ but in RoCE/IB you
> can have as many CQs as you want.

list_empty() before one list_add_tail() is a single instruction.

By doing this right before manipulating the list, it comes with a zero
cache line penalty.

While if you do it at the wrong place, it incurs one extra cache line
miss.

This extra tasklet invocations can cost 3 % of cpu cycles.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 11:10 ` Saeed Mahameed
  2017-02-15 13:29   ` Eric Dumazet
@ 2017-02-15 14:52   ` Matan Barak (External)
  2017-02-15 15:26     ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Matan Barak (External) @ 2017-02-15 14:52 UTC (permalink / raw)
  To: Saeed Mahameed, Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, jackm

On 15/02/2017 13:10, Saeed Mahameed wrote:
> On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> All rx and rx netdev interrupts are handled by respectively
>> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>>
>> But mlx4_eq_int() also fires a tasklet to service all items that were
>> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
>> unless user cqe was handled.
>>
>> This is very confusing, as "mpstat -I SCPU ..." show huge number of
>> tasklet invocations.
>>
>> This patch saves this overhead, by carefully firing the tasklet directly
>> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>>

So, in case of RDMA CQs, we add some per-CQE overhead of comparing the 
list pointers and condition upon that. Maybe we could add an 
invoke_tasklet boolean field on mlx4_cq and return its value from 
mlx4_cq_completion.
That's way we could do invoke_tasklet |= mlx4_cq_completion(....);

Outside the while loop we could just
if (invoke_tasklet)
     tasklet_schedule

Anyway, I guess that even with per-CQE overhead, the performance impact 
here is pretty negligible - so I guess that's fine too :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 14:52   ` Matan Barak (External)
@ 2017-02-15 15:26     ` Eric Dumazet
  2017-02-16 12:34       ` Saeed Mahameed
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2017-02-15 15:26 UTC (permalink / raw)
  To: Matan Barak (External)
  Cc: Saeed Mahameed, David Miller, netdev, Tariq Toukan,
	Saeed Mahameed, jackm

On Wed, 2017-02-15 at 16:52 +0200, Matan Barak (External) wrote:

> So, in case of RDMA CQs, we add some per-CQE overhead of comparing the 
> list pointers and condition upon that. Maybe we could add an 
> invoke_tasklet boolean field on mlx4_cq and return its value from 
> mlx4_cq_completion.
> That's way we could do invoke_tasklet |= mlx4_cq_completion(....);
> 
> Outside the while loop we could just
> if (invoke_tasklet)
>      tasklet_schedule
> 
> Anyway, I guess that even with per-CQE overhead, the performance impact 
> here is pretty negligible - so I guess that's fine too :)


Real question or suggestion would be to use/fire a tasklet only under
stress.

Firing a tasklet adds a lot of latencies for user-space CQ completion,
since softirqs might have to be handled by a kernel thread (ksoftirqd)

I would be surprised if no customer was hit by your commit,
( net/mlx4_core: Use tasklet for user-space CQ completion events )
especially when using specific (RT) scheduler classes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 15:26     ` Eric Dumazet
@ 2017-02-16 12:34       ` Saeed Mahameed
  0 siblings, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-16 12:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matan Barak (External),
	David Miller, netdev, Tariq Toukan, Saeed Mahameed, jackm

On Wed, Feb 15, 2017 at 5:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-15 at 16:52 +0200, Matan Barak (External) wrote:
>
>> So, in case of RDMA CQs, we add some per-CQE overhead of comparing the
>> list pointers and condition upon that. Maybe we could add an
>> invoke_tasklet boolean field on mlx4_cq and return its value from
>> mlx4_cq_completion.
>> That's way we could do invoke_tasklet |= mlx4_cq_completion(....);
>>
>> Outside the while loop we could just
>> if (invoke_tasklet)
>>      tasklet_schedule
>>
>> Anyway, I guess that even with per-CQE overhead, the performance impact
>> here is pretty negligible - so I guess that's fine too :)
>
>
> Real question or suggestion would be to use/fire a tasklet only under
> stress.
>
> Firing a tasklet adds a lot of latencies for user-space CQ completion,
> since softirqs might have to be handled by a kernel thread (ksoftirqd)
>

At least for mlx4_en driver we don't need this tasklet and it is only
adding this overhead. (we have napi)

we must consider removing it for mlx4_en cqs and move the tasklet
handling to mlx4_ib.

I will ack the patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-10 12:27 [PATCH net-next] mlx4: do not fire tasklet unless necessary Eric Dumazet
  2017-02-14 16:29 ` David Miller
  2017-02-15 11:10 ` Saeed Mahameed
@ 2017-02-16 12:38 ` Saeed Mahameed
  2017-02-16 15:30   ` Eric Dumazet
  2017-02-17 15:55 ` David Miller
  3 siblings, 1 reply; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-16 12:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
>
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
>
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-15 14:04     ` Eric Dumazet
@ 2017-02-16 12:44       ` Saeed Mahameed
  2017-02-16 15:49         ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-16 12:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Wed, Feb 15, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:
>
>>
>> mlx4_eq_int() is a hard irq handler.
>>
>> How a tasklet could run in the middle of it ?
>>
>> A tasklet is a softirq handler.
>
> Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
> in eq_set_ci()
>

I wonder why you have so many interrupts ? don't you have some kind of
interrupt moderation ?
what test are you running that got your CPU so busy.

> I wonder why this very expensive mb() is required, right before exiting
> the interrupt handler.

to make sure the HW knows we handled Completions up to (ci) consumer
index. so it will generate next irq.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-16 12:38 ` Saeed Mahameed
@ 2017-02-16 15:30   ` Eric Dumazet
  2017-02-18  2:46     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2017-02-16 15:30 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Thu, 2017-02-16 at 14:38 +0200, Saeed Mahameed wrote:

> Acked-by: Saeed Mahameed <saeedm@mellanox.com>

Thanks for reviewing this Saeed !

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-16 12:44       ` Saeed Mahameed
@ 2017-02-16 15:49         ` Eric Dumazet
  2017-02-16 21:17           ` Saeed Mahameed
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2017-02-16 15:49 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Thu, 2017-02-16 at 14:44 +0200, Saeed Mahameed wrote:
> On Wed, Feb 15, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:
> >
> >>
> >> mlx4_eq_int() is a hard irq handler.
> >>
> >> How a tasklet could run in the middle of it ?
> >>
> >> A tasklet is a softirq handler.
> >
> > Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
> > in eq_set_ci()
> >
> 
> I wonder why you have so many interrupts ? don't you have some kind of
> interrupt moderation ?
> what test are you running that got your CPU so busy.


Simply 8 RX queues. About 140,000 irq per second per RX queue,
for a moderate network load on 40Gbit NIC.

Interrupt moderation is a latency killer, we want our usec back.

> 
> > I wonder why this very expensive mb() is required, right before exiting
> > the interrupt handler.
> 
> to make sure the HW knows we handled Completions up to (ci) consumer
> index. so it will generate next irq.


So why a mb() is needed exactly ?

wmb() seems enough.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-16 15:49         ` Eric Dumazet
@ 2017-02-16 21:17           ` Saeed Mahameed
  2017-02-16 23:13             ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-16 21:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Thu, Feb 16, 2017 at 5:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-02-16 at 14:44 +0200, Saeed Mahameed wrote:
>> On Wed, Feb 15, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:
>> >
>> >>
>> >> mlx4_eq_int() is a hard irq handler.
>> >>
>> >> How a tasklet could run in the middle of it ?
>> >>
>> >> A tasklet is a softirq handler.
>> >
>> > Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
>> > in eq_set_ci()
>> >
>>
>> I wonder why you have so many interrupts ? don't you have some kind of
>> interrupt moderation ?
>> what test are you running that got your CPU so busy.
>
>
> Simply 8 RX queues. About 140,000 irq per second per RX queue,
> for a moderate network load on 40Gbit NIC.
>

so i guess you are not busy polling .. and adaptive moderation decided
to lower down
rx-usecs for you, and you are looking to improve latency.

> Interrupt moderation is a latency killer, we want our usec back.
>

well, for RX we have adaptive moderation.
and for TX we have xmit more.
so interrupt moderation can be good if it is done right.

do i understand from this that you are against interrupt moderation ?

>>
>> > I wonder why this very expensive mb() is required, right before exiting
>> > the interrupt handler.
>>
>> to make sure the HW knows we handled Completions up to (ci) consumer
>> index. so it will generate next irq.
>
>
> So why a mb() is needed exactly ?
>
> wmb() seems enough.
>

Yes seems right, not sure why it is mb() though, i will have to check
and get back to you on this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-16 21:17           ` Saeed Mahameed
@ 2017-02-16 23:13             ` Eric Dumazet
  2017-02-16 23:23               ` [PATCH net] mlx4: fix potential divide by 0 in mlx4_en_auto_moderation() Eric Dumazet
  2017-02-19 22:33               ` [PATCH net-next] mlx4: do not fire tasklet unless necessary Saeed Mahameed
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2017-02-16 23:13 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Thu, 2017-02-16 at 23:17 +0200, Saeed Mahameed wrote:

> so i guess you are not busy polling .. and adaptive moderation decided
> to lower down
> rx-usecs for you, and you are looking to improve latency.
> 
> > Interrupt moderation is a latency killer, we want our usec back.
> >
> 
> well, for RX we have adaptive moderation.
> and for TX we have xmit more.
> so interrupt moderation can be good if it is done right.
> 
> do i understand from this that you are against interrupt moderation ?

Absolutely against it. We prefer gro_flush_timeout.

The defaults values in mlx4 never were tuned for 40Gbit.

If you want to be able to reach 40Gbit on a single TCP flow, you want :

ethtool -C ethX rx-frames 16 rx-usecs-high 32

Or disable interrupt mitigation of course.


Interrupt moderation comes for free with NAPI really :

If under stress, number of interrupts will naturally decrease,
so what's the point with hardware _delaying_ an incoming packet
exactly ???

BTW interrupt moderation in mlx4 has at least two bugs.

I will send a patch.


> 
> >>
> >> > I wonder why this very expensive mb() is required, right before exiting
> >> > the interrupt handler.
> >>
> >> to make sure the HW knows we handled Completions up to (ci) consumer
> >> index. so it will generate next irq.
> >
> >
> > So why a mb() is needed exactly ?
> >
> > wmb() seems enough.
> >
> 
> Yes seems right, not sure why it is mb() though, i will have to check
> and get back to you on this.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net] mlx4: fix potential divide by 0 in mlx4_en_auto_moderation()
  2017-02-16 23:13             ` Eric Dumazet
@ 2017-02-16 23:23               ` Eric Dumazet
  2017-02-19 22:21                 ` Saeed Mahameed
  2017-02-19 22:33               ` [PATCH net-next] mlx4: do not fire tasklet unless necessary Saeed Mahameed
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2017-02-16 23:23 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

From: Eric Dumazet <edumazet@google.com>

1) In the case where rate == priv->pkt_rate_low == priv->pkt_rate_high,
mlx4_en_auto_moderation() does a divide by zero.

2) We want to properly change the moderation parameters if rx_frames
was changed (like in ethtool -C eth0 rx-frames 16)
 
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   24 ++++++++-------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 3b4961a8e8e44d6987ebd23f9239e747c7fc6cd5..ab3801d724ede4ab19958b2a1a214cfbe901c370 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1384,6 +1384,7 @@ static void mlx4_en_set_default_moderation(struct mlx4_en_priv *priv)
 static void mlx4_en_auto_moderation(struct mlx4_en_priv *priv)
 {
 	unsigned long period = (unsigned long) (jiffies - priv->last_moder_jiffies);
+	u32 pkt_rate_high, pkt_rate_low;
 	struct mlx4_en_cq *cq;
 	unsigned long packets;
 	unsigned long rate;
@@ -1397,37 +1398,40 @@ static void mlx4_en_auto_moderation(struct mlx4_en_priv *priv)
 	if (!priv->adaptive_rx_coal || period < priv->sample_interval * HZ)
 		return;
 
+	pkt_rate_low = READ_ONCE(priv->pkt_rate_low);
+	pkt_rate_high = READ_ONCE(priv->pkt_rate_high);
+
 	for (ring = 0; ring < priv->rx_ring_num; ring++) {
 		rx_packets = READ_ONCE(priv->rx_ring[ring]->packets);
 		rx_bytes = READ_ONCE(priv->rx_ring[ring]->bytes);
 
-		rx_pkt_diff = ((unsigned long) (rx_packets -
-				priv->last_moder_packets[ring]));
+		rx_pkt_diff = rx_packets - priv->last_moder_packets[ring];
 		packets = rx_pkt_diff;
 		rate = packets * HZ / period;
-		avg_pkt_size = packets ? ((unsigned long) (rx_bytes -
-				priv->last_moder_bytes[ring])) / packets : 0;
+		avg_pkt_size = packets ? (rx_bytes -
+				priv->last_moder_bytes[ring]) / packets : 0;
 
 		/* Apply auto-moderation only when packet rate
 		 * exceeds a rate that it matters */
 		if (rate > (MLX4_EN_RX_RATE_THRESH / priv->rx_ring_num) &&
 		    avg_pkt_size > MLX4_EN_AVG_PKT_SMALL) {
-			if (rate < priv->pkt_rate_low)
+			if (rate <= pkt_rate_low)
 				moder_time = priv->rx_usecs_low;
-			else if (rate > priv->pkt_rate_high)
+			else if (rate >= pkt_rate_high)
 				moder_time = priv->rx_usecs_high;
 			else
-				moder_time = (rate - priv->pkt_rate_low) *
+				moder_time = (rate - pkt_rate_low) *
 					(priv->rx_usecs_high - priv->rx_usecs_low) /
-					(priv->pkt_rate_high - priv->pkt_rate_low) +
+					(pkt_rate_high - pkt_rate_low) +
 					priv->rx_usecs_low;
 		} else {
 			moder_time = priv->rx_usecs_low;
 		}
 
-		if (moder_time != priv->last_moder_time[ring]) {
+		cq = priv->rx_cq[ring];
+		if (moder_time != priv->last_moder_time[ring] ||
+		    cq->moder_cnt != priv->rx_frames) {
 			priv->last_moder_time[ring] = moder_time;
-			cq = priv->rx_cq[ring];
 			cq->moder_time = moder_time;
 			cq->moder_cnt = priv->rx_frames;
 			err = mlx4_en_set_cq_moder(priv, cq);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-10 12:27 [PATCH net-next] mlx4: do not fire tasklet unless necessary Eric Dumazet
                   ` (2 preceding siblings ...)
  2017-02-16 12:38 ` Saeed Mahameed
@ 2017-02-17 15:55 ` David Miller
  3 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2017-02-17 15:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tariqt, saeedm

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 10 Feb 2017 04:27:58 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
> 
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
> 
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
> 
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-16 15:30   ` Eric Dumazet
@ 2017-02-18  2:46     ` Eric Dumazet
  2017-02-19 22:39       ` Saeed Mahameed
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2017-02-18  2:46 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Thu, 2017-02-16 at 07:30 -0800, Eric Dumazet wrote:
> On Thu, 2017-02-16 at 14:38 +0200, Saeed Mahameed wrote:
> 
> > Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> 
> Thanks for reviewing this Saeed !

Note that mlx4_add_cq_to_tasklet() is called from hard irq context, so
we could replace the spin_lock_irqsave() by spin_lock()

Not that I care, but you might be interested ;)

Thanks.

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index fa6d2354a0e910ee160863e3cbe21a512d77bf03..9dcac3a367425f2804e933842698c66685baf626 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -79,13 +79,13 @@ void mlx4_cq_tasklet_cb(unsigned long data)
 		tasklet_schedule(&ctx->task);
 }
 
+/* Called from hard irq handler */
 static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
 {
 	struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
-	unsigned long flags;
 	bool kick;
 
-	spin_lock_irqsave(&tasklet_ctx->lock, flags);
+	spin_lock(&tasklet_ctx->lock);
 	/* When migrating CQs between EQs will be implemented, please note
 	 * that you need to sync this point. It is possible that
 	 * while migrating a CQ, completions on the old EQs could
@@ -98,7 +98,7 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
 		if (kick)
 			tasklet_schedule(&tasklet_ctx->task);
 	}
-	spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
+	spin_unlock(&tasklet_ctx->lock);
 }
 
 void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net] mlx4: fix potential divide by 0 in mlx4_en_auto_moderation()
  2017-02-16 23:23               ` [PATCH net] mlx4: fix potential divide by 0 in mlx4_en_auto_moderation() Eric Dumazet
@ 2017-02-19 22:21                 ` Saeed Mahameed
  0 siblings, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-19 22:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Fri, Feb 17, 2017 at 1:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> 1) In the case where rate == priv->pkt_rate_low == priv->pkt_rate_high,
> mlx4_en_auto_moderation() does a divide by zero.
>
> 2) We want to properly change the moderation parameters if rx_frames
> was changed (like in ethtool -C eth0 rx-frames 16)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-16 23:13             ` Eric Dumazet
  2017-02-16 23:23               ` [PATCH net] mlx4: fix potential divide by 0 in mlx4_en_auto_moderation() Eric Dumazet
@ 2017-02-19 22:33               ` Saeed Mahameed
  1 sibling, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-19 22:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Matan Barak, jackm

On Fri, Feb 17, 2017 at 1:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-02-16 at 23:17 +0200, Saeed Mahameed wrote:
>
>> so i guess you are not busy polling .. and adaptive moderation decided
>> to lower down
>> rx-usecs for you, and you are looking to improve latency.
>>
>> > Interrupt moderation is a latency killer, we want our usec back.
>> >
>>
>> well, for RX we have adaptive moderation.
>> and for TX we have xmit more.
>> so interrupt moderation can be good if it is done right.
>>
>> do i understand from this that you are against interrupt moderation ?
>
> Absolutely against it. We prefer gro_flush_timeout.
>
> The defaults values in mlx4 never were tuned for 40Gbit.
>
> If you want to be able to reach 40Gbit on a single TCP flow, you want :
>
> ethtool -C ethX rx-frames 16 rx-usecs-high 32
>
> Or disable interrupt mitigation of course.
>
>
> Interrupt moderation comes for free with NAPI really :
>
> If under stress, number of interrupts will naturally decrease,
> so what's the point with hardware _delaying_ an incoming packet
> exactly ???
>

Theoretically NAPI provides some kind of interrupt delaying mechanism
(only under stress), but is it as good as explicit HW interrupt
moderation ?

for example in mlx5 the adaptive mechanism is more intelligent than
the one we have in mlx4.
it always tries to improve BW/packet rate, and reduce # of interrupts
per second.
and it will park at the optimal values.

As you know the less the interrupts overhead the better the latency.

I assume there are many cases where napi will not be good enough and
will keep running at max interrupt rate.
while the above mechanism will deliberately look for and find a better
interrupt rate for at least the same packet rate or even better one.
(Napi will not do that for you).

But we will have to measure before we decide napi as is is good enough.

> BTW interrupt moderation in mlx4 has at least two bugs.
>
> I will send a patch.

Saw it, nice catch !!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] mlx4: do not fire tasklet unless necessary
  2017-02-18  2:46     ` Eric Dumazet
@ 2017-02-19 22:39       ` Saeed Mahameed
  0 siblings, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2017-02-19 22:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed, Majd Dibbiny,
	Matan Barak

On Sat, Feb 18, 2017 at 4:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-02-16 at 07:30 -0800, Eric Dumazet wrote:
>> On Thu, 2017-02-16 at 14:38 +0200, Saeed Mahameed wrote:
>>
>> > Acked-by: Saeed Mahameed <saeedm@mellanox.com>
>>
>> Thanks for reviewing this Saeed !
>
> Note that mlx4_add_cq_to_tasklet() is called from hard irq context, so
> we could replace the spin_lock_irqsave() by spin_lock()
>
> Not that I care, but you might be interested ;)
>

Sure we are !
we will take it from here :-).

Thank you Eric.

> Thanks.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
> index fa6d2354a0e910ee160863e3cbe21a512d77bf03..9dcac3a367425f2804e933842698c66685baf626 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
> @@ -79,13 +79,13 @@ void mlx4_cq_tasklet_cb(unsigned long data)
>                 tasklet_schedule(&ctx->task);
>  }
>
> +/* Called from hard irq handler */
>  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>  {
>         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> -       unsigned long flags;
>         bool kick;
>
> -       spin_lock_irqsave(&tasklet_ctx->lock, flags);
> +       spin_lock(&tasklet_ctx->lock);
>         /* When migrating CQs between EQs will be implemented, please note
>          * that you need to sync this point. It is possible that
>          * while migrating a CQ, completions on the old EQs could
> @@ -98,7 +98,7 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>                 if (kick)
>                         tasklet_schedule(&tasklet_ctx->task);
>         }
> -       spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
> +       spin_unlock(&tasklet_ctx->lock);
>  }
>
>  void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-02-19 22:39 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-10 12:27 [PATCH net-next] mlx4: do not fire tasklet unless necessary Eric Dumazet
2017-02-14 16:29 ` David Miller
2017-02-15 11:10 ` Saeed Mahameed
2017-02-15 13:29   ` Eric Dumazet
2017-02-15 13:59     ` Saeed Mahameed
2017-02-15 14:34       ` Eric Dumazet
2017-02-15 14:04     ` Eric Dumazet
2017-02-16 12:44       ` Saeed Mahameed
2017-02-16 15:49         ` Eric Dumazet
2017-02-16 21:17           ` Saeed Mahameed
2017-02-16 23:13             ` Eric Dumazet
2017-02-16 23:23               ` [PATCH net] mlx4: fix potential divide by 0 in mlx4_en_auto_moderation() Eric Dumazet
2017-02-19 22:21                 ` Saeed Mahameed
2017-02-19 22:33               ` [PATCH net-next] mlx4: do not fire tasklet unless necessary Saeed Mahameed
2017-02-15 14:52   ` Matan Barak (External)
2017-02-15 15:26     ` Eric Dumazet
2017-02-16 12:34       ` Saeed Mahameed
2017-02-16 12:38 ` Saeed Mahameed
2017-02-16 15:30   ` Eric Dumazet
2017-02-18  2:46     ` Eric Dumazet
2017-02-19 22:39       ` Saeed Mahameed
2017-02-17 15:55 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.