* Re: [WIP] net+mlx4: auto doorbell
2016-11-29 6:58 ` [WIP] net+mlx4: auto doorbell Eric Dumazet
@ 2016-11-30 11:38 ` Jesper Dangaard Brouer
2016-11-30 15:56 ` Eric Dumazet
2016-11-30 13:50 ` Saeed Mahameed
2016-12-01 21:32 ` Alexander Duyck
2 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 11:38 UTC (permalink / raw)
To: Eric Dumazet
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer, Achiad Shochat
[-- Attachment #1: Type: text/plain, Size: 3942 bytes --]
I've played with a somewhat similar patch (from Achiad Shochat) for
mlx5 (attached). While it gives huge improvements, the problem I ran
into was that; TX performance became a function of the TX completion
time/interrupt and could easily be throttled if configured too
high/slow.
Can your patch be affected by this too?
Adjustable via:
ethtool -C mlx5p2 tx-usecs 16 tx-frames 32
On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average: eth0 9.50 5707925.60 0.99 585285.69 0.00 0.00 0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average: eth0 2.40 9985214.90 0.31 1023874.60 0.00 0.00 0.50
These +75% number is pktgen without "burst", and definitely show that
your patch activate xmit_more.
What is the pps performance number when using pktgen "burst" option?
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.
>
>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average: eth0 9.00 1307380.50 1.00 308356.18 0.00 0.00 0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
[...]
> Average: eth0 3.10 1459558.20 0.44 344267.57 0.00 0.00 0.50
The +11% number seems consistent with my perf observations that approx
12% was "fakely" spend on the xmit spin_lock.
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
[...]
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> {
> - return ring->prod - ring->cons > ring->full_size;
> + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> }
[...]
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> }
> send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> + * will happen shortly.
> + */
> + if (send_doorbell &&
> + dev->doorbell_opt &&
> + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
It would be nice with a function call with an appropriate name, instead
of an open-coded queue size check. I'm also confused by the "ncons" name.
> + send_doorbell = false;
> +
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> */
> u32 last_nr_txbb;
> u32 cons;
> + u32 ncons;
Maybe we can find a better name than "ncons" ?
> unsigned long wake_queue;
> struct netdev_queue *tx_queue;
> u32 (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
> /* cache line used and dirtied in mlx4_en_xmit() */
> u32 prod ____cacheline_aligned_in_smp;
> + u32 prod_bell;
Good descriptive variable name.
> unsigned int tx_dropped;
> unsigned long bytes;
> unsigned long packets;
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[-- Attachment #2: net_mlx5e__force_tx_skb_bulking.patch --]
[-- Type: text/x-patch, Size: 5079 bytes --]
Return-Path: tariqt@mellanox.com
Received: from zmta04.collab.prod.int.phx2.redhat.com (LHLO
zmta04.collab.prod.int.phx2.redhat.com) (10.5.81.11) by
zmail22.collab.prod.int.phx2.redhat.com with LMTP; Wed, 17 Aug 2016
05:21:47 -0400 (EDT)
Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
by zmta04.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id B23B4DA128
for <jbrouer@mail.corp.redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx01.extmail.prod.ext.phx2.redhat.com [10.5.110.25])
by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u7H9LlWp015796
(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
for <brouer@redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400
Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129])
by mx1.redhat.com (Postfix) with ESMTP id B3ADB8122E
for <brouer@redhat.com>; Wed, 17 Aug 2016 09:21:45 +0000 (UTC)
Received: from Internal Mail-Server by MTLPINE1 (envelope-from tariqt@mellanox.com)
with ESMTPS (AES256-SHA encrypted); 17 Aug 2016 12:15:03 +0300
Received: from dev-l-vrt-206.mtl.labs.mlnx (dev-l-vrt-206.mtl.labs.mlnx [10.134.206.1])
by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id u7H9F31D010642;
Wed, 17 Aug 2016 12:15:03 +0300
From: Tariq Toukan <tariqt@mellanox.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>,
Achiad Shochat <achiad@mellanox.com>,
Rana Shahout <ranas@mellanox.com>,
Saeed Mahameed <saeedm@mellanox.com>
Subject: [PATCH] net/mlx5e: force tx skb bulking
Date: Wed, 17 Aug 2016 12:14:51 +0300
Message-Id: <1471425291-1782-1-git-send-email-tariqt@mellanox.com>
X-Greylist: Delayed for 00:06:41 by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC)
X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC) for IP:'193.47.165.129' DOMAIN:'mail-il-dmz.mellanox.com' HELO:'mellanox.co.il' FROM:'tariqt@mellanox.com' RCPT:''
X-RedHat-Spam-Score: 0.251 (BAYES_50,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS,UNPARSEABLE_RELAY) 193.47.165.129 mail-il-dmz.mellanox.com 193.47.165.129 mail-il-dmz.mellanox.com <tariqt@mellanox.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
X-Scanned-By: MIMEDefang 2.78 on 10.5.110.25
From: Achiad Shochat <achiad@mellanox.com>
To improve SW message rate in case HW is faster.
Heuristically detect cases where the message rate is high and there
is still no skb bulking and if so, stops the txq for a while trying
to force the bulking.
Change-Id: Icb925135e69b030943cb4666117c47d1cc04da97
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 5 +++++
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 9 ++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 74edd01..78a0661 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -394,6 +394,10 @@ enum {
MLX5E_SQ_STATE_TX_TIMEOUT,
};
+enum {
+ MLX5E_SQ_STOP_ONCE,
+};
+
struct mlx5e_ico_wqe_info {
u8 opcode;
u8 num_wqebbs;
@@ -403,6 +407,7 @@ struct mlx5e_sq {
/* data path */
/* dirtied @completion */
+ unsigned long flags;
u16 cc;
u32 dma_fifo_cc;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index e073bf59..034eef0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -351,8 +351,10 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
- if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) {
+ if (test_bit(MLX5E_SQ_STOP_ONCE, &sq->flags) ||
+ unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) {
netif_tx_stop_queue(sq->txq);
+ clear_bit(MLX5E_SQ_STOP_ONCE, &sq->flags);
sq->stats.stopped++;
}
@@ -429,6 +431,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
u32 dma_fifo_cc;
u32 nbytes;
u16 npkts;
+ u16 ncqes;
u16 sqcc;
int i;
@@ -439,6 +442,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
npkts = 0;
nbytes = 0;
+ ncqes = 0;
/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
* otherwise a cq overrun may occur
@@ -458,6 +462,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
break;
mlx5_cqwq_pop(&cq->wq);
+ ncqes++;
wqe_counter = be16_to_cpu(cqe->wqe_counter);
@@ -508,6 +513,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
sq->dma_fifo_cc = dma_fifo_cc;
sq->cc = sqcc;
+ if ((npkts > 7) && ((npkts >> (ilog2(ncqes))) < 8))
+ set_bit(MLX5E_SQ_STOP_ONCE, &sq->flags);
netdev_tx_completed_queue(sq->txq, npkts, nbytes);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 11:38 ` Jesper Dangaard Brouer
@ 2016-11-30 15:56 ` Eric Dumazet
2016-11-30 19:17 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-11-30 15:56 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> I've played with a somewhat similar patch (from Achiad Shochat) for
> mlx5 (attached). While it gives huge improvements, the problem I ran
> into was that; TX performance became a function of the TX completion
> time/interrupt and could easily be throttled if configured too
> high/slow.
>
> Can your patch be affected by this too?
Like all TX business, you should know this Jesper.
No need to constantly remind us something very well known.
>
> Adjustable via:
> ethtool -C mlx5p2 tx-usecs 16 tx-frames 32
>
Generally speaking these settings impact TX, throughput/latencies.
Since NIC IRQ then trigger softirqs, we already know that IRQ handling
is critical, and some tuning can help, depending on particular load.
Now the trick is when you want a generic kernel, being deployed on hosts
shared by millions of jobs.
>
> On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
> >
> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average: eth0 9.50 5707925.60 0.99 585285.69 0.00 0.00 0.50
> > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average: eth0 2.40 9985214.90 0.31 1023874.60 0.00 0.00 0.50
>
> These +75% number is pktgen without "burst", and definitely show that
> your patch activate xmit_more.
> What is the pps performance number when using pktgen "burst" option?
About the same really. About all packets now get the xmit_more effect.
>
>
> > And about 11 % improvement on an mono-flow UDP_STREAM test.
> >
> > skb_set_owner_w() is now the most consuming function.
> >
> >
> > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> > [1] 13696
> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average: eth0 9.00 1307380.50 1.00 308356.18 0.00 0.00 0.50
> > lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> > lpaa23:~# sar -n DEV 1 10|grep eth0
> [...]
> > Average: eth0 3.10 1459558.20 0.44 344267.57 0.00 0.00 0.50
>
> The +11% number seems consistent with my perf observations that approx
> 12% was "fakely" spend on the xmit spin_lock.
>
>
> [...]
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> [...]
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> > {
> > - return ring->prod - ring->cons > ring->full_size;
> > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> > }
> [...]
>
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > }
> > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >
> > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > + * will happen shortly.
> > + */
> > + if (send_doorbell &&
> > + dev->doorbell_opt &&
> > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>
> It would be nice with a function call with an appropriate name, instead
> of an open-coded queue size check. I'm also confused by the "ncons" name.
>
> > + send_doorbell = false;
> > +
> [...]
>
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > index 574bcbb1b38f..c3fd0deda198 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> > */
> > u32 last_nr_txbb;
> > u32 cons;
> > + u32 ncons;
>
> Maybe we can find a better name than "ncons" ?
Thats because 'cons' in this driver really means 'old cons'
and new cons = old cons + last_nr_txbb;
>
> > unsigned long wake_queue;
> > struct netdev_queue *tx_queue;
> > u32 (*free_tx_desc)(struct mlx4_en_priv *priv,
> > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
> >
> > /* cache line used and dirtied in mlx4_en_xmit() */
> > u32 prod ____cacheline_aligned_in_smp;
> > + u32 prod_bell;
>
> Good descriptive variable name.
>
> > unsigned int tx_dropped;
> > unsigned long bytes;
> > unsigned long packets;
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 15:56 ` Eric Dumazet
@ 2016-11-30 19:17 ` Jesper Dangaard Brouer
2016-11-30 19:30 ` Eric Dumazet
0 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 19:17 UTC (permalink / raw)
To: Eric Dumazet
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat, brouer
On Wed, 30 Nov 2016 07:56:26 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> > I've played with a somewhat similar patch (from Achiad Shochat) for
> > mlx5 (attached). While it gives huge improvements, the problem I ran
> > into was that; TX performance became a function of the TX completion
> > time/interrupt and could easily be throttled if configured too
> > high/slow.
> >
> > Can your patch be affected by this too?
>
> Like all TX business, you should know this Jesper.
> No need to constantly remind us something very well known.
Don't take is as critique Eric. I was hoping your patch would have
solved this issue of being sensitive to TX completion adjustments. You
usually have good solutions for difficult issues. I basically rejected
Achiad's approach/patch because it was too sensitive to these kind of
adjustments.
> > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
[...]
> >
> > These +75% number is pktgen without "burst", and definitely show that
> > your patch activate xmit_more.
> > What is the pps performance number when using pktgen "burst" option?
>
> About the same really. About all packets now get the xmit_more effect.
Perfect!
> > [...]
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > > index 4b597dca5c52..affebb435679 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > [...]
> > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> > > {
> > > - return ring->prod - ring->cons > ring->full_size;
> > > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> > > }
> > [...]
> >
> > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > > }
> > > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> > >
> > > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > > + * will happen shortly.
> > > + */
> > > + if (send_doorbell &&
> > > + dev->doorbell_opt &&
> > > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> >
> > It would be nice with a function call with an appropriate name, instead
> > of an open-coded queue size check. I'm also confused by the "ncons" name.
> >
> > > + send_doorbell = false;
> > > +
> > [...]
> >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > index 574bcbb1b38f..c3fd0deda198 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> > > */
> > > u32 last_nr_txbb;
> > > u32 cons;
> > > + u32 ncons;
> >
> > Maybe we can find a better name than "ncons" ?
>
> Thats because 'cons' in this driver really means 'old cons'
>
> and new cons = old cons + last_nr_txbb;
It was not clear to me that "n" meant "new". And also not clear that
this drive have an issue of "cons" (consumer) is tracking "old" cons.
> > > unsigned long wake_queue;
> > > struct netdev_queue *tx_queue;
> > > u32 (*free_tx_desc)(struct mlx4_en_priv *priv,
> > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
> > >
> > > /* cache line used and dirtied in mlx4_en_xmit() */
> > > u32 prod ____cacheline_aligned_in_smp;
> > > + u32 prod_bell;
> >
> > Good descriptive variable name.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 19:17 ` Jesper Dangaard Brouer
@ 2016-11-30 19:30 ` Eric Dumazet
2016-11-30 22:30 ` Jesper Dangaard Brouer
2016-12-01 0:27 ` Eric Dumazet
0 siblings, 2 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-11-30 19:30 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> Don't take is as critique Eric. I was hoping your patch would have
> solved this issue of being sensitive to TX completion adjustments. You
> usually have good solutions for difficult issues. I basically rejected
> Achiad's approach/patch because it was too sensitive to these kind of
> adjustments.
Well, this patch can hurt latencies, because a doorbell can be delayed,
and softirqs can be delayed by many hundred of usec in some cases.
I would not enable this behavior by default.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 19:30 ` Eric Dumazet
@ 2016-11-30 22:30 ` Jesper Dangaard Brouer
2016-11-30 22:40 ` Eric Dumazet
2016-12-01 0:27 ` Eric Dumazet
1 sibling, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-11-30 22:30 UTC (permalink / raw)
To: Eric Dumazet
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat, brouer
On Wed, 30 Nov 2016 11:30:00 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
>
> > Don't take is as critique Eric. I was hoping your patch would have
> > solved this issue of being sensitive to TX completion adjustments. You
> > usually have good solutions for difficult issues. I basically rejected
> > Achiad's approach/patch because it was too sensitive to these kind of
> > adjustments.
>
> Well, this patch can hurt latencies, because a doorbell can be delayed,
> and softirqs can be delayed by many hundred of usec in some cases.
>
> I would not enable this behavior by default.
What about another scheme, where dev_hard_start_xmit() can return an
indication that driver choose not to flush (based on TX queue depth),
and there by requesting stack to call flush at a later point.
Would that introduce less latency issues?
Patch muckup (not even compile tested):
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ffcd874cc20..d7d15e4e6766 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -109,6 +109,7 @@ enum netdev_tx {
__NETDEV_TX_MIN = INT_MIN, /* make sure enum is signed */
NETDEV_TX_OK = 0x00, /* driver took care of packet */
NETDEV_TX_BUSY = 0x10, /* driver tx path was busy*/
+ NETDEV_TX_FLUSHME= 0x04, /* driver request doorbell/flush later */
};
typedef enum netdev_tx netdev_tx_t;
@@ -536,6 +537,8 @@ enum netdev_queue_state_t {
__QUEUE_STATE_DRV_XOFF,
__QUEUE_STATE_STACK_XOFF,
__QUEUE_STATE_FROZEN,
+ // __QUEUE_STATE_NEED_FLUSH
+ // is is better to store in txq state?
};
#define QUEUE_STATE_DRV_XOFF (1 << __QUEUE_STATE_DRV_XOFF)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e6aa0a249672..7480e44c5a50 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -75,6 +75,7 @@ struct Qdisc {
void *u32_node;
struct netdev_queue *dev_queue;
+ struct netdev_queue *flush_dev_queue; // store txq to flush here?
struct gnet_stats_rate_est64 rate_est;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
@@ -98,6 +99,20 @@ struct Qdisc {
spinlock_t busylock ____cacheline_aligned_in_smp;
};
+static inline void qdisc_request_txq_flush(struct Qdisc *qdisc,
+ struct netdev_queue *txq)
+{
+ struct net_device dev;
+
+ if (qdisc->flush_dev_queue) {
+ if (likely(qdisc->flush_dev_queue == txq))
+ return;
+ /* Flush existing txq before reassignment */
+ dev_flush_xmit(qdisc_dev(q), txq);
+ }
+ qdisc->flush_dev_queue = txq;
+}
+
static inline bool qdisc_is_running(const struct Qdisc *qdisc)
{
return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
@@ -117,6 +132,19 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
+ /* flush device txq here, if needed */
+ if (qdisc->flush_dev_queue) {
+ struct netdev_queue *txq = qdisc->flush_dev_queue;
+ struct net_device *dev = qdisc_dev(q);
+
+ qdisc->flush_dev_queue = NULL;
+ dev_flush_xmit(dev, txq);
+ /*
+ * DISCUSS: it is too soon to flush here? What about
+ * rescheduling a NAPI poll cycle for this device,
+ * before calling flush.
+ */
+ }
write_seqcount_end(&qdisc->running);
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 048b46b7c92a..70339c267f33 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2880,6 +2880,15 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
}
EXPORT_SYMBOL(netif_skb_features);
+static int dev_flush_xmit(struct net_device *dev,
+ struct netdev_queue *txq)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ ops->ndo_flush_xmit(dev, txq);
+ // Oh oh, do we need to take HARD_TX_LOCK ??
+}
+
static int xmit_one(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq, bool more)
{
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6cfb6e9038c2..55c01b6f6311 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -190,6 +190,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
+ if (ret == NETDEV_TX_FLUSHME) {
+ /* Driver choose no-TX-doorbell MMIO write.
+ * This made taking qdisc root_lock less expensive.
+ */
+ qdisc_request_txq_flush(q, txq);
+ // Flush happens later in qdisc_run_end()
+ }
ret = qdisc_qlen(q);
} else {
/* Driver returned NETDEV_TX_BUSY - requeue skb */
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 22:30 ` Jesper Dangaard Brouer
@ 2016-11-30 22:40 ` Eric Dumazet
0 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-11-30 22:40 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
On Wed, 2016-11-30 at 23:30 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 30 Nov 2016 11:30:00 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> >
> > > Don't take is as critique Eric. I was hoping your patch would have
> > > solved this issue of being sensitive to TX completion adjustments. You
> > > usually have good solutions for difficult issues. I basically rejected
> > > Achiad's approach/patch because it was too sensitive to these kind of
> > > adjustments.
> >
> > Well, this patch can hurt latencies, because a doorbell can be delayed,
> > and softirqs can be delayed by many hundred of usec in some cases.
> >
> > I would not enable this behavior by default.
>
> What about another scheme, where dev_hard_start_xmit() can return an
> indication that driver choose not to flush (based on TX queue depth),
> and there by requesting stack to call flush at a later point.
>
> Would that introduce less latency issues?
Again, how is it going to change anything when your netperf UDP sends
one packet at a time ?
qdisc gets one packet , dequeue it immediately.
No pressure. -> doorbell will be sent as before.
Some TX producers do not even use a qdisc to begin with.
Please think of a solution that does not involve qdisc layer at all.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 19:30 ` Eric Dumazet
2016-11-30 22:30 ` Jesper Dangaard Brouer
@ 2016-12-01 0:27 ` Eric Dumazet
2016-12-01 1:16 ` Tom Herbert
1 sibling, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 0:27 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Tom Herbert, Willem de Bruijn
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.
When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().
The work-flow is :
1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.
2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
netif_schedule_queue(dev_queue);
This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)
So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.
This problem is magnified when XPS is used, if one mono-threaded application deals with
thousands of TCP sockets.
We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 0:27 ` Eric Dumazet
@ 2016-12-01 1:16 ` Tom Herbert
2016-12-01 2:32 ` Eric Dumazet
0 siblings, 1 reply; 62+ messages in thread
From: Tom Herbert @ 2016-12-01 1:16 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
Achiad Shochat
On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Another issue I found during my tests last days, is a problem with BQL,
> and more generally when driver stops/starts the queue.
>
> When under stress and BQL stops the queue, driver TX completion does a
> lot of work, and servicing CPU also takes over further qdisc_run().
>
Hmm, hard to say if this is problem or a feature I think ;-). One way
to "solve" this would be to use IRQ round robin, that would spread the
load of networking across CPUs, but that gives us no additional
parallelism and reduces locality-- it's generally considered a bad
idea. The question might be: is it better to continuously ding one CPU
with all the networking work or try to artificially spread it out?
Note this is orthogonal to MQ also, so we should already have multiple
CPUs doing netif_schedule_queue for queues they manage.
Do you have a test or application that shows this is causing pain?
> The work-flow is :
>
> 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> unmap things, queue skbs for freeing.
>
> 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>
> if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
> netif_schedule_queue(dev_queue);
>
> This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> (They absolutely have no chance to grab it)
>
> So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> and RX work.
>
> This problem is magnified when XPS is used, if one mono-threaded application deals with
> thousands of TCP sockets.
>
Do you know of an application doing this? The typical way XPS and most
of the other steering mechanisms are configured assume that workloads
tend towards a normal distribution. Such mechanisms can become
problematic under asymmetric loads, but then I would expect these are
likely dedicated servers so that the mechanisms can be tuned
accordingly. For instance, XPS can be configured to map more than one
queue to a CPU. Alternatively, IIRC Windows has some functionality to
tune networking for the load (spin up queues, reconfigure things
similar to XPS/RPS, etc.)-- that's promising up the point that we need
a lot of heuristics and measurement; but then we lose determinism and
risk edge case where we get completely unsatisfied results (one of my
concerns with the recent attempt to put configuration in the kernel).
> We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> to allow another cpu to service the qdisc and spare us.
>
Wouldn't this need to be an active operation? That is to queue the
qdisc on another CPU's output_queue?
Tom
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 1:16 ` Tom Herbert
@ 2016-12-01 2:32 ` Eric Dumazet
2016-12-01 2:50 ` Eric Dumazet
2016-12-01 5:03 ` Tom Herbert
0 siblings, 2 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 2:32 UTC (permalink / raw)
To: Tom Herbert
Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
Achiad Shochat
On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Another issue I found during my tests last days, is a problem with BQL,
> > and more generally when driver stops/starts the queue.
> >
> > When under stress and BQL stops the queue, driver TX completion does a
> > lot of work, and servicing CPU also takes over further qdisc_run().
> >
> Hmm, hard to say if this is problem or a feature I think ;-). One way
> to "solve" this would be to use IRQ round robin, that would spread the
> load of networking across CPUs, but that gives us no additional
> parallelism and reduces locality-- it's generally considered a bad
> idea. The question might be: is it better to continuously ding one CPU
> with all the networking work or try to artificially spread it out?
> Note this is orthogonal to MQ also, so we should already have multiple
> CPUs doing netif_schedule_queue for queues they manage.
>
> Do you have a test or application that shows this is causing pain?
Yes, just launch enough TCP senders (more than 10,000) to fully utilize
the NIC, with small messages.
super_netperf is not good for that, because you would need 10,000
processes and would spend too much cycles just dealing with an enormous
working set, you would not activate BQL.
>
> > The work-flow is :
> >
> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> > unmap things, queue skbs for freeing.
> >
> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
> >
> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
> > netif_schedule_queue(dev_queue);
> >
> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> > (They absolutely have no chance to grab it)
> >
> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> > and RX work.
> >
> > This problem is magnified when XPS is used, if one mono-threaded application deals with
> > thousands of TCP sockets.
> >
> Do you know of an application doing this? The typical way XPS and most
> of the other steering mechanisms are configured assume that workloads
> tend towards a normal distribution. Such mechanisms can become
> problematic under asymmetric loads, but then I would expect these are
> likely dedicated servers so that the mechanisms can be tuned
> accordingly. For instance, XPS can be configured to map more than one
> queue to a CPU. Alternatively, IIRC Windows has some functionality to
> tune networking for the load (spin up queues, reconfigure things
> similar to XPS/RPS, etc.)-- that's promising up the point that we need
> a lot of heuristics and measurement; but then we lose determinism and
> risk edge case where we get completely unsatisfied results (one of my
> concerns with the recent attempt to put configuration in the kernel).
We have thousands of applications, and some of them 'kind of multicast'
events to a broad number of TCP sockets.
Very often, applications writers use a single thread for doing this,
when all they need is to send small packets to 10,000 sockets, and they
do not really care of doing this very fast. They also do not want to
hurt other applications sharing the NIC.
Very often, process scheduler will also run this single thread in a
single cpu, ie avoiding expensive migrations if they are not needed.
Problem is this behavior locks one TX queue for the duration of the
multicast, since XPS will force all the TX packets to go to one TX
queue.
Other flows that would share the locked CPU experience high P99
latencies.
>
> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> > to allow another cpu to service the qdisc and spare us.
> >
> Wouldn't this need to be an active operation? That is to queue the
> qdisc on another CPU's output_queue?
I simply suggest we try to queue the qdisc for further servicing as we
do today, from net_tx_action(), but we might use a different bit, so
that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
before we grab it from net_tx_action(), maybe 100 usec later (time to
flush all skbs queued in napi_consume_skb() and maybe RX processing,
since most NIC handle TX completion before doing RX processing from thei
napi poll handler.
Should be doable with few changes in __netif_schedule() and
net_tx_action(), plus some control paths that will need to take care of
the new bit at dismantle time, right ?
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 2:32 ` Eric Dumazet
@ 2016-12-01 2:50 ` Eric Dumazet
2016-12-02 18:16 ` Eric Dumazet
2016-12-01 5:03 ` Tom Herbert
1 sibling, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 2:50 UTC (permalink / raw)
To: Tom Herbert
Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
Achiad Shochat
On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:
> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
>
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?
Hmm... this is silly. Code already implements a different bit.
qdisc_run() seems to run more often from net_tx_action(), I have to find
out why.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 2:50 ` Eric Dumazet
@ 2016-12-02 18:16 ` Eric Dumazet
0 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-02 18:16 UTC (permalink / raw)
To: Tom Herbert
Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
Achiad Shochat
On Wed, 2016-11-30 at 18:50 -0800, Eric Dumazet wrote:
> On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:
>
> > I simply suggest we try to queue the qdisc for further servicing as we
> > do today, from net_tx_action(), but we might use a different bit, so
> > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> > before we grab it from net_tx_action(), maybe 100 usec later (time to
> > flush all skbs queued in napi_consume_skb() and maybe RX processing,
> > since most NIC handle TX completion before doing RX processing from thei
> > napi poll handler.
> >
> > Should be doable with few changes in __netif_schedule() and
> > net_tx_action(), plus some control paths that will need to take care of
> > the new bit at dismantle time, right ?
>
> Hmm... this is silly. Code already implements a different bit.
>
> qdisc_run() seems to run more often from net_tx_action(), I have to find
> out why.
After more analysis I believe TSQ was one of the bottlenecks.
I prepared a patch series that helped my use cases.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 2:32 ` Eric Dumazet
2016-12-01 2:50 ` Eric Dumazet
@ 2016-12-01 5:03 ` Tom Herbert
2016-12-01 19:24 ` Willem de Bruijn
1 sibling, 1 reply; 62+ messages in thread
From: Tom Herbert @ 2016-12-01 5:03 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
Achiad Shochat
On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
>> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Another issue I found during my tests last days, is a problem with BQL,
>> > and more generally when driver stops/starts the queue.
>> >
>> > When under stress and BQL stops the queue, driver TX completion does a
>> > lot of work, and servicing CPU also takes over further qdisc_run().
>> >
>> Hmm, hard to say if this is problem or a feature I think ;-). One way
>> to "solve" this would be to use IRQ round robin, that would spread the
>> load of networking across CPUs, but that gives us no additional
>> parallelism and reduces locality-- it's generally considered a bad
>> idea. The question might be: is it better to continuously ding one CPU
>> with all the networking work or try to artificially spread it out?
>> Note this is orthogonal to MQ also, so we should already have multiple
>> CPUs doing netif_schedule_queue for queues they manage.
>>
>> Do you have a test or application that shows this is causing pain?
>
> Yes, just launch enough TCP senders (more than 10,000) to fully utilize
> the NIC, with small messages.
>
> super_netperf is not good for that, because you would need 10,000
> processes and would spend too much cycles just dealing with an enormous
> working set, you would not activate BQL.
>
>
>>
>> > The work-flow is :
>> >
>> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
>> > unmap things, queue skbs for freeing.
>> >
>> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>> >
>> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>> > netif_schedule_queue(dev_queue);
>> >
>> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
>> > (They absolutely have no chance to grab it)
>> >
>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>> > and RX work.
>> >
>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>> > thousands of TCP sockets.
>> >
>> Do you know of an application doing this? The typical way XPS and most
>> of the other steering mechanisms are configured assume that workloads
>> tend towards a normal distribution. Such mechanisms can become
>> problematic under asymmetric loads, but then I would expect these are
>> likely dedicated servers so that the mechanisms can be tuned
>> accordingly. For instance, XPS can be configured to map more than one
>> queue to a CPU. Alternatively, IIRC Windows has some functionality to
>> tune networking for the load (spin up queues, reconfigure things
>> similar to XPS/RPS, etc.)-- that's promising up the point that we need
>> a lot of heuristics and measurement; but then we lose determinism and
>> risk edge case where we get completely unsatisfied results (one of my
>> concerns with the recent attempt to put configuration in the kernel).
>
> We have thousands of applications, and some of them 'kind of multicast'
> events to a broad number of TCP sockets.
>
> Very often, applications writers use a single thread for doing this,
> when all they need is to send small packets to 10,000 sockets, and they
> do not really care of doing this very fast. They also do not want to
> hurt other applications sharing the NIC.
>
> Very often, process scheduler will also run this single thread in a
> single cpu, ie avoiding expensive migrations if they are not needed.
>
> Problem is this behavior locks one TX queue for the duration of the
> multicast, since XPS will force all the TX packets to go to one TX
> queue.
>
The fact that XPS is forcing TX packets to go over one CPU is a result
of how you've configured XPS. There are other potentially
configurations that present different tradeoffs. For instance, TX
queues are plentiful now days to the point that we could map a number
of queues to each CPU while respecting NUMA locality between the
sending CPU and where TX completions occur. If the set is sufficiently
large this would also helps to address device lock contention. The
other thing I'm wondering is if Willem's concepts distributing DOS
attacks in RPS might be applicable in XPS. If we could detect that a
TX queue is "under attack" maybe we can automatically backoff to
distributing the load to more queues in XPS somehow.
Tom
> Other flows that would share the locked CPU experience high P99
> latencies.
>
>
>>
>> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
>> > to allow another cpu to service the qdisc and spare us.
>> >
>> Wouldn't this need to be an active operation? That is to queue the
>> qdisc on another CPU's output_queue?
>
> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
>
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?
>
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 5:03 ` Tom Herbert
@ 2016-12-01 19:24 ` Willem de Bruijn
0 siblings, 0 replies; 62+ messages in thread
From: Willem de Bruijn @ 2016-12-01 19:24 UTC (permalink / raw)
To: Tom Herbert
Cc: Eric Dumazet, Jesper Dangaard Brouer, Willem de Bruijn,
Rick Jones, Linux Kernel Network Developers, Saeed Mahameed,
Tariq Toukan, Achiad Shochat
>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>>> > and RX work.
This problem is somewhat tangential to the doorbell avoidance discussion.
>>> >
>>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>>> > thousands of TCP sockets.
>>> >
>> We have thousands of applications, and some of them 'kind of multicast'
>> events to a broad number of TCP sockets.
>>
>> Very often, applications writers use a single thread for doing this,
>> when all they need is to send small packets to 10,000 sockets, and they
>> do not really care of doing this very fast. They also do not want to
>> hurt other applications sharing the NIC.
>>
>> Very often, process scheduler will also run this single thread in a
>> single cpu, ie avoiding expensive migrations if they are not needed.
>>
>> Problem is this behavior locks one TX queue for the duration of the
>> multicast, since XPS will force all the TX packets to go to one TX
>> queue.
>>
> The fact that XPS is forcing TX packets to go over one CPU is a result
> of how you've configured XPS. There are other potentially
> configurations that present different tradeoffs.
Right, XPS supports multiple txqueues mappings, using skb_tx_hash
to decide among them. Unfortunately cross-cpu is more expensive
than tx + completion on the same core, so this is suboptimal for
the common case where there is no excessive load imbalance.
> For instance, TX
> queues are plentiful now days to the point that we could map a number
> of queues to each CPU while respecting NUMA locality between the
> sending CPU and where TX completions occur. If the set is sufficiently
> large this would also helps to address device lock contention. The
> other thing I'm wondering is if Willem's concepts distributing DOS
> attacks in RPS might be applicable in XPS. If we could detect that a
> TX queue is "under attack" maybe we can automatically backoff to
> distributing the load to more queues in XPS somehow.
If only targeting states of imbalance, that indeed could work. For the
10,000 socket case, instead of load balancing qdisc servicing, we
could perhaps modify tx queue selection in __netdev_pick_tx to
choose another queue if the the initial choice is paused or otherwise
backlogged.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-29 6:58 ` [WIP] net+mlx4: auto doorbell Eric Dumazet
2016-11-30 11:38 ` Jesper Dangaard Brouer
@ 2016-11-30 13:50 ` Saeed Mahameed
2016-11-30 15:44 ` Eric Dumazet
2016-12-01 21:32 ` Alexander Duyck
2 siblings, 1 reply; 62+ messages in thread
From: Saeed Mahameed @ 2016-11-30 13:50 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
Saeed Mahameed, Tariq Toukan
On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
Hi Eric, Nice Idea indeed and we need something like this,
today we almost don't exploit the TX bulking at all.
But please see below, i am not sure different contexts should share
the doorbell ringing, it is really risky.
> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2
> drivers/net/ethernet/mellanox/mlx4/en_tx.c | 90 +++++++++++------
> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4
> include/linux/netdevice.h | 1
> net/core/net-sysfs.c | 18 +++
> 5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
> if (polled) {
> if (doorbell_pending)
> - mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> + mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
> mlx4_cq_set_ci(&cq->mcq);
> wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
> ring->size = size;
> ring->size_mask = size - 1;
> ring->sp_stride = stride;
> - ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> + ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>
> tmp = size * sizeof(struct mlx4_en_tx_info);
> ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
> ring->sp_cqn = cq;
> ring->prod = 0;
> ring->cons = 0xffffffff;
> + ring->ncons = 0;
> ring->last_nr_txbb = 1;
> memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
> memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
> MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
> }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> {
> - return ring->prod - ring->cons > ring->full_size;
> + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> }
>
> static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
> /* Skip last polled descriptor */
> ring->cons += ring->last_nr_txbb;
> + ring->ncons += ring->last_nr_txbb;
> en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
> ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> !!(ring->cons & ring->size), 0,
> 0 /* Non-NAPI caller */);
> ring->cons += ring->last_nr_txbb;
> + ring->ncons += ring->last_nr_txbb;
> cnt++;
> }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> return cnt;
> }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> + struct mlx4_en_tx_ring *ring)
> +{
> +
> + if (dev->doorbell_opt & 1) {
> + u32 oval = READ_ONCE(ring->prod_bell);
> + u32 nval = READ_ONCE(ring->prod);
> +
> + if (oval == nval)
> + return;
> +
> + /* I can not tell yet if a cmpxchg() is needed or not */
> + if (dev->doorbell_opt & 2)
> + WRITE_ONCE(ring->prod_bell, nval);
> + else
> + if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> + return;
> + }
> + /* Since there is no iowrite*_native() that writes the
> + * value as is, without byteswapping - using the one
> + * the doesn't do byteswapping in the relevant arch
> + * endianness.
> + */
> +#if defined(__LITTLE_ENDIAN)
> + iowrite32(
> +#else
> + iowrite32be(
> +#endif
> + ring->doorbell_qpn,
> + ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +
> static bool mlx4_en_process_tx_cq(struct net_device *dev,
> struct mlx4_en_cq *cq, int napi_budget)
> {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> wmb();
>
> /* we want to dirty this cache line once */
> - ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> - ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> + WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> + ring_cons += txbbs_skipped;
> + WRITE_ONCE(ring->cons, ring_cons);
> + WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> + if (dev->doorbell_opt)
> + mlx4_en_xmit_doorbell(dev, ring);
>
> if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
> return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
> __iowrite64_copy(dst, src, bytecnt / 8);
> }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> - wmb();
you missed/removed this "wmb()" in the new function, why ? where did
you compensate for it ?
> - /* Since there is no iowrite*_native() that writes the
> - * value as is, without byteswapping - using the one
> - * the doesn't do byteswapping in the relevant arch
> - * endianness.
> - */
> -#if defined(__LITTLE_ENDIAN)
> - iowrite32(
> -#else
> - iowrite32be(
> -#endif
> - ring->doorbell_qpn,
> - ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
> static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> struct mlx4_en_tx_desc *tx_desc,
> union mlx4_wqe_qpn_vlan qpn_vlan,
> int desc_size, int bf_index,
> __be32 op_own, bool bf_ok,
> - bool send_doorbell)
> + bool send_doorbell,
> + const struct net_device *dev, int nr_txbb)
> {
> tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
> wmb();
>
> + ring->prod += nr_txbb;
> mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
> desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> */
> dma_wmb();
> tx_desc->ctrl.owner_opcode = op_own;
> + ring->prod += nr_txbb;
> if (send_doorbell)
> - mlx4_en_xmit_doorbell(ring);
> + mlx4_en_xmit_doorbell(dev, ring);
> else
> ring->xmit_more++;
> }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
> }
>
> - ring->prod += nr_txbb;
> -
> /* If we used a bounce buffer then copy descriptor back into place */
> if (unlikely(bounce))
> tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> }
> send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> + * will happen shortly.
> + */
> + if (send_doorbell &&
> + dev->doorbell_opt &&
> + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
Aelexi already expressed his worries about synchronization, and i
think here (in this exact line) sits the problem,
what about if exactly at this point the TX completion handler just
finished and rang the last doorbell,
you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
the last doorbell from the CQ handler missed it.
even if you wrote the TX descriptor before the doorbell decision here,
you will need a memory barrier to make sure the HW sees
the new packet, which was typically done before ringing the doorbell.
All in all, this is risky business :), the right way to go is to
force the upper layer to use xmit-more and delay doorbells/use bulking
but from the same context
(xmit routine). For example see Achiad's suggestion (attached in
Jesper's response), he used stop queue to force the stack to queue up
packets (TX bulking)
which would set xmit-more and will use the next completion to release
the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
> + send_doorbell = false;
> +
> real_size = (real_size / 16) & 0x3f;
>
> bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> qpn_vlan.fence_size = real_size;
>
> mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> - op_own, bf_ok, send_doorbell);
> + op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
> if (unlikely(stop_queue)) {
> /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> */
> smp_rmb();
>
> - ring_cons = ACCESS_ONCE(ring->cons);
> if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
> netif_tx_wake_queue(ring->tx_queue);
> ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
> rx_ring->xdp_tx++;
> AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> - ring->prod += nr_txbb;
> -
> stop_queue = mlx4_en_is_tx_ring_full(ring);
> send_doorbell = stop_queue ||
> *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
> qpn_vlan.fence_size = real_size;
>
> mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> - op_own, bf_ok, send_doorbell);
> + op_own, bf_ok, send_doorbell, dev, nr_txbb);
> *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
> return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> */
> u32 last_nr_txbb;
> u32 cons;
> + u32 ncons;
> unsigned long wake_queue;
> struct netdev_queue *tx_queue;
> u32 (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
> /* cache line used and dirtied in mlx4_en_xmit() */
> u32 prod ____cacheline_aligned_in_smp;
> + u32 prod_bell;
> unsigned int tx_dropped;
> unsigned long bytes;
> unsigned long packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
> struct mlx4_en_rx_alloc *frame,
> struct net_device *dev, unsigned int length,
> int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
> bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
> struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
> DECLARE_HASHTABLE (qdisc_hash, 4);
> #endif
> unsigned long tx_queue_len;
> + unsigned long doorbell_opt;
> spinlock_t tx_global_lock;
> int watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
> }
> NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> + dev->doorbell_opt = val;
> + return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> +
> + return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
> static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
> &dev_attr_phys_port_name.attr,
> &dev_attr_phys_switch_id.attr,
> &dev_attr_proto_down.attr,
> + &dev_attr_doorbell_opt.attr,
> NULL,
> };
> ATTRIBUTE_GROUPS(net_class);
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 13:50 ` Saeed Mahameed
@ 2016-11-30 15:44 ` Eric Dumazet
2016-11-30 16:27 ` Saeed Mahameed
0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-11-30 15:44 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
Saeed Mahameed, Tariq Toukan
On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
> >
> >
> >> Not sure it this has been tried before, but the doorbell avoidance could
> >> be done by the driver itself, because it knows a TX completion will come
> >> shortly (well... if softirqs are not delayed too much !)
> >>
> >> Doorbell would be forced only if :
> >>
> >> ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
> >> OR
> >> ( too many [1] packets were put in TX ring buffer, no point deferring
> >> more)
> >>
> >> Start the pump, but once it is started, let the doorbells being done by
> >> TX completion.
> >>
> >> ndo_start_xmit and TX completion handler would have to maintain a shared
> >> state describing if packets were ready but doorbell deferred.
> >>
> >>
> >> Note that TX completion means "if at least one packet was drained",
> >> otherwise busy polling, constantly calling napi->poll() would force a
> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
> >>
> >> But then, maybe busy poll would like to force a doorbell...
> >>
> >> I could try these ideas on mlx4 shortly.
> >>
> >>
> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
> >
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
>
> Hi Eric, Nice Idea indeed and we need something like this,
> today we almost don't exploit the TX bulking at all.
>
> But please see below, i am not sure different contexts should share
> the doorbell ringing, it is really risky.
>
> > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2
> > drivers/net/ethernet/mellanox/mlx4/en_tx.c | 90 +++++++++++------
> > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4
> > include/linux/netdevice.h | 1
> > net/core/net-sysfs.c | 18 +++
> > 5 files changed, 83 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 6562f78b07f4..fbea83218fc0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >
> > if (polled) {
> > if (doorbell_pending)
> > - mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> > + mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
> >
> > mlx4_cq_set_ci(&cq->mcq);
> > wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
> > ring->size = size;
> > ring->size_mask = size - 1;
> > ring->sp_stride = stride;
> > - ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> > + ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
> >
> > tmp = size * sizeof(struct mlx4_en_tx_info);
> > ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
> > ring->sp_cqn = cq;
> > ring->prod = 0;
> > ring->cons = 0xffffffff;
> > + ring->ncons = 0;
> > ring->last_nr_txbb = 1;
> > memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
> > memset(ring->buf, 0, ring->buf_size);
> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
> > MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
> > }
> >
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> > {
> > - return ring->prod - ring->cons > ring->full_size;
> > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> > }
> >
> > static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >
> > /* Skip last polled descriptor */
> > ring->cons += ring->last_nr_txbb;
> > + ring->ncons += ring->last_nr_txbb;
> > en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
> > ring->cons, ring->prod);
> >
> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> > !!(ring->cons & ring->size), 0,
> > 0 /* Non-NAPI caller */);
> > ring->cons += ring->last_nr_txbb;
> > + ring->ncons += ring->last_nr_txbb;
> > cnt++;
> > }
> >
> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> > return cnt;
> > }
> >
> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> > + struct mlx4_en_tx_ring *ring)
> > +{
> > +
> > + if (dev->doorbell_opt & 1) {
> > + u32 oval = READ_ONCE(ring->prod_bell);
> > + u32 nval = READ_ONCE(ring->prod);
> > +
> > + if (oval == nval)
> > + return;
> > +
> > + /* I can not tell yet if a cmpxchg() is needed or not */
> > + if (dev->doorbell_opt & 2)
> > + WRITE_ONCE(ring->prod_bell, nval);
> > + else
> > + if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> > + return;
> > + }
> > + /* Since there is no iowrite*_native() that writes the
> > + * value as is, without byteswapping - using the one
> > + * the doesn't do byteswapping in the relevant arch
> > + * endianness.
> > + */
> > +#if defined(__LITTLE_ENDIAN)
> > + iowrite32(
> > +#else
> > + iowrite32be(
> > +#endif
> > + ring->doorbell_qpn,
> > + ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > +}
> > +
> > static bool mlx4_en_process_tx_cq(struct net_device *dev,
> > struct mlx4_en_cq *cq, int napi_budget)
> > {
> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> > wmb();
> >
> > /* we want to dirty this cache line once */
> > - ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> > - ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> > + WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> > + ring_cons += txbbs_skipped;
> > + WRITE_ONCE(ring->cons, ring_cons);
> > + WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> > +
> > + if (dev->doorbell_opt)
> > + mlx4_en_xmit_doorbell(dev, ring);
> >
> > if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
> > return done < budget;
> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
> > __iowrite64_copy(dst, src, bytecnt / 8);
> > }
> >
> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> > -{
> > - wmb();
>
> you missed/removed this "wmb()" in the new function, why ? where did
> you compensate for it ?
I removed it because I had a cmpxchg() there if the barrier was needed.
My patch is a WIP, where you can set the bit 2 to ask to replace the
cmpxchg() by a simple write, only for performance testing/comparisons.
>
> > - /* Since there is no iowrite*_native() that writes the
> > - * value as is, without byteswapping - using the one
> > - * the doesn't do byteswapping in the relevant arch
> > - * endianness.
> > - */
> > -#if defined(__LITTLE_ENDIAN)
> > - iowrite32(
> > -#else
> > - iowrite32be(
> > -#endif
> > - ring->doorbell_qpn,
> > - ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > -}
> >
> > static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> > struct mlx4_en_tx_desc *tx_desc,
> > union mlx4_wqe_qpn_vlan qpn_vlan,
> > int desc_size, int bf_index,
> > __be32 op_own, bool bf_ok,
> > - bool send_doorbell)
> > + bool send_doorbell,
> > + const struct net_device *dev, int nr_txbb)
> > {
> > tx_desc->ctrl.qpn_vlan = qpn_vlan;
> >
> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >
> > wmb();
> >
> > + ring->prod += nr_txbb;
> > mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
> > desc_size);
> >
> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> > */
> > dma_wmb();
> > tx_desc->ctrl.owner_opcode = op_own;
> > + ring->prod += nr_txbb;
> > if (send_doorbell)
> > - mlx4_en_xmit_doorbell(ring);
> > + mlx4_en_xmit_doorbell(dev, ring);
> > else
> > ring->xmit_more++;
> > }
> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
> > }
> >
> > - ring->prod += nr_txbb;
> > -
> > /* If we used a bounce buffer then copy descriptor back into place */
> > if (unlikely(bounce))
> > tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > }
> > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >
> > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > + * will happen shortly.
> > + */
> > + if (send_doorbell &&
> > + dev->doorbell_opt &&
> > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>
> Aelexi already expressed his worries about synchronization, and i
> think here (in this exact line) sits the problem,
> what about if exactly at this point the TX completion handler just
> finished and rang the last doorbell,
> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
> the last doorbell from the CQ handler missed it.
> even if you wrote the TX descriptor before the doorbell decision here,
> you will need a memory barrier to make sure the HW sees
> the new packet, which was typically done before ringing the doorbell.
My patch is a WIP, meaning it is not completed ;)
Surely we can find a non racy way to handle this.
> All in all, this is risky business :), the right way to go is to
> force the upper layer to use xmit-more and delay doorbells/use bulking
> but from the same context
> (xmit routine). For example see Achiad's suggestion (attached in
> Jesper's response), he used stop queue to force the stack to queue up
> packets (TX bulking)
> which would set xmit-more and will use the next completion to release
> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
Well, you depend on having a higher level queue like a qdisc.
Some users do not use a qdisc.
If you stop the queue, they no longer can send anything -> drops.
T
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 15:44 ` Eric Dumazet
@ 2016-11-30 16:27 ` Saeed Mahameed
2016-11-30 17:28 ` Eric Dumazet
2016-12-01 12:05 ` Jesper Dangaard Brouer
0 siblings, 2 replies; 62+ messages in thread
From: Saeed Mahameed @ 2016-11-30 16:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
Saeed Mahameed, Tariq Toukan
On Wed, Nov 30, 2016 at 5:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
>> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>> >
>> >
>> >> Not sure it this has been tried before, but the doorbell avoidance could
>> >> be done by the driver itself, because it knows a TX completion will come
>> >> shortly (well... if softirqs are not delayed too much !)
>> >>
>> >> Doorbell would be forced only if :
>> >>
>> >> ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> >> OR
>> >> ( too many [1] packets were put in TX ring buffer, no point deferring
>> >> more)
>> >>
>> >> Start the pump, but once it is started, let the doorbells being done by
>> >> TX completion.
>> >>
>> >> ndo_start_xmit and TX completion handler would have to maintain a shared
>> >> state describing if packets were ready but doorbell deferred.
>> >>
>> >>
>> >> Note that TX completion means "if at least one packet was drained",
>> >> otherwise busy polling, constantly calling napi->poll() would force a
>> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
>> >>
>> >> But then, maybe busy poll would like to force a doorbell...
>> >>
>> >> I could try these ideas on mlx4 shortly.
>> >>
>> >>
>> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>> >
>> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
>> > not used.
>>
>> Hi Eric, Nice Idea indeed and we need something like this,
>> today we almost don't exploit the TX bulking at all.
>>
>> But please see below, i am not sure different contexts should share
>> the doorbell ringing, it is really risky.
>>
>> > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2
>> > drivers/net/ethernet/mellanox/mlx4/en_tx.c | 90 +++++++++++------
>> > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4
>> > include/linux/netdevice.h | 1
>> > net/core/net-sysfs.c | 18 +++
>> > 5 files changed, 83 insertions(+), 32 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > index 6562f78b07f4..fbea83218fc0 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>> >
>> > if (polled) {
>> > if (doorbell_pending)
>> > - mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
>> > + mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>> >
>> > mlx4_cq_set_ci(&cq->mcq);
>> > wmb(); /* ensure HW sees CQ consumer before we post new buffers */
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > index 4b597dca5c52..affebb435679 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>> > ring->size = size;
>> > ring->size_mask = size - 1;
>> > ring->sp_stride = stride;
>> > - ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
>> > + ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>> >
>> > tmp = size * sizeof(struct mlx4_en_tx_info);
>> > ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
>> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>> > ring->sp_cqn = cq;
>> > ring->prod = 0;
>> > ring->cons = 0xffffffff;
>> > + ring->ncons = 0;
>> > ring->last_nr_txbb = 1;
>> > memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>> > memset(ring->buf, 0, ring->buf_size);
>> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>> > MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>> > }
>> >
>> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
>> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>> > {
>> > - return ring->prod - ring->cons > ring->full_size;
>> > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>> > }
>> >
>> > static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
>> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >
>> > /* Skip last polled descriptor */
>> > ring->cons += ring->last_nr_txbb;
>> > + ring->ncons += ring->last_nr_txbb;
>> > en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>> > ring->cons, ring->prod);
>> >
>> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> > !!(ring->cons & ring->size), 0,
>> > 0 /* Non-NAPI caller */);
>> > ring->cons += ring->last_nr_txbb;
>> > + ring->ncons += ring->last_nr_txbb;
>> > cnt++;
>> > }
>> >
>> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> > return cnt;
>> > }
>> >
>> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
>> > + struct mlx4_en_tx_ring *ring)
>> > +{
>> > +
>> > + if (dev->doorbell_opt & 1) {
>> > + u32 oval = READ_ONCE(ring->prod_bell);
>> > + u32 nval = READ_ONCE(ring->prod);
>> > +
>> > + if (oval == nval)
>> > + return;
>> > +
>> > + /* I can not tell yet if a cmpxchg() is needed or not */
>> > + if (dev->doorbell_opt & 2)
>> > + WRITE_ONCE(ring->prod_bell, nval);
>> > + else
>> > + if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
>> > + return;
>> > + }
>> > + /* Since there is no iowrite*_native() that writes the
>> > + * value as is, without byteswapping - using the one
>> > + * the doesn't do byteswapping in the relevant arch
>> > + * endianness.
>> > + */
>> > +#if defined(__LITTLE_ENDIAN)
>> > + iowrite32(
>> > +#else
>> > + iowrite32be(
>> > +#endif
>> > + ring->doorbell_qpn,
>> > + ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > +}
>> > +
>> > static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> > struct mlx4_en_cq *cq, int napi_budget)
>> > {
>> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> > wmb();
>> >
>> > /* we want to dirty this cache line once */
>> > - ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
>> > - ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
>> > + WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
>> > + ring_cons += txbbs_skipped;
>> > + WRITE_ONCE(ring->cons, ring_cons);
>> > + WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
>> > +
>> > + if (dev->doorbell_opt)
>> > + mlx4_en_xmit_doorbell(dev, ring);
>> >
>> > if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>> > return done < budget;
>> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>> > __iowrite64_copy(dst, src, bytecnt / 8);
>> > }
>> >
>> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
>> > -{
>> > - wmb();
>>
>> you missed/removed this "wmb()" in the new function, why ? where did
>> you compensate for it ?
>
> I removed it because I had a cmpxchg() there if the barrier was needed.
>
Cool, so the answer is yes, the barrier is needed in order for the HW
to see the last step of
mlx4_en_tx_write_desc where we write the ownership bit (which means
this descriptor is valid for HW processing).
tx_desc->ctrl.owner_opcode = op_own;
ringing the doorbell without this wmb might cause the HW to miss that
last packet.
> My patch is a WIP, where you can set the bit 2 to ask to replace the
> cmpxchg() by a simple write, only for performance testing/comparisons.
>
>
>>
>> > - /* Since there is no iowrite*_native() that writes the
>> > - * value as is, without byteswapping - using the one
>> > - * the doesn't do byteswapping in the relevant arch
>> > - * endianness.
>> > - */
>> > -#if defined(__LITTLE_ENDIAN)
>> > - iowrite32(
>> > -#else
>> > - iowrite32be(
>> > -#endif
>> > - ring->doorbell_qpn,
>> > - ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > -}
>> >
>> > static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> > struct mlx4_en_tx_desc *tx_desc,
>> > union mlx4_wqe_qpn_vlan qpn_vlan,
>> > int desc_size, int bf_index,
>> > __be32 op_own, bool bf_ok,
>> > - bool send_doorbell)
>> > + bool send_doorbell,
>> > + const struct net_device *dev, int nr_txbb)
>> > {
>> > tx_desc->ctrl.qpn_vlan = qpn_vlan;
>> >
>> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >
>> > wmb();
>> >
>> > + ring->prod += nr_txbb;
>> > mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>> > desc_size);
>> >
>> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> > */
>> > dma_wmb();
>> > tx_desc->ctrl.owner_opcode = op_own;
>> > + ring->prod += nr_txbb;
>> > if (send_doorbell)
>> > - mlx4_en_xmit_doorbell(ring);
>> > + mlx4_en_xmit_doorbell(dev, ring);
>> > else
>> > ring->xmit_more++;
>> > }
>> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> > op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>> > }
>> >
>> > - ring->prod += nr_txbb;
>> > -
>> > /* If we used a bounce buffer then copy descriptor back into place */
>> > if (unlikely(bounce))
>> > tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> > }
>> > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>> >
>> > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
>> > + * will happen shortly.
>> > + */
>> > + if (send_doorbell &&
>> > + dev->doorbell_opt &&
>> > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>>
>> Aelexi already expressed his worries about synchronization, and i
>> think here (in this exact line) sits the problem,
>> what about if exactly at this point the TX completion handler just
>> finished and rang the last doorbell,
>> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
>> the last doorbell from the CQ handler missed it.
>> even if you wrote the TX descriptor before the doorbell decision here,
>> you will need a memory barrier to make sure the HW sees
>> the new packet, which was typically done before ringing the doorbell.
>
>
> My patch is a WIP, meaning it is not completed ;)
>
> Surely we can find a non racy way to handle this.
The question is, can it be done without locking ? Maybe ring the
doorbell every N msecs just in case.
>
>
>> All in all, this is risky business :), the right way to go is to
>> force the upper layer to use xmit-more and delay doorbells/use bulking
>> but from the same context
>> (xmit routine). For example see Achiad's suggestion (attached in
>> Jesper's response), he used stop queue to force the stack to queue up
>> packets (TX bulking)
>> which would set xmit-more and will use the next completion to release
>> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
>
>
>
> Well, you depend on having a higher level queue like a qdisc.
>
> Some users do not use a qdisc.
> If you stop the queue, they no longer can send anything -> drops.
>
In this case, i think they should implement their own bulking (pktgen
is not a good example)
but XDP can predict if it has more packets to xmit as long as all of
them fall in the same NAPI cycle.
Others should try and do the same.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 16:27 ` Saeed Mahameed
@ 2016-11-30 17:28 ` Eric Dumazet
2016-12-01 12:05 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-11-30 17:28 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
Saeed Mahameed, Tariq Toukan
On Wed, 2016-11-30 at 18:27 +0200, Saeed Mahameed wrote:
>
> In this case, i think they should implement their own bulking (pktgen
> is not a good example)
> but XDP can predict if it has more packets to xmit as long as all of
> them fall in the same NAPI cycle.
> Others should try and do the same.
We can not trust user space to signal us when it is the good time to
perform a doorbell.
trafgen is using af_packet with qdisc bypass for example.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-30 16:27 ` Saeed Mahameed
2016-11-30 17:28 ` Eric Dumazet
@ 2016-12-01 12:05 ` Jesper Dangaard Brouer
2016-12-01 14:24 ` Eric Dumazet
1 sibling, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 12:05 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Eric Dumazet, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan, brouer
On Wed, 30 Nov 2016 18:27:45 +0200
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> >> All in all, this is risky business :), the right way to go is to
> >> force the upper layer to use xmit-more and delay doorbells/use bulking
> >> but from the same context (xmit routine). For example see
> >> Achiad's suggestion (attached in Jesper's response), he used stop
> >> queue to force the stack to queue up packets (TX bulking)
> >> which would set xmit-more and will use the next completion to
> >> release the "stopped" ring TXQ rather than hit the doorbell on
> >> behalf of it.
> >
> > Well, you depend on having a higher level queue like a qdisc.
> >
> > Some users do not use a qdisc.
> > If you stop the queue, they no longer can send anything -> drops.
> >
You do have a point that stopping the device might not be the best way
to create a push-back (to allow stack queue packets).
netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF
> In this case, i think they should implement their own bulking (pktgen
> is not a good example) but XDP can predict if it has more packets to
> xmit as long as all of them fall in the same NAPI cycle.
> Others should try and do the same.
I actually agree with Saeed here.
Maybe we can come up with another __QUEUE_STATE_xxx that informs the
upper layer what the driver is doing. Then users not using a qdisc can
use this indication (like the qdisc could). (qdisc-bypass users already
check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).
My main objection is that this is a driver local optimization. By not
involving the upper layers, the netstack looses the ability to amortize
it's cost, as it still does per packet handling.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 12:05 ` Jesper Dangaard Brouer
@ 2016-12-01 14:24 ` Eric Dumazet
2016-12-01 16:04 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 14:24 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan
On Thu, 2016-12-01 at 13:05 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 30 Nov 2016 18:27:45 +0200
> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>
> > >> All in all, this is risky business :), the right way to go is to
> > >> force the upper layer to use xmit-more and delay doorbells/use bulking
> > >> but from the same context (xmit routine). For example see
> > >> Achiad's suggestion (attached in Jesper's response), he used stop
> > >> queue to force the stack to queue up packets (TX bulking)
> > >> which would set xmit-more and will use the next completion to
> > >> release the "stopped" ring TXQ rather than hit the doorbell on
> > >> behalf of it.
> > >
> > > Well, you depend on having a higher level queue like a qdisc.
> > >
> > > Some users do not use a qdisc.
> > > If you stop the queue, they no longer can send anything -> drops.
> > >
>
> You do have a point that stopping the device might not be the best way
> to create a push-back (to allow stack queue packets).
>
> netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF
>
>
> > In this case, i think they should implement their own bulking (pktgen
> > is not a good example) but XDP can predict if it has more packets to
> > xmit as long as all of them fall in the same NAPI cycle.
> > Others should try and do the same.
>
> I actually agree with Saeed here.
>
> Maybe we can come up with another __QUEUE_STATE_xxx that informs the
> upper layer what the driver is doing. Then users not using a qdisc can
> use this indication (like the qdisc could). (qdisc-bypass users already
> check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).
Can you explain how this is going to help trafgen using AF_PACKET with
Qdisc bypass ?
Say trafgen wants to send 10 or 1000 packets back to back (as fast as
possible)
With my proposal, only the first is triggering a doorbell from
ndo_start_xmit(). Following ones are driven by TX completion logic, or
BQL if we can push packets faster than TX interrupt can be
delivered/handled.
If you stop the queue (with yet another atomic operations to stop/unstop
btw), packet_direct_xmit() will have to drop trafgen packets on the
floor.
We already have BQL stopping the queue at a fine granularity.
I suspect that Saeed proposal will interfere with BQL logic.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 14:24 ` Eric Dumazet
@ 2016-12-01 16:04 ` Jesper Dangaard Brouer
2016-12-01 17:04 ` Eric Dumazet
0 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 16:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan, brouer
On Thu, 01 Dec 2016 06:24:34 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-01 at 13:05 +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 30 Nov 2016 18:27:45 +0200
> > Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> >
> > > >> All in all, this is risky business :), the right way to go is to
> > > >> force the upper layer to use xmit-more and delay doorbells/use bulking
> > > >> but from the same context (xmit routine). For example see
> > > >> Achiad's suggestion (attached in Jesper's response), he used stop
> > > >> queue to force the stack to queue up packets (TX bulking)
> > > >> which would set xmit-more and will use the next completion to
> > > >> release the "stopped" ring TXQ rather than hit the doorbell on
> > > >> behalf of it.
> > > >
> > > > Well, you depend on having a higher level queue like a qdisc.
> > > >
> > > > Some users do not use a qdisc.
> > > > If you stop the queue, they no longer can send anything -> drops.
> > > >
> >
> > You do have a point that stopping the device might not be the best way
> > to create a push-back (to allow stack queue packets).
> >
> > netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF
> >
> >
> > > In this case, i think they should implement their own bulking (pktgen
> > > is not a good example) but XDP can predict if it has more packets to
> > > xmit as long as all of them fall in the same NAPI cycle.
> > > Others should try and do the same.
> >
> > I actually agree with Saeed here.
> >
> > Maybe we can come up with another __QUEUE_STATE_xxx that informs the
> > upper layer what the driver is doing. Then users not using a qdisc can
> > use this indication (like the qdisc could). (qdisc-bypass users already
> > check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).
>
> Can you explain how this is going to help trafgen using AF_PACKET with
> Qdisc bypass ?
>
> Say trafgen wants to send 10 or 1000 packets back to back (as fast as
> possible)
>
> With my proposal, only the first is triggering a doorbell from
> ndo_start_xmit(). Following ones are driven by TX completion logic, or
> BQL if we can push packets faster than TX interrupt can be
> delivered/handled.
>
> If you stop the queue (with yet another atomic operations to stop/unstop
> btw), packet_direct_xmit() will have to drop trafgen packets on the
> floor.
I think you misunderstood my concept[1]. I don't want to stop the
queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
it just indicating that someone need to flush/ring-doorbell. Maybe it
need another name, because it also indicate that the driver can see
that its TX queue is so busy that we don't need to call it immediately.
The qdisc layer can then choose to enqueue instead if doing direct xmit.
When qdisc layer or trafgen/af_packet see this indication it knows it
should/must flush the queue when it don't have more work left. Perhaps
through net_tx_action(), by registering itself and e.g. if qdisc_run()
is called and queue is empty then check if queue needs a flush. I would
also allow driver to flush and clear this bit.
I just see it as an extension of your solution, as we still need the
driver to figure out then the doorbell/flush can be delayed.
p.s. don't be discouraged by this feedback, I'm just very excited and
happy that your are working on a solution in this area. As this is a
problem area that I've not been able to solve myself for the last
approx 2 years. Keep up the good work!
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[1] http://lkml.kernel.org/r/20161130233015.3de95356@redhat.com
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 16:04 ` Jesper Dangaard Brouer
@ 2016-12-01 17:04 ` Eric Dumazet
2016-12-01 19:17 ` Jesper Dangaard Brouer
` (2 more replies)
0 siblings, 3 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 17:04 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan
On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> I think you misunderstood my concept[1]. I don't want to stop the
> queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
> it just indicating that someone need to flush/ring-doorbell. Maybe it
> need another name, because it also indicate that the driver can see
> that its TX queue is so busy that we don't need to call it immediately.
> The qdisc layer can then choose to enqueue instead if doing direct xmit.
But driver ndo_start_xmit() does not have a pointer to qdisc.
Also the concept of 'queue busy' just because we queued one packet is a
bit flaky.
>
> When qdisc layer or trafgen/af_packet see this indication it knows it
> should/must flush the queue when it don't have more work left. Perhaps
> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> is called and queue is empty then check if queue needs a flush. I would
> also allow driver to flush and clear this bit.
net_tx_action() is not normally called, unless BQL limit is hit and/or
some qdiscs with throttling (HTB, TBF, FQ, ...)
>
> I just see it as an extension of your solution, as we still need the
> driver to figure out then the doorbell/flush can be delayed.
> p.s. don't be discouraged by this feedback, I'm just very excited and
> happy that your are working on a solution in this area. As this is a
> problem area that I've not been able to solve myself for the last
> approx 2 years. Keep up the good work!
Do not worry, I appreciate the feedbacks ;)
BTW, if you are doing tests on mlx4 40Gbit, would you check the
following quick/dirty hack, using lots of low-rate flows ?
mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 12ea3405f442..96940666abd3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct net_device *dev,
queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
}
+static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
+module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
+MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
+
+
static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
struct net_device *dev,
netdev_features_t features)
@@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
(udp_hdr(skb)->dest != priv->vxlan_port))
features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
}
+ if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
+ features &= NETIF_F_GSO_MASK;
return features;
}
^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 17:04 ` Eric Dumazet
@ 2016-12-01 19:17 ` Jesper Dangaard Brouer
2016-12-01 20:11 ` Eric Dumazet
2016-12-01 20:20 ` David Miller
2016-12-02 14:23 ` Eric Dumazet
2 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-01 19:17 UTC (permalink / raw)
To: Eric Dumazet
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan, brouer
On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> BTW, if you are doing tests on mlx4 40Gbit,
I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
machines too.
> would you check the
> following quick/dirty hack, using lots of low-rate flows ?
What tool should I use to send "low-rate flows"?
And what am I looking for?
> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct net_device *dev,
> queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
> }
>
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
> static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
> struct net_device *dev,
> netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
> (udp_hdr(skb)->dest != priv->vxlan_port))
> features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
> }
> + if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> + features &= NETIF_F_GSO_MASK;
>
> return features;
> }
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 19:17 ` Jesper Dangaard Brouer
@ 2016-12-01 20:11 ` Eric Dumazet
0 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 20:11 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan
On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > BTW, if you are doing tests on mlx4 40Gbit,
>
> I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
> machines too.
>
> > would you check the
> > following quick/dirty hack, using lots of low-rate flows ?
>
> What tool should I use to send "low-rate flows"?
>
You could use https://github.com/google/neper
It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate
limited to 3028000 bytes per second (so sending one 2-MSS TSO packet
every ms per flow)
> And what am I looking for?
Max throughput, in packets per second :/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 17:04 ` Eric Dumazet
2016-12-01 19:17 ` Jesper Dangaard Brouer
@ 2016-12-01 20:20 ` David Miller
2016-12-01 22:10 ` Eric Dumazet
2016-12-02 14:23 ` Eric Dumazet
2 siblings, 1 reply; 62+ messages in thread
From: David Miller @ 2016-12-01 20:20 UTC (permalink / raw)
To: eric.dumazet; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 01 Dec 2016 09:04:17 -0800
> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
>
>> When qdisc layer or trafgen/af_packet see this indication it knows it
>> should/must flush the queue when it don't have more work left. Perhaps
>> through net_tx_action(), by registering itself and e.g. if qdisc_run()
>> is called and queue is empty then check if queue needs a flush. I would
>> also allow driver to flush and clear this bit.
>
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)
The one thing I wonder about is whether we should "ramp up" into a mode
where the NAPI poll does the doorbells instead of going directly there.
Maybe I misunderstand your algorithm, but it looks to me like if there
are any active packets in the TX queue at enqueue time you will defer
the doorbell to the interrupt handler.
Let's say we put 1 packet in, and hit the doorbell.
Then another packet comes in and we defer the doorbell to the IRQ.
At this point there are a couple things I'm unclear about.
For example, if we didn't hit the doorbell, will the chip still take a
peek at the second descriptor? Depending upon how the doorbell works
it might, or it might not.
Either way, wouldn't there be a possible condition where the chip
wouldn't see the second enqueued packet and we'd thus have the wire
idle until the interrupt + NAPI runs and hits the doorbell?
This is why I think we should "ramp up" the doorbell deferral, in
order to avoid this potential wire idle time situation.
Maybe the situation I'm worried about is not possible, so please
explain it to me :-)
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 20:20 ` David Miller
@ 2016-12-01 22:10 ` Eric Dumazet
0 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 22:10 UTC (permalink / raw)
To: David Miller; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 01 Dec 2016 09:04:17 -0800
>
> > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> >
> >> When qdisc layer or trafgen/af_packet see this indication it knows it
> >> should/must flush the queue when it don't have more work left. Perhaps
> >> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> >> is called and queue is empty then check if queue needs a flush. I would
> >> also allow driver to flush and clear this bit.
> >
> > net_tx_action() is not normally called, unless BQL limit is hit and/or
> > some qdiscs with throttling (HTB, TBF, FQ, ...)
>
> The one thing I wonder about is whether we should "ramp up" into a mode
> where the NAPI poll does the doorbells instead of going directly there.
>
> Maybe I misunderstand your algorithm, but it looks to me like if there
> are any active packets in the TX queue at enqueue time you will defer
> the doorbell to the interrupt handler.
>
> Let's say we put 1 packet in, and hit the doorbell.
>
> Then another packet comes in and we defer the doorbell to the IRQ.
>
> At this point there are a couple things I'm unclear about.
>
> For example, if we didn't hit the doorbell, will the chip still take a
> peek at the second descriptor? Depending upon how the doorbell works
> it might, or it might not.
It might depend on the hardware. I can easily check on mlx4, by
increasing tx-usecs and tx-frames, and sending 2 packets back to back.
>
> Either way, wouldn't there be a possible condition where the chip
> wouldn't see the second enqueued packet and we'd thus have the wire
> idle until the interrupt + NAPI runs and hits the doorbell?
>
> This is why I think we should "ramp up" the doorbell deferral, in
> order to avoid this potential wire idle time situation.
>
> Maybe the situation I'm worried about is not possible, so please
> explain it to me :-)
This is absolutely the problem. We might need to enable this mode only
above a given load. We could have an EWMA of the number of packets
that TX completion runs can dequeue. And enable auto doorbell only if we
have that many packets in the TX ring (instead of the "1 packet
threshold" of the WIP)
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 17:04 ` Eric Dumazet
2016-12-01 19:17 ` Jesper Dangaard Brouer
2016-12-01 20:20 ` David Miller
@ 2016-12-02 14:23 ` Eric Dumazet
2 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-02 14:23 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan
On Thu, 2016-12-01 at 09:04 -0800, Eric Dumazet wrote:
> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
>
> > I think you misunderstood my concept[1]. I don't want to stop the
> > queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
> > it just indicating that someone need to flush/ring-doorbell. Maybe it
> > need another name, because it also indicate that the driver can see
> > that its TX queue is so busy that we don't need to call it immediately.
> > The qdisc layer can then choose to enqueue instead if doing direct xmit.
>
> But driver ndo_start_xmit() does not have a pointer to qdisc.
>
> Also the concept of 'queue busy' just because we queued one packet is a
> bit flaky.
>
> >
> > When qdisc layer or trafgen/af_packet see this indication it knows it
> > should/must flush the queue when it don't have more work left. Perhaps
> > through net_tx_action(), by registering itself and e.g. if qdisc_run()
> > is called and queue is empty then check if queue needs a flush. I would
> > also allow driver to flush and clear this bit.
>
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)
>
> >
> > I just see it as an extension of your solution, as we still need the
> > driver to figure out then the doorbell/flush can be delayed.
> > p.s. don't be discouraged by this feedback, I'm just very excited and
> > happy that your are working on a solution in this area. As this is a
> > problem area that I've not been able to solve myself for the last
> > approx 2 years. Keep up the good work!
>
> Do not worry, I appreciate the feedbacks ;)
>
> BTW, if you are doing tests on mlx4 40Gbit, would you check the
> following quick/dirty hack, using lots of low-rate flows ?
>
> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct net_device *dev,
> queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
> }
>
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
> static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
> struct net_device *dev,
> netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
> (udp_hdr(skb)->dest != priv->vxlan_port))
> features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
> }
> + if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> + features &= NETIF_F_GSO_MASK;
Sorry, stupid typo.
This should be "features &= ~NETIF_F_GSO_MASK;" of course
>
> return features;
> }
>
>
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-11-29 6:58 ` [WIP] net+mlx4: auto doorbell Eric Dumazet
2016-11-30 11:38 ` Jesper Dangaard Brouer
2016-11-30 13:50 ` Saeed Mahameed
@ 2016-12-01 21:32 ` Alexander Duyck
2016-12-01 22:04 ` Eric Dumazet
2 siblings, 1 reply; 62+ messages in thread
From: Alexander Duyck @ 2016-12-01 21:32 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed, Tariq Toukan
On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:26 eth0 0.00 5822800.00 0.00 597064.41 0.00 0.00 1.00
> 22:43:27 eth0 24.00 5788237.00 2.09 593520.26 0.00 0.00 0.00
> 22:43:28 eth0 12.00 5817777.00 1.43 596551.47 0.00 0.00 1.00
> 22:43:29 eth0 22.00 5841516.00 1.61 598982.87 0.00 0.00 0.00
> 22:43:30 eth0 4.00 4389137.00 0.71 450058.08 0.00 0.00 1.00
> 22:43:31 eth0 4.00 5871008.00 0.72 602007.79 0.00 0.00 0.00
> 22:43:32 eth0 12.00 5891809.00 1.43 604142.60 0.00 0.00 1.00
> 22:43:33 eth0 10.00 5901904.00 1.12 605175.70 0.00 0.00 0.00
> 22:43:34 eth0 5.00 5907982.00 0.69 605798.99 0.00 0.00 1.00
> 22:43:35 eth0 2.00 5847086.00 0.12 599554.71 0.00 0.00 0.00
> Average: eth0 9.50 5707925.60 0.99 585285.69 0.00 0.00 0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:47 eth0 9.00 10226424.00 1.02 1048608.05 0.00 0.00 1.00
> 22:43:48 eth0 1.00 10316955.00 0.06 1057890.89 0.00 0.00 0.00
> 22:43:49 eth0 1.00 10310104.00 0.10 1057188.32 0.00 0.00 1.00
> 22:43:50 eth0 0.00 10249423.00 0.00 1050966.23 0.00 0.00 0.00
> 22:43:51 eth0 0.00 10210441.00 0.00 1046969.05 0.00 0.00 1.00
> 22:43:52 eth0 2.00 10198389.00 0.16 1045733.17 0.00 0.00 1.00
> 22:43:53 eth0 8.00 10079257.00 1.43 1033517.83 0.00 0.00 0.00
> 22:43:54 eth0 0.00 7693509.00 0.00 788885.16 0.00 0.00 0.00
> 22:43:55 eth0 2.00 10343076.00 0.20 1060569.32 0.00 0.00 1.00
> 22:43:56 eth0 1.00 10224571.00 0.14 1048417.93 0.00 0.00 0.00
> Average: eth0 2.40 9985214.90 0.31 1023874.60 0.00 0.00 0.50
>
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.
A few years back when I did something like this on ixgbe I was told by
you that the issue was that doing something like this would add too
much latency. I was just wondering what the latency impact is on a
change like this and if this now isn't that much of an issue?
>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:50:47 eth0 3.00 1355422.00 0.45 319706.04 0.00 0.00 0.00
> 22:50:48 eth0 2.00 1344270.00 0.42 317035.21 0.00 0.00 1.00
> 22:50:49 eth0 3.00 1350503.00 0.51 318478.34 0.00 0.00 0.00
> 22:50:50 eth0 29.00 1348593.00 2.86 318113.02 0.00 0.00 1.00
> 22:50:51 eth0 14.00 1354855.00 1.83 319508.56 0.00 0.00 0.00
> 22:50:52 eth0 7.00 1357794.00 0.73 320226.89 0.00 0.00 1.00
> 22:50:53 eth0 5.00 1326130.00 0.63 312784.72 0.00 0.00 0.00
> 22:50:54 eth0 2.00 994584.00 0.12 234598.40 0.00 0.00 1.00
> 22:50:55 eth0 5.00 1318209.00 0.75 310932.46 0.00 0.00 0.00
> 22:50:56 eth0 20.00 1323445.00 1.73 312178.11 0.00 0.00 1.00
> Average: eth0 9.00 1307380.50 1.00 308356.18 0.00 0.00 0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:51:03 eth0 4.00 1512055.00 0.54 356599.40 0.00 0.00 0.00
> 22:51:04 eth0 4.00 1507631.00 0.55 355609.46 0.00 0.00 1.00
> 22:51:05 eth0 4.00 1487789.00 0.42 350917.47 0.00 0.00 0.00
> 22:51:06 eth0 7.00 1474460.00 1.22 347811.16 0.00 0.00 1.00
> 22:51:07 eth0 2.00 1496529.00 0.24 352995.18 0.00 0.00 0.00
> 22:51:08 eth0 3.00 1485856.00 0.49 350425.65 0.00 0.00 1.00
> 22:51:09 eth0 1.00 1114808.00 0.06 262905.38 0.00 0.00 0.00
> 22:51:10 eth0 2.00 1510924.00 0.30 356397.53 0.00 0.00 1.00
> 22:51:11 eth0 2.00 1506408.00 0.30 355345.76 0.00 0.00 0.00
> 22:51:12 eth0 2.00 1499122.00 0.32 353668.75 0.00 0.00 1.00
> Average: eth0 3.10 1459558.20 0.44 344267.57 0.00 0.00 0.50
>
> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2
> drivers/net/ethernet/mellanox/mlx4/en_tx.c | 90 +++++++++++------
> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4
> include/linux/netdevice.h | 1
> net/core/net-sysfs.c | 18 +++
> 5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
> if (polled) {
> if (doorbell_pending)
> - mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> + mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
> mlx4_cq_set_ci(&cq->mcq);
> wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
> ring->size = size;
> ring->size_mask = size - 1;
> ring->sp_stride = stride;
> - ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> + ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>
Just wondering if this should be a separate change? Seems like this
is something that might apply outside of your changes since it seems
like it is reserving enough room for 2 buffers.
> tmp = size * sizeof(struct mlx4_en_tx_info);
> ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
> ring->sp_cqn = cq;
> ring->prod = 0;
> ring->cons = 0xffffffff;
> + ring->ncons = 0;
> ring->last_nr_txbb = 1;
> memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
> memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
> MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
> }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> {
> - return ring->prod - ring->cons > ring->full_size;
> + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> }
>
With the use of READ_ONCE are there some barriers that are going to
need to be added to make sure these are populated in the correct
spots? Any ordering constraints on the updates these fields?
> static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
> /* Skip last polled descriptor */
> ring->cons += ring->last_nr_txbb;
> + ring->ncons += ring->last_nr_txbb;
> en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
> ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> !!(ring->cons & ring->size), 0,
> 0 /* Non-NAPI caller */);
> ring->cons += ring->last_nr_txbb;
> + ring->ncons += ring->last_nr_txbb;
> cnt++;
> }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> return cnt;
> }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> + struct mlx4_en_tx_ring *ring)
> +{
> +
> + if (dev->doorbell_opt & 1) {
> + u32 oval = READ_ONCE(ring->prod_bell);
> + u32 nval = READ_ONCE(ring->prod);
> +
> + if (oval == nval)
> + return;
> +
> + /* I can not tell yet if a cmpxchg() is needed or not */
> + if (dev->doorbell_opt & 2)
> + WRITE_ONCE(ring->prod_bell, nval);
> + else
> + if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> + return;
> + }
> + /* Since there is no iowrite*_native() that writes the
> + * value as is, without byteswapping - using the one
> + * the doesn't do byteswapping in the relevant arch
> + * endianness.
> + */
> +#if defined(__LITTLE_ENDIAN)
> + iowrite32(
> +#else
> + iowrite32be(
> +#endif
> + ring->doorbell_qpn,
> + ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +
I realize you are just copying from the function further down, but
couldn't this just be __raw_writel instead of iowrite32?
Also you may be able to squeeze a bit more out of this by doing some
barrier clean-ups. In most drivers the wmb() is needed between writes
of the DMA memory and the MMIO memory. So odds are you could hold off
on using a wmb() until you call this function and you wouldn't really
need it until just before the call to iowrite32 or __raw_writel. The
rest of it could use either an smp_wmb() or dma_wmb().
> static bool mlx4_en_process_tx_cq(struct net_device *dev,
> struct mlx4_en_cq *cq, int napi_budget)
> {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> wmb();
>
> /* we want to dirty this cache line once */
> - ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> - ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> + WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> + ring_cons += txbbs_skipped;
> + WRITE_ONCE(ring->cons, ring_cons);
> + WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> + if (dev->doorbell_opt)
> + mlx4_en_xmit_doorbell(dev, ring);
>
It might be more readable to put the addition on ring_cons out in
front of all the WRITE_ONCE macro calls.
> if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
> return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
> __iowrite64_copy(dst, src, bytecnt / 8);
> }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> - wmb();
> - /* Since there is no iowrite*_native() that writes the
> - * value as is, without byteswapping - using the one
> - * the doesn't do byteswapping in the relevant arch
> - * endianness.
> - */
> -#if defined(__LITTLE_ENDIAN)
> - iowrite32(
> -#else
> - iowrite32be(
> -#endif
> - ring->doorbell_qpn,
> - ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
> static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> struct mlx4_en_tx_desc *tx_desc,
> union mlx4_wqe_qpn_vlan qpn_vlan,
> int desc_size, int bf_index,
> __be32 op_own, bool bf_ok,
> - bool send_doorbell)
> + bool send_doorbell,
> + const struct net_device *dev, int nr_txbb)
> {
> tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
> wmb();
>
> + ring->prod += nr_txbb;
> mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
> desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> */
> dma_wmb();
> tx_desc->ctrl.owner_opcode = op_own;
> + ring->prod += nr_txbb;
> if (send_doorbell)
> - mlx4_en_xmit_doorbell(ring);
> + mlx4_en_xmit_doorbell(dev, ring);
> else
> ring->xmit_more++;
> }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
> }
>
> - ring->prod += nr_txbb;
> -
> /* If we used a bounce buffer then copy descriptor back into place */
> if (unlikely(bounce))
> tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> }
> send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> + * will happen shortly.
> + */
> + if (send_doorbell &&
> + dev->doorbell_opt &&
> + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> + send_doorbell = false;
> +
> real_size = (real_size / 16) & 0x3f;
>
> bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> qpn_vlan.fence_size = real_size;
>
> mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> - op_own, bf_ok, send_doorbell);
> + op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
> if (unlikely(stop_queue)) {
> /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> */
> smp_rmb();
>
> - ring_cons = ACCESS_ONCE(ring->cons);
> if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
> netif_tx_wake_queue(ring->tx_queue);
> ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
> rx_ring->xdp_tx++;
> AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> - ring->prod += nr_txbb;
> -
> stop_queue = mlx4_en_is_tx_ring_full(ring);
> send_doorbell = stop_queue ||
> *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
> qpn_vlan.fence_size = real_size;
>
> mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> - op_own, bf_ok, send_doorbell);
> + op_own, bf_ok, send_doorbell, dev, nr_txbb);
> *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
> return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> */
> u32 last_nr_txbb;
> u32 cons;
> + u32 ncons;
> unsigned long wake_queue;
> struct netdev_queue *tx_queue;
> u32 (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
> /* cache line used and dirtied in mlx4_en_xmit() */
> u32 prod ____cacheline_aligned_in_smp;
> + u32 prod_bell;
> unsigned int tx_dropped;
> unsigned long bytes;
> unsigned long packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
> struct mlx4_en_rx_alloc *frame,
> struct net_device *dev, unsigned int length,
> int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
> bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
> struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
> DECLARE_HASHTABLE (qdisc_hash, 4);
> #endif
> unsigned long tx_queue_len;
> + unsigned long doorbell_opt;
> spinlock_t tx_global_lock;
> int watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
> }
> NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> + dev->doorbell_opt = val;
> + return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> +
> + return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
> static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
> const char *buf, size_t len)
> {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
> &dev_attr_phys_port_name.attr,
> &dev_attr_phys_switch_id.attr,
> &dev_attr_proto_down.attr,
> + &dev_attr_doorbell_opt.attr,
> NULL,
> };
> ATTRIBUTE_GROUPS(net_class);
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [WIP] net+mlx4: auto doorbell
2016-12-01 21:32 ` Alexander Duyck
@ 2016-12-01 22:04 ` Eric Dumazet
0 siblings, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-12-01 22:04 UTC (permalink / raw)
To: Alexander Duyck
Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed, Tariq Toukan
On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote:
> A few years back when I did something like this on ixgbe I was told by
> you that the issue was that doing something like this would add too
> much latency. I was just wondering what the latency impact is on a
> change like this and if this now isn't that much of an issue?
I believe it is clear that we can not use this trick without admin
choice.
In my experience, mlx4 can deliver way more interrupts than ixgbe.
(No idea why)
This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames|
tx-frames-irq", or simply a choice for hosts dedicated to content
delivery (like video servers) with 40GBit+ cards.
Under stress, softirq can be handled by ksoftirqd, and might be delayed
by ~10 ms or even more.
This WIP was minimal, but we certainly need to add belts and suspenders.
1) <maybe> timestamps
use a ktime_get_ns() to remember a timestamp for the last doorbell,
and force a doorbell if it gets too old, checked in ndo_start_xmit()
every time we would like to not send the doorbell.
2) <maybe> max numbers of packets not yet doorbelled.
But we can not rely on another high resolution timer, since they also
require softirq being processed timely.
^ permalink raw reply [flat|nested] 62+ messages in thread