All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
@ 2014-09-26  0:46 Alexei Starovoitov
  2014-09-26  1:20 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2014-09-26  0:46 UTC (permalink / raw)
  To: David S. Miller
  Cc: Jesper Dangaard Brouer, Eric Dumazet, John Fastabend, netdev

This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)

burst=1 is a default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
        2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
        3rd copy with xmit_more=false

Performance with ixgbe:

usec 30:
burst=1  tx:9.2 Mpps
burst=2  tx:13.6 Mpps
burst=3  tx:14.5 Mpps full 10G line rate

usec 1 (default):
burst=1,4,100 tx:3.9 Mpps

usec 0:
burst=1  tx:4.9 Mpps
burst=2  tx:6.6 Mpps
burst=3  tx:7.9 Mpps
burst=4  tx:8.7 Mpps
burst=8  tx:10.3 Mpps
burst=128  tx:12.4 Mpps

Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---

tx queue size, irq affinity left in default.
pause frames are off.

Nice to finally see line rate generated by one cpu

Comparing to Jesper patch this one amortizes the cost
of spin_lock and atomic_inc by doing HARD_TX_LOCK and
atomic_add(N) once across N packets.

 net/core/pktgen.c |   33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 5c728aa..47557ba 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -387,6 +387,7 @@ struct pktgen_dev {
 	u16 queue_map_min;
 	u16 queue_map_max;
 	__u32 skb_priority;	/* skb priority field */
+	int burst;		/* number of duplicated packets to burst */
 	int node;               /* Memory node */
 
 #ifdef CONFIG_XFRM
@@ -613,6 +614,9 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
 	if (pkt_dev->traffic_class)
 		seq_printf(seq, "     traffic_class: 0x%02x\n", pkt_dev->traffic_class);
 
+	if (pkt_dev->burst > 1)
+		seq_printf(seq, "     burst: %d\n", pkt_dev->burst);
+
 	if (pkt_dev->node >= 0)
 		seq_printf(seq, "     node: %d\n", pkt_dev->node);
 
@@ -1124,6 +1128,16 @@ static ssize_t pktgen_if_write(struct file *file,
 			pkt_dev->dst_mac_count);
 		return count;
 	}
+	if (!strcmp(name, "burst")) {
+		len = num_arg(&user_buffer[i], 10, &value);
+		if (len < 0)
+			return len;
+
+		i += len;
+		pkt_dev->burst = value < 1 ? 1 : value;
+		sprintf(pg_result, "OK: burst=%d", pkt_dev->burst);
+		return count;
+	}
 	if (!strcmp(name, "node")) {
 		len = num_arg(&user_buffer[i], 10, &value);
 		if (len < 0)
@@ -3299,7 +3313,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 {
 	struct net_device *odev = pkt_dev->odev;
 	struct netdev_queue *txq;
-	int ret;
+	int burst_cnt, ret;
+	bool more;
 
 	/* If device is offline, then don't send */
 	if (unlikely(!netif_running(odev) || !netif_carrier_ok(odev))) {
@@ -3347,8 +3362,14 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		pkt_dev->last_ok = 0;
 		goto unlock;
 	}
-	atomic_inc(&(pkt_dev->skb->users));
-	ret = netdev_start_xmit(pkt_dev->skb, odev, txq, false);
+	atomic_add(pkt_dev->burst, &pkt_dev->skb->users);
+
+	burst_cnt = 0;
+
+xmit_more:
+	more = ++burst_cnt < pkt_dev->burst;
+
+	ret = netdev_start_xmit(pkt_dev->skb, odev, txq, more);
 
 	switch (ret) {
 	case NETDEV_TX_OK:
@@ -3356,6 +3377,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		pkt_dev->sofar++;
 		pkt_dev->seq_num++;
 		pkt_dev->tx_bytes += pkt_dev->last_pkt_size;
+		if (more)
+			goto xmit_more;
 		break;
 	case NET_XMIT_DROP:
 	case NET_XMIT_CN:
@@ -3374,6 +3397,9 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		atomic_dec(&(pkt_dev->skb->users));
 		pkt_dev->last_ok = 0;
 	}
+
+	if (unlikely(pkt_dev->burst - burst_cnt > 0))
+		atomic_sub(pkt_dev->burst - burst_cnt, &pkt_dev->skb->users);
 unlock:
 	HARD_TX_UNLOCK(odev, txq);
 
@@ -3572,6 +3598,7 @@ static int pktgen_add_device(struct pktgen_thread *t, const char *ifname)
 	pkt_dev->svlan_p = 0;
 	pkt_dev->svlan_cfi = 0;
 	pkt_dev->svlan_id = 0xffff;
+	pkt_dev->burst = 1;
 	pkt_dev->node = -1;
 
 	err = pktgen_setup_dev(t->net, pkt_dev, ifname);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26  0:46 [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Alexei Starovoitov
@ 2014-09-26  1:20 ` Eric Dumazet
  2014-09-26  7:42   ` Eric Dumazet
  2014-09-26  8:05 ` [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Jesper Dangaard Brouer
  2014-09-27 20:59 ` Or Gerlitz
  2 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-09-26  1:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesper Dangaard Brouer, Eric Dumazet,
	John Fastabend, netdev

On Thu, 2014-09-25 at 17:46 -0700, Alexei Starovoitov wrote:
> This patch demonstrates the effect of delaying update of HW tailptr.
> (based on earlier patch by Jesper)
> 
> burst=1 is a default. It sends one packet with xmit_more=false
> burst=2 sends one packet with xmit_more=true and
>         2nd copy of the same packet with xmit_more=false
> burst=3 sends two copies of the same packet with xmit_more=true and
>         3rd copy with xmit_more=false

> 
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---

Perfect, this is what I had in mind, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26  1:20 ` Eric Dumazet
@ 2014-09-26  7:42   ` Eric Dumazet
  2014-09-26 15:44     ` Eric Dumazet
  2014-09-27 20:43     ` Eric Dumazet
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-26  7:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesper Dangaard Brouer, Eric Dumazet,
	John Fastabend, netdev

On Thu, 2014-09-25 at 18:20 -0700, Eric Dumazet wrote:
> On Thu, 2014-09-25 at 17:46 -0700, Alexei Starovoitov wrote:
> > This patch demonstrates the effect of delaying update of HW tailptr.
> > (based on earlier patch by Jesper)
> > 
> > burst=1 is a default. It sends one packet with xmit_more=false
> > burst=2 sends one packet with xmit_more=true and
> >         2nd copy of the same packet with xmit_more=false
> > burst=3 sends two copies of the same packet with xmit_more=true and
> >         3rd copy with xmit_more=false
> 
> > 
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> > ---
> 
> Perfect, this is what I had in mind, thanks !
> 
> Acked-by: Eric Dumazet <edumazet@google.com>
> 

By the way with this patch, I now reach 10 Mpps on mlx4 

base line : 5 Mpps

+ skb->xmit_more and quick hack in pjtgen (spinlock/unlock per packet)
-> 7 Mpps

+ burst of 16 packets, no spinlock per packet -> 10 Mpps

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26  0:46 [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Alexei Starovoitov
  2014-09-26  1:20 ` Eric Dumazet
@ 2014-09-26  8:05 ` Jesper Dangaard Brouer
  2014-09-27 20:59 ` Or Gerlitz
  2 siblings, 0 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-26  8:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, John Fastabend, netdev, brouer

On Thu, 25 Sep 2014 17:46:22 -0700
Alexei Starovoitov <ast@plumgrid.com> wrote:

> This patch demonstrates the effect of delaying update of HW tailptr.
> (based on earlier patch by Jesper)
> 
> burst=1 is a default. It sends one packet with xmit_more=false
> burst=2 sends one packet with xmit_more=true and
>         2nd copy of the same packet with xmit_more=false
> burst=3 sends two copies of the same packet with xmit_more=true and
>         3rd copy with xmit_more=false
> 
> Performance with ixgbe:
> 
> usec 30:
> burst=1  tx:9.2 Mpps
> burst=2  tx:13.6 Mpps
> burst=3  tx:14.5 Mpps full 10G line rate

Perfect, full wirespeed! :-)

> usec 1 (default):
> burst=1,4,100 tx:3.9 Mpps

Here you are being limited by the TX ring queue cleanup, being too slow.
As desc here:
 http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html

> usec 0:
> burst=1  tx:4.9 Mpps
> burst=2  tx:6.6 Mpps
> burst=3  tx:7.9 Mpps
> burst=4  tx:8.7 Mpps
> burst=8  tx:10.3 Mpps
> burst=128  tx:12.4 Mpps
> 
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

> tx queue size, irq affinity left in default.
> pause frames are off.
> 
> Nice to finally see line rate generated by one cpu

Yes, 
 
> Comparing to Jesper patch this one amortizes the cost
> of spin_lock and atomic_inc by doing HARD_TX_LOCK and
> atomic_add(N) once across N packets.

Nice additional optimizations :-)

>  net/core/pktgen.c |   33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index 5c728aa..47557ba 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -387,6 +387,7 @@ struct pktgen_dev {
>  	u16 queue_map_min;
>  	u16 queue_map_max;
>  	__u32 skb_priority;	/* skb priority field */
> +	int burst;		/* number of duplicated packets to burst */
>  	int node;               /* Memory node */
>  
>  #ifdef CONFIG_XFRM
[...]
> @@ -3299,7 +3313,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>  {
>  	struct net_device *odev = pkt_dev->odev;
>  	struct netdev_queue *txq;
> -	int ret;
> +	int burst_cnt, ret;
> +	bool more;
>  
>  	/* If device is offline, then don't send */
>  	if (unlikely(!netif_running(odev) || !netif_carrier_ok(odev))) {
> @@ -3347,8 +3362,14 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>  		pkt_dev->last_ok = 0;
>  		goto unlock;
>  	}
> -	atomic_inc(&(pkt_dev->skb->users));
> -	ret = netdev_start_xmit(pkt_dev->skb, odev, txq, false);
> +	atomic_add(pkt_dev->burst, &pkt_dev->skb->users);
> +
> +	burst_cnt = 0;
> +
> +xmit_more:
> +	more = ++burst_cnt < pkt_dev->burst;
> +
> +	ret = netdev_start_xmit(pkt_dev->skb, odev, txq, more);
>  
>  	switch (ret) {
>  	case NETDEV_TX_OK:
> @@ -3356,6 +3377,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>  		pkt_dev->sofar++;
>  		pkt_dev->seq_num++;
>  		pkt_dev->tx_bytes += pkt_dev->last_pkt_size;
> +		if (more)
> +			goto xmit_more;

I think this will break my VLAN hack mode, that allows me to shoot
pktgen after the qdisc layer, but I'm okay with that, as I can just
avoid using this new burst mode and then it will still work for me.


>  		break;
>  	case NET_XMIT_DROP:
>  	case NET_XMIT_CN:
> @@ -3374,6 +3397,9 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>  		atomic_dec(&(pkt_dev->skb->users));
>  		pkt_dev->last_ok = 0;
>  	}
> +
> +	if (unlikely(pkt_dev->burst - burst_cnt > 0))
> +		atomic_sub(pkt_dev->burst - burst_cnt, &pkt_dev->skb->users);
>  unlock:
>  	HARD_TX_UNLOCK(odev, txq);
>  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26  7:42   ` Eric Dumazet
@ 2014-09-26 15:44     ` Eric Dumazet
  2014-09-26 15:59       ` Alexei Starovoitov
  2014-09-27 20:43     ` Eric Dumazet
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-09-26 15:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesper Dangaard Brouer, Eric Dumazet,
	John Fastabend, netdev

On Fri, 2014-09-26 at 00:42 -0700, Eric Dumazet wrote:
> On Thu, 2014-09-25 at 18:20 -0700, Eric Dumazet wrote:
> > On Thu, 2014-09-25 at 17:46 -0700, Alexei Starovoitov wrote:
> > > This patch demonstrates the effect of delaying update of HW tailptr.
> > > (based on earlier patch by Jesper)
> > > 
> > > burst=1 is a default. It sends one packet with xmit_more=false
> > > burst=2 sends one packet with xmit_more=true and
> > >         2nd copy of the same packet with xmit_more=false
> > > burst=3 sends two copies of the same packet with xmit_more=true and
> > >         3rd copy with xmit_more=false
> > 
> > > 
> > > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> > > ---
> > 
> > Perfect, this is what I had in mind, thanks !
> > 
> > Acked-by: Eric Dumazet <edumazet@google.com>
> > 
> 
> By the way with this patch, I now reach 10 Mpps on mlx4 

On the other hand, bnx2x is not that happy

[  381.424736] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
when queue awake!
[  381.432660] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
when queue awake!
[  381.440337] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
when queue awake!
[  381.448255] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
when queue awake!

When TX_BUSY is replied, and/or  netif_tx_stop_queue(txq) was done by
the driver, the 'burst' loop should stop

So your :

	if (more)
		goto xmit_more;

Should instead be :

	if (more && !netif_xmit_frozen_or_drv_stopped(txq))
		goto xmit_more;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26 15:44     ` Eric Dumazet
@ 2014-09-26 15:59       ` Alexei Starovoitov
  2014-09-26 16:06         ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2014-09-26 15:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Jesper Dangaard Brouer, Eric Dumazet,
	John Fastabend, Network Development

On Fri, Sep 26, 2014 at 8:44 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-09-26 at 00:42 -0700, Eric Dumazet wrote:
>> On Thu, 2014-09-25 at 18:20 -0700, Eric Dumazet wrote:
>> > On Thu, 2014-09-25 at 17:46 -0700, Alexei Starovoitov wrote:
>> > > This patch demonstrates the effect of delaying update of HW tailptr.
>> > > (based on earlier patch by Jesper)
>> > >
>> > > burst=1 is a default. It sends one packet with xmit_more=false
>> > > burst=2 sends one packet with xmit_more=true and
>> > >         2nd copy of the same packet with xmit_more=false
>> > > burst=3 sends two copies of the same packet with xmit_more=true and
>> > >         3rd copy with xmit_more=false
>> >
>> > >
>> > > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> > > ---
>> >
>> > Perfect, this is what I had in mind, thanks !
>> >
>> > Acked-by: Eric Dumazet <edumazet@google.com>
>> >
>>
>> By the way with this patch, I now reach 10 Mpps on mlx4
>
> On the other hand, bnx2x is not that happy
>
> [  381.424736] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
> when queue awake!
> [  381.432660] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
> when queue awake!
> [  381.440337] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
> when queue awake!
> [  381.448255] bnx2x: [bnx2x_start_xmit:3772(eth0)]BUG! Tx ring full
> when queue awake!
>
> When TX_BUSY is replied, and/or  netif_tx_stop_queue(txq) was done by
> the driver, the 'burst' loop should stop
>
> So your :
>
>         if (more)
>                 goto xmit_more;
>
> Should instead be :
>
>         if (more && !netif_xmit_frozen_or_drv_stopped(txq))
>                 goto xmit_more;
>

oh. thanks for testing! That makes sense.
I don't think I can find bnx2x around to double check this.
Do you mind applying this fix and resubmitting with two sob?
This one was rfc anyway :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26 15:59       ` Alexei Starovoitov
@ 2014-09-26 16:06         ` Eric Dumazet
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-26 16:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesper Dangaard Brouer, Eric Dumazet,
	John Fastabend, Network Development

On Fri, 2014-09-26 at 08:59 -0700, Alexei Starovoitov wrote:

> 
> oh. thanks for testing! That makes sense.
> I don't think I can find bnx2x around to double check this.
> Do you mind applying this fix and resubmitting with two sob?
> This one was rfc anyway :)

Yes, I'll also provide the bnx2x patch to support skb->xmit_more.

Thanks

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26  7:42   ` Eric Dumazet
  2014-09-26 15:44     ` Eric Dumazet
@ 2014-09-27 20:43     ` Eric Dumazet
  2014-09-27 20:55       ` Or Gerlitz
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-09-27 20:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesper Dangaard Brouer, Eric Dumazet,
	John Fastabend, netdev, Amir Vadai, Or Gerlitz

On Fri, 2014-09-26 at 00:42 -0700, Eric Dumazet wrote:

> By the way with this patch, I now reach 10 Mpps on mlx4 
> 
> base line : 5 Mpps
> 
> + skb->xmit_more and quick hack in pjtgen (spinlock/unlock per packet)
> -> 7 Mpps
> 
> + burst of 16 packets, no spinlock per packet -> 10 Mpps

With careful study of mlx4 driver to remove false sharing, I now get 14
Mpps.

(Note they have a special feature to 'inline' small packets in tx
descriptors : If this is used, max rate is lower, because cpu spend more
cycles to perform the copies)

I find worrying driver authors do not know how to properly use a ring
buffer, and do not place the producer and consumer indexes in separate
cache lines.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-27 20:43     ` Eric Dumazet
@ 2014-09-27 20:55       ` Or Gerlitz
  2014-09-27 21:30         ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Or Gerlitz @ 2014-09-27 20:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

On Sat, Sep 27, 2014 at 11:43 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-09-26 at 00:42 -0700, Eric Dumazet wrote:
>
>> By the way with this patch, I now reach 10 Mpps on mlx4
>>
>> base line : 5 Mpps
>>
>> + skb->xmit_more and quick hack in pjtgen (spinlock/unlock per packet)
>> -> 7 Mpps
>>
>> + burst of 16 packets, no spinlock per packet -> 10 Mpps
>
> With careful study of mlx4 driver to remove false sharing, I now get 14
> Mpps.
>
> (Note they have a special feature to 'inline' small packets in tx
> descriptors : If this is used, max rate is lower, because cpu spend more
> cycles to perform the copies)
>
> I find worrying driver authors do not know how to properly use a ring
> buffer, and do not place the producer and consumer indexes in separate
> cache lines.

mmm, so the numberz sound good, the comment sounds as the basics are
still not fully behind us (so we have where to improve)... any mlx4
patch you want to share?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-26  0:46 [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Alexei Starovoitov
  2014-09-26  1:20 ` Eric Dumazet
  2014-09-26  8:05 ` [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Jesper Dangaard Brouer
@ 2014-09-27 20:59 ` Or Gerlitz
  2 siblings, 0 replies; 29+ messages in thread
From: Or Gerlitz @ 2014-09-27 20:59 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: David S. Miller, Jesper Dangaard Brouer, John Fastabend,
	Linux Netdev List

On Fri, Sep 26, 2014 at 3:46 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> This patch demonstrates the effect of delaying update of HW tailptr.
> (based on earlier patch by Jesper)
>
> burst=1 is a default. It sends one packet with xmit_more=false
> burst=2 sends one packet with xmit_more=true and
>         2nd copy of the same packet with xmit_more=false
> burst=3 sends two copies of the same packet with xmit_more=true and
>         3rd copy with xmit_more=false


Guys, just to be sure, this pktgen config uses only one TX ring, right?


> Performance with ixgbe:
>
> usec 30:
> burst=1  tx:9.2 Mpps
> burst=2  tx:13.6 Mpps
> burst=3  tx:14.5 Mpps full 10G line rate
>
> usec 1 (default):
> burst=1,4,100 tx:3.9 Mpps
>
> usec 0:
> burst=1  tx:4.9 Mpps
> burst=2  tx:6.6 Mpps
> burst=3  tx:7.9 Mpps
> burst=4  tx:8.7 Mpps
> burst=8  tx:10.3 Mpps
> burst=128  tx:12.4 Mpps
>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>
> tx queue size, irq affinity left in default.
> pause frames are off.
>
> Nice to finally see line rate generated by one cpu
>
> Comparing to Jesper patch this one amortizes the cost
> of spin_lock and atomic_inc by doing HARD_TX_LOCK and
> atomic_add(N) once across N packets.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more
  2014-09-27 20:55       ` Or Gerlitz
@ 2014-09-27 21:30         ` Eric Dumazet
  2014-09-27 22:56           ` [PATCH net-next] mlx4: optimize xmit path Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-09-27 21:30 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

On Sat, 2014-09-27 at 23:55 +0300, Or Gerlitz wrote:

> mmm, so the numberz sound good, the comment sounds as the basics are
> still not fully behind us (so we have where to improve)... any mlx4
> patch you want to share?

Sure, I am definitely not playing around on mlx4 a Saturday to throw
away my work ;)

I'll send a patch maybe today if I am not too bored.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH net-next] mlx4: optimize xmit path
  2014-09-27 21:30         ` Eric Dumazet
@ 2014-09-27 22:56           ` Eric Dumazet
  2014-09-27 23:44             ` Hannes Frederic Sowa
                               ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-27 22:56 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

From: Eric Dumazet <edumazet@google.com>

First I implemented skb->xmit_more support, and pktgen throughput
went from ~5Mpps to ~10Mpps.

Then, looking closely at this driver I found false sharing problems that
should be addressed by this patch, as my pktgen now reaches 14.7 Mpps
on a single TX queue, with a burst factor of 8.

So this patch in a whole permits to improve raw performance on a single
TX queue from about 5 Mpps to 14.7 Mpps.

Note that if packets are below the inline_thold threshold (104 bytes),
driver copies packets content into tx descriptor, and throughput
is lowered to ~7 Mpps :
->	We might reconsider inlining strategy in a followup patch.

I could split this patch into multiple components, but I prefer not
spend days on this work.

Lets instead list all changes I did :

1) align struct mlx4_en_tx_info to a cache line

2) add frag0_dma/frag0_byte_count into mlx4_en_tx_info to avoid a cache
   line miss in TX completion for frames having one dma element.
   (We avoid reading back the tx descriptor)
   Note this could be extended to 2/3 dma elements later,
   as we have free room in mlx4_en_tx_info

3) reorganize struct mlx4_en_tx_ring to have 
   3.1 - One cache line containing last_nr_txbb & cons,
     used by tx completion.
   3.2 - One cache line containing fields dirtied by mlx4_en_xmit()  
   3.3 - Following part is read mostly and shared by cpus.

4) struct mlx4_bf @offset field reduced to unsigned int to save space
   so that the 3.2 part is only 64 bytes, or one cache line on x86.

5) doorbell_qpn is stored in the cpu_to_be32() way to avoid bswap() in
   fast path

6) mdev->mr.key stored in ring->mr_key to also avoid bswap() and access
   to cold cache line.

7) mlx4_en_free_tx_desc() no longer accesses skb_shinfo(). We use a new
   nr_frags fields in mlx4_en_tx_info to avoid 2 or 3 cache misses.

8) mlx4_en_free_tx_desc() uses a prefetchw(&skb->users) to speed up
   consume_skb()

9) mlx4_en_process_tx_cq() carefully fetches and writes
   ring->last_nr_txbb & ring->cons only one time to avoid false sharing

10) mlx4_en_xmit() reads ring->cons once, and ahead of time to avoid
    stalls.

11) prefetchw(&ring->tx_queue->dql) to speed up BQL update

12) properly clears tx_info->ts_requested :
    This field was never cleared.

13) Support skb->xmit_more to avoid the expensive doorbell

14) reorganize code to call is_inline() once, so compiler can inline it.

15) rename @i variable to @i_frag to avoid confusion, as the
    "goto tx_drop_unmap;" relied on this @i variable.
 
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  312 ++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |   89 ++--
 include/linux/mlx4/device.h                  |    2 
 3 files changed, 241 insertions(+), 162 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c44f4237b9be..fa29d53860a6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -191,12 +191,12 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->prod = 0;
 	ring->cons = 0xffffffff;
 	ring->last_nr_txbb = 1;
-	ring->poll_cnt = 0;
 	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
 	memset(ring->buf, 0, ring->buf_size);
 
 	ring->qp_state = MLX4_QP_STATE_RST;
-	ring->doorbell_qpn = ring->qp.qpn << 8;
+	ring->doorbell_qpn = cpu_to_be32(ring->qp.qpn << 8);
+	ring->mr_key = cpu_to_be32(mdev->mr.key);
 
 	mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
 				ring->cqn, user_prio, &ring->context);
@@ -259,38 +259,45 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 				struct mlx4_en_tx_ring *ring,
 				int index, u8 owner, u64 timestamp)
 {
-	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
 	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
 	struct mlx4_wqe_data_seg *data = (void *) tx_desc + tx_info->data_offset;
-	struct sk_buff *skb = tx_info->skb;
-	struct skb_frag_struct *frag;
 	void *end = ring->buf + ring->buf_size;
-	int frags = skb_shinfo(skb)->nr_frags;
+	struct sk_buff *skb = tx_info->skb;
+	int nr_frags = tx_info->nr_frags;
 	int i;
-	struct skb_shared_hwtstamps hwts;
 
-	if (timestamp) {
-		mlx4_en_fill_hwtstamps(mdev, &hwts, timestamp);
+	/* We do not touch skb here, so prefetch skb->users location
+	 * to speedup consume_skb()
+	 */
+	prefetchw(&skb->users);
+
+	if (unlikely(timestamp)) {
+		struct skb_shared_hwtstamps hwts;
+
+		mlx4_en_fill_hwtstamps(priv->mdev, &hwts, timestamp);
 		skb_tstamp_tx(skb, &hwts);
 	}
 
 	/* Optimize the common case when there are no wraparounds */
 	if (likely((void *) tx_desc + tx_info->nr_txbb * TXBB_SIZE <= end)) {
 		if (!tx_info->inl) {
-			if (tx_info->linear) {
+			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 be32_to_cpu(data->byte_count),
-					 PCI_DMA_TODEVICE);
-				++data;
-			}
-
-			for (i = 0; i < frags; i++) {
-				frag = &skb_shinfo(skb)->frags[i];
+						tx_info->frag0_dma,
+						tx_info->frag0_byte_count,
+						PCI_DMA_TODEVICE);
+			else
+				dma_unmap_page(priv->ddev,
+					       tx_info->frag0_dma,
+					       tx_info->frag0_byte_count,
+					       PCI_DMA_TODEVICE);
+			for (i = 1; i < nr_frags; i++) {
+				data++;
 				dma_unmap_page(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data[i].addr),
-					skb_frag_size(frag), PCI_DMA_TODEVICE);
+					(dma_addr_t)be64_to_cpu(data->addr),
+					be32_to_cpu(data->byte_count),
+					PCI_DMA_TODEVICE);
 			}
 		}
 	} else {
@@ -299,22 +306,24 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 				data = ring->buf + ((void *)data - end);
 			}
 
-			if (tx_info->linear) {
+			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 be32_to_cpu(data->byte_count),
-					 PCI_DMA_TODEVICE);
-				++data;
-			}
-
-			for (i = 0; i < frags; i++) {
+						tx_info->frag0_dma,
+						tx_info->frag0_byte_count,
+						PCI_DMA_TODEVICE);
+			else
+				dma_unmap_page(priv->ddev,
+					       tx_info->frag0_dma,
+					       tx_info->frag0_byte_count,
+					       PCI_DMA_TODEVICE);
+			for (i = 1; i < nr_frags; i++) {
 				/* Check for wraparound before unmapping */
 				if ((void *) data >= end)
 					data = ring->buf;
-				frag = &skb_shinfo(skb)->frags[i];
 				dma_unmap_page(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 skb_frag_size(frag), PCI_DMA_TODEVICE);
+					(dma_addr_t)be64_to_cpu(data->addr),
+					be32_to_cpu(data->byte_count),
+					PCI_DMA_TODEVICE);
 				++data;
 			}
 		}
@@ -377,13 +386,18 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	u64 timestamp = 0;
 	int done = 0;
 	int budget = priv->tx_work_limit;
+	u32 last_nr_txbb;
+	u32 ring_cons;
 
 	if (!priv->port_up)
 		return true;
 
+	prefetchw(&ring->tx_queue->dql.limit);
 	index = cons_index & size_mask;
 	cqe = mlx4_en_get_cqe(buf, index, priv->cqe_size) + factor;
-	ring_index = ring->cons & size_mask;
+	last_nr_txbb = ACCESS_ONCE(ring->last_nr_txbb);
+	ring_cons = ACCESS_ONCE(ring->cons);
+	ring_index = ring_cons & size_mask;
 	stamp_index = ring_index;
 
 	/* Process all completed CQEs */
@@ -408,19 +422,19 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 		new_index = be16_to_cpu(cqe->wqe_index) & size_mask;
 
 		do {
-			txbbs_skipped += ring->last_nr_txbb;
-			ring_index = (ring_index + ring->last_nr_txbb) & size_mask;
+			txbbs_skipped += last_nr_txbb;
+			ring_index = (ring_index + last_nr_txbb) & size_mask;
 			if (ring->tx_info[ring_index].ts_requested)
 				timestamp = mlx4_en_get_cqe_ts(cqe);
 
 			/* free next descriptor */
-			ring->last_nr_txbb = mlx4_en_free_tx_desc(
+			last_nr_txbb = mlx4_en_free_tx_desc(
 					priv, ring, ring_index,
-					!!((ring->cons + txbbs_skipped) &
+					!!((ring_cons + txbbs_skipped) &
 					ring->size), timestamp);
 
 			mlx4_en_stamp_wqe(priv, ring, stamp_index,
-					  !!((ring->cons + txbbs_stamp) &
+					  !!((ring_cons + txbbs_stamp) &
 						ring->size));
 			stamp_index = ring_index;
 			txbbs_stamp = txbbs_skipped;
@@ -441,7 +455,11 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	mcq->cons_index = cons_index;
 	mlx4_cq_set_ci(mcq);
 	wmb();
-	ring->cons += txbbs_skipped;
+
+	/* we want to dirty this cache line once */
+	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
+	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
+
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
 	/*
@@ -512,30 +530,35 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 	return ring->buf + index * TXBB_SIZE;
 }
 
-static int is_inline(int inline_thold, struct sk_buff *skb, void **pfrag)
+/* Decide if skb can be inlined in tx descriptor to avoid dma mapping
+ *
+ * It seems strange we do not simply use skb_copy_bits().
+ * This would allow to inline all skbs iff skb->len <= inline_thold
+ *
+ * Note that caller already checked skb was not a gso packet
+ */
+static bool is_inline(int inline_thold, const struct sk_buff *skb,
+		      const struct skb_shared_info *shinfo,
+		      void **pfrag)
 {
 	void *ptr;
 
-	if (inline_thold && !skb_is_gso(skb) && skb->len <= inline_thold) {
-		if (skb_shinfo(skb)->nr_frags == 1) {
-			ptr = skb_frag_address_safe(&skb_shinfo(skb)->frags[0]);
-			if (unlikely(!ptr))
-				return 0;
-
-			if (pfrag)
-				*pfrag = ptr;
+	if (skb->len > inline_thold || !inline_thold)
+		return false;
 
-			return 1;
-		} else if (unlikely(skb_shinfo(skb)->nr_frags))
-			return 0;
-		else
-			return 1;
+	if (shinfo->nr_frags == 1) {
+		ptr = skb_frag_address_safe(&shinfo->frags[0]);
+		if (unlikely(!ptr))
+			return false;
+		*pfrag = ptr;
+		return true;
 	}
-
-	return 0;
+	if (shinfo->nr_frags)
+		return false;
+	return true;
 }
 
-static int inline_size(struct sk_buff *skb)
+static int inline_size(const struct sk_buff *skb)
 {
 	if (skb->len + CTRL_SIZE + sizeof(struct mlx4_wqe_inline_seg)
 	    <= MLX4_INLINE_ALIGN)
@@ -546,18 +569,23 @@ static int inline_size(struct sk_buff *skb)
 			     sizeof(struct mlx4_wqe_inline_seg), 16);
 }
 
-static int get_real_size(struct sk_buff *skb, struct net_device *dev,
-			 int *lso_header_size)
+static int get_real_size(const struct sk_buff *skb,
+			 const struct skb_shared_info *shinfo,
+			 struct net_device *dev,
+			 int *lso_header_size,
+			 bool *inline_ok,
+			 void **pfrag)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
-	if (skb_is_gso(skb)) {
+	if (shinfo->gso_size) {
+		*inline_ok = false;
 		if (skb->encapsulation)
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
 		else
 			*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
-		real_size = CTRL_SIZE + skb_shinfo(skb)->nr_frags * DS_SIZE +
+		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
 			ALIGN(*lso_header_size + 4, DS_SIZE);
 		if (unlikely(*lso_header_size != skb_headlen(skb))) {
 			/* We add a segment for the skb linear buffer only if
@@ -572,17 +600,24 @@ static int get_real_size(struct sk_buff *skb, struct net_device *dev,
 		}
 	} else {
 		*lso_header_size = 0;
-		if (!is_inline(priv->prof->inline_thold, skb, NULL))
-			real_size = CTRL_SIZE + (skb_shinfo(skb)->nr_frags + 1) * DS_SIZE;
-		else
+		*inline_ok = is_inline(priv->prof->inline_thold, skb,
+				       shinfo, pfrag);
+
+		if (*inline_ok)
 			real_size = inline_size(skb);
+		else
+			real_size = CTRL_SIZE +
+				    (shinfo->nr_frags + 1) * DS_SIZE;
 	}
 
 	return real_size;
 }
 
-static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *skb,
-			     int real_size, u16 *vlan_tag, int tx_ind, void *fragptr)
+static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
+			     const struct sk_buff *skb,
+			     const struct skb_shared_info *shinfo,
+			     int real_size, u16 *vlan_tag,
+			     int tx_ind, void *fragptr)
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
@@ -596,9 +631,9 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 			       MIN_PKT_LEN - skb->len);
 		}
 		skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
-		if (skb_shinfo(skb)->nr_frags)
+		if (shinfo->nr_frags)
 			memcpy(((void *)(inl + 1)) + skb_headlen(skb), fragptr,
-			       skb_frag_size(&skb_shinfo(skb)->frags[0]));
+			       skb_frag_size(&shinfo->frags[0]));
 
 	} else {
 		inl->byte_count = cpu_to_be32(1 << 31 | spc);
@@ -616,9 +651,10 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 			inl = (void *) (inl + 1) + spc;
 			skb_copy_from_linear_data_offset(skb, spc, inl + 1,
 					skb_headlen(skb) - spc);
-			if (skb_shinfo(skb)->nr_frags)
+			if (shinfo->nr_frags)
 				memcpy(((void *)(inl + 1)) + skb_headlen(skb) - spc,
-					fragptr, skb_frag_size(&skb_shinfo(skb)->frags[0]));
+				       fragptr,
+				       skb_frag_size(&shinfo->frags[0]));
 		}
 
 		wmb();
@@ -642,7 +678,8 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 	return fallback(dev, skb) % rings_p_up + up * rings_p_up;
 }
 
-static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
+static void mlx4_bf_copy(void __iomem *dst, const void *src,
+			 unsigned int bytecnt)
 {
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
@@ -663,15 +700,26 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	u32 index, bf_index;
 	__be32 op_own;
 	u16 vlan_tag = 0;
-	int i;
+	int i_frag;
 	int lso_header_size;
-	void *fragptr;
+	void *fragptr = NULL;
 	bool bounce = false;
+	bool send_doorbell;
+	u32 ring_cons;
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	bool inline_ok;
 
 	if (!priv->port_up)
 		goto tx_drop;
 
-	real_size = get_real_size(skb, dev, &lso_header_size);
+	tx_ind = skb_get_queue_mapping(skb);
+	ring = priv->tx_ring[tx_ind];
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = ACCESS_ONCE(ring->cons);
+
+	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
+				  &inline_ok, &fragptr);
 	if (unlikely(!real_size))
 		goto tx_drop;
 
@@ -684,13 +732,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
-	tx_ind = skb->queue_mapping;
-	ring = priv->tx_ring[tx_ind];
 	if (vlan_tx_tag_present(skb))
 		vlan_tag = vlan_tx_tag_get(skb);
 
 	/* Check available TXBBs And 2K spare for prefetch */
-	if (unlikely(((int)(ring->prod - ring->cons)) >
+	if (unlikely(((int)(ring->prod - ring_cons)) >
 		     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 		/* every full Tx ring stops queue */
 		netif_tx_stop_queue(ring->tx_queue);
@@ -704,7 +750,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		 */
 		wmb();
 
-		if (unlikely(((int)(ring->prod - ring->cons)) <=
+		ring_cons = ACCESS_ONCE(ring->cons);
+		if (unlikely(((int)(ring->prod - ring_cons)) <=
 			     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 			netif_tx_wake_queue(ring->tx_queue);
 			ring->wake_queue++;
@@ -713,9 +760,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 
+	prefetchw(&ring->tx_queue->dql);
+
 	/* Track current inflight packets for performance analysis */
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
-			 (u32) (ring->prod - ring->cons - 1));
+			 (u32)(ring->prod - ring_cons - 1));
 
 	/* Packet is good - grab an index and transmit it */
 	index = ring->prod & ring->size_mask;
@@ -735,31 +784,34 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->skb = skb;
 	tx_info->nr_txbb = nr_txbb;
 
+	data = &tx_desc->data;
 	if (lso_header_size)
 		data = ((void *)&tx_desc->lso + ALIGN(lso_header_size + 4,
 						      DS_SIZE));
-	else
-		data = &tx_desc->data;
 
 	/* valid only for none inline segments */
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
 
+	tx_info->inl = inline_ok;
+
 	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
-			   !is_inline(ring->inline_thold, skb, NULL)) ? 1 : 0;
+			   !inline_ok) ? 1 : 0;
 
-	data += skb_shinfo(skb)->nr_frags + tx_info->linear - 1;
+	tx_info->nr_frags = shinfo->nr_frags + tx_info->linear;
+	data += tx_info->nr_frags - 1;
 
-	if (is_inline(ring->inline_thold, skb, &fragptr)) {
-		tx_info->inl = 1;
-	} else {
-		/* Map fragments */
-		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
+	if (!tx_info->inl) {
+		dma_addr_t dma = 0;
+		u32 byte_count = 0;
+
+		/* Map fragments if any */
+		for (i_frag = shinfo->nr_frags - 1; i_frag >= 0; i_frag--) {
 			struct skb_frag_struct *frag;
-			dma_addr_t dma;
 
-			frag = &skb_shinfo(skb)->frags[i];
+			frag = &shinfo->frags[i_frag];
+			byte_count = skb_frag_size(frag);
 			dma = skb_frag_dma_map(ddev, frag,
-					       0, skb_frag_size(frag),
+					       0, byte_count,
 					       DMA_TO_DEVICE);
 			if (dma_mapping_error(ddev, dma))
 				goto tx_drop_unmap;
@@ -767,14 +819,13 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			data->addr = cpu_to_be64(dma);
 			data->lkey = cpu_to_be32(mdev->mr.key);
 			wmb();
-			data->byte_count = cpu_to_be32(skb_frag_size(frag));
+			data->byte_count = cpu_to_be32(byte_count);
 			--data;
 		}
 
-		/* Map linear part */
+		/* Map linear part if needed */
 		if (tx_info->linear) {
-			u32 byte_count = skb_headlen(skb) - lso_header_size;
-			dma_addr_t dma;
+			byte_count = skb_headlen(skb) - lso_header_size;
 
 			dma = dma_map_single(ddev, skb->data +
 					     lso_header_size, byte_count,
@@ -787,25 +838,24 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			wmb();
 			data->byte_count = cpu_to_be32(byte_count);
 		}
-		tx_info->inl = 0;
+		/* tx completion can avoid cache line miss for common cases */
+		tx_info->frag0_dma = dma;
+		tx_info->frag0_byte_count = byte_count;
 	}
 
 	/*
 	 * For timestamping add flag to skb_shinfo and
 	 * set flag for further reference
 	 */
-	if (ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
-	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
-		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+	tx_info->ts_requested = 0;
+	if (unlikely(ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
+		     shinfo->tx_flags & SKBTX_HW_TSTAMP)) {
+		shinfo->tx_flags |= SKBTX_IN_PROGRESS;
 		tx_info->ts_requested = 1;
 	}
 
 	/* Prepare ctrl segement apart opcode+ownership, which depends on
 	 * whether LSO is used */
-	tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-	tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
-		!!vlan_tx_tag_present(skb);
-	tx_desc->ctrl.fence_size = (real_size / 16) & 0x3f;
 	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
 		tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
@@ -826,6 +876,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Handle LSO (TSO) packets */
 	if (lso_header_size) {
+		int i;
+
 		/* Mark opcode as LSO */
 		op_own = cpu_to_be32(MLX4_OPCODE_LSO | (1 << 6)) |
 			((ring->prod & ring->size) ?
@@ -833,15 +885,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		/* Fill in the LSO prefix */
 		tx_desc->lso.mss_hdr_size = cpu_to_be32(
-			skb_shinfo(skb)->gso_size << 16 | lso_header_size);
+			shinfo->gso_size << 16 | lso_header_size);
 
 		/* Copy headers;
 		 * note that we already verified that it is linear */
 		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
 		priv->port_stats.tso_packets++;
-		i = ((skb->len - lso_header_size) / skb_shinfo(skb)->gso_size) +
-			!!((skb->len - lso_header_size) % skb_shinfo(skb)->gso_size);
+		i = ((skb->len - lso_header_size) / shinfo->gso_size) +
+			!!((skb->len - lso_header_size) % shinfo->gso_size);
 		tx_info->nr_bytes = skb->len + (i - 1) * lso_header_size;
 		ring->packets += i;
 	} else {
@@ -851,16 +903,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 		tx_info->nr_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
 		ring->packets++;
-
 	}
 	ring->bytes += tx_info->nr_bytes;
 	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
-	if (tx_info->inl) {
-		build_inline_wqe(tx_desc, skb, real_size, &vlan_tag, tx_ind, fragptr);
-		tx_info->inl = 1;
-	}
+	if (tx_info->inl)
+		build_inline_wqe(tx_desc, skb, shinfo, real_size,
+				 &vlan_tag, tx_ind, fragptr);
 
 	if (skb->encapsulation) {
 		struct iphdr *ipv4 = (struct iphdr *)skb_inner_network_header(skb);
@@ -873,35 +923,53 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	ring->prod += nr_txbb;
 
 	/* If we used a bounce buffer then copy descriptor back into place */
-	if (bounce)
+	if (unlikely(bounce))
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
 	skb_tx_timestamp(skb);
 
-	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce && !vlan_tx_tag_present(skb)) {
-		tx_desc->ctrl.bf_qpn |= cpu_to_be32(ring->doorbell_qpn);
+	real_size = (real_size / 16) & 0x3f;
+
+	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
+
+	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
+	    !vlan_tx_tag_present(skb) && send_doorbell) {
+		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
+				       cpu_to_be32(real_size);
 
 		op_own |= htonl((bf_index & 0xffff) << 8);
-		/* Ensure new descirptor hits memory
-		* before setting ownership of this descriptor to HW */
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
 		wmb();
 		tx_desc->ctrl.owner_opcode = op_own;
 
 		wmb();
 
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned long *) &tx_desc->ctrl,
-		     desc_size);
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset,
+			     &tx_desc->ctrl,
+			     desc_size);
 
 		wmb();
 
 		ring->bf.offset ^= ring->bf.buf_size;
 	} else {
-		/* Ensure new descirptor hits memory
-		* before setting ownership of this descriptor to HW */
+		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
+		tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
+			!!vlan_tx_tag_present(skb);
+		tx_desc->ctrl.fence_size = real_size;
+
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
 		wmb();
 		tx_desc->ctrl.owner_opcode = op_own;
-		wmb();
-		iowrite32be(ring->doorbell_qpn, ring->bf.uar->map + MLX4_SEND_DOORBELL);
+
+		if (send_doorbell) {
+			wmb(); /* ensure owner_opcode is written */
+			iowrite32(ring->doorbell_qpn,
+				  ring->bf.uar->map + MLX4_SEND_DOORBELL);
+		}
 	}
 
 	return NETDEV_TX_OK;
@@ -909,8 +977,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 tx_drop_unmap:
 	en_err(priv, "DMA mapping error\n");
 
-	for (i++; i < skb_shinfo(skb)->nr_frags; i++) {
-		data++;
+	while (++i_frag < shinfo->nr_frags) {
+		++data;
 		dma_unmap_page(ddev, (dma_addr_t) be64_to_cpu(data->addr),
 			       be32_to_cpu(data->byte_count),
 			       PCI_DMA_TODEVICE);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6a4fc2394cf2..8025e3c0b14e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -216,13 +216,16 @@ enum cq_type {
 
 struct mlx4_en_tx_info {
 	struct sk_buff *skb;
-	u32 nr_txbb;
-	u32 nr_bytes;
-	u8 linear;
-	u8 data_offset;
-	u8 inl;
-	u8 ts_requested;
-};
+	dma_addr_t	frag0_dma;
+	u32		frag0_byte_count;
+	u32		nr_txbb;
+	u32		nr_bytes;
+	u8		linear;
+	u8		data_offset;
+	u8		inl;
+	u8		ts_requested;
+	u8		nr_frags;
+} ____cacheline_aligned_in_smp;
 
 
 #define MLX4_EN_BIT_DESC_OWN	0x80000000
@@ -253,39 +256,47 @@ struct mlx4_en_rx_alloc {
 };
 
 struct mlx4_en_tx_ring {
+			/* cache line used and dirtied in tx completion
+			 * (mlx4_en_free_tx_buf())
+			 */
+	u32			last_nr_txbb;
+	u32			cons;
+
+			/* cache line used and dirtied in mlx4_en_xmit()
+			 */
+	u32			prod ____cacheline_aligned_in_smp;
+	unsigned long		bytes;
+	unsigned long		packets;
+	unsigned long		tx_csum;
+	struct mlx4_bf		bf;
+	unsigned long		queue_stopped;
+	unsigned long		wake_queue;
+
+			/* Following part should be mostly read
+			 */
+	cpumask_t		affinity_mask;
+	struct mlx4_qp		qp;
 	struct mlx4_hwq_resources wqres;
-	u32 size ; /* number of TXBBs */
-	u32 size_mask;
-	u16 stride;
-	u16 cqn;	/* index of port CQ associated with this ring */
-	u32 prod;
-	u32 cons;
-	u32 buf_size;
-	u32 doorbell_qpn;
-	void *buf;
-	u16 poll_cnt;
-	struct mlx4_en_tx_info *tx_info;
-	u8 *bounce_buf;
-	u8 queue_index;
-	cpumask_t affinity_mask;
-	u32 last_nr_txbb;
-	struct mlx4_qp qp;
-	struct mlx4_qp_context context;
-	int qpn;
-	enum mlx4_qp_state qp_state;
-	struct mlx4_srq dummy;
-	unsigned long bytes;
-	unsigned long packets;
-	unsigned long tx_csum;
-	unsigned long queue_stopped;
-	unsigned long wake_queue;
-	struct mlx4_bf bf;
-	bool bf_enabled;
-	bool bf_alloced;
-	struct netdev_queue *tx_queue;
-	int hwtstamp_tx_type;
-	int inline_thold;
-};
+	u32			size ; /* number of TXBBs */
+	u32			size_mask;
+	u16			stride;
+	u16			cqn;	/* index of port CQ associated with this ring */
+	u32			buf_size;
+	__be32			doorbell_qpn;
+	__be32			mr_key;
+	void			*buf;
+	struct mlx4_en_tx_info	*tx_info;
+	u8			*bounce_buf;
+	struct mlx4_qp_context	context;
+	int			qpn;
+	enum mlx4_qp_state	qp_state;
+	u8			queue_index;
+	bool			bf_enabled;
+	bool			bf_alloced;
+	struct netdev_queue	*tx_queue;
+	int			hwtstamp_tx_type;
+	int			inline_thold;
+} ____cacheline_aligned_in_smp;
 
 struct mlx4_en_rx_desc {
 	/* actual number of entries depends on rx ring stride */
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 03b5608a4329..5e5ad07548b8 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -583,7 +583,7 @@ struct mlx4_uar {
 };
 
 struct mlx4_bf {
-	unsigned long		offset;
+	unsigned int		offset;
 	int			buf_size;
 	struct mlx4_uar	       *uar;
 	void __iomem	       *reg;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next] mlx4: optimize xmit path
  2014-09-27 22:56           ` [PATCH net-next] mlx4: optimize xmit path Eric Dumazet
@ 2014-09-27 23:44             ` Hannes Frederic Sowa
  2014-09-28  0:05               ` Eric Dumazet
  2014-09-28 12:42             ` Eric Dumazet
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-27 23:44 UTC (permalink / raw)
  To: Eric Dumazet, Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

Hi Eric,

On Sun, Sep 28, 2014, at 00:56, Eric Dumazet wrote:
> -       ring->cons += txbbs_skipped;
> +
> +       /* we want to dirty this cache line once */
> +       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> +       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +

Impressive work!

I wonder if another macro might be useful for those kind of
dereferences, because ACCESS_ONCE is associated with correctness in my
mind and those usages only try to optimize access patterns.
Does OPTIMIZER_HIDE_VAR generate the same code?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next] mlx4: optimize xmit path
  2014-09-27 23:44             ` Hannes Frederic Sowa
@ 2014-09-28  0:05               ` Eric Dumazet
  2014-09-28  0:22                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-09-28  0:05 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Amir Vadai, Or Gerlitz

On Sun, 2014-09-28 at 01:44 +0200, Hannes Frederic Sowa wrote:
> Hi Eric,
> 
> On Sun, Sep 28, 2014, at 00:56, Eric Dumazet wrote:
> > -       ring->cons += txbbs_skipped;
> > +
> > +       /* we want to dirty this cache line once */
> > +       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> > +       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> > +
> 
> Impressive work!
> 
> I wonder if another macro might be useful for those kind of
> dereferences, because ACCESS_ONCE is associated with correctness in my
> mind and those usages only try to optimize access patterns.
> Does OPTIMIZER_HIDE_VAR generate the same code?


If we have 

ring->cons += txbbs_skipped;

Then compiler might issue a RMW instruction.

And this is bad in this case.

I really want to _write_ into this location, and its fast because I
already have in ring_cons the content I fetched maybe hundred of
nanoseconds before, or even thousand of nanoseconds before.

ACCESS_ONCE(XXXX) = y

Is not only for correctness.

It exactly documents the fact that we want to perform a single write.

I believe it is time that people understand how useful is this helper
(Less than 700 occurrences in the whole kernel today, not including
Documentation/*)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next] mlx4: optimize xmit path
  2014-09-28  0:05               ` Eric Dumazet
@ 2014-09-28  0:22                 ` Hannes Frederic Sowa
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-28  0:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Amir Vadai, Or Gerlitz

On Sun, Sep 28, 2014, at 02:05, Eric Dumazet wrote:
> On Sun, 2014-09-28 at 01:44 +0200, Hannes Frederic Sowa wrote:
> > Hi Eric,
> > 
> > On Sun, Sep 28, 2014, at 00:56, Eric Dumazet wrote:
> > > -       ring->cons += txbbs_skipped;
> > > +
> > > +       /* we want to dirty this cache line once */
> > > +       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> > > +       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> > > +
> > 
> > Impressive work!
> > 
> > I wonder if another macro might be useful for those kind of
> > dereferences, because ACCESS_ONCE is associated with correctness in my
> > mind and those usages only try to optimize access patterns.
> > Does OPTIMIZER_HIDE_VAR generate the same code?
> 
> 
> If we have 
> 
> ring->cons += txbbs_skipped;
> 
> Then compiler might issue a RMW instruction.
> 
> And this is bad in this case.
> 
> I really want to _write_ into this location, and its fast because I
> already have in ring_cons the content I fetched maybe hundred of
> nanoseconds before, or even thousand of nanoseconds before.
> 
> ACCESS_ONCE(XXXX) = y
> 
> Is not only for correctness.
> 
> It exactly documents the fact that we want to perform a single write.
> 
> I believe it is time that people understand how useful is this helper
> (Less than 700 occurrences in the whole kernel today, not including
> Documentation/*)

Understood, thanks.

For me ACCESS_ONCE was something which slowed down code till today. Also
I have the feeling that instruction scheduling in the compiler could do
a better job in some places...

Now I wonder if it is worth it playing around with the restrict keyword
and strict-aliasing in networking. ;)

Bye,
Hannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next] mlx4: optimize xmit path
  2014-09-27 22:56           ` [PATCH net-next] mlx4: optimize xmit path Eric Dumazet
  2014-09-27 23:44             ` Hannes Frederic Sowa
@ 2014-09-28 12:42             ` Eric Dumazet
  2014-09-28 14:35             ` Or Gerlitz
  2014-09-29  4:19             ` [PATCH v2 " Eric Dumazet
  3 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-28 12:42 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

On Sat, 2014-09-27 at 15:56 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>

>  
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---


>  
>  struct mlx4_en_tx_ring {
> +			/* cache line used and dirtied in tx completion
> +			 * (mlx4_en_free_tx_buf())
> +			 */
> +	u32			last_nr_txbb;
> +	u32			cons;
> +
> +			/* cache line used and dirtied in mlx4_en_xmit()
> +			 */
> +	u32			prod ____cacheline_aligned_in_smp;
> +	unsigned long		bytes;
> +	unsigned long		packets;
> +	unsigned long		tx_csum;
> +	struct mlx4_bf		bf;
> +	unsigned long		queue_stopped;
> +	unsigned long		wake_queue;
> +


Hmm, I forgot to move wake_queue in the first cache line.

I also forgot to include <linux/prefetch.h>

I'll send a v2.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next] mlx4: optimize xmit path
  2014-09-27 22:56           ` [PATCH net-next] mlx4: optimize xmit path Eric Dumazet
  2014-09-27 23:44             ` Hannes Frederic Sowa
  2014-09-28 12:42             ` Eric Dumazet
@ 2014-09-28 14:35             ` Or Gerlitz
  2014-09-28 16:03               ` Eric Dumazet
  2014-09-29  4:19             ` [PATCH v2 " Eric Dumazet
  3 siblings, 1 reply; 29+ messages in thread
From: Or Gerlitz @ 2014-09-28 14:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz, saeedm, Yevgeny Petrilin, idos

On Sun, Sep 28, 2014 at 1:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> First I implemented skb->xmit_more support, and pktgen throughput
> went from ~5Mpps to ~10Mpps.
>
> Then, looking closely at this driver I found false sharing problems that
> should be addressed by this patch, as my pktgen now reaches 14.7 Mpps
> on a single TX queue, with a burst factor of 8.
>
> So this patch in a whole permits to improve raw performance on a single
> TX queue from about 5 Mpps to 14.7 Mpps.

Eric,

cool!! the team here will take a look this week. I assume we might
want to break the fifteen changes into multiple patches...

Thanks again for all your great work

Or.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next] mlx4: optimize xmit path
  2014-09-28 14:35             ` Or Gerlitz
@ 2014-09-28 16:03               ` Eric Dumazet
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-28 16:03 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz, saeedm, Yevgeny Petrilin, idos

On Sun, 2014-09-28 at 17:35 +0300, Or Gerlitz wrote:
> On Sun, Sep 28, 2014 at 1:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > First I implemented skb->xmit_more support, and pktgen throughput
> > went from ~5Mpps to ~10Mpps.
> >
> > Then, looking closely at this driver I found false sharing problems that
> > should be addressed by this patch, as my pktgen now reaches 14.7 Mpps
> > on a single TX queue, with a burst factor of 8.
> >
> > So this patch in a whole permits to improve raw performance on a single
> > TX queue from about 5 Mpps to 14.7 Mpps.
> 
> Eric,
> 
> cool!! the team here will take a look this week. I assume we might
> want to break the fifteen changes into multiple patches...
> 
> Thanks again for all your great work

Another problem I noticed is the false sharing on
prot_stats.tso_packets.

Please add following fix to your queue.

Thanks !

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index c2cfb05e7290..5bd33e580b22 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -150,14 +150,17 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	priv->port_stats.tx_chksum_offload = 0;
 	priv->port_stats.queue_stopped = 0;
 	priv->port_stats.wake_queue = 0;
+	priv->port_stats.tso_packets = 0;
 
 	for (i = 0; i < priv->tx_ring_num; i++) {
-		stats->tx_packets += priv->tx_ring[i]->packets;
-		stats->tx_bytes += priv->tx_ring[i]->bytes;
-		priv->port_stats.tx_chksum_offload += priv->tx_ring[i]->tx_csum;
-		priv->port_stats.queue_stopped +=
-			priv->tx_ring[i]->queue_stopped;
-		priv->port_stats.wake_queue += priv->tx_ring[i]->wake_queue;
+		struct mlx4_en_tx_ring *ring = priv->tx_ring[i];
+
+		stats->tx_packets += ring->packets;
+		stats->tx_bytes += ring->bytes;
+		priv->port_stats.tx_chksum_offload += ring->tx_csum;
+		priv->port_stats.queue_stopped += ring->queue_stopped;
+		priv->port_stats.wake_queue += ring->wake_queue;
+		priv->port_stats.tso_packets += ring->tso_packets;
 	}
 
 	stats->rx_errors = be64_to_cpu(mlx4_en_stats->PCS) +
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c44f4237b9be..7bb156e99894 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -839,7 +839,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		 * note that we already verified that it is linear */
 		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
-		priv->port_stats.tso_packets++;
+		ring->tso_packets++;
+
 		i = ((skb->len - lso_header_size) / skb_shinfo(skb)->gso_size) +
 			!!((skb->len - lso_header_size) % skb_shinfo(skb)->gso_size);
 		tx_info->nr_bytes = skb->len + (i - 1) * lso_header_size;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6a4fc2394cf2..007645c4edc0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -277,6 +277,7 @@ struct mlx4_en_tx_ring {
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long tx_csum;
+	unsigned long tso_packets;
 	unsigned long queue_stopped;
 	unsigned long wake_queue;
 	struct mlx4_bf bf;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 net-next] mlx4: optimize xmit path
  2014-09-27 22:56           ` [PATCH net-next] mlx4: optimize xmit path Eric Dumazet
                               ` (2 preceding siblings ...)
  2014-09-28 14:35             ` Or Gerlitz
@ 2014-09-29  4:19             ` Eric Dumazet
  2014-09-30 12:01               ` Amir Vadai
  2014-10-02  4:35               ` Eric Dumazet
  3 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-29  4:19 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

From: Eric Dumazet <edumazet@google.com>

First I implemented skb->xmit_more support, and pktgen throughput
went from ~5Mpps to ~10Mpps.

Then, looking closely at this driver I found false sharing problems that
should be addressed by this patch, as my pktgen now reaches 14.7 Mpps
on a single TX queue, with a burst factor of 8.

So this patch in a whole permits to improve raw performance on a single
TX queue from about 5 Mpps to 14.9 Mpps.

Note that if packets are below the inline_thold threshold (104 bytes),
driver copies packets content into tx descriptor, and throughput
is lowered to ~7.5 Mpps :
->      We might reconsider inline strategy in a followup patch.

I could split this patch into multiple components, but I prefer not
spend days on this work.

Lets instead list all changes I did :

1) align struct mlx4_en_tx_info to a cache line

2) add frag0_dma/frag0_byte_count into mlx4_en_tx_info to avoid a cache
   line miss in TX completion for frames having one dma element.
   (We avoid reading back the tx descriptor)
   Note this could be extended to 2/3 dma elements later,
   as we have free room in mlx4_en_tx_info

3) reorganize struct mlx4_en_tx_ring to have 
   3.1 - One cache line containing last_nr_txbb & cons & wake_queue,
     used by tx completion.
   3.2 - One cache line containing fields dirtied by mlx4_en_xmit()  
   3.3 - Following part is read mostly and shared by cpus.

4) struct mlx4_bf @offset field reduced to unsigned int to save space
 
5) doorbell_qpn is stored in the cpu_to_be32() way to avoid bswap() in
   fast path.

6) mdev->mr.key stored in ring->mr_key to also avoid bswap() and access
   to cold cache line.

7) mlx4_en_free_tx_desc() no longer accesses skb_shinfo(). We use a new
   nr_maps fields in mlx4_en_tx_info to avoid 2 or 3 cache misses.

8) mlx4_en_free_tx_desc() uses a prefetchw(&skb->users) to speed up
   consume_skb()

9) mlx4_en_process_tx_cq() carefully fetches and writes
   ring->last_nr_txbb & ring->cons only one time to avoid false sharing

10) mlx4_en_xmit() reads ring->cons once, and ahead of time to avoid
    stalls.

11) prefetchw(&ring->tx_queue->dql) to speed up BQL update

12) properly clears tx_info->ts_requested :
    This field was never cleared.

13) remove false sharing on prot_stats.tso_packets 

14) reorganize code to call is_inline() once, so compiler can inline it.

15) rename @i variable to @i_frag to avoid confusion, as the
    "goto tx_drop_unmap;" relied on this @i variable.
 
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v2: prot_stats.tso_packets false sharing
    wake_queue moved into first cache line
    Missing <linux/prefetch.h>
    tx_info allocated with kmalloc() instead of vmalloc()
    A bug in mlx4_en_free_tx_desc() was fixed.
    xmit_more part was already merged by David.

 drivers/net/ethernet/mellanox/mlx4/en_port.c |   15 
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  327 ++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |   90 ++--
 include/linux/mlx4/device.h                  |    2 
 4 files changed, 254 insertions(+), 180 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index c2cfb05e7290..132805799fca 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -150,14 +150,17 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	priv->port_stats.tx_chksum_offload = 0;
 	priv->port_stats.queue_stopped = 0;
 	priv->port_stats.wake_queue = 0;
+	priv->port_stats.tso_packets = 0;
 
 	for (i = 0; i < priv->tx_ring_num; i++) {
-		stats->tx_packets += priv->tx_ring[i]->packets;
-		stats->tx_bytes += priv->tx_ring[i]->bytes;
-		priv->port_stats.tx_chksum_offload += priv->tx_ring[i]->tx_csum;
-		priv->port_stats.queue_stopped +=
-			priv->tx_ring[i]->queue_stopped;
-		priv->port_stats.wake_queue += priv->tx_ring[i]->wake_queue;
+		const struct mlx4_en_tx_ring *ring = priv->tx_ring[i];
+
+		stats->tx_packets += ring->packets;
+		stats->tx_bytes += ring->bytes;
+		priv->port_stats.tx_chksum_offload += ring->tx_csum;
+		priv->port_stats.queue_stopped += ring->queue_stopped;
+		priv->port_stats.wake_queue += ring->wake_queue;
+		priv->port_stats.tso_packets += ring->tso_packets;
 	}
 
 	stats->rx_errors = be64_to_cpu(mlx4_en_stats->PCS) +
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index adedc47e947d..8816592ed7be 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -37,6 +37,7 @@
 #include <linux/mlx4/qp.h>
 #include <linux/skbuff.h>
 #include <linux/if_vlan.h>
+#include <linux/prefetch.h>
 #include <linux/vmalloc.h>
 #include <linux/tcp.h>
 #include <linux/ip.h>
@@ -68,7 +69,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->inline_thold = priv->prof->inline_thold;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
-	ring->tx_info = vmalloc_node(tmp, node);
+	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
 	if (!ring->tx_info) {
 		ring->tx_info = vmalloc(tmp);
 		if (!ring->tx_info) {
@@ -151,7 +152,7 @@ err_bounce:
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
 err_info:
-	vfree(ring->tx_info);
+	kvfree(ring->tx_info);
 	ring->tx_info = NULL;
 err_ring:
 	kfree(ring);
@@ -174,7 +175,7 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *priv,
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, ring->buf_size);
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
-	vfree(ring->tx_info);
+	kvfree(ring->tx_info);
 	ring->tx_info = NULL;
 	kfree(ring);
 	*pring = NULL;
@@ -191,12 +192,12 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->prod = 0;
 	ring->cons = 0xffffffff;
 	ring->last_nr_txbb = 1;
-	ring->poll_cnt = 0;
 	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
 	memset(ring->buf, 0, ring->buf_size);
 
 	ring->qp_state = MLX4_QP_STATE_RST;
-	ring->doorbell_qpn = ring->qp.qpn << 8;
+	ring->doorbell_qpn = cpu_to_be32(ring->qp.qpn << 8);
+	ring->mr_key = cpu_to_be32(mdev->mr.key);
 
 	mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
 				ring->cqn, user_prio, &ring->context);
@@ -259,38 +260,45 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 				struct mlx4_en_tx_ring *ring,
 				int index, u8 owner, u64 timestamp)
 {
-	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
 	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
 	struct mlx4_wqe_data_seg *data = (void *) tx_desc + tx_info->data_offset;
-	struct sk_buff *skb = tx_info->skb;
-	struct skb_frag_struct *frag;
 	void *end = ring->buf + ring->buf_size;
-	int frags = skb_shinfo(skb)->nr_frags;
+	struct sk_buff *skb = tx_info->skb;
+	int nr_maps = tx_info->nr_maps;
 	int i;
-	struct skb_shared_hwtstamps hwts;
 
-	if (timestamp) {
-		mlx4_en_fill_hwtstamps(mdev, &hwts, timestamp);
+	/* We do not touch skb here, so prefetch skb->users location
+	 * to speedup consume_skb()
+	 */
+	prefetchw(&skb->users);
+
+	if (unlikely(timestamp)) {
+		struct skb_shared_hwtstamps hwts;
+
+		mlx4_en_fill_hwtstamps(priv->mdev, &hwts, timestamp);
 		skb_tstamp_tx(skb, &hwts);
 	}
 
 	/* Optimize the common case when there are no wraparounds */
 	if (likely((void *) tx_desc + tx_info->nr_txbb * TXBB_SIZE <= end)) {
 		if (!tx_info->inl) {
-			if (tx_info->linear) {
+			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 be32_to_cpu(data->byte_count),
-					 PCI_DMA_TODEVICE);
-				++data;
-			}
-
-			for (i = 0; i < frags; i++) {
-				frag = &skb_shinfo(skb)->frags[i];
+						tx_info->map0_dma,
+						tx_info->map0_byte_count,
+						PCI_DMA_TODEVICE);
+			else
+				dma_unmap_page(priv->ddev,
+					       tx_info->map0_dma,
+					       tx_info->map0_byte_count,
+					       PCI_DMA_TODEVICE);
+			for (i = 1; i < nr_maps; i++) {
+				data++;
 				dma_unmap_page(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data[i].addr),
-					skb_frag_size(frag), PCI_DMA_TODEVICE);
+					(dma_addr_t)be64_to_cpu(data->addr),
+					be32_to_cpu(data->byte_count),
+					PCI_DMA_TODEVICE);
 			}
 		}
 	} else {
@@ -299,23 +307,25 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 				data = ring->buf + ((void *)data - end);
 			}
 
-			if (tx_info->linear) {
+			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 be32_to_cpu(data->byte_count),
-					 PCI_DMA_TODEVICE);
-				++data;
-			}
-
-			for (i = 0; i < frags; i++) {
+						tx_info->map0_dma,
+						tx_info->map0_byte_count,
+						PCI_DMA_TODEVICE);
+			else
+				dma_unmap_page(priv->ddev,
+					       tx_info->map0_dma,
+					       tx_info->map0_byte_count,
+					       PCI_DMA_TODEVICE);
+			for (i = 1; i < nr_maps; i++) {
+				data++;
 				/* Check for wraparound before unmapping */
 				if ((void *) data >= end)
 					data = ring->buf;
-				frag = &skb_shinfo(skb)->frags[i];
 				dma_unmap_page(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 skb_frag_size(frag), PCI_DMA_TODEVICE);
-				++data;
+					(dma_addr_t)be64_to_cpu(data->addr),
+					be32_to_cpu(data->byte_count),
+					PCI_DMA_TODEVICE);
 			}
 		}
 	}
@@ -377,13 +387,18 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	u64 timestamp = 0;
 	int done = 0;
 	int budget = priv->tx_work_limit;
+	u32 last_nr_txbb;
+	u32 ring_cons;
 
 	if (!priv->port_up)
 		return true;
 
+	prefetchw(&ring->tx_queue->dql.limit);
 	index = cons_index & size_mask;
 	cqe = mlx4_en_get_cqe(buf, index, priv->cqe_size) + factor;
-	ring_index = ring->cons & size_mask;
+	last_nr_txbb = ACCESS_ONCE(ring->last_nr_txbb);
+	ring_cons = ACCESS_ONCE(ring->cons);
+	ring_index = ring_cons & size_mask;
 	stamp_index = ring_index;
 
 	/* Process all completed CQEs */
@@ -408,19 +423,19 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 		new_index = be16_to_cpu(cqe->wqe_index) & size_mask;
 
 		do {
-			txbbs_skipped += ring->last_nr_txbb;
-			ring_index = (ring_index + ring->last_nr_txbb) & size_mask;
+			txbbs_skipped += last_nr_txbb;
+			ring_index = (ring_index + last_nr_txbb) & size_mask;
 			if (ring->tx_info[ring_index].ts_requested)
 				timestamp = mlx4_en_get_cqe_ts(cqe);
 
 			/* free next descriptor */
-			ring->last_nr_txbb = mlx4_en_free_tx_desc(
+			last_nr_txbb = mlx4_en_free_tx_desc(
 					priv, ring, ring_index,
-					!!((ring->cons + txbbs_skipped) &
+					!!((ring_cons + txbbs_skipped) &
 					ring->size), timestamp);
 
 			mlx4_en_stamp_wqe(priv, ring, stamp_index,
-					  !!((ring->cons + txbbs_stamp) &
+					  !!((ring_cons + txbbs_stamp) &
 						ring->size));
 			stamp_index = ring_index;
 			txbbs_stamp = txbbs_skipped;
@@ -441,7 +456,11 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	mcq->cons_index = cons_index;
 	mlx4_cq_set_ci(mcq);
 	wmb();
-	ring->cons += txbbs_skipped;
+
+	/* we want to dirty this cache line once */
+	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
+	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
+
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
 	/*
@@ -512,30 +531,35 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 	return ring->buf + index * TXBB_SIZE;
 }
 
-static int is_inline(int inline_thold, struct sk_buff *skb, void **pfrag)
+/* Decide if skb can be inlined in tx descriptor to avoid dma mapping
+ *
+ * It seems strange we do not simply use skb_copy_bits().
+ * This would allow to inline all skbs iff skb->len <= inline_thold
+ *
+ * Note that caller already checked skb was not a gso packet
+ */
+static bool is_inline(int inline_thold, const struct sk_buff *skb,
+		      const struct skb_shared_info *shinfo,
+		      void **pfrag)
 {
 	void *ptr;
 
-	if (inline_thold && !skb_is_gso(skb) && skb->len <= inline_thold) {
-		if (skb_shinfo(skb)->nr_frags == 1) {
-			ptr = skb_frag_address_safe(&skb_shinfo(skb)->frags[0]);
-			if (unlikely(!ptr))
-				return 0;
-
-			if (pfrag)
-				*pfrag = ptr;
+	if (skb->len > inline_thold || !inline_thold)
+		return false;
 
-			return 1;
-		} else if (unlikely(skb_shinfo(skb)->nr_frags))
-			return 0;
-		else
-			return 1;
+	if (shinfo->nr_frags == 1) {
+		ptr = skb_frag_address_safe(&shinfo->frags[0]);
+		if (unlikely(!ptr))
+			return false;
+		*pfrag = ptr;
+		return true;
 	}
-
-	return 0;
+	if (shinfo->nr_frags)
+		return false;
+	return true;
 }
 
-static int inline_size(struct sk_buff *skb)
+static int inline_size(const struct sk_buff *skb)
 {
 	if (skb->len + CTRL_SIZE + sizeof(struct mlx4_wqe_inline_seg)
 	    <= MLX4_INLINE_ALIGN)
@@ -546,18 +570,23 @@ static int inline_size(struct sk_buff *skb)
 			     sizeof(struct mlx4_wqe_inline_seg), 16);
 }
 
-static int get_real_size(struct sk_buff *skb, struct net_device *dev,
-			 int *lso_header_size)
+static int get_real_size(const struct sk_buff *skb,
+			 const struct skb_shared_info *shinfo,
+			 struct net_device *dev,
+			 int *lso_header_size,
+			 bool *inline_ok,
+			 void **pfrag)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
-	if (skb_is_gso(skb)) {
+	if (shinfo->gso_size) {
+		*inline_ok = false;
 		if (skb->encapsulation)
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
 		else
 			*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
-		real_size = CTRL_SIZE + skb_shinfo(skb)->nr_frags * DS_SIZE +
+		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
 			ALIGN(*lso_header_size + 4, DS_SIZE);
 		if (unlikely(*lso_header_size != skb_headlen(skb))) {
 			/* We add a segment for the skb linear buffer only if
@@ -572,20 +601,28 @@ static int get_real_size(struct sk_buff *skb, struct net_device *dev,
 		}
 	} else {
 		*lso_header_size = 0;
-		if (!is_inline(priv->prof->inline_thold, skb, NULL))
-			real_size = CTRL_SIZE + (skb_shinfo(skb)->nr_frags + 1) * DS_SIZE;
-		else
+		*inline_ok = is_inline(priv->prof->inline_thold, skb,
+				       shinfo, pfrag);
+
+		if (*inline_ok)
 			real_size = inline_size(skb);
+		else
+			real_size = CTRL_SIZE +
+				    (shinfo->nr_frags + 1) * DS_SIZE;
 	}
 
 	return real_size;
 }
 
-static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *skb,
-			     int real_size, u16 *vlan_tag, int tx_ind, void *fragptr)
+static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
+			     const struct sk_buff *skb,
+			     const struct skb_shared_info *shinfo,
+			     int real_size, u16 *vlan_tag,
+			     int tx_ind, void *fragptr)
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
+	unsigned int hlen = skb_headlen(skb);
 
 	if (skb->len <= spc) {
 		if (likely(skb->len >= MIN_PKT_LEN)) {
@@ -595,19 +632,19 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 			memset(((void *)(inl + 1)) + skb->len, 0,
 			       MIN_PKT_LEN - skb->len);
 		}
-		skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
-		if (skb_shinfo(skb)->nr_frags)
-			memcpy(((void *)(inl + 1)) + skb_headlen(skb), fragptr,
-			       skb_frag_size(&skb_shinfo(skb)->frags[0]));
+		skb_copy_from_linear_data(skb, inl + 1, hlen);
+		if (shinfo->nr_frags)
+			memcpy(((void *)(inl + 1)) + hlen, fragptr,
+			       skb_frag_size(&shinfo->frags[0]));
 
 	} else {
 		inl->byte_count = cpu_to_be32(1 << 31 | spc);
-		if (skb_headlen(skb) <= spc) {
-			skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
-			if (skb_headlen(skb) < spc) {
-				memcpy(((void *)(inl + 1)) + skb_headlen(skb),
-					fragptr, spc - skb_headlen(skb));
-				fragptr +=  spc - skb_headlen(skb);
+		if (hlen <= spc) {
+			skb_copy_from_linear_data(skb, inl + 1, hlen);
+			if (hlen < spc) {
+				memcpy(((void *)(inl + 1)) + hlen,
+					fragptr, spc - hlen);
+				fragptr +=  spc - hlen;
 			}
 			inl = (void *) (inl + 1) + spc;
 			memcpy(((void *)(inl + 1)), fragptr, skb->len - spc);
@@ -615,10 +652,11 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 			skb_copy_from_linear_data(skb, inl + 1, spc);
 			inl = (void *) (inl + 1) + spc;
 			skb_copy_from_linear_data_offset(skb, spc, inl + 1,
-					skb_headlen(skb) - spc);
-			if (skb_shinfo(skb)->nr_frags)
-				memcpy(((void *)(inl + 1)) + skb_headlen(skb) - spc,
-					fragptr, skb_frag_size(&skb_shinfo(skb)->frags[0]));
+							 hlen - spc);
+			if (shinfo->nr_frags)
+				memcpy(((void *)(inl + 1)) + hlen - spc,
+				       fragptr,
+				       skb_frag_size(&shinfo->frags[0]));
 		}
 
 		wmb();
@@ -642,15 +680,16 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 	return fallback(dev, skb) % rings_p_up + up * rings_p_up;
 }
 
-static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
+static void mlx4_bf_copy(void __iomem *dst, const void *src,
+			 unsigned int bytecnt)
 {
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	struct mlx4_en_dev *mdev = priv->mdev;
 	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
@@ -663,16 +702,25 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	u32 index, bf_index;
 	__be32 op_own;
 	u16 vlan_tag = 0;
-	int i;
+	int i_frag;
 	int lso_header_size;
-	void *fragptr;
+	void *fragptr = NULL;
 	bool bounce = false;
 	bool send_doorbell;
+	bool inline_ok;
+	u32 ring_cons;
 
 	if (!priv->port_up)
 		goto tx_drop;
 
-	real_size = get_real_size(skb, dev, &lso_header_size);
+	tx_ind = skb_get_queue_mapping(skb);
+	ring = priv->tx_ring[tx_ind];
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = ACCESS_ONCE(ring->cons);
+
+	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
+				  &inline_ok, &fragptr);
 	if (unlikely(!real_size))
 		goto tx_drop;
 
@@ -685,13 +733,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
-	tx_ind = skb->queue_mapping;
-	ring = priv->tx_ring[tx_ind];
 	if (vlan_tx_tag_present(skb))
 		vlan_tag = vlan_tx_tag_get(skb);
 
 	/* Check available TXBBs And 2K spare for prefetch */
-	if (unlikely(((int)(ring->prod - ring->cons)) >
+	if (unlikely(((int)(ring->prod - ring_cons)) >
 		     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 		/* every full Tx ring stops queue */
 		netif_tx_stop_queue(ring->tx_queue);
@@ -705,7 +751,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		 */
 		wmb();
 
-		if (unlikely(((int)(ring->prod - ring->cons)) <=
+		ring_cons = ACCESS_ONCE(ring->cons);
+		if (unlikely(((int)(ring->prod - ring_cons)) <=
 			     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 			netif_tx_wake_queue(ring->tx_queue);
 			ring->wake_queue++;
@@ -714,9 +761,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 
+	prefetchw(&ring->tx_queue->dql);
+
 	/* Track current inflight packets for performance analysis */
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
-			 (u32) (ring->prod - ring->cons - 1));
+			 (u32)(ring->prod - ring_cons - 1));
 
 	/* Packet is good - grab an index and transmit it */
 	index = ring->prod & ring->size_mask;
@@ -736,46 +785,48 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->skb = skb;
 	tx_info->nr_txbb = nr_txbb;
 
+	data = &tx_desc->data;
 	if (lso_header_size)
 		data = ((void *)&tx_desc->lso + ALIGN(lso_header_size + 4,
 						      DS_SIZE));
-	else
-		data = &tx_desc->data;
 
 	/* valid only for none inline segments */
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
 
+	tx_info->inl = inline_ok;
+
 	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
-			   !is_inline(ring->inline_thold, skb, NULL)) ? 1 : 0;
+			   !inline_ok) ? 1 : 0;
 
-	data += skb_shinfo(skb)->nr_frags + tx_info->linear - 1;
+	tx_info->nr_maps = shinfo->nr_frags + tx_info->linear;
+	data += tx_info->nr_maps - 1;
 
-	if (is_inline(ring->inline_thold, skb, &fragptr)) {
-		tx_info->inl = 1;
-	} else {
-		/* Map fragments */
-		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
-			struct skb_frag_struct *frag;
-			dma_addr_t dma;
+	if (!tx_info->inl) {
+		dma_addr_t dma = 0;
+		u32 byte_count = 0;
 
-			frag = &skb_shinfo(skb)->frags[i];
+		/* Map fragments if any */
+		for (i_frag = shinfo->nr_frags - 1; i_frag >= 0; i_frag--) {
+			const struct skb_frag_struct *frag;
+
+			frag = &shinfo->frags[i_frag];
+			byte_count = skb_frag_size(frag);
 			dma = skb_frag_dma_map(ddev, frag,
-					       0, skb_frag_size(frag),
+					       0, byte_count,
 					       DMA_TO_DEVICE);
 			if (dma_mapping_error(ddev, dma))
 				goto tx_drop_unmap;
 
 			data->addr = cpu_to_be64(dma);
-			data->lkey = cpu_to_be32(mdev->mr.key);
+			data->lkey = ring->mr_key;
 			wmb();
-			data->byte_count = cpu_to_be32(skb_frag_size(frag));
+			data->byte_count = cpu_to_be32(byte_count);
 			--data;
 		}
 
-		/* Map linear part */
+		/* Map linear part if needed */
 		if (tx_info->linear) {
-			u32 byte_count = skb_headlen(skb) - lso_header_size;
-			dma_addr_t dma;
+			byte_count = skb_headlen(skb) - lso_header_size;
 
 			dma = dma_map_single(ddev, skb->data +
 					     lso_header_size, byte_count,
@@ -784,29 +835,28 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 				goto tx_drop_unmap;
 
 			data->addr = cpu_to_be64(dma);
-			data->lkey = cpu_to_be32(mdev->mr.key);
+			data->lkey = ring->mr_key;
 			wmb();
 			data->byte_count = cpu_to_be32(byte_count);
 		}
-		tx_info->inl = 0;
+		/* tx completion can avoid cache line miss for common cases */
+		tx_info->map0_dma = dma;
+		tx_info->map0_byte_count = byte_count;
 	}
 
 	/*
 	 * For timestamping add flag to skb_shinfo and
 	 * set flag for further reference
 	 */
-	if (ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
-	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
-		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+	tx_info->ts_requested = 0;
+	if (unlikely(ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
+		     shinfo->tx_flags & SKBTX_HW_TSTAMP)) {
+		shinfo->tx_flags |= SKBTX_IN_PROGRESS;
 		tx_info->ts_requested = 1;
 	}
 
 	/* Prepare ctrl segement apart opcode+ownership, which depends on
 	 * whether LSO is used */
-	tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-	tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
-		!!vlan_tx_tag_present(skb);
-	tx_desc->ctrl.fence_size = (real_size / 16) & 0x3f;
 	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
 		tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
@@ -827,6 +877,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Handle LSO (TSO) packets */
 	if (lso_header_size) {
+		int i;
+
 		/* Mark opcode as LSO */
 		op_own = cpu_to_be32(MLX4_OPCODE_LSO | (1 << 6)) |
 			((ring->prod & ring->size) ?
@@ -834,15 +886,16 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		/* Fill in the LSO prefix */
 		tx_desc->lso.mss_hdr_size = cpu_to_be32(
-			skb_shinfo(skb)->gso_size << 16 | lso_header_size);
+			shinfo->gso_size << 16 | lso_header_size);
 
 		/* Copy headers;
 		 * note that we already verified that it is linear */
 		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
-		priv->port_stats.tso_packets++;
-		i = ((skb->len - lso_header_size) / skb_shinfo(skb)->gso_size) +
-			!!((skb->len - lso_header_size) % skb_shinfo(skb)->gso_size);
+		ring->tso_packets++;
+
+		i = ((skb->len - lso_header_size) / shinfo->gso_size) +
+			!!((skb->len - lso_header_size) % shinfo->gso_size);
 		tx_info->nr_bytes = skb->len + (i - 1) * lso_header_size;
 		ring->packets += i;
 	} else {
@@ -852,16 +905,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 		tx_info->nr_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
 		ring->packets++;
-
 	}
 	ring->bytes += tx_info->nr_bytes;
 	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
-	if (tx_info->inl) {
-		build_inline_wqe(tx_desc, skb, real_size, &vlan_tag, tx_ind, fragptr);
-		tx_info->inl = 1;
-	}
+	if (tx_info->inl)
+		build_inline_wqe(tx_desc, skb, shinfo, real_size,
+				 &vlan_tag, tx_ind, fragptr);
 
 	if (skb->encapsulation) {
 		struct iphdr *ipv4 = (struct iphdr *)skb_inner_network_header(skb);
@@ -874,16 +925,19 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	ring->prod += nr_txbb;
 
 	/* If we used a bounce buffer then copy descriptor back into place */
-	if (bounce)
+	if (unlikely(bounce))
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
 	skb_tx_timestamp(skb);
 
 	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
 
+	real_size = (real_size / 16) & 0x3f;
+
 	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
 	    !vlan_tx_tag_present(skb) && send_doorbell) {
-		tx_desc->ctrl.bf_qpn |= cpu_to_be32(ring->doorbell_qpn);
+		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
+				       cpu_to_be32(real_size);
 
 		op_own |= htonl((bf_index & 0xffff) << 8);
 		/* Ensure new descriptor hits memory
@@ -894,13 +948,18 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		wmb();
 
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned long *) &tx_desc->ctrl,
-		     desc_size);
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+			     desc_size);
 
 		wmb();
 
 		ring->bf.offset ^= ring->bf.buf_size;
 	} else {
+		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
+		tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
+			!!vlan_tx_tag_present(skb);
+		tx_desc->ctrl.fence_size = real_size;
+
 		/* Ensure new descriptor hits memory
 		 * before setting ownership of this descriptor to HW
 		 */
@@ -918,8 +977,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 tx_drop_unmap:
 	en_err(priv, "DMA mapping error\n");
 
-	for (i++; i < skb_shinfo(skb)->nr_frags; i++) {
-		data++;
+	while (++i_frag < shinfo->nr_frags) {
+		++data;
 		dma_unmap_page(ddev, (dma_addr_t) be64_to_cpu(data->addr),
 			       be32_to_cpu(data->byte_count),
 			       PCI_DMA_TODEVICE);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6a4fc2394cf2..8731aac89244 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -216,13 +216,16 @@ enum cq_type {
 
 struct mlx4_en_tx_info {
 	struct sk_buff *skb;
-	u32 nr_txbb;
-	u32 nr_bytes;
-	u8 linear;
-	u8 data_offset;
-	u8 inl;
-	u8 ts_requested;
-};
+	dma_addr_t	map0_dma;
+	u32		map0_byte_count;
+	u32		nr_txbb;
+	u32		nr_bytes;
+	u8		linear;
+	u8		data_offset;
+	u8		inl;
+	u8		ts_requested;
+	u8		nr_maps;
+} ____cacheline_aligned_in_smp;
 
 
 #define MLX4_EN_BIT_DESC_OWN	0x80000000
@@ -253,39 +256,48 @@ struct mlx4_en_rx_alloc {
 };
 
 struct mlx4_en_tx_ring {
+			/* cache line used and dirtied in tx completion
+			 * (mlx4_en_free_tx_buf())
+			 */
+	u32			last_nr_txbb;
+	u32			cons;
+	unsigned long		wake_queue;
+
+			/* cache line used and dirtied in mlx4_en_xmit()
+			 */
+	u32			prod ____cacheline_aligned_in_smp;
+	unsigned long		bytes;
+	unsigned long		packets;
+	unsigned long		tx_csum;
+	unsigned long		tso_packets;
+	struct mlx4_bf		bf;
+	unsigned long		queue_stopped;
+
+			/* Following part should be mostly read
+			 */
+	cpumask_t		affinity_mask;
+	struct mlx4_qp		qp;
 	struct mlx4_hwq_resources wqres;
-	u32 size ; /* number of TXBBs */
-	u32 size_mask;
-	u16 stride;
-	u16 cqn;	/* index of port CQ associated with this ring */
-	u32 prod;
-	u32 cons;
-	u32 buf_size;
-	u32 doorbell_qpn;
-	void *buf;
-	u16 poll_cnt;
-	struct mlx4_en_tx_info *tx_info;
-	u8 *bounce_buf;
-	u8 queue_index;
-	cpumask_t affinity_mask;
-	u32 last_nr_txbb;
-	struct mlx4_qp qp;
-	struct mlx4_qp_context context;
-	int qpn;
-	enum mlx4_qp_state qp_state;
-	struct mlx4_srq dummy;
-	unsigned long bytes;
-	unsigned long packets;
-	unsigned long tx_csum;
-	unsigned long queue_stopped;
-	unsigned long wake_queue;
-	struct mlx4_bf bf;
-	bool bf_enabled;
-	bool bf_alloced;
-	struct netdev_queue *tx_queue;
-	int hwtstamp_tx_type;
-	int inline_thold;
-};
+	u32			size ; /* number of TXBBs */
+	u32			size_mask;
+	u16			stride;
+	u16			cqn;	/* index of port CQ associated with this ring */
+	u32			buf_size;
+	__be32			doorbell_qpn;
+	__be32			mr_key;
+	void			*buf;
+	struct mlx4_en_tx_info	*tx_info;
+	u8			*bounce_buf;
+	struct mlx4_qp_context	context;
+	int			qpn;
+	enum mlx4_qp_state	qp_state;
+	u8			queue_index;
+	bool			bf_enabled;
+	bool			bf_alloced;
+	struct netdev_queue	*tx_queue;
+	int			hwtstamp_tx_type;
+	int			inline_thold;
+} ____cacheline_aligned_in_smp;
 
 struct mlx4_en_rx_desc {
 	/* actual number of entries depends on rx ring stride */
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 03b5608a4329..5e5ad07548b8 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -583,7 +583,7 @@ struct mlx4_uar {
 };
 
 struct mlx4_bf {
-	unsigned long		offset;
+	unsigned int		offset;
 	int			buf_size;
 	struct mlx4_uar	       *uar;
 	void __iomem	       *reg;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-09-29  4:19             ` [PATCH v2 " Eric Dumazet
@ 2014-09-30 12:01               ` Amir Vadai
  2014-09-30 12:11                 ` Eric Dumazet
  2014-10-02  4:35               ` Eric Dumazet
  1 sibling, 1 reply; 29+ messages in thread
From: Amir Vadai @ 2014-09-30 12:01 UTC (permalink / raw)
  To: Eric Dumazet, Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Or Gerlitz

On 9/29/2014 7:19 AM, Eric Dumazet wrote:
> First I implemented skb->xmit_more support, and pktgen throughput
> went from ~5Mpps to ~10Mpps.
> 
> Then, looking closely at this driver I found false sharing problems that
> should be addressed by this patch, as my pktgen now reaches 14.7 Mpps
> on a single TX queue, with a burst factor of 8.
> 
> So this patch in a whole permits to improve raw performance on a single
> TX queue from about 5 Mpps to 14.9 Mpps.
> 
> Note that if packets are below the inline_thold threshold (104 bytes),
> driver copies packets content into tx descriptor, and throughput
> is lowered to ~7.5 Mpps :
> ->      We might reconsider inline strategy in a followup patch.
> 
> I could split this patch into multiple components, but I prefer not
> spend days on this work.

Hi,

Thanks for the great work Eric.

I'm working to split the patch and pass it through some regression tests
and performance benchmarking.
Will send the split patch-set in few days.

Amir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-09-30 12:01               ` Amir Vadai
@ 2014-09-30 12:11                 ` Eric Dumazet
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-09-30 12:11 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz

On Tue, 2014-09-30 at 15:01 +0300, Amir Vadai wrote:

> Hi,
> 
> Thanks for the great work Eric.
> 
> I'm working to split the patch and pass it through some regression tests
> and performance benchmarking.
> Will send the split patch-set in few days.

This looks great, thanks Amir.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-09-29  4:19             ` [PATCH v2 " Eric Dumazet
  2014-09-30 12:01               ` Amir Vadai
@ 2014-10-02  4:35               ` Eric Dumazet
  2014-10-02  8:03                 ` Amir Vadai
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-10-02  4:35 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Amir Vadai,
	Or Gerlitz

On Sun, 2014-09-28 at 21:19 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>

...
> 6) mdev->mr.key stored in ring->mr_key to also avoid bswap() and access
>    to cold cache line.
>  


>  		ring->bf.offset ^= ring->bf.buf_size;
>  	} else {
> +		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
> +		tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
> +			!!vlan_tx_tag_present(skb);
> +		tx_desc->ctrl.fence_size = real_size;
> +
>  		/* Ensure new descriptor hits memory
>  		 * before setting ownership of this descriptor to HW
>  		 */


Sorry, there is a missing replacement of 

	iowrite32be(ring->doorbell_qpn,
		    ring->bf.uar->map + MLX4_SEND_DOORBELL);

by iowrite32(ring->doorbell_qpn,
	     ring->bf.uar->map + MLX4_SEND_DOORBELL);

Since doorbel_qpn was changed to a __be32 and setup in
mlx4_en_activate_tx_ring()

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02  4:35               ` Eric Dumazet
@ 2014-10-02  8:03                 ` Amir Vadai
  2014-10-02  8:29                   ` Jesper Dangaard Brouer
  2014-10-02 11:45                   ` Eric Dumazet
  0 siblings, 2 replies; 29+ messages in thread
From: Amir Vadai @ 2014-10-02  8:03 UTC (permalink / raw)
  To: Eric Dumazet, Or Gerlitz
  Cc: Alexei Starovoitov, David S. Miller, Jesper Dangaard Brouer,
	Eric Dumazet, John Fastabend, Linux Netdev List, Or Gerlitz,
	amira, idos, Yevgeny Petrilin, eyalpe

On 10/2/2014 7:35 AM, Eric Dumazet wrote:
> On Sun, 2014-09-28 at 21:19 -0700, Eric Dumazet wrote:
>> From: Eric Dumazet <edumazet@google.com>
> 

[...]

> Sorry, there is a missing replacement of 
> 
> 	iowrite32be(ring->doorbell_qpn,
> 		    ring->bf.uar->map + MLX4_SEND_DOORBELL);
> 
> by iowrite32(ring->doorbell_qpn,
> 	     ring->bf.uar->map + MLX4_SEND_DOORBELL);
> 
> Since doorbel_qpn was changed to a __be32 and setup in
> mlx4_en_activate_tx_ring()
> 

Hi,

Will take it into the split patchset - we just hit this bug when tried
to run benchmarks with blueflame disabled (easy to test by using ethtool
priv flag blueflame).

I'm still working on it, but I can't reproduce the numbers that you
show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
with burst=1.

In addition, I see no improvements when adding the optimization to the
xmit path.
I use the net-next kernel + pktgen burst support patch, with and without
this xmit path optimization patch.

Do you use other patches not upstream in your environment?
Can you share the .config/pktgen configuration?

One other note: we're checking now that blueflame could be used with
xmit_more. It might result with packets reordering/drops. Still under
investigation.

Thanks,
Amir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02  8:03                 ` Amir Vadai
@ 2014-10-02  8:29                   ` Jesper Dangaard Brouer
  2014-10-02  8:57                     ` Amir Vadai
  2014-10-02 11:45                   ` Eric Dumazet
  1 sibling, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-02  8:29 UTC (permalink / raw)
  To: Amir Vadai
  Cc: brouer, Eric Dumazet, Or Gerlitz, Alexei Starovoitov,
	David S. Miller, Eric Dumazet, John Fastabend, Linux Netdev List,
	Or Gerlitz, amira, idos, Yevgeny Petrilin, eyalpe

On Thu, 2 Oct 2014 11:03:13 +0300
Amir Vadai <amirv@mellanox.com> wrote:

> I'm still working on it, but I can't reproduce the numbers that you
> show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
> with burst=1.

For pktgen it sounds like you either missing:

1) flag QUEUE_MAP_CPU
 and/or
2) clone_skb $BIGNUMBER

My usage of pktgen:
 http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02  8:29                   ` Jesper Dangaard Brouer
@ 2014-10-02  8:57                     ` Amir Vadai
  0 siblings, 0 replies; 29+ messages in thread
From: Amir Vadai @ 2014-10-02  8:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Eric Dumazet, John Fastabend, Linux Netdev List, Or Gerlitz,
	amira, idos, Yevgeny Petrilin, eyalpe

On 10/2/2014 11:29 AM, Jesper Dangaard Brouer wrote:
> On Thu, 2 Oct 2014 11:03:13 +0300
> Amir Vadai <amirv@mellanox.com> wrote:
> 
>> I'm still working on it, but I can't reproduce the numbers that you
>> show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
>> with burst=1.
> 
> For pktgen it sounds like you either missing:
> 
> 1) flag QUEUE_MAP_CPU

>  and/or
> 2) clone_skb $BIGNUMBER
Thanks,

This improved the numbers to ~7Mpps with burst=8 and ~4Mpps with burst=1
- still far away from 15Mpps, but in the direction.

> 
> My usage of pktgen:
>  http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02  8:03                 ` Amir Vadai
  2014-10-02  8:29                   ` Jesper Dangaard Brouer
@ 2014-10-02 11:45                   ` Eric Dumazet
  2014-10-02 11:56                     ` Amir Vadai
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-10-02 11:45 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe

On Thu, 2014-10-02 at 11:03 +0300, Amir Vadai wrote:

> Hi,
> 
> Will take it into the split patchset - we just hit this bug when tried
> to run benchmarks with blueflame disabled (easy to test by using ethtool
> priv flag blueflame).

Hmm, I do not know this ethtool command, please share ;)

> 
> I'm still working on it, but I can't reproduce the numbers that you
> show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
> with burst=1.

You have to be careful with the 'clone X' : If you choose a too big
value, TX completion competes with the sender thread.

> 
> In addition, I see no improvements when adding the optimization to the
> xmit path.
> I use the net-next kernel + pktgen burst support patch, with and without
> this xmit path optimization patch.
> 
> Do you use other patches not upstream in your environment?

Nope, this is with David net-next tree.

> Can you share the .config/pktgen configuration?

Sure.

> 
> One other note: we're checking now that blueflame could be used with
> xmit_more. It might result with packets reordering/drops. Still under
> investigation.

I noticed no reorders. I tweaked the stack to force a gso segmentation
(in software) instead of using NIC TSO for small packets (2 or 3 MSS)

200 concurrent netperf -t TCP_RR -- -r 2000,2000    performance was
increased by ~100%.


#!/bin/bash
#
# on the destination, drop packets with
#   iptables -A PREROUTING -t raw -p udp --dport 9 -j DROP
#   Or run a recent enough kernel with global ICMP rate limiting to 1000 packets/sec
#   ( http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=4cdf507d54525842dfd9f6313fdafba039084046 )
#
#### Configure

# Yeah, if you use PKTSIZE <= 104, performance is lower because of inline (copy whole frame content into tx desc)
PKTSIZE=105

echo "pktrate: $PKTRATE"

COUNT=20000000000
RUN_SECS=60
SRC_DEV=eth0
SRC_IP_MIN=7.0.0.1
SRC_IP_MAX=7.255.255.255
SRC_MAC=00:1a:11:c3:0d:7f 
DST_IP=10.246.7.152 
DST_MAC=00:1a:11:c3:0d:45 
DST_UDP=9

## END OF CONFIGURATION OPTIONS

#### Helper

## Configuration procfs inodes

DEV_INODE=/proc/net/pktgen/$SRC_DEV
MAIN_INODE=/proc/net/pktgen/pgctrl
THREAD_INODE=/proc/net/pktgen/kpktgend_2

# write to a procfs file
function pgset_ex()
{
  local result

	echo $2
  echo $2 > $1

  result=`cat $1 | fgrep "Result: OK:"`
  if [ "$result" = "" ]; then
       cat $1 | fgrep Result:
  fi
}


#### Pre: configure

# attach device exclusively
pgset_ex $THREAD_INODE "rem_device_all"
pgset_ex $THREAD_INODE "add_device $SRC_DEV"

# configure basics
pgset_ex $DEV_INODE "clone_skb 8"
pgset_ex $DEV_INODE "src_min $SRC_IP_MIN"
pgset_ex $DEV_INODE "src_max $SRC_IP_MAX"
pgset_ex $DEV_INODE "dst $DST_IP"
pgset_ex $DEV_INODE "dst_mac $DST_MAC"
pgset_ex $DEV_INODE "udp_dst_min $DST_UDP"
pgset_ex $DEV_INODE "udp_dst_max $DST_UDP"
pgset_ex $DEV_INODE "queue_map_min 0"
pgset_ex $DEV_INODE "queue_map_max 0"
pgset_ex $DEV_INODE "burst 8"

pgset_ex $DEV_INODE "pkt_size $PKTSIZE"

 pgset_ex $DEV_INODE "delay 0"

# reset to continuous transmission
pgset_ex $DEV_INODE "count $COUNT"


#### Run: transmit

echo -e "UDP packet generator (based on linux pktgen)\n"
echo -e "  src:  mac=$SRC_MAC ip=$SRC_IP dev=$SRC_DEV"
echo -e "  dest: mac=$DST_MAC ip=$DST_IP port=$DST_UDP\n"

modprobe pktgen

#ethtool -C eth0 tx-usecs 16 tx-frames 16
#ethtool -C eth1 tx-usecs 16 tx-frames 16

# start thread(s)
# the write will block until Ctrl^C is pressed or a timeout kills the write
echo "Running for $RUN_SECS seconds"
#pgset_ex $MAIN_INODE "start"
echo "start" > $MAIN_INODE 2>/dev/null &


  sleep $RUN_SECS
 echo $DEV_INODE 
 cat $DEV_INODE

# stop
kill $!
pgset_ex $MAIN_INODE "stop"

echo "OK. All done"

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02 11:45                   ` Eric Dumazet
@ 2014-10-02 11:56                     ` Amir Vadai
  2014-10-02 12:07                       ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Amir Vadai @ 2014-10-02 11:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe

On 10/2/2014 2:45 PM, Eric Dumazet wrote:
> On Thu, 2014-10-02 at 11:03 +0300, Amir Vadai wrote:
> 
>> Hi,
>>
>> Will take it into the split patchset - we just hit this bug when tried
>> to run benchmarks with blueflame disabled (easy to test by using ethtool
>> priv flag blueflame).
> 
> Hmm, I do not know this ethtool command, please share ;)

$ ethtool --set-priv-flags eth0 blueflame off

> 
>>
>> I'm still working on it, but I can't reproduce the numbers that you
>> show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
>> with burst=1.
> 
> You have to be careful with the 'clone X' : If you choose a too big
> value, TX completion competes with the sender thread.
> 
>>
>> In addition, I see no improvements when adding the optimization to the
>> xmit path.
After making sure the sender thread and the TX completions are not on
the same CPU, I see the expected improvement. +0.5Mpps with tx
optimizations.

>> I use the net-next kernel + pktgen burst support patch, with and without
>> this xmit path optimization patch.
>>
>> Do you use other patches not upstream in your environment?
> 
> Nope, this is with David net-next tree.
> 
>> Can you share the .config/pktgen configuration?
> 
> Sure.
> 
>>
>> One other note: we're checking now that blueflame could be used with
>> xmit_more. It might result with packets reordering/drops. Still under
>> investigation.
> 
> I noticed no reorders. I tweaked the stack to force a gso segmentation
> (in software) instead of using NIC TSO for small packets (2 or 3 MSS)
> 
> 200 concurrent netperf -t TCP_RR -- -r 2000,2000    performance was
> increased by ~100%.
> 
> 
> #!/bin/bash
> #
> # on the destination, drop packets with
> #   iptables -A PREROUTING -t raw -p udp --dport 9 -j DROP
> #   Or run a recent enough kernel with global ICMP rate limiting to 1000 packets/sec
> #   ( http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=4cdf507d54525842dfd9f6313fdafba039084046 )
> #
> #### Configure
> 
> # Yeah, if you use PKTSIZE <= 104, performance is lower because of inline (copy whole frame content into tx desc)
> PKTSIZE=105
You can also set the module parameter to turn it off:
$ modprobe mlx4_en inline_thold=17

> 
[...]

Thanks,
Amir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02 11:56                     ` Amir Vadai
@ 2014-10-02 12:07                       ` Eric Dumazet
  2014-10-02 12:45                         ` Amir Vadai
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2014-10-02 12:07 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe

On Thu, 2014-10-02 at 14:56 +0300, Amir Vadai wrote:
> After making sure the sender thread and the TX completions are not on
> the same CPU, I see the expected improvement. +0.5Mpps with tx
> optimizations.

I got 40% here.

Hmm... both cpus are on the same socket, right ?

TX coalescing is properly setup ?

What interrupt rate do you get ?

You can take a look at where cycles are spent

perf record -a -g sleep 5
perf report

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
  2014-10-02 12:07                       ` Eric Dumazet
@ 2014-10-02 12:45                         ` Amir Vadai
  0 siblings, 0 replies; 29+ messages in thread
From: Amir Vadai @ 2014-10-02 12:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe

On 10/2/2014 3:07 PM, Eric Dumazet wrote:
> On Thu, 2014-10-02 at 14:56 +0300, Amir Vadai wrote:
>> After making sure the sender thread and the TX completions are not on
>> the same CPU, I see the expected improvement. +0.5Mpps with tx
>> optimizations.
> 
> I got 40% here.
> 
> Hmm... both cpus are on the same socket, right ?
Yes. And on a NUMA node close to the NIC

> 
> TX coalescing is properly setup ?
I will try to play with it, if it is too aggressive, the queue is stopped.
I will play again with it and the ring size to find a better spot.

> 
> What interrupt rate do you get ?
~50K per second.

> 
> You can take a look at where cycles are spent
> 
> perf record -a -g sleep 5
> perf report
> 
> 
I will.

In case your curious, I have this on the CPU doing completions:
+  44.86%      swapper  [kernel.kallsyms]  [k] consume_skb
+  21.38%      swapper  [kernel.kallsyms]  [k] mlx4_en_poll_tx_cq
+   6.81%      swapper  [kernel.kallsyms]  [k] mlx4_en_free_tx_desc.isra.24
+   6.07%      swapper  [kernel.kallsyms]  [k] intel_idle
+   2.98%      swapper  [kernel.kallsyms]  [k] __dev_kfree_skb_any
+   2.66%         rngd  rngd               [.] 0x0000000000002749
+   1.79%         rngd  [kernel.kallsyms]  [k] consume_skb
+   1.28%      swapper  [kernel.kallsyms]  [k] irq_entries_start
+   1.16%         rngd  [kernel.kallsyms]  [k] mlx4_en_poll_tx_cq
+   0.86%      swapper  [kernel.kallsyms]  [k] swiotlb_unmap_page
+   0.79%      swapper  [kernel.kallsyms]  [k] unmap_single
+   0.63%      swapper  [kernel.kallsyms]  [k] irqtime_account_irq
+   0.55%      swapper  [kernel.kallsyms]  [k] mlx4_eq_int
+   0.51%      swapper  [kernel.kallsyms]  [k] eq_set_ci.isra.14

and this on the xmit thread:
+  72.00%  kpktgend_6  [kernel.kallsyms]  [k] mlx4_en_xmit
+  13.11%  kpktgend_6  [kernel.kallsyms]  [k] pktgen_thread_worker
+   5.34%  kpktgend_6  [kernel.kallsyms]  [k] _raw_spin_lock
+   3.80%  kpktgend_6  [kernel.kallsyms]  [k] swiotlb_map_page
+   2.08%  kpktgend_6  [kernel.kallsyms]  [k] __iowrite64_copy
+   1.21%  kpktgend_6  [kernel.kallsyms]  [k] skb_clone_tx_timestamp
+   0.87%  kpktgend_6  [kernel.kallsyms]  [k] kthread_should_stop
+   0.64%  kpktgend_6  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error
+   0.51%  kpktgend_6  [kernel.kallsyms]  [k] __local_bh_enable_ip

Thanks,
Amir

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-10-02 12:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-26  0:46 [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Alexei Starovoitov
2014-09-26  1:20 ` Eric Dumazet
2014-09-26  7:42   ` Eric Dumazet
2014-09-26 15:44     ` Eric Dumazet
2014-09-26 15:59       ` Alexei Starovoitov
2014-09-26 16:06         ` Eric Dumazet
2014-09-27 20:43     ` Eric Dumazet
2014-09-27 20:55       ` Or Gerlitz
2014-09-27 21:30         ` Eric Dumazet
2014-09-27 22:56           ` [PATCH net-next] mlx4: optimize xmit path Eric Dumazet
2014-09-27 23:44             ` Hannes Frederic Sowa
2014-09-28  0:05               ` Eric Dumazet
2014-09-28  0:22                 ` Hannes Frederic Sowa
2014-09-28 12:42             ` Eric Dumazet
2014-09-28 14:35             ` Or Gerlitz
2014-09-28 16:03               ` Eric Dumazet
2014-09-29  4:19             ` [PATCH v2 " Eric Dumazet
2014-09-30 12:01               ` Amir Vadai
2014-09-30 12:11                 ` Eric Dumazet
2014-10-02  4:35               ` Eric Dumazet
2014-10-02  8:03                 ` Amir Vadai
2014-10-02  8:29                   ` Jesper Dangaard Brouer
2014-10-02  8:57                     ` Amir Vadai
2014-10-02 11:45                   ` Eric Dumazet
2014-10-02 11:56                     ` Amir Vadai
2014-10-02 12:07                       ` Eric Dumazet
2014-10-02 12:45                         ` Amir Vadai
2014-09-26  8:05 ` [RFC PATCH net-next] net: pktgen: packet bursting via skb->xmit_more Jesper Dangaard Brouer
2014-09-27 20:59 ` Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.