All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/3] Some deferred TX queue follow-ups
@ 2014-08-24 13:42 Daniel Borkmann
  2014-08-24 13:42 ` [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush() Daniel Borkmann
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Daniel Borkmann @ 2014-08-24 13:42 UTC (permalink / raw)
  To: davem; +Cc: netdev

On top of Dave's `[PATCH 0/3] Basic deferred TX queue flushing infrastructure`.
It adds an implementation for ixgbe and a use-case for AF_PACKET's QDISC_BYPASS.

Daniel Borkmann (3):
  ixgbe: support netdev_ops->ndo_xmit_flush()
  net: add __netdev_xmit_{only,flush} helpers
  packet: make use of deferred TX queue flushing

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 28 ++++++++++++---
 include/linux/netdevice.h                     | 15 ++++++++
 net/packet/af_packet.c                        | 49 ++++++++++++++++++++-------
 3 files changed, 76 insertions(+), 16 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-24 13:42 [RFC PATCH net-next 0/3] Some deferred TX queue follow-ups Daniel Borkmann
@ 2014-08-24 13:42 ` Daniel Borkmann
  2014-08-25  5:55   ` David Miller
  2014-08-25 12:07   ` Jesper Dangaard Brouer
  2014-08-24 13:42 ` [RFC PATCH net-next 2/3] net: add __netdev_xmit_{only,flush} helpers Daniel Borkmann
  2014-08-24 13:42 ` [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing Daniel Borkmann
  2 siblings, 2 replies; 16+ messages in thread
From: Daniel Borkmann @ 2014-08-24 13:42 UTC (permalink / raw)
  To: davem; +Cc: netdev

This implements the deferred tail pointer flush API for the ixgbe
driver. Similar version also proposed longer time ago by Alexander Duyck.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 28 +++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 87bd53f..4e073cf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6957,10 +6957,6 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 		i = 0;
 
 	tx_ring->next_to_use = i;
-
-	/* notify HW of packet */
-	ixgbe_write_tail(tx_ring, i);
-
 	return;
 dma_error:
 	dev_err(tx_ring->dev, "TX DMA map failed\n");
@@ -7301,6 +7297,29 @@ static netdev_tx_t ixgbe_xmit_frame(struct sk_buff *skb,
 	return __ixgbe_xmit_frame(skb, netdev, NULL);
 }
 
+static inline struct ixgbe_ring *
+__ixgb_tx_queue_mapping(struct ixgbe_adapter *adapter, u16 queue)
+{
+	if (unlikely(queue >= adapter->num_tx_queues))
+		queue = queue % adapter->num_tx_queues;
+
+	return adapter->tx_ring[queue];
+}
+
+static void ixgbe_xmit_flush(struct net_device *netdev, u16 queue)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+	struct ixgbe_ring *tx_ring;
+
+	tx_ring = __ixgb_tx_queue_mapping(adapter, queue);
+	ixgbe_write_tail(tx_ring, tx_ring->next_to_use);
+
+	/* we need this if more than one processor can write to our tail
+	 * at a time, it synchronizes IO on IA64/Altix systems
+	 */
+	mmiowb();
+}
+
 /**
  * ixgbe_set_mac - Change the Ethernet Address of the NIC
  * @netdev: network interface device structure
@@ -7914,6 +7933,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
+	.ndo_xmit_flush		= ixgbe_xmit_flush,
 	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 2/3] net: add __netdev_xmit_{only,flush} helpers
  2014-08-24 13:42 [RFC PATCH net-next 0/3] Some deferred TX queue follow-ups Daniel Borkmann
  2014-08-24 13:42 ` [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush() Daniel Borkmann
@ 2014-08-24 13:42 ` Daniel Borkmann
  2014-08-24 13:42 ` [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing Daniel Borkmann
  2 siblings, 0 replies; 16+ messages in thread
From: Daniel Borkmann @ 2014-08-24 13:42 UTC (permalink / raw)
  To: davem; +Cc: netdev

This adds two helpers to use xmit and flushing separately from
each other as opposed to netdev_start_xmit().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/linux/netdevice.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1d05932..33faa33 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3393,6 +3393,21 @@ static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_devi
 	return __netdev_start_xmit(ops, skb, dev);
 }
 
+static inline netdev_tx_t __netdev_xmit_only(struct sk_buff *skb, struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	return ops->ndo_start_xmit(skb, dev);
+}
+
+static inline void __netdev_xmit_flush(struct net_device *dev, u16 queue)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_xmit_flush)
+		ops->ndo_xmit_flush(dev, queue);
+}
+
 int netdev_class_create_file_ns(struct class_attribute *class_attr,
 				const void *ns);
 void netdev_class_remove_file_ns(struct class_attribute *class_attr,
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing
  2014-08-24 13:42 [RFC PATCH net-next 0/3] Some deferred TX queue follow-ups Daniel Borkmann
  2014-08-24 13:42 ` [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush() Daniel Borkmann
  2014-08-24 13:42 ` [RFC PATCH net-next 2/3] net: add __netdev_xmit_{only,flush} helpers Daniel Borkmann
@ 2014-08-24 13:42 ` Daniel Borkmann
  2014-08-25  5:57   ` David Miller
  2014-08-25 13:54   ` Jesper Dangaard Brouer
  2 siblings, 2 replies; 16+ messages in thread
From: Daniel Borkmann @ 2014-08-24 13:42 UTC (permalink / raw)
  To: davem; +Cc: netdev

This adds a first use-case of deferred tail pointer flushing
for AF_PACKET's TX_RING in QDISC_BYPASS mode.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 49 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 0dfa990..27457e8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -216,7 +216,8 @@ static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
 static void packet_flush_mclist(struct sock *sk);
 
 struct packet_skb_cb {
-	unsigned int origlen;
+	u32 enforce_flush:1,
+	    origlen:31;
 	union {
 		struct sockaddr_pkt pkt;
 		struct sockaddr_ll ll;
@@ -237,8 +238,11 @@ struct packet_skb_cb {
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
+#define PACKET_FLUSH_THRESH	8
+
 static int packet_direct_xmit(struct sk_buff *skb)
 {
+	bool flush = PACKET_SKB_CB(skb)->enforce_flush;
 	struct net_device *dev = skb->dev;
 	netdev_features_t features;
 	struct netdev_queue *txq;
@@ -261,9 +265,12 @@ static int packet_direct_xmit(struct sk_buff *skb)
 
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
 	if (!netif_xmit_frozen_or_drv_stopped(txq)) {
-		ret = netdev_start_xmit(skb, dev);
-		if (ret == NETDEV_TX_OK)
+		ret = __netdev_xmit_only(skb, dev);
+		if (ret == NETDEV_TX_OK) {
+			if (flush)
+				__netdev_xmit_flush(dev, queue_map);
 			txq_trans_update(txq);
+		}
 	}
 	HARD_TX_UNLOCK(dev, txq);
 
@@ -313,7 +320,7 @@ static u16 __packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 	return (u16) raw_smp_processor_id() % dev->real_num_tx_queues;
 }
 
-static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
+static u16 packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 	u16 queue_index;
@@ -327,6 +334,7 @@ static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb_set_queue_mapping(skb, queue_index);
+	return queue_index;
 }
 
 /* register_prot_hook must be invoked with the po->bind_lock held,
@@ -2237,7 +2245,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	unsigned char *addr;
 	int len_sum = 0;
 	int status = TP_STATUS_AVAILABLE;
-	int hlen, tlen;
+	int hlen, tlen, pending = 0;
+	u16 last_queue = 0;
 
 	mutex_lock(&po->pg_vec_lock);
 
@@ -2276,18 +2285,22 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
-			if (need_wait && need_resched())
+			if (need_wait && need_resched()) {
+				if (packet_use_direct_xmit(po) && pending > 0) {
+					__netdev_xmit_flush(dev, last_queue);
+					pending = 0;
+				}
 				schedule();
+			}
 			continue;
 		}
 
 		status = TP_STATUS_SEND_REQUEST;
 		hlen = LL_RESERVED_SPACE(dev);
-		tlen = dev->needed_tailroom;
-		skb = sock_alloc_send_skb(&po->sk,
-				hlen + tlen + sizeof(struct sockaddr_ll),
-				0, &err);
 
+		tlen = dev->needed_tailroom;
+		skb = sock_alloc_send_skb(&po->sk, hlen + tlen +
+					  sizeof(struct sockaddr_ll), 0, &err);
 		if (unlikely(skb == NULL))
 			goto out_status;
 
@@ -2319,13 +2332,18 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			}
 		}
 
-		packet_pick_tx_queue(dev, skb);
+		last_queue = packet_pick_tx_queue(dev, skb);
 
 		skb->destructor = tpacket_destruct_skb;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
 		packet_inc_pending(&po->tx_ring);
 
 		status = TP_STATUS_SEND_REQUEST;
+		if (pending >= PACKET_FLUSH_THRESH) {
+			PACKET_SKB_CB(skb)->enforce_flush = 1;
+			pending = -1;
+		}
+
 		err = po->xmit(skb);
 		if (unlikely(err > 0)) {
 			err = net_xmit_errno(err);
@@ -2340,7 +2358,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			 * let's treat it like congestion or err < 0
 			 */
 			err = 0;
+		} else {
+			/* Sucessfully sent out. */
+			pending++;
 		}
+
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
 	} while (likely((ph != NULL) ||
@@ -2354,11 +2376,12 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 	err = len_sum;
 	goto out_put;
-
 out_status:
 	__packet_set_status(po, ph, status);
 	kfree_skb(skb);
 out_put:
+	if (packet_use_direct_xmit(po) && pending > 0)
+		__netdev_xmit_flush(dev, last_queue);
 	dev_put(dev);
 out:
 	mutex_unlock(&po->pg_vec_lock);
@@ -2561,6 +2584,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (unlikely(extra_len == 4))
 		skb->no_fcs = 1;
 
+	PACKET_SKB_CB(skb)->enforce_flush = 1;
+
 	err = po->xmit(skb);
 	if (err > 0 && (err = net_xmit_errno(err)) != 0)
 		goto out_unlock;
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-24 13:42 ` [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush() Daniel Borkmann
@ 2014-08-25  5:55   ` David Miller
  2014-08-25 12:07   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 16+ messages in thread
From: David Miller @ 2014-08-25  5:55 UTC (permalink / raw)
  To: dborkman; +Cc: netdev

From: Daniel Borkmann <dborkman@redhat.com>
Date: Sun, 24 Aug 2014 15:42:16 +0200

> +	/* we need this if more than one processor can write to our tail
> +	 * at a time, it synchronizes IO on IA64/Altix systems
> +	 */
> +	mmiowb();

Unlike for IGB, this doesn't exist in the IXGBE driver, please do not add
it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing
  2014-08-24 13:42 ` [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing Daniel Borkmann
@ 2014-08-25  5:57   ` David Miller
  2014-08-25  6:40     ` Daniel Borkmann
  2014-08-25 13:54   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 16+ messages in thread
From: David Miller @ 2014-08-25  5:57 UTC (permalink / raw)
  To: dborkman; +Cc: netdev

From: Daniel Borkmann <dborkman@redhat.com>
Date: Sun, 24 Aug 2014 15:42:18 +0200

> This adds a first use-case of deferred tail pointer flushing
> for AF_PACKET's TX_RING in QDISC_BYPASS mode.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

If last_queue changes, you'll need to force a flush, does that
end up happening with your changes here?  I really couldn't
tell for sure.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing
  2014-08-25  5:57   ` David Miller
@ 2014-08-25  6:40     ` Daniel Borkmann
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Borkmann @ 2014-08-25  6:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On 08/25/2014 07:57 AM, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Sun, 24 Aug 2014 15:42:18 +0200
>
>> This adds a first use-case of deferred tail pointer flushing
>> for AF_PACKET's TX_RING in QDISC_BYPASS mode.
>>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>
> If last_queue changes, you'll need to force a flush, does that
> end up happening with your changes here?  I really couldn't
> tell for sure.

Yes indeed, I noticed that later on as well. :)

I will fix that up and resubmit the series, thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-24 13:42 ` [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush() Daniel Borkmann
  2014-08-25  5:55   ` David Miller
@ 2014-08-25 12:07   ` Jesper Dangaard Brouer
  2014-08-25 22:32     ` David Miller
                       ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-25 12:07 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: brouer, davem, netdev

On Sun, 24 Aug 2014 15:42:16 +0200
Daniel Borkmann <dborkman@redhat.com> wrote:

> This implements the deferred tail pointer flush API for the ixgbe
> driver. Similar version also proposed longer time ago by Alexander Duyck.

I've run some benchmarks with this patch only, which actually shows a
performance regression.

Using trafgen with QDISC_BYPASS and mmap mode, via cmdline:
 trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1

BASELINE(no-patch): trafgen QDISC_BYPASS and mmap:
 - tx:1562539 pps

(This patch only): ixgbe use of .ndo_xmit_flush.
 - tx:1532299 pps

Regression: -30240 pps
 * In nanosec: (1/1562539*10^9)-(1/1532299*10^9) = -12.63 ns


As DaveM points out, me might not need the mmiowb().
Result when not performing the mmiowb():
 - tx:1548352 pps

Still a small regression: -14187 pps
 * In nanosec: (1/1562539*10^9)-(1/1548352*10^9) = -5.86 ns


I was not expecting this "slowdown", with this rather simple use of the
new ndo_xmit_flush API.  Can anyone explain why this is happening?



 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 28 +++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 87bd53f..4e073cf 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -6957,10 +6957,6 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring,
>  		i = 0;
>  
>  	tx_ring->next_to_use = i;
> -
> -	/* notify HW of packet */
> -	ixgbe_write_tail(tx_ring, i);
> -
>  	return;
>  dma_error:
>  	dev_err(tx_ring->dev, "TX DMA map failed\n");
> @@ -7301,6 +7297,29 @@ static netdev_tx_t ixgbe_xmit_frame(struct sk_buff *skb,
>  	return __ixgbe_xmit_frame(skb, netdev, NULL);
>  }
>  
> +static inline struct ixgbe_ring *
> +__ixgb_tx_queue_mapping(struct ixgbe_adapter *adapter, u16 queue)
> +{
> +	if (unlikely(queue >= adapter->num_tx_queues))
> +		queue = queue % adapter->num_tx_queues;
> +
> +	return adapter->tx_ring[queue];
> +}
> +
> +static void ixgbe_xmit_flush(struct net_device *netdev, u16 queue)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(netdev);
> +	struct ixgbe_ring *tx_ring;
> +
> +	tx_ring = __ixgb_tx_queue_mapping(adapter, queue);
> +	ixgbe_write_tail(tx_ring, tx_ring->next_to_use);
> +
> +	/* we need this if more than one processor can write to our tail
> +	 * at a time, it synchronizes IO on IA64/Altix systems
> +	 */
> +	mmiowb();

This is the mmiowb() which is avoided in above measurement.

> +}
> +
>  /**
>   * ixgbe_set_mac - Change the Ethernet Address of the NIC
>   * @netdev: network interface device structure
> @@ -7914,6 +7933,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
>  	.ndo_open		= ixgbe_open,
>  	.ndo_stop		= ixgbe_close,
>  	.ndo_start_xmit		= ixgbe_xmit_frame,
> +	.ndo_xmit_flush		= ixgbe_xmit_flush,
>  	.ndo_select_queue	= ixgbe_select_queue,
>  	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
>  	.ndo_validate_addr	= eth_validate_addr,



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing
  2014-08-24 13:42 ` [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing Daniel Borkmann
  2014-08-25  5:57   ` David Miller
@ 2014-08-25 13:54   ` Jesper Dangaard Brouer
  2014-08-25 15:16     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-25 13:54 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: brouer, davem, netdev

On Sun, 24 Aug 2014 15:42:18 +0200
Daniel Borkmann <dborkman@redhat.com> wrote:

> This adds a first use-case of deferred tail pointer flushing
> for AF_PACKET's TX_RING in QDISC_BYPASS mode.

Testing with trafgen.  I've updated patch 1/3 to NOT call mmiowb(),
during this testing, see why in my other post.

trafgen cmdline:
 trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
 * Only use 1 CPU
 * default is mmap
 * default is QDISC_BYPASS mode

BASELINE(no-patches): trafgen QDISC_BYPASS and mmap:
 - tx:1562539 pps

With PACKET_FLUSH_THRESH=8, and QDISC_BYPASS and mmap:
 - tx:1683746 pps

Improvement:
 + 121207 pps
 - 46 ns (1/1562539*10^9)-(1/1683746*10^9)

This is a significant improvement! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Setup details:
--------------
Network overload testing setup according to:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html

Trafgen input file:
 https://github.com/netoptimizer/network-testing/blob/master/trafgen/udp_example01.trafgen

Driver/NIC: ixgbe
CPU: E5-2695v2(ES)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing
  2014-08-25 13:54   ` Jesper Dangaard Brouer
@ 2014-08-25 15:16     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-25 15:16 UTC (permalink / raw)
  Cc: brouer, Daniel Borkmann, davem, netdev

On Mon, 25 Aug 2014 15:54:02 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Sun, 24 Aug 2014 15:42:18 +0200
> Daniel Borkmann <dborkman@redhat.com> wrote:
> 
> > This adds a first use-case of deferred tail pointer flushing
> > for AF_PACKET's TX_RING in QDISC_BYPASS mode.
> 
> Testing with trafgen.  I've updated patch 1/3 to NOT call mmiowb(),
> during this testing, see why in my other post.
> 
> trafgen cmdline:
>  trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
>  * Only use 1 CPU
>  * default is mmap
>  * default is QDISC_BYPASS mode
> 
> BASELINE(no-patches): trafgen QDISC_BYPASS and mmap:
>  - tx:1562539 pps
> 
> With PACKET_FLUSH_THRESH=8, and QDISC_BYPASS and mmap:
>  - tx:1683746 pps
> 
> Improvement:
>  + 121207 pps
>  - 46 ns (1/1562539*10^9)-(1/1683746*10^9)
> 
> This is a significant improvement! :-)

I'm unfortunately seeing a regression, if I'm NOT bypassing the qdisc
layer, and still use mmap.  Trafgen have an option --qdisc-path for
this. (I believe most other solutions, don't set the QDISC_BYPASS
socket option)

trafgen command:
 # trafgen --cpp --dev eth5 --conf udp_example01.trafgen -V  --qdisc-path --cpus 1
 * still use mmap
 * choose normal qdisc code path via --qdisc-path

BASELINE(no-patches): trafgen using --qdisc-path and mmap:
 - tx:1371307 pps

(Patched): trafgen using --qdisc-path and mmap
 - tx:1345999 pps

Regression:
 * 25308 pps slower than before
 * 13.71 nanosec slower (1/1371307*10^9)-(1/1345999*10^9)

How can we explain this?!?

As can be deducted from the baseline numbers, the cost of the qdisc
path is fairly high, with 89.24 ns ((1/1562539*10^9)-(1/1371307*10^9)).
(This is a bit higher than I expected based on my data from:
http://people.netfilter.org/hawk/presentations/nfws2014/dp-accel-qdisc-lockless.pdf
where I measured it to be 60ns).

(Does this makes sense?):  Above results say we can save 46ns by
delaying tailptr updates.  But the qdisc path itself will add 89ns of
delay between packet, which is then too large to take advantage of the
tailptr win.  (not sure this explains the issue... feel free to come up
with a better explanation)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-25 12:07   ` Jesper Dangaard Brouer
@ 2014-08-25 22:32     ` David Miller
  2014-08-25 23:31       ` David Miller
  2014-08-25 22:51     ` Alexander Duyck
  2014-08-27 11:34     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2014-08-25 22:32 UTC (permalink / raw)
  To: brouer; +Cc: dborkman, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Mon, 25 Aug 2014 14:07:21 +0200

> I've run some benchmarks with this patch only, which actually shows a
> performance regression.
> 
> Using trafgen with QDISC_BYPASS and mmap mode, via cmdline:
>  trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
> 
> BASELINE(no-patch): trafgen QDISC_BYPASS and mmap:
>  - tx:1562539 pps
> 
> (This patch only): ixgbe use of .ndo_xmit_flush.
>  - tx:1532299 pps
> 
> Regression: -30240 pps
>  * In nanosec: (1/1562539*10^9)-(1/1532299*10^9) = -12.63 ns
> 
> 
> As DaveM points out, me might not need the mmiowb().
> Result when not performing the mmiowb():
>  - tx:1548352 pps
> 
> Still a small regression: -14187 pps
>  * In nanosec: (1/1562539*10^9)-(1/1548352*10^9) = -5.86 ns
> 
> I was not expecting this "slowdown", with this rather simple use of the
> new ndo_xmit_flush API.  Can anyone explain why this is happening?

Impressive amount of overhead for something that evaluates to just a
compiler barrier :-)

The extra indirect function call and walking down the data structures
to get to the queue pointer might account for the remaining cost.

This might be argument enough to contain the behavioral changes within
->ndo_start_xmit() itself.

Jesper, just for fun, could you revert all of the xmit flush stuff and
test this patch instead?

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 87bd53f..ba9ceaa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6958,9 +6958,10 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 
 	tx_ring->next_to_use = i;
 
-	/* notify HW of packet */
-	ixgbe_write_tail(tx_ring, i);
-
+	if (!skb->xmit_more) {
+		/* notify HW of packet */
+		ixgbe_write_tail(tx_ring, i);
+	}
 	return;
 dma_error:
 	dev_err(tx_ring->dev, "TX DMA map failed\n");
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 18ddf96..dc6141da 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -558,6 +558,7 @@ struct sk_buff {
 
 	__u16			queue_mapping;
 	kmemcheck_bitfield_begin(flags2);
+	__u8			xmit_more:1;
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
 	__u8			ndisc_nodetype:2;
 #endif

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-25 12:07   ` Jesper Dangaard Brouer
  2014-08-25 22:32     ` David Miller
@ 2014-08-25 22:51     ` Alexander Duyck
  2014-08-26  6:44       ` Jesper Dangaard Brouer
  2014-08-27 11:34     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2014-08-25 22:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann; +Cc: davem, netdev

On 08/25/2014 05:07 AM, Jesper Dangaard Brouer wrote:
> On Sun, 24 Aug 2014 15:42:16 +0200
> Daniel Borkmann <dborkman@redhat.com> wrote:
> 
>> This implements the deferred tail pointer flush API for the ixgbe
>> driver. Similar version also proposed longer time ago by Alexander Duyck.
> 
> I've run some benchmarks with this patch only, which actually shows a
> performance regression.
> 
> Using trafgen with QDISC_BYPASS and mmap mode, via cmdline:
>  trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
> 
> BASELINE(no-patch): trafgen QDISC_BYPASS and mmap:
>  - tx:1562539 pps
> 
> (This patch only): ixgbe use of .ndo_xmit_flush.
>  - tx:1532299 pps
> 
> Regression: -30240 pps
>  * In nanosec: (1/1562539*10^9)-(1/1532299*10^9) = -12.63 ns
> 
> 
> As DaveM points out, me might not need the mmiowb().
> Result when not performing the mmiowb():
>  - tx:1548352 pps
> 
> Still a small regression: -14187 pps
>  * In nanosec: (1/1562539*10^9)-(1/1548352*10^9) = -5.86 ns
> 
> 
> I was not expecting this "slowdown", with this rather simple use of the
> new ndo_xmit_flush API.  Can anyone explain why this is happening?

One possibility is that we are now doing less stuff between the time we
write tail and when we grab the qdisc lock (locked transactions are
stalled by MMIO) so that we are spending more time stuck waiting for the
write to complete and doing nothing.

Then of course there are always the funny oddball quirks such as the
code changes might have changed the alignment of a loop resulting in Tx
cleanup more expensive than it was before.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-25 22:32     ` David Miller
@ 2014-08-25 23:31       ` David Miller
  2014-08-26  6:13         ` Daniel Borkmann
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2014-08-25 23:31 UTC (permalink / raw)
  To: brouer; +Cc: dborkman, netdev

From: David Miller <davem@davemloft.net>
Date: Mon, 25 Aug 2014 15:32:48 -0700 (PDT)

> Jesper, just for fun, could you revert all of the xmit flush stuff and
> test this patch instead?

This doesn't work properly, sorry.  We need to explicitly set xmit_more
to zero before ->ndo_start_xmit() calls, because the initial zero'ing
of that value isn't propagated in copy_skb_header() (nor do we want to
add that).

But I'm beyond convinced now that ->ndo_xmit_flush() is not the way to
do this.

I'm about to post a set of patches which will go into net-next which:

1) Converts the tree to skb->xmit_more

2) Adds Daniel's IXGBE conversion, adjusted for xmit_more.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-25 23:31       ` David Miller
@ 2014-08-26  6:13         ` Daniel Borkmann
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Borkmann @ 2014-08-26  6:13 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, netdev

On 08/26/2014 01:31 AM, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Mon, 25 Aug 2014 15:32:48 -0700 (PDT)
>
>> Jesper, just for fun, could you revert all of the xmit flush stuff and
>> test this patch instead?
>
> This doesn't work properly, sorry.  We need to explicitly set xmit_more
> to zero before ->ndo_start_xmit() calls, because the initial zero'ing
> of that value isn't propagated in copy_skb_header() (nor do we want to
> add that).
>
> But I'm beyond convinced now that ->ndo_xmit_flush() is not the way to
> do this.
>
> I'm about to post a set of patches which will go into net-next which:
>
> 1) Converts the tree to skb->xmit_more
>
> 2) Adds Daniel's IXGBE conversion, adjusted for xmit_more.

This seems indeed a better, more lightweight version to accomplish it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-25 22:51     ` Alexander Duyck
@ 2014-08-26  6:44       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-26  6:44 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Daniel Borkmann, davem, netdev, brouer

On Mon, 25 Aug 2014 15:51:50 -0700
Alexander Duyck <alexander.h.duyck@intel.com> wrote:

> On 08/25/2014 05:07 AM, Jesper Dangaard Brouer wrote:
> > On Sun, 24 Aug 2014 15:42:16 +0200
> > Daniel Borkmann <dborkman@redhat.com> wrote:
> > 
> >> This implements the deferred tail pointer flush API for the ixgbe
> >> driver. Similar version also proposed longer time ago by Alexander Duyck.
> > 
> > I've run some benchmarks with this patch only, which actually shows a
> > performance regression.
> > 
> > Using trafgen with QDISC_BYPASS and mmap mode, via cmdline:
> >  trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
> > 
> > BASELINE(no-patch): trafgen QDISC_BYPASS and mmap:
> >  - tx:1562539 pps
> > 
> > (This patch only): ixgbe use of .ndo_xmit_flush.
> >  - tx:1532299 pps
> > 
> > Regression: -30240 pps
> >  * In nanosec: (1/1562539*10^9)-(1/1532299*10^9) = -12.63 ns
> > 
> > 
> > As DaveM points out, me might not need the mmiowb().
> > Result when not performing the mmiowb():
> >  - tx:1548352 pps
> > 
> > Still a small regression: -14187 pps
> >  * In nanosec: (1/1562539*10^9)-(1/1548352*10^9) = -5.86 ns
> > 
> > 
> > I was not expecting this "slowdown", with this rather simple use of the
> > new ndo_xmit_flush API.  Can anyone explain why this is happening?
> 
> One possibility is that we are now doing less stuff between the time we
> write tail and when we grab the qdisc lock (locked transactions are
> stalled by MMIO) so that we are spending more time stuck waiting for the
> write to complete and doing nothing.

In this testcase we are bypassing the qdisc code path, but still taking
the HARD_TX_LOCK.  I were only expecting in the area of -2ns due to the
extra function call overhead.

But when we start to include the qdisc code path, then the performance
regression gets even worse.  I would like an explanation for that, see:

 http://thread.gmane.org/gmane.linux.network/327254/focus=327431


> Then of course there are always the funny oddball quirks such as the
> code changes might have changed the alignment of a loop resulting in Tx
> cleanup more expensive than it was before.

Yes, this is when it gets hairy!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush()
  2014-08-25 12:07   ` Jesper Dangaard Brouer
  2014-08-25 22:32     ` David Miller
  2014-08-25 22:51     ` Alexander Duyck
@ 2014-08-27 11:34     ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-27 11:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, davem, netdev, Daniel Borkmann,
	Hannes Frederic Sowa, Florian Westphal

On Mon, 25 Aug 2014 14:07:21 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Sun, 24 Aug 2014 15:42:16 +0200
> Daniel Borkmann <dborkman@redhat.com> wrote:
> 
> > This implements the deferred tail pointer flush API for the ixgbe
> > driver. Similar version also proposed longer time ago by Alexander Duyck.
> 
> I've run some benchmarks with this patch only, which actually shows a
> performance regression.
> 
[...]
>
> Still a small regression: -14187 pps
>  * In nanosec: (1/1562539*10^9)-(1/1548352*10^9) = -5.86 ns
>  
> I was not expecting this "slowdown", with this rather simple use of the
> new ndo_xmit_flush API.  Can anyone explain why this is happening?

I've re-run this experiment with more accuracy, e.g. C-state tuning, no
Hyper-Threading, and using pktgen. See desc in thread subj: "Get rid of
ndo_xmit_flush"[1].

DaveM was right in reverting this API, according to my new more
accurate measurements, the conclusion is the same, this API hurts performance.

Compared to baseline, with this patch (except not using mmiowb()):
 * (1/5609929*10^9)-(1/5388719*10^9) = -7.32 ns

Details below signature.

[1] http://thread.gmane.org/gmane.linux.network/327502/focus=327803
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Base setup
==========

BIOS: Disabled HT (Hyper-Threading)

Setup commands:
 sudo killall irqbalance
 base_device_setup.sh eth4 # calls set_irq_affinity
 base_device_setup.sh eth5
 netfilter_unload_modules.sh
 sudo ethtool -C eth5 rx-usecs 30
 sudo tuned-adm profile latency-performance

pktgen cmdline:
 ./example03.sh -i eth5 -d 192.168.21.4 -m 00:12:c0:80:1d:54
 (SKB_CLONE="100000" and no UDP port random)

Vanilla kernel for baselining, just **before**:
 * commit 4798248e4e02 ("net: Add ops->ndo_xmit_flush()").
Thus at:
 * commit 4c83acbc565d53 ("ipv6: White-space cleansing : gaps between function and symbol export").

With no HT:
 * ethtool -C eth5 rx-usecs 30
 * tuned-adm profile latency-performance
Results (pktgen):
 * instant rx:2 tx:5620736 pps n:120 average: rx:1 tx:5618140 pps
   (instant variation TX 0.082 ns (min:-0.088 max:0.147) RX 0.000 ns)
 * instant rx:1 tx:5622300 pps n:250 average: rx:1 tx:5619732 pps
   (instant variation TX 0.081 ns (min:-0.858 max:0.098) RX 0.000 ns)
 * accuracy: (1/5618140*10^9)-(1/5619732*10^9) = 0.05 ns
 * instant rx:1 tx:5618692 pps n:120 average: rx:1 tx:5617469 pps
   (instant variation TX 0.039 ns (min:-0.043 max:0.045) RX 0.000 ns)
 * accuracy: (1/5619732*10^9)-(1/5617469*10^9) = -0.072 ns
 * (reboot same kernel)
 * Some hickup:
 * instant rx:1 tx:5610140 pps n:190 average: rx:1 tx:5587229 pps
   (instant variation TX 0.731 ns (min:-2.612 max:2.627) RX 0.000 ns)
 * accuracy: (1/5587229*10^9)-(1/5617469*10^9) = 0.963 ns
 * accuracy: (1/5587229*10^9)-(1/5619732*10^9) = 1.035 ns
 * instant rx:1 tx:5607568 pps n:120 average: rx:1 tx:5606006 pps
   (instant variation TX 0.050 ns (min:-0.855 max:0.066) RX 0.000 ns)
 * instant rx:1 tx:5608168 pps n:120 average: rx:1 tx:5611001 pps
   (instant variation TX -0.090 ns (min:-0.156 max:0.100) RX 0.000 ns)
 * Average: (5618140+5619732+5617469+5587229+5606006+5611001)/6 = 5609929 pps

Results: on branch 'ndo_xmit_flush'
-----------------------------------
Kernel at:
 * commit fe88e6dd8b9 ("Merge branch 'ndo_xmit_flush'")

Sending out ixgbe, which in this kernel does not have the defined the
ndo_xmit_flush function.

With no HT:
 * ethtool -C eth5 rx-usecs 30
 * tuned-adm profile latency-performance
Results (pktgen):
 * instant rx:1 tx:5600404 pps n:161 average: rx:1 tx:5600257 pps
  (instant variation TX 0.005 ns (min:-0.047 max:0.050) RX 0.000 ns)
 * instant rx:1 tx:5594840 pps n:120 average: rx:1 tx:5595316 pps
  (instant variation TX -0.015 ns (min:-0.028 max:0.025) RX 0.000 ns)
 * instant rx:1 tx:5599644 pps n:140 average: rx:1 tx:5599155 pps
  (instant variation TX 0.016 ns (min:-0.074 max:0.059) RX 0.000 ns)
 * instant rx:1 tx:5601296 pps n:75 average: rx:1 tx:5599074 pps
  (instant variation TX 0.071 ns (min:-0.051 max:0.087) RX 0.000 ns)
 * Averaged: (5600257+5595316+5599155+5599074)/4 = 5598450 pps

Compared to baseline: (averaged 5609929 pps)
 * (1/5609929*10^9)-(1/5598450*10^9) = -0.365ns

Conclusion: When ndo_xmit_flush is not active in driver, performance
is the same, as 0.365ns difference is below our accuracy level.

Results: on branch bulking01
----------------------------

Kernel at:
 * commit fe88e6dd8b9 ("Merge branch 'ndo_xmit_flush'")
 * Plus ixgbe support netdev_ops->ndo_xmit_flush()

With no HT:
 * ethtool -C eth5 rx-usecs 30
 * tuned-adm profile latency-performance
Results (pktgen):
 * instant rx:1 tx:5387528 pps n:170 average: rx:1 tx:5387842 pps
  (instant variation TX -0.011 ns (min:-0.193 max:0.125) RX 0.000 ns)
 * instant rx:1 tx:5387588 pps n:212 average: rx:1 tx:5387930 pps
  (instant variation TX -0.012 ns (min:-0.852 max:0.177) RX 0.000 ns)
 * instant rx:1 tx:5391172 pps n:70 average: rx:1 tx:5389684 pps
  (instant variation TX 0.051 ns (min:-0.097 max:0.087) RX 0.000 ns)
 * instant rx:1 tx:5388444 pps n:150 average: rx:1 tx:5389421 pps
  (instant variation TX -0.034 ns (min:-1.014 max:0.092) RX 0.000 ns
 * Average: (5387842+5387930+5389684+5389421)/4 = 5388719

Compared to baseline: (averaged 5609929 pps)
 * (1/5609929*10^9)-(1/5388719*10^9) = -7.32 ns

Conclusion: When ndo_xmit_flush is ACTIVE in the driver, then this new
API of calling ndo_xmit_flush(), hurts performance.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-08-27 11:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-24 13:42 [RFC PATCH net-next 0/3] Some deferred TX queue follow-ups Daniel Borkmann
2014-08-24 13:42 ` [RFC PATCH net-next 1/3] ixgbe: support netdev_ops->ndo_xmit_flush() Daniel Borkmann
2014-08-25  5:55   ` David Miller
2014-08-25 12:07   ` Jesper Dangaard Brouer
2014-08-25 22:32     ` David Miller
2014-08-25 23:31       ` David Miller
2014-08-26  6:13         ` Daniel Borkmann
2014-08-25 22:51     ` Alexander Duyck
2014-08-26  6:44       ` Jesper Dangaard Brouer
2014-08-27 11:34     ` Jesper Dangaard Brouer
2014-08-24 13:42 ` [RFC PATCH net-next 2/3] net: add __netdev_xmit_{only,flush} helpers Daniel Borkmann
2014-08-24 13:42 ` [RFC PATCH net-next 3/3] packet: make use of deferred TX queue flushing Daniel Borkmann
2014-08-25  5:57   ` David Miller
2014-08-25  6:40     ` Daniel Borkmann
2014-08-25 13:54   ` Jesper Dangaard Brouer
2014-08-25 15:16     ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.