All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Coalesce MMIO writes for transmits
@ 2012-07-12  0:25 Alexander Duyck
  2012-07-12  0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Alexander Duyck @ 2012-07-12  0:25 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

This patch set is meant to address recent issues I found with ixgbe
performance being bound by Tx tail writes.  With these changes in place
and the dispatch_limit set to 1 or more I see a significant increase in
performance.

In the case of one of my systems I saw the routing rate for 7 queues jump
from 10.5 to 11.7Mpps.  The overall increase I have seen on most systems is
something on the order of about 15%.  In the case of pktgen I have also
seen a noticeable increase as the previous limit for transmits was
~12.5Mpps, but with this patch set in place and the dispatch_limit enabled
the value increases to ~14.2Mpps.

I expected there to be an increase in latency, however so far I have not
ran into that.  I have tried running NPtcp tests for latency and seen no
difference in the coalesced and non-coalesced transaction times.  I welcome
any suggestions for tests I might run that might expose any latency issues
as a result of this patch.

---

Alexander Duyck (2):
      ixgbe: Add functionality for delaying the MMIO write for Tx
      net: Add new network device function to allow for MMIO batching


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   22 +++++++-
 include/linux/netdevice.h                     |   57 +++++++++++++++++++++
 net/core/dev.c                                |   67 +++++++++++++++++++++++++
 net/core/net-sysfs.c                          |   36 +++++++++++++
 4 files changed, 180 insertions(+), 2 deletions(-)

-- 
Thanks,

Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-12  0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck
@ 2012-07-12  0:26 ` Alexander Duyck
  2012-07-12  7:14   ` Eric Dumazet
  2012-07-13  7:19   ` Eric Dumazet
  2012-07-12  0:26 ` [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx Alexander Duyck
  2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger
  2 siblings, 2 replies; 14+ messages in thread
From: Alexander Duyck @ 2012-07-12  0:26 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

This change adds capabilities to the driver for batching the MMIO write
involved with transmits.  Most of the logic is based off of the code for
the qdisc scheduling.

What I did is break the transmit path into two parts.  We already had the
ndo_start_xmit function which has been there all along.  The part I added
was ndo_complete_xmit which is meant to handle notifying the hardware that
frames are ready for delivery.

To control all of this I added a net sysfs value for the Tx queues called
dispatch_limit.  When 0 it indicates that all frames will notify hardware
immediately.  When 1 or more the netdev_complete_xmit call will queue up to
that number of packets, and when the value is exceeded it will notify the
hardware and reset the pending frame dispatch count.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/linux/netdevice.h |   57 ++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   67 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      |   36 ++++++++++++++++++++++++
 3 files changed, 160 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a1a657..8d50fc4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -522,6 +522,8 @@ enum netdev_queue_state_t {
 	__QUEUE_STATE_DRV_XOFF,
 	__QUEUE_STATE_STACK_XOFF,
 	__QUEUE_STATE_FROZEN,
+	__QUEUE_STATE_DELAYED,
+	__QUEUE_STATE_DISPATCH,
 #define QUEUE_STATE_ANY_XOFF ((1 << __QUEUE_STATE_DRV_XOFF)		| \
 			      (1 << __QUEUE_STATE_STACK_XOFF))
 #define QUEUE_STATE_ANY_XOFF_OR_FROZEN (QUEUE_STATE_ANY_XOFF		| \
@@ -550,6 +552,7 @@ struct netdev_queue {
 #if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
 	int			numa_node;
 #endif
+	unsigned int		dispatch_limit;
 /*
  * write mostly part
  */
@@ -561,6 +564,11 @@ struct netdev_queue {
 	unsigned long		trans_start;
 
 	/*
+	 * pointer to next Tx queue in dispatch_queue
+	 */
+	struct netdev_queue	*next_dispatch;
+
+	/*
 	 * Number of TX timeouts for this queue
 	 * (/sys/class/net/DEV/Q/trans_timeout)
 	 */
@@ -568,6 +576,8 @@ struct netdev_queue {
 
 	unsigned long		state;
 
+	unsigned int		dispatch_pending;
+
 #ifdef CONFIG_BQL
 	struct dql		dql;
 #endif
@@ -924,6 +934,8 @@ struct net_device_ops {
 	int			(*ndo_stop)(struct net_device *dev);
 	netdev_tx_t		(*ndo_start_xmit) (struct sk_buff *skb,
 						   struct net_device *dev);
+	void			(*ndo_complete_xmit) (struct net_device *dev,
+						      unsigned int queue);
 	u16			(*ndo_select_queue)(struct net_device *dev,
 						    struct sk_buff *skb);
 	void			(*ndo_change_rx_flags)(struct net_device *dev,
@@ -1760,6 +1772,9 @@ struct softnet_data {
 	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
+
+	struct netdev_queue	*dispatch_queue;
+	struct netdev_queue	**dispatch_queue_tailp;
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
@@ -1779,6 +1794,44 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd,
 
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
+static inline void netif_tx_delay_queue(struct netdev_queue *txq)
+{
+	set_bit(__QUEUE_STATE_DELAYED, &txq->state);
+}
+
+extern void __netif_tx_dispatch_queue(struct netdev_queue *txq);
+
+static inline void netif_tx_dispatch_queue(struct netdev_queue *txq)
+{
+	if (test_and_clear_bit(__QUEUE_STATE_DELAYED, &txq->state))
+		__netif_tx_dispatch_queue(txq);
+}
+
+static inline bool netif_tx_queue_delayed(const struct netdev_queue *txq)
+{
+	return test_bit(__QUEUE_STATE_DELAYED, &txq->state);
+}
+
+static inline void netdev_complete_xmit(struct netdev_queue *txq)
+{
+	struct net_device *dev = txq->dev;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (txq->dispatch_pending < txq->dispatch_limit) {
+		if (netif_tx_queue_delayed(txq)) {
+			txq->dispatch_pending++;
+			return;
+		}
+
+		/* start of delayed write sequence */
+		netif_tx_delay_queue(txq);
+	}
+
+	txq->dispatch_pending = 0;
+
+	ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]);
+}
+
 extern void __netif_schedule(struct Qdisc *q);
 
 static inline void netif_schedule_queue(struct netdev_queue *txq)
@@ -1973,6 +2026,7 @@ static inline void netdev_completed_queue(struct net_device *dev,
 
 static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 {
+	clear_bit(__QUEUE_STATE_DELAYED, &q->state);
 #ifdef CONFIG_BQL
 	clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state);
 	dql_reset(&q->dql);
@@ -2482,6 +2536,9 @@ static inline void netif_tx_unlock_bh(struct net_device *dev)
 	}						\
 }
 
+#define HARD_TX_TRYLOCK(dev, txq)			\
+	((dev->features & NETIF_F_LLTX) || __netif_tx_trylock(txq))
+
 static inline void netif_tx_disable(struct net_device *dev)
 {
 	unsigned int i;
diff --git a/net/core/dev.c b/net/core/dev.c
index 93af533..a72669a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2032,6 +2032,27 @@ int netif_get_num_default_rss_queues(void)
 }
 EXPORT_SYMBOL(netif_get_num_default_rss_queues);
 
+static inline void __netif_tx_redispatch_queue(struct netdev_queue *txq)
+{
+	struct softnet_data *sd;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	txq->next_dispatch = NULL;
+	sd->dispatch_queue = txq;
+	sd->dispatch_queue_tailp = &txq->next_dispatch;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void __netif_tx_dispatch_queue(struct netdev_queue *txq)
+{
+	if (!test_and_set_bit(__QUEUE_STATE_DISPATCH, &txq->state))
+		__netif_tx_redispatch_queue(txq);
+}
+EXPORT_SYMBOL(__netif_tx_dispatch_queue);
+
 static inline void __netif_reschedule(struct Qdisc *q)
 {
 	struct softnet_data *sd;
@@ -3268,6 +3289,41 @@ static void net_tx_action(struct softirq_action *h)
 			}
 		}
 	}
+
+	if (sd->dispatch_queue) {
+		struct netdev_queue *head;
+
+		local_irq_disable();
+		head = sd->dispatch_queue;
+		sd->dispatch_queue = NULL;
+		sd->dispatch_queue_tailp = &sd->dispatch_queue;
+		local_irq_enable();
+
+		while (head) {
+			struct netdev_queue *txq = head;
+			struct net_device *dev = txq->dev;
+			const struct net_device_ops *ops = dev->netdev_ops;
+
+			head = head->next_dispatch;
+
+			if (!HARD_TX_TRYLOCK(dev, txq)) {
+				__netif_tx_redispatch_queue(txq);
+				continue;
+			}
+
+			smp_mb__before_clear_bit();
+			clear_bit(__QUEUE_STATE_DISPATCH, &txq->state);
+
+			if (txq->dispatch_pending &&
+			    !netif_tx_queue_delayed(txq)) {
+				int index = txq - &dev->_tx[0];
+
+				ops->ndo_complete_xmit(dev, index);
+			}
+
+			HARD_TX_UNLOCK(dev, txq);
+		}
+	}
 }
 
 #if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
@@ -6485,6 +6541,15 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 		oldsd->output_queue = NULL;
 		oldsd->output_queue_tailp = &oldsd->output_queue;
 	}
+
+	/* Append delayed xmit queue from offline CPU */
+	if (oldsd->dispatch_queue) {
+		*sd->dispatch_queue_tailp = oldsd->dispatch_queue;
+		sd->dispatch_queue_tailp = oldsd->dispatch_queue_tailp;
+		oldsd->dispatch_queue = NULL;
+		oldsd->dispatch_queue_tailp = &oldsd->dispatch_queue;
+	}
+
 	/* Append NAPI poll list from offline CPU. */
 	if (!list_empty(&oldsd->poll_list)) {
 		list_splice_init(&oldsd->poll_list, &sd->poll_list);
@@ -6772,6 +6837,8 @@ static int __init net_dev_init(void)
 		INIT_LIST_HEAD(&sd->poll_list);
 		sd->output_queue = NULL;
 		sd->output_queue_tailp = &sd->output_queue;
+		sd->dispatch_queue = NULL;
+		sd->dispatch_queue_tailp = &sd->dispatch_queue;
 #ifdef CONFIG_RPS
 		sd->csd.func = rps_trigger_softirq;
 		sd->csd.info = sd;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 42bb496..4f7eb58 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -997,11 +997,47 @@ static struct netdev_queue_attribute xps_cpus_attribute =
     __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
 #endif /* CONFIG_XPS */
 
+static ssize_t show_dispatch_limit(struct netdev_queue *queue,
+				   struct netdev_queue_attribute *attribute,
+				   char *buf)
+{
+	unsigned int dispatch_limit;
+
+	spin_lock_irq(&queue->_xmit_lock);
+	dispatch_limit = queue->dispatch_limit;
+	spin_unlock_irq(&queue->_xmit_lock);
+
+	return sprintf(buf, "%u\n", dispatch_limit);
+}
+
+static ssize_t store_dispatch_limit(struct netdev_queue *queue,
+				    struct netdev_queue_attribute *attribute,
+				    const char *buf, size_t len)
+{
+	unsigned int dispatch_limit;
+	int err;
+
+	err = kstrtouint(buf, 10, &dispatch_limit);
+	if (err < 0)
+		return err;
+
+	spin_lock_irq(&queue->_xmit_lock);
+	queue->dispatch_limit = dispatch_limit;
+	spin_unlock_irq(&queue->_xmit_lock);
+
+	return len;
+}
+
+static struct netdev_queue_attribute dispatch_limit_attribute =
+	__ATTR(dispatch_limit, S_IRUGO | S_IWUSR,
+	       show_dispatch_limit, store_dispatch_limit);
+
 static struct attribute *netdev_queue_default_attrs[] = {
 	&queue_trans_timeout.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
 #endif
+	&dispatch_limit_attribute.attr,
 	NULL
 };
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx
  2012-07-12  0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck
  2012-07-12  0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck
@ 2012-07-12  0:26 ` Alexander Duyck
  2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger
  2 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2012-07-12  0:26 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

This change makes it so that ixgbe can use the new framework for delaying
the MMIO writes in the transmit path.  With this change in place we see a
significant reduction in CPU utilization and increase in overall packets
per second throughput for bulk traffic tests.  In addition I have not seen
any increase in latency as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9ec65ee..e9b71b8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -884,6 +884,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		return true;
 	}
 
+	netif_tx_dispatch_queue(txring_txq(tx_ring));
+
 	netdev_tx_completed_queue(txring_txq(tx_ring),
 				  total_packets, total_bytes);
 
@@ -5825,6 +5827,22 @@ static void ixgbe_service_task(struct work_struct *work)
 	ixgbe_service_event_complete(adapter);
 }
 
+static void ixgbe_complete_xmit_frame(struct net_device *dev,
+				      unsigned int index)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_ring *tx_ring = adapter->tx_ring[index];
+
+	/* notify HW of packet */
+	writel(tx_ring->next_to_use, tx_ring->tail);
+
+	/*
+	 * we need this if more than one processor can write to our tail
+	 * at a time, it synchronizes IO on IA64/Altix systems
+	 */
+	mmiowb();
+}
+
 static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 		     struct ixgbe_tx_buffer *first,
 		     u8 *hdr_len)
@@ -6150,8 +6168,7 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 
 	tx_ring->next_to_use = i;
 
-	/* notify HW of packet */
-	writel(i, tx_ring->tail);
+	netdev_complete_xmit(txring_txq(tx_ring));
 
 	return;
 dma_error:
@@ -6961,6 +6978,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
+	.ndo_complete_xmit	= ixgbe_complete_xmit_frame,
 #ifdef IXGBE_FCOE
 	.ndo_select_queue	= ixgbe_select_queue,
 #endif

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-12  0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck
@ 2012-07-12  7:14   ` Eric Dumazet
  2012-07-12 15:39     ` Alexander Duyck
  2012-07-13  7:19   ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2012-07-12  7:14 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote:
> This change adds capabilities to the driver for batching the MMIO write
> involved with transmits.  Most of the logic is based off of the code for
> the qdisc scheduling.
> 
> What I did is break the transmit path into two parts.  We already had the
> ndo_start_xmit function which has been there all along.  The part I added
> was ndo_complete_xmit which is meant to handle notifying the hardware that
> frames are ready for delivery.
> 
> To control all of this I added a net sysfs value for the Tx queues called
> dispatch_limit.  When 0 it indicates that all frames will notify hardware
> immediately.  When 1 or more the netdev_complete_xmit call will queue up to
> that number of packets, and when the value is exceeded it will notify the
> hardware and reset the pending frame dispatch count.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---

The idea is good, but do we really need so complex schem ?

Most of the transmits are done from __qdisc_run()

We could add logic in __qdisc_run()/qdisc_restart()

qdisc_run_end() would then have to call ndo_complete_xmit() to make
sure the MMIO is done.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-12  7:14   ` Eric Dumazet
@ 2012-07-12 15:39     ` Alexander Duyck
  2012-07-13  7:38       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2012-07-12 15:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On 07/12/2012 12:14 AM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote:
>> This change adds capabilities to the driver for batching the MMIO write
>> involved with transmits.  Most of the logic is based off of the code for
>> the qdisc scheduling.
>>
>> What I did is break the transmit path into two parts.  We already had the
>> ndo_start_xmit function which has been there all along.  The part I added
>> was ndo_complete_xmit which is meant to handle notifying the hardware that
>> frames are ready for delivery.
>>
>> To control all of this I added a net sysfs value for the Tx queues called
>> dispatch_limit.  When 0 it indicates that all frames will notify hardware
>> immediately.  When 1 or more the netdev_complete_xmit call will queue up to
>> that number of packets, and when the value is exceeded it will notify the
>> hardware and reset the pending frame dispatch count.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
> The idea is good, but do we really need so complex schem ?
>
> Most of the transmits are done from __qdisc_run()
>
> We could add logic in __qdisc_run()/qdisc_restart()
>
> qdisc_run_end() would then have to call ndo_complete_xmit() to make
> sure the MMIO is done.

The problem is in both of the cases where I have seen the issue the
qdisc is actually empty.

In the case of pktgen it does not use the qdisc layer at all.  It just
directly calls ndo_start_xmit.

In the standard networking case we never fill the qdisc because the MMIO
write stalls the entire CPU so the application never gets a chance to
get ahead of the hardware.  From what I can tell the only case in which
the qdisc_run solution would work is if the ndo_start_xmit was called on
a different CPU from the application that is doing the transmitting.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/2] Coalesce MMIO writes for transmits
  2012-07-12  0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck
  2012-07-12  0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck
  2012-07-12  0:26 ` [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx Alexander Duyck
@ 2012-07-12 17:23 ` Stephen Hemminger
  2012-07-12 19:01   ` Alexander Duyck
  2 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2012-07-12 17:23 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On Wed, 11 Jul 2012 17:25:58 -0700
Alexander Duyck <alexander.h.duyck@intel.com> wrote:

> This patch set is meant to address recent issues I found with ixgbe
> performance being bound by Tx tail writes.  With these changes in place
> and the dispatch_limit set to 1 or more I see a significant increase in
> performance.
> 
> In the case of one of my systems I saw the routing rate for 7 queues jump
> from 10.5 to 11.7Mpps.  The overall increase I have seen on most systems is
> something on the order of about 15%.  In the case of pktgen I have also
> seen a noticeable increase as the previous limit for transmits was
> ~12.5Mpps, but with this patch set in place and the dispatch_limit enabled
> the value increases to ~14.2Mpps.
> 
> I expected there to be an increase in latency, however so far I have not
> ran into that.  I have tried running NPtcp tests for latency and seen no
> difference in the coalesced and non-coalesced transaction times.  I welcome
> any suggestions for tests I might run that might expose any latency issues
> as a result of this patch.
> 
> ---
> 
> Alexander Duyck (2):
>       ixgbe: Add functionality for delaying the MMIO write for Tx
>       net: Add new network device function to allow for MMIO batching
> 
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   22 +++++++-
>  include/linux/netdevice.h                     |   57 +++++++++++++++++++++
>  net/core/dev.c                                |   67 +++++++++++++++++++++++++
>  net/core/net-sysfs.c                          |   36 +++++++++++++
>  4 files changed, 180 insertions(+), 2 deletions(-)
> 

This is a good idea. I was thinking of adding a multi-skb operation
to netdevice_ops to allow this. Something like ndo_start_xmit_pkts but
the problem is how to deal with the boundary case where there is only
a limited number of slots in the ring.  Using a "that's all folks"
operation seems better.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/2] Coalesce MMIO writes for transmits
  2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger
@ 2012-07-12 19:01   ` Alexander Duyck
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2012-07-12 19:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On 07/12/2012 10:23 AM, Stephen Hemminger wrote:
> On Wed, 11 Jul 2012 17:25:58 -0700
> Alexander Duyck <alexander.h.duyck@intel.com> wrote:
>
>> This patch set is meant to address recent issues I found with ixgbe
>> performance being bound by Tx tail writes.  With these changes in place
>> and the dispatch_limit set to 1 or more I see a significant increase in
>> performance.
>>
>> In the case of one of my systems I saw the routing rate for 7 queues jump
>> from 10.5 to 11.7Mpps.  The overall increase I have seen on most systems is
>> something on the order of about 15%.  In the case of pktgen I have also
>> seen a noticeable increase as the previous limit for transmits was
>> ~12.5Mpps, but with this patch set in place and the dispatch_limit enabled
>> the value increases to ~14.2Mpps.
>>
>> I expected there to be an increase in latency, however so far I have not
>> ran into that.  I have tried running NPtcp tests for latency and seen no
>> difference in the coalesced and non-coalesced transaction times.  I welcome
>> any suggestions for tests I might run that might expose any latency issues
>> as a result of this patch.
>>
>> ---
>>
>> Alexander Duyck (2):
>>       ixgbe: Add functionality for delaying the MMIO write for Tx
>>       net: Add new network device function to allow for MMIO batching
>>
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   22 +++++++-
>>  include/linux/netdevice.h                     |   57 +++++++++++++++++++++
>>  net/core/dev.c                                |   67 +++++++++++++++++++++++++
>>  net/core/net-sysfs.c                          |   36 +++++++++++++
>>  4 files changed, 180 insertions(+), 2 deletions(-)
>>
> This is a good idea. I was thinking of adding a multi-skb operation
> to netdevice_ops to allow this. Something like ndo_start_xmit_pkts but
> the problem is how to deal with the boundary case where there is only
> a limited number of slots in the ring.  Using a "that's all folks"
> operation seems better.
I had considered a multi-skb operation originally, but the problem was
in my case I would have had to come up with a more complex buffering
mechanism to generate a stream of skbs before handing them off to the
device.  By letting the transmit path proceed normally I shouldn't have
any effect on things like the byte queue limits for the transmit queues
and such.

The wierd bit is how this issue was showing up.  I don't know if you
recall my presentation from plumbers last year, but one of the things I
had brought up was the qdisc spinlock being an issue.  However it was
actually this MMIO write that was causing the problem because it was
posting a write to non-coherent memory and then the spinlock was getting
stalled behind the write and couldn't complete until the write was
completed.  With this change in place and the dispatch_limit set to
something like 31 I see the CPU utilization for spinlocks drop from 15%
(90% sch_direct_xmit / 10% dev_queue_xmit) to 5% (66% sch_direct_xmit /
33% dev_queue_xmit).  Makes me wonder what other hotspots we have in the
drivers  that can be resolved by avoiding MMIO followed by locked
operations.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-12  0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck
  2012-07-12  7:14   ` Eric Dumazet
@ 2012-07-13  7:19   ` Eric Dumazet
  2012-07-13 15:49     ` Alexander Duyck
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2012-07-13  7:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote:

> +static inline void netdev_complete_xmit(struct netdev_queue *txq)
> +{
> +	struct net_device *dev = txq->dev;
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +
> +	if (txq->dispatch_pending < txq->dispatch_limit) {
> +		if (netif_tx_queue_delayed(txq)) {
> +			txq->dispatch_pending++;
> +			return;
> +		}
> +
> +		/* start of delayed write sequence */
> +		netif_tx_delay_queue(txq);

	I dont understand this part. Isnt a return missing here ?

> +	}
> +
> +	txq->dispatch_pending = 0;
> +
> +	ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]);
> +}
> +

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-12 15:39     ` Alexander Duyck
@ 2012-07-13  7:38       ` Eric Dumazet
  2012-07-13 15:37         ` Alexander Duyck
  2012-07-13 15:50         ` Stephen Hemminger
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2012-07-13  7:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On Thu, 2012-07-12 at 08:39 -0700, Alexander Duyck wrote:

> The problem is in both of the cases where I have seen the issue the
> qdisc is actually empty.
> 

You mean a router workload, with links of same bandwidth.
(BQL doesnt trigger)

Frankly what percentage of linux powered machines act as high perf
routers ?

> In the case of pktgen it does not use the qdisc layer at all.  It just
> directly calls ndo_start_xmit.

pktgen is in kernel, adding a complete() call in it is certainly ok,
if we can avoid kernel bloat.

I mean, pktgen represents less than 0.000001 % of real workloads.

> 
> In the standard networking case we never fill the qdisc because the MMIO
> write stalls the entire CPU so the application never gets a chance to
> get ahead of the hardware.  From what I can tell the only case in which
> the qdisc_run solution would work is if the ndo_start_xmit was called on
> a different CPU from the application that is doing the transmitting.

Hey, I can tell that qdisc is not empty on many workloads.
But BQL and TSO mean we only send one or two packets per qdisc run.

I understand this MMIO batching helps routers workloads, or workloads
using many small packets.

But on other workloads, this adds a significant latency source
(NET_TX_SOFTIRQ)

It would be good to instrument the extra delay on a single UDP send.

(entering do_softirq() path is not a few instructions...)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-13  7:38       ` Eric Dumazet
@ 2012-07-13 15:37         ` Alexander Duyck
  2012-07-13 15:50         ` Stephen Hemminger
  1 sibling, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2012-07-13 15:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On 07/13/2012 12:38 AM, Eric Dumazet wrote:
> On Thu, 2012-07-12 at 08:39 -0700, Alexander Duyck wrote:
>
>> The problem is in both of the cases where I have seen the issue the
>> qdisc is actually empty.
>>
> You mean a router workload, with links of same bandwidth.
> (BQL doesnt trigger)
>
> Frankly what percentage of linux powered machines act as high perf
> routers ?
Actually I was seeing this issue with the sending application on the
same CPU as the Tx cleanup.  The problem was the CPU would stall and
consume cycles instead of putting work into placing more packets on the
queue. 

>> In the case of pktgen it does not use the qdisc layer at all.  It just
>> directly calls ndo_start_xmit.
> pktgen is in kernel, adding a complete() call in it is certainly ok,
> if we can avoid kernel bloat.
>
> I mean, pktgen represents less than 0.000001 % of real workloads.
I realize that, but it does provide a valid means of stress testing an
interface and demonstrating that the MMIO writes are causing significant
stalls and bus utilization.

>> In the standard networking case we never fill the qdisc because the MMIO
>> write stalls the entire CPU so the application never gets a chance to
>> get ahead of the hardware.  From what I can tell the only case in which
>> the qdisc_run solution would work is if the ndo_start_xmit was called on
>> a different CPU from the application that is doing the transmitting.
> Hey, I can tell that qdisc is not empty on many workloads.
> But BQL and TSO mean we only send one or two packets per qdisc run.
>
> I understand this MMIO batching helps routers workloads, or workloads
> using many small packets.
>
> But on other workloads, this adds a significant latency source
> (NET_TX_SOFTIRQ)
>
> It would be good to instrument the extra delay on a single UDP send.
>
> (entering do_softirq() path is not a few instructions...)
These kind of issues are one of the reasons why this feature is disabled
by default.  You have to explicitly enable it by setting the
dispatch_limit to something other than 0.

I suppose I could just make it a part of the Tx cleanup itself since I
am only doing a trylock instead of waiting and taking the full lock.  I
am open to any other suggestions for alternatives other than NET_TX_SOFTIRQ.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-13  7:19   ` Eric Dumazet
@ 2012-07-13 15:49     ` Alexander Duyck
  2012-07-13 16:18       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2012-07-13 15:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On 07/13/2012 12:19 AM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote:
>
>> +static inline void netdev_complete_xmit(struct netdev_queue *txq)
>> +{
>> +	struct net_device *dev = txq->dev;
>> +	const struct net_device_ops *ops = dev->netdev_ops;
>> +
>> +	if (txq->dispatch_pending < txq->dispatch_limit) {
>> +		if (netif_tx_queue_delayed(txq)) {
>> +			txq->dispatch_pending++;
>> +			return;
>> +		}
>> +
>> +		/* start of delayed write sequence */
>> +		netif_tx_delay_queue(txq);
> 	I dont understand this part. Isnt a return missing here ?
>
>> +	}
>> +
>> +	txq->dispatch_pending = 0;
>> +
>> +	ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]);
>> +}
>> +
>
There is intentionally no return there.  The idea is that the first
packet always gets through.  It is what is going to later force the
interrupt that will force the final flush if it is needed.  That is one
of the ways I am helping to reduce the latency of things such as TSO
which will only be using one or two frames per interrupt anyway.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-13  7:38       ` Eric Dumazet
  2012-07-13 15:37         ` Alexander Duyck
@ 2012-07-13 15:50         ` Stephen Hemminger
  2012-07-13 16:23           ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2012-07-13 15:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, edumazet,
	bhutchings, therbert, alexander.duyck

On Fri, 13 Jul 2012 09:38:49 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Frankly what percentage of linux powered machines act as high perf
> routers ?

More than you think, every Linux machine acting as hypervisor (Xen and KVM)
is also doing this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-13 15:49     ` Alexander Duyck
@ 2012-07-13 16:18       ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2012-07-13 16:18 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

On Fri, 2012-07-13 at 08:49 -0700, Alexander Duyck wrote:
> On 07/13/2012 12:19 AM, Eric Dumazet wrote:
> > On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote:
> >
> >> +static inline void netdev_complete_xmit(struct netdev_queue *txq)
> >> +{
> >> +	struct net_device *dev = txq->dev;
> >> +	const struct net_device_ops *ops = dev->netdev_ops;
> >> +
> >> +	if (txq->dispatch_pending < txq->dispatch_limit) {
> >> +		if (netif_tx_queue_delayed(txq)) {
> >> +			txq->dispatch_pending++;
> >> +			return;
> >> +		}
> >> +
> >> +		/* start of delayed write sequence */
> >> +		netif_tx_delay_queue(txq);
> > 	I dont understand this part. Isnt a return missing here ?
> >
> >> +	}
> >> +
> >> +	txq->dispatch_pending = 0;
> >> +
> >> +	ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]);
> >> +}
> >> +
> >
> There is intentionally no return there.  The idea is that the first
> packet always gets through.  It is what is going to later force the
> interrupt that will force the final flush if it is needed.  That is one
> of the ways I am helping to reduce the latency of things such as TSO
> which will only be using one or two frames per interrupt anyway.


So for a single packet, we only trigger TX softirq do do nothing at all,
or worse the ndo_complete_xmit() is done twice ?

It looks like you need to add comments, because if I dont understand
this code, who will ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
  2012-07-13 15:50         ` Stephen Hemminger
@ 2012-07-13 16:23           ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2012-07-13 16:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, edumazet,
	bhutchings, therbert, alexander.duyck

On Fri, 2012-07-13 at 08:50 -0700, Stephen Hemminger wrote:
> On Fri, 13 Jul 2012 09:38:49 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Frankly what percentage of linux powered machines act as high perf
> > routers ?
> 
> More than you think, every Linux machine acting as hypervisor (Xen and KVM)
> is also doing this.


High perf router meant : ability to route 10 Mpps, without help of TSO.

If an hypervisor is enable to use TSO, I expect very bad performances
anyway.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-07-13 16:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-12  0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck
2012-07-12  0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck
2012-07-12  7:14   ` Eric Dumazet
2012-07-12 15:39     ` Alexander Duyck
2012-07-13  7:38       ` Eric Dumazet
2012-07-13 15:37         ` Alexander Duyck
2012-07-13 15:50         ` Stephen Hemminger
2012-07-13 16:23           ` Eric Dumazet
2012-07-13  7:19   ` Eric Dumazet
2012-07-13 15:49     ` Alexander Duyck
2012-07-13 16:18       ` Eric Dumazet
2012-07-12  0:26 ` [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx Alexander Duyck
2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger
2012-07-12 19:01   ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.