[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
@ 2012-08-08 12:26 Claudiu Manoil
  2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil
  2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker
  0 siblings, 2 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Paul Gortmaker

Hi all,
This set of patches basically splits the existing napi poll routine into
two separate napi functions, one for Rx processing (triggered by frame
receive interrupts only) and one for the Tx confirmation path processing
(triggerred by Tx confirmation interrupts only). The polling algorithm
behind remains much the same.

Important throughput improvements have been noted on low power boards with
this set of changes.
For instance, for the following netperf test:
netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps,
(if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%,
on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group
driver mode).

Also, this change, which should ballance Rx and Tx processing, proves to
be effective against Rx busy interrupt occurrences.

Thanks for your review.
Claudiu

Claudiu Manoil (4):
  gianfar: Remove redundant programming of [rt]xic registers
  gianfar: Clear ievent from interrupt handler for [RT]x int
  gianfar: Separate out the Rx and Tx coalescing functions
  gianfar: Use separate NAPIs for Tx and Rx processing

 drivers/net/ethernet/freescale/gianfar.c |  220 +++++++++++++++++++++--------
 drivers/net/ethernet/freescale/gianfar.h |   16 ++-
 2 files changed, 171 insertions(+), 65 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers
  2012-08-08 12:26 [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Claudiu Manoil
@ 2012-08-08 12:26 ` Claudiu Manoil
  2012-08-08 12:26   ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil
  2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker
  1 sibling, 1 reply; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil

In Multi Q Multi Group (MQ_MG_MODE) mode, the Rx/Tx colescing registers [rt]xic
are aliased with the [rt]xic0 registers (coalescing setting regs for Q0). This
avoids programming twice in a row the coalescing registers for the Rx/Tx hw Q0.
Also, replaced inconsistent "unlikely" in the process.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 4605f72..e9feeb9 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -1799,20 +1799,9 @@ void gfar_configure_coalescing(struct gfar_private *priv,
 {
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	u32 __iomem *baddr;
-	int i = 0;
-
-	/* Backward compatible case ---- even if we enable
-	 * multiple queues, there's only single reg to program
-	 */
-	gfar_write(&regs->txic, 0);
-	if (likely(priv->tx_queue[0]->txcoalescing))
-		gfar_write(&regs->txic, priv->tx_queue[0]->txic);
-
-	gfar_write(&regs->rxic, 0);
-	if (unlikely(priv->rx_queue[0]->rxcoalescing))
-		gfar_write(&regs->rxic, priv->rx_queue[0]->rxic);

 	if (priv->mode == MQ_MG_MODE) {
+		int i;
 		baddr = &regs->txic0;
 		for_each_set_bit(i, &tx_mask, priv->num_tx_queues) {
 			gfar_write(baddr + i, 0);
@@ -1826,6 +1815,17 @@ void gfar_configure_coalescing(struct gfar_private *priv,
 			if (likely(priv->rx_queue[i]->rxcoalescing))
 				gfar_write(baddr + i, priv->rx_queue[i]->rxic);
 		}
+	} else {
+		/* Backward compatible case ---- even if we enable
+		 * multiple queues, there's only single reg to program
+		 */
+		gfar_write(&regs->txic, 0);
+		if (likely(priv->tx_queue[0]->txcoalescing))
+			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
+
+		gfar_write(&regs->rxic, 0);
+		if (likely(priv->rx_queue[0]->rxcoalescing))
+			gfar_write(&regs->rxic, priv->rx_queue[0]->rxic);
 	}
 }

-- 
1.6.6

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int
  2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil
@ 2012-08-08 12:26   ` Claudiu Manoil
  2012-08-08 12:26     ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil
  2012-08-08 16:11     ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker
  0 siblings, 2 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil

It's the interrupt handler's job to clear ievent for the Tx/Rx paths, as soon
as the corresponding interrupt sources have been masked.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c |   16 ++++++----------
 1 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index e9feeb9..ddd350a 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -2568,12 +2568,13 @@ static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp)
 	if (napi_schedule_prep(&gfargrp->napi)) {
 		gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED);
 		__napi_schedule(&gfargrp->napi);
-	} else {
-		/* Clear IEVENT, so interrupts aren't called again
-		 * because of the packets that have already arrived.
-		 */
-		gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK);
 	}
+
+	/* Clear IEVENT, so interrupts aren't called again
+	 * because of the packets that have already arrived.
+	 */
+	gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK);
+
 	spin_unlock_irqrestore(&gfargrp->grplock, flags);

 }
@@ -2837,11 +2838,6 @@ static int gfar_poll(struct napi_struct *napi, int budget)
 	num_queues = gfargrp->num_rx_queues;
 	budget_per_queue = budget/num_queues;

-	/* Clear IEVENT, so interrupts aren't called again
-	 * because of the packets that have already arrived
-	 */
-	gfar_write(&regs->ievent, IEVENT_RTX_MASK);
-
 	while (num_queues && left_over_budget) {
 		budget_per_queue = left_over_budget/num_queues;
 		left_over_budget = 0;
-- 
1.6.6

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions
  2012-08-08 12:26   ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil
@ 2012-08-08 12:26     ` Claudiu Manoil
  2012-08-08 12:26       ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil
                         ` (2 more replies)
  2012-08-08 16:11     ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker
  1 sibling, 3 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil

Split the coalescing programming support by Rx and Tx h/w queues, in order to
introduce a separate NAPI for the Tx confirmation path (next patch). This way,
the Rx processing path will handle the coalescing settings for the Rx queues
only, resp. the Tx confirmation processing path will handle the Tx queues.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c |   36 +++++++++++++++++++++++------
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index ddd350a..919acb3 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev)
 	dev->trans_start = jiffies; /* prevent tx timeout */
 }
 
-void gfar_configure_coalescing(struct gfar_private *priv,
-			       unsigned long tx_mask, unsigned long rx_mask)
+static inline void gfar_configure_tx_coalescing(struct gfar_private *priv,
+						unsigned long mask)
 {
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	u32 __iomem *baddr;
@@ -1803,14 +1803,31 @@ void gfar_configure_coalescing(struct gfar_private *priv,
 	if (priv->mode == MQ_MG_MODE) {
 		int i;
 		baddr = &regs->txic0;
-		for_each_set_bit(i, &tx_mask, priv->num_tx_queues) {
+		for_each_set_bit(i, &mask, priv->num_tx_queues) {
 			gfar_write(baddr + i, 0);
 			if (likely(priv->tx_queue[i]->txcoalescing))
 				gfar_write(baddr + i, priv->tx_queue[i]->txic);
 		}
+	} else {
+		/* Backward compatible case ---- even if we enable
+		 * multiple queues, there's only single reg to program
+		 */
+		gfar_write(&regs->txic, 0);
+		if (likely(priv->tx_queue[0]->txcoalescing))
+			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
+	}
+}
+
+static inline void gfar_configure_rx_coalescing(struct gfar_private *priv,
+						unsigned long mask)
+{
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 __iomem *baddr;
 
+	if (priv->mode == MQ_MG_MODE) {
+		int i;
 		baddr = &regs->rxic0;
-		for_each_set_bit(i, &rx_mask, priv->num_rx_queues) {
+		for_each_set_bit(i, &mask, priv->num_rx_queues) {
 			gfar_write(baddr + i, 0);
 			if (likely(priv->rx_queue[i]->rxcoalescing))
 				gfar_write(baddr + i, priv->rx_queue[i]->rxic);
@@ -1819,16 +1836,19 @@ void gfar_configure_coalescing(struct gfar_private *priv,
 		/* Backward compatible case ---- even if we enable
 		 * multiple queues, there's only single reg to program
 		 */
-		gfar_write(&regs->txic, 0);
-		if (likely(priv->tx_queue[0]->txcoalescing))
-			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
-
 		gfar_write(&regs->rxic, 0);
 		if (likely(priv->rx_queue[0]->rxcoalescing))
 			gfar_write(&regs->rxic, priv->rx_queue[0]->rxic);
 	}
 }
 
+void gfar_configure_coalescing(struct gfar_private *priv,
+			       unsigned long tx_mask, unsigned long rx_mask)
+{
+	gfar_configure_tx_coalescing(priv, tx_mask);
+	gfar_configure_rx_coalescing(priv, rx_mask);
+}
+
 static int register_grp_irqs(struct gfar_priv_grp *grp)
 {
 	struct gfar_private *priv = grp->priv;
-- 
1.6.6

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing
  2012-08-08 12:26     ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil
@ 2012-08-08 12:26       ` Claudiu Manoil
  2012-08-14  0:51         ` Paul Gortmaker
  2012-08-08 15:44       ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker
  2012-08-15  1:29       ` Paul Gortmaker
  2 siblings, 1 reply; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Paul Gortmaker, Pankaj Chauhan, Claudiu Manoil

Add a separate napi poll routine for Tx cleanup, to be triggerred by Tx
confirmation interrupts only. Existing poll function is modified to handle
only the Rx path processing. This allows parallel processing of Rx and Tx
confirmation paths on a smp machine (2 cores).
The split also results in simpler/cleaner napi poll function implementations,
where each processing path has its own budget, thus improving the fairness b/w
the processing paths at the same time.

Signed-off-by: Pankaj Chauhan <pankaj.chauhan@freescale.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c |  154 +++++++++++++++++++++++-------
 drivers/net/ethernet/freescale/gianfar.h |   16 +++-
 2 files changed, 130 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 919acb3..2774961 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -128,12 +128,14 @@ static void free_skb_resources(struct gfar_private *priv);
 static void gfar_set_multi(struct net_device *dev);
 static void gfar_set_hash_for_addr(struct net_device *dev, u8 *addr);
 static void gfar_configure_serdes(struct net_device *dev);
-static int gfar_poll(struct napi_struct *napi, int budget);
+static int gfar_poll_rx(struct napi_struct *napi, int budget);
+static int gfar_poll_tx(struct napi_struct *napi, int budget);
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void gfar_netpoll(struct net_device *dev);
 #endif
 int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit);
-static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue);
+static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue,
+			      int tx_work_limit);
 static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb,
 			      int amount_pull, struct napi_struct *napi);
 void gfar_halt(struct net_device *dev);
@@ -543,16 +545,20 @@ static void disable_napi(struct gfar_private *priv)
 {
 	int i;
 
-	for (i = 0; i < priv->num_grps; i++)
-		napi_disable(&priv->gfargrp[i].napi);
+	for (i = 0; i < priv->num_grps; i++) {
+		napi_disable(&priv->gfargrp[i].napi_rx);
+		napi_disable(&priv->gfargrp[i].napi_tx);
+	}
 }
 
 static void enable_napi(struct gfar_private *priv)
 {
 	int i;
 
-	for (i = 0; i < priv->num_grps; i++)
-		napi_enable(&priv->gfargrp[i].napi);
+	for (i = 0; i < priv->num_grps; i++) {
+		napi_enable(&priv->gfargrp[i].napi_rx);
+		napi_enable(&priv->gfargrp[i].napi_tx);
+	}
 }
 
 static int gfar_parse_group(struct device_node *np,
@@ -1028,9 +1034,12 @@ static int gfar_probe(struct platform_device *ofdev)
 	dev->ethtool_ops = &gfar_ethtool_ops;
 
 	/* Register for napi ...We are registering NAPI for each grp */
-	for (i = 0; i < priv->num_grps; i++)
-		netif_napi_add(dev, &priv->gfargrp[i].napi, gfar_poll,
-			       GFAR_DEV_WEIGHT);
+	for (i = 0; i < priv->num_grps; i++) {
+		netif_napi_add(dev, &priv->gfargrp[i].napi_rx, gfar_poll_rx,
+			       GFAR_DEV_RX_WEIGHT);
+		netif_napi_add(dev, &priv->gfargrp[i].napi_tx, gfar_poll_tx,
+			       GFAR_DEV_TX_WEIGHT);
+	}
 
 	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_CSUM) {
 		dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG |
@@ -2465,7 +2474,8 @@ static void gfar_align_skb(struct sk_buff *skb)
 }
 
 /* Interrupt Handler for Transmit complete */
-static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
+static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue,
+			      int tx_work_limit)
 {
 	struct net_device *dev = tx_queue->dev;
 	struct netdev_queue *txq;
@@ -2490,7 +2500,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 	bdp = tx_queue->dirty_tx;
 	skb_dirtytx = tx_queue->skb_dirtytx;
 
-	while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) {
+	while ((skb = tx_queue->tx_skbuff[skb_dirtytx]) && tx_work_limit--) {
 		unsigned long flags;
 
 		frags = skb_shinfo(skb)->nr_frags;
@@ -2580,29 +2590,50 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 	return howmany;
 }
 
-static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp)
+static void gfar_schedule_rx_cleanup(struct gfar_priv_grp *gfargrp)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&gfargrp->grplock, flags);
-	if (napi_schedule_prep(&gfargrp->napi)) {
-		gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED);
-		__napi_schedule(&gfargrp->napi);
+	if (napi_schedule_prep(&gfargrp->napi_rx)) {
+		u32 imask;
+		spin_lock_irqsave(&gfargrp->grplock, flags);
+		imask = gfar_read(&gfargrp->regs->imask);
+		imask &= ~(IMASK_RX_DEFAULT);
+		gfar_write(&gfargrp->regs->imask, imask);
+		__napi_schedule(&gfargrp->napi_rx);
+		spin_unlock_irqrestore(&gfargrp->grplock, flags);
 	}
 
 	/* Clear IEVENT, so interrupts aren't called again
 	 * because of the packets that have already arrived.
 	 */
-	gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK);
+	gfar_write(&gfargrp->regs->ievent, IEVENT_RX_MASK);
+}
 
-	spin_unlock_irqrestore(&gfargrp->grplock, flags);
+static void gfar_schedule_tx_cleanup(struct gfar_priv_grp *gfargrp)
+{
+	unsigned long flags;
+
+	if (napi_schedule_prep(&gfargrp->napi_tx)) {
+		u32 imask;
+		spin_lock_irqsave(&gfargrp->grplock, flags);
+		imask = gfar_read(&gfargrp->regs->imask);
+		imask &= ~(IMASK_TX_DEFAULT);
+		gfar_write(&gfargrp->regs->imask, imask);
+		__napi_schedule(&gfargrp->napi_tx);
+		spin_unlock_irqrestore(&gfargrp->grplock, flags);
+	}
 
+	/* Clear IEVENT, so interrupts aren't called again
+	 * because of the packets that have already arrived.
+	 */
+	gfar_write(&gfargrp->regs->ievent, IEVENT_TX_MASK);
 }
 
 /* Interrupt Handler for Transmit complete */
 static irqreturn_t gfar_transmit(int irq, void *grp_id)
 {
-	gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id);
+	gfar_schedule_tx_cleanup((struct gfar_priv_grp *)grp_id);
 	return IRQ_HANDLED;
 }
 
@@ -2683,7 +2714,7 @@ static inline void count_errors(unsigned short status, struct net_device *dev)
 
 irqreturn_t gfar_receive(int irq, void *grp_id)
 {
-	gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id);
+	gfar_schedule_rx_cleanup((struct gfar_priv_grp *)grp_id);
 	return IRQ_HANDLED;
 }
 
@@ -2813,7 +2844,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
 				rx_queue->stats.rx_bytes += pkt_len;
 				skb_record_rx_queue(skb, rx_queue->qindex);
 				gfar_process_frame(dev, skb, amount_pull,
-						   &rx_queue->grp->napi);
+						   &rx_queue->grp->napi_rx);
 
 			} else {
 				netif_warn(priv, rx_err, dev, "Missing skb!\n");
@@ -2842,21 +2873,19 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
 	return howmany;
 }
 
-static int gfar_poll(struct napi_struct *napi, int budget)
+static int gfar_poll_rx(struct napi_struct *napi, int budget)
 {
 	struct gfar_priv_grp *gfargrp =
-		container_of(napi, struct gfar_priv_grp, napi);
+		container_of(napi, struct gfar_priv_grp, napi_rx);
 	struct gfar_private *priv = gfargrp->priv;
 	struct gfar __iomem *regs = gfargrp->regs;
-	struct gfar_priv_tx_q *tx_queue = NULL;
 	struct gfar_priv_rx_q *rx_queue = NULL;
-	int rx_cleaned = 0, budget_per_queue = 0, rx_cleaned_per_queue = 0;
-	int tx_cleaned = 0, i, left_over_budget = budget;
+	int rx_cleaned = 0, budget_per_queue, rx_cleaned_per_queue;
+	int i, left_over_budget = budget;
 	unsigned long serviced_queues = 0;
 	int num_queues = 0;
 
 	num_queues = gfargrp->num_rx_queues;
-	budget_per_queue = budget/num_queues;
 
 	while (num_queues && left_over_budget) {
 		budget_per_queue = left_over_budget/num_queues;
@@ -2866,9 +2895,6 @@ static int gfar_poll(struct napi_struct *napi, int budget)
 			if (test_bit(i, &serviced_queues))
 				continue;
 			rx_queue = priv->rx_queue[i];
-			tx_queue = priv->tx_queue[rx_queue->qindex];
-
-			tx_cleaned += gfar_clean_tx_ring(tx_queue);
 			rx_cleaned_per_queue =
 				gfar_clean_rx_ring(rx_queue, budget_per_queue);
 			rx_cleaned += rx_cleaned_per_queue;
@@ -2882,27 +2908,83 @@ static int gfar_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
-	if (tx_cleaned)
-		return budget;
-
 	if (rx_cleaned < budget) {
+		u32 imask;
 		napi_complete(napi);
 
 		/* Clear the halt bit in RSTAT */
 		gfar_write(&regs->rstat, gfargrp->rstat);
 
-		gfar_write(&regs->imask, IMASK_DEFAULT);
+		spin_lock_irq(&gfargrp->grplock);
+		imask = gfar_read(&regs->imask);
+		imask |= IMASK_RX_DEFAULT;
+		gfar_write(&regs->imask, imask);
+		spin_unlock_irq(&gfargrp->grplock);
 
 		/* If we are coalescing interrupts, update the timer
 		 * Otherwise, clear it
 		 */
-		gfar_configure_coalescing(priv, gfargrp->rx_bit_map,
-					  gfargrp->tx_bit_map);
+		gfar_configure_rx_coalescing(priv, gfargrp->rx_bit_map);
 	}
 
 	return rx_cleaned;
 }
 
+static int gfar_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct gfar_priv_grp *gfargrp =
+		container_of(napi, struct gfar_priv_grp, napi_tx);
+	struct gfar_private *priv = gfargrp->priv;
+	struct gfar __iomem *regs = gfargrp->regs;
+	struct gfar_priv_tx_q *tx_queue = NULL;
+	int tx_cleaned = 0, budget_per_queue, tx_cleaned_per_queue;
+	int i, left_over_budget = budget;
+	unsigned long serviced_queues = 0;
+	int num_queues = 0;
+
+	num_queues = gfargrp->num_tx_queues;
+
+	while (num_queues && left_over_budget) {
+		budget_per_queue = left_over_budget/num_queues;
+		left_over_budget = 0;
+
+		for_each_set_bit(i, &gfargrp->tx_bit_map, priv->num_tx_queues) {
+			if (test_bit(i, &serviced_queues))
+				continue;
+			tx_queue = priv->tx_queue[i];
+			tx_cleaned_per_queue =
+				gfar_clean_tx_ring(tx_queue, budget_per_queue);
+			tx_cleaned += tx_cleaned_per_queue;
+			if (tx_cleaned_per_queue < budget_per_queue) {
+				left_over_budget = left_over_budget +
+					(budget_per_queue -
+					 tx_cleaned_per_queue);
+				set_bit(i, &serviced_queues);
+				num_queues--;
+			}
+		}
+	}
+
+	if (tx_cleaned < budget) {
+		u32 imask;
+		napi_complete(napi);
+
+		gfar_write(&regs->imask, IMASK_DEFAULT);
+		spin_lock_irq(&gfargrp->grplock);
+		imask = gfar_read(&regs->imask);
+		imask |= IMASK_TX_DEFAULT;
+		gfar_write(&regs->imask, imask);
+		spin_unlock_irq(&gfargrp->grplock);
+
+		/* If we are coalescing interrupts, update the timer
+		 * Otherwise, clear it
+		 */
+		gfar_configure_tx_coalescing(priv, gfargrp->tx_bit_map);
+	}
+
+	return tx_cleaned;
+}
+
 #ifdef CONFIG_NET_POLL_CONTROLLER
 /* Polling 'interrupt' - used by things like netconsole to send skbs
  * without having to re-enable interrupts. It's not called while
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 2136c7f..f5be234 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -57,8 +57,10 @@ struct ethtool_rx_list {
 	unsigned int count;
 };
 
-/* The maximum number of packets to be handled in one call of gfar_poll */
-#define GFAR_DEV_WEIGHT 64
+/* The maximum number of packets to be handled in one call of gfar_poll_rx */
+#define GFAR_DEV_RX_WEIGHT 64
+/* The maximum number of packets to be handled in one call of gfar_poll_tx */
+#define GFAR_DEV_TX_WEIGHT 64
 
 /* Length for FCB */
 #define GMAC_FCB_LEN 8
@@ -366,6 +368,10 @@ extern const char gfar_driver_version[];
 		| IMASK_PERR)
 #define IMASK_RTX_DISABLED ((~(IMASK_RXFEN0 | IMASK_TXFEN | IMASK_BSY)) \
 			   & IMASK_DEFAULT)
+#define IMASK_RX_DEFAULT  (IMASK_RXFEN0 | IMASK_BSY)
+#define IMASK_TX_DEFAULT  (IMASK_TXFEN)
+#define IMASK_RX_DISABLED ((~(IMASK_RX_DEFAULT)) & IMASK_DEFAULT)
+#define IMASK_TX_DISABLED ((~(IMASK_TX_DEFAULT)) & IMASK_DEFAULT)
 
 /* Fifo management */
 #define FIFO_TX_THR_MASK	0x01ff
@@ -993,7 +999,8 @@ struct gfar_priv_rx_q {
 
 /**
  *	struct gfar_priv_grp - per group structure
- *	@napi: the napi poll function
+ *	@napi_rx: the RX napi poll function
+ *	@napi_tx: the TX confirmation napi poll function
  *	@priv: back pointer to the priv structure
  *	@regs: the ioremapped register space for this group
  *	@grp_id: group id for this group
@@ -1007,7 +1014,8 @@ struct gfar_priv_rx_q {
 
 struct gfar_priv_grp {
 	spinlock_t grplock __attribute__ ((aligned (SMP_CACHE_BYTES)));
-	struct	napi_struct napi;
+	struct	napi_struct napi_rx;
+	struct	napi_struct napi_tx;
 	struct gfar_private *priv;
 	struct gfar __iomem *regs;
 	unsigned int grp_id;
-- 
1.6.6

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions
  2012-08-08 12:26     ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil
  2012-08-08 12:26       ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil
@ 2012-08-08 15:44       ` Paul Gortmaker
  2012-08-09 16:24         ` Claudiu Manoil
  2012-08-15  1:29       ` Paul Gortmaker
  2 siblings, 1 reply; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-08 15:44 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev, David S. Miller

[[RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:

> Split the coalescing programming support by Rx and Tx h/w queues, in order to
> introduce a separate NAPI for the Tx confirmation path (next patch). This way,
> the Rx processing path will handle the coalescing settings for the Rx queues
> only, resp. the Tx confirmation processing path will handle the Tx queues.
> 
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
> ---
>  drivers/net/ethernet/freescale/gianfar.c |   36 +++++++++++++++++++++++------
>  1 files changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
> index ddd350a..919acb3 100644
> --- a/drivers/net/ethernet/freescale/gianfar.c
> +++ b/drivers/net/ethernet/freescale/gianfar.c
> @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev)
>  	dev->trans_start = jiffies; /* prevent tx timeout */
>  }
>  
> -void gfar_configure_coalescing(struct gfar_private *priv,
> -			       unsigned long tx_mask, unsigned long rx_mask)
> +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv,

I believe the preference is to not specify inline when all the chunks in
play are present in the one C file -- i.e. let gcc figure it out.  Same
for the Rx instance below.

P.
--

> +						unsigned long mask)
>  {
>  	struct gfar __iomem *regs = priv->gfargrp[0].regs;
>  	u32 __iomem *baddr;
> @@ -1803,14 +1803,31 @@ void gfar_configure_coalescing(struct gfar_private *priv,
>  	if (priv->mode == MQ_MG_MODE) {
>  		int i;
>  		baddr = &regs->txic0;
> -		for_each_set_bit(i, &tx_mask, priv->num_tx_queues) {
> +		for_each_set_bit(i, &mask, priv->num_tx_queues) {
>  			gfar_write(baddr + i, 0);
>  			if (likely(priv->tx_queue[i]->txcoalescing))
>  				gfar_write(baddr + i, priv->tx_queue[i]->txic);
>  		}
> +	} else {
> +		/* Backward compatible case ---- even if we enable
> +		 * multiple queues, there's only single reg to program
> +		 */
> +		gfar_write(&regs->txic, 0);
> +		if (likely(priv->tx_queue[0]->txcoalescing))
> +			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
> +	}
> +}
> +
> +static inline void gfar_configure_rx_coalescing(struct gfar_private *priv,
> +						unsigned long mask)
> +{
> +	struct gfar __iomem *regs = priv->gfargrp[0].regs;
> +	u32 __iomem *baddr;
>  
> +	if (priv->mode == MQ_MG_MODE) {
> +		int i;
>  		baddr = &regs->rxic0;
> -		for_each_set_bit(i, &rx_mask, priv->num_rx_queues) {
> +		for_each_set_bit(i, &mask, priv->num_rx_queues) {
>  			gfar_write(baddr + i, 0);
>  			if (likely(priv->rx_queue[i]->rxcoalescing))
>  				gfar_write(baddr + i, priv->rx_queue[i]->rxic);
> @@ -1819,16 +1836,19 @@ void gfar_configure_coalescing(struct gfar_private *priv,
>  		/* Backward compatible case ---- even if we enable
>  		 * multiple queues, there's only single reg to program
>  		 */
> -		gfar_write(&regs->txic, 0);
> -		if (likely(priv->tx_queue[0]->txcoalescing))
> -			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
> -
>  		gfar_write(&regs->rxic, 0);
>  		if (likely(priv->rx_queue[0]->rxcoalescing))
>  			gfar_write(&regs->rxic, priv->rx_queue[0]->rxic);
>  	}
>  }
>  
> +void gfar_configure_coalescing(struct gfar_private *priv,
> +			       unsigned long tx_mask, unsigned long rx_mask)
> +{
> +	gfar_configure_tx_coalescing(priv, tx_mask);
> +	gfar_configure_rx_coalescing(priv, rx_mask);
> +}
> +
>  static int register_grp_irqs(struct gfar_priv_grp *grp)
>  {
>  	struct gfar_private *priv = grp->priv;
> -- 
> 1.6.6
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int
  2012-08-08 12:26   ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil
  2012-08-08 12:26     ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil
@ 2012-08-08 16:11     ` Paul Gortmaker
  2012-08-09 16:04       ` Claudiu Manoil
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-08 16:11 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev, David S. Miller

[[RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:

> It's the interrupt handler's job to clear ievent for the Tx/Rx paths, as soon
> as the corresponding interrupt sources have been masked.

What wasn't clear to me was whether we'd ever have an instance of
gfar_poll run without RTX_MASK being cleared (in less normal conditions,
like netconsole, KGDBoE etc), since the gfar_schedule_cleanup is only
called from rx/tx IRQ threads, and neither of those are used by
gfar_poll, it seems.

Paul.
--
> 
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
> ---
>  drivers/net/ethernet/freescale/gianfar.c |   16 ++++++----------
>  1 files changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
> index e9feeb9..ddd350a 100644
> --- a/drivers/net/ethernet/freescale/gianfar.c
> +++ b/drivers/net/ethernet/freescale/gianfar.c
> @@ -2568,12 +2568,13 @@ static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp)
>  	if (napi_schedule_prep(&gfargrp->napi)) {
>  		gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED);
>  		__napi_schedule(&gfargrp->napi);
> -	} else {
> -		/* Clear IEVENT, so interrupts aren't called again
> -		 * because of the packets that have already arrived.
> -		 */
> -		gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK);
>  	}
> +
> +	/* Clear IEVENT, so interrupts aren't called again
> +	 * because of the packets that have already arrived.
> +	 */
> +	gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK);
> +
>  	spin_unlock_irqrestore(&gfargrp->grplock, flags);
>  
>  }
> @@ -2837,11 +2838,6 @@ static int gfar_poll(struct napi_struct *napi, int budget)
>  	num_queues = gfargrp->num_rx_queues;
>  	budget_per_queue = budget/num_queues;
>  
> -	/* Clear IEVENT, so interrupts aren't called again
> -	 * because of the packets that have already arrived
> -	 */
> -	gfar_write(&regs->ievent, IEVENT_RTX_MASK);
> -
>  	while (num_queues && left_over_budget) {
>  		budget_per_queue = left_over_budget/num_queues;
>  		left_over_budget = 0;
> -- 
> 1.6.6
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-08 12:26 [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Claudiu Manoil
  2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil
@ 2012-08-08 16:24 ` Paul Gortmaker
  2012-08-08 16:44   ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-08 16:24 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev, David S. Miller

[[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:

> Hi all,
> This set of patches basically splits the existing napi poll routine into
> two separate napi functions, one for Rx processing (triggered by frame
> receive interrupts only) and one for the Tx confirmation path processing
> (triggerred by Tx confirmation interrupts only). The polling algorithm
> behind remains much the same.
> 
> Important throughput improvements have been noted on low power boards with
> this set of changes.
> For instance, for the following netperf test:
> netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps,
> (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%,
> on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group
> driver mode).

It would be interesting to know more about what was causing that large
an oscillation -- presumably you will have it reappear once one core
becomes 100% utilized.  Also, any thoughts on how the change will change
performance on an older low power single core gianfar system (e.g.  83xx)?

P.
--

> 
> Also, this change, which should ballance Rx and Tx processing, proves to
> be effective against Rx busy interrupt occurrences.
> 
> Thanks for your review.
> Claudiu
> 
> 
> Claudiu Manoil (4):
>   gianfar: Remove redundant programming of [rt]xic registers
>   gianfar: Clear ievent from interrupt handler for [RT]x int
>   gianfar: Separate out the Rx and Tx coalescing functions
>   gianfar: Use separate NAPIs for Tx and Rx processing
> 
>  drivers/net/ethernet/freescale/gianfar.c |  220 +++++++++++++++++++++--------
>  drivers/net/ethernet/freescale/gianfar.h |   16 ++-
>  2 files changed, 171 insertions(+), 65 deletions(-)
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker
@ 2012-08-08 16:44   ` Eric Dumazet
  2012-08-08 23:06     ` Tomas Hruby
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-08-08 16:44 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Claudiu Manoil, netdev, David S. Miller

On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote:
> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
> 
> > Hi all,
> > This set of patches basically splits the existing napi poll routine into
> > two separate napi functions, one for Rx processing (triggered by frame
> > receive interrupts only) and one for the Tx confirmation path processing
> > (triggerred by Tx confirmation interrupts only). The polling algorithm
> > behind remains much the same.
> > 
> > Important throughput improvements have been noted on low power boards with
> > this set of changes.
> > For instance, for the following netperf test:
> > netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> > yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps,
> > (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%,
> > on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group
> > driver mode).
> 
> It would be interesting to know more about what was causing that large
> an oscillation -- presumably you will have it reappear once one core
> becomes 100% utilized.  Also, any thoughts on how the change will change
> performance on an older low power single core gianfar system (e.g.  83xx)?

I also was wondering if this low performance could be caused by BQL

Since TCP stack is driven by incoming ACKS, a NAPI run could have to
handle 10 TCP acks in a row, and resulting xmits could hit BQL and
transit on qdisc (Because NAPI handler wont handle TX completions in the
middle of RX handler)

So experiments would be nice, maybe by reducing a
bit /proc/sys/net/ipv4/tcp_limit_output_bytes 
(from 131072 to 65536 or 32768)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-08 16:44   ` Eric Dumazet
@ 2012-08-08 23:06     ` Tomas Hruby
  2012-08-09 15:07       ` Claudiu Manoil
  0 siblings, 1 reply; 21+ messages in thread
From: Tomas Hruby @ 2012-08-08 23:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paul Gortmaker, Claudiu Manoil, netdev, David S. Miller

On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote:
>> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
>>
>> > Hi all,
>> > This set of patches basically splits the existing napi poll routine into
>> > two separate napi functions, one for Rx processing (triggered by frame
>> > receive interrupts only) and one for the Tx confirmation path processing
>> > (triggerred by Tx confirmation interrupts only). The polling algorithm
>> > behind remains much the same.
>> >
>> > Important throughput improvements have been noted on low power boards with
>> > this set of changes.
>> > For instance, for the following netperf test:
>> > netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
>> > yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps,
>> > (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%,
>> > on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group
>> > driver mode).
>>
>> It would be interesting to know more about what was causing that large
>> an oscillation -- presumably you will have it reappear once one core
>> becomes 100% utilized.  Also, any thoughts on how the change will change
>> performance on an older low power single core gianfar system (e.g.  83xx)?
>
> I also was wondering if this low performance could be caused by BQL
>
> Since TCP stack is driven by incoming ACKS, a NAPI run could have to
> handle 10 TCP acks in a row, and resulting xmits could hit BQL and
> transit on qdisc (Because NAPI handler wont handle TX completions in the
> middle of RX handler)

Does disabling BQL help? Is the BQL limit stable? To what value is it
set? I would be very much interested in more data if the issue is BQL
related.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-08 23:06     ` Tomas Hruby
@ 2012-08-09 15:07       ` Claudiu Manoil
  2012-08-13 16:23         ` Claudiu Manoil
  0 siblings, 1 reply; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-09 15:07 UTC (permalink / raw)
  To: Tomas Hruby, Eric Dumazet, Paul Gortmaker; +Cc: netdev, David S. Miller

On 8/9/2012 2:06 AM, Tomas Hruby wrote:
> On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote:
>>> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
>>>
>>>> Hi all,
>>>> This set of patches basically splits the existing napi poll routine into
>>>> two separate napi functions, one for Rx processing (triggered by frame
>>>> receive interrupts only) and one for the Tx confirmation path processing
>>>> (triggerred by Tx confirmation interrupts only). The polling algorithm
>>>> behind remains much the same.
>>>>
>>>> Important throughput improvements have been noted on low power boards with
>>>> this set of changes.
>>>> For instance, for the following netperf test:
>>>> netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
>>>> yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps,
>>>> (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%,
>>>> on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group
>>>> driver mode).
>>>
>>> It would be interesting to know more about what was causing that large
>>> an oscillation -- presumably you will have it reappear once one core
>>> becomes 100% utilized.  Also, any thoughts on how the change will change
>>> performance on an older low power single core gianfar system (e.g.  83xx)?
>>
>> I also was wondering if this low performance could be caused by BQL
>>
>> Since TCP stack is driven by incoming ACKS, a NAPI run could have to
>> handle 10 TCP acks in a row, and resulting xmits could hit BQL and
>> transit on qdisc (Because NAPI handler wont handle TX completions in the
>> middle of RX handler)
>
> Does disabling BQL help? Is the BQL limit stable? To what value is it
> set? I would be very much interested in more data if the issue is BQL
> related.
>
> .
>

I agree that more tests should be run to investigate why gianfar under-
performs on the low power p1020rdb platform, and BQL seems to be
a good starting point (thanks for the hint). What I can say now is that
the issue is not apparent on p2020rdb, for instance, which is a more
powerful platform: the CPUs - 1200 MHz instead of 800 MHz; twice the
size of L2 cache (512 KB), greater bus (CCB) frequency ... On this
board (p2020rdb) the netperf test reaches 940Mbps both w/ and w/o these
patches.

For a single core system I'm not expecting any performance degradation,
simply because I don't see why the proposed napi poll implementation
would be slower than the existing one. I'll do some measurements on a
p1010rdb too (single core, CPU:800 MHz) and get back to you with the
results.

Thanks.
Claudiu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int
  2012-08-08 16:11     ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker
@ 2012-08-09 16:04       ` Claudiu Manoil
  0 siblings, 0 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-09 16:04 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: netdev, David S. Miller

On 8/8/2012 7:11 PM, Paul Gortmaker wrote:
> [[RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
>
>> It's the interrupt handler's job to clear ievent for the Tx/Rx paths, as soon
>> as the corresponding interrupt sources have been masked.
>
> What wasn't clear to me was whether we'd ever have an instance of
> gfar_poll run without RTX_MASK being cleared (in less normal conditions,
> like netconsole, KGDBoE etc), since the gfar_schedule_cleanup is only
> called from rx/tx IRQ threads, and neither of those are used by
> gfar_poll, it seems.

Hi Paul,

As I see it, netconsole has the ndo_poll_controller hook, which points
to gfar_netpoll() -> gfar_interrupt() -> gfar_receive|transmit() ->
gfar_schedule_cleanup(), so it passes through schedule_cleanup.

My understanding is that gfar_poll() is scheduled for execution
only by "__napi_schedule(&gfargrp->napi);" from the Rx/Tx interrupt
handler (gfar_receive/transmit()->gfar_schedule_cleanup()), and that
it will be executed sometimes after the interrupt handler returns,
which means RTX_MASK has been cleared (and the interrupt sources are
already masked).

I think that we might have an issue if we don't clear IEVENT right away
in the interrupt handler, as this register might cause additional hw
interrupts to the PIC upon returning from the interrupt handler.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions
  2012-08-08 15:44       ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker
@ 2012-08-09 16:24         ` Claudiu Manoil
  0 siblings, 0 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-09 16:24 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: netdev, David S. Miller

On 8/8/2012 6:44 PM, Paul Gortmaker wrote:
> [[RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
>
>> Split the coalescing programming support by Rx and Tx h/w queues, in order to
>> introduce a separate NAPI for the Tx confirmation path (next patch). This way,
>> the Rx processing path will handle the coalescing settings for the Rx queues
>> only, resp. the Tx confirmation processing path will handle the Tx queues.
>>
>> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
>> ---
>>   drivers/net/ethernet/freescale/gianfar.c |   36 +++++++++++++++++++++++------
>>   1 files changed, 28 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
>> index ddd350a..919acb3 100644
>> --- a/drivers/net/ethernet/freescale/gianfar.c
>> +++ b/drivers/net/ethernet/freescale/gianfar.c
>> @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev)
>>   	dev->trans_start = jiffies; /* prevent tx timeout */
>>   }
>>
>> -void gfar_configure_coalescing(struct gfar_private *priv,
>> -			       unsigned long tx_mask, unsigned long rx_mask)
>> +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv,
>
> I believe the preference is to not specify inline when all the chunks in
> play are present in the one C file -- i.e. let gcc figure it out.  Same
> for the Rx instance below.
>
> P.
> --
I agree with you.
thanks

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-09 15:07       ` Claudiu Manoil
@ 2012-08-13 16:23         ` Claudiu Manoil
  2012-08-14  1:15           ` Paul Gortmaker
  0 siblings, 1 reply; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-13 16:23 UTC (permalink / raw)
  To: Tomas Hruby, Eric Dumazet, Paul Gortmaker; +Cc: netdev, David S. Miller

On 08/09/2012 06:07 PM, Claudiu Manoil wrote:
> On 8/9/2012 2:06 AM, Tomas Hruby wrote:
>> On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet <eric.dumazet@gmail.com> 
>> wrote:
>>> On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote:
>>>> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation 
>>>> processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
>>>>
>>>>> Hi all,
>>>>> This set of patches basically splits the existing napi poll 
>>>>> routine into
>>>>> two separate napi functions, one for Rx processing (triggered by 
>>>>> frame
>>>>> receive interrupts only) and one for the Tx confirmation path 
>>>>> processing
>>>>> (triggerred by Tx confirmation interrupts only). The polling 
>>>>> algorithm
>>>>> behind remains much the same.
>>>>>
>>>>> Important throughput improvements have been noted on low power 
>>>>> boards with
>>>>> this set of changes.
>>>>> For instance, for the following netperf test:
>>>>> netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
>>>>> yields a throughput gain from oscilating ~500-~700 Mbps to steady 
>>>>> ~940 Mbps,
>>>>> (if the Rx/Tx paths are processed on different cores), w/ no 
>>>>> increase in CPU%,
>>>>> on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue 
>>>>> Multi-Group
>>>>> driver mode).
>>>>
>>>> It would be interesting to know more about what was causing that large
>>>> an oscillation -- presumably you will have it reappear once one core
>>>> becomes 100% utilized.  Also, any thoughts on how the change will 
>>>> change
>>>> performance on an older low power single core gianfar system (e.g.  
>>>> 83xx)?
>>>
>>> I also was wondering if this low performance could be caused by BQL
>>>
>>> Since TCP stack is driven by incoming ACKS, a NAPI run could have to
>>> handle 10 TCP acks in a row, and resulting xmits could hit BQL and
>>> transit on qdisc (Because NAPI handler wont handle TX completions in 
>>> the
>>> middle of RX handler)
>>
>> Does disabling BQL help? Is the BQL limit stable? To what value is it
>> set? I would be very much interested in more data if the issue is BQL
>> related.
>>
>> .
>>
>
> I agree that more tests should be run to investigate why gianfar under-
> performs on the low power p1020rdb platform, and BQL seems to be
> a good starting point (thanks for the hint). What I can say now is that
> the issue is not apparent on p2020rdb, for instance, which is a more
> powerful platform: the CPUs - 1200 MHz instead of 800 MHz; twice the
> size of L2 cache (512 KB), greater bus (CCB) frequency ... On this
> board (p2020rdb) the netperf test reaches 940Mbps both w/ and w/o these
> patches.
>
> For a single core system I'm not expecting any performance degradation,
> simply because I don't see why the proposed napi poll implementation
> would be slower than the existing one. I'll do some measurements on a
> p1010rdb too (single core, CPU:800 MHz) and get back to you with the
> results.
>

Hi all,

Please find below the netperf measurements performed on a p1010rdb machine
(single core, low power).  Three kernel images were used:
1) Linux version 3.5.0-20970-gaae06bf  -- net-next commit aae06bf
2) Linux version 3.5.0-20974-g2920464 -- commit aae06bf + Tx NAPI patches
3) Linux version 3.5.0-20970-gaae06bf-dirty -- commit aae06bf + 
CONFIG_BQL set to 'n'

The results show that, on *Image 1)*, by adjusting 
tcp_limit_output_bytes no substantial
improvements are seen, as the throughput stays in the 580-60x Mbps range .
By changing the coalescing settings from default* (rx coalescing off,
tx-usecs: 10, tx-frames: 16) to:
"ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32"
we get a throughput of ~710 Mbps.

For *Image 2)*, using the default tcp_limit_output_bytes value (131072) 
- I've noticed
that "tweaking" tcp_limit_output_bytes does not improve the throughput 
-, we get the
following performance numbers:
* default coalescing settings: ~650 Mbps
* rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps

For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no* relevant 
performance
improvement compared to Image 1).
(note:
For all the measurements, rx and tx BD ring sizes have been set to 64, 
for best performance.)

So, I really tend to believe that the performance degradation comes 
primarily from the driver,
and the napi poll processing turns out to be an important source for 
that. The proposed patches
show substantial improvement, especially for SMP systems where Tx and Rx 
processing may be
done in parallel.
What do you think?
Is it ok to proceed by re-spinning the patches? Do you recommend 
additional measurements?

Regards,
Claudiu

//=Image 1)================
root@p1010rdb:~# cat /proc/version
Linux version 3.5.0-20970-gaae06bf [...]

root@p1010rdb:~# zcat /proc/config.gz | grep BQL
CONFIG_BQL=y
root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
131072

root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       580.76   99.95    11.76 14.099  1.659
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       598.21   99.95    10.91 13.687  1.493
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       583.04   99.95    11.25 14.043  1.581


root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
65536

root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       604.29   99.95    11.15 13.550  1.512
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       603.52   99.50    12.57 13.506  1.706
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       596.18   99.95    12.81 13.734  1.760



root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
32768

root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       582.32   99.95    12.96 14.061  1.824
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       583.79   99.95    11.19 14.026  1.571
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       584.16   99.95    11.36 14.016  1.592



root@p1010rdb:~# ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 
tx-usecs 32

root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       708.77   99.85    13.32 11.541  1.540
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       710.50   99.95    12.46 11.524  1.437
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       709.95   99.95    14.15 11.533  1.633


//=Image 2)================

root@p1010rdb:~# cat /proc/version
Linux version 3.5.0-20974-g2920464 [...]

root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       652.60   99.95    13.05 12.547  1.638
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       657.47   99.95    11.81 12.454  1.471
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       655.77   99.95    11.80 12.486  1.474


root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 
tx-usecs 32

root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
131072
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.01       882.42   99.20    18.06 9.209   1.676
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       867.02   99.75    16.21 9.425   1.531
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.01       874.29   99.85    15.25 9.356   1.429


//=Image 3)================

Linux version 3.5.0-20970-gaae06bf-dirty [...] //CONFIG_BQL = n

root@p1010rdb:~# cat /proc/version
Linux version 3.5.0-20970-gaae06bf-dirty 
(b08782@zro04-ws574.ea.freescale.net) (gcc version 4.6.2 (GCC) ) #3 Mon 
Aug 13 13:58:25 EEST 2012
root@p1010rdb:~# zcat /proc/config.gz | grep BQL
# CONFIG_BQL is not set

root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       595.08   99.95    12.51 13.759  1.722
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       593.95   99.95    10.96 13.785  1.511
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       595.30   99.90    11.11 13.747  1.528

root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 
tx-usecs 32
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       710.46   99.95    12.46 11.525  1.437
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       714.27   99.95    14.05 11.463  1.611
root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 
(192.168.10.1) port 0 AF_INET
Recv   Send    Send                          Utilization Service Demand
Socket Socket  Message  Elapsed              Send     Recv Send    Recv
Size   Size    Size     Time     Throughput  local    remote local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB

  87380  16384   1500    20.00       717.69   99.95    12.56 11.409  1.433
.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing
  2012-08-08 12:26       ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil
@ 2012-08-14  0:51         ` Paul Gortmaker
  2012-08-14 16:08           ` Claudiu Manoil
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-14  0:51 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev, David S. Miller, Pankaj Chauhan, Eric Dumazet

[[RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:

> Add a separate napi poll routine for Tx cleanup, to be triggerred by Tx
> confirmation interrupts only. Existing poll function is modified to handle
> only the Rx path processing. This allows parallel processing of Rx and Tx
> confirmation paths on a smp machine (2 cores).
> The split also results in simpler/cleaner napi poll function implementations,
> where each processing path has its own budget, thus improving the fairness b/w
> the processing paths at the same time.
> 
> Signed-off-by: Pankaj Chauhan <pankaj.chauhan@freescale.com>
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
> ---
>  drivers/net/ethernet/freescale/gianfar.c |  154 +++++++++++++++++++++++-------
>  drivers/net/ethernet/freescale/gianfar.h |   16 +++-
>  2 files changed, 130 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
> index 919acb3..2774961 100644
> --- a/drivers/net/ethernet/freescale/gianfar.c
> +++ b/drivers/net/ethernet/freescale/gianfar.c
> @@ -128,12 +128,14 @@ static void free_skb_resources(struct gfar_private *priv);
>  static void gfar_set_multi(struct net_device *dev);
>  static void gfar_set_hash_for_addr(struct net_device *dev, u8 *addr);
>  static void gfar_configure_serdes(struct net_device *dev);
> -static int gfar_poll(struct napi_struct *napi, int budget);
> +static int gfar_poll_rx(struct napi_struct *napi, int budget);
> +static int gfar_poll_tx(struct napi_struct *napi, int budget);
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  static void gfar_netpoll(struct net_device *dev);
>  #endif
>  int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit);
> -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue);
> +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue,
> +			      int tx_work_limit);

I'm looking at this in a bit more detail now (was on vacation last wk).
With the above, you push a work limit down into the clean_tx_ring.
I'm wondering if the above is implicitly involved in the performance
difference you see, since...

>  static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb,
>  			      int amount_pull, struct napi_struct *napi);
>  void gfar_halt(struct net_device *dev);
> @@ -543,16 +545,20 @@ static void disable_napi(struct gfar_private *priv)
>  {
>  	int i;
>  
> -	for (i = 0; i < priv->num_grps; i++)
> -		napi_disable(&priv->gfargrp[i].napi);
> +	for (i = 0; i < priv->num_grps; i++) {
> +		napi_disable(&priv->gfargrp[i].napi_rx);
> +		napi_disable(&priv->gfargrp[i].napi_tx);
> +	}
>  }
>  
>  static void enable_napi(struct gfar_private *priv)
>  {
>  	int i;
>  
> -	for (i = 0; i < priv->num_grps; i++)
> -		napi_enable(&priv->gfargrp[i].napi);
> +	for (i = 0; i < priv->num_grps; i++) {
> +		napi_enable(&priv->gfargrp[i].napi_rx);
> +		napi_enable(&priv->gfargrp[i].napi_tx);
> +	}
>  }
>  
>  static int gfar_parse_group(struct device_node *np,
> @@ -1028,9 +1034,12 @@ static int gfar_probe(struct platform_device *ofdev)
>  	dev->ethtool_ops = &gfar_ethtool_ops;
>  
>  	/* Register for napi ...We are registering NAPI for each grp */
> -	for (i = 0; i < priv->num_grps; i++)
> -		netif_napi_add(dev, &priv->gfargrp[i].napi, gfar_poll,
> -			       GFAR_DEV_WEIGHT);
> +	for (i = 0; i < priv->num_grps; i++) {
> +		netif_napi_add(dev, &priv->gfargrp[i].napi_rx, gfar_poll_rx,
> +			       GFAR_DEV_RX_WEIGHT);
> +		netif_napi_add(dev, &priv->gfargrp[i].napi_tx, gfar_poll_tx,
> +			       GFAR_DEV_TX_WEIGHT);
> +	}
>  
>  	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_CSUM) {
>  		dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG |
> @@ -2465,7 +2474,8 @@ static void gfar_align_skb(struct sk_buff *skb)
>  }
>  
>  /* Interrupt Handler for Transmit complete */
> -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
> +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue,
> +			      int tx_work_limit)
>  {
>  	struct net_device *dev = tx_queue->dev;
>  	struct netdev_queue *txq;
> @@ -2490,7 +2500,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
>  	bdp = tx_queue->dirty_tx;
>  	skb_dirtytx = tx_queue->skb_dirtytx;
>  
> -	while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) {
> +	while ((skb = tx_queue->tx_skbuff[skb_dirtytx]) && tx_work_limit--) {

...code like this provides a new exit point that did not exist before,
for the case of a massive transmit blast.  Do you have any data on how
often the work limit is hit?  The old Don Becker ether drivers which
originally introduced the idea of work limits (on IRQs) used to printk a
message when they hit it, since ideally it shouldn't be happening all
the time.

In any case, it might be worth while to split this change out into a
separate commit; something like:

   gianfar: push transmit cleanup work limit down to clean_tx_ring

The advantage being (1) we can test this change in isolation, and
(2) it makes your remaining rx/tx separate thread patch smaller and
easier to review.

Thanks,
Paul.
--

>  		unsigned long flags;
>  
>  		frags = skb_shinfo(skb)->nr_frags;
> @@ -2580,29 +2590,50 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
>  	return howmany;
>  }
>  
> -static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp)
> +static void gfar_schedule_rx_cleanup(struct gfar_priv_grp *gfargrp)
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&gfargrp->grplock, flags);
> -	if (napi_schedule_prep(&gfargrp->napi)) {
> -		gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED);
> -		__napi_schedule(&gfargrp->napi);
> +	if (napi_schedule_prep(&gfargrp->napi_rx)) {
> +		u32 imask;
> +		spin_lock_irqsave(&gfargrp->grplock, flags);
> +		imask = gfar_read(&gfargrp->regs->imask);
> +		imask &= ~(IMASK_RX_DEFAULT);
> +		gfar_write(&gfargrp->regs->imask, imask);
> +		__napi_schedule(&gfargrp->napi_rx);
> +		spin_unlock_irqrestore(&gfargrp->grplock, flags);
>  	}
>  
>  	/* Clear IEVENT, so interrupts aren't called again
>  	 * because of the packets that have already arrived.
>  	 */
> -	gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK);
> +	gfar_write(&gfargrp->regs->ievent, IEVENT_RX_MASK);
> +}
>  
> -	spin_unlock_irqrestore(&gfargrp->grplock, flags);
> +static void gfar_schedule_tx_cleanup(struct gfar_priv_grp *gfargrp)
> +{
> +	unsigned long flags;
> +
> +	if (napi_schedule_prep(&gfargrp->napi_tx)) {
> +		u32 imask;
> +		spin_lock_irqsave(&gfargrp->grplock, flags);
> +		imask = gfar_read(&gfargrp->regs->imask);
> +		imask &= ~(IMASK_TX_DEFAULT);
> +		gfar_write(&gfargrp->regs->imask, imask);
> +		__napi_schedule(&gfargrp->napi_tx);
> +		spin_unlock_irqrestore(&gfargrp->grplock, flags);
> +	}
>  
> +	/* Clear IEVENT, so interrupts aren't called again
> +	 * because of the packets that have already arrived.
> +	 */
> +	gfar_write(&gfargrp->regs->ievent, IEVENT_TX_MASK);
>  }
>  
>  /* Interrupt Handler for Transmit complete */
>  static irqreturn_t gfar_transmit(int irq, void *grp_id)
>  {
> -	gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id);
> +	gfar_schedule_tx_cleanup((struct gfar_priv_grp *)grp_id);
>  	return IRQ_HANDLED;
>  }
>  
> @@ -2683,7 +2714,7 @@ static inline void count_errors(unsigned short status, struct net_device *dev)
>  
>  irqreturn_t gfar_receive(int irq, void *grp_id)
>  {
> -	gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id);
> +	gfar_schedule_rx_cleanup((struct gfar_priv_grp *)grp_id);
>  	return IRQ_HANDLED;
>  }
>  
> @@ -2813,7 +2844,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
>  				rx_queue->stats.rx_bytes += pkt_len;
>  				skb_record_rx_queue(skb, rx_queue->qindex);
>  				gfar_process_frame(dev, skb, amount_pull,
> -						   &rx_queue->grp->napi);
> +						   &rx_queue->grp->napi_rx);
>  
>  			} else {
>  				netif_warn(priv, rx_err, dev, "Missing skb!\n");
> @@ -2842,21 +2873,19 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
>  	return howmany;
>  }
>  
> -static int gfar_poll(struct napi_struct *napi, int budget)
> +static int gfar_poll_rx(struct napi_struct *napi, int budget)
>  {
>  	struct gfar_priv_grp *gfargrp =
> -		container_of(napi, struct gfar_priv_grp, napi);
> +		container_of(napi, struct gfar_priv_grp, napi_rx);
>  	struct gfar_private *priv = gfargrp->priv;
>  	struct gfar __iomem *regs = gfargrp->regs;
> -	struct gfar_priv_tx_q *tx_queue = NULL;
>  	struct gfar_priv_rx_q *rx_queue = NULL;
> -	int rx_cleaned = 0, budget_per_queue = 0, rx_cleaned_per_queue = 0;
> -	int tx_cleaned = 0, i, left_over_budget = budget;
> +	int rx_cleaned = 0, budget_per_queue, rx_cleaned_per_queue;
> +	int i, left_over_budget = budget;
>  	unsigned long serviced_queues = 0;
>  	int num_queues = 0;
>  
>  	num_queues = gfargrp->num_rx_queues;
> -	budget_per_queue = budget/num_queues;
>  
>  	while (num_queues && left_over_budget) {
>  		budget_per_queue = left_over_budget/num_queues;
> @@ -2866,9 +2895,6 @@ static int gfar_poll(struct napi_struct *napi, int budget)
>  			if (test_bit(i, &serviced_queues))
>  				continue;
>  			rx_queue = priv->rx_queue[i];
> -			tx_queue = priv->tx_queue[rx_queue->qindex];
> -
> -			tx_cleaned += gfar_clean_tx_ring(tx_queue);
>  			rx_cleaned_per_queue =
>  				gfar_clean_rx_ring(rx_queue, budget_per_queue);
>  			rx_cleaned += rx_cleaned_per_queue;
> @@ -2882,27 +2908,83 @@ static int gfar_poll(struct napi_struct *napi, int budget)
>  		}
>  	}
>  
> -	if (tx_cleaned)
> -		return budget;
> -
>  	if (rx_cleaned < budget) {
> +		u32 imask;
>  		napi_complete(napi);
>  
>  		/* Clear the halt bit in RSTAT */
>  		gfar_write(&regs->rstat, gfargrp->rstat);
>  
> -		gfar_write(&regs->imask, IMASK_DEFAULT);
> +		spin_lock_irq(&gfargrp->grplock);
> +		imask = gfar_read(&regs->imask);
> +		imask |= IMASK_RX_DEFAULT;
> +		gfar_write(&regs->imask, imask);
> +		spin_unlock_irq(&gfargrp->grplock);
>  
>  		/* If we are coalescing interrupts, update the timer
>  		 * Otherwise, clear it
>  		 */
> -		gfar_configure_coalescing(priv, gfargrp->rx_bit_map,
> -					  gfargrp->tx_bit_map);
> +		gfar_configure_rx_coalescing(priv, gfargrp->rx_bit_map);
>  	}
>  
>  	return rx_cleaned;
>  }
>  
> +static int gfar_poll_tx(struct napi_struct *napi, int budget)
> +{
> +	struct gfar_priv_grp *gfargrp =
> +		container_of(napi, struct gfar_priv_grp, napi_tx);
> +	struct gfar_private *priv = gfargrp->priv;
> +	struct gfar __iomem *regs = gfargrp->regs;
> +	struct gfar_priv_tx_q *tx_queue = NULL;
> +	int tx_cleaned = 0, budget_per_queue, tx_cleaned_per_queue;
> +	int i, left_over_budget = budget;
> +	unsigned long serviced_queues = 0;
> +	int num_queues = 0;
> +
> +	num_queues = gfargrp->num_tx_queues;
> +
> +	while (num_queues && left_over_budget) {
> +		budget_per_queue = left_over_budget/num_queues;
> +		left_over_budget = 0;
> +
> +		for_each_set_bit(i, &gfargrp->tx_bit_map, priv->num_tx_queues) {
> +			if (test_bit(i, &serviced_queues))
> +				continue;
> +			tx_queue = priv->tx_queue[i];
> +			tx_cleaned_per_queue =
> +				gfar_clean_tx_ring(tx_queue, budget_per_queue);
> +			tx_cleaned += tx_cleaned_per_queue;
> +			if (tx_cleaned_per_queue < budget_per_queue) {
> +				left_over_budget = left_over_budget +
> +					(budget_per_queue -
> +					 tx_cleaned_per_queue);
> +				set_bit(i, &serviced_queues);
> +				num_queues--;
> +			}
> +		}
> +	}
> +
> +	if (tx_cleaned < budget) {
> +		u32 imask;
> +		napi_complete(napi);
> +
> +		gfar_write(&regs->imask, IMASK_DEFAULT);
> +		spin_lock_irq(&gfargrp->grplock);
> +		imask = gfar_read(&regs->imask);
> +		imask |= IMASK_TX_DEFAULT;
> +		gfar_write(&regs->imask, imask);
> +		spin_unlock_irq(&gfargrp->grplock);
> +
> +		/* If we are coalescing interrupts, update the timer
> +		 * Otherwise, clear it
> +		 */
> +		gfar_configure_tx_coalescing(priv, gfargrp->tx_bit_map);
> +	}
> +
> +	return tx_cleaned;
> +}
> +
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  /* Polling 'interrupt' - used by things like netconsole to send skbs
>   * without having to re-enable interrupts. It's not called while
> diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
> index 2136c7f..f5be234 100644
> --- a/drivers/net/ethernet/freescale/gianfar.h
> +++ b/drivers/net/ethernet/freescale/gianfar.h
> @@ -57,8 +57,10 @@ struct ethtool_rx_list {
>  	unsigned int count;
>  };
>  
> -/* The maximum number of packets to be handled in one call of gfar_poll */
> -#define GFAR_DEV_WEIGHT 64
> +/* The maximum number of packets to be handled in one call of gfar_poll_rx */
> +#define GFAR_DEV_RX_WEIGHT 64
> +/* The maximum number of packets to be handled in one call of gfar_poll_tx */
> +#define GFAR_DEV_TX_WEIGHT 64
>  
>  /* Length for FCB */
>  #define GMAC_FCB_LEN 8
> @@ -366,6 +368,10 @@ extern const char gfar_driver_version[];
>  		| IMASK_PERR)
>  #define IMASK_RTX_DISABLED ((~(IMASK_RXFEN0 | IMASK_TXFEN | IMASK_BSY)) \
>  			   & IMASK_DEFAULT)
> +#define IMASK_RX_DEFAULT  (IMASK_RXFEN0 | IMASK_BSY)
> +#define IMASK_TX_DEFAULT  (IMASK_TXFEN)
> +#define IMASK_RX_DISABLED ((~(IMASK_RX_DEFAULT)) & IMASK_DEFAULT)
> +#define IMASK_TX_DISABLED ((~(IMASK_TX_DEFAULT)) & IMASK_DEFAULT)
>  
>  /* Fifo management */
>  #define FIFO_TX_THR_MASK	0x01ff
> @@ -993,7 +999,8 @@ struct gfar_priv_rx_q {
>  
>  /**
>   *	struct gfar_priv_grp - per group structure
> - *	@napi: the napi poll function
> + *	@napi_rx: the RX napi poll function
> + *	@napi_tx: the TX confirmation napi poll function
>   *	@priv: back pointer to the priv structure
>   *	@regs: the ioremapped register space for this group
>   *	@grp_id: group id for this group
> @@ -1007,7 +1014,8 @@ struct gfar_priv_rx_q {
>  
>  struct gfar_priv_grp {
>  	spinlock_t grplock __attribute__ ((aligned (SMP_CACHE_BYTES)));
> -	struct	napi_struct napi;
> +	struct	napi_struct napi_rx;
> +	struct	napi_struct napi_tx;
>  	struct gfar_private *priv;
>  	struct gfar __iomem *regs;
>  	unsigned int grp_id;
> -- 
> 1.6.6
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-13 16:23         ` Claudiu Manoil
@ 2012-08-14  1:15           ` Paul Gortmaker
  2012-08-14 16:08             ` Claudiu Manoil
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-14  1:15 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller

[Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 13/08/2012 (Mon 19:23) Claudiu Manoil wrote:

> On 08/09/2012 06:07 PM, Claudiu Manoil wrote:
> >On 8/9/2012 2:06 AM, Tomas Hruby wrote:
> >>On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet
> >><eric.dumazet@gmail.com> wrote:
> >>>On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote:
> >>>>[[RFC net-next 0/4] gianfar: Use separate NAPI for Tx
> >>>>confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu
> >>>>Manoil wrote:
> >>>>
> >>>>>Hi all,
> >>>>>This set of patches basically splits the existing napi
> >>>>>poll routine into
> >>>>>two separate napi functions, one for Rx processing
> >>>>>(triggered by frame
> >>>>>receive interrupts only) and one for the Tx confirmation
> >>>>>path processing
> >>>>>(triggerred by Tx confirmation interrupts only). The
> >>>>>polling algorithm
> >>>>>behind remains much the same.
> >>>>>
> >>>>>Important throughput improvements have been noted on low
> >>>>>power boards with
> >>>>>this set of changes.
> >>>>>For instance, for the following netperf test:
> >>>>>netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> >>>>>yields a throughput gain from oscilating ~500-~700 Mbps to
> >>>>>steady ~940 Mbps,
> >>>>>(if the Rx/Tx paths are processed on different cores), w/
> >>>>>no increase in CPU%,
> >>>>>on a p1020rdb - 2 core machine featuring etsec2.0
> >>>>>(Multi-Queue Multi-Group
> >>>>>driver mode).
> >>>>
> >>>>It would be interesting to know more about what was causing that large
> >>>>an oscillation -- presumably you will have it reappear once one core
> >>>>becomes 100% utilized.  Also, any thoughts on how the change
> >>>>will change
> >>>>performance on an older low power single core gianfar system
> >>>>(e.g.  83xx)?
> >>>
> >>>I also was wondering if this low performance could be caused by BQL
> >>>
> >>>Since TCP stack is driven by incoming ACKS, a NAPI run could have to
> >>>handle 10 TCP acks in a row, and resulting xmits could hit BQL and
> >>>transit on qdisc (Because NAPI handler wont handle TX
> >>>completions in the
> >>>middle of RX handler)
> >>
> >>Does disabling BQL help? Is the BQL limit stable? To what value is it
> >>set? I would be very much interested in more data if the issue is BQL
> >>related.
> >>
> >>.
> >>
> >
> >I agree that more tests should be run to investigate why gianfar under-
> >performs on the low power p1020rdb platform, and BQL seems to be
> >a good starting point (thanks for the hint). What I can say now is that
> >the issue is not apparent on p2020rdb, for instance, which is a more
> >powerful platform: the CPUs - 1200 MHz instead of 800 MHz; twice the
> >size of L2 cache (512 KB), greater bus (CCB) frequency ... On this
> >board (p2020rdb) the netperf test reaches 940Mbps both w/ and w/o these
> >patches.
> >
> >For a single core system I'm not expecting any performance degradation,
> >simply because I don't see why the proposed napi poll implementation
> >would be slower than the existing one. I'll do some measurements on a
> >p1010rdb too (single core, CPU:800 MHz) and get back to you with the
> >results.
> >
> 
> Hi all,
> 
> Please find below the netperf measurements performed on a p1010rdb machine
> (single core, low power).  Three kernel images were used:
> 1) Linux version 3.5.0-20970-gaae06bf  -- net-next commit aae06bf
> 2) Linux version 3.5.0-20974-g2920464 -- commit aae06bf + Tx NAPI patches
> 3) Linux version 3.5.0-20970-gaae06bf-dirty -- commit aae06bf +
> CONFIG_BQL set to 'n'

For future reference, you don't need to dirty the tree to disable
CONFIG_BQL at compile time; there is a runtime disable:

http://permalink.gmane.org/gmane.linux.network/223107

> 
> The results show that, on *Image 1)*, by adjusting
> tcp_limit_output_bytes no substantial
> improvements are seen, as the throughput stays in the 580-60x Mbps range .

This is a lot lower variation than what you reported earlier (20 versus
200, I think).  It was the variation that raised a red flag for me...

> By changing the coalescing settings from default* (rx coalescing off,
> tx-usecs: 10, tx-frames: 16) to:
> "ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32"
> we get a throughput of ~710 Mbps.
> 
> For *Image 2)*, using the default tcp_limit_output_bytes value
> (131072) - I've noticed
> that "tweaking" tcp_limit_output_bytes does not improve the
> throughput -, we get the
> following performance numbers:
> * default coalescing settings: ~650 Mbps
> * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps
> 
> For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no*
> relevant performance
> improvement compared to Image 1).
> (note:
> For all the measurements, rx and tx BD ring sizes have been set to
> 64, for best performance.)
> 
> So, I really tend to believe that the performance degradation comes
> primarily from the driver,
> and the napi poll processing turns out to be an important source for
> that. The proposed patches

This would make sense, if the CPU was slammed at 100% load in dealing
with the tx processing, and the change made the driver considerably more
efficient.  But is that really the case?  Is the p1010 really going flat
out just to handle the Tx processing?  Have you done any sort of
profiling to confirm/deny where the CPU is spending its time?

> show substantial improvement, especially for SMP systems where Tx
> and Rx processing may be
> done in parallel.
> What do you think?
> Is it ok to proceed by re-spinning the patches? Do you recommend
> additional measurements?

Unfortunately Eric is out this week, so we will be without his input for
a while.  However, we are only at 3.6-rc1 -- meaning net-next will be
open for quite some time, hence no need to rush to try and jam stuff in.

Also, I have two targets I'm interested in testing your patches on.  The
1st is a 500MHz mpc8349 board -- which should replicate what you see on
your p1010 (slow, single core).  The other is an 8641D, which is
interesting since it will give us the SMP tx/rx as separate threads, but
without the MQ_MG_MODE support (is that a correct assumption?)

I don't have any fundamental problem with your patches (although 4/4
might be better as two patches) -- the above targets/tests are only
of interest, since I'm not convinced we yet understand _why_ your
changes give a performance boost, and there might be something
interesting hiding in there.

So, while Eric is out, let me see if I can collect some more data on
those two targets sometime this week.

Thanks,
Paul.
--

> 
> Regards,
> Claudiu
> 
> //=Image 1)================
> root@p1010rdb:~# cat /proc/version
> Linux version 3.5.0-20970-gaae06bf [...]
> 
> root@p1010rdb:~# zcat /proc/config.gz | grep BQL
> CONFIG_BQL=y
> root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
> 131072
> 
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       580.76   99.95    11.76 14.099  1.659
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       598.21   99.95    10.91 13.687  1.493
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       583.04   99.95    11.25 14.043  1.581
> 
> 
> root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
> 65536
> 
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       604.29   99.95    11.15 13.550  1.512
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       603.52   99.50    12.57 13.506  1.706
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       596.18   99.95    12.81 13.734  1.760
> 
> 
> 
> root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
> 32768
> 
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       582.32   99.95    12.96 14.061  1.824
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       583.79   99.95    11.19 14.026  1.571
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       584.16   99.95    11.36 14.016  1.592
> 
> 
> 
> root@p1010rdb:~# ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs
> 32 tx-usecs 32
> 
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       708.77   99.85    13.32 11.541  1.540
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       710.50   99.95    12.46 11.524  1.437
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       709.95   99.95    14.15 11.533  1.633
> 
> 
> //=Image 2)================
> 
> root@p1010rdb:~# cat /proc/version
> Linux version 3.5.0-20974-g2920464 [...]
> 
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       652.60   99.95    13.05 12.547  1.638
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       657.47   99.95    11.81 12.454  1.471
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       655.77   99.95    11.80 12.486  1.474
> 
> 
> root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames
> 22 tx-usecs 32
> 
> root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
> 131072
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.01       882.42   99.20    18.06 9.209   1.676
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       867.02   99.75    16.21 9.425   1.531
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.01       874.29   99.85    15.25 9.356   1.429
> 
> 
> //=Image 3)================
> 
> Linux version 3.5.0-20970-gaae06bf-dirty [...] //CONFIG_BQL = n
> 
> root@p1010rdb:~# cat /proc/version
> Linux version 3.5.0-20970-gaae06bf-dirty
> (b08782@zro04-ws574.ea.freescale.net) (gcc version 4.6.2 (GCC) ) #3
> Mon Aug 13 13:58:25 EEST 2012
> root@p1010rdb:~# zcat /proc/config.gz | grep BQL
> # CONFIG_BQL is not set
> 
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       595.08   99.95    12.51 13.759  1.722
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       593.95   99.95    10.96 13.785  1.511
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       595.30   99.90    11.11 13.747  1.528
> 
> root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames
> 22 tx-usecs 32
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       710.46   99.95    12.46 11.525  1.437
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       714.27   99.95    14.05 11.463  1.611
> root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.1 (192.168.10.1) port 0 AF_INET
> Recv   Send    Send                          Utilization Service Demand
> Socket Socket  Message  Elapsed              Send     Recv Send    Recv
> Size   Size    Size     Time     Throughput  local    remote local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S us/KB   us/KB
> 
>  87380  16384   1500    20.00       717.69   99.95    12.56 11.409  1.433
> .
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-14  1:15           ` Paul Gortmaker
@ 2012-08-14 16:08             ` Claudiu Manoil
  2012-08-16 15:36               ` Paul Gortmaker
  0 siblings, 1 reply; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-14 16:08 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller

On 08/14/2012 04:15 AM, Paul Gortmaker wrote:
> This is a lot lower variation than what you reported earlier (20 
> versus 200, I think). It was the variation that raised a red flag for 
> me... 
Hi Paul,
The earlier variation, which is much bigger (indeed ~200), was observed 
on a p1020 (slow, 2 cores, MQ_MG_MODE).
I did not collect however as detailed measurement results for that 
board, as I did for p1010 (previous email).
The most important performance improvement I've noticed however was on 
the p1020 platform.

>> By changing the coalescing settings from default* (rx coalescing off,
>> tx-usecs: 10, tx-frames: 16) to:
>> "ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32"
>> we get a throughput of ~710 Mbps.
>>
>> For *Image 2)*, using the default tcp_limit_output_bytes value
>> (131072) - I've noticed
>> that "tweaking" tcp_limit_output_bytes does not improve the
>> throughput -, we get the
>> following performance numbers:
>> * default coalescing settings: ~650 Mbps
>> * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps
>>
>> For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no*
>> relevant performance
>> improvement compared to Image 1).
>> (note:
>> For all the measurements, rx and tx BD ring sizes have been set to
>> 64, for best performance.)
>>
>> So, I really tend to believe that the performance degradation comes
>> primarily from the driver,
>> and the napi poll processing turns out to be an important source for
>> that. The proposed patches
> This would make sense, if the CPU was slammed at 100% load in dealing
> with the tx processing, and the change made the driver considerably more
> efficient.  But is that really the case?  Is the p1010 really going flat
> out just to handle the Tx processing?  Have you done any sort of
> profiling to confirm/deny where the CPU is spending its time?
The current gfar_poll implementation processes first the tx confirmation 
path exhaustively, without a budget/ work limit,
and only then proceeds with the rx processing within the allotted 
budget. An this happens for both Rx and Tx confirmation
interrupts. I find this unfair and out of balance. Maybe by letting rx 
processing to be triggered by rx interrupts only, and
the tx conf path processing to be triggered by tx confirmation 
interrupts only, and, on top of that, by imposing a work limit
on the tx confirmation path too, we get a more responsive driver that 
performs better. Indeed some profiling data to
confirm this would be great, but I don't have it.

There's another issues that seems to be solved by this patchset, and 
I've noticed it only on p1020rdb (this time).
And that is excessive Rx busy interrupts occurrence. Solving this issue 
may be another factor for the performance
improvement on p1020. But maybe this is another discussion.

>
>> show substantial improvement, especially for SMP systems where Tx
>> and Rx processing may be
>> done in parallel.
>> What do you think?
>> Is it ok to proceed by re-spinning the patches? Do you recommend
>> additional measurements?
> Unfortunately Eric is out this week, so we will be without his input for
> a while.  However, we are only at 3.6-rc1 -- meaning net-next will be
> open for quite some time, hence no need to rush to try and jam stuff in.
>
> Also, I have two targets I'm interested in testing your patches on.  The
> 1st is a 500MHz mpc8349 board -- which should replicate what you see on
> your p1010 (slow, single core).  The other is an 8641D, which is
> interesting since it will give us the SMP tx/rx as separate threads, but
> without the MQ_MG_MODE support (is that a correct assumption?)
>
> I don't have any fundamental problem with your patches (although 4/4
> might be better as two patches) -- the above targets/tests are only
> of interest, since I'm not convinced we yet understand _why_ your
> changes give a performance boost, and there might be something
> interesting hiding in there.
>
> So, while Eric is out, let me see if I can collect some more data on
> those two targets sometime this week.
Great, I don't mean to rush.  The more data we get on this the better.
It would be great if you could do some measurements on your platforms too.
8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE),
but I did run some tests on a p2020, which has the same features. However
I'm eager to see your results.
Thanks for helping.

Regards,
Claudiu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing
  2012-08-14  0:51         ` Paul Gortmaker
@ 2012-08-14 16:08           ` Claudiu Manoil
  0 siblings, 0 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-14 16:08 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: netdev, David S. Miller, Pankaj Chauhan, Eric Dumazet

On 08/14/2012 03:51 AM, Paul Gortmaker wrote:
> [[RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:
>
>> Add a separate napi poll routine for Tx cleanup, to be triggerred by Tx
>> confirmation interrupts only. Existing poll function is modified to handle
>> only the Rx path processing. This allows parallel processing of Rx and Tx
>> confirmation paths on a smp machine (2 cores).
>> The split also results in simpler/cleaner napi poll function implementations,
>> where each processing path has its own budget, thus improving the fairness b/w
>> the processing paths at the same time.
>>
>> Signed-off-by: Pankaj Chauhan <pankaj.chauhan@freescale.com>
>> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
>> ---
>>   drivers/net/ethernet/freescale/gianfar.c |  154 +++++++++++++++++++++++-------
>>   drivers/net/ethernet/freescale/gianfar.h |   16 +++-
>>   2 files changed, 130 insertions(+), 40 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
>> index 919acb3..2774961 100644
>> --- a/drivers/net/ethernet/freescale/gianfar.c
>> +++ b/drivers/net/ethernet/freescale/gianfar.c
>> @@ -128,12 +128,14 @@ static void free_skb_resources(struct gfar_private *priv);
>>   static void gfar_set_multi(struct net_device *dev);
>>   static void gfar_set_hash_for_addr(struct net_device *dev, u8 *addr);
>>   static void gfar_configure_serdes(struct net_device *dev);
>> -static int gfar_poll(struct napi_struct *napi, int budget);
>> +static int gfar_poll_rx(struct napi_struct *napi, int budget);
>> +static int gfar_poll_tx(struct napi_struct *napi, int budget);
>>   #ifdef CONFIG_NET_POLL_CONTROLLER
>>   static void gfar_netpoll(struct net_device *dev);
>>   #endif
>>   int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit);
>> -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue);
>> +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue,
>> +			      int tx_work_limit);
> I'm looking at this in a bit more detail now (was on vacation last wk).
> With the above, you push a work limit down into the clean_tx_ring.
> I'm wondering if the above is implicitly involved in the performance
> difference you see, since...
>
>>   static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb,
>>   			      int amount_pull, struct napi_struct *napi);
>>   void gfar_halt(struct net_device *dev);
>> @@ -543,16 +545,20 @@ static void disable_napi(struct gfar_private *priv)
>>   {
>>   	int i;
>>   
>> -	for (i = 0; i < priv->num_grps; i++)
>> -		napi_disable(&priv->gfargrp[i].napi);
>> +	for (i = 0; i < priv->num_grps; i++) {
>> +		napi_disable(&priv->gfargrp[i].napi_rx);
>> +		napi_disable(&priv->gfargrp[i].napi_tx);
>> +	}
>>   }
>>   
>>   static void enable_napi(struct gfar_private *priv)
>>   {
>>   	int i;
>>   
>> -	for (i = 0; i < priv->num_grps; i++)
>> -		napi_enable(&priv->gfargrp[i].napi);
>> +	for (i = 0; i < priv->num_grps; i++) {
>> +		napi_enable(&priv->gfargrp[i].napi_rx);
>> +		napi_enable(&priv->gfargrp[i].napi_tx);
>> +	}
>>   }
>>   
>>   static int gfar_parse_group(struct device_node *np,
>> @@ -1028,9 +1034,12 @@ static int gfar_probe(struct platform_device *ofdev)
>>   	dev->ethtool_ops = &gfar_ethtool_ops;
>>   
>>   	/* Register for napi ...We are registering NAPI for each grp */
>> -	for (i = 0; i < priv->num_grps; i++)
>> -		netif_napi_add(dev, &priv->gfargrp[i].napi, gfar_poll,
>> -			       GFAR_DEV_WEIGHT);
>> +	for (i = 0; i < priv->num_grps; i++) {
>> +		netif_napi_add(dev, &priv->gfargrp[i].napi_rx, gfar_poll_rx,
>> +			       GFAR_DEV_RX_WEIGHT);
>> +		netif_napi_add(dev, &priv->gfargrp[i].napi_tx, gfar_poll_tx,
>> +			       GFAR_DEV_TX_WEIGHT);
>> +	}
>>   
>>   	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_CSUM) {
>>   		dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG |
>> @@ -2465,7 +2474,8 @@ static void gfar_align_skb(struct sk_buff *skb)
>>   }
>>   
>>   /* Interrupt Handler for Transmit complete */
>> -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
>> +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue,
>> +			      int tx_work_limit)
>>   {
>>   	struct net_device *dev = tx_queue->dev;
>>   	struct netdev_queue *txq;
>> @@ -2490,7 +2500,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
>>   	bdp = tx_queue->dirty_tx;
>>   	skb_dirtytx = tx_queue->skb_dirtytx;
>>   
>> -	while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) {
>> +	while ((skb = tx_queue->tx_skbuff[skb_dirtytx]) && tx_work_limit--) {
> ...code like this provides a new exit point that did not exist before,
> for the case of a massive transmit blast.  Do you have any data on how
> often the work limit is hit?  The old Don Becker ether drivers which
> originally introduced the idea of work limits (on IRQs) used to printk a
> message when they hit it, since ideally it shouldn't be happening all
> the time.
>
> In any case, it might be worth while to split this change out into a
> separate commit; something like:
>
>     gianfar: push transmit cleanup work limit down to clean_tx_ring
>
> The advantage being (1) we can test this change in isolation, and
> (2) it makes your remaining rx/tx separate thread patch smaller and
> easier to review.
Sounds interesting, I think I'll give it a try.
Thanks,
Claudiu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions
  2012-08-08 12:26     ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil
  2012-08-08 12:26       ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil
  2012-08-08 15:44       ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker
@ 2012-08-15  1:29       ` Paul Gortmaker
  2 siblings, 0 replies; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-15  1:29 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev, David S. Miller

[[RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote:

> Split the coalescing programming support by Rx and Tx h/w queues, in order to
> introduce a separate NAPI for the Tx confirmation path (next patch). This way,
> the Rx processing path will handle the coalescing settings for the Rx queues
> only, resp. the Tx confirmation processing path will handle the Tx queues.
> 
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
> ---
>  drivers/net/ethernet/freescale/gianfar.c |   36 +++++++++++++++++++++++------
>  1 files changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
> index ddd350a..919acb3 100644
> --- a/drivers/net/ethernet/freescale/gianfar.c
> +++ b/drivers/net/ethernet/freescale/gianfar.c
> @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev)
>  	dev->trans_start = jiffies; /* prevent tx timeout */
>  }
>  
> -void gfar_configure_coalescing(struct gfar_private *priv,
> -			       unsigned long tx_mask, unsigned long rx_mask)
> +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv,
> +						unsigned long mask)

Hi Claudiu,

I had it in mind to mention this earlier, but forgot.  Align line two
(and three, and four...) of args with the 1st arg in line one, and
you'll save yourself needing a resend for basic formatting issues.

In other words, you need:

void foo(int some_really_long_line, struct blah *more_long_line,
         int b, ... )

i.e. allowing for mail mangling, make sure you have the the two int
directly over each other, in the example above.

Thanks,
Paul.
--

>  {
>  	struct gfar __iomem *regs = priv->gfargrp[0].regs;
>  	u32 __iomem *baddr;
> @@ -1803,14 +1803,31 @@ void gfar_configure_coalescing(struct gfar_private *priv,
>  	if (priv->mode == MQ_MG_MODE) {
>  		int i;
>  		baddr = &regs->txic0;
> -		for_each_set_bit(i, &tx_mask, priv->num_tx_queues) {
> +		for_each_set_bit(i, &mask, priv->num_tx_queues) {
>  			gfar_write(baddr + i, 0);
>  			if (likely(priv->tx_queue[i]->txcoalescing))
>  				gfar_write(baddr + i, priv->tx_queue[i]->txic);
>  		}
> +	} else {
> +		/* Backward compatible case ---- even if we enable
> +		 * multiple queues, there's only single reg to program
> +		 */
> +		gfar_write(&regs->txic, 0);
> +		if (likely(priv->tx_queue[0]->txcoalescing))
> +			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
> +	}
> +}
> +
> +static inline void gfar_configure_rx_coalescing(struct gfar_private *priv,
> +						unsigned long mask)
> +{
> +	struct gfar __iomem *regs = priv->gfargrp[0].regs;
> +	u32 __iomem *baddr;
>  
> +	if (priv->mode == MQ_MG_MODE) {
> +		int i;
>  		baddr = &regs->rxic0;
> -		for_each_set_bit(i, &rx_mask, priv->num_rx_queues) {
> +		for_each_set_bit(i, &mask, priv->num_rx_queues) {
>  			gfar_write(baddr + i, 0);
>  			if (likely(priv->rx_queue[i]->rxcoalescing))
>  				gfar_write(baddr + i, priv->rx_queue[i]->rxic);
> @@ -1819,16 +1836,19 @@ void gfar_configure_coalescing(struct gfar_private *priv,
>  		/* Backward compatible case ---- even if we enable
>  		 * multiple queues, there's only single reg to program
>  		 */
> -		gfar_write(&regs->txic, 0);
> -		if (likely(priv->tx_queue[0]->txcoalescing))
> -			gfar_write(&regs->txic, priv->tx_queue[0]->txic);
> -
>  		gfar_write(&regs->rxic, 0);
>  		if (likely(priv->rx_queue[0]->rxcoalescing))
>  			gfar_write(&regs->rxic, priv->rx_queue[0]->rxic);
>  	}
>  }
>  
> +void gfar_configure_coalescing(struct gfar_private *priv,
> +			       unsigned long tx_mask, unsigned long rx_mask)
> +{
> +	gfar_configure_tx_coalescing(priv, tx_mask);
> +	gfar_configure_rx_coalescing(priv, rx_mask);
> +}
> +
>  static int register_grp_irqs(struct gfar_priv_grp *grp)
>  {
>  	struct gfar_private *priv = grp->priv;
> -- 
> 1.6.6
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-14 16:08             ` Claudiu Manoil
@ 2012-08-16 15:36               ` Paul Gortmaker
  2012-08-17 11:28                 ` Claudiu Manoil
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Gortmaker @ 2012-08-16 15:36 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller

[Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 14/08/2012 (Tue 19:08) Claudiu Manoil wrote:

> On 08/14/2012 04:15 AM, Paul Gortmaker wrote:
> >This is a lot lower variation than what you reported earlier (20
> >versus 200, I think). It was the variation that raised a red flag
> >for me...
> Hi Paul,
> The earlier variation, which is much bigger (indeed ~200), was
> observed on a p1020 (slow, 2 cores, MQ_MG_MODE).
> I did not collect however as detailed measurement results for that
> board, as I did for p1010 (previous email).
> The most important performance improvement I've noticed however was
> on the p1020 platform.
> 
> >>By changing the coalescing settings from default* (rx coalescing off,
> >>tx-usecs: 10, tx-frames: 16) to:
> >>"ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32"
> >>we get a throughput of ~710 Mbps.
> >>
> >>For *Image 2)*, using the default tcp_limit_output_bytes value
> >>(131072) - I've noticed
> >>that "tweaking" tcp_limit_output_bytes does not improve the
> >>throughput -, we get the
> >>following performance numbers:
> >>* default coalescing settings: ~650 Mbps
> >>* rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps
> >>
> >>For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no*
> >>relevant performance
> >>improvement compared to Image 1).
> >>(note:
> >>For all the measurements, rx and tx BD ring sizes have been set to
> >>64, for best performance.)
> >>
> >>So, I really tend to believe that the performance degradation comes
> >>primarily from the driver,
> >>and the napi poll processing turns out to be an important source for
> >>that. The proposed patches
> >This would make sense, if the CPU was slammed at 100% load in dealing
> >with the tx processing, and the change made the driver considerably more
> >efficient.  But is that really the case?  Is the p1010 really going flat
> >out just to handle the Tx processing?  Have you done any sort of
> >profiling to confirm/deny where the CPU is spending its time?
> The current gfar_poll implementation processes first the tx
> confirmation path exhaustively, without a budget/ work limit,
> and only then proceeds with the rx processing within the allotted
> budget. An this happens for both Rx and Tx confirmation
> interrupts. I find this unfair and out of balance. Maybe by letting
> rx processing to be triggered by rx interrupts only, and
> the tx conf path processing to be triggered by tx confirmation
> interrupts only, and, on top of that, by imposing a work limit
> on the tx confirmation path too, we get a more responsive driver
> that performs better. Indeed some profiling data to
> confirm this would be great, but I don't have it.
> 
> There's another issues that seems to be solved by this patchset, and
> I've noticed it only on p1020rdb (this time).
> And that is excessive Rx busy interrupts occurrence. Solving this
> issue may be another factor for the performance
> improvement on p1020. But maybe this is another discussion.
> 
> >
> >>show substantial improvement, especially for SMP systems where Tx
> >>and Rx processing may be
> >>done in parallel.
> >>What do you think?
> >>Is it ok to proceed by re-spinning the patches? Do you recommend
> >>additional measurements?
> >Unfortunately Eric is out this week, so we will be without his input for
> >a while.  However, we are only at 3.6-rc1 -- meaning net-next will be
> >open for quite some time, hence no need to rush to try and jam stuff in.
> >
> >Also, I have two targets I'm interested in testing your patches on.  The
> >1st is a 500MHz mpc8349 board -- which should replicate what you see on
> >your p1010 (slow, single core).  The other is an 8641D, which is
> >interesting since it will give us the SMP tx/rx as separate threads, but
> >without the MQ_MG_MODE support (is that a correct assumption?)
> >
> >I don't have any fundamental problem with your patches (although 4/4
> >might be better as two patches) -- the above targets/tests are only
> >of interest, since I'm not convinced we yet understand _why_ your
> >changes give a performance boost, and there might be something
> >interesting hiding in there.
> >
> >So, while Eric is out, let me see if I can collect some more data on
> >those two targets sometime this week.
>
> Great, I don't mean to rush.  The more data we get on this the better.
> It would be great if you could do some measurements on your platforms too.
> 8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE),
> but I did run some tests on a p2020, which has the same features. However
> I'm eager to see your results.

So, I've collected data on 8349 (520MHz single core) and 8641D (1GHz
dual core) and the results are kind of surprising (to me).  The SMP
target, which in theory should have benefited from the change, actually
saw about an 8% reduction in throughput.  And the slower single core saw
about a 5% increase.

I also retested the 8641D with just your 1st 3 patches (i.e. drop the
"Use separate NAPIs for Tx and Rx processing" patch) and it recovered
about 1/2 the lost throughput, but not all.

I've used your patches exactly as posted, and the same netperf cmdline.
I briefly experimented with disabling BQL on the 8349 but didn't see any
impact from doing that (consistent with what you'd reported).  I didn't
see any real large variations either (target and server on same switch),
but I'm thinking the scatter could be reduced further if I isolated
the switch entirely to just the target and server.  I'll do that if I
end up doing any more testing on this, since the averages seem to be
reproduceable to about +/- 2% at the moment...

Paul.

 --------------
Command: netperf -l 20 -cC -H 192.168.146.65 -t TCP_STREAM -- -m 1500
next-next baseline: commit 1f07b62f3205f6ed41759df2892eaf433bc051a1
fsl RFC:  http://patchwork.ozlabs.org/patch/175919/ applied to above.
Default queue sizes (256), BQL defaults.

8349 (528 MHz, single core):
   net-next 10 runs
   avg=123
   max=124
   min=121
   send utilization > 99%
   
   fsl RFC 13 runs:
   avg=129 (+ ~5%)
   max=131
   min=127
   send utilization > 99%
   
8641D: (1GHz, dual core) 
   net-next 10 runs
   avg=826
   max=839
   min=807
   send utilization ~ 70%

   fsl RFC 12 runs
   avg=762 (- ~8%)
   max=783
   min=698
   send utilization ~ 70%

   fsl RFC, _only_ 1st 3 of 4 patches, 13 runs
   avg=794 (- ~4%)
   max=816
   min=758
   send utilization ~ 70%
 --------------

> Thanks for helping.
> 
> Regards,
> Claudiu
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing
  2012-08-16 15:36               ` Paul Gortmaker
@ 2012-08-17 11:28                 ` Claudiu Manoil
  0 siblings, 0 replies; 21+ messages in thread
From: Claudiu Manoil @ 2012-08-17 11:28 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller

On 08/16/2012 06:36 PM, Paul Gortmaker wrote:
> [Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 14/08/2012 (Tue 19:08) Claudiu Manoil wrote:
>
>> On 08/14/2012 04:15 AM, Paul Gortmaker wrote:
>>> This is a lot lower variation than what you reported earlier (20
>>> versus 200, I think). It was the variation that raised a red flag
>>> for me...
>> Hi Paul,
>> The earlier variation, which is much bigger (indeed ~200), was
>> observed on a p1020 (slow, 2 cores, MQ_MG_MODE).
>> I did not collect however as detailed measurement results for that
>> board, as I did for p1010 (previous email).
>> The most important performance improvement I've noticed however was
>> on the p1020 platform.
>>
>>>> By changing the coalescing settings from default* (rx coalescing off,
>>>> tx-usecs: 10, tx-frames: 16) to:
>>>> ""
>>>> we get a throughput of ~710 Mbps.
>>>>
>>>> For *Image 2)*, using the default tcp_limit_output_bytes value
>>>> (131072) - I've noticed
>>>> that "tweaking" tcp_limit_output_bytes does not improve the
>>>> throughput -, we get the
>>>> following performance numbers:
>>>> * default coalescing settings: ~650 Mbps
>>>> * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps
>>>>
>>>> For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no*
>>>> relevant performance
>>>> improvement compared to Image 1).
>>>> (note:
>>>> For all the measurements, rx and tx BD ring sizes have been set to
>>>> 64, for best performance.)
>>>>
>>>> So, I really tend to believe that the performance degradation comes
>>>> primarily from the driver,
>>>> and the napi poll processing turns out to be an important source for
>>>> that. The proposed patches
>>> This would make sense, if the CPU was slammed at 100% load in dealing
>>> with the tx processing, and the change made the driver considerably more
>>> efficient.  But is that really the case?  Is the p1010 really going flat
>>> out just to handle the Tx processing?  Have you done any sort of
>>> profiling to confirm/deny where the CPU is spending its time?
>> The current gfar_poll implementation processes first the tx
>> confirmation path exhaustively, without a budget/ work limit,
>> and only then proceeds with the rx processing within the allotted
>> budget. An this happens for both Rx and Tx confirmation
>> interrupts. I find this unfair and out of balance. Maybe by letting
>> rx processing to be triggered by rx interrupts only, and
>> the tx conf path processing to be triggered by tx confirmation
>> interrupts only, and, on top of that, by imposing a work limit
>> on the tx confirmation path too, we get a more responsive driver
>> that performs better. Indeed some profiling data to
>> confirm this would be great, but I don't have it.
>>
>> There's another issues that seems to be solved by this patchset, and
>> I've noticed it only on p1020rdb (this time).
>> And that is excessive Rx busy interrupts occurrence. Solving this
>> issue may be another factor for the performance
>> improvement on p1020. But maybe this is another discussion.
>>
>>>> show substantial improvement, especially for SMP systems where Tx
>>>> and Rx processing may be
>>>> done in parallel.
>>>> What do you think?
>>>> Is it ok to proceed by re-spinning the patches? Do you recommend
>>>> additional measurements?
>>> Unfortunately Eric is out this week, so we will be without his input for
>>> a while.  However, we are only at 3.6-rc1 -- meaning net-next will be
>>> open for quite some time, hence no need to rush to try and jam stuff in.
>>>
>>> Also, I have two targets I'm interested in testing your patches on.  The
>>> 1st is a 500MHz mpc8349 board -- which should replicate what you see on
>>> your p1010 (slow, single core).  The other is an 8641D, which is
>>> interesting since it will give us the SMP tx/rx as separate threads, but
>>> without the MQ_MG_MODE support (is that a correct assumption?)
>>>
>>> I don't have any fundamental problem with your patches (although 4/4
>>> might be better as two patches) -- the above targets/tests are only
>>> of interest, since I'm not convinced we yet understand _why_ your
>>> changes give a performance boost, and there might be something
>>> interesting hiding in there.
>>>
>>> So, while Eric is out, let me see if I can collect some more data on
>>> those two targets sometime this week.
>> Great, I don't mean to rush.  The more data we get on this the better.
>> It would be great if you could do some measurements on your platforms too.
>> 8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE),
>> but I did run some tests on a p2020, which has the same features. However
>> I'm eager to see your results.
> So, I've collected data on 8349 (520MHz single core) and 8641D (1GHz
> dual core) and the results are kind of surprising (to me).  The SMP
> target, which in theory should have benefited from the change, actually
> saw about an 8% reduction in throughput.  And the slower single core saw
> about a 5% increase.
>
> I also retested the 8641D with just your 1st 3 patches (i.e. drop the
> "Use separate NAPIs for Tx and Rx processing" patch) and it recovered
> about 1/2 the lost throughput, but not all.
>
> I've used your patches exactly as posted, and the same netperf cmdline.
> I briefly experimented with disabling BQL on the 8349 but didn't see any
> impact from doing that (consistent with what you'd reported).  I didn't
> see any real large variations either (target and server on same switch),
> but I'm thinking the scatter could be reduced further if I isolated
> the switch entirely to just the target and server.  I'll do that if I
> end up doing any more testing on this, since the averages seem to be
> reproduceable to about +/- 2% at the moment...
>
> Paul.
>
>   --------------
> Command: netperf -l 20 -cC -H 192.168.146.65 -t TCP_STREAM -- -m 1500
> next-next baseline: commit 1f07b62f3205f6ed41759df2892eaf433bc051a1
> fsl RFC:  http://patchwork.ozlabs.org/patch/175919/ applied to above.
> Default queue sizes (256), BQL defaults.
>
> 8349 (528 MHz, single core):
>     net-next 10 runs
>     avg=123
>     max=124
>     min=121
>     send utilization > 99%
>     
>     fsl RFC 13 runs:
>     avg=129 (+ ~5%)
>     max=131
>     min=127
>     send utilization > 99%
>     
> 8641D: (1GHz, dual core)
>     net-next 10 runs
>     avg=826
>     max=839
>     min=807
>     send utilization ~ 70%
>
>     fsl RFC 12 runs
>     avg=762 (- ~8%)
>     max=783
>     min=698
>     send utilization ~ 70%
>
>     fsl RFC, _only_ 1st 3 of 4 patches, 13 runs
>     avg=794 (- ~4%)
>     max=816
>     min=758
>     send utilization ~ 70%
>   --------------
Hello Paul,
Thanks again for the measurements. It will take me some time to "digest"
the results and even do more tests/ analysis on the platforms at my 
disposal.
Your results are indeed surprising, but there are some noticeable 
differences
b/w our setups too.
First of all, as noted before, I'm using BD rings of size 64 for best 
performance
(as this proved to be an optimal setting over the time). So, before 
starting any
tests, I was issuing: "ethtool -G <ethX> rx 64 tx 64".
Another point is that, to enhance the performance gain, I was using some
"decent" interrupt coalescing settings (at least to have rx coalescing 
enabled
too, which is by default off). So I've been using:
"ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 tx-usecs 32"
I think that the proposed code enhancement requires some balanced interrupt
coalescing settings too, for best results.

It's interesting that with these settings, I was able to reach ~940 Mbps 
on a p2020rdb
(which is also "single queue single group", but with two 1.2GHz e500v2 
cores), both with
and without the RFC patches.

Another point to consider when doing these measurements on SMP systems,
is that Rx/Tx interrupt handling should happen on distinct CPUs.
I think this happens by default for netperf on the "non-MQ_MG_MODE" systems
(like 8641D, or p2020), but this condition must be verified for the 
MQ_MG_MODE
systems (like p1020), by checking /proc/interrupts, and (if needed) 
forced by setting
interrupt affinities accordingly.
Btw., do you happen to have a p1020 board at your disposal too?

Best regards,
Claudiu

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-08-17 11:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-08 12:26 [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Claudiu Manoil
2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil
2012-08-08 12:26   ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil
2012-08-08 12:26     ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil
2012-08-08 12:26       ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil
2012-08-14  0:51         ` Paul Gortmaker
2012-08-14 16:08           ` Claudiu Manoil
2012-08-08 15:44       ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker
2012-08-09 16:24         ` Claudiu Manoil
2012-08-15  1:29       ` Paul Gortmaker
2012-08-08 16:11     ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker
2012-08-09 16:04       ` Claudiu Manoil
2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker
2012-08-08 16:44   ` Eric Dumazet
2012-08-08 23:06     ` Tomas Hruby
2012-08-09 15:07       ` Claudiu Manoil
2012-08-13 16:23         ` Claudiu Manoil
2012-08-14  1:15           ` Paul Gortmaker
2012-08-14 16:08             ` Claudiu Manoil
2012-08-16 15:36               ` Paul Gortmaker
2012-08-17 11:28                 ` Claudiu Manoil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.