All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
@ 2007-09-14  9:00 Krishna Kumar
  2007-09-14  9:01 ` [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching Krishna Kumar
                   ` (11 more replies)
  0 siblings, 12 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:00 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general,
	kumarkr, tgraf, randy.dunlap, Krishna Kumar, sri

This set of patches implements the batching xmit capability, and adds support
for batching in IPoIB and E1000 (E1000 driver changes is ported, thanks to
changes taken from Jamal's code from an old kernel).

List of changes from previous revision:
----------------------------------------
1. [Dave] Enable batching as default (change in register_netdev).
2. [Randy] Update documentation (however ethtool cmd to get/set batching is
	not implemented, hence I am guessing the usage).
3. [KK] When changing tx_batch_skb, qdisc xmits need to be blocked since
	qdisc_restart() drops queue_lock before calling driver xmit, and
	driver could find blist change under it.
4. [KK] sched: requeue could wrongly requeue skb already put in the batching
	list (in case a single skb was sent to the device but not sent as the
	device was full, resulting in the skb getting added to blist). This
	also results in slight optimization of batching behavior where for
	getting skbs #2 onwards don't require to check for gso_skb as that
	is the first skb that is processed.
4. [KK] Change documentation to explain this behavior.
5. [KK] sched: Fix panic when GSO is enabled in driver.
6. [KK] IPoIB: Small optimization in ipoib_ib_handle_tx_wc
7. [KK] netdevice: Needed to change NETIF_F_GSO_SHIFT/NETIF_F_GSO_MASK as
	BATCH_SKBS is now defined as 65536 (earlier it was using 8192 which
	was taken up by NETIF_F_NETNS_LOCAL).


Will submit in the next 1-2 days:
---------------------------------
1. [Auke] Enable batching in e1000e.


Extras that I can do later:
---------------------------
1. [Patrick] Use skb_blist statically in netdevice. This could also be used
	to integrate GSO and batching.
2. [Evgeniy] Useful to splice lists dev_add_skb_to_blist (and this can be
	done for regular xmit's of GSO skbs too for #1 above).

Patches are described as:
		 Mail 0/10:  This mail
		 Mail 1/10:  HOWTO documentation
		 Mail 2/10:  Introduce skb_blist, NETIF_F_BATCH_SKBS, use
		 	     single API for batching/no-batching, etc.
		 Mail 3/10:  Modify qdisc_run() to support batching
		 Mail 4/10:  Add ethtool support to enable/disable batching
		 Mail 5/10:  IPoIB: Header file changes to use batching
		 Mail 6/10:  IPoIB: CM & Multicast changes
		 Mail 7/10:  IPoIB: Verbs changes to use batching
		 Mail 8/10:  IPoIB: Internal post and work completion handler
		 Mail 9/10:  IPoIB: Implement the new batching capability
		 Mail 10/10: E1000: Implement the new batching capability

Issues:
--------
The retransmission problem reported earlier seems to happen when mthca is
used as the underlying device, but when I tested ehca the retransmissions
dropped to normal levels (around 2 times the regular code). The performance
improvement is around 55% for TCP.

Please review and provide feedback; and consider for inclusion.

Thanks,

- KK

----------------------------------------------------
			TCP
			----
Size:32 Procs:1		2728	3544	29.91
Size:128 Procs:1	11803	13679	15.89
Size:512 Procs:1	43279	49665	14.75
Size:4096 Procs:1	147952	101246	-31.56
Size:16384 Procs:1	149852	141897	-5.30

Size:32 Procs:4		10562	11349	7.45
Size:128 Procs:4	41010	40832	-.43
Size:512 Procs:4	75374	130943	73.72
Size:4096 Procs:4	167996	368218	119.18
Size:16384 Procs:4	123176	379524	208.11

Size:32 Procs:8		21125	21990	4.09
Size:128 Procs:8	77419	78605	1.53
Size:512 Procs:8	234678	265047	12.94
Size:4096 Procs:8	218063	367604	68.57
Size:16384 Procs:8	184283	370972	101.30

Average:	1509300 -> 2345115 = 55.38%
----------------------------------------------------

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
@ 2007-09-14  9:01 ` Krishna Kumar
  2007-09-14 18:37   ` [ofa-general] " Randy Dunlap
  2007-09-14  9:01 ` [PATCH 2/10 REV5] [core] Add skb_blist & support " Krishna Kumar
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:01 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, kumarkr, xma,
	gaagaan, netdev, rdreier, rick.jones2, mcarlson, jeff, general,
	mchan, tgraf, randy.dunlap, Krishna Kumar, sri

Add Documentation describing batching skb xmit capability.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 batching_skb_xmit.txt |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+)

diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt
--- org/Documentation/networking/batching_skb_xmit.txt	1970-01-01 05:30:00.000000000 +0530
+++ new/Documentation/networking/batching_skb_xmit.txt	2007-09-14 10:25:36.000000000 +0530
@@ -0,0 +1,107 @@
+		 HOWTO for batching skb xmit support
+		 -----------------------------------
+
+Section 1: What is batching skb xmit
+Section 2: How batching xmit works vs the regular xmit
+Section 3: How drivers can support batching
+Section 4: Nitty gritty details for drivers
+Section 5: How users can work with batching
+
+
+Introduction: Kernel support for batching skb
+----------------------------------------------
+
+A new capability to support xmit of multiple skbs is provided in the netdevice
+layer. Drivers which enable this capability should be able to process multiple
+skbs in a single call to their xmit handler.
+
+
+Section 1: What is batching skb xmit
+-------------------------------------
+
+	This capability is optionally enabled by a driver by setting the
+	NETIF_F_BATCH_SKBS bit in dev->features. The prerequisite for a
+	driver to use this capability is that it should have a reasonably-
+	sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching xmit works vs the regular xmit
+-------------------------------------------------------
+
+	The network stack gets called from upper layer protocols with a single
+	skb to transmit. This skb is first enqueued and an attempt is made to
+	transmit it immediately (via qdisc_run). However, events like tx lock
+	contention, tx queue stopped, etc., can result in the skb not getting
+	sent out and it remains in the queue. When the next xmit is called or
+	when the queue is re-enabled, qdisc_run could potentially find
+	multiple packets in the queue, and iteratively send them all out
+	one-by-one.
+
+	Batching skb xmit is a mechanism to exploit this situation where all
+	skbs can be passed in one shot to the device. This reduces driver
+	processing, locking at the driver (or in stack for ~LLTX drivers)
+	gets amortized over multiple skbs, and in case of specific drivers
+	where every xmit results in a completion processing (like IPoIB) -
+	optimizations can be made in the driver to request a completion for
+	only the last skb that was sent which results in saving interrupts
+	for every (but the last) skb that was sent in the same batch.
+
+	Batching can result in significant performance gains for systems that
+	have multiple data stream paths over the same network interface card.
+
+
+Section 3: How drivers can support batching
+---------------------------------------------
+
+	Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+	dev->features.
+
+	The driver's xmit handler should be modified to process multiple skbs
+	instead of one skb. The driver's xmit handler is called either with
+	an skb to transmit or NULL skb, where the latter case should be
+	handled as a call to xmit multiple skbs. This is done by sending out
+	all skbs in the dev->skb_blist list (where it was added by the core
+	stack).
+
+
+Section 4: Nitty gritty details for driver writers
+--------------------------------------------------
+
+	Batching is enabled from core networking stack only from softirq
+	context (NET_TX_SOFTIRQ), and dev_queue_xmit() doesn't use batching.
+
+	This leads to the following situation:
+		A skb was not sent out as either driver lock was contested or
+		the device was blocked. When the softirq handler runs, it
+		moves all skbs from the device queue to the batch list, but
+		then it too could fail to send due to lock contention. The
+		next xmit (of a single skb) called from dev_queue_xmit() will
+		not use batching and try to xmit skb, while previous skbs are
+		still present in the batch list. This results in the receiver
+		getting out-of-order packets, and in case of TCP the sender
+		would have unnecessary retransmissions.
+
+	To fix this problem, error cases where driver xmit gets called with a
+	skb must code as follows:
+		1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED
+		   as usual. This allows qdisc to requeue the skb.
+		2. If driver xmit got the lock but failed to send the skb, it
+		   should return NETDEV_TX_BUSY but before that it should have
+		   queue'd the skb to the batch list. In this case, the qdisc
+		   does not requeue the skb.
+
+
+Section 5: How users can work with batching
+--------------------------------------------
+
+	Batching can be disabled for a particular device, e.g. on desktop
+	systems if only one stream of network activity for that device is
+	taking place, since performance could be slightly affected due to
+	extra processing that batching adds (unless packets are getting
+	sent fast resulting in queue getting stopped). Batching can be enabled
+	if more than one stream of network activity per device is being done,
+	e.g. on servers; or even desktop usage with multiple browser, chat,
+	file transfer sessions, etc.
+
+	Per device batching can be enabled/disabled by:
+		ethtool <dev> batching on/off

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 2/10 REV5] [core] Add skb_blist & support for batching
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
  2007-09-14  9:01 ` [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching Krishna Kumar
@ 2007-09-14  9:01 ` Krishna Kumar
  2007-09-14 12:46   ` [ofa-general] " Evgeniy Polyakov
  2007-09-14  9:01 ` [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching Krishna Kumar
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:01 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, Krishna Kumar,
	general, netdev, tgraf, randy.dunlap, mchan, sri

Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for
batching/no-batching, etc.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/netdevice.h |    8 ++++++--
 net/core/dev.c            |   29 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 5 deletions(-)

diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h	2007-09-14 10:26:21.000000000 +0530
@@ -439,10 +439,11 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BATCH_SKBS	65536	/* Driver supports multiple skbs/xmit */
 
 	/* Segmentation offload features */
-#define NETIF_F_GSO_SHIFT	16
-#define NETIF_F_GSO_MASK	0xffff0000
+#define NETIF_F_GSO_SHIFT	17
+#define NETIF_F_GSO_MASK	0xfffe0000
 #define NETIF_F_TSO		(SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 #define NETIF_F_UFO		(SKB_GSO_UDP << NETIF_F_GSO_SHIFT)
 #define NETIF_F_GSO_ROBUST	(SKB_GSO_DODGY << NETIF_F_GSO_SHIFT)
@@ -548,6 +549,9 @@ struct net_device
 	/* Partially transmitted GSO packet. */
 	struct sk_buff		*gso_skb;
 
+	/* List of batch skbs (optional, used if driver supports skb batching */
+	struct sk_buff_head	*skb_blist;
+
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c	2007-09-14 10:25:36.000000000 +0530
@@ -953,6 +953,16 @@ void netdev_state_change(struct net_devi
 	}
 }
 
+static void free_batching(struct net_device *dev)
+{
+	if (dev->skb_blist) {
+		if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+		kfree(dev->skb_blist);
+		dev->skb_blist = NULL;
+	}
+}
+
 /**
  *	dev_load 	- load a network module
  *	@name: name of interface
@@ -1534,7 +1544,10 @@ static int dev_gso_segment(struct sk_buf
 
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	if (likely(!skb->next)) {
+	if (likely(skb)) {
+		if (unlikely(skb->next))
+			goto gso;
+
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
 
@@ -1544,10 +1557,10 @@ int dev_hard_start_xmit(struct sk_buff *
 			if (skb->next)
 				goto gso;
 		}
-
-		return dev->hard_start_xmit(skb, dev);
 	}
 
+	return dev->hard_start_xmit(skb, dev);
+
 gso:
 	do {
 		struct sk_buff *nskb = skb->next;
@@ -3566,6 +3579,13 @@ int register_netdevice(struct net_device
 		}
 	}
 
+	if (dev->features & NETIF_F_BATCH_SKBS) {
+		/* Driver supports batching skb */
+		dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL);
+		if (dev->skb_blist)
+			skb_queue_head_init(dev->skb_blist);
+	}
+
 	/*
 	 *	nil rebuild_header routine,
 	 *	that should be never called and used as just bug trap.
@@ -3901,6 +3921,9 @@ void unregister_netdevice(struct net_dev
 
 	synchronize_net();
 
+	/* Deallocate batching structure */
+	free_batching(dev);
+
 	/* Shutdown queueing discipline. */
 	dev_shutdown(dev);
 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
  2007-09-14  9:01 ` [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching Krishna Kumar
  2007-09-14  9:01 ` [PATCH 2/10 REV5] [core] Add skb_blist & support " Krishna Kumar
@ 2007-09-14  9:01 ` Krishna Kumar
  2007-09-14 12:15   ` [ofa-general] " Evgeniy Polyakov
  2007-09-14  9:02 ` [PATCH 4/10 REV5] [ethtool] Add ethtool support Krishna Kumar
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:01 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, mchan, general,
	netdev, tgraf, randy.dunlap, Krishna Kumar, sri

Modify qdisc_run() to support batching. Modify callers of qdisc_run to
use batching, modify qdisc_restart to implement batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/netdevice.h |    2 
 include/net/pkt_sched.h   |   17 +++++--
 net/core/dev.c            |   45 ++++++++++++++++++
 net/sched/sch_generic.c   |  109 ++++++++++++++++++++++++++++++++++++----------
 4 files changed, 145 insertions(+), 28 deletions(-)

diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/net/pkt_sched.h	2007-09-14 10:25:36.000000000 +0530
@@ -80,13 +80,24 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+static inline void qdisc_block(struct net_device *dev)
+{
+	while (test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
+		yield();
+}
+
+static inline void qdisc_unblock(struct net_device *dev)
+{
+	clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
+}
+
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	if (!netif_queue_stopped(dev) &&
 	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		__qdisc_run(dev);
+		__qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h	2007-09-14 10:26:21.000000000 +0530
@@ -1013,6 +1013,8 @@ extern int		dev_set_mac_address(struct n
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_add_skb_to_blist(struct sk_buff *skb,
+					     struct net_device *dev);
 
 extern int		netdev_budget;
 
diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c	2007-09-13 09:11:10.000000000 +0530
+++ new/net/sched/sch_generic.c	2007-09-14 10:25:36.000000000 +0530
@@ -59,26 +59,30 @@ static inline int qdisc_qlen(struct Qdis
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (skb) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 
 	netif_schedule(dev);
 	return 0;
 }
 
-static inline struct sk_buff *dev_dequeue_skb(struct net_device *dev,
-					      struct Qdisc *q)
+static inline int dev_requeue_skb_wrapper(struct sk_buff *skb,
+					  struct net_device *dev,
+					  struct Qdisc *q)
 {
-	struct sk_buff *skb;
-
-	if ((skb = dev->gso_skb))
-		dev->gso_skb = NULL;
-	else
-		skb = q->dequeue(q);
+	if (dev->skb_blist) {
+		/*
+		 * In case of tx full, batching drivers would have put all
+		 * skbs into skb_blist so there is no skb to requeue.
+		 */
+		skb = NULL;
+	}
 
-	return skb;
+	return dev_requeue_skb(skb, dev, q);
 }
 
 static inline int handle_dev_cpu_collision(struct sk_buff *skb,
@@ -91,10 +95,15 @@ static inline int handle_dev_cpu_collisi
 		/*
 		 * Same CPU holding the lock. It may be a transient
 		 * configuration error, when hard_start_xmit() recurses. We
-		 * detect it by checking xmit owner and drop the packet when
-		 * deadloop is detected. Return OK to try the next skb.
+		 * detect it by checking xmit owner and drop the packet (or
+		 * all packets in batching case) when deadloop is detected.
+		 * Return OK to try the next skb.
 		 */
-		kfree_skb(skb);
+		if (likely(skb))
+			kfree_skb(skb);
+		else if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+
 		if (net_ratelimit())
 			printk(KERN_WARNING "Dead loop on netdevice %s, "
 			       "fix it urgently!\n", dev->name);
@@ -111,6 +120,53 @@ static inline int handle_dev_cpu_collisi
 	return ret;
 }
 
+#define DEQUEUE_SKB(q)		(q->dequeue(q))
+
+static inline struct sk_buff *get_gso_skb(struct net_device *dev)
+{
+	struct sk_buff *skb;
+
+	if ((skb = dev->gso_skb))
+		dev->gso_skb = NULL;
+
+	return skb;
+}
+
+/*
+ * Algorithm to get skb(s) is:
+ *	- If gso skb present, return it.
+ *	- Non batching drivers, or if the batch list is empty and there is
+ *	  1 skb in the queue - dequeue skb and put it in *skbp to tell the
+ *	  caller to use the single xmit API.
+ *	- Batching drivers where the batch list already contains atleast one
+ *	  skb, or if there are multiple skbs in the queue: keep dequeue'ing
+ *	  skb's upto a limit and set *skbp to NULL to tell the caller to use
+ *	  the multiple xmit API.
+ *
+ * Returns:
+ *	1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ *	    (in case >1 skbs present in blist for batching)
+ *	0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+			  struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+	if ((*skbp = get_gso_skb(dev)) != NULL)
+		return 1;
+
+	if (!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
+		return likely((*skbp = DEQUEUE_SKB(q)) != NULL);
+	} else {
+		struct sk_buff *skb;
+		int max = dev->tx_queue_len - skb_queue_len(blist);
+
+		while (max > 0 && (skb = DEQUEUE_SKB(q)) != NULL)
+			max -= dev_add_skb_to_blist(skb, dev);
+
+		return 1;	/* there is atleast one skb in skb_blist */
+	}
+}
+
 /*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
  *
@@ -130,7 +186,8 @@ static inline int handle_dev_cpu_collisi
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *blist)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
@@ -138,7 +195,7 @@ static inline int qdisc_restart(struct n
 	int ret;
 
 	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+	if (unlikely(get_skb(dev, q, blist, &skb) == 0))
 		return 0;
 
 	/*
@@ -168,7 +225,7 @@ static inline int qdisc_restart(struct n
 
 	switch (ret) {
 	case NETDEV_TX_OK:
-		/* Driver sent out skb successfully */
+		/* Driver sent out skb (or entire skb_blist) successfully */
 		ret = qdisc_qlen(q);
 		break;
 
@@ -183,21 +240,21 @@ static inline int qdisc_restart(struct n
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skb_wrapper(skb, dev, q);
 		break;
 	}
 
 	return ret;
 }
 
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, blist))
 			break;
 	} while (!netif_queue_stopped(dev));
 
-	clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
+	qdisc_unblock(dev);
 }
 
 static void dev_watchdog(unsigned long arg)
@@ -575,6 +632,12 @@ void dev_deactivate(struct net_device *d
 	qdisc = dev->qdisc;
 	dev->qdisc = &noop_qdisc;
 
+	if (dev->skb_blist) {
+		/* Release skbs on batch list */
+		if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+	}
+
 	qdisc_reset(qdisc);
 
 	skb = dev->gso_skb;
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c	2007-09-14 10:25:36.000000000 +0530
@@ -1542,6 +1542,46 @@ static int dev_gso_segment(struct sk_buf
 	return 0;
 }
 
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. No one
+ * can add to this list simultaneously since we are holding QDISC RUNNING
+ * bit. Also list is safe from simultaneous deletes too since skbs are
+ * dequeued only when the driver is invoked.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree_skb(skb);
+			return 0;
+		}
+
+		if (skb->next) {
+			int count = 0;
+
+			do {
+				struct sk_buff *nskb = skb->next;
+
+				skb->next = nskb->next;
+				__skb_queue_tail(dev->skb_blist, nskb);
+				count++;
+			} while (skb->next);
+
+			/* Reset destructor for kfree_skb to work */
+			skb->destructor = DEV_GSO_CB(skb)->destructor;
+			kfree_skb(skb);
+			return count;
+		}
+	}
+	__skb_queue_tail(dev->skb_blist, skb);
+	return 1;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(skb)) {
@@ -1697,7 +1737,7 @@ gso:
 			/* reset queue_mapping to zero */
 			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
-			qdisc_run(dev);
+			qdisc_run(dev, NULL);
 			spin_unlock(&dev->queue_lock);
 
 			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1895,7 +1935,8 @@ static void net_tx_action(struct softirq
 			clear_bit(__LINK_STATE_SCHED, &dev->state);
 
 			if (spin_trylock(&dev->queue_lock)) {
-				qdisc_run(dev);
+				/* Send all skbs if driver supports batching */
+				qdisc_run(dev, dev->skb_blist);
 				spin_unlock(&dev->queue_lock);
 			} else {
 				netif_schedule(dev);

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 4/10 REV5] [ethtool] Add ethtool support
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (2 preceding siblings ...)
  2007-09-14  9:01 ` [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-09-14  9:02 ` Krishna Kumar
  2007-09-14  9:02 ` [PATCH 5/10 REV5] [IPoIB] Header file changes Krishna Kumar
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:02 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

Add ethtool support to enable/disable batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/ethtool.h   |    2 ++
 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/ethtool.c        |   27 +++++++++++++++++++++++++++
 4 files changed, 75 insertions(+)

diff -ruNp org/include/linux/ethtool.h new/include/linux/ethtool.h
--- org/include/linux/ethtool.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/ethtool.h	2007-09-14 10:25:36.000000000 +0530
@@ -440,6 +440,8 @@ struct ethtool_ops {
 #define ETHTOOL_SFLAGS		0x00000026 /* Set flags bitmap(ethtool_value) */
 #define ETHTOOL_GPFLAGS		0x00000027 /* Get driver-private flags bitmap */
 #define ETHTOOL_SPFLAGS		0x00000028 /* Set driver-private flags bitmap */
+#define ETHTOOL_GBATCH		0x00000029 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBATCH		0x00000030 /* Set Batching (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h	2007-09-14 10:26:21.000000000 +0530
@@ -1331,6 +1331,8 @@ extern void		dev_set_promiscuity(struct 
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
+extern int		dev_change_tx_batch_skb(struct net_device *dev,
+						unsigned long new_batch_skb);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
 extern void		dev_mcast_init(void);
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c	2007-09-14 10:25:36.000000000 +0530
@@ -963,6 +963,50 @@ void free_batching(struct net_dev
 	}
 }
 
+int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
+{
+	int ret = 0;
+	struct sk_buff_head *blist = NULL;
+
+	if (!(dev->features & NETIF_F_BATCH_SKBS)) {
+		/* Driver doesn't support batching skb API */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Check if new value is same as the current (paranoia to use !! for
+	 * new_batch_skb as that will be boolean via ethtool).
+	 */
+	if (!!dev->skb_blist == !!new_batch_skb)
+		goto out;
+
+	if (new_batch_skb &&
+	    (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * Block xmit as qdisc_restart() drops queue_lock before calling
+	 * driver xmit, and driver could find blist change under it.
+	 */
+	qdisc_block(dev);
+
+	spin_lock_bh(&dev->queue_lock);
+	if (new_batch_skb) {
+		skb_queue_head_init(blist);
+		dev->skb_blist = blist;
+	} else
+		free_batching(dev);
+	spin_unlock_bh(&dev->queue_lock);
+
+	qdisc_unblock(dev);
+
+out:
+	return ret;
+}
+
 /**
  *	dev_load 	- load a network module
  *	@name: name of interface
diff -ruNp org/net/core/ethtool.c new/net/core/ethtool.c
--- org/net/core/ethtool.c	2007-09-13 09:11:10.000000000 +0530
+++ new/net/core/ethtool.c	2007-09-14 10:25:36.000000000 +0530
@@ -556,6 +556,26 @@ static int ethtool_set_gso(struct net_de
 	return 0;
 }
 
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata = { ETHTOOL_GBATCH };
+
+	edata.data = dev->skb_blist != NULL;
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		 return -EFAULT;
+	return 0;
+}
+
+static int ethtool_set_batch(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	return dev_change_tx_batch_skb(dev, edata.data);
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_test test;
@@ -813,6 +833,7 @@ int dev_ethtool(struct net *net, struct 
 	case ETHTOOL_GGSO:
 	case ETHTOOL_GFLAGS:
 	case ETHTOOL_GPFLAGS:
+	case ETHTOOL_GBATCH:
 		break;
 	default:
 		if (!capable(CAP_NET_ADMIN))
@@ -956,6 +977,12 @@ int dev_ethtool(struct net *net, struct 
 		rc = ethtool_set_value(dev, useraddr,
 				       dev->ethtool_ops->set_priv_flags);
 		break;
+	case ETHTOOL_GBATCH:
+		rc = ethtool_get_batch(dev, useraddr);
+		break;
+	case ETHTOOL_SBATCH:
+		rc = ethtool_set_batch(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 5/10 REV5] [IPoIB] Header file changes
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (3 preceding siblings ...)
  2007-09-14  9:02 ` [PATCH 4/10 REV5] [ethtool] Add ethtool support Krishna Kumar
@ 2007-09-14  9:02 ` Krishna Kumar
  2007-09-14  9:03 ` [PATCH 6/10 REV5] [IPoIB] CM & Multicast changes Krishna Kumar
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:02 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

IPoIB header file changes to use batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib.h |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib.h	2007-09-14 10:25:36.000000000 +0530
@@ -271,8 +271,8 @@ struct ipoib_dev_priv {
 	struct ipoib_tx_buf *tx_ring;
 	unsigned             tx_head;
 	unsigned             tx_tail;
-	struct ib_sge        tx_sge;
-	struct ib_send_wr    tx_wr;
+	struct ib_sge        *tx_sge;
+	struct ib_send_wr    *tx_wr;
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -367,8 +367,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+		      u32 qpn, int wr_num);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 6/10 REV5] [IPoIB] CM & Multicast changes
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (4 preceding siblings ...)
  2007-09-14  9:02 ` [PATCH 5/10 REV5] [IPoIB] Header file changes Krishna Kumar
@ 2007-09-14  9:03 ` Krishna Kumar
  2007-09-14  9:03 ` [PATCH 7/10 REV5] [IPoIB] Verbs changes Krishna Kumar
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:03 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

IPoIB CM & Multicast changes based on header file changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_cm.c        |   13 +++++++++----
 ipoib_multicast.c |    4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-09-14 10:25:36.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
 			    unsigned int wr_id,
 			    u64 addr, int len)
 {
+	int ret;
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	priv->tx_sge[0].addr          = addr;
+	priv->tx_sge[0].length        = len;
+
+	priv->tx_wr[0].wr_id 	      = wr_id;
 
-	priv->tx_wr.wr_id 	      = wr_id;
+	priv->tx_wr[0].next = NULL;
+	ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+	priv->tx_wr[0].next = &priv->tx_wr[1];
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-09-14 10:25:36.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
 	}
 
 unlock:

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 7/10 REV5] [IPoIB] Verbs changes
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (5 preceding siblings ...)
  2007-09-14  9:03 ` [PATCH 6/10 REV5] [IPoIB] CM & Multicast changes Krishna Kumar
@ 2007-09-14  9:03 ` Krishna Kumar
  2007-09-14  9:03 ` [PATCH 8/10 REV5] [IPoIB] Post and work completion handler changes Krishna Kumar
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:03 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

IPoIB verb changes to use batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_verbs.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-09-14 10:25:36.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
 			.max_send_sge = 1,
 			.max_recv_sge = 1
 		},
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.sq_sig_type = IB_SIGNAL_REQ_WR,	/* 11.2.4.1 */
 		.qp_type     = IB_QPT_UD
 	};
-
-	int ret, size;
+	struct ib_send_wr *next_wr = NULL;
+	int i, ret, size;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	priv->tx_sge.lkey 	= priv->mr->lkey;
-
-	priv->tx_wr.opcode 	= IB_WR_SEND;
-	priv->tx_wr.sg_list 	= &priv->tx_sge;
-	priv->tx_wr.num_sge 	= 1;
-	priv->tx_wr.send_flags 	= IB_SEND_SIGNALED;
+	for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+		priv->tx_sge[i].lkey		= priv->mr->lkey;
+		priv->tx_wr[i].opcode		= IB_WR_SEND;
+		priv->tx_wr[i].sg_list		= &priv->tx_sge[i];
+		priv->tx_wr[i].num_sge		= 1;
+		priv->tx_wr[i].send_flags	= 0;
+
+		/* Link the list properly for provider to use */
+		priv->tx_wr[i].next		= next_wr;
+		next_wr				= &priv->tx_wr[i];
+	}
 
 	return 0;
 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 8/10 REV5] [IPoIB] Post and work completion handler changes
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (6 preceding siblings ...)
  2007-09-14  9:03 ` [PATCH 7/10 REV5] [IPoIB] Verbs changes Krishna Kumar
@ 2007-09-14  9:03 ` Krishna Kumar
  2007-09-14  9:04 ` [PATCH 9/10 REV5] [IPoIB] Implement batching Krishna Kumar
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:03 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

IPoIB internal post and work completion handler changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_ib.c |  212 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 168 insertions(+), 44 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-14 10:25:36.000000000 +0530
@@ -242,6 +242,8 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int num_completions, to_process;
+	unsigned int tx_ring_index;
 	unsigned int wr_id = wc->wr_id;
 	struct ipoib_tx_buf *tx_req;
 	unsigned long flags;
@@ -255,18 +257,51 @@ static void ipoib_ib_handle_tx_wc(struct
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	/* Get first WC to process (no one can update tx_tail at this time) */
+	tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping,
-			    tx_req->skb->len, DMA_TO_DEVICE);
+	/* Find number of WC's to process */
+	num_completions = wr_id - tx_ring_index + 1;
+	if (unlikely(num_completions <= 0))
+		num_completions += ipoib_sendq_size;
+	to_process = num_completions;
 
-	++priv->stats.tx_packets;
-	priv->stats.tx_bytes += tx_req->skb->len;
+	/*
+	 * Handle WC's from earlier (possibly multiple) post_sends in this
+	 * iteration as we move from tx_tail to wr_id, since if the last WR
+	 * (which is the one which requested completion notification) failed
+	 * to be sent for any of those earlier request(s), no completion
+	 * notification is generated for successful WR's of those earlier
+	 * request(s). Use a infinite loop to handle the regular case of
+	 * one skb processing faster.
+	 */
+	tx_req = &priv->tx_ring[tx_ring_index];
+	while (1) {
+		if (likely(tx_req->skb)) {
+			ib_dma_unmap_single(priv->ca, tx_req->mapping,
+					    tx_req->skb->len, DMA_TO_DEVICE);
+
+			++priv->stats.tx_packets;
+			priv->stats.tx_bytes += tx_req->skb->len;
+
+			dev_kfree_skb_any(tx_req->skb);
+		}
+		/*
+		 * else this skb failed synchronously when posted and was
+		 * freed immediately.
+		 */
+
+		if (--to_process == 0)
+			break;
 
-	dev_kfree_skb_any(tx_req->skb);
+		if (likely(++tx_ring_index != ipoib_sendq_size))
+			tx_req++;
+		else
+			tx_req = &priv->tx_ring[0];
+	}
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
-	++priv->tx_tail;
+	priv->tx_tail += num_completions;
 	if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
 	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
@@ -335,29 +370,57 @@ void ipoib_ib_completion(struct ib_cq *c
 	netif_rx_schedule(dev, &priv->napi);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
-			    unsigned int wr_id,
-			    struct ib_ah *address, u32 qpn,
-			    u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, first_wr is the first slot in tx_wr[] (or
+ * tx_sge[]). first_wr is normally zero unless a previous post_send returned
+ * error and we are trying to post the untried WR's, in which case first_wr
+ * is the index to the first untried WR.
+ *
+ * Break the WR link before posting so that provider knows how many WR's to
+ * process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+			    int first_wr, int num_skbs,
+			    struct ib_send_wr **bad_wr)
 {
-	struct ib_send_wr *bad_wr;
+	int ret;
+	struct ib_send_wr *last_wr, *next_wr;
+
+	last_wr = &priv->tx_wr[first_wr + num_skbs - 1];
+
+	/* Set Completion Notification for last WR */
+	last_wr->send_flags = IB_SEND_SIGNALED;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	/* Terminate the last WR */
+	next_wr = last_wr->next;
+	last_wr->next = NULL;
 
-	priv->tx_wr.wr_id 	      = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn  = qpn;
-	priv->tx_wr.wr.ud.ah 	      = address;
+	/* Send all the WR's in one doorbell */
+	ret = ib_post_send(priv->qp, &priv->tx_wr[first_wr], bad_wr);
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	/* Restore send_flags & WR chain */
+	last_wr->send_flags = 0;
+	last_wr->next = next_wr;
+
+	return ret;
 }
 
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_ring; and details of the WR in tx_wr
+ * to pass to the provider.
+ *
+ * Returns:
+ *	1: Error and the skb is freed.
+ *	0 skb processed successfully.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+		      u32 qpn, int wr_num)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_tx_buf *tx_req;
 	u64 addr;
+	unsigned int tx_ring_index;
 
 	if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
 		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
@@ -365,7 +428,7 @@ void ipoib_send(struct net_device *dev, 
 		++priv->stats.tx_dropped;
 		++priv->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
-		return;
+		return 1;
 	}
 
 	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
@@ -378,35 +441,96 @@ void ipoib_send(struct net_device *dev, 
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
-	tx_req->skb = skb;
-	addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
-				 DMA_TO_DEVICE);
+	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
 		++priv->stats.tx_errors;
 		dev_kfree_skb_any(skb);
-		return;
+		return 1;
 	}
-	tx_req->mapping = addr;
 
-	if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
-			       address->ah, qpn, addr, skb->len))) {
-		ipoib_warn(priv, "post_send failed\n");
-		++priv->stats.tx_errors;
-		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
-		dev_kfree_skb_any(skb);
-	} else {
-		dev->trans_start = jiffies;
+	tx_ring_index = priv->tx_head & (ipoib_sendq_size - 1);
+
+	/* Save till completion handler executes */
+	priv->tx_ring[tx_ring_index].skb = skb;
+	priv->tx_ring[tx_ring_index].mapping = addr;
+
+	/* Set WR values for the provider to use */
+	priv->tx_sge[wr_num].addr = addr;
+	priv->tx_sge[wr_num].length = skb->len;
+
+	priv->tx_wr[wr_num].wr_id = tx_ring_index;
+	priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+	priv->tx_wr[wr_num].wr.ud.ah = address->ah;
+
+	priv->tx_head++;
+
+	if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+		netif_stop_queue(dev);
+		set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+	}
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+	return 0;
+}
 
-		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
-			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-			netif_stop_queue(dev);
-			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+/*
+ * Send num_skbs to the device. If an skb is passed to this function, it is
+ * single, unprocessed skb send case; otherwise it means that all skbs are
+ * already processed and put on priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int first_wr = 0;
+
+	if (skb && ipoib_process_skb(dev, skb, priv, address, qpn, 0))
+		return;
+
+	/* Send all skb's in one post */
+	do {
+		struct ib_send_wr *bad_wr;
+
+		if (unlikely((post_send(priv, qpn, first_wr, num_skbs,
+					&bad_wr)))) {
+			int done;
+
+			ipoib_warn(priv, "post_send failed\n");
+
+			/* Get number of WR's that finished successfully */
+			done = bad_wr - &priv->tx_wr[first_wr];
+
+			/* Handle 1 error */
+			priv->stats.tx_errors++;
+			ib_dma_unmap_single(priv->ca,
+				priv->tx_sge[first_wr + done].addr,
+				priv->tx_sge[first_wr + done].length,
+				DMA_TO_DEVICE);
+
+			/* Free failed WR & reset for WC handler to recognize */
+			dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+			priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+			/* Handle 'n' successes */
+			if (done) {
+				dev->trans_start = jiffies;
+				address->last_send = priv->tx_head - (num_skbs -
+								      done) - 1;
+			}
+
+			/* Get count of skbs that were not tried */
+			num_skbs -= (done + 1);
+				/* + 1 for WR that was tried & failed */
+
+			/* Get start index for next iteration */
+			first_wr += (done + 1);
+		} else {
+			dev->trans_start = jiffies;
+
+			address->last_send = priv->tx_head - 1;
+			num_skbs = 0;
 		}
-	}
+	} while (num_skbs);
 }
 
 static void __ipoib_reap_ah(struct net_device *dev)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 9/10 REV5] [IPoIB] Implement batching
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (7 preceding siblings ...)
  2007-09-14  9:03 ` [PATCH 8/10 REV5] [IPoIB] Post and work completion handler changes Krishna Kumar
@ 2007-09-14  9:04 ` Krishna Kumar
  2007-09-14  9:04 ` [PATCH 10/10 REV5] [E1000] " Krishna Kumar
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:04 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

IPoIB: implement the new batching API.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_main.c |  248 +++++++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 168 insertions(+), 80 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-09-14 10:25:36.000000000 +0530
@@ -563,7 +563,8 @@ static void neigh_add_path(struct sk_buf
 				goto err_drop;
 			}
 		} else
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, path->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 	} else {
 		neigh->ah  = NULL;
 
@@ -643,7 +644,7 @@ static void unicast_arp_send(struct sk_b
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
@@ -657,105 +658,163 @@ static void unicast_arp_send(struct sk_b
 	spin_unlock(&priv->lock);
 }
 
+#define	XMIT_PROCESSED_SKBS()						\
+	do {								\
+		if (wr_num) {						\
+			ipoib_send(dev, NULL, old_neigh->ah, old_qpn,	\
+				   wr_num);				\
+			wr_num = 0;					\
+		}							\
+	} while (0)
+
 static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_neigh *neigh;
+	struct sk_buff_head *blist;
+	int max_skbs, wr_num = 0;
+	u32 qpn, old_qpn = 0;
+	struct ipoib_neigh *neigh, *old_neigh = NULL;
 	unsigned long flags;
 
 	if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
 		return NETDEV_TX_LOCKED;
 
-	/*
-	 * Check if our queue is stopped.  Since we have the LLTX bit
-	 * set, we can't rely on netif_stop_queue() preventing our
-	 * xmit function from being called with a full queue.
-	 */
-	if (unlikely(netif_queue_stopped(dev))) {
-		spin_unlock_irqrestore(&priv->tx_lock, flags);
-		return NETDEV_TX_BUSY;
+	blist = dev->skb_blist;
+	if (!skb || (blist && skb_queue_len(blist))) {
+		/*
+		 * Either batching xmit call, or single skb case but there are
+		 * skbs already in the batch list from previous failure to
+		 * xmit - send the earlier skbs first to avoid out of order.
+		 */
+
+		if (skb)
+			__skb_queue_tail(blist, skb);
+
+		/*
+		 * Figure out how many skbs can be sent. This prevents the
+		 * device getting full and avoids checking for stopped queue
+		 * after each iteration. Now the queue can get stopped atmost
+		 * after xmit of the last skb.
+		 */
+		max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+		skb = __skb_dequeue(blist);
+	} else {
+		blist = NULL;
+		max_skbs = 1;
 	}
 
-	if (likely(skb->dst && skb->dst->neighbour)) {
-		if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
-			ipoib_path_lookup(skb, dev);
-			goto out;
-		}
-
-		neigh = *to_ipoib_neigh(skb->dst->neighbour);
-
-		if (ipoib_cm_get(neigh)) {
-			if (ipoib_cm_up(neigh)) {
-				ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
-				goto out;
-			}
-		} else if (neigh->ah) {
-			if (unlikely(memcmp(&neigh->dgid.raw,
-					    skb->dst->neighbour->ha + 4,
-					    sizeof(union ib_gid)))) {
-				spin_lock(&priv->lock);
-				/*
-				 * It's safe to call ipoib_put_ah() inside
-				 * priv->lock here, because we know that
-				 * path->ah will always hold one more reference,
-				 * so ipoib_put_ah() will never do more than
-				 * decrement the ref count.
-				 */
-				ipoib_put_ah(neigh->ah);
-				list_del(&neigh->list);
-				ipoib_neigh_free(dev, neigh);
-				spin_unlock(&priv->lock);
+	do {
+		if (likely(skb->dst && skb->dst->neighbour)) {
+			if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+				XMIT_PROCESSED_SKBS();
 				ipoib_path_lookup(skb, dev);
-				goto out;
+				continue;
 			}
 
-			ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
-			goto out;
-		}
-
-		if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
-			spin_lock(&priv->lock);
-			__skb_queue_tail(&neigh->queue, skb);
-			spin_unlock(&priv->lock);
-		} else {
-			++priv->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-		}
-	} else {
-		struct ipoib_pseudoheader *phdr =
-			(struct ipoib_pseudoheader *) skb->data;
-		skb_pull(skb, sizeof *phdr);
+			neigh = *to_ipoib_neigh(skb->dst->neighbour);
 
-		if (phdr->hwaddr[4] == 0xff) {
-			/* Add in the P_Key for multicast*/
-			phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
-			phdr->hwaddr[9] = priv->pkey & 0xff;
+			if (ipoib_cm_get(neigh)) {
+				if (ipoib_cm_up(neigh)) {
+					XMIT_PROCESSED_SKBS();
+					ipoib_cm_send(dev, skb,
+						      ipoib_cm_get(neigh));
+					continue;
+				}
+			} else if (neigh->ah) {
+				if (unlikely(memcmp(&neigh->dgid.raw,
+						    skb->dst->neighbour->ha + 4,
+						    sizeof(union ib_gid)))) {
+					spin_lock(&priv->lock);
+					/*
+					 * It's safe to call ipoib_put_ah()
+					 * inside priv->lock here, because we
+					 * know that path->ah will always hold
+					 * one more reference, so ipoib_put_ah()
+					 * will never do more than decrement
+					 * the ref count.
+					 */
+					ipoib_put_ah(neigh->ah);
+					list_del(&neigh->list);
+					ipoib_neigh_free(dev, neigh);
+					spin_unlock(&priv->lock);
+					XMIT_PROCESSED_SKBS();
+					ipoib_path_lookup(skb, dev);
+					continue;
+				}
+
+				qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				if (neigh != old_neigh || qpn != old_qpn) {
+					/*
+					 * Sending to a different destination
+					 * from earlier skb's (or this is the
+					 * first skb) - send all existing skbs.
+					 */
+					XMIT_PROCESSED_SKBS();
+					old_neigh = neigh;
+					old_qpn = qpn;
+				}
+
+				if (likely(!ipoib_process_skb(dev, skb, priv,
+							      neigh->ah, qpn,
+							      wr_num)))
+					wr_num++;
 
-			ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
-		} else {
-			/* unicast GID -- should be ARP or RARP reply */
+				continue;
+			}
 
-			if ((be16_to_cpup((__be16 *) skb->data) != ETH_P_ARP) &&
-			    (be16_to_cpup((__be16 *) skb->data) != ETH_P_RARP)) {
-				ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
-					   IPOIB_GID_FMT "\n",
-					   skb->dst ? "neigh" : "dst",
-					   be16_to_cpup((__be16 *) skb->data),
-					   IPOIB_QPN(phdr->hwaddr),
-					   IPOIB_GID_RAW_ARG(phdr->hwaddr + 4));
+			if (skb_queue_len(&neigh->queue) <
+			    IPOIB_MAX_PATH_REC_QUEUE) {
+				spin_lock(&priv->lock);
+				__skb_queue_tail(&neigh->queue, skb);
+				spin_unlock(&priv->lock);
+			} else {
 				dev_kfree_skb_any(skb);
 				++priv->stats.tx_dropped;
-				goto out;
 			}
-
-			unicast_arp_send(skb, dev, phdr);
+		} else {
+			struct ipoib_pseudoheader *phdr =
+				(struct ipoib_pseudoheader *) skb->data;
+			skb_pull(skb, sizeof *phdr);
+
+			if (phdr->hwaddr[4] == 0xff) {
+				/* Add in the P_Key for multicast*/
+				phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+				phdr->hwaddr[9] = priv->pkey & 0xff;
+
+				XMIT_PROCESSED_SKBS();
+				ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+			} else {
+				/* unicast GID -- should be ARP or RARP reply */
+
+				if ((be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_ARP) &&
+				    (be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_RARP)) {
+					ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+						IPOIB_GID_FMT "\n",
+						skb->dst ? "neigh" : "dst",
+						be16_to_cpup((__be16 *)
+						skb->data),
+						IPOIB_QPN(phdr->hwaddr),
+						IPOIB_GID_RAW_ARG(phdr->hwaddr
+								  + 4));
+					dev_kfree_skb_any(skb);
+					++priv->stats.tx_dropped;
+					continue;
+				}
+				XMIT_PROCESSED_SKBS();
+				unicast_arp_send(skb, dev, phdr);
+			}
 		}
-	}
+	} while (--max_skbs > 0 && (skb = __skb_dequeue(blist)) != NULL);
+
+	/* Send out last packets (if any) */
+	XMIT_PROCESSED_SKBS();
 
-out:
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
-	return NETDEV_TX_OK;
+	return (!blist || !skb_queue_len(blist)) ? NETDEV_TX_OK :
+						   NETDEV_TX_BUSY;
 }
 
 static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
@@ -903,11 +962,35 @@ int ipoib_dev_init(struct net_device *de
 
 	/* priv->tx_head & tx_tail are already 0 */
 
-	if (ipoib_ib_dev_init(dev, ca, port))
+	/* Allocate tx_sge */
+	priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+			       GFP_KERNEL);
+	if (!priv->tx_sge) {
+		printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
 		goto out_tx_ring_cleanup;
+	}
+
+	/* Allocate tx_wr */
+	priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+			      GFP_KERNEL);
+	if (!priv->tx_wr) {
+		printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
+		goto out_tx_sge_cleanup;
+	}
+
+	if (ipoib_ib_dev_init(dev, ca, port))
+		goto out_tx_wr_cleanup;
 
 	return 0;
 
+out_tx_wr_cleanup:
+	kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+	kfree(priv->tx_sge);
+
 out_tx_ring_cleanup:
 	kfree(priv->tx_ring);
 
@@ -935,9 +1018,13 @@ void ipoib_dev_cleanup(struct net_device
 
 	kfree(priv->rx_ring);
 	kfree(priv->tx_ring);
+	kfree(priv->tx_sge);
+	kfree(priv->tx_wr);
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
+	priv->tx_sge = NULL;
+	priv->tx_wr = NULL;
 }
 
 static void ipoib_setup(struct net_device *dev)
@@ -968,7 +1055,8 @@ static void ipoib_setup(struct net_devic
 	dev->addr_len 		 = INFINIBAND_ALEN;
 	dev->type 		 = ARPHRD_INFINIBAND;
 	dev->tx_queue_len 	 = ipoib_sendq_size * 2;
-	dev->features            = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX;
+	dev->features            = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX |
+				   NETIF_F_BATCH_SKBS;
 
 	/* MTU will be reset when mcast join happens */
 	dev->mtu 		 = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN;

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 10/10 REV5] [E1000] Implement batching
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (8 preceding siblings ...)
  2007-09-14  9:04 ` [PATCH 9/10 REV5] [IPoIB] Implement batching Krishna Kumar
@ 2007-09-14  9:04 ` Krishna Kumar
  2007-09-14 12:47   ` [ofa-general] " Evgeniy Polyakov
  2007-11-13 21:28   ` [ofa-general] " Kok, Auke
  2007-09-14 12:49 ` [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Evgeniy Polyakov
  2007-09-16 23:17 ` David Miller
  11 siblings, 2 replies; 107+ messages in thread
From: Krishna Kumar @ 2007-09-14  9:04 UTC (permalink / raw)
  To: johnpol, herbert, hadi, kaber, shemminger, davem
  Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, randy.dunlap, netdev, Krishna Kumar, sri

E1000: Implement batching capability (ported thanks to changes taken from
	Jamal).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 e1000_main.c |  104 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 75 insertions(+), 29 deletions(-)

diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c
--- org/drivers/net/e1000/e1000_main.c	2007-09-14 10:30:57.000000000 +0530
+++ new/drivers/net/e1000/e1000_main.c	2007-09-14 10:31:02.000000000 +0530
@@ -990,7 +990,7 @@ e1000_probe(struct pci_dev *pdev,
 	if (pci_using_dac)
 		netdev->features |= NETIF_F_HIGHDMA;
 
-	netdev->features |= NETIF_F_LLTX;
+	netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS;
 
 	adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
 
@@ -3092,6 +3092,17 @@ e1000_tx_map(struct e1000_adapter *adapt
 	return count;
 }
 
+static void e1000_kick_DMA(struct e1000_adapter *adapter,
+			   struct e1000_tx_ring *tx_ring, int i)
+{
+	wmb();
+
+	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
+	/* we need this if more than one processor can write to our tail
+	 * at a time, it syncronizes IO on IA64/Altix systems */
+	mmiowb();
+}
+
 static void
 e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
                int tx_flags, int count)
@@ -3138,13 +3149,7 @@ e1000_tx_queue(struct e1000_adapter *ada
 	 * know there are new descriptors to fetch.  (Only
 	 * applicable for weak-ordered memory model archs,
 	 * such as IA-64). */
-	wmb();
-
 	tx_ring->next_to_use = i;
-	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
-	/* we need this if more than one processor can write to our tail
-	 * at a time, it syncronizes IO on IA64/Altix systems */
-	mmiowb();
 }
 
 /**
@@ -3251,22 +3256,23 @@ static int e1000_maybe_stop_tx(struct ne
 }
 
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+
+#define NETDEV_TX_DROPPED	-5
+
 static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_tx_ring *tx_ring;
 	unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
 	unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
 	unsigned int tx_flags = 0;
-	unsigned int len = skb->len;
-	unsigned long flags;
-	unsigned int nr_frags = 0;
-	unsigned int mss = 0;
+	unsigned int len = skb->len - skb->data_len;
+	unsigned int nr_frags;
+	unsigned int mss;
 	int count = 0;
 	int tso;
 	unsigned int f;
-	len -= skb->data_len;
 
 	/* This goes back to the question of how to logically map a tx queue
 	 * to a flow.  Right now, performance is impacted slightly negatively
@@ -3276,7 +3282,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
 
 	if (unlikely(skb->len <= 0)) {
 		dev_kfree_skb_any(skb);
-		return NETDEV_TX_OK;
+		return NETDEV_TX_DROPPED;
 	}
 
 	/* 82571 and newer doesn't need the workaround that limited descriptor
@@ -3322,7 +3328,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
 					DPRINTK(DRV, ERR,
 						"__pskb_pull_tail failed.\n");
 					dev_kfree_skb_any(skb);
-					return NETDEV_TX_OK;
+					return NETDEV_TX_DROPPED;
 				}
 				len = skb->len - skb->data_len;
 				break;
@@ -3366,22 +3372,15 @@ e1000_xmit_frame(struct sk_buff *skb, st
 	    (adapter->hw.mac_type == e1000_82573))
 		e1000_transfer_dhcp_info(adapter, skb);
 
-	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
-		/* Collision - tell upper layer to requeue */
-		return NETDEV_TX_LOCKED;
-
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
-	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
-		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2)))
 		return NETDEV_TX_BUSY;
-	}
 
 	if (unlikely(adapter->hw.mac_type == e1000_82547)) {
 		if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
 			netif_stop_queue(netdev);
 			mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
-			spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 			return NETDEV_TX_BUSY;
 		}
 	}
@@ -3396,8 +3395,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
 	tso = e1000_tso(adapter, tx_ring, skb);
 	if (tso < 0) {
 		dev_kfree_skb_any(skb);
-		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
-		return NETDEV_TX_OK;
+		return NETDEV_TX_DROPPED;
 	}
 
 	if (likely(tso)) {
@@ -3416,13 +3414,61 @@ e1000_xmit_frame(struct sk_buff *skb, st
 	               e1000_tx_map(adapter, tx_ring, skb, first,
 	                            max_per_txd, nr_frags, mss));
 
-	netdev->trans_start = jiffies;
+	return NETDEV_TX_OK;
+}
+
+static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	struct sk_buff_head *blist;
+	int ret, skbs_done = 0;
+	unsigned long flags;
+
+	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+		/* Collision - tell upper layer to requeue */
+		return NETDEV_TX_LOCKED;
+	}
 
-	/* Make sure there is space in the ring for the next send. */
-	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+	blist = netdev->skb_blist;
+
+	if (!skb || (blist && skb_queue_len(blist))) {
+		/*
+		 * Either batching xmit call, or single skb case but there are
+		 * skbs already in the batch list from previous failure to
+		 * xmit - send the earlier skbs first to avoid out of order.
+		 */
+		if (skb)
+			__skb_queue_tail(blist, skb);
+		skb = __skb_dequeue(blist);
+	} else {
+		blist = NULL;
+	}
+
+	do {
+		ret = e1000_prep_queue_frame(skb, netdev);
+		if (likely(ret == NETDEV_TX_OK))
+			skbs_done++;
+		else {
+			if (ret == NETDEV_TX_BUSY) {
+				if (blist)
+					__skb_queue_head(blist, skb);
+				break;
+			}
+			/* skb dropped, not a TX error */
+			ret = NETDEV_TX_OK;
+		}
+	} while (blist && (skb = __skb_dequeue(blist)) != NULL);
+
+	if (skbs_done) {
+		e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+		netdev->trans_start = jiffies;
+		/* Make sure there is space in the ring for the next send. */
+		e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+	}
 
 	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
-	return NETDEV_TX_OK;
+	return ret;
 }
 
 /**

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching
  2007-09-14  9:01 ` [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-09-14 12:15   ` Evgeniy Polyakov
  2007-09-17  3:49     ` Krishna Kumar2
  0 siblings, 1 reply; 107+ messages in thread
From: Evgeniy Polyakov @ 2007-09-14 12:15 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, netdev, kaber, randy.dunlap, jeff,
	general, mchan, tgraf, mcarlson, sri, shemminger, davem

Hi Krishna.

On Fri, Sep 14, 2007 at 02:31:56PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
> +{
> +	if (!list_empty(&ptype_all))
> +		dev_queue_xmit_nit(skb, dev);
> +
> +	if (netif_needs_gso(dev, skb)) {
> +		if (unlikely(dev_gso_segment(skb))) {
> +			kfree_skb(skb);
> +			return 0;
> +		}
> +
> +		if (skb->next) {
> +			int count = 0;
> +
> +			do {
> +				struct sk_buff *nskb = skb->next;
> +
> +				skb->next = nskb->next;
> +				__skb_queue_tail(dev->skb_blist, nskb);
> +				count++;
> +			} while (skb->next);

Could it be list_move()-like function for skb lists?
I'm pretty sure if you change first and the last skbs and ke of the
queue in one shot, result will be the same.
Actually how many skbs are usually batched in your load?

> +			/* Reset destructor for kfree_skb to work */
> +			skb->destructor = DEV_GSO_CB(skb)->destructor;
> +			kfree_skb(skb);

Why do you free first skb in the chain?

> +			return count;
> +		}
> +	}
> +	__skb_queue_tail(dev->skb_blist, skb);
> +	return 1;
> +}

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/10 REV5] [core] Add skb_blist & support for batching
  2007-09-14  9:01 ` [PATCH 2/10 REV5] [core] Add skb_blist & support " Krishna Kumar
@ 2007-09-14 12:46   ` Evgeniy Polyakov
  2007-09-17  3:51     ` Krishna Kumar2
  0 siblings, 1 reply; 107+ messages in thread
From: Evgeniy Polyakov @ 2007-09-14 12:46 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, hadi, kaber, randy.dunlap, jeff, general,
	netdev, tgraf, mcarlson, sri, shemminger, davem, mchan

On Fri, Sep 14, 2007 at 02:31:37PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> @@ -3566,6 +3579,13 @@ int register_netdevice(struct net_device
>  		}
>  	}
>  
> +	if (dev->features & NETIF_F_BATCH_SKBS) {
> +		/* Driver supports batching skb */
> +		dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL);
> +		if (dev->skb_blist)
> +			skb_queue_head_init(dev->skb_blist);
> +	}
> +

A nitpick is that you should use sizeof(struct ...) and I think it
requires flag clearing in cae of failed initialization?

>  	/*
>  	 *	nil rebuild_header routine,
>  	 *	that should be never called and used as just bug trap.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 10/10 REV5] [E1000] Implement batching
  2007-09-14  9:04 ` [PATCH 10/10 REV5] [E1000] " Krishna Kumar
@ 2007-09-14 12:47   ` Evgeniy Polyakov
  2007-09-17  3:56     ` Krishna Kumar2
  2007-11-13 21:28   ` [ofa-general] " Kok, Auke
  1 sibling, 1 reply; 107+ messages in thread
From: Evgeniy Polyakov @ 2007-09-14 12:47 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, hadi, netdev, kaber, randy.dunlap, jeff,
	general, mchan, tgraf, mcarlson, sri, shemminger, davem

On Fri, Sep 14, 2007 at 02:34:42PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> @@ -3276,7 +3282,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  
>  	if (unlikely(skb->len <= 0)) {
>  		dev_kfree_skb_any(skb);
> -		return NETDEV_TX_OK;
> +		return NETDEV_TX_DROPPED;
>  	}

This changes could actually go as own patch, although not sure it is
ever used. just a though, not a stopper.

>  	/* 82571 and newer doesn't need the workaround that limited descriptor
> @@ -3322,7 +3328,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  					DPRINTK(DRV, ERR,
>  						"__pskb_pull_tail failed.\n");
>  					dev_kfree_skb_any(skb);
> -					return NETDEV_TX_OK;
> +					return NETDEV_TX_DROPPED;
>  				}
>  				len = skb->len - skb->data_len;
>  				break;
> @@ -3366,22 +3372,15 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  	    (adapter->hw.mac_type == e1000_82573))
>  		e1000_transfer_dhcp_info(adapter, skb);
>  
> -	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
> -		/* Collision - tell upper layer to requeue */
> -		return NETDEV_TX_LOCKED;
> -
>  	/* need: count + 2 desc gap to keep tail from touching
>  	 * head, otherwise try next time */
> -	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
> -		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> +	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2)))
>  		return NETDEV_TX_BUSY;
> -	}
>  
>  	if (unlikely(adapter->hw.mac_type == e1000_82547)) {
>  		if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
>  			netif_stop_queue(netdev);
>  			mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
> -			spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
>  			return NETDEV_TX_BUSY;
>  		}
>  	}
> @@ -3396,8 +3395,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  	tso = e1000_tso(adapter, tx_ring, skb);
>  	if (tso < 0) {
>  		dev_kfree_skb_any(skb);
> -		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> -		return NETDEV_TX_OK;
> +		return NETDEV_TX_DROPPED;
>  	}
>  
>  	if (likely(tso)) {
> @@ -3416,13 +3414,61 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  	               e1000_tx_map(adapter, tx_ring, skb, first,
>  	                            max_per_txd, nr_frags, mss));
>  
> -	netdev->trans_start = jiffies;
> +	return NETDEV_TX_OK;
> +}
> +
> +static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> +{
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> +	struct sk_buff_head *blist;
> +	int ret, skbs_done = 0;
> +	unsigned long flags;
> +
> +	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
> +		/* Collision - tell upper layer to requeue */
> +		return NETDEV_TX_LOCKED;
> +	}
>  
> -	/* Make sure there is space in the ring for the next send. */
> -	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
> +	blist = netdev->skb_blist;
> +
> +	if (!skb || (blist && skb_queue_len(blist))) {
> +		/*
> +		 * Either batching xmit call, or single skb case but there are
> +		 * skbs already in the batch list from previous failure to
> +		 * xmit - send the earlier skbs first to avoid out of order.
> +		 */
> +		if (skb)
> +			__skb_queue_tail(blist, skb);
> +		skb = __skb_dequeue(blist);

Why is it put at the end?


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (9 preceding siblings ...)
  2007-09-14  9:04 ` [PATCH 10/10 REV5] [E1000] " Krishna Kumar
@ 2007-09-14 12:49 ` Evgeniy Polyakov
  2007-09-16 23:17 ` David Miller
  11 siblings, 0 replies; 107+ messages in thread
From: Evgeniy Polyakov @ 2007-09-14 12:49 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, kumarkr, herbert,
	gaagaan, netdev, rdreier, hadi, kaber, randy.dunlap, jeff,
	general, mchan, tgraf, mcarlson, sri, shemminger, davem

Hi Krishna.

On Fri, Sep 14, 2007 at 02:30:58PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> --------
> The retransmission problem reported earlier seems to happen when mthca is
> used as the underlying device, but when I tested ehca the retransmissions
> dropped to normal levels (around 2 times the regular code). The performance
> improvement is around 55% for TCP.

And what about latency for this patchset?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching
  2007-09-14  9:01 ` [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching Krishna Kumar
@ 2007-09-14 18:37   ` Randy Dunlap
  2007-09-17  4:10     ` Krishna Kumar2
  0 siblings, 1 reply; 107+ messages in thread
From: Randy Dunlap @ 2007-09-14 18:37 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, jagana, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	netdev, tgraf, mcarlson, sri, shemminger, davem, mchan

On Fri, 14 Sep 2007 14:31:18 +0530 Krishna Kumar wrote:

> Add Documentation describing batching skb xmit capability.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  batching_skb_xmit.txt |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+)
> 
> diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt
> --- org/Documentation/networking/batching_skb_xmit.txt	1970-01-01 05:30:00.000000000 +0530
> +++ new/Documentation/networking/batching_skb_xmit.txt	2007-09-14 10:25:36.000000000 +0530
> @@ -0,0 +1,107 @@
> +
> +Section 4: Nitty gritty details for driver writers
> +--------------------------------------------------
> +
> +	Batching is enabled from core networking stack only from softirq
> +	context (NET_TX_SOFTIRQ), and dev_queue_xmit() doesn't use batching.
> +
> +	This leads to the following situation:
> +		A skb was not sent out as either driver lock was contested or
> +		the device was blocked. When the softirq handler runs, it
> +		moves all skbs from the device queue to the batch list, but
> +		then it too could fail to send due to lock contention. The
> +		next xmit (of a single skb) called from dev_queue_xmit() will
> +		not use batching and try to xmit skb, while previous skbs are
> +		still present in the batch list. This results in the receiver
> +		getting out-of-order packets, and in case of TCP the sender
> +		would have unnecessary retransmissions.
> +
> +	To fix this problem, error cases where driver xmit gets called with a
> +	skb must code as follows:
> +		1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED
> +		   as usual. This allows qdisc to requeue the skb.
> +		2. If driver xmit got the lock but failed to send the skb, it
> +		   should return NETDEV_TX_BUSY but before that it should have
> +		   queue'd the skb to the batch list. In this case, the qdisc

                   queued

> +		   does not requeue the skb.

and then
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>

Thanks,
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
                   ` (10 preceding siblings ...)
  2007-09-14 12:49 ` [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Evgeniy Polyakov
@ 2007-09-16 23:17 ` David Miller
  2007-09-17  0:29   ` jamal
  2007-09-17  4:08   ` [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar2
  11 siblings, 2 replies; 107+ messages in thread
From: David Miller @ 2007-09-16 23:17 UTC (permalink / raw)
  To: krkumar2
  Cc: johnpol, jagana, peter.p.waskiewicz.jr, kumarkr, herbert,
	gaagaan, Robert.Olsson, netdev, rdreier, hadi, mcarlson, jeff,
	general, mchan, tgraf, randy.dunlap, sri, shemminger, kaber

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Fri, 14 Sep 2007 14:30:58 +0530

> This set of patches implements the batching xmit capability, and
> adds support for batching in IPoIB and E1000 (E1000 driver changes
> is ported, thanks to changes taken from Jamal's code from an old
> kernel).

The only major complaint I have about this patch series is that
the IPoIB part should just be one big changeset.  Otherwise the
tree is not bisectable, for example the initial ipoib header file
change breaks the build.

The tree must compile and work properly after every single patch.

On a lower priority, I question the indirection of skb_blist by making
it a pointer.  For what?  Saving 12 bytes on 64-bit?  That kmalloc()'d
thing is a nearly guarenteed cache and/or TLB miss.  Just inline the
thing, we generally don't do crap like this anywhere else.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-16 23:17 ` David Miller
@ 2007-09-17  0:29   ` jamal
  2007-09-17  1:02     ` David Miller
  2007-09-23 17:53     ` [PATCHES] TX batching jamal
  2007-09-17  4:08   ` [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar2
  1 sibling, 2 replies; 107+ messages in thread
From: jamal @ 2007-09-17  0:29 UTC (permalink / raw)
  To: David Miller
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri

On Sun, 2007-16-09 at 16:17 -0700, David Miller wrote:

> The only major complaint I have about this patch series is that
> the IPoIB part should just be one big changeset. 

Dave, you do realize that i have been investing my time working on
batching as well, right? 

cheers,
jamal



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  0:29   ` jamal
@ 2007-09-17  1:02     ` David Miller
  2007-09-17  2:14       ` [ofa-general] " jamal
  2007-09-23 17:53     ` [PATCHES] TX batching jamal
  1 sibling, 1 reply; 107+ messages in thread
From: David Miller @ 2007-09-17  1:02 UTC (permalink / raw)
  To: hadi
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri

From: jamal <hadi@cyberus.ca>
Date: Sun, 16 Sep 2007 20:29:18 -0400

> On Sun, 2007-16-09 at 16:17 -0700, David Miller wrote:
> 
> > The only major complaint I have about this patch series is that
> > the IPoIB part should just be one big changeset. 
> 
> Dave, you do realize that i have been investing my time working on
> batching as well, right? 

I do.

And I'm reviewing and applying several hundred patches a day.

What's the point? :-)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  1:02     ` David Miller
@ 2007-09-17  2:14       ` jamal
  2007-09-17  2:25         ` David Miller
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-17  2:14 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

On Sun, 2007-16-09 at 18:02 -0700, David Miller wrote:

> I do.
> 
> And I'm reviewing and applying several hundred patches a day.
> 
> What's the point? :-)

Reading the commentary made me think you were about to swallow that with
one more change by the time i wake up;->
I still think this work - despite my vested interest - needs more
scrutiny from a performance perspective.
I tend to send a url to my work, but it may be time to start posting
patches.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  2:14       ` [ofa-general] " jamal
@ 2007-09-17  2:25         ` David Miller
  2007-09-17  3:01           ` jamal
  2007-09-17  4:46           ` Krishna Kumar2
  0 siblings, 2 replies; 107+ messages in thread
From: David Miller @ 2007-09-17  2:25 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

From: jamal <hadi@cyberus.ca>
Date: Sun, 16 Sep 2007 22:14:21 -0400

> I still think this work - despite my vested interest - needs more
> scrutiny from a performance perspective.

Absolutely.

There are tertiary issues I'm personally interested in, for example
how well this stuff works when we enable software GSO on a non-TSO
capable card.

In such a case the GSO segment should be split right before we hit the
driver and then all the sub-segments of the original GSO frame batched
in one shot down to the device driver.

In this way you'll get a large chunk of the benefit of TSO without
explicit hardware support for the feature.

There are several cards (some even 10GB) that will benefit immensely
from this.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  2:25         ` David Miller
@ 2007-09-17  3:01           ` jamal
  2007-09-17  3:13             ` David Miller
  2007-09-17  4:46           ` Krishna Kumar2
  1 sibling, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-17  3:01 UTC (permalink / raw)
  To: David Miller
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri

On Sun, 2007-16-09 at 19:25 -0700, David Miller wrote:

> There are tertiary issues I'm personally interested in, for example
> how well this stuff works when we enable software GSO on a non-TSO
> capable card.
> 
> In such a case the GSO segment should be split right before we hit the
> driver and then all the sub-segments of the original GSO frame batched
> in one shot down to the device driver.

I think GSO is still useful on top of this.
In my patches anything with gso gets put into the batch list and shot
down the driver. Ive never considered checking whether the nic is TSO
capable, that may be something worth checking into. The netiron allows
you to shove upto 128 skbs utilizing one tx descriptor, which makes for
interesting possibilities.

> In this way you'll get a large chunk of the benefit of TSO without
> explicit hardware support for the feature.
> 
> There are several cards (some even 10GB) that will benefit immensely
> from this.

indeed - ive always wondered if batching this way would make the NICs
behave differently from the way TSO does.

On a side note: My observation is that with large packets on a very busy
system; bulk transfer type app, one approaches wire speed; with or
without batching, the apps are mostly idling (Ive seen upto 90% idle
time polling at the socket level for write to complete with a really
busy system). This is the case with or without batching. cpu seems a
little better with batching. As the aggregation of the apps gets more
aggressive (achievable by reducing their packet sizes), one can achieve
improved throughput and reduced cpu utilization. This all with UDP; i am
still studying tcp. 
 
cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  3:01           ` jamal
@ 2007-09-17  3:13             ` David Miller
  2007-09-17 12:51               ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: David Miller @ 2007-09-17  3:13 UTC (permalink / raw)
  To: hadi
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri

From: jamal <hadi@cyberus.ca>
Date: Sun, 16 Sep 2007 23:01:43 -0400

> I think GSO is still useful on top of this.
> In my patches anything with gso gets put into the batch list and shot
> down the driver. Ive never considered checking whether the nic is TSO
> capable, that may be something worth checking into. The netiron allows
> you to shove upto 128 skbs utilizing one tx descriptor, which makes for
> interesting possibilities.

We're talking past each other, but I'm happy to hear that for
sure your code does the right thing :-)

Right now only TSO capable hardware sets the TSO capable bit,
except perhaps for the XEN netfront driver.

What Herbert and I want to do is basically turn on TSO for
devices that can't do it in hardware, and rely upon the GSO
framework to do the segmenting in software right before we
hit the device.

This only makes sense for devices which can 1) scatter-gather
and 2) checksum on transmit.  Otherwise we make too many
copies and/or passes over the data.

And we can only get the full benefit if we can pass all the
sub-segments down to the driver in one ->hard_start_xmit()
call.

> On a side note: My observation is that with large packets on a very busy
> system; bulk transfer type app, one approaches wire speed; with or
> without batching, the apps are mostly idling (Ive seen upto 90% idle
> time polling at the socket level for write to complete with a really
> busy system). This is the case with or without batching. cpu seems a
> little better with batching. As the aggregation of the apps gets more
> aggressive (achievable by reducing their packet sizes), one can achieve
> improved throughput and reduced cpu utilization. This all with UDP; i am
> still studying tcp. 

UDP apps spraying data tend to naturally batch well and load balance
amongst themselves because each socket fills up to it's socket send
buffer limit, then sleeps, and we then get a stream from the next UDP
socket up to it's limit, and so on and so forth.

UDP is too easy a test case in fact :-)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching
  2007-09-14 12:15   ` [ofa-general] " Evgeniy Polyakov
@ 2007-09-17  3:49     ` Krishna Kumar2
  0 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar2 @ 2007-09-17  3:49 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 09/14/2007 05:45:19 PM:

> > +      if (skb->next) {
> > +         int count = 0;
> > +
> > +         do {
> > +            struct sk_buff *nskb = skb->next;
> > +
> > +            skb->next = nskb->next;
> > +            __skb_queue_tail(dev->skb_blist, nskb);
> > +            count++;
> > +         } while (skb->next);
>
> Could it be list_move()-like function for skb lists?
> I'm pretty sure if you change first and the last skbs and ke of the
> queue in one shot, result will be the same.

I have to do a bit more like update count, etc, but I agree it is do-able.
I had mentioned in my PATCH 0/10 that I will later try this suggestion
that you provided last time.

> Actually how many skbs are usually batched in your load?

It depends, eg when the tx lock is not got, I get batching of upto 8-10
skbs (assuming that tx lock was not got quite a few times). But when the
queue gets blocked, I have seen batching upto 4K skbs (if tx_queue_len
is 4K).

> > +         /* Reset destructor for kfree_skb to work */
> > +         skb->destructor = DEV_GSO_CB(skb)->destructor;
> > +         kfree_skb(skb);
>
> Why do you free first skb in the chain?

This is the gso code which has segmented 'skb' to skb'1-n', and those
skb'1-n' are sent out and freed by driver, which means the dummy 'skb'
(without any data) remains to be freed.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/10 REV5] [core] Add skb_blist & support for batching
  2007-09-14 12:46   ` [ofa-general] " Evgeniy Polyakov
@ 2007-09-17  3:51     ` Krishna Kumar2
  0 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar2 @ 2007-09-17  3:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 09/14/2007 06:16:38 PM:

> > +   if (dev->features & NETIF_F_BATCH_SKBS) {
> > +      /* Driver supports batching skb */
> > +      dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL);
> > +      if (dev->skb_blist)
> > +         skb_queue_head_init(dev->skb_blist);
> > +   }
> > +
>
> A nitpick is that you should use sizeof(struct ...) and I think it
> requires flag clearing in cae of failed initialization?

I thought it is better to use *var name in case the name of the structure
changes. Also, the flag is not cleared since I could try to enable batching
later, and it could succeed at that time. When skb_blist is allocated, then
batching is enabled otherwise it is disabled (while features flag just
indicates that driver supports batching).

Thanks,

- KK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/10 REV5] [E1000] Implement batching
  2007-09-14 12:47   ` [ofa-general] " Evgeniy Polyakov
@ 2007-09-17  3:56     ` Krishna Kumar2
  0 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar2 @ 2007-09-17  3:56 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 09/14/2007 06:17:14 PM:

> >     if (unlikely(skb->len <= 0)) {
> >        dev_kfree_skb_any(skb);
> > -      return NETDEV_TX_OK;
> > +      return NETDEV_TX_DROPPED;
> >     }
>
> This changes could actually go as own patch, although not sure it is
> ever used. just a though, not a stopper.

Since this flag is new and useful only for batching, I feel it is OK to
include it in this patch.

> > +   if (!skb || (blist && skb_queue_len(blist))) {
> > +      /*
> > +       * Either batching xmit call, or single skb case but there are
> > +       * skbs already in the batch list from previous failure to
> > +       * xmit - send the earlier skbs first to avoid out of order.
> > +       */
> > +      if (skb)
> > +         __skb_queue_tail(blist, skb);
> > +      skb = __skb_dequeue(blist);
>
> Why is it put at the end?

There is a bug that I had explained in rev4 (see XXX below) resulting
in sending out skbs out of order. The fix is that if the driver gets
called with a skb but there are older skbs already in the batch list
(which failed to get sent out), send those skbs first before this one.

Thanks,

- KK

[XXX] Dave had suggested to use batching only in the net_tx_action case.
When I implemented that in earlier revisions, there were lots of TCP
retransmissions (about 18,000 to every 1 in regular code). I found the
reason
for part of that problem as: skbs get queue'd up in dev->qdisc (when tx
lock
was not got or queue blocked); when net_tx_action is called later, it
passes
the batch list as argument to qdisc_run and this results in skbs being
moved
to the batch list; then batching xmit also fails due to tx lock failure;
the
next many regular xmit of a single skb will go through the fast path (pass
NULL batch list to qdisc_run) and send those skbs out to the device while
previous skbs are cooling their heels in the batch list.

The first fix was to not pass NULL/batch-list to qdisc_run() but to always
check whether skbs are present in batch list when trying to xmit. This
reduced
retransmissions by a third (from 18,000 to around 12,000), but led to
another
problem while testing - iperf transmit almost zero data for higher # of
parallel flows like 64 or more (and when I run iperf for a 2 min run, it
takes about 5-6 mins, and reports that it ran 0 secs and the amount of data
transfered is a few MB's). I don't know why this happens with this being
the
only change (any ideas is very appreciated).

The second fix that resolved this was to revert back to Dave's suggestion
to
use batching only in net_tx_action case, and modify the driver to see if
skbs
are present in batch list and to send them out first before sending the
current skb. I still see huge retransmission for IPoIB (but not for E1000),
though it has reduced to 12,000 from the earlier 18,000 number.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-16 23:17 ` David Miller
  2007-09-17  0:29   ` jamal
@ 2007-09-17  4:08   ` Krishna Kumar2
  1 sibling, 0 replies; 107+ messages in thread
From: Krishna Kumar2 @ 2007-09-17  4:08 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

Hi Dave,

David Miller <davem@davemloft.net> wrote on 09/17/2007 04:47:48 AM:

> The only major complaint I have about this patch series is that
> the IPoIB part should just be one big changeset.  Otherwise the
> tree is not bisectable, for example the initial ipoib header file
> change breaks the build.

Right, I will change it accordingly.

> On a lower priority, I question the indirection of skb_blist by making
> it a pointer.  For what?  Saving 12 bytes on 64-bit?  That kmalloc()'d
> thing is a nearly guarenteed cache and/or TLB miss.  Just inline the
> thing, we generally don't do crap like this anywhere else.

The intention was to avoid having two flags (one that driver supports
batching and second to indicate that batching is on/off). So I could test
skb_blist as an indication of whether batching is on/off. But your point
on cache miss is absolutely correct, and I will change this part to be
inline.

thanks,

- KK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching
  2007-09-14 18:37   ` [ofa-general] " Randy Dunlap
@ 2007-09-17  4:10     ` Krishna Kumar2
  2007-09-17  4:13       ` [ofa-general] " Jeff Garzik
  0 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar2 @ 2007-09-17  4:10 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
	kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

Hi Randy,

Randy Dunlap <randy.dunlap@oracle.com> wrote on 09/15/2007 12:07:09 AM:

> > +   To fix this problem, error cases where driver xmit gets called with
a
> > +   skb must code as follows:
> > +      1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED
> > +         as usual. This allows qdisc to requeue the skb.
> > +      2. If driver xmit got the lock but failed to send the skb, it
> > +         should return NETDEV_TX_BUSY but before that it should have
> > +         queue'd the skb to the batch list. In this case, the qdisc
>
>                    queued
>
> > +         does not requeue the skb.

Since this was a new section that I added to the documentation, this error
creeped up. Thanks for catching it, and review comments/ack-off :)

thanks,

- KK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching
  2007-09-17  4:10     ` Krishna Kumar2
@ 2007-09-17  4:13       ` Jeff Garzik
  0 siblings, 0 replies; 107+ messages in thread
From: Jeff Garzik @ 2007-09-17  4:13 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Randy Dunlap, jagana, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, netdev, sri,
	general, mchan, tgraf, johnpol, shemminger, davem, rdreier

Please remove me from the CC list.

I get this via netdev, and not having said a single thing in the thread, 
I don't feel the need to be CC'd on every email.

The CC list is pretty massive as it is, anyway.

	Jeff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  2:25         ` David Miller
  2007-09-17  3:01           ` jamal
@ 2007-09-17  4:46           ` Krishna Kumar2
  1 sibling, 0 replies; 107+ messages in thread
From: Krishna Kumar2 @ 2007-09-17  4:46 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general,
	mchan, tgraf, randy.dunlap, shemminger, kaber, sri

[Removing Jeff as requested from thread :) ]

Hi Dave,

David Miller <davem@davemloft.net> wrote on 09/17/2007 07:55:02 AM:

> From: jamal <hadi@cyberus.ca>
> Date: Sun, 16 Sep 2007 22:14:21 -0400
>
> > I still think this work - despite my vested interest - needs more
> > scrutiny from a performance perspective.
>
> Absolutely.
>
> There are tertiary issues I'm personally interested in, for example
> how well this stuff works when we enable software GSO on a non-TSO
> capable card.
>
> In such a case the GSO segment should be split right before we hit the
> driver and then all the sub-segments of the original GSO frame batched
> in one shot down to the device driver.
>
> In this way you'll get a large chunk of the benefit of TSO without
> explicit hardware support for the feature.
>
> There are several cards (some even 10GB) that will benefit immensely
> from this.

I have tried this on ehca which does not support TSO. I added GSO flag at
the ipoib layer (and that resulted in a panic/fix that is mentioned in
this patchset). I will re-run tests for this and submit results.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17  3:13             ` David Miller
@ 2007-09-17 12:51               ` jamal
  2007-09-17 16:37                 ` [ofa-general] " David Miller
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-17 12:51 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, herbert, kaber, shemminger, jagana, Robert.Olsson,
	rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

On Sun, 2007-16-09 at 20:13 -0700, David Miller wrote:

> What Herbert and I want to do is basically turn on TSO for
> devices that can't do it in hardware, and rely upon the GSO
> framework to do the segmenting in software right before we
> hit the device.

Sensible. 

> This only makes sense for devices which can 1) scatter-gather
> and 2) checksum on transmit.  

If you have knowledge there are enough descriptors in the driver to
cover all skbs you are passing, do you need to have #1? 
Note i dont touch fragments, i am assuming the driver is smart enough to
handle them otherwise it wont advertise it can handle scatter-gather

> Otherwise we make too many copies and/or passes over the data.

I didnt understand this last bit - you are still going to go over the
list regardless of whether you call ->hard_start_xmit() once or
multiple times over the same list, no? In the later case i am assuming
a trimmed down ->hard_start_xmit()

> UDP is too easy a test case in fact :-)

I learnt a lot about the behavior out of doing udp (and before that with
pktgen); theres a lot of driver habits that may need to be tuned before
batching becomes really effective - which is easier to see with udp than
with tcp.

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
  2007-09-17 12:51               ` jamal
@ 2007-09-17 16:37                 ` David Miller
  0 siblings, 0 replies; 107+ messages in thread
From: David Miller @ 2007-09-17 16:37 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, jagana, peter.p.waskiewicz.jr, kumarkr, herbert,
	gaagaan, Robert.Olsson, netdev, rdreier, mcarlson, general,
	mchan, tgraf, randy.dunlap, sri, shemminger, kaber

From: jamal <hadi@cyberus.ca>
Date: Mon, 17 Sep 2007 08:51:40 -0400

> On Sun, 2007-16-09 at 20:13 -0700, David Miller wrote:
> 
> > This only makes sense for devices which can 1) scatter-gather
> > and 2) checksum on transmit.  
> 
> If you have knowledge there are enough descriptors in the driver to
> cover all skbs you are passing, do you need to have #1? 
> Note i dont touch fragments, i am assuming the driver is smart enough to
> handle them otherwise it wont advertise it can handle scatter-gather

Yes, because you can have multiple descriptors per SKB
because we have the head part in skb->data and the rest
in the page vector.

Thus the device must be able to handle multiple descriptors
representing one packet.

> > Otherwise we make too many copies and/or passes over the data.
> 
> I didnt understand this last bit - you are still going to go over the
> list regardless of whether you call ->hard_start_xmit() once or
> multiple times over the same list, no? In the later case i am assuming
> a trimmed down ->hard_start_xmit()

If the device can't checksum, we have to pass over the data to
compute the checksum and stick it into the headers.

If the device can't scatter-gather, we have to allocate and
copy into a linear buffer.

Otherwise it's just bumping page reference counts and adjusting
offsets, no data touching at all.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCHES] TX batching
  2007-09-17  0:29   ` jamal
  2007-09-17  1:02     ` David Miller
@ 2007-09-23 17:53     ` jamal
  2007-09-23 17:56       ` [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock jamal
                         ` (3 more replies)
  1 sibling, 4 replies; 107+ messages in thread
From: jamal @ 2007-09-23 17:53 UTC (permalink / raw)
  To: David Miller
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri


I had plenty of time this weekend so i have been doing a _lot_ of
testing.  My next emails will send a set of patches:
 
Patch 1: Introduces explicit tx locking
Patch 2: Introduces batching interface
Patch 3: Core uses batching interface
Patch 4: get rid of dev->gso_skb

Testing
-------
Each of these patches has been performance tested and the results
are in the logs on a per-patch basis. 
My system under test hardware is a 2xdual core opteron with a couple of 
tg3s. 
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.

All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I have not run testing on patch #4 because i had to let the machine
go, but will have some access to it tommorow early morning where i can
run some tests.

Comments
--------
Iam trying to kill ->hard_batch_xmit() but it would be tricky to do
without it for LLTX drivers. Anything i try will require a few extra
checks. OTOH, I could kill LLTX for the drivers i am using that
are LLTX and then drop that interface or I could say "no support
for LLTX". I am in a dilema.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 4 i will appreaciate it.

More patches to follow  - i didnt want to overload people by dumping 
too many patches. Most of these patches below are ready to go; some are
need some testing and others need a little porting from an earlier
kernel: 
- tg3 driver (tested and works well, but dont want to send 
- tun driver
- pktgen
- netiron driver
- e1000 driver
- ethtool interface
- There is at least one other driver promised to me

I am also going to update the two documents i posted earlier.
Hopefully i can do that today.

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-23 17:53     ` [PATCHES] TX batching jamal
@ 2007-09-23 17:56       ` jamal
  2007-09-23 17:58         ` [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface jamal
  2007-09-24 19:12         ` [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock Waskiewicz Jr, Peter P
  2007-09-23 18:19       ` [PATCHES] TX batching Jeff Garzik
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 107+ messages in thread
From: jamal @ 2007-09-23 17:56 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 141 bytes --]


I have submitted this before; but here it is again.
Against net-2.6.24 from yesterday for this and all following patches. 


cheers,
jamal


[-- Attachment #2: patch10f4 --]
[-- Type: text/plain, Size: 3260 bytes --]

[NET_SCHED] explict hold dev tx lock

For N cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
In the current mode of qdisc operation, when all N CPUs contend for
the dequeue region and one of them (after all the work) entering
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.

The changes in this patch result in a small increase in performance
with a 4CPU (2xdual-core) with no irq binding. My tests are UDP
based and keep all 4CPUs busy all the time for the period of the test.
Both e1000 and tg3 showed similar behavior.
I expect higher gains with more CPUs. Summary below with different
UDP packets and the resulting pps seen. Note at around 200Bytes,
the two dont seem that much different and we are approaching wire
speed (with plenty of CPU available; eg at 512B, the app is sitting
at 80% idle on both cases).

        +------------+--------------+-------------+------------+--------+
pktsize |       64B  |  128B        | 256B        | 512B       |1024B   |
        +------------+--------------+-------------+------------+--------+
Original| 467482     | 463061       | 388267      | 216308     | 114704 |
        |            |              |             |            |        |
txlock  | 468922     | 464060       | 388298      | 216316     | 114709 |
        -----------------------------------------------------------------

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit b0e36991c5850dfe930f80ee508b08fdcabc18d1
tree b1787bba26f80a325298f89d1ec882cc5ab524ae
parent 42765047105fdd496976bc1784d22eec1cd9b9aa
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 09:09:17 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 09:09:17 -0400

 net/sched/sch_generic.c |   19 ++-----------------
 1 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..95ae119 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
 	int ret;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
 		return 0;
 
-	/*
-	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
-	 * locks can be quite expensive. The driver can do a trylock, as
-	 * is being done here; in case of lock contention it should return
-	 * NETDEV_TX_LOCKED and the packet will be requeued.
-	 */
-	lockless = (dev->features & NETIF_F_LLTX);
-
-	if (!lockless && !netif_tx_trylock(dev)) {
-		/* Another CPU grabbed the driver tx lock */
-		return handle_dev_cpu_collision(skb, dev, q);
-	}
 
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
+	HARD_TX_LOCK(dev, smp_processor_id());
 	ret = dev_hard_start_xmit(skb, dev);
-
-	if (!lockless)
-		netif_tx_unlock(dev);
+	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
 	q = dev->qdisc;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface
  2007-09-23 17:56       ` [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock jamal
@ 2007-09-23 17:58         ` jamal
  2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
                             ` (2 more replies)
  2007-09-24 19:12         ` [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock Waskiewicz Jr, Peter P
  1 sibling, 3 replies; 107+ messages in thread
From: jamal @ 2007-09-23 17:58 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 77 bytes --]


This patch introduces the netdevice interface for batching.

cheers,
jamal


[-- Attachment #2: patch20f4 --]
[-- Type: text/plain, Size: 8823 bytes --]

[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

A typical driver dev->hard_start_xmit() has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev->hard_start_xmit() into those 4 functions
anyways].

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev->hard_start_xmit() in the following manner:
1)use its dev->hard_prep_xmit() method to achieve #a
2)use its dev->hard_end_xmit() method to achieve #d
3)#b and #c can stay in ->hard_start_xmit() (or whichever way you want
to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
optional.

The core will first do the packet formatting by invoking your supplied
dev->hard_prep_xmit() method. It will then pass you the packet via
your dev->hard_start_xmit() method and lastly will invoke your
dev->hard_end_xmit() when it completes passing you all the packets
queued for you. dev->hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK().

LLTX present a challenge in that we have to introduce a deviation
from the norm and introduce the ->hard_batch_xmit() method. An LLTX
driver presents us with ->hard_batch_xmit() to which we pass it a list
of packets in a dev->blist skb queue. It is then the responsibility
of the ->hard_batch_xmit() to exercise steps #b and #c for all packets
and #d when the batching is complete. Step #a is already done for you
by the time you get the packets in dev->blist.
And last xmit_win variable is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Some decisions i had to make:
- every driver will have a xmit_win variable and the core will set it
to 1 which means the behavior of non-batching drivers stays the same.
- the batch list, blist, is no longer a pointer; wastes a little extra
memmory i plan to recoup by killing gso_skb in later patches.

Theres a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and
David Miller, have contributed in one or more of {bug fixes, enhancements,
testing, lively discussion}. The Broadcom and netiron folks have been
outstanding in their help.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit ab4b07ef2e4069c115c9c1707d86ae2344a5ded5
tree 994b42b03bbfcc09ac8b7670c53c12e0b2a71dc7
parent b0e36991c5850dfe930f80ee508b08fdcabc18d1
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 10:30:32 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 10:30:32 -0400

 include/linux/netdevice.h |   17 +++++++
 net/core/dev.c            |  106 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf89ce6..443cded 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -453,6 +453,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -578,6 +579,15 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	/* hard_batch_xmit is needed for LLTX, kill it when those
+	 * disappear or better kill it now and dont support LLTX
+	*/
+	int			(*hard_batch_xmit) (struct net_device *dev);
+	int			(*hard_prep_xmit) (struct sk_buff *skb,
+						   struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -592,6 +602,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head     blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1022,6 +1033,12 @@ extern int		dev_set_mac_address(struct net_device *,
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_batch_xmit(struct net_device *dev);
+extern int		prepare_gso_skb(struct sk_buff *skb,
+					struct net_device *dev,
+					struct sk_buff_head *skbs);
+extern int		xmit_prepare_skb(struct sk_buff *skb,
+					 struct net_device *dev);
 
 extern int		netdev_budget;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 91c31e6..25d01fd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1531,6 +1531,110 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+int prepare_gso_skb(struct sk_buff *skb, struct net_device *dev,
+		    struct sk_buff_head *skbs)
+{
+	int tdq = 0;
+	do {
+		struct sk_buff *nskb = skb->next;
+
+		skb->next = nskb->next;
+		nskb->next = NULL;
+
+		if (dev->hard_prep_xmit) {
+			/* note: skb->cb is set in hard_prep_xmit(),
+			 * it should not be trampled somewhere
+			 * between here and the driver picking it
+			 * The VLAN code wrongly assumes it owns it
+			 * so the driver needs to be careful; for
+			 * good handling look at tg3 driver ..
+			*/
+			int ret = dev->hard_prep_xmit(nskb, dev);
+			if (ret != NETDEV_TX_OK)
+				continue;
+		}
+		/* Driver likes this packet .. */
+		tdq++;
+		__skb_queue_tail(skbs, nskb);
+	} while (skb->next);
+	skb->destructor = DEV_GSO_CB(skb)->destructor;
+	kfree_skb(skb);
+
+	return tdq;
+}
+
+int xmit_prepare_skb(struct sk_buff *skb, struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree_skb(skb);
+			return 0;
+		}
+		if (skb->next)
+			return prepare_gso_skb(skb, dev, skbs);
+	}
+
+	if (dev->hard_prep_xmit) {
+		int ret = dev->hard_prep_xmit(skb, dev);
+		if (ret != NETDEV_TX_OK)
+			return 0;
+	}
+	__skb_queue_tail(skbs, skb);
+	return 1;
+}
+
+int dev_batch_xmit(struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+	int rc = NETDEV_TX_OK;
+	struct sk_buff *skb;
+	int orig_w = dev->xmit_win;
+	int orig_pkts = skb_queue_len(skbs);
+
+	if (dev->hard_batch_xmit) { /* only for LLTX devices */
+		rc = dev->hard_batch_xmit(dev);
+	} else {
+		while ((skb = __skb_dequeue(skbs)) != NULL) {
+			if (!list_empty(&ptype_all))
+				dev_queue_xmit_nit(skb, dev);
+			rc = dev->hard_start_xmit(skb, dev);
+			if (unlikely(rc))
+				break;
+			/*
+			 * XXX: multiqueue may need closer srutiny..
+			*/
+			if (unlikely(netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping))) {
+				rc = NETDEV_TX_BUSY;
+				break;
+			}
+		}
+	}
+
+	/* driver is likely buggy and lied to us on how much
+	 * space it had. Damn you driver ..
+	*/
+	if (unlikely(skb_queue_len(skbs))) {
+		printk(KERN_WARNING "Likely bug %s %s (%d) "
+				"left %d/%d window now %d, orig %d\n",
+			dev->name, rc?"busy":"locked",
+			netif_queue_stopped(dev),
+			skb_queue_len(skbs),
+			orig_pkts,
+			dev->xmit_win,
+			orig_w);
+			rc = NETDEV_TX_BUSY;
+	}
+
+	if (orig_pkts > skb_queue_len(skbs))
+		if (dev->hard_end_xmit)
+			dev->hard_end_xmit(dev);
+
+	return rc;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -3565,6 +3669,8 @@ int register_netdevice(struct net_device *dev)
 		}
 	}
 
+	dev->xmit_win = 1;
+	skb_queue_head_init(&dev->blist);
 	/*
 	 *	nil rebuild_header routine,
 	 *	that should be never called and used as just bug trap.

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 3/4][NET_BATCH] net core use batching
  2007-09-23 17:58         ` [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface jamal
@ 2007-09-23 18:00           ` jamal
  2007-09-23 18:02             ` [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb jamal
                               ` (2 more replies)
  2007-09-30 18:51           ` [ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface jamal
  2007-10-07 18:36           ` [ofa-general] " jamal
  2 siblings, 3 replies; 107+ messages in thread
From: jamal @ 2007-09-23 18:00 UTC (permalink / raw)
  To: David Miller
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri

[-- Attachment #1: Type: text/plain, Size: 71 bytes --]

This patch adds the usage of batching within the core.

cheers,
jamal


[-- Attachment #2: patch30f4 --]
[-- Type: text/plain, Size: 6919 bytes --]

[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is used, with
the following results on different kernels:

        +------------+--------------+-------------+------------+--------+
        |       64B  |  128B        | 256B        | 512B       |1024B   |
        +------------+--------------+-------------+------------+--------+
Original| 467482     | 463061       | 388267      | 216308     | 114704 |
        |            |              |             |            |        |
txlock  | 468922     | 464060       | 388298      | 216316     | 114709 |
        |            |              |             |            |        |
tg3nobtx| 468012     | 464079       | 388293      | 216314     | 114704 |
        |            |              |             |            |        |
tg3btxdr| 480794     | 475102       | 388298      | 216316     | 114705 |
        |            |              |             |            |        |
tg3btxco| 481059     | 475423       | 388285      | 216308     | 114706 |
        +------------+--------------+-------------+------------+--------+

The first two colums "Original" and "txlock" were introduced in an earlier
patch and demonstrate a slight increase in performance with txlock.
"tg3nobtx" shows the tg3 driver with no changes to support batching.
The purpose of this test is to demonstrate the effect of introducing
the core changes to a driver that doesnt support them.
Although this patch brings down perfomance slightly compared to txlock
for such netdevices, it is still better compared to just the original kernel.
"tg3btxdr" demonstrates the effect of using ->hard_batch_xmit() with tg3
driver. "tg3btxco" demonstrates the effect of letting the core do all the
work. As can be seen the last two are not very different in performance.
The difference is ->hard_batch_xmit() introduces a new method which
is intrusive.

I have #if-0ed some of the old functions so the patch is more readable.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit e26705f6ef7db034df7af3f4fccd7cd40b8e46e0
tree b99c469497a0145ca5c0651dc4229ce17da5b31c
parent 6b8e2f76f86c35a6b2cee3698c633d20495ae0c0
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 11:35:25 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 11:35:25 -0400

 net/sched/sch_generic.c |  127 +++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 115 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..86a3f9d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q->q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev->xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+				"Dead loop on netdevice %s, fix it urgently!\n",
+				dev->name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	       struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(skbs)) != NULL)
+		q->ops->requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	    struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb->next) {
+		count += skb_shinfo(skb)->nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev->xmit_win;
+
+	if (count  && dev->gso_skb) {
+		skb = dev->gso_skb;
+		dev->gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count > 0) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
+
+static int xmit_prepare_pkts(struct net_device *dev,
+			     struct sk_buff_head *tlist)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *flist = &dev->blist;
+
+	while ((skb = __skb_dequeue(tlist)) != NULL)
+		xmit_prepare_skb(skb, dev);
+
+	return skb_queue_len(flist);
+}
 
 /*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
@@ -130,22 +222,27 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *tpktlist)
 {
 	struct Qdisc *q = dev->qdisc;
-	struct sk_buff *skb;
-	int ret;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	ret = xmit_get_pkts(dev, q, tpktlist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(&dev->queue_lock);
 
+	/* prepare to embark */
+	xmit_prepare_pkts(dev, tpktlist);
+
+	/* bye packets ....*/
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
@@ -158,8 +255,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(&dev->blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +265,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(&dev->blist, dev, q);
 		break;
 	}
 
@@ -177,8 +274,11 @@ static inline int qdisc_restart(struct net_device *dev)
 
 void __qdisc_run(struct net_device *dev)
 {
+	struct sk_buff_head tpktlist;
+	skb_queue_head_init(&tpktlist);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, &tpktlist))
 			break;
 	} while (!netif_queue_stopped(dev));
 
@@ -564,6 +664,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev->gso_skb;
 	dev->gso_skb = NULL;
+	if (!skb_queue_empty(&dev->blist))
+		skb_queue_purge(&dev->blist);
+	dev->xmit_win = 1;
 	spin_unlock_bh(&dev->queue_lock);
 
 	kfree_skb(skb);

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb
  2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
@ 2007-09-23 18:02             ` jamal
  2007-09-30 18:53               ` [ofa-general] [PATCH 3/3][NET_SCHED] " jamal
  2007-10-07 18:39               ` [ofa-general] [PATCH 3/3][NET_BATCH] " jamal
  2007-09-30 18:52             ` [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching jamal
  2007-10-07 18:38             ` [ofa-general] " jamal
  2 siblings, 2 replies; 107+ messages in thread
From: jamal @ 2007-09-23 18:02 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 98 bytes --]


This patch removes dev->gso_skb as it is no longer necessary with
batching code.

cheers,
jamal


[-- Attachment #2: patch40f4 --]
[-- Type: text/plain, Size: 2277 bytes --]

[NET_SCHED] kill dev->gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev->blist
(and still not need dev->gso_skb)

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit c6d2d61a73e1df5daaa294876f62454413fcb0af
tree 1d7bf650096a922a6b6a4e7d6810f83320eb94dd
parent e26705f6ef7db034df7af3f4fccd7cd40b8e46e0
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 12:25:10 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 12:25:10 -0400

 include/linux/netdevice.h |    3 ---
 net/sched/sch_generic.c   |   12 ------------
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 443cded..7811729 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -560,9 +560,6 @@ struct net_device
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
-	/* Partially transmitted GSO packet. */
-	struct sk_buff		*gso_skb;
-
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 86a3f9d..b4e1607 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
 	struct sk_buff *skb;
 	int count = dev->xmit_win;
 
-	if (count  && dev->gso_skb) {
-		skb = dev->gso_skb;
-		dev->gso_skb = NULL;
-		count -= xmit_count_skbs(skb);
-		__skb_queue_tail(pktlist, skb);
-	}
-
 	while (count > 0) {
 		skb = q->dequeue(q);
 		if (!skb)
@@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
-	struct sk_buff *skb;
 
 	spin_lock_bh(&dev->queue_lock);
 	qdisc = dev->qdisc;
@@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev)
 
 	qdisc_reset(qdisc);
 
-	skb = dev->gso_skb;
-	dev->gso_skb = NULL;
 	if (!skb_queue_empty(&dev->blist))
 		skb_queue_purge(&dev->blist);
 	dev->xmit_win = 1;
 	spin_unlock_bh(&dev->queue_lock);
 
-	kfree_skb(skb);
-
 	dev_watchdog_down(dev);
 
 	/* Wait for outstanding dev_queue_xmit calls. */

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCHES] TX batching
  2007-09-23 17:53     ` [PATCHES] TX batching jamal
  2007-09-23 17:56       ` [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock jamal
@ 2007-09-23 18:19       ` Jeff Garzik
  2007-09-23 19:11         ` [ofa-general] " jamal
  2007-09-30 18:50       ` [ofa-general] " jamal
  2007-10-07 18:34       ` [ofa-general] " jamal
  3 siblings, 1 reply; 107+ messages in thread
From: Jeff Garzik @ 2007-09-23 18:19 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

jamal wrote:
> More patches to follow  - i didnt want to overload people by dumping 
> too many patches. Most of these patches below are ready to go; some are
> need some testing and others need a little porting from an earlier
> kernel: 
> - tg3 driver (tested and works well, but dont want to send 
> - tun driver
> - pktgen
> - netiron driver
> - e1000 driver


You should post at least a couple driver patches to see how its used on 
Real Hardware(tm)...   :)

The batching idea has always seemed like a no-brainer to me, so I'm very 
interested to see how this turns out.

	Jeff



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-09-23 18:19       ` [PATCHES] TX batching Jeff Garzik
@ 2007-09-23 19:11         ` jamal
  2007-09-23 19:36           ` Kok, Auke
                             ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: jamal @ 2007-09-23 19:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, kaber, jagana, general,
	mchan, tgraf, randy.dunlap, sri, shemminger, David Miller

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On Sun, 2007-23-09 at 14:19 -0400, Jeff Garzik wrote:

> 
> You should post at least a couple driver patches to see how its used on 
> Real Hardware(tm)...   :)

This is the tg3 patch i used for the testing - against whats in Daves
net-2.6.24 tree. Patch may be a bit hard to read.
For an example of an LLTX version look at the e1000 in the older git
tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

If the intel folks will accept the patch i'd really like to kill 
the e1000 LLTX interface.
The tg3 in that tree used the old style batch_xmit() interface.

cheers,
jamal

[-- Attachment #2: tg3.p --]
[-- Type: text/x-patch, Size: 16359 bytes --]

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index d4ac6e9..ba0b49e 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3103,6 +3103,13 @@ static inline u32 tg3_tx_avail(struct tg3 *tp)
 		((tp->tx_prod - tp->tx_cons) & (TG3_TX_RING_SIZE - 1)));
 }
 
+static inline void tg3_set_win(struct tg3 *tp)
+{
+	tp->dev->xmit_win = tg3_tx_avail(tp) - (MAX_SKB_FRAGS + 1);
+	if (tp->dev->xmit_win < 1)
+		tp->dev->xmit_win = 1;
+}
+
 /* Tigon3 never reports partial packet sends.  So we do not
  * need special logic to handle SKBs that have not had all
  * of their frags sent yet, like SunGEM does.
@@ -3165,8 +3172,10 @@ static void tg3_tx(struct tg3 *tp)
 		     (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))) {
 		netif_tx_lock(tp->dev);
 		if (netif_queue_stopped(tp->dev) &&
-		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
+		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))) {
+			tg3_set_win(tp);
 			netif_wake_queue(tp->dev);
+		}
 		netif_tx_unlock(tp->dev);
 	}
 }
@@ -3910,47 +3919,67 @@ static void tg3_set_txd(struct tg3 *tp, int entry,
 	txd->vlan_tag = vlan_tag << TXD_VLAN_TAG_SHIFT;
 }
 
-/* hard_start_xmit for devices that don't have any bugs and
- * support TG3_FLG2_HW_TSO_2 only.
- */
-static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+struct tg3_tx_cbdata {
+	u32 base_flags;
+	unsigned int mss;
+};
+#define TG3_SKB_CB(__skb)       ((struct tg3_tx_cbdata *)&((__skb)->cb[0]))
+#define NETDEV_TX_DROPPED       -5
+
+static int tg3_prep_bug_frame(struct sk_buff *skb, struct net_device *dev)
 {
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
 	struct tg3 *tp = netdev_priv(dev);
-	dma_addr_t mapping;
-	u32 len, entry, base_flags, mss;
-
-	len = skb_headlen(skb);
+	u32 vlantag = 0;
 
-	/* We are running in BH disabled context with netif_tx_lock
-	 * and TX reclaim runs via tp->napi.poll inside of a software
-	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
-	 * no IRQ context deadlocks to worry about either.  Rejoice!
-	 */
-	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
-		if (!netif_queue_stopped(dev)) {
-			netif_stop_queue(dev);
+#if TG3_VLAN_TAG_USED
+	if (tp->vlgrp != NULL && vlan_tx_tag_present(skb))
+		vlantag = (TXD_FLAG_VLAN | (vlan_tx_tag_get(skb) << 16));
+#endif
 
-			/* This is a hard error, log it. */
-			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
-			       "queue awake!\n", dev->name);
+	cb->base_flags = vlantag;
+	cb->mss = skb_shinfo(skb)->gso_size;
+	if (cb->mss != 0) {
+		if (skb_header_cloned(skb) &&
+		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
+			dev_kfree_skb(skb);
+			return NETDEV_TX_DROPPED;
 		}
-		return NETDEV_TX_BUSY;
+
+		cb->base_flags |= (TXD_FLAG_CPU_PRE_DMA |
+			       TXD_FLAG_CPU_POST_DMA);
 	}
 
-	entry = tp->tx_prod;
-	base_flags = 0;
-	mss = 0;
-	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
+	if (skb->ip_summed == CHECKSUM_PARTIAL)
+		cb->base_flags |= TXD_FLAG_TCPUDP_CSUM;
+
+	return NETDEV_TX_OK;
+}
+
+static int tg3_prep_frame(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
+	struct tg3 *tp = netdev_priv(dev);
+	u32 vlantag = 0;
+
+#if TG3_VLAN_TAG_USED
+	if (tp->vlgrp != NULL && vlan_tx_tag_present(skb))
+		vlantag = (TXD_FLAG_VLAN | (vlan_tx_tag_get(skb) << 16));
+#endif
+
+	cb->base_flags = vlantag;
+	cb->mss = skb_shinfo(skb)->gso_size;
+	if (cb->mss != 0) {
 		int tcp_opt_len, ip_tcp_len;
 
 		if (skb_header_cloned(skb) &&
 		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
 			dev_kfree_skb(skb);
-			goto out_unlock;
+			return NETDEV_TX_DROPPED;
 		}
 
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
-			mss |= (skb_headlen(skb) - ETH_HLEN) << 9;
+			cb->mss |= (skb_headlen(skb) - ETH_HLEN) << 9;
 		else {
 			struct iphdr *iph = ip_hdr(skb);
 
@@ -3958,32 +3987,63 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
 
 			iph->check = 0;
-			iph->tot_len = htons(mss + ip_tcp_len + tcp_opt_len);
-			mss |= (ip_tcp_len + tcp_opt_len) << 9;
+			iph->tot_len = htons(cb->mss + ip_tcp_len
+					     + tcp_opt_len);
+			cb->mss |= (ip_tcp_len + tcp_opt_len) << 9;
 		}
 
-		base_flags |= (TXD_FLAG_CPU_PRE_DMA |
+		cb->base_flags |= (TXD_FLAG_CPU_PRE_DMA |
 			       TXD_FLAG_CPU_POST_DMA);
 
 		tcp_hdr(skb)->check = 0;
 
 	}
 	else if (skb->ip_summed == CHECKSUM_PARTIAL)
-		base_flags |= TXD_FLAG_TCPUDP_CSUM;
-#if TG3_VLAN_TAG_USED
-	if (tp->vlgrp != NULL && vlan_tx_tag_present(skb))
-		base_flags |= (TXD_FLAG_VLAN |
-			       (vlan_tx_tag_get(skb) << 16));
-#endif
+		cb->base_flags |= TXD_FLAG_TCPUDP_CSUM;
+
+	return NETDEV_TX_OK;
+}
+
+void tg3_kick_DMA(struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	u32 entry = tp->tx_prod;
+
+	/* Packets are ready, update Tx producer idx local and on card. */
+	tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry);
+
+	if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) {
+		netif_stop_queue(dev);
+		dev->xmit_win = 1;
+		if (tg3_tx_avail(tp) >= TG3_TX_WAKEUP_THRESH(tp)) {
+			tg3_set_win(tp);
+			netif_wake_queue(dev);
+		}
+	} else {
+		tg3_set_win(tp);
+	}
 
+	mmiowb();
+	dev->trans_start = jiffies;
+}
+
+static int tg3_enqueue(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	dma_addr_t mapping;
+	u32 len, entry;
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
+
+	entry = tp->tx_prod;
+	len = skb_headlen(skb);
 	/* Queue skb data, a.k.a. the main skb fragment. */
 	mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
 
 	tp->tx_buffers[entry].skb = skb;
 	pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, mapping);
 
-	tg3_set_txd(tp, entry, mapping, len, base_flags,
-		    (skb_shinfo(skb)->nr_frags == 0) | (mss << 1));
+	tg3_set_txd(tp, entry, mapping, len, cb->base_flags,
+		    (skb_shinfo(skb)->nr_frags == 0) | (cb->mss << 1));
 
 	entry = NEXT_TX(entry);
 
@@ -4005,28 +4065,71 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, mapping);
 
 			tg3_set_txd(tp, entry, mapping, len,
-				    base_flags, (i == last) | (mss << 1));
+				    cb->base_flags,
+				    (i == last) | (cb->mss << 1));
 
 			entry = NEXT_TX(entry);
 		}
 	}
 
-	/* Packets are ready, update Tx producer idx local and on card. */
-	tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry);
-
 	tp->tx_prod = entry;
-	if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) {
-		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))
-			netif_wake_queue(tp->dev);
+	return NETDEV_TX_OK;
+}
+
+/* hard_start_xmit for devices that don't have any bugs and
+ * support TG3_FLG2_HW_TSO_2 only.
+ */
+static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	int ret = tg3_prep_frame(skb, dev);
+	/* XXX: original code did mmiowb(); on failure,
+	* I dont think thats necessary
+	*/
+	if (unlikely(ret != NETDEV_TX_OK))
+	       return NETDEV_TX_OK;
+
+	/* We are running in BH disabled context with netif_tx_lock
+	 * and TX reclaim runs via tp->poll inside of a software
+	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
+	 * no IRQ context deadlocks to worry about either.  Rejoice!
+	 */
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			tp->dev->xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
 	}
 
-out_unlock:
-    	mmiowb();
+	ret = tg3_enqueue(skb, dev);
+	if (ret == NETDEV_TX_OK)
+		tg3_kick_DMA(dev);
 
-	dev->trans_start = jiffies;
+	return ret;
+}
 
-	return NETDEV_TX_OK;
+static int tg3_start_bxmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev->xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return tg3_enqueue(skb, dev);
 }
 
 static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *);
@@ -4041,9 +4144,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
 	/* Estimate the number of fragments in the worst case */
 	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->gso_segs * 3))) {
 		netif_stop_queue(tp->dev);
+		tp->dev->xmit_win = 1;
 		if (tg3_tx_avail(tp) <= (skb_shinfo(skb)->gso_segs * 3))
 			return NETDEV_TX_BUSY;
 
+		tg3_set_win(tp);
 		netif_wake_queue(tp->dev);
 	}
 
@@ -4067,46 +4172,19 @@ tg3_tso_bug_end:
 /* hard_start_xmit for devices that have the 4G bug and/or 40-bit bug and
  * support TG3_FLG2_HW_TSO_1 or firmware TSO only.
  */
-static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev)
 {
 	struct tg3 *tp = netdev_priv(dev);
 	dma_addr_t mapping;
-	u32 len, entry, base_flags, mss;
+	u32 len, entry;
 	int would_hit_hwbug;
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
 
-	len = skb_headlen(skb);
-
-	/* We are running in BH disabled context with netif_tx_lock
-	 * and TX reclaim runs via tp->napi.poll inside of a software
-	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
-	 * no IRQ context deadlocks to worry about either.  Rejoice!
-	 */
-	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
-		if (!netif_queue_stopped(dev)) {
-			netif_stop_queue(dev);
-
-			/* This is a hard error, log it. */
-			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
-			       "queue awake!\n", dev->name);
-		}
-		return NETDEV_TX_BUSY;
-	}
 
-	entry = tp->tx_prod;
-	base_flags = 0;
-	if (skb->ip_summed == CHECKSUM_PARTIAL)
-		base_flags |= TXD_FLAG_TCPUDP_CSUM;
-	mss = 0;
-	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
+	if (cb->mss != 0) {
 		struct iphdr *iph;
 		int tcp_opt_len, ip_tcp_len, hdr_len;
 
-		if (skb_header_cloned(skb) &&
-		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
-			dev_kfree_skb(skb);
-			goto out_unlock;
-		}
-
 		tcp_opt_len = tcp_optlen(skb);
 		ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
 
@@ -4115,15 +4193,13 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 			     (tp->tg3_flags2 & TG3_FLG2_TSO_BUG))
 			return (tg3_tso_bug(tp, skb));
 
-		base_flags |= (TXD_FLAG_CPU_PRE_DMA |
-			       TXD_FLAG_CPU_POST_DMA);
 
 		iph = ip_hdr(skb);
 		iph->check = 0;
-		iph->tot_len = htons(mss + hdr_len);
+		iph->tot_len = htons(cb->mss + hdr_len);
 		if (tp->tg3_flags2 & TG3_FLG2_HW_TSO) {
 			tcp_hdr(skb)->check = 0;
-			base_flags &= ~TXD_FLAG_TCPUDP_CSUM;
+			cb->base_flags &= ~TXD_FLAG_TCPUDP_CSUM;
 		} else
 			tcp_hdr(skb)->check = ~csum_tcpudp_magic(iph->saddr,
 								 iph->daddr, 0,
@@ -4136,22 +4212,19 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 				int tsflags;
 
 				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				mss |= (tsflags << 11);
+				cb->mss |= (tsflags << 11);
 			}
 		} else {
 			if (tcp_opt_len || iph->ihl > 5) {
 				int tsflags;
 
 				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				base_flags |= tsflags << 12;
+				cb->base_flags |= tsflags << 12;
 			}
 		}
 	}
-#if TG3_VLAN_TAG_USED
-	if (tp->vlgrp != NULL && vlan_tx_tag_present(skb))
-		base_flags |= (TXD_FLAG_VLAN |
-			       (vlan_tx_tag_get(skb) << 16));
-#endif
+	len = skb_headlen(skb);
+	entry = tp->tx_prod;
 
 	/* Queue skb data, a.k.a. the main skb fragment. */
 	mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
@@ -4164,8 +4237,8 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 	if (tg3_4g_overflow_test(mapping, len))
 		would_hit_hwbug = 1;
 
-	tg3_set_txd(tp, entry, mapping, len, base_flags,
-		    (skb_shinfo(skb)->nr_frags == 0) | (mss << 1));
+	tg3_set_txd(tp, entry, mapping, len, cb->base_flags,
+		    (skb_shinfo(skb)->nr_frags == 0) | (cb->mss << 1));
 
 	entry = NEXT_TX(entry);
 
@@ -4194,10 +4267,11 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 
 			if (tp->tg3_flags2 & TG3_FLG2_HW_TSO)
 				tg3_set_txd(tp, entry, mapping, len,
-					    base_flags, (i == last)|(mss << 1));
+					    cb->base_flags,
+					    (i == last)|(cb->mss << 1));
 			else
 				tg3_set_txd(tp, entry, mapping, len,
-					    base_flags, (i == last));
+					    cb->base_flags, (i == last));
 
 			entry = NEXT_TX(entry);
 		}
@@ -4214,28 +4288,68 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 		 * failure, silently drop this packet.
 		 */
 		if (tigon3_dma_hwbug_workaround(tp, skb, last_plus_one,
-						&start, base_flags, mss))
-			goto out_unlock;
+						&start, cb->base_flags,
+						cb->mss)) {
+			mmiowb();
+			return NETDEV_TX_OK;
+		}
 
 		entry = start;
 	}
 
-	/* Packets are ready, update Tx producer idx local and on card. */
-	tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry);
-
 	tp->tx_prod = entry;
-	if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) {
-		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))
-			netif_wake_queue(tp->dev);
+	return NETDEV_TX_OK;
+}
+
+static int tg3_start_bxmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev->xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
 	}
 
-out_unlock:
-    	mmiowb();
+	return  tg3_enqueue_buggy(skb, dev);
+}
 
-	dev->trans_start = jiffies;
+static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	int ret = tg3_prep_bug_frame(skb, dev);
 
-	return NETDEV_TX_OK;
+	if (unlikely(ret != NETDEV_TX_OK))
+	       return NETDEV_TX_OK;
+
+	/* We are running in BH disabled context with netif_tx_lock
+	 * and TX reclaim runs via tp->poll inside of a software
+	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
+	 * no IRQ context deadlocks to worry about either.  Rejoice!
+	 */
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev->xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	ret = tg3_enqueue_buggy(skb, dev);
+	if (ret == NETDEV_TX_OK)
+		tg3_kick_DMA(dev);
+
+	return ret;
 }
 
 static inline void tg3_set_mtu(struct net_device *dev, struct tg3 *tp,
@@ -11039,15 +11153,19 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 	else
 		tp->tg3_flags &= ~TG3_FLAG_POLL_SERDES;
 
+	tp->dev->hard_end_xmit = tg3_kick_DMA;
 	/* All chips before 5787 can get confused if TX buffers
 	 * straddle the 4GB address boundary in some cases.
 	 */
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5755 ||
 	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5787 ||
-	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
-		tp->dev->hard_start_xmit = tg3_start_xmit;
-	else
-		tp->dev->hard_start_xmit = tg3_start_xmit_dma_bug;
+	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) {
+		tp->dev->hard_start_xmit = tg3_start_bxmit;
+		tp->dev->hard_prep_xmit = tg3_prep_frame;
+	} else {
+		tp->dev->hard_start_xmit = tg3_start_bxmit_dma_bug;
+		tp->dev->hard_prep_xmit = tg3_prep_bug_frame;
+	}
 
 	tp->rx_offset = 2;
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701 &&
@@ -11895,6 +12013,8 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
 	dev->watchdog_timeo = TG3_TX_TIMEOUT;
 	dev->change_mtu = tg3_change_mtu;
 	dev->irq = pdev->irq;
+	dev->features |= NETIF_F_BTX;
+	dev->xmit_win = tp->tx_pending >> 2;
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	dev->poll_controller = tg3_poll_controller;
 #endif

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-09-23 19:11         ` [ofa-general] " jamal
@ 2007-09-23 19:36           ` Kok, Auke
  2007-09-23 21:20             ` jamal
  2007-09-24 22:54           ` [DOC] Net batching driver howto jamal
  2007-09-25  0:15           ` [PATCHES] TX batching Jeff Garzik
  2 siblings, 1 reply; 107+ messages in thread
From: Kok, Auke @ 2007-09-23 19:36 UTC (permalink / raw)
  To: hadi
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, Jeff Garzik,
	rdreier, mcarlson, general, sri, jagana, mchan, netdev, kaber,
	tgraf, David Miller

jamal wrote:
> On Sun, 2007-23-09 at 14:19 -0400, Jeff Garzik wrote:
> 
>> You should post at least a couple driver patches to see how its used on 
>> Real Hardware(tm)...   :)
> 
> This is the tg3 patch i used for the testing - against whats in Daves
> net-2.6.24 tree. Patch may be a bit hard to read.
> For an example of an LLTX version look at the e1000 in the older git
> tree at:
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
> 
> If the intel folks will accept the patch i'd really like to kill 
> the e1000 LLTX interface.
> The tg3 in that tree used the old style batch_xmit() interface.

please be reminded that we're going to strip down e1000 and most of the features
should go into e1000e, which has much less hardware workarounds. I'm still
reluctant to putting in new stuff in e1000 - I really want to chop it down first ;)

AUke

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-09-23 19:36           ` Kok, Auke
@ 2007-09-23 21:20             ` jamal
  2007-09-24  7:00               ` Kok, Auke
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-23 21:20 UTC (permalink / raw)
  To: Kok, Auke
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, Jeff Garzik,
	rdreier, mcarlson, general, sri, jagana, mchan, netdev, kaber,
	tgraf, David Miller

On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote:

> please be reminded that we're going to strip down e1000 and most of the features
> should go into e1000e, which has much less hardware workarounds. I'm still
> reluctant to putting in new stuff in e1000 - I really want to chop it down first ;)

sure - the question then is, will you take those changes if i use
e1000e? theres a few cleanups that have nothing to do with batching;
take a look at the modified e1000 on the git tree.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-09-23 21:20             ` jamal
@ 2007-09-24  7:00               ` Kok, Auke
  2007-09-24 22:38                 ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Kok, Auke @ 2007-09-24  7:00 UTC (permalink / raw)
  To: hadi
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, Jeff Garzik,
	rdreier, mcarlson, general, sri, jagana, mchan, Kok, Auke,
	netdev, David Miller, tgraf, kaber

jamal wrote:
> On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote:
> 
>> please be reminded that we're going to strip down e1000 and most of the features
>> should go into e1000e, which has much less hardware workarounds. I'm still
>> reluctant to putting in new stuff in e1000 - I really want to chop it down first ;)
> 
> sure - the question then is, will you take those changes if i use
> e1000e? theres a few cleanups that have nothing to do with batching;
> take a look at the modified e1000 on the git tree.

that's bad to begin with :) - please send those separately so I can fasttrack them
into e1000e and e1000 where applicable.

But yes, I'm very inclined to merge more features into e1000e than e1000. I intend
to put multiqueue support into e1000e, as *all* of the hardware that it will
support has multiple queues. Putting in any other performance feature like tx
batching would absolutely be interesting.

Auke

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-23 17:56       ` [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock jamal
  2007-09-23 17:58         ` [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface jamal
@ 2007-09-24 19:12         ` Waskiewicz Jr, Peter P
  2007-09-24 22:51           ` jamal
  1 sibling, 1 reply; 107+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-09-24 19:12 UTC (permalink / raw)
  To: hadi, David Miller
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, mcarlson, randy.dunlap, jagana, general, mchan, tgraf,
	jeff, sri, shemminger, kaber

> I have submitted this before; but here it is again.
> Against net-2.6.24 from yesterday for this and all following patches. 
> 
> 
> cheers,
> jamal

Hi Jamal,
	I've been (slowly) working on resurrecting the original design
of my multiqueue patches to address this exact issue of the queue_lock
being a hot item.  I added a queue_lock to each queue in the subqueue
struct, and in the enqueue and dequeue, just lock that queue instead of
the global device queue_lock.  The only two issues to overcome are the
QDISC_RUNNING state flag, since that also serializes entry into the
qdisc_restart() function, and the qdisc statistics maintenance, which
needs to be serialized.  Do you think this work along with your patch
will benefit from one another?  I apologize for not having working
patches right now, but I am working on them slowly as I have some blips
of spare time.

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCHES] TX batching
  2007-09-24  7:00               ` Kok, Auke
@ 2007-09-24 22:38                 ` jamal
  2007-09-24 22:52                   ` [ofa-general] " Kok, Auke
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-24 22:38 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, David Miller, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, peter.p.waskiewicz.jr, mcarlson, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote:

> that's bad to begin with :) - please send those separately so I can fasttrack them
> into e1000e and e1000 where applicable.

Ive been CCing you ;-> Most of the changes are readability and
reusability with the batching.

> But yes, I'm very inclined to merge more features into e1000e than e1000. I intend
> to put multiqueue support into e1000e, as *all* of the hardware that it will
> support has multiple queues. Putting in any other performance feature like tx
> batching would absolutely be interesting.

I looked at the e1000e and it is very close to e1000 so i should be able
to move the changes easily. Most importantly, can i kill LLTX?
For tx batching, we have to wait to see how Dave wants to move forward;
i will have the patches but it is not something you need to push until
we see where that is going.

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 19:12         ` [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock Waskiewicz Jr, Peter P
@ 2007-09-24 22:51           ` jamal
  2007-09-24 22:57             ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-24 22:51 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, mcarlson, jeff, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote:

> Hi Jamal,
> 	I've been (slowly) working on resurrecting the original design
> of my multiqueue patches to address this exact issue of the queue_lock
> being a hot item.  I added a queue_lock to each queue in the subqueue
> struct, and in the enqueue and dequeue, just lock that queue instead of
> the global device queue_lock.  The only two issues to overcome are the
> QDISC_RUNNING state flag, since that also serializes entry into the
> qdisc_restart() function, and the qdisc statistics maintenance, which
> needs to be serialized.  Do you think this work along with your patch
> will benefit from one another? 

The one thing that seems obvious is to use dev->hard_prep_xmit() in the
patches i posted to select the xmit ring. You should be able to do
figure out the txmit ring without holding any lock. 

I lost track of how/where things went since the last discussion; so i
need to wrap my mind around it to make sensisble suggestions - I know
the core patches are in the kernel but havent paid attention to details
and if you look at my second patch youd see a comment in
dev_batch_xmit() which says i need to scrutinize multiqueue more. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-09-24 22:38                 ` jamal
@ 2007-09-24 22:52                   ` Kok, Auke
  0 siblings, 0 replies; 107+ messages in thread
From: Kok, Auke @ 2007-09-24 22:52 UTC (permalink / raw)
  To: hadi
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, Jeff Garzik,
	rdreier, mcarlson, general, sri, jagana, mchan, Kok, Auke,
	netdev, David Miller, tgraf, kaber

jamal wrote:
> On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote:
> 
>> that's bad to begin with :) - please send those separately so I can fasttrack them
>> into e1000e and e1000 where applicable.
> 
> Ive been CCing you ;-> Most of the changes are readability and
> reusability with the batching.
> 
>> But yes, I'm very inclined to merge more features into e1000e than e1000. I intend
>> to put multiqueue support into e1000e, as *all* of the hardware that it will
>> support has multiple queues. Putting in any other performance feature like tx
>> batching would absolutely be interesting.
> 
> I looked at the e1000e and it is very close to e1000 so i should be able
> to move the changes easily. Most importantly, can i kill LLTX?
> For tx batching, we have to wait to see how Dave wants to move forward;
> i will have the patches but it is not something you need to push until
> we see where that is going.

hmm, I though I already removed that, but now I see some remnants from that.

By all means, please send a separate patch for that!

Auke

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [DOC] Net batching driver howto
  2007-09-23 19:11         ` [ofa-general] " jamal
  2007-09-23 19:36           ` Kok, Auke
@ 2007-09-24 22:54           ` jamal
  2007-09-25 20:16             ` [ofa-general] " Randy Dunlap
  2007-09-25  0:15           ` [PATCHES] TX batching Jeff Garzik
  2 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-24 22:54 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

[-- Attachment #1: Type: text/plain, Size: 100 bytes --]


I have updated the driver howto to match the patches i posted yesterday.
attached. 

cheers,
jamal

[-- Attachment #2: batch-driver-howto.txt --]
[-- Type: text/plain, Size: 9307 bytes --]

Heres the begining of a howto for driver authors.

The intended audience for this howto is people already
familiar with netdevices.

1.0  Netdevice Pre-requisites
------------------------------

For hardware based netdevices, you must have at least hardware that 
is capable of doing DMA with many descriptors; i.e having hardware 
with a queue length of 3 (as in some fscked ethernet hardware) is 
not very useful in this case.

2.0  What is new in the driver API
-----------------------------------

There are 3 new methods and one new variable introduced. These are:
1)dev->hard_prep_xmit()
2)dev->hard_end_xmit()
3)dev->hard_batch_xmit()
4)dev->xmit_win

2.1 Using Core driver changes
-----------------------------

To provide context, lets look at a typical driver abstraction
for dev->hard_start_xmit(). It has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew 
on, tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev->hard_start_xmit() into those 4 functions
anyways].
A driver which has all 4 parts and needing to support batching is 
advised to split its dev->hard_start_xmit() in the following manner:
1)use its dev->hard_prep_xmit() method to achieve #a
2)use its dev->hard_end_xmit() method to achieve #d
3)#b and #c can stay in ->hard_start_xmit() (or whichever way you 
want to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
essentially optional.

2.1.1 Theory of operation
--------------------------

The core will first do the packet formatting by invoking your 
supplied dev->hard_prep_xmit() method. It will then pass you the packet 
via your dev->hard_start_xmit() method for as many as packets you
have advertised (via dev->xmit_win) you can consume. Lastly it will 
invoke your dev->hard_end_xmit() when it completes passing you all the 
packets queued for you. 


2.1.1.1 Locking rules
---------------------

dev->hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK(). So you have to ensure that
whatever you put it dev->hard_prep_xmit() doesnt require locking.

2.1.1.2 The slippery LLTX
-------------------------

LLTX drivers present a challenge in that we have to introduce a deviation
from the norm and require the ->hard_batch_xmit() method. An LLTX
driver presents us with ->hard_batch_xmit() to which we pass it a list
of packets in a dev->blist skb queue. It is then the responsibility
of the ->hard_batch_xmit() to exercise steps #b and #c for all packets
passed in the dev->blist.
Step #a and #d are done by the core should you register presence of
dev->hard_prep_xmit() and dev->hard_end_xmit() in your setup.

2.1.1.3 xmit_win
----------------

dev->xmit_win variable is set by the driver to tell us how
much space it has in its rings/queues. dev->xmit_win is introduced to 
ensure that when we pass the driver a list of packets it will swallow 
all of them - which is useful because we dont requeue to the qdisc (and 
avoids burning unnecessary cpu cycles or introducing any strange 
re-ordering). The driver tells us, whenever it invokes netif_wake_queue,
how much space it has for descriptors by setting this variable.

3.0 Driver Essentials
---------------------

The typical driver tx state machine is:

----
-1-> +Core sends packets
     +--> Driver puts packet onto hardware queue
     +    if hardware queue is full, netif_stop_queue(dev)
     +
-2-> +core stops sending because of netif_stop_queue(dev)
..
.. time passes ...
..
-3-> +---> driver has transmitted packets, opens up tx path by
          invoking netif_wake_queue(dev)
-1-> +Cycle repeats and core sends more packets (step 1).
----

3.1  Driver pre-requisite
--------------------------

This is _a very important_ requirement in making batching useful.
The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
Drivers such as tg3 and e1000 already do this.
Before you invoke netif_wake_queue(dev) you check if there is a
threshold of space reached to insert new packets.

Heres an example of how i added it to tun driver. Observe the
setting of dev->xmit_win

---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
	u32 t = skb_queue_len(&tun->readq);
	if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
		tun->dev->xmit_win = tun->dev->tx_queue_len;
		netif_wake_queue(tun->dev);
	}
---

Heres how the batching e1000 driver does it:

--
if (unlikely(cleaned && netif_carrier_ok(netdev) &&
     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) {

	if (netif_queue_stopped(netdev)) {
	       int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
	       netdev->xmit_win = rspace;
	       netif_wake_queue(netdev);
       }
---

in tg3 code (with no batching changes) looks like:

-----
	if (netif_queue_stopped(tp->dev) &&
		(tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
			netif_wake_queue(tp->dev);
---

3.2 Driver Setup
-----------------

a) On initialization (before netdev registration)
 i) set NETIF_F_BTX in dev->features 
  i.e dev->features |= NETIF_F_BTX
  This makes the core do proper initialization.

ii) set dev->xmit_win to something reasonable like
  maybe half the tx DMA ring size etc.

b) create proper pointer to the new methods desribed above if
you need them.

3.3 Annotation on the different methods 
----------------------------------------
This section shows examples and offers suggestions on how the different 
methods and variable could be used.

3.3.1 The dev->hard_prep_xmit() method
---------------------------------------

Use this method to only do pre-processing of the skb passed.
If in the current dev->hard_start_xmit() you are pre-processing
packets before holding any locks (eg formating them to be put in
any descriptor etc).
Look at e1000_prep_queue_frame() for an example.
You may use the skb->cb to store any state that you need to know
of later when batching.
PS: I have found when discussing with Michael Chan and Matt Carlson
that skb->cb[0] (8bytes of it) is used by the VLAN code to pass VLAN 
info to the driver.
I think this is a violation of the usage of the cb scratch pad. 
To work around this, you could use skb->cb[8] or do what the broadcom
tg3 bacthing driver does which is to glean the vlan info first then
re-use the skb->cb.

3.3.2 dev->hard_start_xmit()
----------------------------
  
Heres an example of tx routine that is similar to the one i added 
to the current tun driver. bxmit suffix is kept so that you can turn
off batching if needed via and call already existing interface.

----
  static int xxx_net_bxmit(struct net_device *dev)
  {
  ....
  ....
	enqueue onto hardware ring
				           
	if (hardware ring full) {
		  netif_stop_queue(dev);
		  dev->xmit_win = 1;
	}

   .......
   ..
   .
  }
------

All return codes like NETDEV_TX_OK etc still apply.

3.3.3 The LLTX batching method, dev->batch_xmit()
-------------------------------------------------
  
Heres an example of a batch tx routine that is similar
to the one i added to the older tun driver. Essentially
this is what youd do if you wanted to support LLTX.

----
  static int xxx_net_bxmit(struct net_device *dev)
  {
  ....
  ....
        while (skb_queue_len(dev->blist)) {
	        dequeue from dev->blist
		enqueue onto hardware ring
		if hardware ring full break
        }
				           
	if (hardware ring full) {
		  netif_stop_queue(dev);
		  dev->xmit_win = 1;
	}

       .......
       ..
       .
  }
------

All return codes like NETDEV_TX_OK etc still apply.

3.3.4 The tx complete, dev->hard_end_xmit()
-------------------------------------------------
  
In this method, if there are any IO operations that apply to a 
set of packets such as kicking DMA, setting of interupt thresholds etc,
leave them to the end and apply them once if you have successfully enqueued. 
This provides a mechanism for saving a lot of cpu cycles since IO
is cycle expensive.
For an example of this look e1000 driver e1000_kick_DMA() function.

3.3.5 setting the dev->xmit_win 
-----------------------------

As mentioned earlier this variable provides hints on how much
data to send from the core to the driver. Here are the obvious ways:
a)on doing a netif_stop, set it to 1. By default all drivers have 
this value set to 1 to emulate old behavior where a driver only
receives one packet at a time.
b)on netif_wake_queue set it to the max available space. You have
to be careful if your hardware does scatter-gather since the core
will pass you scatter-gatherable skbs and so you want to at least
leave enough space for the maximum allowed. Look at the tg3 and
e1000 to see how this is implemented.

The variable is important because it avoids the core sending
any more than what the driver can handle therefore avoiding 
any need to muck with packet scheduling mechanisms.

Appendix 1: History
-------------------
June 11/2007: Initial revision
June 11/2007: Fixed typo on e1000 netif_wake description ..
Aug  08/2007: Added info on VLAN and the skb->cb[] danger ..
Sep  24/2007: Revised and cleaned up

^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 22:51           ` jamal
@ 2007-09-24 22:57             ` Waskiewicz Jr, Peter P
  2007-09-24 23:38               ` [ofa-general] " jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-09-24 22:57 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, mcarlson, jeff, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

> The one thing that seems obvious is to use 
> dev->hard_prep_xmit() in the patches i posted to select the 
> xmit ring. You should be able to do figure out the txmit ring 
> without holding any lock. 

I've looked at that as a candidate to use.  The lock for enqueue would
be needed when actually placing the skb into the appropriate software
queue for the qdisc, so it'd be quick.

> I lost track of how/where things went since the last 
> discussion; so i need to wrap my mind around it to make 
> sensisble suggestions - I know the core patches are in the 
> kernel but havent paid attention to details and if you look 
> at my second patch youd see a comment in
> dev_batch_xmit() which says i need to scrutinize multiqueue more. 

No worries.  I'll try to get things together on my end and provide some
patches to add a per-queue lock.  In the meantime, I'll take a much
closer look at the batching code, since I've stopped looking at the
patches in-depth about a month ago.  :-(

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 22:57             ` Waskiewicz Jr, Peter P
@ 2007-09-24 23:38               ` jamal
  2007-09-24 23:47                 ` Waskiewicz Jr, Peter P
  2007-10-08  4:51                 ` [ofa-general] " David Miller
  0 siblings, 2 replies; 107+ messages in thread
From: jamal @ 2007-09-24 23:38 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, mcarlson, kaber, randy.dunlap, jagana, general, mchan,
	tgraf, jeff, sri, shemminger, David Miller

On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:

> I've looked at that as a candidate to use.  The lock for enqueue would
> be needed when actually placing the skb into the appropriate software
> queue for the qdisc, so it'd be quick.

The enqueue is easy to comprehend. The single device queue lock should
suffice. The dequeue is interesting:
Maybe you can point me to some doc or describe to me the dequeue aspect;
are you planning to have an array of txlocks per, one per ring?
How is the policy to define the qdisc queues locked/mapped to tx rings? 

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 23:38               ` [ofa-general] " jamal
@ 2007-09-24 23:47                 ` Waskiewicz Jr, Peter P
  2007-09-25  0:14                   ` [ofa-general] " Stephen Hemminger
  2007-09-25 13:08                   ` [ofa-general] " jamal
  2007-10-08  4:51                 ` [ofa-general] " David Miller
  1 sibling, 2 replies; 107+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-09-24 23:47 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, mcarlson, jeff, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

> On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:
> 
> > I've looked at that as a candidate to use.  The lock for 
> enqueue would 
> > be needed when actually placing the skb into the 
> appropriate software 
> > queue for the qdisc, so it'd be quick.
> 
> The enqueue is easy to comprehend. The single device queue 
> lock should suffice. The dequeue is interesting:

We should make sure we're symmetric with the locking on enqueue to
dequeue.  If we use the single device queue lock on enqueue, then
dequeue will also need to check that lock in addition to the individual
queue lock.  The details of this are more trivial than the actual
dequeue to make it efficient though.

> Maybe you can point me to some doc or describe to me the 
> dequeue aspect; are you planning to have an array of txlocks 
> per, one per ring?
> How is the policy to define the qdisc queues locked/mapped to 
> tx rings? 

The dequeue locking would be pushed into the qdisc itself.  This is how
I had it originally, and it did make the code more complex, but it was
successful at breaking the heavily-contended queue_lock apart.  I have a
subqueue structure right now in netdev, which only has queue_state (for
netif_{start|stop}_subqueue).  This state is checked in sch_prio right
now in the dequeue for both prio and rr.  My approach is to add a
queue_lock in that struct, so each queue allocated by the driver would
have a lock per queue.  Then in dequeue, that lock would be taken when
the skb is about to be dequeued.

The skb->queue_mapping field also maps directly to the queue index
itself, so it can be unlocked easily outside of the context of the
dequeue function.  The policy would be to use a spin_trylock() in
dequeue, so that dequeue can still do work if enqueue or another dequeue
is busy.  And the allocation of qdisc queues to device queues is assumed
to be one-to-one (that's how the qdisc behaves now).

I really just need to put my nose to the grindstone and get the patches
together and to the list...stay tuned.

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 23:47                 ` Waskiewicz Jr, Peter P
@ 2007-09-25  0:14                   ` Stephen Hemminger
  2007-09-25  0:31                     ` [ofa-general] " Waskiewicz Jr, Peter P
  2007-09-25 13:15                     ` [ofa-general] " jamal
  2007-09-25 13:08                   ` [ofa-general] " jamal
  1 sibling, 2 replies; 107+ messages in thread
From: Stephen Hemminger @ 2007-09-25  0:14 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: johnpol, jeff, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, hadi, kaber, randy.dunlap, jagana, general, mchan,
	tgraf, mcarlson, sri, David Miller

On Mon, 24 Sep 2007 16:47:06 -0700
"Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com> wrote:

> > On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:
> > 
> > > I've looked at that as a candidate to use.  The lock for 
> > enqueue would 
> > > be needed when actually placing the skb into the 
> > appropriate software 
> > > queue for the qdisc, so it'd be quick.
> > 
> > The enqueue is easy to comprehend. The single device queue 
> > lock should suffice. The dequeue is interesting:
> 
> We should make sure we're symmetric with the locking on enqueue to
> dequeue.  If we use the single device queue lock on enqueue, then
> dequeue will also need to check that lock in addition to the individual
> queue lock.  The details of this are more trivial than the actual
> dequeue to make it efficient though.
> 
> > Maybe you can point me to some doc or describe to me the 
> > dequeue aspect; are you planning to have an array of txlocks 
> > per, one per ring?
> > How is the policy to define the qdisc queues locked/mapped to 
> > tx rings? 
> 
> The dequeue locking would be pushed into the qdisc itself.  This is how
> I had it originally, and it did make the code more complex, but it was
> successful at breaking the heavily-contended queue_lock apart.  I have a
> subqueue structure right now in netdev, which only has queue_state (for
> netif_{start|stop}_subqueue).  This state is checked in sch_prio right
> now in the dequeue for both prio and rr.  My approach is to add a
> queue_lock in that struct, so each queue allocated by the driver would
> have a lock per queue.  Then in dequeue, that lock would be taken when
> the skb is about to be dequeued.
> 
> The skb->queue_mapping field also maps directly to the queue index
> itself, so it can be unlocked easily outside of the context of the
> dequeue function.  The policy would be to use a spin_trylock() in
> dequeue, so that dequeue can still do work if enqueue or another dequeue
> is busy.  And the allocation of qdisc queues to device queues is assumed
> to be one-to-one (that's how the qdisc behaves now).
> 
> I really just need to put my nose to the grindstone and get the patches
> together and to the list...stay tuned.
> 
> Thanks,
> -PJ Waskiewicz
> -


Since we are redoing this, is there any way to make the whole TX path
more lockless?  The existing model seems to be more of a monitor than
a real locking model.
-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCHES] TX batching
  2007-09-23 19:11         ` [ofa-general] " jamal
  2007-09-23 19:36           ` Kok, Auke
  2007-09-24 22:54           ` [DOC] Net batching driver howto jamal
@ 2007-09-25  0:15           ` Jeff Garzik
  2 siblings, 0 replies; 107+ messages in thread
From: Jeff Garzik @ 2007-09-25  0:15 UTC (permalink / raw)
  To: hadi, David Miller
  Cc: krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

jamal wrote:
> If the intel folks will accept the patch i'd really like to kill 
> the e1000 LLTX interface.


If I understood DaveM correctly, it is sounding like we want to 
deprecate all of use LLTX on "real" hardware?  If so, several such 
projects might be considered, as well as possibly simplifying TX 
batching work perhaps.

Also, WRT e1000 specifically, I was hoping to minimize changes, and 
focus people on e1000e.

e1000e replaces (deprecates) large portions of e1000, namely the support 
for the PCI Express modern chips.  When e1000e has proven itself in the 
field, we can potentially look at several e1000 simplifications, during 
the large scale code removal that becomes possible.

	Jeff



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-25  0:14                   ` [ofa-general] " Stephen Hemminger
@ 2007-09-25  0:31                     ` Waskiewicz Jr, Peter P
  2007-09-25 13:15                     ` [ofa-general] " jamal
  1 sibling, 0 replies; 107+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-09-25  0:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: johnpol, jeff, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, hadi, kaber, randy.dunlap, jagana, general, mchan,
	tgraf, mcarlson, sri, David Miller

> > I really just need to put my nose to the grindstone and get the 
> > patches together and to the list...stay tuned.
> > 
> > Thanks,
> > -PJ Waskiewicz
> > -
> 
> 
> Since we are redoing this, is there any way to make the whole 
> TX path more lockless?  The existing model seems to be more 
> of a monitor than a real locking model.

That seems quite reasonable.  I will certainly see what I can do.

Thanks Stephen,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 23:47                 ` Waskiewicz Jr, Peter P
  2007-09-25  0:14                   ` [ofa-general] " Stephen Hemminger
@ 2007-09-25 13:08                   ` jamal
  1 sibling, 0 replies; 107+ messages in thread
From: jamal @ 2007-09-25 13:08 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, mcarlson, kaber, randy.dunlap, jagana, general, mchan,
	tgraf, jeff, sri, shemminger, David Miller

On Mon, 2007-24-09 at 16:47 -0700, Waskiewicz Jr, Peter P wrote:

> We should make sure we're symmetric with the locking on enqueue to
> dequeue.  If we use the single device queue lock on enqueue, then
> dequeue will also need to check that lock in addition to the individual
> queue lock.  The details of this are more trivial than the actual
> dequeue to make it efficient though.

It would be interesting to observe the performance implications.

> The dequeue locking would be pushed into the qdisc itself.  This is how
> I had it originally, and it did make the code more complex, but it was
> successful at breaking the heavily-contended queue_lock apart.  I have a
> subqueue structure right now in netdev, which only has queue_state (for
> netif_{start|stop}_subqueue).  This state is checked in sch_prio right
> now in the dequeue for both prio and rr.  My approach is to add a
> queue_lock in that struct, so each queue allocated by the driver would
> have a lock per queue.  Then in dequeue, that lock would be taken when
> the skb is about to be dequeued.

more locks implies degraded performance. If only one processor can enter
that region, presumably after acquiring the outer lock , why this
secondary lock per queue?

> The skb->queue_mapping field also maps directly to the queue index
> itself, so it can be unlocked easily outside of the context of the
> dequeue function.  The policy would be to use a spin_trylock() in
> dequeue, so that dequeue can still do work if enqueue or another dequeue
> is busy.  

So there could be a parallel cpu dequeueing at the same time?
Wouldnt this have implications depending on what the scheduling
algorithm used? If for example i was doing priority queueing i would
want to make sure the highest priority is being dequeued first AND by
all means goes out first to the driver; i dont want a parallell cpu
dequeing a lower prio packet at the same time.

> And the allocation of qdisc queues to device queues is assumed
> to be one-to-one (that's how the qdisc behaves now).

Ok, that brings back the discussion we had; my thinking was something
like dev->hard_prep_xmit() would select the ring and i think you
staticly already map the ring to a qdisc queue. So i dont think 
dev->hard_prep_xmit() is useful to you.
In any case, there is nothing the batching patches do that interfere or
prevent you from going the path you intend to. instead of dequeueing one
packet, you dequeue several and instead of sending to the driver one
packet, you send several. And using the xmit_win, you should never ever
have to requeue.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-25  0:14                   ` [ofa-general] " Stephen Hemminger
  2007-09-25  0:31                     ` [ofa-general] " Waskiewicz Jr, Peter P
@ 2007-09-25 13:15                     ` jamal
  2007-09-25 15:24                       ` Stephen Hemminger
  1 sibling, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-25 13:15 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, Waskiewicz Jr, Peter P, mcarlson, kaber, randy.dunlap,
	jagana, general, mchan, tgraf, jeff, sri, David Miller

On Mon, 2007-24-09 at 17:14 -0700, Stephen Hemminger wrote:

> Since we are redoing this, 
> is there any way to make the whole TX path
> more lockless?  The existing model seems to be more of a monitor than
> a real locking model.

What do you mean it is "more of a monitor"?

On the challenge of making it lockless:
About every NAPI driver combines the tx prunning with rx polling. If you
are dealing with tx resources on receive thread as well as tx thread,
_you need_ locking. The only other way we can do avoid it is to separate
the rx path interupts from ones on tx related resources; the last NAPI
driver that did that was tulip; i think the e1000 for a short period in
its life did the same as well. But that has been frowned on and people
have evolved away from it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-25 13:15                     ` [ofa-general] " jamal
@ 2007-09-25 15:24                       ` Stephen Hemminger
  2007-09-25 22:14                         ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Stephen Hemminger @ 2007-09-25 15:24 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, Waskiewicz Jr, Peter P, mcarlson, kaber, randy.dunlap,
	jagana, general, mchan, tgraf, jeff, sri, David Miller

On Tue, 25 Sep 2007 09:15:38 -0400
jamal <hadi@cyberus.ca> wrote:

> On Mon, 2007-24-09 at 17:14 -0700, Stephen Hemminger wrote:
> 
> > Since we are redoing this, 
> > is there any way to make the whole TX path
> > more lockless?  The existing model seems to be more of a monitor than
> > a real locking model.
> 

http://en.wikipedia.org/wiki/Monitor_(synchronization)
> What do you mean it is "more of a monitor"?

The transmit code path is locked as a code region, rather than just object locking
on the transmit queue or other fine grained object. This leads to moderately long
lock hold times when multiple qdisc's and classification is being done.

> 
> On the challenge of making it lockless:
> About every NAPI driver combines the tx prunning with rx polling. If you
> are dealing with tx resources on receive thread as well as tx thread,
> _you need_ locking. The only other way we can do avoid it is to separate
> the rx path interupts from ones on tx related resources; the last NAPI
> driver that did that was tulip; i think the e1000 for a short period in
> its life did the same as well. But that has been frowned on and people
> have evolved away from it.

If we went to finer grain locking it would also mean changes to all network
devices using the new locking model. My assumption is that we would use
something like the features flag to do the transition for backward compatibility.

Take this as a purely "what if" or "it would be nice if" kind of suggestion
not a requirement or some grand plan.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [DOC] Net batching driver howto
  2007-09-24 22:54           ` [DOC] Net batching driver howto jamal
@ 2007-09-25 20:16             ` Randy Dunlap
  2007-09-25 22:28               ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Randy Dunlap @ 2007-09-25 20:16 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, Jeff Garzik,
	Robert.Olsson, netdev, rdreier, mcarlson, David Miller, gaagaan,
	jagana, general, mchan, tgraf, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 536 bytes --]

On Mon, 24 Sep 2007 18:54:19 -0400 jamal wrote:

> I have updated the driver howto to match the patches i posted yesterday.
> attached. 

Thanks for sending this.

This is an early draft, right?

I'll fix some typos etc. in it (patch attached) and add some whitespace.
Please see RD: in the patch for more questions/comments.

IMO it needs some changes to eliminate words like "we", "you",
and "your" (words that personify code).  Those words are OK
when talking about yourself.


---
~Randy
Phaedrus says that Quality is about caring.

[-- Attachment #2: batch-howto.patch --]
[-- Type: text/x-patch, Size: 11579 bytes --]


 batch-driver-howto.txt |  116 +++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 53 deletions(-)

diff -Naurp tmp/batch-driver-howto.txt~fix1 tmp/batch-driver-howto.txt
--- tmp/batch-driver-howto.txt~fix1	2007-09-25 12:44:10.000000000 -0700
+++ tmp/batch-driver-howto.txt	2007-09-25 13:14:27.000000000 -0700
@@ -1,13 +1,13 @@
-Heres the begining of a howto for driver authors.
+Here's the beginning of a howto for driver authors.
 
 The intended audience for this howto is people already
 familiar with netdevices.
 
-1.0  Netdevice Pre-requisites
+1.0  Netdevice Prerequisites
 ------------------------------
 
-For hardware based netdevices, you must have at least hardware that 
-is capable of doing DMA with many descriptors; i.e having hardware 
+For hardware-based netdevices, you must have at least hardware that 
+is capable of doing DMA with many descriptors; i.e., having hardware 
 with a queue length of 3 (as in some fscked ethernet hardware) is 
 not very useful in this case.
 
@@ -15,33 +15,36 @@ not very useful in this case.
 -----------------------------------
 
 There are 3 new methods and one new variable introduced. These are:
-1)dev->hard_prep_xmit()
-2)dev->hard_end_xmit()
-3)dev->hard_batch_xmit()
-4)dev->xmit_win
+1) dev->hard_prep_xmit()
+2) dev->hard_end_xmit()
+3) dev->hard_batch_xmit()
+4) dev->xmit_win
 
 2.1 Using Core driver changes
 -----------------------------
 
-To provide context, lets look at a typical driver abstraction
+To provide context, let's look at a typical driver abstraction
 for dev->hard_start_xmit(). It has 4 parts:
-a) packet formating (example vlan, mss, descriptor counting etc)
-b) chip specific formatting
+a) packet formatting (example: vlan, mss, descriptor counting, etc.)
+b) chip-specific formatting
 c) enqueueing the packet on a DMA ring
 d) IO operations to complete packet transmit, tell DMA engine to chew 
-on, tx completion interupts etc
+on, tx completion interrupts, etc.
 
 [For code cleanliness/readability sake, regardless of this work,
 one should break the dev->hard_start_xmit() into those 4 functions
 anyways].
+
 A driver which has all 4 parts and needing to support batching is 
 advised to split its dev->hard_start_xmit() in the following manner:
-1)use its dev->hard_prep_xmit() method to achieve #a
-2)use its dev->hard_end_xmit() method to achieve #d
-3)#b and #c can stay in ->hard_start_xmit() (or whichever way you 
+
+1) use its dev->hard_prep_xmit() method to achieve #a
+2) use its dev->hard_end_xmit() method to achieve #d
+3) #b and #c can stay in ->hard_start_xmit() (or whichever way you 
 want to do this)
+
 Note: There are drivers which may need not support any of the two
-methods (example the tun driver i patched) so the two methods are
+methods (for example, the tun driver I patched), so the two methods are
 essentially optional.
 
 2.1.1 Theory of operation
@@ -49,7 +52,7 @@ essentially optional.
 
 The core will first do the packet formatting by invoking your 
 supplied dev->hard_prep_xmit() method. It will then pass you the packet 
-via your dev->hard_start_xmit() method for as many as packets you
+via your dev->hard_start_xmit() method for as many packets as you
 have advertised (via dev->xmit_win) you can consume. Lastly it will 
 invoke your dev->hard_end_xmit() when it completes passing you all the 
 packets queued for you. 
@@ -58,16 +61,16 @@ packets queued for you. 
 2.1.1.1 Locking rules
 ---------------------
 
-dev->hard_prep_xmit() is invoked without holding any
-tx lock but the rest are under TX_LOCK(). So you have to ensure that
-whatever you put it dev->hard_prep_xmit() doesnt require locking.
+dev->hard_prep_xmit() is invoked without holding any tx lock
+but the rest are under TX_LOCK(). So you have to ensure that
+whatever you put it dev->hard_prep_xmit() doesn't require locking.
 
 2.1.1.2 The slippery LLTX
 -------------------------
 
 LLTX drivers present a challenge in that we have to introduce a deviation
 from the norm and require the ->hard_batch_xmit() method. An LLTX
-driver presents us with ->hard_batch_xmit() to which we pass it a list
+driver presents us with ->hard_batch_xmit() to which we pass in a list
 of packets in a dev->blist skb queue. It is then the responsibility
 of the ->hard_batch_xmit() to exercise steps #b and #c for all packets
 passed in the dev->blist.
@@ -80,11 +83,14 @@ dev->hard_prep_xmit() and dev->hard_end_
 dev->xmit_win variable is set by the driver to tell us how
 much space it has in its rings/queues. dev->xmit_win is introduced to 
 ensure that when we pass the driver a list of packets it will swallow 
-all of them - which is useful because we dont requeue to the qdisc (and 
-avoids burning unnecessary cpu cycles or introducing any strange 
+all of them -- which is useful because we don't requeue to the qdisc (and 
+avoids burning unnecessary CPU cycles or introducing any strange 
 re-ordering). The driver tells us, whenever it invokes netif_wake_queue,
 how much space it has for descriptors by setting this variable.
 
+RD:  so xmit_win is not total queue size, it's the available queue size
+when calling netif_wake_queue(), right?  I guess that's explained below.
+
 3.0 Driver Essentials
 ---------------------
 
@@ -104,18 +110,18 @@ The typical driver tx state machine is:
 -1-> +Cycle repeats and core sends more packets (step 1).
 ----
 
-3.1  Driver pre-requisite
+3.1  Driver prerequisite
 --------------------------
 
 This is _a very important_ requirement in making batching useful.
-The pre-requisite for batching changes is that the driver should 
+The prerequisite for batching changes is that the driver should 
 provide a low threshold to open up the tx path.
 Drivers such as tg3 and e1000 already do this.
 Before you invoke netif_wake_queue(dev) you check if there is a
 threshold of space reached to insert new packets.
 
-Heres an example of how i added it to tun driver. Observe the
-setting of dev->xmit_win
+Here's an example of how I added it to tun driver. Observe the
+setting of dev->xmit_win.
 
 ---
 +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
@@ -153,14 +159,14 @@ in tg3 code (with no batching changes) l
 -----------------
 
 a) On initialization (before netdev registration)
- i) set NETIF_F_BTX in dev->features 
-  i.e dev->features |= NETIF_F_BTX
+ 1) set NETIF_F_BTX in dev->features 
+  i.e., dev->features |= NETIF_F_BTX
   This makes the core do proper initialization.
 
-ii) set dev->xmit_win to something reasonable like
+ 2) set dev->xmit_win to something reasonable like
   maybe half the tx DMA ring size etc.
 
-b) create proper pointer to the new methods desribed above if
+b) create proper pointer to the new methods described above if
 you need them.
 
 3.3 Annotation on the different methods 
@@ -171,27 +177,30 @@ methods and variable could be used.
 3.3.1 The dev->hard_prep_xmit() method
 ---------------------------------------
 
-Use this method to only do pre-processing of the skb passed.
-If in the current dev->hard_start_xmit() you are pre-processing
-packets before holding any locks (eg formating them to be put in
-any descriptor etc).
+Use this method only to do preprocessing of the skb passed.
+If in the current dev->hard_start_xmit() you are preprocessing
+packets before holding any locks (e.g., formatting them to be put in
+any descriptor etc.).
+RD: incomplete sentence above.
 Look at e1000_prep_queue_frame() for an example.
 You may use the skb->cb to store any state that you need to know
-of later when batching.
-PS: I have found when discussing with Michael Chan and Matt Carlson
-that skb->cb[0] (8bytes of it) is used by the VLAN code to pass VLAN 
+later when batching.
+
+NOTE: I have found when discussing with Michael Chan and Matt Carlson
+that skb->cb[0] (8 bytes of it) is used by the VLAN code to pass VLAN 
 info to the driver.
 I think this is a violation of the usage of the cb scratch pad. 
 To work around this, you could use skb->cb[8] or do what the broadcom
-tg3 bacthing driver does which is to glean the vlan info first then
-re-use the skb->cb.
+tg3 batching driver does which is to glean the vlan info first then
+reuse the skb->cb.
 
 3.3.2 dev->hard_start_xmit()
 ----------------------------
   
-Heres an example of tx routine that is similar to the one i added 
+Here's an example of tx routine that is similar to the one I added 
 to the current tun driver. bxmit suffix is kept so that you can turn
 off batching if needed via and call already existing interface.
+RD: off batching via ethtool if needed ??
 
 ----
   static int xxx_net_bxmit(struct net_device *dev)
@@ -211,14 +220,14 @@ off batching if needed via and call alre
   }
 ------
 
-All return codes like NETDEV_TX_OK etc still apply.
+All return codes like NETDEV_TX_OK etc. still apply.
 
 3.3.3 The LLTX batching method, dev->batch_xmit()
 -------------------------------------------------
   
-Heres an example of a batch tx routine that is similar
-to the one i added to the older tun driver. Essentially
-this is what youd do if you wanted to support LLTX.
+Here's an example of a batch tx routine that is similar
+to the one I added to the older tun driver. Essentially
+this is what you would do if you wanted to support LLTX.
 
 ----
   static int xxx_net_bxmit(struct net_device *dev)
@@ -228,7 +237,7 @@ this is what youd do if you wanted to su
         while (skb_queue_len(dev->blist)) {
 	        dequeue from dev->blist
 		enqueue onto hardware ring
-		if hardware ring full break
+		if hardware ring full, break
         }
 				           
 	if (hardware ring full) {
@@ -242,34 +251,35 @@ this is what youd do if you wanted to su
   }
 ------
 
-All return codes like NETDEV_TX_OK etc still apply.
+All return codes like NETDEV_TX_OK etc. still apply.
 
 3.3.4 The tx complete, dev->hard_end_xmit()
 -------------------------------------------------
   
 In this method, if there are any IO operations that apply to a 
-set of packets such as kicking DMA, setting of interupt thresholds etc,
+set of packets such as kicking DMA, setting of interrupt thresholds etc.,
 leave them to the end and apply them once if you have successfully enqueued. 
-This provides a mechanism for saving a lot of cpu cycles since IO
+This provides a mechanism for saving a lot of CPU cycles since IO
 is cycle expensive.
-For an example of this look e1000 driver e1000_kick_DMA() function.
+For an example of this look at the e1000 driver e1000_kick_DMA() function.
 
 3.3.5 setting the dev->xmit_win 
 -----------------------------
 
 As mentioned earlier this variable provides hints on how much
 data to send from the core to the driver. Here are the obvious ways:
-a)on doing a netif_stop, set it to 1. By default all drivers have 
+
+a) on doing a netif_stop, set it to 1. By default all drivers have 
 this value set to 1 to emulate old behavior where a driver only
 receives one packet at a time.
-b)on netif_wake_queue set it to the max available space. You have
+b) on netif_wake_queue set it to the max available space. You have
 to be careful if your hardware does scatter-gather since the core
 will pass you scatter-gatherable skbs and so you want to at least
 leave enough space for the maximum allowed. Look at the tg3 and
 e1000 to see how this is implemented.
 
 The variable is important because it avoids the core sending
-any more than what the driver can handle therefore avoiding 
+any more than what the driver can handle, therefore avoiding 
 any need to muck with packet scheduling mechanisms.
 
 Appendix 1: History

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-25 15:24                       ` Stephen Hemminger
@ 2007-09-25 22:14                         ` jamal
  2007-09-25 22:43                           ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-25 22:14 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, Waskiewicz Jr, Peter P, mcarlson, kaber, randy.dunlap,
	jagana, general, mchan, tgraf, jeff, sri, David Miller

On Tue, 2007-25-09 at 08:24 -0700, Stephen Hemminger wrote:

> The transmit code path is locked as a code region, rather than just object locking
> on the transmit queue or other fine grained object. This leads to moderately long
> lock hold times when multiple qdisc's and classification is being done.

It will be pretty tricky to optimize that path given the dependencies
between the queues, classifiers, and actions in enqueues; schedulers in
dequeues as well as their config/queries from user space which could
happen concurently on all "N" CPUs. 
The txlock optimization i added in patch1 intends to let go of the queue
lock when we enter the dequeue region sooner to reduce the contention.

A further optimization i made was to reduce the time it takes to hold
the tx lock at the driver by moving gunk that doesnt need lock-holding
into the new method dev->hard_end_xmit() (refer to patch #2)

> If we went to finer grain locking it would also mean changes to all network
> devices using the new locking model. My assumption is that we would use
> something like the features flag to do the transition for backward compatibility.
> Take this as a purely "what if" or "it would be nice if" kind of suggestion
> not a requirement or some grand plan.

Ok, hopefully someone would demonstrate how to achieve it; seems a hard
thing to achieve.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [DOC] Net batching driver howto
  2007-09-25 20:16             ` [ofa-general] " Randy Dunlap
@ 2007-09-25 22:28               ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-09-25 22:28 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, Jeff Garzik,
	Robert.Olsson, netdev, rdreier, mcarlson, David Miller, gaagaan,
	jagana, general, mchan, tgraf, sri, shemminger, kaber

On Tue, 2007-25-09 at 13:16 -0700, Randy Dunlap wrote:
> On Mon, 24 Sep 2007 18:54:19 -0400 jamal wrote:
> 
> > I have updated the driver howto to match the patches i posted yesterday.
> > attached. 
> 
> Thanks for sending this.

Thank you for reading it Randy. 

> This is an early draft, right?

Its a third revision - but you could call it early. When it is done, i
will probably put a pointer to it in some patch.

> I'll fix some typos etc. in it (patch attached) and add some whitespace.
> Please see RD: in the patch for more questions/comments.

Thanks, will do and changes will show up in the next update.

> IMO it needs some changes to eliminate words like "we", "you",
> and "your" (words that personify code).  Those words are OK
> when talking about yourself.

The narrative intent is supposed to be i (or someone doing the
description) sitting there with a pen and paper and maybe a laptop and
walking through the details with someone who needs to understand those
details. If you think it is important to make it formal, then by all
means be my guest.

Again, thanks for taking the time.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-25 22:14                         ` jamal
@ 2007-09-25 22:43                           ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-09-25 22:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, Waskiewicz Jr, Peter P, mcarlson, kaber, randy.dunlap,
	jagana, general, mchan, tgraf, jeff, sri, David Miller

On Tue, 2007-25-09 at 18:15 -0400, jamal wrote:

> A further optimization i made was to reduce the time it takes to hold
> the tx lock at the driver by moving gunk that doesnt need lock-holding
> into the new method dev->hard_end_xmit() (refer to patch #2)

Sorry, that should have read dev->hard_prep_xmit()

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCHES] TX batching
  2007-09-23 17:53     ` [PATCHES] TX batching jamal
  2007-09-23 17:56       ` [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock jamal
  2007-09-23 18:19       ` [PATCHES] TX batching Jeff Garzik
@ 2007-09-30 18:50       ` jamal
  2007-09-30 19:19         ` [ofa-general] " jamal
  2007-10-07 18:34       ` [ofa-general] " jamal
  3 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-30 18:50 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber


Latest net-2.6.24 breaks the patches i posted last week; so this is an
update to resolve that. If you are receiving these emails and are
finding them overloading, please give me a shout and i will remove your
name.

Please provide feedback on the code and/or architecture.
Last time i posted them i received none. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev->gso_skb

I have decided i will kill ->hard_batch_xmit() and not support any
more LLTX drivers. This is the last of patches that will have
->hard_batch_xmit() as i am supporting an e1000 that is LLTX.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev->gso_skb that you introduced).

More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver (tested and works well, but dont want to send 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

Each of these patches has been performance tested (last with DaveM's
tree from last weekend) and the results are in the logs on a per-patch 
basis.  My system under test hardware is a 2xdual core opteron with a 
couple of tg3s. I have not re-run the tests with this morning's tree
but i suspect not much difference.
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface
  2007-09-23 17:58         ` [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface jamal
  2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
@ 2007-09-30 18:51           ` jamal
  2007-09-30 18:54             ` [ofa-general] Re: [PATCH 1/3] " jamal
  2007-10-07 18:36           ` [ofa-general] " jamal
  2 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-09-30 18:51 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 78 bytes --]


This patch introduces the netdevice interface for batching.

cheers,
jamal



[-- Attachment #2: sep30-p10f3 --]
[-- Type: text/plain, Size: 8794 bytes --]

[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

A typical driver dev->hard_start_xmit() has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev->hard_start_xmit() into those 4 functions
anyways].

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev->hard_start_xmit() in the following manner:
1)use its dev->hard_prep_xmit() method to achieve #a
2)use its dev->hard_end_xmit() method to achieve #d
3)#b and #c can stay in ->hard_start_xmit() (or whichever way you want
to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
optional.

The core will first do the packet formatting by invoking your supplied
dev->hard_prep_xmit() method. It will then pass you the packet via
your dev->hard_start_xmit() method and lastly will invoke your
dev->hard_end_xmit() when it completes passing you all the packets
queued for you. dev->hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK().

LLTX present a challenge in that we have to introduce a deviation
from the norm and introduce the ->hard_batch_xmit() method. An LLTX
driver presents us with ->hard_batch_xmit() to which we pass it a list
of packets in a dev->blist skb queue. It is then the responsibility
of the ->hard_batch_xmit() to exercise steps #b and #c for all packets
and #d when the batching is complete. Step #a is already done for you
by the time you get the packets in dev->blist.
And last xmit_win variable is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Some decisions i had to make:
- every driver will have a xmit_win variable and the core will set it
to 1 which means the behavior of non-batching drivers stays the same.
- the batch list, blist, is no longer a pointer; wastes a little extra
memmory i plan to recoup by killing gso_skb in later patches.

Theres a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and
David Miller, have contributed in one or more of {bug fixes, enhancements,
testing, lively discussion}. The Broadcom and netiron folks have been
outstanding in their help.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit 624a0bfeb971c9aa58496c7372df01f0ed750def
tree c1c0ee53453392866a5241631a7502ce6569b2cc
parent 260dbcc4b0195897c539c5ff79d95afdddeb3378
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 30 Sep 2007 14:23:31 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 30 Sep 2007 14:23:31 -0400

 include/linux/netdevice.h |   17 +++++++
 net/core/dev.c            |  106 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..df1fb61 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -467,6 +467,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -595,6 +596,15 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	/* hard_batch_xmit is needed for LLTX, kill it when those
+	 * disappear or better kill it now and dont support LLTX
+	*/
+	int			(*hard_batch_xmit) (struct net_device *dev);
+	int			(*hard_prep_xmit) (struct sk_buff *skb,
+						   struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -609,6 +619,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head     blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1044,6 +1055,12 @@ extern int		dev_set_mac_address(struct net_device *,
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_batch_xmit(struct net_device *dev);
+extern int		prepare_gso_skb(struct sk_buff *skb,
+					struct net_device *dev,
+					struct sk_buff_head *skbs);
+extern int		xmit_prepare_skb(struct sk_buff *skb,
+					 struct net_device *dev);
 
 extern int		netdev_budget;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 833f060..f82aff7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1517,6 +1517,110 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+int prepare_gso_skb(struct sk_buff *skb, struct net_device *dev,
+		    struct sk_buff_head *skbs)
+{
+	int tdq = 0;
+	do {
+		struct sk_buff *nskb = skb->next;
+
+		skb->next = nskb->next;
+		nskb->next = NULL;
+
+		if (dev->hard_prep_xmit) {
+			/* note: skb->cb is set in hard_prep_xmit(),
+			 * it should not be trampled somewhere
+			 * between here and the driver picking it
+			 * The VLAN code wrongly assumes it owns it
+			 * so the driver needs to be careful; for
+			 * good handling look at tg3 driver ..
+			*/
+			int ret = dev->hard_prep_xmit(nskb, dev);
+			if (ret != NETDEV_TX_OK)
+				continue;
+		}
+		/* Driver likes this packet .. */
+		tdq++;
+		__skb_queue_tail(skbs, nskb);
+	} while (skb->next);
+	skb->destructor = DEV_GSO_CB(skb)->destructor;
+	kfree_skb(skb);
+
+	return tdq;
+}
+
+int xmit_prepare_skb(struct sk_buff *skb, struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree_skb(skb);
+			return 0;
+		}
+		if (skb->next)
+			return prepare_gso_skb(skb, dev, skbs);
+	}
+
+	if (dev->hard_prep_xmit) {
+		int ret = dev->hard_prep_xmit(skb, dev);
+		if (ret != NETDEV_TX_OK)
+			return 0;
+	}
+	__skb_queue_tail(skbs, skb);
+	return 1;
+}
+
+int dev_batch_xmit(struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+	int rc = NETDEV_TX_OK;
+	struct sk_buff *skb;
+	int orig_w = dev->xmit_win;
+	int orig_pkts = skb_queue_len(skbs);
+
+	if (dev->hard_batch_xmit) { /* only for LLTX devices */
+		rc = dev->hard_batch_xmit(dev);
+	} else {
+		while ((skb = __skb_dequeue(skbs)) != NULL) {
+			if (!list_empty(&ptype_all))
+				dev_queue_xmit_nit(skb, dev);
+			rc = dev->hard_start_xmit(skb, dev);
+			if (unlikely(rc))
+				break;
+			/*
+			 * XXX: multiqueue may need closer srutiny..
+			*/
+			if (unlikely(netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping))) {
+				rc = NETDEV_TX_BUSY;
+				break;
+			}
+		}
+	}
+
+	/* driver is likely buggy and lied to us on how much
+	 * space it had. Damn you driver ..
+	*/
+	if (unlikely(skb_queue_len(skbs))) {
+		printk(KERN_WARNING "Likely bug %s %s (%d) "
+				"left %d/%d window now %d, orig %d\n",
+			dev->name, rc?"busy":"locked",
+			netif_queue_stopped(dev),
+			skb_queue_len(skbs),
+			orig_pkts,
+			dev->xmit_win,
+			orig_w);
+			rc = NETDEV_TX_BUSY;
+	}
+
+	if (orig_pkts > skb_queue_len(skbs))
+		if (dev->hard_end_xmit)
+			dev->hard_end_xmit(dev);
+
+	return rc;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -3551,6 +3655,8 @@ int register_netdevice(struct net_device *dev)
 		}
 	}
 
+	dev->xmit_win = 1;
+	skb_queue_head_init(&dev->blist);
 	ret = netdev_register_kobject(dev);
 	if (ret)
 		goto err_uninit;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching
  2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
  2007-09-23 18:02             ` [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb jamal
@ 2007-09-30 18:52             ` jamal
  2007-10-01  4:11               ` Bill Fink
  2007-10-01 10:42               ` [ofa-general] " Patrick McHardy
  2007-10-07 18:38             ` [ofa-general] " jamal
  2 siblings, 2 replies; 107+ messages in thread
From: jamal @ 2007-09-30 18:52 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 74 bytes --]


This patch adds the usage of batching within the core.

cheers,
jamal




[-- Attachment #2: sep30-p2of3 --]
[-- Type: text/plain, Size: 6919 bytes --]

[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is used, with
the following results on different kernels:

        +------------+--------------+-------------+------------+--------+
        |       64B  |  128B        | 256B        | 512B       |1024B   |
        +------------+--------------+-------------+------------+--------+
Original| 467482     | 463061       | 388267      | 216308     | 114704 |
        |            |              |             |            |        |
txlock  | 468922     | 464060       | 388298      | 216316     | 114709 |
        |            |              |             |            |        |
tg3nobtx| 468012     | 464079       | 388293      | 216314     | 114704 |
        |            |              |             |            |        |
tg3btxdr| 480794     | 475102       | 388298      | 216316     | 114705 |
        |            |              |             |            |        |
tg3btxco| 481059     | 475423       | 388285      | 216308     | 114706 |
        +------------+--------------+-------------+------------+--------+

The first two colums "Original" and "txlock" were introduced in an earlier
patch and demonstrate a slight increase in performance with txlock.
"tg3nobtx" shows the tg3 driver with no changes to support batching.
The purpose of this test is to demonstrate the effect of introducing
the core changes to a driver that doesnt support them.
Although this patch brings down perfomance slightly compared to txlock
for such netdevices, it is still better compared to just the original
kernel.
"tg3btxdr" demonstrates the effect of using ->hard_batch_xmit() with tg3
driver. "tg3btxco" demonstrates the effect of letting the core do all the
work. As can be seen the last two are not very different in performance.
The difference is ->hard_batch_xmit() introduces a new method which
is intrusive.

I have #if-0ed some of the old functions so the patch is more readable.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit 9b4a8fb190278d388c0a622fb5529d184ac8c7dc
tree 053e8dda02b5d26fe7cc778823306a8a526df513
parent 624a0bfeb971c9aa58496c7372df01f0ed750def
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 30 Sep 2007 14:38:11 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 30 Sep 2007 14:38:11 -0400

 net/sched/sch_generic.c |  127 +++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 115 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..86a3f9d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q->q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev->xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+				"Dead loop on netdevice %s, fix it urgently!\n",
+				dev->name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	       struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(skbs)) != NULL)
+		q->ops->requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	    struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb->next) {
+		count += skb_shinfo(skb)->nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev->xmit_win;
+
+	if (count  && dev->gso_skb) {
+		skb = dev->gso_skb;
+		dev->gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count > 0) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
+
+static int xmit_prepare_pkts(struct net_device *dev,
+			     struct sk_buff_head *tlist)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *flist = &dev->blist;
+
+	while ((skb = __skb_dequeue(tlist)) != NULL)
+		xmit_prepare_skb(skb, dev);
+
+	return skb_queue_len(flist);
+}
 
 /*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
@@ -130,22 +222,27 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *tpktlist)
 {
 	struct Qdisc *q = dev->qdisc;
-	struct sk_buff *skb;
-	int ret;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	ret = xmit_get_pkts(dev, q, tpktlist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(&dev->queue_lock);
 
+	/* prepare to embark */
+	xmit_prepare_pkts(dev, tpktlist);
+
+	/* bye packets ....*/
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
@@ -158,8 +255,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(&dev->blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +265,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(&dev->blist, dev, q);
 		break;
 	}
 
@@ -177,8 +274,11 @@ static inline int qdisc_restart(struct net_device *dev)
 
 void __qdisc_run(struct net_device *dev)
 {
+	struct sk_buff_head tpktlist;
+	skb_queue_head_init(&tpktlist);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, &tpktlist))
 			break;
 	} while (!netif_queue_stopped(dev));
 
@@ -564,6 +664,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev->gso_skb;
 	dev->gso_skb = NULL;
+	if (!skb_queue_empty(&dev->blist))
+		skb_queue_purge(&dev->blist);
+	dev->xmit_win = 1;
 	spin_unlock_bh(&dev->queue_lock);
 
 	kfree_skb(skb);

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 3/3][NET_SCHED] kill dev->gso_skb
  2007-09-23 18:02             ` [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb jamal
@ 2007-09-30 18:53               ` jamal
  2007-10-07 18:39               ` [ofa-general] [PATCH 3/3][NET_BATCH] " jamal
  1 sibling, 0 replies; 107+ messages in thread
From: jamal @ 2007-09-30 18:53 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 99 bytes --]


This patch removes dev->gso_skb as it is no longer necessary with
batching code.

cheers,
jamal



[-- Attachment #2: sep30-p3of3 --]
[-- Type: text/plain, Size: 2277 bytes --]

[NET_SCHED] kill dev->gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev->blist
(and still not need dev->gso_skb)

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit c2916c550d228472ddcdd676c2689fa6c8ecfcc0
tree 5beaf197fd08a038d83501f405017f48712d0318
parent 9b4a8fb190278d388c0a622fb5529d184ac8c7dc
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 30 Sep 2007 14:38:58 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 30 Sep 2007 14:38:58 -0400

 include/linux/netdevice.h |    3 ---
 net/sched/sch_generic.c   |   12 ------------
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index df1fb61..cea400a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -577,9 +577,6 @@ struct net_device
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
-	/* Partially transmitted GSO packet. */
-	struct sk_buff		*gso_skb;
-
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 86a3f9d..b4e1607 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
 	struct sk_buff *skb;
 	int count = dev->xmit_win;
 
-	if (count  && dev->gso_skb) {
-		skb = dev->gso_skb;
-		dev->gso_skb = NULL;
-		count -= xmit_count_skbs(skb);
-		__skb_queue_tail(pktlist, skb);
-	}
-
 	while (count > 0) {
 		skb = q->dequeue(q);
 		if (!skb)
@@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
-	struct sk_buff *skb;
 
 	spin_lock_bh(&dev->queue_lock);
 	qdisc = dev->qdisc;
@@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev)
 
 	qdisc_reset(qdisc);
 
-	skb = dev->gso_skb;
-	dev->gso_skb = NULL;
 	if (!skb_queue_empty(&dev->blist))
 		skb_queue_purge(&dev->blist);
 	dev->xmit_win = 1;
 	spin_unlock_bh(&dev->queue_lock);
 
-	kfree_skb(skb);
-
 	dev_watchdog_down(dev);
 
 	/* Wait for outstanding dev_queue_xmit calls. */

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
  2007-09-30 18:51           ` [ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface jamal
@ 2007-09-30 18:54             ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-09-30 18:54 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

Fixed subject - should be 1/3 not 1/4

On Sun, 2007-30-09 at 14:51 -0400, jamal wrote:
> This patch introduces the netdevice interface for batching.
> 
> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-09-30 18:50       ` [ofa-general] " jamal
@ 2007-09-30 19:19         ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-09-30 19:19 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 149 bytes --]

And heres a patch that provides a sample of the usage for batching with
tg3. 
Requires patch "[TG3]Some cleanups" i posted earlier. 

cheers,
jamal


[-- Attachment #2: tg3.potoc --]
[-- Type: text/x-patch, Size: 5252 bytes --]

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 5a864bd..9aafb78 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3103,6 +3103,13 @@ static inline u32 tg3_tx_avail(struct tg3 *tp)
 		((tp->tx_prod - tp->tx_cons) & (TG3_TX_RING_SIZE - 1)));
 }
 
+static inline void tg3_set_win(struct tg3 *tp)
+{
+	tp->dev->xmit_win = tg3_tx_avail(tp) - (MAX_SKB_FRAGS + 1);
+	if (tp->dev->xmit_win < 1)
+		tp->dev->xmit_win = 1;
+}
+
 /* Tigon3 never reports partial packet sends.  So we do not
  * need special logic to handle SKBs that have not had all
  * of their frags sent yet, like SunGEM does.
@@ -3165,8 +3172,10 @@ static void tg3_tx(struct tg3 *tp)
 		     (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))) {
 		netif_tx_lock(tp->dev);
 		if (netif_queue_stopped(tp->dev) &&
-		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
+		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))) {
+			tg3_set_win(tp);
 			netif_wake_queue(tp->dev);
+		}
 		netif_tx_unlock(tp->dev);
 	}
 }
@@ -4007,8 +4016,13 @@ void tg3_kick_DMA(struct net_device *dev)
 
 	if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) {
 		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) >= TG3_TX_WAKEUP_THRESH(tp))
+		dev->xmit_win = 1;
+		if (tg3_tx_avail(tp) >= TG3_TX_WAKEUP_THRESH(tp)) {
+			tg3_set_win(tp);
 			netif_wake_queue(dev);
+		}
+	} else {
+		tg3_set_win(tp);
 	}
 
 	mmiowb();
@@ -4085,6 +4099,7 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
+			tp->dev->xmit_win = 1;
 
 			/* This is a hard error, log it. */
 			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
@@ -4100,6 +4115,25 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return ret;
 }
 
+static int tg3_start_bxmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev->xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return tg3_enqueue(skb, dev);
+}
+
 static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *);
 
 /* Use GSO to workaround a rare TSO bug that may be triggered when the
@@ -4112,9 +4146,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
 	/* Estimate the number of fragments in the worst case */
 	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->gso_segs * 3))) {
 		netif_stop_queue(tp->dev);
+		tp->dev->xmit_win = 1;
 		if (tg3_tx_avail(tp) <= (skb_shinfo(skb)->gso_segs * 3))
 			return NETDEV_TX_BUSY;
 
+		tg3_set_win(tp);
 		netif_wake_queue(tp->dev);
 	}
 
@@ -4267,6 +4303,25 @@ static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int tg3_start_bxmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev->xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return  tg3_enqueue_buggy(skb, dev);
+}
+
 static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 {
 	struct tg3 *tp = netdev_priv(dev);
@@ -4283,6 +4338,7 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
+			dev->xmit_win = 1;
 
 			/* This is a hard error, log it. */
 			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
@@ -11099,15 +11155,19 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 	else
 		tp->tg3_flags &= ~TG3_FLAG_POLL_SERDES;
 
+	tp->dev->hard_end_xmit = tg3_kick_DMA;
 	/* All chips before 5787 can get confused if TX buffers
 	 * straddle the 4GB address boundary in some cases.
 	 */
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5755 ||
 	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5787 ||
-	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
-		tp->dev->hard_start_xmit = tg3_start_xmit;
-	else
-		tp->dev->hard_start_xmit = tg3_start_xmit_dma_bug;
+	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) {
+		tp->dev->hard_start_xmit = tg3_start_bxmit;
+		tp->dev->hard_prep_xmit = tg3_prep_frame;
+	} else {
+		tp->dev->hard_start_xmit = tg3_start_bxmit_dma_bug;
+		tp->dev->hard_prep_xmit = tg3_prep_bug_frame;
+	}
 
 	tp->rx_offset = 2;
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701 &&
@@ -11955,6 +12015,8 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
 	dev->watchdog_timeo = TG3_TX_TIMEOUT;
 	dev->change_mtu = tg3_change_mtu;
 	dev->irq = pdev->irq;
+	dev->features |= NETIF_F_BTX;
+	dev->xmit_win = tp->tx_pending >> 2;
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	dev->poll_controller = tg3_poll_controller;
 #endif

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-09-30 18:52             ` [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching jamal
@ 2007-10-01  4:11               ` Bill Fink
  2007-10-01 13:30                 ` jamal
  2007-10-01 10:42               ` [ofa-general] " Patrick McHardy
  1 sibling, 1 reply; 107+ messages in thread
From: Bill Fink @ 2007-10-01  4:11 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

On Sun, 30 Sep 2007, jamal wrote:

> This patch adds the usage of batching within the core.
> 
> cheers,
> jamal



> [sep30-p2of3  text/plain (6.8KB)]
> [NET_BATCH] net core use batching
> 
> This patch adds the usage of batching within the core.
> The same test methodology used in introducing txlock is used, with
> the following results on different kernels:
> 
>         +------------+--------------+-------------+------------+--------+
>         |       64B  |  128B        | 256B        | 512B       |1024B   |
>         +------------+--------------+-------------+------------+--------+
> Original| 467482     | 463061       | 388267      | 216308     | 114704 |
>         |            |              |             |            |        |
> txlock  | 468922     | 464060       | 388298      | 216316     | 114709 |
>         |            |              |             |            |        |
> tg3nobtx| 468012     | 464079       | 388293      | 216314     | 114704 |
>         |            |              |             |            |        |
> tg3btxdr| 480794     | 475102       | 388298      | 216316     | 114705 |
>         |            |              |             |            |        |
> tg3btxco| 481059     | 475423       | 388285      | 216308     | 114706 |
>         +------------+--------------+-------------+------------+--------+
> 
> The first two colums "Original" and "txlock" were introduced in an earlier
> patch and demonstrate a slight increase in performance with txlock.
> "tg3nobtx" shows the tg3 driver with no changes to support batching.
> The purpose of this test is to demonstrate the effect of introducing
> the core changes to a driver that doesnt support them.
> Although this patch brings down perfomance slightly compared to txlock
> for such netdevices, it is still better compared to just the original
> kernel.
> "tg3btxdr" demonstrates the effect of using ->hard_batch_xmit() with tg3
> driver. "tg3btxco" demonstrates the effect of letting the core do all the
> work. As can be seen the last two are not very different in performance.
> The difference is ->hard_batch_xmit() introduces a new method which
> is intrusive.

Have you done performance comparisons for the case of using 9000-byte
jumbo frames?

						-Bill

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-09-30 18:52             ` [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching jamal
  2007-10-01  4:11               ` Bill Fink
@ 2007-10-01 10:42               ` Patrick McHardy
  2007-10-01 13:21                 ` jamal
  1 sibling, 1 reply; 107+ messages in thread
From: Patrick McHardy @ 2007-10-01 10:42 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, David Miller

jamal wrote:
> +static inline int
> +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
> +	       struct Qdisc *q)
> +{
> +
> +	struct sk_buff *skb;
> +
> +	while ((skb = __skb_dequeue(skbs)) != NULL)
> +		q->ops->requeue(skb, q);


->requeue queues at the head, so this looks like it would reverse
the order of the skbs.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-01 10:42               ` [ofa-general] " Patrick McHardy
@ 2007-10-01 13:21                 ` jamal
  2007-10-08  5:03                   ` Krishna Kumar2
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-01 13:21 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, David Miller

On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote:
> jamal wrote:

> > +	while ((skb = __skb_dequeue(skbs)) != NULL)
> > +		q->ops->requeue(skb, q);
> 
> 
> ->requeue queues at the head, so this looks like it would reverse
> the order of the skbs.

Excellent catch!  thanks; i will fix.

As a side note: Any batching driver should _never_ have to requeue; if
it does it is buggy. And the non-batching ones if they ever requeue will
be a single packet, so not much reordering.

Thanks again Patrick.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-01  4:11               ` Bill Fink
@ 2007-10-01 13:30                 ` jamal
  2007-10-02  4:25                   ` [ofa-general] " Bill Fink
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-01 13:30 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:

> Have you done performance comparisons for the case of using 9000-byte
> jumbo frames?

I havent, but will try if any of the gige cards i have support it.

As a side note: I have not seen any useful gains or losses as the packet
size approaches even 1500B MTU. For example, post about 256B neither the
batching nor the non-batching give much difference in either throughput
or cpu use. Below 256B, theres a noticeable gain for batching.
Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
so the occupancy of both the qdisc queue(s) and ethernet ring is
constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
and we are hitting in the range of wire speed. We are at 90% idle at
1024B. This is the case with or without batching.  So my suspicion is
that with that trend a 9000B packet will just follow the same pattern.


cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-01 13:30                 ` jamal
@ 2007-10-02  4:25                   ` Bill Fink
  2007-10-02 13:20                     ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Bill Fink @ 2007-10-02  4:25 UTC (permalink / raw)
  To: hadi
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, jeff,
	rdreier, mcarlson, general, sri, jagana, mchan, netdev,
	David Miller, tgraf, kaber

On Mon, 01 Oct 2007, jamal wrote:

> On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:
> 
> > Have you done performance comparisons for the case of using 9000-byte
> > jumbo frames?
> 
> I havent, but will try if any of the gige cards i have support it.
> 
> As a side note: I have not seen any useful gains or losses as the packet
> size approaches even 1500B MTU. For example, post about 256B neither the
> batching nor the non-batching give much difference in either throughput
> or cpu use. Below 256B, theres a noticeable gain for batching.
> Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
> so the occupancy of both the qdisc queue(s) and ethernet ring is
> constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
> and we are hitting in the range of wire speed. We are at 90% idle at
> 1024B. This is the case with or without batching.  So my suspicion is
> that with that trend a 9000B packet will just follow the same pattern.

One reason I ask, is that on an earlier set of alternative batching
xmit patches by Krishna Kumar, his performance testing showed a 30 %
performance hit for TCP for a single process and a size of 4 KB, and
a performance hit of 5 % for a single process and a size of 16 KB
(a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
time to inquire further about it, but it would be a major potential
concern for me in my 10-GigE network testing with 9000-byte jumbo
frames.  Of course the single process and 4 KB or larger size was
the only case that showed a significant performance hit in Krishna
Kumar's latest reported test results, so it might be acceptable to
just have a switch to disable the batching feature for that specific
usage scenario.  So it would be useful to know if your xmit batching
changes would have similar issues.

Also for your xmit batching changes, I think it would be good to see
performance comparisons for TCP and IP forwarding in addition to your
UDP pktgen tests, including various packet sizes up to and including
9000-byte jumbo frames.

						-Bill

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-02  4:25                   ` [ofa-general] " Bill Fink
@ 2007-10-02 13:20                     ` jamal
  2007-10-03  5:29                       ` [ofa-general] " Bill Fink
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-02 13:20 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote:

> One reason I ask, is that on an earlier set of alternative batching
> xmit patches by Krishna Kumar, his performance testing showed a 30 %
> performance hit for TCP for a single process and a size of 4 KB, and
> a performance hit of 5 % for a single process and a size of 16 KB
> (a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
> time to inquire further about it, but it would be a major potential
> concern for me in my 10-GigE network testing with 9000-byte jumbo
> frames.  Of course the single process and 4 KB or larger size was
> the only case that showed a significant performance hit in Krishna
> Kumar's latest reported test results, so it might be acceptable to
> just have a switch to disable the batching feature for that specific
> usage scenario.  So it would be useful to know if your xmit batching
> changes would have similar issues.


There were many times while testing that i noticed inconsistencies and
in each case when i analysed[1], i found it to be due to some variable
other than batching which needed some resolving, always via some
parametrization or other. I suspect what KK posted is in the same class.
To give you an example, with UDP, batching was giving worse results at
around 256B compared to 64B or 512B; investigating i found that the
receiver just wasnt able to keep up and the udp layer dropped a lot of
packets so both iperf and netperf reported bad numbers. Fixing the
receiver ended up with consistency coming back. On why 256B was the one
that overwhelmed the receiver more than 64B(which sent more pps)? On
some limited investigation, it seemed to me to be the effect of the
choice of the tg3 driver's default tx mitigation parameters as well tx
ring size; which is something i plan to revisit (but neutralizing it
helps me focus on just batching). In the end i dropped both netperf and
iperf for similar reasons and wrote my own app. What i am trying to
achieve is demonstrate if batching is a GoodThing. In experimentation
like this, it is extremely valuable to reduce the variables. Batching
may expose other orthogonal issues - those need to be resolved or fixed
as they are found. I hope that sounds sensible.

Back to the >=9K packet size you raise above:
I dont have a 10Gige card so iam theorizing. Given that theres an
observed benefit to batching for a saturated link with "smaller" packets
(in my results "small" is anything below 256B which maps to about
380Kpps anything above that seems to approach wire speed and the link is
the bottleneck); then i theorize that 10Gige with 9K jumbo frames if
already achieving wire rate, should continue to do so. And sizes below
that will see improvements if they were not already hitting wire rate.
So i would say that with 10G NICS, there will be more observed
improvements with batching with apps that do bulk transfers (assuming
those apps are not seeing wire speed already). Note that this hasnt been
quiet the case even with TSO given the bottlenecks in the Linux
receivers that J Heffner put nicely in a response to some results you
posted - but that exposes an issue with Linux receivers rather than TSO.

> Also for your xmit batching changes, I think it would be good to see
> performance comparisons for TCP and IP forwarding in addition to your
> UDP pktgen tests, 

That is not pktgen - it is a udp app running in process context
utilizing all 4CPUs to send traffic. pktgen bypasses the stack entirely
and has its own merits in proving that batching infact is a GoodThing
even if it is just for traffic generation ;->

> including various packet sizes up to and including
> 9000-byte jumbo frames.

I will do TCP and forwarding tests in the near future. 

cheers,
jamal

[1] On average i spend 10x more time performance testing and analysing
results than writting code.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-02 13:20                     ` jamal
@ 2007-10-03  5:29                       ` Bill Fink
  2007-10-03 13:42                         ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Bill Fink @ 2007-10-03  5:29 UTC (permalink / raw)
  To: hadi
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, jeff,
	rdreier, mcarlson, general, sri, jagana, mchan, netdev,
	David Miller, tgraf, kaber

On Tue, 02 Oct 2007, jamal wrote:

> On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote:
> 
> > One reason I ask, is that on an earlier set of alternative batching
> > xmit patches by Krishna Kumar, his performance testing showed a 30 %
> > performance hit for TCP for a single process and a size of 4 KB, and
> > a performance hit of 5 % for a single process and a size of 16 KB
> > (a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
> > time to inquire further about it, but it would be a major potential
> > concern for me in my 10-GigE network testing with 9000-byte jumbo
> > frames.  Of course the single process and 4 KB or larger size was
> > the only case that showed a significant performance hit in Krishna
> > Kumar's latest reported test results, so it might be acceptable to
> > just have a switch to disable the batching feature for that specific
> > usage scenario.  So it would be useful to know if your xmit batching
> > changes would have similar issues.
> 
> There were many times while testing that i noticed inconsistencies and
> in each case when i analysed[1], i found it to be due to some variable
> other than batching which needed some resolving, always via some
> parametrization or other. I suspect what KK posted is in the same class.
> To give you an example, with UDP, batching was giving worse results at
> around 256B compared to 64B or 512B; investigating i found that the
> receiver just wasnt able to keep up and the udp layer dropped a lot of
> packets so both iperf and netperf reported bad numbers. Fixing the
> receiver ended up with consistency coming back. On why 256B was the one
> that overwhelmed the receiver more than 64B(which sent more pps)? On
> some limited investigation, it seemed to me to be the effect of the
> choice of the tg3 driver's default tx mitigation parameters as well tx
> ring size; which is something i plan to revisit (but neutralizing it
> helps me focus on just batching). In the end i dropped both netperf and
> iperf for similar reasons and wrote my own app. What i am trying to
> achieve is demonstrate if batching is a GoodThing. In experimentation
> like this, it is extremely valuable to reduce the variables. Batching
> may expose other orthogonal issues - those need to be resolved or fixed
> as they are found. I hope that sounds sensible.

It does sound sensible.  My own decidedly non-expert speculation
was that the big 30 % performance hit right at 4 KB may be related
to memory allocation issues or having to split the skb across
multiple 4 KB pages.  And perhaps it only affected the single
process case because with multiple processes lock contention may
be a bigger issue and the xmit batching changes would presumably
help with that.  I am admittedly a novice when it comes to the
detailed internals of TCP/skb processing, although I have been
slowly slogging my way through parts of the TCP kernel code to
try and get a better understanding, so I don't know if these
thoughts have any merit.

BTW does anyone know of a good book they would recommend that has
substantial coverage of the Linux kernel TCP code, that's fairly
up-to-date and gives both an overall view of the code and packet
flow as well as details on individual functions and algorithms,
and hopefully covers basic issues like locking and synchronization,
concurrency of different parts of the stack, and memory allocation.
I have several books already on Linux kernel and networking internals,
but they seem to only cover the IP (and perhaps UDP) portions of the
network stack, and none have more than a cursory reference to TCP.  
The most useful documentation on the Linux TCP stack that I have
found thus far is some of Dave Miller's excellent web pages and
a few other web references, but overall it seems fairly skimpy
for such an important part of the Linux network code.

> Back to the >=9K packet size you raise above:
> I dont have a 10Gige card so iam theorizing. Given that theres an
> observed benefit to batching for a saturated link with "smaller" packets
> (in my results "small" is anything below 256B which maps to about
> 380Kpps anything above that seems to approach wire speed and the link is
> the bottleneck); then i theorize that 10Gige with 9K jumbo frames if
> already achieving wire rate, should continue to do so. And sizes below
> that will see improvements if they were not already hitting wire rate.
> So i would say that with 10G NICS, there will be more observed
> improvements with batching with apps that do bulk transfers (assuming
> those apps are not seeing wire speed already). Note that this hasnt been
> quiet the case even with TSO given the bottlenecks in the Linux
> receivers that J Heffner put nicely in a response to some results you
> posted - but that exposes an issue with Linux receivers rather than TSO.

It would be good to see some empirical evidence that there aren't
any unforeseen gotchas for larger packet sizes, that at least the
same level of performance can be obtained with no greater CPU
utilization.

> > Also for your xmit batching changes, I think it would be good to see
> > performance comparisons for TCP and IP forwarding in addition to your
> > UDP pktgen tests, 
> 
> That is not pktgen - it is a udp app running in process context
> utilizing all 4CPUs to send traffic. pktgen bypasses the stack entirely
> and has its own merits in proving that batching infact is a GoodThing
> even if it is just for traffic generation ;->
> 
> > including various packet sizes up to and including
> > 9000-byte jumbo frames.
> 
> I will do TCP and forwarding tests in the near future. 

Looking forward to it.

> cheers,
> jamal
> 
> [1] On average i spend 10x more time performance testing and analysing
> results than writting code.

As you have written previously, and I heartily agree with, this is a
very good practice for developing performance enhancement patches.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-03  5:29                       ` [ofa-general] " Bill Fink
@ 2007-10-03 13:42                         ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-03 13:42 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, krkumar2, johnpol, herbert, kaber, shemminger,
	jagana, Robert.Olsson, rick.jones2, xma, gaagaan, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general,
	kumarkr, tgraf, randy.dunlap, sri

On Wed, 2007-03-10 at 01:29 -0400, Bill Fink wrote:

> It does sound sensible.  My own decidedly non-expert speculation
> was that the big 30 % performance hit right at 4 KB may be related
> to memory allocation issues or having to split the skb across
> multiple 4 KB pages.  

plausible. But i also worry it could be 10 other things; example, could
it be the driver used? I noted in my udp test the oddity that turned out
to be tx coal parameter related.
In any case, I will attempt to run those tests later.

> And perhaps it only affected the single
> process case because with multiple processes lock contention may
> be a bigger issue and the xmit batching changes would presumably
> help with that.  I am admittedly a novice when it comes to the
> detailed internals of TCP/skb processing, although I have been
> slowly slogging my way through parts of the TCP kernel code to
> try and get a better understanding, so I don't know if these
> thoughts have any merit.

You do bring up issues that need to be looked into and i will run those
tests.
Note, the effectiveness of batching becomes evident as the number of
flows grows. Actually, scratch that: It becomes evident if you can keep
the tx path busyed out to which multiple users running contribute. If i
can have a user per CPU with lots of traffic to send, i can create that
condition. It's a little boring in the scenario where the bottleneck is
the wire but it needs to be checked.

> BTW does anyone know of a good book they would recommend that has
> substantial coverage of the Linux kernel TCP code, that's fairly
> up-to-date and gives both an overall view of the code and packet
> flow as well as details on individual functions and algorithms,
> and hopefully covers basic issues like locking and synchronization,
> concurrency of different parts of the stack, and memory allocation.
> I have several books already on Linux kernel and networking internals,
> but they seem to only cover the IP (and perhaps UDP) portions of the
> network stack, and none have more than a cursory reference to TCP.  
> The most useful documentation on the Linux TCP stack that I have
> found thus far is some of Dave Miller's excellent web pages and
> a few other web references, but overall it seems fairly skimpy
> for such an important part of the Linux network code.

Reading books or magazines may end up busying you out with some small
gains of knowledge at the end. They tend to be outdated fast. My advice
is if you start with a focus on one thing, watch the patches that fly
around on that area and learn that way. Read the code to further
understand things then ask questions when its not clear. Other folks may
have different views. The other way to do it is pick yourself some task
to either add or improve something and get your hands dirty that way. 

> It would be good to see some empirical evidence that there aren't
> any unforeseen gotchas for larger packet sizes, that at least the
> same level of performance can be obtained with no greater CPU
> utilization.

Reasonable - I will try with 9K after i move over to the new tree from
Dave and make sure nothing else broke in the previous tests.
And when all looks good, i will move to TCP.


> > [1] On average i spend 10x more time performance testing and analysing
> > results than writting code.
> 
> As you have written previously, and I heartily agree with, this is a
> very good practice for developing performance enhancement patches.

To give you a perspective, the results i posted were each run 10
iterations per packet size per kernel. Each run is 60 seconds long. I
think i am past that stage for resolving or fixing anything for UDP or
pktgen, but i need to keep checking for any new regressions when Dave
updates his tree. Now multiply that by 5 packet sizes (I am going to add
2 more) and multiply that by 3-4 kernels. Then add the time it takes to
sift through the data and collect it then analyze it and go back to the
drawing table when something doesnt look right.  Essentially, it needs a
weekend ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCHES] TX batching
  2007-09-23 17:53     ` [PATCHES] TX batching jamal
                         ` (2 preceding siblings ...)
  2007-09-30 18:50       ` [ofa-general] " jamal
@ 2007-10-07 18:34       ` jamal
  2007-10-08 12:51         ` [ofa-general] " Evgeniy Polyakov
  3 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-07 18:34 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber


Please provide feedback on the code and/or architecture.
Last time i posted them i received little. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev->gso_skb

What has changed since i posted last:
1) Fix a bug eyeballed by Patrick McHardy on requeue reordering.
2) Killed ->hard_batch_xmit() 
3) I am going one step back and making this set of patches even simpler
so i can make it easier to review.I am therefore killing dev->hard_prep_xmit()
and focussing just on batching. I plan to re-introduce dev->hard_prep_xmit()
but from now on i will make that a separate effort. (it seems to be creating
confusion in relation to the general work).

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev->gso_skb that you introduced).

UPCOMING PATCHES
---------------
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

PERFORMANCE TESTING
--------------------
I started testing since yesterday, but these tests take a long time
so i will post results probably at the end of the day sometime and
may stop running more tests and just comparing batch vs non-batch results.
I have optimized the kernel-config so i expect my overall performance
numbers to look better than the last test results i posted for both
batch and non-batch.
My system under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface
  2007-09-23 17:58         ` [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface jamal
  2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
  2007-09-30 18:51           ` [ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface jamal
@ 2007-10-07 18:36           ` jamal
  2007-10-08  9:59             ` Krishna Kumar2
  2 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-07 18:36 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 77 bytes --]

This patch introduces the netdevice interface for batching.

cheers,
jamal



[-- Attachment #2: oct07-p1of3 --]
[-- Type: text/plain, Size: 7403 bytes --]

[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

BACKGROUND
---------

A driver dev->hard_start_xmit() has 4 typical parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts, set last tx time, etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev->hard_start_xmit() into those 4 functions
anyways].

INTRODUCING API
---------------

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev->hard_start_xmit() in the following manner:
1)Remove #d from dev->hard_start_xmit() and put it in
dev->hard_end_xmit() method.
2)#b and #c can stay in ->hard_start_xmit() (or whichever way you want
to do this)
3) #a is deffered to future work to reduce confusion (since it holds
on its own).

Note: There are drivers which may need not support any of the two
approaches (example the tun driver i patched) so the methods are
optional.

xmit_win variable is set by the driver to tell the core how much space
it has to take on new skbs. It is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Refer to the driver howto for more details.

THEORY OF OPERATION
-------------------

1. Core dequeues from qdiscs upto dev->xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs TX_LOCK
3. Core loop for all skbs:
                    invokes driver dev->hard_start_xmit()
4. Core invokes driver dev->hard_end_xmit()

ACKNOWLEDGEMENT AND SOME HISTORY
--------------------------------

There's a lot of history and reasoning of "why batching" in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller,
and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or 
more of {bug fixes, enhancements, testing, lively discussion}. The 
Broadcom and neterion folks have been outstanding in their help.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit 0a0762e2c615a980af284e86d9729d233e1bf7f4
tree c27fec824a9e75ffbb791647bdb595c082a54990
parent 190674ff1fe0b7bddf038c2bfddf45b9c6418e2a
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 07 Oct 2007 08:51:10 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 07 Oct 2007 08:51:10 -0400

 include/linux/netdevice.h |   11 ++++++
 net/core/dev.c            |   83 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..b31df5c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -467,6 +467,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -595,6 +596,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -609,6 +613,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head     blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1044,6 +1049,12 @@ extern int		dev_set_mac_address(struct net_device *,
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_batch_xmit(struct net_device *dev);
+extern int		prepare_gso_skb(struct sk_buff *skb,
+					struct net_device *dev,
+					struct sk_buff_head *skbs);
+extern int		xmit_prepare_skb(struct sk_buff *skb,
+					 struct net_device *dev);
 
 extern int		netdev_budget;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index d998646..04df3fb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1517,6 +1517,87 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+int prepare_gso_skb(struct sk_buff *skb, struct net_device *dev,
+		    struct sk_buff_head *skbs)
+{
+	int tdq = 0;
+	do {
+		struct sk_buff *nskb = skb->next;
+
+		skb->next = nskb->next;
+		nskb->next = NULL;
+
+		/* Driver likes this packet .. */
+		tdq++;
+		__skb_queue_tail(skbs, nskb);
+	} while (skb->next);
+	skb->destructor = DEV_GSO_CB(skb)->destructor;
+	kfree_skb(skb);
+
+	return tdq;
+}
+
+int xmit_prepare_skb(struct sk_buff *skb, struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree_skb(skb);
+			return 0;
+		}
+		if (skb->next)
+			return prepare_gso_skb(skb, dev, skbs);
+	}
+
+	__skb_queue_tail(skbs, skb);
+	return 1;
+}
+
+int dev_batch_xmit(struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+	int rc = NETDEV_TX_OK;
+	struct sk_buff *skb;
+	int orig_w = dev->xmit_win;
+	int orig_pkts = skb_queue_len(skbs);
+
+	while ((skb = __skb_dequeue(skbs)) != NULL) {
+		if (!list_empty(&ptype_all))
+			dev_queue_xmit_nit(skb, dev);
+		rc = dev->hard_start_xmit(skb, dev);
+		if (unlikely(rc))
+			break;
+		/* * XXX: multiqueue may need closer srutiny.. */
+		if (unlikely(netif_queue_stopped(dev) ||
+		     netif_subqueue_stopped(dev, skb->queue_mapping))) {
+			rc = NETDEV_TX_BUSY;
+			break;
+		}
+	}
+
+	/* driver is likely buggy and lied to us on how much
+	 * space it had. Damn you driver ..
+	*/
+	if (unlikely(skb_queue_len(skbs))) {
+		printk(KERN_WARNING "Likely bug %s %s (%d) "
+				"left %d/%d window now %d, orig %d\n",
+			dev->name, rc?"busy":"locked",
+			netif_queue_stopped(dev),
+			skb_queue_len(skbs),
+			orig_pkts,
+			dev->xmit_win,
+			orig_w);
+			rc = NETDEV_TX_BUSY;
+	}
+
+	if (orig_pkts > skb_queue_len(skbs))
+		if (dev->hard_end_xmit)
+			dev->hard_end_xmit(dev);
+
+	return rc;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -3553,6 +3634,8 @@ int register_netdevice(struct net_device *dev)
 		}
 	}
 
+	dev->xmit_win = 1;
+	skb_queue_head_init(&dev->blist);
 	ret = netdev_register_kobject(dev);
 	if (ret)
 		goto err_uninit;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching
  2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
  2007-09-23 18:02             ` [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb jamal
  2007-09-30 18:52             ` [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching jamal
@ 2007-10-07 18:38             ` jamal
  2 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-07 18:38 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 73 bytes --]

This patch adds the usage of batching within the core.

cheers,
jamal




[-- Attachment #2: oct07-p2of3 --]
[-- Type: text/plain, Size: 5401 bytes --]

[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.

I have #if-0ed some of the old functions so the patch is more readable.
A future patch will remove all if-0ed content.
Patrick McHardy eyeballed a bug that will cause re-ordering in case
of a requeue.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit cd602aa5f84fcef6359852cd99c95863eeb91015
tree f31d2dde4f138ff6789682163624bc0f8541aa77
parent 0a0762e2c615a980af284e86d9729d233e1bf7f4
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 07 Oct 2007 09:13:04 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 07 Oct 2007 09:13:04 -0400

 net/sched/sch_generic.c |  132 +++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 120 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..80ac56b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q->q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev->xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+				"Dead loop on netdevice %s, fix it urgently!\n",
+				dev->name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	       struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue_tail(skbs)) != NULL)
+		q->ops->requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	    struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb->next) {
+		count += skb_shinfo(skb)->nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev->xmit_win;
+
+	if (count  && dev->gso_skb) {
+		skb = dev->gso_skb;
+		dev->gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count > 0) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
+
+static int xmit_prepare_pkts(struct net_device *dev,
+			     struct sk_buff_head *tlist)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *flist = &dev->blist;
+
+	while ((skb = __skb_dequeue(tlist)) != NULL)
+		xmit_prepare_skb(skb, dev);
+
+	return skb_queue_len(flist);
+}
 
 /*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
@@ -130,22 +222,32 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *tpktlist)
 {
 	struct Qdisc *q = dev->qdisc;
-	struct sk_buff *skb;
-	int ret;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	/* use of tpktlist reduces the amount of time we sit
+	 * holding the queue_lock
+	*/
+	ret = xmit_get_pkts(dev, q, tpktlist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(&dev->queue_lock);
 
+	/* prepare to embark, no locks held moves packets
+	* to dev->blist
+	* */
+	xmit_prepare_pkts(dev, tpktlist);
+
+	/* bye packets ....*/
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
@@ -158,8 +260,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(&dev->blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +270,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(&dev->blist, dev, q);
 		break;
 	}
 
@@ -177,8 +279,11 @@ static inline int qdisc_restart(struct net_device *dev)
 
 void __qdisc_run(struct net_device *dev)
 {
+	struct sk_buff_head tpktlist;
+	skb_queue_head_init(&tpktlist);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, &tpktlist))
 			break;
 	} while (!netif_queue_stopped(dev));
 
@@ -564,6 +669,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev->gso_skb;
 	dev->gso_skb = NULL;
+	if (!skb_queue_empty(&dev->blist))
+		skb_queue_purge(&dev->blist);
+	dev->xmit_win = 1;
 	spin_unlock_bh(&dev->queue_lock);
 
 	kfree_skb(skb);

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] [PATCH 3/3][NET_BATCH] kill dev->gso_skb
  2007-09-23 18:02             ` [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb jamal
  2007-09-30 18:53               ` [ofa-general] [PATCH 3/3][NET_SCHED] " jamal
@ 2007-10-07 18:39               ` jamal
  1 sibling, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-07 18:39 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 96 bytes --]

This patch removes dev->gso_skb as it is no longer necessary with
batching code.

cheers,
jamal

[-- Attachment #2: oct07-p3of3 --]
[-- Type: text/plain, Size: 2277 bytes --]

[NET_BATCH] kill dev->gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev->blist
(and still not need dev->gso_skb)

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit 7ebf50f0f43edd4897b88601b4133612fc36af61
tree 5d942ecebc14de6254ab3c812d542d524e148e92
parent cd602aa5f84fcef6359852cd99c95863eeb91015
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 07 Oct 2007 09:30:19 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 07 Oct 2007 09:30:19 -0400

 include/linux/netdevice.h |    3 ---
 net/sched/sch_generic.c   |   12 ------------
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b31df5c..4ddc6eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -577,9 +577,6 @@ struct net_device
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
-	/* Partially transmitted GSO packet. */
-	struct sk_buff		*gso_skb;
-
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 80ac56b..772e7fe 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
 	struct sk_buff *skb;
 	int count = dev->xmit_win;
 
-	if (count  && dev->gso_skb) {
-		skb = dev->gso_skb;
-		dev->gso_skb = NULL;
-		count -= xmit_count_skbs(skb);
-		__skb_queue_tail(pktlist, skb);
-	}
-
 	while (count > 0) {
 		skb = q->dequeue(q);
 		if (!skb)
@@ -659,7 +652,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
-	struct sk_buff *skb;
 
 	spin_lock_bh(&dev->queue_lock);
 	qdisc = dev->qdisc;
@@ -667,15 +659,11 @@ void dev_deactivate(struct net_device *dev)
 
 	qdisc_reset(qdisc);
 
-	skb = dev->gso_skb;
-	dev->gso_skb = NULL;
 	if (!skb_queue_empty(&dev->blist))
 		skb_queue_purge(&dev->blist);
 	dev->xmit_win = 1;
 	spin_unlock_bh(&dev->queue_lock);
 
-	kfree_skb(skb);
-
 	dev_watchdog_down(dev);
 
 	/* Wait for outstanding dev_queue_xmit calls. */

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-09-24 23:38               ` [ofa-general] " jamal
  2007-09-24 23:47                 ` Waskiewicz Jr, Peter P
@ 2007-10-08  4:51                 ` David Miller
  2007-10-08 13:34                   ` jamal
  1 sibling, 1 reply; 107+ messages in thread
From: David Miller @ 2007-10-08  4:51 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, kumarkr, herbert, gaagaan, Robert.Olsson, netdev,
	rdreier, peter.p.waskiewicz.jr, mcarlson, randy.dunlap, jagana,
	general, mchan, tgraf, jeff, sri, shemminger, kaber

From: jamal <hadi@cyberus.ca>
Date: Mon, 24 Sep 2007 19:38:19 -0400

> How is the policy to define the qdisc queues locked/mapped to tx rings? 

For these high performance 10Gbit cards it's a load balancing
function, really, as all of the transmit queues go out to the same
physical port so you could:

1) Load balance on CPU number.
2) Load balance on "flow"
3) Load balance on destination MAC

etc. etc. etc.

It's something that really sits logically between the qdisc and the
card, not something that is a qdisc thing.

In some ways it's similar to bonding, but using anything similar to
bonding's infrastructure (stacking devices) is way overkill for this.

And then we have the virtualization network devices where the queue
selection has to be made precisely, in order for the packet to
reach the proper destination, rather than a performance improvement.
It is also a situation where the TX queue selection is something
to be made between qdisc activity and hitting the device.

I think we will initially have to live with taking the centralized
qdisc lock for the device, get in and out of that as fast as possible,
then only take the TX queue lock of the queue selected.

After we get things that far we can try to find some clever lockless
algorithm for handling the qdisc to get rid of that hot spot.

These queue selection schemes want a common piece of generic code.  A
set of load balancing algorithms, a "select TX queue by MAC with a
default fallback on no match" for virtualization, and interfaces for
both drivers and userspace to change the queue selection scheme.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-01 13:21                 ` jamal
@ 2007-10-08  5:03                   ` Krishna Kumar2
  2007-10-08 13:17                     ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar2 @ 2007-10-08  5:03 UTC (permalink / raw)
  To: hadi
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, Patrick McHardy, netdev,
	general, mchan, tgraf, randy.dunlap, sri, shemminger,
	David Miller, herbert

> jamal wrote:
>
> > > +   while ((skb = __skb_dequeue(skbs)) != NULL)
> > > +      q->ops->requeue(skb, q);
> >
> >
> > ->requeue queues at the head, so this looks like it would reverse
> > the order of the skbs.
>
> Excellent catch!  thanks; i will fix.
>
> As a side note: Any batching driver should _never_ have to requeue; if
> it does it is buggy. And the non-batching ones if they ever requeue will
> be a single packet, so not much reordering.

On the contrary, batching LLTX drivers (if that is not ruled out) will very
often requeue resulting in heavy reordering. Fix looks good though.

- KK

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
  2007-10-07 18:36           ` [ofa-general] " jamal
@ 2007-10-08  9:59             ` Krishna Kumar2
  2007-10-08 13:49               ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar2 @ 2007-10-08  9:59 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, gaagaan, general, herbert, jagana, jeff, johnpol,
	kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK

J Hadi Salim <j.hadi123@gmail.com> wrote on 10/08/2007 12:06:23 AM:

> This patch introduces the netdevice interface for batching.
>
> cheers,
> jamal
>
>
> [attachment "oct07-p1of3" deleted by Krishna Kumar2/India/IBM]


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCHES] TX batching
  2007-10-07 18:34       ` [ofa-general] " jamal
@ 2007-10-08 12:51         ` Evgeniy Polyakov
  2007-10-08 14:05           ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Evgeniy Polyakov @ 2007-10-08 12:51 UTC (permalink / raw)
  To: jamal
  Cc: jagana, peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, David Miller, jeff,
	general, mchan, tgraf, randy.dunlap, sri, shemminger, kaber

Hi Jamal.

On Sun, Oct 07, 2007 at 02:34:53PM -0400, jamal (hadi@cyberus.ca) wrote:
> 
> Please provide feedback on the code and/or architecture.
> Last time i posted them i received little. They are now updated to 
> work with the latest net-2.6.24 from a few hours ago.
> 
> Patch 1: Introduces batching interface
> Patch 2: Core uses batching interface
> Patch 3: get rid of dev->gso_skb

it looks like you and Krishna use the same requeueing methods - get one
from qdisk, queue it into blist, get next from qdisk, queue it,
eventually start transmit, where you dequeue it one-by-one and send (or
prepare and commit). This is not the 100% optimal approach, but if you
proved it does not hurt usual network processing, it is ok.
Number of comments dusted to very small - that's a sign, but I'm a bit
lost - did you and Krishna create the competing approaches, or they can
co-exist together, in the former case I doubt you can push, until all
problematic places are resolved, in the latter case, this is probably
ready.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-08  5:03                   ` Krishna Kumar2
@ 2007-10-08 13:17                     ` jamal
  2007-10-09  3:09                       ` [ofa-general] " Krishna Kumar2
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-08 13:17 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, gaagaan, general, herbert, jagana, jeff, johnpol,
	Patrick McHardy, kumarkr, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, randy.dunlap, rdreier, rick.jones2,
	Robert.Olsson, shemminger, sri, tgraf, xma

On Mon, 2007-08-10 at 10:33 +0530, Krishna Kumar2 wrote:

> > As a side note: Any batching driver should _never_ have to requeue; if
> > it does it is buggy. And the non-batching ones if they ever requeue will
> > be a single packet, so not much reordering.
> 
> On the contrary, batching LLTX drivers (if that is not ruled out) will very
> often requeue resulting in heavy reordering. Fix looks good though.

Two things: 
one, LLTX is deprecated (I think i saw a patch which says no more new
drivers should do LLTX) and i plan if nobody else does to kill 
LLTX in e1000 RSN. So for that reason i removed all code that existed to
support LLTX.
two, there should _never_ be any requeueing even if LLTX in the previous
patches when i supported them; if there is, it is a bug. This is because
we dont send more than what the driver asked for via xmit_win. So if it
asked for more than it can handle, that is a bug. If its available space
changes while we are sending to it, that too is a bug.

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-10-08  4:51                 ` [ofa-general] " David Miller
@ 2007-10-08 13:34                   ` jamal
  2007-10-08 14:22                     ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
  2007-10-08 21:05                     ` [PATCH 1/4] [NET_SCHED] explict hold dev tx lock David Miller
  0 siblings, 2 replies; 107+ messages in thread
From: jamal @ 2007-10-08 13:34 UTC (permalink / raw)
  To: David Miller
  Cc: peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mcarlson, jeff, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote:

> For these high performance 10Gbit cards it's a load balancing
> function, really, as all of the transmit queues go out to the same
> physical port so you could:
> 
> 1) Load balance on CPU number.
> 2) Load balance on "flow"
> 3) Load balance on destination MAC
> 
> etc. etc. etc.

The brain-block i am having is the parallelization aspect of it.
Whatever scheme it is - it needs to ensure the scheduler works as
expected. For example, if it was a strict prio scheduler i would expect
that whatever goes out is always high priority first and never ever
allow a low prio packet out at any time theres something high prio
needing to go out. If i have the two priorities running on two cpus,
then i cant guarantee that effect.
IOW, i see the scheduler/qdisc level as not being split across parallel
cpus. Do i make any sense?

The rest of my understanding hinges on the above, so let me stop here.

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
  2007-10-08  9:59             ` Krishna Kumar2
@ 2007-10-08 13:49               ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-08 13:49 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, gaagaan, general, herbert, jagana, jeff, johnpol,
	kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

On Mon, 2007-08-10 at 15:29 +0530, Krishna Kumar2 wrote:
> Hi Jamal,
> 
> If you don't mind, I am trying to run your approach vs mine to get some
> results for comparison.

Please provide an analysis when you get the results. IOW, explain why
one vs the other get different results.

> For starters, I am having issues with iperf when using your infrastructure
> code with
> my IPoIB driver - about 100MB is sent and then everything stops for some
> reason.

I havent tested with iperf in a while.
Can you post the netstat on both sides when the driver stops?
It does sound like a driver issue to me.

> The changes in the IPoIB driver that I made to support batching is to set
> BTX, set
> xmit_win, and dynamically reduce xmit_win on every xmit 
> and increase xmit_win on every xmit completion. 

>From driver howto:
---
This variable should be set during xmit path shutdown(netif_stop),
wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first
one the value is set to 1 and in the other two it is set to whatever
the driver deems to be available space on the ring.
----

> Is there anything else that is required from the
> driver?

Your driver needs to also support wake thresholding.
I will post the driver howto later today.

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCHES] TX batching
  2007-10-08 12:51         ` [ofa-general] " Evgeniy Polyakov
@ 2007-10-08 14:05           ` jamal
  2007-10-09  8:14             ` Krishna Kumar2
  0 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-08 14:05 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, krkumar2, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri

On Mon, 2007-08-10 at 16:51 +0400, Evgeniy Polyakov wrote:

> it looks like you and Krishna use the same requeueing methods - get one
> from qdisk, queue it into blist, get next from qdisk, queue it,
> eventually start transmit, where you dequeue it one-by-one and send (or
> prepare and commit). This is not the 100% optimal approach, but if you
> proved it does not hurt usual network processing, it is ok.

There are probably other bottlenecks that hide the need to optimize
further.

> Number of comments dusted to very small - that's a sign, but I'm a bit
> lost - did you and Krishna create the competing approaches, or they can
> co-exist together, in the former case I doubt you can push, until all
> problematic places are resolved, in the latter case, this is probably
> ready.

Thanks. I would like to make one more cleanup and get rid of the
temporary pkt list in qdisc restart; now that i have defered the skb
pre-format interface it is unnecessary.  I have a day off today, so i
will make changes, re-run tests and post again.

I dont see something from Krishna's approach that i can take and reuse.
This maybe because my old approaches have evolved from the same path.
There is a long list but as a sample: i used to do a lot more work while
holding the queue lock which i have now moved post queue lock; i dont
have any speacial interfaces/tricks just for batching, i provide hints
to the core of how much the driver can take etc etc. I have offered
Krishna co-authorship if he makes the IPOIB driver to work on my
patches, that offer still stands if he chooses to take it. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)
  2007-10-08 13:34                   ` jamal
@ 2007-10-08 14:22                     ` Jeff Garzik
  2007-10-08 15:18                         ` [ofa-general] " jamal
  2007-10-08 21:11                         ` [ofa-general] " David Miller
  2007-10-08 21:05                     ` [PATCH 1/4] [NET_SCHED] explict hold dev tx lock David Miller
  1 sibling, 2 replies; 107+ messages in thread
From: Jeff Garzik @ 2007-10-08 14:22 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert,
	kaber, shemminger, jagana, Robert.Olsson, rick.jones2, xma,
	gaagaan, netdev, rdreier, Ingo Molnar, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri, Linux Kernel Mailing List

jamal wrote:
> On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote:
> 
>> For these high performance 10Gbit cards it's a load balancing
>> function, really, as all of the transmit queues go out to the same
>> physical port so you could:
>>
>> 1) Load balance on CPU number.
>> 2) Load balance on "flow"
>> 3) Load balance on destination MAC
>>
>> etc. etc. etc.
> 
> The brain-block i am having is the parallelization aspect of it.
> Whatever scheme it is - it needs to ensure the scheduler works as
> expected. For example, if it was a strict prio scheduler i would expect
> that whatever goes out is always high priority first and never ever
> allow a low prio packet out at any time theres something high prio
> needing to go out. If i have the two priorities running on two cpus,
> then i cant guarantee that effect.

Any chance the NIC hardware could provide that guarantee?

8139cp, for example, has two TX DMA rings, with hardcoded 
characteristics:  one is a high prio q, and one a low prio q.  The logic 
is pretty simple:   empty the high prio q first (potentially starving 
low prio q, in worst case).


In terms of overall parallelization, both for TX as well as RX, my gut 
feeling is that we want to move towards an MSI-X, multi-core friendly 
model where packets are LIKELY to be sent and received by the same set 
of [cpus | cores | packages | nodes] that the [userland] processes 
dealing with the data.

There are already some primitive NUMA bits in skbuff allocation, but 
with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
along the lines of better CPU scheduling decisions, directing flows to 
clusters of cpus, and generally doing a better job of maximizing cache 
efficiency in a modern multi-thread environment.

IMO the current model where each NIC's TX completion and RX processes 
are both locked to the same CPU is outmoded in a multi-core world with 
modern NICs.  :)

But I readily admit general ignorance about the kernel process 
scheduling stuff, so my only idea about a starting point was to see how 
far to go with the concept of "skb affinity" -- a mask in sk_buff that 
is a hint about which cpu(s) on which the NIC should attempt to send and 
receive packets.  When going through bonding or netfilter, it is trivial 
to 'or' together affinity masks.  All the various layers of net stack 
should attempt to honor the skb affinity, where feasible (requires 
interaction with CFS scheduler?).

Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
on the bigger picture.  Parallelization starts at the user process.

	Jeff



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)
  2007-10-08 14:22                     ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
@ 2007-10-08 15:18                         ` jamal
  2007-10-08 21:11                         ` [ofa-general] " David Miller
  1 sibling, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-08 15:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Miller, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert,
	kaber, shemminger, jagana, Robert.Olsson, rick.jones2, xma,
	gaagaan, netdev, rdreier, Ingo Molnar, mchan, general, kumarkr,
	tgraf, randy.dunlap, sri, Linux Kernel Mailing List

On Mon, 2007-08-10 at 10:22 -0400, Jeff Garzik wrote:

> Any chance the NIC hardware could provide that guarantee?

If you can get the scheduling/dequeuing to run on one CPU (as we do
today) it should work; alternatively you can totaly bypass the qdisc
subystem and go direct to the hardware for devices that are capable
and that would work but would require huge changes. 
My fear is there's a mini-scheduler pieces running on multi cpus which
is what i understood as being described.

> 8139cp, for example, has two TX DMA rings, with hardcoded 
> characteristics:  one is a high prio q, and one a low prio q.  The logic 
> is pretty simple:   empty the high prio q first (potentially starving 
> low prio q, in worst case).

sounds like strict prio scheduling to me which says "if low prio starves
so be it"

> In terms of overall parallelization, both for TX as well as RX, my gut 
> feeling is that we want to move towards an MSI-X, multi-core friendly 
> model where packets are LIKELY to be sent and received by the same set 
> of [cpus | cores | packages | nodes] that the [userland] processes 
> dealing with the data.

Does putting things in the same core help? But overall i agree with your
views. 

> There are already some primitive NUMA bits in skbuff allocation, but 
> with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
> along the lines of better CPU scheduling decisions, directing flows to 
> clusters of cpus, and generally doing a better job of maximizing cache 
> efficiency in a modern multi-thread environment.

I think i see the receive with a lot of clarity, i am still foggy on the
txmit path mostly because of the qos/scheduling issues. 

> IMO the current model where each NIC's TX completion and RX processes 
> are both locked to the same CPU is outmoded in a multi-core world with 
> modern NICs.  :)

Infact even with status quo theres a case that can be made to not bind
to interupts.
In my recent experience with batching, due to the nature of my test app,
if i let the interupts float across multiple cpus i benefit.
My app runs/binds a thread per CPU and so benefits from having more
juice to send more packets per unit of time - something i wouldnt get if
i was always running on one cpu.
But when i do this i found that just because i have bound a thread to
cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup
happens on cpu1, scheduler will put the thread on cpu1 if it is to be
run. It made sense to do that, it just took me a while to digest.

> But I readily admit general ignorance about the kernel process 
> scheduling stuff, so my only idea about a starting point was to see how 
> far to go with the concept of "skb affinity" -- a mask in sk_buff that 
> is a hint about which cpu(s) on which the NIC should attempt to send and 
> receive packets.  When going through bonding or netfilter, it is trivial 
> to 'or' together affinity masks.  All the various layers of net stack 
> should attempt to honor the skb affinity, where feasible (requires 
> interaction with CFS scheduler?).

There would be cache benefits if you can free the packet on the same cpu
it was allocated; so the idea of skb affinity is useful in the minimal
in that sense if you can pull it. Assuming hardware is capable, even if
you just tagged it on xmit to say which cpu it was sent out on, and made
sure thats where it is freed, that would be a good start. 

Note: The majority of the packet processing overhead is _still_ the
memory subsystem latency; in my tests with batched pktgen improving the 
xmit subsystem meant the overhead on allocing and freeing the packets
went to something > 80%.
So something along the lines of parallelizing based on a split of alloc
free of sksb IMO on more cpus than where xmit/receive run would see
more performance improvements.

> Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
> on the bigger picture.  Parallelization starts at the user process.


cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)
@ 2007-10-08 15:18                         ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-08 15:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, rdreier,
	general, Ingo Molnar, sri, jagana, mchan, netdev,
	Linux Kernel Mailing List, David Miller, tgraf, kaber

On Mon, 2007-08-10 at 10:22 -0400, Jeff Garzik wrote:

> Any chance the NIC hardware could provide that guarantee?

If you can get the scheduling/dequeuing to run on one CPU (as we do
today) it should work; alternatively you can totaly bypass the qdisc
subystem and go direct to the hardware for devices that are capable
and that would work but would require huge changes. 
My fear is there's a mini-scheduler pieces running on multi cpus which
is what i understood as being described.

> 8139cp, for example, has two TX DMA rings, with hardcoded 
> characteristics:  one is a high prio q, and one a low prio q.  The logic 
> is pretty simple:   empty the high prio q first (potentially starving 
> low prio q, in worst case).

sounds like strict prio scheduling to me which says "if low prio starves
so be it"

> In terms of overall parallelization, both for TX as well as RX, my gut 
> feeling is that we want to move towards an MSI-X, multi-core friendly 
> model where packets are LIKELY to be sent and received by the same set 
> of [cpus | cores | packages | nodes] that the [userland] processes 
> dealing with the data.

Does putting things in the same core help? But overall i agree with your
views. 

> There are already some primitive NUMA bits in skbuff allocation, but 
> with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
> along the lines of better CPU scheduling decisions, directing flows to 
> clusters of cpus, and generally doing a better job of maximizing cache 
> efficiency in a modern multi-thread environment.

I think i see the receive with a lot of clarity, i am still foggy on the
txmit path mostly because of the qos/scheduling issues. 

> IMO the current model where each NIC's TX completion and RX processes 
> are both locked to the same CPU is outmoded in a multi-core world with 
> modern NICs.  :)

Infact even with status quo theres a case that can be made to not bind
to interupts.
In my recent experience with batching, due to the nature of my test app,
if i let the interupts float across multiple cpus i benefit.
My app runs/binds a thread per CPU and so benefits from having more
juice to send more packets per unit of time - something i wouldnt get if
i was always running on one cpu.
But when i do this i found that just because i have bound a thread to
cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup
happens on cpu1, scheduler will put the thread on cpu1 if it is to be
run. It made sense to do that, it just took me a while to digest.

> But I readily admit general ignorance about the kernel process 
> scheduling stuff, so my only idea about a starting point was to see how 
> far to go with the concept of "skb affinity" -- a mask in sk_buff that 
> is a hint about which cpu(s) on which the NIC should attempt to send and 
> receive packets.  When going through bonding or netfilter, it is trivial 
> to 'or' together affinity masks.  All the various layers of net stack 
> should attempt to honor the skb affinity, where feasible (requires 
> interaction with CFS scheduler?).

There would be cache benefits if you can free the packet on the same cpu
it was allocated; so the idea of skb affinity is useful in the minimal
in that sense if you can pull it. Assuming hardware is capable, even if
you just tagged it on xmit to say which cpu it was sent out on, and made
sure thats where it is freed, that would be a good start. 

Note: The majority of the packet processing overhead is _still_ the
memory subsystem latency; in my tests with batched pktgen improving the 
xmit subsystem meant the overhead on allocing and freeing the packets
went to something > 80%.
So something along the lines of parallelizing based on a split of alloc
free of sksb IMO on more cpus than where xmit/receive run would see
more performance improvements.

> Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
> on the bigger picture.  Parallelization starts at the user process.


cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
  2007-10-08 13:34                   ` jamal
  2007-10-08 14:22                     ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
@ 2007-10-08 21:05                     ` David Miller
  1 sibling, 0 replies; 107+ messages in thread
From: David Miller @ 2007-10-08 21:05 UTC (permalink / raw)
  To: hadi
  Cc: peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mcarlson, jeff, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri

From: jamal <hadi@cyberus.ca>
Date: Mon, 08 Oct 2007 09:34:50 -0400

> The brain-block i am having is the parallelization aspect of it.
> Whatever scheme it is - it needs to ensure the scheduler works as
> expected. For example, if it was a strict prio scheduler i would expect
> that whatever goes out is always high priority first and never ever
> allow a low prio packet out at any time theres something high prio
> needing to go out. If i have the two priorities running on two cpus,
> then i cant guarantee that effect.
> IOW, i see the scheduler/qdisc level as not being split across parallel
> cpus. Do i make any sense?

Picture it like N tubes you stick packets into, and the tubes are
processed using DRR.

So packets within a tube won't be reordered, but reordering amongst
tubes is definitely possible.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: parallel networking
  2007-10-08 14:22                     ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
@ 2007-10-08 21:11                         ` David Miller
  2007-10-08 21:11                         ` [ofa-general] " David Miller
  1 sibling, 0 replies; 107+ messages in thread
From: David Miller @ 2007-10-08 21:11 UTC (permalink / raw)
  To: jeff
  Cc: hadi, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

From: Jeff Garzik <jeff@garzik.org>
Date: Mon, 08 Oct 2007 10:22:28 -0400

> In terms of overall parallelization, both for TX as well as RX, my gut 
> feeling is that we want to move towards an MSI-X, multi-core friendly 
> model where packets are LIKELY to be sent and received by the same set 
> of [cpus | cores | packages | nodes] that the [userland] processes 
> dealing with the data.

The problem is that the packet schedulers want global guarantees
on packet ordering, not flow centric ones.

That is the issue Jamal is concerned about.

The more I think about it, the more inevitable it seems that we really
might need multiple qdiscs, one for each TX queue, to pull this full
parallelization off.

But the semantics of that don't smell so nice either.  If the user
attaches a new qdisc to "ethN", does it go to all the TX queues, or
what?

All of the traffic shaping technology deals with the device as a unary
object.  It doesn't fit to multi-queue at all.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: parallel networking
@ 2007-10-08 21:11                         ` David Miller
  0 siblings, 0 replies; 107+ messages in thread
From: David Miller @ 2007-10-08 21:11 UTC (permalink / raw)
  To: jeff
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, rdreier,
	general, mingo, sri, jagana, hadi, mchan, netdev, linux-kernel,
	tgraf, kaber

From: Jeff Garzik <jeff@garzik.org>
Date: Mon, 08 Oct 2007 10:22:28 -0400

> In terms of overall parallelization, both for TX as well as RX, my gut 
> feeling is that we want to move towards an MSI-X, multi-core friendly 
> model where packets are LIKELY to be sent and received by the same set 
> of [cpus | cores | packages | nodes] that the [userland] processes 
> dealing with the data.

The problem is that the packet schedulers want global guarantees
on packet ordering, not flow centric ones.

That is the issue Jamal is concerned about.

The more I think about it, the more inevitable it seems that we really
might need multiple qdiscs, one for each TX queue, to pull this full
parallelization off.

But the semantics of that don't smell so nice either.  If the user
attaches a new qdisc to "ethN", does it go to all the TX queues, or
what?

All of the traffic shaping technology deals with the device as a unary
object.  It doesn't fit to multi-queue at all.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: parallel networking
  2007-10-08 21:11                         ` [ofa-general] " David Miller
  (?)
@ 2007-10-08 22:30                         ` jamal
  2007-10-08 22:33                           ` David Miller
  -1 siblings, 1 reply; 107+ messages in thread
From: jamal @ 2007-10-08 22:30 UTC (permalink / raw)
  To: David Miller
  Cc: jeff, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

On Mon, 2007-08-10 at 14:11 -0700, David Miller wrote:

> The problem is that the packet schedulers want global guarantees
> on packet ordering, not flow centric ones.
> 
> That is the issue Jamal is concerned about.

indeed, thank you for giving it better wording. 

> The more I think about it, the more inevitable it seems that we really
> might need multiple qdiscs, one for each TX queue, to pull this full
> parallelization off.
> 
> But the semantics of that don't smell so nice either.  If the user
> attaches a new qdisc to "ethN", does it go to all the TX queues, or
> what?
> 
> All of the traffic shaping technology deals with the device as a unary
> object.  It doesn't fit to multi-queue at all.

If you let only one CPU at a time access the "xmit path" you solve all
the reordering. If you want to be more fine grained you make the
serialization point as low as possible in the stack - perhaps in the
driver.
But I think even what we have today with only one cpu entering the
dequeue/scheduler region, _for starters_, is not bad actually ;->  What
i am finding (and i can tell you i have been trying hard;->) is that a
sufficiently fast cpu doesnt sit in the dequeue area for "too long" (and
batching reduces the time spent further). Very quickly there are no more
packets for it to dequeue from the qdisc or the driver is stoped and it
has to get out of there. If you dont have any interupt tied to a
specific cpu then you can have many cpus enter and leave that region all
the time. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: parallel networking
  2007-10-08 22:30                         ` jamal
@ 2007-10-08 22:33                           ` David Miller
  2007-10-08 22:35                               ` [ofa-general] " Waskiewicz Jr, Peter P
  2007-10-08 23:42                               ` [ofa-general] " jamal
  0 siblings, 2 replies; 107+ messages in thread
From: David Miller @ 2007-10-08 22:33 UTC (permalink / raw)
  To: hadi
  Cc: jeff, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

From: jamal <hadi@cyberus.ca>
Date: Mon, 08 Oct 2007 18:30:18 -0400

> Very quickly there are no more packets for it to dequeue from the
> qdisc or the driver is stoped and it has to get out of there. If you
> dont have any interupt tied to a specific cpu then you can have many
> cpus enter and leave that region all the time.

With the lock shuttling back and forth between those cpus, which is
what we're trying to avoid.

Multiply whatever effect you think you might be able to measure due to
that on your 2 or 4 way system, and multiple it up to 64 cpus or so
for machines I am using.  This is where machines are going, and is
going to become the norm.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: parallel networking
  2007-10-08 22:33                           ` David Miller
@ 2007-10-08 22:35                               ` Waskiewicz Jr, Peter P
  2007-10-08 23:42                               ` [ofa-general] " jamal
  1 sibling, 0 replies; 107+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-10-08 22:35 UTC (permalink / raw)
  To: David Miller, hadi
  Cc: jeff, krkumar2, johnpol, herbert, kaber, shemminger, jagana,
	Robert.Olsson, rick.jones2, xma, gaagaan, netdev, rdreier, mingo,
	mchan, general, kumarkr, tgraf, randy.dunlap, sri, linux-kernel

> Multiply whatever effect you think you might be able to 
> measure due to that on your 2 or 4 way system, and multiple 
> it up to 64 cpus or so for machines I am using.  This is 
> where machines are going, and is going to become the norm.

That along with speeds going to 10 GbE with multiple Tx/Rx queues (with
40 and 100 GbE under discussion now), where multiple CPU's hitting the
driver are needed to push line rate without cratering the entire
machine.

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] RE: parallel networking
@ 2007-10-08 22:35                               ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 107+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-10-08 22:35 UTC (permalink / raw)
  To: David Miller, hadi
  Cc: johnpol, kumarkr, herbert, jeff, Robert.Olsson, netdev, rdreier,
	linux-kernel, randy.dunlap, gaagaan, jagana, general, mchan,
	tgraf, mingo, sri, shemminger, kaber

> Multiply whatever effect you think you might be able to 
> measure due to that on your 2 or 4 way system, and multiple 
> it up to 64 cpus or so for machines I am using.  This is 
> where machines are going, and is going to become the norm.

That along with speeds going to 10 GbE with multiple Tx/Rx queues (with
40 and 100 GbE under discussion now), where multiple CPU's hitting the
driver are needed to push line rate without cratering the entire
machine.

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: parallel networking
  2007-10-08 22:33                           ` David Miller
@ 2007-10-08 23:42                               ` jamal
  2007-10-08 23:42                               ` [ofa-general] " jamal
  1 sibling, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-08 23:42 UTC (permalink / raw)
  To: David Miller
  Cc: jeff, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

On Mon, 2007-08-10 at 15:33 -0700, David Miller wrote:

> Multiply whatever effect you think you might be able to measure due to
> that on your 2 or 4 way system, and multiple it up to 64 cpus or so
> for machines I am using.  This is where machines are going, and is
> going to become the norm.

Yes, i keep forgetting that ;-> I need to train my brain to remember
that.

cheers,
jamal




^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: parallel networking
@ 2007-10-08 23:42                               ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-08 23:42 UTC (permalink / raw)
  To: David Miller
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, jeff,
	rdreier, general, mingo, sri, jagana, mchan, netdev,
	linux-kernel, tgraf, kaber

On Mon, 2007-08-10 at 15:33 -0700, David Miller wrote:

> Multiply whatever effect you think you might be able to measure due to
> that on your 2 or 4 way system, and multiple it up to 64 cpus or so
> for machines I am using.  This is where machines are going, and is
> going to become the norm.

Yes, i keep forgetting that ;-> I need to train my brain to remember
that.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: parallel networking
  2007-10-08 21:11                         ` [ofa-general] " David Miller
@ 2007-10-09  1:53                           ` Jeff Garzik
  -1 siblings, 0 replies; 107+ messages in thread
From: Jeff Garzik @ 2007-10-09  1:53 UTC (permalink / raw)
  To: David Miller
  Cc: hadi, peter.p.waskiewicz.jr, krkumar2, johnpol, herbert, kaber,
	shemminger, jagana, Robert.Olsson, rick.jones2, xma, gaagaan,
	netdev, rdreier, mingo, mchan, general, kumarkr, tgraf,
	randy.dunlap, sri, linux-kernel

David Miller wrote:
> From: Jeff Garzik <jeff@garzik.org>
> Date: Mon, 08 Oct 2007 10:22:28 -0400
> 
>> In terms of overall parallelization, both for TX as well as RX, my gut 
>> feeling is that we want to move towards an MSI-X, multi-core friendly 
>> model where packets are LIKELY to be sent and received by the same set 
>> of [cpus | cores | packages | nodes] that the [userland] processes 
>> dealing with the data.
> 
> The problem is that the packet schedulers want global guarantees
> on packet ordering, not flow centric ones.
> 
> That is the issue Jamal is concerned about.

Oh, absolutely.

I think, fundamentally, any amount of cross-flow resource management 
done in software is an obstacle to concurrency.

That's not a value judgement, just a statement of fact.

"traffic cops" are intentional bottlenecks we add to the process, to 
enable features like priority flows, filtering, or even simple socket 
fairness guarantees.  Each of those bottlenecks serves a valid purpose, 
but at the end of the day, it's still a bottleneck.

So, improving concurrency may require turning off useful features that 
nonetheless hurt concurrency.


> The more I think about it, the more inevitable it seems that we really
> might need multiple qdiscs, one for each TX queue, to pull this full
> parallelization off.
> 
> But the semantics of that don't smell so nice either.  If the user
> attaches a new qdisc to "ethN", does it go to all the TX queues, or
> what?
> 
> All of the traffic shaping technology deals with the device as a unary
> object.  It doesn't fit to multi-queue at all.

Well the easy solutions to networking concurrency are

* use virtualization to carve up the machine into chunks

* use multiple net devices

Since new NIC hardware is actively trying to be friendly to 
multi-channel/virt scenarios, either of these is reasonably 
straightforward given the current state of the Linux net stack.  Using 
multiple net devices is especially attractive because it works very well 
with the existing packet scheduling.

Both unfortunately impose a burden on the developer and admin, to force 
their apps to distribute flows across multiple [VMs | net devs].


The third alternative is to use a single net device, with SMP-friendly 
packet scheduling.  Here you run into the problems you described "device 
as a unary object" etc. with the current infrastructure.

With multiple TX rings, consider that we are pushing the packet 
scheduling from software to hardware...  which implies
* hardware-specific packet scheduling
* some TC/shaping features not available, because hardware doesn't 
support it

	Jeff





^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: parallel networking
@ 2007-10-09  1:53                           ` Jeff Garzik
  0 siblings, 0 replies; 107+ messages in thread
From: Jeff Garzik @ 2007-10-09  1:53 UTC (permalink / raw)
  To: David Miller
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, rdreier,
	general, mingo, sri, jagana, hadi, mchan, netdev, linux-kernel,
	tgraf, kaber

David Miller wrote:
> From: Jeff Garzik <jeff@garzik.org>
> Date: Mon, 08 Oct 2007 10:22:28 -0400
> 
>> In terms of overall parallelization, both for TX as well as RX, my gut 
>> feeling is that we want to move towards an MSI-X, multi-core friendly 
>> model where packets are LIKELY to be sent and received by the same set 
>> of [cpus | cores | packages | nodes] that the [userland] processes 
>> dealing with the data.
> 
> The problem is that the packet schedulers want global guarantees
> on packet ordering, not flow centric ones.
> 
> That is the issue Jamal is concerned about.

Oh, absolutely.

I think, fundamentally, any amount of cross-flow resource management 
done in software is an obstacle to concurrency.

That's not a value judgement, just a statement of fact.

"traffic cops" are intentional bottlenecks we add to the process, to 
enable features like priority flows, filtering, or even simple socket 
fairness guarantees.  Each of those bottlenecks serves a valid purpose, 
but at the end of the day, it's still a bottleneck.

So, improving concurrency may require turning off useful features that 
nonetheless hurt concurrency.


> The more I think about it, the more inevitable it seems that we really
> might need multiple qdiscs, one for each TX queue, to pull this full
> parallelization off.
> 
> But the semantics of that don't smell so nice either.  If the user
> attaches a new qdisc to "ethN", does it go to all the TX queues, or
> what?
> 
> All of the traffic shaping technology deals with the device as a unary
> object.  It doesn't fit to multi-queue at all.

Well the easy solutions to networking concurrency are

* use virtualization to carve up the machine into chunks

* use multiple net devices

Since new NIC hardware is actively trying to be friendly to 
multi-channel/virt scenarios, either of these is reasonably 
straightforward given the current state of the Linux net stack.  Using 
multiple net devices is especially attractive because it works very well 
with the existing packet scheduling.

Both unfortunately impose a burden on the developer and admin, to force 
their apps to distribute flows across multiple [VMs | net devs].


The third alternative is to use a single net device, with SMP-friendly 
packet scheduling.  Here you run into the problems you described "device 
as a unary object" etc. with the current infrastructure.

With multiple TX rings, consider that we are pushing the packet 
scheduling from software to hardware...  which implies
* hardware-specific packet scheduling
* some TC/shaping features not available, because hardware doesn't 
support it

	Jeff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-08 13:17                     ` jamal
@ 2007-10-09  3:09                       ` Krishna Kumar2
  2007-10-09 13:10                         ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar2 @ 2007-10-09  3:09 UTC (permalink / raw)
  To: hadi
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, Patrick McHardy, netdev,
	general, mchan, tgraf, randy.dunlap, sri, shemminger,
	David Miller, herbert

J Hadi Salim <j.hadi123@gmail.com> wrote on 10/08/2007 06:47:24 PM:

> two, there should _never_ be any requeueing even if LLTX in the previous
> patches when i supported them; if there is, it is a bug. This is because
> we dont send more than what the driver asked for via xmit_win. So if it
> asked for more than it can handle, that is a bug. If its available space
> changes while we are sending to it, that too is a bug.

Driver might ask for 10 and we send 10, but LLTX driver might fail to get
lock and return TX_LOCKED. I haven't seen your code in greater detail, but
don't you requeue in that case too?

- KK

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCHES] TX batching
  2007-10-08 14:05           ` jamal
@ 2007-10-09  8:14             ` Krishna Kumar2
  2007-10-09 13:25               ` jamal
  0 siblings, 1 reply; 107+ messages in thread
From: Krishna Kumar2 @ 2007-10-09  8:14 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, gaagaan, general, herbert, jagana, jeff,
	Evgeniy Polyakov, kaber, kumarkr, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, randy.dunlap, rdreier, rick.jones2,
	Robert.Olsson, shemminger, sri, tgraf, xma

J Hadi Salim <j.hadi123@gmail.com> wrote on 10/08/2007 07:35:20 PM:

> I dont see something from Krishna's approach that i can take and reuse.
> This maybe because my old approaches have evolved from the same path.
> There is a long list but as a sample: i used to do a lot more work while
> holding the queue lock which i have now moved post queue lock; i dont
> have any speacial interfaces/tricks just for batching, i provide hints
> to the core of how much the driver can take etc etc. I have offered
> Krishna co-authorship if he makes the IPOIB driver to work on my
> patches, that offer still stands if he chooses to take it.

My feeling is that since the approaches are very different, it would
be a good idea to test the two for performance. Do you mind me doing
that? Ofcourse others and/or you are more than welcome to do the same.

I had sent a note to you yesterday about this, please let me know
either way.

******************* Previous mail ******************

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
  2007-10-09  3:09                       ` [ofa-general] " Krishna Kumar2
@ 2007-10-09 13:10                         ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-09 13:10 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, Patrick McHardy, netdev,
	general, mchan, tgraf, randy.dunlap, sri, shemminger,
	David Miller, herbert

On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote:

> Driver might ask for 10 and we send 10, but LLTX driver might fail to get
> lock and return TX_LOCKED. I haven't seen your code in greater detail, but
> don't you requeue in that case too?

For others drivers that are non-batching and LLTX, it is possible - at
the moment in my patch i whine that the driver is buggy. I will fix this
up so it checks for NETIF_F_BTX. Thanks for pointing the above use case.

cheers,
jamal

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCHES] TX batching
  2007-10-09  8:14             ` Krishna Kumar2
@ 2007-10-09 13:25               ` jamal
  0 siblings, 0 replies; 107+ messages in thread
From: jamal @ 2007-10-09 13:25 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, gaagaan, general, herbert, jagana, jeff,
	Evgeniy Polyakov, kaber, kumarkr, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, randy.dunlap, rdreier, rick.jones2,
	Robert.Olsson, shemminger, sri, tgraf, xma

On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote:

> My feeling is that since the approaches are very different, 

My concern is the approaches are different only for short periods of
time. For example, I do requeueing, have xmit_win, have ->end_xmit,
do batching from core etc; if you see value in any of these concepts,
they will appear in your patches and this goes on a loop. Perhaps what
we need is a referee and use our energies in something more positive.

> it would be a good idea to test the two for performance. 

Which i dont mind as long as it has an analysis that goes with it.
If all you post is "heres what netperf showed", it is not useful at all.
There are also a lot of affecting variables. For example, is the
receiver a bottleneck? To make it worse, I could demonstrate to you that
if i slowed down the driver and allowed more packets to queue up on the
qdisc, batching will do well. In the past my feeling is you glossed over
such details and i am sucker for things like that - hence the conflict.

> Do you mind me doing
> that? Ofcourse others and/or you are more than welcome to do the same.
> 
> I had sent a note to you yesterday about this, please let me know
> either way.
> 

I responded to you - but it may have been lost in the noise; heres a
copy:
http://marc.info/?l=linux-netdev&m=119185137124008&w=2

cheers,
jamal


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [ofa-general] Re: parallel networking
  2007-10-09  1:53                           ` [ofa-general] " Jeff Garzik
  (?)
@ 2007-10-09 14:59                           ` Michael Krause
  -1 siblings, 0 replies; 107+ messages in thread
From: Michael Krause @ 2007-10-09 14:59 UTC (permalink / raw)
  To: Jeff Garzik, David Miller
  Cc: randy.dunlap, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, linux-kernel, kaber, jagana,
	general, mchan, tgraf, mingo, johnpol, shemminger, netdev, sri


[-- Attachment #1.1: Type: text/plain, Size: 5833 bytes --]

At 06:53 PM 10/8/2007, Jeff Garzik wrote:
>David Miller wrote:
>>From: Jeff Garzik <jeff@garzik.org>
>>Date: Mon, 08 Oct 2007 10:22:28 -0400
>>
>>>In terms of overall parallelization, both for TX as well as RX, my gut 
>>>feeling is that we want to move towards an MSI-X, multi-core friendly 
>>>model where packets are LIKELY to be sent and received by the same set 
>>>of [cpus | cores | packages | nodes] that the [userland] processes 
>>>dealing with the data.
>>The problem is that the packet schedulers want global guarantees
>>on packet ordering, not flow centric ones.
>>That is the issue Jamal is concerned about.
>
>Oh, absolutely.
>
>I think, fundamentally, any amount of cross-flow resource management done 
>in software is an obstacle to concurrency.
>
>That's not a value judgement, just a statement of fact.

Correct.


>"traffic cops" are intentional bottlenecks we add to the process, to 
>enable features like priority flows, filtering, or even simple socket 
>fairness guarantees.  Each of those bottlenecks serves a valid purpose, 
>but at the end of the day, it's still a bottleneck.
>
>So, improving concurrency may require turning off useful features that 
>nonetheless hurt concurrency.

Software needs to get out of the main data path - another fact of life.



>>The more I think about it, the more inevitable it seems that we really
>>might need multiple qdiscs, one for each TX queue, to pull this full
>>parallelization off.
>>But the semantics of that don't smell so nice either.  If the user
>>attaches a new qdisc to "ethN", does it go to all the TX queues, or
>>what?
>>All of the traffic shaping technology deals with the device as a unary
>>object.  It doesn't fit to multi-queue at all.
>
>Well the easy solutions to networking concurrency are
>
>* use virtualization to carve up the machine into chunks
>
>* use multiple net devices
>
>Since new NIC hardware is actively trying to be friendly to 
>multi-channel/virt scenarios, either of these is reasonably 
>straightforward given the current state of the Linux net stack.  Using 
>multiple net devices is especially attractive because it works very well 
>with the existing packet scheduling.
>
>Both unfortunately impose a burden on the developer and admin, to force 
>their apps to distribute flows across multiple [VMs | net devs].

Not the most optimal approach.

>The third alternative is to use a single net device, with SMP-friendly 
>packet scheduling.  Here you run into the problems you described "device 
>as a unary object" etc. with the current infrastructure.
>
>With multiple TX rings, consider that we are pushing the packet scheduling 
>from software to hardware...  which implies
>* hardware-specific packet scheduling
>* some TC/shaping features not available, because hardware doesn't support it

For a number of years now, we have designed interconnects to support a 
reasonable range of arbitration capabilities among hardware resource 
sets.  With reasonable classification by software to identify a hardware 
resource sets (ideally interpretation of the application's view of its 
priority combined with policy management software that determines how that 
should map among competing application views), one can eliminate most of 
the CPU cycles spent into today's implementations.   I and others presented 
a number of these concepts many years ago during the development which 
eventually led to IB and iWARP.

- Each resource set can be assigned to a unique PCIe function or a function 
group to enable function / group arbitration to the PCIe link.

- Each resource set can be assigned to a unique PCIe TC and with improved 
ordering hints (coming soon) can be used to eliminate false ordering 
dependencies.

- Each resource set can be assigned to a unique IB TC / SL or iWARP 802.1p 
to signal priority.  These can then be used to program respective link 
arbitration as well as path selection to enable multi-path load balancing.

- Many IHV have picked up on the arbitration capabilities and extended them 
as shown years ago by a number of us to enable resource set arbitration and 
a variety of QoS based policies.  If software defines a reasonable (i.e. 
small) number of management and control knobs, then these can be easily 
mapped to most h/w implementations.   Some of us are working on how to do 
this for virtualized environments and I expect these to be applicable to 
all environments in the end.

One other key item to keep in mind is that unless there is contention in 
the system, the majority of the QoS mechanisms are meaningless and in a 
very large percentage of customer environments, they simply don't scale 
with device and interconnect performance.   Many applications in fact 
remain processor / memory constrained and therefore do not stress the I/O 
subsystem or the external interconnects making most of the software 
mechanisms rather moot in real customer environments.   Simple truth is it 
is nearly always cheaper to over-provision the I/O / interconnects than to 
use the software approach which while quite applicable in many environments 
for the 1 Gbps and below speeds, generally has less meaning / value in the 
10 moving to 40 moving to 100 Gbps environments.   Does not really matter 
whether one believes in protocol off-load or protocol on-load, the 
interconnects will be able to handle all commercial workloads and perhaps 
all but the most extreme HPC (even there one might contend that any 
software intermediary would be discarded in favor of reducing OS / kernel 
overhead from the main data path).  This isn't to say that software has no 
role to play only that role needs to shift from main data path overhead to 
one of policy shaping and programming of h/w based arbitration.   This will 
hold true for both virtualized and non-virtualized environments.

Mike 

[-- Attachment #1.2: Type: text/html, Size: 6660 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [ofa-general] Re: [PATCH 10/10 REV5] [E1000] Implement batching
  2007-09-14  9:04 ` [PATCH 10/10 REV5] [E1000] " Krishna Kumar
  2007-09-14 12:47   ` [ofa-general] " Evgeniy Polyakov
@ 2007-11-13 21:28   ` Kok, Auke
  2007-11-14  8:30     ` Krishna Kumar2
  1 sibling, 1 reply; 107+ messages in thread
From: Kok, Auke @ 2007-11-13 21:28 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: randy.dunlap, Robert.Olsson, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, shemminger, johnpol, herbert, jeff,
	rdreier, mcarlson, general, sri, jagana, hadi, mchan, netdev,
	davem, tgraf, kaber

Krishna Kumar wrote:
> E1000: Implement batching capability (ported thanks to changes taken from
> 	Jamal).
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>


this doesn't apply anymore and it would help if you could re-spin this for e1000e.
I don't know what the status for merging of the batched xmit patches is right now
but it would help if you could rewrite them against e1000e, which I assume is what
most people want to test with. There are also significant changes upstream right
now in jgarzik/netdev-2.6 #upstream...

I'm still very interested in these patches BTW.

Auke



> ---
>  e1000_main.c |  104 ++++++++++++++++++++++++++++++++++++++++++-----------------
>  1 files changed, 75 insertions(+), 29 deletions(-)
> 
> diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c
> --- org/drivers/net/e1000/e1000_main.c	2007-09-14 10:30:57.000000000 +0530
> +++ new/drivers/net/e1000/e1000_main.c	2007-09-14 10:31:02.000000000 +0530
> @@ -990,7 +990,7 @@ e1000_probe(struct pci_dev *pdev,
>  	if (pci_using_dac)
>  		netdev->features |= NETIF_F_HIGHDMA;
>  
> -	netdev->features |= NETIF_F_LLTX;
> +	netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS;
>  
>  	adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
>  
> @@ -3092,6 +3092,17 @@ e1000_tx_map(struct e1000_adapter *adapt
>  	return count;
>  }
>  
> +static void e1000_kick_DMA(struct e1000_adapter *adapter,
> +			   struct e1000_tx_ring *tx_ring, int i)
> +{
> +	wmb();
> +
> +	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
> +	/* we need this if more than one processor can write to our tail
> +	 * at a time, it syncronizes IO on IA64/Altix systems */
> +	mmiowb();
> +}
> +
>  static void
>  e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
>                 int tx_flags, int count)
> @@ -3138,13 +3149,7 @@ e1000_tx_queue(struct e1000_adapter *ada
>  	 * know there are new descriptors to fetch.  (Only
>  	 * applicable for weak-ordered memory model archs,
>  	 * such as IA-64). */
> -	wmb();
> -
>  	tx_ring->next_to_use = i;
> -	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
> -	/* we need this if more than one processor can write to our tail
> -	 * at a time, it syncronizes IO on IA64/Altix systems */
> -	mmiowb();
>  }
>  
>  /**
> @@ -3251,22 +3256,23 @@ static int e1000_maybe_stop_tx(struct ne
>  }
>  
>  #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
> +
> +#define NETDEV_TX_DROPPED	-5
> +
>  static int
> -e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> +e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev)
>  {
>  	struct e1000_adapter *adapter = netdev_priv(netdev);
>  	struct e1000_tx_ring *tx_ring;
>  	unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
>  	unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
>  	unsigned int tx_flags = 0;
> -	unsigned int len = skb->len;
> -	unsigned long flags;
> -	unsigned int nr_frags = 0;
> -	unsigned int mss = 0;
> +	unsigned int len = skb->len - skb->data_len;
> +	unsigned int nr_frags;
> +	unsigned int mss;
>  	int count = 0;
>  	int tso;
>  	unsigned int f;
> -	len -= skb->data_len;
>  
>  	/* This goes back to the question of how to logically map a tx queue
>  	 * to a flow.  Right now, performance is impacted slightly negatively
> @@ -3276,7 +3282,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  
>  	if (unlikely(skb->len <= 0)) {
>  		dev_kfree_skb_any(skb);
> -		return NETDEV_TX_OK;
> +		return NETDEV_TX_DROPPED;
>  	}
>  
>  	/* 82571 and newer doesn't need the workaround that limited descriptor
> @@ -3322,7 +3328,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  					DPRINTK(DRV, ERR,
>  						"__pskb_pull_tail failed.\n");
>  					dev_kfree_skb_any(skb);
> -					return NETDEV_TX_OK;
> +					return NETDEV_TX_DROPPED;
>  				}
>  				len = skb->len - skb->data_len;
>  				break;
> @@ -3366,22 +3372,15 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  	    (adapter->hw.mac_type == e1000_82573))
>  		e1000_transfer_dhcp_info(adapter, skb);
>  
> -	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
> -		/* Collision - tell upper layer to requeue */
> -		return NETDEV_TX_LOCKED;
> -
>  	/* need: count + 2 desc gap to keep tail from touching
>  	 * head, otherwise try next time */
> -	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
> -		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> +	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2)))
>  		return NETDEV_TX_BUSY;
> -	}
>  
>  	if (unlikely(adapter->hw.mac_type == e1000_82547)) {
>  		if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
>  			netif_stop_queue(netdev);
>  			mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
> -			spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
>  			return NETDEV_TX_BUSY;
>  		}
>  	}
> @@ -3396,8 +3395,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  	tso = e1000_tso(adapter, tx_ring, skb);
>  	if (tso < 0) {
>  		dev_kfree_skb_any(skb);
> -		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> -		return NETDEV_TX_OK;
> +		return NETDEV_TX_DROPPED;
>  	}
>  
>  	if (likely(tso)) {
> @@ -3416,13 +3414,61 @@ e1000_xmit_frame(struct sk_buff *skb, st
>  	               e1000_tx_map(adapter, tx_ring, skb, first,
>  	                            max_per_txd, nr_frags, mss));
>  
> -	netdev->trans_start = jiffies;
> +	return NETDEV_TX_OK;
> +}
> +
> +static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> +{
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> +	struct sk_buff_head *blist;
> +	int ret, skbs_done = 0;
> +	unsigned long flags;
> +
> +	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
> +		/* Collision - tell upper layer to requeue */
> +		return NETDEV_TX_LOCKED;
> +	}
>  
> -	/* Make sure there is space in the ring for the next send. */
> -	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
> +	blist = netdev->skb_blist;
> +
> +	if (!skb || (blist && skb_queue_len(blist))) {
> +		/*
> +		 * Either batching xmit call, or single skb case but there are
> +		 * skbs already in the batch list from previous failure to
> +		 * xmit - send the earlier skbs first to avoid out of order.
> +		 */
> +		if (skb)
> +			__skb_queue_tail(blist, skb);
> +		skb = __skb_dequeue(blist);
> +	} else {
> +		blist = NULL;
> +	}
> +
> +	do {
> +		ret = e1000_prep_queue_frame(skb, netdev);
> +		if (likely(ret == NETDEV_TX_OK))
> +			skbs_done++;
> +		else {
> +			if (ret == NETDEV_TX_BUSY) {
> +				if (blist)
> +					__skb_queue_head(blist, skb);
> +				break;
> +			}
> +			/* skb dropped, not a TX error */
> +			ret = NETDEV_TX_OK;
> +		}
> +	} while (blist && (skb = __skb_dequeue(blist)) != NULL);
> +
> +	if (skbs_done) {
> +		e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
> +		netdev->trans_start = jiffies;
> +		/* Make sure there is space in the ring for the next send. */
> +		e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
> +	}
>  
>  	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> -	return NETDEV_TX_OK;
> +	return ret;
>  }
>  
>  /**
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/10 REV5] [E1000] Implement batching
  2007-11-13 21:28   ` [ofa-general] " Kok, Auke
@ 2007-11-14  8:30     ` Krishna Kumar2
  0 siblings, 0 replies; 107+ messages in thread
From: Krishna Kumar2 @ 2007-11-14  8:30 UTC (permalink / raw)
  To: Kok, Auke
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
	kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	randy.dunlap, rdreier, rick.jones2, Robert.Olsson, shemminger,
	sri, tgraf, xma

Hi Auke,

"Kok, Auke" <auke-jan.h.kok@intel.com> wrote on 11/14/2007 02:58:14 AM:

> this doesn't apply anymore and it would help if you could re-spin this
for e1000e.
> I don't know what the status for merging of the batched xmit patches is
right now
> but it would help if you could rewrite them against e1000e, which I
assume is what
> most people want to test with. There are also significant changes
upstream right
> now in jgarzik/netdev-2.6 #upstream...
>
> I'm still very interested in these patches BTW.

I will make a latest version and test it out for some numbers and try to
send it this week.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2007-11-14  8:28 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-14  9:00 [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar
2007-09-14  9:01 ` [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching Krishna Kumar
2007-09-14 18:37   ` [ofa-general] " Randy Dunlap
2007-09-17  4:10     ` Krishna Kumar2
2007-09-17  4:13       ` [ofa-general] " Jeff Garzik
2007-09-14  9:01 ` [PATCH 2/10 REV5] [core] Add skb_blist & support " Krishna Kumar
2007-09-14 12:46   ` [ofa-general] " Evgeniy Polyakov
2007-09-17  3:51     ` Krishna Kumar2
2007-09-14  9:01 ` [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching Krishna Kumar
2007-09-14 12:15   ` [ofa-general] " Evgeniy Polyakov
2007-09-17  3:49     ` Krishna Kumar2
2007-09-14  9:02 ` [PATCH 4/10 REV5] [ethtool] Add ethtool support Krishna Kumar
2007-09-14  9:02 ` [PATCH 5/10 REV5] [IPoIB] Header file changes Krishna Kumar
2007-09-14  9:03 ` [PATCH 6/10 REV5] [IPoIB] CM & Multicast changes Krishna Kumar
2007-09-14  9:03 ` [PATCH 7/10 REV5] [IPoIB] Verbs changes Krishna Kumar
2007-09-14  9:03 ` [PATCH 8/10 REV5] [IPoIB] Post and work completion handler changes Krishna Kumar
2007-09-14  9:04 ` [PATCH 9/10 REV5] [IPoIB] Implement batching Krishna Kumar
2007-09-14  9:04 ` [PATCH 10/10 REV5] [E1000] " Krishna Kumar
2007-09-14 12:47   ` [ofa-general] " Evgeniy Polyakov
2007-09-17  3:56     ` Krishna Kumar2
2007-11-13 21:28   ` [ofa-general] " Kok, Auke
2007-11-14  8:30     ` Krishna Kumar2
2007-09-14 12:49 ` [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Evgeniy Polyakov
2007-09-16 23:17 ` David Miller
2007-09-17  0:29   ` jamal
2007-09-17  1:02     ` David Miller
2007-09-17  2:14       ` [ofa-general] " jamal
2007-09-17  2:25         ` David Miller
2007-09-17  3:01           ` jamal
2007-09-17  3:13             ` David Miller
2007-09-17 12:51               ` jamal
2007-09-17 16:37                 ` [ofa-general] " David Miller
2007-09-17  4:46           ` Krishna Kumar2
2007-09-23 17:53     ` [PATCHES] TX batching jamal
2007-09-23 17:56       ` [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock jamal
2007-09-23 17:58         ` [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface jamal
2007-09-23 18:00           ` [PATCH 3/4][NET_BATCH] net core use batching jamal
2007-09-23 18:02             ` [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb jamal
2007-09-30 18:53               ` [ofa-general] [PATCH 3/3][NET_SCHED] " jamal
2007-10-07 18:39               ` [ofa-general] [PATCH 3/3][NET_BATCH] " jamal
2007-09-30 18:52             ` [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching jamal
2007-10-01  4:11               ` Bill Fink
2007-10-01 13:30                 ` jamal
2007-10-02  4:25                   ` [ofa-general] " Bill Fink
2007-10-02 13:20                     ` jamal
2007-10-03  5:29                       ` [ofa-general] " Bill Fink
2007-10-03 13:42                         ` jamal
2007-10-01 10:42               ` [ofa-general] " Patrick McHardy
2007-10-01 13:21                 ` jamal
2007-10-08  5:03                   ` Krishna Kumar2
2007-10-08 13:17                     ` jamal
2007-10-09  3:09                       ` [ofa-general] " Krishna Kumar2
2007-10-09 13:10                         ` jamal
2007-10-07 18:38             ` [ofa-general] " jamal
2007-09-30 18:51           ` [ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface jamal
2007-09-30 18:54             ` [ofa-general] Re: [PATCH 1/3] " jamal
2007-10-07 18:36           ` [ofa-general] " jamal
2007-10-08  9:59             ` Krishna Kumar2
2007-10-08 13:49               ` jamal
2007-09-24 19:12         ` [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock Waskiewicz Jr, Peter P
2007-09-24 22:51           ` jamal
2007-09-24 22:57             ` Waskiewicz Jr, Peter P
2007-09-24 23:38               ` [ofa-general] " jamal
2007-09-24 23:47                 ` Waskiewicz Jr, Peter P
2007-09-25  0:14                   ` [ofa-general] " Stephen Hemminger
2007-09-25  0:31                     ` [ofa-general] " Waskiewicz Jr, Peter P
2007-09-25 13:15                     ` [ofa-general] " jamal
2007-09-25 15:24                       ` Stephen Hemminger
2007-09-25 22:14                         ` jamal
2007-09-25 22:43                           ` jamal
2007-09-25 13:08                   ` [ofa-general] " jamal
2007-10-08  4:51                 ` [ofa-general] " David Miller
2007-10-08 13:34                   ` jamal
2007-10-08 14:22                     ` parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) Jeff Garzik
2007-10-08 15:18                       ` jamal
2007-10-08 15:18                         ` [ofa-general] " jamal
2007-10-08 21:11                       ` parallel networking David Miller
2007-10-08 21:11                         ` [ofa-general] " David Miller
2007-10-08 22:30                         ` jamal
2007-10-08 22:33                           ` David Miller
2007-10-08 22:35                             ` Waskiewicz Jr, Peter P
2007-10-08 22:35                               ` [ofa-general] " Waskiewicz Jr, Peter P
2007-10-08 23:42                             ` jamal
2007-10-08 23:42                               ` [ofa-general] " jamal
2007-10-09  1:53                         ` Jeff Garzik
2007-10-09  1:53                           ` [ofa-general] " Jeff Garzik
2007-10-09 14:59                           ` Michael Krause
2007-10-08 21:05                     ` [PATCH 1/4] [NET_SCHED] explict hold dev tx lock David Miller
2007-09-23 18:19       ` [PATCHES] TX batching Jeff Garzik
2007-09-23 19:11         ` [ofa-general] " jamal
2007-09-23 19:36           ` Kok, Auke
2007-09-23 21:20             ` jamal
2007-09-24  7:00               ` Kok, Auke
2007-09-24 22:38                 ` jamal
2007-09-24 22:52                   ` [ofa-general] " Kok, Auke
2007-09-24 22:54           ` [DOC] Net batching driver howto jamal
2007-09-25 20:16             ` [ofa-general] " Randy Dunlap
2007-09-25 22:28               ` jamal
2007-09-25  0:15           ` [PATCHES] TX batching Jeff Garzik
2007-09-30 18:50       ` [ofa-general] " jamal
2007-09-30 19:19         ` [ofa-general] " jamal
2007-10-07 18:34       ` [ofa-general] " jamal
2007-10-08 12:51         ` [ofa-general] " Evgeniy Polyakov
2007-10-08 14:05           ` jamal
2007-10-09  8:14             ` Krishna Kumar2
2007-10-09 13:25               ` jamal
2007-09-17  4:08   ` [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Krishna Kumar2

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.