All of lore.kernel.org
 help / color / mirror / Atom feed
* [ofa-general] [PATCH 00/10] Implement batching skb API
@ 2007-07-20  6:31 Krishna Kumar
  2007-07-20  6:32 ` [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB Krishna Kumar
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:31 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr, jagana,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, sri, hadi, netdev

Hi Dave, Roland, everyone,

In May, I had proposed creating an API for sending 'n' skbs to a driver to
reduce lock overhead, DMA operations, and specific to drivers that have
completion notification like IPoIB - reduce completion handling ("[RFC] New
driver API to speed up small packets xmits" @
http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent
initial test results for E1000 which showed minor improvements (but also
got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2.

After fine-tuning qdisc and other changes, I modified IPoIB to use this API,
and now get good gains. Summary for TCP & No Delay: 1 process improves for
all cases from 1.4% to 49.5%; 4 process has almost identical improvements
from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to
33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP
was tested with 1 process netperf with small increase in BW but big
improvement in Service Demand. Netperf latency tests show small drop in
transaction rate (results in separate attachment).

To verify that performance does not degrade with batching turned off (as is
the case for all existing drivers), I ran tests with tx_batch_skbs=0 vs the
original code, without getting real degradation. Also enabled all kernel
debugs to catch panics, warnings, memory free use bugs, etc, and simulated
driver errors to get coverage on core & IPoIB error paths. Testing was on
2-CPU X-series systems and 8-CPU PPC64 Power5 systems using IPoIB over mthca,
and E1000 (used driver that Jamal had converted but didn't get improvement).
On i386, the size of the kernel (drivers are modules) increased by:
	text: 0.007% data: 0.007% bss: 0% total: 0.03%.

There is a parallel WIP by Jamal but the two implementations are completely
different since the code bases from the start were separate. Key changes:
	- Use a single qdisc interface to avoid code duplication and reduce
	  maintainability (sch_generic.c size reduces by ~9%).
	- Has per device configurable parameter to turn on/off batching.
	- qdisc_restart gets slightly modified while looking simple without
	  any checks for batching vs regular code (infact only two lines have
	  changed - 1. instead of dev_dequeue_skb, a new batch-aware function
	  is called; and 2. an extra call to hard_start_xmit_batch.
	- Batching algo/processing is different (eg. if qdisc_restart() finds
	  one skb in the batch list, it will try to batch more (upto a limit)
	  instead of sending that out and batching the rest in the next call.
	- No change in__qdisc_run other than a new argument (from DM's idea).
	- Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.
	- Jamal's code has a separate hw prep handler called from the stack,
	  and results are accessed in driver during xmit later.
	- Jamal's code has dev->xmit_win which is cached by the driver. Mine
	  has dev->xmit_slots but this is used only by the driver while the
	  core has a different mechanism to find how many skbs to batch.
	- Completely different structure/design & coding styles.
(This patch will work with drivers updated by Jamal, Matt & Michael Chan with
minor modifications - rename xmit_win to xmit_slots & rename batch handler)

Patches are described as:
	Mail 0/10  : This mail.
	Mail 1/10  : HOWTO documentation.
	Mail 2/10  : Networking include file changes.
	Mail 3/10  : dev.c changes.
	Mail 4/10  : net-sysfs.c changes.
	Mail 5/10  : sch_generic.c changes.
	Mail 6/10  : IPoIB include file changes.
	Mail 7/10  : IPoIB verbs changes
	Mail 8/10  : IPoIB multicast, CM changes
	Mail 9/10  : IPoIB xmit API addition
	Mail 10/10 : IPoIB xmit internals changes (ipoib_ib.c)

I am also sending separately an attachment with results (across 10 run
cycle), test scripts and a script to analyze results.

Thanks to Sridhar & Shirley Ma for code reviews; Evgeniy, Jamal & Sridhar for
suggesting to put driver skb list on netdev instead of on skb to avoid
requeue; and David Miller for explanation on using batching only when the
queue is woken up.

Please review and provide feedback/ideas; and consider for inclusion.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
@ 2007-07-20  6:32 ` Krishna Kumar
  2007-07-20  6:32 ` [PATCH 02/10] Networking include file changes Krishna Kumar
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:32 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, mcarlson, netdev, jagana, general, mchan, tgraf, jeff,
	hadi, kaber, sri

Add HOWTO documentation on what batching is, how to implement drivers to use
it, and how users can enable/disable batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 Batching_skb_API.txt |   91 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 91 insertions(+)

diff -ruNp org/Documentation/networking/Batching_skb_API.txt new/Documentation/networking/Batching_skb_API.txt
--- org/Documentation/networking/Batching_skb_API.txt	1970-01-01 05:30:00.000000000 +0530
+++ new/Documentation/networking/Batching_skb_API.txt	2007-07-20 08:30:22.000000000 +0530
@@ -0,0 +1,91 @@
+		 HOWTO for batching skb API support
+		 -----------------------------------
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+-----------------------------------------------
+
+An extended API is supported in the netdevice layer, which is very similar
+to the existing hard_start_xmit() API. Drivers which wish to take advantage
+of this new API should implement this routine similar to how the
+hard_start_xmit handler is written. The difference between these API's is
+that while the existing hard_start_xmit processes one skb, the new API can
+process multiple skbs (or even one) in a single call. It is also possible
+for the driver writer to re-use most of the code from the existing API in
+the new API without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-------------------------------------
+
+	This is a new API that is optionally exported by a driver. The pre-
+	requisite for a driver to use this API is that it should have a
+	reasonably sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+-------------------------------------------------------
+
+	The networking stack normally gets called from upper layer protocols
+	with a single skb to xmit. This skb is first enqueue'd and an
+	attempt is next made to transmit it immediately (via qdisc_run).
+	However, events like driver lock contention, queue stopped, etc, can
+	result in the skb not getting sent out, and it remains in the queue.
+	When a new xmit is called or when the queue is re-enabled, qdisc_run
+	could potentially find multiple packets in the queue, and have to
+	send them all out one by one iteratively.
+
+	The batching skb API case was added to exploit this situation where
+	if there are multiple skbs, all of them can be sent to the device in
+	one shot. This reduces driver processing, locking at the driver (or
+	in stack for ~LLTX drivers) gets amortized over multiple skbs, and
+	in case of specific drivers where every xmit results in a completion
+	processing (like IPoIB), optimizations could be made in the driver
+	to get a completion for only the last skb that was sent which will
+	result in saving interrupts for every (but the last) skb that was
+	sent in the same batch.
+
+	This batching can result in significant performance gains for
+	systems that have multiple data stream paths over the same network
+	interface card.
+
+
+Section 3: How drivers can support this API ?
+---------------------------------------------
+
+	The new API - dev->hard_start_xmit_batch(struct net_device *dev),
+	simplistically, can be written almost identically to the regular
+	xmit API (hard_start_xmit), except that all skbs on dev->skb_blist
+	should be processed by the driver instead of just one skb. The new
+	API doesn't get any skb as argument to process, instead it picks up
+	all the skbs from dev->skb_blist, where it was added by the stack,
+	and tries to send them out.
+
+	Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+	dev->features, and dev->hard_start_xmit_batch should point to the
+	new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+---------------------------------------------
+
+	Batching could be disabled for a particular device, e.g. on desktop
+	systems if only one stream of network activity for that device is
+	taking place, since performance could be slightly affected due to
+	extra processing that batching adds. Batching can be enabled if
+	more than one stream of network activity per device is being done,
+	e.g. on servers, or even desktop usage with multiple browser, chat,
+	file transfer sessions, etc.
+
+	Per device batching can be enabled/disabled using:
+
+	echo 1 > /sys/class/net/<device-name>/tx_batch_skbs (enable)
+	echo 0 > /sys/class/net/<device-name>/tx_batch_skbs (disable)
+
+	E.g. to enable batching on eth0, run:
+		echo 1 > /sys/class/net/eth0/tx_batch_skbs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 02/10] Networking include file changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
  2007-07-20  6:32 ` [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB Krishna Kumar
@ 2007-07-20  6:32 ` Krishna Kumar
  2007-07-20  9:59   ` Patrick McHardy
  2007-07-20 17:25   ` [ofa-general] " Sridhar Samudrala
  2007-07-20  6:32 ` [ofa-general] [PATCH 03/10] dev.c changes Krishna Kumar
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:32 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, xma, rick.jones2, mcarlson, netdev, jagana, general,
	mchan, tgraf, jeff, hadi, kaber, Krishna Kumar, sri

Networking include file changes for batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 linux/netdevice.h |   10 ++++++++++
 net/pkt_sched.h   |    6 +++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
+++ new/include/linux/netdevice.h	2007-07-20 08:30:55.000000000 +0530
@@ -264,6 +264,8 @@ enum netdev_state_t
 	__LINK_STATE_QDISC_RUNNING,
 };
 
+/* Minimum length of device hardware queue for batching to work */
+#define MIN_QUEUE_LEN_BATCH	16
 
 /*
  * This structure holds at boot time configured netdevice settings. They
@@ -340,6 +342,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BATCH_SKBS	8192	/* Driver supports batch skbs API */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
@@ -452,6 +455,8 @@ struct net_device
 	struct Qdisc		*qdisc_sleeping;
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
+	unsigned long		xmit_slots;	/* Device free slots */
+	struct sk_buff_head	*skb_blist;	/* List of batch skbs */
 
 	/* Partially transmitted GSO packet. */
 	struct sk_buff		*gso_skb;
@@ -472,6 +477,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_start_xmit_batch) (struct net_device
+							  *dev);
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -832,6 +840,8 @@ extern int		dev_set_mac_address(struct n
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_add_skb_to_blist(struct sk_buff *skb,
+					     struct net_device *dev);
 
 extern void		dev_init(void);
 
diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h	2007-07-20 07:49:28.000000000 +0530
+++ new/include/net/pkt_sched.h	2007-07-20 08:30:22.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	if (!netif_queue_stopped(dev) &&
 	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		__qdisc_run(dev);
+		__qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] [PATCH 03/10] dev.c changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
  2007-07-20  6:32 ` [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB Krishna Kumar
  2007-07-20  6:32 ` [PATCH 02/10] Networking include file changes Krishna Kumar
@ 2007-07-20  6:32 ` Krishna Kumar
  2007-07-20 10:04   ` [ofa-general] " Patrick McHardy
  2007-07-20 17:44   ` Sridhar Samudrala
  2007-07-20  6:32 ` [PATCH 04/10] net-sysfs.c changes Krishna Kumar
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:32 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, mcarlson, jagana, general, netdev, tgraf, jeff, hadi,
	kaber, mchan, sri

Changes in dev.c to support batching : add dev_add_skb_to_blist,
register_netdev recognizes batch aware drivers, and net_tx_action is
the sole user of batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 dev.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 74 insertions(+), 3 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
+++ new/net/core/dev.c	2007-07-20 08:31:35.000000000 +0530
@@ -1414,6 +1414,45 @@ static int dev_gso_segment(struct sk_buf
 	return 0;
 }
 
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree(skb);
+			return 0;
+		}
+
+		if (skb->next) {
+			int count = 0;
+
+			do {
+				struct sk_buff *nskb = skb->next;
+
+				skb->next = nskb->next;
+				__skb_queue_tail(dev->skb_blist, nskb);
+				count++;
+			} while (skb->next);
+
+			skb->destructor = DEV_GSO_CB(skb)->destructor;
+			kfree_skb(skb);
+			return count;
+		}
+	}
+	__skb_queue_tail(dev->skb_blist, skb);
+	return 1;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -1566,7 +1605,7 @@ gso:
 			/* reset queue_mapping to zero */
 			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
-			qdisc_run(dev);
+			qdisc_run(dev, NULL);
 			spin_unlock(&dev->queue_lock);
 
 			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1763,7 +1802,11 @@ static void net_tx_action(struct softirq
 			clear_bit(__LINK_STATE_SCHED, &dev->state);
 
 			if (spin_trylock(&dev->queue_lock)) {
-				qdisc_run(dev);
+				/*
+				 * Try to send out all skbs if batching is
+				 * enabled.
+				 */
+				qdisc_run(dev, dev->skb_blist);
 				spin_unlock(&dev->queue_lock);
 			} else {
 				netif_schedule(dev);
@@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device
 		}
 	}
 
+	if (dev->features & NETIF_F_BATCH_SKBS) {
+		if (!dev->hard_start_xmit_batch ||
+		    dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) {
+			/*
+			 * Batch TX requires API support in driver plus have
+			 * a minimum sized queue.
+			 */
+			printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS "
+					"since no API support or queue len "
+					"is smaller than %d.\n",
+					dev->name, MIN_QUEUE_LEN_BATCH);
+			dev->features &= ~NETIF_F_BATCH_SKBS;
+		} else {
+			dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
+						 GFP_KERNEL);
+			if (dev->skb_blist) {
+				skb_queue_head_init(dev->skb_blist);
+				dev->tx_queue_len >>= 1;
+			}
+		}
+	}
+
 	/*
 	 *	nil rebuild_header routine,
 	 *	that should be never called and used as just bug trap.
@@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev
 
 	synchronize_net();
 
+	/* Deallocate batching structure */
+	if (dev->skb_blist) {
+		skb_queue_purge(dev->skb_blist);
+		kfree(dev->skb_blist);
+		dev->skb_blist = NULL;
+	}
+
 	/* Shutdown queueing discipline. */
 	dev_shutdown(dev);
 
-
 	/* Notify protocols, that we are about to destroy
 	   this device. They should clean all the things.
 	*/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 04/10] net-sysfs.c changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (2 preceding siblings ...)
  2007-07-20  6:32 ` [ofa-general] [PATCH 03/10] dev.c changes Krishna Kumar
@ 2007-07-20  6:32 ` Krishna Kumar
  2007-07-20 10:07   ` [ofa-general] " Patrick McHardy
  2007-07-20  6:32 ` [ofa-general] [PATCH 05/10] sch_generic.c changes Krishna Kumar
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:32 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, xma, rick.jones2, mcarlson, jagana, mchan, general,
	netdev, tgraf, jeff, hadi, kaber, Krishna Kumar, sri

Support to turn on/off batching from /sys.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net-sysfs.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 70 insertions(+)

diff -ruNp org/net/core/net-sysfs.c new/net/core/net-sysfs.c
--- org/net/core/net-sysfs.c	2007-07-20 07:49:28.000000000 +0530
+++ new/net/core/net-sysfs.c	2007-07-20 08:34:45.000000000 +0530
@@ -230,6 +230,74 @@ static ssize_t store_weight(struct devic
 	return netdev_store(dev, attr, buf, len, change_weight);
 }
 
+static ssize_t show_tx_batch_skbs(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+
+	return sprintf(buf, fmt_dec, netdev->skb_blist ? 1 : 0);
+}
+
+static int change_tx_batch_skbs(struct net_device *net,
+				unsigned long new_tx_batch_skbs)
+{
+	int ret = 0;
+	struct sk_buff_head *blist;
+
+	if (!(net->features & NETIF_F_BATCH_SKBS) ||
+	    (new_tx_batch_skbs && net->tx_queue_len < MIN_QUEUE_LEN_BATCH)) {
+		/*
+		 * Driver doesn't support batching SKBS, or the queue len
+		 * is insufficient. TODO: Add similar check to disable
+		 * batching in change_tx_queue_len() if queue_len becomes
+		 * smaller than MIN_QUEUE_LEN_BATCH.
+		 */
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	/* Handle invalid argument */
+	if (new_tx_batch_skbs < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Check if new value is same as the current */
+	new_tx_batch_skbs = !!new_tx_batch_skbs;
+	if (!!net->skb_blist == new_tx_batch_skbs)
+		goto out;
+
+	if (new_tx_batch_skbs &&
+	    (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock(&net->queue_lock);
+	if (new_tx_batch_skbs) {
+		skb_queue_head_init(blist);
+		net->skb_blist = blist;
+		net->tx_queue_len >>= 1;
+	} else {
+		if (!skb_queue_empty(net->skb_blist))
+			skb_queue_purge(net->skb_blist);
+		kfree(net->skb_blist);
+		net->skb_blist = NULL;
+		net->tx_queue_len <<= 1;
+	}
+	spin_unlock(&net->queue_lock);
+
+out:
+	return ret;
+}
+
+static ssize_t store_tx_batch_skbs(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_tx_batch_skbs);
+}
+
 static struct device_attribute net_class_attributes[] = {
 	__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
 	__ATTR(iflink, S_IRUGO, show_iflink, NULL),
@@ -246,6 +314,8 @@ static struct device_attribute net_class
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+	__ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skbs,
+	       store_tx_batch_skbs),
 	__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
 	{}
 };

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] [PATCH 05/10] sch_generic.c changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (3 preceding siblings ...)
  2007-07-20  6:32 ` [PATCH 04/10] net-sysfs.c changes Krishna Kumar
@ 2007-07-20  6:32 ` Krishna Kumar
  2007-07-20 10:11   ` [ofa-general] " Patrick McHardy
  2007-07-20 18:16   ` Patrick McHardy
  2007-07-20  6:33 ` [ofa-general] [PATCH 06/10] IPoIB header file changes Krishna Kumar
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:32 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, mcarlson, netdev, jagana, general, mchan, tgraf, jeff,
	hadi, kaber, sri

net/sched/sch_generic.c changes to support batching. Adds a batch
aware function (get_skb) to get skbs to send.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 sch_generic.c |   94 +++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 71 insertions(+), 23 deletions(-)

diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c	2007-07-20 07:49:28.000000000 +0530
+++ new/net/sched/sch_generic.c	2007-07-20 08:30:22.000000000 +0530
@@ -9,6 +9,11 @@
  * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
  *              Jamal Hadi Salim, <hadi@cyberus.ca> 990601
  *              - Ingress support
+ *
+ * New functionality:
+ *		Krishna Kumar, <krkumar2@in.ibm.com>, July 2007
+ *		- Support for sending multiple skbs to devices that support
+ *		  new api - dev->hard_start_xmit_batch()
  */
 
 #include <linux/bitops.h>
@@ -59,10 +64,12 @@ static inline int qdisc_qlen(struct Qdis
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (likely(skb)) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 
 	netif_schedule(dev);
 	return 0;
@@ -91,18 +98,23 @@ static inline int handle_dev_cpu_collisi
 		/*
 		 * Same CPU holding the lock. It may be a transient
 		 * configuration error, when hard_start_xmit() recurses. We
-		 * detect it by checking xmit owner and drop the packet when
-		 * deadloop is detected. Return OK to try the next skb.
+		 * detect it by checking xmit owner and drop skb (or all
+		 * skbs in batching case) when deadloop is detected. Return
+		 * OK to try the next skb.
 		 */
-		kfree_skb(skb);
+		if (likely(skb))
+			kfree_skb(skb);
+		else if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+
 		if (net_ratelimit())
 			printk(KERN_WARNING "Dead loop on netdevice %s, "
 			       "fix it urgently!\n", dev->name);
 		ret = qdisc_qlen(q);
 	} else {
 		/*
-		 * Another cpu is holding lock, requeue & delay xmits for
-		 * some time.
+		 * Another cpu is holding lock. Requeue skb and delay xmits
+		 * for some time.
 		 */
 		__get_cpu_var(netdev_rx_stat).cpu_collision++;
 		ret = dev_requeue_skb(skb, dev, q);
@@ -112,6 +124,39 @@ static inline int handle_dev_cpu_collisi
 }
 
 /*
+ * Algorithm to get skb(s) is:
+ *	- Non batching drivers, or if the batch list is empty and there is 1
+ *	  skb in the queue - dequeue skb and put it in *skbp to tell the
+ *	  caller to use the regular API.
+ *	- Batching drivers where the batch list already contains atleast one
+ *	  skb or if there are multiple skbs in the queue: keep dequeue'ing
+ *	  skb's upto a limit and set *skbp to NULL to tell the caller to use
+ *	  the new API.
+ *
+ * Returns:
+ *	1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ *	    (in case >1 skbs present in blist for batching)
+ *	0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+			  struct sk_buff_head *blist,
+			  struct sk_buff **skbp)
+{
+	if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
+		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+	} else {
+		int max = dev->tx_queue_len - skb_queue_len(blist);
+		struct sk_buff *skb;
+
+		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+			max -= dev_add_skb_to_blist(skb, dev);
+
+		*skbp = NULL;
+		return 1;	/* we have atleast one skb in blist */
+	}
+}
+
+/*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
  *
  * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,27 +175,28 @@ static inline int handle_dev_cpu_collisi
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *blist)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
+	unsigned getlock;		/* whether we need to get lock or not */
 	int ret;
 
 	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+	if (unlikely(!get_skb(dev, q, blist, &skb)))
 		return 0;
 
 	/*
 	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
+	 * start_xmit. These checks are worth it because even uncontested
 	 * locks can be quite expensive. The driver can do a trylock, as
 	 * is being done here; in case of lock contention it should return
 	 * NETDEV_TX_LOCKED and the packet will be requeued.
 	 */
-	lockless = (dev->features & NETIF_F_LLTX);
+	getlock = !(dev->features & NETIF_F_LLTX);
 
-	if (!lockless && !netif_tx_trylock(dev)) {
+	if (getlock && !netif_tx_trylock(dev)) {
 		/* Another CPU grabbed the driver tx lock */
 		return handle_dev_cpu_collision(skb, dev, q);
 	}
@@ -158,9 +204,12 @@ static inline int qdisc_restart(struct n
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
-	ret = dev_hard_start_xmit(skb, dev);
+	if (likely(skb))
+		ret = dev_hard_start_xmit(skb, dev);
+	else
+		ret = dev->hard_start_xmit_batch(dev);
 
-	if (!lockless)
+	if (getlock)
 		netif_tx_unlock(dev);
 
 	spin_lock(&dev->queue_lock);
@@ -168,7 +217,7 @@ static inline int qdisc_restart(struct n
 
 	switch (ret) {
 	case NETDEV_TX_OK:
-		/* Driver sent out skb successfully */
+		/* Driver sent out skb (or entire skb_blist) successfully */
 		ret = qdisc_qlen(q);
 		break;
 
@@ -179,10 +228,9 @@ static inline int qdisc_restart(struct n
 
 	default:
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */
-		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
-			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
+		if (unlikely(ret != NETDEV_TX_BUSY) && net_ratelimit())
+			printk(KERN_WARNING " %s: BUG. code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
-
 		ret = dev_requeue_skb(skb, dev, q);
 		break;
 	}
@@ -190,10 +238,10 @@ static inline int qdisc_restart(struct n
 	return ret;
 }
 
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, blist))
 			break;
 	} while (!netif_queue_stopped(dev));
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] [PATCH 06/10] IPoIB header file changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (4 preceding siblings ...)
  2007-07-20  6:32 ` [ofa-general] [PATCH 05/10] sch_generic.c changes Krishna Kumar
@ 2007-07-20  6:33 ` Krishna Kumar
  2007-07-20  6:33 ` [ofa-general] [PATCH 07/10] IPoIB verb changes Krishna Kumar
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:33 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, mcarlson, jagana, general, netdev, tgraf,
	jeff, sri, hadi, kaber, mchan

IPoIB header file changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib.h |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-20 08:30:22.000000000 +0530
@@ -269,8 +269,8 @@ struct ipoib_dev_priv {
 	struct ipoib_tx_buf *tx_ring;
 	unsigned             tx_head;
 	unsigned             tx_tail;
-	struct ib_sge        tx_sge;
-	struct ib_send_wr    tx_wr;
+	struct ib_sge        *tx_sge;
+	struct ib_send_wr    *tx_wr;
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, int snum, int tx_index,
+		      struct ipoib_ah *address, u32 qpn);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] [PATCH 07/10] IPoIB verb changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (5 preceding siblings ...)
  2007-07-20  6:33 ` [ofa-general] [PATCH 06/10] IPoIB header file changes Krishna Kumar
@ 2007-07-20  6:33 ` Krishna Kumar
  2007-07-20  6:33 ` [ofa-general] [PATCH 08/10] IPoIB multicast/CM changes Krishna Kumar
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:33 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, mcarlson, jagana, general, netdev, tgraf, jeff, hadi,
	kaber, mchan, sri

IPoIB verb changes to support batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_verbs.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-20 08:30:22.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
 			.max_send_sge = 1,
 			.max_recv_sge = 1
 		},
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.sq_sig_type = IB_SIGNAL_REQ_WR,	/* 11.2.4.1 */
 		.qp_type     = IB_QPT_UD
 	};
-
-	int ret, size;
+	struct ib_send_wr *next_wr = NULL;
+	int i, ret, size;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	priv->tx_sge.lkey 	= priv->mr->lkey;
-
-	priv->tx_wr.opcode 	= IB_WR_SEND;
-	priv->tx_wr.sg_list 	= &priv->tx_sge;
-	priv->tx_wr.num_sge 	= 1;
-	priv->tx_wr.send_flags 	= IB_SEND_SIGNALED;
+	for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+		priv->tx_sge[i].lkey		= priv->mr->lkey;
+		priv->tx_wr[i].opcode		= IB_WR_SEND;
+		priv->tx_wr[i].sg_list		= &priv->tx_sge[i];
+		priv->tx_wr[i].num_sge		= 1;
+		priv->tx_wr[i].send_flags	= 0;
+
+		/* Link the list properly for provider to use */
+		priv->tx_wr[i].next		= next_wr;
+		next_wr				= &priv->tx_wr[i];
+	}
 
 	return 0;
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] [PATCH 08/10] IPoIB multicast/CM changes.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (6 preceding siblings ...)
  2007-07-20  6:33 ` [ofa-general] [PATCH 07/10] IPoIB verb changes Krishna Kumar
@ 2007-07-20  6:33 ` Krishna Kumar
  2007-07-20  6:33 ` [PATCH 09/10] IPoIB batching xmit handler support Krishna Kumar
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:33 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, mcarlson, kaber, jagana, general, mchan,
	tgraf, jeff, sri, hadi, netdev

IPoIB Multicast and CM changes for batching support.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_cm.c        |   13 +++++++++----
 ipoib_multicast.c |    4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-20 08:30:22.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
 			    unsigned int wr_id,
 			    u64 addr, int len)
 {
+	int ret;
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	priv->tx_sge[0].addr          = addr;
+	priv->tx_sge[0].length        = len;
+
+	priv->tx_wr[0].wr_id 	      = wr_id;
 
-	priv->tx_wr.wr_id 	      = wr_id;
+	priv->tx_wr[0].next = NULL;
+	ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+	priv->tx_wr[0].next = &priv->tx_wr[1];
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-20 08:30:22.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
 	}
 
 unlock:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 09/10] IPoIB batching xmit handler support.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (7 preceding siblings ...)
  2007-07-20  6:33 ` [ofa-general] [PATCH 08/10] IPoIB multicast/CM changes Krishna Kumar
@ 2007-07-20  6:33 ` Krishna Kumar
  2007-07-20  6:33 ` [PATCH 10/10] IPoIB batching in internal xmit/handler routines Krishna Kumar
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:33 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
	peter.p.waskiewicz.jr, mcarlson, kaber, jagana, general, mchan,
	tgraf, jeff, sri, hadi, netdev, Krishna Kumar, xma

Add a IPoIB batching xmit handler.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_main.c |  215 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 210 insertions(+), 5 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-20 08:30:22.000000000 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
 				goto err_drop;
 			}
 		} else
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, path->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 	} else {
 		neigh->ah  = NULL;
 
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
 				goto out;
 			}
 
-			ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, neigh->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 			goto out;
 		}
 
@@ -753,6 +755,177 @@ out:
 	return NETDEV_TX_OK;
 }
 
+#define	XMIT_QUEUED_SKBS()						\
+	do {								\
+		if (num_skbs) {						\
+			ipoib_send(dev, NULL, old_neigh->ah, old_qpn,	\
+				   num_skbs);				\
+			num_skbs = 0;					\
+		}							\
+	} while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	struct sk_buff_head *blist;
+	int max_skbs, num_skbs = 0, tx_ring_index = -1;
+	u32 qpn, old_qpn = 0;
+	struct ipoib_neigh *neigh, *old_neigh = NULL;
+	unsigned long flags;
+
+	if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
+		return NETDEV_TX_LOCKED;
+
+	blist = dev->skb_blist;
+
+	/*
+	 * Send atmost xmit_slots skbs. This also prevents the device getting
+	 * full as ipoib_send modifies the xmit_slots and we use the same
+	 * value to figure how many skbs to send.
+	 */
+	max_skbs = dev->xmit_slots;
+
+	while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) {
+		/*
+		 * From here on, ipoib_send() cannot stop the queue as it
+		 * uses the same initialization as 'max_skbs'. So we can
+		 * optimize to not check for queue stopped for every skb.
+		 */
+		if (likely(skb->dst && skb->dst->neighbour)) {
+			if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+				XMIT_QUEUED_SKBS();
+				ipoib_path_lookup(skb, dev);
+				continue;
+			}
+
+			neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+			if (ipoib_cm_get(neigh)) {
+				if (ipoib_cm_up(neigh)) {
+					XMIT_QUEUED_SKBS();
+					ipoib_cm_send(dev, skb,
+						      ipoib_cm_get(neigh));
+					continue;
+				}
+			} else if (neigh->ah) {
+				if (unlikely(memcmp(&neigh->dgid.raw,
+						    skb->dst->neighbour->ha + 4,
+						    sizeof(union ib_gid)))) {
+					spin_lock(&priv->lock);
+					/*
+					 * It's safe to call ipoib_put_ah()
+					 * inside priv->lock here, because we
+					 * know that path->ah will always hold
+					 * one more reference, so ipoib_put_ah()
+					 * will never do more than decrement
+					 * the ref count.
+					 */
+					ipoib_put_ah(neigh->ah);
+					list_del(&neigh->list);
+					ipoib_neigh_free(dev, neigh);
+					spin_unlock(&priv->lock);
+					XMIT_QUEUED_SKBS();
+					ipoib_path_lookup(skb, dev);
+					continue;
+				}
+
+				qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				if (neigh != old_neigh || qpn != old_qpn) {
+					/*
+					 * Sending to a different destination
+					 * from earlier skb's - send all
+					 * existing skbs (if any).
+					 */
+					if (tx_ring_index == -1) {
+						/*
+						 * First time, find where to
+						 * store skb.
+						 */
+						tx_ring_index = priv->tx_head &
+							(ipoib_sendq_size - 1);
+					} else {
+						/* Some skbs to send */
+						XMIT_QUEUED_SKBS();
+					}
+					old_neigh = neigh;
+					old_qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				}
+
+				if (ipoib_process_skb(dev, skb, priv, num_skbs,
+						      tx_ring_index, neigh->ah,
+						      qpn))
+					continue;
+
+				num_skbs++;
+
+				/* Queue'd one skb, get index for next skb */
+				if (max_skbs)
+					tx_ring_index = (tx_ring_index + 1) &
+							(ipoib_sendq_size - 1);
+				continue;
+			}
+
+			if (skb_queue_len(&neigh->queue) <
+			    IPOIB_MAX_PATH_REC_QUEUE) {
+				spin_lock(&priv->lock);
+				__skb_queue_tail(&neigh->queue, skb);
+				spin_unlock(&priv->lock);
+			} else {
+				dev_kfree_skb_any(skb);
+				++priv->stats.tx_dropped;
+				++max_skbs;
+			}
+		} else {
+			struct ipoib_pseudoheader *phdr =
+				(struct ipoib_pseudoheader *) skb->data;
+			skb_pull(skb, sizeof *phdr);
+
+			if (phdr->hwaddr[4] == 0xff) {
+				/* Add in the P_Key for multicast*/
+				phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+				phdr->hwaddr[9] = priv->pkey & 0xff;
+
+				XMIT_QUEUED_SKBS();
+				ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+			} else {
+				/* unicast GID -- should be ARP or RARP reply */
+
+				if ((be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_ARP) &&
+				    (be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_RARP)) {
+					ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+						IPOIB_GID_FMT "\n",
+						skb->dst ? "neigh" : "dst",
+						be16_to_cpup((__be16 *)
+						skb->data),
+						IPOIB_QPN(phdr->hwaddr),
+						IPOIB_GID_RAW_ARG(phdr->hwaddr
+								  + 4));
+					dev_kfree_skb_any(skb);
+					++priv->stats.tx_dropped;
+					++max_skbs;
+					continue;
+				}
+				XMIT_QUEUED_SKBS();
+				unicast_arp_send(skb, dev, phdr);
+			}
+		}
+	}
+
+	/* Send out last packets (if any) */
+	XMIT_QUEUED_SKBS();
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+	return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY;
+}
+
 static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -898,11 +1071,35 @@ int ipoib_dev_init(struct net_device *de
 
 	/* priv->tx_head & tx_tail are already 0 */
 
-	if (ipoib_ib_dev_init(dev, ca, port))
+	/* Allocate tx_sge */
+	priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+			       GFP_KERNEL);
+	if (!priv->tx_sge) {
+		printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
 		goto out_tx_ring_cleanup;
+	}
+
+	/* Allocate tx_wr */
+	priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+			      GFP_KERNEL);
+	if (!priv->tx_wr) {
+		printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
+		goto out_tx_sge_cleanup;
+	}
+
+	if (ipoib_ib_dev_init(dev, ca, port))
+		goto out_tx_wr_cleanup;
 
 	return 0;
 
+out_tx_wr_cleanup:
+	kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+	kfree(priv->tx_sge);
+
 out_tx_ring_cleanup:
 	kfree(priv->tx_ring);
 
@@ -930,9 +1127,13 @@ void ipoib_dev_cleanup(struct net_device
 
 	kfree(priv->rx_ring);
 	kfree(priv->tx_ring);
+	kfree(priv->tx_sge);
+	kfree(priv->tx_wr);
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
+	priv->tx_sge = NULL;
+	priv->tx_wr = NULL;
 }
 
 static void ipoib_setup(struct net_device *dev)
@@ -943,6 +1144,7 @@ static void ipoib_setup(struct net_devic
 	dev->stop 		 = ipoib_stop;
 	dev->change_mtu 	 = ipoib_change_mtu;
 	dev->hard_start_xmit 	 = ipoib_start_xmit;
+	dev->hard_start_xmit_batch = ipoib_start_xmit_frames;
 	dev->get_stats 		 = ipoib_get_stats;
 	dev->tx_timeout 	 = ipoib_timeout;
 	dev->hard_header 	 = ipoib_hard_header;
@@ -963,7 +1165,10 @@ static void ipoib_setup(struct net_devic
 	dev->addr_len 		 = INFINIBAND_ALEN;
 	dev->type 		 = ARPHRD_INFINIBAND;
 	dev->tx_queue_len 	 = ipoib_sendq_size * 2;
-	dev->features            = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX;
+	dev->features            = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX |
+					NETIF_F_BATCH_SKBS;
+
+	dev->xmit_slots		= ipoib_sendq_size;
 
 	/* MTU will be reset when mcast join happens */
 	dev->mtu 		 = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN;

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 10/10] IPoIB batching in internal xmit/handler routines.
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (8 preceding siblings ...)
  2007-07-20  6:33 ` [PATCH 09/10] IPoIB batching xmit handler support Krishna Kumar
@ 2007-07-20  6:33 ` Krishna Kumar
  2007-07-20  7:18 ` [ofa-general] Re: [PATCH 00/10] Implement batching skb API Stephen Hemminger
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar @ 2007-07-20  6:33 UTC (permalink / raw)
  To: davem, rdreier
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, xma, rick.jones2, mcarlson, netdev, jagana, general,
	mchan, tgraf, jeff, hadi, kaber, Krishna Kumar, sri

Add batching support to IPoIB post_send and TX completion handler.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_ib.c |  233 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 187 insertions(+), 46 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-20 08:30:22.000000000 +0530
@@ -242,8 +242,9 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i = 0, num_completions;
+	int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
 	unsigned int wr_id = wc->wr_id;
-	struct ipoib_tx_buf *tx_req;
 	unsigned long flags;
 
 	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
@@ -255,23 +256,60 @@ static void ipoib_ib_handle_tx_wc(struct
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	num_completions = wr_id - tx_ring_index + 1;
+	if (num_completions <= 0)
+		num_completions += ipoib_sendq_size;
+
+	/*
+	 * Handle skbs completion from tx_tail to wr_id. It is possible to
+	 * handle WC's from earlier post_sends (possible multiple) in this
+	 * iteration as we move from tx_tail to wr_id, since if the last
+	 * WR (which is the one which had a completion request) failed to be
+	 * sent for any of those earlier request(s), no completion
+	 * notification is generated for successful WR's of those earlier
+	 * request(s).
+	 */
+	while (1) {
+		/*
+		 * Could use while (i < num_completions), but it is costly
+		 * since in most cases there is 1 completion, and we end up
+		 * doing an extra "index = (index+1) & (ipoib_sendq_size-1)"
+		 */
+		struct ipoib_tx_buf *tx_req = &priv->tx_ring[tx_ring_index];
+
+		if (likely(tx_req->skb)) {
+			ib_dma_unmap_single(priv->ca, tx_req->mapping,
+					    tx_req->skb->len, DMA_TO_DEVICE);
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping,
-			    tx_req->skb->len, DMA_TO_DEVICE);
+			++priv->stats.tx_packets;
+			priv->stats.tx_bytes += tx_req->skb->len;
 
-	++priv->stats.tx_packets;
-	priv->stats.tx_bytes += tx_req->skb->len;
+			dev_kfree_skb_any(tx_req->skb);
+		}
+		/*
+		 * else this skb failed synchronously when posted and was
+		 * freed immediately.
+		 */
+
+		if (++i == num_completions)
+			break;
 
-	dev_kfree_skb_any(tx_req->skb);
+		/* More WC's to handle */
+		tx_ring_index = (tx_ring_index + 1) & (ipoib_sendq_size - 1);
+	}
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
-	++priv->tx_tail;
+
+	priv->tx_tail += num_completions;
 	if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
 	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
 		netif_wake_queue(dev);
 	}
+
+	/* Make more slots available for posts */
+	dev->xmit_slots = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (wc->status != IB_WC_SUCCESS &&
@@ -340,78 +378,181 @@ void ipoib_ib_completion(struct ib_cq *c
 	netif_rx_schedule(dev_ptr);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
-			    unsigned int wr_id,
-			    struct ib_ah *address, u32 qpn,
-			    u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, 'start_index' is the first slot in
+ * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a
+ * previous post_send returned error and we are trying to send the untried
+ * WR's, in which case start_index will point to the first untried WR.
+ *
+ * We also break the WR link before posting so that the driver knows how
+ * many WR's to process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+			    int start_index, int num_skbs,
+			    struct ib_send_wr **bad_wr)
 {
-	struct ib_send_wr *bad_wr;
+	int ret;
+	struct ib_send_wr *last_wr, *next_wr;
+
+	last_wr = &priv->tx_wr[start_index + num_skbs - 1];
+
+	/* Set Completion Notification for last WR */
+	last_wr->send_flags = IB_SEND_SIGNALED;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	/* Terminate the last WR */
+	next_wr = last_wr->next;
+	last_wr->next = NULL;
 
-	priv->tx_wr.wr_id 	      = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn  = qpn;
-	priv->tx_wr.wr.ud.ah 	      = address;
+	/* Send all the WR's in one doorbell */
+	ret = ib_post_send(priv->qp, &priv->tx_wr[start_index], bad_wr);
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	/* Restore send_flags & WR chain */
+	last_wr->send_flags = 0;
+	last_wr->next = next_wr;
+
+	return ret;
 }
 
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_req; and details of the WR in tx_wr
+ * to pass to the driver.
+ *
+ * Returns :
+ *	- 0 on successful processing of the skb
+ *	- 1 if the skb was freed.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, int wr_num,
+		      int tx_ring_index, struct ipoib_ah *address, u32 qpn)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_tx_buf *tx_req;
 	u64 addr;
+	struct ipoib_tx_buf *tx_req;
 
 	if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
-		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+		ipoib_warn(priv, "packet len %d (> %d) too long to "
+			   "send, dropping\n",
 			   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
 		++priv->stats.tx_dropped;
 		++priv->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
-		return;
+		return 1;
 	}
 
-	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
+	ipoib_dbg_data(priv, "sending packet, length=%d address=%p "
+		       "qpn=0x%06x\n",
 		       skb->len, address, qpn);
 
 	/*
 	 * We put the skb into the tx_ring _before_ we call post_send()
 	 * because it's entirely possible that the completion handler will
-	 * run before we execute anything after the post_send().  That
+	 * run before we execute anything after the post_send(). That
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
-	tx_req->skb = skb;
-	addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
-				 DMA_TO_DEVICE);
+	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
 		++priv->stats.tx_errors;
 		dev_kfree_skb_any(skb);
-		return;
+		return 1;
 	}
+
+	tx_req = &priv->tx_ring[tx_ring_index];
+	tx_req->skb = skb;
 	tx_req->mapping = addr;
+	priv->tx_sge[wr_num].addr = addr;
+	priv->tx_sge[wr_num].length = skb->len;
+	priv->tx_wr[wr_num].wr_id = tx_ring_index;
+	priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+	priv->tx_wr[wr_num].wr.ud.ah = address->ah;
 
-	if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
-			       address->ah, qpn, addr, skb->len))) {
-		ipoib_warn(priv, "post_send failed\n");
-		++priv->stats.tx_errors;
-		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
-		dev_kfree_skb_any(skb);
-	} else {
-		dev->trans_start = jiffies;
+	return 0;
+}
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+/*
+ * If an skb is passed to this function, it is the single, unprocessed skb
+ * send case. Otherwise if skb is NULL, it means that all skbs are already
+ * processed and put on the priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int start_index = 0;
+
+	if (skb && ipoib_process_skb(dev, skb, priv, 0, priv->tx_head &
+				     (ipoib_sendq_size - 1), address, qpn))
+		return;
+
+	/* Send out all the skb's in one post */
+	while (num_skbs) {
+		struct ib_send_wr *bad_wr;
+
+		if (unlikely((post_send(priv, qpn, start_index, num_skbs,
+					&bad_wr)))) {
+			int done;
+
+			/*
+			 * Better error handling can be done here, like free
+			 * all untried skbs if err == -ENOMEM. However at this
+			 * time, we re-try all the skbs, all of which will
+			 * likely fail anyway (unless device finished sending
+			 * some out in the meantime). This is not a regression
+			 * since the earlier code is not doing this either.
+			 */
+			ipoib_warn(priv, "post_send failed\n");
 
-		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
-			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-			netif_stop_queue(dev);
-			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+			/* Get #WR's that finished successfully */
+			done = bad_wr - &priv->tx_wr[start_index];
+
+			/* Handle 1 error */
+			priv->stats.tx_errors++;
+			ib_dma_unmap_single(priv->ca,
+				priv->tx_sge[start_index + done].addr,
+				priv->tx_sge[start_index + done].length,
+				DMA_TO_DEVICE);
+
+			/* Handle 'n' successes */
+			if (done) {
+				dev->trans_start = jiffies;
+				address->last_send = priv->tx_head;
+			}
+
+			/* Free failed WR & reset for WC handler to recognize */
+			dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+			priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+			/* Move head to first untried WR */
+			priv->tx_head += (done + 1);
+				/* + 1 for WR that was tried & failed */
+
+			/* Get count of skbs that were not tried */
+			num_skbs -= (done + 1);
+
+			/* Get start index for next iteration */
+			start_index += (done + 1);
+		} else {
+			dev->trans_start = jiffies;
+
+			address->last_send = priv->tx_head;
+			priv->tx_head += num_skbs;
+			num_skbs = 0;
 		}
 	}
+
+	if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+		/*
+		 * Not accurate as some intermediate slots could have been
+		 * freed on error, but no harm - only queue stopped earlier.
+		 */
+		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+		netif_stop_queue(dev);
+		set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+	}
+
+	/* Reduce the number of slots for sends */
+	dev->xmit_slots = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
 }
 
 static void __ipoib_reap_ah(struct net_device *dev)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (9 preceding siblings ...)
  2007-07-20  6:33 ` [PATCH 10/10] IPoIB batching in internal xmit/handler routines Krishna Kumar
@ 2007-07-20  7:18 ` Stephen Hemminger
  2007-07-20  7:30   ` Krishna Kumar2
                     ` (2 more replies)
  2007-07-20 12:54 ` [ofa-general] " Evgeniy Polyakov
  2007-07-21 13:18 ` [ofa-general] " jamal
  12 siblings, 3 replies; 55+ messages in thread
From: Stephen Hemminger @ 2007-07-20  7:18 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jagana, kaber, jeff, general,
	mchan, tgraf, netdev, sri, hadi, davem

On Fri, 20 Jul 2007 12:01:49 +0530
Krishna Kumar <krkumar2@in.ibm.com> wrote:

> Hi Dave, Roland, everyone,
> 
> In May, I had proposed creating an API for sending 'n' skbs to a driver to
> reduce lock overhead, DMA operations, and specific to drivers that have
> completion notification like IPoIB - reduce completion handling ("[RFC] New
> driver API to speed up small packets xmits" @
> http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent
> initial test results for E1000 which showed minor improvements (but also
> got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2.
> 
> After fine-tuning qdisc and other changes, I modified IPoIB to use this API,
> and now get good gains. Summary for TCP & No Delay: 1 process improves for
> all cases from 1.4% to 49.5%; 4 process has almost identical improvements
> from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to
> 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP
> was tested with 1 process netperf with small increase in BW but big
> improvement in Service Demand. Netperf latency tests show small drop in
> transaction rate (results in separate attachment).
> 

You may see worse performance with batching in the real world when
running over WAN's.  Like TSO, batching will generate back to back packet
trains that are subject to multi-packet synchronized loss. The problem is that
intermediate router queues are often close to full, and when a long string
of packets arrives back to back only the first ones will get in, the rest
get dropped.  Normal sends have at least minimal pacing so they are less
likely do get synchronized drop.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  7:18 ` [ofa-general] Re: [PATCH 00/10] Implement batching skb API Stephen Hemminger
@ 2007-07-20  7:30   ` Krishna Kumar2
  2007-07-20  7:57     ` [ofa-general] " Stephen Hemminger
  2007-07-20  7:47   ` Krishna Kumar2
  2007-07-21 13:46   ` [ofa-general] TCP and batching WAS(Re: " jamal
  2 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20  7:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
	kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	rdreier, rick.jones2, Robert.Olsson, sri, tgraf, xma

Stephen Hemminger <shemminger@linux-foundation.org> wrote on 07/20/2007
12:48:48 PM:

> You may see worse performance with batching in the real world when
> running over WAN's.  Like TSO, batching will generate back to back packet
> trains that are subject to multi-packet synchronized loss. The problem is
that
> intermediate router queues are often close to full, and when a long
string
> of packets arrives back to back only the first ones will get in, the rest
> get dropped.  Normal sends have at least minimal pacing so they are less
> likely do get synchronized drop.

Hi Stephen,

OK. The difference that I could see is that in existing code, the "minimal
pacing" also could lead to (possibly slighly lesser) loss since sends are
quick iterations at the IP layer, while in batching sends are iterative at
the driver layer.

Is it an issue ? Any suggestions ?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  7:18 ` [ofa-general] Re: [PATCH 00/10] Implement batching skb API Stephen Hemminger
  2007-07-20  7:30   ` Krishna Kumar2
@ 2007-07-20  7:47   ` Krishna Kumar2
  2007-07-21 13:46   ` [ofa-general] TCP and batching WAS(Re: " jamal
  2 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20  7:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, sri, davem, rdreier

Stephen Hemminger <shemminger@linux-foundation.org> wrote on 07/20/2007
12:48:48 PM:

> You may see worse performance with batching in the real world when
> running over WAN's.  Like TSO, batching will generate back to back packet
> trains that are subject to multi-packet synchronized loss. The problem is
that
> intermediate router queues are often close to full, and when a long
string
> of packets arrives back to back only the first ones will get in, the rest
> get dropped.  Normal sends have at least minimal pacing so they are less
> likely do get synchronized drop.

Also forgot to mention in the previous mail, if performance is seen to be
dipping,
batching can be disabled on WAN's by:

echo 0 > /sys/class/net/<dev>/tx_batch_skbs

and use batching on local/site networks in that case.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  7:30   ` Krishna Kumar2
@ 2007-07-20  7:57     ` Stephen Hemminger
  0 siblings, 0 replies; 55+ messages in thread
From: Stephen Hemminger @ 2007-07-20  7:57 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, sri, davem, rdreier

On Fri, 20 Jul 2007 13:00:25 +0530
Krishna Kumar2 <krkumar2@in.ibm.com> wrote:

> Stephen Hemminger <shemminger@linux-foundation.org> wrote on 07/20/2007
> 12:48:48 PM:
> 
> > You may see worse performance with batching in the real world when
> > running over WAN's.  Like TSO, batching will generate back to back packet
> > trains that are subject to multi-packet synchronized loss. The problem is
> that
> > intermediate router queues are often close to full, and when a long
> string
> > of packets arrives back to back only the first ones will get in, the rest
> > get dropped.  Normal sends have at least minimal pacing so they are less
> > likely do get synchronized drop.
> 
> Hi Stephen,
> 
> OK. The difference that I could see is that in existing code, the "minimal
> pacing" also could lead to (possibly slighly lesser) loss since sends are
> quick iterations at the IP layer, while in batching sends are iterative at
> the driver layer.
> 
> Is it an issue ? Any suggestions ?

Not an immediate issue, but it is the kind of thing that could cause performance
regression reports if it was used on every interface by default.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/10] Networking include file changes.
  2007-07-20  6:32 ` [PATCH 02/10] Networking include file changes Krishna Kumar
@ 2007-07-20  9:59   ` Patrick McHardy
  2007-07-20 17:25   ` [ofa-general] " Sridhar Samudrala
  1 sibling, 0 replies; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20  9:59 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: davem, rdreier, johnpol, Robert.Olsson, peter.p.waskiewicz.jr,
	herbert, gaagaan, kumarkr, xma, rick.jones2, mcarlson, netdev,
	jagana, general, mchan, tgraf, jeff, hadi, sri

Krishna Kumar wrote:
> diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> --- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
> +++ new/include/linux/netdevice.h	2007-07-20 08:30:55.000000000 +0530
> @@ -264,6 +264,8 @@ enum netdev_state_t
>  	__LINK_STATE_QDISC_RUNNING,
>  };
>  
> +/* Minimum length of device hardware queue for batching to work */
> +#define MIN_QUEUE_LEN_BATCH	16


Is there any downside in using batching with smaller queue sizes?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20  6:32 ` [ofa-general] [PATCH 03/10] dev.c changes Krishna Kumar
@ 2007-07-20 10:04   ` Patrick McHardy
  2007-07-20 10:27     ` Krishna Kumar2
  2007-07-20 17:44   ` Sridhar Samudrala
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 10:04 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, jagana, general, netdev, tgraf, jeff,
	hadi, davem, mchan, sri

Krishna Kumar wrote:
> @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device
>  		}
>  	}
>  
> +	if (dev->features & NETIF_F_BATCH_SKBS) {
> +		if (!dev->hard_start_xmit_batch ||
> +		    dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) {
> +			/*
> +			 * Batch TX requires API support in driver plus have
> +			 * a minimum sized queue.
> +			 */
> +			printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS "
> +					"since no API support or queue len "
> +					"is smaller than %d.\n",
> +					dev->name, MIN_QUEUE_LEN_BATCH);
> +			dev->features &= ~NETIF_F_BATCH_SKBS;


The queue length can be changed through multiple interfaces, if that
really is important you need to catch these cases too.

> +		} else {
> +			dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
> +						 GFP_KERNEL);


Why not simply put the head in struct net_device? It seems to me that
this could also be used for gso_skb.

> +			if (dev->skb_blist) {
> +				skb_queue_head_init(dev->skb_blist);
> +				dev->tx_queue_len >>= 1;
> +			}
> +		}
> +	}
> +
>  	/*
>  	 *	nil rebuild_header routine,
>  	 *	that should be never called and used as just bug trap.
> @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev
>  
>  	synchronize_net();
>  
> +	/* Deallocate batching structure */
> +	if (dev->skb_blist) {
> +		skb_queue_purge(dev->skb_blist);
> +		kfree(dev->skb_blist);
> +		dev->skb_blist = NULL;
> +	}
> +

Queue purging should be done in dev_deactivate.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
  2007-07-20  6:32 ` [PATCH 04/10] net-sysfs.c changes Krishna Kumar
@ 2007-07-20 10:07   ` Patrick McHardy
  2007-07-20 10:28     ` Krishna Kumar2
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 10:07 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, jeff, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jagana, general, mchan, tgraf,
	netdev, sri, hadi, davem

Krishna Kumar wrote:
> Support to turn on/off batching from /sys.


rtnetlink support seems more important than sysfs to me.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
  2007-07-20  6:32 ` [ofa-general] [PATCH 05/10] sch_generic.c changes Krishna Kumar
@ 2007-07-20 10:11   ` Patrick McHardy
  2007-07-20 10:32     ` Krishna Kumar2
  2007-07-20 18:16   ` Patrick McHardy
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 10:11 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan,
	tgraf, jeff, hadi, davem, sri

Krishna Kumar wrote:
> diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
> --- org/net/sched/sch_generic.c	2007-07-20 07:49:28.000000000 +0530
> +++ new/net/sched/sch_generic.c	2007-07-20 08:30:22.000000000 +0530
> @@ -9,6 +9,11 @@
>   * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
>   *              Jamal Hadi Salim, <hadi@cyberus.ca> 990601
>   *              - Ingress support
> + *
> + * New functionality:
> + *		Krishna Kumar, <krkumar2@in.ibm.com>, July 2007
> + *		- Support for sending multiple skbs to devices that support
> + *		  new api - dev->hard_start_xmit_batch()


No new changelogs in source code please, git keeps track of that.

> -static inline int qdisc_restart(struct net_device *dev)
> +static inline int qdisc_restart(struct net_device *dev,
> +				struct sk_buff_head *blist)
>  {
>  	struct Qdisc *q = dev->qdisc;
>  	struct sk_buff *skb;
> -	unsigned lockless;
> +	unsigned getlock;		/* whether we need to get lock or not */


Unrelated rename, please get rid of this to reduce the noise.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/10] dev.c changes.
  2007-07-20 10:04   ` [ofa-general] " Patrick McHardy
@ 2007-07-20 10:27     ` Krishna Kumar2
  2007-07-20 11:20       ` [ofa-general] " Patrick McHardy
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 10:27 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, sri, tgraf, xma

Hi Patrick,

Thanks for your comments.

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:34:30 PM:

> The queue length can be changed through multiple interfaces, if that
> really is important you need to catch these cases too.

I have a TODO comment in net-sysfs.c which is to catch this case.

> > +      } else {
> > +         dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
> > +                   GFP_KERNEL);
>
>
> Why not simply put the head in struct net_device? It seems to me that
> this could also be used for gso_skb.

Without going into GSO, it is wasting some 32 bytes on i386 since most
drivers
don't export this API.

> Queue purging should be done in dev_deactivate.

I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup
eth0,
the system panic'd. The first solution I thought was to initialize the
skb_blist
in dev_change_flags() rather than in register_netdev(), but then felt that
a
series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize
stuff,
and so thought of putting it in unregister_netdev (where it is balanced
with
register_netdev).

Is there any reason to move this ?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
  2007-07-20 10:07   ` [ofa-general] " Patrick McHardy
@ 2007-07-20 10:28     ` Krishna Kumar2
  2007-07-20 11:21       ` Patrick McHardy
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 10:28 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:37:20 PM:

> Krishna Kumar wrote:
> > Support to turn on/off batching from /sys.
>
>
> rtnetlink support seems more important than sysfs to me.

Thanks, I will add that as a patch. The reason to add to sysfs is that
it is easier to change for a user (and similar to tx_queue_len).

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
  2007-07-20 10:11   ` [ofa-general] " Patrick McHardy
@ 2007-07-20 10:32     ` Krishna Kumar2
  2007-07-20 11:24       ` Patrick McHardy
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 10:32 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:41:01 PM:

> Krishna Kumar wrote:
> > diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
> > --- org/net/sched/sch_generic.c   2007-07-20 07:49:28.000000000 +0530
> > +++ new/net/sched/sch_generic.c   2007-07-20 08:30:22.000000000 +0530
> > @@ -9,6 +9,11 @@
> >   * Authors:   Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
> >   *              Jamal Hadi Salim, <hadi@cyberus.ca> 990601
> >   *              - Ingress support
> > + *
> > + * New functionality:
> > + *      Krishna Kumar, <krkumar2@in.ibm.com>, July 2007
> > + *      - Support for sending multiple skbs to devices that support
> > + *        new api - dev->hard_start_xmit_batch()
>
>
> No new changelogs in source code please, git keeps track of that.

Ah, didn't know this, thanks for letting me know.

> > -static inline int qdisc_restart(struct net_device *dev)
> > +static inline int qdisc_restart(struct net_device *dev,
> > +            struct sk_buff_head *blist)
> >  {
> >     struct Qdisc *q = dev->qdisc;
> >     struct sk_buff *skb;
> > -   unsigned lockless;
> > +   unsigned getlock;      /* whether we need to get lock or not */
>
>
> Unrelated rename, please get rid of this to reduce the noise.

OK, I guess I should have sent that change earlier :) The reason to change
the name is to avoid (double-negative) checks like :

      if (!lockless)
to
      if (getlock).

I will remove these changes.

thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 10:27     ` Krishna Kumar2
@ 2007-07-20 11:20       ` Patrick McHardy
  2007-07-20 11:52         ` Krishna Kumar2
                           ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 11:20 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Krishna Kumar2 wrote:
> Hi Patrick,
>
> Thanks for your comments.
>
> Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:34:30 PM:
>
>   
>> The queue length can be changed through multiple interfaces, if that
>> really is important you need to catch these cases too.
>>     
>
> I have a TODO comment in net-sysfs.c which is to catch this case.
>   

I noticed that. Still wondering why it is important at all though.

>   
>>> +      } else {
>>> +         dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
>>> +                   GFP_KERNEL);
>>>       
>> Why not simply put the head in struct net_device? It seems to me that
>> this could also be used for gso_skb.
>>     
>
> Without going into GSO, it is wasting some 32 bytes on i386 since most
> drivers don't export this API.
>   

32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
If you'd use it for gso_skb it would come down to 8 bytes. struct
net_device is a pig already, and there are better ways to reduce this
than starting to allocating single members with a few bytes IMO.

>   
>> Queue purging should be done in dev_deactivate.
>>     
>
> I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup
> eth0,
> the system panic'd. The first solution I thought was to initialize the
> skb_blist
> in dev_change_flags() rather than in register_netdev(), but then felt that
> a
> series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize
> stuff,
> and so thought of putting it in unregister_netdev (where it is balanced
> with
> register_netdev).
>
> Is there any reason to move this ?
>   

Yes, packets can be holding references to various stuff and
these should be released on device down. As I said above I
don't really like the allocation, but even if you want to
keep it, just do the purging and dev_deactivate and keep the
freeing in unregister_netdev (actually I guess it should be
free_netdev to handle register_netdevice errors).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
  2007-07-20 10:28     ` Krishna Kumar2
@ 2007-07-20 11:21       ` Patrick McHardy
  2007-07-20 16:22         ` Stephen Hemminger
  0 siblings, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 11:21 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Krishna Kumar2 wrote:
> Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:37:20 PM:
>
>
>   
>> rtnetlink support seems more important than sysfs to me.
>>     
>
> Thanks, I will add that as a patch. The reason to add to sysfs is that
> it is easier to change for a user (and similar to tx_queue_len).
>   

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
  2007-07-20 10:32     ` Krishna Kumar2
@ 2007-07-20 11:24       ` Patrick McHardy
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 11:24 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Krishna Kumar2 wrote:
> Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:41:01 PM:
>   
>>> -static inline int qdisc_restart(struct net_device *dev)
>>> +static inline int qdisc_restart(struct net_device *dev,
>>> +            struct sk_buff_head *blist)
>>>  {
>>>     struct Qdisc *q = dev->qdisc;
>>>     struct sk_buff *skb;
>>> -   unsigned lockless;
>>> +   unsigned getlock;      /* whether we need to get lock or not */
>>>       
>> Unrelated rename, please get rid of this to reduce the noise.
>>     
>
> OK, I guess I should have sent that change earlier :) The reason to change
> the name is to avoid (double-negative) checks like :
>
>       if (!lockless)
> to
>       if (getlock).
>
> I will remove these changes.
>   

I guess you could put it in another patch. But frankly, I think
the biggest uglyness is the conditional locking, not naming or
double negation, so it won't really make the code any nicer :)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 11:20       ` [ofa-general] " Patrick McHardy
@ 2007-07-20 11:52         ` Krishna Kumar2
  2007-07-20 11:55           ` Patrick McHardy
  2007-07-20 12:09         ` Krishna Kumar2
  2007-07-20 12:25         ` Krishna Kumar2
  2 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 11:52 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Hi Patrick,

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 04:50:37 PM:

> > I have a TODO comment in net-sysfs.c which is to catch this case.
> >
>
> I noticed that. Still wondering why it is important at all though.

I saw another mail of yours on the marc list on this same topic (which
still hasn't come to me in the mail), so I will answer both :

> Is there any downside in using batching with smaller queue sizes?

I think there is, but as yet I don't have any data (and 16 is probably
higher
than reqd) to show it. If the queue size is very small (like 4), the extra
processing to maintain this list may take more cycles than the performance
gains for sending out few skbs, esp since most xmits will send out 1 skb
and
skb batching takes places less often (when tx lock fails or queue gets
full).

OTOH, there might be a gain to even send out 2 skbs, the problem is in
doing
the extra processing before xmit and not at the time of xmit.

Does this sound OK ? If so, I will add the code to implement the TODO for
tx_queue_len checking too.

> > Without going into GSO, it is wasting some 32 bytes on i386 since most
> > drivers don't export this API.
>
> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
> If you'd use it for gso_skb it would come down to 8 bytes. struct
> net_device is a pig already, and there are better ways to reduce this
> than starting to allocating single members with a few bytes IMO.

Sorry, I wanted to say 12 bytes on 32 bit system but mixed it up and
said 32 bytes. So I guess static allocation is better then, and it will
also help in performance as memory access is not required (offsetof
should work).

> Yes, packets can be holding references to various stuff and
> these should be released on device down. As I said above I
> don't really like the allocation, but even if you want to
> keep it, just do the purging and dev_deactivate and keep the
> freeing in unregister_netdev (actually I guess it should be
> free_netdev to handle register_netdevice errors).

Right, that makes it clean to do (and avoid stale packets on down).
I will make both these changes now.

Thanks for these suggestions,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 11:52         ` Krishna Kumar2
@ 2007-07-20 11:55           ` Patrick McHardy
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 11:55 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Krishna Kumar2 wrote:
> Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 04:50:37 PM
>> Is there any downside in using batching with smaller queue sizes?
>>     
>
> I think there is, but as yet I don't have any data (and 16 is probably
> higher
> than reqd) to show it. If the queue size is very small (like 4), the extra
> processing to maintain this list may take more cycles than the performance
> gains for sending out few skbs, esp since most xmits will send out 1 skb
> and
> skb batching takes places less often (when tx lock fails or queue gets
> full).
>
> OTOH, there might be a gain to even send out 2 skbs, the problem is in
> doing
> the extra processing before xmit and not at the time of xmit.
>
> Does this sound OK ? If so, I will add the code to implement the TODO for
> tx_queue_len checking too.
>   

I can't really argue about the numbers, but it seems to me that only
devices which *usually* have a sufficient queue length will support
this, and anyone setting the queue length of a gbit device to <16 is
begging for trouble anyway. So it doesn't really seem worth to bloat
the code for handling an insane configuration as long as it doesn't
break.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 11:20       ` [ofa-general] " Patrick McHardy
  2007-07-20 11:52         ` Krishna Kumar2
@ 2007-07-20 12:09         ` Krishna Kumar2
  2007-07-20 12:25         ` Krishna Kumar2
  2 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 12:09 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007:

> I can't really argue about the numbers, but it seems to me that only
> devices which *usually* have a sufficient queue length will support
> this, and anyone setting the queue length of a gbit device to <16 is
> begging for trouble anyway. So it doesn't really seem worth to bloat
> the code for handling an insane configuration as long as it doesn't
> break.

Ah, I get your point now. So if driver sets BATCHING and user then sets
queue_len to (say) 4, then poor results are expected (and kernel doesn't
need to try fix it). Same for driver setting BATCHING when it's queue is
small in the first place, which no driver writer should do anyway. I think
it makes the code a lot easier too. Will update.

thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 11:20       ` [ofa-general] " Patrick McHardy
  2007-07-20 11:52         ` Krishna Kumar2
  2007-07-20 12:09         ` Krishna Kumar2
@ 2007-07-20 12:25         ` Krishna Kumar2
  2007-07-20 12:37           ` Patrick McHardy
  2 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 12:25 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 04:50:37 PM:

> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
> If you'd use it for gso_skb it would come down to 8 bytes. struct
> net_device is a pig already, and there are better ways to reduce this
> than starting to allocating single members with a few bytes IMO.

Currently, this allocated pointer is an indication to let kernel users
(qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
is enabled or disabled. Removing the pointer and making it static means
those users cannot figure out this information . Adding another field to
netdev may be a bad idea, so I am thinking of overloading dev->features
to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
bit. Does this approach sound OK ?

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 12:25         ` Krishna Kumar2
@ 2007-07-20 12:37           ` Patrick McHardy
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 12:37 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Krishna Kumar2 wrote:
> Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 04:50:37 PM:
>
>   
>> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
>> If you'd use it for gso_skb it would come down to 8 bytes. struct
>> net_device is a pig already, and there are better ways to reduce this
>> than starting to allocating single members with a few bytes IMO.
>>     
>
> Currently, this allocated pointer is an indication to let kernel users
> (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
> is enabled or disabled. Removing the pointer and making it static means
> those users cannot figure out this information . Adding another field to
> netdev may be a bad idea, so I am thinking of overloading dev->features
> to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
> capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
> bit. Does this approach sound OK ?
>   

I guess so. It would be more consistent with things like HW checksumming
etc. though to handle this through ethtool and have the ethtool callbacks
set or clear just the one feature bit. That would mean you don't need
to provide further indication of the device's capabilities to the stack
since only the driver enables or disables the feature.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (10 preceding siblings ...)
  2007-07-20  7:18 ` [ofa-general] Re: [PATCH 00/10] Implement batching skb API Stephen Hemminger
@ 2007-07-20 12:54 ` Evgeniy Polyakov
  2007-07-20 13:02   ` Krishna Kumar2
  2007-07-23  4:23   ` Krishna Kumar2
  2007-07-21 13:18 ` [ofa-general] " jamal
  12 siblings, 2 replies; 55+ messages in thread
From: Evgeniy Polyakov @ 2007-07-20 12:54 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, netdev, sri, hadi, davem

Hi Krishna.

On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> After fine-tuning qdisc and other changes, I modified IPoIB to use this API,
> and now get good gains. Summary for TCP & No Delay: 1 process improves for
> all cases from 1.4% to 49.5%; 4 process has almost identical improvements
> from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to
> 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP
> was tested with 1 process netperf with small increase in BW but big
> improvement in Service Demand. Netperf latency tests show small drop in
> transaction rate (results in separate attachment).

What about round-robin tcp time and latency test? In theory such batching
mode should not change that timings, but practice can show new aspects.
I will review code later this week (likely tomorrow) and if there will
be some issues return back.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-20 12:54 ` [ofa-general] " Evgeniy Polyakov
@ 2007-07-20 13:02   ` Krishna Kumar2
  2007-07-23  4:23   ` Krishna Kumar2
  1 sibling, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-20 13:02 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: jagana, mcarlson, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, davem, sri

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 07/20/2007 06:24:23 PM:

> > After fine-tuning qdisc and other changes, I modified IPoIB to use this
API,
> > and now get good gains. Summary for TCP & No Delay: 1 process improves
for
> > all cases from 1.4% to 49.5%; 4 process has almost identical
improvements
> > from -1.7% to 59.1%; 16 process case also improves in the range of
-1.2% to
> > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%).
UDP
> > was tested with 1 process netperf with small increase in BW but big
> > improvement in Service Demand. Netperf latency tests show small drop in
> > transaction rate (results in separate attachment).
>
> What about round-robin tcp time and latency test? In theory such batching
> mode should not change that timings, but practice can show new aspects.
> I will review code later this week (likely tomorrow) and if there will
> be some issues return back.

I had run RR test quite some time back and don't have the result at this
time,
other than remembering it was almost the same as the original. As I am
running
some tests on those systems at this time, I can send the results of RR
tomorrow.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
  2007-07-20 11:21       ` Patrick McHardy
@ 2007-07-20 16:22         ` Stephen Hemminger
  2007-07-21  6:46           ` Krishna Kumar2
  0 siblings, 1 reply; 55+ messages in thread
From: Stephen Hemminger @ 2007-07-20 16:22 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general, mchan,
	tgraf, netdev, johnpol, davem, sri

On Fri, 20 Jul 2007 13:21:51 +0200
Patrick McHardy <kaber@trash.net> wrote:

> Krishna Kumar2 wrote:
> > Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:37:20 PM:
> >
> >
> >   
> >> rtnetlink support seems more important than sysfs to me.
> >>     
> >
> > Thanks, I will add that as a patch. The reason to add to sysfs is that
> > it is easier to change for a user (and similar to tx_queue_len).
> >   
> 

But since batching is so similar to TSO, i really should be part of the
flags and controlled by ethtool like other offload flags.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 02/10] Networking include file changes.
  2007-07-20  6:32 ` [PATCH 02/10] Networking include file changes Krishna Kumar
  2007-07-20  9:59   ` Patrick McHardy
@ 2007-07-20 17:25   ` Sridhar Samudrala
  2007-07-21  6:30     ` Krishna Kumar2
  1 sibling, 1 reply; 55+ messages in thread
From: Sridhar Samudrala @ 2007-07-20 17:25 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jagana, general, netdev,
	tgraf, jeff, hadi, davem, mchan

On Fri, 2007-07-20 at 12:02 +0530, Krishna Kumar wrote:
> Networking include file changes for batching.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  linux/netdevice.h |   10 ++++++++++
>  net/pkt_sched.h   |    6 +++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> --- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
> +++ new/include/linux/netdevice.h	2007-07-20 08:30:55.000000000 +0530
> @@ -264,6 +264,8 @@ enum netdev_state_t
>  	__LINK_STATE_QDISC_RUNNING,
>  };
> 
> +/* Minimum length of device hardware queue for batching to work */
> +#define MIN_QUEUE_LEN_BATCH	16
> 
>  /*
>   * This structure holds at boot time configured netdevice settings. They
> @@ -340,6 +342,7 @@ struct net_device
>  #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
>  #define NETIF_F_GSO		2048	/* Enable software GSO. */
>  #define NETIF_F_LLTX		4096	/* LockLess TX */
> +#define NETIF_F_BATCH_SKBS	8192	/* Driver supports batch skbs API */
>  #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
> 
>  	/* Segmentation offload features */
> @@ -452,6 +455,8 @@ struct net_device
>  	struct Qdisc		*qdisc_sleeping;
>  	struct list_head	qdisc_list;
>  	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
> +	unsigned long		xmit_slots;	/* Device free slots */
> +	struct sk_buff_head	*skb_blist;	/* List of batch skbs */
> 
>  	/* Partially transmitted GSO packet. */
>  	struct sk_buff		*gso_skb;
> @@ -472,6 +477,9 @@ struct net_device
>  	void			*priv;	/* pointer to private data	*/
>  	int			(*hard_start_xmit) (struct sk_buff *skb,
>  						    struct net_device *dev);
> +	int			(*hard_start_xmit_batch) (struct net_device
> +							  *dev);
> +
>  	/* These may be needed for future network-power-down code. */
>  	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
> 
> @@ -832,6 +840,8 @@ extern int		dev_set_mac_address(struct n
>  					    struct sockaddr *);
>  extern int		dev_hard_start_xmit(struct sk_buff *skb,
>  					    struct net_device *dev);
> +extern int		dev_add_skb_to_blist(struct sk_buff *skb,
> +					     struct net_device *dev);
> 
>  extern void		dev_init(void);
> 
> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
> --- org/include/net/pkt_sched.h	2007-07-20 07:49:28.000000000 +0530
> +++ new/include/net/pkt_sched.h	2007-07-20 08:30:22.000000000 +0530
> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
>  		struct rtattr *tab);
>  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
> 
> -extern void __qdisc_run(struct net_device *dev);
> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);

Why do we need this additional 'blist' argument?
Is this different from dev->skb_blist?

> 
> -static inline void qdisc_run(struct net_device *dev)
> +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
>  {
>  	if (!netif_queue_stopped(dev) &&
>  	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> -		__qdisc_run(dev);
> +		__qdisc_run(dev, blist);
>  }
> 
>  extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20  6:32 ` [ofa-general] [PATCH 03/10] dev.c changes Krishna Kumar
  2007-07-20 10:04   ` [ofa-general] " Patrick McHardy
@ 2007-07-20 17:44   ` Sridhar Samudrala
  2007-07-21  6:44     ` Krishna Kumar2
  1 sibling, 1 reply; 55+ messages in thread
From: Sridhar Samudrala @ 2007-07-20 17:44 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, kaber, jagana, general, netdev,
	tgraf, jeff, hadi, davem, mchan

On Fri, 2007-07-20 at 12:02 +0530, Krishna Kumar wrote:
> Changes in dev.c to support batching : add dev_add_skb_to_blist,
> register_netdev recognizes batch aware drivers, and net_tx_action is
> the sole user of batching.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  dev.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 74 insertions(+), 3 deletions(-)
> 
> diff -ruNp org/net/core/dev.c new/net/core/dev.c
> --- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
> +++ new/net/core/dev.c	2007-07-20 08:31:35.000000000 +0530

<snip>

> @@ -1566,7 +1605,7 @@ gso:
>  			/* reset queue_mapping to zero */
>  			skb->queue_mapping = 0;
>  			rc = q->enqueue(skb, q);
> -			qdisc_run(dev);
> +			qdisc_run(dev, NULL);

OK. So you are passing a NULL blist here. However, i am
not sure why batching is not used in this situation.

Thanks
Sridhar

>  			spin_unlock(&dev->queue_lock);
> 
>  			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
> @@ -1763,7 +1802,11 @@ static void net_tx_action(struct softirq
>  			clear_bit(__LINK_STATE_SCHED, &dev->state);
> 
>  			if (spin_trylock(&dev->queue_lock)) {
> -				qdisc_run(dev);
> +				/*
> +				 * Try to send out all skbs if batching is
> +				 * enabled.
> +				 */
> +				qdisc_run(dev, dev->skb_blist);
>  				spin_unlock(&dev->queue_lock);
>  			} else {
>  				netif_schedule(dev);
> @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device
>  		}
>  	}
> 
> +	if (dev->features & NETIF_F_BATCH_SKBS) {
> +		if (!dev->hard_start_xmit_batch ||
> +		    dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) {
> +			/*
> +			 * Batch TX requires API support in driver plus have
> +			 * a minimum sized queue.
> +			 */
> +			printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS "
> +					"since no API support or queue len "
> +					"is smaller than %d.\n",
> +					dev->name, MIN_QUEUE_LEN_BATCH);
> +			dev->features &= ~NETIF_F_BATCH_SKBS;
> +		} else {
> +			dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
> +						 GFP_KERNEL);
> +			if (dev->skb_blist) {
> +				skb_queue_head_init(dev->skb_blist);
> +				dev->tx_queue_len >>= 1;
> +			}
> +		}
> +	}
> +
>  	/*
>  	 *	nil rebuild_header routine,
>  	 *	that should be never called and used as just bug trap.
> @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev
> 
>  	synchronize_net();
> 
> +	/* Deallocate batching structure */
> +	if (dev->skb_blist) {
> +		skb_queue_purge(dev->skb_blist);
> +		kfree(dev->skb_blist);
> +		dev->skb_blist = NULL;
> +	}
> +
>  	/* Shutdown queueing discipline. */
>  	dev_shutdown(dev);
> 
> -
>  	/* Notify protocols, that we are about to destroy
>  	   this device. They should clean all the things.
>  	*/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
  2007-07-20  6:32 ` [ofa-general] [PATCH 05/10] sch_generic.c changes Krishna Kumar
  2007-07-20 10:11   ` [ofa-general] " Patrick McHardy
@ 2007-07-20 18:16   ` Patrick McHardy
  2007-07-21  6:56     ` Krishna Kumar2
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2007-07-20 18:16 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan,
	tgraf, jeff, hadi, davem, sri

Krishna Kumar wrote:
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> +			  struct sk_buff_head *blist,
> +			  struct sk_buff **skbp)
> +{
> +	if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
> +		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> +	} else {
> +		int max = dev->tx_queue_len - skb_queue_len(blist);


I'm assuming the driver will simply leave excess packets in the
blist for the next run. The check for tx_queue_len is wrong though,
its only a default which can be overriden and some qdiscs don't
care for it at all.

> +		struct sk_buff *skb;
> +
> +		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> +			max -= dev_add_skb_to_blist(skb, dev);
> +
> +		*skbp = NULL;
> +		return 1;	/* we have atleast one skb in blist */
> +	}
> +}



> -void __qdisc_run(struct net_device *dev)
> +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)


And the patches should really be restructured so this change is
in the same patch changing the header and the caller, for example.

>  {
>  	do {
> -		if (!qdisc_restart(dev))
> +		if (!qdisc_restart(dev, blist))
>  			break;
>  	} while (!netif_queue_stopped(dev));
>  
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 02/10] Networking include file changes.
  2007-07-20 17:25   ` [ofa-general] " Sridhar Samudrala
@ 2007-07-21  6:30     ` Krishna Kumar2
  2007-07-23  5:59       ` Sridhar Samudrala
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-21  6:30 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, davem, rdreier

Hi Sridhar,

Sridhar Samudrala <sri@us.ibm.com> wrote on 07/20/2007 10:55:05 PM:
> > diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
> > --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.000000000 +0530
> > +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.000000000 +0530
> > @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
> >        struct rtattr *tab);
> >  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
> >
> > -extern void __qdisc_run(struct net_device *dev);
> > +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
*blist);
>
> Why do we need this additional 'blist' argument?
> Is this different from dev->skb_blist?

It is the same, but I want to call it mostly with NULL and rarely with the
batch list pointer (so it is related to your other question). My original
code didn't have this and was trying batching in all cases. But in most
xmit's (probably almost all), there will be only one packet in the queue to
send and batching will never happen. When there is a lock contention or if
the queue is stopped, then the next iteration will find >1 packets. But I
still will try no batching for the lock failure case as there be probably
2 packets (one from previous time and 1 from this time, or 3 if two
failures,
etc), and try batching only when queue was stopped from net_tx_action (this
was based on Dave Miller's idea).

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 03/10] dev.c changes.
  2007-07-20 17:44   ` Sridhar Samudrala
@ 2007-07-21  6:44     ` Krishna Kumar2
  0 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-21  6:44 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, davem, rdreier

Hi Sridhar,

Sridhar Samudrala <sri@us.ibm.com> wrote on 07/20/2007 11:14:19 PM:

> > @@ -1566,7 +1605,7 @@ gso:
> >           /* reset queue_mapping to zero */
> >           skb->queue_mapping = 0;
> >           rc = q->enqueue(skb, q);
> > -         qdisc_run(dev);
> > +         qdisc_run(dev, NULL);
>
> OK. So you are passing a NULL blist here. However, i am
> not sure why batching is not used in this situation.

Actually it could be used, but in most cases there will be only
one skb. If I pass the blist here, the result (for batching
case) will be to put one single skb into the blist and call
the new xmit API. That wastes cycles as we take a skb out
from the queue (as in regular code) and then add it to the
blist (different in the new code) and then the driver has to
remove this skb from the blist (different in the new code).
I could try batching but then require there are more than
1 skbs before adding to the blist (or the blist doesn't already
have skbs, in which case adding even one skb makes sense). Also,
it will have a slight impact for regular drivers where for each
xmit, one extra dereference for dev->skb_blist (which is always
NULL) is made, which was another reason to always pass NULL.

I will check what the results are by giving passing blist here
too and make the above change. I will run tests for that (as
well as NETPERF RR test as asked by Evgeniy).

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
  2007-07-20 16:22         ` Stephen Hemminger
@ 2007-07-21  6:46           ` Krishna Kumar2
  2007-07-23  9:56             ` Stephen Hemminger
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-21  6:46 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, Patrick McHardy, jeff,
	general, mchan, tgraf, netdev, sri, davem, rdreier

Stephen Hemminger <shemminger@linux-foundation.org> wrote on 07/20/2007
09:52:03 PM:
> Patrick McHardy <kaber@trash.net> wrote:
>
> > Krishna Kumar2 wrote:
> > > Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:37:20 PM:
> > >
> > >
> > >
> > >> rtnetlink support seems more important than sysfs to me.
> > >>
> > >
> > > Thanks, I will add that as a patch. The reason to add to sysfs is
that
> > > it is easier to change for a user (and similar to tx_queue_len).
> > >
> >
>
> But since batching is so similar to TSO, i really should be part of the
> flags and controlled by ethtool like other offload flags.

So should I add all three interfaces (or which ones) :

      1. /sys (like for tx_queue_len)
      2. netlink
      3. ethtool.

Or only 2 & 3 are enough ?

thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
  2007-07-20 18:16   ` Patrick McHardy
@ 2007-07-21  6:56     ` Krishna Kumar2
  2007-07-22 17:03       ` Patrick McHardy
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-21  6:56 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Hi Patrick,

Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 11:46:36 PM:

> Krishna Kumar wrote:
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > +           struct sk_buff_head *blist,
> > +           struct sk_buff **skbp)
> > +{
> > +   if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1)) {
> > +      return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > +   } else {
> > +      int max = dev->tx_queue_len - skb_queue_len(blist);
>
>
> I'm assuming the driver will simply leave excess packets in the
> blist for the next run.

Yes, and the next run will be scheduled even if no more xmits are called
either
due to qdisc_restart()'s call to driver returning :
      BUSY : driver failed to send all, net_tx_action will handle this
later (the
             case you mentioned)
      OK : and qlen is > 0, return 1 and __qdisc_run() will re-retry (where
            blist len will become zero as driver processed EVERYTHING on
blist)

> The check for tx_queue_len is wrong though,
> its only a default which can be overriden and some qdiscs don't
> care for it at all.

I think it should not matter whether qdiscs use this or not, or even if it
is modified (unless it is made zero in which case this breaks). The
intention behind this check is to make sure that not more than tx_queue_len
skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise
the blist can become too large and breaks the idea of tx_queue_len. Is that
a good justification ?

> > -void __qdisc_run(struct net_device *dev)
> > +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
>
>
> And the patches should really be restructured so this change is
> in the same patch changing the header and the caller, for example.

Ah, OK.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
                   ` (11 preceding siblings ...)
  2007-07-20 12:54 ` [ofa-general] " Evgeniy Polyakov
@ 2007-07-21 13:18 ` jamal
  2007-07-22  6:27   ` Krishna Kumar2
  12 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2007-07-21 13:18 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, netdev, sri, jagana, davem


I am (have been) under extreme travel mode - so i will have high latency
in follow ups.

On Fri, 2007-20-07 at 12:01 +0530, Krishna Kumar wrote:
> Hi Dave, Roland, everyone,
> 
> In May, I had proposed creating an API for sending 'n' skbs to a driver to
> reduce lock overhead, DMA operations, and specific to drivers that have
> completion notification like IPoIB - reduce completion handling ("[RFC] New
> driver API to speed up small packets xmits" @
> http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent
> initial test results for E1000 which showed minor improvements (but also
> got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2.
> 

Add to that context: that i have been putting out patches on this over
the last 3+ years as well as several public presentations = last one
being: http://vger.kernel.org/jamal_netconf2006.sxi

My main problem (and obstacles to submitting the patches) has been a
result of not doing the approriate testing - i had been testing
forwarding path (in all my results post the latest patches) when i
should really have been testing the improvement of the tx path. 

> There is a parallel WIP by Jamal but the two implementations are completely
> different since the code bases from the start were separate. Key changes:
> 	- Use a single qdisc interface to avoid code duplication and reduce
> 	  maintainability (sch_generic.c size reduces by ~9%).
> 	- Has per device configurable parameter to turn on/off batching.
> 	- qdisc_restart gets slightly modified while looking simple without
> 	  any checks for batching vs regular code (infact only two lines have
> 	  changed - 1. instead of dev_dequeue_skb, a new batch-aware function
> 	  is called; and 2. an extra call to hard_start_xmit_batch.

> 	- No change in__qdisc_run other than a new argument (from DM's idea).
> 	- Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.

All the above are cosmetic differences. To me is the highest priority
is making sure that batching is useful and what the limitations are.
At some point, when all looks good - i dont mind adding an ethtool
interface to turn off/on batching, merge with the new qdisc restart path
instead of having a parallel path, solicit feedback on naming, where to
allocate structs etc etc. All that is low prio if batching across a
variety of hardware and applications doesnt prove useful. At the moment,
i am unsure theres consistency to justify push batching in.

Having said that below are the main architectural differences we have
which is what we really need to discuss and see what proves useful:

>         - Batching algo/processing is different (eg. if
>           qdisc_restart() finds
> 	  one skb in the batch list, it will try to batch more (upto a limit)
> 	  instead of sending that out and batching the rest in the next call.

This sounds a little more aggressive but maybe useful.
I have experimented with setting upper bound limits (current patches
have a pktgen interface to set the max to send) and have concluded that
it is unneeded. Probing by letting the driver tell you what space is
available has proven to be the best approach. I have been meaning to
remove the code in pktgen which allows these limits.
 
> 	- Jamal's code has a separate hw prep handler called from the stack,
> 	  and results are accessed in driver during xmit later.

I have explained the reasoning to this a few times. A recent response to
Michael Chan is here:
http://marc.info/?l=linux-netdev&m=118346921316657&w=2
And heres a response to you that i havent heard back on:
http://marc.info/?l=linux-netdev&m=118355539503924&w=2

My tests so far indicate this interface is useful. It doesnt apply well
to some drivers (for example i dont use it in tun) - which makes it
optional but useful nevertheless. I will be more than happy to kill this
if i can find cases where it proves to be a bad idea.

> 	- Jamal's code has dev->xmit_win which is cached by the driver. Mine
> 	  has dev->xmit_slots but this is used only by the driver while the
> 	  core has a different mechanism to find how many skbs to batch.

This is related to the first item.

> 	- Completely different structure/design & coding styles.
> (This patch will work with drivers updated by Jamal, Matt & Michael Chan with
> minor modifications - rename xmit_win to xmit_slots & rename batch handler)

Again, cosmetics (and indication you are morphing towards me).

So if i was to sum up this, (it would be useful discussion to have on
these) the real difference is:

a) you have an extra check on refilling the skb list when you find that
it has a single skb. I tagged this as being potentially useful.
b) You have a check for some upper bound on the number of skbs to send
to the driver. I tagged this as unnecessary - the interface is still on
in my current code, so it shouldnt be hard to show one way or other.
c) You dont have prep_xmit()

Add to that list any other architectural differences i may have missed
and lets discuss and hopefully make some good progress.

cheers,
jaaml

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] TCP and batching WAS(Re: [PATCH 00/10] Implement batching skb API
  2007-07-20  7:18 ` [ofa-general] Re: [PATCH 00/10] Implement batching skb API Stephen Hemminger
  2007-07-20  7:30   ` Krishna Kumar2
  2007-07-20  7:47   ` Krishna Kumar2
@ 2007-07-21 13:46   ` jamal
  2007-07-23  9:44     ` Stephen Hemminger
  2 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2007-07-21 13:46 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: johnpol, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, netdev, sri, jagana, davem

On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote:

> You may see worse performance with batching in the real world when
> running over WAN's.  Like TSO, batching will generate back to back packet
> trains that are subject to multi-packet synchronized loss. 

Has someone done any study on TSO effect? Doesnt ECN with a RED router
help on something like this?
I find it suprising that a single flow doing TSO would overwhelm a
routers buffer. I actually think the value of batching as far as TCP is
concerned is propotional to the number of flows. i.e the more flows you
have the more batching you will end up doing. And if TCPs fairness is
the legend talk it has been made to be, then i dont see this as
problematic.

BTW, something i noticed regards to GSO when testing batching:
For TCP packets slightly above MDU (upto 2K), GSO gives worse
performance than non-GSO. Actually has nothing to do with batching,
rather it works the same way with or without batching changes.

Another oddity:
Looking at the flow rate from a purely packets/second (I know thats a
router centric view, but i found it strange nevertheless) - you see that
as packet size goes up, the pps also goes up. I tried mucking around
with nagle etc, but saw no observable changes. Any insight?
My expectation was that the pps would stay at least the same or get
better with smaller packets (assuming theres less data to push around).

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-21 13:18 ` [ofa-general] " jamal
@ 2007-07-22  6:27   ` Krishna Kumar2
  2007-07-22 12:51     ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-22  6:27 UTC (permalink / raw)
  To: hadi
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, netdev, general, mchan,
	tgraf, sri, davem, herbert

Hi Jamal,

J Hadi Salim <j.hadi123@gmail.com> wrote on 07/21/2007 06:48:41 PM:

> >    - Use a single qdisc interface to avoid code duplication and reduce
> >      maintainability (sch_generic.c size reduces by ~9%).
> >    - Has per device configurable parameter to turn on/off batching.
> >    - qdisc_restart gets slightly modified while looking simple without
> >      any checks for batching vs regular code (infact only two lines
have
> >      changed - 1. instead of dev_dequeue_skb, a new batch-aware
function
> >      is called; and 2. an extra call to hard_start_xmit_batch.
>
> >    - No change in__qdisc_run other than a new argument (from DM's
idea).
> >    - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.
>
> All the above are cosmetic differences. To me is the highest priority
> is making sure that batching is useful and what the limitations are.
> At some point, when all looks good - i dont mind adding an ethtool
> interface to turn off/on batching, merge with the new qdisc restart path
> instead of having a parallel path, solicit feedback on naming, where to
> allocate structs etc etc. All that is low prio if batching across a
> variety of hardware and applications doesnt prove useful. At the moment,
> i am unsure theres consistency to justify push batching in.

Batching need not be useful for every hardware. If there is hardware that
is useful to exploit batching (like clearly IPoIB is a good candidate as
both the TX and the TX completion path can handle multiple skb processing,
and I haven't looked at other drivers to see if any of them can do
something
similar), then IMHO it makes sense to enable batching for that hardware. It
is upto the other drivers to determine whether converting to the batching
API makes sense or not. And as indicated, the total size increase for
adding
the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the
'size' command).

> Having said that below are the main architectural differences we have
> which is what we really need to discuss and see what proves useful:
>
> >         - Batching algo/processing is different (eg. if
> >           qdisc_restart() finds
> >      one skb in the batch list, it will try to batch more (upto a
limit)
> >      instead of sending that out and batching the rest in the next
call.
>
> This sounds a little more aggressive but maybe useful.
> I have experimented with setting upper bound limits (current patches
> have a pktgen interface to set the max to send) and have concluded that
> it is unneeded. Probing by letting the driver tell you what space is
> available has proven to be the best approach. I have been meaning to
> remove the code in pktgen which allows these limits.

I don't quite agree with that approach, eg, if the blist is empty and the
driver tells there is space for one packet, you will add one packet and
the driver sends it out and the device is stopped (with potentially lot of
skbs on dev->q). Then no packets are added till the queue is enabled, at
which time a flood of skbs will be processed increasing latency and holding
lock for a single longer duration. My approach will mitigate holding lock
for longer times and instead send skbs to the device as long as we are
within
the limits.

Infact in my rev2 patch (being today or tomorrow after handling Patrick's
and
Stephen's comments), I am even removing the driver specific xmit_slots as I
find it is adding bloat and requires more cycles than calculating the value
each time xmit is done (ofcourse in your approach it is required since the
stack uses it).

> >    - Jamal's code has a separate hw prep handler called from the stack,
> >      and results are accessed in driver during xmit later.
>
> I have explained the reasoning to this a few times. A recent response to
> Michael Chan is here:
> http://marc.info/?l=linux-netdev&m=118346921316657&w=2

Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
it),
I feel having prep will not help as no other cpu can execute the queue/xmit
code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock
could
get improvement however.

> And heres a response to you that i havent heard back on:
> http://marc.info/?l=linux-netdev&m=118355539503924&w=2

That is because it answered my query :) It is what I was expecting, but
thanks
for the explanation.

> My tests so far indicate this interface is useful. It doesnt apply well

I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
performance is affected. If it helps, I guess you could send me a patch to
add that and I can also test it to see what the effect is. I didn't add it
since IPoIB wouldn't be able to exploit it (unless someone is kind enough
to show me how to).

> So if i was to sum up this, (it would be useful discussion to have on
> these) the real difference is:
>
> a) you have an extra check on refilling the skb list when you find that
> it has a single skb. I tagged this as being potentially useful.

It is very useful since extra processing is not required for one skb case -
you remove it from list and unnecessarily add it to a different list and
then
delete it immediately in the driver when all that was required is to pass
the
skb directly to the driver using it's original API (ofcourse the caveat is
that
I also have a check to add that *single* skb to the blist in case there are
already earlier skbs on the blist, this helps in batching and more
importantly -
to send skbs in order).

> b) You have a check for some upper bound on the number of skbs to send
> to the driver. I tagged this as unnecessary - the interface is still on
> in my current code, so it shouldnt be hard to show one way or other.

Explained earlier wrt latency.

> c) You dont have prep_xmit()
>
> Add to that list any other architectural differences i may have missed
> and lets discuss and hopefully make some good progress.

I think the code I have is ready and stable, and the issues pointed out so
far is also incorporated and to be sent out today. Please let me know if
you want to add something to it.

Thanks for your review/comments,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-22  6:27   ` Krishna Kumar2
@ 2007-07-22 12:51     ` jamal
  2007-07-23  4:49       ` Krishna Kumar2
  0 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2007-07-22 12:51 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, netdev, general, mchan,
	tgraf, sri, davem, herbert

KK,

On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote:

> Batching need not be useful for every hardware. 

My concern is there is no consistency in results. I see improvements on
something which you say dont. You see improvement in something that
Evgeniy doesnt etc. 
There are many knobs and we need in the minimal to find out how those
play.

> I don't quite agree with that approach, eg, if the blist is empty and the
> driver tells there is space for one packet, you will add one packet and
> the driver sends it out and the device is stopped (with potentially lot of
> skbs on dev->q). Then no packets are added till the queue is enabled, at
> which time a flood of skbs will be processed increasing latency and holding
> lock for a single longer duration. My approach will mitigate holding lock
> for longer times and instead send skbs to the device as long as we are
> within the limits.

Just as a side note _I do have this feature_ in the pktgen piece.
Infact, You can tell pktgen what that bound is as opposed to the hard
coding(look at the pktgen "batchl" parameter). I have not found it to be
useful experimentally; actually, i should say i could not "zone" on a
useful value by experimenting and it was better to turn it off.
I never tried adding it to the qdisc path - but this is something i
could try and as i said it may prove useful.

> Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
> it),
> I feel having prep will not help as no other cpu can execute the queue/xmit
> code anyway (E1000 is also a LLTX driver).

My experiments show it is useful (in a very visible way using pktgen)
for e1000 to have the prep() interface.

>  Other driver that hold tx lock could get improvement however.

So you do see the value then with non LLTX drivers, right? ;-> 
The value is also there in LLTX drivers even if in just formating a skb
ready for transmit. If this is not clear i could do a much longer
writeup on my thought evolution towards adding prep().

> I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
> performance is affected. 

Absolutely. And regardless of whether its beneficial or not for e1000,
theres clear benefit in the tg3 for example.

> If it helps, I guess you could send me a patch to
> add that and I can also test it to see what the effect is. I didn't add it
> since IPoIB wouldn't be able to exploit it (unless someone is kind enough
> to show me how to).

Such core code should not just be focussed on IPOIB.

> I think the code I have is ready and stable, 

I am not sure how to intepret that - are you saying all-is-good and we
should just push your code in? It sounds disingenuous but i may have
misread you. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
  2007-07-21  6:56     ` Krishna Kumar2
@ 2007-07-22 17:03       ` Patrick McHardy
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick McHardy @ 2007-07-22 17:03 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, davem, sri

Krishna Kumar2 wrote:
> Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 11:46:36 PM:
> 
>>The check for tx_queue_len is wrong though,
>>its only a default which can be overriden and some qdiscs don't
>>care for it at all.
> 
> 
> I think it should not matter whether qdiscs use this or not, or even if it
> is modified (unless it is made zero in which case this breaks). The
> intention behind this check is to make sure that not more than tx_queue_len
> skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise
> the blist can become too large and breaks the idea of tx_queue_len. Is that
> a good justification ?


Its a good justification, but on second thought the entire idea of
a single queue after qdiscs that is refilled independantly of
transmissions times etc. make be worry a bit. By changing the
timing you're effectively changing the qdiscs behaviour, at least
in some cases. SFQ is a good example, but I believe it affects most
work-conserving qdiscs. Think of this situation:

100 packets of flow 1 arrive
50 packets of flow 1 are sent
100 packets for flow 2 arrive
remaining packets are sent

On the wire you'll first see 50 packets of flow 1, than 100 packets
alternate of flow 1 and 2, then 50 packets flow 2.

With your additional queue all packets of flow 1 are pulled out of
the qdisc immediately and put in the fifo. When the 100 packets of
the second flow arrive they will also get pulled out immediately
and are put in the fifo behind the remaining 50 packets of flow 1.
So what you get on the wire is:

100 packets of flow 1
100 packets of flow 1

So SFQ is without any effect. This is not completely avoidable of
course, but you can and should limit the damage by only pulling
out as much packets as the driver can take and have the driver
stop the queue afterwards.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/10] Implement batching skb API
  2007-07-20 12:54 ` [ofa-general] " Evgeniy Polyakov
  2007-07-20 13:02   ` Krishna Kumar2
@ 2007-07-23  4:23   ` Krishna Kumar2
  1 sibling, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-23  4:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, sri, tgraf, xma

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 07/20/2007 06:24:23 PM:

> Hi Krishna.
>
> On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar
(krkumar2@in.ibm.com) wrote:
> > After fine-tuning qdisc and other changes, I modified IPoIB to use this
API,
> > and now get good gains. Summary for TCP & No Delay: 1 process improves
for
> > all cases from 1.4% to 49.5%; 4 process has almost identical
improvements
> > from -1.7% to 59.1%; 16 process case also improves in the range of
-1.2% to
> > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%).
UDP
> > was tested with 1 process netperf with small increase in BW but big
> > improvement in Service Demand. Netperf latency tests show small drop in
> > transaction rate (results in separate attachment).
>
> What about round-robin tcp time and latency test? In theory such batching
> mode should not change that timings, but practice can show new aspects.

The TCP RR results show a slight impact, however the service demand shows
good improvement. The results are (I did TCP RR - 1 process, 1,8,32,128,512
buffer sizes; and UDP RR - 1 process, 1 byte buffer size) :

        Results for TCR RR (1 process) ORG code:
Size        R-R                   CPU%            S.Demand
------------------------------------------------------------
1         521346.02               5.48            1346.145
8         129463.14               6.74            418.370
32        128899.73               7.51            467.106
128       127230.15               5.42            340.876
512       119605.68               6.48            435.650


        Results for TCR RR (1 process) NEW code (and change%):
Size        R-R                   CPU%            S.Demand
--------------------------------------------------------------------
1         516596.62 (-0.91%)      5.74            1423.819 (5.77%)
8         129184.46 (-.22%)       5.43            336.747 (-19.51%)
32        128238.35 (-.51%)       5.43            339.213 (-27.38%)
128       126545.79 (-.54%)       5.36            339.188 (-0.50%)
512       119297.49 (-.26%)       5.16            346.185 (-20.54%)


              Results for UDP RR 1 process ORG & NEW code:
Code   Size      R-R                CPU%      S.Demand
----------------------------------------------------------------------
ORG     1        539327.86          5.68      1348.985
NEW     1        540669.33 (0.25%)  6.05      1434.180 (6.32%)


> I will review code later this week (likely tomorrow) and if there will
> be some issues return back.

Thanks! I had just submitted Rev2 on Sunday, please let me know what you
find.

Regards,

- KK


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-22 12:51     ` jamal
@ 2007-07-23  4:49       ` Krishna Kumar2
  2007-07-23 12:32         ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-23  4:49 UTC (permalink / raw)
  To: hadi
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, netdev, general, mchan,
	tgraf, sri, davem, herbert

Hi Jamal,

J Hadi Salim <j.hadi123@gmail.com> wrote on 07/22/2007 06:21:09 PM:

> My concern is there is no consistency in results. I see improvements on
> something which you say dont. You see improvement in something that
> Evgeniy doesnt etc.

Hmmm ? Evgeniy has not even tested my code to find some regression :) And
you may possibly not find much improvement in E1000 when you run iperf
(which
is what I do) compared to pktgen. I can re-run and confirm this since my
last
E1000 run was quite some time back.

My point is that batching not being viable for E1000 (or tg3) need not be
the
sole criterea for inclusion. If IPoIB or other drivers can take advantage
of
it and get better results, then batching can be considered. Maybe E1000 too
can get improvements if some one with more expertise tries to add this API
(not judging your driver writing capabilities - just stating that driver
writers will know more knobs to exploit a complex device like E1000).

> > Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't
find
> > it),
> > I feel having prep will not help as no other cpu can execute the
queue/xmit
> > code anyway (E1000 is also a LLTX driver).
>
> My experiments show it is useful (in a very visible way using pktgen)
> for e1000 to have the prep() interface.

I meant : have you compared results of batching with prep on vs prep off,
and
what is the difference in BW ?

> >  Other driver that hold tx lock could get improvement however.
>
> So you do see the value then with non LLTX drivers, right? ;->

No. I see value only in non-LLTX drivers which also gets the same TX lock
in the RX path. If different locks are got by TX/RX, then since you are
holding queue_lock before calling 'prep', this excludes other TX from
running at the same time. In that case, pre-poning the get of the tx_lock
to do the 'prep' will not cause any degradation (since no other tx can run
anyway, while rx can run as it gets a different lock).

> The value is also there in LLTX drivers even if in just formating a skb
> ready for transmit. If this is not clear i could do a much longer
> writeup on my thought evolution towards adding prep().

In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
any case, so there should be no improvement. Could you send the write-up
since
I really don't see the value in prep unless the driver is non-LLTX *and*
TX/RX holds the same TX lock. I think that is the sole criterea, right ?

> > If it helps, I guess you could send me a patch to
> > add that and I can also test it to see what the effect is. I didn't add
it
> > since IPoIB wouldn't be able to exploit it (unless someone is kind
enough
> > to show me how to).
>
> Such core code should not just be focussed on IPOIB.

There is *nothing* IPoIB specific or focus in my code. I said adding prep
doesn't
work for IPoIB and so it is pointless to add bloat to the code until some
code can
actually take advantage of this feature (I am sure you will agree). Which
is why I
also mentioned to please send me a patch if you find it useful for any
driver
rather than rejecting this idea.

> > I think the code I have is ready and stable,
>
> I am not sure how to intepret that - are you saying all-is-good and we
> should just push your code in?

I am only too well aware that Dave will not accept any code (having
experienced with Mobile IPv6 a long time back when he said to move most
of it to userspace and he was absolutely correct :). What I meant to say
is that there isn't much point in saying that your code is not ready or
you are using old code base, or has multiple restart functions, or is not
tested enough, etc, and then say let's re-do/rethink the whole
implementation when my code is already working and giving good results.
Unless you have some design issues with it, or code is written badly, is
not maintainable, not linux style compliant, is buggy, will not handle
some case/workload, type of issues.

OTOH, if you find some cases that are better handled with :
      1. prep handler
      2. xmit_win (which I don't have now),
then please send me patches and I will also test out and incorporate.

> It sounds disingenuous but i may have misread you.

("lacking in frankness, candor, or sincerity; falsely or hypocritically
ingenuous; insincere") ???? Sorry, no response to personal comments and
have a flame-war :)

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 02/10] Networking include file changes.
  2007-07-21  6:30     ` Krishna Kumar2
@ 2007-07-23  5:59       ` Sridhar Samudrala
  2007-07-23  6:27         ` Krishna Kumar2
  0 siblings, 1 reply; 55+ messages in thread
From: Sridhar Samudrala @ 2007-07-23  5:59 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, davem, rdreier

Krishna Kumar2 wrote:
> Hi Sridhar,
> 
> Sridhar Samudrala <sri@us.ibm.com> wrote on 07/20/2007 10:55:05 PM:
>>> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
>>> --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.000000000 +0530
>>> +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.000000000 +0530
>>> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
>>>        struct rtattr *tab);
>>>  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
>>>
>>> -extern void __qdisc_run(struct net_device *dev);
>>> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
> *blist);
>> Why do we need this additional 'blist' argument?
>> Is this different from dev->skb_blist?
> 
> It is the same, but I want to call it mostly with NULL and rarely with the
> batch list pointer (so it is related to your other question). My original
> code didn't have this and was trying batching in all cases. But in most
> xmit's (probably almost all), there will be only one packet in the queue to
> send and batching will never happen. When there is a lock contention or if
> the queue is stopped, then the next iteration will find >1 packets. But I
> still will try no batching for the lock failure case as there be probably
> 2 packets (one from previous time and 1 from this time, or 3 if two
> failures,
> etc), and try batching only when queue was stopped from net_tx_action (this
> was based on Dave Miller's idea).


Is this right to say that the above change is to get this behavior?
   If qdisc_run() is called from dev_queue_xmit() don't use batching.
   If qdisc_run() is called from net_tx_action(), do batching.

Isn't it possible to have multiple skb's in the qdisc queue in the
first case?

If this additional argument is used to indicate if we should do batching
or not, then passing a flag may be much more cleaner than passing the blist.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 02/10] Networking include file changes.
  2007-07-23  5:59       ` Sridhar Samudrala
@ 2007-07-23  6:27         ` Krishna Kumar2
  0 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-23  6:27 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, davem, rdreier

Hi Sridhar,

Sridhar Samudrala <sri@us.ibm.com> wrote on 07/23/2007 11:29:39 AM:

> Krishna Kumar2 wrote:
> > Hi Sridhar,
> >
> > Sridhar Samudrala <sri@us.ibm.com> wrote on 07/20/2007 10:55:05 PM:
> >>> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
> >>> --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.000000000 +0530
> >>> +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.000000000 +0530
> >>> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
> >>>        struct rtattr *tab);
> >>>  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
> >>>
> >>> -extern void __qdisc_run(struct net_device *dev);
> >>> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
> > *blist);
> >> Why do we need this additional 'blist' argument?
> >> Is this different from dev->skb_blist?
> >
> > It is the same, but I want to call it mostly with NULL and rarely with
the
> > batch list pointer (so it is related to your other question). My
original
> > code didn't have this and was trying batching in all cases. But in most
> > xmit's (probably almost all), there will be only one packet in the
queue to
> > send and batching will never happen. When there is a lock contention or
if
> > the queue is stopped, then the next iteration will find >1 packets. But
I
> > still will try no batching for the lock failure case as there be
probably
> > 2 packets (one from previous time and 1 from this time, or 3 if two
> > failures,
> > etc), and try batching only when queue was stopped from net_tx_action
(this
> > was based on Dave Miller's idea).
>
> Is this right to say that the above change is to get this behavior?
>    If qdisc_run() is called from dev_queue_xmit() don't use batching.
>    If qdisc_run() is called from net_tx_action(), do batching.

Correct.

> Isn't it possible to have multiple skb's in the qdisc queue in the
> first case?

It is possible but rarer (so unnecessary checking most of the time). From
net_tx_action you are guaranteed to have multiple skbs, but from xmit you
will almost always get one skb (since most send of 1 skb will go out OK).
And also in the xmit path, it is more likely to have few skbs compared to
possibly hundreds in the net_tx_action path.

> If this additional argument is used to indicate if we should do batching
> or not, then passing a flag may be much more cleaner than passing the
blist.

OK, I will add this as another action item to check (along with Patrick's
suggestion to use single API) and will get back.

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: TCP and batching WAS(Re: [PATCH 00/10] Implement batching skb API
  2007-07-21 13:46   ` [ofa-general] TCP and batching WAS(Re: " jamal
@ 2007-07-23  9:44     ` Stephen Hemminger
  0 siblings, 0 replies; 55+ messages in thread
From: Stephen Hemminger @ 2007-07-23  9:44 UTC (permalink / raw)
  To: hadi
  Cc: Krishna Kumar, davem, rdreier, johnpol, Robert.Olsson,
	peter.p.waskiewicz.jr, kumarkr, herbert, gaagaan, mcarlson, xma,
	rick.jones2, jeff, general, mchan, tgraf, netdev, jagana, kaber,
	sri

On Sat, 21 Jul 2007 09:46:19 -0400
jamal <hadi@cyberus.ca> wrote:

> On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote:
> 
> > You may see worse performance with batching in the real world when
> > running over WAN's.  Like TSO, batching will generate back to back packet
> > trains that are subject to multi-packet synchronized loss. 
> 
> Has someone done any study on TSO effect? 
Not that I have seen, TCP research tends to turn of NAPI and TSO because it
causes other effects which are too confusing for measurement. The discussion
of TSO usually shows up in discussions of pacing. I have seen argument both
pro and con for pacing. The most convincing arguments are that pacing doesn't
help in the general case (and therefore TSO would be ok). 

> Doesnt ECN with a RED router
> help on something like this?
Yes, but RED is not deployed on backbone, and ECN only slightly.
Most common is over sized FIFO queues.

> I find it suprising that a single flow doing TSO would overwhelm a
> routers buffer. I actually think the value of batching as far as TCP is
> concerned is propotional to the number of flows. i.e the more flows you
> have the more batching you will end up doing. And if TCPs fairness is
> the legend talk it has been made to be, then i dont see this as
> problematic.

It is not that TSO would overwhelm the router by itself, just that any
congested link will have periods when there is only a small number of
available slots left. When this happens a TSO burst will get truncated.

The argument against pacing, and for TSO; is that the busy sender with
large congestion window is the one most likely to have send large bursts.
For fairness, the system works better if the busy sender gets penalized more,
and dropping the latter part of the burst does that.  With pacing, the sender
may be able to saturate the router more and not detect that it is monopolizing
the bandwidth.


> BTW, something i noticed regards to GSO when testing batching:
> For TCP packets slightly above MDU (upto 2K), GSO gives worse
> performance than non-GSO. Actually has nothing to do with batching,
> rather it works the same way with or without batching changes.
> 
> Another oddity:
> Looking at the flow rate from a purely packets/second (I know thats a
> router centric view, but i found it strange nevertheless) - you see that
> as packet size goes up, the pps also goes up. I tried mucking around
> with nagle etc, but saw no observable changes. Any insight?
> My expectation was that the pps would stay at least the same or get
> better with smaller packets (assuming theres less data to push around).
> 
> cheers,
> jamal
> 
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/10] net-sysfs.c changes.
  2007-07-21  6:46           ` Krishna Kumar2
@ 2007-07-23  9:56             ` Stephen Hemminger
  0 siblings, 0 replies; 55+ messages in thread
From: Stephen Hemminger @ 2007-07-23  9:56 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
	Patrick McHardy, kumarkr, mcarlson, mchan, netdev,
	peter.p.waskiewicz.jr, rdreier, rick.jones2, Robert.Olsson, sri,
	tgraf, xma

On Sat, 21 Jul 2007 12:16:30 +0530
Krishna Kumar2 <krkumar2@in.ibm.com> wrote:

> Stephen Hemminger <shemminger@linux-foundation.org> wrote on 07/20/2007
> 09:52:03 PM:
> > Patrick McHardy <kaber@trash.net> wrote:
> >
> > > Krishna Kumar2 wrote:
> > > > Patrick McHardy <kaber@trash.net> wrote on 07/20/2007 03:37:20 PM:
> > > >
> > > >
> > > >
> > > >> rtnetlink support seems more important than sysfs to me.
> > > >>
> > > >
> > > > Thanks, I will add that as a patch. The reason to add to sysfs is
> that
> > > > it is easier to change for a user (and similar to tx_queue_len).
> > > >
> > >
> >
> > But since batching is so similar to TSO, i really should be part of the
> > flags and controlled by ethtool like other offload flags.
> 
> So should I add all three interfaces (or which ones) :
> 
>       1. /sys (like for tx_queue_len)
>       2. netlink
>       3. ethtool.
> 
> Or only 2 & 3 are enough ?
> 

Yes, please do #3 and maybe #2.
Sysfs api's are a long term ABI problem.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/10] Implement batching skb API
  2007-07-23  4:49       ` Krishna Kumar2
@ 2007-07-23 12:32         ` jamal
  2007-07-24  3:44           ` [ofa-general] " Krishna Kumar2
  0 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2007-07-23 12:32 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: davem, gaagaan, general, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, sri, tgraf, xma

KK,

On Mon, 2007-23-07 at 10:19 +0530, Krishna Kumar2 wrote:

> Hmmm ? Evgeniy has not even tested my code to find some regression :) And
> you may possibly not find much improvement in E1000 when you run iperf
> (which is what I do) compared to pktgen. 

Pktgen is the correct test (or the closest to correct) because it tests
the driver tx path. iperf/netperf test the effect of batching on
tcp/udp. Infact i would start with udp first. What you need to do if
testing end-2-end is see where the effects occur. For example, it is
feasible that batching is a little too aggressive and the receiver cant
keep up (netstat -s before and after will be helpful).
Maybe by such insight we can improve things.

> > My experiments show it is useful (in a very visible way using pktgen)
> > for e1000 to have the prep() interface.
> 
> I meant : have you compared results of batching with prep on vs prep off,
> and
> what is the difference in BW ?

Yes, and these results were sent to you as well a while back.
When i get the time when i get back i will look em up in my test machine
and resend.

> No. I see value only in non-LLTX drivers which also gets the same TX lock
> in the RX path.

So _which_ non-LLTX driver doesnt do that? ;->

> > The value is also there in LLTX drivers even if in just formating a skb
> > ready for transmit. If this is not clear i could do a much longer
> > writeup on my thought evolution towards adding prep().
> 
> In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
> any case, so there should be no improvement. Could you send the write-up

I will - please give me sometime; i am overloaded at the moment.

> There is *nothing* IPoIB specific or focus in my code. 
> I said adding prep
> doesn't
> work for IPoIB and so it is pointless to add bloat to the code until some
> code can

tun driver doesnt use it either - but i doubt that makes it "bloat"

>  What I meant to say
> is that there isn't much point in saying that your code is not ready or
> you are using old code base, or has multiple restart functions, or is not
> tested enough, etc, and then say let's re-do/rethink the whole
> implementation when my code is already working and giving good results.

The suggestive hand gesturing is the kind of thing that bothers me. What
do you think: Would i be submitting patches in baed on 2.6.22-rc4? Would
it make sense to include parallel qdisc paths? For heavens sake, i have
told you i would be fine with accepting such changes when the qdisc
restart changes went in first.
You waltz in, have the luxury of looking at my code, presentations, many
discussions with me etc ...
When i ask for differences to code you produced, they now seem to sum up
to the two below. You dont think theres some honest issue with this
picture?

> OTOH, if you find some cases that are better handled with :
>       1. prep handler
>       2. xmit_win (which I don't have now),
> then please send me patches and I will also test out and incorporate.
> 

And then of course you will end up adding those because they are both
useful, just calling them some other name. And then you will end up
incorporating all the drivers i invested many hours (as a gratitous
volunteer) to change and test - maybe you will change varibale names or
rearrange some function. 
I am a very compromising person; i have no problem coauthoring these
patches if you actually invest useful time like fixing things up and
doing proper tests. But you are not doing that - instead you are being
extremely aggressive and hijacking the whole thing. It is courteous if
you find somebody else has a patch you point out whats wrong preferably
with some proof. 

> > It sounds disingenuous but i may have misread you.
> 
> ("lacking in frankness, candor, or sincerity; falsely or hypocritically
> ingenuous; insincere") ???? Sorry, no response to personal comments and
> have a flame-war :)

Give me a better description. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [ofa-general] Re: [PATCH 00/10] Implement batching skb API
  2007-07-23 12:32         ` jamal
@ 2007-07-24  3:44           ` Krishna Kumar2
  2007-07-24 19:28             ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-24  3:44 UTC (permalink / raw)
  To: hadi
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, netdev, general, mchan,
	tgraf, sri, davem, herbert

Hi Jamal,

J Hadi Salim <j.hadi123@gmail.com> wrote on 07/23/2007 06:02:01 PM:

> Yes, and these results were sent to you as well a while back.
> When i get the time when i get back i will look em up in my test machine
> and resend.

Actually you have not sent netperf results with prep and without prep.

> > No. I see value only in non-LLTX drivers which also gets the same TX
lock
> > in the RX path.
>
> So _which_ non-LLTX driver doesnt do that? ;->

I have no idea since I haven't looked at all drivers. Can you tell which
all non-LLTX drivers does that ? I stated this as the sole criterea.

> tun driver doesnt use it either - but i doubt that makes it "bloat"

Adding extra code that is currently not usable (esp from a submission
point)
is bloat.

> You waltz in, have the luxury of looking at my code, presentations, many
> discussions with me etc ...

"luxury" ? I had implemented the entire thing even before knowing that you
are working on something similar! and I had sent the first proposal to
netdev,
*after* which you told that you have your own code and presentations (which
I had never seen earlier - I joined netdev a few months back, earlier I was
working on RDMA, Infiniband as you know). And it didn't give me any great
ideas either, remember I had posted results for E1000 at the time of
sending
the proposals. However I do give credit in my proposal to you for what
ideas
that your provided (without actual code), and the same I did for other
people
who did the same, like Dave, Sridhar. BTW, you too had discussions with me,
and I sent some patches to improve your code too, so it looks like a two
way
street to me (and that is how open source works and should).

> When i ask for differences to code you produced, they now seem to sum up
> to the two below. You dont think theres some honest issue with this
> picture?

Two changes ? That's it ? I gave a big list of changes between our
implementations but you twist my words to conclude there is just two (by
conveniently labelling everything else "cosmetic", or "potentially
useful"!)! Even my restart routine used a single API from the first day,
I would never imagine using multiple API's. Our codes probably doesn't
have even one line that look remotely similar!

To clarify : I suggested that you could send patches for the two *missing*
items if you can show they add value (and not the rest, as I consider
those will not improve the code/logic/algo).

> > ("lacking in frankness, candor, or sincerity; falsely or hypocritically
> > ingenuous; insincere") ???? Sorry, no response to personal comments and
> > have a flame-war :)
>
> Give me a better description.

Sorry, no personal comments. Infact I will avoid responding to baits and
innuendoes from now on.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/10] Implement batching skb API
  2007-07-24  3:44           ` [ofa-general] " Krishna Kumar2
@ 2007-07-24 19:28             ` jamal
  2007-07-25  2:41               ` Krishna Kumar2
  0 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2007-07-24 19:28 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: davem, gaagaan, general, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, sri, tgraf, xma

KK,

On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote:

> 
> J Hadi Salim <j.hadi123@gmail.com> wrote on 07/23/2007 06:02:01 PM:


> Actually you have not sent netperf results with prep and without prep.

My results were based on pktgen (which i explained as testing the
driver). I think depending on netperf without further analysis is
simplistic. It was like me doing forwarding tests on these patches.

> > So _which_ non-LLTX driver doesnt do that? ;->
> 
> I have no idea since I haven't looked at all drivers. Can you tell which
> all non-LLTX drivers does that ? I stated this as the sole criterea.

The few i have peeked at all do it. I also think the e1000 should be
converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. 

> > tun driver doesnt use it either - but i doubt that makes it "bloat"
> 
> Adding extra code that is currently not usable (esp from a submission
> point) is bloat.

So far i have converted 3 drivers, 1 of them doesnt use it. Two more
driver conversions are on the way, they will both use it. How is this
bloat again? 
A few emails back you said if only IPOIB can use batching then thats
good enough justification. 

> > You waltz in, have the luxury of looking at my code, presentations, many
> > discussions with me etc ...
> 
> "luxury" ? 
> I had implemented the entire thing even before knowing that you
> are working on something similar! and I had sent the first proposal to
> netdev,

I saw your patch at the end of may (or at least 2 weeks after you said
it existed). That patch has very little resemblance to what you just
posted conceptwise or codewise. I could post it if you would give me
permission.

> *after* which you told that you have your own code and presentations (which
> I had never seen earlier - I joined netdev a few months back, earlier I was
> working on RDMA, Infiniband as you know).

I am gonna assume you didnt know of my work - which i have been making
public for about 3 years. Infact i talked about this topic when i
visited your office in 2006 on a day you were not present, so it is
plausible you didnt hear of it.

>  And it didn't give me any great
> ideas either, remember I had posted results for E1000 at the time of
> sending the proposals. 

In mid-June you sent me a series of patches which included anything from
changing variable names to combining qdisc_restart and about everything
i referred to as being "cosmetic differences" in your posted patches. I
took two of those and incorporated them in. One was an "XXX" in my code
already to allocate the dev->blist 
(Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). 
The other one was a mechanical removal of the blist being passed
(Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). 
Some of the others i asked you to defer. For example, the reason i gave
you for not merging any qdisc_restart_combine changes is because i was
waiting for Dave to swallow the qdisc_restart changes i made; otherwise
maintainance becomes extremely painful for me. 
Sridhar actually provided a lot more valuable comments and fixes but has
not planted a flag on behalf of the queen of spain like you did. 

> However I do give credit in my proposal to you for what
> ideas that your provided (without actual code), and the same I did for other
> people who did the same, like Dave, Sridhar. BTW, you too had discussions with me,
> and I sent some patches to improve your code too, 

I incorporated two of your patches and asked for deferal of others.
These patches have now shown up in what you claim as "the difference". I
just call them "cosmetic difference" not to downplay the importance of
having an ethtool interface but because they do not make batching
perform any better. The real differences are those two items. I am
suprised you havent cannibalized those changes as well. I thought you
renamed them to something else; according to your posting:
"This patch will work with drivers updated by Jamal, Matt & Michael Chan
with minor modifications - rename xmit_win to xmit_slots & rename batch
handler". Or maybe thats a "future plan" you have in mind?

> so it looks like a two
> way street to me (and that is how open source works and should).

Open source is a lot more transparent than that.

You posted a question, which was part of your research. I responded and
told you i have patches; you asked me for them and i promptly ported
them from pre-2.6.18 to the latest kernel at the time. 

The nature of this batching work is one of performance. So numbers are
important. If you had some strong disagreements on something in the
architecture, then it would be of great value to explain it in a
technical detail - and more importantly to provide some numbers to say
why it is a bad idea. You get numbers by running some tests. 
You did none of the above. Your effort has been to produce "your patch"
for whatever reasons. This would not have been problematic to me if it
actually was based within reasons of optimization because the end goal
would have been achieved.

I have deleted the rest of the email because it goes back and forth on
the same points. 

I am gonna continue work on the current tree i have. I will put more
time when i get back next week (and hopefully no travel right after).
I will upgrade to Daves tree later when i get the two new drivers in. I
am probably gonna hold on until the new NAPI stuff settles in first. You
are welcome to  submit the ipoib changes in. You are also welcome to
co-author with me but you will have to work for it this time.

cheers,
jamal


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/10] Implement batching skb API
  2007-07-24 19:28             ` jamal
@ 2007-07-25  2:41               ` Krishna Kumar2
  0 siblings, 0 replies; 55+ messages in thread
From: Krishna Kumar2 @ 2007-07-25  2:41 UTC (permalink / raw)
  To: hadi
  Cc: davem, gaagaan, general, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, sri, tgraf, xma

Jamal,

This is silly. I am not responding to this type of presumptuous and
insulting mails.

Regards,

- KK

J Hadi Salim <j.hadi123@gmail.com> wrote on 07/25/2007 12:58:20 AM:

> KK,
>
> On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote:
>
> >
> > J Hadi Salim <j.hadi123@gmail.com> wrote on 07/23/2007 06:02:01 PM:
>
>
> > Actually you have not sent netperf results with prep and without prep.
>
> My results were based on pktgen (which i explained as testing the
> driver). I think depending on netperf without further analysis is
> simplistic. It was like me doing forwarding tests on these patches.
>
> > > So _which_ non-LLTX driver doesnt do that? ;->
> >
> > I have no idea since I haven't looked at all drivers. Can you tell
which
> > all non-LLTX drivers does that ? I stated this as the sole criterea.
>
> The few i have peeked at all do it. I also think the e1000 should be
> converted to be non-LLTX. The rest of netdev is screaming to kill LLTX.
>
> > > tun driver doesnt use it either - but i doubt that makes it "bloat"
> >
> > Adding extra code that is currently not usable (esp from a submission
> > point) is bloat.
>
> So far i have converted 3 drivers, 1 of them doesnt use it. Two more
> driver conversions are on the way, they will both use it. How is this
> bloat again?
> A few emails back you said if only IPOIB can use batching then thats
> good enough justification.
>
> > > You waltz in, have the luxury of looking at my code, presentations,
many
> > > discussions with me etc ...
> >
> > "luxury" ?
> > I had implemented the entire thing even before knowing that you
> > are working on something similar! and I had sent the first proposal to
> > netdev,
>
> I saw your patch at the end of may (or at least 2 weeks after you said
> it existed). That patch has very little resemblance to what you just
> posted conceptwise or codewise. I could post it if you would give me
> permission.
>
> > *after* which you told that you have your own code and presentations
(which
> > I had never seen earlier - I joined netdev a few months back, earlier I
was
> > working on RDMA, Infiniband as you know).
>
> I am gonna assume you didnt know of my work - which i have been making
> public for about 3 years. Infact i talked about this topic when i
> visited your office in 2006 on a day you were not present, so it is
> plausible you didnt hear of it.
>
> >  And it didn't give me any great
> > ideas either, remember I had posted results for E1000 at the time of
> > sending the proposals.
>
> In mid-June you sent me a series of patches which included anything from
> changing variable names to combining qdisc_restart and about everything
> i referred to as being "cosmetic differences" in your posted patches. I
> took two of those and incorporated them in. One was an "XXX" in my code
> already to allocate the dev->blist
> (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63).
> The other one was a mechanical removal of the blist being passed
> (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757).
> Some of the others i asked you to defer. For example, the reason i gave
> you for not merging any qdisc_restart_combine changes is because i was
> waiting for Dave to swallow the qdisc_restart changes i made; otherwise
> maintainance becomes extremely painful for me.
> Sridhar actually provided a lot more valuable comments and fixes but has
> not planted a flag on behalf of the queen of spain like you did.
>
> > However I do give credit in my proposal to you for what
> > ideas that your provided (without actual code), and the same I did for
other
> > people who did the same, like Dave, Sridhar. BTW, you too had
discussions with me,
> > and I sent some patches to improve your code too,
>
> I incorporated two of your patches and asked for deferal of others.
> These patches have now shown up in what you claim as "the difference". I
> just call them "cosmetic difference" not to downplay the importance of
> having an ethtool interface but because they do not make batching
> perform any better. The real differences are those two items. I am
> suprised you havent cannibalized those changes as well. I thought you
> renamed them to something else; according to your posting:
> "This patch will work with drivers updated by Jamal, Matt & Michael Chan
> with minor modifications - rename xmit_win to xmit_slots & rename batch
> handler". Or maybe thats a "future plan" you have in mind?
>
> > so it looks like a two
> > way street to me (and that is how open source works and should).
>
> Open source is a lot more transparent than that.
>
> You posted a question, which was part of your research. I responded and
> told you i have patches; you asked me for them and i promptly ported
> them from pre-2.6.18 to the latest kernel at the time.
>
> The nature of this batching work is one of performance. So numbers are
> important. If you had some strong disagreements on something in the
> architecture, then it would be of great value to explain it in a
> technical detail - and more importantly to provide some numbers to say
> why it is a bad idea. You get numbers by running some tests.
> You did none of the above. Your effort has been to produce "your patch"
> for whatever reasons. This would not have been problematic to me if it
> actually was based within reasons of optimization because the end goal
> would have been achieved.
>
> I have deleted the rest of the email because it goes back and forth on
> the same points.
>
> I am gonna continue work on the current tree i have. I will put more
> time when i get back next week (and hopefully no travel right after).
> I will upgrade to Daves tree later when i get the two new drivers in. I
> am probably gonna hold on until the new NAPI stuff settles in first. You
> are welcome to  submit the ipoib changes in. You are also welcome to
> co-author with me but you will have to work for it this time.
>
> cheers,
> jamal
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2007-07-25  2:40 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-20  6:31 [ofa-general] [PATCH 00/10] Implement batching skb API Krishna Kumar
2007-07-20  6:32 ` [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB Krishna Kumar
2007-07-20  6:32 ` [PATCH 02/10] Networking include file changes Krishna Kumar
2007-07-20  9:59   ` Patrick McHardy
2007-07-20 17:25   ` [ofa-general] " Sridhar Samudrala
2007-07-21  6:30     ` Krishna Kumar2
2007-07-23  5:59       ` Sridhar Samudrala
2007-07-23  6:27         ` Krishna Kumar2
2007-07-20  6:32 ` [ofa-general] [PATCH 03/10] dev.c changes Krishna Kumar
2007-07-20 10:04   ` [ofa-general] " Patrick McHardy
2007-07-20 10:27     ` Krishna Kumar2
2007-07-20 11:20       ` [ofa-general] " Patrick McHardy
2007-07-20 11:52         ` Krishna Kumar2
2007-07-20 11:55           ` Patrick McHardy
2007-07-20 12:09         ` Krishna Kumar2
2007-07-20 12:25         ` Krishna Kumar2
2007-07-20 12:37           ` Patrick McHardy
2007-07-20 17:44   ` Sridhar Samudrala
2007-07-21  6:44     ` Krishna Kumar2
2007-07-20  6:32 ` [PATCH 04/10] net-sysfs.c changes Krishna Kumar
2007-07-20 10:07   ` [ofa-general] " Patrick McHardy
2007-07-20 10:28     ` Krishna Kumar2
2007-07-20 11:21       ` Patrick McHardy
2007-07-20 16:22         ` Stephen Hemminger
2007-07-21  6:46           ` Krishna Kumar2
2007-07-23  9:56             ` Stephen Hemminger
2007-07-20  6:32 ` [ofa-general] [PATCH 05/10] sch_generic.c changes Krishna Kumar
2007-07-20 10:11   ` [ofa-general] " Patrick McHardy
2007-07-20 10:32     ` Krishna Kumar2
2007-07-20 11:24       ` Patrick McHardy
2007-07-20 18:16   ` Patrick McHardy
2007-07-21  6:56     ` Krishna Kumar2
2007-07-22 17:03       ` Patrick McHardy
2007-07-20  6:33 ` [ofa-general] [PATCH 06/10] IPoIB header file changes Krishna Kumar
2007-07-20  6:33 ` [ofa-general] [PATCH 07/10] IPoIB verb changes Krishna Kumar
2007-07-20  6:33 ` [ofa-general] [PATCH 08/10] IPoIB multicast/CM changes Krishna Kumar
2007-07-20  6:33 ` [PATCH 09/10] IPoIB batching xmit handler support Krishna Kumar
2007-07-20  6:33 ` [PATCH 10/10] IPoIB batching in internal xmit/handler routines Krishna Kumar
2007-07-20  7:18 ` [ofa-general] Re: [PATCH 00/10] Implement batching skb API Stephen Hemminger
2007-07-20  7:30   ` Krishna Kumar2
2007-07-20  7:57     ` [ofa-general] " Stephen Hemminger
2007-07-20  7:47   ` Krishna Kumar2
2007-07-21 13:46   ` [ofa-general] TCP and batching WAS(Re: " jamal
2007-07-23  9:44     ` Stephen Hemminger
2007-07-20 12:54 ` [ofa-general] " Evgeniy Polyakov
2007-07-20 13:02   ` Krishna Kumar2
2007-07-23  4:23   ` Krishna Kumar2
2007-07-21 13:18 ` [ofa-general] " jamal
2007-07-22  6:27   ` Krishna Kumar2
2007-07-22 12:51     ` jamal
2007-07-23  4:49       ` Krishna Kumar2
2007-07-23 12:32         ` jamal
2007-07-24  3:44           ` [ofa-general] " Krishna Kumar2
2007-07-24 19:28             ` jamal
2007-07-25  2:41               ` Krishna Kumar2

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.