All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance
@ 2014-05-14 14:17 Jesper Dangaard Brouer
  2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

I'm on a quest to push the packet per sec (pps) limits of our network
stack, with a special focus on single CPU performance.

My first action is to measure and identify bottlenecks in the transmit
path. For achieving this goal, I need a fast in-kernel packet
generator, like "pktgen". It turned out that "pktgen" were too slow.

Thus, this series focus on optimizing "pktgen" for single CPU performance.

Overview 1xCPU performance Packet Per Sec (pps) stats:
 * baseline: 3,930,068 pps
 * patch2:   5,362,722 pps -- TXSZ=1024
 * patch3:   5,608,781 pps --> 178.29ns per pkt
 * patch4:   5,857,065 pps --> 170.73ns ( -7.56ns)
 * patch5:   6,346,500 pps --> 157.56ns (-13.17ns)
 * No-lock:  6,642,948 pps --> 150.53ns ( -7.03ns)

The last result "No-lock" removes the HARD_TX_{UN}LOCK, and is not
applicable to upstream.  It removes two "LOCK" instructions (cost 8ns
each), thus I were expecting to see an improvement of 16ns, but we
only see 7ns.  This leads me to believe, that we have reached the
ixgbe driver limit, single queue.

Setup according to blogpost:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html

Hardware:
 System: CPU E5-2630
 NIC: Intel ixgbe/82599 chip

Testing done with net-next git tree on top of
 commit 79e0f1c9f (ipv6: Need to sock_put on csum error).

Pktgen script exercising race condition:
 https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh

---

Jesper Dangaard Brouer (5):
      pktgen: RCU'ify "if_list" to remove lock in next_to_run()
      pktgen: avoid expensive set_current_state() call in loop
      pktgen: avoid atomic_inc per packet in xmit loop
      ixgbe: increase default TX ring buffer to 1024
      ixgbe: trivial fixes while reading code


 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    2 
 net/core/pktgen.c                             |  115 +++++++++++++------------
 3 files changed, 61 insertions(+), 58 deletions(-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [net-next PATCH 1/5] ixgbe: trivial fixes while reading code
  2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
@ 2014-05-14 14:17 ` Jesper Dangaard Brouer
  2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 8089ea9..5e5df9f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2097,7 +2097,7 @@ dma_sync:
  * This function provides a "bounce buffer" approach to Rx interrupt
  * processing.  The advantage to this is that on systems that have
  * expensive overhead for IOMMU access this provides a means of avoiding
- * it by maintaining the mapping of the page to the syste.
+ * it by maintaining the mapping of the page to the system.
  *
  * Returns amount of work completed
  **/

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
  2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer
@ 2014-05-14 14:17 ` Jesper Dangaard Brouer
  2014-05-14 14:28   ` David Laight
  2014-05-14 16:28   ` Alexander Duyck
  2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
running full.  Thus, the TX ring is artificially limiting pktgen.

Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
counters.

Increasing the TX ring buffer should be done carefully, as it comes at
a higher memory cost, which can also negatively influence performance.
E.g. ring buffer array of struct ixgbe_tx_buffer (current size 48bytes)
increase from 512*48=24576bytes to 1024*48=49152bytes which is larger
than the L1 data cache (32KB on my E5-2630), thus increasing the L1->L2
cache-references.

Adjusting the TX ring buffer (TXSZ) measured over 10 sec with ifpps
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * cmd: ethtool -G eth8 tx $TXSZ
 * 3,930,065 pps -- TXSZ= 512
 * 5,312,249 pps -- TXSZ= 768
 * 5,362,722 pps -- TXSZ=1024
 * 5,361,390 pps -- TXSZ=1536
 * 5,362,439 pps -- TXSZ=2048
 * 5,359,744 pps -- TXSZ=4096

Choosing size 1024 because for the next optimizations 768 is not
enough.

Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
more pressure on the TX ring buffers.

It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index c688c8a..bf078fe 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -63,7 +63,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 /* TX/RX descriptor defines */
-#define IXGBE_DEFAULT_TXD		    512
+#define IXGBE_DEFAULT_TXD		   1024
 #define IXGBE_DEFAULT_TX_WORK		    256
 #define IXGBE_MAX_TXD			   4096
 #define IXGBE_MIN_TXD			     64

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop
  2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
  2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer
  2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer
@ 2014-05-14 14:17 ` Jesper Dangaard Brouer
  2014-05-14 14:35   ` Eric Dumazet
  2014-05-14 14:17 ` [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

Avoid the expensive atomic refcnt increase in the pktgen xmit loop, by
simply setting the refcnt only when a new SKB gets allocated. Setting
it according to how many times we are spinning the same SKB (and
handling the case of skb_clone=0).

Performance data with CLONE_SKB==100000 and TX ring buffer size=1024:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Before: 5,362,722 pps --> 186.47ns per pkt (1/5362722*10^9)
 * Now:    5,608,781 pps --> 178.29ns per pkt (1/5608781*10^9)
 * Diff:    +246,059 pps -->  -8.18ns

The performance increase converted to nanoseconds (8.18ns), correspond
well to the measured overhead of LOCK prefixed assembler instructions
on my E5-2630 CPU which is measured to be 8.23ns.

Note, with TX ring size 768 I see some "tx_restart_queue" events.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/core/pktgen.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 0304f98..7752806 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3327,6 +3327,9 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 			pkt_dev->clone_count--;	/* back out increment, OOM */
 			return;
 		}
+		/* Avoid atomic inc for every packet before xmit call */
+		atomic_set(&(pkt_dev->skb->users),
+			   max(2,(pkt_dev->clone_skb+1)));
 		pkt_dev->last_pkt_size = pkt_dev->skb->len;
 		pkt_dev->allocated_skbs++;
 		pkt_dev->clone_count = 0;	/* reset counter */
@@ -3347,7 +3350,6 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		pkt_dev->last_ok = 0;
 		goto unlock;
 	}
-	atomic_inc(&(pkt_dev->skb->users));
 	ret = (*xmit)(pkt_dev->skb, odev);
 
 	switch (ret) {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop
  2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer
@ 2014-05-14 14:17 ` Jesper Dangaard Brouer
  2014-05-14 14:18 ` [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer
  2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
  5 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

I request review as I'm uncertain of this change as I don't know
the API of set_current_state() very well.

The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
is LOCK prefixed.

Avoid calling set_current_state() inside the busy-loop in
pktgen_thread_worker().  In case of pkt_dev->delay, then it is still
used in pktgen_xmit() via the spin() call.

Performance data with CLONE_SKB==100000 and TX ring buffer size=1024:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Prev:  5608781 pps --> 178.29ns (1/5608781*10^9)
 * Now:   5857065 pps --> 170.73ns (1/5857065*10^9)
 * Diff:  +248284 pps -->  -7.56ns

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/core/pktgen.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 7752806..cae7e0c 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3409,10 +3409,10 @@ static int pktgen_thread_worker(void *arg)
 
 	pr_debug("starting pktgen/%d:  pid=%d\n", cpu, task_pid_nr(current));
 
-	set_current_state(TASK_INTERRUPTIBLE);
-
 	set_freezable();
 
+	__set_current_state(TASK_RUNNING);
+
 	while (!kthread_should_stop()) {
 		pkt_dev = next_to_run(t);
 
@@ -3426,8 +3426,6 @@ static int pktgen_thread_worker(void *arg)
 			continue;
 		}
 
-		__set_current_state(TASK_RUNNING);
-
 		if (likely(pkt_dev)) {
 			pktgen_xmit(pkt_dev);
 
@@ -3458,9 +3456,8 @@ static int pktgen_thread_worker(void *arg)
 		}
 
 		try_to_freeze();
-
-		set_current_state(TASK_INTERRUPTIBLE);
 	}
+	set_current_state(TASK_INTERRUPTIBLE);
 
 	pr_debug("%s stopping all device\n", t->tsk->comm);
 	pktgen_stop(t);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run()
  2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2014-05-14 14:17 ` [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer
@ 2014-05-14 14:18 ` Jesper Dangaard Brouer
  2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
  5 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 14:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

The if_lock()/if_unlock() in next_to_run() adds a significant
overhead, because its called for every packet in busy loop of
pktgen_thread_worker().

Removing these two "LOCK" operations should in theory save us approx
16ns (8ns x 2), as illustrated below we save 15ns when just removing
the locks, and 13ns when introducing RCU protection.

Performance data with CLONE_SKB==100000 and TX ring buffer size=1024:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Prev    :  5857065 pps --> 170.73ns (1/5857065*10^9)
 * No-lock :  6424246 pps --> 155.66ns (1/6424246*10^9)
 * Diff    :  +567181 pps --> -15.07ns (1/6424246*10^9)-(1/5857065*10^9)
 * RCU-fix :  6346500 pps --> 157.56ns
 * RCU-cost:   -77746 pps -->  +1.90ns
 * Diff    :  +489435 pps --> -13.17ns (1/6346500*10^9)-(1/5857065*10^9)

To understand this RCU patch, I describe the pktgen thread model
below.

In pktgen there is several kernel threads, but there is only one CPU
running each kernel thread.  Communication with the kernel threads are
done through some thread control flags.  This allow the thread to
change data structures at a know synchronization point, see main
thread func pktgen_thread_worker().

Userspace changes are communicated through proc-file writes.  There
are three types of changes, general control changes "pgctrl"
(func:pgctrl_write), thread changes "kpktgend_X"
(func:pktgen_thread_write), and interface config changes "etcX@N"
(func:pktgen_if_write).

Userspace "pgctrl" and "thread" changes are synchronized via the mutex
pktgen_thread_lock, thus only a single userspace instance can run.
The mutex is taken while the packet generator is running, by pgctrl
"start".  Thus e.g. "add_device" cannot be invoked when pktgen is
running/started.

All "pgctrl" and all "thread" changes, except thread "add_device",
communicate via the thread control flags.  The main problem is the
exception "add_device", that modifies threads "if_list" directly.

Fortunately "add_device" cannot be invoked while pktgen is running.
But there exists a race between "rem_device_all" and "add_device"
(which normally don't occur, because "rem_device_all" waits 125ms
before returning). Background'ing "rem_device_all" and running
"add_device" immediately allow the race to occur.

The race affects the threads (list of devices) "if_list".  The if_lock
is used for protecting this "if_list".  Other readers are given
lock-free access to the list under RCU read sections.

Note, interface config changes (via proc) can occur while pktgen is
running, which worries me a bit.  I'm assuming proc_remove() takes
appropriate locks, to assure no writers exists after proc_remove()
finish.

I've been running a script exercising the race condition (leading me
to fix the proc_remove order), without any issues.  The script also
exercises concurrent proc writes, while the interface config is
getting removed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/core/pktgen.c |  102 ++++++++++++++++++++++++++++-------------------------
 1 files changed, 53 insertions(+), 49 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index cae7e0c..915b162 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -69,8 +69,9 @@
  * for running devices in the if_list and sends packets until count is 0 it
  * also the thread checks the thread->control which is used for inter-process
  * communication. controlling process "posts" operations to the threads this
- * way. The if_lock should be possible to remove when add/rem_device is merged
- * into this too.
+ * way.
+ * The if_list is RCU protected, and the if_lock remains to protect updating
+ * of if_list, from "add_device" as it invoked from userspace (via proc write).
  *
  * By design there should only be *one* "controlling" process. In practice
  * multiple write accesses gives unpredictable result. Understood by "write"
@@ -208,7 +209,7 @@
 #define T_REMDEVALL   (1<<2)	/* Remove all devs */
 #define T_REMDEV      (1<<3)	/* Remove one dev */
 
-/* If lock -- can be removed after some work */
+/* If lock -- protects updating of if_list */
 #define   if_lock(t)           spin_lock(&(t->if_lock));
 #define   if_unlock(t)           spin_unlock(&(t->if_lock));
 
@@ -241,6 +242,7 @@ struct pktgen_dev {
 	struct proc_dir_entry *entry;	/* proc file */
 	struct pktgen_thread *pg_thread;/* the owner */
 	struct list_head list;		/* chaining in the thread's run-queue */
+	struct rcu_head	 rcu;		/* freed by RCU */
 
 	int running;		/* if false, the test will stop */
 
@@ -1737,14 +1739,15 @@ static int pktgen_thread_show(struct seq_file *seq, void *v)
 
 	seq_printf(seq, "Running: ");
 
-	if_lock(t);
-	list_for_each_entry(pkt_dev, &t->if_list, list)
+	/* Should we use if_lock here? */
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list)
 		if (pkt_dev->running)
 			seq_printf(seq, "%s ", pkt_dev->odevname);
 
 	seq_printf(seq, "\nStopped: ");
 
-	list_for_each_entry(pkt_dev, &t->if_list, list)
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list)
 		if (!pkt_dev->running)
 			seq_printf(seq, "%s ", pkt_dev->odevname);
 
@@ -1753,7 +1756,7 @@ static int pktgen_thread_show(struct seq_file *seq, void *v)
 	else
 		seq_printf(seq, "\nResult: NA\n");
 
-	if_unlock(t);
+	rcu_read_unlock();
 
 	return 0;
 }
@@ -1878,10 +1881,8 @@ static struct pktgen_dev *__pktgen_NN_threads(const struct pktgen_net *pn,
 		pkt_dev = pktgen_find_dev(t, ifname, exact);
 		if (pkt_dev) {
 			if (remove) {
-				if_lock(t);
 				pkt_dev->removal_mark = 1;
 				t->control |= T_REMDEV;
-				if_unlock(t);
 			}
 			break;
 		}
@@ -1931,7 +1932,8 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d
 	list_for_each_entry(t, &pn->pktgen_threads, th_list) {
 		struct pktgen_dev *pkt_dev;
 
-		list_for_each_entry(pkt_dev, &t->if_list, list) {
+		rcu_read_lock();
+		list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 			if (pkt_dev->odev != dev)
 				continue;
 
@@ -1946,6 +1948,7 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d
 				       dev->name);
 			break;
 		}
+		rcu_read_unlock();
 	}
 }
 
@@ -2997,8 +3000,8 @@ static void pktgen_run(struct pktgen_thread *t)
 
 	func_enter();
 
-	if_lock(t);
-	list_for_each_entry(pkt_dev, &t->if_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 
 		/*
 		 * setup odev and create initial packet.
@@ -3007,18 +3010,18 @@ static void pktgen_run(struct pktgen_thread *t)
 
 		if (pkt_dev->odev) {
 			pktgen_clear_counters(pkt_dev);
-			pkt_dev->running = 1;	/* Cranke yeself! */
 			pkt_dev->skb = NULL;
 			pkt_dev->started_at = pkt_dev->next_tx = ktime_get();
 
 			set_pkt_overhead(pkt_dev);
 
 			strcpy(pkt_dev->result, "Starting");
+			pkt_dev->running = 1;	/* Cranke yeself! */
 			started++;
 		} else
 			strcpy(pkt_dev->result, "Error starting");
 	}
-	if_unlock(t);
+	rcu_read_unlock();
 	if (started)
 		t->control &= ~(T_STOP);
 }
@@ -3041,27 +3044,25 @@ static int thread_is_running(const struct pktgen_thread *t)
 {
 	const struct pktgen_dev *pkt_dev;
 
-	list_for_each_entry(pkt_dev, &t->if_list, list)
-		if (pkt_dev->running)
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list)
+		if (pkt_dev->running) {
+			rcu_read_unlock();
 			return 1;
+		}
+	rcu_read_unlock();
 	return 0;
 }
 
 static int pktgen_wait_thread_run(struct pktgen_thread *t)
 {
-	if_lock(t);
-
 	while (thread_is_running(t)) {
 
-		if_unlock(t);
-
 		msleep_interruptible(100);
 
 		if (signal_pending(current))
 			goto signal;
-		if_lock(t);
 	}
-	if_unlock(t);
 	return 1;
 signal:
 	return 0;
@@ -3166,10 +3167,10 @@ static int pktgen_stop_device(struct pktgen_dev *pkt_dev)
 		return -EINVAL;
 	}
 
+	pkt_dev->running = 0;
 	kfree_skb(pkt_dev->skb);
 	pkt_dev->skb = NULL;
 	pkt_dev->stopped_at = ktime_get();
-	pkt_dev->running = 0;
 
 	show_results(pkt_dev, nr_frags);
 
@@ -3180,9 +3181,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t)
 {
 	struct pktgen_dev *pkt_dev, *best = NULL;
 
-	if_lock(t);
-
-	list_for_each_entry(pkt_dev, &t->if_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 		if (!pkt_dev->running)
 			continue;
 		if (best == NULL)
@@ -3190,7 +3190,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t)
 		else if (ktime_compare(pkt_dev->next_tx, best->next_tx) < 0)
 			best = pkt_dev;
 	}
-	if_unlock(t);
+	rcu_read_unlock();
+
 	return best;
 }
 
@@ -3200,13 +3201,13 @@ static void pktgen_stop(struct pktgen_thread *t)
 
 	func_enter();
 
-	if_lock(t);
+	rcu_read_lock();
 
-	list_for_each_entry(pkt_dev, &t->if_list, list) {
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 		pktgen_stop_device(pkt_dev);
 	}
 
-	if_unlock(t);
+	rcu_read_unlock();
 }
 
 /*
@@ -3220,8 +3221,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t)
 
 	func_enter();
 
-	if_lock(t);
-
 	list_for_each_safe(q, n, &t->if_list) {
 		cur = list_entry(q, struct pktgen_dev, list);
 
@@ -3235,8 +3234,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t)
 
 		break;
 	}
-
-	if_unlock(t);
 }
 
 static void pktgen_rem_all_ifs(struct pktgen_thread *t)
@@ -3248,8 +3245,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t)
 
 	/* Remove all devices, free mem */
 
-	if_lock(t);
-
 	list_for_each_safe(q, n, &t->if_list) {
 		cur = list_entry(q, struct pktgen_dev, list);
 
@@ -3258,8 +3253,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t)
 
 		pktgen_remove_device(t, cur);
 	}
-
-	if_unlock(t);
 }
 
 static void pktgen_rem_thread(struct pktgen_thread *t)
@@ -3484,8 +3477,8 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t,
 	struct pktgen_dev *p, *pkt_dev = NULL;
 	size_t len = strlen(ifname);
 
-	if_lock(t);
-	list_for_each_entry(p, &t->if_list, list)
+	rcu_read_lock();
+	list_for_each_entry_rcu(p, &t->if_list, list)
 		if (strncmp(p->odevname, ifname, len) == 0) {
 			if (p->odevname[len]) {
 				if (exact || p->odevname[len] != '@')
@@ -3495,7 +3488,7 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t,
 			break;
 		}
 
-	if_unlock(t);
+	rcu_read_unlock();
 	pr_debug("find_dev(%s) returning %p\n", ifname, pkt_dev);
 	return pkt_dev;
 }
@@ -3509,6 +3502,12 @@ static int add_dev_to_thread(struct pktgen_thread *t,
 {
 	int rv = 0;
 
+	/* This function cannot be called concurrently, as its called
+	 * under pktgen_thread_lock mutex, but it can run from
+	 * userspace on another CPU than the kthread.  The if_lock()
+	 * is used here to sync with concurrent instances of
+	 * _rem_dev_from_if_list() invoked via kthread, which is also
+	 * updating the if_list */
 	if_lock(t);
 
 	if (pkt_dev->pg_thread) {
@@ -3517,9 +3516,9 @@ static int add_dev_to_thread(struct pktgen_thread *t,
 		goto out;
 	}
 
-	list_add(&pkt_dev->list, &t->if_list);
-	pkt_dev->pg_thread = t;
 	pkt_dev->running = 0;
+	pkt_dev->pg_thread = t;
+	list_add_rcu(&pkt_dev->list, &t->if_list);
 
 out:
 	if_unlock(t);
@@ -3674,11 +3673,13 @@ static void _rem_dev_from_if_list(struct pktgen_thread *t,
 	struct list_head *q, *n;
 	struct pktgen_dev *p;
 
+	if_lock(t);
 	list_for_each_safe(q, n, &t->if_list) {
 		p = list_entry(q, struct pktgen_dev, list);
 		if (p == pkt_dev)
-			list_del(&p->list);
+			list_del_rcu(&p->list);
 	}
+	if_unlock(t);
 }
 
 static int pktgen_remove_device(struct pktgen_thread *t,
@@ -3698,20 +3699,22 @@ static int pktgen_remove_device(struct pktgen_thread *t,
 		pkt_dev->odev = NULL;
 	}
 
-	/* And update the thread if_list */
-
-	_rem_dev_from_if_list(t, pkt_dev);
-
+	/* Remove proc before if_list entry, because add_device uses
+	 * list to determine if interface already exist, avoid race
+	 * with proc_create_data() */
 	if (pkt_dev->entry)
 		proc_remove(pkt_dev->entry);
 
+	/* And update the thread if_list */
+	_rem_dev_from_if_list(t, pkt_dev);
+
 #ifdef CONFIG_XFRM
 	free_SAs(pkt_dev);
 #endif
 	vfree(pkt_dev->flows);
 	if (pkt_dev->page)
 		put_page(pkt_dev->page);
-	kfree(pkt_dev);
+	kfree_rcu(pkt_dev, rcu);
 	return 0;
 }
 
@@ -3811,6 +3814,7 @@ static void __exit pg_cleanup(void)
 {
 	unregister_netdevice_notifier(&pktgen_notifier_block);
 	unregister_pernet_subsys(&pg_net_ops);
+	/* Don't need rcu_barrier() due to use of kfree_rcu() */
 }
 
 module_init(pg_init);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer
@ 2014-05-14 14:28   ` David Laight
  2014-05-14 19:25     ` Jesper Dangaard Brouer
  2014-05-14 16:28   ` Alexander Duyck
  1 sibling, 1 reply; 22+ messages in thread
From: David Laight @ 2014-05-14 14:28 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', netdev
  Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal,
	David S. Miller, Stephen Hemminger, Paul E. McKenney,
	Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi

From: Jesper Dangaard Brouer
> Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
> running full.  Thus, the TX ring is artificially limiting pktgen.
> 
> Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
> counters.

Have you tried reducing the tx interrupt mitigation delay.
It might just be that the end of tx interrupt is delayed too long.

	David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop
  2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer
@ 2014-05-14 14:35   ` Eric Dumazet
  2014-05-14 15:13     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2014-05-14 14:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann,
	Florian Westphal, David S. Miller, Stephen Hemminger,
	Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend,
	danieltt, zhouzhouyi

On Wed, 2014-05-14 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> Avoid the expensive atomic refcnt increase in the pktgen xmit loop, by
> simply setting the refcnt only when a new SKB gets allocated. Setting
> it according to how many times we are spinning the same SKB (and
> handling the case of skb_clone=0).
> 
> Performance data with CLONE_SKB==100000 and TX ring buffer size=1024:
>  (single CPU performance, ixgbe 10Gbit/s, E5-2630)
>  * Before: 5,362,722 pps --> 186.47ns per pkt (1/5362722*10^9)
>  * Now:    5,608,781 pps --> 178.29ns per pkt (1/5608781*10^9)
>  * Diff:    +246,059 pps -->  -8.18ns
> 
> The performance increase converted to nanoseconds (8.18ns), correspond
> well to the measured overhead of LOCK prefixed assembler instructions
> on my E5-2630 CPU which is measured to be 8.23ns.
> 
> Note, with TX ring size 768 I see some "tx_restart_queue" events.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

OK, but then you need to properly undo the refcnt at the end of pktgen
loop, otherwise you leak one skb ?

Say you run pktgen for 995 sends, and clone_count is 10.

If done properly, you probably can avoid the kfree_skb(pkt_dev->skb) in
pktgen_xmit(), and set initial skb->users to max(1, pkt_dev->clone_skb);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop
  2014-05-14 14:35   ` Eric Dumazet
@ 2014-05-14 15:13     ` Jesper Dangaard Brouer
  2014-05-14 15:35       ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 15:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brouer, netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann,
	Florian Westphal, David S. Miller, Stephen Hemminger,
	Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend,
	danieltt, zhouzhouyi

On Wed, 14 May 2014 07:35:31 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-05-14 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> > Avoid the expensive atomic refcnt increase in the pktgen xmit loop, by
> > simply setting the refcnt only when a new SKB gets allocated. Setting
> > it according to how many times we are spinning the same SKB (and
> > handling the case of skb_clone=0).
> > 
> > Performance data with CLONE_SKB==100000 and TX ring buffer size=1024:
> >  (single CPU performance, ixgbe 10Gbit/s, E5-2630)
> >  * Before: 5,362,722 pps --> 186.47ns per pkt (1/5362722*10^9)
> >  * Now:    5,608,781 pps --> 178.29ns per pkt (1/5608781*10^9)
> >  * Diff:    +246,059 pps -->  -8.18ns
> > 
> > The performance increase converted to nanoseconds (8.18ns), correspond
> > well to the measured overhead of LOCK prefixed assembler instructions
> > on my E5-2630 CPU which is measured to be 8.23ns.
> > 
> > Note, with TX ring size 768 I see some "tx_restart_queue" events.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> 
> OK, but then you need to properly undo the refcnt at the end of pktgen
> loop, otherwise you leak one skb ?
> 
> Say you run pktgen for 995 sends, and clone_count is 10.

Ah I see, then pktgen gets stopped, I might leak a SKB.


> If done properly, you probably can avoid the kfree_skb(pkt_dev->skb) in
> pktgen_xmit(), and set initial skb->users to max(1, pkt_dev->clone_skb);

Just setting max(1, pkt_dev->clone_skb) killed my machine (and my
netconsole didn't send the entire output before dying).

kernel:[91899.988338] skbuff: skb_under_panic: text:ffffffffa03a0b5f len:56 put:14 head:ffff88081bdeb400 data:ffff88081bdeb3f4 tail:0x2c end:0xc0 dev:eth8
 kernel:[91899.988505] ------------[ cut here ]------------
 kernel:[91899.988643] invalid opcode: 0000 [#1] SMP 
 kernel:[91899.988711] Modules linked in: pktgen netconsole ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
 kernel:[91899.988887] general protection fault: 0000 [#2] SMP 
 kernel:[91899.988888] Modules linked in: pktgen netconsole ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost tun kvm_intel kvm microcode pcspkr ipmi_si ipmi_msghandler hid_generic i2c_i801 sg ixgbe mlx4_en mdio vxlan mlx4_core igb i2c_algo_bit i2c_core ptp pps_core sd_mod crc_t10dif crct10dif_common ahci libahci libata dm_mirror dm_region_hash dm_log dm_mod [last unloaded: pktgen]
 kernel:[91899.988912] CPU: 0 PID: 21807 Comm: kpktgend_0 Tainted: G        W     3.15.0-rc1-dpaccel01-pktgen+ #469
 kernel:[91899.988913] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
 kernel:[91899.988914] task: ffff88081e1b6480 ti: ffff880807374000 task.ti: ffff880807374000
 kernel:[91899.988915] RIP: 0010:[<ffffffff8119ecdc>]  [<ffffffff8119ecdc>] __kmalloc_node_track_caller+0xdc/0x240
 kernel:[91899.988920] RSP: 0018:ffff880807375798  EFLAGS: 00010046
 kernel:[91899.988921] RAX: 0000000000000000 RBX: 0000000000000020 RCX: dc0a9af1def08800
 kernel:[91899.988922] RDX: 0000000000046786 RSI: 0000000000000000 RDI: 0000000000016340
 kernel:[91899.988922] RBP: ffff8808073757f8 R08: ffff88085fc16340 R09: ffffffff815a8367
 kernel:[91899.988923] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000010220
 kernel:[91899.988924] R13: ffff88085f803600 R14: 0000000000000180 R15: 00000000ffffffff
 kernel:[91899.988925] FS:  0000000000000000(0000) GS:ffff88085fc00000(0000) knlGS:0000000000000000
 kernel:[91899.988925] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel:[91899.988926] CR2: 00007f311af84000 CR3: 00000000019f0000 CR4: 00000000000407f0
 kernel:[91899.988926] Stack:
 kernel:[91899.988927]  00000020ffffffff ffff88081a78e800 ffff8808594c6800 ffff8808594c6000
 kernel:[91899.988928]  ffffffff815a839b dc0a9af1def08800 ffff8808073757e8 0000000000000020
 kernel:[91899.988930]  00000000ffffffff 0000000000000180 ffff880807375867 0000000000000180
 kernel:[91899.988931] Call Trace:
 kernel:[91899.988932]  [<ffffffff815a839b>] ? __alloc_skb+0x8b/0x1e0
 kernel:[91899.988936]  [<ffffffff815a805b>] __kmalloc_reserve+0x3b/0xa0
 kernel:[91899.988938]  [<ffffffff815a839b>] __alloc_skb+0x8b/0x1e0
 kernel:[91899.988940]  [<ffffffff815d508c>] netpoll_send_udp+0x8c/0x400
 kernel:[91899.988943]  [<ffffffffa03962c3>] write_msg+0xd3/0x130 [netconsole]
 kernel:[91899.988946]  [<ffffffff810b5e75>] call_console_drivers.clone.1+0xa5/0x110
 kernel:[91899.988950]  [<ffffffff810b5fc7>] console_cont_flush.clone.0+0xe7/0x1a0
 kernel:[91899.988952]  [<ffffffff810b60b2>] console_unlock+0x32/0x2a0
 kernel:[91899.988953]  [<ffffffff810b6805>] vprintk_emit+0x325/0x510
 kernel:[91899.988954]  [<ffffffff810b577c>] ? wake_up_klogd+0x3c/0x50
 kernel:[91899.988956]  [<ffffffff816be2a1>] printk+0x77/0x79
 kernel:[91899.988959]  [<ffffffff816c6e95>] ? notifier_call_chain+0x55/0x80
 kernel:[91899.988962]  [<ffffffff810d4547>] print_modules+0xb7/0xd0
 kernel:[91899.988965]  [<ffffffff816c6f3e>] ? notify_die+0x2e/0x30
 kernel:[91899.988966]  [<ffffffff816c3f7b>] __die+0xab/0x100
 kernel:[91899.988967]  [<ffffffff81005e78>] die+0x48/0x90
 kernel:[91899.988970]  [<ffffffff816c3b9b>] do_trap+0xcb/0x170
 kernel:[91899.988971]  [<ffffffff816c6eda>] ? __atomic_notifier_call_chain+0x1a/0x30
 kernel:[91899.988972]  [<ffffffff81003615>] do_invalid_op+0x95/0xb0
 kernel:[91899.988974]  [<ffffffff815a7f45>] ? skb_push+0x85/0x90
 kernel:[91899.988976]  [<ffffffff810b66c8>] ? vprintk_emit+0x1e8/0x510
 kernel:[91899.988977]  [<ffffffff816cd378>] invalid_op+0x18/0x20
 kernel:[91899.988979]  [<ffffffff815a7f45>] ? skb_push+0x85/0x90
 kernel:[91899.988981]  [<ffffffff815a7f45>] ? skb_push+0x85/0x90
 kernel:[91899.988982]  [<ffffffffa03a0b5f>] fill_packet_ipv4+0xaf/0x470 [pktgen]
 kernel:[91899.988985]  [<ffffffff815a8eff>] ? __kfree_skb+0x3f/0xc0
 kernel:[91899.988987]  [<ffffffffa0167ed0>] ? ixgbe_xmit_frame_ring+0x610/0x610 [ixgbe]
 kernel:[91899.988993]  [<ffffffffa03a1438>] pktgen_xmit+0x98/0x370 [pktgen]
 kernel:[91899.988995]  [<ffffffffa03a1540>] ? pktgen_xmit+0x1a0/0x370 [pktgen]
 kernel:[91899.988997]  [<ffffffffa03a1887>] pktgen_thread_worker+0x177/0x500 [pktgen]
 kernel:[91899.988999]  [<ffffffff810af2b0>] ? bit_waitqueue+0xd0/0xd0
 kernel:[91899.989001]  [<ffffffff810af2b0>] ? bit_waitqueue+0xd0/0xd0
 kernel:[91899.989002]  [<ffffffffa03a1710>] ? pktgen_xmit+0x370/0x370 [pktgen]
 kernel:[91899.989004]  [<ffffffff8108e7ae>] kthread+0xce/0xf0
 kernel:[91899.989006]  [<ffffffff8108e6e0>] ? kthread_worker_fn+0x120/0x120
 kernel:[91899.989008]  [<ffffffff816cbd6c>] ret_from_fork+0x7c/0xb0
 kernel:[91899.989011]  [<ffffffff8108e6e0>] ? kthread_worker_fn+0x120/0x120
 kernel:[91899.989012] Code: 48 8b 4d c0 44 89 fa 44 89 e6 4c 89 ef e8 4d c5 ff ff 48 89 45 c8 eb 37 0f 1f 80 00 00 00 00 49 63 45 20 48 8b 4d c8 49 8b 7d 00 <48> 8b 1c 01 48 8d 4a 01 48 8b 45 c8 65 48 0f c7 0f 0f 94 c0 3c 
 kernel:[91899.989030]  RSP <ffff880807375798>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop
  2014-05-14 15:13     ` Jesper Dangaard Brouer
@ 2014-05-14 15:35       ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2014-05-14 15:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann,
	Florian Westphal, David S. Miller, Stephen Hemminger,
	Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend,
	danieltt, zhouzhouyi

On Wed, 2014-05-14 at 17:13 +0200, Jesper Dangaard Brouer wrote:

> Ah I see, then pktgen gets stopped, I might leak a SKB.
> 
> 
> > If done properly, you probably can avoid the kfree_skb(pkt_dev->skb) in
> > pktgen_xmit(), and set initial skb->users to max(1, pkt_dev->clone_skb);
> 
> Just setting max(1, pkt_dev->clone_skb) killed my machine (and my
> netconsole didn't send the entire output before dying).

Well, of course, I did not provide a patch, only a hint ;)

If all skb->users is set to the _exact_ number of
kfree_skb()/consume_skb() done by the driver, then no kfree_skb() is
needed in pktgen.

The thing is : clone_skb == 0 means basically that you need to init
skb->users to the number of time you plan to send this packet.

This depends on pkt_dev->count - pkt_dev->sofar and other things
probably.

min(dev->clone_skb, pkt_dev->count - pkt_dev->sofar)

But don't 'try' that without making sure its the right thing !!!

Make sure to replace atomic_dec(&(pkt_dev->skb->users)) by kfree_skb()
in case of NETDEV_TX_LOCKED/NETDEV_TX_BUSY

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer
  2014-05-14 14:28   ` David Laight
@ 2014-05-14 16:28   ` Alexander Duyck
  2014-05-14 17:49     ` David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Alexander Duyck @ 2014-05-14 16:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller,
	Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear,
	John Fastabend, danieltt, zhouzhouyi

On 05/14/2014 07:17 AM, Jesper Dangaard Brouer wrote:
> Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
> running full.  Thus, the TX ring is artificially limiting pktgen.
> 
> Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
> counters.
> 
> Increasing the TX ring buffer should be done carefully, as it comes at
> a higher memory cost, which can also negatively influence performance.
> E.g. ring buffer array of struct ixgbe_tx_buffer (current size 48bytes)
> increase from 512*48=24576bytes to 1024*48=49152bytes which is larger
> than the L1 data cache (32KB on my E5-2630), thus increasing the L1->L2
> cache-references.
> 
> Adjusting the TX ring buffer (TXSZ) measured over 10 sec with ifpps
>  (single CPU performance, ixgbe 10Gbit/s, E5-2630)
>  * cmd: ethtool -G eth8 tx $TXSZ
>  * 3,930,065 pps -- TXSZ= 512
>  * 5,312,249 pps -- TXSZ= 768
>  * 5,362,722 pps -- TXSZ=1024
>  * 5,361,390 pps -- TXSZ=1536
>  * 5,362,439 pps -- TXSZ=2048
>  * 5,359,744 pps -- TXSZ=4096
> 
> Choosing size 1024 because for the next optimizations 768 is not
> enough.
> 
> Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
> devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
> the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
> more pressure on the TX ring buffers.
> 
> It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
> pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index c688c8a..bf078fe 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -63,7 +63,7 @@
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
>  /* TX/RX descriptor defines */
> -#define IXGBE_DEFAULT_TXD		    512
> +#define IXGBE_DEFAULT_TXD		   1024
>  #define IXGBE_DEFAULT_TX_WORK		    256
>  #define IXGBE_MAX_TXD			   4096
>  #define IXGBE_MIN_TXD			     64
> 

What is the point of optimizing ixgbe for a synthetic benchmark?  In my
experience the full stack can only handle about 2Mpps at 60B packets
with a single queue.  Updating the defaults for a pktgen test seems
unrealistic as that isn't really a standard use case for the driver.

I'd say that it might be better to just add a note to the documentation
folder indicating what configuration is optimal for pktgen rather then
changing everyone's defaults to support one specific test.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 16:28   ` Alexander Duyck
@ 2014-05-14 17:49     ` David Miller
  2014-05-14 19:09       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2014-05-14 17:49 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: brouer, netdev, jeffrey.t.kirsher, dborkman, fw, shemminger,
	paulmck, robert, greearb, john.r.fastabend, danieltt, zhouzhouyi

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Wed, 14 May 2014 09:28:50 -0700

> I'd say that it might be better to just add a note to the documentation
> folder indicating what configuration is optimal for pktgen rather then
> changing everyone's defaults to support one specific test.

We could have drivers provide a pktgen config adjustment mechanism,
so if someone starts pktgen then the device auto-adjusts to a pktgen
optimal configuration (whatever that may entail).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 17:49     ` David Miller
@ 2014-05-14 19:09       ` Jesper Dangaard Brouer
  2014-05-14 19:54         ` David Miller
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 19:09 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, dborkman, fw,
	shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt,
	zhouzhouyi, brouer

On Wed, 14 May 2014 13:49:50 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Wed, 14 May 2014 09:28:50 -0700
> 
> > I'd say that it might be better to just add a note to the documentation
> > folder indicating what configuration is optimal for pktgen rather then
> > changing everyone's defaults to support one specific test.
> 
> We could have drivers provide a pktgen config adjustment mechanism,
> so if someone starts pktgen then the device auto-adjusts to a pktgen
> optimal configuration (whatever that may entail).

That might be problematic because changing the TX queue size cause the
ixgbe driver to reset the link.

Notice that pktgen is ignoring BQL.  I'm sort of hoping that BQL will
push back for real use-cases, to avoid the bad effects of increasing
the TX size.

One of the bad effects, I'm hoping BQL will mitigate, is the case of
filling the TX queue with large frames.  Consider 9K jumbo frames, how
long time will it take to empty 1024 jumbo frames on a 10G link:

(9000*8)/(10000*10^6)*1000*1024 = 7.37ms

But with 9K MTU and 512, we already have:
 (9000*8)/(10000*10^6)*1000*512 = 3.69ms

Guess the more normal use-case would be 1500+38 (Ethernet overhead)
 (1538*8)/(10000*10^6)*1000*1024 = 1.25ms

And then again, these calculations should then in theory be multiplied
with the number of TX queues.


I know, increasing these limits should not be taken lightly, but we
just have to be crystal clear that the current 512 limit, is
artificially limiting the capabilities of your hardware.

We can postpone this increase, because I also observe a 2Mpps limit
when actually using (alloc/free) real SKBs.  The alloc/free cost is
currently just hiding this limitation.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 14:28   ` David Laight
@ 2014-05-14 19:25     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-14 19:25 UTC (permalink / raw)
  To: David Laight
  Cc: brouer, netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann,
	Florian Westphal, David S. Miller, Stephen Hemminger,
	Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend,
	danieltt, zhouzhouyi

On Wed, 14 May 2014 14:28:24 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
> > running full.  Thus, the TX ring is artificially limiting pktgen.
> > 
> > Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
> > counters.
> 
> Have you tried reducing the tx interrupt mitigation delay.
> It might just be that the end of tx interrupt is delayed too long.

Does the ixgbe have TX interrupt mitigation delays?

I don't "see" them:

$ sudo ethtool -c eth8 
Coalesce parameters for eth8:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 19:09       ` Jesper Dangaard Brouer
@ 2014-05-14 19:54         ` David Miller
  2014-05-15  9:16         ` David Laight
  2014-05-29 15:29         ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2014-05-14 19:54 UTC (permalink / raw)
  To: brouer
  Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, dborkman, fw,
	shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt,
	zhouzhouyi

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 14 May 2014 21:09:35 +0200

> On Wed, 14 May 2014 13:49:50 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Wed, 14 May 2014 09:28:50 -0700
>> 
>> > I'd say that it might be better to just add a note to the documentation
>> > folder indicating what configuration is optimal for pktgen rather then
>> > changing everyone's defaults to support one specific test.
>> 
>> We could have drivers provide a pktgen config adjustment mechanism,
>> so if someone starts pktgen then the device auto-adjusts to a pktgen
>> optimal configuration (whatever that may entail).
> 
> That might be problematic because changing the TX queue size cause the
> ixgbe driver to reset the link.

A minor issue for someone firing up a specialized network test tool
like pktgen.

> Notice that pktgen is ignoring BQL.  I'm sort of hoping that BQL will
> push back for real use-cases, to avoid the bad effects of increasing
> the TX size.

I think we need to think carefully about drivers configuring such huge
per-queue TX queue sizes.

If anything, we should be decreasing their size in the default
configuration.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 19:09       ` Jesper Dangaard Brouer
  2014-05-14 19:54         ` David Miller
@ 2014-05-15  9:16         ` David Laight
  2014-05-29 15:29         ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 22+ messages in thread
From: David Laight @ 2014-05-15  9:16 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', David Miller
  Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, dborkman, fw,
	shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt,
	zhouzhouyi

From: Jesper Dangaard Brouer
> On Wed, 14 May 2014 13:49:50 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> > Date: Wed, 14 May 2014 09:28:50 -0700
> >
> > > I'd say that it might be better to just add a note to the documentation
> > > folder indicating what configuration is optimal for pktgen rather then
> > > changing everyone's defaults to support one specific test.
> >
> > We could have drivers provide a pktgen config adjustment mechanism,
> > so if someone starts pktgen then the device auto-adjusts to a pktgen
> > optimal configuration (whatever that may entail).
> 
> That might be problematic because changing the TX queue size cause the
> ixgbe driver to reset the link.
> 
> Notice that pktgen is ignoring BQL.  I'm sort of hoping that BQL will
> push back for real use-cases, to avoid the bad effects of increasing
> the TX size.
> 
> One of the bad effects, I'm hoping BQL will mitigate, is the case of
> filling the TX queue with large frames.  Consider 9K jumbo frames, how
> long time will it take to empty 1024 jumbo frames on a 10G link:
> 
> (9000*8)/(10000*10^6)*1000*1024 = 7.37ms

Never mind 9k 'jumbo' frames, I'm pretty sure ixgbe supports TCP segmentation
offload - so you can have 64k frames in the tx ring.
Since each takes (about) 44 ethernet frames that is about 55ms
to clear the queue.

	David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024
  2014-05-14 19:09       ` Jesper Dangaard Brouer
  2014-05-14 19:54         ` David Miller
  2014-05-15  9:16         ` David Laight
@ 2014-05-29 15:29         ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-05-29 15:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, alexander.h.duyck, netdev, jeffrey.t.kirsher,
	dborkman, fw, shemminger, paulmck, robert, greearb,
	john.r.fastabend, danieltt, zhouzhouyi, Thomas Graf

On Wed, 14 May 2014 21:09:35 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> > Date: Wed, 14 May 2014 09:28:50 -0700
> > 
> > > I'd say that it might be better to just add a note to the documentation
> > > folder indicating what configuration is optimal for pktgen rather then
> > > changing everyone's defaults to support one specific test.
>
> I know, increasing these limits should not be taken lightly, but we
> just have to be crystal clear that the current 512 limit, is
> artificially limiting the capabilities of your hardware.

The above statement is mine and it is wrong ;-)

 I'm dropping this patch because of the following understanding:

Alexander Duyck pointed out to me, that interrupt throttling might be
the reason behind the need to increase the TX ring size.  I tested this
and Alex is right.

The needed TX ring size ("ethtool -g") for max performance, is
directly corrolated with how fast/often the TX cleanup is running.

Adjusting the "ethtool -C <ethx> rx-usecs" value affect how often we
cleanup the ring(s).  The default value "1" is some auto interrupt throttling.

Notice with these coalesce tuning, the performance even increase from
6.7Mpps to 7.1Mpps on top of patchset V1.

On top of V1 patchset:
 - 6,747,016 pps - rx-usecs:  1 tx-ring: 1024 (irqs: 9492)
 - 6,684,612 pps - rx-usecs: 10 tx-ring: 1024 (irqs:99322)
 - 7,005,226 pps - rx-usecs: 20 tx-ring: 1024 (irqs:50444)
 - 7,113,048 pps - rx-usecs: 30 tx-ring: 1024 (irqs:34004)
 - 7,133,019 pps - rx-usecs: 40 tx-ring: 1024 (irqs:25845)
 - 7,168,399 pps - rx-usecs: 50 tx-ring: 1024 (irqs:20896)

Look same performance with 512 TX ring.

Lowering TX ring size to (default) 512:
 (On top of V1 patchset)
 - 3,934,674 pps - rx-usecs:  1 tx-ring: 512 (irqs: 9602)
 - 6,684,066 pps - rx-usecs: 10 tx-ring: 512 (irqs:99370)
 - 7,001,235 pps - rx-usecs: 20 tx-ring: 512 (irqs:50567)
 - 7,115,047 pps - rx-usecs: 30 tx-ring: 512 (irqs:34105)
 - 7,130,250 pps - rx-usecs: 40 tx-ring: 512 (irqs:25741)
 - 7,165,296 pps - rx-usecs: 50 tx-ring: 512 (irqs:20898)

Look how even a 256 TX ring is enough, if we cleanup the TX ring fast
enough, and how performance decrease if we cleanup to slowly.

Lowering TX ring size to (default) 256:
 (On top of V1 patchset)
 - 1,883,360 pps - rx-usecs:  1 tx-ring: 256 (irqs: 9800)
 - 6,683,552 pps - rx-usecs: 10 tx-ring: 256 (irqs:99786)
 - 7,005,004 pps - rx-usecs: 20 tx-ring: 256 (irqs:50749)
 - 7,108,776 pps - rx-usecs: 30 tx-ring: 256 (irqs:34536)
 - 5,734,301 pps - rx-usecs: 40 tx-ring: 256 (irqs:25909)
 - 4,590,093 pps - rx-usecs: 50 tx-ring: 256 (irqs:21183)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance
  2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2014-05-14 14:18 ` [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer
@ 2014-06-26 11:16 ` Jesper Dangaard Brouer
  2014-06-26 11:16   ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer
                     ` (3 more replies)
  5 siblings, 4 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S. Miller, netdev
  Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa,
	Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck,
	Jeff Kirsher

This series focus on optimizing "pktgen" for single CPU performance.

V2-series:
 - Removed some patches
 - Doc real reason for TX ring buffer filling up

NIC tuning for pktgen:
 http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html

General overload setup according to:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html

Hardware:
 System: CPU E5-2630
 NIC: Intel ixgbe/82599 chip

Testing done with net-next git tree on top of
 commit 6623b41944 ("Merge branch 'master' of...jkirsher/net-next")

Pktgen script exercising race condition:
 https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh

Tool for measuring LOCK overhead:
 https://github.com/netoptimizer/network-testing/blob/master/src/overhead_cmpxchg.c

---

Jesper Dangaard Brouer (3):
      pktgen: RCU-ify "if_list" to remove lock in next_to_run()
      pktgen: avoid expensive set_current_state() call in loop
      pktgen: document tuning for max NIC performance


 Documentation/networking/pktgen.txt |   28 +++++++++
 net/core/pktgen.c                   |  110 ++++++++++++++++++-----------------
 2 files changed, 83 insertions(+), 55 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance
  2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
@ 2014-06-26 11:16   ` Jesper Dangaard Brouer
  2014-06-26 11:16   ` [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S. Miller, netdev
  Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa,
	Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck,
	Jeff Kirsher

Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
running full.  Thus, the TX ring is artificially limiting pktgen.
(Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
counters.)

Using ixgbe, the real reason behind the TX ring running full, is due
to TX ring not being cleaned up fast enough. The ixgbe driver combines
TX+RX ring cleanups, and the cleanup interval is affected by the
ethtool --coalesce setting of parameter "rx-usecs".

Do not increase the default NIC TX ring buffer or default cleanup
interval.  Instead simply document that pktgen needs special NIC
tuning for maximum packet per sec performance.

Performance results with pktgen with clone_skb=100000.
TX ring size 512 (default), adjusting "rx-usecs":
 (Single CPU performance, E5-2630, ixgbe)
 - 3935002 pps - rx-usecs:  1 (irqs:  9346)
 - 5132350 pps - rx-usecs: 10 (irqs: 99157)
 - 5375111 pps - rx-usecs: 20 (irqs: 50154)
 - 5454050 pps - rx-usecs: 30 (irqs: 33872)
 - 5496320 pps - rx-usecs: 40 (irqs: 26197)
 - 5502510 pps - rx-usecs: 50 (irqs: 21527)

TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
 - 3935002 pps - tx-size:  512
 - 5354401 pps - tx-size:  768
 - 5356847 pps - tx-size: 1024
 - 5327595 pps - tx-size: 1536
 - 5356779 pps - tx-size: 2048
 - 5353438 pps - tx-size: 4096

Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
more pressure on the TX ring buffers.

It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 Documentation/networking/pktgen.txt |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index 0e30c78..0dffc6e 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -24,6 +24,34 @@ For monitoring and control pktgen creates:
         /proc/net/pktgen/ethX
 
 
+Tuning NIC for max performance
+==============================
+
+The default NIC setting are (likely) not tuned for pktgen's artificial
+overload type of benchmarking, as this could hurt the normal use-case.
+
+Specifically increasing the TX ring buffer in the NIC:
+ # ethtool -G ethX tx 1024
+
+A larger TX ring can improve pktgen's performance, while it can hurt
+in the general case, 1) because the TX ring buffer might get larger
+than the CPUs L1/L2 cache, 2) because it allow more queueing in the
+NIC HW layer (which is bad for bufferbloat).
+
+One should be careful to conclude, that packets/descriptors in the HW
+TX ring cause delay.  Drivers usually delay cleaning up the
+ring-buffers (for various performance reasons), thus packets stalling
+the TX ring, might just be waiting for cleanup.
+
+This cleanup issues is specifically the case, for the driver ixgbe
+(Intel 82599 chip).  This driver (ixgbe) combine TX+RX ring cleanups,
+and the cleanup interval is affected by the ethtool --coalesce setting
+of parameter "rx-usecs".
+
+For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
+ # ethtool -C ethX rx-usecs 30
+
+
 Viewing threads
 ===============
 /proc/net/pktgen/kpktgend_0 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop
  2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
  2014-06-26 11:16   ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer
@ 2014-06-26 11:16   ` Jesper Dangaard Brouer
  2014-06-26 11:16   ` [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer
  2014-07-01 22:51   ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance David Miller
  3 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S. Miller, netdev
  Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa,
	Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck,
	Jeff Kirsher

Avoid calling set_current_state() inside the busy-loop in
pktgen_thread_worker().  In case of pkt_dev->delay, then it is still
used/enabled in pktgen_xmit() via the spin() call.

The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
is LOCK prefixed.  I've measured the asm LOCK operation to take approx
8ns on this E5-2630 CPU.  Performance increase corrolate with this
measurement.

Performance data with CLONE_SKB==100000, rx-usecs=30:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Prev:  5454050 pps --> 183.35ns (1/5454050*10^9)
 * Now:   5684009 pps --> 175.93ns (1/5684009*10^9)
 * Diff:  +229959 pps -->  -7.42ns

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/core/pktgen.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fc17a9d..b61f553 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3407,10 +3407,10 @@ static int pktgen_thread_worker(void *arg)
 
 	pr_debug("starting pktgen/%d:  pid=%d\n", cpu, task_pid_nr(current));
 
-	set_current_state(TASK_INTERRUPTIBLE);
-
 	set_freezable();
 
+	__set_current_state(TASK_RUNNING);
+
 	while (!kthread_should_stop()) {
 		pkt_dev = next_to_run(t);
 
@@ -3424,8 +3424,6 @@ static int pktgen_thread_worker(void *arg)
 			continue;
 		}
 
-		__set_current_state(TASK_RUNNING);
-
 		if (likely(pkt_dev)) {
 			pktgen_xmit(pkt_dev);
 
@@ -3456,9 +3454,8 @@ static int pktgen_thread_worker(void *arg)
 		}
 
 		try_to_freeze();
-
-		set_current_state(TASK_INTERRUPTIBLE);
 	}
+	set_current_state(TASK_INTERRUPTIBLE);
 
 	pr_debug("%s stopping all device\n", t->tsk->comm);
 	pktgen_stop(t);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run()
  2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
  2014-06-26 11:16   ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer
  2014-06-26 11:16   ` [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer
@ 2014-06-26 11:16   ` Jesper Dangaard Brouer
  2014-07-01 22:51   ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance David Miller
  3 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S. Miller, netdev
  Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa,
	Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck,
	Jeff Kirsher

The if_lock()/if_unlock() in next_to_run() adds a significant
overhead, because its called for every packet in busy loop of
pktgen_thread_worker().  (Thomas Graf originally pointed me
at this lock problem).

Removing these two "LOCK" operations should in theory save us approx
16ns (8ns x 2), as illustrated below we do save 16ns when removing
the locks and introducing RCU protection.

Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
 (single CPU performance, ixgbe 10Gbit/s, E5-2630)
 * Prev   : 5684009 pps --> 175.93ns (1/5684009*10^9)
 * RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
 * Diff   : +588195 pps --> -16.50ns

To understand this RCU patch, I describe the pktgen thread model
below.

In pktgen there is several kernel threads, but there is only one CPU
running each kernel thread.  Communication with the kernel threads are
done through some thread control flags.  This allow the thread to
change data structures at a know synchronization point, see main
thread func pktgen_thread_worker().

Userspace changes are communicated through proc-file writes.  There
are three types of changes, general control changes "pgctrl"
(func:pgctrl_write), thread changes "kpktgend_X"
(func:pktgen_thread_write), and interface config changes "etcX@N"
(func:pktgen_if_write).

Userspace "pgctrl" and "thread" changes are synchronized via the mutex
pktgen_thread_lock, thus only a single userspace instance can run.
The mutex is taken while the packet generator is running, by pgctrl
"start".  Thus e.g. "add_device" cannot be invoked when pktgen is
running/started.

All "pgctrl" and all "thread" changes, except thread "add_device",
communicate via the thread control flags.  The main problem is the
exception "add_device", that modifies threads "if_list" directly.

Fortunately "add_device" cannot be invoked while pktgen is running.
But there exists a race between "rem_device_all" and "add_device"
(which normally don't occur, because "rem_device_all" waits 125ms
before returning). Background'ing "rem_device_all" and running
"add_device" immediately allow the race to occur.

The race affects the threads (list of devices) "if_list".  The if_lock
is used for protecting this "if_list".  Other readers are given
lock-free access to the list under RCU read sections.

Note, interface config changes (via proc) can occur while pktgen is
running, which worries me a bit.  I'm assuming proc_remove() takes
appropriate locks, to assure no writers exists after proc_remove()
finish.

I've been running a script exercising the race condition (leading me
to fix the proc_remove order), without any issues.  The script also
exercises concurrent proc writes, while the interface config is
getting removed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
---

 net/core/pktgen.c |  101 +++++++++++++++++++++++++++--------------------------
 1 files changed, 52 insertions(+), 49 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index b61f553..5b05e36 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -69,8 +69,9 @@
  * for running devices in the if_list and sends packets until count is 0 it
  * also the thread checks the thread->control which is used for inter-process
  * communication. controlling process "posts" operations to the threads this
- * way. The if_lock should be possible to remove when add/rem_device is merged
- * into this too.
+ * way.
+ * The if_list is RCU protected, and the if_lock remains to protect updating
+ * of if_list, from "add_device" as it invoked from userspace (via proc write).
  *
  * By design there should only be *one* "controlling" process. In practice
  * multiple write accesses gives unpredictable result. Understood by "write"
@@ -208,7 +209,7 @@
 #define T_REMDEVALL   (1<<2)	/* Remove all devs */
 #define T_REMDEV      (1<<3)	/* Remove one dev */
 
-/* If lock -- can be removed after some work */
+/* If lock -- protects updating of if_list */
 #define   if_lock(t)           spin_lock(&(t->if_lock));
 #define   if_unlock(t)           spin_unlock(&(t->if_lock));
 
@@ -241,6 +242,7 @@ struct pktgen_dev {
 	struct proc_dir_entry *entry;	/* proc file */
 	struct pktgen_thread *pg_thread;/* the owner */
 	struct list_head list;		/* chaining in the thread's run-queue */
+	struct rcu_head	 rcu;		/* freed by RCU */
 
 	int running;		/* if false, the test will stop */
 
@@ -1737,14 +1739,14 @@ static int pktgen_thread_show(struct seq_file *seq, void *v)
 
 	seq_puts(seq, "Running: ");
 
-	if_lock(t);
-	list_for_each_entry(pkt_dev, &t->if_list, list)
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list)
 		if (pkt_dev->running)
 			seq_printf(seq, "%s ", pkt_dev->odevname);
 
 	seq_puts(seq, "\nStopped: ");
 
-	list_for_each_entry(pkt_dev, &t->if_list, list)
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list)
 		if (!pkt_dev->running)
 			seq_printf(seq, "%s ", pkt_dev->odevname);
 
@@ -1753,7 +1755,7 @@ static int pktgen_thread_show(struct seq_file *seq, void *v)
 	else
 		seq_puts(seq, "\nResult: NA\n");
 
-	if_unlock(t);
+	rcu_read_unlock();
 
 	return 0;
 }
@@ -1878,10 +1880,8 @@ static struct pktgen_dev *__pktgen_NN_threads(const struct pktgen_net *pn,
 		pkt_dev = pktgen_find_dev(t, ifname, exact);
 		if (pkt_dev) {
 			if (remove) {
-				if_lock(t);
 				pkt_dev->removal_mark = 1;
 				t->control |= T_REMDEV;
-				if_unlock(t);
 			}
 			break;
 		}
@@ -1931,7 +1931,8 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d
 	list_for_each_entry(t, &pn->pktgen_threads, th_list) {
 		struct pktgen_dev *pkt_dev;
 
-		list_for_each_entry(pkt_dev, &t->if_list, list) {
+		rcu_read_lock();
+		list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 			if (pkt_dev->odev != dev)
 				continue;
 
@@ -1946,6 +1947,7 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d
 				       dev->name);
 			break;
 		}
+		rcu_read_unlock();
 	}
 }
 
@@ -2997,8 +2999,8 @@ static void pktgen_run(struct pktgen_thread *t)
 
 	func_enter();
 
-	if_lock(t);
-	list_for_each_entry(pkt_dev, &t->if_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 
 		/*
 		 * setup odev and create initial packet.
@@ -3007,18 +3009,18 @@ static void pktgen_run(struct pktgen_thread *t)
 
 		if (pkt_dev->odev) {
 			pktgen_clear_counters(pkt_dev);
-			pkt_dev->running = 1;	/* Cranke yeself! */
 			pkt_dev->skb = NULL;
 			pkt_dev->started_at = pkt_dev->next_tx = ktime_get();
 
 			set_pkt_overhead(pkt_dev);
 
 			strcpy(pkt_dev->result, "Starting");
+			pkt_dev->running = 1;	/* Cranke yeself! */
 			started++;
 		} else
 			strcpy(pkt_dev->result, "Error starting");
 	}
-	if_unlock(t);
+	rcu_read_unlock();
 	if (started)
 		t->control &= ~(T_STOP);
 }
@@ -3041,27 +3043,25 @@ static int thread_is_running(const struct pktgen_thread *t)
 {
 	const struct pktgen_dev *pkt_dev;
 
-	list_for_each_entry(pkt_dev, &t->if_list, list)
-		if (pkt_dev->running)
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list)
+		if (pkt_dev->running) {
+			rcu_read_unlock();
 			return 1;
+		}
+	rcu_read_unlock();
 	return 0;
 }
 
 static int pktgen_wait_thread_run(struct pktgen_thread *t)
 {
-	if_lock(t);
-
 	while (thread_is_running(t)) {
 
-		if_unlock(t);
-
 		msleep_interruptible(100);
 
 		if (signal_pending(current))
 			goto signal;
-		if_lock(t);
 	}
-	if_unlock(t);
 	return 1;
 signal:
 	return 0;
@@ -3166,10 +3166,10 @@ static int pktgen_stop_device(struct pktgen_dev *pkt_dev)
 		return -EINVAL;
 	}
 
+	pkt_dev->running = 0;
 	kfree_skb(pkt_dev->skb);
 	pkt_dev->skb = NULL;
 	pkt_dev->stopped_at = ktime_get();
-	pkt_dev->running = 0;
 
 	show_results(pkt_dev, nr_frags);
 
@@ -3180,9 +3180,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t)
 {
 	struct pktgen_dev *pkt_dev, *best = NULL;
 
-	if_lock(t);
-
-	list_for_each_entry(pkt_dev, &t->if_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 		if (!pkt_dev->running)
 			continue;
 		if (best == NULL)
@@ -3190,7 +3189,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t)
 		else if (ktime_compare(pkt_dev->next_tx, best->next_tx) < 0)
 			best = pkt_dev;
 	}
-	if_unlock(t);
+	rcu_read_unlock();
+
 	return best;
 }
 
@@ -3200,13 +3200,13 @@ static void pktgen_stop(struct pktgen_thread *t)
 
 	func_enter();
 
-	if_lock(t);
+	rcu_read_lock();
 
-	list_for_each_entry(pkt_dev, &t->if_list, list) {
+	list_for_each_entry_rcu(pkt_dev, &t->if_list, list) {
 		pktgen_stop_device(pkt_dev);
 	}
 
-	if_unlock(t);
+	rcu_read_unlock();
 }
 
 /*
@@ -3220,8 +3220,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t)
 
 	func_enter();
 
-	if_lock(t);
-
 	list_for_each_safe(q, n, &t->if_list) {
 		cur = list_entry(q, struct pktgen_dev, list);
 
@@ -3235,8 +3233,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t)
 
 		break;
 	}
-
-	if_unlock(t);
 }
 
 static void pktgen_rem_all_ifs(struct pktgen_thread *t)
@@ -3248,8 +3244,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t)
 
 	/* Remove all devices, free mem */
 
-	if_lock(t);
-
 	list_for_each_safe(q, n, &t->if_list) {
 		cur = list_entry(q, struct pktgen_dev, list);
 
@@ -3258,8 +3252,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t)
 
 		pktgen_remove_device(t, cur);
 	}
-
-	if_unlock(t);
 }
 
 static void pktgen_rem_thread(struct pktgen_thread *t)
@@ -3482,8 +3474,8 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t,
 	struct pktgen_dev *p, *pkt_dev = NULL;
 	size_t len = strlen(ifname);
 
-	if_lock(t);
-	list_for_each_entry(p, &t->if_list, list)
+	rcu_read_lock();
+	list_for_each_entry_rcu(p, &t->if_list, list)
 		if (strncmp(p->odevname, ifname, len) == 0) {
 			if (p->odevname[len]) {
 				if (exact || p->odevname[len] != '@')
@@ -3493,7 +3485,7 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t,
 			break;
 		}
 
-	if_unlock(t);
+	rcu_read_unlock();
 	pr_debug("find_dev(%s) returning %p\n", ifname, pkt_dev);
 	return pkt_dev;
 }
@@ -3507,6 +3499,12 @@ static int add_dev_to_thread(struct pktgen_thread *t,
 {
 	int rv = 0;
 
+	/* This function cannot be called concurrently, as its called
+	 * under pktgen_thread_lock mutex, but it can run from
+	 * userspace on another CPU than the kthread.  The if_lock()
+	 * is used here to sync with concurrent instances of
+	 * _rem_dev_from_if_list() invoked via kthread, which is also
+	 * updating the if_list */
 	if_lock(t);
 
 	if (pkt_dev->pg_thread) {
@@ -3515,9 +3513,9 @@ static int add_dev_to_thread(struct pktgen_thread *t,
 		goto out;
 	}
 
-	list_add(&pkt_dev->list, &t->if_list);
-	pkt_dev->pg_thread = t;
 	pkt_dev->running = 0;
+	pkt_dev->pg_thread = t;
+	list_add_rcu(&pkt_dev->list, &t->if_list);
 
 out:
 	if_unlock(t);
@@ -3672,11 +3670,13 @@ static void _rem_dev_from_if_list(struct pktgen_thread *t,
 	struct list_head *q, *n;
 	struct pktgen_dev *p;
 
+	if_lock(t);
 	list_for_each_safe(q, n, &t->if_list) {
 		p = list_entry(q, struct pktgen_dev, list);
 		if (p == pkt_dev)
-			list_del(&p->list);
+			list_del_rcu(&p->list);
 	}
+	if_unlock(t);
 }
 
 static int pktgen_remove_device(struct pktgen_thread *t,
@@ -3696,20 +3696,22 @@ static int pktgen_remove_device(struct pktgen_thread *t,
 		pkt_dev->odev = NULL;
 	}
 
-	/* And update the thread if_list */
-
-	_rem_dev_from_if_list(t, pkt_dev);
-
+	/* Remove proc before if_list entry, because add_device uses
+	 * list to determine if interface already exist, avoid race
+	 * with proc_create_data() */
 	if (pkt_dev->entry)
 		proc_remove(pkt_dev->entry);
 
+	/* And update the thread if_list */
+	_rem_dev_from_if_list(t, pkt_dev);
+
 #ifdef CONFIG_XFRM
 	free_SAs(pkt_dev);
 #endif
 	vfree(pkt_dev->flows);
 	if (pkt_dev->page)
 		put_page(pkt_dev->page);
-	kfree(pkt_dev);
+	kfree_rcu(pkt_dev, rcu);
 	return 0;
 }
 
@@ -3809,6 +3811,7 @@ static void __exit pg_cleanup(void)
 {
 	unregister_netdevice_notifier(&pktgen_notifier_block);
 	unregister_pernet_subsys(&pg_net_ops);
+	/* Don't need rcu_barrier() due to use of kfree_rcu() */
 }
 
 module_init(pg_init);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance
  2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2014-06-26 11:16   ` [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer
@ 2014-07-01 22:51   ` David Miller
  3 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2014-07-01 22:51 UTC (permalink / raw)
  To: brouer
  Cc: netdev, dborkman, fw, hannes, tgraf, eric.dumazet, zoltan.kiss,
	alexander.h.duyck, jeffrey.t.kirsher

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 26 Jun 2014 13:16:17 +0200

> This series focus on optimizing "pktgen" for single CPU performance.
> 
> V2-series:
>  - Removed some patches
>  - Doc real reason for TX ring buffer filling up

Looks great, series applied, thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-07-01 22:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer
2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer
2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer
2014-05-14 14:28   ` David Laight
2014-05-14 19:25     ` Jesper Dangaard Brouer
2014-05-14 16:28   ` Alexander Duyck
2014-05-14 17:49     ` David Miller
2014-05-14 19:09       ` Jesper Dangaard Brouer
2014-05-14 19:54         ` David Miller
2014-05-15  9:16         ` David Laight
2014-05-29 15:29         ` Jesper Dangaard Brouer
2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer
2014-05-14 14:35   ` Eric Dumazet
2014-05-14 15:13     ` Jesper Dangaard Brouer
2014-05-14 15:35       ` Eric Dumazet
2014-05-14 14:17 ` [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer
2014-05-14 14:18 ` [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer
2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer
2014-06-26 11:16   ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer
2014-06-26 11:16   ` [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer
2014-06-26 11:16   ` [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer
2014-07-01 22:51   ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.