[net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates

All of lore.kernel.org
 help / color / mirror / Atom feed

* [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates
@ 2014-09-24 16:10 Jesper Dangaard Brouer
  2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-24 16:10 UTC (permalink / raw)
  To: netdev, therbert, David S. Miller, Eric Dumazet, Jesper Dangaard Brouer
  Cc: Alexander Duyck, John Fastabend, toke, jhs, Dave Taht

This patch uses DaveM's recent API changes to dev_hard_start_xmit(),
from the qdisc layer, to implement dequeue bulking.

In this V4 iteration we are choosing an conservative approach.

Patch V4:

- Patch rewritten in the Red Hat Neuchatel office jointed work with
  Hannes, Daniel and Florian.
- Conservative approach of only using on BQL enabled drivers
- No user tunable parameter, but limit bulking to 8 packets.
- For now, avoid bulking GSO packets packets.

---

Jesper Dangaard Brouer (1):
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE

 include/net/sch_generic.h |   16 +++++++++++++++
 net/sched/sch_generic.c   |   47 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 61 insertions(+), 2 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 16:10 [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
@ 2014-09-24 16:12 ` Jesper Dangaard Brouer
  2014-09-24 17:23   ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-24 16:12 UTC (permalink / raw)
  To: netdev, therbert, David S. Miller, Eric Dumazet, Jesper Dangaard Brouer
  Cc: Alexander Duyck, toke, Florian Westphal, jhs, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.
Besides the bytelimit from BQL, also limit bulking to maximum 8
packets to avoid any issues with available descriptor in HW.

For now, as a conservative approach, don't bulk dequeue GSO and
segmented GSO packets, because testing showed requeuing occuring
with segmented GSO packets.

Jointed work with Hannes, Daniel and Florian.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

---
V4:
- Patch rewritten in the Red Hat Neuchatel office jointed work with
  Hannes, Daniel and Florian.
- Conservative approach of only using on BQL enabled drivers
- No user tunable parameter, but limit bulking to 8 packets.
- For now, avoid bulking GSO packets packets.

V3:
 - Correct use of BQL
 - Some minor adjustments based on feedback.
 - Default setting only bulk dequeue 1 extra packet (2 packets).

V2:
 - Restruct functions, split out functionality
 - Use BQL bytelimit to avoid overshooting driver limits
---
 include/net/sch_generic.h |   16 +++++++++++++++
 net/sched/sch_generic.c   |   47 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a3cfb8e..4e39a3e 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -6,6 +6,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pkt_sched.h>
 #include <linux/pkt_cls.h>
+#include <linux/dynamic_queue_limits.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 
@@ -111,6 +112,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 	qdisc->__state &= ~__QDISC___STATE_RUNNING;
 }
 
+static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
+{
+	return qdisc->flags & TCQ_F_ONETXQUEUE;
+}
+
+static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
+{
+#ifdef CONFIG_BQL
+	/* Non-BQL migrated drivers will return 0, too. */
+	return dql_avail(&txq->dql);
+#else
+	return 0;
+#endif
+}
+
 static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
 {
 	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 19696eb..6fba089 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
+					    struct sk_buff *head_skb,
+					    int bytelimit)
+{
+	struct sk_buff *skb, *tail_skb = head_skb;
+	int budget = 8; /* Arbitrary, but conservatively choosen limit */
+
+	while (bytelimit > 0 && --budget > 0) {
+		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
+		if (tail_skb->next || skb_is_gso(tail_skb))
+			break;
+
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+
+		bytelimit -= skb->len; /* covers GSO len */
+		skb = validate_xmit_skb(skb, qdisc_dev(q));
+		if (!skb)
+			break;
+
+		/* "skb" can be a skb list after validate call above
+		 * (GSO segmented), but it is okay to append it to
+		 * current tail_skb->next, because next round will exit
+		 * in-case "tail_skb->next" is a skb list.
+		 */
+		tail_skb->next = skb;
+		tail_skb = skb;
+	}
+
+	return head_skb;
+}
+
+/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
+ * A requeued skb (via q->gso_skb) can also be a SKB list.
+ */
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb = q->gso_skb;
@@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 		} else
 			skb = NULL;
 	} else {
-		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
+		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+		    !netif_xmit_frozen_or_stopped(txq)) {
+			int bytelimit = qdisc_avail_bulklimit(txq);
+
 			skb = q->dequeue(q);
-			if (skb)
+			if (skb) {
+				bytelimit -= skb->len;
 				skb = validate_xmit_skb(skb, qdisc_dev(q));
+			}
+			if (skb && qdisc_may_bulk(q))
+				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
@ 2014-09-24 17:23   ` Eric Dumazet
  2014-09-24 17:58     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2014-09-24 17:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, therbert, David S. Miller, Alexander Duyck, toke,
	Florian Westphal, jhs, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote:
> Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> sending/processing an entire skb list.
> 
> This patch implements qdisc bulk dequeue, by allowing multiple packets
> to be dequeued in dequeue_skb().
> 
> The optimization principle for this is two fold, (1) to amortize
> locking cost and (2) avoid expensive tailptr update for notifying HW.
>  (1) Several packets are dequeued while holding the qdisc root_lock,
> amortizing locking cost over several packet.  The dequeued SKB list is
> processed under the TXQ lock in dev_hard_start_xmit(), thus also
> amortizing the cost of the TXQ lock.
>  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> API to delay HW tailptr update, which also reduces the cost per
> packet.
> 
> One restriction of the new API is that every SKB must belong to the
> same TXQ.  This patch takes the easy way out, by restricting bulk
> dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> qdisc only have attached a single TXQ.
> 
> Some detail about the flow; dev_hard_start_xmit() will process the skb
> list, and transmit packets individually towards the driver (see
> xmit_one()).  In case the driver stops midway in the list, the
> remaining skb list is returned by dev_hard_start_xmit().  In
> sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> 
> To avoid overshooting the HW limits, which results in requeuing, the
> patch limits the amount of bytes dequeued, based on the drivers BQL
> limits.  In-effect bulking will only happen for BQL enabled drivers.
> Besides the bytelimit from BQL, also limit bulking to maximum 8
> packets to avoid any issues with available descriptor in HW.
> 
> For now, as a conservative approach, don't bulk dequeue GSO and
> segmented GSO packets, because testing showed requeuing occuring
> with segmented GSO packets.
> 
> Jointed work with Hannes, Daniel and Florian.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> 
> ---
> V4:
> - Patch rewritten in the Red Hat Neuchatel office jointed work with
>   Hannes, Daniel and Florian.
> - Conservative approach of only using on BQL enabled drivers
> - No user tunable parameter, but limit bulking to 8 packets.
> - For now, avoid bulking GSO packets packets.
> 
> V3:
>  - Correct use of BQL
>  - Some minor adjustments based on feedback.
>  - Default setting only bulk dequeue 1 extra packet (2 packets).
> 
> V2:
>  - Restruct functions, split out functionality
>  - Use BQL bytelimit to avoid overshooting driver limits
> ---
>  include/net/sch_generic.h |   16 +++++++++++++++
>  net/sched/sch_generic.c   |   47 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index a3cfb8e..4e39a3e 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -6,6 +6,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/pkt_sched.h>
>  #include <linux/pkt_cls.h>
> +#include <linux/dynamic_queue_limits.h>
>  #include <net/gen_stats.h>
>  #include <net/rtnetlink.h>
>  
> @@ -111,6 +112,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>  	qdisc->__state &= ~__QDISC___STATE_RUNNING;
>  }
>  
> +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> +{
> +	return qdisc->flags & TCQ_F_ONETXQUEUE;
> +}
> +
> +static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
> +{
> +#ifdef CONFIG_BQL
> +	/* Non-BQL migrated drivers will return 0, too. */
> +	return dql_avail(&txq->dql);
> +#else
> +	return 0;
> +#endif
> +}
> +
>  static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
>  {
>  	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 19696eb..6fba089 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  	return 0;
>  }
>  
> +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
> +					    struct sk_buff *head_skb,
> +					    int bytelimit)
> +{
> +	struct sk_buff *skb, *tail_skb = head_skb;
> +	int budget = 8; /* Arbitrary, but conservatively choosen limit */
> +
> +	while (bytelimit > 0 && --budget > 0) {
> +		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */

Hmm... this is a serious limitation.

> +		if (tail_skb->next || skb_is_gso(tail_skb))
> +			break;
> +
> +		skb = q->dequeue(q);
> +		if (!skb)
> +			break;
> +
> +		bytelimit -= skb->len; /* covers GSO len */

Not really, use qdisc_pkt_len(skb) instead, to have a better byte count.

> +		skb = validate_xmit_skb(skb, qdisc_dev(q));
> +		if (!skb)
> +			break;
> +
> +		/* "skb" can be a skb list after validate call above
> +		 * (GSO segmented), but it is okay to append it to
> +		 * current tail_skb->next, because next round will exit
> +		 * in-case "tail_skb->next" is a skb list.
> +		 */

It would be trivial to change validate_xmit_skb() to return the head and
tail of the chain. Walking the chain would hit hot cache lines, but it
is better not having to walk it.

> +		tail_skb->next = skb;
> +		tail_skb = skb;

So that here we do : tail_skb = tail;

> +	}
> +
> +	return head_skb;
> +}
> +
> +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> + * A requeued skb (via q->gso_skb) can also be a SKB list.
> + */
>  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>  {
>  	struct sk_buff *skb = q->gso_skb;
> @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>  		} else
>  			skb = NULL;
>  	} else {
> -		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> +		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> +		    !netif_xmit_frozen_or_stopped(txq)) {
> +			int bytelimit = qdisc_avail_bulklimit(txq);
> +
>  			skb = q->dequeue(q);
> -			if (skb)
> +			if (skb) {
> +				bytelimit -= skb->len;

  qdisc_pkt_len(skb)

>  				skb = validate_xmit_skb(skb, qdisc_dev(q));
> +			}
> +			if (skb && qdisc_may_bulk(q))
> +				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
>  		}
>  	}
>  

It looks good, but we really need to take care of TSO packets.

pktgen is nice, but do not represent the majority of the traffic we send
from high performance host where we want this bulk dequeue thing ;)

Thanks guys !

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 17:23   ` Eric Dumazet
@ 2014-09-24 17:58     ` Jesper Dangaard Brouer
  2014-09-24 18:34       ` Tom Herbert
  2014-09-24 22:13       ` Jamal Hadi Salim
  0 siblings, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-24 17:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, therbert, David S. Miller, Alexander Duyck, toke,
	Florian Westphal, jhs, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa, brouer

On Wed, 24 Sep 2014 10:23:15 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote:
> > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> > sending/processing an entire skb list.
> > 
> > This patch implements qdisc bulk dequeue, by allowing multiple packets
> > to be dequeued in dequeue_skb().
> > 
> > The optimization principle for this is two fold, (1) to amortize
> > locking cost and (2) avoid expensive tailptr update for notifying HW.
> >  (1) Several packets are dequeued while holding the qdisc root_lock,
> > amortizing locking cost over several packet.  The dequeued SKB list is
> > processed under the TXQ lock in dev_hard_start_xmit(), thus also
> > amortizing the cost of the TXQ lock.
> >  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> > API to delay HW tailptr update, which also reduces the cost per
> > packet.
> > 
> > One restriction of the new API is that every SKB must belong to the
> > same TXQ.  This patch takes the easy way out, by restricting bulk
> > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> > qdisc only have attached a single TXQ.
> > 
> > Some detail about the flow; dev_hard_start_xmit() will process the skb
> > list, and transmit packets individually towards the driver (see
> > xmit_one()).  In case the driver stops midway in the list, the
> > remaining skb list is returned by dev_hard_start_xmit().  In
> > sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> > 
> > To avoid overshooting the HW limits, which results in requeuing, the
> > patch limits the amount of bytes dequeued, based on the drivers BQL
> > limits.  In-effect bulking will only happen for BQL enabled drivers.
> > Besides the bytelimit from BQL, also limit bulking to maximum 8
> > packets to avoid any issues with available descriptor in HW.
> > 
> > For now, as a conservative approach, don't bulk dequeue GSO and
> > segmented GSO packets, because testing showed requeuing occuring
> > with segmented GSO packets.
> > 
> > Jointed work with Hannes, Daniel and Florian.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > 
> > ---
> > V4:
> > - Patch rewritten in the Red Hat Neuchatel office jointed work with
> >   Hannes, Daniel and Florian.
> > - Conservative approach of only using on BQL enabled drivers
> > - No user tunable parameter, but limit bulking to 8 packets.
> > - For now, avoid bulking GSO packets packets.
> > 
[...]
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 19696eb..6fba089 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >  	return 0;
> >  }
> >  
> > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
> > +					    struct sk_buff *head_skb,
> > +					    int bytelimit)
> > +{
> > +	struct sk_buff *skb, *tail_skb = head_skb;
> > +	int budget = 8; /* Arbitrary, but conservatively choosen limit */
> > +
> > +	while (bytelimit > 0 && --budget > 0) {
> > +		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
> 
> Hmm... this is a serious limitation.

We plan to remove this at a later point, this is just to be conservative.

 
> > +		if (tail_skb->next || skb_is_gso(tail_skb))
> > +			break;
> > +
> > +		skb = q->dequeue(q);
> > +		if (!skb)
> > +			break;
> > +
> > +		bytelimit -= skb->len; /* covers GSO len */
> 
> Not really, use qdisc_pkt_len(skb) instead, to have a better byte count.

Understood, will update patch tomorrow.

 
> > +		skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +		if (!skb)
> > +			break;
> > +
> > +		/* "skb" can be a skb list after validate call above
> > +		 * (GSO segmented), but it is okay to append it to
> > +		 * current tail_skb->next, because next round will exit
> > +		 * in-case "tail_skb->next" is a skb list.
> > +		 */
> 
> It would be trivial to change validate_xmit_skb() to return the head and
> tail of the chain. Walking the chain would hit hot cache lines, but it
> is better not having to walk it.

Yes, we could do this when we later add support for GSO segmented packets.


> > +		tail_skb->next = skb;
> > +		tail_skb = skb;
> 
> So that here we do : tail_skb = tail;
> 
> > +	}
> > +
> > +	return head_skb;
> > +}
> > +
> > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> > + * A requeued skb (via q->gso_skb) can also be a SKB list.
> > + */
> >  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  {
> >  	struct sk_buff *skb = q->gso_skb;
> > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  		} else
> >  			skb = NULL;
> >  	} else {
> > -		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> > +		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> > +		    !netif_xmit_frozen_or_stopped(txq)) {
> > +			int bytelimit = qdisc_avail_bulklimit(txq);
> > +
> >  			skb = q->dequeue(q);
> > -			if (skb)
> > +			if (skb) {
> > +				bytelimit -= skb->len;
> 
>   qdisc_pkt_len(skb)

Okay, will update patch tomorrow.
 
> >  				skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +			}
> > +			if (skb && qdisc_may_bulk(q))
> > +				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
> >  		}
> >  	}
> >  
> 
> It looks good, but we really need to take care of TSO packets.

As notes above, TSO packets are planned as a future feature.  Lets
first see if this works well, without introducing any HoL blocking
issues or other regressions.


> pktgen is nice, but do not represent the majority of the traffic we send
> from high performance host where we want this bulk dequeue thing ;)

This patch is actually targetted towards more normal use-cases.
Pktgen cannot even use this work, as it bypass the qdisc layer...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 17:58     ` Jesper Dangaard Brouer
@ 2014-09-24 18:34       ` Tom Herbert
  2014-09-24 19:22         ` Eric Dumazet
  2014-09-24 22:13       ` Jamal Hadi Salim
  1 sibling, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2014-09-24 18:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On Wed, Sep 24, 2014 at 10:58 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 24 Sep 2014 10:23:15 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote:
>> > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
>> > sending/processing an entire skb list.
>> >
>> > This patch implements qdisc bulk dequeue, by allowing multiple packets
>> > to be dequeued in dequeue_skb().
>> >
>> > The optimization principle for this is two fold, (1) to amortize
>> > locking cost and (2) avoid expensive tailptr update for notifying HW.
>> >  (1) Several packets are dequeued while holding the qdisc root_lock,
>> > amortizing locking cost over several packet.  The dequeued SKB list is
>> > processed under the TXQ lock in dev_hard_start_xmit(), thus also
>> > amortizing the cost of the TXQ lock.
>> >  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
>> > API to delay HW tailptr update, which also reduces the cost per
>> > packet.
>> >
>> > One restriction of the new API is that every SKB must belong to the
>> > same TXQ.  This patch takes the easy way out, by restricting bulk
>> > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
>> > qdisc only have attached a single TXQ.
>> >
>> > Some detail about the flow; dev_hard_start_xmit() will process the skb
>> > list, and transmit packets individually towards the driver (see
>> > xmit_one()).  In case the driver stops midway in the list, the
>> > remaining skb list is returned by dev_hard_start_xmit().  In
>> > sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
>> >
>> > To avoid overshooting the HW limits, which results in requeuing, the
>> > patch limits the amount of bytes dequeued, based on the drivers BQL
>> > limits.  In-effect bulking will only happen for BQL enabled drivers.
>> > Besides the bytelimit from BQL, also limit bulking to maximum 8
>> > packets to avoid any issues with available descriptor in HW.
>> >
>> > For now, as a conservative approach, don't bulk dequeue GSO and
>> > segmented GSO packets, because testing showed requeuing occuring
>> > with segmented GSO packets.
>> >
>> > Jointed work with Hannes, Daniel and Florian.
>> >
>> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> > Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> > Signed-off-by: Florian Westphal <fw@strlen.de>
>> >
>> > ---
>> > V4:
>> > - Patch rewritten in the Red Hat Neuchatel office jointed work with
>> >   Hannes, Daniel and Florian.
>> > - Conservative approach of only using on BQL enabled drivers
>> > - No user tunable parameter, but limit bulking to 8 packets.
>> > - For now, avoid bulking GSO packets packets.
>> >
> [...]
>> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> > index 19696eb..6fba089 100644
>> > --- a/net/sched/sch_generic.c
>> > +++ b/net/sched/sch_generic.c
>> > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>> >     return 0;
>> >  }
>> >
>> > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
>> > +                                       struct sk_buff *head_skb,
>> > +                                       int bytelimit)
>> > +{
>> > +   struct sk_buff *skb, *tail_skb = head_skb;
>> > +   int budget = 8; /* Arbitrary, but conservatively choosen limit */
>> > +
>> > +   while (bytelimit > 0 && --budget > 0) {
>> > +           /* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
>>
>> Hmm... this is a serious limitation.
>
> We plan to remove this at a later point, this is just to be conservative.
>
>
>> > +           if (tail_skb->next || skb_is_gso(tail_skb))
>> > +                   break;
>> > +
>> > +           skb = q->dequeue(q);
>> > +           if (!skb)
>> > +                   break;
>> > +
>> > +           bytelimit -= skb->len; /* covers GSO len */
>>
>> Not really, use qdisc_pkt_len(skb) instead, to have a better byte count.
>
> Understood, will update patch tomorrow.
>
I believe drivers typically use skb->len for BQL tracking. Since
bytelimit is based on BQL here, it might be more correct to use
skb->len.

>
>> > +           skb = validate_xmit_skb(skb, qdisc_dev(q));
>> > +           if (!skb)
>> > +                   break;
>> > +
>> > +           /* "skb" can be a skb list after validate call above
>> > +            * (GSO segmented), but it is okay to append it to
>> > +            * current tail_skb->next, because next round will exit
>> > +            * in-case "tail_skb->next" is a skb list.
>> > +            */
>>
>> It would be trivial to change validate_xmit_skb() to return the head and
>> tail of the chain. Walking the chain would hit hot cache lines, but it
>> is better not having to walk it.
>
> Yes, we could do this when we later add support for GSO segmented packets.
>
>
>> > +           tail_skb->next = skb;
>> > +           tail_skb = skb;
>>
>> So that here we do : tail_skb = tail;
>>
>> > +   }
>> > +
>> > +   return head_skb;
>> > +}
>> > +
>> > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
>> > + * A requeued skb (via q->gso_skb) can also be a SKB list.
>> > + */
>> >  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>> >  {
>> >     struct sk_buff *skb = q->gso_skb;
>> > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>> >             } else
>> >                     skb = NULL;
>> >     } else {
>> > -           if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
>> > +           if (!(q->flags & TCQ_F_ONETXQUEUE) ||
>> > +               !netif_xmit_frozen_or_stopped(txq)) {
>> > +                   int bytelimit = qdisc_avail_bulklimit(txq);
>> > +
>> >                     skb = q->dequeue(q);
>> > -                   if (skb)
>> > +                   if (skb) {
>> > +                           bytelimit -= skb->len;
>>
>>   qdisc_pkt_len(skb)
>
> Okay, will update patch tomorrow.
>
>> >                             skb = validate_xmit_skb(skb, qdisc_dev(q));
>> > +                   }
>> > +                   if (skb && qdisc_may_bulk(q))
>> > +                           skb = try_bulk_dequeue_skb(q, skb, bytelimit);
>> >             }
>> >     }
>> >
>>
>> It looks good, but we really need to take care of TSO packets.
>
> As notes above, TSO packets are planned as a future feature.  Lets
> first see if this works well, without introducing any HoL blocking
> issues or other regressions.
>
>
>> pktgen is nice, but do not represent the majority of the traffic we send
>> from high performance host where we want this bulk dequeue thing ;)
>
> This patch is actually targetted towards more normal use-cases.
> Pktgen cannot even use this work, as it bypass the qdisc layer...
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 18:34       ` Tom Herbert
@ 2014-09-24 19:22         ` Eric Dumazet
  2014-09-25  2:12           ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2014-09-24 19:22 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote:
> >
> I believe drivers typically use skb->len for BQL tracking. Since
> bytelimit is based on BQL here, it might be more correct to use
> skb->len.

That is only because drivers dont have access to qdisc_pkt_len(skb) as :

- We can call them without any qdisc traversal
- We escaped from qdisc layer, and gso packet do not have the field set

And drivers actually update BQL inflight, while bulk dequeue do not
update it (its a local copy)

I do not particularly care, as BQL do not have to be 100% precise.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 19:22         ` Eric Dumazet
@ 2014-09-25  2:12           ` Eric Dumazet
  2014-09-25  2:38             ` Tom Herbert
  2014-09-25 23:40             ` Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-25  2:12 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote:
> On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote:
> > >
> > I believe drivers typically use skb->len for BQL tracking. Since
> > bytelimit is based on BQL here, it might be more correct to use
> > skb->len.

Speaking of BQL, I wonder if we now should try to not wakeup queues as
soon some room was made, and instead have a 50% threshold ?

This would probably increase probability to have bulk dequeues ;)


diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 5621547d631b..c0be7ff5ae97 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -83,6 +83,13 @@ static inline int dql_avail(const struct dql *dql)
 	return dql->adj_limit - dql->num_queued;
 }
 
+/* Returns true if a queue occupancy is less than half capacity */
+static inline bool dql_half_avail(const struct dql *dql)
+{
+	return dql->adj_limit >= (dql->num_queued << 1);
+}
+
+
 /* Record number of completed objects and recalculate the limit. */
 void dql_completed(struct dql *dql, unsigned int count);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c8e388e5fccc..1f7541284b32 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2413,7 +2413,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
 	smp_mb();
 
 	/* check again in case another CPU has just made room avail */
-	if (unlikely(dql_avail(&dev_queue->dql) >= 0))
+	if (unlikely(dql_half_avail(&dev_queue->dql)))
 		clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
 #endif
 }
@@ -2448,7 +2448,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
 	 */
 	smp_mb();
 
-	if (dql_avail(&dev_queue->dql) < 0)
+	if (!dql_half_avail(&dev_queue->dql))
 		return;
 
 	if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25  2:12           ` Eric Dumazet
@ 2014-09-25  2:38             ` Tom Herbert
  2014-09-25  2:58               ` Dave Taht
  2014-09-25 23:40             ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2014-09-25  2:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On Wed, Sep 24, 2014 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote:
>> On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote:
>> > >
>> > I believe drivers typically use skb->len for BQL tracking. Since
>> > bytelimit is based on BQL here, it might be more correct to use
>> > skb->len.
>
> Speaking of BQL, I wonder if we now should try to not wakeup queues as
> soon some room was made, and instead have a 50% threshold ?
>
> This would probably increase probability to have bulk dequeues ;)
>
It would be good to have data on that. In the absence of TSO, I've
seen BQL limits at around 30K for "standard" interrupt rates on 10G.
This should mean that ~15K becomes available every interrupt period
(the math is actually straightforward), so that should already have 10
packet batches which isn't bad!

It's also probably true that we can tradeoff batching for latency in
many ways-- more batching increase latency, less batching helps
latency. For instance, the interrupt rate can be modulated to balance
between latency and batching (CPU utilization).

Tom

>
> diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
> index 5621547d631b..c0be7ff5ae97 100644
> --- a/include/linux/dynamic_queue_limits.h
> +++ b/include/linux/dynamic_queue_limits.h
> @@ -83,6 +83,13 @@ static inline int dql_avail(const struct dql *dql)
>         return dql->adj_limit - dql->num_queued;
>  }
>
> +/* Returns true if a queue occupancy is less than half capacity */
> +static inline bool dql_half_avail(const struct dql *dql)
> +{
> +       return dql->adj_limit >= (dql->num_queued << 1);
> +}
> +
> +
>  /* Record number of completed objects and recalculate the limit. */
>  void dql_completed(struct dql *dql, unsigned int count);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index c8e388e5fccc..1f7541284b32 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2413,7 +2413,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
>         smp_mb();
>
>         /* check again in case another CPU has just made room avail */
> -       if (unlikely(dql_avail(&dev_queue->dql) >= 0))
> +       if (unlikely(dql_half_avail(&dev_queue->dql)))
>                 clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
>  #endif
>  }
> @@ -2448,7 +2448,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
>          */
>         smp_mb();
>
> -       if (dql_avail(&dev_queue->dql) < 0)
> +       if (!dql_half_avail(&dev_queue->dql))
>                 return;
>
>         if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25  2:38             ` Tom Herbert
@ 2014-09-25  2:58               ` Dave Taht
  2014-09-26 18:06                 ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Taht @ 2014-09-25  2:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal,
	Jamal Hadi Salim, John Fastabend, Daniel Borkmann,
	Hannes Frederic Sowa

On Wed, Sep 24, 2014 at 7:38 PM, Tom Herbert <therbert@google.com> wrote:
> On Wed, Sep 24, 2014 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote:
>>> On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote:
>>> > >
>>> > I believe drivers typically use skb->len for BQL tracking. Since
>>> > bytelimit is based on BQL here, it might be more correct to use
>>> > skb->len.
>>
>> Speaking of BQL, I wonder if we now should try to not wakeup queues as
>> soon some room was made, and instead have a 50% threshold ?
>>
>> This would probably increase probability to have bulk dequeues ;)
>>
> It would be good to have data on that.

+1.

Yes, I'd be very carefully observing BQL's behavior as these changes are made.

netperf-wrapper could rather easily sample bql's limit on as low as
10ms interval
I think....

and... me being me, I volunteer to watch 100Mbit and down.

> In the absence of TSO, I've
> seen BQL limits at around 30K for "standard" interrupt rates on 10G.

It is typically 3k at 100Mbit, 1500 at 10Mbit. It thus usually goes overlimit
by a MTU. I have simple tons of data on this on tons of kernels for
various platforms.

> This should mean that ~15K becomes available every interrupt period
> (the math is actually straightforward), so that should already have 10
> packet batches which isn't bad!
>
> It's also probably true that we can tradeoff batching for latency in
> many ways-- more batching increase latency, less batching helps
> latency. For instance, the interrupt rate can be modulated to balance
> between latency and batching (CPU utilization).

I have long hoped that the actual BQL limit in play would feed into
TCP small queues when there are a lot of flows to make each tcp
"small" queue gradually smaller...

> Tom
>
>>
>> diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
>> index 5621547d631b..c0be7ff5ae97 100644
>> --- a/include/linux/dynamic_queue_limits.h
>> +++ b/include/linux/dynamic_queue_limits.h
>> @@ -83,6 +83,13 @@ static inline int dql_avail(const struct dql *dql)
>>         return dql->adj_limit - dql->num_queued;
>>  }
>>
>> +/* Returns true if a queue occupancy is less than half capacity */
>> +static inline bool dql_half_avail(const struct dql *dql)
>> +{
>> +       return dql->adj_limit >= (dql->num_queued << 1);
>> +}
>> +
>> +
>>  /* Record number of completed objects and recalculate the limit. */
>>  void dql_completed(struct dql *dql, unsigned int count);
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index c8e388e5fccc..1f7541284b32 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -2413,7 +2413,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
>>         smp_mb();
>>
>>         /* check again in case another CPU has just made room avail */
>> -       if (unlikely(dql_avail(&dev_queue->dql) >= 0))
>> +       if (unlikely(dql_half_avail(&dev_queue->dql)))
>>                 clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
>>  #endif
>>  }
>> @@ -2448,7 +2448,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
>>          */
>>         smp_mb();
>>
>> -       if (dql_avail(&dev_queue->dql) < 0)
>> +       if (!dql_half_avail(&dev_queue->dql))
>>                 return;
>>
>>         if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>>
>>



-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25  2:58               ` Dave Taht
@ 2014-09-26 18:06                 ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-26 18:06 UTC (permalink / raw)
  To: Dave Taht
  Cc: Tom Herbert, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal,
	Jamal Hadi Salim, John Fastabend, Daniel Borkmann,
	Hannes Frederic Sowa

On Wed, 2014-09-24 at 19:58 -0700, Dave Taht wrote:

> I have long hoped that the actual BQL limit in play would feed into
> TCP small queues when there are a lot of flows to make each tcp
> "small" queue gradually smaller...

Yep, but what do you do if you have 64 TX queues, each one having
different BQL limits ? This looks like a NP problem when you also have
millions of tcp flows.

Basically BQL limit is more a sign of local host being able to drain TX
completions fast or not.

Here at Google, we have a very tiny queue of at most 2 packets per flow,
and srtt being in usec units gives us a very dynamic signal, based on
end to end measures.

I am experimenting moving TSQ processing from tasklet to workqueue,
because we need faster response to {soft}irq when we so often enter
TCP stack.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25  2:12           ` Eric Dumazet
  2014-09-25  2:38             ` Tom Herbert
@ 2014-09-25 23:40             ` Eric Dumazet
  2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
  2014-09-26  9:23               ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight
  1 sibling, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-25 23:40 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote:
> On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote:
> > On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote:
> > > >
> > > I believe drivers typically use skb->len for BQL tracking. Since
> > > bytelimit is based on BQL here, it might be more correct to use
> > > skb->len.
> 
> Speaking of BQL, I wonder if we now should try to not wakeup queues as
> soon some room was made, and instead have a 50% threshold ?
> 
> This would probably increase probability to have bulk dequeues ;)

It turned out the problem I noticed was caused by compiler trying to be
smart, but involving a bad MESI transaction.

  0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
  0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
 58.23 │  add    %edx,%edi
  0.58 │  cmp    %edi,0xc4(%rax)
  0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
  0.72 │  js     bd8


I get an incredible 10 % gain by making sure cpu wont get the cache line
in Shared mode.

(I also tried a barrier() in netdev_tx_sent_queue() between the

	dql_queued(&dev_queue->dql, bytes);
-
+	barrier();
 	if (likely(dql_avail(&dev_queue->dql) >= 0))

But following patch seems cleaner

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 5621547d631b..978fbe332090 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -80,7 +80,7 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
 /* Returns how many objects can be queued, < 0 indicates over limit. */
 static inline int dql_avail(const struct dql *dql)
 {
-	return dql->adj_limit - dql->num_queued;
+	return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued);
 }
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions
  2014-09-25 23:40             ` Eric Dumazet
@ 2014-09-26  6:04               ` Eric Dumazet
  2014-09-26  8:47                 ` Jesper Dangaard Brouer
                                   ` (2 more replies)
  2014-09-26  9:23               ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight
  1 sibling, 3 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-26  6:04 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

From: Eric Dumazet <edumazet@google.com>

While doing high throughput test on a BQL enabled NIC,
I found a very high cost in ndo_start_xmit() when accessing BQL data.

It turned out the problem was caused by compiler trying to be
smart, but involving a bad MESI transaction :

  0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
  0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
 58.23 │  add    %edx,%edi
  0.58 │  cmp    %edi,0xc4(%rax)
  0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
  0.72 │  js     bd8


I got an incredible 10 % gain [1] by making sure cpu do not attempt
to get the cache line in Shared mode, but directly requests for
ownership.

New code :
	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
	cmp    %edx,%ecx

The TX completion was running from another cpu, with high interrupts
rate.

Note that I am using barrier() as a soft hint, as mb() here could be
too heavy cost.

[1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/dynamic_queue_limits.h |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 5621547d631b..a4be70398ce1 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
 {
 	BUG_ON(count > DQL_MAX_OBJECT);
 
-	dql->num_queued += count;
 	dql->last_obj_cnt = count;
+
+	/* We want to force a write first, so that cpu do not attempt
+	 * to get cache line containing last_obj_cnt, num_queued, adj_limit
+	 * in Shared state, but directly does a Request For Ownership
+	 * It is only a hint, we use barrier() only.
+	 */
+	barrier();
+
+	dql->num_queued += count;
 }
 
 /* Returns how many objects can be queued, < 0 indicates over limit. */
 static inline int dql_avail(const struct dql *dql)
 {
-	return dql->adj_limit - dql->num_queued;
+	return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued);
 }
 
 /* Record number of completed objects and recalculate the limit. */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions
  2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
@ 2014-09-26  8:47                 ` Jesper Dangaard Brouer
  2014-09-26 11:06                 ` Hannes Frederic Sowa
  2014-09-28 21:43                 ` David Miller
  2 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-26  8:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Linux Netdev List, David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal,
	Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann,
	Hannes Frederic Sowa, brouer

On Thu, 25 Sep 2014 23:04:56 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

[...]
> 
> It turned out the problem was caused by compiler trying to be
> smart, but involving a bad MESI transaction :
>
[...] 
> 
> I got an incredible 10 % gain [1] by making sure cpu do not attempt
> to get the cache line in Shared mode, but directly requests for
> ownership.
[...]
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

I'm very impressed - thank you Eric for finding this!!!

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions
  2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
  2014-09-26  8:47                 ` Jesper Dangaard Brouer
@ 2014-09-26 11:06                 ` Hannes Frederic Sowa
  2014-09-26 12:02                   ` Eric Dumazet
  2014-09-28 21:43                 ` David Miller
  2 siblings, 1 reply; 31+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-26 11:06 UTC (permalink / raw)
  To: Eric Dumazet, Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann

On Fri, Sep 26, 2014, at 08:04, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> While doing high throughput test on a BQL enabled NIC,
> I found a very high cost in ndo_start_xmit() when accessing BQL data.
> 
> It turned out the problem was caused by compiler trying to be
> smart, but involving a bad MESI transaction :
> 
>   0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
>   0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
>  58.23 │  add    %edx,%edi
>   0.58 │  cmp    %edi,0xc4(%rax)
>   0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
>   0.72 │  js     bd8
> 
> 
> I got an incredible 10 % gain [1] by making sure cpu do not attempt
> to get the cache line in Shared mode, but directly requests for
> ownership.
> 
> New code :
> 	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
> 	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
> 	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
> 	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
> 	cmp    %edx,%ecx
> 
> The TX completion was running from another cpu, with high interrupts
> rate.
> 
> Note that I am using barrier() as a soft hint, as mb() here could be
> too heavy cost.
> 
> [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/dynamic_queue_limits.h |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/dynamic_queue_limits.h
> b/include/linux/dynamic_queue_limits.h
> index 5621547d631b..a4be70398ce1 100644
> --- a/include/linux/dynamic_queue_limits.h
> +++ b/include/linux/dynamic_queue_limits.h
> @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql,
> unsigned int count)
>  {
>  	BUG_ON(count > DQL_MAX_OBJECT);
>  
> -       dql->num_queued += count;
>  	dql->last_obj_cnt = count;
> +
> +       /* We want to force a write first, so that cpu do not attempt
> +        * to get cache line containing last_obj_cnt, num_queued,
> adj_limit
> +        * in Shared state, but directly does a Request For Ownership
> +        * It is only a hint, we use barrier() only.
> +        */
> +       barrier();
> +
> +       dql->num_queued += count;
>  }

I thought that prefetchw() would be the canonical way to solve write
stalls in CPUs and prepare the specific cache line to be written to.
Interesting, thanks Eric.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions
  2014-09-26 11:06                 ` Hannes Frederic Sowa
@ 2014-09-26 12:02                   ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-26 12:02 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal,
	Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann

On Fri, 2014-09-26 at 13:06 +0200, Hannes Frederic Sowa wrote:

> I thought that prefetchw() would be the canonical way to solve write
> stalls in CPUs and prepare the specific cache line to be written to.
> Interesting, thanks Eric.

True, and I have a tcp_ack() (tcp_clean_rtx_queue() more exactly) patch
that indeed uses prefetchw() to bring tcp_shinfo() in cache before the 
tcp_unlink_write_queue(skb, sk);

Performance went from 570000 packets per second to 820000 packets per
second on a TSO=off GSO=off workload.

But this applies after the https://patchwork.ozlabs.org/patch/392877/
patch (tcp: change tcp_skb_pcount() location )

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions
  2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
  2014-09-26  8:47                 ` Jesper Dangaard Brouer
  2014-09-26 11:06                 ` Hannes Frederic Sowa
@ 2014-09-28 21:43                 ` David Miller
  2 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2014-09-28 21:43 UTC (permalink / raw)
  To: eric.dumazet
  Cc: therbert, brouer, netdev, alexander.h.duyck, toke, fw, jhs,
	dave.taht, john.r.fastabend, dborkman, hannes

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 25 Sep 2014 23:04:56 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While doing high throughput test on a BQL enabled NIC,
> I found a very high cost in ndo_start_xmit() when accessing BQL data.
> 
> It turned out the problem was caused by compiler trying to be
> smart, but involving a bad MESI transaction :
> 
>   0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
>   0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
>  58.23 │  add    %edx,%edi
>   0.58 │  cmp    %edi,0xc4(%rax)
>   0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
>   0.72 │  js     bd8
> 
> 
> I got an incredible 10 % gain [1] by making sure cpu do not attempt
> to get the cache line in Shared mode, but directly requests for
> ownership.
> 
> New code :
> 	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
> 	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
> 	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
> 	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
> 	cmp    %edx,%ecx
> 
> The TX completion was running from another cpu, with high interrupts
> rate.
> 
> Note that I am using barrier() as a soft hint, as mb() here could be
> too heavy cost.
> 
> [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Ok now you're just showing off :-)  Applied, thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 23:40             ` Eric Dumazet
  2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
@ 2014-09-26  9:23               ` David Laight
  2014-09-26 13:16                 ` David Laight
  1 sibling, 1 reply; 31+ messages in thread
From: David Laight @ 2014-09-26  9:23 UTC (permalink / raw)
  To: 'Eric Dumazet', Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

From: Eric Dumazet
> On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote:
...
> It turned out the problem I noticed was caused by compiler trying to be
> smart, but involving a bad MESI transaction.
> 
>   0.05   mov    0xc0(%rax),%edi    // LOAD dql->num_queued
>   0.48   mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
>  58.23   add    %edx,%edi
>   0.58   cmp    %edi,0xc4(%rax)
>   0.76   mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
>   0.72   js     bd8
> 
> 
> I get an incredible 10 % gain by making sure cpu wont get the cache line
> in Shared mode.

That is a stunning difference between requesting 'exclusive' access
and upgrading 'shared' to exclusive.
Stinks of a cpu bug?

Or is the reported stall a side effect of waiting for the earlier
'cache line read' to complete in order to issue the 'upgrade to exclusive'.
In which case gcc's instruction scheduler probably needs to be taught
to schedule writes before reads.

> (I also tried a barrier() in netdev_tx_sent_queue() between the
> 
> 	dql_queued(&dev_queue->dql, bytes);
> -
> +	barrier();
>  	if (likely(dql_avail(&dev_queue->dql) >= 0))
> 
> But following patch seems cleaner
> 
> diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
> index 5621547d631b..978fbe332090 100644
> --- a/include/linux/dynamic_queue_limits.h
> +++ b/include/linux/dynamic_queue_limits.h
> @@ -80,7 +80,7 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
>  /* Returns how many objects can be queued, < 0 indicates over limit. */
>  static inline int dql_avail(const struct dql *dql)
>  {
> -	return dql->adj_limit - dql->num_queued;
> +	return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued);

Dunno, that could have an impact on other calls where the values
are already in registers - I suspect ACCESS_ONCE() forces an access.

	David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-26  9:23               ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight
@ 2014-09-26 13:16                 ` David Laight
  2014-09-26 13:38                   ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: David Laight @ 2014-09-26 13:16 UTC (permalink / raw)
  To: David Laight, 'Eric Dumazet', Tom Herbert
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Alexander Duyck, Toke Høiland-Jørgensen,
	Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

From: David Laight
> From: Eric Dumazet
> > On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote:
> ...
> > It turned out the problem I noticed was caused by compiler trying to be
> > smart, but involving a bad MESI transaction.
> >
> >   0.05   mov    0xc0(%rax),%edi    // LOAD dql->num_queued
> >   0.48   mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
> >  58.23   add    %edx,%edi
> >   0.58   cmp    %edi,0xc4(%rax)
> >   0.76   mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
> >   0.72   js     bd8
> >
> >
> > I get an incredible 10 % gain by making sure cpu wont get the cache line
> > in Shared mode.
> 
> That is a stunning difference between requesting 'exclusive' access
> and upgrading 'shared' to exclusive.
> Stinks of a cpu bug?
> 
> Or is the reported stall a side effect of waiting for the earlier
> 'cache line read' to complete in order to issue the 'upgrade to exclusive'.
> In which case gcc's instruction scheduler probably needs to be taught
> to schedule writes before reads.

Thinking further.
gcc is probably moving memory reads before writes under the assumption that
the cpu might stall waiting for the read to complete but that the write
can be buffered by the hardware.

That assumption is true for simple cpus (like the Nios2), but for x86 with
its multiple instructions in flight (etc) may make little difference if
the memory is in the cache.

OTOH it looks as though there is a big hit if you read then write
a non-present cache line.
(This may even depend on which instructions get executed in parallel,
minor changes to the code could easily change that.)

I wonder how easy it would be to modify gcc to remove (or even reverse)
that memory ordering 'optimisation'.

	David



^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-26 13:16                 ` David Laight
@ 2014-09-26 13:38                   ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-26 13:38 UTC (permalink / raw)
  To: David Laight
  Cc: Tom Herbert, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal,
	Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann,
	Hannes Frederic Sowa

On Fri, 2014-09-26 at 13:16 +0000, David Laight wrote:

> OTOH it looks as though there is a big hit if you read then write
> a non-present cache line.

Well, this is exactly what it is about. Welcome to the club.

This is a well known problem.

Look at following old commit :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3f9d35b9514da6757ca98831372518f9eeb71b33

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 17:58     ` Jesper Dangaard Brouer
  2014-09-24 18:34       ` Tom Herbert
@ 2014-09-24 22:13       ` Jamal Hadi Salim
  2014-09-25  8:25         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 31+ messages in thread
From: Jamal Hadi Salim @ 2014-09-24 22:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet
  Cc: netdev, therbert, David S. Miller, Alexander Duyck, toke,
	Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann,
	Hannes Frederic Sowa

On 09/24/14 13:58, Jesper Dangaard Brouer wrote:
> On Wed, 24 Sep 2014 10:23:15 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>

>
>> pktgen is nice, but do not represent the majority of the traffic we send
>> from high performance host where we want this bulk dequeue thing ;)
>
> This patch is actually targetted towards more normal use-cases.
> Pktgen cannot even use this work, as it bypass the qdisc layer...

When you post these patches - can you please also post basic performance
numbers? You dont have to show improvement if it is hard for bulking
to kick in, but you need to show no harm in at least latency for the
general use case (i.e not pktgen maybe forwarding activity or something
sourced from tcp).

cheers,
jamal

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-24 22:13       ` Jamal Hadi Salim
@ 2014-09-25  8:25         ` Jesper Dangaard Brouer
  2014-09-25 12:48           ` Jamal Hadi Salim
  2014-09-25 13:52           ` Dave Taht
  0 siblings, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-25  8:25 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Eric Dumazet, netdev, therbert, David S. Miller, Alexander Duyck,
	toke, Florian Westphal, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa, brouer

On Wed, 24 Sep 2014 18:13:57 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 09/24/14 13:58, Jesper Dangaard Brouer wrote:
> > On Wed, 24 Sep 2014 10:23:15 -0700
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> 
> >
> >> pktgen is nice, but do not represent the majority of the traffic we send
> >> from high performance host where we want this bulk dequeue thing ;)
> >
> > This patch is actually targetted towards more normal use-cases.
> > Pktgen cannot even use this work, as it bypass the qdisc layer...
> 
> When you post these patches - can you please also post basic performance
> numbers? You dont have to show improvement if it is hard for bulking
> to kick in, but you need to show no harm in at least latency for the
> general use case (i.e not pktgen maybe forwarding activity or something
> sourced from tcp).

I've done measurements with netperf-wrapper:
 http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html

I have already previously posted my measurements here:
 http://people.netfilter.org/hawk/qdisc/
 http://people.netfilter.org/hawk/qdisc/measure01/
 http://people.netfilter.org/hawk/qdisc/experiment01/

Please, see my previous mail where I described each graph.

The above measurements is for 10Gbit/s, but I've also done measurements
on 1Gbit/s driver igb, and 10Mbit/s by forcing igb to use 10Mbit/s.
Those results I forgot upload (and I cannot upload them right now, as
I'm currently in Switzerland).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25  8:25         ` Jesper Dangaard Brouer
@ 2014-09-25 12:48           ` Jamal Hadi Salim
  2014-09-25 14:40             ` Tom Herbert
  2014-09-25 13:52           ` Dave Taht
  1 sibling, 1 reply; 31+ messages in thread
From: Jamal Hadi Salim @ 2014-09-25 12:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, netdev, therbert, David S. Miller, Alexander Duyck,
	toke, Florian Westphal, Dave Taht, John Fastabend,
	Daniel Borkmann, Hannes Frederic Sowa

On 09/25/14 04:25, Jesper Dangaard Brouer wrote:
> On Wed, 24 Sep 2014 18:13:57 -0400

> I've done measurements with netperf-wrapper:
>   http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html
>
> I have already previously posted my measurements here:
>   http://people.netfilter.org/hawk/qdisc/
>   http://people.netfilter.org/hawk/qdisc/measure01/
>   http://people.netfilter.org/hawk/qdisc/experiment01/
>

Theres a lot of data there to digest; all good looking.
I will try to pay attention to detail and get back to you.
What you have on those urls is fit for a paper, but what someone
like me (with ADD) needs is a summary somewhere maybe in the commit
logs.

> Please, see my previous mail where I described each graph.
>

Is this patch 0?
I think a simple statement in the commit log that no cats
were harmed ^W^W^W performance was affected. A paragraph
at most but if you want to do more, DaveM actually has no
problems if you write the details of a novel in patch 0.

cheers,
jamal

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 12:48           ` Jamal Hadi Salim
@ 2014-09-25 14:40             ` Tom Herbert
  2014-09-25 14:57               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2014-09-25 14:40 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa

On Thu, Sep 25, 2014 at 5:48 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 09/25/14 04:25, Jesper Dangaard Brouer wrote:
>>
>> On Wed, 24 Sep 2014 18:13:57 -0400
>
>
>> I've done measurements with netperf-wrapper:
>>
>> http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html
>>
>> I have already previously posted my measurements here:
>>   http://people.netfilter.org/hawk/qdisc/
>>   http://people.netfilter.org/hawk/qdisc/measure01/
>>   http://people.netfilter.org/hawk/qdisc/experiment01/
>>
>
> Theres a lot of data there to digest; all good looking.
> I will try to pay attention to detail and get back to you.
> What you have on those urls is fit for a paper, but what someone
> like me (with ADD) needs is a summary somewhere maybe in the commit
> logs.
>
>> Please, see my previous mail where I described each graph.
>>
>
> Is this patch 0?
> I think a simple statement in the commit log that no cats
> were harmed ^W^W^W performance was affected. A paragraph
> at most but if you want to do more, DaveM actually has no
> problems if you write the details of a novel in patch 0.
>
+1

A few test results in patch 0 are good. I like to have results for
with and without patch. These should two things: 1) Any regressions
caused by the patch 2) Performance gains (in that order of importance
:-) ). There doesn't need to be a lot here, just something reasonably
representative, simple, and should be easily reproducible. My
expectation in bulk dequeue is that we should see no obvious
regression and hopefully an improvement in CPU utilization-- are you
able to verify this?


> cheers,
> jamal

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 14:40             ` Tom Herbert
@ 2014-09-25 14:57               ` Jesper Dangaard Brouer
  2014-09-25 15:05                 ` Tom Herbert
  2014-09-25 15:12                 ` Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-25 14:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer

On Thu, 25 Sep 2014 07:40:33 -0700
Tom Herbert <therbert@google.com> wrote:

> A few test results in patch 0 are good. I like to have results for
> with and without patch. These should two things: 1) Any regressions
> caused by the patch 2) Performance gains (in that order of importance
> :-) ). There doesn't need to be a lot here, just something reasonably
> representative, simple, and should be easily reproducible. My
> expectation in bulk dequeue is that we should see no obvious
> regression and hopefully an improvement in CPU utilization-- are you
> able to verify this?

We are saving 3% CPU, as I described in my post with subject:
"qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue":
 http://thread.gmane.org/gmane.linux.network/331152/focus=331154

Using UDP_STREAM on 1Gbit/s driver igb, I can show that the
_raw_spin_lock calls are reduced with approx 3%, when enabling
bulking of just 2 packets.

This test can only demonstrates a CPU usage reduction, as the
throughput is already at maximum link (bandwidth) capacity.

Notice netperf option "-m 1472" which makes sure we are not sending
UDP IP-fragments::

 netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472

Results from perf diff::

 # Command: perf diff
 # Event 'cycles'
 # Baseline  Delta    Symbol
 # no-bulk   bulk(1)
 # ........  .......  .........................................
 #
     7.05%   -3.03%  [k] _raw_spin_lock
     6.34%   +0.23%  [k] copy_user_enhanced_fast_string
     6.30%   +0.26%  [k] fib_table_lookup
     3.03%   +0.01%  [k] __slab_free
     3.00%   +0.08%  [k] intel_idle
     2.49%   +0.05%  [k] sock_alloc_send_pskb
     2.31%   +0.30%  netperf  [.] send_omni_inner
     2.12%   +0.12%  netperf  [.] send_data
     2.11%   +0.10%  [k] udp_sendmsg
     1.96%   +0.02%  [k] __ip_append_data
     1.48%   -0.01%  [k] __alloc_skb
     1.46%   +0.07%  [k] __mkroute_output
     1.34%   +0.05%  [k] __ip_select_ident
     1.29%   +0.03%  [k] check_leaf
     1.27%   +0.09%  [k] __skb_get_hash

A nitpick is that, this testing were done on V2 of the patchset.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 14:57               ` Jesper Dangaard Brouer
@ 2014-09-25 15:05                 ` Tom Herbert
  2014-09-25 15:23                   ` Jesper Dangaard Brouer
  2014-09-25 15:12                 ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2014-09-25 15:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa

On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 25 Sep 2014 07:40:33 -0700
> Tom Herbert <therbert@google.com> wrote:
>
>> A few test results in patch 0 are good. I like to have results for
>> with and without patch. These should two things: 1) Any regressions
>> caused by the patch 2) Performance gains (in that order of importance
>> :-) ). There doesn't need to be a lot here, just something reasonably
>> representative, simple, and should be easily reproducible. My
>> expectation in bulk dequeue is that we should see no obvious
>> regression and hopefully an improvement in CPU utilization-- are you
>> able to verify this?
>
> We are saving 3% CPU, as I described in my post with subject:
> "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue":
>  http://thread.gmane.org/gmane.linux.network/331152/focus=331154
>
> Using UDP_STREAM on 1Gbit/s driver igb, I can show that the
> _raw_spin_lock calls are reduced with approx 3%, when enabling
> bulking of just 2 packets.
>
That's great. In commit log, would be good to have results with
TCP_STREAM also and please report aggregate CPU utilization changes
(like from mpstat).

Thanks,
Tom

> This test can only demonstrates a CPU usage reduction, as the
> throughput is already at maximum link (bandwidth) capacity.
>
> Notice netperf option "-m 1472" which makes sure we are not sending
> UDP IP-fragments::
>
>  netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472
>
> Results from perf diff::
>
>  # Command: perf diff
>  # Event 'cycles'
>  # Baseline  Delta    Symbol
>  # no-bulk   bulk(1)
>  # ........  .......  .........................................
>  #
>      7.05%   -3.03%  [k] _raw_spin_lock
>      6.34%   +0.23%  [k] copy_user_enhanced_fast_string
>      6.30%   +0.26%  [k] fib_table_lookup
>      3.03%   +0.01%  [k] __slab_free
>      3.00%   +0.08%  [k] intel_idle
>      2.49%   +0.05%  [k] sock_alloc_send_pskb
>      2.31%   +0.30%  netperf  [.] send_omni_inner
>      2.12%   +0.12%  netperf  [.] send_data
>      2.11%   +0.10%  [k] udp_sendmsg
>      1.96%   +0.02%  [k] __ip_append_data
>      1.48%   -0.01%  [k] __alloc_skb
>      1.46%   +0.07%  [k] __mkroute_output
>      1.34%   +0.05%  [k] __ip_select_ident
>      1.29%   +0.03%  [k] check_leaf
>      1.27%   +0.09%  [k] __skb_get_hash
>
> A nitpick is that, this testing were done on V2 of the patchset.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 15:05                 ` Tom Herbert
@ 2014-09-25 15:23                   ` Jesper Dangaard Brouer
  2014-09-25 15:58                     ` Tom Herbert
  2014-09-29 20:14                     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-25 15:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer

On Thu, 25 Sep 2014 08:05:38 -0700
Tom Herbert <therbert@google.com> wrote:

> On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Thu, 25 Sep 2014 07:40:33 -0700
> > Tom Herbert <therbert@google.com> wrote:
> >
> >> A few test results in patch 0 are good. I like to have results for
> >> with and without patch. These should two things: 1) Any regressions
> >> caused by the patch 2) Performance gains (in that order of importance
> >> :-) ). There doesn't need to be a lot here, just something reasonably
> >> representative, simple, and should be easily reproducible. My
> >> expectation in bulk dequeue is that we should see no obvious
> >> regression and hopefully an improvement in CPU utilization-- are you
> >> able to verify this?
> >
> > We are saving 3% CPU, as I described in my post with subject:
> > "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue":
> >  http://thread.gmane.org/gmane.linux.network/331152/focus=331154
> >
> > Using UDP_STREAM on 1Gbit/s driver igb, I can show that the
> > _raw_spin_lock calls are reduced with approx 3%, when enabling
> > bulking of just 2 packets.
> >
>
> That's great. In commit log, would be good to have results with
> TCP_STREAM also and please report aggregate CPU utilization changes
> (like from mpstat).

The TCP_STREAM is not a good test for this, because unless disabling
both TSO and GSO the packets will not hit the code path (that this
patch changes).  When we later add support for TSO and GSO bulking,
then it will make sense to include TCP_STREAM testing, not before.

I will redo the tests, once I get home to my testlab, as the remote lab
I'm using now is annoyingly slow rebooting machines, as we not longer
have a runtime option for enable/disable (I'm currently in Switzerland).   

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 15:23                   ` Jesper Dangaard Brouer
@ 2014-09-25 15:58                     ` Tom Herbert
  2014-09-29 20:23                       ` Jesper Dangaard Brouer
  2014-09-29 20:14                     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2014-09-25 15:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa

On Thu, Sep 25, 2014 at 8:23 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 25 Sep 2014 08:05:38 -0700
> Tom Herbert <therbert@google.com> wrote:
>
>> On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > On Thu, 25 Sep 2014 07:40:33 -0700
>> > Tom Herbert <therbert@google.com> wrote:
>> >
>> >> A few test results in patch 0 are good. I like to have results for
>> >> with and without patch. These should two things: 1) Any regressions
>> >> caused by the patch 2) Performance gains (in that order of importance
>> >> :-) ). There doesn't need to be a lot here, just something reasonably
>> >> representative, simple, and should be easily reproducible. My
>> >> expectation in bulk dequeue is that we should see no obvious
>> >> regression and hopefully an improvement in CPU utilization-- are you
>> >> able to verify this?
>> >
>> > We are saving 3% CPU, as I described in my post with subject:
>> > "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue":
>> >  http://thread.gmane.org/gmane.linux.network/331152/focus=331154
>> >
>> > Using UDP_STREAM on 1Gbit/s driver igb, I can show that the
>> > _raw_spin_lock calls are reduced with approx 3%, when enabling
>> > bulking of just 2 packets.
>> >
>>
>> That's great. In commit log, would be good to have results with
>> TCP_STREAM also and please report aggregate CPU utilization changes
>> (like from mpstat).
>
> The TCP_STREAM is not a good test for this, because unless disabling
> both TSO and GSO the packets will not hit the code path (that this
> patch changes).  When we later add support for TSO and GSO bulking,
> then it will make sense to include TCP_STREAM testing, not before.
>
Disabling TSO and GSO is fine. I'm interested to see interactions with TCP.

> I will redo the tests, once I get home to my testlab, as the remote lab
> I'm using now is annoyingly slow rebooting machines, as we not longer
> have a runtime option for enable/disable (I'm currently in Switzerland).
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 15:58                     ` Tom Herbert
@ 2014-09-29 20:23                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-29 20:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer

Hi Tom,

On Thu, 25 Sep 2014 08:58:54 -0700 Tom Herbert <therbert@google.com> wrote:

> On Thu, Sep 25, 2014 at 8:23 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > On Thu, 25 Sep 2014 08:05:38 -0700 Tom Herbert <therbert@google.com> wrote:
> >
> >> On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >> > On Thu, 25 Sep 2014 07:40:33 -0700 Tom Herbert <therbert@google.com> wrote:
> >> >
[...]
> >>
> >> That's great. In commit log, would be good to have results with
> >> TCP_STREAM also and please report aggregate CPU utilization changes
> >> (like from mpstat).
> >
> > The TCP_STREAM is not a good test for this, because unless disabling
> > both TSO and GSO the packets will not hit the code path (that this
> > patch changes).  When we later add support for TSO and GSO bulking,
> > then it will make sense to include TCP_STREAM testing, not before.
> >
> Disabling TSO and GSO is fine. I'm interested to see interactions with TCP.

TCP already benefit from bulking, via TSO and especially for GSO
segmented packets, and this patch does not bulk TSO and GSO packets.

Testing effect for TCP involves disabling TSO and GSO, but I had to
enable GRO on the receiver, which reduces ACK packets, else the system
could not exceed the 10Gbit/s link capacilty with none-bulking.

Test cmd:
 * netperf -H 192.168.8.2 -t TCP_STREAM -l 1000 -D 1 -n -N

The measured perf diff benefit for TCP_STREAM were 4.66% less CPU used
on calls to _raw_spin_lock() (mostly from sch_direct_xmit()).

Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
 * Disabled bulking: 8.30% soft 88.75% idle
 * Enabled  bulking: 7.80% soft 89.36% idle

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 15:23                   ` Jesper Dangaard Brouer
  2014-09-25 15:58                     ` Tom Herbert
@ 2014-09-29 20:14                     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-29 20:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Jamal Hadi Salim, Eric Dumazet, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa


On Thu, 25 Sep 2014 17:23:29 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
 
> I will redo the tests, once I get home to my testlab, as the remote lab
> I'm using now is annoyingly slow rebooting machines, as we not longer
> have a runtime option for enable/disable (I'm currently in Switzerland).   

I'm going to change the bulking "budget" from 8 packet to 7 packets,
based on my netperf-wrapper test.

I've designed some tests for the tool "netperf-wrapper" that tries to
measure the HoL blocking effect.  I've turned down the link speed to
100Mbit/s on driver igb.  And uses a single TXQ setup
(cmdline: ethtool -L eth1 combined 1).

I can now show some kind of HoL blocking effect.  The strange part is
the HoL effect only occurs above 8 packets, 7 packet and below show no
bad effect of this bulking patch.

Results 100Mbit/s test qdisc_prio_hol, latency in high prio band:
 * GSO: stable average on 2.23ms
 * TSO: varies between min 4.13ms to 4.41ms (range 0.28ms)
 * No bulking: stable average on 1.71ms (3x outliners on 1.95ms)
 * Bulking(8): varies between 1.71ms to 1.95ms (range 0.24ms)
 * Bulking(7): stable average on 1.71ms (1x outliner  on 1.95ms)
 * Bulking(6): stable average on 1.71ms (3x outliners on 1.91ms)
 * Bulking(5): stable average on 1.71ms (1x outliner  on 1.91ms)
 * Bulking(5): stable average on 1.71ms (1x outliner  on 1.91ms)
 * Bulking(4): stable average on 1.71ms (0x outliner)

Bulking(8) the calculation:
 * Added delay were 0.24ms (1.95 - 1.71ms) corrosponding to 3000 bytes
 * 1500 bytes *8 / 100Mbit = 0.12 ms
 * (100*10^6)*(1.95/10^3)/8 = 24375 bytes
 * (100*10^6)*(1.71/10^3)/8 = 21375 bytes
 * Added HoL: 3000 bytes
 * Still compared to TSO and GSO, the added HoL blocking is small.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25 14:57               ` Jesper Dangaard Brouer
  2014-09-25 15:05                 ` Tom Herbert
@ 2014-09-25 15:12                 ` Eric Dumazet
  1 sibling, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2014-09-25 15:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Jamal Hadi Salim, Linux Netdev List,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal, Dave Taht,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa

On Thu, 2014-09-25 at 16:57 +0200, Jesper Dangaard Brouer wrote:

> We are saving 3% CPU, as I described in my post with subject:
> "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue":
>  http://thread.gmane.org/gmane.linux.network/331152/focus=331154
> 
> Using UDP_STREAM on 1Gbit/s driver igb, I can show that the
> _raw_spin_lock calls are reduced with approx 3%, when enabling
> bulking of just 2 packets.
> 
> This test can only demonstrates a CPU usage reduction, as the
> throughput is already at maximum link (bandwidth) capacity.
> 
> Notice netperf option "-m 1472" which makes sure we are not sending
> UDP IP-fragments::
> 
>  netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472
> 
> Results from perf diff::
> 
>  # Command: perf diff
>  # Event 'cycles'
>  # Baseline  Delta    Symbol
>  # no-bulk   bulk(1)
>  # ........  .......  .........................................
>  #
>      7.05%   -3.03%  [k] _raw_spin_lock
>      6.34%   +0.23%  [k] copy_user_enhanced_fast_string
>      6.30%   +0.26%  [k] fib_table_lookup
>      3.03%   +0.01%  [k] __slab_free
>      3.00%   +0.08%  [k] intel_idle
>      2.49%   +0.05%  [k] sock_alloc_send_pskb
>      2.31%   +0.30%  netperf  [.] send_omni_inner
>      2.12%   +0.12%  netperf  [.] send_data
>      2.11%   +0.10%  [k] udp_sendmsg
>      1.96%   +0.02%  [k] __ip_append_data
>      1.48%   -0.01%  [k] __alloc_skb
>      1.46%   +0.07%  [k] __mkroute_output
>      1.34%   +0.05%  [k] __ip_select_ident
>      1.29%   +0.03%  [k] check_leaf
>      1.27%   +0.09%  [k] __skb_get_hash
> 
> A nitpick is that, this testing were done on V2 of the patchset.
> 

You could avoid the fib_table_lookup() cost by using netperf -- -N -n

(connected UDP sockets)

And of course reduce message sizes to increase pps

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-25  8:25         ` Jesper Dangaard Brouer
  2014-09-25 12:48           ` Jamal Hadi Salim
@ 2014-09-25 13:52           ` Dave Taht
  1 sibling, 0 replies; 31+ messages in thread
From: Dave Taht @ 2014-09-25 13:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jamal Hadi Salim, Eric Dumazet, netdev, Tom Herbert,
	David S. Miller, Alexander Duyck,
	Toke Høiland-Jørgensen, Florian Westphal,
	John Fastabend, Daniel Borkmann, Hannes Frederic Sowa

On Thu, Sep 25, 2014 at 1:25 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 24 Sep 2014 18:13:57 -0400
> Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>> On 09/24/14 13:58, Jesper Dangaard Brouer wrote:
>> > On Wed, 24 Sep 2014 10:23:15 -0700
>> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>>
>> >
>> >> pktgen is nice, but do not represent the majority of the traffic we send
>> >> from high performance host where we want this bulk dequeue thing ;)
>> >
>> > This patch is actually targetted towards more normal use-cases.
>> > Pktgen cannot even use this work, as it bypass the qdisc layer...
>>
>> When you post these patches - can you please also post basic performance
>> numbers? You dont have to show improvement if it is hard for bulking
>> to kick in, but you need to show no harm in at least latency for the
>> general use case (i.e not pktgen maybe forwarding activity or something
>> sourced from tcp).
>
> I've done measurements with netperf-wrapper:
>  http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html
>
> I have already previously posted my measurements here:
>  http://people.netfilter.org/hawk/qdisc/
>  http://people.netfilter.org/hawk/qdisc/measure01/
>  http://people.netfilter.org/hawk/qdisc/experiment01/
>
> Please, see my previous mail where I described each graph.

Stuff like this:

http://people.netfilter.org/hawk/qdisc/experiment01/compare_TSO_vs_TSO_with_rxusec30__rr_latency.png

will no doubt make many a high speed trader happy.

My point being was that by repeating this experiment for each
successive change (eric's 1/2 BQL patch, better batching, sch_fq,
different ethernet drivers, etc),
you (or people duplicating the experiment) can produce
ongoing comparison plots...

> The above measurements is for 10Gbit/s, but I've also done measurements
> on 1Gbit/s driver igb, and 10Mbit/s by forcing igb to use 10Mbit/s.
> Those results I forgot upload (and I cannot upload them right now, as
> I'm currently in Switzerland).
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer



-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-09-29 20:24 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24 16:10 [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
2014-09-24 17:23   ` Eric Dumazet
2014-09-24 17:58     ` Jesper Dangaard Brouer
2014-09-24 18:34       ` Tom Herbert
2014-09-24 19:22         ` Eric Dumazet
2014-09-25  2:12           ` Eric Dumazet
2014-09-25  2:38             ` Tom Herbert
2014-09-25  2:58               ` Dave Taht
2014-09-26 18:06                 ` Eric Dumazet
2014-09-25 23:40             ` Eric Dumazet
2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
2014-09-26  8:47                 ` Jesper Dangaard Brouer
2014-09-26 11:06                 ` Hannes Frederic Sowa
2014-09-26 12:02                   ` Eric Dumazet
2014-09-28 21:43                 ` David Miller
2014-09-26  9:23               ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight
2014-09-26 13:16                 ` David Laight
2014-09-26 13:38                   ` Eric Dumazet
2014-09-24 22:13       ` Jamal Hadi Salim
2014-09-25  8:25         ` Jesper Dangaard Brouer
2014-09-25 12:48           ` Jamal Hadi Salim
2014-09-25 14:40             ` Tom Herbert
2014-09-25 14:57               ` Jesper Dangaard Brouer
2014-09-25 15:05                 ` Tom Herbert
2014-09-25 15:23                   ` Jesper Dangaard Brouer
2014-09-25 15:58                     ` Tom Herbert
2014-09-29 20:23                       ` Jesper Dangaard Brouer
2014-09-29 20:14                     ` Jesper Dangaard Brouer
2014-09-25 15:12                 ` Eric Dumazet
2014-09-25 13:52           ` Dave Taht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.