* [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates @ 2014-09-24 16:10 Jesper Dangaard Brouer 2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer 0 siblings, 1 reply; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-24 16:10 UTC (permalink / raw) To: netdev, therbert, David S. Miller, Eric Dumazet, Jesper Dangaard Brouer Cc: Alexander Duyck, John Fastabend, toke, jhs, Dave Taht This patch uses DaveM's recent API changes to dev_hard_start_xmit(), from the qdisc layer, to implement dequeue bulking. In this V4 iteration we are choosing an conservative approach. Patch V4: - Patch rewritten in the Red Hat Neuchatel office jointed work with Hannes, Daniel and Florian. - Conservative approach of only using on BQL enabled drivers - No user tunable parameter, but limit bulking to 8 packets. - For now, avoid bulking GSO packets packets. --- Jesper Dangaard Brouer (1): qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE include/net/sch_generic.h | 16 +++++++++++++++ net/sched/sch_generic.c | 47 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 61 insertions(+), 2 deletions(-) -- ^ permalink raw reply [flat|nested] 31+ messages in thread
* [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 16:10 [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer @ 2014-09-24 16:12 ` Jesper Dangaard Brouer 2014-09-24 17:23 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-24 16:12 UTC (permalink / raw) To: netdev, therbert, David S. Miller, Eric Dumazet, Jesper Dangaard Brouer Cc: Alexander Duyck, toke, Florian Westphal, jhs, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa Based on DaveM's recent API work on dev_hard_start_xmit(), that allows sending/processing an entire skb list. This patch implements qdisc bulk dequeue, by allowing multiple packets to be dequeued in dequeue_skb(). The optimization principle for this is two fold, (1) to amortize locking cost and (2) avoid expensive tailptr update for notifying HW. (1) Several packets are dequeued while holding the qdisc root_lock, amortizing locking cost over several packet. The dequeued SKB list is processed under the TXQ lock in dev_hard_start_xmit(), thus also amortizing the cost of the TXQ lock. (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more API to delay HW tailptr update, which also reduces the cost per packet. One restriction of the new API is that every SKB must belong to the same TXQ. This patch takes the easy way out, by restricting bulk dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the qdisc only have attached a single TXQ. Some detail about the flow; dev_hard_start_xmit() will process the skb list, and transmit packets individually towards the driver (see xmit_one()). In case the driver stops midway in the list, the remaining skb list is returned by dev_hard_start_xmit(). In sch_direct_xmit() this returned list is requeued by dev_requeue_skb(). To avoid overshooting the HW limits, which results in requeuing, the patch limits the amount of bytes dequeued, based on the drivers BQL limits. In-effect bulking will only happen for BQL enabled drivers. Besides the bytelimit from BQL, also limit bulking to maximum 8 packets to avoid any issues with available descriptor in HW. For now, as a conservative approach, don't bulk dequeue GSO and segmented GSO packets, because testing showed requeuing occuring with segmented GSO packets. Jointed work with Hannes, Daniel and Florian. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> --- V4: - Patch rewritten in the Red Hat Neuchatel office jointed work with Hannes, Daniel and Florian. - Conservative approach of only using on BQL enabled drivers - No user tunable parameter, but limit bulking to 8 packets. - For now, avoid bulking GSO packets packets. V3: - Correct use of BQL - Some minor adjustments based on feedback. - Default setting only bulk dequeue 1 extra packet (2 packets). V2: - Restruct functions, split out functionality - Use BQL bytelimit to avoid overshooting driver limits --- include/net/sch_generic.h | 16 +++++++++++++++ net/sched/sch_generic.c | 47 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 61 insertions(+), 2 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index a3cfb8e..4e39a3e 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -6,6 +6,7 @@ #include <linux/rcupdate.h> #include <linux/pkt_sched.h> #include <linux/pkt_cls.h> +#include <linux/dynamic_queue_limits.h> #include <net/gen_stats.h> #include <net/rtnetlink.h> @@ -111,6 +112,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) qdisc->__state &= ~__QDISC___STATE_RUNNING; } +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) +{ + return qdisc->flags & TCQ_F_ONETXQUEUE; +} + +static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq) +{ +#ifdef CONFIG_BQL + /* Non-BQL migrated drivers will return 0, too. */ + return dql_avail(&txq->dql); +#else + return 0; +#endif +} + static inline bool qdisc_is_throttled(const struct Qdisc *qdisc) { return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 19696eb..6fba089 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) return 0; } +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q, + struct sk_buff *head_skb, + int bytelimit) +{ + struct sk_buff *skb, *tail_skb = head_skb; + int budget = 8; /* Arbitrary, but conservatively choosen limit */ + + while (bytelimit > 0 && --budget > 0) { + /* For now, don't bulk dequeue GSO (or GSO segmented) pkts */ + if (tail_skb->next || skb_is_gso(tail_skb)) + break; + + skb = q->dequeue(q); + if (!skb) + break; + + bytelimit -= skb->len; /* covers GSO len */ + skb = validate_xmit_skb(skb, qdisc_dev(q)); + if (!skb) + break; + + /* "skb" can be a skb list after validate call above + * (GSO segmented), but it is okay to append it to + * current tail_skb->next, because next round will exit + * in-case "tail_skb->next" is a skb list. + */ + tail_skb->next = skb; + tail_skb = skb; + } + + return head_skb; +} + +/* Note that dequeue_skb can possibly return a SKB list (via skb->next). + * A requeued skb (via q->gso_skb) can also be a SKB list. + */ static inline struct sk_buff *dequeue_skb(struct Qdisc *q) { struct sk_buff *skb = q->gso_skb; @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q) } else skb = NULL; } else { - if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) { + if (!(q->flags & TCQ_F_ONETXQUEUE) || + !netif_xmit_frozen_or_stopped(txq)) { + int bytelimit = qdisc_avail_bulklimit(txq); + skb = q->dequeue(q); - if (skb) + if (skb) { + bytelimit -= skb->len; skb = validate_xmit_skb(skb, qdisc_dev(q)); + } + if (skb && qdisc_may_bulk(q)) + skb = try_bulk_dequeue_skb(q, skb, bytelimit); } } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer @ 2014-09-24 17:23 ` Eric Dumazet 2014-09-24 17:58 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2014-09-24 17:23 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: netdev, therbert, David S. Miller, Alexander Duyck, toke, Florian Westphal, jhs, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote: > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows > sending/processing an entire skb list. > > This patch implements qdisc bulk dequeue, by allowing multiple packets > to be dequeued in dequeue_skb(). > > The optimization principle for this is two fold, (1) to amortize > locking cost and (2) avoid expensive tailptr update for notifying HW. > (1) Several packets are dequeued while holding the qdisc root_lock, > amortizing locking cost over several packet. The dequeued SKB list is > processed under the TXQ lock in dev_hard_start_xmit(), thus also > amortizing the cost of the TXQ lock. > (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more > API to delay HW tailptr update, which also reduces the cost per > packet. > > One restriction of the new API is that every SKB must belong to the > same TXQ. This patch takes the easy way out, by restricting bulk > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the > qdisc only have attached a single TXQ. > > Some detail about the flow; dev_hard_start_xmit() will process the skb > list, and transmit packets individually towards the driver (see > xmit_one()). In case the driver stops midway in the list, the > remaining skb list is returned by dev_hard_start_xmit(). In > sch_direct_xmit() this returned list is requeued by dev_requeue_skb(). > > To avoid overshooting the HW limits, which results in requeuing, the > patch limits the amount of bytes dequeued, based on the drivers BQL > limits. In-effect bulking will only happen for BQL enabled drivers. > Besides the bytelimit from BQL, also limit bulking to maximum 8 > packets to avoid any issues with available descriptor in HW. > > For now, as a conservative approach, don't bulk dequeue GSO and > segmented GSO packets, because testing showed requeuing occuring > with segmented GSO packets. > > Jointed work with Hannes, Daniel and Florian. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> > Signed-off-by: Daniel Borkmann <dborkman@redhat.com> > Signed-off-by: Florian Westphal <fw@strlen.de> > > --- > V4: > - Patch rewritten in the Red Hat Neuchatel office jointed work with > Hannes, Daniel and Florian. > - Conservative approach of only using on BQL enabled drivers > - No user tunable parameter, but limit bulking to 8 packets. > - For now, avoid bulking GSO packets packets. > > V3: > - Correct use of BQL > - Some minor adjustments based on feedback. > - Default setting only bulk dequeue 1 extra packet (2 packets). > > V2: > - Restruct functions, split out functionality > - Use BQL bytelimit to avoid overshooting driver limits > --- > include/net/sch_generic.h | 16 +++++++++++++++ > net/sched/sch_generic.c | 47 +++++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 61 insertions(+), 2 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index a3cfb8e..4e39a3e 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -6,6 +6,7 @@ > #include <linux/rcupdate.h> > #include <linux/pkt_sched.h> > #include <linux/pkt_cls.h> > +#include <linux/dynamic_queue_limits.h> > #include <net/gen_stats.h> > #include <net/rtnetlink.h> > > @@ -111,6 +112,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) > qdisc->__state &= ~__QDISC___STATE_RUNNING; > } > > +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) > +{ > + return qdisc->flags & TCQ_F_ONETXQUEUE; > +} > + > +static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq) > +{ > +#ifdef CONFIG_BQL > + /* Non-BQL migrated drivers will return 0, too. */ > + return dql_avail(&txq->dql); > +#else > + return 0; > +#endif > +} > + > static inline bool qdisc_is_throttled(const struct Qdisc *qdisc) > { > return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false; > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 19696eb..6fba089 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) > return 0; > } > > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q, > + struct sk_buff *head_skb, > + int bytelimit) > +{ > + struct sk_buff *skb, *tail_skb = head_skb; > + int budget = 8; /* Arbitrary, but conservatively choosen limit */ > + > + while (bytelimit > 0 && --budget > 0) { > + /* For now, don't bulk dequeue GSO (or GSO segmented) pkts */ Hmm... this is a serious limitation. > + if (tail_skb->next || skb_is_gso(tail_skb)) > + break; > + > + skb = q->dequeue(q); > + if (!skb) > + break; > + > + bytelimit -= skb->len; /* covers GSO len */ Not really, use qdisc_pkt_len(skb) instead, to have a better byte count. > + skb = validate_xmit_skb(skb, qdisc_dev(q)); > + if (!skb) > + break; > + > + /* "skb" can be a skb list after validate call above > + * (GSO segmented), but it is okay to append it to > + * current tail_skb->next, because next round will exit > + * in-case "tail_skb->next" is a skb list. > + */ It would be trivial to change validate_xmit_skb() to return the head and tail of the chain. Walking the chain would hit hot cache lines, but it is better not having to walk it. > + tail_skb->next = skb; > + tail_skb = skb; So that here we do : tail_skb = tail; > + } > + > + return head_skb; > +} > + > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next). > + * A requeued skb (via q->gso_skb) can also be a SKB list. > + */ > static inline struct sk_buff *dequeue_skb(struct Qdisc *q) > { > struct sk_buff *skb = q->gso_skb; > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q) > } else > skb = NULL; > } else { > - if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) { > + if (!(q->flags & TCQ_F_ONETXQUEUE) || > + !netif_xmit_frozen_or_stopped(txq)) { > + int bytelimit = qdisc_avail_bulklimit(txq); > + > skb = q->dequeue(q); > - if (skb) > + if (skb) { > + bytelimit -= skb->len; qdisc_pkt_len(skb) > skb = validate_xmit_skb(skb, qdisc_dev(q)); > + } > + if (skb && qdisc_may_bulk(q)) > + skb = try_bulk_dequeue_skb(q, skb, bytelimit); > } > } > It looks good, but we really need to take care of TSO packets. pktgen is nice, but do not represent the majority of the traffic we send from high performance host where we want this bulk dequeue thing ;) Thanks guys ! ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 17:23 ` Eric Dumazet @ 2014-09-24 17:58 ` Jesper Dangaard Brouer 2014-09-24 18:34 ` Tom Herbert 2014-09-24 22:13 ` Jamal Hadi Salim 0 siblings, 2 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-24 17:58 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, therbert, David S. Miller, Alexander Duyck, toke, Florian Westphal, jhs, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer On Wed, 24 Sep 2014 10:23:15 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote: > > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows > > sending/processing an entire skb list. > > > > This patch implements qdisc bulk dequeue, by allowing multiple packets > > to be dequeued in dequeue_skb(). > > > > The optimization principle for this is two fold, (1) to amortize > > locking cost and (2) avoid expensive tailptr update for notifying HW. > > (1) Several packets are dequeued while holding the qdisc root_lock, > > amortizing locking cost over several packet. The dequeued SKB list is > > processed under the TXQ lock in dev_hard_start_xmit(), thus also > > amortizing the cost of the TXQ lock. > > (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more > > API to delay HW tailptr update, which also reduces the cost per > > packet. > > > > One restriction of the new API is that every SKB must belong to the > > same TXQ. This patch takes the easy way out, by restricting bulk > > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the > > qdisc only have attached a single TXQ. > > > > Some detail about the flow; dev_hard_start_xmit() will process the skb > > list, and transmit packets individually towards the driver (see > > xmit_one()). In case the driver stops midway in the list, the > > remaining skb list is returned by dev_hard_start_xmit(). In > > sch_direct_xmit() this returned list is requeued by dev_requeue_skb(). > > > > To avoid overshooting the HW limits, which results in requeuing, the > > patch limits the amount of bytes dequeued, based on the drivers BQL > > limits. In-effect bulking will only happen for BQL enabled drivers. > > Besides the bytelimit from BQL, also limit bulking to maximum 8 > > packets to avoid any issues with available descriptor in HW. > > > > For now, as a conservative approach, don't bulk dequeue GSO and > > segmented GSO packets, because testing showed requeuing occuring > > with segmented GSO packets. > > > > Jointed work with Hannes, Daniel and Florian. > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> > > Signed-off-by: Daniel Borkmann <dborkman@redhat.com> > > Signed-off-by: Florian Westphal <fw@strlen.de> > > > > --- > > V4: > > - Patch rewritten in the Red Hat Neuchatel office jointed work with > > Hannes, Daniel and Florian. > > - Conservative approach of only using on BQL enabled drivers > > - No user tunable parameter, but limit bulking to 8 packets. > > - For now, avoid bulking GSO packets packets. > > [...] > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > > index 19696eb..6fba089 100644 > > --- a/net/sched/sch_generic.c > > +++ b/net/sched/sch_generic.c > > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) > > return 0; > > } > > > > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q, > > + struct sk_buff *head_skb, > > + int bytelimit) > > +{ > > + struct sk_buff *skb, *tail_skb = head_skb; > > + int budget = 8; /* Arbitrary, but conservatively choosen limit */ > > + > > + while (bytelimit > 0 && --budget > 0) { > > + /* For now, don't bulk dequeue GSO (or GSO segmented) pkts */ > > Hmm... this is a serious limitation. We plan to remove this at a later point, this is just to be conservative. > > + if (tail_skb->next || skb_is_gso(tail_skb)) > > + break; > > + > > + skb = q->dequeue(q); > > + if (!skb) > > + break; > > + > > + bytelimit -= skb->len; /* covers GSO len */ > > Not really, use qdisc_pkt_len(skb) instead, to have a better byte count. Understood, will update patch tomorrow. > > + skb = validate_xmit_skb(skb, qdisc_dev(q)); > > + if (!skb) > > + break; > > + > > + /* "skb" can be a skb list after validate call above > > + * (GSO segmented), but it is okay to append it to > > + * current tail_skb->next, because next round will exit > > + * in-case "tail_skb->next" is a skb list. > > + */ > > It would be trivial to change validate_xmit_skb() to return the head and > tail of the chain. Walking the chain would hit hot cache lines, but it > is better not having to walk it. Yes, we could do this when we later add support for GSO segmented packets. > > + tail_skb->next = skb; > > + tail_skb = skb; > > So that here we do : tail_skb = tail; > > > + } > > + > > + return head_skb; > > +} > > + > > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next). > > + * A requeued skb (via q->gso_skb) can also be a SKB list. > > + */ > > static inline struct sk_buff *dequeue_skb(struct Qdisc *q) > > { > > struct sk_buff *skb = q->gso_skb; > > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q) > > } else > > skb = NULL; > > } else { > > - if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) { > > + if (!(q->flags & TCQ_F_ONETXQUEUE) || > > + !netif_xmit_frozen_or_stopped(txq)) { > > + int bytelimit = qdisc_avail_bulklimit(txq); > > + > > skb = q->dequeue(q); > > - if (skb) > > + if (skb) { > > + bytelimit -= skb->len; > > qdisc_pkt_len(skb) Okay, will update patch tomorrow. > > skb = validate_xmit_skb(skb, qdisc_dev(q)); > > + } > > + if (skb && qdisc_may_bulk(q)) > > + skb = try_bulk_dequeue_skb(q, skb, bytelimit); > > } > > } > > > > It looks good, but we really need to take care of TSO packets. As notes above, TSO packets are planned as a future feature. Lets first see if this works well, without introducing any HoL blocking issues or other regressions. > pktgen is nice, but do not represent the majority of the traffic we send > from high performance host where we want this bulk dequeue thing ;) This patch is actually targetted towards more normal use-cases. Pktgen cannot even use this work, as it bypass the qdisc layer... -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 17:58 ` Jesper Dangaard Brouer @ 2014-09-24 18:34 ` Tom Herbert 2014-09-24 19:22 ` Eric Dumazet 2014-09-24 22:13 ` Jamal Hadi Salim 1 sibling, 1 reply; 31+ messages in thread From: Tom Herbert @ 2014-09-24 18:34 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, Sep 24, 2014 at 10:58 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 24 Sep 2014 10:23:15 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote: >> > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows >> > sending/processing an entire skb list. >> > >> > This patch implements qdisc bulk dequeue, by allowing multiple packets >> > to be dequeued in dequeue_skb(). >> > >> > The optimization principle for this is two fold, (1) to amortize >> > locking cost and (2) avoid expensive tailptr update for notifying HW. >> > (1) Several packets are dequeued while holding the qdisc root_lock, >> > amortizing locking cost over several packet. The dequeued SKB list is >> > processed under the TXQ lock in dev_hard_start_xmit(), thus also >> > amortizing the cost of the TXQ lock. >> > (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more >> > API to delay HW tailptr update, which also reduces the cost per >> > packet. >> > >> > One restriction of the new API is that every SKB must belong to the >> > same TXQ. This patch takes the easy way out, by restricting bulk >> > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the >> > qdisc only have attached a single TXQ. >> > >> > Some detail about the flow; dev_hard_start_xmit() will process the skb >> > list, and transmit packets individually towards the driver (see >> > xmit_one()). In case the driver stops midway in the list, the >> > remaining skb list is returned by dev_hard_start_xmit(). In >> > sch_direct_xmit() this returned list is requeued by dev_requeue_skb(). >> > >> > To avoid overshooting the HW limits, which results in requeuing, the >> > patch limits the amount of bytes dequeued, based on the drivers BQL >> > limits. In-effect bulking will only happen for BQL enabled drivers. >> > Besides the bytelimit from BQL, also limit bulking to maximum 8 >> > packets to avoid any issues with available descriptor in HW. >> > >> > For now, as a conservative approach, don't bulk dequeue GSO and >> > segmented GSO packets, because testing showed requeuing occuring >> > with segmented GSO packets. >> > >> > Jointed work with Hannes, Daniel and Florian. >> > >> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> >> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> >> > Signed-off-by: Daniel Borkmann <dborkman@redhat.com> >> > Signed-off-by: Florian Westphal <fw@strlen.de> >> > >> > --- >> > V4: >> > - Patch rewritten in the Red Hat Neuchatel office jointed work with >> > Hannes, Daniel and Florian. >> > - Conservative approach of only using on BQL enabled drivers >> > - No user tunable parameter, but limit bulking to 8 packets. >> > - For now, avoid bulking GSO packets packets. >> > > [...] >> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >> > index 19696eb..6fba089 100644 >> > --- a/net/sched/sch_generic.c >> > +++ b/net/sched/sch_generic.c >> > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) >> > return 0; >> > } >> > >> > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q, >> > + struct sk_buff *head_skb, >> > + int bytelimit) >> > +{ >> > + struct sk_buff *skb, *tail_skb = head_skb; >> > + int budget = 8; /* Arbitrary, but conservatively choosen limit */ >> > + >> > + while (bytelimit > 0 && --budget > 0) { >> > + /* For now, don't bulk dequeue GSO (or GSO segmented) pkts */ >> >> Hmm... this is a serious limitation. > > We plan to remove this at a later point, this is just to be conservative. > > >> > + if (tail_skb->next || skb_is_gso(tail_skb)) >> > + break; >> > + >> > + skb = q->dequeue(q); >> > + if (!skb) >> > + break; >> > + >> > + bytelimit -= skb->len; /* covers GSO len */ >> >> Not really, use qdisc_pkt_len(skb) instead, to have a better byte count. > > Understood, will update patch tomorrow. > I believe drivers typically use skb->len for BQL tracking. Since bytelimit is based on BQL here, it might be more correct to use skb->len. > >> > + skb = validate_xmit_skb(skb, qdisc_dev(q)); >> > + if (!skb) >> > + break; >> > + >> > + /* "skb" can be a skb list after validate call above >> > + * (GSO segmented), but it is okay to append it to >> > + * current tail_skb->next, because next round will exit >> > + * in-case "tail_skb->next" is a skb list. >> > + */ >> >> It would be trivial to change validate_xmit_skb() to return the head and >> tail of the chain. Walking the chain would hit hot cache lines, but it >> is better not having to walk it. > > Yes, we could do this when we later add support for GSO segmented packets. > > >> > + tail_skb->next = skb; >> > + tail_skb = skb; >> >> So that here we do : tail_skb = tail; >> >> > + } >> > + >> > + return head_skb; >> > +} >> > + >> > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next). >> > + * A requeued skb (via q->gso_skb) can also be a SKB list. >> > + */ >> > static inline struct sk_buff *dequeue_skb(struct Qdisc *q) >> > { >> > struct sk_buff *skb = q->gso_skb; >> > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q) >> > } else >> > skb = NULL; >> > } else { >> > - if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) { >> > + if (!(q->flags & TCQ_F_ONETXQUEUE) || >> > + !netif_xmit_frozen_or_stopped(txq)) { >> > + int bytelimit = qdisc_avail_bulklimit(txq); >> > + >> > skb = q->dequeue(q); >> > - if (skb) >> > + if (skb) { >> > + bytelimit -= skb->len; >> >> qdisc_pkt_len(skb) > > Okay, will update patch tomorrow. > >> > skb = validate_xmit_skb(skb, qdisc_dev(q)); >> > + } >> > + if (skb && qdisc_may_bulk(q)) >> > + skb = try_bulk_dequeue_skb(q, skb, bytelimit); >> > } >> > } >> > >> >> It looks good, but we really need to take care of TSO packets. > > As notes above, TSO packets are planned as a future feature. Lets > first see if this works well, without introducing any HoL blocking > issues or other regressions. > > >> pktgen is nice, but do not represent the majority of the traffic we send >> from high performance host where we want this bulk dequeue thing ;) > > This patch is actually targetted towards more normal use-cases. > Pktgen cannot even use this work, as it bypass the qdisc layer... > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 18:34 ` Tom Herbert @ 2014-09-24 19:22 ` Eric Dumazet 2014-09-25 2:12 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2014-09-24 19:22 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote: > > > I believe drivers typically use skb->len for BQL tracking. Since > bytelimit is based on BQL here, it might be more correct to use > skb->len. That is only because drivers dont have access to qdisc_pkt_len(skb) as : - We can call them without any qdisc traversal - We escaped from qdisc layer, and gso packet do not have the field set And drivers actually update BQL inflight, while bulk dequeue do not update it (its a local copy) I do not particularly care, as BQL do not have to be 100% precise. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 19:22 ` Eric Dumazet @ 2014-09-25 2:12 ` Eric Dumazet 2014-09-25 2:38 ` Tom Herbert 2014-09-25 23:40 ` Eric Dumazet 0 siblings, 2 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-25 2:12 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote: > On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote: > > > > > I believe drivers typically use skb->len for BQL tracking. Since > > bytelimit is based on BQL here, it might be more correct to use > > skb->len. Speaking of BQL, I wonder if we now should try to not wakeup queues as soon some room was made, and instead have a 50% threshold ? This would probably increase probability to have bulk dequeues ;) diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h index 5621547d631b..c0be7ff5ae97 100644 --- a/include/linux/dynamic_queue_limits.h +++ b/include/linux/dynamic_queue_limits.h @@ -83,6 +83,13 @@ static inline int dql_avail(const struct dql *dql) return dql->adj_limit - dql->num_queued; } +/* Returns true if a queue occupancy is less than half capacity */ +static inline bool dql_half_avail(const struct dql *dql) +{ + return dql->adj_limit >= (dql->num_queued << 1); +} + + /* Record number of completed objects and recalculate the limit. */ void dql_completed(struct dql *dql, unsigned int count); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c8e388e5fccc..1f7541284b32 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2413,7 +2413,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, smp_mb(); /* check again in case another CPU has just made room avail */ - if (unlikely(dql_avail(&dev_queue->dql) >= 0)) + if (unlikely(dql_half_avail(&dev_queue->dql))) clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state); #endif } @@ -2448,7 +2448,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue, */ smp_mb(); - if (dql_avail(&dev_queue->dql) < 0) + if (!dql_half_avail(&dev_queue->dql)) return; if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 2:12 ` Eric Dumazet @ 2014-09-25 2:38 ` Tom Herbert 2014-09-25 2:58 ` Dave Taht 2014-09-25 23:40 ` Eric Dumazet 1 sibling, 1 reply; 31+ messages in thread From: Tom Herbert @ 2014-09-25 2:38 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, Sep 24, 2014 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote: >> On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote: >> > > >> > I believe drivers typically use skb->len for BQL tracking. Since >> > bytelimit is based on BQL here, it might be more correct to use >> > skb->len. > > Speaking of BQL, I wonder if we now should try to not wakeup queues as > soon some room was made, and instead have a 50% threshold ? > > This would probably increase probability to have bulk dequeues ;) > It would be good to have data on that. In the absence of TSO, I've seen BQL limits at around 30K for "standard" interrupt rates on 10G. This should mean that ~15K becomes available every interrupt period (the math is actually straightforward), so that should already have 10 packet batches which isn't bad! It's also probably true that we can tradeoff batching for latency in many ways-- more batching increase latency, less batching helps latency. For instance, the interrupt rate can be modulated to balance between latency and batching (CPU utilization). Tom > > diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h > index 5621547d631b..c0be7ff5ae97 100644 > --- a/include/linux/dynamic_queue_limits.h > +++ b/include/linux/dynamic_queue_limits.h > @@ -83,6 +83,13 @@ static inline int dql_avail(const struct dql *dql) > return dql->adj_limit - dql->num_queued; > } > > +/* Returns true if a queue occupancy is less than half capacity */ > +static inline bool dql_half_avail(const struct dql *dql) > +{ > + return dql->adj_limit >= (dql->num_queued << 1); > +} > + > + > /* Record number of completed objects and recalculate the limit. */ > void dql_completed(struct dql *dql, unsigned int count); > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index c8e388e5fccc..1f7541284b32 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2413,7 +2413,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, > smp_mb(); > > /* check again in case another CPU has just made room avail */ > - if (unlikely(dql_avail(&dev_queue->dql) >= 0)) > + if (unlikely(dql_half_avail(&dev_queue->dql))) > clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state); > #endif > } > @@ -2448,7 +2448,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue, > */ > smp_mb(); > > - if (dql_avail(&dev_queue->dql) < 0) > + if (!dql_half_avail(&dev_queue->dql)) > return; > > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 2:38 ` Tom Herbert @ 2014-09-25 2:58 ` Dave Taht 2014-09-26 18:06 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Dave Taht @ 2014-09-25 2:58 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, Sep 24, 2014 at 7:38 PM, Tom Herbert <therbert@google.com> wrote: > On Wed, Sep 24, 2014 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote: >>> On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote: >>> > > >>> > I believe drivers typically use skb->len for BQL tracking. Since >>> > bytelimit is based on BQL here, it might be more correct to use >>> > skb->len. >> >> Speaking of BQL, I wonder if we now should try to not wakeup queues as >> soon some room was made, and instead have a 50% threshold ? >> >> This would probably increase probability to have bulk dequeues ;) >> > It would be good to have data on that. +1. Yes, I'd be very carefully observing BQL's behavior as these changes are made. netperf-wrapper could rather easily sample bql's limit on as low as 10ms interval I think.... and... me being me, I volunteer to watch 100Mbit and down. > In the absence of TSO, I've > seen BQL limits at around 30K for "standard" interrupt rates on 10G. It is typically 3k at 100Mbit, 1500 at 10Mbit. It thus usually goes overlimit by a MTU. I have simple tons of data on this on tons of kernels for various platforms. > This should mean that ~15K becomes available every interrupt period > (the math is actually straightforward), so that should already have 10 > packet batches which isn't bad! > > It's also probably true that we can tradeoff batching for latency in > many ways-- more batching increase latency, less batching helps > latency. For instance, the interrupt rate can be modulated to balance > between latency and batching (CPU utilization). I have long hoped that the actual BQL limit in play would feed into TCP small queues when there are a lot of flows to make each tcp "small" queue gradually smaller... > Tom > >> >> diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h >> index 5621547d631b..c0be7ff5ae97 100644 >> --- a/include/linux/dynamic_queue_limits.h >> +++ b/include/linux/dynamic_queue_limits.h >> @@ -83,6 +83,13 @@ static inline int dql_avail(const struct dql *dql) >> return dql->adj_limit - dql->num_queued; >> } >> >> +/* Returns true if a queue occupancy is less than half capacity */ >> +static inline bool dql_half_avail(const struct dql *dql) >> +{ >> + return dql->adj_limit >= (dql->num_queued << 1); >> +} >> + >> + >> /* Record number of completed objects and recalculate the limit. */ >> void dql_completed(struct dql *dql, unsigned int count); >> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >> index c8e388e5fccc..1f7541284b32 100644 >> --- a/include/linux/netdevice.h >> +++ b/include/linux/netdevice.h >> @@ -2413,7 +2413,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, >> smp_mb(); >> >> /* check again in case another CPU has just made room avail */ >> - if (unlikely(dql_avail(&dev_queue->dql) >= 0)) >> + if (unlikely(dql_half_avail(&dev_queue->dql))) >> clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state); >> #endif >> } >> @@ -2448,7 +2448,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue, >> */ >> smp_mb(); >> >> - if (dql_avail(&dev_queue->dql) < 0) >> + if (!dql_half_avail(&dev_queue->dql)) >> return; >> >> if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) >> >> -- Dave Täht https://www.bufferbloat.net/projects/make-wifi-fast ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 2:58 ` Dave Taht @ 2014-09-26 18:06 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-26 18:06 UTC (permalink / raw) To: Dave Taht Cc: Tom Herbert, Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, 2014-09-24 at 19:58 -0700, Dave Taht wrote: > I have long hoped that the actual BQL limit in play would feed into > TCP small queues when there are a lot of flows to make each tcp > "small" queue gradually smaller... Yep, but what do you do if you have 64 TX queues, each one having different BQL limits ? This looks like a NP problem when you also have millions of tcp flows. Basically BQL limit is more a sign of local host being able to drain TX completions fast or not. Here at Google, we have a very tiny queue of at most 2 packets per flow, and srtt being in usec units gives us a very dynamic signal, based on end to end measures. I am experimenting moving TSQ processing from tasklet to workqueue, because we need faster response to {soft}irq when we so often enter TCP stack. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 2:12 ` Eric Dumazet 2014-09-25 2:38 ` Tom Herbert @ 2014-09-25 23:40 ` Eric Dumazet 2014-09-26 6:04 ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet 2014-09-26 9:23 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight 1 sibling, 2 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-25 23:40 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote: > On Wed, 2014-09-24 at 12:22 -0700, Eric Dumazet wrote: > > On Wed, 2014-09-24 at 11:34 -0700, Tom Herbert wrote: > > > > > > > I believe drivers typically use skb->len for BQL tracking. Since > > > bytelimit is based on BQL here, it might be more correct to use > > > skb->len. > > Speaking of BQL, I wonder if we now should try to not wakeup queues as > soon some room was made, and instead have a 50% threshold ? > > This would probably increase probability to have bulk dequeues ;) It turned out the problem I noticed was caused by compiler trying to be smart, but involving a bad MESI transaction. 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count 58.23 │ add %edx,%edi 0.58 │ cmp %edi,0xc4(%rax) 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count 0.72 │ js bd8 I get an incredible 10 % gain by making sure cpu wont get the cache line in Shared mode. (I also tried a barrier() in netdev_tx_sent_queue() between the dql_queued(&dev_queue->dql, bytes); - + barrier(); if (likely(dql_avail(&dev_queue->dql) >= 0)) But following patch seems cleaner diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h index 5621547d631b..978fbe332090 100644 --- a/include/linux/dynamic_queue_limits.h +++ b/include/linux/dynamic_queue_limits.h @@ -80,7 +80,7 @@ static inline void dql_queued(struct dql *dql, unsigned int count) /* Returns how many objects can be queued, < 0 indicates over limit. */ static inline int dql_avail(const struct dql *dql) { - return dql->adj_limit - dql->num_queued; + return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued); } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions 2014-09-25 23:40 ` Eric Dumazet @ 2014-09-26 6:04 ` Eric Dumazet 2014-09-26 8:47 ` Jesper Dangaard Brouer ` (2 more replies) 2014-09-26 9:23 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight 1 sibling, 3 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-26 6:04 UTC (permalink / raw) To: Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa From: Eric Dumazet <edumazet@google.com> While doing high throughput test on a BQL enabled NIC, I found a very high cost in ndo_start_xmit() when accessing BQL data. It turned out the problem was caused by compiler trying to be smart, but involving a bad MESI transaction : 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count 58.23 │ add %edx,%edi 0.58 │ cmp %edi,0xc4(%rax) 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count 0.72 │ js bd8 I got an incredible 10 % gain [1] by making sure cpu do not attempt to get the cache line in Shared mode, but directly requests for ownership. New code : mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count add %edx,0xc0(%rax) // RMW dql->num_queued += count mov 0xc4(%rax),%ecx // LOAD dql->adj_limit mov 0xc0(%rax),%edx // LOAD dql->num_queued cmp %edx,%ecx The TX completion was running from another cpu, with high interrupts rate. Note that I am using barrier() as a soft hint, as mb() here could be too heavy cost. [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/linux/dynamic_queue_limits.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h index 5621547d631b..a4be70398ce1 100644 --- a/include/linux/dynamic_queue_limits.h +++ b/include/linux/dynamic_queue_limits.h @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, unsigned int count) { BUG_ON(count > DQL_MAX_OBJECT); - dql->num_queued += count; dql->last_obj_cnt = count; + + /* We want to force a write first, so that cpu do not attempt + * to get cache line containing last_obj_cnt, num_queued, adj_limit + * in Shared state, but directly does a Request For Ownership + * It is only a hint, we use barrier() only. + */ + barrier(); + + dql->num_queued += count; } /* Returns how many objects can be queued, < 0 indicates over limit. */ static inline int dql_avail(const struct dql *dql) { - return dql->adj_limit - dql->num_queued; + return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued); } /* Record number of completed objects and recalculate the limit. */ ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions 2014-09-26 6:04 ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet @ 2014-09-26 8:47 ` Jesper Dangaard Brouer 2014-09-26 11:06 ` Hannes Frederic Sowa 2014-09-28 21:43 ` David Miller 2 siblings, 0 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-26 8:47 UTC (permalink / raw) To: Eric Dumazet Cc: Tom Herbert, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer On Thu, 25 Sep 2014 23:04:56 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: [...] > > It turned out the problem was caused by compiler trying to be > smart, but involving a bad MESI transaction : > [...] > > I got an incredible 10 % gain [1] by making sure cpu do not attempt > to get the cache line in Shared mode, but directly requests for > ownership. [...] > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- I'm very impressed - thank you Eric for finding this!!! Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions 2014-09-26 6:04 ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet 2014-09-26 8:47 ` Jesper Dangaard Brouer @ 2014-09-26 11:06 ` Hannes Frederic Sowa 2014-09-26 12:02 ` Eric Dumazet 2014-09-28 21:43 ` David Miller 2 siblings, 1 reply; 31+ messages in thread From: Hannes Frederic Sowa @ 2014-09-26 11:06 UTC (permalink / raw) To: Eric Dumazet, Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann On Fri, Sep 26, 2014, at 08:04, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > While doing high throughput test on a BQL enabled NIC, > I found a very high cost in ndo_start_xmit() when accessing BQL data. > > It turned out the problem was caused by compiler trying to be > smart, but involving a bad MESI transaction : > > 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued > 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count > 58.23 │ add %edx,%edi > 0.58 │ cmp %edi,0xc4(%rax) > 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count > 0.72 │ js bd8 > > > I got an incredible 10 % gain [1] by making sure cpu do not attempt > to get the cache line in Shared mode, but directly requests for > ownership. > > New code : > mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count > add %edx,0xc0(%rax) // RMW dql->num_queued += count > mov 0xc4(%rax),%ecx // LOAD dql->adj_limit > mov 0xc0(%rax),%edx // LOAD dql->num_queued > cmp %edx,%ecx > > The TX completion was running from another cpu, with high interrupts > rate. > > Note that I am using barrier() as a soft hint, as mb() here could be > too heavy cost. > > [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > include/linux/dynamic_queue_limits.h | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/include/linux/dynamic_queue_limits.h > b/include/linux/dynamic_queue_limits.h > index 5621547d631b..a4be70398ce1 100644 > --- a/include/linux/dynamic_queue_limits.h > +++ b/include/linux/dynamic_queue_limits.h > @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, > unsigned int count) > { > BUG_ON(count > DQL_MAX_OBJECT); > > - dql->num_queued += count; > dql->last_obj_cnt = count; > + > + /* We want to force a write first, so that cpu do not attempt > + * to get cache line containing last_obj_cnt, num_queued, > adj_limit > + * in Shared state, but directly does a Request For Ownership > + * It is only a hint, we use barrier() only. > + */ > + barrier(); > + > + dql->num_queued += count; > } I thought that prefetchw() would be the canonical way to solve write stalls in CPUs and prepare the specific cache line to be written to. Interesting, thanks Eric. Thanks, Hannes ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions 2014-09-26 11:06 ` Hannes Frederic Sowa @ 2014-09-26 12:02 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-26 12:02 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Tom Herbert, Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann On Fri, 2014-09-26 at 13:06 +0200, Hannes Frederic Sowa wrote: > I thought that prefetchw() would be the canonical way to solve write > stalls in CPUs and prepare the specific cache line to be written to. > Interesting, thanks Eric. True, and I have a tcp_ack() (tcp_clean_rtx_queue() more exactly) patch that indeed uses prefetchw() to bring tcp_shinfo() in cache before the tcp_unlink_write_queue(skb, sk); Performance went from 570000 packets per second to 820000 packets per second on a TSO=off GSO=off workload. But this applies after the https://patchwork.ozlabs.org/patch/392877/ patch (tcp: change tcp_skb_pcount() location ) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions 2014-09-26 6:04 ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet 2014-09-26 8:47 ` Jesper Dangaard Brouer 2014-09-26 11:06 ` Hannes Frederic Sowa @ 2014-09-28 21:43 ` David Miller 2 siblings, 0 replies; 31+ messages in thread From: David Miller @ 2014-09-28 21:43 UTC (permalink / raw) To: eric.dumazet Cc: therbert, brouer, netdev, alexander.h.duyck, toke, fw, jhs, dave.taht, john.r.fastabend, dborkman, hannes From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 25 Sep 2014 23:04:56 -0700 > From: Eric Dumazet <edumazet@google.com> > > While doing high throughput test on a BQL enabled NIC, > I found a very high cost in ndo_start_xmit() when accessing BQL data. > > It turned out the problem was caused by compiler trying to be > smart, but involving a bad MESI transaction : > > 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued > 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count > 58.23 │ add %edx,%edi > 0.58 │ cmp %edi,0xc4(%rax) > 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count > 0.72 │ js bd8 > > > I got an incredible 10 % gain [1] by making sure cpu do not attempt > to get the cache line in Shared mode, but directly requests for > ownership. > > New code : > mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count > add %edx,0xc0(%rax) // RMW dql->num_queued += count > mov 0xc4(%rax),%ecx // LOAD dql->adj_limit > mov 0xc0(%rax),%edx // LOAD dql->num_queued > cmp %edx,%ecx > > The TX completion was running from another cpu, with high interrupts > rate. > > Note that I am using barrier() as a soft hint, as mb() here could be > too heavy cost. > > [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Ok now you're just showing off :-) Applied, thanks! ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 23:40 ` Eric Dumazet 2014-09-26 6:04 ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet @ 2014-09-26 9:23 ` David Laight 2014-09-26 13:16 ` David Laight 1 sibling, 1 reply; 31+ messages in thread From: David Laight @ 2014-09-26 9:23 UTC (permalink / raw) To: 'Eric Dumazet', Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa From: Eric Dumazet > On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote: ... > It turned out the problem I noticed was caused by compiler trying to be > smart, but involving a bad MESI transaction. > > 0.05 mov 0xc0(%rax),%edi // LOAD dql->num_queued > 0.48 mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count > 58.23 add %edx,%edi > 0.58 cmp %edi,0xc4(%rax) > 0.76 mov %edi,0xc0(%rax) // STORE dql->num_queued += count > 0.72 js bd8 > > > I get an incredible 10 % gain by making sure cpu wont get the cache line > in Shared mode. That is a stunning difference between requesting 'exclusive' access and upgrading 'shared' to exclusive. Stinks of a cpu bug? Or is the reported stall a side effect of waiting for the earlier 'cache line read' to complete in order to issue the 'upgrade to exclusive'. In which case gcc's instruction scheduler probably needs to be taught to schedule writes before reads. > (I also tried a barrier() in netdev_tx_sent_queue() between the > > dql_queued(&dev_queue->dql, bytes); > - > + barrier(); > if (likely(dql_avail(&dev_queue->dql) >= 0)) > > But following patch seems cleaner > > diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h > index 5621547d631b..978fbe332090 100644 > --- a/include/linux/dynamic_queue_limits.h > +++ b/include/linux/dynamic_queue_limits.h > @@ -80,7 +80,7 @@ static inline void dql_queued(struct dql *dql, unsigned int count) > /* Returns how many objects can be queued, < 0 indicates over limit. */ > static inline int dql_avail(const struct dql *dql) > { > - return dql->adj_limit - dql->num_queued; > + return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued); Dunno, that could have an impact on other calls where the values are already in registers - I suspect ACCESS_ONCE() forces an access. David ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-26 9:23 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight @ 2014-09-26 13:16 ` David Laight 2014-09-26 13:38 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: David Laight @ 2014-09-26 13:16 UTC (permalink / raw) To: David Laight, 'Eric Dumazet', Tom Herbert Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa From: David Laight > From: Eric Dumazet > > On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote: > ... > > It turned out the problem I noticed was caused by compiler trying to be > > smart, but involving a bad MESI transaction. > > > > 0.05 mov 0xc0(%rax),%edi // LOAD dql->num_queued > > 0.48 mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count > > 58.23 add %edx,%edi > > 0.58 cmp %edi,0xc4(%rax) > > 0.76 mov %edi,0xc0(%rax) // STORE dql->num_queued += count > > 0.72 js bd8 > > > > > > I get an incredible 10 % gain by making sure cpu wont get the cache line > > in Shared mode. > > That is a stunning difference between requesting 'exclusive' access > and upgrading 'shared' to exclusive. > Stinks of a cpu bug? > > Or is the reported stall a side effect of waiting for the earlier > 'cache line read' to complete in order to issue the 'upgrade to exclusive'. > In which case gcc's instruction scheduler probably needs to be taught > to schedule writes before reads. Thinking further. gcc is probably moving memory reads before writes under the assumption that the cpu might stall waiting for the read to complete but that the write can be buffered by the hardware. That assumption is true for simple cpus (like the Nios2), but for x86 with its multiple instructions in flight (etc) may make little difference if the memory is in the cache. OTOH it looks as though there is a big hit if you read then write a non-present cache line. (This may even depend on which instructions get executed in parallel, minor changes to the code could easily change that.) I wonder how easy it would be to modify gcc to remove (or even reverse) that memory ordering 'optimisation'. David ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-26 13:16 ` David Laight @ 2014-09-26 13:38 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-26 13:38 UTC (permalink / raw) To: David Laight Cc: Tom Herbert, Jesper Dangaard Brouer, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Jamal Hadi Salim, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Fri, 2014-09-26 at 13:16 +0000, David Laight wrote: > OTOH it looks as though there is a big hit if you read then write > a non-present cache line. Well, this is exactly what it is about. Welcome to the club. This is a well known problem. Look at following old commit : http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3f9d35b9514da6757ca98831372518f9eeb71b33 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 17:58 ` Jesper Dangaard Brouer 2014-09-24 18:34 ` Tom Herbert @ 2014-09-24 22:13 ` Jamal Hadi Salim 2014-09-25 8:25 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 31+ messages in thread From: Jamal Hadi Salim @ 2014-09-24 22:13 UTC (permalink / raw) To: Jesper Dangaard Brouer, Eric Dumazet Cc: netdev, therbert, David S. Miller, Alexander Duyck, toke, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On 09/24/14 13:58, Jesper Dangaard Brouer wrote: > On Wed, 24 Sep 2014 10:23:15 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > >> pktgen is nice, but do not represent the majority of the traffic we send >> from high performance host where we want this bulk dequeue thing ;) > > This patch is actually targetted towards more normal use-cases. > Pktgen cannot even use this work, as it bypass the qdisc layer... When you post these patches - can you please also post basic performance numbers? You dont have to show improvement if it is hard for bulking to kick in, but you need to show no harm in at least latency for the general use case (i.e not pktgen maybe forwarding activity or something sourced from tcp). cheers, jamal ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-24 22:13 ` Jamal Hadi Salim @ 2014-09-25 8:25 ` Jesper Dangaard Brouer 2014-09-25 12:48 ` Jamal Hadi Salim 2014-09-25 13:52 ` Dave Taht 0 siblings, 2 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-25 8:25 UTC (permalink / raw) To: Jamal Hadi Salim Cc: Eric Dumazet, netdev, therbert, David S. Miller, Alexander Duyck, toke, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer On Wed, 24 Sep 2014 18:13:57 -0400 Jamal Hadi Salim <jhs@mojatatu.com> wrote: > On 09/24/14 13:58, Jesper Dangaard Brouer wrote: > > On Wed, 24 Sep 2014 10:23:15 -0700 > > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > > >> pktgen is nice, but do not represent the majority of the traffic we send > >> from high performance host where we want this bulk dequeue thing ;) > > > > This patch is actually targetted towards more normal use-cases. > > Pktgen cannot even use this work, as it bypass the qdisc layer... > > When you post these patches - can you please also post basic performance > numbers? You dont have to show improvement if it is hard for bulking > to kick in, but you need to show no harm in at least latency for the > general use case (i.e not pktgen maybe forwarding activity or something > sourced from tcp). I've done measurements with netperf-wrapper: http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html I have already previously posted my measurements here: http://people.netfilter.org/hawk/qdisc/ http://people.netfilter.org/hawk/qdisc/measure01/ http://people.netfilter.org/hawk/qdisc/experiment01/ Please, see my previous mail where I described each graph. The above measurements is for 10Gbit/s, but I've also done measurements on 1Gbit/s driver igb, and 10Mbit/s by forcing igb to use 10Mbit/s. Those results I forgot upload (and I cannot upload them right now, as I'm currently in Switzerland). -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 8:25 ` Jesper Dangaard Brouer @ 2014-09-25 12:48 ` Jamal Hadi Salim 2014-09-25 14:40 ` Tom Herbert 2014-09-25 13:52 ` Dave Taht 1 sibling, 1 reply; 31+ messages in thread From: Jamal Hadi Salim @ 2014-09-25 12:48 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, netdev, therbert, David S. Miller, Alexander Duyck, toke, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On 09/25/14 04:25, Jesper Dangaard Brouer wrote: > On Wed, 24 Sep 2014 18:13:57 -0400 > I've done measurements with netperf-wrapper: > http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html > > I have already previously posted my measurements here: > http://people.netfilter.org/hawk/qdisc/ > http://people.netfilter.org/hawk/qdisc/measure01/ > http://people.netfilter.org/hawk/qdisc/experiment01/ > Theres a lot of data there to digest; all good looking. I will try to pay attention to detail and get back to you. What you have on those urls is fit for a paper, but what someone like me (with ADD) needs is a summary somewhere maybe in the commit logs. > Please, see my previous mail where I described each graph. > Is this patch 0? I think a simple statement in the commit log that no cats were harmed ^W^W^W performance was affected. A paragraph at most but if you want to do more, DaveM actually has no problems if you write the details of a novel in patch 0. cheers, jamal ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 12:48 ` Jamal Hadi Salim @ 2014-09-25 14:40 ` Tom Herbert 2014-09-25 14:57 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 31+ messages in thread From: Tom Herbert @ 2014-09-25 14:40 UTC (permalink / raw) To: Jamal Hadi Salim Cc: Jesper Dangaard Brouer, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Thu, Sep 25, 2014 at 5:48 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > On 09/25/14 04:25, Jesper Dangaard Brouer wrote: >> >> On Wed, 24 Sep 2014 18:13:57 -0400 > > >> I've done measurements with netperf-wrapper: >> >> http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html >> >> I have already previously posted my measurements here: >> http://people.netfilter.org/hawk/qdisc/ >> http://people.netfilter.org/hawk/qdisc/measure01/ >> http://people.netfilter.org/hawk/qdisc/experiment01/ >> > > Theres a lot of data there to digest; all good looking. > I will try to pay attention to detail and get back to you. > What you have on those urls is fit for a paper, but what someone > like me (with ADD) needs is a summary somewhere maybe in the commit > logs. > >> Please, see my previous mail where I described each graph. >> > > Is this patch 0? > I think a simple statement in the commit log that no cats > were harmed ^W^W^W performance was affected. A paragraph > at most but if you want to do more, DaveM actually has no > problems if you write the details of a novel in patch 0. > +1 A few test results in patch 0 are good. I like to have results for with and without patch. These should two things: 1) Any regressions caused by the patch 2) Performance gains (in that order of importance :-) ). There doesn't need to be a lot here, just something reasonably representative, simple, and should be easily reproducible. My expectation in bulk dequeue is that we should see no obvious regression and hopefully an improvement in CPU utilization-- are you able to verify this? > cheers, > jamal ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 14:40 ` Tom Herbert @ 2014-09-25 14:57 ` Jesper Dangaard Brouer 2014-09-25 15:05 ` Tom Herbert 2014-09-25 15:12 ` Eric Dumazet 0 siblings, 2 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-25 14:57 UTC (permalink / raw) To: Tom Herbert Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer On Thu, 25 Sep 2014 07:40:33 -0700 Tom Herbert <therbert@google.com> wrote: > A few test results in patch 0 are good. I like to have results for > with and without patch. These should two things: 1) Any regressions > caused by the patch 2) Performance gains (in that order of importance > :-) ). There doesn't need to be a lot here, just something reasonably > representative, simple, and should be easily reproducible. My > expectation in bulk dequeue is that we should see no obvious > regression and hopefully an improvement in CPU utilization-- are you > able to verify this? We are saving 3% CPU, as I described in my post with subject: "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue": http://thread.gmane.org/gmane.linux.network/331152/focus=331154 Using UDP_STREAM on 1Gbit/s driver igb, I can show that the _raw_spin_lock calls are reduced with approx 3%, when enabling bulking of just 2 packets. This test can only demonstrates a CPU usage reduction, as the throughput is already at maximum link (bandwidth) capacity. Notice netperf option "-m 1472" which makes sure we are not sending UDP IP-fragments:: netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472 Results from perf diff:: # Command: perf diff # Event 'cycles' # Baseline Delta Symbol # no-bulk bulk(1) # ........ ....... ......................................... # 7.05% -3.03% [k] _raw_spin_lock 6.34% +0.23% [k] copy_user_enhanced_fast_string 6.30% +0.26% [k] fib_table_lookup 3.03% +0.01% [k] __slab_free 3.00% +0.08% [k] intel_idle 2.49% +0.05% [k] sock_alloc_send_pskb 2.31% +0.30% netperf [.] send_omni_inner 2.12% +0.12% netperf [.] send_data 2.11% +0.10% [k] udp_sendmsg 1.96% +0.02% [k] __ip_append_data 1.48% -0.01% [k] __alloc_skb 1.46% +0.07% [k] __mkroute_output 1.34% +0.05% [k] __ip_select_ident 1.29% +0.03% [k] check_leaf 1.27% +0.09% [k] __skb_get_hash A nitpick is that, this testing were done on V2 of the patchset. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 14:57 ` Jesper Dangaard Brouer @ 2014-09-25 15:05 ` Tom Herbert 2014-09-25 15:23 ` Jesper Dangaard Brouer 2014-09-25 15:12 ` Eric Dumazet 1 sibling, 1 reply; 31+ messages in thread From: Tom Herbert @ 2014-09-25 15:05 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Thu, 25 Sep 2014 07:40:33 -0700 > Tom Herbert <therbert@google.com> wrote: > >> A few test results in patch 0 are good. I like to have results for >> with and without patch. These should two things: 1) Any regressions >> caused by the patch 2) Performance gains (in that order of importance >> :-) ). There doesn't need to be a lot here, just something reasonably >> representative, simple, and should be easily reproducible. My >> expectation in bulk dequeue is that we should see no obvious >> regression and hopefully an improvement in CPU utilization-- are you >> able to verify this? > > We are saving 3% CPU, as I described in my post with subject: > "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue": > http://thread.gmane.org/gmane.linux.network/331152/focus=331154 > > Using UDP_STREAM on 1Gbit/s driver igb, I can show that the > _raw_spin_lock calls are reduced with approx 3%, when enabling > bulking of just 2 packets. > That's great. In commit log, would be good to have results with TCP_STREAM also and please report aggregate CPU utilization changes (like from mpstat). Thanks, Tom > This test can only demonstrates a CPU usage reduction, as the > throughput is already at maximum link (bandwidth) capacity. > > Notice netperf option "-m 1472" which makes sure we are not sending > UDP IP-fragments:: > > netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472 > > Results from perf diff:: > > # Command: perf diff > # Event 'cycles' > # Baseline Delta Symbol > # no-bulk bulk(1) > # ........ ....... ......................................... > # > 7.05% -3.03% [k] _raw_spin_lock > 6.34% +0.23% [k] copy_user_enhanced_fast_string > 6.30% +0.26% [k] fib_table_lookup > 3.03% +0.01% [k] __slab_free > 3.00% +0.08% [k] intel_idle > 2.49% +0.05% [k] sock_alloc_send_pskb > 2.31% +0.30% netperf [.] send_omni_inner > 2.12% +0.12% netperf [.] send_data > 2.11% +0.10% [k] udp_sendmsg > 1.96% +0.02% [k] __ip_append_data > 1.48% -0.01% [k] __alloc_skb > 1.46% +0.07% [k] __mkroute_output > 1.34% +0.05% [k] __ip_select_ident > 1.29% +0.03% [k] check_leaf > 1.27% +0.09% [k] __skb_get_hash > > A nitpick is that, this testing were done on V2 of the patchset. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 15:05 ` Tom Herbert @ 2014-09-25 15:23 ` Jesper Dangaard Brouer 2014-09-25 15:58 ` Tom Herbert 2014-09-29 20:14 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-25 15:23 UTC (permalink / raw) To: Tom Herbert Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer On Thu, 25 Sep 2014 08:05:38 -0700 Tom Herbert <therbert@google.com> wrote: > On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > On Thu, 25 Sep 2014 07:40:33 -0700 > > Tom Herbert <therbert@google.com> wrote: > > > >> A few test results in patch 0 are good. I like to have results for > >> with and without patch. These should two things: 1) Any regressions > >> caused by the patch 2) Performance gains (in that order of importance > >> :-) ). There doesn't need to be a lot here, just something reasonably > >> representative, simple, and should be easily reproducible. My > >> expectation in bulk dequeue is that we should see no obvious > >> regression and hopefully an improvement in CPU utilization-- are you > >> able to verify this? > > > > We are saving 3% CPU, as I described in my post with subject: > > "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue": > > http://thread.gmane.org/gmane.linux.network/331152/focus=331154 > > > > Using UDP_STREAM on 1Gbit/s driver igb, I can show that the > > _raw_spin_lock calls are reduced with approx 3%, when enabling > > bulking of just 2 packets. > > > > That's great. In commit log, would be good to have results with > TCP_STREAM also and please report aggregate CPU utilization changes > (like from mpstat). The TCP_STREAM is not a good test for this, because unless disabling both TSO and GSO the packets will not hit the code path (that this patch changes). When we later add support for TSO and GSO bulking, then it will make sense to include TCP_STREAM testing, not before. I will redo the tests, once I get home to my testlab, as the remote lab I'm using now is annoyingly slow rebooting machines, as we not longer have a runtime option for enable/disable (I'm currently in Switzerland). -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 15:23 ` Jesper Dangaard Brouer @ 2014-09-25 15:58 ` Tom Herbert 2014-09-29 20:23 ` Jesper Dangaard Brouer 2014-09-29 20:14 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 31+ messages in thread From: Tom Herbert @ 2014-09-25 15:58 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Thu, Sep 25, 2014 at 8:23 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Thu, 25 Sep 2014 08:05:38 -0700 > Tom Herbert <therbert@google.com> wrote: > >> On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >> > On Thu, 25 Sep 2014 07:40:33 -0700 >> > Tom Herbert <therbert@google.com> wrote: >> > >> >> A few test results in patch 0 are good. I like to have results for >> >> with and without patch. These should two things: 1) Any regressions >> >> caused by the patch 2) Performance gains (in that order of importance >> >> :-) ). There doesn't need to be a lot here, just something reasonably >> >> representative, simple, and should be easily reproducible. My >> >> expectation in bulk dequeue is that we should see no obvious >> >> regression and hopefully an improvement in CPU utilization-- are you >> >> able to verify this? >> > >> > We are saving 3% CPU, as I described in my post with subject: >> > "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue": >> > http://thread.gmane.org/gmane.linux.network/331152/focus=331154 >> > >> > Using UDP_STREAM on 1Gbit/s driver igb, I can show that the >> > _raw_spin_lock calls are reduced with approx 3%, when enabling >> > bulking of just 2 packets. >> > >> >> That's great. In commit log, would be good to have results with >> TCP_STREAM also and please report aggregate CPU utilization changes >> (like from mpstat). > > The TCP_STREAM is not a good test for this, because unless disabling > both TSO and GSO the packets will not hit the code path (that this > patch changes). When we later add support for TSO and GSO bulking, > then it will make sense to include TCP_STREAM testing, not before. > Disabling TSO and GSO is fine. I'm interested to see interactions with TCP. > I will redo the tests, once I get home to my testlab, as the remote lab > I'm using now is annoyingly slow rebooting machines, as we not longer > have a runtime option for enable/disable (I'm currently in Switzerland). > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 15:58 ` Tom Herbert @ 2014-09-29 20:23 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-29 20:23 UTC (permalink / raw) To: Tom Herbert Cc: Jamal Hadi Salim, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa, brouer Hi Tom, On Thu, 25 Sep 2014 08:58:54 -0700 Tom Herbert <therbert@google.com> wrote: > On Thu, Sep 25, 2014 at 8:23 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Thu, 25 Sep 2014 08:05:38 -0700 Tom Herbert <therbert@google.com> wrote: > > > >> On Thu, Sep 25, 2014 at 7:57 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > >> > On Thu, 25 Sep 2014 07:40:33 -0700 Tom Herbert <therbert@google.com> wrote: > >> > [...] > >> > >> That's great. In commit log, would be good to have results with > >> TCP_STREAM also and please report aggregate CPU utilization changes > >> (like from mpstat). > > > > The TCP_STREAM is not a good test for this, because unless disabling > > both TSO and GSO the packets will not hit the code path (that this > > patch changes). When we later add support for TSO and GSO bulking, > > then it will make sense to include TCP_STREAM testing, not before. > > > Disabling TSO and GSO is fine. I'm interested to see interactions with TCP. TCP already benefit from bulking, via TSO and especially for GSO segmented packets, and this patch does not bulk TSO and GSO packets. Testing effect for TCP involves disabling TSO and GSO, but I had to enable GRO on the receiver, which reduces ACK packets, else the system could not exceed the 10Gbit/s link capacilty with none-bulking. Test cmd: * netperf -H 192.168.8.2 -t TCP_STREAM -l 1000 -D 1 -n -N The measured perf diff benefit for TCP_STREAM were 4.66% less CPU used on calls to _raw_spin_lock() (mostly from sch_direct_xmit()). Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows: * Disabled bulking: 8.30% soft 88.75% idle * Enabled bulking: 7.80% soft 89.36% idle -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 15:23 ` Jesper Dangaard Brouer 2014-09-25 15:58 ` Tom Herbert @ 2014-09-29 20:14 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 31+ messages in thread From: Jesper Dangaard Brouer @ 2014-09-29 20:14 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Tom Herbert, Jamal Hadi Salim, Eric Dumazet, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Thu, 25 Sep 2014 17:23:29 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > I will redo the tests, once I get home to my testlab, as the remote lab > I'm using now is annoyingly slow rebooting machines, as we not longer > have a runtime option for enable/disable (I'm currently in Switzerland). I'm going to change the bulking "budget" from 8 packet to 7 packets, based on my netperf-wrapper test. I've designed some tests for the tool "netperf-wrapper" that tries to measure the HoL blocking effect. I've turned down the link speed to 100Mbit/s on driver igb. And uses a single TXQ setup (cmdline: ethtool -L eth1 combined 1). I can now show some kind of HoL blocking effect. The strange part is the HoL effect only occurs above 8 packets, 7 packet and below show no bad effect of this bulking patch. Results 100Mbit/s test qdisc_prio_hol, latency in high prio band: * GSO: stable average on 2.23ms * TSO: varies between min 4.13ms to 4.41ms (range 0.28ms) * No bulking: stable average on 1.71ms (3x outliners on 1.95ms) * Bulking(8): varies between 1.71ms to 1.95ms (range 0.24ms) * Bulking(7): stable average on 1.71ms (1x outliner on 1.95ms) * Bulking(6): stable average on 1.71ms (3x outliners on 1.91ms) * Bulking(5): stable average on 1.71ms (1x outliner on 1.91ms) * Bulking(5): stable average on 1.71ms (1x outliner on 1.91ms) * Bulking(4): stable average on 1.71ms (0x outliner) Bulking(8) the calculation: * Added delay were 0.24ms (1.95 - 1.71ms) corrosponding to 3000 bytes * 1500 bytes *8 / 100Mbit = 0.12 ms * (100*10^6)*(1.95/10^3)/8 = 24375 bytes * (100*10^6)*(1.71/10^3)/8 = 21375 bytes * Added HoL: 3000 bytes * Still compared to TSO and GSO, the added HoL blocking is small. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 14:57 ` Jesper Dangaard Brouer 2014-09-25 15:05 ` Tom Herbert @ 2014-09-25 15:12 ` Eric Dumazet 1 sibling, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2014-09-25 15:12 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Tom Herbert, Jamal Hadi Salim, Linux Netdev List, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, Dave Taht, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Thu, 2014-09-25 at 16:57 +0200, Jesper Dangaard Brouer wrote: > We are saving 3% CPU, as I described in my post with subject: > "qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue": > http://thread.gmane.org/gmane.linux.network/331152/focus=331154 > > Using UDP_STREAM on 1Gbit/s driver igb, I can show that the > _raw_spin_lock calls are reduced with approx 3%, when enabling > bulking of just 2 packets. > > This test can only demonstrates a CPU usage reduction, as the > throughput is already at maximum link (bandwidth) capacity. > > Notice netperf option "-m 1472" which makes sure we are not sending > UDP IP-fragments:: > > netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472 > > Results from perf diff:: > > # Command: perf diff > # Event 'cycles' > # Baseline Delta Symbol > # no-bulk bulk(1) > # ........ ....... ......................................... > # > 7.05% -3.03% [k] _raw_spin_lock > 6.34% +0.23% [k] copy_user_enhanced_fast_string > 6.30% +0.26% [k] fib_table_lookup > 3.03% +0.01% [k] __slab_free > 3.00% +0.08% [k] intel_idle > 2.49% +0.05% [k] sock_alloc_send_pskb > 2.31% +0.30% netperf [.] send_omni_inner > 2.12% +0.12% netperf [.] send_data > 2.11% +0.10% [k] udp_sendmsg > 1.96% +0.02% [k] __ip_append_data > 1.48% -0.01% [k] __alloc_skb > 1.46% +0.07% [k] __mkroute_output > 1.34% +0.05% [k] __ip_select_ident > 1.29% +0.03% [k] check_leaf > 1.27% +0.09% [k] __skb_get_hash > > A nitpick is that, this testing were done on V2 of the patchset. > You could avoid the fib_table_lookup() cost by using netperf -- -N -n (connected UDP sockets) And of course reduce message sizes to increase pps ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE 2014-09-25 8:25 ` Jesper Dangaard Brouer 2014-09-25 12:48 ` Jamal Hadi Salim @ 2014-09-25 13:52 ` Dave Taht 1 sibling, 0 replies; 31+ messages in thread From: Dave Taht @ 2014-09-25 13:52 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Jamal Hadi Salim, Eric Dumazet, netdev, Tom Herbert, David S. Miller, Alexander Duyck, Toke Høiland-Jørgensen, Florian Westphal, John Fastabend, Daniel Borkmann, Hannes Frederic Sowa On Thu, Sep 25, 2014 at 1:25 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 24 Sep 2014 18:13:57 -0400 > Jamal Hadi Salim <jhs@mojatatu.com> wrote: > >> On 09/24/14 13:58, Jesper Dangaard Brouer wrote: >> > On Wed, 24 Sep 2014 10:23:15 -0700 >> > Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > >> >> > >> >> pktgen is nice, but do not represent the majority of the traffic we send >> >> from high performance host where we want this bulk dequeue thing ;) >> > >> > This patch is actually targetted towards more normal use-cases. >> > Pktgen cannot even use this work, as it bypass the qdisc layer... >> >> When you post these patches - can you please also post basic performance >> numbers? You dont have to show improvement if it is hard for bulking >> to kick in, but you need to show no harm in at least latency for the >> general use case (i.e not pktgen maybe forwarding activity or something >> sourced from tcp). > > I've done measurements with netperf-wrapper: > http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html > > I have already previously posted my measurements here: > http://people.netfilter.org/hawk/qdisc/ > http://people.netfilter.org/hawk/qdisc/measure01/ > http://people.netfilter.org/hawk/qdisc/experiment01/ > > Please, see my previous mail where I described each graph. Stuff like this: http://people.netfilter.org/hawk/qdisc/experiment01/compare_TSO_vs_TSO_with_rxusec30__rr_latency.png will no doubt make many a high speed trader happy. My point being was that by repeating this experiment for each successive change (eric's 1/2 BQL patch, better batching, sch_fq, different ethernet drivers, etc), you (or people duplicating the experiment) can produce ongoing comparison plots... > The above measurements is for 10Gbit/s, but I've also done measurements > on 1Gbit/s driver igb, and 10Mbit/s by forcing igb to use 10Mbit/s. > Those results I forgot upload (and I cannot upload them right now, as > I'm currently in Switzerland). > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer -- Dave Täht https://www.bufferbloat.net/projects/make-wifi-fast ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2014-09-29 20:24 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-09-24 16:10 [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer 2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer 2014-09-24 17:23 ` Eric Dumazet 2014-09-24 17:58 ` Jesper Dangaard Brouer 2014-09-24 18:34 ` Tom Herbert 2014-09-24 19:22 ` Eric Dumazet 2014-09-25 2:12 ` Eric Dumazet 2014-09-25 2:38 ` Tom Herbert 2014-09-25 2:58 ` Dave Taht 2014-09-26 18:06 ` Eric Dumazet 2014-09-25 23:40 ` Eric Dumazet 2014-09-26 6:04 ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet 2014-09-26 8:47 ` Jesper Dangaard Brouer 2014-09-26 11:06 ` Hannes Frederic Sowa 2014-09-26 12:02 ` Eric Dumazet 2014-09-28 21:43 ` David Miller 2014-09-26 9:23 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight 2014-09-26 13:16 ` David Laight 2014-09-26 13:38 ` Eric Dumazet 2014-09-24 22:13 ` Jamal Hadi Salim 2014-09-25 8:25 ` Jesper Dangaard Brouer 2014-09-25 12:48 ` Jamal Hadi Salim 2014-09-25 14:40 ` Tom Herbert 2014-09-25 14:57 ` Jesper Dangaard Brouer 2014-09-25 15:05 ` Tom Herbert 2014-09-25 15:23 ` Jesper Dangaard Brouer 2014-09-25 15:58 ` Tom Herbert 2014-09-29 20:23 ` Jesper Dangaard Brouer 2014-09-29 20:14 ` Jesper Dangaard Brouer 2014-09-25 15:12 ` Eric Dumazet 2014-09-25 13:52 ` Dave Taht
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.