[RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
@ 2010-03-23 20:25 Alexander Duyck
  2010-03-23 20:31 ` David Miller
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Alexander Duyck @ 2010-03-23 20:25 UTC (permalink / raw)
  To: netdev

The qdisc layer shows a significant issue when you start transmitting from
multiple CPUs.  The issue is that the transmit rate drops significantly, and I
believe it is due to the fact that the spinlock is shared between the 1
dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
adding one additional lock which will need to be obtained during the enqueue
portion of the path.  This essentially allows sch_direct_xmit to jump to
near the head of the line when attempting to obtain the lock after
completing a transmit.

Running the script below I saw an increase from 200K packets per second to
1.07M packets per second as a result of this patch.

for j in `seq 0 15`; do
	for i in `seq 0 7`; do
		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
	done
done

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/net/sch_generic.h |    3 ++-
 net/core/dev.c            |    7 ++++++-
 net/sched/sch_generic.c   |    1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 67dc08e..f5088a9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,7 +51,6 @@ struct Qdisc {
 	struct list_head	list;
 	u32			handle;
 	u32			parent;
-	atomic_t		refcnt;
 	struct gnet_stats_rate_est	rate_est;
 	int			(*reshape_fail)(struct sk_buff *skb,
 					struct Qdisc *q);
@@ -65,6 +64,8 @@ struct Qdisc {
 	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
 
+	atomic_t		refcnt;
+	spinlock_t		enqueue_lock;
 	struct sk_buff		*gso_skb;
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
diff --git a/net/core/dev.c b/net/core/dev.c
index a03aab4..8a2d508 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2006,8 +2006,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	spinlock_t *root_lock = qdisc_lock(q);
 	int rc;
 
+	spin_lock(&q->enqueue_lock);
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+		spin_unlock(&q->enqueue_lock);
 		kfree_skb(skb);
 		rc = NET_XMIT_DROP;
 	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
@@ -2018,7 +2020,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		 * xmit the skb directly.
 		 */
 		__qdisc_update_bstats(q, skb->len);
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock))
+		spin_unlock(&q->enqueue_lock);
+		rc = sch_direct_xmit(skb, q, dev, txq, root_lock);
+		if (rc)
 			__qdisc_run(q);
 		else
 			clear_bit(__QDISC_STATE_RUNNING, &q->state);
@@ -2026,6 +2030,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		rc = NET_XMIT_SUCCESS;
 	} else {
 		rc = qdisc_enqueue_root(skb, q);
+		spin_unlock(&q->enqueue_lock);
 		qdisc_run(q);
 	}
 	spin_unlock(root_lock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 5173c1e..4b5e125 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -538,6 +538,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch = (struct Qdisc *) QDISC_ALIGN((unsigned long) p);
 	sch->padded = (char *) sch - (char *) p;
 
+	spin_lock_init(&sch->enqueue_lock);
 	INIT_LIST_HEAD(&sch->list);
 	skb_queue_head_init(&sch->q);
 	sch->ops = ops;


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:25 [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Alexander Duyck
@ 2010-03-23 20:31 ` David Miller
  2010-03-24  2:12   ` Andi Kleen
  2010-03-23 20:40 ` Rick Jones
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2010-03-23 20:31 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: netdev

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Tue, 23 Mar 2010 13:25:53 -0700

> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.

Why are you hitting any central qdisc lock on a multiqueue
configuration?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:25 [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Alexander Duyck
  2010-03-23 20:31 ` David Miller
@ 2010-03-23 20:40 ` Rick Jones
  2010-03-23 20:54 ` Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Rick Jones @ 2010-03-23 20:40 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev

Alexander Duyck wrote:
> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.  The issue is that the transmit rate drops significantly, and I
> believe it is due to the fact that the spinlock is shared between the 1
> dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
> adding one additional lock which will need to be obtained during the enqueue
> portion of the path.  This essentially allows sch_direct_xmit to jump to
> near the head of the line when attempting to obtain the lock after
> completing a transmit.
> 
> Running the script below I saw an increase from 200K packets per second to
> 1.07M packets per second as a result of this patch.
> 
> for j in `seq 0 15`; do
> 	for i in `seq 0 7`; do
> 		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
> 	done
> done


Wow - someone is using the no control connection mode in netperf :)  Or were you 
actually trying to get the test-specific -N option to control whether or not 
netperf "connects" the UDP socket?

rick jones

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:25 [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Alexander Duyck
  2010-03-23 20:31 ` David Miller
  2010-03-23 20:40 ` Rick Jones
@ 2010-03-23 20:54 ` Eric Dumazet
  2010-03-23 21:18   ` Eric Dumazet
  2010-03-23 21:45   ` David Miller
  2010-03-24 14:24 ` [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Eric Dumazet
  2010-05-21 15:08 ` [PATCH] net: add additional lock to qdisc to increase throughput Eric Dumazet
  4 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-03-23 20:54 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev

Le mardi 23 mars 2010 à 13:25 -0700, Alexander Duyck a écrit :
> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.  The issue is that the transmit rate drops significantly, and I
> believe it is due to the fact that the spinlock is shared between the 1
> dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
> adding one additional lock which will need to be obtained during the enqueue
> portion of the path.  This essentially allows sch_direct_xmit to jump to
> near the head of the line when attempting to obtain the lock after
> completing a transmit.
> 
> Running the script below I saw an increase from 200K packets per second to
> 1.07M packets per second as a result of this patch.
> 
> for j in `seq 0 15`; do
> 	for i in `seq 0 7`; do
> 		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
> 	done
> done
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 

Hi Alexander

Thats a pretty good topic :)

So to speedup a pathological case (dozen of cpus all sending to same
queue), you suggest adding a spin_lock to fast path, slowing down normal
cases ?

Quite frankly, the real problem in this case is not the reduced
throughput, but fact that one cpu can stay a long time doing the xmits
to device, of skb queued by other cpus. This can hurt latencies a lot,
for real time threads for example...

I wonder if ticket spinlocks are not the problem. Maybe we want a
variant of spinlocks, so that cpu doing transmits can get the lock
before other cpus...




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:54 ` Eric Dumazet
@ 2010-03-23 21:18   ` Eric Dumazet
  2010-03-23 21:45   ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-03-23 21:18 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev

Le mardi 23 mars 2010 à 21:54 +0100, Eric Dumazet a écrit :
> I wonder if ticket spinlocks are not the problem. Maybe we want a
> variant of spinlocks, so that cpu doing transmits can get the lock
> before other cpus...

Something like this portable implementation :

struct spinprio {
	spinlock_t	lock;
	int		highprio_cnt;
};

void spinprio_lock(struct spinprio *l)
{
	while (1) {
		spin_lock(&l->lock);
		if (!l->highprio_cnt)
			break;
		spin_unlock(&l->lock);
		cpu_relax();
	}
}

void spinprio_unlock(struct spinprio *l)
{
	spin_unlock(&l->lock);
}

void spinprio_relock(struct spinprio *l)
{
	l->highprio_cnt = 1;
	spin_lock(&l->lock);
	l->highprio_cnt = 0;
}

We would have to use spinprio_unlock()/spinprio_relock() in
sch_direct_xmit()




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:54 ` Eric Dumazet
  2010-03-23 21:18   ` Eric Dumazet
@ 2010-03-23 21:45   ` David Miller
  2010-03-23 22:08     ` Duyck, Alexander H
  2010-03-23 22:13     ` Eric Dumazet
  1 sibling, 2 replies; 26+ messages in thread
From: David Miller @ 2010-03-23 21:45 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 23 Mar 2010 21:54:27 +0100

> Quite frankly, the real problem in this case is not the reduced
> throughput, but fact that one cpu can stay a long time doing the xmits
> to device, of skb queued by other cpus. This can hurt latencies a lot,
> for real time threads for example...
> 
> I wonder if ticket spinlocks are not the problem. Maybe we want a
> variant of spinlocks, so that cpu doing transmits can get the lock
> before other cpus...

I want to note that things operate the way they do now
intentionally.

Herbert Xu and Jamal Hadi Salim were active in this area
about 4 years ago.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 21:45   ` David Miller
@ 2010-03-23 22:08     ` Duyck, Alexander H
  2010-03-24  2:58       ` David Miller
  2010-03-23 22:13     ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Duyck, Alexander H @ 2010-03-23 22:08 UTC (permalink / raw)
  To: David Miller, eric.dumazet; +Cc: netdev

David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 23 Mar 2010 21:54:27 +0100
> 
>> Quite frankly, the real problem in this case is not the reduced
>> throughput, but fact that one cpu can stay a long time doing the
>> xmits to device, of skb queued by other cpus. This can hurt
>> latencies a lot, for real time threads for example...
>> 
>> I wonder if ticket spinlocks are not the problem. Maybe we want a
>> variant of spinlocks, so that cpu doing transmits can get the lock
>> before other cpus...
> 
> I want to note that things operate the way they do now
> intentionally.
> 
> Herbert Xu and Jamal Hadi Salim were active in this area
> about 4 years ago.

I'll do some digging on this to see what I can find out.  I'm not sure why we would want to do things this way since it seems like the CPU that has qdisc_running is sitting idle due to the fact that the ticket spinlocks have us waiting for all of the other CPU enqueues before we can get around to another dequeue.  It seems like it would be in our best interest to keep qdisc_running going while we are still enqueuing on the other CPUs.

I'll also look into the idea of possibly coming up with a more portable solution such as a priority based spinlock solution.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 21:45   ` David Miller
  2010-03-23 22:08     ` Duyck, Alexander H
@ 2010-03-23 22:13     ` Eric Dumazet
  2010-05-21 15:43       ` Stephen Hemminger
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-03-23 22:13 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev

Le mardi 23 mars 2010 à 14:45 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 23 Mar 2010 21:54:27 +0100
> 
> > Quite frankly, the real problem in this case is not the reduced
> > throughput, but fact that one cpu can stay a long time doing the xmits
> > to device, of skb queued by other cpus. This can hurt latencies a lot,
> > for real time threads for example...
> > 
> > I wonder if ticket spinlocks are not the problem. Maybe we want a
> > variant of spinlocks, so that cpu doing transmits can get the lock
> > before other cpus...
> 
> I want to note that things operate the way they do now
> intentionally.
> 
> Herbert Xu and Jamal Hadi Salim were active in this area
> about 4 years ago.

Yes, but ticket spinlocks were added after their work (in 2008 - 2.6.25
if I remember well) and change things.

We want cpu owning __QDISC_STATE_RUNNING being able to re-get the lock
as fast as possible. Alexander results can show the possible speedup.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:31 ` David Miller
@ 2010-03-24  2:12   ` Andi Kleen
  0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2010-03-24  2:12 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev

David Miller <davem@davemloft.net> writes:

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Tue, 23 Mar 2010 13:25:53 -0700
>
>> The qdisc layer shows a significant issue when you start transmitting from
>> multiple CPUs.
>
> Why are you hitting any central qdisc lock on a multiqueue
> configuration?

One thing to remember is that systems can have a lot more CPUs than a NIC queues
(e.g. 64 vs 8)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 22:08     ` Duyck, Alexander H
@ 2010-03-24  2:58       ` David Miller
  2010-03-24  5:42         ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2010-03-24  2:58 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: eric.dumazet, netdev

From: "Duyck, Alexander H" <alexander.h.duyck@intel.com>
Date: Tue, 23 Mar 2010 15:08:12 -0700

> I'll do some digging on this to see what I can find out.

Please see:  http://vger.kernel.org/jamal_netconf2006.sxi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-24  2:58       ` David Miller
@ 2010-03-24  5:42         ` Eric Dumazet
  2010-03-24  6:10           ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-03-24  5:42 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev

Le mardi 23 mars 2010 à 19:58 -0700, David Miller a écrit :
> From: "Duyck, Alexander H" <alexander.h.duyck@intel.com>
> Date: Tue, 23 Mar 2010 15:08:12 -0700
> 
> > I'll do some digging on this to see what I can find out.
> 
> Please see:  http://vger.kernel.org/jamal_netconf2006.sxi
> --

Thanks David, this is very informative !

Do we have a list of such network related documents ?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-24  5:42         ` Eric Dumazet
@ 2010-03-24  6:10           ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2010-03-24  6:10 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 24 Mar 2010 06:42:04 +0100

> Do we have a list of such network related documents ?

Not really, just scan http://vger.kernel.org/netconf200${N}.html
for the papers :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 20:25 [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Alexander Duyck
                   ` (2 preceding siblings ...)
  2010-03-23 20:54 ` Eric Dumazet
@ 2010-03-24 14:24 ` Eric Dumazet
  2010-05-21 15:08 ` [PATCH] net: add additional lock to qdisc to increase throughput Eric Dumazet
  4 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-03-24 14:24 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev

Le mardi 23 mars 2010 à 13:25 -0700, Alexander Duyck a écrit :
> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.  The issue is that the transmit rate drops significantly, and I
> believe it is due to the fact that the spinlock is shared between the 1
> dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
> adding one additional lock which will need to be obtained during the enqueue
> portion of the path.  This essentially allows sch_direct_xmit to jump to
> near the head of the line when attempting to obtain the lock after
> completing a transmit.
> 

This two stages lock permits to dequeue cpu to have 50% of chance to get
q.lock, since all other cpus (but one) are competing in enqueue_lock.

> Running the script below I saw an increase from 200K packets per second to
> 1.07M packets per second as a result of this patch.
> 
> for j in `seq 0 15`; do
> 	for i in `seq 0 7`; do
> 		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
> 	done
> done
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  include/net/sch_generic.h |    3 ++-
>  net/core/dev.c            |    7 ++++++-
>  net/sched/sch_generic.c   |    1 +
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 67dc08e..f5088a9 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -51,7 +51,6 @@ struct Qdisc {
>  	struct list_head	list;
>  	u32			handle;
>  	u32			parent;
> -	atomic_t		refcnt;
>  	struct gnet_stats_rate_est	rate_est;
>  	int			(*reshape_fail)(struct sk_buff *skb,
>  					struct Qdisc *q);
> @@ -65,6 +64,8 @@ struct Qdisc {
>  	struct netdev_queue	*dev_queue;
>  	struct Qdisc		*next_sched;
>  
> +	atomic_t		refcnt;
> +	spinlock_t		enqueue_lock;
>  	struct sk_buff		*gso_skb;
>  	/*


Could you at least try to not fill the hole, but place enqueue_lock
right after "struct sk_buff_head q;" ?

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 67dc08e..6079c70 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -71,6 +71,7 @@ struct Qdisc {
         */
        unsigned long           state;
        struct sk_buff_head     q;
+       spinlock_t              enqueue_lock;
        struct gnet_stats_basic_packed bstats;
        struct gnet_stats_queue qstats;
 };


So that on x86_64, all these fields use one cache line instead of two.

offsetof(struct Qdisc, state)=0x80
offsetof(struct Qdisc, q)=0x88
offsetof(struct Qdisc, enqueue_lock)=0xa0

It would be nice to have a benchmark for non pathological cases (one cpu
doing a flood xmit of small packets)




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH] net: add additional lock to qdisc to increase throughput
  2010-03-23 20:25 [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Alexander Duyck
                   ` (3 preceding siblings ...)
  2010-03-24 14:24 ` [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Eric Dumazet
@ 2010-05-21 15:08 ` Eric Dumazet
  2010-05-21 20:04   ` Duyck, Alexander H
  4 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-05-21 15:08 UTC (permalink / raw)
  To: Alexander Duyck, David Miller; +Cc: netdev

Le mardi 23 mars 2010 à 13:25 -0700, Alexander Duyck a écrit :
> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.  The issue is that the transmit rate drops significantly, and I
> believe it is due to the fact that the spinlock is shared between the 1
> dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
> adding one additional lock which will need to be obtained during the enqueue
> portion of the path.  This essentially allows sch_direct_xmit to jump to
> near the head of the line when attempting to obtain the lock after
> completing a transmit.
> 
> Running the script below I saw an increase from 200K packets per second to
> 1.07M packets per second as a result of this patch.
> 
> for j in `seq 0 15`; do
> 	for i in `seq 0 7`; do
> 		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
> 	done
> done
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  include/net/sch_generic.h |    3 ++-
>  net/core/dev.c            |    7 ++++++-
>  net/sched/sch_generic.c   |    1 +
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 67dc08e..f5088a9 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -51,7 +51,6 @@ struct Qdisc {
>  	struct list_head	list;
>  	u32			handle;
>  	u32			parent;
> -	atomic_t		refcnt;
>  	struct gnet_stats_rate_est	rate_est;
>  	int			(*reshape_fail)(struct sk_buff *skb,
>  					struct Qdisc *q);
> @@ -65,6 +64,8 @@ struct Qdisc {
>  	struct netdev_queue	*dev_queue;
>  	struct Qdisc		*next_sched;
>  
> +	atomic_t		refcnt;
> +	spinlock_t		enqueue_lock;
>  	struct sk_buff		*gso_skb;
>  	/*
>  	 * For performance sake on SMP, we put highly modified fields at the end
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a03aab4..8a2d508 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2006,8 +2006,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	spinlock_t *root_lock = qdisc_lock(q);
>  	int rc;
>  
> +	spin_lock(&q->enqueue_lock);
>  	spin_lock(root_lock);
>  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> +		spin_unlock(&q->enqueue_lock);
>  		kfree_skb(skb);
>  		rc = NET_XMIT_DROP;
>  	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> @@ -2018,7 +2020,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  		 * xmit the skb directly.
>  		 */
>  		__qdisc_update_bstats(q, skb->len);
> -		if (sch_direct_xmit(skb, q, dev, txq, root_lock))
> +		spin_unlock(&q->enqueue_lock);
> +		rc = sch_direct_xmit(skb, q, dev, txq, root_lock);
> +		if (rc)
>  			__qdisc_run(q);
>  		else
>  			clear_bit(__QDISC_STATE_RUNNING, &q->state);
> @@ -2026,6 +2030,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  		rc = NET_XMIT_SUCCESS;
>  	} else {
>  		rc = qdisc_enqueue_root(skb, q);
> +		spin_unlock(&q->enqueue_lock);
>  		qdisc_run(q);
>  	}
>  	spin_unlock(root_lock);
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 5173c1e..4b5e125 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -538,6 +538,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
>  	sch = (struct Qdisc *) QDISC_ALIGN((unsigned long) p);
>  	sch->padded = (char *) sch - (char *) p;
>  
> +	spin_lock_init(&sch->enqueue_lock);
>  	INIT_LIST_HEAD(&sch->list);
>  	skb_queue_head_init(&sch->q);
>  	sch->ops = ops;
> 

Hi Alexander

What about following patch instead, adding an heuristic to avoid an
extra atomic operation in fast path ?

I also try to release the 'busylock' after the main lock to favor the
worker as much as possible. It must be released before main lock only
before a call to __qdisc_run()

Another refinement would be to make the spinning done outside of BH
disabled context, because we currently delay TX completion (if NAPI is
used by driver) that can make room in device TX ring buffer. This would
allow cpus to process incoming traffic too...

Thanks

[PATCH] net: add additional lock to qdisc to increase throughput

When many cpus compete for sending frames on a given qdisc, the qdisc
spinlock suffers from very high contention.

The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
the lock, and cannot dequeue packets fast enough, since it must wait for
this lock for each dequeued packet.

One solution to this problem is to force all cpus spinning on a second
lock before trying to get the main lock, when/if they see
__QDISC_STATE_RUNNING already set.

The owning cpu then compete with at most one other cpu for the main
lock, allowing for higher dequeueing rate.

Based on a previous patch from Alexander Duyck. I added the heuristic to
avoid the atomic in fast path, and put the new lock far away from the
cache line used by the dequeue worker. Also try to release the busylock
lock as late as possible.

Tests with following script gave a boost from ~50.000 pps to ~600.000
pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
(A single netperf flow can reach ~800.000 pps on this platform)

for j in `seq 0 3`; do
  for i in `seq 0 7`; do
    netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
  done
done


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sch_generic.h |    3 ++-
 net/core/dev.c            |   29 +++++++++++++++++++++++++----
 net/sched/sch_generic.c   |    1 +
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 03ca5d8..2d55649 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -73,7 +73,8 @@ struct Qdisc {
 	struct sk_buff_head	q;
 	struct gnet_stats_basic_packed bstats;
 	struct gnet_stats_queue	qstats;
-	struct rcu_head     rcu_head;
+	struct rcu_head		rcu_head;
+	spinlock_t		busylock;
 };
 
 struct Qdisc_class_ops {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c82065..1b313eb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2040,6 +2040,16 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 {
 	spinlock_t *root_lock = qdisc_lock(q);
 	int rc;
+	bool contended = test_bit(__QDISC_STATE_RUNNING, &q->state);
+
+	/*
+	 * Heuristic to force contended enqueues to serialize on a
+	 * separate lock before trying to get qdisc main lock.
+	 * This permits __QDISC_STATE_RUNNING owner to get the lock more often
+	 * and dequeue packets faster.
+	 */
+	if (unlikely(contended))
+		spin_lock(&q->busylock);
 
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
@@ -2055,19 +2065,30 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		if (!(dev->priv_flags & IFF_XMIT_DST_RELEASE))
 			skb_dst_force(skb);
 		__qdisc_update_bstats(q, skb->len);
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock))
+		if (sch_direct_xmit(skb, q, dev, txq, root_lock)) {
+			if (unlikely(contended)) {
+				spin_unlock(&q->busylock);
+				contended = false;
+			}
 			__qdisc_run(q);
-		else
+		} else
 			clear_bit(__QDISC_STATE_RUNNING, &q->state);
 
 		rc = NET_XMIT_SUCCESS;
 	} else {
 		skb_dst_force(skb);
 		rc = qdisc_enqueue_root(skb, q);
-		qdisc_run(q);
+		if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {
+			if (unlikely(contended)) {
+				spin_unlock(&q->busylock);
+				contended = false;
+			}
+			__qdisc_run(q);
+		}
 	}
 	spin_unlock(root_lock);
-
+	if (unlikely(contended))
+		spin_unlock(&q->busylock);
 	return rc;
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a63029e..0a44290 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -543,6 +543,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 
 	INIT_LIST_HEAD(&sch->list);
 	skb_queue_head_init(&sch->q);
+	spin_lock_init(&sch->busylock);
 	sch->ops = ops;
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-03-23 22:13     ` Eric Dumazet
@ 2010-05-21 15:43       ` Stephen Hemminger
  2010-05-21 16:44         ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen Hemminger @ 2010-05-21 15:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, alexander.h.duyck, netdev

On Tue, 23 Mar 2010 23:13:00 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 23 mars 2010 à 14:45 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 23 Mar 2010 21:54:27 +0100
> > 
> > > Quite frankly, the real problem in this case is not the reduced
> > > throughput, but fact that one cpu can stay a long time doing the xmits
> > > to device, of skb queued by other cpus. This can hurt latencies a lot,
> > > for real time threads for example...
> > > 
> > > I wonder if ticket spinlocks are not the problem. Maybe we want a
> > > variant of spinlocks, so that cpu doing transmits can get the lock
> > > before other cpus...
> > 
> > I want to note that things operate the way they do now
> > intentionally.
> > 
> > Herbert Xu and Jamal Hadi Salim were active in this area
> > about 4 years ago.
> 
> Yes, but ticket spinlocks were added after their work (in 2008 - 2.6.25
> if I remember well) and change things.
> 
> We want cpu owning __QDISC_STATE_RUNNING being able to re-get the lock
> as fast as possible. Alexander results can show the possible speedup.

What about having a special function (spin_lock_greedy?) that just ignores
the ticket mechanism and always assumes it has right to next ticket.



-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-05-21 15:43       ` Stephen Hemminger
@ 2010-05-21 16:44         ` Eric Dumazet
  2010-05-21 17:38           ` Stephen Hemminger
  2010-05-21 17:40           ` Eric Dumazet
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-05-21 16:44 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, alexander.h.duyck, netdev

Le vendredi 21 mai 2010 à 08:43 -0700, Stephen Hemminger a écrit :

> What about having a special function (spin_lock_greedy?) that just ignores
> the ticket mechanism and always assumes it has right to next ticket.
> 

I am not sure we want to do this, for a single use case...
Adding a new lock primitive to linux should be really really motivated.

x86 ticket spinlocks are nice because we are sure no cpu is going to
stay forever in this code. For other arches, I dont know how this is
coped.

I thought about this before re-working on this patch, but found it was
easier to slightly change Alexander code, ie adding a regular spinlock
for the slowpath.

We could use cmpxchg() and manipulate several bits at once in fast path.
( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
of cpus spin on the same bits/cacheline than dequeue worker would
definitely slowdown the worker.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-05-21 16:44         ` Eric Dumazet
@ 2010-05-21 17:38           ` Stephen Hemminger
  2010-05-21 17:40           ` Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Stephen Hemminger @ 2010-05-21 17:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, alexander.h.duyck, netdev

On Fri, 21 May 2010 18:44:35 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 21 mai 2010 à 08:43 -0700, Stephen Hemminger a écrit :
> 
> > What about having a special function (spin_lock_greedy?) that just ignores
> > the ticket mechanism and always assumes it has right to next ticket.
> > 
> 
> I am not sure we want to do this, for a single use case...
> Adding a new lock primitive to linux should be really really motivated.
> 
> x86 ticket spinlocks are nice because we are sure no cpu is going to
> stay forever in this code. For other arches, I dont know how this is
> coped.
> 
> I thought about this before re-working on this patch, but found it was
> easier to slightly change Alexander code, ie adding a regular spinlock
> for the slowpath.
> 
> We could use cmpxchg() and manipulate several bits at once in fast path.
> ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> of cpus spin on the same bits/cacheline than dequeue worker would
> definitely slowdown the worker.

Your solution is okay, but it is a special case performance hack
and we seem to be getting more and more of these lately.
This is a general problem of any producer/consumer case;
other code will have same problem (like disk requests) 
and can't use the same solution.

Maybe the RT and scheduler folks have some input because it does
seem like the same problem has occurred in other contexts before.

-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-05-21 16:44         ` Eric Dumazet
  2010-05-21 17:38           ` Stephen Hemminger
@ 2010-05-21 17:40           ` Eric Dumazet
  2010-06-02  9:48             ` Eric Dumazet
                               ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-05-21 17:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, alexander.h.duyck, netdev

Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit :

> We could use cmpxchg() and manipulate several bits at once in fast path.
> ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> of cpus spin on the same bits/cacheline than dequeue worker would
> definitely slowdown the worker.
> 
> 

Maybe I am missing something, but __QDISC_STATE_RUNNING is always
manipulated with the lock held...

We might avoid two atomic ops when changing this state (if moved to a
separate container) in fast path (when a cpu sends only one packet and
returns)




^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH] net: add additional lock to qdisc to increase throughput
  2010-05-21 15:08 ` [PATCH] net: add additional lock to qdisc to increase throughput Eric Dumazet
@ 2010-05-21 20:04   ` Duyck, Alexander H
  2010-06-02 12:10     ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Duyck, Alexander H @ 2010-05-21 20:04 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: netdev

Eric Dumazet wrote:
> Tests with following script gave a boost from ~50.000 pps to ~600.000
> pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
> (A single netperf flow can reach ~800.000 pps on this platform)
> 
> for j in `seq 0 3`; do
>   for i in `seq 0 7`; do
>     netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
>   done
> done

Running the same script with your patch my results went from 200Kpps to 1.2Mpps on a dual Xeon 5570.

Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
  2010-05-21 17:40           ` Eric Dumazet
@ 2010-06-02  9:48             ` Eric Dumazet
  2010-06-02  9:49             ` [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING Eric Dumazet
  2010-06-02  9:50             ` [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops Eric Dumazet
  2 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-06-02  9:48 UTC (permalink / raw)
  To: Stephen Hemminger, David Miller; +Cc: alexander.h.duyck, netdev

Le vendredi 21 mai 2010 à 19:40 +0200, Eric Dumazet a écrit :
> Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit :
> 
> > We could use cmpxchg() and manipulate several bits at once in fast path.
> > ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> > of cpus spin on the same bits/cacheline than dequeue worker would
> > definitely slowdown the worker.
> > 
> > 
> 
> Maybe I am missing something, but __QDISC_STATE_RUNNING is always
> manipulated with the lock held...
> 
> We might avoid two atomic ops when changing this state (if moved to a
> separate container) in fast path (when a cpu sends only one packet and
> returns)
> 
> 

Here are two patches to implement this idea.

First patch to abstract QDISC_STATE_RUNNING access.

Second patch to add a __qstate container and remove the atomic ops.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING
  2010-05-21 17:40           ` Eric Dumazet
  2010-06-02  9:48             ` Eric Dumazet
@ 2010-06-02  9:49             ` Eric Dumazet
  2010-06-02 10:24               ` David Miller
  2010-06-02  9:50             ` [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops Eric Dumazet
  2 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-06-02  9:49 UTC (permalink / raw)
  To: Stephen Hemminger, David Miller; +Cc: alexander.h.duyck, netdev

Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
second patch will move on another location.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/pkt_sched.h   |    2 +-
 include/net/sch_generic.h |   15 +++++++++++++++
 net/core/dev.c            |    4 ++--
 net/sched/sch_generic.c   |    4 ++--
 4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 9d4d87c..d9549af 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -95,7 +95,7 @@ extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	if (qdisc_run_begin(q))
 		__qdisc_run(q);
 }
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 03ca5d8..9707dae 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -76,6 +76,21 @@ struct Qdisc {
 	struct rcu_head     rcu_head;
 };
 
+static inline bool qdisc_is_running(struct Qdisc *qdisc)
+{
+	return test_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+}
+
+static inline bool qdisc_run_begin(struct Qdisc *qdisc)
+{
+	return !test_and_set_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+}
+
+static inline void qdisc_run_end(struct Qdisc *qdisc)
+{
+	clear_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+}
+
 struct Qdisc_class_ops {
 	/* Child qdisc manipulation */
 	struct netdev_queue *	(*select_queue)(struct Qdisc *, struct tcmsg *);
diff --git a/net/core/dev.c b/net/core/dev.c
index 983a3c1..2733226 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2047,7 +2047,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		kfree_skb(skb);
 		rc = NET_XMIT_DROP;
 	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
-		   !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {
+		   qdisc_run_begin(q)) {
 		/*
 		 * This is a work-conserving queue; there are no old skbs
 		 * waiting to be sent out; and the qdisc is not running -
@@ -2059,7 +2059,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		if (sch_direct_xmit(skb, q, dev, txq, root_lock))
 			__qdisc_run(q);
 		else
-			clear_bit(__QDISC_STATE_RUNNING, &q->state);
+			qdisc_run_end(q);
 
 		rc = NET_XMIT_SUCCESS;
 	} else {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index bd1892f..37b86ea 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -205,7 +205,7 @@ void __qdisc_run(struct Qdisc *q)
 		}
 	}
 
-	clear_bit(__QDISC_STATE_RUNNING, &q->state);
+	qdisc_run_end(q);
 }
 
 unsigned long dev_trans_start(struct net_device *dev)
@@ -797,7 +797,7 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 
 		spin_lock_bh(root_lock);
 
-		val = (test_bit(__QDISC_STATE_RUNNING, &q->state) ||
+		val = (qdisc_is_running(q) ||
 		       test_bit(__QDISC_STATE_SCHED, &q->state));
 
 		spin_unlock_bh(root_lock);



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops
  2010-05-21 17:40           ` Eric Dumazet
  2010-06-02  9:48             ` Eric Dumazet
  2010-06-02  9:49             ` [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING Eric Dumazet
@ 2010-06-02  9:50             ` Eric Dumazet
  2010-06-02 10:25               ` David Miller
  2 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-06-02  9:50 UTC (permalink / raw)
  To: Stephen Hemminger, David Miller; +Cc: alexander.h.duyck, netdev

__QDISC_STATE_RUNNING is always changed while qdisc lock is held.

We can avoid two atomic operations in xmit path, if we move this bit in
a new __state container.

Location of this __state container is carefully chosen so that fast path
only dirties one qdisc cache line.

THROTTLED bit could later be moved into this __state location too, to
avoid dirtying first qdisc cache line.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sch_generic.h |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9707dae..b3591e4 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -23,11 +23,17 @@ struct qdisc_rate_table {
 };
 
 enum qdisc_state_t {
-	__QDISC_STATE_RUNNING,
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
 };
 
+/*
+ * following bits are only changed while qdisc lock is held
+ */
+enum qdisc___state_t {
+	__QDISC___STATE_RUNNING,
+};
+
 struct qdisc_size_table {
 	struct list_head	list;
 	struct tc_sizespec	szopts;
@@ -72,23 +78,24 @@ struct Qdisc {
 	unsigned long		state;
 	struct sk_buff_head	q;
 	struct gnet_stats_basic_packed bstats;
+	unsigned long		__state;
 	struct gnet_stats_queue	qstats;
 	struct rcu_head     rcu_head;
 };
 
 static inline bool qdisc_is_running(struct Qdisc *qdisc)
 {
-	return test_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+	return test_bit(__QDISC___STATE_RUNNING, &qdisc->__state);
 }
 
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
-	return !test_and_set_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+	return !__test_and_set_bit(__QDISC___STATE_RUNNING, &qdisc->__state);
 }
 
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
-	clear_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+	__clear_bit(__QDISC___STATE_RUNNING, &qdisc->__state);
 }
 
 struct Qdisc_class_ops {



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING
  2010-06-02  9:49             ` [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING Eric Dumazet
@ 2010-06-02 10:24               ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2010-06-02 10:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Jun 2010 11:49:14 +0200

> Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
> second patch will move on another location.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, applied.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops
  2010-06-02  9:50             ` [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops Eric Dumazet
@ 2010-06-02 10:25               ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2010-06-02 10:25 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, alexander.h.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Jun 2010 11:50:33 +0200

> __QDISC_STATE_RUNNING is always changed while qdisc lock is held.
> 
> We can avoid two atomic operations in xmit path, if we move this bit in
> a new __state container.
> 
> Location of this __state container is carefully chosen so that fast path
> only dirties one qdisc cache line.
> 
> THROTTLED bit could later be moved into this __state location too, to
> avoid dirtying first qdisc cache line.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Also looks good, applied.

One thing about naming.  Here, even though we name the type and the
state member with two leading underscores, the things that actually
modify these state members are helper functions with names that lack
double underscores.

So really, reading the code, you don't see that special considerations
for these state changes might be necessary.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] net: add additional lock to qdisc to increase throughput
  2010-05-21 20:04   ` Duyck, Alexander H
@ 2010-06-02 12:10     ` David Miller
  2010-06-02 14:52       ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2010-06-02 12:10 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: eric.dumazet, netdev

From: "Duyck, Alexander H" <alexander.h.duyck@intel.com>
Date: Fri, 21 May 2010 13:04:20 -0700

> Eric Dumazet wrote:
>> Tests with following script gave a boost from ~50.000 pps to ~600.000
>> pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
>> (A single netperf flow can reach ~800.000 pps on this platform)
>> 
>> for j in `seq 0 3`; do
>>   for i in `seq 0 7`; do
>>     netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
>>   done
>> done
> 
> Running the same script with your patch my results went from 200Kpps to 1.2Mpps on a dual Xeon 5570.
> 
> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>

Applied, thanks guys.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] net: add additional lock to qdisc to increase throughput
  2010-06-02 12:10     ` David Miller
@ 2010-06-02 14:52       ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-06-02 14:52 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev

Le mercredi 02 juin 2010 à 05:10 -0700, David Miller a écrit :
> From: "Duyck, Alexander H" <alexander.h.duyck@intel.com>
> Date: Fri, 21 May 2010 13:04:20 -0700
> 
> > Eric Dumazet wrote:
> >> Tests with following script gave a boost from ~50.000 pps to ~600.000
> >> pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
> >> (A single netperf flow can reach ~800.000 pps on this platform)
> >> 
> >> for j in `seq 0 3`; do
> >>   for i in `seq 0 7`; do
> >>     netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
> >>   done
> >> done
> > 
> > Running the same script with your patch my results went from 200Kpps to 1.2Mpps on a dual Xeon 5570.
> > 
> > Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> Applied, thanks guys.

Thanks David, I realize you did all the necessary changes after
qdisc_run_begin/qdisc_is_running/ integration ;)



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-06-02 14:52 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-23 20:25 [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Alexander Duyck
2010-03-23 20:31 ` David Miller
2010-03-24  2:12   ` Andi Kleen
2010-03-23 20:40 ` Rick Jones
2010-03-23 20:54 ` Eric Dumazet
2010-03-23 21:18   ` Eric Dumazet
2010-03-23 21:45   ` David Miller
2010-03-23 22:08     ` Duyck, Alexander H
2010-03-24  2:58       ` David Miller
2010-03-24  5:42         ` Eric Dumazet
2010-03-24  6:10           ` David Miller
2010-03-23 22:13     ` Eric Dumazet
2010-05-21 15:43       ` Stephen Hemminger
2010-05-21 16:44         ` Eric Dumazet
2010-05-21 17:38           ` Stephen Hemminger
2010-05-21 17:40           ` Eric Dumazet
2010-06-02  9:48             ` Eric Dumazet
2010-06-02  9:49             ` [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING Eric Dumazet
2010-06-02 10:24               ` David Miller
2010-06-02  9:50             ` [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops Eric Dumazet
2010-06-02 10:25               ` David Miller
2010-03-24 14:24 ` [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness Eric Dumazet
2010-05-21 15:08 ` [PATCH] net: add additional lock to qdisc to increase throughput Eric Dumazet
2010-05-21 20:04   ` Duyck, Alexander H
2010-06-02 12:10     ` David Miller
2010-06-02 14:52       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.