All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] net: remove busylock
@ 2016-05-19 17:08 Eric Dumazet
  2016-05-19 18:03 ` Alexander Duyck
  2016-05-19 18:12 ` David Miller
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-05-19 17:08 UTC (permalink / raw)
  To: netdev; +Cc: Alexander Duyck

busylock was added at the time we had expensive ticket spinlocks

(commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
lock to qdisc to increase throughput")

Now kernel spinlocks are MCS, this busylock things is no longer
relevant. It is slowing down things a bit.


With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads.

lpaa5:~# sar -n DEV 4 4 |grep eth0
10:05:44         eth0 798951.25 798951.75  52276.22  52275.26      0.00      0.00      0.50
10:05:48         eth0 798576.00 798572.75  52251.24  52250.39      0.00      0.00      0.75
10:05:52         eth0 798746.00 798748.75  52262.89  52262.13      0.00      0.00      0.50
10:05:56         eth0 798303.25 798291.50  52235.22  52233.10      0.00      0.00      0.50
Average:         eth0 798644.12 798641.19  52256.39  52255.22      0.00      0.00      0.56

Disabling busylock (by using a local sysctl)

lpaa5:~# sar -n DEV 4 4 |grep eth0
10:05:14         eth0 864085.75 864091.50  56538.09  56537.46      0.00      0.00      0.50
10:05:18         eth0 864734.75 864729.25  56580.35  56579.05      0.00      0.00      0.75
10:05:22         eth0 864366.00 864361.50  56556.74  56555.00      0.00      0.00      0.50
10:05:26         eth0 864246.50 864248.75  56549.19  56547.65      0.00      0.00      0.50
Average:         eth0 864358.25 864357.75  56556.09  56554.79      0.00      0.00      0.56

That would be a 8 % increase.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 17:08 [RFC] net: remove busylock Eric Dumazet
@ 2016-05-19 18:03 ` Alexander Duyck
  2016-05-19 18:41   ` Rick Jones
                     ` (2 more replies)
  2016-05-19 18:12 ` David Miller
  1 sibling, 3 replies; 22+ messages in thread
From: Alexander Duyck @ 2016-05-19 18:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Alexander Duyck

On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> busylock was added at the time we had expensive ticket spinlocks
>
> (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> lock to qdisc to increase throughput")
>
> Now kernel spinlocks are MCS, this busylock things is no longer
> relevant. It is slowing down things a bit.
>
>
> With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads.
>
> lpaa5:~# sar -n DEV 4 4 |grep eth0
> 10:05:44         eth0 798951.25 798951.75  52276.22  52275.26      0.00      0.00      0.50
> 10:05:48         eth0 798576.00 798572.75  52251.24  52250.39      0.00      0.00      0.75
> 10:05:52         eth0 798746.00 798748.75  52262.89  52262.13      0.00      0.00      0.50
> 10:05:56         eth0 798303.25 798291.50  52235.22  52233.10      0.00      0.00      0.50
> Average:         eth0 798644.12 798641.19  52256.39  52255.22      0.00      0.00      0.56
>
> Disabling busylock (by using a local sysctl)
>
> lpaa5:~# sar -n DEV 4 4 |grep eth0
> 10:05:14         eth0 864085.75 864091.50  56538.09  56537.46      0.00      0.00      0.50
> 10:05:18         eth0 864734.75 864729.25  56580.35  56579.05      0.00      0.00      0.75
> 10:05:22         eth0 864366.00 864361.50  56556.74  56555.00      0.00      0.00      0.50
> 10:05:26         eth0 864246.50 864248.75  56549.19  56547.65      0.00      0.00      0.50
> Average:         eth0 864358.25 864357.75  56556.09  56554.79      0.00      0.00      0.56
>
> That would be a 8 % increase.

The main point of the busy lock is to deal with the bulk throughput
case, not the latency case which would be relatively well behaved.
The problem wasn't really related to lock bouncing slowing things
down.  It was the fairness between the threads that was killing us
because the dequeue needs to have priority.

The main problem that the busy lock solved was the fact that you could
start a number of stream tests equal to the number of CPUs in a given
system and the result was that the performance would drop off a cliff
and you would drop almost all the packets for almost all the streams
because the qdisc never had a chance to drain because it would be CPU
- 1 enqueues, followed by 1 dequeue.

What we need if we are going to get rid of busy lock would be some
sort of priority locking mechanism that would allow the dequeue thread
to jump to the head of the line if it is attempting to take the lock.
Otherwise you end up spending all your time enqueuing packets into
oblivion because the qdiscs just overflow without the busy lock in
place.

- Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 17:08 [RFC] net: remove busylock Eric Dumazet
  2016-05-19 18:03 ` Alexander Duyck
@ 2016-05-19 18:12 ` David Miller
  2016-05-19 18:44   ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller @ 2016-05-19 18:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, aduyck

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 May 2016 10:08:36 -0700

> busylock was added at the time we had expensive ticket spinlocks
> 
> (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> lock to qdisc to increase throughput")
> 
> Now kernel spinlocks are MCS, this busylock things is no longer
> relevant. It is slowing down things a bit.
 ...
> That would be a 8 % increase.

Presumably only for x86-64 and other platforms that actually are using
the ticket spinlocks.

For others it would be a regression.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 18:03 ` Alexander Duyck
@ 2016-05-19 18:41   ` Rick Jones
  2016-05-19 18:56   ` Eric Dumazet
  2016-05-20  7:29   ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 22+ messages in thread
From: Rick Jones @ 2016-05-19 18:41 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet; +Cc: netdev, Alexander Duyck

On 05/19/2016 11:03 AM, Alexander Duyck wrote:
> On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads.
...
>>
>> That would be a 8 % increase.
>
> The main point of the busy lock is to deal with the bulk throughput
> case, not the latency case which would be relatively well behaved.
> The problem wasn't really related to lock bouncing slowing things
> down.  It was the fairness between the threads that was killing us
> because the dequeue needs to have priority.

Quibbledrift... While the origins of the netperf TCP_RR test center on 
measuring latency, I'm not sure I'd call 200 of them running 
concurrently a latency test.  Indeed it may be neither fish nor fowl, 
but it will certainly be exercising the basic packet send/receive path 
rather fully and is likely a reasonable proxy for aggregate small packet 
performance.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 18:12 ` David Miller
@ 2016-05-19 18:44   ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-05-19 18:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, aduyck

On Thu, 2016-05-19 at 11:12 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 19 May 2016 10:08:36 -0700
> 
> > busylock was added at the time we had expensive ticket spinlocks
> > 
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> > 
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
>  ...
> > That would be a 8 % increase.
> 
> Presumably only for x86-64 and other platforms that actually are using
> the ticket spinlocks.
> 
> For others it would be a regression.

Ticket spinlocks are actually gone, these are now MCS.

But yes, point taken ;)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 18:03 ` Alexander Duyck
  2016-05-19 18:41   ` Rick Jones
@ 2016-05-19 18:56   ` Eric Dumazet
  2016-05-19 19:35     ` Eric Dumazet
  2016-05-20  7:29   ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-19 18:56 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, Alexander Duyck

On Thu, 2016-05-19 at 11:03 -0700, Alexander Duyck wrote:
> On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > busylock was added at the time we had expensive ticket spinlocks
> >
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> >
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
> >
> >
> > With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads.
> >
> > lpaa5:~# sar -n DEV 4 4 |grep eth0
> > 10:05:44         eth0 798951.25 798951.75  52276.22  52275.26      0.00      0.00      0.50
> > 10:05:48         eth0 798576.00 798572.75  52251.24  52250.39      0.00      0.00      0.75
> > 10:05:52         eth0 798746.00 798748.75  52262.89  52262.13      0.00      0.00      0.50
> > 10:05:56         eth0 798303.25 798291.50  52235.22  52233.10      0.00      0.00      0.50
> > Average:         eth0 798644.12 798641.19  52256.39  52255.22      0.00      0.00      0.56
> >
> > Disabling busylock (by using a local sysctl)
> >
> > lpaa5:~# sar -n DEV 4 4 |grep eth0
> > 10:05:14         eth0 864085.75 864091.50  56538.09  56537.46      0.00      0.00      0.50
> > 10:05:18         eth0 864734.75 864729.25  56580.35  56579.05      0.00      0.00      0.75
> > 10:05:22         eth0 864366.00 864361.50  56556.74  56555.00      0.00      0.00      0.50
> > 10:05:26         eth0 864246.50 864248.75  56549.19  56547.65      0.00      0.00      0.50
> > Average:         eth0 864358.25 864357.75  56556.09  56554.79      0.00      0.00      0.56
> >
> > That would be a 8 % increase.
> 
> The main point of the busy lock is to deal with the bulk throughput
> case, not the latency case which would be relatively well behaved.
> The problem wasn't really related to lock bouncing slowing things
> down.  It was the fairness between the threads that was killing us
> because the dequeue needs to have priority.



> 
> The main problem that the busy lock solved was the fact that you could
> start a number of stream tests equal to the number of CPUs in a given
> system and the result was that the performance would drop off a cliff
> and you would drop almost all the packets for almost all the streams
> because the qdisc never had a chance to drain because it would be CPU
> - 1 enqueues, followed by 1 dequeue.
> 
> What we need if we are going to get rid of busy lock would be some
> sort of priority locking mechanism that would allow the dequeue thread
> to jump to the head of the line if it is attempting to take the lock.
> Otherwise you end up spending all your time enqueuing packets into
> oblivion because the qdiscs just overflow without the busy lock in
> place.


Removing busylock helped in all cases I tested. (at least on x86 as
David pointed out)

As I said, we need to revisit busylock now that spinlocks are different.

In one case (20 concurrent UDP netperf), I even got a 500 % increase.

With busylock :

lpaa5:~# sar -n DEV 4 4|grep eth0
11:33:34         eth0      9.00 115057.00      1.60  38426.92      0.00      0.00      0.50
11:33:38         eth0     13.50 113237.75      2.04  37819.69      0.00      0.00      0.75
11:33:42         eth0     13.50 111492.25      1.76  37236.58      0.00      0.00      0.75
11:33:46         eth0     12.75 111401.50      2.40  37205.93      0.00      0.00      0.75
Average:         eth0     12.19 112797.12      1.95  37672.28      0.00      0.00      0.69

Packets are dropped in HTB because we hit a limit of 1000 packets there

- 100.00%  netperf  [kernel.kallsyms]  [k] kfree_skb           ▒
   - kfree_skb                                                 ▒
      -  100.00% htb_enqueue                                   ▒
            __dev_queue_xmit                                   ▒
            dev_queue_xmit                                     ▒
            ip_finish_output2                                  ▒
            ip_finish_output                                   ▒
            ip_output                                          ▒
            ip_local_out                                       ▒
         +  ip_send_skb      


Presumably it would tremendously help if the actual kfree_skb()
was done after qdisc lock is released, ie not from the qdisc->enqueue()
method.


Without busylock :

lpaa5:~# sar -n DEV 4 4|grep eth0
11:41:12         eth0     11.00 669053.50      1.99 223452.30      0.00      0.00      0.75
11:41:16         eth0      8.50 669513.25      2.27 223605.55      0.00      0.00      0.75
11:41:20         eth0      3.50 669426.50      0.90 223577.19      0.00      0.00      0.50
11:41:24         eth0      8.25 669284.00      1.42 223529.79      0.00      0.00      0.50
Average:         eth0      7.81 669319.31      1.65 223541.21      0.00      0.00      0.62

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 18:56   ` Eric Dumazet
@ 2016-05-19 19:35     ` Eric Dumazet
  2016-05-19 20:39       ` Alexander Duyck
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-19 19:35 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, Alexander Duyck

On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote:

> Removing busylock helped in all cases I tested. (at least on x86 as
> David pointed out)
> 
> As I said, we need to revisit busylock now that spinlocks are different.
> 
> In one case (20 concurrent UDP netperf), I even got a 500 % increase.
> 
> With busylock :
> 
> lpaa5:~# sar -n DEV 4 4|grep eth0
> Average:         eth0     12.19 112797.12      1.95  37672.28      0.00      0.00      0.69
> 


Hmpf, my sysctl logic was inverted. Really these results made little
sense.

Sorry for the noise. At least we have 8% confirmed gain with this
stuff ;)

> Presumably it would tremendously help if the actual kfree_skb()
> was done after qdisc lock is released, ie not from the qdisc->enqueue()
> method.
> 

This part is still valid.

We could have a per cpu storage of one skb pointer, so that we do not
have to change all ->enqueue() prototypes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 19:35     ` Eric Dumazet
@ 2016-05-19 20:39       ` Alexander Duyck
  2016-05-20  4:49         ` John Fastabend
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Duyck @ 2016-05-19 20:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Alexander Duyck, Jesper Dangaard Brouer, John Fastabend

On Thu, May 19, 2016 at 12:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote:
>
>> Removing busylock helped in all cases I tested. (at least on x86 as
>> David pointed out)
>>
>> As I said, we need to revisit busylock now that spinlocks are different.
>>
>> In one case (20 concurrent UDP netperf), I even got a 500 % increase.
>>
>> With busylock :
>>
>> lpaa5:~# sar -n DEV 4 4|grep eth0
>> Average:         eth0     12.19 112797.12      1.95  37672.28      0.00      0.00      0.69
>>
>
>
> Hmpf, my sysctl logic was inverted. Really these results made little
> sense.
>
> Sorry for the noise. At least we have 8% confirmed gain with this
> stuff ;)

Unfortunately I see a 21% regression when you place the qdisc under
load from multiple CPUs/nodes.  In my testing I dropped from around
1.05 Mpps to around 827 Kpps when I removed the busylock.

My test setup is pretty straight forward and the results I have seen
are consistent between runs.  I have a system with 32 threads / 16
cores spread over 2 NUMA nodes.  I reduce the number of queues on a
10Gb/s NIC to 1.  I kill irqbalance and disable CPU power states.  I
then start a quick "for" loop that will schedule one UDP_STREAM
netperf session on each CPU using a small payload size like 32.

On a side note, if I move everything onto one node I can push about
2.4 Mpps and the busylock doesn't seem to really impact things, but if
I go to both nodes then I see the performance regression as described
above.  I was thinking about it and I don't think the MCS type locks
would have that much of an impact.  If anything I think that xmit_more
probably has a bigger effect given that it allows us to grab multiple
frames with each fetch and thereby reduce the lock contention on the
dequeue side.

>> Presumably it would tremendously help if the actual kfree_skb()
>> was done after qdisc lock is released, ie not from the qdisc->enqueue()
>> method.
>>
>
> This part is still valid.
>
> We could have a per cpu storage of one skb pointer, so that we do not
> have to change all ->enqueue() prototypes.

I fully agree with that.

One thought I had is if we could move to a lockless dequeue path for
the qdisc then we could also get rid of the busy lock.  I know there
has been talk about doing away with qdisc locks or changing the inner
mechanics of the qdisc itself in the past, I'm CCing Jesper and John
for that reason.  Maybe it is time for us to start looking into that
so we can start cleaning out some of the old cruft we have in this
path.

- Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 20:39       ` Alexander Duyck
@ 2016-05-20  4:49         ` John Fastabend
  2016-05-20  4:56           ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: John Fastabend @ 2016-05-20  4:49 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: netdev, Alexander Duyck, Jesper Dangaard Brouer, John Fastabend

On 16-05-19 01:39 PM, Alexander Duyck wrote:
> On Thu, May 19, 2016 at 12:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote:
>>
>>> Removing busylock helped in all cases I tested. (at least on x86 as
>>> David pointed out)
>>>
>>> As I said, we need to revisit busylock now that spinlocks are different.
>>>
>>> In one case (20 concurrent UDP netperf), I even got a 500 % increase.
>>>
>>> With busylock :
>>>
>>> lpaa5:~# sar -n DEV 4 4|grep eth0
>>> Average:         eth0     12.19 112797.12      1.95  37672.28      0.00      0.00      0.69
>>>
>>
>>
>> Hmpf, my sysctl logic was inverted. Really these results made little
>> sense.
>>
>> Sorry for the noise. At least we have 8% confirmed gain with this
>> stuff ;)
> 
> Unfortunately I see a 21% regression when you place the qdisc under
> load from multiple CPUs/nodes.  In my testing I dropped from around
> 1.05 Mpps to around 827 Kpps when I removed the busylock.
> 
> My test setup is pretty straight forward and the results I have seen
> are consistent between runs.  I have a system with 32 threads / 16
> cores spread over 2 NUMA nodes.  I reduce the number of queues on a
> 10Gb/s NIC to 1.  I kill irqbalance and disable CPU power states.  I
> then start a quick "for" loop that will schedule one UDP_STREAM
> netperf session on each CPU using a small payload size like 32.
> 
> On a side note, if I move everything onto one node I can push about
> 2.4 Mpps and the busylock doesn't seem to really impact things, but if
> I go to both nodes then I see the performance regression as described
> above.  I was thinking about it and I don't think the MCS type locks
> would have that much of an impact.  If anything I think that xmit_more
> probably has a bigger effect given that it allows us to grab multiple
> frames with each fetch and thereby reduce the lock contention on the
> dequeue side.
> 
>>> Presumably it would tremendously help if the actual kfree_skb()
>>> was done after qdisc lock is released, ie not from the qdisc->enqueue()
>>> method.
>>>
>>
>> This part is still valid.
>>
>> We could have a per cpu storage of one skb pointer, so that we do not
>> have to change all ->enqueue() prototypes.
> 
> I fully agree with that.
> 
> One thought I had is if we could move to a lockless dequeue path for
> the qdisc then we could also get rid of the busy lock.  I know there
> has been talk about doing away with qdisc locks or changing the inner
> mechanics of the qdisc itself in the past, I'm CCing Jesper and John
> for that reason.  Maybe it is time for us to start looking into that
> so we can start cleaning out some of the old cruft we have in this
> path.
> 
> - Alex
> 

I plan to start looking at this again in June when I have some
more time FWIW. The last set of RFCs I sent out bypassed both the
qdisc lock and the busy poll lock. I remember thinking this was a
net win at the time but I only did very basic testing e.g. firing
up n sessions of pktgen.

And because we are talking about cruft I always thought the gso_skb
requeue logic could be done away with as well. As far as I can tell
it must be there from some historic code that has been re-factored
or deleted pre-git days. It would be much better I think (no data)
to use byte queue limits or some other way to ensure the driver can
consume the packet vs popping and pushing skbs around.

.John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20  4:49         ` John Fastabend
@ 2016-05-20  4:56           ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-05-20  4:56 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, netdev, Alexander Duyck, Jesper Dangaard Brouer,
	John Fastabend

On Thu, 2016-05-19 at 21:49 -0700, John Fastabend wrote:

> I plan to start looking at this again in June when I have some
> more time FWIW. The last set of RFCs I sent out bypassed both the
> qdisc lock and the busy poll lock. I remember thinking this was a
> net win at the time but I only did very basic testing e.g. firing
> up n sessions of pktgen.
> 
> And because we are talking about cruft I always thought the gso_skb
> requeue logic could be done away with as well. As far as I can tell
> it must be there from some historic code that has been re-factored
> or deleted pre-git days. It would be much better I think (no data)
> to use byte queue limits or some other way to ensure the driver can
> consume the packet vs popping and pushing skbs around.

Problem is : byte queue limit can tell qdisc to send one packet, that
happens to be a GSO packet needing software segmentation.

(BQL does not know the size of the next packet to be dequeued from
qdisc)

Let say this GSO packet had 45 segs.

Then the driver has a limited TX ring space and only accepts 10 segs,
or simply BQL budget is consumed after 10 segs.

You need to requeue the remaining 35 segs.

So for example, following patch does not even help the requeue syndrom.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 269dd71b3828..a440c059fbcf 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -60,7 +60,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 const struct netdev_queue *txq,
 				 int *packets)
 {
-	int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
+	int bytelimit = qdisc_avail_bulklimit(txq) - qdisc_skb_cb(skb)->pkt_len;
 
 	while (bytelimit > 0) {
 		struct sk_buff *nskb = q->dequeue(q);
@@ -68,7 +68,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 		if (!nskb)
 			break;
 
-		bytelimit -= nskb->len; /* covers GSO len */
+		bytelimit -= qdisc_skb_cb(nskb)->pkt_len;
 		skb->next = nskb;
 		skb = nskb;
 		(*packets)++; /* GSO counts as one pkt */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-19 18:03 ` Alexander Duyck
  2016-05-19 18:41   ` Rick Jones
  2016-05-19 18:56   ` Eric Dumazet
@ 2016-05-20  7:29   ` Jesper Dangaard Brouer
  2016-05-20 13:11     ` Eric Dumazet
  2 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-20  7:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: brouer, Eric Dumazet, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On Thu, 19 May 2016 11:03:32 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > busylock was added at the time we had expensive ticket spinlocks
> >
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> >
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
> >
> >
> > With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads.
> >
[...]
> 
> The main point of the busy lock is to deal with the bulk throughput
> case, not the latency case which would be relatively well behaved.
> The problem wasn't really related to lock bouncing slowing things
> down.  It was the fairness between the threads that was killing us
> because the dequeue needs to have priority.

Yes, exactly.
 
> The main problem that the busy lock solved was the fact that you could
> start a number of stream tests equal to the number of CPUs in a given
> system and the result was that the performance would drop off a cliff
> and you would drop almost all the packets for almost all the streams
> because the qdisc never had a chance to drain because it would be CPU
> - 1 enqueues, followed by 1 dequeue.

Notice 1 enqueue does not guarantee 1 dequeue.  If one CPU has entered
dequeue/xmit state (setting __QDISC___STATE_RUNNING) then other CPUs
will just enqueue.  Thus, N CPUs will enqueue (a packet each), and 1 CPU
will dequeue a packet.

The problem arise due to the fairness of the ticket spinlock, and AFAIK
the MCS lock is even more fair.  All enqueue's get their turn before
the dequeue can move forward.  And the qdisc dequeue CPU have an
unfortunate pattern of releasing the qdisc root_lock and acquiring it
again (qdisc_restart+sch_direct_xmit). Thus, while dequeue is waiting
re-aquire the root_lock, N CPUs will enqueue packets.

The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
by allowing dequeue to do more work, while holding the lock.

You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
Have you tried to enable/allow HTB to bulk dequeue?


> What we need if we are going to get rid of busy lock would be some
> sort of priority locking mechanism that would allow the dequeue thread
> to jump to the head of the line if it is attempting to take the lock.
> Otherwise you end up spending all your time enqueuing packets into
> oblivion because the qdiscs just overflow without the busy lock in
> place.

Exactly. The qdisc locking scheme is designed to only allow one
dequeuing CPU to run (via state __QDISC___STATE_RUNNING).  Jamal told
me this was an optimization.  Maybe this optimization "broke" when
locking got fair?

I don't want to offend the original qdisc designers, but when I look at
the qdisc locking code, I keep thinking this scheme is broken.

Wouldn't it be better to have seperate an enqueue lock and a dequeue
lock? (with a producer/consumer queue).

Once we get John's lockless qdisc scheme, then the individual qdiscs
can implement a locking scheme like this, and we can keep the old
locking scheme intact for legacy qdisc. Or we can clean up this locking
scheme and update the legacy qdisc to use a MPMC queue?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20  7:29   ` Jesper Dangaard Brouer
@ 2016-05-20 13:11     ` Eric Dumazet
  2016-05-20 13:47       ` Eric Dumazet
  2016-05-20 16:01       ` John Fastabend
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-05-20 13:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:


> The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
> by allowing dequeue to do more work, while holding the lock.
> 
> You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
> Have you tried to enable/allow HTB to bulk dequeue?
> 

Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
many packets from the qdisc and tx them to the device.

It is generic for any kind of qdisc.

HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
this while holding qdisc spinlock, you block other cpus from doing
concurrent ->enqueue(), adding latencies (always the same trade off...)

HTB wont be anytime soon have separate protections for the ->enqueue()
and the ->dequeue(). Have you looked at this monster ? I did, many
times...

Note that I am working on a patch to transform __QDISC___STATE_RUNNING
to a seqcount do that we can grab stats without holding the qdisc lock.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20 13:11     ` Eric Dumazet
@ 2016-05-20 13:47       ` Eric Dumazet
  2016-05-20 14:16         ` Eric Dumazet
  2016-05-20 16:01       ` John Fastabend
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-20 13:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On Fri, 2016-05-20 at 06:11 -0700, Eric Dumazet wrote:
> On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:
> 
> 
> > The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
> > by allowing dequeue to do more work, while holding the lock.
> > 
> > You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
> > Have you tried to enable/allow HTB to bulk dequeue?
> > 
> 
> Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
> many packets from the qdisc and tx them to the device.
> 
> It is generic for any kind of qdisc.
> 
> HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
> this while holding qdisc spinlock, you block other cpus from doing
> concurrent ->enqueue(), adding latencies (always the same trade off...)
> 
> HTB wont be anytime soon have separate protections for the ->enqueue()
> and the ->dequeue(). Have you looked at this monster ? I did, many
> times...
> 
> Note that I am working on a patch to transform __QDISC___STATE_RUNNING
> to a seqcount do that we can grab stats without holding the qdisc lock.

Slide note : __qdisc_run() could probably avoid a __netif_schedule()
when it breaks the loop, if another cpu is busy spinning on qdisc lock.

-> Less (spurious) TX softirq invocations, so less chance to trigger the
infamous ksoftirqd bug we discussed lately.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20 13:47       ` Eric Dumazet
@ 2016-05-20 14:16         ` Eric Dumazet
  2016-05-20 17:49           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-20 14:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On Fri, 2016-05-20 at 06:47 -0700, Eric Dumazet wrote:
> On Fri, 2016-05-20 at 06:11 -0700, Eric Dumazet wrote:
> > On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:
> > 
> > 
> > > The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
> > > by allowing dequeue to do more work, while holding the lock.
> > > 
> > > You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
> > > Have you tried to enable/allow HTB to bulk dequeue?
> > > 
> > 
> > Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
> > many packets from the qdisc and tx them to the device.
> > 
> > It is generic for any kind of qdisc.
> > 
> > HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
> > this while holding qdisc spinlock, you block other cpus from doing
> > concurrent ->enqueue(), adding latencies (always the same trade off...)
> > 
> > HTB wont be anytime soon have separate protections for the ->enqueue()
> > and the ->dequeue(). Have you looked at this monster ? I did, many
> > times...
> > 
> > Note that I am working on a patch to transform __QDISC___STATE_RUNNING
> > to a seqcount do that we can grab stats without holding the qdisc lock.
> 
> Slide note : __qdisc_run() could probably avoid a __netif_schedule()
> when it breaks the loop, if another cpu is busy spinning on qdisc lock.
> 
> -> Less (spurious) TX softirq invocations, so less chance to trigger the
> infamous ksoftirqd bug we discussed lately.

Also note that in our case, we have HTB on a bonding device, and
FQ/pacing on slaves.

Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
on sch->flags when HTB is installed at the bonding device root.

We might add a flag to tell qdisc layer that a device is virtual and
could benefit from bulk dequeue, since the ultimate TX queue is located
on another (physical) netdev, eventually MQ enabled.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20 13:11     ` Eric Dumazet
  2016-05-20 13:47       ` Eric Dumazet
@ 2016-05-20 16:01       ` John Fastabend
  1 sibling, 0 replies; 22+ messages in thread
From: John Fastabend @ 2016-05-20 16:01 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On 16-05-20 06:11 AM, Eric Dumazet wrote:
> On Fri, 2016-05-20 at 09:29 +0200, Jesper Dangaard Brouer wrote:
> 
> 
>> The hole idea behind allowing bulk qdisc dequeue, was to mitigate this,
>> by allowing dequeue to do more work, while holding the lock.
>>
>> You mention HTB.  Notice HTB does not take advantage of bulk dequeue.
>> Have you tried to enable/allow HTB to bulk dequeue?
>>
> 
> Well, __QDISC___STATE_RUNNING means exactly that : one cpu is dequeueing
> many packets from the qdisc and tx them to the device.
> 
> It is generic for any kind of qdisc.
> 
> HTB bulk dequeue would have to call ->dequeue() mutiple times. If you do
> this while holding qdisc spinlock, you block other cpus from doing
> concurrent ->enqueue(), adding latencies (always the same trade off...)
> 
> HTB wont be anytime soon have separate protections for the ->enqueue()
> and the ->dequeue(). Have you looked at this monster ? I did, many
> times...
> 

I came to the conclusion that we just need to rewrite a new modern
version of HTB at some point. Easier said than done however.

> Note that I am working on a patch to transform __QDISC___STATE_RUNNING
> to a seqcount do that we can grab stats without holding the qdisc lock.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20 14:16         ` Eric Dumazet
@ 2016-05-20 17:49           ` Jesper Dangaard Brouer
  2016-05-20 21:32             ` Eric Dumazet
  2016-05-24 13:50             ` [RFC] net: remove busylock David Laight
  0 siblings, 2 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-20 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim, brouer

On Fri, 20 May 2016 07:16:55 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
> on sch->flags when HTB is installed at the bonding device root.

If would be cool if you could run a test with removed busylock and
allow HTB to bulk dequeue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20 17:49           ` Jesper Dangaard Brouer
@ 2016-05-20 21:32             ` Eric Dumazet
  2016-05-23  9:50               ` Jesper Dangaard Brouer
  2016-05-24 13:50             ` [RFC] net: remove busylock David Laight
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-20 21:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On Fri, 2016-05-20 at 19:49 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 20 May 2016 07:16:55 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
> > on sch->flags when HTB is installed at the bonding device root.
> 
> If would be cool if you could run a test with removed busylock and
> allow HTB to bulk dequeue.

I added a /sys/class/net/eth0/tx_bulk_limit to tune the number of
packets that we could bulk dequeue from a virtual device 
(no BQL on them)

200 TCP_RR through HTB on eth0 (bonding device)

1) busylock enabled
--------------------

With tx_bulk_limit set to 8, we get 12.7 % increase.

lpaa23:~# for f in `seq 1 16`; do echo $f|tee /sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
1
Average:         eth0 868625.67 868487.44  57055.69  56826.37      0.00      0.00      0.56
2
Average:         eth0 927081.67 926920.78  60923.83  60649.90      0.00      0.00      0.44
3
Average:         eth0 957678.11 957554.00  62877.04  62653.89      0.00      0.00      0.56
4
Average:         eth0 966912.44 966747.33  63532.72  63255.51      0.00      0.00      0.56
5
Average:         eth0 973137.56 972950.44  63958.31  63661.39      0.00      0.00      0.44
6
Average:         eth0 958589.22 958449.44  62961.79  62712.56      0.00      0.00      0.67
7
Average:         eth0 960737.67 960672.22  62968.34  62857.97      0.00      0.00      0.44
8
Average:         eth0 979271.78 979201.67  64199.47  64070.84      0.00      0.00      0.56
9
Average:         eth0 982464.33 982390.33  64418.42  64278.93      0.00      0.00      0.56
10
Average:         eth0 982698.00 982578.22  64495.25  64291.28      0.00      0.00      0.44
11
Average:         eth0 981862.22 981746.00  64438.16  64236.31      0.00      0.00      0.56
12
Average:         eth0 983277.44 983096.33  64632.79  64327.79      0.00      0.00      0.44
13
Average:         eth0 981221.11 981018.00  64521.82  64189.26      0.00      0.00      0.67
14
Average:         eth0 981754.11 981555.89  64553.19  64224.39      0.00      0.00      0.44
15
Average:         eth0 982484.33 982301.67  64572.00  64273.38      0.00      0.00      0.44
16
Average:         eth0 978529.56 978326.67  64350.89  64013.39      0.00      0.00      0.67


2) busylock disabled
--------------------

Well, bulk dequeue helps, but does not close the gap. 
(busylock is needed)

lpaa23:~# for f in `seq 1 16`; do echo $f|tee /sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
1
Average:         eth0 795408.44 795407.67  52044.66  52045.93      0.00      0.00      0.56
2
Average:         eth0 843411.78 843415.11  55185.23  55184.51      0.00      0.00      0.56
3
Average:         eth0 876175.89 876175.00  57329.50  57327.98      0.00      0.00      0.44
4
Average:         eth0 890631.22 890629.44  58274.58  58274.25      0.00      0.00      0.67
5
Average:         eth0 900672.00 900668.89  58931.29  58930.54      0.00      0.00      0.44
6
Average:         eth0 908325.78 908328.22  59432.97  59431.76      0.00      0.00      0.56
7
Average:         eth0 913895.33 913885.11  59796.89  59795.46      0.00      0.00      0.56
8
Average:         eth0 914429.11 914433.56  59832.26  59831.23      0.00      0.00      0.67
9
Average:         eth0 918701.11 918699.67  60110.68  60110.36      0.00      0.00      0.55
10
Average:         eth0 920382.33 920376.56  60223.31  60220.54      0.00      0.00      0.67
11
Average:         eth0 914341.67 914344.67  59826.25  59825.90      0.00      0.00      0.67
12
Average:         eth0 912697.00 912693.78  59718.77  59717.44      0.00      0.00      0.44
13
Average:         eth0 917392.56 917385.00  60025.79  60024.34      0.00      0.00      0.44
14
Average:         eth0 918232.89 918233.78  60081.04  60079.94      0.00      0.00      0.67
15
Average:         eth0 918377.11 918381.00  60091.14  60089.79      0.00      0.00      0.44
16
Average:         eth0 913817.56 913812.33  59792.09  59790.66      0.00      0.00      0.56

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-20 21:32             ` Eric Dumazet
@ 2016-05-23  9:50               ` Jesper Dangaard Brouer
  2016-05-23 21:24                 ` [PATCH net] net_sched: avoid too many hrtimer_start() calls Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-23  9:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim, brouer

On Fri, 20 May 2016 14:32:40 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Fri, 2016-05-20 at 19:49 +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 20 May 2016 07:16:55 -0700
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> > > Since bonding pretends to be multiqueue, TCQ_F_ONETXQUEUE is not set
> > > on sch->flags when HTB is installed at the bonding device root.  
> > 
> > If would be cool if you could run a test with removed busylock and
> > allow HTB to bulk dequeue.  
> 
> I added a /sys/class/net/eth0/tx_bulk_limit to tune the number of
> packets that we could bulk dequeue from a virtual device 
> (no BQL on them)
> 
> 200 TCP_RR through HTB on eth0 (bonding device)
> 
> 1) busylock enabled
> --------------------
> 
> With tx_bulk_limit set to 8, we get 12.7 % increase.

That is actually a quite good performance increase for this workload.


> lpaa23:~# for f in `seq 1 16`; do echo $f|tee /sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
> 1
> Average:         eth0 868625.67 868487.44  57055.69  56826.37      0.00      0.00      0.56
> 2
> Average:         eth0 927081.67 926920.78  60923.83  60649.90      0.00      0.00      0.44
> 3
> Average:         eth0 957678.11 957554.00  62877.04  62653.89      0.00      0.00      0.56
> 4
> Average:         eth0 966912.44 966747.33  63532.72  63255.51      0.00      0.00      0.56
> 5
> Average:         eth0 973137.56 972950.44  63958.31  63661.39      0.00      0.00      0.44
> 6
> Average:         eth0 958589.22 958449.44  62961.79  62712.56      0.00      0.00      0.67
> 7
> Average:         eth0 960737.67 960672.22  62968.34  62857.97      0.00      0.00      0.44
> 8
> Average:         eth0 979271.78 979201.67  64199.47  64070.84      0.00      0.00      0.56
> 9
> Average:         eth0 982464.33 982390.33  64418.42  64278.93      0.00      0.00      0.56
> 10
> Average:         eth0 982698.00 982578.22  64495.25  64291.28      0.00      0.00      0.44
> 11
> Average:         eth0 981862.22 981746.00  64438.16  64236.31      0.00      0.00      0.56
> 12
> Average:         eth0 983277.44 983096.33  64632.79  64327.79      0.00      0.00      0.44
> 13
> Average:         eth0 981221.11 981018.00  64521.82  64189.26      0.00      0.00      0.67
> 14
> Average:         eth0 981754.11 981555.89  64553.19  64224.39      0.00      0.00      0.44
> 15
> Average:         eth0 982484.33 982301.67  64572.00  64273.38      0.00      0.00      0.44
> 16
> Average:         eth0 978529.56 978326.67  64350.89  64013.39      0.00      0.00      0.67
> 
> 
> 2) busylock disabled
> --------------------
> 
> Well, bulk dequeue helps, but does not close the gap. 
> (busylock is needed)

Interesting observation, that busylock trick is still needed.  I would
have liked to see this getting cleaned up, but I guess it is not that
trivial to "fix"/remove.


> lpaa23:~# for f in `seq 1 16`; do echo $f|tee /sys/class/net/eth0/tx_bulk_limit; sar -n DEV 3 3|grep eth0|grep Average; done
> 1
> Average:         eth0 795408.44 795407.67  52044.66  52045.93      0.00      0.00      0.56
> 2
> Average:         eth0 843411.78 843415.11  55185.23  55184.51      0.00      0.00      0.56
> 3
> Average:         eth0 876175.89 876175.00  57329.50  57327.98      0.00      0.00      0.44
> 4
> Average:         eth0 890631.22 890629.44  58274.58  58274.25      0.00      0.00      0.67
> 5
> Average:         eth0 900672.00 900668.89  58931.29  58930.54      0.00      0.00      0.44
> 6
> Average:         eth0 908325.78 908328.22  59432.97  59431.76      0.00      0.00      0.56
> 7
> Average:         eth0 913895.33 913885.11  59796.89  59795.46      0.00      0.00      0.56
> 8
> Average:         eth0 914429.11 914433.56  59832.26  59831.23      0.00      0.00      0.67
> 9
> Average:         eth0 918701.11 918699.67  60110.68  60110.36      0.00      0.00      0.55
> 10
> Average:         eth0 920382.33 920376.56  60223.31  60220.54      0.00      0.00      0.67
> 11
> Average:         eth0 914341.67 914344.67  59826.25  59825.90      0.00      0.00      0.67
> 12
> Average:         eth0 912697.00 912693.78  59718.77  59717.44      0.00      0.00      0.44
> 13
> Average:         eth0 917392.56 917385.00  60025.79  60024.34      0.00      0.00      0.44
> 14
> Average:         eth0 918232.89 918233.78  60081.04  60079.94      0.00      0.00      0.67
> 15
> Average:         eth0 918377.11 918381.00  60091.14  60089.79      0.00      0.00      0.44
> 16
> Average:         eth0 913817.56 913812.33  59792.09  59790.66      0.00      0.00      0.56
> 

Thanks for doing the experiment.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net] net_sched: avoid too many hrtimer_start() calls
  2016-05-23  9:50               ` Jesper Dangaard Brouer
@ 2016-05-23 21:24                 ` Eric Dumazet
  2016-05-24 21:49                   ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-23 21:24 UTC (permalink / raw)
  To: David Miller
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim, Jesper Dangaard Brouer

From: Eric Dumazet <edumazet@google.com>

I found a serious performance bug in packet schedulers using hrtimers.

sch_htb and sch_fq are definitely impacted by this problem.

We constantly rearm high resolution timers if some packets are throttled
in one (or more) class, and other packets are flying through qdisc on
another (non throttled) class.

hrtimer_start() does not have the mod_timer() trick of doing nothing if
expires value does not change :

	if (timer_pending(timer) &&
            timer->expires == expires)
                return 1;

This issue is particularly visible when multiple cpus can queue/dequeue
packets on the same qdisc, as hrtimer code has to lock a remote base.

I used following fix :

1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
it.

2) Cache watchdog prior expiration. hrtimer might provide this, but I
prefer to not rely on some hrtimer internal.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/pkt_sched.h |    1 +
 net/sched/sch_api.c     |    4 ++++
 net/sched/sch_htb.c     |   13 +++----------
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 401038d2f9b8..fea53f4d92ca 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -61,6 +61,7 @@ psched_tdiff_bounded(psched_time_t tv1, psched_time_t tv2, psched_time_t bound)
 }
 
 struct qdisc_watchdog {
+	u64		last_expires;
 	struct hrtimer	timer;
 	struct Qdisc	*qdisc;
 };
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 64f71a2155f3..ddf047df5361 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -607,6 +607,10 @@ void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires, bool thr
 	if (throttle)
 		qdisc_throttled(wd->qdisc);
 
+	if (wd->last_expires == expires)
+		return;
+
+	wd->last_expires = expires;
 	hrtimer_start(&wd->timer,
 		      ns_to_ktime(expires),
 		      HRTIMER_MODE_ABS_PINNED);
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index f6bf5818ed4d..d4b4218af6b1 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -928,17 +928,10 @@ ok:
 		}
 	}
 	qdisc_qstats_overlimit(sch);
-	if (likely(next_event > q->now)) {
-		if (!test_bit(__QDISC_STATE_DEACTIVATED,
-			      &qdisc_root_sleeping(q->watchdog.qdisc)->state)) {
-			ktime_t time = ns_to_ktime(next_event);
-			qdisc_throttled(q->watchdog.qdisc);
-			hrtimer_start(&q->watchdog.timer, time,
-				      HRTIMER_MODE_ABS_PINNED);
-		}
-	} else {
+	if (likely(next_event > q->now))
+		qdisc_watchdog_schedule_ns(&q->watchdog, next_event, true);
+	else
 		schedule_work(&q->work);
-	}
 fin:
 	return skb;
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [RFC] net: remove busylock
  2016-05-20 17:49           ` Jesper Dangaard Brouer
  2016-05-20 21:32             ` Eric Dumazet
@ 2016-05-24 13:50             ` David Laight
  2016-05-24 14:37               ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: David Laight @ 2016-05-24 13:50 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', Eric Dumazet
  Cc: Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

From: Jesper Dangaard Brouer
> Sent: 20 May 2016 18:50
...
> If would be cool if you could run a test with removed busylock and
> allow HTB to bulk dequeue.

(Without having looked ....)
Could you have two queues and separate queue and dequeue locks.

The enqueue code would acquire the enqueue lock and add the packet
to the first queue.

The dequeue code would acquire the dequeue lock and try to remove
a packet from the 2nd queue, if nothing present it would acquire
the enqueue lock and move the entire 1st queue to the 2nd queue.

The obvious downside is two lock/unlocks for single packet dequeue.
If you can guarantee a single dequeue thread the 2nd lock isn't needed.

	David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] net: remove busylock
  2016-05-24 13:50             ` [RFC] net: remove busylock David Laight
@ 2016-05-24 14:37               ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-05-24 14:37 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jesper Dangaard Brouer',
	Alexander Duyck, netdev, Alexander Duyck, John Fastabend,
	Jamal Hadi Salim

On Tue, 2016-05-24 at 13:50 +0000, David Laight wrote:
> From: Jesper Dangaard Brouer
> > Sent: 20 May 2016 18:50
> ...
> > If would be cool if you could run a test with removed busylock and
> > allow HTB to bulk dequeue.
> 
> (Without having looked ....)
> Could you have two queues and separate queue and dequeue locks.
> 
> The enqueue code would acquire the enqueue lock and add the packet
> to the first queue.
> 
> The dequeue code would acquire the dequeue lock and try to remove
> a packet from the 2nd queue, if nothing present it would acquire
> the enqueue lock and move the entire 1st queue to the 2nd queue.
> 
> The obvious downside is two lock/unlocks for single packet dequeue.
> If you can guarantee a single dequeue thread the 2nd lock isn't needed.
> 

Not with HTB or any 'complex' qdisc hierarchy, where we have many
'queues' and strict limits (packets per qdisc)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net] net_sched: avoid too many hrtimer_start() calls
  2016-05-23 21:24                 ` [PATCH net] net_sched: avoid too many hrtimer_start() calls Eric Dumazet
@ 2016-05-24 21:49                   ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2016-05-24 21:49 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.duyck, netdev, aduyck, john.r.fastabend, jhs, brouer

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 May 2016 14:24:56 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> I found a serious performance bug in packet schedulers using hrtimers.
> 
> sch_htb and sch_fq are definitely impacted by this problem.
> 
> We constantly rearm high resolution timers if some packets are throttled
> in one (or more) class, and other packets are flying through qdisc on
> another (non throttled) class.
> 
> hrtimer_start() does not have the mod_timer() trick of doing nothing if
> expires value does not change :
> 
> 	if (timer_pending(timer) &&
>             timer->expires == expires)
>                 return 1;
> 
> This issue is particularly visible when multiple cpus can queue/dequeue
> packets on the same qdisc, as hrtimer code has to lock a remote base.
> 
> I used following fix :
> 
> 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
> it.
> 
> 2) Cache watchdog prior expiration. hrtimer might provide this, but I
> prefer to not rely on some hrtimer internal.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

This looks fine, applied, thanks Eric.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-05-24 21:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-19 17:08 [RFC] net: remove busylock Eric Dumazet
2016-05-19 18:03 ` Alexander Duyck
2016-05-19 18:41   ` Rick Jones
2016-05-19 18:56   ` Eric Dumazet
2016-05-19 19:35     ` Eric Dumazet
2016-05-19 20:39       ` Alexander Duyck
2016-05-20  4:49         ` John Fastabend
2016-05-20  4:56           ` Eric Dumazet
2016-05-20  7:29   ` Jesper Dangaard Brouer
2016-05-20 13:11     ` Eric Dumazet
2016-05-20 13:47       ` Eric Dumazet
2016-05-20 14:16         ` Eric Dumazet
2016-05-20 17:49           ` Jesper Dangaard Brouer
2016-05-20 21:32             ` Eric Dumazet
2016-05-23  9:50               ` Jesper Dangaard Brouer
2016-05-23 21:24                 ` [PATCH net] net_sched: avoid too many hrtimer_start() calls Eric Dumazet
2016-05-24 21:49                   ` David Miller
2016-05-24 13:50             ` [RFC] net: remove busylock David Laight
2016-05-24 14:37               ` Eric Dumazet
2016-05-20 16:01       ` John Fastabend
2016-05-19 18:12 ` David Miller
2016-05-19 18:44   ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.