Re: [RFC] net: remove busylock

From: Eric Dumazet <eric.dumazet@gmail.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: netdev <netdev@vger.kernel.org>, Alexander Duyck <aduyck@mirantis.com>
Subject: Re: [RFC] net: remove busylock
Date: Thu, 19 May 2016 11:56:30 -0700	[thread overview]
Message-ID: <1463684190.18194.228.camel@edumazet-glaptop3.roam.corp.google.com> (raw)
In-Reply-To: <CAKgT0UfBxW=KpqJux+tjyNpHQUHhZ5Laiqnt5FPs=jpkBJWrHA@mail.gmail.com>

On Thu, 2016-05-19 at 11:03 -0700, Alexander Duyck wrote:
> On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > busylock was added at the time we had expensive ticket spinlocks
> >
> > (commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527 ("net: add additional
> > lock to qdisc to increase throughput")
> >
> > Now kernel spinlocks are MCS, this busylock things is no longer
> > relevant. It is slowing down things a bit.
> >
> >
> > With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads.
> >
> > lpaa5:~# sar -n DEV 4 4 |grep eth0
> > 10:05:44         eth0 798951.25 798951.75  52276.22  52275.26      0.00      0.00      0.50
> > 10:05:48         eth0 798576.00 798572.75  52251.24  52250.39      0.00      0.00      0.75
> > 10:05:52         eth0 798746.00 798748.75  52262.89  52262.13      0.00      0.00      0.50
> > 10:05:56         eth0 798303.25 798291.50  52235.22  52233.10      0.00      0.00      0.50
> > Average:         eth0 798644.12 798641.19  52256.39  52255.22      0.00      0.00      0.56
> >
> > Disabling busylock (by using a local sysctl)
> >
> > lpaa5:~# sar -n DEV 4 4 |grep eth0
> > 10:05:14         eth0 864085.75 864091.50  56538.09  56537.46      0.00      0.00      0.50
> > 10:05:18         eth0 864734.75 864729.25  56580.35  56579.05      0.00      0.00      0.75
> > 10:05:22         eth0 864366.00 864361.50  56556.74  56555.00      0.00      0.00      0.50
> > 10:05:26         eth0 864246.50 864248.75  56549.19  56547.65      0.00      0.00      0.50
> > Average:         eth0 864358.25 864357.75  56556.09  56554.79      0.00      0.00      0.56
> >
> > That would be a 8 % increase.
> 
> The main point of the busy lock is to deal with the bulk throughput
> case, not the latency case which would be relatively well behaved.
> The problem wasn't really related to lock bouncing slowing things
> down.  It was the fairness between the threads that was killing us
> because the dequeue needs to have priority.

> 
> The main problem that the busy lock solved was the fact that you could
> start a number of stream tests equal to the number of CPUs in a given
> system and the result was that the performance would drop off a cliff
> and you would drop almost all the packets for almost all the streams
> because the qdisc never had a chance to drain because it would be CPU
> - 1 enqueues, followed by 1 dequeue.
> 
> What we need if we are going to get rid of busy lock would be some
> sort of priority locking mechanism that would allow the dequeue thread
> to jump to the head of the line if it is attempting to take the lock.
> Otherwise you end up spending all your time enqueuing packets into
> oblivion because the qdiscs just overflow without the busy lock in
> place.

Removing busylock helped in all cases I tested. (at least on x86 as
David pointed out)

As I said, we need to revisit busylock now that spinlocks are different.

In one case (20 concurrent UDP netperf), I even got a 500 % increase.

With busylock :

lpaa5:~# sar -n DEV 4 4|grep eth0
11:33:34         eth0      9.00 115057.00      1.60  38426.92      0.00      0.00      0.50
11:33:38         eth0     13.50 113237.75      2.04  37819.69      0.00      0.00      0.75
11:33:42         eth0     13.50 111492.25      1.76  37236.58      0.00      0.00      0.75
11:33:46         eth0     12.75 111401.50      2.40  37205.93      0.00      0.00      0.75
Average:         eth0     12.19 112797.12      1.95  37672.28      0.00      0.00      0.69

Packets are dropped in HTB because we hit a limit of 1000 packets there

- 100.00%  netperf  [kernel.kallsyms]  [k] kfree_skb           ▒
   - kfree_skb                                                 ▒
      -  100.00% htb_enqueue                                   ▒
            __dev_queue_xmit                                   ▒
            dev_queue_xmit                                     ▒
            ip_finish_output2                                  ▒
            ip_finish_output                                   ▒
            ip_output                                          ▒
            ip_local_out                                       ▒
         +  ip_send_skb      

Presumably it would tremendously help if the actual kfree_skb()
was done after qdisc lock is released, ie not from the qdisc->enqueue()
method.

Without busylock :

lpaa5:~# sar -n DEV 4 4|grep eth0
11:41:12         eth0     11.00 669053.50      1.99 223452.30      0.00      0.00      0.75
11:41:16         eth0      8.50 669513.25      2.27 223605.55      0.00      0.00      0.75
11:41:20         eth0      3.50 669426.50      0.90 223577.19      0.00      0.00      0.50
11:41:24         eth0      8.25 669284.00      1.42 223529.79      0.00      0.00      0.50
Average:         eth0      7.81 669319.31      1.65 223541.21      0.00      0.00      0.62