From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [RFC] net: remove busylock Date: Thu, 19 May 2016 13:39:22 -0700 Message-ID: References: <1463677716.18194.203.camel@edumazet-glaptop3.roam.corp.google.com> <1463684190.18194.228.camel@edumazet-glaptop3.roam.corp.google.com> <1463686523.18194.232.camel@edumazet-glaptop3.roam.corp.google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: netdev , Alexander Duyck , Jesper Dangaard Brouer , John Fastabend To: Eric Dumazet Return-path: Received: from mail-io0-f181.google.com ([209.85.223.181]:36133 "EHLO mail-io0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753942AbcESUjX (ORCPT ); Thu, 19 May 2016 16:39:23 -0400 Received: by mail-io0-f181.google.com with SMTP id b78so5990707ioj.3 for ; Thu, 19 May 2016 13:39:23 -0700 (PDT) In-Reply-To: <1463686523.18194.232.camel@edumazet-glaptop3.roam.corp.google.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, May 19, 2016 at 12:35 PM, Eric Dumazet wrote: > On Thu, 2016-05-19 at 11:56 -0700, Eric Dumazet wrote: > >> Removing busylock helped in all cases I tested. (at least on x86 as >> David pointed out) >> >> As I said, we need to revisit busylock now that spinlocks are different. >> >> In one case (20 concurrent UDP netperf), I even got a 500 % increase. >> >> With busylock : >> >> lpaa5:~# sar -n DEV 4 4|grep eth0 >> Average: eth0 12.19 112797.12 1.95 37672.28 0.00 0.00 0.69 >> > > > Hmpf, my sysctl logic was inverted. Really these results made little > sense. > > Sorry for the noise. At least we have 8% confirmed gain with this > stuff ;) Unfortunately I see a 21% regression when you place the qdisc under load from multiple CPUs/nodes. In my testing I dropped from around 1.05 Mpps to around 827 Kpps when I removed the busylock. My test setup is pretty straight forward and the results I have seen are consistent between runs. I have a system with 32 threads / 16 cores spread over 2 NUMA nodes. I reduce the number of queues on a 10Gb/s NIC to 1. I kill irqbalance and disable CPU power states. I then start a quick "for" loop that will schedule one UDP_STREAM netperf session on each CPU using a small payload size like 32. On a side note, if I move everything onto one node I can push about 2.4 Mpps and the busylock doesn't seem to really impact things, but if I go to both nodes then I see the performance regression as described above. I was thinking about it and I don't think the MCS type locks would have that much of an impact. If anything I think that xmit_more probably has a bigger effect given that it allows us to grab multiple frames with each fetch and thereby reduce the lock contention on the dequeue side. >> Presumably it would tremendously help if the actual kfree_skb() >> was done after qdisc lock is released, ie not from the qdisc->enqueue() >> method. >> > > This part is still valid. > > We could have a per cpu storage of one skb pointer, so that we do not > have to change all ->enqueue() prototypes. I fully agree with that. One thought I had is if we could move to a lockless dequeue path for the qdisc then we could also get rid of the busy lock. I know there has been talk about doing away with qdisc locks or changing the inner mechanics of the qdisc itself in the past, I'm CCing Jesper and John for that reason. Maybe it is time for us to start looking into that so we can start cleaning out some of the old cruft we have in this path. - Alex