From mboxrd@z Thu Jan 1 00:00:00 1970 From: Malcolm Crossley Subject: Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks Date: Wed, 25 Nov 2015 08:58:47 +0000 Message-ID: <56557847.1060707@citrix.com> References: <1448035423-24242-1-git-send-email-malcolm.crossley@citrix.com> <5654A98D.3050801@citrix.com> <5654ACDD.5050107@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1a1Vuf-0005C9-TH for xen-devel@lists.xenproject.org; Wed, 25 Nov 2015 08:58:54 +0000 In-Reply-To: <5654ACDD.5050107@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap , JBeulich@suse.com, ian.campbell@citrix.com, andrew.cooper3@citrix.com, Marcos.Matsunaga@oracle.com, keir@xen.org, konrad.wilk@oracle.com, george.dunlap@eu.citrix.com Cc: xen-devel@lists.xenproject.org, stefano.stabellini@citrix.com List-Id: xen-devel@lists.xenproject.org On 24/11/15 18:30, George Dunlap wrote: > On 24/11/15 18:16, George Dunlap wrote: >> On 20/11/15 16:03, Malcolm Crossley wrote: >>> This patch series adds per-cpu reader-writer locks as a generic lock >>> implementation and then converts the grant table and p2m rwlocks to >>> use the percpu rwlocks, in order to improve multi-socket host performance. >>> >>> CPU profiling has revealed the rwlocks themselves suffer from severe cache >>> line bouncing due to the cmpxchg operation used even when taking a read lock. >>> Multiqueue paravirtualised I/O results in heavy contention of the grant table >>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked >>> by the overhead of the cache line bouncing itself. >>> >>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data >>> area to record a CPU has taken the read lock. Correctness is enforced for the >>> write lock by using a per lock barrier which forces the per-cpu read lock >>> to revert to using a standard read lock. The write lock then polls all >>> the percpu data area until active readers for the lock have exited. >>> >>> Removing the cache line bouncing on a multi-socket Haswell-EP system >>> dramatically improves performance, with 16 vCPU network IO performance going >>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 >>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better >>> IO improvement. >> >> Impressive -- thanks for doing this work. Thanks, I think the key to isolating the problem was using profiling tools. The scale of the overhead would not have been clear without them. >> >> One question: Your description here sounds like you've tested with a >> single large domain, but what happens with multiple domains? >> >> It looks like the "per-cpu-rwlock" is shared by *all* locks of a >> particular type (e.g., all domains share the per-cpu p2m rwlock). >> (Correct me if I'm wrong here.) > > Sorry, looking in more detail at the code, it seems I am wrong. The > fast-path stores which "slow" lock has been grabbed in the per-cpu > variable; so the writer only needs to wait for readers that have grabbed > the particular lock it's interested in. So the scenarios I outline > below shouldn't really be issues. > > The description of the algorithm in the changelog could do with a bit > more detail. :-) I'll enhance the description to say "per lock local variable" to make it clearer that not all readers will be affected. BTW, I added to the "To" list because I need your ACK for the patch to the p2m code. Do you have any review comments for that patch? Thanks Malcolm > > -George >