Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks

From: Malcolm Crossley <malcolm.crossley@citrix.com>
To: George Dunlap <george.dunlap@citrix.com>,
	JBeulich@suse.com, ian.campbell@citrix.com,
	andrew.cooper3@citrix.com, Marcos.Matsunaga@oracle.com,
	keir@xen.org, konrad.wilk@oracle.com,
	george.dunlap@eu.citrix.com
Cc: xen-devel@lists.xenproject.org, stefano.stabellini@citrix.com
Subject: Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
Date: Wed, 25 Nov 2015 08:58:47 +0000	[thread overview]
Message-ID: <56557847.1060707@citrix.com> (raw)
In-Reply-To: <5654ACDD.5050107@citrix.com>

On 24/11/15 18:30, George Dunlap wrote:
> On 24/11/15 18:16, George Dunlap wrote:
>> On 20/11/15 16:03, Malcolm Crossley wrote:
>>> This patch series adds per-cpu reader-writer locks as a generic lock
>>> implementation and then converts the grant table and p2m rwlocks to
>>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>>
>>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>>> by the overhead of the cache line bouncing itself.
>>>
>>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>>> write lock by using a per lock barrier which forces the per-cpu read lock 
>>> to revert to using a standard read lock. The write lock then polls all 
>>> the percpu data area until active readers for the lock have exited.
>>>
>>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>>> dramatically improves performance, with 16 vCPU network IO performance going 
>>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>>> IO improvement.
>>
>> Impressive -- thanks for doing this work.

Thanks, I think the key to isolating the problem was using profiling tools. The scale
of the overhead would not have been clear without them.

>>
>> One question: Your description here sounds like you've tested with a
>> single large domain, but what happens with multiple domains?
>>
>> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
>> particular type (e.g., all domains share the per-cpu p2m rwlock).
>> (Correct me if I'm wrong here.)
> 
> Sorry, looking in more detail at the code, it seems I am wrong.  The
> fast-path stores which "slow" lock has been grabbed in the per-cpu
> variable; so the writer only needs to wait for readers that have grabbed
> the particular lock it's interested in.  So the scenarios I outline
> below shouldn't really be issues.
> 
> The description of the algorithm  in the changelog could do with a bit
> more detail. :-)

I'll enhance the description to say "per lock local variable" to make it clearer
that not all readers will be affected.

BTW, I added to the "To" list because I need your ACK for the patch to the p2m code.

Do you have any review comments for that patch?

Thanks

Malcolm

> 
>  -George
>