From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@citrix.com>
Subject: Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
Date: Wed, 25 Nov 2015 09:49:17 +0000
Message-ID: <5655841D.8090508@citrix.com>
References: <1448035423-24242-1-git-send-email-malcolm.crossley@citrix.com>
	<5654A98D.3050801@citrix.com> <5654ACDD.5050107@citrix.com>
	<56557847.1060707@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta14.messagelabs.com ([193.109.254.103])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <prvs=764ec1ada=George.Dunlap@citrix.com>)
	id 1a1WhX-0000rT-Eh
	for xen-devel@lists.xenproject.org; Wed, 25 Nov 2015 09:49:23 +0000
In-Reply-To: <56557847.1060707@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Malcolm Crossley <malcolm.crossley@citrix.com>, JBeulich@suse.com, ian.campbell@citrix.com, andrew.cooper3@citrix.com, Marcos.Matsunaga@oracle.com, keir@xen.org, konrad.wilk@oracle.com, george.dunlap@eu.citrix.com
Cc: xen-devel@lists.xenproject.org, stefano.stabellini@citrix.com
List-Id: xen-devel@lists.xenproject.org

On 25/11/15 08:58, Malcolm Crossley wrote:
> On 24/11/15 18:30, George Dunlap wrote:
>> On 24/11/15 18:16, George Dunlap wrote:
>>> On 20/11/15 16:03, Malcolm Crossley wrote:
>>>> This patch series adds per-cpu reader-writer locks as a generic lock
>>>> implementation and then converts the grant table and p2m rwlocks to
>>>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>>>
>>>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>>>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>>>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>>>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>>>> by the overhead of the cache line bouncing itself.
>>>>
>>>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>>>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>>>> write lock by using a per lock barrier which forces the per-cpu read lock 
>>>> to revert to using a standard read lock. The write lock then polls all 
>>>> the percpu data area until active readers for the lock have exited.
>>>>
>>>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>>>> dramatically improves performance, with 16 vCPU network IO performance going 
>>>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>>>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>>>> IO improvement.
>>>
>>> Impressive -- thanks for doing this work.
> 
> Thanks, I think the key to isolating the problem was using profiling tools. The scale
> of the overhead would not have been clear without them.
> 
>>>
>>> One question: Your description here sounds like you've tested with a
>>> single large domain, but what happens with multiple domains?
>>>
>>> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
>>> particular type (e.g., all domains share the per-cpu p2m rwlock).
>>> (Correct me if I'm wrong here.)
>>
>> Sorry, looking in more detail at the code, it seems I am wrong.  The
>> fast-path stores which "slow" lock has been grabbed in the per-cpu
>> variable; so the writer only needs to wait for readers that have grabbed
>> the particular lock it's interested in.  So the scenarios I outline
>> below shouldn't really be issues.
>>
>> The description of the algorithm  in the changelog could do with a bit
>> more detail. :-)
> 
> I'll enhance the description to say "per lock local variable" to make it clearer
> that not all readers will be affected.
> 
> BTW, I added to the "To" list because I need your ACK for the patch to the p2m code.
> 
> Do you have any review comments for that patch?

Yes, I realize that, and I'll get to it. :-)

 -George