Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: George Dunlap <george.dunlap@citrix.com>,
	Malcolm Crossley <malcolm.crossley@citrix.com>,
	JBeulich@suse.com, ian.campbell@citrix.com,
	Marcos.Matsunaga@oracle.com, keir@xen.org,
	konrad.wilk@oracle.com, george.dunlap@eu.citrix.com
Cc: xen-devel@lists.xenproject.org, stefano.stabellini@citrix.com
Subject: Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
Date: Tue, 24 Nov 2015 18:32:50 +0000	[thread overview]
Message-ID: <5654AD52.1080804@citrix.com> (raw)
In-Reply-To: <5654A98D.3050801@citrix.com>

On 24/11/15 18:16, George Dunlap wrote:
> On 20/11/15 16:03, Malcolm Crossley wrote:
>> This patch series adds per-cpu reader-writer locks as a generic lock
>> implementation and then converts the grant table and p2m rwlocks to
>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>
>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>> by the overhead of the cache line bouncing itself.
>>
>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>> write lock by using a per lock barrier which forces the per-cpu read lock 
>> to revert to using a standard read lock. The write lock then polls all 
>> the percpu data area until active readers for the lock have exited.
>>
>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>> dramatically improves performance, with 16 vCPU network IO performance going 
>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>> IO improvement.
> Impressive -- thanks for doing this work.
>
> One question: Your description here sounds like you've tested with a
> single large domain, but what happens with multiple domains?

The test was with two domU's, doing network traffic between them.

>
> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
> particular type (e.g., all domains share the per-cpu p2m rwlock).
> (Correct me if I'm wrong here.)

The per-pcpu pointer is shared for all locks of a particular type, but
allows the individual lock to be distinguished by following the pointer
back.

Therefore, the locks are not actually shared.

>
> Which means two things:
>
>  1) *Any* writer will have to wait for the rwlock of that type to be
> released on *all* domains before being able to write.  Is there any risk
> that on a busy system, this will be an unusually long wait?

No.  The write-locker looks through all pcpus and ignores any whose
per-cpu pointer is not pointing at the intended percpu_rwlock.

>From _percpu_write_lock():

/*
 * Remove any percpu readers not contending on this rwlock
 * from our check mask.
 */
if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
    cpumask_clear_cpu(cpu, &this_cpu(percpu_rwlock_readers));

As a result, two calls to _percpu_write_lock() against different
percpu_rwlock_t's will not interact.

>
> 2) *All* domains will have to take the slow path for reading when a
> *any* domain has or is trying to acquire the write lock.  What is the
> probability that on a busy system that turns out to be "most of the time"?

_percpu_read_lock() will only take the slow path if:

1) The intended lock is (or is in the process of being) taken for writing
2) The callpath has nested _percpu_read_lock() of the same type of lock

By observation, 2) does not occur currently in the code.

~Andrew