All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] use unfair spinlock when running on hypervisor.
@ 2010-06-01  9:35 Gleb Natapov
  2010-06-01 15:53 ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Gleb Natapov @ 2010-06-01  9:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, avi, hpa, mingo, npiggin, tglx, mtosatti

Ticket lock spinlock ensures fairness by introducing FIFO of cpus waiting
for spinlock to be released. This works great on real HW, but when running
on hypervisor it introduce very heavy performance hit if physical cpus
are overcommitted (up to 35% in my test). The reason for performance
hit is that vcpu that at the head of the FIFO can be unscheduled and no
other vcpu waiting on the lock can proceed.  This result in a big time
gaps where no vcpu is in critical section. Even worse they are constantly
spinning and not allowing vcpu at the head of the FIFO to be scheduled
to run.

The patch below allows to patch ticket spinlock code to behave similar to
old unfair spinlock when hypervisor is detected. After patching unlocked
condition remains the same (next == owner), but if vcpu detects that lock
is already taken it spins till (next == owner == 0) and when the condition
is met it tries to reacquire the lock. Unlock simply zeroes head and tail.
Trylock state the same since unlocked condition stays the same.

I ran two guest with 16 vcpus each one. One guest with this patch another
without. Inside those guests I ran kernel compilation "time make -j 16 all".
Here are results of two runs:

patched:                        unpatched:
real 13m34.674s                 real 19m35.827s
user 96m2.638s                  user 102m38.665s
sys 56m14.991s                  sys 158m22.470s

real 13m3.768s                  real 19m4.375s
user 95m34.509s                 user 111m9.903s
sys 53m40.550s                  sys 141m59.370s

Note that for 6 minutes unpatched guest ran without cpu oversubscription
since patched guest was already idle.

Running kernbench on the host with and without the patch:

Unpatched:                                            Patched:
Average Half load -j 8 Run (std deviation):
Elapsed Time 312.538 (0.779404)                       Elapsed Time 311.974 (0.258031)
User Time 2154.89 (6.62261)                           User Time 2151.94 (1.96702)
System Time 233.78 (1.96642)                          System Time 236.472 (0.695572)
Percent CPU 763.8 (1.64317)                           Percent CPU 765 (0.707107)
Context Switches 39348.2 (1542.87)                    Context Switches 41522.4 (1193.13)
Sleeps 871688 (5596.52)                               Sleeps 877317 (1135.83)

Average Optimal load -j 16 Run (std deviation):
Elapsed Time 259.878 (0.444826)                       Elapsed Time 258.842 (0.413122)
User Time 2549.22 (415.685)                           User Time 2538.42 (407.383)
System Time 261.084 (28.8145)                         System Time 263.847 (28.8638)
Percent CPU 1003.4 (252.566)                          Percent CPU 1003.5 (251.407)
Context Switches 228418 (199308)                      Context Switches 228894 (197533)
Sleeps 898721 (28757.2)                               Sleeps 902020 (26074.5)

Signed-off-by: Gleb Natapov <gleb@redhat.com>
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3089f70..b919b54 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -60,19 +60,27 @@
 
 static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 {
-	short inc = 0x0100;
+	short inc;
 
 	asm volatile (
+		"1:\t\n"
+		"mov $0x100, %0\n\t"
 		LOCK_PREFIX "xaddw %w0, %1\n"
-		"1:\t"
+		"2:\t"
 		"cmpb %h0, %b0\n\t"
-		"je 2f\n\t"
+		"je 4f\n\t"
+		"3:\t\n"
 		"rep ; nop\n\t"
-		"movb %1, %b0\n\t"
 		/* don't need lfence here, because loads are in-order */
-		"jmp 1b\n"
-		"2:"
-		: "+Q" (inc), "+m" (lock->slock)
+		ALTERNATIVE(
+		"movb %1, %b0\n\t"
+		"jmp 2b\n",
+		"nop", X86_FEATURE_HYPERVISOR)"\n\t"
+		"cmpw $0, %1\n\t"
+		"jne 3b\n\t"
+		"jmp 1b\n\t"
+		"4:"
+		: "=Q" (inc), "+m" (lock->slock)
 		:
 		: "memory", "cc");
 }
@@ -98,10 +106,13 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 {
-	asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
-		     : "+m" (lock->slock)
-		     :
-		     : "memory", "cc");
+	asm volatile(
+		ALTERNATIVE(UNLOCK_LOCK_PREFIX "incb (%0);"ASM_NOP3,
+			    UNLOCK_LOCK_PREFIX "movw $0, (%0)",
+			    X86_FEATURE_HYPERVISOR)
+		:
+		: "Q" (&lock->slock)
+		: "memory", "cc");
 }
 #else
 #define TICKET_SHIFT 16
--
			Gleb.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01  9:35 [PATCH] use unfair spinlock when running on hypervisor Gleb Natapov
@ 2010-06-01 15:53 ` Andi Kleen
  2010-06-01 16:24   ` Gleb Natapov
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-06-01 15:53 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: linux-kernel, kvm, avi, hpa, mingo, npiggin, tglx, mtosatti

Gleb Natapov <gleb@redhat.com> writes:
>
> The patch below allows to patch ticket spinlock code to behave similar to
> old unfair spinlock when hypervisor is detected. After patching unlocked

The question is what happens when you have a system with unfair
memory and you run the hypervisor on that. There it could be much worse.

Your new code would starve again, right?

There's a reason the ticket spinlocks were added in the first place.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 15:53 ` Andi Kleen
@ 2010-06-01 16:24   ` Gleb Natapov
  2010-06-01 16:38     ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Gleb Natapov @ 2010-06-01 16:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, kvm, avi, hpa, mingo, npiggin, tglx, mtosatti

On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote:
> Gleb Natapov <gleb@redhat.com> writes:
> >
> > The patch below allows to patch ticket spinlock code to behave similar to
> > old unfair spinlock when hypervisor is detected. After patching unlocked
> 
> The question is what happens when you have a system with unfair
> memory and you run the hypervisor on that. There it could be much worse.
> 
How much worse performance hit could be?

> Your new code would starve again, right?
> 
Yes, of course it may starve with unfair spinlock. Since vcpus are not
always running there is much smaller chance then vcpu on remote memory
node will starve forever. Old kernels with unfair spinlocks are running
fine in VMs on NUMA machines with various loads.

> There's a reason the ticket spinlocks were added in the first place.
> 
I understand that reason and do not propose to get back to old spinlock
on physical HW! But with virtualization performance hit is unbearable.

--
			Gleb.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 16:24   ` Gleb Natapov
@ 2010-06-01 16:38     ` Andi Kleen
  2010-06-01 16:52       ` Avi Kivity
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-06-01 16:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Andi Kleen, linux-kernel, kvm, avi, hpa, mingo, npiggin, tglx, mtosatti

On Tue, Jun 01, 2010 at 07:24:14PM +0300, Gleb Natapov wrote:
> On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote:
> > Gleb Natapov <gleb@redhat.com> writes:
> > >
> > > The patch below allows to patch ticket spinlock code to behave similar to
> > > old unfair spinlock when hypervisor is detected. After patching unlocked
> > 
> > The question is what happens when you have a system with unfair
> > memory and you run the hypervisor on that. There it could be much worse.
> > 
> How much worse performance hit could be?

It depends on the workload. Overall it means that a contended
lock can have much higher latencies.

If you want to study some examples see the locking problems the
RT people have with their heavy weight mutex-spinlocks.

But the main problem is that in the worst case you 
can see extremly long stalls (upto a second has been observed),
which then turns in a correctness issue.
> 
> > Your new code would starve again, right?
> > 
> Yes, of course it may starve with unfair spinlock. Since vcpus are not
> always running there is much smaller chance then vcpu on remote memory
> node will starve forever. Old kernels with unfair spinlocks are running
> fine in VMs on NUMA machines with various loads.

Try it on a NUMA system with unfair memory.

> > There's a reason the ticket spinlocks were added in the first place.
> > 
> I understand that reason and do not propose to get back to old spinlock
> on physical HW! But with virtualization performance hit is unbearable.

Extreme unfairness can be unbearable too.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 16:38     ` Andi Kleen
@ 2010-06-01 16:52       ` Avi Kivity
  2010-06-01 17:27         ` Andi Kleen
                           ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Avi Kivity @ 2010-06-01 16:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin, tglx, mtosatti

On 06/01/2010 07:38 PM, Andi Kleen wrote:
>>> Your new code would starve again, right?
>>>
>>>        
>> Yes, of course it may starve with unfair spinlock. Since vcpus are not
>> always running there is much smaller chance then vcpu on remote memory
>> node will starve forever. Old kernels with unfair spinlocks are running
>> fine in VMs on NUMA machines with various loads.
>>      
> Try it on a NUMA system with unfair memory.
>    

We are running everything on NUMA (since all modern machines are now 
NUMA).  At what scale do the issues become observable?

>> I understand that reason and do not propose to get back to old spinlock
>> on physical HW! But with virtualization performance hit is unbearable.
>>      
> Extreme unfairness can be unbearable too.
>    

Well, the question is what happens first.  In our experience, vcpu 
overcommit is a lot more painful.  People will never see the NUMA 
unfairness issue if they can't use kvm due to the vcpu overcommit problem.

What I'd like to see eventually is a short-term-unfair, long-term-fair 
spinlock.  Might make sense for bare metal as well.  But it won't be 
easy to write.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 16:52       ` Avi Kivity
@ 2010-06-01 17:27         ` Andi Kleen
  2010-06-02  2:51           ` Avi Kivity
  2010-06-01 17:39         ` Valdis.Kletnieks
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-06-01 17:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
> We are running everything on NUMA (since all modern machines are now NUMA). 
>  At what scale do the issues become observable?

On Intel platforms it's visible starting with 4 sockets.

>
>>> I understand that reason and do not propose to get back to old spinlock
>>> on physical HW! But with virtualization performance hit is unbearable.
>>>      
>> Extreme unfairness can be unbearable too.
>>    
>
> Well, the question is what happens first.  In our experience, vcpu 
> overcommit is a lot more painful.  People will never see the NUMA 
> unfairness issue if they can't use kvm due to the vcpu overcommit problem.

You really have to address both, if you don't fix them both
users will eventually into one of them and be unhappy.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 16:52       ` Avi Kivity
  2010-06-01 17:27         ` Andi Kleen
@ 2010-06-01 17:39         ` Valdis.Kletnieks
  2010-06-02  2:46           ` Avi Kivity
  2010-06-02  7:39           ` H. Peter Anvin
  2010-06-01 17:54         ` john cooper
  2010-06-01 21:39         ` Eric Dumazet
  3 siblings, 2 replies; 34+ messages in thread
From: Valdis.Kletnieks @ 2010-06-01 17:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

[-- Attachment #1: Type: text/plain, Size: 503 bytes --]

On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said:
> On 06/01/2010 07:38 PM, Andi Kleen wrote:
> >>> Your new code would starve again, right?
> > Try it on a NUMA system with unfair memory.

> We are running everything on NUMA (since all modern machines are now 
> NUMA).  At what scale do the issues become observable?

My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
perfectly-running NUMA=n kernel I'm running.

Or did you mean a less broad phrase than "all modern machines"?


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 16:52       ` Avi Kivity
  2010-06-01 17:27         ` Andi Kleen
  2010-06-01 17:39         ` Valdis.Kletnieks
@ 2010-06-01 17:54         ` john cooper
  2010-06-01 19:36           ` Andi Kleen
  2010-06-01 21:39         ` Eric Dumazet
  3 siblings, 1 reply; 34+ messages in thread
From: john cooper @ 2010-06-01 17:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti, john cooper

Avi Kivity wrote:
> On 06/01/2010 07:38 PM, Andi Kleen wrote:
>>>> Your new code would starve again, right?
>>>>
>>>>        
>>> Yes, of course it may starve with unfair spinlock. Since vcpus are not
>>> always running there is much smaller chance then vcpu on remote memory
>>> node will starve forever. Old kernels with unfair spinlocks are running
>>> fine in VMs on NUMA machines with various loads.
>>>      
>> Try it on a NUMA system with unfair memory.
>>    
> 
> We are running everything on NUMA (since all modern machines are now
> NUMA).  At what scale do the issues become observable?
> 
>>> I understand that reason and do not propose to get back to old spinlock
>>> on physical HW! But with virtualization performance hit is unbearable.
>>>      
>> Extreme unfairness can be unbearable too.
>>    
> 
> Well, the question is what happens first.  In our experience, vcpu
> overcommit is a lot more painful.  People will never see the NUMA
> unfairness issue if they can't use kvm due to the vcpu overcommit problem.

Gleb's observed performance hit seems to be a rather mild
throughput depression compared with creating a worst case by
enforcing vcpu overcommit.  Running a single guest with 2:1
overcommit on a 4 core machine I saw over an order of magnitude
slowdown vs. 1:1 commit with the same kernel build test.
Others have reported similar results.

How close you'll get to that scenario depends on host
scheduling dynamics, and statistically the number of opened
and stalled lock held paths waiting to be contended.  So
I'd expect to see quite variable numbers for guest-guest
aggravation of this problem.

> What I'd like to see eventually is a short-term-unfair, long-term-fair
> spinlock.  Might make sense for bare metal as well.  But it won't be
> easy to write.

Collecting the contention/usage statistics on a per spinlock
basis seems complex.  I believe a practical approximation
to this are adaptive mutexes where upon hitting a spin
time threshold, punt and let the scheduler reconcile fairness.

-john

-- 
john.cooper@third-harmonic.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 17:54         ` john cooper
@ 2010-06-01 19:36           ` Andi Kleen
  2010-06-03 11:06             ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-06-01 19:36 UTC (permalink / raw)
  To: john cooper
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, npiggin, tglx, mtosatti, john cooper

> Collecting the contention/usage statistics on a per spinlock
> basis seems complex.  I believe a practical approximation
> to this are adaptive mutexes where upon hitting a spin
> time threshold, punt and let the scheduler reconcile fairness.

That would probably work, except: how do you get the
adaptive spinlock into a paravirt op without slowing
down a standard kernel?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 16:52       ` Avi Kivity
                           ` (2 preceding siblings ...)
  2010-06-01 17:54         ` john cooper
@ 2010-06-01 21:39         ` Eric Dumazet
  3 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-06-01 21:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti, netdev

Le mardi 01 juin 2010 à 19:52 +0300, Avi Kivity a écrit :

> What I'd like to see eventually is a short-term-unfair, long-term-fair 
> spinlock.  Might make sense for bare metal as well.  But it won't be 
> easy to write.
> 

This thread rings a bell here :)

Yes, ticket spinlocks are sometime slower, especially in workloads where
a spinlock needs to be taken several times to handle one unit of work,
and many cpus competing.

We currently have kind of a similar problem in network stack, and we
have a patch to speedup xmit path by an order of magnitude, letting one
cpu (the consumer cpu) to get unfair access to the (ticket) spinlock.
(It can compete with no more than one other cpu)

Boost from ~50.000 to ~600.000 pps on a dual quad core machine (E5450
@3.00GHz) on a particular workload (many cpus want to xmit their
packets)

( patch : http://patchwork.ozlabs.org/patch/53163/ )


It could be possible to write such a generic beast, with a cascade or
regular ticket spinlocks ?

One ticket spinlock at first stage (only if some conditions are met, aka
slow path), then an 'primary' spinlock at second stage.


// generic implementation
// (x86 could use 16bit fields for users_in & user_out)
struct cascade_lock {
	atomic_t 	users_in;
	int		users_out;
	spinlock_t	primlock;
	spinlock_t	slowpathlock; // could be outside of this structure, shared by many 'cascade_locks'
};

/*
 * In kvm case, you might call hypervisor when slowpathlock is about to be taken ?
 * When a cascade lock is unlocked, and relocked right after, this cpu has unfair
 * priority and could get the lock before cpus blocked in slowpathlock (especially if
 * an hypervisor call was done)
 *
 * In network xmit path, the dequeue thread would use highprio_user=true mode
 * In network xmit path, the 'contended' enqueueing thread would set a negative threshold,
 *  to force a 'lowprio_user' mode.
 */
void cascade_lock(struct cascade_lock *l, bool highprio_user, int threshold)
{
	bool slowpath = false;

	atomic_inc(&l->users_in); // no real need for atomic_inc_return()
	if (atomic_read(&l->users_in) - l->users_out > threshold && !highprio_user)) {
		spin_lock(&l->slowpathlock);
		slowpath = true;
	}
	spin_lock(&l->primlock);
	if (slowpath)
		spin_unlock(&l->slowpathlock);
}

void cascade_unlock(struct cascade_lock *l)
{
	l->users_out++;
	spin_unlock(&l->primlock);
}




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 17:39         ` Valdis.Kletnieks
@ 2010-06-02  2:46           ` Avi Kivity
  2010-06-02  7:39           ` H. Peter Anvin
  1 sibling, 0 replies; 34+ messages in thread
From: Avi Kivity @ 2010-06-02  2:46 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

On 06/01/2010 08:39 PM, Valdis.Kletnieks@vt.edu wrote:
>> We are running everything on NUMA (since all modern machines are now
>> NUMA).  At what scale do the issues become observable?
>>      
> My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
> perfectly-running NUMA=n kernel I'm running.
>
> Or did you mean a less broad phrase than "all modern machines"?
>
>    

All modern two socket and above boards, sorry.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 17:27         ` Andi Kleen
@ 2010-06-02  2:51           ` Avi Kivity
  2010-06-02  5:26             ` Srivatsa Vaddagiri
  2010-06-02  8:50             ` Andi Kleen
  0 siblings, 2 replies; 34+ messages in thread
From: Avi Kivity @ 2010-06-02  2:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin, tglx, mtosatti

On 06/01/2010 08:27 PM, Andi Kleen wrote:
> On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
>    
>> We are running everything on NUMA (since all modern machines are now NUMA).
>>   At what scale do the issues become observable?
>>      
> On Intel platforms it's visible starting with 4 sockets.
>    

Can you recommend a benchmark that shows bad behaviour?  I'll run it 
with ticket spinlocks and Gleb's patch.  I have a 4-way Nehalem-EX, 
presumably the huge number of threads will magnify the problem even more 
there.

>>>> I understand that reason and do not propose to get back to old spinlock
>>>> on physical HW! But with virtualization performance hit is unbearable.
>>>>
>>>>          
>>> Extreme unfairness can be unbearable too.
>>>
>>>        
>> Well, the question is what happens first.  In our experience, vcpu
>> overcommit is a lot more painful.  People will never see the NUMA
>> unfairness issue if they can't use kvm due to the vcpu overcommit problem.
>>      
> You really have to address both, if you don't fix them both
> users will eventually into one of them and be unhappy.
>    

That's definitely the long term plan.  I consider Gleb's patch the first 
step.

Do  you have any idea how we can tackle both problems?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-02  2:51           ` Avi Kivity
@ 2010-06-02  5:26             ` Srivatsa Vaddagiri
  2010-06-02  8:50             ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-02  5:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
> That's definitely the long term plan.  I consider Gleb's patch the
> first step.
> 
> Do  you have any idea how we can tackle both problems?

I recall Xen posting some solution for a similar problem:

http://lkml.org/lkml/2010/1/29/45

Wouldn't a similar approach help KVM as well?

- vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 17:39         ` Valdis.Kletnieks
  2010-06-02  2:46           ` Avi Kivity
@ 2010-06-02  7:39           ` H. Peter Anvin
  1 sibling, 0 replies; 34+ messages in thread
From: H. Peter Anvin @ 2010-06-02  7:39 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, mingo,
	npiggin, tglx, mtosatti

On 06/01/2010 10:39 AM, Valdis.Kletnieks@vt.edu wrote:
> On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said:
>> On 06/01/2010 07:38 PM, Andi Kleen wrote:
>>>>> Your new code would starve again, right?
>>> Try it on a NUMA system with unfair memory.
>
>> We are running everything on NUMA (since all modern machines are now
>> NUMA).  At what scale do the issues become observable?
>
> My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
> perfectly-running NUMA=n kernel I'm running.
>
> Or did you mean a less broad phrase than "all modern machines"?
>

All modern multisocket machines, unless configured in interleaved memory 
mode.

	-hpa


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-02  2:51           ` Avi Kivity
  2010-06-02  5:26             ` Srivatsa Vaddagiri
@ 2010-06-02  8:50             ` Andi Kleen
  2010-06-02  9:00               ` Avi Kivity
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-06-02  8:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
> On 06/01/2010 08:27 PM, Andi Kleen wrote:
>> On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
>>    
>>> We are running everything on NUMA (since all modern machines are now NUMA).
>>>   At what scale do the issues become observable?
>>>      
>> On Intel platforms it's visible starting with 4 sockets.
>>    
>
> Can you recommend a benchmark that shows bad behaviour?  I'll run it with 

Pretty much anything with high lock contention.

> ticket spinlocks and Gleb's patch.  I have a 4-way Nehalem-EX, presumably 
> the huge number of threads will magnify the problem even more there.

Yes more threads cause more lock contention too.

> Do  you have any idea how we can tackle both problems?

Apparently Xen has something, perhaps that can be leveraged
(but I haven't looked at their solution in detail)

Otherwise I would probably try to start with a adaptive
spinlock that at some point calls into the HV (or updates
shared memory?), like john cooper suggested. The tricky part here would
be to find the thresholds and fit that state into
paravirt ops and the standard spinlock_t.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-02  8:50             ` Andi Kleen
@ 2010-06-02  9:00               ` Avi Kivity
  2010-06-03  4:20                 ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 34+ messages in thread
From: Avi Kivity @ 2010-06-02  9:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin, tglx, mtosatti

On 06/02/2010 11:50 AM, Andi Kleen wrote:
> On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
>    
>> On 06/01/2010 08:27 PM, Andi Kleen wrote:
>>      
>>> On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
>>>
>>>        
>>>> We are running everything on NUMA (since all modern machines are now NUMA).
>>>>    At what scale do the issues become observable?
>>>>
>>>>          
>>> On Intel platforms it's visible starting with 4 sockets.
>>>
>>>        
>> Can you recommend a benchmark that shows bad behaviour?  I'll run it with
>>      
> Pretty much anything with high lock contention.
>    

Okay, we'll try to measure it here as soon as we can switch it into numa 
mode.

>> Do  you have any idea how we can tackle both problems?
>>      
> Apparently Xen has something, perhaps that can be leveraged
> (but I haven't looked at their solution in detail)
>
> Otherwise I would probably try to start with a adaptive
> spinlock that at some point calls into the HV (or updates
> shared memory?), like john cooper suggested. The tricky part here would
> be to find the thresholds and fit that state into
> paravirt ops and the standard spinlock_t.
>
>    

There are two separate problems: the more general problem is that the 
hypervisor can put a vcpu to sleep while holding a lock, causing other 
vcpus to spin until the end of their time slice.  This can only be 
addressed with hypervisor help.  The second problem is that the extreme 
fairness of ticket locks causes lots of context switches if the 
hypervisor helps, and aggravates the first problem horribly if it 
doesn't (since now a vcpu will spin waiting for its ticket even if the 
lock is unlocked).

So yes, we'll need hypervisor assistance, but even with that we'll need 
to reduce ticket lock fairness (retaining global fairness but 
sacrificing some local fairness).  I imagine that will be helpful for 
non-virt as well as local unfairness reduces bounciness.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-02  9:00               ` Avi Kivity
@ 2010-06-03  4:20                 ` Srivatsa Vaddagiri
  2010-06-03  4:51                   ` Eric Dumazet
                                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03  4:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
> 
> There are two separate problems: the more general problem is that
> the hypervisor can put a vcpu to sleep while holding a lock, causing
> other vcpus to spin until the end of their time slice.  This can
> only be addressed with hypervisor help.

Fyi - I have a early patch ready to address this issue. Basically I am using
host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
host whenever guest is in spin-lock'ed section, which is read by host scheduler 
to defer preemption.

Guest side:

static inline void spin_lock(spinlock_t *lock)
{
	raw_spin_lock(&lock->rlock);
+       __get_cpu_var(gh_vcpu_ptr)->defer_preempt++;
}

static inline void spin_unlock(spinlock_t *lock)
{
+	__get_cpu_var(gh_vcpu_ptr)->defer_preempt--;
        raw_spin_unlock(&lock->rlock);
}

[similar changes to other spinlock variants]

Host side:


@@ -860,6 +866,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
+		if ((sched_feat(DEFER_PREEMPT)) && (rq_of(cfs_rq)->curr->ghptr)) {
+			int defer_preempt =  rq_of(cfs_rq)->curr->ghptr->defer_preempt;
+			if (((defer_preempt & 0xFFFF0000) == 0xfeed0000) && ((defer_preempt & 0x0000FFFF) != 0)) {
+				if  ((rq_of(cfs_rq)->curr->grace_defer++ < sysctl_sched_preempt_defer_count)) {
+					rq_of(cfs_rq)->defer_preempt++;
+					return;
+				} else
+					rq_of(cfs_rq)->force_preempt++;
+			}
+		}
 		resched_task(rq_of(cfs_rq)->curr);
 		/*
 		 * The current task ran long enough, ensure it doesn't get

[similar changes introduced at other preemption points in sched_fair.c]


Note that guest can only request preemption to be deferred (and not disabled via
this mechanism). I have seen good improvement (~15%) in kern compile benchmark 
with sysctl_sched_preempt_defer_count set to a low value of just 2 (i.e we can 
defer preemption by maximum two ticks). I intend to cleanup and post the patches
pretty soon for comments.

One pathological case where this may actually hurt is routines in guest like 
flush_tlb_others_ipi() which take a spinlock and then enter a while() loop
waiting for other cpus to ack something. In this case, deferring preemption just
because guest is in critical section actually hurts! Hopefully the upper bound
for deferring preemtion and the fact that such routines may not be frequently
hit should help alleviate such situations.

- vatsa


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03  4:20                 ` Srivatsa Vaddagiri
@ 2010-06-03  4:51                   ` Eric Dumazet
  2010-06-03  5:38                     ` Srivatsa Vaddagiri
  2010-06-03  8:52                   ` Andi Kleen
  2010-06-03 10:38                   ` Nick Piggin
  2 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-06-03  4:51 UTC (permalink / raw)
  To: vatsa
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, npiggin, tglx, mtosatti

Le jeudi 03 juin 2010 à 09:50 +0530, Srivatsa Vaddagiri a écrit :
> On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
> > 
> > There are two separate problems: the more general problem is that
> > the hypervisor can put a vcpu to sleep while holding a lock, causing
> > other vcpus to spin until the end of their time slice.  This can
> > only be addressed with hypervisor help.
> 
> Fyi - I have a early patch ready to address this issue. Basically I am using
> host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
> host whenever guest is in spin-lock'ed section, which is read by host scheduler 
> to defer preemption.
> 
> Guest side:
> 
> static inline void spin_lock(spinlock_t *lock)
> {
> 	raw_spin_lock(&lock->rlock);
> +       __get_cpu_var(gh_vcpu_ptr)->defer_preempt++;

1) __this_cpu_inc() should be faster

2) Isnt a bit late to do this increment _after_
raw_spin_lock(&lock->rlock);

> }
> 
> static inline void spin_unlock(spinlock_t *lock)
> {
> +	__get_cpu_var(gh_vcpu_ptr)->defer_preempt--;
>         raw_spin_unlock(&lock->rlock);
> }




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03  4:51                   ` Eric Dumazet
@ 2010-06-03  5:38                     ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03  5:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, npiggin, tglx, mtosatti

On Thu, Jun 03, 2010 at 06:51:51AM +0200, Eric Dumazet wrote:
> > Guest side:
> > 
> > static inline void spin_lock(spinlock_t *lock)
> > {
> > 	raw_spin_lock(&lock->rlock);
> > +       __get_cpu_var(gh_vcpu_ptr)->defer_preempt++;
> 
> 1) __this_cpu_inc() should be faster

Ok ..thx for that tip.

> 2) Isnt a bit late to do this increment _after_
> raw_spin_lock(&lock->rlock);

I think so, my worry about doing it earlier is we may set the defer_preempt hint
for the wrong vcpu (if lets say the guest application thread is preempted by
guest kernel and later migrated to another vcpu after it sets the hint and 
before it acquires the lock). 

- vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03  4:20                 ` Srivatsa Vaddagiri
  2010-06-03  4:51                   ` Eric Dumazet
@ 2010-06-03  8:52                   ` Andi Kleen
  2010-06-03  9:26                     ` Srivatsa Vaddagiri
  2010-06-03 10:22                     ` Nick Piggin
  2010-06-03 10:38                   ` Nick Piggin
  2 siblings, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2010-06-03  8:52 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, npiggin, tglx, mtosatti

On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote:
> On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
> > 
> > There are two separate problems: the more general problem is that
> > the hypervisor can put a vcpu to sleep while holding a lock, causing
> > other vcpus to spin until the end of their time slice.  This can
> > only be addressed with hypervisor help.
> 
> Fyi - I have a early patch ready to address this issue. Basically I am using
> host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
> host whenever guest is in spin-lock'ed section, which is read by host scheduler 
> to defer preemption.

Looks like a ni.ce simple way to handle this for the kernel.

However I suspect user space will hit the same issue sooner
or later. I assume your way is not easily extensable to futexes?

> One pathological case where this may actually hurt is routines in guest like 
> flush_tlb_others_ipi() which take a spinlock and then enter a while() loop
> waiting for other cpus to ack something. In this case, deferring preemption just
> because guest is in critical section actually hurts! Hopefully the upper bound
> for deferring preemtion and the fact that such routines may not be frequently
> hit should help alleviate such situations.

So do you defer during the whole spinlock region or just during the spin?

I assume the the first?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03  8:52                   ` Andi Kleen
@ 2010-06-03  9:26                     ` Srivatsa Vaddagiri
  2010-06-03 10:22                     ` Nick Piggin
  1 sibling, 0 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03  9:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, Gleb Natapov, linux-kernel, kvm, hpa, mingo, npiggin,
	tglx, mtosatti

On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote:
> > Fyi - I have a early patch ready to address this issue. Basically I am using
> > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
> > host whenever guest is in spin-lock'ed section, which is read by host scheduler 
> > to defer preemption.
> 
> Looks like a ni.ce simple way to handle this for the kernel.

The idea is not new. It has been discussed for example at [1].

> However I suspect user space will hit the same issue sooner
> or later. I assume your way is not easily extensable to futexes?

I had thought that most userspace lock implementation avoid spinning for long
times? i.e they would spin for a short while and sleep beyond a threshold?
If that is the case, we shouldn't be burning lot of cycles unnecessarily
spinning in userspace ..

> So do you defer during the whole spinlock region or just during the spin?
> 
> I assume the the first?

My current implementation just blindly defers by a tick and checks if it is safe
to preempt in the next tick - otherwise gives more grace ticks until the
threshold is crossed (after which we forcibly preempt it).

In future, I was thinking that host scheduler can hint back to guest that it was
given some "grace" time which can be used in guest to yield when it comes out of
the locked section.

- vatsa

1.  http://l4ka.org/publications/2004/Towards-Scalable-Multiprocessor-Virtual-Machines-VM04.pdf

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03  8:52                   ` Andi Kleen
  2010-06-03  9:26                     ` Srivatsa Vaddagiri
@ 2010-06-03 10:22                     ` Nick Piggin
  1 sibling, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2010-06-03 10:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Srivatsa Vaddagiri, Avi Kivity, Gleb Natapov, linux-kernel, kvm,
	hpa, mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote:
> On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote:
> > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
> > > 
> > > There are two separate problems: the more general problem is that
> > > the hypervisor can put a vcpu to sleep while holding a lock, causing
> > > other vcpus to spin until the end of their time slice.  This can
> > > only be addressed with hypervisor help.
> > 
> > Fyi - I have a early patch ready to address this issue. Basically I am using
> > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
> > host whenever guest is in spin-lock'ed section, which is read by host scheduler 
> > to defer preemption.
> 
> Looks like a ni.ce simple way to handle this for the kernel.
> 
> However I suspect user space will hit the same issue sooner
> or later. I assume your way is not easily extensable to futexes?

Well userspace has always had the problem, hypervisor or not. So
sleeping locks obviously help a lot with that.

But we do hit the problem at times. The MySQL sysbench scalability
problem was a fine example

http://ozlabs.org/~anton/linux/sysbench/

Performance would tank when threads oversubscribe CPUs because lock
holders would start getting preempted.

This was due to nasty locking in MySQL as well, mind you.

There are some ways to improve it. glibc I believe has an option to
increase thread priority when taking a mutex, which is similar to
what we have here.

But it's a fairly broad problem for userspace. The resource may not
be just a lock but it could be IO as well.

 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03  4:20                 ` Srivatsa Vaddagiri
  2010-06-03  4:51                   ` Eric Dumazet
  2010-06-03  8:52                   ` Andi Kleen
@ 2010-06-03 10:38                   ` Nick Piggin
  2010-06-03 12:04                     ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2010-06-03 10:38 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote:
> On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
> > 
> > There are two separate problems: the more general problem is that
> > the hypervisor can put a vcpu to sleep while holding a lock, causing
> > other vcpus to spin until the end of their time slice.  This can
> > only be addressed with hypervisor help.
> 
> Fyi - I have a early patch ready to address this issue. Basically I am using
> host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
> host whenever guest is in spin-lock'ed section, which is read by host scheduler 
> to defer preemption.
> 
> Guest side:
> 
> static inline void spin_lock(spinlock_t *lock)
> {
> 	raw_spin_lock(&lock->rlock);
> +       __get_cpu_var(gh_vcpu_ptr)->defer_preempt++;
> }
> 
> static inline void spin_unlock(spinlock_t *lock)
> {
> +	__get_cpu_var(gh_vcpu_ptr)->defer_preempt--;
>         raw_spin_unlock(&lock->rlock);
> }
> 
> [similar changes to other spinlock variants]

Great, this is a nice way to improve it.

You might want to consider playing with first taking a ticket, and
then if we fail to acquire the lock immediately, then increment
defer_preempt before we start spinning.

The downside of this would be if we waste all our slice on spinning
and then preempted in the critical section. But with ticket locks
you can easily see how many entries in the queue in front of you.
So you could experiment with starting to defer preempt when we
notice we are getting toward the head of the queue.

Have you also looked at how s390 checks if the owning vcpu is running
and if so it spins, if not yields to the hypervisor. Something like
turning it into an adaptive lock. This could be applicable as well.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-01 19:36           ` Andi Kleen
@ 2010-06-03 11:06             ` David Woodhouse
  2010-06-03 15:15               ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2010-06-03 11:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: john cooper, Avi Kivity, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, npiggin, tglx, mtosatti, john cooper

On Tue, 2010-06-01 at 21:36 +0200, Andi Kleen wrote:
> > Collecting the contention/usage statistics on a per spinlock
> > basis seems complex.  I believe a practical approximation
> > to this are adaptive mutexes where upon hitting a spin
> > time threshold, punt and let the scheduler reconcile fairness.
> 
> That would probably work, except: how do you get the
> adaptive spinlock into a paravirt op without slowing
> down a standard kernel?

It only ever comes into play in the case where the spinlock is contended
anyway -- surely it shouldn't be _that_ much of a performance issue?

See the way that ppc64 handles it -- on a machine with overcommitted
virtual cpus, it will call __spin_yield (arch/powerpc/lib/locks.c) on
contention, which may cause the virtual CPU to donate its hypervisor
timeslice to the vCPU which is actually holding the lock in question.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 10:38                   ` Nick Piggin
@ 2010-06-03 12:04                     ` Srivatsa Vaddagiri
  2010-06-03 12:38                       ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03 12:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 08:38:55PM +1000, Nick Piggin wrote:
> > Guest side:
> > 
> > static inline void spin_lock(spinlock_t *lock)
> > {
> > 	raw_spin_lock(&lock->rlock);
> > +       __get_cpu_var(gh_vcpu_ptr)->defer_preempt++;
> > }
> > 
> > static inline void spin_unlock(spinlock_t *lock)
> > {
> > +	__get_cpu_var(gh_vcpu_ptr)->defer_preempt--;
> >         raw_spin_unlock(&lock->rlock);
> > }
> > 
> > [similar changes to other spinlock variants]
> 
> Great, this is a nice way to improve it.
> 
> You might want to consider playing with first taking a ticket, and
> then if we fail to acquire the lock immediately, then increment
> defer_preempt before we start spinning.
>
> The downside of this would be if we waste all our slice on spinning
> and then preempted in the critical section. But with ticket locks
> you can easily see how many entries in the queue in front of you.
> So you could experiment with starting to defer preempt when we
> notice we are getting toward the head of the queue.

Mm - my goal is to avoid long spin times in the first place (because the 
owning vcpu was descheduled at an unfortunate time i.e while it was holding a
lock). From that sense, I am targetting preemption-defer of lock *holder*
rather than of lock acquirer. So ideally whenever somebody tries to grab a lock,
it should be free most of the time, it can be held only if the owner is
currently running - which means we won't have to spin too long for the lock.

> Have you also looked at how s390 checks if the owning vcpu is running
> and if so it spins, if not yields to the hypervisor. Something like
> turning it into an adaptive lock. This could be applicable as well.

I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang
scheduling of vcpus, which reduces the severity of this problem very much -
essentially lock acquirer/holder are run simultaneously on different cpus all
the time. Gang scheduling is on my list of things to look at much later
(although I have been warned that its a scalablility nightmare!).

- vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 12:04                     ` Srivatsa Vaddagiri
@ 2010-06-03 12:38                       ` Nick Piggin
  2010-06-03 12:58                         ` Srivatsa Vaddagiri
  2010-06-03 15:17                         ` Andi Kleen
  0 siblings, 2 replies; 34+ messages in thread
From: Nick Piggin @ 2010-06-03 12:38 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 05:34:50PM +0530, Srivatsa Vaddagiri wrote:
> On Thu, Jun 03, 2010 at 08:38:55PM +1000, Nick Piggin wrote:
> > > Guest side:
> > > 
> > > static inline void spin_lock(spinlock_t *lock)
> > > {
> > > 	raw_spin_lock(&lock->rlock);
> > > +       __get_cpu_var(gh_vcpu_ptr)->defer_preempt++;
> > > }
> > > 
> > > static inline void spin_unlock(spinlock_t *lock)
> > > {
> > > +	__get_cpu_var(gh_vcpu_ptr)->defer_preempt--;
> > >         raw_spin_unlock(&lock->rlock);
> > > }
> > > 
> > > [similar changes to other spinlock variants]
> > 
> > Great, this is a nice way to improve it.
> > 
> > You might want to consider playing with first taking a ticket, and
> > then if we fail to acquire the lock immediately, then increment
> > defer_preempt before we start spinning.
> >
> > The downside of this would be if we waste all our slice on spinning
> > and then preempted in the critical section. But with ticket locks
> > you can easily see how many entries in the queue in front of you.
> > So you could experiment with starting to defer preempt when we
> > notice we are getting toward the head of the queue.
> 
> Mm - my goal is to avoid long spin times in the first place (because the 
> owning vcpu was descheduled at an unfortunate time i.e while it was holding a
> lock). From that sense, I am targetting preemption-defer of lock *holder*
> rather than of lock acquirer. So ideally whenever somebody tries to grab a lock,
> it should be free most of the time, it can be held only if the owner is
> currently running - which means we won't have to spin too long for the lock.

Holding a ticket in the queue is effectively the same as holding the
lock, from the pov of processes waiting behind.

The difference of course is that CPU cycles do not directly reduce
latency of ticket holders (only the owner). Spinlock critical sections
should tend to be several orders of magnitude shorter than context
switch times. So if you preempt the guy waiting at the head of the
queue, then it's almost as bad as preempting the lock holder.

 
> > Have you also looked at how s390 checks if the owning vcpu is running
> > and if so it spins, if not yields to the hypervisor. Something like
> > turning it into an adaptive lock. This could be applicable as well.
> 
> I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang
> scheduling of vcpus, which reduces the severity of this problem very much -
> essentially lock acquirer/holder are run simultaneously on different cpus all
> the time. Gang scheduling is on my list of things to look at much later
> (although I have been warned that its a scalablility nightmare!).

It effectively is pretty well an adaptive lock. The spinlock itself
doesn't sleep of course, but it yields to the hypervisor if the owner
has been preempted. This is pretty close to analogous with Linux
adaptive mutexes.

s390 also has the diag9c instruction which I suppose somehow boosts
priority of a preempted contended lock holder. In spite of any other
possible optimizations in their hypervisor like gang scheduling,
diag9c apparently provides quite a large improvement in some cases.

And they aren't even using ticket spinlocks!!

So I think these things are fairly important to look at.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 12:38                       ` Nick Piggin
@ 2010-06-03 12:58                         ` Srivatsa Vaddagiri
  2010-06-03 13:04                           ` Srivatsa Vaddagiri
  2010-06-03 13:45                           ` Nick Piggin
  2010-06-03 15:17                         ` Andi Kleen
  1 sibling, 2 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03 12:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
> Holding a ticket in the queue is effectively the same as holding the
> lock, from the pov of processes waiting behind.
> 
> The difference of course is that CPU cycles do not directly reduce
> latency of ticket holders (only the owner). Spinlock critical sections
> should tend to be several orders of magnitude shorter than context
> switch times. So if you preempt the guy waiting at the head of the
> queue, then it's almost as bad as preempting the lock holder.

Ok got it - although that approach is not advisable in some cases for ex: when
the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by
the hypervisor (which was experimented with in [1] where they foud a huge hit in
perf).

I agree that in general we should look at deferring preemption of lock 
acquirer esp when its at "head" as you suggest - I will consider that approach
as the next step (want to incrementally progress basically!).

> > > Have you also looked at how s390 checks if the owning vcpu is running
> > > and if so it spins, if not yields to the hypervisor. Something like
> > > turning it into an adaptive lock. This could be applicable as well.
> > 
> > I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang
> > scheduling of vcpus, which reduces the severity of this problem very much -
> > essentially lock acquirer/holder are run simultaneously on different cpus all
> > the time. Gang scheduling is on my list of things to look at much later
> > (although I have been warned that its a scalablility nightmare!).
> 
> It effectively is pretty well an adaptive lock. The spinlock itself
> doesn't sleep of course, but it yields to the hypervisor if the owner
> has been preempted. This is pretty close to analogous with Linux adaptive mutexes.

Oops you are right - sorry should have checked more closely earlier. Given that
we may not be able to always guarantee that locked critical sections will not be
preempted (ex: when a real-time task takes over), we will need a combination of 
both approaches (i.e request preemption defer on lock hold path + yield on lock 
acquire path if owner !scheduled). The advantage of former approach is that it
could reduce job turnaround times in most cases (as lock is available when we 
want or we don't have to wait too long for it).

> s390 also has the diag9c instruction which I suppose somehow boosts
> priority of a preempted contended lock holder. In spite of any other
> possible optimizations in their hypervisor like gang scheduling,
> diag9c apparently provides quite a large improvement in some cases.

Ok - thx for that pointer - will have a look at diag9c.

> So I think these things are fairly important to look at.

I agree ..

- vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 12:58                         ` Srivatsa Vaddagiri
@ 2010-06-03 13:04                           ` Srivatsa Vaddagiri
  2010-06-03 13:45                           ` Nick Piggin
  1 sibling, 0 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03 13:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 06:28:21PM +0530, Srivatsa Vaddagiri wrote:
> Ok got it - although that approach is not advisable in some cases for ex: when
> the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by
> the hypervisor (which was experimented with in [1] where they foud a huge hit in
> perf).

1. http://lkml.org/lkml/2010/4/13/464

- vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 12:58                         ` Srivatsa Vaddagiri
  2010-06-03 13:04                           ` Srivatsa Vaddagiri
@ 2010-06-03 13:45                           ` Nick Piggin
  2010-06-03 14:48                             ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2010-06-03 13:45 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 06:28:21PM +0530, Srivatsa Vaddagiri wrote:
> On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
> > Holding a ticket in the queue is effectively the same as holding the
> > lock, from the pov of processes waiting behind.
> > 
> > The difference of course is that CPU cycles do not directly reduce
> > latency of ticket holders (only the owner). Spinlock critical sections
> > should tend to be several orders of magnitude shorter than context
> > switch times. So if you preempt the guy waiting at the head of the
> > queue, then it's almost as bad as preempting the lock holder.
> 
> Ok got it - although that approach is not advisable in some cases for ex: when
> the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by
> the hypervisor (which was experimented with in [1] where they foud a huge hit in
> perf).

Sure but if you had adaptive yielding, that solves that problem.

 
> I agree that in general we should look at deferring preemption of lock 
> acquirer esp when its at "head" as you suggest - I will consider that approach
> as the next step (want to incrementally progress basically!).
>
> > > > Have you also looked at how s390 checks if the owning vcpu is running
> > > > and if so it spins, if not yields to the hypervisor. Something like
> > > > turning it into an adaptive lock. This could be applicable as well.
> > > 
> > > I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang
> > > scheduling of vcpus, which reduces the severity of this problem very much -
> > > essentially lock acquirer/holder are run simultaneously on different cpus all
> > > the time. Gang scheduling is on my list of things to look at much later
> > > (although I have been warned that its a scalablility nightmare!).
> > 
> > It effectively is pretty well an adaptive lock. The spinlock itself
> > doesn't sleep of course, but it yields to the hypervisor if the owner
> > has been preempted. This is pretty close to analogous with Linux adaptive mutexes.
> 
> Oops you are right - sorry should have checked more closely earlier. Given that
> we may not be able to always guarantee that locked critical sections will not be
> preempted (ex: when a real-time task takes over), we will need a combination of 
> both approaches (i.e request preemption defer on lock hold path + yield on lock 
> acquire path if owner !scheduled). The advantage of former approach is that it
> could reduce job turnaround times in most cases (as lock is available when we 
> want or we don't have to wait too long for it).

Both I think would be good. It might be interesting to talk with the
s390 guys and see if they can look at ticket locks and preempt defer
techniques too (considering they already do the other half of the
equation well).

 
> > s390 also has the diag9c instruction which I suppose somehow boosts
> > priority of a preempted contended lock holder. In spite of any other
> > possible optimizations in their hypervisor like gang scheduling,
> > diag9c apparently provides quite a large improvement in some cases.
> 
> Ok - thx for that pointer - will have a look at diag9c.
> 
> > So I think these things are fairly important to look at.
> 
> I agree ..
> 
> - vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 13:45                           ` Nick Piggin
@ 2010-06-03 14:48                             ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 34+ messages in thread
From: Srivatsa Vaddagiri @ 2010-06-03 14:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Andi Kleen, Gleb Natapov, linux-kernel, kvm, hpa,
	mingo, tglx, mtosatti, schwidefsky, heiko.carstens

On Thu, Jun 03, 2010 at 11:45:00PM +1000, Nick Piggin wrote:
> > Ok got it - although that approach is not advisable in some cases for ex: when
> > the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by
> > the hypervisor (which was experimented with in [1] where they foud a huge hit in
> > perf).
> 
> Sure but if you had adaptive yielding, that solves that problem.

I guess so.

> > Oops you are right - sorry should have checked more closely earlier. Given that
> > we may not be able to always guarantee that locked critical sections will not be
> > preempted (ex: when a real-time task takes over), we will need a combination of 
> > both approaches (i.e request preemption defer on lock hold path + yield on lock 
> > acquire path if owner !scheduled). The advantage of former approach is that it
> > could reduce job turnaround times in most cases (as lock is available when we 
> > want or we don't have to wait too long for it).
> 
> Both I think would be good. It might be interesting to talk with the
> s390 guys and see if they can look at ticket locks and preempt defer
> techniques too (considering they already do the other half of the
> equation well).

Martin/Heiko,
	Do you want to comment on this?

- vatsa

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 11:06             ` David Woodhouse
@ 2010-06-03 15:15               ` Andi Kleen
  0 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-06-03 15:15 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Andi Kleen, john cooper, Avi Kivity, Gleb Natapov, linux-kernel,
	kvm, hpa, mingo, npiggin, tglx, mtosatti, john cooper

On Thu, Jun 03, 2010 at 12:06:39PM +0100, David Woodhouse wrote:
> On Tue, 2010-06-01 at 21:36 +0200, Andi Kleen wrote:
> > > Collecting the contention/usage statistics on a per spinlock
> > > basis seems complex.  I believe a practical approximation
> > > to this are adaptive mutexes where upon hitting a spin
> > > time threshold, punt and let the scheduler reconcile fairness.
> > 
> > That would probably work, except: how do you get the
> > adaptive spinlock into a paravirt op without slowing
> > down a standard kernel?
> 
> It only ever comes into play in the case where the spinlock is contended
> anyway -- surely it shouldn't be _that_ much of a performance issue?

The problem is fitting the state into the u32

Also "lightly contended" is not that uncommon.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 12:38                       ` Nick Piggin
  2010-06-03 12:58                         ` Srivatsa Vaddagiri
@ 2010-06-03 15:17                         ` Andi Kleen
  2010-06-03 15:35                           ` Nick Piggin
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-06-03 15:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Srivatsa Vaddagiri, Avi Kivity, Andi Kleen, Gleb Natapov,
	linux-kernel, kvm, hpa, mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
> And they aren't even using ticket spinlocks!!

I suppose they simply don't have unfair memory. Makes things easier.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 15:17                         ` Andi Kleen
@ 2010-06-03 15:35                           ` Nick Piggin
  2010-06-03 17:25                             ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2010-06-03 15:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Srivatsa Vaddagiri, Avi Kivity, Gleb Natapov, linux-kernel, kvm,
	hpa, mingo, tglx, mtosatti

On Thu, Jun 03, 2010 at 05:17:30PM +0200, Andi Kleen wrote:
> On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
> > And they aren't even using ticket spinlocks!!
> 
> I suppose they simply don't have unfair memory. Makes things easier.

That would certainly be a part of it, I'm sure they have stronger
fairness and guarantees at the expense of some performance. We saw the
spinlock starvation first on 8-16 core Opterons I think, wheras Altix
had been over 1024 cores and POWER7 1024 threads now apparently without
reported problems.

However I think more is needed than simply "fair" memory at the cache
coherency level, considering that for example s390 implements it simply
by retrying cas until it succeeds. So you could perfectly round-robin
all cache requests for the lock word, but one core could quite easily
always find it is granted the cacheline when the lock is already taken.

So I think actively enforcing fairness at the lock level would be
required. Something like if it is detected that a core is not making
progress on a tight cas loop, then it will need to enter a queue of
cores where the head of the queue is always granted the cacheline first
after it has been dirtied. Interrupts will need to be ignored from this
logic. This still doesn't solve the problem of an owner unfairly
releasing and grabbing the lock again, they could have more detection to
handle that.

I don't know how far hardware goes. Maybe it is enough to statistically
avoid starvation if memory is pretty fair. But it does seem a lot easier
to enforce fairness in software.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] use unfair spinlock when running on hypervisor.
  2010-06-03 15:35                           ` Nick Piggin
@ 2010-06-03 17:25                             ` Andi Kleen
  0 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-06-03 17:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Srivatsa Vaddagiri, Avi Kivity, Gleb Natapov,
	linux-kernel, kvm, hpa, mingo, tglx, mtosatti

> That would certainly be a part of it, I'm sure they have stronger
> fairness and guarantees at the expense of some performance. We saw the
> spinlock starvation first on 8-16 core Opterons I think, wheras Altix
> had been over 1024 cores and POWER7 1024 threads now apparently without
> reported problems.

I suppose P7 handles that in the HV through the pvcall.

Altix AFAIK has special hardware for this in the interconnect, 
but as individual nodes get larger and have more cores you'll start
seeing it there too.

In general we now have the problem that with increasing core counts
per socket each NUMA node can be a fairly large SMP by itself
and several of the old SMP scalability problems that were fixed
by having per node datastructures are back now.

For example this is a serious problem with the zone locks in some
workloads now on 8core+HT systems.

> So I think actively enforcing fairness at the lock level would be
> required. Something like if it is detected that a core is not making

I suppose how that exactly works is IBM's secret sauce. Anyways
as long as there are no reports I wouldn't worry about it.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2010-06-03 17:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-01  9:35 [PATCH] use unfair spinlock when running on hypervisor Gleb Natapov
2010-06-01 15:53 ` Andi Kleen
2010-06-01 16:24   ` Gleb Natapov
2010-06-01 16:38     ` Andi Kleen
2010-06-01 16:52       ` Avi Kivity
2010-06-01 17:27         ` Andi Kleen
2010-06-02  2:51           ` Avi Kivity
2010-06-02  5:26             ` Srivatsa Vaddagiri
2010-06-02  8:50             ` Andi Kleen
2010-06-02  9:00               ` Avi Kivity
2010-06-03  4:20                 ` Srivatsa Vaddagiri
2010-06-03  4:51                   ` Eric Dumazet
2010-06-03  5:38                     ` Srivatsa Vaddagiri
2010-06-03  8:52                   ` Andi Kleen
2010-06-03  9:26                     ` Srivatsa Vaddagiri
2010-06-03 10:22                     ` Nick Piggin
2010-06-03 10:38                   ` Nick Piggin
2010-06-03 12:04                     ` Srivatsa Vaddagiri
2010-06-03 12:38                       ` Nick Piggin
2010-06-03 12:58                         ` Srivatsa Vaddagiri
2010-06-03 13:04                           ` Srivatsa Vaddagiri
2010-06-03 13:45                           ` Nick Piggin
2010-06-03 14:48                             ` Srivatsa Vaddagiri
2010-06-03 15:17                         ` Andi Kleen
2010-06-03 15:35                           ` Nick Piggin
2010-06-03 17:25                             ` Andi Kleen
2010-06-01 17:39         ` Valdis.Kletnieks
2010-06-02  2:46           ` Avi Kivity
2010-06-02  7:39           ` H. Peter Anvin
2010-06-01 17:54         ` john cooper
2010-06-01 19:36           ` Andi Kleen
2010-06-03 11:06             ` David Woodhouse
2010-06-03 15:15               ` Andi Kleen
2010-06-01 21:39         ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.