[PATCH RFC 0/4] Paravirtual spinlocks

* [PATCH RFC 0/4] Paravirtual spinlocks
@ 2008-07-07 19:07 Jeremy Fitzhardinge
  2008-07-07 19:07 ` [PATCH RFC 1/4] x86/paravirt: add hooks for spinlock operations Jeremy Fitzhardinge
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 19:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: LKML, Ingo Molnar, Jens Axboe, Peter Zijlstra, Christoph Lameter,
	Petr Tesarik, Virtualization, Xen devel, Thomas Friebel

At the most recent Xen Summit, Thomas Friebel presented a paper
("Preventing Guests from Spinning Around",
http://xen.org/files/xensummitboston08/LHP.pdf) investigating the
interactions between spinlocks and virtual machines.  Specifically, he
looked at what happens when a lock-holding VCPU gets involuntarily
preempted.

The obvious first order effect is that while the VCPU is not running,
the effective critical region time goes from being microseconds to
milliseconds, until it gets scheduled again.  This increases the
chance that there will be be contention, and the contending VCPU will
waste time spinning.

This is a measurable effect, but not terribly serious.  After all,
since Linux tends to hold locks for very short periods of time,
the likelihood of being preempted while holding a lock is low.

The real eye-opener is the secondary effects specific to ticket locks.

Clearly ticket locks suffer the same problem as all spinlocks.  But
when the lock holder releases the lock, the real fun begins.

By design, ticket locks are strictly fair, by imposing a FIFO order
lock holders.  The micro-architectural effect of this is that the
lock cache line will bounce around between the contending CPUs until
it finds the next in line, who then takes the lock and carries on.

When running in a virtual machine, a similar effect happens at the
VCPU level.  If all the contending VCPUs are not currently running on
real CPUs, then VCPU scheduler will run some random subset of them.
If it isn't a given VCPUs turn to take the lock, it will spin, burning
a VCPU timeslice.  Eventually the next-in-line will get scheduled,
take the lock, release it, and the remaining contending VCPUs will
repeat the process until the next in line is scheduled.

This means that the effective contention time of the lock is not
merely the time if takes the original lock-holder to take and release
the lock - including any preemption it may suffer - but the
spin-scheduling storm that follows to schedule the right VCPU to next
take the lock.  This could happen if the original contention was not
as a result of preemption, but just normal spinlock level contention.

One of the results Thomas presents is a kernbench run which normally
takes less than a minute going for 45 minutes, with 99+% spent in
ticket lock contention.  I've reproduced similar results.

This series has:
 - a paravirt_ops spinlock interface, which defaults to the standard
   ticket lock algorithm,
 - a second spinlock implementation based on the pre-ticket-lock
   "lock-byte" algorithm,
 - And a Xen-specific spinlock algorithm which voluntarily preempts a
   VCPU if it spins for too long. [FOR REFERENCE ONLY: will not apply
   to a current git tree.]

When running on native hardware, the overhead of enabling
CONFIG_PARAVIRT is an extra direct call/return on the lock/unlock
paths; the paravirt-ops patching machinery eliminates any indirect
calls.  With a small amount of restructuring, this overhead could be
eliminated (by making spin_lock()/unlock() inline functions,
containing calls to __raw_spin_lock/unlock).

My experiments show that using a Xen-specific lock helps guest
performance a bit (reduction in elapsed and system time in a kernbench
run), but most significantly, reduces overall physical CPU consumption
by 10%, and so increases overall system scalability.

   J
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread