All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 12:16 ` Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-28 12:16 UTC (permalink / raw)
  To: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck
  Cc: torvalds, waiman.long, davej, oleg, x86, jeremy, paul.gortmaker,
	ak, jasowang, fernando_b1, linux-kernel, kvm, virtualization,
	xen-devel, riel, mtosatti, chegu_vinod, raghavendra.kt

In virtualized environment there are mainly three problems
related to spinlocks that affect performance.
1. LHP (lock holder preemption)
2. Lock Waiter Preemption (LWP)
3. Starvation/fairness

 Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
pv-ticketlocks tried to address this. But we can further improve at the
cost of relaxed fairness.

In this patch, we form a batch of eligible lock holders and we serve the eligible
(to hold the lock) batches in FIFO, but the lock-waiters within eligible batch are served
in an unfair manner. This increases probability of any eligible lock-holder being
in running state (to an average of (batch_size/2)-1). It also provides needed
bounded starvation since any lock requester can not acquire more than batch_size
times repeatedly during contention. On the negetive side we would need an extra cmpxchg.

 The patch has the batch size of 4. (As we know increasing  batch size means we are
closer to unfair locks and batch size of 1 = ticketlock).

Result:
Test system: 32cpu 2node machine w/ 64GB each (16 pcpu machine +ht).
Guests:  8GB  16vcpu guests (average of 8 iterations)

 % Improvements with kvm guests (batch size = 4):
  ebizzy_0.5x   4.3
  ebizzy_1.0x   7.8
  ebizzy_1.5x  23.4
  ebizzy_2.0x  48.6

Baremetal:
ebizzy showed very high stdev, kernbench result was good but both of them did
not show much difference.

ebizzy: rec/sec higher is better
base    50452
patched 50703

kernbench  time in sec (lesser is better)
base    48.9 sec
patched 48.8 sec

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/spinlock.h       | 35 +++++++++++++++++++++++++----------
 arch/x86/include/asm/spinlock_types.h | 14 ++++++++++----
 arch/x86/kernel/kvm.c                 |  6 ++++--
 3 files changed, 39 insertions(+), 16 deletions(-)

TODO:
- we need an intelligent way to nullify the effect of batching for baremetal
 (because extra cmpxchg is not required).

- we may have to make batch size as kernel arg to solve above problem
 (to run same kernel for host/guest). Increasing batch size also seem to help
 virtualized guest more, so we will have flexibility of tuning depending on vm size.

- My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
  show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

- virtualized guest had slight impact on 1x cases of some benchmarks but we have got
 impressive performance for >1x cases. So overall, patch needs exhaustive tesing.

- we can further add dynamically changing batch_size implementation (inspiration and
  hint by Paul McKenney) as necessary.
 
 I have found that increasing  batch size gives excellent improvements for 
 overcommitted guests. I understand that we need more exhaustive testing.

 Please provide your suggestion and comments.

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..87685f1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
  */
 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
+	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
+	struct __raw_tickets new;
 
 	inc = xadd(&lock->tickets, inc);
-	if (likely(inc.head == inc.tail))
-		goto out;
 
 	inc.tail &= ~TICKET_SLOWPATH_FLAG;
 	for (;;) {
 		unsigned count = SPIN_THRESHOLD;
 
 		do {
-			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
-				goto out;
+			if ((inc.head & TICKET_LOCK_BATCH_MASK) == (inc.tail &
+							TICKET_LOCK_BATCH_MASK))
+				goto spin;
 			cpu_relax();
+			inc.head = ACCESS_ONCE(lock->tickets.head);
 		} while (--count);
 		__ticket_lock_spinning(lock, inc.tail);
 	}
+spin:
+	for (;;) {
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+		if (!(inc.head & TICKET_LOCK_HEAD_INC)) {
+			new.head = inc.head | TICKET_LOCK_HEAD_INC;
+			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
+					== inc.head)
+				goto out;
+		}
+		cpu_relax();
+	}
+
 out:	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
@@ -109,7 +122,8 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
 		return 0;
 
-	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+	new.head_tail = old.head_tail + (TICKET_LOCK_TAIL_INC << TICKET_SHIFT);
+	new.head_tail |= TICKET_LOCK_HEAD_INC;
 
 	/* cmpxchg is a full barrier, so nothing can move before it */
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
@@ -123,7 +137,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
 
 	/* Perform the unlock on the "before" copy */
-	old.tickets.head += TICKET_LOCK_INC;
+	old.tickets.head += TICKET_LOCK_HEAD_INC;
 
 	/* Clear the slowpath flag */
 	new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
@@ -150,14 +164,15 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 		arch_spinlock_t prev;
 
 		prev = *lock;
-		add_smp(&lock->tickets.head, TICKET_LOCK_INC);
+		add_smp(&lock->tickets.head, TICKET_LOCK_HEAD_INC);
 
 		/* add_smp() is a full mb() */
 
 		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
 			__ticket_unlock_slowpath(lock, prev);
 	} else
-		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+		__add(&lock->tickets.head,
+		      TICKET_LOCK_HEAD_INC, UNLOCK_LOCK_PREFIX);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
@@ -171,7 +186,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
-	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
+	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_TAIL_INC;
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..b04c03d 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,15 +3,16 @@
 
 #include <linux/types.h>
 
+#define TICKET_LOCK_INC_SHIFT 1
+#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define __TICKET_LOCK_INC	2
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
 #else
-#define __TICKET_LOCK_INC	1
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
-#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
+#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -19,7 +20,12 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
-#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
+#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
+
+#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
+#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
+#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
+				  TICKET_LOCK_TAIL_INC - 1)
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 0331cb3..3dd41f7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -755,7 +755,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	 * check again make sure it didn't become free while
 	 * we weren't looking.
 	 */
-	if (ACCESS_ONCE(lock->tickets.head) == want) {
+	if ((ACCESS_ONCE(lock->tickets.head) & TICKET_LOCK_BATCH_MASK) ==
+				(want & TICKET_LOCK_BATCH_MASK)) {
 		add_stats(TAKEN_SLOW_PICKUP, 1);
 		goto out;
 	}
@@ -787,7 +788,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 	for_each_cpu(cpu, &waiting_cpus) {
 		const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu);
 		if (ACCESS_ONCE(w->lock) == lock &&
-		    ACCESS_ONCE(w->want) == ticket) {
+		    (ACCESS_ONCE(w->want) & TICKET_LOCK_BATCH_MASK) ==
+			(ticket & TICKET_LOCK_BATCH_MASK)) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
 			kvm_kick_cpu(cpu);
 			break;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 12:16 ` Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-28 12:16 UTC (permalink / raw)
  To: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck
  Cc: waiman.long, jeremy, ak, kvm, riel, linux-kernel, x86, oleg,
	fernando_b1, paul.gortmaker, chegu_vinod, raghavendra.kt,
	xen-devel, davej, virtualization, torvalds

In virtualized environment there are mainly three problems
related to spinlocks that affect performance.
1. LHP (lock holder preemption)
2. Lock Waiter Preemption (LWP)
3. Starvation/fairness

 Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
pv-ticketlocks tried to address this. But we can further improve at the
cost of relaxed fairness.

In this patch, we form a batch of eligible lock holders and we serve the eligible
(to hold the lock) batches in FIFO, but the lock-waiters within eligible batch are served
in an unfair manner. This increases probability of any eligible lock-holder being
in running state (to an average of (batch_size/2)-1). It also provides needed
bounded starvation since any lock requester can not acquire more than batch_size
times repeatedly during contention. On the negetive side we would need an extra cmpxchg.

 The patch has the batch size of 4. (As we know increasing  batch size means we are
closer to unfair locks and batch size of 1 = ticketlock).

Result:
Test system: 32cpu 2node machine w/ 64GB each (16 pcpu machine +ht).
Guests:  8GB  16vcpu guests (average of 8 iterations)

 % Improvements with kvm guests (batch size = 4):
  ebizzy_0.5x   4.3
  ebizzy_1.0x   7.8
  ebizzy_1.5x  23.4
  ebizzy_2.0x  48.6

Baremetal:
ebizzy showed very high stdev, kernbench result was good but both of them did
not show much difference.

ebizzy: rec/sec higher is better
base    50452
patched 50703

kernbench  time in sec (lesser is better)
base    48.9 sec
patched 48.8 sec

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/spinlock.h       | 35 +++++++++++++++++++++++++----------
 arch/x86/include/asm/spinlock_types.h | 14 ++++++++++----
 arch/x86/kernel/kvm.c                 |  6 ++++--
 3 files changed, 39 insertions(+), 16 deletions(-)

TODO:
- we need an intelligent way to nullify the effect of batching for baremetal
 (because extra cmpxchg is not required).

- we may have to make batch size as kernel arg to solve above problem
 (to run same kernel for host/guest). Increasing batch size also seem to help
 virtualized guest more, so we will have flexibility of tuning depending on vm size.

- My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
  show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

- virtualized guest had slight impact on 1x cases of some benchmarks but we have got
 impressive performance for >1x cases. So overall, patch needs exhaustive tesing.

- we can further add dynamically changing batch_size implementation (inspiration and
  hint by Paul McKenney) as necessary.
 
 I have found that increasing  batch size gives excellent improvements for 
 overcommitted guests. I understand that we need more exhaustive testing.

 Please provide your suggestion and comments.

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..87685f1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
  */
 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
+	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
+	struct __raw_tickets new;
 
 	inc = xadd(&lock->tickets, inc);
-	if (likely(inc.head == inc.tail))
-		goto out;
 
 	inc.tail &= ~TICKET_SLOWPATH_FLAG;
 	for (;;) {
 		unsigned count = SPIN_THRESHOLD;
 
 		do {
-			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
-				goto out;
+			if ((inc.head & TICKET_LOCK_BATCH_MASK) == (inc.tail &
+							TICKET_LOCK_BATCH_MASK))
+				goto spin;
 			cpu_relax();
+			inc.head = ACCESS_ONCE(lock->tickets.head);
 		} while (--count);
 		__ticket_lock_spinning(lock, inc.tail);
 	}
+spin:
+	for (;;) {
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+		if (!(inc.head & TICKET_LOCK_HEAD_INC)) {
+			new.head = inc.head | TICKET_LOCK_HEAD_INC;
+			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
+					== inc.head)
+				goto out;
+		}
+		cpu_relax();
+	}
+
 out:	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
@@ -109,7 +122,8 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
 		return 0;
 
-	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+	new.head_tail = old.head_tail + (TICKET_LOCK_TAIL_INC << TICKET_SHIFT);
+	new.head_tail |= TICKET_LOCK_HEAD_INC;
 
 	/* cmpxchg is a full barrier, so nothing can move before it */
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
@@ -123,7 +137,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
 
 	/* Perform the unlock on the "before" copy */
-	old.tickets.head += TICKET_LOCK_INC;
+	old.tickets.head += TICKET_LOCK_HEAD_INC;
 
 	/* Clear the slowpath flag */
 	new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
@@ -150,14 +164,15 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 		arch_spinlock_t prev;
 
 		prev = *lock;
-		add_smp(&lock->tickets.head, TICKET_LOCK_INC);
+		add_smp(&lock->tickets.head, TICKET_LOCK_HEAD_INC);
 
 		/* add_smp() is a full mb() */
 
 		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
 			__ticket_unlock_slowpath(lock, prev);
 	} else
-		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+		__add(&lock->tickets.head,
+		      TICKET_LOCK_HEAD_INC, UNLOCK_LOCK_PREFIX);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
@@ -171,7 +186,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
-	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
+	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_TAIL_INC;
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..b04c03d 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,15 +3,16 @@
 
 #include <linux/types.h>
 
+#define TICKET_LOCK_INC_SHIFT 1
+#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define __TICKET_LOCK_INC	2
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
 #else
-#define __TICKET_LOCK_INC	1
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
-#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
+#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -19,7 +20,12 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
-#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
+#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
+
+#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
+#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
+#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
+				  TICKET_LOCK_TAIL_INC - 1)
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 0331cb3..3dd41f7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -755,7 +755,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	 * check again make sure it didn't become free while
 	 * we weren't looking.
 	 */
-	if (ACCESS_ONCE(lock->tickets.head) == want) {
+	if ((ACCESS_ONCE(lock->tickets.head) & TICKET_LOCK_BATCH_MASK) ==
+				(want & TICKET_LOCK_BATCH_MASK)) {
 		add_stats(TAKEN_SLOW_PICKUP, 1);
 		goto out;
 	}
@@ -787,7 +788,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 	for_each_cpu(cpu, &waiting_cpus) {
 		const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu);
 		if (ACCESS_ONCE(w->lock) == lock &&
-		    ACCESS_ONCE(w->want) == ticket) {
+		    (ACCESS_ONCE(w->want) & TICKET_LOCK_BATCH_MASK) ==
+			(ticket & TICKET_LOCK_BATCH_MASK)) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
 			kvm_kick_cpu(cpu);
 			break;
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 12:16 ` Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-28 12:16 UTC (permalink / raw)
  To: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck
  Cc: waiman.long, jeremy, ak, kvm, riel, linux-kernel, x86, oleg,
	fernando_b1, paul.gortmaker, chegu_vinod, raghavendra.kt,
	xen-devel, davej, virtualization, torvalds

In virtualized environment there are mainly three problems
related to spinlocks that affect performance.
1. LHP (lock holder preemption)
2. Lock Waiter Preemption (LWP)
3. Starvation/fairness

 Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
pv-ticketlocks tried to address this. But we can further improve at the
cost of relaxed fairness.

In this patch, we form a batch of eligible lock holders and we serve the eligible
(to hold the lock) batches in FIFO, but the lock-waiters within eligible batch are served
in an unfair manner. This increases probability of any eligible lock-holder being
in running state (to an average of (batch_size/2)-1). It also provides needed
bounded starvation since any lock requester can not acquire more than batch_size
times repeatedly during contention. On the negetive side we would need an extra cmpxchg.

 The patch has the batch size of 4. (As we know increasing  batch size means we are
closer to unfair locks and batch size of 1 = ticketlock).

Result:
Test system: 32cpu 2node machine w/ 64GB each (16 pcpu machine +ht).
Guests:  8GB  16vcpu guests (average of 8 iterations)

 % Improvements with kvm guests (batch size = 4):
  ebizzy_0.5x   4.3
  ebizzy_1.0x   7.8
  ebizzy_1.5x  23.4
  ebizzy_2.0x  48.6

Baremetal:
ebizzy showed very high stdev, kernbench result was good but both of them did
not show much difference.

ebizzy: rec/sec higher is better
base    50452
patched 50703

kernbench  time in sec (lesser is better)
base    48.9 sec
patched 48.8 sec

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/spinlock.h       | 35 +++++++++++++++++++++++++----------
 arch/x86/include/asm/spinlock_types.h | 14 ++++++++++----
 arch/x86/kernel/kvm.c                 |  6 ++++--
 3 files changed, 39 insertions(+), 16 deletions(-)

TODO:
- we need an intelligent way to nullify the effect of batching for baremetal
 (because extra cmpxchg is not required).

- we may have to make batch size as kernel arg to solve above problem
 (to run same kernel for host/guest). Increasing batch size also seem to help
 virtualized guest more, so we will have flexibility of tuning depending on vm size.

- My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
  show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

- virtualized guest had slight impact on 1x cases of some benchmarks but we have got
 impressive performance for >1x cases. So overall, patch needs exhaustive tesing.

- we can further add dynamically changing batch_size implementation (inspiration and
  hint by Paul McKenney) as necessary.
 
 I have found that increasing  batch size gives excellent improvements for 
 overcommitted guests. I understand that we need more exhaustive testing.

 Please provide your suggestion and comments.

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..87685f1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
  */
 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
+	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
+	struct __raw_tickets new;
 
 	inc = xadd(&lock->tickets, inc);
-	if (likely(inc.head == inc.tail))
-		goto out;
 
 	inc.tail &= ~TICKET_SLOWPATH_FLAG;
 	for (;;) {
 		unsigned count = SPIN_THRESHOLD;
 
 		do {
-			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
-				goto out;
+			if ((inc.head & TICKET_LOCK_BATCH_MASK) == (inc.tail &
+							TICKET_LOCK_BATCH_MASK))
+				goto spin;
 			cpu_relax();
+			inc.head = ACCESS_ONCE(lock->tickets.head);
 		} while (--count);
 		__ticket_lock_spinning(lock, inc.tail);
 	}
+spin:
+	for (;;) {
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+		if (!(inc.head & TICKET_LOCK_HEAD_INC)) {
+			new.head = inc.head | TICKET_LOCK_HEAD_INC;
+			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
+					== inc.head)
+				goto out;
+		}
+		cpu_relax();
+	}
+
 out:	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
@@ -109,7 +122,8 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
 		return 0;
 
-	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+	new.head_tail = old.head_tail + (TICKET_LOCK_TAIL_INC << TICKET_SHIFT);
+	new.head_tail |= TICKET_LOCK_HEAD_INC;
 
 	/* cmpxchg is a full barrier, so nothing can move before it */
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
@@ -123,7 +137,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
 
 	/* Perform the unlock on the "before" copy */
-	old.tickets.head += TICKET_LOCK_INC;
+	old.tickets.head += TICKET_LOCK_HEAD_INC;
 
 	/* Clear the slowpath flag */
 	new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
@@ -150,14 +164,15 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 		arch_spinlock_t prev;
 
 		prev = *lock;
-		add_smp(&lock->tickets.head, TICKET_LOCK_INC);
+		add_smp(&lock->tickets.head, TICKET_LOCK_HEAD_INC);
 
 		/* add_smp() is a full mb() */
 
 		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
 			__ticket_unlock_slowpath(lock, prev);
 	} else
-		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+		__add(&lock->tickets.head,
+		      TICKET_LOCK_HEAD_INC, UNLOCK_LOCK_PREFIX);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
@@ -171,7 +186,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
-	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
+	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_TAIL_INC;
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..b04c03d 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,15 +3,16 @@
 
 #include <linux/types.h>
 
+#define TICKET_LOCK_INC_SHIFT 1
+#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define __TICKET_LOCK_INC	2
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
 #else
-#define __TICKET_LOCK_INC	1
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
-#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
+#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -19,7 +20,12 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
-#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
+#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
+
+#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
+#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
+#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
+				  TICKET_LOCK_TAIL_INC - 1)
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 0331cb3..3dd41f7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -755,7 +755,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	 * check again make sure it didn't become free while
 	 * we weren't looking.
 	 */
-	if (ACCESS_ONCE(lock->tickets.head) == want) {
+	if ((ACCESS_ONCE(lock->tickets.head) & TICKET_LOCK_BATCH_MASK) ==
+				(want & TICKET_LOCK_BATCH_MASK)) {
 		add_stats(TAKEN_SLOW_PICKUP, 1);
 		goto out;
 	}
@@ -787,7 +788,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 	for_each_cpu(cpu, &waiting_cpus) {
 		const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu);
 		if (ACCESS_ONCE(w->lock) == lock &&
-		    ACCESS_ONCE(w->want) == ticket) {
+		    (ACCESS_ONCE(w->want) & TICKET_LOCK_BATCH_MASK) ==
+			(ticket & TICKET_LOCK_BATCH_MASK)) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
 			kvm_kick_cpu(cpu);
 			break;
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 12:16 ` Raghavendra K T
@ 2014-05-28 21:55   ` Rik van Riel
  -1 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2014-05-28 21:55 UTC (permalink / raw)
  To: Raghavendra K T, tglx, mingo, hpa, konrad.wilk, pbonzini, gleb,
	peterz, paulmck
  Cc: torvalds, waiman.long, davej, oleg, x86, jeremy, paul.gortmaker,
	ak, jasowang, fernando_b1, linux-kernel, kvm, virtualization,
	xen-devel, mtosatti, chegu_vinod

On 05/28/2014 08:16 AM, Raghavendra K T wrote:

This patch looks very promising.

> TODO:
> - we need an intelligent way to nullify the effect of batching for baremetal
>  (because extra cmpxchg is not required).

On (larger?) NUMA systems, the unfairness may be a nice performance
benefit, reducing cache line bouncing through the system, and it
could well outweigh the extra cmpxchg at times.

> - we may have to make batch size as kernel arg to solve above problem
>  (to run same kernel for host/guest). Increasing batch size also seem to help
>  virtualized guest more, so we will have flexibility of tuning depending on vm size.
> 
> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>   show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

Canceled out by better NUMA locality?

Or maybe cmpxchg is cheap once you already own the cache line
exclusively?

> - virtualized guest had slight impact on 1x cases of some benchmarks but we have got
>  impressive performance for >1x cases. So overall, patch needs exhaustive tesing.
> 
> - we can further add dynamically changing batch_size implementation (inspiration and
>   hint by Paul McKenney) as necessary.

I could see a larger batch size being beneficial.

Currently the maximum wait time for a spinlock on a system
with N CPUs is N times the length of the largest critical
section.

Having the batch size set equal to the number of CPUs would only
double that, and better locality (CPUs local to the current
lock holder winning the spinlock operation) might speed things
up enough to cancel that part of that out again...

>  I have found that increasing  batch size gives excellent improvements for 
>  overcommitted guests. I understand that we need more exhaustive testing.
> 
>  Please provide your suggestion and comments.

I have only nitpicks so far...

> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..b04c03d 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -3,15 +3,16 @@
>  
>  #include <linux/types.h>
>  
> +#define TICKET_LOCK_INC_SHIFT 1
> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define __TICKET_LOCK_INC	2
>  #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>  #else
> -#define __TICKET_LOCK_INC	1
>  #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>  #endif

For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
now you are making it one. Probably not an issue, since even people
who compile with 128 < CONFIG_NR_CPUS <= 256 will likely have their
spinlocks padded out to 32 or 64 bits anyway in most data structures.

> -#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
> +#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
>  #else
> @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
>  #endif
>  
> -#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
> +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
> +
> +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
> +				  TICKET_LOCK_TAIL_INC - 1)

I do not see the value in having TICKET_BATCH declared with a
hexadecimal number, and it may be worth making sure the code
does not compile if someone tried a TICKET_BATCH value that
is not a power of 2.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 21:55   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2014-05-28 21:55 UTC (permalink / raw)
  To: Raghavendra K T, tglx, mingo, hpa, konrad.wilk, pbonzini, gleb,
	peterz, paulmck
  Cc: waiman.long, jeremy, ak, kvm, linux-kernel, x86, oleg,
	fernando_b1, paul.gortmaker, chegu_vinod, xen-devel, davej,
	virtualization, torvalds

On 05/28/2014 08:16 AM, Raghavendra K T wrote:

This patch looks very promising.

> TODO:
> - we need an intelligent way to nullify the effect of batching for baremetal
>  (because extra cmpxchg is not required).

On (larger?) NUMA systems, the unfairness may be a nice performance
benefit, reducing cache line bouncing through the system, and it
could well outweigh the extra cmpxchg at times.

> - we may have to make batch size as kernel arg to solve above problem
>  (to run same kernel for host/guest). Increasing batch size also seem to help
>  virtualized guest more, so we will have flexibility of tuning depending on vm size.
> 
> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>   show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

Canceled out by better NUMA locality?

Or maybe cmpxchg is cheap once you already own the cache line
exclusively?

> - virtualized guest had slight impact on 1x cases of some benchmarks but we have got
>  impressive performance for >1x cases. So overall, patch needs exhaustive tesing.
> 
> - we can further add dynamically changing batch_size implementation (inspiration and
>   hint by Paul McKenney) as necessary.

I could see a larger batch size being beneficial.

Currently the maximum wait time for a spinlock on a system
with N CPUs is N times the length of the largest critical
section.

Having the batch size set equal to the number of CPUs would only
double that, and better locality (CPUs local to the current
lock holder winning the spinlock operation) might speed things
up enough to cancel that part of that out again...

>  I have found that increasing  batch size gives excellent improvements for 
>  overcommitted guests. I understand that we need more exhaustive testing.
> 
>  Please provide your suggestion and comments.

I have only nitpicks so far...

> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..b04c03d 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -3,15 +3,16 @@
>  
>  #include <linux/types.h>
>  
> +#define TICKET_LOCK_INC_SHIFT 1
> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define __TICKET_LOCK_INC	2
>  #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>  #else
> -#define __TICKET_LOCK_INC	1
>  #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>  #endif

For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
now you are making it one. Probably not an issue, since even people
who compile with 128 < CONFIG_NR_CPUS <= 256 will likely have their
spinlocks padded out to 32 or 64 bits anyway in most data structures.

> -#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
> +#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
>  #else
> @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
>  #endif
>  
> -#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
> +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
> +
> +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
> +				  TICKET_LOCK_TAIL_INC - 1)

I do not see the value in having TICKET_BATCH declared with a
hexadecimal number, and it may be worth making sure the code
does not compile if someone tried a TICKET_BATCH value that
is not a power of 2.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 12:16 ` Raghavendra K T
                   ` (2 preceding siblings ...)
  (?)
@ 2014-05-28 21:55 ` Rik van Riel
  -1 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2014-05-28 21:55 UTC (permalink / raw)
  To: Raghavendra K T, tglx, mingo, hpa, konrad.wilk, pbonzini, gleb,
	peterz, paulmck
  Cc: waiman.long, jeremy, ak, kvm, linux-kernel, jasowang, x86, oleg,
	fernando_b1, paul.gortmaker, chegu_vinod, xen-devel, davej,
	virtualization, torvalds, mtosatti

On 05/28/2014 08:16 AM, Raghavendra K T wrote:

This patch looks very promising.

> TODO:
> - we need an intelligent way to nullify the effect of batching for baremetal
>  (because extra cmpxchg is not required).

On (larger?) NUMA systems, the unfairness may be a nice performance
benefit, reducing cache line bouncing through the system, and it
could well outweigh the extra cmpxchg at times.

> - we may have to make batch size as kernel arg to solve above problem
>  (to run same kernel for host/guest). Increasing batch size also seem to help
>  virtualized guest more, so we will have flexibility of tuning depending on vm size.
> 
> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>   show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

Canceled out by better NUMA locality?

Or maybe cmpxchg is cheap once you already own the cache line
exclusively?

> - virtualized guest had slight impact on 1x cases of some benchmarks but we have got
>  impressive performance for >1x cases. So overall, patch needs exhaustive tesing.
> 
> - we can further add dynamically changing batch_size implementation (inspiration and
>   hint by Paul McKenney) as necessary.

I could see a larger batch size being beneficial.

Currently the maximum wait time for a spinlock on a system
with N CPUs is N times the length of the largest critical
section.

Having the batch size set equal to the number of CPUs would only
double that, and better locality (CPUs local to the current
lock holder winning the spinlock operation) might speed things
up enough to cancel that part of that out again...

>  I have found that increasing  batch size gives excellent improvements for 
>  overcommitted guests. I understand that we need more exhaustive testing.
> 
>  Please provide your suggestion and comments.

I have only nitpicks so far...

> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..b04c03d 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -3,15 +3,16 @@
>  
>  #include <linux/types.h>
>  
> +#define TICKET_LOCK_INC_SHIFT 1
> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define __TICKET_LOCK_INC	2
>  #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>  #else
> -#define __TICKET_LOCK_INC	1
>  #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>  #endif

For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
now you are making it one. Probably not an issue, since even people
who compile with 128 < CONFIG_NR_CPUS <= 256 will likely have their
spinlocks padded out to 32 or 64 bits anyway in most data structures.

> -#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
> +#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
>  #else
> @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
>  #endif
>  
> -#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
> +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
> +
> +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
> +				  TICKET_LOCK_TAIL_INC - 1)

I do not see the value in having TICKET_BATCH declared with a
hexadecimal number, and it may be worth making sure the code
does not compile if someone tried a TICKET_BATCH value that
is not a power of 2.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 21:55   ` Rik van Riel
@ 2014-05-28 22:19     ` Linus Torvalds
  -1 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2014-05-28 22:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Raghavendra K T, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Konrad Rzeszutek Wilk, Paolo Bonzini, Gleb Natapov,
	Peter Zijlstra, Paul McKenney, Waiman Long, Dave Jones,
	Oleg Nesterov, the arch/x86 maintainers, Jeremy Fitzhardinge,
	Paul Gortmaker, Andi Kleen, Jason Wang, fernando_b1,
	Linux Kernel Mailing List, KVM list, virtualization, xen-devel,
	Marcelo Tosatti, Vinod, Chegu

On Wed, May 28, 2014 at 2:55 PM, Rik van Riel <riel@redhat.com> wrote:
>
> Or maybe cmpxchg is cheap once you already own the cache line
> exclusively?

A locked cmpxchg ends up being anything between ~15-50 cycles
depending on microarchitecture if things are already exclusively in
the cache (with the P4 being an outlier, and all locked instructions
tend to take ~100+ cycles, but I can't say I can really find it in
myself to even care about netburst any more).

The most noticeable downside we've seen has been when we've used
"read-op-cmpxchg" as a _replacement_ for something like "lock [x]add",
when that "read+cmpxchg" has caused two cacheline ops (cacheline first
loaded shared by the read, then exclusive by the cmpxchg). That's bad.

But if preceded by a write (or, in this case, an xadd), that doesn't
happen. Still, those roughly 15-50 cycles can certainly be noticeable
(especially at the high end), but you need to have some load that
doesn't bounce the lock, and largely fit in the caches to see it. And
you probably want to test one of the older CPU's, I think Haswell is
the lower end (ie in the ~20 cycle range).

If somebody has a P4 still, that's likely the worst case by far.

              Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 22:19     ` Linus Torvalds
  0 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2014-05-28 22:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM list, Peter Zijlstra,
	Oleg Nesterov, Paul Gortmaker, Peter Anvin, Andi Kleen,
	Gleb Natapov, the arch/x86 maintainers, Ingo Molnar, xen-devel,
	Paul McKenney, virtualization, Konrad Rzeszutek Wilk, Dave Jones,
	Thomas Gleixner, fernando_b1, Vinod, Chegu, Waiman Long,
	Linux Kernel Mailing List, Paolo Bonzini

On Wed, May 28, 2014 at 2:55 PM, Rik van Riel <riel@redhat.com> wrote:
>
> Or maybe cmpxchg is cheap once you already own the cache line
> exclusively?

A locked cmpxchg ends up being anything between ~15-50 cycles
depending on microarchitecture if things are already exclusively in
the cache (with the P4 being an outlier, and all locked instructions
tend to take ~100+ cycles, but I can't say I can really find it in
myself to even care about netburst any more).

The most noticeable downside we've seen has been when we've used
"read-op-cmpxchg" as a _replacement_ for something like "lock [x]add",
when that "read+cmpxchg" has caused two cacheline ops (cacheline first
loaded shared by the read, then exclusive by the cmpxchg). That's bad.

But if preceded by a write (or, in this case, an xadd), that doesn't
happen. Still, those roughly 15-50 cycles can certainly be noticeable
(especially at the high end), but you need to have some load that
doesn't bounce the lock, and largely fit in the caches to see it. And
you probably want to test one of the older CPU's, I think Haswell is
the lower end (ie in the ~20 cycle range).

If somebody has a P4 still, that's likely the worst case by far.

              Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 21:55   ` Rik van Riel
  (?)
  (?)
@ 2014-05-28 22:19   ` Linus Torvalds
  -1 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2014-05-28 22:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM list, Peter Zijlstra,
	Jason Wang, Oleg Nesterov, Paul Gortmaker, Peter Anvin,
	Andi Kleen, Gleb Natapov, the arch/x86 maintainers, Ingo Molnar,
	xen-devel, Paul McKenney, virtualization, Dave Jones,
	Thomas Gleixner, fernando_b1, Vinod, Chegu, Waiman Long,
	Marcelo Tosatti, Linux Kernel Mailing List, Paolo Bonzini

On Wed, May 28, 2014 at 2:55 PM, Rik van Riel <riel@redhat.com> wrote:
>
> Or maybe cmpxchg is cheap once you already own the cache line
> exclusively?

A locked cmpxchg ends up being anything between ~15-50 cycles
depending on microarchitecture if things are already exclusively in
the cache (with the P4 being an outlier, and all locked instructions
tend to take ~100+ cycles, but I can't say I can really find it in
myself to even care about netburst any more).

The most noticeable downside we've seen has been when we've used
"read-op-cmpxchg" as a _replacement_ for something like "lock [x]add",
when that "read+cmpxchg" has caused two cacheline ops (cacheline first
loaded shared by the read, then exclusive by the cmpxchg). That's bad.

But if preceded by a write (or, in this case, an xadd), that doesn't
happen. Still, those roughly 15-50 cycles can certainly be noticeable
(especially at the high end), but you need to have some load that
doesn't bounce the lock, and largely fit in the caches to see it. And
you probably want to test one of the older CPU's, I think Haswell is
the lower end (ie in the ~20 cycle range).

If somebody has a P4 still, that's likely the worst case by far.

              Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 22:19     ` Linus Torvalds
@ 2014-05-28 22:29       ` Thomas Gleixner
  -1 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2014-05-28 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Raghavendra K T, Ingo Molnar, Peter Anvin,
	Konrad Rzeszutek Wilk, Paolo Bonzini, Gleb Natapov,
	Peter Zijlstra, Paul McKenney, Waiman Long, Dave Jones,
	Oleg Nesterov, the arch/x86 maintainers, Jeremy Fitzhardinge,
	Paul Gortmaker, Andi Kleen, Jason Wang, fernando_b1,
	Linux Kernel Mailing List, KVM list, virtualization, xen-devel,
	Marcelo Tosatti, Vinod, Chegu

On Wed, 28 May 2014, Linus Torvalds wrote:
> 
> If somebody has a P4 still, that's likely the worst case by far.

I do, but I'm only using it during winter and only if the ia64 machine
does not provide sufficient heating. So you have to wait at least half
a year until I'm able to test it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 22:29       ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2014-05-28 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM list, Peter Zijlstra,
	Oleg Nesterov, Paul Gortmaker, Peter Anvin, Andi Kleen,
	Gleb Natapov, the arch/x86 maintainers, Ingo Molnar, xen-devel,
	Paul McKenney, virtualization, Rik van Riel,
	Konrad Rzeszutek Wilk, Dave Jones, fernando_b1, Vinod, Chegu,
	Waiman Long, Linux Kernel Mailing List, Paolo Bonzini

On Wed, 28 May 2014, Linus Torvalds wrote:
> 
> If somebody has a P4 still, that's likely the worst case by far.

I do, but I'm only using it during winter and only if the ia64 machine
does not provide sufficient heating. So you have to wait at least half
a year until I'm able to test it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 22:19     ` Linus Torvalds
  (?)
@ 2014-05-28 22:29     ` Thomas Gleixner
  -1 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2014-05-28 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM list, Peter Zijlstra,
	Jason Wang, Oleg Nesterov, Paul Gortmaker, Peter Anvin,
	Andi Kleen, Gleb Natapov, the arch/x86 maintainers, Ingo Molnar,
	xen-devel, Paul McKenney, virtualization, Dave Jones,
	fernando_b1, Vinod, Chegu, Waiman Long, Marcelo Tosatti,
	Linux Kernel Mailing List, Paolo Bonzini

On Wed, 28 May 2014, Linus Torvalds wrote:
> 
> If somebody has a P4 still, that's likely the worst case by far.

I do, but I'm only using it during winter and only if the ia64 machine
does not provide sufficient heating. So you have to wait at least half
a year until I'm able to test it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 22:19     ` Linus Torvalds
@ 2014-05-29  1:18       ` Rik van Riel
  -1 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2014-05-29  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Raghavendra K T, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Konrad Rzeszutek Wilk, Paolo Bonzini, Gleb Natapov,
	Peter Zijlstra, Paul McKenney, Waiman Long, Dave Jones,
	Oleg Nesterov, the arch/x86 maintainers, Jeremy Fitzhardinge,
	Paul Gortmaker, Andi Kleen, Jason Wang, fernando_b1,
	Linux Kernel Mailing List, KVM list, virtualization, xen-devel,
	Marcelo Tosatti, Vinod, Chegu

On 05/28/2014 06:19 PM, Linus Torvalds wrote:

> If somebody has a P4 still, that's likely the worst case by far.

I'm sure cmpxchg isn't the only thing making P4 the worst case :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-29  1:18       ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2014-05-29  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM list, Peter Zijlstra,
	Oleg Nesterov, Paul Gortmaker, Peter Anvin, Andi Kleen,
	Gleb Natapov, the arch/x86 maintainers, Ingo Molnar, xen-devel,
	Paul McKenney, virtualization, Konrad Rzeszutek Wilk, Dave Jones,
	Thomas Gleixner, fernando_b1, Vinod, Chegu, Waiman Long,
	Linux Kernel Mailing List, Paolo Bonzini

On 05/28/2014 06:19 PM, Linus Torvalds wrote:

> If somebody has a P4 still, that's likely the worst case by far.

I'm sure cmpxchg isn't the only thing making P4 the worst case :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 22:19     ` Linus Torvalds
                       ` (2 preceding siblings ...)
  (?)
@ 2014-05-29  1:18     ` Rik van Riel
  -1 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2014-05-29  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM list, Peter Zijlstra,
	Jason Wang, Oleg Nesterov, Paul Gortmaker, Peter Anvin,
	Andi Kleen, Gleb Natapov, the arch/x86 maintainers, Ingo Molnar,
	xen-devel, Paul McKenney, virtualization, Dave Jones,
	Thomas Gleixner, fernando_b1, Vinod, Chegu, Waiman Long,
	Marcelo Tosatti, Linux Kernel Mailing List, Paolo Bonzini

On 05/28/2014 06:19 PM, Linus Torvalds wrote:

> If somebody has a P4 still, that's likely the worst case by far.

I'm sure cmpxchg isn't the only thing making P4 the worst case :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 12:16 ` Raghavendra K T
@ 2014-05-29  6:46   ` Peter Zijlstra
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2014-05-29  6:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, paulmck, torvalds,
	waiman.long, davej, oleg, x86, jeremy, paul.gortmaker, ak,
	jasowang, fernando_b1, linux-kernel, kvm, virtualization,
	xen-devel, riel, mtosatti, chegu_vinod

On Wed, May 28, 2014 at 05:46:39PM +0530, Raghavendra K T wrote:
> In virtualized environment there are mainly three problems
> related to spinlocks that affect performance.
> 1. LHP (lock holder preemption)
> 2. Lock Waiter Preemption (LWP)
> 3. Starvation/fairness
> 
>  Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
> pv-ticketlocks tried to address this. But we can further improve at the
> cost of relaxed fairness.

So I really hate the idea of having different locks for paravirt and
normal kernels.

And we're looking to move to that queued lock for normal kernels.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-29  6:46   ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2014-05-29  6:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: jeremy, kvm, virtualization, paul.gortmaker, hpa, ak, gleb, x86,
	mingo, xen-devel, paulmck, riel, konrad.wilk, oleg, davej, tglx,
	fernando_b1, chegu_vinod, waiman.long, linux-kernel, pbonzini,
	torvalds

On Wed, May 28, 2014 at 05:46:39PM +0530, Raghavendra K T wrote:
> In virtualized environment there are mainly three problems
> related to spinlocks that affect performance.
> 1. LHP (lock holder preemption)
> 2. Lock Waiter Preemption (LWP)
> 3. Starvation/fairness
> 
>  Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
> pv-ticketlocks tried to address this. But we can further improve at the
> cost of relaxed fairness.

So I really hate the idea of having different locks for paravirt and
normal kernels.

And we're looking to move to that queued lock for normal kernels.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 12:16 ` Raghavendra K T
                   ` (3 preceding siblings ...)
  (?)
@ 2014-05-29  6:46 ` Peter Zijlstra
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2014-05-29  6:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: jeremy, kvm, jasowang, virtualization, paul.gortmaker, hpa, ak,
	gleb, x86, mingo, xen-devel, paulmck, oleg, davej, tglx,
	fernando_b1, chegu_vinod, waiman.long, mtosatti, linux-kernel,
	pbonzini, torvalds

On Wed, May 28, 2014 at 05:46:39PM +0530, Raghavendra K T wrote:
> In virtualized environment there are mainly three problems
> related to spinlocks that affect performance.
> 1. LHP (lock holder preemption)
> 2. Lock Waiter Preemption (LWP)
> 3. Starvation/fairness
> 
>  Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
> pv-ticketlocks tried to address this. But we can further improve at the
> cost of relaxed fairness.

So I really hate the idea of having different locks for paravirt and
normal kernels.

And we're looking to move to that queued lock for normal kernels.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 21:55   ` Rik van Riel
@ 2014-05-29  9:44     ` Raghavendra K T
  -1 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-29  9:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck,
	torvalds, waiman.long, davej, oleg, x86, jeremy, paul.gortmaker,
	ak, jasowang, fernando_b1, linux-kernel, kvm, virtualization,
	xen-devel, mtosatti, chegu_vinod

On 05/29/2014 03:25 AM, Rik van Riel wrote:
> On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>
> This patch looks very promising.

Thank you Rik.

[...]
>>
>> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>>    show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.
>
> Canceled out by better NUMA locality?

Yes perhaps. it was even slightly better.

[...]
>> - we can further add dynamically changing batch_size implementation (inspiration and
>>    hint by Paul McKenney) as necessary.
>
> I could see a larger batch size being beneficial.
>
> Currently the maximum wait time for a spinlock on a system
> with N CPUs is N times the length of the largest critical
> section.
>
> Having the batch size set equal to the number of CPUs would only
> double that, and better locality (CPUs local to the current
> lock holder winning the spinlock operation) might speed things
> up enough to cancel that part of that out again...

having batch size = number of cpus would definitely help contended cases
especially on larger machines (by my experience with testing on a 4
node 32 core machine). +ht case should make it even more
beneficial.

My only botheration was overhead in undercommit cases because of extra
cmpxchg.
So may be batch_size = total cpus / numa node be optimal?...

[...]
>> +#define TICKET_LOCK_INC_SHIFT 1
>> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
>> +
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>> -#define __TICKET_LOCK_INC	2
>>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>>   #else
>> -#define __TICKET_LOCK_INC	1
>>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>>   #endif
>
> For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
> now you are making it one. Probably not an issue, since even people
> who compile with 128 < CONFIG_NR_CPUS <= 256 will likely have their
> spinlocks padded out to 32 or 64 bits anyway in most data structures.

Yes..

[...]
>> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
>> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
>> +				  TICKET_LOCK_TAIL_INC - 1)
>
> I do not see the value in having TICKET_BATCH declared with a
> hexadecimal number,

yes.. It had only helped me to make the idea readable to myself, I
could get rid of this if needed.

and it may be worth making sure the code
> does not compile if someone tried a TICKET_BATCH value that
> is not a power of 2.

I agree.  will have BUILD_BUG for not power of 2 in next version.
But yes it reminds me that I wanted to have TICKET_BATCH = 1 for
!CONFIG_PARAVIRT so that we continue to have original fair lock version.
Does that make sense? I left it after thinking about same kernel running
on host/guest which would anyway will have CONFIG_PARAVIRT on.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-29  9:44     ` Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-29  9:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: jeremy, kvm, peterz, virtualization, paul.gortmaker, hpa, ak,
	gleb, x86, mingo, xen-devel, paulmck, konrad.wilk, oleg, davej,
	tglx, fernando_b1, chegu_vinod, waiman.long, linux-kernel,
	pbonzini, torvalds

On 05/29/2014 03:25 AM, Rik van Riel wrote:
> On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>
> This patch looks very promising.

Thank you Rik.

[...]
>>
>> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>>    show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.
>
> Canceled out by better NUMA locality?

Yes perhaps. it was even slightly better.

[...]
>> - we can further add dynamically changing batch_size implementation (inspiration and
>>    hint by Paul McKenney) as necessary.
>
> I could see a larger batch size being beneficial.
>
> Currently the maximum wait time for a spinlock on a system
> with N CPUs is N times the length of the largest critical
> section.
>
> Having the batch size set equal to the number of CPUs would only
> double that, and better locality (CPUs local to the current
> lock holder winning the spinlock operation) might speed things
> up enough to cancel that part of that out again...

having batch size = number of cpus would definitely help contended cases
especially on larger machines (by my experience with testing on a 4
node 32 core machine). +ht case should make it even more
beneficial.

My only botheration was overhead in undercommit cases because of extra
cmpxchg.
So may be batch_size = total cpus / numa node be optimal?...

[...]
>> +#define TICKET_LOCK_INC_SHIFT 1
>> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
>> +
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>> -#define __TICKET_LOCK_INC	2
>>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>>   #else
>> -#define __TICKET_LOCK_INC	1
>>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>>   #endif
>
> For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
> now you are making it one. Probably not an issue, since even people
> who compile with 128 < CONFIG_NR_CPUS <= 256 will likely have their
> spinlocks padded out to 32 or 64 bits anyway in most data structures.

Yes..

[...]
>> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
>> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
>> +				  TICKET_LOCK_TAIL_INC - 1)
>
> I do not see the value in having TICKET_BATCH declared with a
> hexadecimal number,

yes.. It had only helped me to make the idea readable to myself, I
could get rid of this if needed.

and it may be worth making sure the code
> does not compile if someone tried a TICKET_BATCH value that
> is not a power of 2.

I agree.  will have BUILD_BUG for not power of 2 in next version.
But yes it reminds me that I wanted to have TICKET_BATCH = 1 for
!CONFIG_PARAVIRT so that we continue to have original fair lock version.
Does that make sense? I left it after thinking about same kernel running
on host/guest which would anyway will have CONFIG_PARAVIRT on.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 21:55   ` Rik van Riel
                     ` (2 preceding siblings ...)
  (?)
@ 2014-05-29  9:44   ` Raghavendra K T
  -1 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-29  9:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: jeremy, kvm, peterz, jasowang, virtualization, paul.gortmaker,
	hpa, ak, gleb, x86, mingo, xen-devel, paulmck, oleg, davej, tglx,
	fernando_b1, chegu_vinod, waiman.long, mtosatti, linux-kernel,
	pbonzini, torvalds

On 05/29/2014 03:25 AM, Rik van Riel wrote:
> On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>
> This patch looks very promising.

Thank you Rik.

[...]
>>
>> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>>    show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.
>
> Canceled out by better NUMA locality?

Yes perhaps. it was even slightly better.

[...]
>> - we can further add dynamically changing batch_size implementation (inspiration and
>>    hint by Paul McKenney) as necessary.
>
> I could see a larger batch size being beneficial.
>
> Currently the maximum wait time for a spinlock on a system
> with N CPUs is N times the length of the largest critical
> section.
>
> Having the batch size set equal to the number of CPUs would only
> double that, and better locality (CPUs local to the current
> lock holder winning the spinlock operation) might speed things
> up enough to cancel that part of that out again...

having batch size = number of cpus would definitely help contended cases
especially on larger machines (by my experience with testing on a 4
node 32 core machine). +ht case should make it even more
beneficial.

My only botheration was overhead in undercommit cases because of extra
cmpxchg.
So may be batch_size = total cpus / numa node be optimal?...

[...]
>> +#define TICKET_LOCK_INC_SHIFT 1
>> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
>> +
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>> -#define __TICKET_LOCK_INC	2
>>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>>   #else
>> -#define __TICKET_LOCK_INC	1
>>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>>   #endif
>
> For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
> now you are making it one. Probably not an issue, since even people
> who compile with 128 < CONFIG_NR_CPUS <= 256 will likely have their
> spinlocks padded out to 32 or 64 bits anyway in most data structures.

Yes..

[...]
>> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
>> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
>> +				  TICKET_LOCK_TAIL_INC - 1)
>
> I do not see the value in having TICKET_BATCH declared with a
> hexadecimal number,

yes.. It had only helped me to make the idea readable to myself, I
could get rid of this if needed.

and it may be worth making sure the code
> does not compile if someone tried a TICKET_BATCH value that
> is not a power of 2.

I agree.  will have BUILD_BUG for not power of 2 in next version.
But yes it reminds me that I wanted to have TICKET_BATCH = 1 for
!CONFIG_PARAVIRT so that we continue to have original fair lock version.
Does that make sense? I left it after thinking about same kernel running
on host/guest which would anyway will have CONFIG_PARAVIRT on.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-29  6:46   ` Peter Zijlstra
@ 2014-05-29  9:51     ` Raghavendra K T
  -1 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-29  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, paulmck, torvalds,
	waiman.long, davej, oleg, x86, jeremy, paul.gortmaker, ak,
	jasowang, fernando_b1, linux-kernel, kvm, virtualization,
	xen-devel, riel, mtosatti, chegu_vinod

On 05/29/2014 12:16 PM, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 05:46:39PM +0530, Raghavendra K T wrote:
>> In virtualized environment there are mainly three problems
>> related to spinlocks that affect performance.
>> 1. LHP (lock holder preemption)
>> 2. Lock Waiter Preemption (LWP)
>> 3. Starvation/fairness
>>
>>   Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
>> pv-ticketlocks tried to address this. But we can further improve at the
>> cost of relaxed fairness.
>
> So I really hate the idea of having different locks for paravirt and
> normal kernels.

Yes. I understand that queued lock for normal kernel and unfair version 
of queued spinlock for virtual guest would do better.

Since strict serialization of lockwaiters (in both ticketlock/queued 
spinlock) does not work well for virtualized guest, my idea was to
give an alternate idea which has bounded starvation and performs as good 
as unfair version to virtualized guest.

>
> And we're looking to move to that queued lock for normal kernels.

Agree. and I have tested that too.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-29  9:51     ` Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-29  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jeremy, kvm, virtualization, paul.gortmaker, hpa, ak, gleb, x86,
	mingo, xen-devel, paulmck, riel, konrad.wilk, oleg, davej, tglx,
	fernando_b1, chegu_vinod, waiman.long, linux-kernel, pbonzini,
	torvalds

On 05/29/2014 12:16 PM, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 05:46:39PM +0530, Raghavendra K T wrote:
>> In virtualized environment there are mainly three problems
>> related to spinlocks that affect performance.
>> 1. LHP (lock holder preemption)
>> 2. Lock Waiter Preemption (LWP)
>> 3. Starvation/fairness
>>
>>   Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
>> pv-ticketlocks tried to address this. But we can further improve at the
>> cost of relaxed fairness.
>
> So I really hate the idea of having different locks for paravirt and
> normal kernels.

Yes. I understand that queued lock for normal kernel and unfair version 
of queued spinlock for virtual guest would do better.

Since strict serialization of lockwaiters (in both ticketlock/queued 
spinlock) does not work well for virtualized guest, my idea was to
give an alternate idea which has bounded starvation and performs as good 
as unfair version to virtualized guest.

>
> And we're looking to move to that queued lock for normal kernels.

Agree. and I have tested that too.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-29  6:46   ` Peter Zijlstra
  (?)
@ 2014-05-29  9:51   ` Raghavendra K T
  -1 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-29  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jeremy, kvm, jasowang, virtualization, paul.gortmaker, hpa, ak,
	gleb, x86, mingo, xen-devel, paulmck, oleg, davej, tglx,
	fernando_b1, chegu_vinod, waiman.long, mtosatti, linux-kernel,
	pbonzini, torvalds

On 05/29/2014 12:16 PM, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 05:46:39PM +0530, Raghavendra K T wrote:
>> In virtualized environment there are mainly three problems
>> related to spinlocks that affect performance.
>> 1. LHP (lock holder preemption)
>> 2. Lock Waiter Preemption (LWP)
>> 3. Starvation/fairness
>>
>>   Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
>> pv-ticketlocks tried to address this. But we can further improve at the
>> cost of relaxed fairness.
>
> So I really hate the idea of having different locks for paravirt and
> normal kernels.

Yes. I understand that queued lock for normal kernel and unfair version 
of queued spinlock for virtual guest would do better.

Since strict serialization of lockwaiters (in both ticketlock/queued 
spinlock) does not work well for virtualized guest, my idea was to
give an alternate idea which has bounded starvation and performs as good 
as unfair version to virtualized guest.

>
> And we're looking to move to that queued lock for normal kernels.

Agree. and I have tested that too.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 12:16 ` Raghavendra K T
@ 2014-05-29 22:45   ` Waiman Long
  -1 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2014-05-29 22:45 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck,
	torvalds, davej, oleg, x86, jeremy, paul.gortmaker, ak, jasowang,
	fernando_b1, linux-kernel, kvm, virtualization, xen-devel, riel,
	mtosatti, chegu_vinod

On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>
> TODO:
> - we need an intelligent way to nullify the effect of batching for baremetal
>   (because extra cmpxchg is not required).

To do this, you will need to have 2 slightly different algorithms 
depending on the paravirt_ticketlocks_enabled jump label.

>
> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>    show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

It will depend on the micro-benchmark and the test system used. I had 
seen the a test case that extra cmpxchg did not really impact 
performance on a Westmere system but had noticeable adverse impact on an 
IvyBridge system with the same micro-benchmark.

>   Please provide your suggestion and comments.
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 0f62f54..87685f1 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>    */
>   static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>   {
> -	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
> +	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
> +	struct __raw_tickets new;
>
>   	inc = xadd(&lock->tickets, inc);
> -	if (likely(inc.head == inc.tail))
> -		goto out;
>
>   	inc.tail&= ~TICKET_SLOWPATH_FLAG;
>   	for (;;) {
>   		unsigned count = SPIN_THRESHOLD;
>
>   		do {
> -			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
> -				goto out;
> +			if ((inc.head&  TICKET_LOCK_BATCH_MASK) == (inc.tail&
> +							TICKET_LOCK_BATCH_MASK))
> +				goto spin;
>   			cpu_relax();
> +			inc.head = ACCESS_ONCE(lock->tickets.head);
>   		} while (--count);
>   		__ticket_lock_spinning(lock, inc.tail);
>   	}
> +spin:
> +	for (;;) {
> +		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		if (!(inc.head&  TICKET_LOCK_HEAD_INC)) {
> +			new.head = inc.head | TICKET_LOCK_HEAD_INC;
> +			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
> +					== inc.head)
> +				goto out;
> +		}
> +		cpu_relax();
> +	}
> +

It had taken me some time to figure out the the LSB of inc.head is used 
as a bit lock for the contending tasks in the spin loop. I would suggest 
adding some comment here to make it easier to look at.

> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..b04c03d 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -3,15 +3,16 @@
>
>   #include<linux/types.h>
>
> +#define TICKET_LOCK_INC_SHIFT 1
> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
> +
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define __TICKET_LOCK_INC	2
>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>   #else
> -#define __TICKET_LOCK_INC	1
>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>   #endif
>
> -#if (CONFIG_NR_CPUS<  (256 / __TICKET_LOCK_INC))
> +#if (CONFIG_NR_CPUS<  (256 / __TICKET_LOCK_TAIL_INC))
>   typedef u8  __ticket_t;
>   typedef u16 __ticketpair_t;
>   #else
> @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
>   typedef u32 __ticketpair_t;
>   #endif
>
> -#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
> +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
> +
> +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
> +				  TICKET_LOCK_TAIL_INC - 1)

I don't think TAIL_INC has anything to do with setting the BATCH_MASK. 
It works here because TAIL_INC is 2. I think it is clearer to define it 
as either "(~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1)" or 
(~((TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) - 1)).

-Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-29 22:45   ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2014-05-29 22:45 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: jeremy, kvm, peterz, virtualization, paul.gortmaker, hpa, ak,
	gleb, x86, mingo, xen-devel, paulmck, riel, konrad.wilk, oleg,
	davej, tglx, fernando_b1, chegu_vinod, linux-kernel, pbonzini,
	torvalds

On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>
> TODO:
> - we need an intelligent way to nullify the effect of batching for baremetal
>   (because extra cmpxchg is not required).

To do this, you will need to have 2 slightly different algorithms 
depending on the paravirt_ticketlocks_enabled jump label.

>
> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>    show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

It will depend on the micro-benchmark and the test system used. I had 
seen the a test case that extra cmpxchg did not really impact 
performance on a Westmere system but had noticeable adverse impact on an 
IvyBridge system with the same micro-benchmark.

>   Please provide your suggestion and comments.
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 0f62f54..87685f1 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>    */
>   static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>   {
> -	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
> +	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
> +	struct __raw_tickets new;
>
>   	inc = xadd(&lock->tickets, inc);
> -	if (likely(inc.head == inc.tail))
> -		goto out;
>
>   	inc.tail&= ~TICKET_SLOWPATH_FLAG;
>   	for (;;) {
>   		unsigned count = SPIN_THRESHOLD;
>
>   		do {
> -			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
> -				goto out;
> +			if ((inc.head&  TICKET_LOCK_BATCH_MASK) == (inc.tail&
> +							TICKET_LOCK_BATCH_MASK))
> +				goto spin;
>   			cpu_relax();
> +			inc.head = ACCESS_ONCE(lock->tickets.head);
>   		} while (--count);
>   		__ticket_lock_spinning(lock, inc.tail);
>   	}
> +spin:
> +	for (;;) {
> +		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		if (!(inc.head&  TICKET_LOCK_HEAD_INC)) {
> +			new.head = inc.head | TICKET_LOCK_HEAD_INC;
> +			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
> +					== inc.head)
> +				goto out;
> +		}
> +		cpu_relax();
> +	}
> +

It had taken me some time to figure out the the LSB of inc.head is used 
as a bit lock for the contending tasks in the spin loop. I would suggest 
adding some comment here to make it easier to look at.

> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..b04c03d 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -3,15 +3,16 @@
>
>   #include<linux/types.h>
>
> +#define TICKET_LOCK_INC_SHIFT 1
> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
> +
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define __TICKET_LOCK_INC	2
>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>   #else
> -#define __TICKET_LOCK_INC	1
>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>   #endif
>
> -#if (CONFIG_NR_CPUS<  (256 / __TICKET_LOCK_INC))
> +#if (CONFIG_NR_CPUS<  (256 / __TICKET_LOCK_TAIL_INC))
>   typedef u8  __ticket_t;
>   typedef u16 __ticketpair_t;
>   #else
> @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
>   typedef u32 __ticketpair_t;
>   #endif
>
> -#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
> +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
> +
> +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
> +				  TICKET_LOCK_TAIL_INC - 1)

I don't think TAIL_INC has anything to do with setting the BATCH_MASK. 
It works here because TAIL_INC is 2. I think it is clearer to define it 
as either "(~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1)" or 
(~((TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) - 1)).

-Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-28 12:16 ` Raghavendra K T
                   ` (5 preceding siblings ...)
  (?)
@ 2014-05-29 22:45 ` Waiman Long
  -1 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2014-05-29 22:45 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: jeremy, kvm, peterz, jasowang, virtualization, paul.gortmaker,
	hpa, ak, gleb, x86, mingo, xen-devel, paulmck, oleg, davej, tglx,
	fernando_b1, chegu_vinod, mtosatti, linux-kernel, pbonzini,
	torvalds

On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>
> TODO:
> - we need an intelligent way to nullify the effect of batching for baremetal
>   (because extra cmpxchg is not required).

To do this, you will need to have 2 slightly different algorithms 
depending on the paravirt_ticketlocks_enabled jump label.

>
> - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
>    show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

It will depend on the micro-benchmark and the test system used. I had 
seen the a test case that extra cmpxchg did not really impact 
performance on a Westmere system but had noticeable adverse impact on an 
IvyBridge system with the same micro-benchmark.

>   Please provide your suggestion and comments.
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 0f62f54..87685f1 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>    */
>   static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>   {
> -	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
> +	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
> +	struct __raw_tickets new;
>
>   	inc = xadd(&lock->tickets, inc);
> -	if (likely(inc.head == inc.tail))
> -		goto out;
>
>   	inc.tail&= ~TICKET_SLOWPATH_FLAG;
>   	for (;;) {
>   		unsigned count = SPIN_THRESHOLD;
>
>   		do {
> -			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
> -				goto out;
> +			if ((inc.head&  TICKET_LOCK_BATCH_MASK) == (inc.tail&
> +							TICKET_LOCK_BATCH_MASK))
> +				goto spin;
>   			cpu_relax();
> +			inc.head = ACCESS_ONCE(lock->tickets.head);
>   		} while (--count);
>   		__ticket_lock_spinning(lock, inc.tail);
>   	}
> +spin:
> +	for (;;) {
> +		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		if (!(inc.head&  TICKET_LOCK_HEAD_INC)) {
> +			new.head = inc.head | TICKET_LOCK_HEAD_INC;
> +			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
> +					== inc.head)
> +				goto out;
> +		}
> +		cpu_relax();
> +	}
> +

It had taken me some time to figure out the the LSB of inc.head is used 
as a bit lock for the contending tasks in the spin loop. I would suggest 
adding some comment here to make it easier to look at.

> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..b04c03d 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -3,15 +3,16 @@
>
>   #include<linux/types.h>
>
> +#define TICKET_LOCK_INC_SHIFT 1
> +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
> +
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define __TICKET_LOCK_INC	2
>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
>   #else
> -#define __TICKET_LOCK_INC	1
>   #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
>   #endif
>
> -#if (CONFIG_NR_CPUS<  (256 / __TICKET_LOCK_INC))
> +#if (CONFIG_NR_CPUS<  (256 / __TICKET_LOCK_TAIL_INC))
>   typedef u8  __ticket_t;
>   typedef u16 __ticketpair_t;
>   #else
> @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
>   typedef u32 __ticketpair_t;
>   #endif
>
> -#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
> +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
> +
> +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
> +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
> +				  TICKET_LOCK_TAIL_INC - 1)

I don't think TAIL_INC has anything to do with setting the BATCH_MASK. 
It works here because TAIL_INC is 2. I think it is clearer to define it 
as either "(~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1)" or 
(~((TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) - 1)).

-Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-29 22:45   ` Waiman Long
@ 2014-05-30  8:53     ` Raghavendra K T
  -1 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-30  8:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck,
	torvalds, davej, oleg, x86, jeremy, paul.gortmaker, ak, jasowang,
	fernando_b1, linux-kernel, kvm, virtualization, xen-devel, riel,
	mtosatti, chegu_vinod

On 05/30/2014 04:15 AM, Waiman Long wrote:
> On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>> - we need an intelligent way to nullify the effect of batching for
>> baremetal
>>   (because extra cmpxchg is not required).
>
> To do this, you will need to have 2 slightly different algorithms
> depending on the paravirt_ticketlocks_enabled jump label.

Thanks for the hint Waiman.

[...]
>> +spin:
>> +    for (;;) {
>> +        inc.head = ACCESS_ONCE(lock->tickets.head);
>> +        if (!(inc.head&  TICKET_LOCK_HEAD_INC)) {
>> +            new.head = inc.head | TICKET_LOCK_HEAD_INC;
>> +            if (cmpxchg(&lock->tickets.head, inc.head, new.head)
>> +                    == inc.head)
>> +                goto out;
>> +        }
>> +        cpu_relax();
>> +    }
>> +
>
> It had taken me some time to figure out the the LSB of inc.head is used
> as a bit lock for the contending tasks in the spin loop. I would suggest
> adding some comment here to make it easier to look at.

Agree. 'll add a comment.

[...]
>> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
>> +#define TICKET_LOCK_BATCH_MASK
>> (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
>> +                  TICKET_LOCK_TAIL_INC - 1)
>
> I don't think TAIL_INC has anything to do with setting the BATCH_MASK.
> It works here because TAIL_INC is 2. I think it is clearer to define it
> as either "(~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1)" or
> (~((TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) - 1)).

You are right.
Thanks for pointing out. Your expression is simple and clearer. 'll use
one of them.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
@ 2014-05-30  8:53     ` Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-30  8:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: jeremy, kvm, peterz, virtualization, paul.gortmaker, hpa, ak,
	gleb, x86, mingo, xen-devel, paulmck, riel, konrad.wilk, oleg,
	davej, tglx, fernando_b1, chegu_vinod, linux-kernel, pbonzini,
	torvalds

On 05/30/2014 04:15 AM, Waiman Long wrote:
> On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>> - we need an intelligent way to nullify the effect of batching for
>> baremetal
>>   (because extra cmpxchg is not required).
>
> To do this, you will need to have 2 slightly different algorithms
> depending on the paravirt_ticketlocks_enabled jump label.

Thanks for the hint Waiman.

[...]
>> +spin:
>> +    for (;;) {
>> +        inc.head = ACCESS_ONCE(lock->tickets.head);
>> +        if (!(inc.head&  TICKET_LOCK_HEAD_INC)) {
>> +            new.head = inc.head | TICKET_LOCK_HEAD_INC;
>> +            if (cmpxchg(&lock->tickets.head, inc.head, new.head)
>> +                    == inc.head)
>> +                goto out;
>> +        }
>> +        cpu_relax();
>> +    }
>> +
>
> It had taken me some time to figure out the the LSB of inc.head is used
> as a bit lock for the contending tasks in the spin loop. I would suggest
> adding some comment here to make it easier to look at.

Agree. 'll add a comment.

[...]
>> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
>> +#define TICKET_LOCK_BATCH_MASK
>> (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
>> +                  TICKET_LOCK_TAIL_INC - 1)
>
> I don't think TAIL_INC has anything to do with setting the BATCH_MASK.
> It works here because TAIL_INC is 2. I think it is clearer to define it
> as either "(~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1)" or
> (~((TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) - 1)).

You are right.
Thanks for pointing out. Your expression is simple and clearer. 'll use
one of them.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Implement Batched (group) ticket lock
  2014-05-29 22:45   ` Waiman Long
  (?)
  (?)
@ 2014-05-30  8:53   ` Raghavendra K T
  -1 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-30  8:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: jeremy, kvm, peterz, jasowang, virtualization, paul.gortmaker,
	hpa, ak, gleb, x86, mingo, xen-devel, paulmck, oleg, davej, tglx,
	fernando_b1, chegu_vinod, mtosatti, linux-kernel, pbonzini,
	torvalds

On 05/30/2014 04:15 AM, Waiman Long wrote:
> On 05/28/2014 08:16 AM, Raghavendra K T wrote:
>> - we need an intelligent way to nullify the effect of batching for
>> baremetal
>>   (because extra cmpxchg is not required).
>
> To do this, you will need to have 2 slightly different algorithms
> depending on the paravirt_ticketlocks_enabled jump label.

Thanks for the hint Waiman.

[...]
>> +spin:
>> +    for (;;) {
>> +        inc.head = ACCESS_ONCE(lock->tickets.head);
>> +        if (!(inc.head&  TICKET_LOCK_HEAD_INC)) {
>> +            new.head = inc.head | TICKET_LOCK_HEAD_INC;
>> +            if (cmpxchg(&lock->tickets.head, inc.head, new.head)
>> +                    == inc.head)
>> +                goto out;
>> +        }
>> +        cpu_relax();
>> +    }
>> +
>
> It had taken me some time to figure out the the LSB of inc.head is used
> as a bit lock for the contending tasks in the spin loop. I would suggest
> adding some comment here to make it easier to look at.

Agree. 'll add a comment.

[...]
>> +#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
>> +#define TICKET_LOCK_BATCH_MASK
>> (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
>> +                  TICKET_LOCK_TAIL_INC - 1)
>
> I don't think TAIL_INC has anything to do with setting the BATCH_MASK.
> It works here because TAIL_INC is 2. I think it is clearer to define it
> as either "(~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1)" or
> (~((TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) - 1)).

You are right.
Thanks for pointing out. Your expression is simple and clearer. 'll use
one of them.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC] Implement Batched (group) ticket lock
@ 2014-05-28 12:16 Raghavendra K T
  0 siblings, 0 replies; 31+ messages in thread
From: Raghavendra K T @ 2014-05-28 12:16 UTC (permalink / raw)
  To: tglx, mingo, hpa, konrad.wilk, pbonzini, gleb, peterz, paulmck
  Cc: waiman.long, jeremy, ak, kvm, linux-kernel, jasowang, x86, oleg,
	fernando_b1, paul.gortmaker, chegu_vinod, raghavendra.kt,
	xen-devel, davej, virtualization, torvalds, mtosatti

In virtualized environment there are mainly three problems
related to spinlocks that affect performance.
1. LHP (lock holder preemption)
2. Lock Waiter Preemption (LWP)
3. Starvation/fairness

 Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
pv-ticketlocks tried to address this. But we can further improve at the
cost of relaxed fairness.

In this patch, we form a batch of eligible lock holders and we serve the eligible
(to hold the lock) batches in FIFO, but the lock-waiters within eligible batch are served
in an unfair manner. This increases probability of any eligible lock-holder being
in running state (to an average of (batch_size/2)-1). It also provides needed
bounded starvation since any lock requester can not acquire more than batch_size
times repeatedly during contention. On the negetive side we would need an extra cmpxchg.

 The patch has the batch size of 4. (As we know increasing  batch size means we are
closer to unfair locks and batch size of 1 = ticketlock).

Result:
Test system: 32cpu 2node machine w/ 64GB each (16 pcpu machine +ht).
Guests:  8GB  16vcpu guests (average of 8 iterations)

 % Improvements with kvm guests (batch size = 4):
  ebizzy_0.5x   4.3
  ebizzy_1.0x   7.8
  ebizzy_1.5x  23.4
  ebizzy_2.0x  48.6

Baremetal:
ebizzy showed very high stdev, kernbench result was good but both of them did
not show much difference.

ebizzy: rec/sec higher is better
base    50452
patched 50703

kernbench  time in sec (lesser is better)
base    48.9 sec
patched 48.8 sec

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/spinlock.h       | 35 +++++++++++++++++++++++++----------
 arch/x86/include/asm/spinlock_types.h | 14 ++++++++++----
 arch/x86/kernel/kvm.c                 |  6 ++++--
 3 files changed, 39 insertions(+), 16 deletions(-)

TODO:
- we need an intelligent way to nullify the effect of batching for baremetal
 (because extra cmpxchg is not required).

- we may have to make batch size as kernel arg to solve above problem
 (to run same kernel for host/guest). Increasing batch size also seem to help
 virtualized guest more, so we will have flexibility of tuning depending on vm size.

- My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
  show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

- virtualized guest had slight impact on 1x cases of some benchmarks but we have got
 impressive performance for >1x cases. So overall, patch needs exhaustive tesing.

- we can further add dynamically changing batch_size implementation (inspiration and
  hint by Paul McKenney) as necessary.
 
 I have found that increasing  batch size gives excellent improvements for 
 overcommitted guests. I understand that we need more exhaustive testing.

 Please provide your suggestion and comments.

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..87685f1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,23 +81,36 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
  */
 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
+	register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
+	struct __raw_tickets new;
 
 	inc = xadd(&lock->tickets, inc);
-	if (likely(inc.head == inc.tail))
-		goto out;
 
 	inc.tail &= ~TICKET_SLOWPATH_FLAG;
 	for (;;) {
 		unsigned count = SPIN_THRESHOLD;
 
 		do {
-			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
-				goto out;
+			if ((inc.head & TICKET_LOCK_BATCH_MASK) == (inc.tail &
+							TICKET_LOCK_BATCH_MASK))
+				goto spin;
 			cpu_relax();
+			inc.head = ACCESS_ONCE(lock->tickets.head);
 		} while (--count);
 		__ticket_lock_spinning(lock, inc.tail);
 	}
+spin:
+	for (;;) {
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+		if (!(inc.head & TICKET_LOCK_HEAD_INC)) {
+			new.head = inc.head | TICKET_LOCK_HEAD_INC;
+			if (cmpxchg(&lock->tickets.head, inc.head, new.head)
+					== inc.head)
+				goto out;
+		}
+		cpu_relax();
+	}
+
 out:	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
@@ -109,7 +122,8 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
 		return 0;
 
-	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+	new.head_tail = old.head_tail + (TICKET_LOCK_TAIL_INC << TICKET_SHIFT);
+	new.head_tail |= TICKET_LOCK_HEAD_INC;
 
 	/* cmpxchg is a full barrier, so nothing can move before it */
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
@@ -123,7 +137,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
 
 	/* Perform the unlock on the "before" copy */
-	old.tickets.head += TICKET_LOCK_INC;
+	old.tickets.head += TICKET_LOCK_HEAD_INC;
 
 	/* Clear the slowpath flag */
 	new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
@@ -150,14 +164,15 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 		arch_spinlock_t prev;
 
 		prev = *lock;
-		add_smp(&lock->tickets.head, TICKET_LOCK_INC);
+		add_smp(&lock->tickets.head, TICKET_LOCK_HEAD_INC);
 
 		/* add_smp() is a full mb() */
 
 		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
 			__ticket_unlock_slowpath(lock, prev);
 	} else
-		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+		__add(&lock->tickets.head,
+		      TICKET_LOCK_HEAD_INC, UNLOCK_LOCK_PREFIX);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
@@ -171,7 +186,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
-	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
+	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_TAIL_INC;
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..b04c03d 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,15 +3,16 @@
 
 #include <linux/types.h>
 
+#define TICKET_LOCK_INC_SHIFT 1
+#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT)
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define __TICKET_LOCK_INC	2
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
 #else
-#define __TICKET_LOCK_INC	1
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
-#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
+#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -19,7 +20,12 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
-#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
+#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
+
+#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
+#define TICKET_BATCH    0x4 /* 4 waiters can contend simultaneously */
+#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + \
+				  TICKET_LOCK_TAIL_INC - 1)
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 0331cb3..3dd41f7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -755,7 +755,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	 * check again make sure it didn't become free while
 	 * we weren't looking.
 	 */
-	if (ACCESS_ONCE(lock->tickets.head) == want) {
+	if ((ACCESS_ONCE(lock->tickets.head) & TICKET_LOCK_BATCH_MASK) ==
+				(want & TICKET_LOCK_BATCH_MASK)) {
 		add_stats(TAKEN_SLOW_PICKUP, 1);
 		goto out;
 	}
@@ -787,7 +788,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 	for_each_cpu(cpu, &waiting_cpus) {
 		const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu);
 		if (ACCESS_ONCE(w->lock) == lock &&
-		    ACCESS_ONCE(w->want) == ticket) {
+		    (ACCESS_ONCE(w->want) & TICKET_LOCK_BATCH_MASK) ==
+			(ticket & TICKET_LOCK_BATCH_MASK)) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
 			kvm_kick_cpu(cpu);
 			break;
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-05-30  8:53 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-28 12:16 [RFC] Implement Batched (group) ticket lock Raghavendra K T
2014-05-28 12:16 ` Raghavendra K T
2014-05-28 12:16 ` Raghavendra K T
2014-05-28 21:55 ` Rik van Riel
2014-05-28 21:55   ` Rik van Riel
2014-05-28 22:19   ` Linus Torvalds
2014-05-28 22:19     ` Linus Torvalds
2014-05-28 22:29     ` Thomas Gleixner
2014-05-28 22:29     ` Thomas Gleixner
2014-05-28 22:29       ` Thomas Gleixner
2014-05-29  1:18     ` Rik van Riel
2014-05-29  1:18     ` Rik van Riel
2014-05-29  1:18       ` Rik van Riel
2014-05-28 22:19   ` Linus Torvalds
2014-05-29  9:44   ` Raghavendra K T
2014-05-29  9:44   ` Raghavendra K T
2014-05-29  9:44     ` Raghavendra K T
2014-05-28 21:55 ` Rik van Riel
2014-05-29  6:46 ` Peter Zijlstra
2014-05-29  6:46 ` Peter Zijlstra
2014-05-29  6:46   ` Peter Zijlstra
2014-05-29  9:51   ` Raghavendra K T
2014-05-29  9:51   ` Raghavendra K T
2014-05-29  9:51     ` Raghavendra K T
2014-05-29 22:45 ` Waiman Long
2014-05-29 22:45 ` Waiman Long
2014-05-29 22:45   ` Waiman Long
2014-05-30  8:53   ` Raghavendra K T
2014-05-30  8:53     ` Raghavendra K T
2014-05-30  8:53   ` Raghavendra K T
  -- strict thread matches above, loose matches on Subject: below --
2014-05-28 12:16 Raghavendra K T

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.