linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
@ 2011-03-23 15:37 Tejun Heo
  2011-03-23 15:40 ` Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-23 15:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Hello, guys.

I've been playing with locking in btrfs which has developed custom
locking to avoid excessive context switches in its btree
implementation.

Generally, doing away with the custom implementation and just using
the mutex adaptive owner spinning seems better; however, there's an
interesting distinction in the custom implemention of trylock.  It
distinguishes between simple trylock and tryspin, where the former
just tries once and then fail while the latter does some spinning
before giving up.

Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, at least for
btrfs anyway.

The following results are from "dbench 50" run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.

IMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with the patch at the end of this
message applied on top - mutex_trylock() uses adaptive spinning.

        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
 SIMPLE 61107  354977    217  8099529  845.100 MB/sec
 SPIN   63140  364888    214  6840527  879.077 MB/sec

I've been running various different configs from yesterday and the
adaptive spinning trylock consistently posts higher throughput.  The
amount of difference varies but it outperforms consistently.

In general, I think using adaptive spinning on trylock makes sense as
trylock failure usually leads to costly unlock-relock sequence.

Also, contrary to the comment, the adaptive spinning doesn't seem to
check whether there are pending waiters or not.  Is this intended or
the test got lost somehow?

Thanks.

NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/mutex.c |   98 +++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 61 insertions(+), 37 deletions(-)

Index: work/kernel/mutex.c
===================================================================
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,39 +126,33 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * we find that there are no pending waiters and the lock owner is
+ * currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation doesn't
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	/*
-	 * Optimistic spinning.
-	 *
-	 * We try to spin for acquisition when we find that there are no
-	 * pending waiters and the lock owner is currently running on a
-	 * (different) CPU.
-	 *
-	 * The rationale is that if the lock owner is running, it is likely to
-	 * release the lock soon.
-	 *
-	 * Since this needs the lock owner, and this mutex implementation
-	 * doesn't track the owner atomically in the lock field, we need to
-	 * track it non-atomically.
-	 *
-	 * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
-	 * to serialize everything.
-	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -177,12 +171,8 @@ __mutex_lock_common(struct mutex *lock,
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
-		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
-			lock_acquired(&lock->dep_map, ip);
-			mutex_set_owner(lock);
-			preempt_enable();
-			return 0;
-		}
+		if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
+			return true;
 
 		/*
 		 * When there's no owner, we might have preempted between the
@@ -190,7 +180,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * we're an RT task that will live-lock because we won't let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +192,30 @@ __mutex_lock_common(struct mutex *lock,
 		cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+	       	unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock)) {
+		lock_acquired(&lock->dep_map, ip);
+		mutex_set_owner(lock);
+		preempt_enable();
+		return 0;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -430,6 +444,15 @@ static inline int __mutex_trylock_slowpa
 	unsigned long flags;
 	int prev;
 
+	preempt_disable();
+
+	if (mutex_spin(lock)) {
+		mutex_set_owner(lock);
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		preempt_enable();
+		return 1;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
@@ -443,6 +466,7 @@ static inline int __mutex_trylock_slowpa
 		atomic_set(&lock->count, 0);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
+	preempt_enable();
 
 	return prev == 1;
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
@ 2011-03-23 15:40 ` Tejun Heo
  2011-03-23 15:48 ` Linus Torvalds
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-23 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

Oops, sorry, forgot the link to the simplification patch and attaching
.config.

On Wed, Mar 23, 2011 at 04:37:27PM +0100, Tejun Heo wrote:
> SIMPLE is with the locking simplification patch[1] applied.  i.e. it
> basically just uses mutex.  SPIN is with the patch at the end of this
> message applied on top - mutex_trylock() uses adaptive spinning.

[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

-- 
tejun

[-- Attachment #2: .config --]
[-- Type: application/x-config, Size: 54281 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
  2011-03-23 15:40 ` Tejun Heo
@ 2011-03-23 15:48 ` Linus Torvalds
  2011-03-23 15:52   ` Tejun Heo
  2011-03-23 19:46   ` Andrey Kuzmin
  2011-03-24  8:18 ` Ingo Molnar
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2011-03-23 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Chris Mason,
	linux-kernel, linux-btrfs

On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
> just once.  I got curious whether using adaptive spinning on
> mutex_trylock() would be beneficial and it seems so, at least for
> btrfs anyway.

Hmm. Seems reasonable to me. The patch looks clean, although part of
that is just the mutex_spin() cleanup that is independent of actually
using it in trylock.

So no objections from me.

                    Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-23 15:48 ` Linus Torvalds
@ 2011-03-23 15:52   ` Tejun Heo
  2011-03-23 19:46   ` Andrey Kuzmin
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-23 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Chris Mason,
	linux-kernel, linux-btrfs

On Wed, Mar 23, 2011 at 08:48:01AM -0700, Linus Torvalds wrote:
> On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
> > just once.  I got curious whether using adaptive spinning on
> > mutex_trylock() would be beneficial and it seems so, at least for
> > btrfs anyway.
> 
> Hmm. Seems reasonable to me. The patch looks clean, although part of
> that is just the mutex_spin() cleanup that is independent of actually
> using it in trylock.

Oh, I have two split patches.  Posted the combined one for comments.

> So no objections from me.

Awesome.  Peter, what do you think?  Are there some other tests which
can be useful?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-23 15:48 ` Linus Torvalds
  2011-03-23 15:52   ` Tejun Heo
@ 2011-03-23 19:46   ` Andrey Kuzmin
  1 sibling, 0 replies; 33+ messages in thread
From: Andrey Kuzmin @ 2011-03-23 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs

On Wed, Mar 23, 2011 at 6:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo <tj@kernel.org> wrote:
>>
>> Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
>> just once.  I got curious whether using adaptive spinning on
>> mutex_trylock() would be beneficial and it seems so, at least for
>> btrfs anyway.
>
> Hmm. Seems reasonable to me.

TAS/spin with exponential back-off has been preferred locking approach
in Postgres (and I believe other DBMSes) for years, at least since '04
when I had last touched Postgres code. Even with 'false negative' cost
in user-space being much higher than in the kernel, it's still just a
question of scale (no wonder measurable improvement here is reported
from dbench on SSD capable of few dozen thousand IOPS).

Regards,
Andrey

> The patch looks clean, although part of that is just the mutex_spin()
> cleanup that is independent of actually using it in trylock.
>
> So no objections from me.
>
>                    Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
  2011-03-23 15:40 ` Tejun Heo
  2011-03-23 15:48 ` Linus Torvalds
@ 2011-03-24  8:18 ` Ingo Molnar
  2011-03-25  3:24   ` Steven Rostedt
  2011-03-24  9:41 ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2011-03-24  8:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs


* Tejun Heo <tj@kernel.org> wrote:

> NOT-Signed-off-by: Tejun Heo <tj@kernel.org>

s/NOT-// ?

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/2] Subject: mutex: Separate out mutex_spin()
  2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
                   ` (2 preceding siblings ...)
  2011-03-24  8:18 ` Ingo Molnar
@ 2011-03-24  9:41 ` Tejun Heo
  2011-03-24  9:41   ` [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
  2011-03-24  9:42   ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
  2011-03-24 16:18 ` [tip:core/urgent] " tip-bot for Tejun Heo
  2011-03-24 16:19 ` [tip:core/urgent] mutex: Do adaptive spinning in mutex_trylock() tip-bot for Tejun Heo
  5 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-24  9:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Separate out mutex_spin() out of __mutex_lock_common().  The fat
comment is converted to docbook function description.

While at it, drop the part of comment which explains that adaptive
spinning considers whether there are pending waiters, which doesn't
match the code.

This patch is to prepare for using adaptive spinning in
mutex_trylock() and doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20110323153727.GB12003@htj.dyndns.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
Here are split patches with SOB.  Ingo, it's probably best to route
this through -tip, I suppose?

Thanks.

 kernel/mutex.c |   87 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 50 insertions(+), 37 deletions(-)

Index: work/kernel/mutex.c
===================================================================
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,39 +126,32 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * the lock owner is currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation doesn't
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	/*
-	 * Optimistic spinning.
-	 *
-	 * We try to spin for acquisition when we find that there are no
-	 * pending waiters and the lock owner is currently running on a
-	 * (different) CPU.
-	 *
-	 * The rationale is that if the lock owner is running, it is likely to
-	 * release the lock soon.
-	 *
-	 * Since this needs the lock owner, and this mutex implementation
-	 * doesn't track the owner atomically in the lock field, we need to
-	 * track it non-atomically.
-	 *
-	 * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
-	 * to serialize everything.
-	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -177,12 +170,8 @@ __mutex_lock_common(struct mutex *lock,
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
-		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
-			lock_acquired(&lock->dep_map, ip);
-			mutex_set_owner(lock);
-			preempt_enable();
-			return 0;
-		}
+		if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
+			return true;
 
 		/*
 		 * When there's no owner, we might have preempted between the
@@ -190,7 +179,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * we're an RT task that will live-lock because we won't let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +191,30 @@ __mutex_lock_common(struct mutex *lock,
 		arch_mutex_cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+	       	unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock)) {
+		lock_acquired(&lock->dep_map, ip);
+		mutex_set_owner(lock);
+		preempt_enable();
+		return 0;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-24  9:41 ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
@ 2011-03-24  9:41   ` Tejun Heo
  2011-03-25  3:39     ` Steven Rostedt
  2011-03-25 10:12     ` Tejun Heo
  2011-03-24  9:42   ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
  1 sibling, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-24  9:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Adaptive owner spinning used to be applied only to mutex_lock().  This
patch applies it also to mutex_trylock().

btrfs has developed custom locking to avoid excessive context switches
in its btree implementation.  Generally, doing away with the custom
implementation and just using the mutex shows better behavior;
however, there's an interesting distinction in the custom implemention
of trylock.  It distinguishes between simple trylock and tryspin,
where the former just tries once and then fail while the latter does
some spinning before giving up.

Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, for btrfs anyway.

The following results are from "dbench 50" run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.

SIMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with this patch applied on top -
mutex_trylock() uses adaptive spinning.

        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
 SIMPLE 61107  354977    217  8099529  845.100 MB/sec
 SPIN   63140  364888    214  6840527  879.077 MB/sec

On various runs, the adaptive spinning trylock consistently posts
higher throughput.  The amount of difference varies but it outperforms
consistently.

In general, using adaptive spinning on trylock makes sense as trylock
failure usually leads to costly unlock-relock sequence.

[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20110323153727.GB12003@htj.dyndns.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
---
 kernel/mutex.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: work/kernel/mutex.c
===================================================================
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -443,6 +443,15 @@ static inline int __mutex_trylock_slowpa
 	unsigned long flags;
 	int prev;
 
+	preempt_disable();
+
+	if (mutex_spin(lock)) {
+		mutex_set_owner(lock);
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		preempt_enable();
+		return 1;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
@@ -456,6 +465,7 @@ static inline int __mutex_trylock_slowpa
 		atomic_set(&lock->count, 0);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
+	preempt_enable();
 
 	return prev == 1;
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Subject: mutex: Separate out mutex_spin()
  2011-03-24  9:41 ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
  2011-03-24  9:41   ` [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
@ 2011-03-24  9:42   ` Tejun Heo
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-24  9:42 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Ugh... Please drop the extra "Subject: " from subject before applying.
Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [tip:core/urgent] mutex: Separate out mutex_spin()
  2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
                   ` (3 preceding siblings ...)
  2011-03-24  9:41 ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
@ 2011-03-24 16:18 ` tip-bot for Tejun Heo
  2011-03-24 16:19 ` [tip:core/urgent] mutex: Do adaptive spinning in mutex_trylock() tip-bot for Tejun Heo
  5 siblings, 0 replies; 33+ messages in thread
From: tip-bot for Tejun Heo @ 2011-03-24 16:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, chris.mason, tj, tglx, mingo

Commit-ID:  d41228115b150ce0f813122a518d8349f68d3b85
Gitweb:     http://git.kernel.org/tip/d41228115b150ce0f813122a518d8349f68d3b85
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Thu, 24 Mar 2011 10:41:19 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 24 Mar 2011 11:16:49 +0100

mutex: Separate out mutex_spin()

Separate out mutex_spin() out of __mutex_lock_common(). The fat
comment is converted to docbook function description.

While at it, drop the part of comment which explains that
adaptive spinning considers whether there are pending waiters,
which doesn't match the code.

This patch is to prepare for using adaptive spinning in
mutex_trylock() and doesn't cause any change in behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Mason <chris.mason@oracle.com>
LKML-Reference: <20110323153727.GB12003@htj.dyndns.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/mutex.c |   86 ++++++++++++++++++++++++++++++++------------------------
 1 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index a5889fb..03465e8 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -126,39 +126,32 @@ void __sched mutex_unlock(struct mutex *lock)
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * the lock owner is currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation doesn't
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	/*
-	 * Optimistic spinning.
-	 *
-	 * We try to spin for acquisition when we find that there are no
-	 * pending waiters and the lock owner is currently running on a
-	 * (different) CPU.
-	 *
-	 * The rationale is that if the lock owner is running, it is likely to
-	 * release the lock soon.
-	 *
-	 * Since this needs the lock owner, and this mutex implementation
-	 * doesn't track the owner atomically in the lock field, we need to
-	 * track it non-atomically.
-	 *
-	 * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
-	 * to serialize everything.
-	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -177,12 +170,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
-		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
-			lock_acquired(&lock->dep_map, ip);
-			mutex_set_owner(lock);
-			preempt_enable();
-			return 0;
-		}
+		if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
+			return true;
 
 		/*
 		 * When there's no owner, we might have preempted between the
@@ -190,7 +179,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * we're an RT task that will live-lock because we won't let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +191,29 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		arch_mutex_cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock)) {
+		lock_acquired(&lock->dep_map, ip);
+		mutex_set_owner(lock);
+		preempt_enable();
+		return 0;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [tip:core/urgent] mutex: Do adaptive spinning in mutex_trylock()
  2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
                   ` (4 preceding siblings ...)
  2011-03-24 16:18 ` [tip:core/urgent] " tip-bot for Tejun Heo
@ 2011-03-24 16:19 ` tip-bot for Tejun Heo
  5 siblings, 0 replies; 33+ messages in thread
From: tip-bot for Tejun Heo @ 2011-03-24 16:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, chris.mason, tj, tglx, mingo

Commit-ID:  bcedb5fa5e7cfeea30b6effe0c47d8f09ffc82df
Gitweb:     http://git.kernel.org/tip/bcedb5fa5e7cfeea30b6effe0c47d8f09ffc82df
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Thu, 24 Mar 2011 10:41:51 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 24 Mar 2011 11:16:49 +0100

mutex: Do adaptive spinning in mutex_trylock()

Adaptive owner spinning is used in mutex_lock().
This patch also applies it to mutex_trylock().

btrfs has developed custom locking to avoid excessive context
switches in its btree implementation.  Generally, doing away
with the custom implementation and just using the mutex shows
better behavior; however, there's an interesting distinction in
the custom implemention of trylock.  It distinguishes between
simple trylock and tryspin, where the former just tries once and
then fail while the latter does some spinning before giving up.

Currently, mutex_trylock() doesn't use adaptive spinning.  It
tries just once.  I got curious whether using adaptive spinning
on mutex_trylock() would be beneficial and it seems so, for
btrfs anyway.

The following results are from "dbench 50" run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex
SSD. During the run, disk stays mostly idle and all CPUs are
fully occupied and the difference in locking performance becomes
quite visible.

SIMPLE is with the locking simplification patch[1] applied.
i.e. it basically just uses mutex.  SPIN is with this patch
applied on top - mutex_trylock() uses adaptive spinning.

        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
 SIMPLE 61107  354977    217  8099529  845.100 MB/sec
 SPIN   63140  364888    214  6840527  879.077 MB/sec

On various runs, the adaptive spinning trylock consistently
posts higher throughput.  The amount of difference varies but it
outperforms consistently.

In general, using adaptive spinning on trylock makes sense as
trylock failure usually leads to costly unlock-relock sequence.

[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Mason <chris.mason@oracle.com>
LKML-Reference: <20110323153727.GB12003@htj.dyndns.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/mutex.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index 03465e8..2510cd1 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -442,6 +442,15 @@ static inline int __mutex_trylock_slowpath(atomic_t *lock_count)
 	unsigned long flags;
 	int prev;
 
+	preempt_disable();
+
+	if (mutex_spin(lock)) {
+		mutex_set_owner(lock);
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		preempt_enable();
+		return 1;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
@@ -455,6 +464,7 @@ static inline int __mutex_trylock_slowpath(atomic_t *lock_count)
 		atomic_set(&lock->count, 0);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
+	preempt_enable();
 
 	return prev == 1;
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-24  8:18 ` Ingo Molnar
@ 2011-03-25  3:24   ` Steven Rostedt
  2011-03-25 10:29     ` Ingo Molnar
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2011-03-25  3:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Chris Mason, linux-kernel, linux-btrfs

On Thu, Mar 24, 2011 at 09:18:16AM +0100, Ingo Molnar wrote:
> 
> * Tejun Heo <tj@kernel.org> wrote:
> 
> > NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
> 
> s/NOT-// ?
> 

Perhaps because it is still in RFC context?

-- Steve


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-24  9:41   ` [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
@ 2011-03-25  3:39     ` Steven Rostedt
  2011-03-25  4:38       ` Linus Torvalds
  2011-03-25 11:13       ` Andrey Kuzmin
  2011-03-25 10:12     ` Tejun Heo
  1 sibling, 2 replies; 33+ messages in thread
From: Steven Rostedt @ 2011-03-25  3:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs

On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
> Adaptive owner spinning used to be applied only to mutex_lock().  This
> patch applies it also to mutex_trylock().
> 
> btrfs has developed custom locking to avoid excessive context switches
> in its btree implementation.  Generally, doing away with the custom
> implementation and just using the mutex shows better behavior;
> however, there's an interesting distinction in the custom implemention
> of trylock.  It distinguishes between simple trylock and tryspin,
> where the former just tries once and then fail while the latter does
> some spinning before giving up.
> 
> Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
> just once.  I got curious whether using adaptive spinning on
> mutex_trylock() would be beneficial and it seems so, for btrfs anyway.
> 
> The following results are from "dbench 50" run on an opteron two
> socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
> During the run, disk stays mostly idle and all CPUs are fully occupied
> and the difference in locking performance becomes quite visible.
> 
> SIMPLE is with the locking simplification patch[1] applied.  i.e. it
> basically just uses mutex.  SPIN is with this patch applied on top -
> mutex_trylock() uses adaptive spinning.
> 
>         USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
>  SIMPLE 61107  354977    217  8099529  845.100 MB/sec
>  SPIN   63140  364888    214  6840527  879.077 MB/sec
> 
> On various runs, the adaptive spinning trylock consistently posts
> higher throughput.  The amount of difference varies but it outperforms
> consistently.
> 
> In general, using adaptive spinning on trylock makes sense as trylock
> failure usually leads to costly unlock-relock sequence.
> 
> [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

I'm curious about the effects that this has on those places that do:

again:
	mutex_lock(A);
	if (mutex_trylock(B)) {
		mutex_unlock(A);
		goto again;


Where the normal locking order is:
 B -> A

If another location does:

	mutex_lock(B);
	[...]
	mutex_lock(A);

But another process has A already, and is running, it may spin waiting
for A as A's owner is still running.

But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
is running (spinning on A) it will spin as well waiting for A's owner to
release it. Unfortunately, A's owner is also spinning waiting for B to
release it.

If both A and B's owners are real time tasks, then boom! deadlock.

-- Steve


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25  3:39     ` Steven Rostedt
@ 2011-03-25  4:38       ` Linus Torvalds
  2011-03-25  6:53         ` Tejun Heo
  2011-03-25 11:13       ` Andrey Kuzmin
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2011-03-25  4:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs

On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
> is running (spinning on A) it will spin as well waiting for A's owner to
> release it. Unfortunately, A's owner is also spinning waiting for B to
> release it.
>
> If both A and B's owners are real time tasks, then boom! deadlock.

Hmm. I think you're right. And it looks pretty fundamental - I don't
see any reasonable approach to avoid it.

I think the RT issue is a red herring too - afaik, you can get a
deadlock with two perfectly normal processes too. Of course, for
non-RT tasks, any other process will eventually disturb the situation
and you'd get kicked out due to need_resched(), but even that might be
avoided for a long time if there are other CPU's - leading to tons of
wasted CPU time.

                       Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25  4:38       ` Linus Torvalds
@ 2011-03-25  6:53         ` Tejun Heo
  2011-03-25 13:10           ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2011-03-25  6:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs

Hello, Steven, Linus.

On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote:
> On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
> > is running (spinning on A) it will spin as well waiting for A's owner to
> > release it. Unfortunately, A's owner is also spinning waiting for B to
> > release it.
> >
> > If both A and B's owners are real time tasks, then boom! deadlock.
> 
> Hmm. I think you're right. And it looks pretty fundamental - I don't
> see any reasonable approach to avoid it.

Hmmm... I have an idea.  Will play with it a bit and post if it works
out okay.

> I think the RT issue is a red herring too - afaik, you can get a
> deadlock with two perfectly normal processes too. Of course, for
> non-RT tasks, any other process will eventually disturb the situation
> and you'd get kicked out due to need_resched(), but even that might be
> avoided for a long time if there are other CPU's - leading to tons of
> wasted CPU time.

Yeap, need_resched() currently is the only thing which limits the
duration of spinning when the owner continues to run.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-24  9:41   ` [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
  2011-03-25  3:39     ` Steven Rostedt
@ 2011-03-25 10:12     ` Tejun Heo
  2011-03-25 10:31       ` Ingo Molnar
  2011-03-29 16:37       ` Tejun Heo
  1 sibling, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-25 10:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Hello,

On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
>         USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
>  SIMPLE 61107  354977    217  8099529  845.100 MB/sec
>  SPIN   63140  364888    214  6840527  879.077 MB/sec
> 
> On various runs, the adaptive spinning trylock consistently posts
> higher throughput.  The amount of difference varies but it outperforms
> consistently.

I've been running more of these tests and am having doubts about the
consistency.  It seems that, even on a fresh filesystem, some random
initial condition seems to have persistent effect on the whole run.
I'll run more tests and report back.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25  3:24   ` Steven Rostedt
@ 2011-03-25 10:29     ` Ingo Molnar
  0 siblings, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2011-03-25 10:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Chris Mason, linux-kernel, linux-btrfs


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, Mar 24, 2011 at 09:18:16AM +0100, Ingo Molnar wrote:
> > 
> > * Tejun Heo <tj@kernel.org> wrote:
> > 
> > > NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
> > 
> > s/NOT-// ?
> > 
> 
> Perhaps because it is still in RFC context?

Ok, i guess i was a bit too cryptic about it: the discussion was in support of 
the changes so i asked Tejun above to send a SOB version and did that in a 
geeky way.

And paid my price for the shortcut: had two write two other emails about it 
already ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 10:12     ` Tejun Heo
@ 2011-03-25 10:31       ` Ingo Molnar
  2011-03-29 16:37       ` Tejun Heo
  1 sibling, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2011-03-25 10:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
> >         USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
> >  SIMPLE 61107  354977    217  8099529  845.100 MB/sec
> >  SPIN   63140  364888    214  6840527  879.077 MB/sec
> > 
> > On various runs, the adaptive spinning trylock consistently posts
> > higher throughput.  The amount of difference varies but it outperforms
> > consistently.
> 
> I've been running more of these tests and am having doubts about the
> consistency.  It seems that, even on a fresh filesystem, some random
> initial condition seems to have persistent effect on the whole run.
> I'll run more tests and report back.

Ok, and there's the deadlock issue as well which Steve noticed.

I'll zap the patches from tip:core/urgent and lets do this via tip:core/locking 
with a .40 timeframe and plenty of time of testing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25  3:39     ` Steven Rostedt
  2011-03-25  4:38       ` Linus Torvalds
@ 2011-03-25 11:13       ` Andrey Kuzmin
  2011-03-25 13:12         ` Steven Rostedt
  1 sibling, 1 reply; 33+ messages in thread
From: Andrey Kuzmin @ 2011-03-25 11:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Chris Mason, linux-kernel, linux-btrfs

On Fri, Mar 25, 2011 at 6:39 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
>> Adaptive owner spinning used to be applied only to mutex_lock().  This
>> patch applies it also to mutex_trylock().
>>
>> btrfs has developed custom locking to avoid excessive context switches
>> in its btree implementation.  Generally, doing away with the custom
>> implementation and just using the mutex shows better behavior;
>> however, there's an interesting distinction in the custom implemention
>> of trylock.  It distinguishes between simple trylock and tryspin,
>> where the former just tries once and then fail while the latter does
>> some spinning before giving up.
>>
>> Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
>> just once.  I got curious whether using adaptive spinning on
>> mutex_trylock() would be beneficial and it seems so, for btrfs anyway.
>>
>> The following results are from "dbench 50" run on an opteron two
>> socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
>> During the run, disk stays mostly idle and all CPUs are fully occupied
>> and the difference in locking performance becomes quite visible.
>>
>> SIMPLE is with the locking simplification patch[1] applied.  i.e. it
>> basically just uses mutex.  SPIN is with this patch applied on top -
>> mutex_trylock() uses adaptive spinning.
>>
>>         USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
>>  SIMPLE 61107  354977    217  8099529  845.100 MB/sec
>>  SPIN   63140  364888    214  6840527  879.077 MB/sec
>>
>> On various runs, the adaptive spinning trylock consistently posts
>> higher throughput.  The amount of difference varies but it outperforms
>> consistently.
>>
>> In general, using adaptive spinning on trylock makes sense as trylock
>> failure usually leads to costly unlock-relock sequence.
>>
>> [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>
> I'm curious about the effects that this has on those places that do:
>
> again:
>        mutex_lock(A);
>        if (mutex_trylock(B)) {
>                mutex_unlock(A);
>                goto again;
>
>
> Where the normal locking order is:
>  B -> A
>
> If another location does:
>
>        mutex_lock(B);
>        [...]
>        mutex_lock(A);
>
> But another process has A already, and is running, it may spin waiting
> for A as A's owner is still running.
>
> But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
> is running (spinning on A) it will spin as well waiting for A's owner to
> release it. Unfortunately, A's owner is also spinning waiting for B to
> release it.
>
> If both A and B's owners are real time tasks, then boom! deadlock.

Turning try_lock into indefinitely spinning one breaks its semantics,
so deadlock is to be expected. But what's wrong in this scenario if
try_lock spins a bit before giving up?

Regards,
Andrey

>
> -- Steve
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25  6:53         ` Tejun Heo
@ 2011-03-25 13:10           ` Steven Rostedt
  2011-03-25 13:29             ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2011-03-25 13:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs

On Fri, 2011-03-25 at 07:53 +0100, Tejun Heo wrote:
> Hello, Steven, Linus.
> 
> On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote:
> > On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
> > > is running (spinning on A) it will spin as well waiting for A's owner to
> > > release it. Unfortunately, A's owner is also spinning waiting for B to
> > > release it.
> > >
> > > If both A and B's owners are real time tasks, then boom! deadlock.
> > 
> > Hmm. I think you're right. And it looks pretty fundamental - I don't
> > see any reasonable approach to avoid it.
> 
> Hmmm... I have an idea.  Will play with it a bit and post if it works
> out okay.

One solution is to have this be only done on explicit trylocks. Perhaps
introduce a mutex_trylock_spin()? Then when the developer knows that
this scenario does not exist, they can convert mutex_trylocks() into
this spinning version.

> 
> > I think the RT issue is a red herring too - afaik, you can get a
> > deadlock with two perfectly normal processes too. Of course, for
> > non-RT tasks, any other process will eventually disturb the situation
> > and you'd get kicked out due to need_resched(), but even that might be
> > avoided for a long time if there are other CPU's - leading to tons of
> > wasted CPU time.
> 
> Yeap, need_resched() currently is the only thing which limits the
> duration of spinning when the owner continues to run.

Yeah, I was about to complain about the long latencies that this could
cause, then I realized that RT tasks would in fact deadlock the system
here, which I thought was a bigger problem, and focused on that issue.

-- Steve



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 11:13       ` Andrey Kuzmin
@ 2011-03-25 13:12         ` Steven Rostedt
  2011-03-25 13:50           ` Andrey Kuzmin
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2011-03-25 13:12 UTC (permalink / raw)
  To: Andrey Kuzmin
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Chris Mason, linux-kernel, linux-btrfs

On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
> Turning try_lock into indefinitely spinning one breaks its semantics,
> so deadlock is to be expected. But what's wrong in this scenario if
> try_lock spins a bit before giving up? 

Because that will cause this scenario to spin that "little longer"
always, and introduce latencies that did not exist before. Either the
solution does not break this scenario, or it should not go in.

-- Steve



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 13:10           ` Steven Rostedt
@ 2011-03-25 13:29             ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2011-03-25 13:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Chris Mason, linux-kernel, linux-btrfs

On Fri, 2011-03-25 at 09:10 -0400, Steven Rostedt wrote:

> One solution is to have this be only done on explicit trylocks. Perhaps
> introduce a mutex_trylock_spin()? Then when the developer knows that
> this scenario does not exist, they can convert mutex_trylocks() into
> this spinning version.
> 

I'm not sure this is even worth it, as I'm looking at the
btfs/extend-tree.c code, this is the main reason to use mutex_trylock().

I guess what you see in your benchmarks is that trylock contention
happens mostly in the non-deadlock scenario. But I bet you have
latencies when it does happen, but the benefit seems to out weigh it in
the results.

I wonder what happens if you run dbench as an RT task.

-- Steve



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 13:12         ` Steven Rostedt
@ 2011-03-25 13:50           ` Andrey Kuzmin
  2011-03-25 14:05             ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Andrey Kuzmin @ 2011-03-25 13:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Chris Mason, linux-kernel, linux-btrfs

On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
>> Turning try_lock into indefinitely spinning one breaks its semantics,
>> so deadlock is to be expected. But what's wrong in this scenario if
>> try_lock spins a bit before giving up?
>
> Because that will cause this scenario to spin that "little longer"
> always, and introduce latencies that did not exist before. Either the
> solution does not break this scenario, or it should not go in.

Broken semantics and extra latency are two separate issues. If the
former is fixed, the latter is easily handled by introducing new
mutex_trylock_spin call that lets one either stick to existing
behavior (try/fail) or choose a new one where latency penalty is
justified by locking patterns.

Regards,
Andrey

>
> -- Steve
>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 13:50           ` Andrey Kuzmin
@ 2011-03-25 14:05             ` Steven Rostedt
  2011-03-25 19:51               ` Ingo Molnar
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2011-03-25 14:05 UTC (permalink / raw)
  To: Andrey Kuzmin
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Chris Mason, linux-kernel, linux-btrfs

On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote:
> On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
> >> Turning try_lock into indefinitely spinning one breaks its semantics,
> >> so deadlock is to be expected. But what's wrong in this scenario if
> >> try_lock spins a bit before giving up?
> >
> > Because that will cause this scenario to spin that "little longer"
> > always, and introduce latencies that did not exist before. Either the
> > solution does not break this scenario, or it should not go in.
> 
> Broken semantics and extra latency are two separate issues. If the
> former is fixed, the latter is easily handled by introducing new
> mutex_trylock_spin call that lets one either stick to existing
> behavior (try/fail) or choose a new one where latency penalty is
> justified by locking patterns.
> 

For those wanting a more RT deterministic OS, I will argue against
latency penalties.

-- Steve



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 14:05             ` Steven Rostedt
@ 2011-03-25 19:51               ` Ingo Molnar
  0 siblings, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2011-03-25 19:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrey Kuzmin, Tejun Heo, Peter Zijlstra, Ingo Molnar,
	Linus Torvalds, Andrew Morton, Chris Mason, linux-kernel,
	linux-btrfs


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote:
> > On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > > On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
> > >> Turning try_lock into indefinitely spinning one breaks its semantics,
> > >> so deadlock is to be expected. But what's wrong in this scenario if
> > >> try_lock spins a bit before giving up?
> > >
> > > Because that will cause this scenario to spin that "little longer"
> > > always, and introduce latencies that did not exist before. Either the
> > > solution does not break this scenario, or it should not go in.
> > 
> > Broken semantics and extra latency are two separate issues. If the
> > former is fixed, the latter is easily handled by introducing new
> > mutex_trylock_spin call that lets one either stick to existing
> > behavior (try/fail) or choose a new one where latency penalty is
> > justified by locking patterns.
> > 
> 
> For those wanting a more RT deterministic OS, I will argue against
> latency penalties.

Later mails from Tejun suggest that the benchmark results are varied, and that 
it's not a clear win after all.

It's possible that if useless spinning is introduced then that might explain 
such workload variations. I.e. it's not really 'latencies' but 'unnecessary 
overhead spent spinning' - and also 'extra non-deterministic noise' - none of 
which help consistent performance.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-25 10:12     ` Tejun Heo
  2011-03-25 10:31       ` Ingo Molnar
@ 2011-03-29 16:37       ` Tejun Heo
  2011-03-29 17:09         ` Tejun Heo
  2011-03-30 11:46         ` Chris Mason
  1 sibling, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2011-03-29 16:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Hello, guys.

I've been running dbench 50 for a few days now and the result is,
well, I don't know how to call it.

The problem was that the original patch didn't do anything because x86
fastpath code didn't call into the generic slowpath at all.

  static inline int __mutex_fastpath_trylock(atomic_t *count,
					     int (*fail_fn)(atomic_t *))
  {
	  if (likely(atomic_cmpxchg(count, 1, 0) == 1))
		  return 1;
	  else
		  return 0;
  }												

So, I thought that I probably was doing unconscious data selection
while I was running the test before sending out the patches.  Maybe I
was seeing what I wanted to see, so I ran tests in larger scale more
methodologically.

I first started with ten consecutive runs and then doubled it with
intervening reboot and then basically ended up doing that twice for
four configuration (I didn't do two runs of simple and refactor but
just averaged the two).

The hardware is mostly the same except that I switched to a hard drive
instead of SSD as hard drives tend to be slower but more consistent in
performance numbers.  On each run, the filesystem is recreated and the
system was rebooted after every ten runs.  The numbers are the
reported throughput in MiB/s at the end of each run.

  https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXc&hl=en

Here are the descriptions of the eight columns.

  simple	only with patch to make btrfs use mutex
  refactor	mutex_spin() factored out
  spin		mutex_spin() applied to the unused trylock slowpath
  spin-1	ditto
  spin-fixed	x86 trylock fastpath updated to use generic slowpath
  spin-fixed-1	ditto
  code-layout	refactor + dummy function added to mutex.c
  code-layout-1	ditto

After running the simple, refactor and spin ones, I was convinced that
there definitely was something which was causing the difference.  The
averages were apart by more than 1.5 sigma, but I couldn't explain
what could have caused such difference.

The code-layout runs were my desparate attempts to find explanation on
what's going on.  Addition of mutex_spin to the unused trylock generic
path makes gcc arrange functions differently.  Without it, trylock
functions end up inbetween lock and unlock funcitons; with it, they
are located at the end.  I commented out the unused trylock slowpath
function and added a dummy function at the end to make gcc generate
similar assembly layout.

At this point, the only conclusions I can draw are,

* Using adaptive spinning on mutex_trylock() doesn't seem to buy
  anything according to btrfs dbench 50 runs.

and much more importantly,

* btrfs dbench 50 runs are probably not good for measuring subtle
  mutex performance differences.  Maybe it's too macro and there are
  larger scale tendencies which skew the result unless the number of
  runs are vastly increased (but 40 runs are already over eight
  hours).

If anyone can provide an explanation on what's going on, I'll be super
happy.  Otherwise, for now, I'll just leave it alone.  :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-29 16:37       ` Tejun Heo
@ 2011-03-29 17:09         ` Tejun Heo
  2011-03-29 17:37           ` Peter Zijlstra
  2011-03-30 11:46         ` Chris Mason
  1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2011-03-29 17:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason
  Cc: linux-kernel, linux-btrfs

Here's the combined patch I was planning on testing but didn't get to
(yet).  It implements two things - hard limit on spin duration and
early break if the owner also is spinning on a mutex.

Thanks.

Index: work1/include/linux/sched.h
===================================================================
--- work1.orig/include/linux/sched.h
+++ work1/include/linux/sched.h
@@ -359,7 +359,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: work1/kernel/sched.c
===================================================================
--- work1.orig/kernel/sched.c
+++ work1/kernel/sched.c
@@ -536,6 +536,10 @@ struct rq {
 	struct hrtimer hrtick_timer;
 #endif
 
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+	bool spinning_on_mutex;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	/* latency stats */
 	struct sched_info rq_sched_info;
@@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 /*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
+ * Maximum mutex owner spin duration in nsecs.  Don't spin more then
+ * DEF_TIMESLICE.
+ */
+#define MAX_MUTEX_SPIN_NS	(DEF_TIMESLICE * 1000000000LLU / HZ)
+
+/**
+ * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
+ * @lock: the mutex to spin on
+ * @owner: the current owner (speculative pointer)
+ *
+ * The caller is trying to acquire @lock held by @owner.  If @owner is
+ * currently running, it might get unlocked soon and spinning on it can
+ * save the overhead of sleeping and waking up.
+ *
+ * Note that @owner is completely speculative and may be completely
+ * invalid.  It should be accessed very carefully.
+ *
+ * Forward progress is guaranteed regardless of locking ordering by never
+ * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
+ * mutex_trylock(), which doesn't have to follow the usual locking
+ * ordering, also uses this function.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if the lock was released and the caller should retry locking.
+ * %false if the caller better go sleeping.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 {
+	unsigned long start;
 	unsigned int cpu;
 	struct rq *rq;
+	bool ret = true;
 
 	if (!sched_feat(OWNER_SPIN))
-		return 0;
+		return false;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	/*
@@ -4039,7 +4071,7 @@ int mutex_spin_on_owner(struct mutex *lo
 	 * the mutex owner just released it and exited.
 	 */
 	if (probe_kernel_address(&owner->cpu, cpu))
-		return 0;
+		return false;
 #else
 	cpu = owner->cpu;
 #endif
@@ -4049,15 +4081,17 @@ int mutex_spin_on_owner(struct mutex *lo
 	 * the cpu field may no longer be valid.
 	 */
 	if (cpu >= nr_cpumask_bits)
-		return 0;
+		return false;
 
 	/*
 	 * We need to validate that we can do a
 	 * get_cpu() and that we have the percpu area.
 	 */
 	if (!cpu_online(cpu))
-		return 0;
+		return false;
 
+	this_rq()->spinning_on_mutex = true;
+	start = local_clock();
 	rq = cpu_rq(cpu);
 
 	for (;;) {
@@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
 			 * we likely have heavy contention. Return 0 to quit
 			 * optimistic spinning and not contend further:
 			 */
-			if (lock->owner)
-				return 0;
+			ret = !lock->owner;
 			break;
 		}
 
 		/*
-		 * Is that owner really running on that cpu?
+		 * Quit spinning if any of the followings is true.
+		 *
+		 * - The owner isn't running on that cpu.
+		 * - The owner also is spinning on a mutex.
+		 * - Someone else wants to use this cpu.
+		 * - We've been spinning for too long.
 		 */
-		if (task_thread_info(rq->curr) != owner || need_resched())
-			return 0;
+		if (task_thread_info(rq->curr) != owner ||
+		    rq->spinning_on_mutex || need_resched() ||
+		    local_clock() > start + MAX_MUTEX_SPIN_NS) {
+			ret = false;
+			break;
+		}
 
 		arch_mutex_cpu_relax();
 	}
 
-	return 1;
+	this_rq()->spinning_on_mutex = false;
+	return ret;
 }
 #endif
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-29 17:09         ` Tejun Heo
@ 2011-03-29 17:37           ` Peter Zijlstra
  2011-03-30  8:17             ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-03-29 17:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason,
	linux-kernel, linux-btrfs

On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
> Here's the combined patch I was planning on testing but didn't get to
> (yet).  It implements two things - hard limit on spin duration and
> early break if the owner also is spinning on a mutex.

This is going to give massive conflicts with

 https://lkml.org/lkml/2011/3/2/286
 https://lkml.org/lkml/2011/3/2/282

which I was planning to stuff into .40


> @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
>  
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
>  /*
> + * Maximum mutex owner spin duration in nsecs.  Don't spin more then
> + * DEF_TIMESLICE.
> + */
> +#define MAX_MUTEX_SPIN_NS	(DEF_TIMESLICE * 1000000000LLU / HZ)

DEF_TIMESLICE is SCHED_RR only, so its use here is dubious at best, also
I bet we have something like NSEC_PER_SEC to avoid counting '0's.

> +
> +/**
> + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
> + * @lock: the mutex to spin on
> + * @owner: the current owner (speculative pointer)
> + *
> + * The caller is trying to acquire @lock held by @owner.  If @owner is
> + * currently running, it might get unlocked soon and spinning on it can
> + * save the overhead of sleeping and waking up.
> + *
> + * Note that @owner is completely speculative and may be completely
> + * invalid.  It should be accessed very carefully.
> + *
> + * Forward progress is guaranteed regardless of locking ordering by never
> + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
> + * mutex_trylock(), which doesn't have to follow the usual locking
> + * ordering, also uses this function.

While that puts a limit on things it'll still waste time. I'd much
rather pass an trylock argument to mutex_spin_on_owner() and then bail
on owner also spinning.

> + * CONTEXT:
> + * Preemption disabled.
> + *
> + * RETURNS:
> + * %true if the lock was released and the caller should retry locking.
> + * %false if the caller better go sleeping.
>   */
> -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
> +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
>  {

> @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
>  			 * we likely have heavy contention. Return 0 to quit
>  			 * optimistic spinning and not contend further:
>  			 */
> +			ret = !lock->owner;
>  			break;
>  		}
>  
>  		/*
> -		 * Is that owner really running on that cpu?
> +		 * Quit spinning if any of the followings is true.
> +		 *
> +		 * - The owner isn't running on that cpu.
> +		 * - The owner also is spinning on a mutex.
> +		 * - Someone else wants to use this cpu.
> +		 * - We've been spinning for too long.
>  		 */
> +		if (task_thread_info(rq->curr) != owner ||
> +		    rq->spinning_on_mutex || need_resched() ||
> +		    local_clock() > start + MAX_MUTEX_SPIN_NS) {

While we did our best with making local_clock() cheap, I'm still fairly
uncomfortable with putting it in such a tight loop.

> +			ret = false;
> +			break;
> +		}
>  
>  		arch_mutex_cpu_relax();
>  	}
>  
> +	this_rq()->spinning_on_mutex = false;
> +	return ret;
>  }
>  #endif
>  




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-29 17:37           ` Peter Zijlstra
@ 2011-03-30  8:17             ` Tejun Heo
  2011-03-30 11:29               ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2011-03-30  8:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason,
	linux-kernel, linux-btrfs

Hey, Peter.

On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
> > Here's the combined patch I was planning on testing but didn't get to
> > (yet).  It implements two things - hard limit on spin duration and
> > early break if the owner also is spinning on a mutex.
> 
> This is going to give massive conflicts with
> 
>  https://lkml.org/lkml/2011/3/2/286
>  https://lkml.org/lkml/2011/3/2/282
> 
> which I was planning to stuff into .40

I see.  Adapting shouldn't be hard.  The patch is in proof-of-concept
stage anyway.

> > + * Forward progress is guaranteed regardless of locking ordering by never
> > + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
> > + * mutex_trylock(), which doesn't have to follow the usual locking
> > + * ordering, also uses this function.
> 
> While that puts a limit on things it'll still waste time. I'd much
> rather pass an trylock argument to mutex_spin_on_owner() and then bail
> on owner also spinning.

Do we guarantee or enforce that the lock ownership can't be
transferred to a different task?  If we do, the recursive spinning
detection is enough to guarantee forward progress.

> > +		if (task_thread_info(rq->curr) != owner ||
> > +		    rq->spinning_on_mutex || need_resched() ||
> > +		    local_clock() > start + MAX_MUTEX_SPIN_NS) {
> 
> While we did our best with making local_clock() cheap, I'm still fairly
> uncomfortable with putting it in such a tight loop.

That's one thing I didn't really understand.  It seems the spinning
code tried to be light on CPU cycle usage, but we're wasting CPU
cycles there anyway.  If the spinning can become smarter using some
CPU cycles, isn't that a gain?  Why is the spinning code optimizing
for less CPU cycles?

Also, it would be great if you can share what you're using for locking
performance measurements.  My attempts with dbench didn't end too
well. :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-30  8:17             ` Tejun Heo
@ 2011-03-30 11:29               ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-03-30 11:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Chris Mason,
	linux-kernel, linux-btrfs

On Wed, 2011-03-30 at 10:17 +0200, Tejun Heo wrote:
> Hey, Peter.
> 
> On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
> > > Here's the combined patch I was planning on testing but didn't get to
> > > (yet).  It implements two things - hard limit on spin duration and
> > > early break if the owner also is spinning on a mutex.
> > 
> > This is going to give massive conflicts with
> > 
> >  https://lkml.org/lkml/2011/3/2/286
> >  https://lkml.org/lkml/2011/3/2/282
> > 
> > which I was planning to stuff into .40
> 
> I see.  Adapting shouldn't be hard.  The patch is in proof-of-concept
> stage anyway.
> 
> > > + * Forward progress is guaranteed regardless of locking ordering by never
> > > + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
> > > + * mutex_trylock(), which doesn't have to follow the usual locking
> > > + * ordering, also uses this function.
> > 
> > While that puts a limit on things it'll still waste time. I'd much
> > rather pass an trylock argument to mutex_spin_on_owner() and then bail
> > on owner also spinning.
> 
> Do we guarantee or enforce that the lock ownership can't be
> transferred to a different task?  If we do, the recursive spinning
> detection is enough to guarantee forward progress.

The only way to switch owner is for the current owner to release and a
new owner to acquire the lock. Also we already bail the spin loop when
owner changes.

> > > +		if (task_thread_info(rq->curr) != owner ||
> > > +		    rq->spinning_on_mutex || need_resched() ||
> > > +		    local_clock() > start + MAX_MUTEX_SPIN_NS) {
> > 
> > While we did our best with making local_clock() cheap, I'm still fairly
> > uncomfortable with putting it in such a tight loop.
> 
> That's one thing I didn't really understand.  It seems the spinning
> code tried to be light on CPU cycle usage, but we're wasting CPU
> cycles there anyway.  If the spinning can become smarter using some
> CPU cycles, isn't that a gain?  Why is the spinning code optimizing
> for less CPU cycles?

Loop exit latency mostly, the lighter the loop, the faster you're
through the less time you waste once the condition is true, but yeah,
what you say is true too, its a balance game as always.

> Also, it would be great if you can share what you're using for locking
> performance measurements.  My attempts with dbench didn't end too
> well. :-(

The thing I used for the initial implementation is mentioned in the
changelog (0d66bf6d3):

    Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
    gave a 345% boost for VFS scalability on my testbox:
    
     # ./test-mutex-shm V 16 10 | grep "^avg ops"
     avg ops/sec:               296604
    
     # ./test-mutex-shm V 16 10 | grep "^avg ops"
     avg ops/sec:               85870

I've no idea how heavy that is on trylock though, you'd have to look at
that.

Chris did some improvements using dbench in ac6e60ee4 but clearly that
isn't working for you.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-29 16:37       ` Tejun Heo
  2011-03-29 17:09         ` Tejun Heo
@ 2011-03-30 11:46         ` Chris Mason
  2011-03-30 11:52           ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Mason @ 2011-03-30 11:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-btrfs

Excerpts from Tejun Heo's message of 2011-03-29 12:37:02 -0400:
> Hello, guys.
> 
> I've been running dbench 50 for a few days now and the result is,
> well, I don't know how to call it.
> 
> The problem was that the original patch didn't do anything because x86
> fastpath code didn't call into the generic slowpath at all.
> 
>   static inline int __mutex_fastpath_trylock(atomic_t *count,
>                          int (*fail_fn)(atomic_t *))
>   {
>       if (likely(atomic_cmpxchg(count, 1, 0) == 1))
>           return 1;
>       else
>           return 0;
>   }                                                
> 
> So, I thought that I probably was doing unconscious data selection
> while I was running the test before sending out the patches.  Maybe I
> was seeing what I wanted to see, so I ran tests in larger scale more
> methodologically.
> 
> I first started with ten consecutive runs and then doubled it with
> intervening reboot and then basically ended up doing that twice for
> four configuration (I didn't do two runs of simple and refactor but
> just averaged the two).
> 
> The hardware is mostly the same except that I switched to a hard drive
> instead of SSD as hard drives tend to be slower but more consistent in
> performance numbers.  On each run, the filesystem is recreated and the
> system was rebooted after every ten runs.  The numbers are the
> reported throughput in MiB/s at the end of each run.
> 
>   https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXc&hl=en
> 
> Here are the descriptions of the eight columns.
> 
>   simple    only with patch to make btrfs use mutex
>   refactor    mutex_spin() factored out
>   spin        mutex_spin() applied to the unused trylock slowpath
>   spin-1    ditto
>   spin-fixed    x86 trylock fastpath updated to use generic slowpath
>   spin-fixed-1    ditto
>   code-layout    refactor + dummy function added to mutex.c
>   code-layout-1    ditto
> 
> After running the simple, refactor and spin ones, I was convinced that
> there definitely was something which was causing the difference.  The
> averages were apart by more than 1.5 sigma, but I couldn't explain
> what could have caused such difference.

I have another workload that is interesting for this, basically N (100
or so) procs all doing stats on a bunch of files in the same directory.
This hammers very hard on the btree lookups.

During the first set of mutex spinning patches, doing any kind of
breakout on the owner spin (when the owner hasn't changed) made it much
slower.

You might want to try again with the 2.6.39-rc1 btrfs (or pull my
for-linus-unmerged tree into .38) because this gets rid of an unneeded
spinlock for the root node.

All your work convinced me to dig out my seqlock patches for read only
tree block operations.  I think we can take your current btrfs patches
and combine the seqlock stuff for a huge benefit.

In this case, the only thing we're really missing is a way to mutex_lock
without the cond_resched()

The biggest trylock user in btrfs is only there so we can keep the
spinning lock instead of the blocking lock.  Since you're getting rid of
the whole spin vs block setup, we can switch directly to a lock.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-30 11:46         ` Chris Mason
@ 2011-03-30 11:52           ` Peter Zijlstra
  2011-03-30 11:59             ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-03-30 11:52 UTC (permalink / raw)
  To: Chris Mason
  Cc: Tejun Heo, Ingo Molnar, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-btrfs

On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:
> 
> In this case, the only thing we're really missing is a way to mutex_lock
> without the cond_resched() 

So you're trying to explicitly avoid a voluntary preemption point? Seems
like a bad idea, normally people add those :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
  2011-03-30 11:52           ` Peter Zijlstra
@ 2011-03-30 11:59             ` Chris Mason
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Mason @ 2011-03-30 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-btrfs

Excerpts from Peter Zijlstra's message of 2011-03-30 07:52:04 -0400:
> On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:
> > 
> > In this case, the only thing we're really missing is a way to mutex_lock
> > without the cond_resched() 
> 
> So you're trying to explicitly avoid a voluntary preemption point? Seems
> like a bad idea, normally people add those :-)

Yeah, but the btrfs fast path (when we're able to spin today) looks like
this:

spin_lock(parent)
binary search, select slot
pull block for that slot out of cache
spin_lock(child)
spin_unlock(parent)

If we switch it all to mutexes:

mutex_lock(parent)
binary search, select slot
pull block for that slot out of cache
mutex_lock(child)
mutex_unlock(parent)

Most of the spinning vs blocking benefits in btrfs came from doing
special things (like dropping the parent lock) when we know we're going
to block with the parent lock held.  Surprise blocking in mutex_lock
isn't great.

It would probably be enough to just move the cond_resched() after the
spinning portion of the mutex_lock()

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-03-30 12:00 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-23 15:37 [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
2011-03-23 15:40 ` Tejun Heo
2011-03-23 15:48 ` Linus Torvalds
2011-03-23 15:52   ` Tejun Heo
2011-03-23 19:46   ` Andrey Kuzmin
2011-03-24  8:18 ` Ingo Molnar
2011-03-25  3:24   ` Steven Rostedt
2011-03-25 10:29     ` Ingo Molnar
2011-03-24  9:41 ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
2011-03-24  9:41   ` [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock() Tejun Heo
2011-03-25  3:39     ` Steven Rostedt
2011-03-25  4:38       ` Linus Torvalds
2011-03-25  6:53         ` Tejun Heo
2011-03-25 13:10           ` Steven Rostedt
2011-03-25 13:29             ` Steven Rostedt
2011-03-25 11:13       ` Andrey Kuzmin
2011-03-25 13:12         ` Steven Rostedt
2011-03-25 13:50           ` Andrey Kuzmin
2011-03-25 14:05             ` Steven Rostedt
2011-03-25 19:51               ` Ingo Molnar
2011-03-25 10:12     ` Tejun Heo
2011-03-25 10:31       ` Ingo Molnar
2011-03-29 16:37       ` Tejun Heo
2011-03-29 17:09         ` Tejun Heo
2011-03-29 17:37           ` Peter Zijlstra
2011-03-30  8:17             ` Tejun Heo
2011-03-30 11:29               ` Peter Zijlstra
2011-03-30 11:46         ` Chris Mason
2011-03-30 11:52           ` Peter Zijlstra
2011-03-30 11:59             ` Chris Mason
2011-03-24  9:42   ` [PATCH 1/2] Subject: mutex: Separate out mutex_spin() Tejun Heo
2011-03-24 16:18 ` [tip:core/urgent] " tip-bot for Tejun Heo
2011-03-24 16:19 ` [tip:core/urgent] mutex: Do adaptive spinning in mutex_trylock() tip-bot for Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).