linux-next: manual merge of the akpm-current tree with the tip tree

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* linux-next: manual merge of the akpm-current tree with the tip tree
@ 2016-06-15  5:23 Stephen Rothwell
  2016-06-18 19:39 ` Manfred Spraul
  2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  0 siblings, 2 replies; 17+ messages in thread
From: Stephen Rothwell @ 2016-06-15  5:23 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra
  Cc: linux-next, linux-kernel, Manfred Spraul

Hi Andrew,

Today's linux-next merge of the akpm-current tree got a conflict in:

  ipc/sem.c

between commit:

  33ac279677dc ("locking/barriers: Introduce smp_acquire__after_ctrl_dep()")

from the tip tree and commit:

  a1c58ea067cb ("ipc/sem.c: Fix complex_count vs. simple op race")

from the akpm-current tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc ipc/sem.c
index ae72b3cddc8d,11d9e605a619..000000000000
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@@ -260,13 -267,20 +267,10 @@@ static void sem_rcu_free(struct rcu_hea
  }
  
  /*
-  * Wait until all currently ongoing simple ops have completed.
 - * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
 - * are only control barriers.
 - * The code must pair with spin_unlock(&sem->lock) or
 - * spin_unlock(&sem_perm.lock), thus just the control barrier is insufficient.
 - *
 - * smp_rmb() is sufficient, as writes cannot pass the control barrier.
 - */
 -#define ipc_smp_acquire__after_spin_is_unlocked()	smp_rmb()
 -
 -/*
+  * Enter the mode suitable for non-simple operations:
   * Caller must own sem_perm.lock.
-  * New simple ops cannot start, because simple ops first check
-  * that sem_perm.lock is free.
-  * that a) sem_perm.lock is free and b) complex_count is 0.
   */
- static void sem_wait_array(struct sem_array *sma)
+ static void complexmode_enter(struct sem_array *sma)
  {
  	int i;
  	struct sem *sem;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: linux-next: manual merge of the akpm-current tree with the tip tree
  2016-06-15  5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell
@ 2016-06-18 19:39 ` Manfred Spraul
  2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  1 sibling, 0 replies; 17+ messages in thread
From: Manfred Spraul @ 2016-06-18 19:39 UTC (permalink / raw)
  To: Stephen Rothwell, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra
  Cc: linux-next, linux-kernel

Hi,

On 06/15/2016 07:23 AM, Stephen Rothwell wrote:
> Hi Andrew,
>
> Today's linux-next merge of the akpm-current tree got a conflict in:
>
>    ipc/sem.c
>
> between commit:
>
>    33ac279677dc ("locking/barriers: Introduce smp_acquire__after_ctrl_dep()")
>
> from the tip tree and commit:
>
>    a1c58ea067cb ("ipc/sem.c: Fix complex_count vs. simple op race")
>
> from the akpm-current tree.
Just in case, I have created a rediff of my patch against -tip.
And the patch with hysteresis would be ready as well.

I will send both patches.

More testers would be welcome, I can only test it on my laptop.

--
     Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-15  5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell
  2016-06-18 19:39 ` Manfred Spraul
@ 2016-06-18 20:02 ` Manfred Spraul
  2016-06-18 20:02   ` [PATCH 2/2] ipc/sem: sem_lock with hysteresis Manfred Spraul
                     ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Manfred Spraul @ 2016-06-18 20:02 UTC (permalink / raw)
  To: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton
  Cc: LKML, linux-next, 1vier1, Davidlohr Bueso, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the commit 6d07b68ce16a
("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..db2e6fc 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
 }
 
 /*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_wmb();
+
+	WRITE_ONCE(sma->complex_mode, false);
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
-
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		/* Now repeat the test for complex_mode.
+		 * A memory barrier is provided by the spin_lock()
+		 * above.
+		 */
+		if (!READ_ONCE(sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] ipc/sem: sem_lock with hysteresis
  2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
@ 2016-06-18 20:02   ` Manfred Spraul
  2016-06-21 20:29     ` Davidlohr Bueso
  2016-06-20 23:04   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Andrew Morton
  2016-06-21  0:30   ` Davidlohr Bueso
  2 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-06-18 20:02 UTC (permalink / raw)
  To: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton
  Cc: LKML, linux-next, 1vier1, Davidlohr Bueso, felixh, Manfred Spraul

sysv sem has two lock modes: One with per-semaphore locks, one lock mode
with a single big lock for the whole array.
When switching from the per-semaphore locks to the big lock, all
per-semaphore locks must be scanned for ongoing operations.

The patch adds a hysteresis for switching from the big lock to the per
semaphore locks. This reduces how often the per-semaphore locks must
be scanned.

Passed stress testing with sem-scalebench.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>

---
 include/linux/sem.h |  2 +-
 ipc/sem.c           | 91 ++++++++++++++++++++++++++++-------------------------
 2 files changed, 49 insertions(+), 44 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index d0efd6e..6fb3227 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,7 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
-	bool			complex_mode;	/* no parallel simple ops */
+	int			complex_mode;	/* >0: no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index 11d9e60..1f43fb8 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -161,6 +161,13 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #define SEMOPM_FAST	64  /* ~ 372 bytes on stack */
 
 /*
+ * Switching from the mode suitable for simple ops
+ * to the mode for complex ops is costly. Therefore:
+ * use some hysteresis
+ */
+#define COMPLEX_MODE_ENTER 10
+
+/*
  * Locking:
  * a) global sem_lock() for read/write
  *	sem_undo.id_next,
@@ -279,17 +286,25 @@ static void sem_rcu_free(struct rcu_head *head)
 /*
  * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
+ * Note:
+ * There is no leave complex mode function. Leaving
+ * happens in sem_lock, with some hysteresis.
  */
 static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_mode)  {
-		/* We are already in complex_mode. Nothing to do */
+	if (sma->complex_mode > 0)  {
+		/*
+		 * We are already in complex_mode.
+		 * Nothing to do, just increase
+		 * counter until we return to simple mode
+		 */
+		WRITE_ONCE(sma->complex_mode, COMPLEX_MODE_ENTER);
 		return;
 	}
-	WRITE_ONCE(sma->complex_mode, true);
+	WRITE_ONCE(sma->complex_mode, COMPLEX_MODE_ENTER);
 
 	/* We need a full barrier:
 	 * The write to complex_mode must be visible
@@ -305,29 +320,6 @@ static void complexmode_enter(struct sem_array *sma)
 }
 
 /*
- * Try to leave the mode that disallows simple operations:
- * Caller must own sem_perm.lock.
- */
-static void complexmode_tryleave(struct sem_array *sma)
-{
-	if (sma->complex_count)  {
-		/* Complex ops are sleeping.
-		 * We must stay in complex mode
-		 */
-		return;
-	}
-	/*
-	 * Immediately after setting complex_mode to false,
-	 * a simple op can start. Thus: all memory writes
-	 * performed by the current operation must be visible
-	 * before we set complex_mode to false.
-	 */
-	smp_wmb();
-
-	WRITE_ONCE(sma->complex_mode, false);
-}
-
-/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -383,27 +375,42 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 	ipc_lock_object(&sma->sem_perm);
 
 	if (sma->complex_count == 0) {
-		/* False alarm:
-		 * There is no complex operation, thus we can switch
-		 * back to the fast path.
-		 */
-		spin_lock(&sem->lock);
-		ipc_unlock_object(&sma->sem_perm);
-		return sops->sem_num;
-	} else {
-		/* Not a false alarm, thus complete the sequence for a
-		 * full lock.
+		/*
+		 * Check if fast path is possible:
+		 * There is no complex operation, check hysteresis
+		 * If 0, switch back to the fast path.
 		 */
-		complexmode_enter(sma);
-		return -1;
+		if (sma->complex_mode > 0) {
+			/* Note:
+			 * Immediately after setting complex_mode to 0,
+			 * a simple op could start.
+			 * The data it would access was written by the
+			 * previous owner of sem->sem_perm.lock, i.e
+			 * a release and an acquire memory barrier ago.
+			 * No need for another barrier.
+			 */
+			WRITE_ONCE(sma->complex_mode, sma->complex_mode-1);
+		}
+		if (sma->complex_mode == 0) {
+			spin_lock(&sem->lock);
+			ipc_unlock_object(&sma->sem_perm);
+			return sops->sem_num;
+		}
 	}
+	/*
+	 * Not a false alarm, full lock is required.
+	 * Since we are already in complex_mode (either because of waiting
+	 * complex ops or due to hysteresis), there is not need for a
+	 * complexmode_enter().
+	 */
+	WARN_ON(sma->complex_mode == 0);
+	return -1;
 }
 
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
-		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -555,7 +562,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
-	sma->complex_mode = true; /* dropped by sem_unlock below */
+	WRITE_ONCE(sma->complex_mode, COMPLEX_MODE_ENTER);
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2212,7 +2219,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
 	 * In order to stay compatible with sem_lock(), we must
-	 * enter / leave complex_mode.
+	 * enter complex_mode.
 	 */
 	complexmode_enter(sma);
 
@@ -2231,8 +2238,6 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
-	complexmode_tryleave(sma);
-
 	return 0;
 }
 #endif
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] ipc/sem: sem_lock with hysteresis
  2016-06-18 20:02   ` [PATCH 2/2] ipc/sem: sem_lock with hysteresis Manfred Spraul
@ 2016-06-21 20:29     ` Davidlohr Bueso
  2016-06-25 17:37       ` Manfred Spraul
  0 siblings, 1 reply; 17+ messages in thread
From: Davidlohr Bueso @ 2016-06-21 20:29 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh

On Sat, 18 Jun 2016, Manfred Spraul wrote:

>sysv sem has two lock modes: One with per-semaphore locks, one lock mode
>with a single big lock for the whole array.
>When switching from the per-semaphore locks to the big lock, all
>per-semaphore locks must be scanned for ongoing operations.
>
>The patch adds a hysteresis for switching from the big lock to the per
>semaphore locks. This reduces how often the per-semaphore locks must
>be scanned.

Isn't this very arbitrary depending on the workload? Ie the other way around:
when we have a lot more simple ops going on not so good. While I'm more worried
about combinations that could cause enough complex ops to always delay taking
the finer grained lock, this change also obviously makes simple ops more expensive
on newly created segments.

In general I don't trust magic numbers much. What sort of numbers have you seen
with this patch? Is this a real concern (particularly because a lot of the sem->lock
work was because real world workloads were doing a lot more simple ops afaicr)?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] ipc/sem: sem_lock with hysteresis
  2016-06-21 20:29     ` Davidlohr Bueso
@ 2016-06-25 17:37       ` Manfred Spraul
  2016-06-28 17:54         ` Davidlohr Bueso
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh

On 06/21/2016 10:29 PM, Davidlohr Bueso wrote:
> On Sat, 18 Jun 2016, Manfred Spraul wrote:
>
>> sysv sem has two lock modes: One with per-semaphore locks, one lock mode
>> with a single big lock for the whole array.
>> When switching from the per-semaphore locks to the big lock, all
>> per-semaphore locks must be scanned for ongoing operations.
>>
>> The patch adds a hysteresis for switching from the big lock to the per
>> semaphore locks. This reduces how often the per-semaphore locks must
>> be scanned.
>
> Isn't this very arbitrary depending on the workload? Ie the other way 
> around:
> when we have a lot more simple ops going on not so good. While I'm 
> more worried
> about combinations that could cause enough complex ops to always delay 
> taking
> the finer grained lock, this change also obviously makes simple ops 
> more expensive
> on newly created segments.
I

Entering complex mode requires a scan of sem_base[].sem_lock.
>         for (i = 0; i < sma->sem_nsems; i++) {
>                 sem = sma->sem_base + i;
>                 spin_unlock_wait(&sem->lock);
This is what the patch tries to avoid.
>
> In general I don't trust magic numbers much. What sort of numbers have 
> you seen
> with this patch? Is this a real concern (particularly because a lot of 
> the sem->lock
> work was because real world workloads were doing a lot more simple ops 
> afaicr)?
>
With a microbenchmark: As much improvement as you want :-)

- Only simple ops: patch has no impact (the first 10 semops do not matter)
- sleeping complex ops: patch has no impact, we are always in complex mode
- not sleeping complex ops: depends on the size of the array.
With a 4.000 semaphore array, I see an improvement of factor 20.

There is obviously one case where the patch causes a slowdown:
- complex op, then 11 simple ops, then repeat.


Perhaps: set COMPLEX_MODE_ENTER to 1 or 2, then allow to configure it 
from user space.
Or do not merge the patch and wait until someone come with a profile 
that shows complexmode_enter().

--
     Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] ipc/sem: sem_lock with hysteresis
  2016-06-25 17:37       ` Manfred Spraul
@ 2016-06-28 17:54         ` Davidlohr Bueso
  0 siblings, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2016-06-28 17:54 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh

On Sat, 25 Jun 2016, Manfred Spraul wrote:

>- Only simple ops: patch has no impact (the first 10 semops do not matter)

Agreed.

>- sleeping complex ops: patch has no impact, we are always in complex mode
>- not sleeping complex ops: depends on the size of the array.
>With a 4.000 semaphore array, I see an improvement of factor 20.
>
>There is obviously one case where the patch causes a slowdown:
>- complex op, then 11 simple ops, then repeat.

Yeah, you reset the counter to COMPLEX_MODE_ENTER every time we do a
complexmode_enter(), so if you have interleaving of complex and simple
ops you could end up always taking the big lock. This is my main concern
and not an unusual scenario at all.

I wonder if we could be a bit less aggressive if already in complex_mode
and do something like:

if (sma->complex_mode > 0) {
   WRITE_ONCE(sma->complex_mode, min(complex_mode + 1, COMPLEX_MODE_ENTER));
   return;
}

and still be constrained by COMPLEX_MODE_ENTER... of course that has its own
additional overhead, albeit less contention on the big lock :/

>
>Perhaps: set COMPLEX_MODE_ENTER to 1 or 2, then allow to configure it
>from user space.

Nah, I don't think it's a good idea to expose such internals to userspace.

>Or do not merge the patch and wait until someone come with a profile
>that shows complexmode_enter().

Testing/benchmarking this patch is on my todo list mainly because I'm lacking
a decent box to test it on. But I'm not conformable having this one without
any numbers for say at least a rdbms benchmark. I'm very eager to add the
first patch (complex_mode) once the nits are settled, as it fixes a real bug.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  2016-06-18 20:02   ` [PATCH 2/2] ipc/sem: sem_lock with hysteresis Manfred Spraul
@ 2016-06-20 23:04   ` Andrew Morton
  2016-06-23 18:55     ` Manfred Spraul
  2016-06-21  0:30   ` Davidlohr Bueso
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2016-06-20 23:04 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso,
	felixh, stable

On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:

> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:
> 
> sem_lock has a fast path that allows parallel simple operations.
> There are two reasons why a simple operation cannot run in parallel:
> - a non-simple operations is ongoing (sma->sem_perm.lock held)
> - a complex operation is sleeping (sma->complex_count != 0)
> 
> As both facts are stored independently, a thread can bypass the current
> checks by sleeping in the right positions. See below for more details
> (or kernel bugzilla 105651).
> 
> The patch fixes that by creating one variable (complex_mode)
> that tracks both reasons why parallel operations are not possible.
> 
> The patch also updates stale documentation regarding the locking.
> 
> With regards to stable kernels:
> The patch is required for all kernels that include the commit 6d07b68ce16a
> ("ipc/sem.c: optimize sem_lock()") (3.10?)

I've had this in -mm (and -next) since January 4, without issues.  I
put it on hold because Davidlohr expressed concern about performance
regressions.

Your [2/2] should prevent those regressions (yes?) so I assume that any
kernel which has [1/2] really should have [2/2] as well.  But without
any quantitative information, this is all mad guesswork.

What to do?

(The [2/2] changelog should explain that it is the cure to [1/2]'s
regressions, btw).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-20 23:04   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Andrew Morton
@ 2016-06-23 18:55     ` Manfred Spraul
  0 siblings, 0 replies; 17+ messages in thread
From: Manfred Spraul @ 2016-06-23 18:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso,
	felixh, stable

On 06/21/2016 01:04 AM, Andrew Morton wrote:
> On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:
>
>> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:
>>
>> sem_lock has a fast path that allows parallel simple operations.
>> There are two reasons why a simple operation cannot run in parallel:
>> - a non-simple operations is ongoing (sma->sem_perm.lock held)
>> - a complex operation is sleeping (sma->complex_count != 0)
>>
>> As both facts are stored independently, a thread can bypass the current
>> checks by sleeping in the right positions. See below for more details
>> (or kernel bugzilla 105651).
>>
>> The patch fixes that by creating one variable (complex_mode)
>> that tracks both reasons why parallel operations are not possible.
>>
>> The patch also updates stale documentation regarding the locking.
>>
>> With regards to stable kernels:
>> The patch is required for all kernels that include the commit 6d07b68ce16a
>> ("ipc/sem.c: optimize sem_lock()") (3.10?)
> I've had this in -mm (and -next) since January 4, without issues.  I
> put it on hold because Davidlohr expressed concern about performance
> regressions.
I had several ideas how to fix it. The initial ideas probably had 
performance issue.

The current one doesn't have any issues. It just took longer than 
expected to test it.
> Your [2/2] should prevent those regressions (yes?) so I assume that any
> kernel which has [1/2] really should have [2/2] as well.  But without
> any quantitative information, this is all mad guesswork.
>
> What to do?
[2/2] is an improvement, it handles one case better than the current code.
If you want:
3.10 improved scalability, but it introduced a performance regression 
for one use case.
[with 3.10, simple ops got parallel, but complex ops had to perform a 
"for_each_sem() {spin_unlock_wait()}"]
The patch fixes that.


--
     Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  2016-06-18 20:02   ` [PATCH 2/2] ipc/sem: sem_lock with hysteresis Manfred Spraul
  2016-06-20 23:04   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Andrew Morton
@ 2016-06-21  0:30   ` Davidlohr Bueso
  2016-06-23 19:22     ` Manfred Spraul
  2 siblings, 1 reply; 17+ messages in thread
From: Davidlohr Bueso @ 2016-06-21  0:30 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On Sat, 18 Jun 2016, Manfred Spraul wrote:

>diff --git a/include/linux/sem.h b/include/linux/sem.h
>index 976ce3a..d0efd6e 100644
>--- a/include/linux/sem.h
>+++ b/include/linux/sem.h
>@@ -21,6 +21,7 @@ struct sem_array {
> 	struct list_head	list_id;	/* undo requests on this array */
> 	int			sem_nsems;	/* no. of semaphores in array */
> 	int			complex_count;	/* pending complex operations */

I see this patch also takes care of complex_count needing READ/WRITE_ONCE
as you change that to always be done with the big sem_perm lock.

>+	bool			complex_mode;	/* no parallel simple ops */

But I'm wondering about races with this one. Doesn't complex_mode need
acquire/release semantics?

> };
>

[...]

> /*
>- * Wait until all currently ongoing simple ops have completed.
>+ * Enter the mode suitable for non-simple operations:
>  * Caller must own sem_perm.lock.
>- * New simple ops cannot start, because simple ops first check
>- * that sem_perm.lock is free.
>- * that a) sem_perm.lock is free and b) complex_count is 0.
>  */
>-static void sem_wait_array(struct sem_array *sma)
>+static void complexmode_enter(struct sem_array *sma)
> {
> 	int i;
> 	struct sem *sem;
>
>-	if (sma->complex_count)  {
>-		/* The thread that increased sma->complex_count waited on
>-		 * all sem->lock locks. Thus we don't need to wait again.
>-		 */
>+	if (sma->complex_mode)  {
>+		/* We are already in complex_mode. Nothing to do */

This complex_mode load is serialized because both complexmode_enter() and
_tryleave(), which are the main calls that modify the variable, are serialized
by sem_perm lock, right?

> 		return;
> 	}

Btw, I like that this logic is much simpler, just by reading the comments :)

>+	WRITE_ONCE(sma->complex_mode, true);
>+
>+	/* We need a full barrier:
>+	 * The write to complex_mode must be visible
>+	 * before we read the first sem->lock spinlock state.
>+	 */
>+	smp_mb();
>
> 	for (i = 0; i < sma->sem_nsems; i++) {
> 		sem = sma->sem_base + i;
>@@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
> }
>
> /*
>+ * Try to leave the mode that disallows simple operations:
>+ * Caller must own sem_perm.lock.
>+ */
>+static void complexmode_tryleave(struct sem_array *sma)
>+{
>+	if (sma->complex_count)  {
>+		/* Complex ops are sleeping.
>+		 * We must stay in complex mode
>+		 */
>+		return;
>+	}
>+	/*
>+	 * Immediately after setting complex_mode to false,
>+	 * a simple op can start. Thus: all memory writes
>+	 * performed by the current operation must be visible
>+	 * before we set complex_mode to false.
>+	 */
>+	smp_wmb();
>+
>+	WRITE_ONCE(sma->complex_mode, false);

smp_store_release()? See below.

>+}
>+
>+/*
>  * If the request contains only one semaphore operation, and there are
>  * no complex transactions pending, lock only the semaphore involved.
>  * Otherwise, lock the entire semaphore array, since we either have
>@@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
> 		/* Complex operation - acquire a full lock */
> 		ipc_lock_object(&sma->sem_perm);
>
>-		/* And wait until all simple ops that are processed
>-		 * right now have dropped their locks.
>-		 */
>-		sem_wait_array(sma);
>+		/* Prevent parallel simple ops */
>+		complexmode_enter(sma);
> 		return -1;
> 	}
>
> 	/*
> 	 * Only one semaphore affected - try to optimize locking.
>-	 * The rules are:
>-	 * - optimized locking is possible if no complex operation
>-	 *   is either enqueued or processed right now.
>-	 * - The test for enqueued complex ops is simple:
>-	 *      sma->complex_count != 0
>-	 * - Testing for complex ops that are processed right now is
>-	 *   a bit more difficult. Complex ops acquire the full lock
>-	 *   and first wait that the running simple ops have completed.
>-	 *   (see above)
>-	 *   Thus: If we own a simple lock and the global lock is free
>-	 *	and complex_count is now 0, then it will stay 0 and
>-	 *	thus just locking sem->lock is sufficient.
>+	 * Optimized locking is possible if no complex operation
>+	 * is either enqueued or processed right now.
>+	 *
>+	 * Both facts are tracked by complex_mode.
> 	 */
> 	sem = sma->sem_base + sops->sem_num;
>
>-	if (sma->complex_count == 0) {
>+	/*
>+	 * Initial check for complex_mode. Just an optimization,
>+	 * no locking.
>+	 */
>+	if (!READ_ONCE(sma->complex_mode)) {

We have no lock (which is the whole point), I think we need stronger
guarantees here to avoid racing with another thread that holds sem_perm
lock and is entering complexmode for the first time or vice versa with
tryleave(). An smp_load_acquire here would pair with the suggested
smp_store_release calls.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-21  0:30   ` Davidlohr Bueso
@ 2016-06-23 19:22     ` Manfred Spraul
  2016-06-28  5:27       ` Davidlohr Bueso
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-06-23 19:22 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On 06/21/2016 02:30 AM, Davidlohr Bueso wrote:
> On Sat, 18 Jun 2016, Manfred Spraul wrote:
>
>> diff --git a/include/linux/sem.h b/include/linux/sem.h
>> index 976ce3a..d0efd6e 100644
>> --- a/include/linux/sem.h
>> +++ b/include/linux/sem.h
>> @@ -21,6 +21,7 @@ struct sem_array {
>>     struct list_head    list_id;    /* undo requests on this array */
>>     int            sem_nsems;    /* no. of semaphores in array */
>>     int            complex_count;    /* pending complex operations */
>
> I see this patch also takes care of complex_count needing READ/WRITE_ONCE
> as you change that to always be done with the big sem_perm lock.
>
>> +    bool            complex_mode;    /* no parallel simple ops */
>
> But I'm wondering about races with this one. Doesn't complex_mode need
> acquire/release semantics?
>
Yes. complex_mode needs acquire/release.

>> };
>>
>
> [...]
>
>> /*
>> - * Wait until all currently ongoing simple ops have completed.
>> + * Enter the mode suitable for non-simple operations:
>>  * Caller must own sem_perm.lock.
>> - * New simple ops cannot start, because simple ops first check
>> - * that sem_perm.lock is free.
>> - * that a) sem_perm.lock is free and b) complex_count is 0.
>>  */
>> -static void sem_wait_array(struct sem_array *sma)
>> +static void complexmode_enter(struct sem_array *sma)
>> {
>>     int i;
>>     struct sem *sem;
>>
>> -    if (sma->complex_count)  {
>> -        /* The thread that increased sma->complex_count waited on
>> -         * all sem->lock locks. Thus we don't need to wait again.
>> -         */
>> +    if (sma->complex_mode)  {
>> +        /* We are already in complex_mode. Nothing to do */
>
> This complex_mode load is serialized because both complexmode_enter() and
> _tryleave(), which are the main calls that modify the variable, are 
> serialized
> by sem_perm lock, right?
>
Exactly. Changes to complex_mode are protected by perm.lock.

>>         return;
>>     }
>
> Btw, I like that this logic is much simpler, just by reading the 
> comments :)
>
>> +    WRITE_ONCE(sma->complex_mode, true);
>> +
>> +    /* We need a full barrier:
>> +     * The write to complex_mode must be visible
>> +     * before we read the first sem->lock spinlock state.
>> +     */
>> +    smp_mb();
>>
Theoretically: smp_store_acquire. but this doesn't exist, so smp_mb()
>>     for (i = 0; i < sma->sem_nsems; i++) {
>>         sem = sma->sem_base + i;
>> @@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
>> }
>>
>> /*
>> + * Try to leave the mode that disallows simple operations:
>> + * Caller must own sem_perm.lock.
>> + */
>> +static void complexmode_tryleave(struct sem_array *sma)
>> +{
>> +    if (sma->complex_count)  {
>> +        /* Complex ops are sleeping.
>> +         * We must stay in complex mode
>> +         */
>> +        return;
>> +    }
>> +    /*
>> +     * Immediately after setting complex_mode to false,
>> +     * a simple op can start. Thus: all memory writes
>> +     * performed by the current operation must be visible
>> +     * before we set complex_mode to false.
>> +     */
>> +    smp_wmb();
>> +
>> +    WRITE_ONCE(sma->complex_mode, false);
>
> smp_store_release()? See below.
>
Yes
>> +}
>> +
>> +/*
>>  * If the request contains only one semaphore operation, and there are
>>  * no complex transactions pending, lock only the semaphore involved.
>>  * Otherwise, lock the entire semaphore array, since we either have
>> @@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array 
>> *sma, struct sembuf *sops,
>>         /* Complex operation - acquire a full lock */
>>         ipc_lock_object(&sma->sem_perm);
>>
>> -        /* And wait until all simple ops that are processed
>> -         * right now have dropped their locks.
>> -         */
>> -        sem_wait_array(sma);
>> +        /* Prevent parallel simple ops */
>> +        complexmode_enter(sma);
>>         return -1;
>>     }
>>
>>     /*
>>      * Only one semaphore affected - try to optimize locking.
>> -     * The rules are:
>> -     * - optimized locking is possible if no complex operation
>> -     *   is either enqueued or processed right now.
>> -     * - The test for enqueued complex ops is simple:
>> -     *      sma->complex_count != 0
>> -     * - Testing for complex ops that are processed right now is
>> -     *   a bit more difficult. Complex ops acquire the full lock
>> -     *   and first wait that the running simple ops have completed.
>> -     *   (see above)
>> -     *   Thus: If we own a simple lock and the global lock is free
>> -     *    and complex_count is now 0, then it will stay 0 and
>> -     *    thus just locking sem->lock is sufficient.
>> +     * Optimized locking is possible if no complex operation
>> +     * is either enqueued or processed right now.
>> +     *
>> +     * Both facts are tracked by complex_mode.
>>      */
>>     sem = sma->sem_base + sops->sem_num;
>>
>> -    if (sma->complex_count == 0) {
>> +    /*
>> +     * Initial check for complex_mode. Just an optimization,
>> +     * no locking.
>> +     */
>> +    if (!READ_ONCE(sma->complex_mode)) {
>
> We have no lock (which is the whole point), I think we need stronger
> guarantees here to avoid racing with another thread that holds sem_perm
> lock and is entering complexmode for the first time or vice versa with
> tryleave(). An smp_load_acquire here would pair with the suggested
> smp_store_release calls.
>
Yes,you are right.
What I'm not sure yet is if smp_load_acquire() is sufficient:

Thread A:
>        if (!READ_ONCE(sma->complex_mode)) {
The code is test_and_test, no barrier requirements for first test
>                 /*
>                  * It appears that no complex operation is around.
>                  * Acquire the per-semaphore lock.
>                  */
>                 spin_lock(&sem->lock);
>
>                 if (!smp_load_acquire(&sma->complex_mode)) {
>                         /* fast path successful! */
>                         return sops->sem_num;
>                 }
>                 spin_unlock(&sem->lock);
>         }

Thread B:
>        WRITE_ONCE(sma->complex_mode, true);
>
>         /* We need a full barrier:
>          * The write to complex_mode must be visible
>          * before we read the first sem->lock spinlock state.
>          */
>         smp_mb();
>
>         for (i = 0; i < sma->sem_nsems; i++) {
>                 sem = sma->sem_base + i;
>                 spin_unlock_wait(&sem->lock);
>         }

If thread A is allowed to issue read_spinlock;read complex_mode;write 
spinlock, then thread B would not notice that thread A is in the 
critical section

> Thanks,
> Davidlohr


I'll update the patch.


--

     Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-23 19:22     ` Manfred Spraul
@ 2016-06-28  5:27       ` Davidlohr Bueso
  2016-06-30 19:28         ` Manfred Spraul
  0 siblings, 1 reply; 17+ messages in thread
From: Davidlohr Bueso @ 2016-06-28  5:27 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On Thu, 23 Jun 2016, Manfred Spraul wrote:

>What I'm not sure yet is if smp_load_acquire() is sufficient:
>
>Thread A:
>>       if (!READ_ONCE(sma->complex_mode)) {
>The code is test_and_test, no barrier requirements for first test

Yeah, it would just make us take the big lock unnecessarily, nothing fatal
and I agree its probably worth the optimization. It still might be worth
commenting.

>>                /*
>>                 * It appears that no complex operation is around.
>>                 * Acquire the per-semaphore lock.
>>                 */
>>                spin_lock(&sem->lock);
>>
>>                if (!smp_load_acquire(&sma->complex_mode)) {
>>                        /* fast path successful! */
>>                        return sops->sem_num;
>>                }
>>                spin_unlock(&sem->lock);
>>        }
>
>Thread B:
>>       WRITE_ONCE(sma->complex_mode, true);
>>
>>        /* We need a full barrier:
>>         * The write to complex_mode must be visible
>>         * before we read the first sem->lock spinlock state.
>>         */
>>        smp_mb();
>>
>>        for (i = 0; i < sma->sem_nsems; i++) {
>>                sem = sma->sem_base + i;
>>                spin_unlock_wait(&sem->lock);
>>        }
>
>If thread A is allowed to issue read_spinlock;read complex_mode;write 
>spinlock, then thread B would not notice that thread A is in the 
>critical section

Are you referring to the sem->lock word not being visibly locked before we
read complex_mode (for the second time)? This issue was fixed in 2c610022711
(locking/qspinlock: Fix spin_unlock_wait() some more). So smp_load_acquire
should be enough afaict, or are you referring to something else?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-28  5:27       ` Davidlohr Bueso
@ 2016-06-30 19:28         ` Manfred Spraul
  2016-07-01 16:52           ` Davidlohr Bueso
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-06-30 19:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On 06/28/2016 07:27 AM, Davidlohr Bueso wrote:
> On Thu, 23 Jun 2016, Manfred Spraul wrote:
>
>> What I'm not sure yet is if smp_load_acquire() is sufficient:
>>
>> Thread A:
>>>       if (!READ_ONCE(sma->complex_mode)) {
>> The code is test_and_test, no barrier requirements for first test
>
> Yeah, it would just make us take the big lock unnecessarily, nothing 
> fatal
> and I agree its probably worth the optimization. It still might be worth
> commenting.
>
I'll extend the comment: "no locking and no memory barrier"
>>>                /*
>>>                 * It appears that no complex operation is around.
>>>                 * Acquire the per-semaphore lock.
>>>                 */
>>>                spin_lock(&sem->lock);
>>>
>>>                if (!smp_load_acquire(&sma->complex_mode)) {
>>>                        /* fast path successful! */
>>>                        return sops->sem_num;
>>>                }
>>>                spin_unlock(&sem->lock);
>>>        }
>>
>> Thread B:
>>>       WRITE_ONCE(sma->complex_mode, true);
>>>
>>>        /* We need a full barrier:
>>>         * The write to complex_mode must be visible
>>>         * before we read the first sem->lock spinlock state.
>>>         */
>>>        smp_mb();
>>>
>>>        for (i = 0; i < sma->sem_nsems; i++) {
>>>                sem = sma->sem_base + i;
>>>                spin_unlock_wait(&sem->lock);
>>>        }
>>
>> If thread A is allowed to issue read_spinlock;read complex_mode;write 
>> spinlock, then thread B would not notice that thread A is in the 
>> critical section
>
> Are you referring to the sem->lock word not being visibly locked 
> before we
> read complex_mode (for the second time)? This issue was fixed in 
> 2c610022711
> (locking/qspinlock: Fix spin_unlock_wait() some more). So 
> smp_load_acquire
> should be enough afaict, or are you referring to something else?
>
You are right, I didn't read this patch fully.
If I understand it right, it means that spin_lock() is both an acquire 
and a release - for qspinlocks.

It this valid for all spinlock implementations, for all architectures?
Otherwise: How can I detect in generic code if I can rely on a release 
inside spin_lock()?

--
     Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-30 19:28         ` Manfred Spraul
@ 2016-07-01 16:52           ` Davidlohr Bueso
  0 siblings, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2016-07-01 16:52 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On Thu, 30 Jun 2016, Manfred Spraul wrote:

>On 06/28/2016 07:27 AM, Davidlohr Bueso wrote:

>If I understand it right, it means that spin_lock() is both an acquire 
>and a release - for qspinlocks.

I wouldn't say that: the _Q_LOCKED_VAL stores are still unordered inside
the acquire region but no further than the the unlock store-release; which
is fine for regular mutual exclusion.

>It this valid for all spinlock implementations, for all architectures?
>Otherwise: How can I detect in generic code if I can rely on a release 
>inside spin_lock()?

With ticket locks this was never an issue as special lock readers
(spin_unlock_wait(), spin_is_locked()) will always see the lock taken as
they observe waiters by adding itself to the tail -- something the above
commit mimics and piggy backs on to save expensive smp_rmb().

In addition, see 726328d92a4 (locking/spinlock, arch: Update and fix
spin_unlock_wait() implementations).

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-06-25 17:37 Manfred Spraul
  2016-06-25 17:37 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul

Hi Andrew, Hi Peter,

next version of the sem_lock() fixes / improvement:
The patches are now vs. tip.

Patch 1 is ready for merging, patch 2 is new and for discussion.

Patch 1 fixes the race that was found by Felix.
It also adds smp_mb() to fully synchronize

        WRITE_ONCE(status, 1);
        <<< smp_mb();
        spin_unlock_wait();

vs.
        spin_lock();
        <<< smp_mb();
        READ_ONCE(status);

Patch 2 tries to close a performance regression:
        sem_wait_array() must perform a spin_unlock_wait() for every
        semaphore in the array. If the array is large, then this is
        slow.

        Two open points:
        - Do we need it? Factor 20 improvement is nice, but is the
          test realistic?
        - COMPLEX_MODE_ENTER=10 is a magic number, without any
          rational. (2+sem_nsems/128) would be another option,
          but also arbitrary.

        Test:
         https://github.com/manfred-colorfu/ipcscale
         # ./sem-scalebench -o 4 -f -m 0 -h 4 -i 1 -p 2 -c 8 -t 5 -d 500
          Before: 214732
          After: 4008635

--
        Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-25 17:37 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
@ 2016-06-25 17:37 ` Manfred Spraul
  0 siblings, 0 replies; 17+ messages in thread
From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Full memory barrier are required to synchronize changes of
complex_mode and the lock operations.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/sem.h |   1 +
 ipc/sem.c           | 122 ++++++++++++++++++++++++++++++----------------------
 2 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..538f43a 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -285,6 +294,27 @@ static void sem_wait_array(struct sem_array *sma)
 }
 
 /*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_store_release(&sma->complex_mode, false);
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -300,56 +330,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
+		/*
+		 * A full barrier is required: the write of sem->lock
+		 * must be visible before the read is executed
+		 */
+		smp_mb();
 
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		if (!READ_ONCE(sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-07-13  5:06 Manfred Spraul
  2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-07-13  5:06 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul

Hi Andrew, Hi Peter,

next version of the sem_lock() fixes:
The patches are again vs. tip.

Patch 1 is ready for merging, Patch 2 is for review.

- Patch 1 is the patch as in -next since January
  It fixes the race that was found by Felix.
- Patch 2 removes the memory barriers that are part of the qspinlock
  code.
- (The hysteresis patch would be patch 3. The risk of regressions
  can't be ruled out, thus it must wait for benchmarks from real
  workload tests)

Patch 1+2 were one patch, I've split patches so that review and
backporting are simpler.

--
        Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
@ 2016-07-13  5:06 ` Manfred Spraul
  2016-07-16  1:27   ` Davidlohr Bueso
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2016-07-13  5:06 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait() or that the acquire
within spin_lock() is after setting the lock variable.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Full memory barrier are required to synchronize changes of
complex_mode and the lock operations.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/sem.h |   1 +
 ipc/sem.c           | 130 +++++++++++++++++++++++++++++++---------------------
 2 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..0da63c8 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,28 +267,59 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
+	/*
+	 * spin_unlock_wait() is not a memory barriers, it is only a
+	 * control barrier. The code must pair with spin_unlock(&sem->lock),
+	 * thus just the control barrier is insufficient.
+	 *
+	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
+	 */
+	smp_rmb();
+}
+
+/*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_store_release(&sma->complex_mode, false);
 }
 
 /*
@@ -300,56 +338,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking, no memory barrier.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
+		/*
+		 * A full barrier is required: the write of sem->lock
+		 * must be visible before the read is executed
+		 */
+		smp_mb();
 
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		if (!smp_load_acquire(&sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +391,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +400,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +552,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2208,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2228,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
@ 2016-07-16  1:27   ` Davidlohr Bueso
  0 siblings, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2016-07-16  1:27 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh, stable

On Wed, 13 Jul 2016, Manfred Spraul wrote:

>-static void sem_wait_array(struct sem_array *sma)
>+static void complexmode_enter(struct sem_array *sma)
> {
> 	int i;
> 	struct sem *sem;
>
>-	if (sma->complex_count)  {
>-		/* The thread that increased sma->complex_count waited on
>-		 * all sem->lock locks. Thus we don't need to wait again.
>-		 */
>+	if (sma->complex_mode)  {
>+		/* We are already in complex_mode. Nothing to do */
> 		return;
> 	}
>+	WRITE_ONCE(sma->complex_mode, true);

So we can actually save those READ/WRITE_ONCE calls for complex_mode as it's
a bool and therefore tearing is not an issue.

>+
>+	/* We need a full barrier:
>+	 * The write to complex_mode must be visible
>+	 * before we read the first sem->lock spinlock state.
>+	 */
>+	smp_mb();

smp_store_mb()?

> /*
>@@ -300,56 +338,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
> 		/* Complex operation - acquire a full lock */
> 		ipc_lock_object(&sma->sem_perm);
>
>-		/* And wait until all simple ops that are processed
>-		 * right now have dropped their locks.
>-		 */
>-		sem_wait_array(sma);
>+		/* Prevent parallel simple ops */
>+		complexmode_enter(sma);
> 		return -1;

nit and unrelated: we should probably use some better label here than a raw
-1 (although I don't see it changing, just for nicer reading), ie: SEM_OBJECT_LOCKED

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-07-16  1:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-15  5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell
2016-06-18 19:39 ` Manfred Spraul
2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
2016-06-18 20:02   ` [PATCH 2/2] ipc/sem: sem_lock with hysteresis Manfred Spraul
2016-06-21 20:29     ` Davidlohr Bueso
2016-06-25 17:37       ` Manfred Spraul
2016-06-28 17:54         ` Davidlohr Bueso
2016-06-20 23:04   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Andrew Morton
2016-06-23 18:55     ` Manfred Spraul
2016-06-21  0:30   ` Davidlohr Bueso
2016-06-23 19:22     ` Manfred Spraul
2016-06-28  5:27       ` Davidlohr Bueso
2016-06-30 19:28         ` Manfred Spraul
2016-07-01 16:52           ` Davidlohr Bueso
2016-06-25 17:37 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
2016-06-25 17:37 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
2016-07-16  1:27   ` Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).