All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-07-13  5:06 Manfred Spraul
  2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton
  0 siblings, 2 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-07-13  5:06 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul

Hi Andrew, Hi Peter,

next version of the sem_lock() fixes:
The patches are again vs. tip.

Patch 1 is ready for merging, Patch 2 is for review.

- Patch 1 is the patch as in -next since January
  It fixes the race that was found by Felix.
- Patch 2 removes the memory barriers that are part of the qspinlock
  code.
- (The hysteresis patch would be patch 3. The risk of regressions
  can't be ruled out, thus it must wait for benchmarks from real
  workload tests)

Patch 1+2 were one patch, I've split patches so that review and
backporting are simpler.

--
        Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
@ 2016-07-13  5:06 ` Manfred Spraul
  2016-07-13  5:06   ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul
  2016-07-16  1:27   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Davidlohr Bueso
  2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton
  1 sibling, 2 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-07-13  5:06 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait() or that the acquire
within spin_lock() is after setting the lock variable.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Full memory barrier are required to synchronize changes of
complex_mode and the lock operations.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/sem.h |   1 +
 ipc/sem.c           | 130 +++++++++++++++++++++++++++++++---------------------
 2 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..0da63c8 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,28 +267,59 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
+	/*
+	 * spin_unlock_wait() is not a memory barriers, it is only a
+	 * control barrier. The code must pair with spin_unlock(&sem->lock),
+	 * thus just the control barrier is insufficient.
+	 *
+	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
+	 */
+	smp_rmb();
+}
+
+/*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_store_release(&sma->complex_mode, false);
 }
 
 /*
@@ -300,56 +338,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking, no memory barrier.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
+		/*
+		 * A full barrier is required: the write of sem->lock
+		 * must be visible before the read is executed
+		 */
+		smp_mb();
 
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		if (!smp_load_acquire(&sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +391,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +400,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +552,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2208,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2228,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers.
  2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
@ 2016-07-13  5:06   ` Manfred Spraul
  2016-07-13 16:16     ` Davidlohr Bueso
  2016-07-16  1:27   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Davidlohr Bueso
  1 sibling, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-07-13  5:06 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul

With 2c610022711 (locking/qspinlock: Fix spin_unlock_wait() some more),
memory barriers were added into spin_unlock_wait().
Thus another barrier is not required.

And as explained in 055ce0fd1b8 (locking/qspinlock: Add comments),
spin_lock() provides a barrier so that reads within the critical
section cannot happen before the write for the lock is visible.
i.e. spin_lock provides an acquire barrier after the write of the lock
variable, this barrier pairs with the smp_mb() in complexmode_enter().

Please review!
For x86, the patch is safe. But I don't know enough about all archs
that support SMP.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 0da63c8..d7b4212 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -291,14 +291,6 @@ static void complexmode_enter(struct sem_array *sma)
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
-	/*
-	 * spin_unlock_wait() is not a memory barriers, it is only a
-	 * control barrier. The code must pair with spin_unlock(&sem->lock),
-	 * thus just the control barrier is insufficient.
-	 *
-	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
-	 */
-	smp_rmb();
 }
 
 /*
@@ -363,12 +355,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		spin_lock(&sem->lock);
 
-		/*
-		 * A full barrier is required: the write of sem->lock
-		 * must be visible before the read is executed
-		 */
-		smp_mb();
-
 		if (!smp_load_acquire(&sma->complex_mode)) {
 			/* fast path successful! */
 			return sops->sem_num;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers.
  2016-07-13  5:06   ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul
@ 2016-07-13 16:16     ` Davidlohr Bueso
  2016-07-13 18:37       ` Manfred Spraul
  0 siblings, 1 reply; 20+ messages in thread
From: Davidlohr Bueso @ 2016-07-13 16:16 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh

Manfred, shouldn't this patch be part of patch 1 (as you add the unnecessary barriers there? 
Iow, can we have a single patch for all this?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers.
  2016-07-13 16:16     ` Davidlohr Bueso
@ 2016-07-13 18:37       ` Manfred Spraul
  0 siblings, 0 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-07-13 18:37 UTC (permalink / raw)
  To: Davidlohr Bueso, Michael Ellerman
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh

Hi Davidlohr,

On 07/13/2016 06:16 PM, Davidlohr Bueso wrote:
> Manfred, shouldn't this patch be part of patch 1 (as you add the 
> unnecessary barriers there? Iow, can we have a single patch for all this?
>
Two reasons:
- patch 1 is safe for backporting, patch 2 not.
- patch 1 is safe on all architectures, for patch 2 I would like to get 
some review feedback.

e.g. I just found/read 51d7d5205d33 ("powerpc: Add smp_mb() to 
arch_spin_is_locked()"):
For powerpc, a smp_mb() was added into spin_is_locked(), more or less 
for ipc/sem.c.

Patch 1 replaces the spin_is_locked() with smp_load_acquire().
Isn't that the proof that smp_mb() is required?



--
     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ipc/sem.c: sem_lock fixes
  2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
  2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
@ 2016-07-13 22:05 ` Andrew Morton
  2016-07-14 16:40   ` Manfred Spraul
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2016-07-13 22:05 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: H. Peter Anvin, Peter Zijlstra, Davidlohr Bueso, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh

On Wed, 13 Jul 2016 07:06:50 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:

> Hi Andrew, Hi Peter,
> 
> next version of the sem_lock() fixes:
> The patches are again vs. tip.
> 
> Patch 1 is ready for merging, Patch 2 is for review.
> 
> - Patch 1 is the patch as in -next since January
>   It fixes the race that was found by Felix.
> - Patch 2 removes the memory barriers that are part of the qspinlock
>   code.
> - (The hysteresis patch would be patch 3. The risk of regressions
>   can't be ruled out, thus it must wait for benchmarks from real
>   workload tests)

I think you're saying that if these two patches cause performance
regressions, we will need ipc-sem-sem_lock-with-hysteresis.patch?

Is that even necessary?  If your testing shows that
ipc-sem-sem_lock-with-hysteresis.patch makes things faster then in it
goes, surely?  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ipc/sem.c: sem_lock fixes
  2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton
@ 2016-07-14 16:40   ` Manfred Spraul
  0 siblings, 0 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-07-14 16:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: H. Peter Anvin, Peter Zijlstra, Davidlohr Bueso, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh

Hi Andrew,

On 07/14/2016 12:05 AM, Andrew Morton wrote:
> On Wed, 13 Jul 2016 07:06:50 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:
>
>> Hi Andrew, Hi Peter,
>>
>> next version of the sem_lock() fixes:
>> The patches are again vs. tip.
>>
>> Patch 1 is ready for merging, Patch 2 is for review.
>>
>> - Patch 1 is the patch as in -next since January
>>    It fixes the race that was found by Felix.
>> - Patch 2 removes the memory barriers that are part of the qspinlock
>>    code.
>> - (The hysteresis patch would be patch 3. The risk of regressions
>>    can't be ruled out, thus it must wait for benchmarks from real
>>    workload tests)
> I think you're saying that if these two patches cause performance
> regressions, we will need ipc-sem-sem_lock-with-hysteresis.patch?
No, the two patches will not cause any regressions.

Commit 6062a8dc051 ("ipc,sem: fine grained locking for semtimedop") was 
a huge scalability improvement for 99% of the users, but introduced a 
regression for one workload.
Noone complained about it, so the workload must be rare.
Patch 3 now fixes the regression.
But we live with the introduced regression for 3 years, so give me (and 
Davidlohr, and anyone else who wants to support) some more time.
Also on my TODO list: The queue merge/unmerge logic, i.e. commit 
f269f40ad5ae ("ipc/sem.c: always use only one queue for alter 
operations") might also win from some hysteresis.

My proposal:
- patch 1: merge towards main tree.
- patch 2: needs update. The barrier in sem_lock() is required. I'm not 
yet sure about complexmode_enter(), perhaps with #ifndef 
CONFIG_QUEUED_SPINLOCKS
- patch 3 should wait.

--

     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
  2016-07-13  5:06   ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul
@ 2016-07-16  1:27   ` Davidlohr Bueso
  1 sibling, 0 replies; 20+ messages in thread
From: Davidlohr Bueso @ 2016-07-16  1:27 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh, stable

On Wed, 13 Jul 2016, Manfred Spraul wrote:

>-static void sem_wait_array(struct sem_array *sma)
>+static void complexmode_enter(struct sem_array *sma)
> {
> 	int i;
> 	struct sem *sem;
>
>-	if (sma->complex_count)  {
>-		/* The thread that increased sma->complex_count waited on
>-		 * all sem->lock locks. Thus we don't need to wait again.
>-		 */
>+	if (sma->complex_mode)  {
>+		/* We are already in complex_mode. Nothing to do */
> 		return;
> 	}
>+	WRITE_ONCE(sma->complex_mode, true);

So we can actually save those READ/WRITE_ONCE calls for complex_mode as it's
a bool and therefore tearing is not an issue.

>+
>+	/* We need a full barrier:
>+	 * The write to complex_mode must be visible
>+	 * before we read the first sem->lock spinlock state.
>+	 */
>+	smp_mb();

smp_store_mb()?

> /*
>@@ -300,56 +338,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
> 		/* Complex operation - acquire a full lock */
> 		ipc_lock_object(&sma->sem_perm);
>
>-		/* And wait until all simple ops that are processed
>-		 * right now have dropped their locks.
>-		 */
>-		sem_wait_array(sma);
>+		/* Prevent parallel simple ops */
>+		complexmode_enter(sma);
> 		return -1;

nit and unrelated: we should probably use some better label here than a raw
-1 (although I don't see it changing, just for nicer reading), ie: SEM_OBJECT_LOCKED

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-30 19:28         ` Manfred Spraul
@ 2016-07-01 16:52           ` Davidlohr Bueso
  0 siblings, 0 replies; 20+ messages in thread
From: Davidlohr Bueso @ 2016-07-01 16:52 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On Thu, 30 Jun 2016, Manfred Spraul wrote:

>On 06/28/2016 07:27 AM, Davidlohr Bueso wrote:

>If I understand it right, it means that spin_lock() is both an acquire 
>and a release - for qspinlocks.

I wouldn't say that: the _Q_LOCKED_VAL stores are still unordered inside
the acquire region but no further than the the unlock store-release; which
is fine for regular mutual exclusion.

>It this valid for all spinlock implementations, for all architectures?
>Otherwise: How can I detect in generic code if I can rely on a release 
>inside spin_lock()?

With ticket locks this was never an issue as special lock readers
(spin_unlock_wait(), spin_is_locked()) will always see the lock taken as
they observe waiters by adding itself to the tail -- something the above
commit mimics and piggy backs on to save expensive smp_rmb().

In addition, see 726328d92a4 (locking/spinlock, arch: Update and fix
spin_unlock_wait() implementations).

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-28  5:27       ` Davidlohr Bueso
@ 2016-06-30 19:28         ` Manfred Spraul
  2016-07-01 16:52           ` Davidlohr Bueso
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-06-30 19:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On 06/28/2016 07:27 AM, Davidlohr Bueso wrote:
> On Thu, 23 Jun 2016, Manfred Spraul wrote:
>
>> What I'm not sure yet is if smp_load_acquire() is sufficient:
>>
>> Thread A:
>>>       if (!READ_ONCE(sma->complex_mode)) {
>> The code is test_and_test, no barrier requirements for first test
>
> Yeah, it would just make us take the big lock unnecessarily, nothing 
> fatal
> and I agree its probably worth the optimization. It still might be worth
> commenting.
>
I'll extend the comment: "no locking and no memory barrier"
>>>                /*
>>>                 * It appears that no complex operation is around.
>>>                 * Acquire the per-semaphore lock.
>>>                 */
>>>                spin_lock(&sem->lock);
>>>
>>>                if (!smp_load_acquire(&sma->complex_mode)) {
>>>                        /* fast path successful! */
>>>                        return sops->sem_num;
>>>                }
>>>                spin_unlock(&sem->lock);
>>>        }
>>
>> Thread B:
>>>       WRITE_ONCE(sma->complex_mode, true);
>>>
>>>        /* We need a full barrier:
>>>         * The write to complex_mode must be visible
>>>         * before we read the first sem->lock spinlock state.
>>>         */
>>>        smp_mb();
>>>
>>>        for (i = 0; i < sma->sem_nsems; i++) {
>>>                sem = sma->sem_base + i;
>>>                spin_unlock_wait(&sem->lock);
>>>        }
>>
>> If thread A is allowed to issue read_spinlock;read complex_mode;write 
>> spinlock, then thread B would not notice that thread A is in the 
>> critical section
>
> Are you referring to the sem->lock word not being visibly locked 
> before we
> read complex_mode (for the second time)? This issue was fixed in 
> 2c610022711
> (locking/qspinlock: Fix spin_unlock_wait() some more). So 
> smp_load_acquire
> should be enough afaict, or are you referring to something else?
>
You are right, I didn't read this patch fully.
If I understand it right, it means that spin_lock() is both an acquire 
and a release - for qspinlocks.

It this valid for all spinlock implementations, for all architectures?
Otherwise: How can I detect in generic code if I can rely on a release 
inside spin_lock()?

--
     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-23 19:22     ` Manfred Spraul
@ 2016-06-28  5:27       ` Davidlohr Bueso
  2016-06-30 19:28         ` Manfred Spraul
  0 siblings, 1 reply; 20+ messages in thread
From: Davidlohr Bueso @ 2016-06-28  5:27 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On Thu, 23 Jun 2016, Manfred Spraul wrote:

>What I'm not sure yet is if smp_load_acquire() is sufficient:
>
>Thread A:
>>       if (!READ_ONCE(sma->complex_mode)) {
>The code is test_and_test, no barrier requirements for first test

Yeah, it would just make us take the big lock unnecessarily, nothing fatal
and I agree its probably worth the optimization. It still might be worth
commenting.

>>                /*
>>                 * It appears that no complex operation is around.
>>                 * Acquire the per-semaphore lock.
>>                 */
>>                spin_lock(&sem->lock);
>>
>>                if (!smp_load_acquire(&sma->complex_mode)) {
>>                        /* fast path successful! */
>>                        return sops->sem_num;
>>                }
>>                spin_unlock(&sem->lock);
>>        }
>
>Thread B:
>>       WRITE_ONCE(sma->complex_mode, true);
>>
>>        /* We need a full barrier:
>>         * The write to complex_mode must be visible
>>         * before we read the first sem->lock spinlock state.
>>         */
>>        smp_mb();
>>
>>        for (i = 0; i < sma->sem_nsems; i++) {
>>                sem = sma->sem_base + i;
>>                spin_unlock_wait(&sem->lock);
>>        }
>
>If thread A is allowed to issue read_spinlock;read complex_mode;write 
>spinlock, then thread B would not notice that thread A is in the 
>critical section

Are you referring to the sem->lock word not being visibly locked before we
read complex_mode (for the second time)? This issue was fixed in 2c610022711
(locking/qspinlock: Fix spin_unlock_wait() some more). So smp_load_acquire
should be enough afaict, or are you referring to something else?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-25 17:37 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
@ 2016-06-25 19:41   ` Holger Hoffstätte
  0 siblings, 0 replies; 20+ messages in thread
From: Holger Hoffstätte @ 2016-06-25 19:41 UTC (permalink / raw)
  To: stable; +Cc: linux-kernel

On Sat, 25 Jun 2016 19:37:51 +0200, Manfred Spraul wrote:

> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
> race:
> 
> sem_lock has a fast path that allows parallel simple operations.
> There are two reasons why a simple operation cannot run in parallel:
> - a non-simple operations is ongoing (sma->sem_perm.lock held)
> - a complex operation is sleeping (sma->complex_count != 0)
> 
> As both facts are stored independently, a thread can bypass the current
> checks by sleeping in the right positions. See below for more details
> (or kernel bugzilla 105651).
> 
> The patch fixes that by creating one variable (complex_mode)
> that tracks both reasons why parallel operations are not possible.
> 
> The patch also updates stale documentation regarding the locking.
> 
> With regards to stable kernels:
> The patch is required for all kernels that include the
> commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)
> 
> The alternative is to revert the patch that introduced the race.
> 
> Background:
> Here is the race of the current implementation:
> 
> Thread A: (simple op)
> - does the first "sma->complex_count == 0" test
> 
> Thread B: (complex op)
> - does sem_lock(): This includes an array scan. But the scan can't
>   find Thread A, because Thread A does not own sem->lock yet.
> - the thread does the operation, increases complex_count,
>   drops sem_lock, sleeps
> 
> Thread A:
> - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
> - sleeps before the complex_count test
> 
> Thread C: (complex op)
> - does sem_lock (no array scan, complex_count==1)
> - wakes up Thread B.
> - decrements complex_count
> 
> Thread A:
> - does the complex_count test
> 
> Bug:
> Now both thread A and thread C operate on the same array, without
> any synchronization.
> 
> Full memory barrier are required to synchronize changes of
> complex_mode and the lock operations.
> 
> Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
> Reported-by: felixh@informatik.uni-bremen.de
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> Cc: <stable@vger.kernel.org>

<snip>

>  
> -		/* Then check that the global lock is free */
> -		if (!spin_is_locked(&sma->sem_perm.lock)) {
> -			/*
> -			 * We need a memory barrier with acquire semantics,
> -			 * otherwise we can race with another thread that does:
> -			 *	complex_count++;
> -			 *	spin_unlock(sem_perm.lock);
> -			 */
> -			smp_acquire__after_ctrl_dep();

This won't apply to -stable because smp_acquire__after_ctrl_dep() was only
recently added. I could merge this over 4.4.x by replacing it with the previous
definition ipc_smp_acquire__after_spin_is_unlocked() (just as you did in the
first version of the patch).

hth,
Holger


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-25 17:37 Manfred Spraul
@ 2016-06-25 17:37 ` Manfred Spraul
  2016-06-25 19:41   ` Holger Hoffstätte
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Full memory barrier are required to synchronize changes of
complex_mode and the lock operations.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/sem.h |   1 +
 ipc/sem.c           | 122 ++++++++++++++++++++++++++++++----------------------
 2 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..538f43a 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -285,6 +294,27 @@ static void sem_wait_array(struct sem_array *sma)
 }
 
 /*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_store_release(&sma->complex_mode, false);
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -300,56 +330,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
+		/*
+		 * A full barrier is required: the write of sem->lock
+		 * must be visible before the read is executed
+		 */
+		smp_mb();
 
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		if (!READ_ONCE(sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-21  0:30   ` Davidlohr Bueso
@ 2016-06-23 19:22     ` Manfred Spraul
  2016-06-28  5:27       ` Davidlohr Bueso
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2016-06-23 19:22 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On 06/21/2016 02:30 AM, Davidlohr Bueso wrote:
> On Sat, 18 Jun 2016, Manfred Spraul wrote:
>
>> diff --git a/include/linux/sem.h b/include/linux/sem.h
>> index 976ce3a..d0efd6e 100644
>> --- a/include/linux/sem.h
>> +++ b/include/linux/sem.h
>> @@ -21,6 +21,7 @@ struct sem_array {
>>     struct list_head    list_id;    /* undo requests on this array */
>>     int            sem_nsems;    /* no. of semaphores in array */
>>     int            complex_count;    /* pending complex operations */
>
> I see this patch also takes care of complex_count needing READ/WRITE_ONCE
> as you change that to always be done with the big sem_perm lock.
>
>> +    bool            complex_mode;    /* no parallel simple ops */
>
> But I'm wondering about races with this one. Doesn't complex_mode need
> acquire/release semantics?
>
Yes. complex_mode needs acquire/release.

>> };
>>
>
> [...]
>
>> /*
>> - * Wait until all currently ongoing simple ops have completed.
>> + * Enter the mode suitable for non-simple operations:
>>  * Caller must own sem_perm.lock.
>> - * New simple ops cannot start, because simple ops first check
>> - * that sem_perm.lock is free.
>> - * that a) sem_perm.lock is free and b) complex_count is 0.
>>  */
>> -static void sem_wait_array(struct sem_array *sma)
>> +static void complexmode_enter(struct sem_array *sma)
>> {
>>     int i;
>>     struct sem *sem;
>>
>> -    if (sma->complex_count)  {
>> -        /* The thread that increased sma->complex_count waited on
>> -         * all sem->lock locks. Thus we don't need to wait again.
>> -         */
>> +    if (sma->complex_mode)  {
>> +        /* We are already in complex_mode. Nothing to do */
>
> This complex_mode load is serialized because both complexmode_enter() and
> _tryleave(), which are the main calls that modify the variable, are 
> serialized
> by sem_perm lock, right?
>
Exactly. Changes to complex_mode are protected by perm.lock.

>>         return;
>>     }
>
> Btw, I like that this logic is much simpler, just by reading the 
> comments :)
>
>> +    WRITE_ONCE(sma->complex_mode, true);
>> +
>> +    /* We need a full barrier:
>> +     * The write to complex_mode must be visible
>> +     * before we read the first sem->lock spinlock state.
>> +     */
>> +    smp_mb();
>>
Theoretically: smp_store_acquire. but this doesn't exist, so smp_mb()
>>     for (i = 0; i < sma->sem_nsems; i++) {
>>         sem = sma->sem_base + i;
>> @@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
>> }
>>
>> /*
>> + * Try to leave the mode that disallows simple operations:
>> + * Caller must own sem_perm.lock.
>> + */
>> +static void complexmode_tryleave(struct sem_array *sma)
>> +{
>> +    if (sma->complex_count)  {
>> +        /* Complex ops are sleeping.
>> +         * We must stay in complex mode
>> +         */
>> +        return;
>> +    }
>> +    /*
>> +     * Immediately after setting complex_mode to false,
>> +     * a simple op can start. Thus: all memory writes
>> +     * performed by the current operation must be visible
>> +     * before we set complex_mode to false.
>> +     */
>> +    smp_wmb();
>> +
>> +    WRITE_ONCE(sma->complex_mode, false);
>
> smp_store_release()? See below.
>
Yes
>> +}
>> +
>> +/*
>>  * If the request contains only one semaphore operation, and there are
>>  * no complex transactions pending, lock only the semaphore involved.
>>  * Otherwise, lock the entire semaphore array, since we either have
>> @@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array 
>> *sma, struct sembuf *sops,
>>         /* Complex operation - acquire a full lock */
>>         ipc_lock_object(&sma->sem_perm);
>>
>> -        /* And wait until all simple ops that are processed
>> -         * right now have dropped their locks.
>> -         */
>> -        sem_wait_array(sma);
>> +        /* Prevent parallel simple ops */
>> +        complexmode_enter(sma);
>>         return -1;
>>     }
>>
>>     /*
>>      * Only one semaphore affected - try to optimize locking.
>> -     * The rules are:
>> -     * - optimized locking is possible if no complex operation
>> -     *   is either enqueued or processed right now.
>> -     * - The test for enqueued complex ops is simple:
>> -     *      sma->complex_count != 0
>> -     * - Testing for complex ops that are processed right now is
>> -     *   a bit more difficult. Complex ops acquire the full lock
>> -     *   and first wait that the running simple ops have completed.
>> -     *   (see above)
>> -     *   Thus: If we own a simple lock and the global lock is free
>> -     *    and complex_count is now 0, then it will stay 0 and
>> -     *    thus just locking sem->lock is sufficient.
>> +     * Optimized locking is possible if no complex operation
>> +     * is either enqueued or processed right now.
>> +     *
>> +     * Both facts are tracked by complex_mode.
>>      */
>>     sem = sma->sem_base + sops->sem_num;
>>
>> -    if (sma->complex_count == 0) {
>> +    /*
>> +     * Initial check for complex_mode. Just an optimization,
>> +     * no locking.
>> +     */
>> +    if (!READ_ONCE(sma->complex_mode)) {
>
> We have no lock (which is the whole point), I think we need stronger
> guarantees here to avoid racing with another thread that holds sem_perm
> lock and is entering complexmode for the first time or vice versa with
> tryleave(). An smp_load_acquire here would pair with the suggested
> smp_store_release calls.
>
Yes,you are right.
What I'm not sure yet is if smp_load_acquire() is sufficient:

Thread A:
>        if (!READ_ONCE(sma->complex_mode)) {
The code is test_and_test, no barrier requirements for first test
>                 /*
>                  * It appears that no complex operation is around.
>                  * Acquire the per-semaphore lock.
>                  */
>                 spin_lock(&sem->lock);
>
>                 if (!smp_load_acquire(&sma->complex_mode)) {
>                         /* fast path successful! */
>                         return sops->sem_num;
>                 }
>                 spin_unlock(&sem->lock);
>         }

Thread B:
>        WRITE_ONCE(sma->complex_mode, true);
>
>         /* We need a full barrier:
>          * The write to complex_mode must be visible
>          * before we read the first sem->lock spinlock state.
>          */
>         smp_mb();
>
>         for (i = 0; i < sma->sem_nsems; i++) {
>                 sem = sma->sem_base + i;
>                 spin_unlock_wait(&sem->lock);
>         }

If thread A is allowed to issue read_spinlock;read complex_mode;write 
spinlock, then thread B would not notice that thread A is in the 
critical section

> Thanks,
> Davidlohr


I'll update the patch.


--

     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-20 23:04     ` Andrew Morton
  (?)
@ 2016-06-23 18:55     ` Manfred Spraul
  -1 siblings, 0 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-06-23 18:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso,
	felixh, stable

On 06/21/2016 01:04 AM, Andrew Morton wrote:
> On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:
>
>> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:
>>
>> sem_lock has a fast path that allows parallel simple operations.
>> There are two reasons why a simple operation cannot run in parallel:
>> - a non-simple operations is ongoing (sma->sem_perm.lock held)
>> - a complex operation is sleeping (sma->complex_count != 0)
>>
>> As both facts are stored independently, a thread can bypass the current
>> checks by sleeping in the right positions. See below for more details
>> (or kernel bugzilla 105651).
>>
>> The patch fixes that by creating one variable (complex_mode)
>> that tracks both reasons why parallel operations are not possible.
>>
>> The patch also updates stale documentation regarding the locking.
>>
>> With regards to stable kernels:
>> The patch is required for all kernels that include the commit 6d07b68ce16a
>> ("ipc/sem.c: optimize sem_lock()") (3.10?)
> I've had this in -mm (and -next) since January 4, without issues.  I
> put it on hold because Davidlohr expressed concern about performance
> regressions.
I had several ideas how to fix it. The initial ideas probably had 
performance issue.

The current one doesn't have any issues. It just took longer than 
expected to test it.
> Your [2/2] should prevent those regressions (yes?) so I assume that any
> kernel which has [1/2] really should have [2/2] as well.  But without
> any quantitative information, this is all mad guesswork.
>
> What to do?
[2/2] is an improvement, it handles one case better than the current code.
If you want:
3.10 improved scalability, but it introduced a performance regression 
for one use case.
[with 3.10, simple ops got parallel, but complex ops had to perform a 
"for_each_sem() {spin_unlock_wait()}"]
The patch fixes that.


--
     Manfred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-18 20:02   ` Manfred Spraul
  (?)
  (?)
@ 2016-06-21  0:30   ` Davidlohr Bueso
  2016-06-23 19:22     ` Manfred Spraul
  -1 siblings, 1 reply; 20+ messages in thread
From: Davidlohr Bueso @ 2016-06-21  0:30 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh,
	stable

On Sat, 18 Jun 2016, Manfred Spraul wrote:

>diff --git a/include/linux/sem.h b/include/linux/sem.h
>index 976ce3a..d0efd6e 100644
>--- a/include/linux/sem.h
>+++ b/include/linux/sem.h
>@@ -21,6 +21,7 @@ struct sem_array {
> 	struct list_head	list_id;	/* undo requests on this array */
> 	int			sem_nsems;	/* no. of semaphores in array */
> 	int			complex_count;	/* pending complex operations */

I see this patch also takes care of complex_count needing READ/WRITE_ONCE
as you change that to always be done with the big sem_perm lock.

>+	bool			complex_mode;	/* no parallel simple ops */

But I'm wondering about races with this one. Doesn't complex_mode need
acquire/release semantics?

> };
>

[...]

> /*
>- * Wait until all currently ongoing simple ops have completed.
>+ * Enter the mode suitable for non-simple operations:
>  * Caller must own sem_perm.lock.
>- * New simple ops cannot start, because simple ops first check
>- * that sem_perm.lock is free.
>- * that a) sem_perm.lock is free and b) complex_count is 0.
>  */
>-static void sem_wait_array(struct sem_array *sma)
>+static void complexmode_enter(struct sem_array *sma)
> {
> 	int i;
> 	struct sem *sem;
>
>-	if (sma->complex_count)  {
>-		/* The thread that increased sma->complex_count waited on
>-		 * all sem->lock locks. Thus we don't need to wait again.
>-		 */
>+	if (sma->complex_mode)  {
>+		/* We are already in complex_mode. Nothing to do */

This complex_mode load is serialized because both complexmode_enter() and
_tryleave(), which are the main calls that modify the variable, are serialized
by sem_perm lock, right?

> 		return;
> 	}

Btw, I like that this logic is much simpler, just by reading the comments :)

>+	WRITE_ONCE(sma->complex_mode, true);
>+
>+	/* We need a full barrier:
>+	 * The write to complex_mode must be visible
>+	 * before we read the first sem->lock spinlock state.
>+	 */
>+	smp_mb();
>
> 	for (i = 0; i < sma->sem_nsems; i++) {
> 		sem = sma->sem_base + i;
>@@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
> }
>
> /*
>+ * Try to leave the mode that disallows simple operations:
>+ * Caller must own sem_perm.lock.
>+ */
>+static void complexmode_tryleave(struct sem_array *sma)
>+{
>+	if (sma->complex_count)  {
>+		/* Complex ops are sleeping.
>+		 * We must stay in complex mode
>+		 */
>+		return;
>+	}
>+	/*
>+	 * Immediately after setting complex_mode to false,
>+	 * a simple op can start. Thus: all memory writes
>+	 * performed by the current operation must be visible
>+	 * before we set complex_mode to false.
>+	 */
>+	smp_wmb();
>+
>+	WRITE_ONCE(sma->complex_mode, false);

smp_store_release()? See below.

>+}
>+
>+/*
>  * If the request contains only one semaphore operation, and there are
>  * no complex transactions pending, lock only the semaphore involved.
>  * Otherwise, lock the entire semaphore array, since we either have
>@@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
> 		/* Complex operation - acquire a full lock */
> 		ipc_lock_object(&sma->sem_perm);
>
>-		/* And wait until all simple ops that are processed
>-		 * right now have dropped their locks.
>-		 */
>-		sem_wait_array(sma);
>+		/* Prevent parallel simple ops */
>+		complexmode_enter(sma);
> 		return -1;
> 	}
>
> 	/*
> 	 * Only one semaphore affected - try to optimize locking.
>-	 * The rules are:
>-	 * - optimized locking is possible if no complex operation
>-	 *   is either enqueued or processed right now.
>-	 * - The test for enqueued complex ops is simple:
>-	 *      sma->complex_count != 0
>-	 * - Testing for complex ops that are processed right now is
>-	 *   a bit more difficult. Complex ops acquire the full lock
>-	 *   and first wait that the running simple ops have completed.
>-	 *   (see above)
>-	 *   Thus: If we own a simple lock and the global lock is free
>-	 *	and complex_count is now 0, then it will stay 0 and
>-	 *	thus just locking sem->lock is sufficient.
>+	 * Optimized locking is possible if no complex operation
>+	 * is either enqueued or processed right now.
>+	 *
>+	 * Both facts are tracked by complex_mode.
> 	 */
> 	sem = sma->sem_base + sops->sem_num;
>
>-	if (sma->complex_count == 0) {
>+	/*
>+	 * Initial check for complex_mode. Just an optimization,
>+	 * no locking.
>+	 */
>+	if (!READ_ONCE(sma->complex_mode)) {

We have no lock (which is the whole point), I think we need stronger
guarantees here to avoid racing with another thread that holds sem_perm
lock and is entering complexmode for the first time or vice versa with
tryleave(). An smp_load_acquire here would pair with the suggested
smp_store_release calls.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-18 20:02   ` Manfred Spraul
@ 2016-06-20 23:04     ` Andrew Morton
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2016-06-20 23:04 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso,
	felixh, stable

On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:

> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:
> 
> sem_lock has a fast path that allows parallel simple operations.
> There are two reasons why a simple operation cannot run in parallel:
> - a non-simple operations is ongoing (sma->sem_perm.lock held)
> - a complex operation is sleeping (sma->complex_count != 0)
> 
> As both facts are stored independently, a thread can bypass the current
> checks by sleeping in the right positions. See below for more details
> (or kernel bugzilla 105651).
> 
> The patch fixes that by creating one variable (complex_mode)
> that tracks both reasons why parallel operations are not possible.
> 
> The patch also updates stale documentation regarding the locking.
> 
> With regards to stable kernels:
> The patch is required for all kernels that include the commit 6d07b68ce16a
> ("ipc/sem.c: optimize sem_lock()") (3.10?)

I've had this in -mm (and -next) since January 4, without issues.  I
put it on hold because Davidlohr expressed concern about performance
regressions.

Your [2/2] should prevent those regressions (yes?) so I assume that any
kernel which has [1/2] really should have [2/2] as well.  But without
any quantitative information, this is all mad guesswork.

What to do?

(The [2/2] changelog should explain that it is the cure to [1/2]'s
regressions, btw).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
@ 2016-06-20 23:04     ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2016-06-20 23:04 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso,
	felixh, stable

On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:

> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:
> 
> sem_lock has a fast path that allows parallel simple operations.
> There are two reasons why a simple operation cannot run in parallel:
> - a non-simple operations is ongoing (sma->sem_perm.lock held)
> - a complex operation is sleeping (sma->complex_count != 0)
> 
> As both facts are stored independently, a thread can bypass the current
> checks by sleeping in the right positions. See below for more details
> (or kernel bugzilla 105651).
> 
> The patch fixes that by creating one variable (complex_mode)
> that tracks both reasons why parallel operations are not possible.
> 
> The patch also updates stale documentation regarding the locking.
> 
> With regards to stable kernels:
> The patch is required for all kernels that include the commit 6d07b68ce16a
> ("ipc/sem.c: optimize sem_lock()") (3.10?)

I've had this in -mm (and -next) since January 4, without issues.  I
put it on hold because Davidlohr expressed concern about performance
regressions.

Your [2/2] should prevent those regressions (yes?) so I assume that any
kernel which has [1/2] really should have [2/2] as well.  But without
any quantitative information, this is all mad guesswork.

What to do?

(The [2/2] changelog should explain that it is the cure to [1/2]'s
regressions, btw).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
  2016-06-15  5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell
@ 2016-06-18 20:02   ` Manfred Spraul
  0 siblings, 0 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-06-18 20:02 UTC (permalink / raw)
  To: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton
  Cc: LKML, linux-next, 1vier1, Davidlohr Bueso, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the commit 6d07b68ce16a
("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..db2e6fc 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
 }
 
 /*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_wmb();
+
+	WRITE_ONCE(sma->complex_mode, false);
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
-
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		/* Now repeat the test for complex_mode.
+		 * A memory barrier is provided by the spin_lock()
+		 * above.
+		 */
+		if (!READ_ONCE(sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race
@ 2016-06-18 20:02   ` Manfred Spraul
  0 siblings, 0 replies; 20+ messages in thread
From: Manfred Spraul @ 2016-06-18 20:02 UTC (permalink / raw)
  To: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton
  Cc: LKML, linux-next, 1vier1, Davidlohr Bueso, felixh,
	Manfred Spraul, stable

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the commit 6d07b68ce16a
("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@informatik.uni-bremen.de
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
---

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..db2e6fc 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma)
 }
 
 /*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_wmb();
+
+	WRITE_ONCE(sma->complex_mode, false);
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
-
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		/* Now repeat the test for complex_mode.
+		 * A memory barrier is provided by the spin_lock()
+		 * above.
+		 */
+		if (!READ_ONCE(sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-07-16  1:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
2016-07-13  5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
2016-07-13  5:06   ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul
2016-07-13 16:16     ` Davidlohr Bueso
2016-07-13 18:37       ` Manfred Spraul
2016-07-16  1:27   ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Davidlohr Bueso
2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton
2016-07-14 16:40   ` Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2016-06-25 17:37 Manfred Spraul
2016-06-25 17:37 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
2016-06-25 19:41   ` Holger Hoffstätte
2016-06-15  5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell
2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul
2016-06-18 20:02   ` Manfred Spraul
2016-06-20 23:04   ` Andrew Morton
2016-06-20 23:04     ` Andrew Morton
2016-06-23 18:55     ` Manfred Spraul
2016-06-21  0:30   ` Davidlohr Bueso
2016-06-23 19:22     ` Manfred Spraul
2016-06-28  5:27       ` Davidlohr Bueso
2016-06-30 19:28         ` Manfred Spraul
2016-07-01 16:52           ` Davidlohr Bueso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.