linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-10-01 18:40 Manfred Spraul
  2016-10-01 18:40 ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul
  2016-10-01 18:40 ` [PATCH 2/2] ipc/sem: Add hysteresis Manfred Spraul
  0 siblings, 2 replies; 7+ messages in thread
From: Manfred Spraul @ 2016-10-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	felixh, Manfred Spraul

Hi Andrew, Hi Peter, Hi Davidlohr,

New idea for ipc/sem:
The ACQUIRE from spin_lock() will continue to apply only for the load,
not for the store.

Thus: If we don't want to add arch dependencies into ipc/sem, the only
safe option is to use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Or we must stay with the current code, which is a ~9% regression.

Thus:
- Patch 1 replaces spin_unlock_wait() with spin_lock()/spin_unlock() and
  removes all memory barriers that are then unnecessary.

- Patch 2 adds the hysteresis code.

What do you think?

The patches passed stress-testing.

Andrew: Could you add it into mmots? Perhaps aiming for 4.10.

--
        Manfred

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait()
  2016-10-01 18:40 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
@ 2016-10-01 18:40 ` Manfred Spraul
  2016-10-01 18:40 ` [PATCH 2/2] ipc/sem: Add hysteresis Manfred Spraul
  1 sibling, 0 replies; 7+ messages in thread
From: Manfred Spraul @ 2016-10-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	felixh, Manfred Spraul

a) The ACQUIRE in spin_lock() applies to the read, not to the store,
at least for powerpc. This forces to add a smp_mb() into the fast
path.

b) The memory barrier provided by spin_unlock_wait() is right now
arch dependent.

Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Advantage: faster single op semop calls(), observed +8.9% on
x86. (the other solution would be arch dependencies in ipc/sem).

Disadvantage: slower complex op semop calls, if (and only if)
there are no sleeping operations.

The next patch adds hysteresis, this further reduces the
probability that the slow path is used.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 25 +++----------------------
 1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 5e318c5..d5f2710 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -280,24 +280,13 @@ static void complexmode_enter(struct sem_array *sma)
 		return;
 	}
 
-	/* We need a full barrier after seting complex_mode:
-	 * The write to complex_mode must be visible
-	 * before we read the first sem->lock spinlock state.
-	 */
-	smp_store_mb(sma->complex_mode, true);
+	sma->complex_mode = true;
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
-		spin_unlock_wait(&sem->lock);
+		spin_lock(&sem->lock);
+		spin_unlock(&sem->lock);
 	}
-	/*
-	 * spin_unlock_wait() is not a memory barriers, it is only a
-	 * control barrier. The code must pair with spin_unlock(&sem->lock),
-	 * thus just the control barrier is insufficient.
-	 *
-	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
-	 */
-	smp_rmb();
 }
 
 /*
@@ -363,14 +352,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		spin_lock(&sem->lock);
 
-		/*
-		 * See 51d7d5205d33
-		 * ("powerpc: Add smp_mb() to arch_spin_is_locked()"):
-		 * A full barrier is required: the write of sem->lock
-		 * must be visible before the read is executed
-		 */
-		smp_mb();
-
 		if (!smp_load_acquire(&sma->complex_mode)) {
 			/* fast path successful! */
 			return sops->sem_num;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] ipc/sem: Add hysteresis.
  2016-10-01 18:40 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
  2016-10-01 18:40 ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul
@ 2016-10-01 18:40 ` Manfred Spraul
  1 sibling, 0 replies; 7+ messages in thread
From: Manfred Spraul @ 2016-10-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	felixh, Manfred Spraul

sysv sem has two lock modes: One with per-semaphore locks, one lock mode
with a single global lock for the whole array.
When switching from the per-semaphore locks to the global lock, all
per-semaphore locks must be scanned for ongoing operations.

The patch adds a hysteresis for switching from the global lock to the
per semaphore locks. This reduces how often the per-semaphore locks
must be scanned.

Compared to the initial patch, this is a simplified solution:
Setting USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior.

In theory, a workload with exactly 10 simple sops and then
one complex op now scales a bit worse, but this is pure theory:
If there is concurrency, the it won't be exactly 10:1:10:1:10:1:...
If there is no concurrency, then there is no need for scalability.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/sem.h |  2 +-
 ipc/sem.c           | 87 ++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index d0efd6e..4fc222f 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,7 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
-	bool			complex_mode;	/* no parallel simple ops */
+	unsigned int		use_global_lock;/* >0: global lock required */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index d5f2710..0982c4d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -161,22 +161,43 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #define SEMOPM_FAST	64  /* ~ 372 bytes on stack */
 
 /*
+ * Switching from the mode suitable for simple ops
+ * to the mode for complex ops is costly. Therefore:
+ * use some hysteresis
+ */
+#define USE_GLOBAL_LOCK_HYSTERESIS	10
+
+/*
  * Locking:
  * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.complex_mode
  *	sem_array.pending{_alter,_const},
  *	sem_array.sem_undo
  *
  * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *	sem_array.complex_mode (for read)
+ *	sem_array.use_global_lock(for read)
  *
  * c) special:
  *	sem_undo_list.list_proc:
  *	* undo_list->lock for write
  *	* rcu for read
+ *	use_global_lock:
+ *	* global sem_lock() for write
+ *	* either local or global sem_lock() for read.
+ *
+ * Memory ordering:
+ * Most ordering is enforced by using spin_lock() and spin_unlock().
+ * The special case is use_global_lock:
+ * Setting it from non-zero to 0 is a RELEASE, this is ensured by
+ * using smp_store_release().
+ * Testing if it is non-zero is an ACQUIRE, this is ensured by using
+ * smp_load_acquire().
+ * Setting it from 0 to non-zero must be ordered with regards to
+ * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
+ * is inside a spin_lock() and after a write from 0 to non-zero a
+ * spin_lock()+spin_unlock() is done.
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -275,12 +296,16 @@ static void complexmode_enter(struct sem_array *sma)
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_mode)  {
-		/* We are already in complex_mode. Nothing to do */
+	if (sma->use_global_lock > 0)  {
+		/*
+		 * We are already in global lock mode.
+		 * Nothing to do, just reset the
+		 * counter until we return to simple mode.
+		 */
+		sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
 		return;
 	}
-
-	sma->complex_mode = true;
+	sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -301,13 +326,17 @@ static void complexmode_tryleave(struct sem_array *sma)
 		 */
 		return;
 	}
-	/*
-	 * Immediately after setting complex_mode to false,
-	 * a simple op can start. Thus: all memory writes
-	 * performed by the current operation must be visible
-	 * before we set complex_mode to false.
-	 */
-	smp_store_release(&sma->complex_mode, false);
+	if (sma->use_global_lock == 1) {
+		/*
+		 * Immediately after setting use_global_lock to 0,
+		 * a simple op can start. Thus: all memory writes
+		 * performed by the current operation must be visible
+		 * before we set use_global_lock to 0.
+		 */
+		smp_store_release(&sma->use_global_lock, 0);
+	} else {
+		sma->use_global_lock--;
+	}
 }
 
 #define SEM_GLOBAL_LOCK	(-1)
@@ -337,22 +366,23 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 	 * Optimized locking is possible if no complex operation
 	 * is either enqueued or processed right now.
 	 *
-	 * Both facts are tracked by complex_mode.
+	 * Both facts are tracked by use_global_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
 	/*
-	 * Initial check for complex_mode. Just an optimization,
+	 * Initial check for use_global_lock. Just an optimization,
 	 * no locking, no memory barrier.
 	 */
-	if (!sma->complex_mode) {
+	if (!sma->use_global_lock) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		if (!smp_load_acquire(&sma->complex_mode)) {
+		/* pairs with smp_store_release() */
+		if (!smp_load_acquire(&sma->use_global_lock)) {
 			/* fast path successful! */
 			return sops->sem_num;
 		}
@@ -362,19 +392,26 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 	/* slow path: acquire the full lock */
 	ipc_lock_object(&sma->sem_perm);
 
-	if (sma->complex_count == 0) {
-		/* False alarm:
-		 * There is no complex operation, thus we can switch
-		 * back to the fast path.
+	if (sma->use_global_lock == 0) {
+		/*
+		 * The use_global_lock mode ended while we waited for
+		 * sma->sem_perm.lock. Thus we must switch to locking
+		 * with sem->lock.
+		 * Unlike in the fast path, there is no need to recheck
+		 * sma->use_global_lock after we have acquired sem->lock:
+		 * We own sma->sem_perm.lock, thus use_global_lock cannot
+		 * change.
 		 */
 		spin_lock(&sem->lock);
+
 		ipc_unlock_object(&sma->sem_perm);
 		return sops->sem_num;
 	} else {
-		/* Not a false alarm, thus complete the sequence for a
-		 * full lock.
+		/*
+		 * Not a false alarm, thus continue to use the global lock
+		 * mode. No need for complexmode_enter(), this was done by
+		 * the caller that has set use_global_mode to non-zero.
 		 */
-		complexmode_enter(sma);
 		return SEM_GLOBAL_LOCK;
 	}
 }
@@ -535,7 +572,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
-	sma->complex_mode = true; /* dropped by sem_unlock below */
+	WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS);
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] ipc/sem.c: sem_lock fixes
  2016-07-13 22:05 ` Andrew Morton
@ 2016-07-14 16:40   ` Manfred Spraul
  0 siblings, 0 replies; 7+ messages in thread
From: Manfred Spraul @ 2016-07-14 16:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: H. Peter Anvin, Peter Zijlstra, Davidlohr Bueso, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh

Hi Andrew,

On 07/14/2016 12:05 AM, Andrew Morton wrote:
> On Wed, 13 Jul 2016 07:06:50 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:
>
>> Hi Andrew, Hi Peter,
>>
>> next version of the sem_lock() fixes:
>> The patches are again vs. tip.
>>
>> Patch 1 is ready for merging, Patch 2 is for review.
>>
>> - Patch 1 is the patch as in -next since January
>>    It fixes the race that was found by Felix.
>> - Patch 2 removes the memory barriers that are part of the qspinlock
>>    code.
>> - (The hysteresis patch would be patch 3. The risk of regressions
>>    can't be ruled out, thus it must wait for benchmarks from real
>>    workload tests)
> I think you're saying that if these two patches cause performance
> regressions, we will need ipc-sem-sem_lock-with-hysteresis.patch?
No, the two patches will not cause any regressions.

Commit 6062a8dc051 ("ipc,sem: fine grained locking for semtimedop") was 
a huge scalability improvement for 99% of the users, but introduced a 
regression for one workload.
Noone complained about it, so the workload must be rare.
Patch 3 now fixes the regression.
But we live with the introduced regression for 3 years, so give me (and 
Davidlohr, and anyone else who wants to support) some more time.
Also on my TODO list: The queue merge/unmerge logic, i.e. commit 
f269f40ad5ae ("ipc/sem.c: always use only one queue for alter 
operations") might also win from some hysteresis.

My proposal:
- patch 1: merge towards main tree.
- patch 2: needs update. The barrier in sem_lock() is required. I'm not 
yet sure about complexmode_enter(), perhaps with #ifndef 
CONFIG_QUEUED_SPINLOCKS
- patch 3 should wait.

--

     Manfred

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] ipc/sem.c: sem_lock fixes
  2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
@ 2016-07-13 22:05 ` Andrew Morton
  2016-07-14 16:40   ` Manfred Spraul
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2016-07-13 22:05 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: H. Peter Anvin, Peter Zijlstra, Davidlohr Bueso, LKML,
	Thomas Gleixner, Ingo Molnar, 1vier1, felixh

On Wed, 13 Jul 2016 07:06:50 +0200 Manfred Spraul <manfred@colorfullife.com> wrote:

> Hi Andrew, Hi Peter,
> 
> next version of the sem_lock() fixes:
> The patches are again vs. tip.
> 
> Patch 1 is ready for merging, Patch 2 is for review.
> 
> - Patch 1 is the patch as in -next since January
>   It fixes the race that was found by Felix.
> - Patch 2 removes the memory barriers that are part of the qspinlock
>   code.
> - (The hysteresis patch would be patch 3. The risk of regressions
>   can't be ruled out, thus it must wait for benchmarks from real
>   workload tests)

I think you're saying that if these two patches cause performance
regressions, we will need ipc-sem-sem_lock-with-hysteresis.patch?

Is that even necessary?  If your testing shows that
ipc-sem-sem_lock-with-hysteresis.patch makes things faster then in it
goes, surely?  

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-07-13  5:06 Manfred Spraul
  2016-07-13 22:05 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Manfred Spraul @ 2016-07-13  5:06 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul

Hi Andrew, Hi Peter,

next version of the sem_lock() fixes:
The patches are again vs. tip.

Patch 1 is ready for merging, Patch 2 is for review.

- Patch 1 is the patch as in -next since January
  It fixes the race that was found by Felix.
- Patch 2 removes the memory barriers that are part of the qspinlock
  code.
- (The hysteresis patch would be patch 3. The risk of regressions
  can't be ruled out, thus it must wait for benchmarks from real
  workload tests)

Patch 1+2 were one patch, I've split patches so that review and
backporting are simpler.

--
        Manfred

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-06-25 17:37 Manfred Spraul
  0 siblings, 0 replies; 7+ messages in thread
From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul

Hi Andrew, Hi Peter,

next version of the sem_lock() fixes / improvement:
The patches are now vs. tip.

Patch 1 is ready for merging, patch 2 is new and for discussion.

Patch 1 fixes the race that was found by Felix.
It also adds smp_mb() to fully synchronize

        WRITE_ONCE(status, 1);
        <<< smp_mb();
        spin_unlock_wait();

vs.
        spin_lock();
        <<< smp_mb();
        READ_ONCE(status);

Patch 2 tries to close a performance regression:
        sem_wait_array() must perform a spin_unlock_wait() for every
        semaphore in the array. If the array is large, then this is
        slow.

        Two open points:
        - Do we need it? Factor 20 improvement is nice, but is the
          test realistic?
        - COMPLEX_MODE_ENTER=10 is a magic number, without any
          rational. (2+sem_nsems/128) would be another option,
          but also arbitrary.
 
        Test:
         https://github.com/manfred-colorfu/ipcscale
         # ./sem-scalebench -o 4 -f -m 0 -h 4 -i 1 -p 2 -c 8 -t 5 -d 500
          Before: 214732
          After: 4008635

--
        Manfred

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-10-01 18:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-01 18:40 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
2016-10-01 18:40 ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul
2016-10-01 18:40 ` [PATCH 2/2] ipc/sem: Add hysteresis Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2016-07-13  5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
2016-07-13 22:05 ` Andrew Morton
2016-07-14 16:40   ` Manfred Spraul
2016-06-25 17:37 Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).