linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] ipc/sem.c: sem_lock fixes
@ 2016-10-01 18:40 Manfred Spraul
  2016-10-01 18:40 ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul
  2016-10-01 18:40 ` [PATCH 2/2] ipc/sem: Add hysteresis Manfred Spraul
  0 siblings, 2 replies; 4+ messages in thread
From: Manfred Spraul @ 2016-10-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	felixh, Manfred Spraul

Hi Andrew, Hi Peter, Hi Davidlohr,

New idea for ipc/sem:
The ACQUIRE from spin_lock() will continue to apply only for the load,
not for the store.

Thus: If we don't want to add arch dependencies into ipc/sem, the only
safe option is to use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Or we must stay with the current code, which is a ~9% regression.

Thus:
- Patch 1 replaces spin_unlock_wait() with spin_lock()/spin_unlock() and
  removes all memory barriers that are then unnecessary.

- Patch 2 adds the hysteresis code.

What do you think?

The patches passed stress-testing.

Andrew: Could you add it into mmots? Perhaps aiming for 4.10.

--
        Manfred

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait()
  2016-10-01 18:40 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
@ 2016-10-01 18:40 ` Manfred Spraul
  2016-10-01 18:40 ` [PATCH 2/2] ipc/sem: Add hysteresis Manfred Spraul
  1 sibling, 0 replies; 4+ messages in thread
From: Manfred Spraul @ 2016-10-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	felixh, Manfred Spraul

a) The ACQUIRE in spin_lock() applies to the read, not to the store,
at least for powerpc. This forces to add a smp_mb() into the fast
path.

b) The memory barrier provided by spin_unlock_wait() is right now
arch dependent.

Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Advantage: faster single op semop calls(), observed +8.9% on
x86. (the other solution would be arch dependencies in ipc/sem).

Disadvantage: slower complex op semop calls, if (and only if)
there are no sleeping operations.

The next patch adds hysteresis, this further reduces the
probability that the slow path is used.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 25 +++----------------------
 1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 5e318c5..d5f2710 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -280,24 +280,13 @@ static void complexmode_enter(struct sem_array *sma)
 		return;
 	}
 
-	/* We need a full barrier after seting complex_mode:
-	 * The write to complex_mode must be visible
-	 * before we read the first sem->lock spinlock state.
-	 */
-	smp_store_mb(sma->complex_mode, true);
+	sma->complex_mode = true;
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
-		spin_unlock_wait(&sem->lock);
+		spin_lock(&sem->lock);
+		spin_unlock(&sem->lock);
 	}
-	/*
-	 * spin_unlock_wait() is not a memory barriers, it is only a
-	 * control barrier. The code must pair with spin_unlock(&sem->lock),
-	 * thus just the control barrier is insufficient.
-	 *
-	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
-	 */
-	smp_rmb();
 }
 
 /*
@@ -363,14 +352,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		spin_lock(&sem->lock);
 
-		/*
-		 * See 51d7d5205d33
-		 * ("powerpc: Add smp_mb() to arch_spin_is_locked()"):
-		 * A full barrier is required: the write of sem->lock
-		 * must be visible before the read is executed
-		 */
-		smp_mb();
-
 		if (!smp_load_acquire(&sma->complex_mode)) {
 			/* fast path successful! */
 			return sops->sem_num;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] ipc/sem: Add hysteresis.
  2016-10-01 18:40 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
  2016-10-01 18:40 ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul
@ 2016-10-01 18:40 ` Manfred Spraul
  1 sibling, 0 replies; 4+ messages in thread
From: Manfred Spraul @ 2016-10-01 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Davidlohr Bueso
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	felixh, Manfred Spraul

sysv sem has two lock modes: One with per-semaphore locks, one lock mode
with a single global lock for the whole array.
When switching from the per-semaphore locks to the global lock, all
per-semaphore locks must be scanned for ongoing operations.

The patch adds a hysteresis for switching from the global lock to the
per semaphore locks. This reduces how often the per-semaphore locks
must be scanned.

Compared to the initial patch, this is a simplified solution:
Setting USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior.

In theory, a workload with exactly 10 simple sops and then
one complex op now scales a bit worse, but this is pure theory:
If there is concurrency, the it won't be exactly 10:1:10:1:10:1:...
If there is no concurrency, then there is no need for scalability.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/sem.h |  2 +-
 ipc/sem.c           | 87 ++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index d0efd6e..4fc222f 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,7 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
-	bool			complex_mode;	/* no parallel simple ops */
+	unsigned int		use_global_lock;/* >0: global lock required */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index d5f2710..0982c4d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -161,22 +161,43 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #define SEMOPM_FAST	64  /* ~ 372 bytes on stack */
 
 /*
+ * Switching from the mode suitable for simple ops
+ * to the mode for complex ops is costly. Therefore:
+ * use some hysteresis
+ */
+#define USE_GLOBAL_LOCK_HYSTERESIS	10
+
+/*
  * Locking:
  * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.complex_mode
  *	sem_array.pending{_alter,_const},
  *	sem_array.sem_undo
  *
  * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *	sem_array.complex_mode (for read)
+ *	sem_array.use_global_lock(for read)
  *
  * c) special:
  *	sem_undo_list.list_proc:
  *	* undo_list->lock for write
  *	* rcu for read
+ *	use_global_lock:
+ *	* global sem_lock() for write
+ *	* either local or global sem_lock() for read.
+ *
+ * Memory ordering:
+ * Most ordering is enforced by using spin_lock() and spin_unlock().
+ * The special case is use_global_lock:
+ * Setting it from non-zero to 0 is a RELEASE, this is ensured by
+ * using smp_store_release().
+ * Testing if it is non-zero is an ACQUIRE, this is ensured by using
+ * smp_load_acquire().
+ * Setting it from 0 to non-zero must be ordered with regards to
+ * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
+ * is inside a spin_lock() and after a write from 0 to non-zero a
+ * spin_lock()+spin_unlock() is done.
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -275,12 +296,16 @@ static void complexmode_enter(struct sem_array *sma)
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_mode)  {
-		/* We are already in complex_mode. Nothing to do */
+	if (sma->use_global_lock > 0)  {
+		/*
+		 * We are already in global lock mode.
+		 * Nothing to do, just reset the
+		 * counter until we return to simple mode.
+		 */
+		sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
 		return;
 	}
-
-	sma->complex_mode = true;
+	sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -301,13 +326,17 @@ static void complexmode_tryleave(struct sem_array *sma)
 		 */
 		return;
 	}
-	/*
-	 * Immediately after setting complex_mode to false,
-	 * a simple op can start. Thus: all memory writes
-	 * performed by the current operation must be visible
-	 * before we set complex_mode to false.
-	 */
-	smp_store_release(&sma->complex_mode, false);
+	if (sma->use_global_lock == 1) {
+		/*
+		 * Immediately after setting use_global_lock to 0,
+		 * a simple op can start. Thus: all memory writes
+		 * performed by the current operation must be visible
+		 * before we set use_global_lock to 0.
+		 */
+		smp_store_release(&sma->use_global_lock, 0);
+	} else {
+		sma->use_global_lock--;
+	}
 }
 
 #define SEM_GLOBAL_LOCK	(-1)
@@ -337,22 +366,23 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 	 * Optimized locking is possible if no complex operation
 	 * is either enqueued or processed right now.
 	 *
-	 * Both facts are tracked by complex_mode.
+	 * Both facts are tracked by use_global_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
 	/*
-	 * Initial check for complex_mode. Just an optimization,
+	 * Initial check for use_global_lock. Just an optimization,
 	 * no locking, no memory barrier.
 	 */
-	if (!sma->complex_mode) {
+	if (!sma->use_global_lock) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		if (!smp_load_acquire(&sma->complex_mode)) {
+		/* pairs with smp_store_release() */
+		if (!smp_load_acquire(&sma->use_global_lock)) {
 			/* fast path successful! */
 			return sops->sem_num;
 		}
@@ -362,19 +392,26 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 	/* slow path: acquire the full lock */
 	ipc_lock_object(&sma->sem_perm);
 
-	if (sma->complex_count == 0) {
-		/* False alarm:
-		 * There is no complex operation, thus we can switch
-		 * back to the fast path.
+	if (sma->use_global_lock == 0) {
+		/*
+		 * The use_global_lock mode ended while we waited for
+		 * sma->sem_perm.lock. Thus we must switch to locking
+		 * with sem->lock.
+		 * Unlike in the fast path, there is no need to recheck
+		 * sma->use_global_lock after we have acquired sem->lock:
+		 * We own sma->sem_perm.lock, thus use_global_lock cannot
+		 * change.
 		 */
 		spin_lock(&sem->lock);
+
 		ipc_unlock_object(&sma->sem_perm);
 		return sops->sem_num;
 	} else {
-		/* Not a false alarm, thus complete the sequence for a
-		 * full lock.
+		/*
+		 * Not a false alarm, thus continue to use the global lock
+		 * mode. No need for complexmode_enter(), this was done by
+		 * the caller that has set use_global_mode to non-zero.
 		 */
-		complexmode_enter(sma);
 		return SEM_GLOBAL_LOCK;
 	}
 }
@@ -535,7 +572,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
-	sma->complex_mode = true; /* dropped by sem_unlock below */
+	WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS);
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait()
  2016-10-19  4:38 ` [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0% Manfred Spraul
@ 2016-10-19  4:38   ` Manfred Spraul
  0 siblings, 0 replies; 4+ messages in thread
From: Manfred Spraul @ 2016-10-19  4:38 UTC (permalink / raw)
  To: LKML, Peter Zijlstra, Davidlohr Bueso
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, 1vier1,
	Andrew Morton, torvalds, xiaolong.ye, felixh, Manfred Spraul

a) The ACQUIRE in spin_lock() applies to the read, not to the store,
at least for powerpc. This forces to add a smp_mb() into the fast
path.

b) The memory barrier provided by spin_unlock_wait() is right now
arch dependent.

Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Advantage: faster single op semop calls(), observed +8.9% on
x86. (the other solution would be arch dependencies in ipc/sem).

Disadvantage: slower complex op semop calls, if (and only if)
there are no sleeping operations.

The next patch adds hysteresis, this further reduces the
probability that the slow path is used.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 25 +++----------------------
 1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 5e318c5..d5f2710 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -280,24 +280,13 @@ static void complexmode_enter(struct sem_array *sma)
 		return;
 	}
 
-	/* We need a full barrier after seting complex_mode:
-	 * The write to complex_mode must be visible
-	 * before we read the first sem->lock spinlock state.
-	 */
-	smp_store_mb(sma->complex_mode, true);
+	sma->complex_mode = true;
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
-		spin_unlock_wait(&sem->lock);
+		spin_lock(&sem->lock);
+		spin_unlock(&sem->lock);
 	}
-	/*
-	 * spin_unlock_wait() is not a memory barriers, it is only a
-	 * control barrier. The code must pair with spin_unlock(&sem->lock),
-	 * thus just the control barrier is insufficient.
-	 *
-	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
-	 */
-	smp_rmb();
 }
 
 /*
@@ -363,14 +352,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		spin_lock(&sem->lock);
 
-		/*
-		 * See 51d7d5205d33
-		 * ("powerpc: Add smp_mb() to arch_spin_is_locked()"):
-		 * A full barrier is required: the write of sem->lock
-		 * must be visible before the read is executed
-		 */
-		smp_mb();
-
 		if (!smp_load_acquire(&sma->complex_mode)) {
 			/* fast path successful! */
 			return sops->sem_num;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-10-19  4:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-01 18:40 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul
2016-10-01 18:40 ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul
2016-10-01 18:40 ` [PATCH 2/2] ipc/sem: Add hysteresis Manfred Spraul
2016-10-17  2:25 [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0% regression kernel test robot
2016-10-19  4:38 ` [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0% Manfred Spraul
2016-10-19  4:38   ` [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait() Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).