All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
@ 2013-06-10 17:16 Manfred Spraul
  2013-06-10 17:16 ` [PATCH 1/6] ipc/util.c, ipc_rcu_alloc: cacheline align allocation Manfred Spraul
  2013-06-14 15:38 ` [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
  0 siblings, 2 replies; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

Hi Andrew,

I have cleaned up/improved my updates to sysv sem.
Could you replace my patches in -akpm with this series?

- 1: cacheline align output from ipc_rcu_alloc
- 2: cacheline align semaphore structures
- 3: seperate-wait-for-zero-and-alter-tasks
- 4: Always-use-only-one-queue-for-alter-operations
- 5: Replace the global sem_otime with a distributed otime
- 6: Rename-try_atomic_semop-to-perform_atomic

The first 2 patches result up to a ~30% performance improvement on a 2-core i3.

3,4 remove unnecessary loops in do_smart_update() and restore FIFO behavior.

I would expect that the 5th one will cause an improvement for multi-socket
systems, but I don't have a test setup.

6 is just a cleanup/function rename.

--
	Manfred

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] ipc/util.c, ipc_rcu_alloc: cacheline align allocation
  2013-06-10 17:16 [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
@ 2013-06-10 17:16 ` Manfred Spraul
  2013-06-10 17:16   ` [PATCH 2/6] ipc/sem.c: cacheline align the semaphore structures Manfred Spraul
  2013-06-14 15:38 ` [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
  1 sibling, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

Enforce that ipc_rcu_alloc returns a cacheline aligned pointer on SMP.

Rational:
The SysV sem code tries to move the main spinlock into a seperate cacheline
(____cacheline_aligned_in_smp). This works only if ipc_rcu_alloc returns
cacheline aligned pointers.
vmalloc and kmalloc return cacheline algined pointers, the implementation
of ipc_rcu_alloc breaks that.

Andrew: Could you merge it into -akpm and then forward it towards Linus' tree?

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/util.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 809ec5e..9623c8e 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -469,9 +469,7 @@ void ipc_free(void* ptr, int size)
 struct ipc_rcu {
 	struct rcu_head rcu;
 	atomic_t refcount;
-	/* "void *" makes sure alignment of following data is sane. */
-	void *data[0];
-};
+} ____cacheline_aligned_in_smp;
 
 /**
  *	ipc_rcu_alloc	-	allocate ipc and rcu space 
@@ -489,12 +487,14 @@ void *ipc_rcu_alloc(int size)
 	if (unlikely(!out))
 		return NULL;
 	atomic_set(&out->refcount, 1);
-	return out->data;
+	return out+1;
 }
 
 int ipc_rcu_getref(void *ptr)
 {
-	return atomic_inc_not_zero(&container_of(ptr, struct ipc_rcu, data)->refcount);
+	struct ipc_rcu *p = ((struct ipc_rcu*)ptr)-1;
+
+	return atomic_inc_not_zero(&p->refcount);
 }
 
 /**
@@ -508,7 +508,7 @@ static void ipc_schedule_free(struct rcu_head *head)
 
 void ipc_rcu_putref(void *ptr)
 {
-	struct ipc_rcu *p = container_of(ptr, struct ipc_rcu, data);
+	struct ipc_rcu *p = ((struct ipc_rcu*)ptr)-1;
 
 	if (!atomic_dec_and_test(&p->refcount))
 		return;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] ipc/sem.c: cacheline align the semaphore structures
  2013-06-10 17:16 ` [PATCH 1/6] ipc/util.c, ipc_rcu_alloc: cacheline align allocation Manfred Spraul
@ 2013-06-10 17:16   ` Manfred Spraul
  2013-06-10 17:16     ` [PATCH 3/6] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
  0 siblings, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

As now each semaphore has it's own spinlock and parallel operations
are possible, give each semaphore it's own cacheline.

On a i3 laptop, this gives up to 28% better performance:

#semscale 10 | grep "interleave 2"
- before:
Cpus 1, interleave 2 delay 0: 36109234 in 10 secs
Cpus 2, interleave 2 delay 0: 55276317 in 10 secs
Cpus 3, interleave 2 delay 0: 62411025 in 10 secs
Cpus 4, interleave 2 delay 0: 81963928 in 10 secs

-after:
Cpus 1, interleave 2 delay 0: 35527306 in 10 secs
Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28%
Cpus 3, interleave 2 delay 0: 80518538 in 10 secs
Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7%

i3, with 2 cores and with hyperthreading enabled.
Interleave 2 in order use first the full cores.
HT partially hides the delay from cacheline trashing, thus
the improvement is "only" 8.7% if 4 threads are running.

Andrew: Could you merge it into -akpm and then forward it towards Linus' tree?

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 70480a3..1afbc57 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -96,7 +96,7 @@ struct sem {
 	int	sempid;		/* pid of last operation */
 	spinlock_t	lock;	/* spinlock for fine-grained semtimedop */
 	struct list_head sem_pending; /* pending single-sop operations */
-};
+} ____cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-06-10 17:16   ` [PATCH 2/6] ipc/sem.c: cacheline align the semaphore structures Manfred Spraul
@ 2013-06-10 17:16     ` Manfred Spraul
  2013-06-10 17:16       ` [PATCH 4/6] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
  0 siblings, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

Introduce seperate queues for operations that do not modify the
semaphore values.
Advantages:
- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations
  are always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/sem.h |   5 +-
 ipc/sem.c           | 209 +++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 153 insertions(+), 61 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 53d4265..55e17f6 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -15,7 +15,10 @@ struct sem_array {
 	time_t			sem_otime;	/* last semop time */
 	time_t			sem_ctime;	/* last change time */
 	struct sem		*sem_base;	/* ptr to first semaphore in array */
-	struct list_head	sem_pending;	/* pending operations to be processed */
+	struct list_head	pending_alter;	/* pending operations */
+						/* that alter the array */
+	struct list_head	pending_const;	/* pending complex operations */
+						/* that do not alter semvals */
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 1afbc57..e7f3d64 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -95,7 +95,10 @@ struct sem {
 	int	semval;		/* current value */
 	int	sempid;		/* pid of last operation */
 	spinlock_t	lock;	/* spinlock for fine-grained semtimedop */
-	struct list_head sem_pending; /* pending single-sop operations */
+	struct list_head pending_alter; /* pending single-sop operations */
+					/* that alter the semaphore */
+	struct list_head pending_const; /* pending single-sop operations */
+					/* that do not alter the semaphore*/
 } ____cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
@@ -152,7 +155,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 /*
  * linked list protection:
  *	sem_undo.id_next,
- *	sem_array.sem_pending{,last},
+ *	sem_array.pending{_alter,_cont},
  *	sem_array.sem_undo: sem_lock() for read/write
  *	sem_undo.proc_next: only "current" is allowed to read/write that field.
  *	
@@ -337,7 +340,7 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * Without the check/retry algorithm a lockless wakeup is possible:
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
- *	* unlinking the queue entry from sma->sem_pending
+ *	* unlinking the queue entry from the pending list
  *	* setting queue.status to IN_WAKEUP
  *	  This is the notification for the blocked thread that a
  *	  result value is imminent.
@@ -418,12 +421,14 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	sma->sem_base = (struct sem *) &sma[1];
 
 	for (i = 0; i < nsems; i++) {
-		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+		INIT_LIST_HEAD(&sma->sem_base[i].pending_alter);
+		INIT_LIST_HEAD(&sma->sem_base[i].pending_const);
 		spin_lock_init(&sma->sem_base[i].lock);
 	}
 
 	sma->complex_count = 0;
-	INIT_LIST_HEAD(&sma->sem_pending);
+	INIT_LIST_HEAD(&sma->pending_alter);
+	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
 	sma->sem_nsems = nsems;
 	sma->sem_ctime = get_seconds();
@@ -609,60 +614,130 @@ static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
  * update_queue is O(N^2) when it restarts scanning the whole queue of
  * waiting operations. Therefore this function checks if the restart is
  * really necessary. It is called after a previously waiting operation
- * was completed.
+ * modified the array.
+ * Note that wait-for-zero operations are handled without restart.
  */
 static int check_restart(struct sem_array *sma, struct sem_queue *q)
 {
-	struct sem *curr;
-	struct sem_queue *h;
-
-	/* if the operation didn't modify the array, then no restart */
-	if (q->alter == 0)
-		return 0;
-
-	/* pending complex operations are too difficult to analyse */
-	if (sma->complex_count)
+	/* pending complex alter operations are too difficult to analyse */
+	if (!list_empty(&sma->pending_alter))
 		return 1;
 
 	/* we were a sleeping complex operation. Too difficult */
 	if (q->nsops > 1)
 		return 1;
 
-	curr = sma->sem_base + q->sops[0].sem_num;
+	/* It is impossible that someone waits for the new value:
+	 * - complex operations always restart.
+	 * - wait-for-zero are handled seperately.
+	 * - q is a previously sleeping simple operation that
+	 *   altered the array. It must be a decrement, because
+	 *   simple increments never sleep.
+	 * - If there are older (higher priority) decrements
+	 *   in the queue, then they have observed the original
+	 *   semval value and couldn't proceed. The operation
+	 *   decremented to value - thus they won't proceed either.
+	 */
+	return 0;
+}
 
-	/* No-one waits on this queue */
-	if (list_empty(&curr->sem_pending))
-		return 0;
+/**
+ * wake_const_ops(sma, semnum, pt) - Wake up non-alter tasks
+ * @sma: semaphore array.
+ * @semnum: semaphore that was modified.
+ * @pt: list head for the tasks that must be woken up.
+ *
+ * wake_const_ops must be called after a semaphore in a semaphore array
+ * was set to 0. If complex const operations are pending, wake_const_ops must
+ * be called with semnum = -1, as well as with the number of each modified
+ * semaphore.
+ * The tasks that must be woken up are added to @pt. The return code
+ * is stored in q->pid.
+ * The function returns 1 if at least one operation was completed successfully.
+ */
+static int wake_const_ops(struct sem_array *sma, int semnum,
+				struct list_head *pt)
+{
+	struct sem_queue *q;
+	struct list_head *walk;
+	struct list_head *pending_list;
+	int semop_completed = 0;
+
+	if (semnum == -1)
+		pending_list = &sma->pending_const;
+	else
+		pending_list = &sma->sem_base[semnum].pending_const;
+
+	walk = pending_list->next;
+	while (walk != pending_list) {
+		int error;
+
+		q = container_of(walk, struct sem_queue, list);
+		walk = walk->next;
+
+		error = try_atomic_semop(sma, q->sops, q->nsops,
+						q->undo, q->pid);
+
+		if (error <= 0) {
+			/* operation completed, remove from queue & wakeup */
+
+			unlink_queue(sma, q);
+
+			wake_up_sem_queue_prepare(pt, q, error);
+			if (error == 0)
+				semop_completed = 1;
+		}
+	}
+	return semop_completed;
+}
 
-	/* the new semaphore value */
-	if (curr->semval) {
-		/* It is impossible that someone waits for the new value:
-		 * - q is a previously sleeping simple operation that
-		 *   altered the array. It must be a decrement, because
-		 *   simple increments never sleep.
-		 * - The value is not 0, thus wait-for-zero won't proceed.
-		 * - If there are older (higher priority) decrements
-		 *   in the queue, then they have observed the original
-		 *   semval value and couldn't proceed. The operation
-		 *   decremented to value - thus they won't proceed either.
+/**
+ * do_smart_wakeup_zero(sma, sops, nsops, pt) - wakeup all wait for zero tasks
+ * @sma: semaphore array
+ * @sops: operations that were performed
+ * @nsops: number of operations
+ * @pt: list head of the tasks that must be woken up.
+ *
+ * do_smart_wakeup_zero() checks all required queue for wait-for-zero
+ * operations, based on the actual changes that were performed on the
+ * semaphore array.
+ * The function returns 1 if at least one operation was completed successfully.
+ */
+static int do_smart_wakeup_zero(struct sem_array *sma, struct sembuf *sops,
+					int nsops, struct list_head *pt)
+{
+	int i;
+	int semop_completed = 0;
+	int got_zero = 0;
+
+	/* first: the per-semaphore queues, if known */
+	if (sops) {
+		for (i = 0; i < nsops; i++) {
+			int num = sops[i].sem_num;
+
+			if (sma->sem_base[num].semval == 0) {
+				got_zero = 1;
+				semop_completed |= wake_const_ops(sma, num, pt);
+			}
+		}
+	} else {
+		/*
+		 * No sops means modified semaphores not known.
+		 * Assume all were changed.
 		 */
-		BUG_ON(q->sops[0].sem_op >= 0);
-		return 0;
+		for (i = 0; i < sma->sem_nsems; i++) {
+			if (sma->sem_base[i].semval == 0)
+				semop_completed |= wake_const_ops(sma, i, pt);
+		}
 	}
 	/*
-	 * semval is 0. Check if there are wait-for-zero semops.
-	 * They must be the first entries in the per-semaphore queue
+	 * If one of the modified semaphores got 0,
+	 * then check the global queue, too.
 	 */
-	h = list_first_entry(&curr->sem_pending, struct sem_queue, list);
-	BUG_ON(h->nsops != 1);
-	BUG_ON(h->sops[0].sem_num != q->sops[0].sem_num);
+	if (got_zero)
+		semop_completed |= wake_const_ops(sma, -1, pt);
 
-	/* Yes, there is a wait-for-zero semop. Restart */
-	if (h->sops[0].sem_op == 0)
-		return 1;
-
-	/* Again - no-one is waiting for the new value. */
-	return 0;
+	return semop_completed;
 }
 
 
@@ -678,6 +753,8 @@ static int check_restart(struct sem_array *sma, struct sem_queue *q)
  * semaphore.
  * The tasks that must be woken up are added to @pt. The return code
  * is stored in q->pid.
+ * The function internally checks if const operations can now succeed.
+ *
  * The function return 1 if at least one semop was completed successfully.
  */
 static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
@@ -688,9 +765,9 @@ static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
 	int semop_completed = 0;
 
 	if (semnum == -1)
-		pending_list = &sma->sem_pending;
+		pending_list = &sma->pending_alter;
 	else
-		pending_list = &sma->sem_base[semnum].sem_pending;
+		pending_list = &sma->sem_base[semnum].pending_alter;
 
 again:
 	walk = pending_list->next;
@@ -702,13 +779,12 @@ again:
 
 		/* If we are scanning the single sop, per-semaphore list of
 		 * one semaphore and that semaphore is 0, then it is not
-		 * necessary to scan the "alter" entries: simple increments
+		 * necessary to scan further: simple increments
 		 * that affect only one entry succeed immediately and cannot
 		 * be in the  per semaphore pending queue, and decrements
 		 * cannot be successful if the value is already 0.
 		 */
-		if (semnum != -1 && sma->sem_base[semnum].semval == 0 &&
-				q->alter)
+		if (semnum != -1 && sma->sem_base[semnum].semval == 0)
 			break;
 
 		error = try_atomic_semop(sma, q->sops, q->nsops,
@@ -724,6 +800,7 @@ again:
 			restart = 0;
 		} else {
 			semop_completed = 1;
+			do_smart_wakeup_zero(sma, q->sops, q->nsops, pt);
 			restart = check_restart(sma, q);
 		}
 
@@ -742,8 +819,8 @@ again:
  * @otime: force setting otime
  * @pt: list head of the tasks that must be woken up.
  *
- * do_smart_update() does the required called to update_queue, based on the
- * actual changes that were performed on the semaphore array.
+ * do_smart_update() does the required calls to update_queue and wakeup_zero,
+ * based on the actual changes that were performed on the semaphore array.
  * Note that the function does not do the actual wake-up: the caller is
  * responsible for calling wake_up_sem_queue_do(@pt).
  * It is safe to perform this call after dropping all locks.
@@ -754,6 +831,8 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 	int i;
 	int progress;
 
+	otime |= do_smart_wakeup_zero(sma, sops, nsops, pt);
+
 	progress = 1;
 retry_global:
 	if (sma->complex_count) {
@@ -813,14 +892,14 @@ static int count_semncnt (struct sem_array * sma, ushort semnum)
 	struct sem_queue * q;
 
 	semncnt = 0;
-	list_for_each_entry(q, &sma->sem_base[semnum].sem_pending, list) {
+	list_for_each_entry(q, &sma->sem_base[semnum].pending_alter, list) {
 		struct sembuf * sops = q->sops;
 		BUG_ON(sops->sem_num != semnum);
 		if ((sops->sem_op < 0) && !(sops->sem_flg & IPC_NOWAIT))
 			semncnt++;
 	}
 
-	list_for_each_entry(q, &sma->sem_pending, list) {
+	list_for_each_entry(q, &sma->pending_alter, list) {
 		struct sembuf * sops = q->sops;
 		int nsops = q->nsops;
 		int i;
@@ -839,14 +918,14 @@ static int count_semzcnt (struct sem_array * sma, ushort semnum)
 	struct sem_queue * q;
 
 	semzcnt = 0;
-	list_for_each_entry(q, &sma->sem_base[semnum].sem_pending, list) {
+	list_for_each_entry(q, &sma->sem_base[semnum].pending_const, list) {
 		struct sembuf * sops = q->sops;
 		BUG_ON(sops->sem_num != semnum);
 		if ((sops->sem_op == 0) && !(sops->sem_flg & IPC_NOWAIT))
 			semzcnt++;
 	}
 
-	list_for_each_entry(q, &sma->sem_pending, list) {
+	list_for_each_entry(q, &sma->pending_const, list) {
 		struct sembuf * sops = q->sops;
 		int nsops = q->nsops;
 		int i;
@@ -884,13 +963,22 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
 	INIT_LIST_HEAD(&tasks);
-	list_for_each_entry_safe(q, tq, &sma->sem_pending, list) {
+	list_for_each_entry_safe(q, tq, &sma->pending_const, list) {
+		unlink_queue(sma, q);
+		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
+	}
+
+	list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
 		unlink_queue(sma, q);
 		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
 	}
 	for (i = 0; i < sma->sem_nsems; i++) {
 		struct sem *sem = sma->sem_base + i;
-		list_for_each_entry_safe(q, tq, &sem->sem_pending, list) {
+		list_for_each_entry_safe(q, tq, &sem->pending_const, list) {
+			unlink_queue(sma, q);
+			wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
+		}
+		list_for_each_entry_safe(q, tq, &sem->pending_alter, list) {
 			unlink_queue(sma, q);
 			wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
 		}
@@ -1654,14 +1742,15 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		curr = &sma->sem_base[sops->sem_num];
 
 		if (alter)
-			list_add_tail(&queue.list, &curr->sem_pending);
+			list_add_tail(&queue.list, &curr->pending_alter);
 		else
-			list_add(&queue.list, &curr->sem_pending);
+			list_add_tail(&queue.list, &curr->pending_const);
 	} else {
 		if (alter)
-			list_add_tail(&queue.list, &sma->sem_pending);
+			list_add_tail(&queue.list, &sma->pending_alter);
 		else
-			list_add(&queue.list, &sma->sem_pending);
+			list_add_tail(&queue.list, &sma->pending_const);
+
 		sma->complex_count++;
 	}
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] ipc/sem.c: Always use only one queue for alter operations.
  2013-06-10 17:16     ` [PATCH 3/6] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
@ 2013-06-10 17:16       ` Manfred Spraul
  2013-06-10 17:16         ` [PATCH 5/6] ipc/sem.c: Replace shared sem_otime with per-semaphore value Manfred Spraul
  0 siblings, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.

Since one of the queues must be processed first, this causes an odd
priorization of the wakeups:
Right now, complex operations have priority over simple ops.

The patch restores the behavior of linux <=3.0.9: The longest
waiting operation has the highest priority.

This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.

As a side effect, do_smart_update_queue() becomes much simpler:
No more goto logic.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 128 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 88 insertions(+), 40 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index e7f3d64..dcf99ef 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -192,6 +192,53 @@ void __init sem_init (void)
 				IPC_SEM_IDS, sysvipc_sem_proc_show);
 }
 
+/**
+ * unmerge_queues - unmerge queues, if possible.
+ * @sma: semaphore array
+ *
+ * The function unmerges the wait queues if complex_count is 0.
+ * It must be called prior to dropping the global semaphore array lock.
+ */
+static void unmerge_queues(struct sem_array *sma)
+{
+	struct sem_queue *q, *tq;
+
+	/* complex operations still around? */
+	if (sma->complex_count)
+		return;
+	/*
+	 * We will switch back to simple mode.
+	 * Move all pending operation back into the per-semaphore
+	 * queues.
+	 */
+	list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
+		struct sem *curr;
+		curr = &sma->sem_base[q->sops[0].sem_num];
+
+		list_add_tail(&q->list, &curr->pending_alter);
+	}
+	INIT_LIST_HEAD(&sma->pending_alter);
+}
+
+/**
+ * merge_queues - Merge single semop queues into global queue
+ * @sma: semaphore array
+ *
+ * This function merges all per-semaphore queues into the global queue.
+ * It is necessary to achieve FIFO ordering for the pending single-sop
+ * operations when a multi-semop operation must sleep.
+ * Only the alter operations must be moved, the const operations can stay.
+ */
+static void merge_queues(struct sem_array *sma)
+{
+	int i;
+	for (i = 0; i < sma->sem_nsems; i++) {
+		struct sem *sem = sma->sem_base + i;
+
+		list_splice_init(&sem->pending_alter, &sma->pending_alter);
+	}
+}
+
 /*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
@@ -262,6 +309,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
+		unmerge_queues(sma);
 		spin_unlock(&sma->sem_perm.lock);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -829,49 +877,38 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 			int otime, struct list_head *pt)
 {
 	int i;
-	int progress;
 
 	otime |= do_smart_wakeup_zero(sma, sops, nsops, pt);
 
-	progress = 1;
-retry_global:
-	if (sma->complex_count) {
-		if (update_queue(sma, -1, pt)) {
-			progress = 1;
-			otime = 1;
-			sops = NULL;
-		}
-	}
-	if (!progress)
-		goto done;
-
-	if (!sops) {
-		/* No semops; something special is going on. */
-		for (i = 0; i < sma->sem_nsems; i++) {
-			if (update_queue(sma, i, pt)) {
-				otime = 1;
-				progress = 1;
+	if (!list_empty(&sma->pending_alter)) {
+		/* semaphore array uses the global queue - just process it. */
+		otime |= update_queue(sma, -1, pt);
+	} else {
+		if (!sops) {
+			/*
+			 * No sops, thus the modified semaphores are not
+			 * known. Check all.
+			 */
+			for (i = 0; i < sma->sem_nsems; i++)
+				otime |= update_queue(sma, i, pt);
+		} else {
+			/*
+			 * Check the semaphores that were increased:
+			 * - No complex ops, thus all sleeping ops are
+			 *   decrease.
+			 * - if we decreased the value, then any sleeping
+			 *   semaphore ops wont be able to run: If the
+			 *   previous value was too small, then the new
+			 *   value will be too small, too.
+			 */
+			for (i = 0; i < nsops; i++) {
+				if (sops[i].sem_op > 0) {
+					otime |= update_queue(sma,
+							sops[i].sem_num, pt);
+				}
 			}
 		}
-		goto done_checkretry;
-	}
-
-	/* Check the semaphores that were modified. */
-	for (i = 0; i < nsops; i++) {
-		if (sops[i].sem_op > 0 ||
-			(sops[i].sem_op < 0 &&
-				sma->sem_base[sops[i].sem_num].semval == 0))
-			if (update_queue(sma, sops[i].sem_num, pt)) {
-				otime = 1;
-				progress = 1;
-			}
-	}
-done_checkretry:
-	if (progress) {
-		progress = 0;
-		goto retry_global;
 	}
-done:
 	if (otime)
 		sma->sem_otime = get_seconds();
 }
@@ -1741,11 +1778,22 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		struct sem *curr;
 		curr = &sma->sem_base[sops->sem_num];
 
-		if (alter)
-			list_add_tail(&queue.list, &curr->pending_alter);
-		else
+		if (alter) {
+			if (sma->complex_count) {
+				list_add_tail(&queue.list,
+						&sma->pending_alter);
+			} else {
+
+				list_add_tail(&queue.list,
+						&curr->pending_alter);
+			}
+		} else {
 			list_add_tail(&queue.list, &curr->pending_const);
+		}
 	} else {
+		if (!sma->complex_count)
+			merge_queues(sma);
+
 		if (alter)
 			list_add_tail(&queue.list, &sma->pending_alter);
 		else
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] ipc/sem.c: Replace shared sem_otime with per-semaphore value
  2013-06-10 17:16       ` [PATCH 4/6] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
@ 2013-06-10 17:16         ` Manfred Spraul
  2013-06-10 17:16           ` [PATCH 6/6] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update Manfred Spraul
  0 siblings, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

sem_otime contains the time of the last semaphore operation
that completed successfully. Every operation updates this
value, thus access from multiple cpus can cause trashing.

Therefore the patch replaces the variable with a per-semaphore
variable. The per-array sem_otime is only calculated when required.

No performance improvement on a single-socket i3 - only important
for larger systems.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/sem.h |  1 -
 ipc/sem.c           | 37 +++++++++++++++++++++++++++++++------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 55e17f6..976ce3a 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -12,7 +12,6 @@ struct task_struct;
 struct sem_array {
 	struct kern_ipc_perm	____cacheline_aligned_in_smp
 				sem_perm;	/* permissions .. see ipc.h */
-	time_t			sem_otime;	/* last semop time */
 	time_t			sem_ctime;	/* last change time */
 	struct sem		*sem_base;	/* ptr to first semaphore in array */
 	struct list_head	pending_alter;	/* pending operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index dcf99ef..e6d21f6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -99,6 +99,7 @@ struct sem {
 					/* that alter the semaphore */
 	struct list_head pending_const; /* pending single-sop operations */
 					/* that do not alter the semaphore*/
+	time_t	sem_otime;	/* candidate for sem_otime */
 } ____cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
@@ -909,8 +910,14 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 			}
 		}
 	}
-	if (otime)
-		sma->sem_otime = get_seconds();
+	if (otime) {
+		if (sops == NULL) {
+			sma->sem_base[0].sem_otime = get_seconds();
+		} else {
+			sma->sem_base[sops[0].sem_num].sem_otime =
+								get_seconds();
+		}
+	}
 }
 
 
@@ -1056,6 +1063,21 @@ static unsigned long copy_semid_to_user(void __user *buf, struct semid64_ds *in,
 	}
 }
 
+static time_t get_semotime(struct sem_array *sma)
+{
+	int i;
+	time_t res;
+
+	res = sma->sem_base[0].sem_otime;
+	for (i = 1; i < sma->sem_nsems; i++) {
+		time_t to = sma->sem_base[i].sem_otime;
+
+		if (to > res)
+			res = to;
+	}
+	return res;
+}
+
 static int semctl_nolock(struct ipc_namespace *ns, int semid,
 			 int cmd, int version, void __user *p)
 {
@@ -1129,9 +1151,9 @@ static int semctl_nolock(struct ipc_namespace *ns, int semid,
 			goto out_unlock;
 
 		kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
-		tbuf.sem_otime  = sma->sem_otime;
-		tbuf.sem_ctime  = sma->sem_ctime;
-		tbuf.sem_nsems  = sma->sem_nsems;
+		tbuf.sem_otime = get_semotime(sma);
+		tbuf.sem_ctime = sma->sem_ctime;
+		tbuf.sem_nsems = sma->sem_nsems;
 		rcu_read_unlock();
 		if (copy_semid_to_user(p, &tbuf, version))
 			return -EFAULT;
@@ -2019,6 +2041,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 {
 	struct user_namespace *user_ns = seq_user_ns(s);
 	struct sem_array *sma = it;
+	time_t sem_otime;
+
+	sem_otime = get_semotime(sma);
 
 	return seq_printf(s,
 			  "%10d %10d  %4o %10u %5u %5u %5u %5u %10lu %10lu\n",
@@ -2030,7 +2055,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 			  from_kgid_munged(user_ns, sma->sem_perm.gid),
 			  from_kuid_munged(user_ns, sma->sem_perm.cuid),
 			  from_kgid_munged(user_ns, sma->sem_perm.cgid),
-			  sma->sem_otime,
+			  sem_otime,
 			  sma->sem_ctime);
 }
 #endif
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update
  2013-06-10 17:16         ` [PATCH 5/6] ipc/sem.c: Replace shared sem_otime with per-semaphore value Manfred Spraul
@ 2013-06-10 17:16           ` Manfred Spraul
  0 siblings, 0 replies; 18+ messages in thread
From: Manfred Spraul @ 2013-06-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Rik van Riel, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith, Manfred Spraul

Cleanup: Some minor points that I noticed while writing the
previous patches

1) The name try_atomic_semop() is misleading: The function performs the
   operation (if it is possible).

2) Some documentation updates.

No real code change, a rename and documentation changes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index e6d21f6..f9d1c06 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -154,12 +154,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #define SEMOPM_FAST	64  /* ~ 372 bytes on stack */
 
 /*
- * linked list protection:
+ * Locking:
  *	sem_undo.id_next,
+ *	sem_array.complex_count,
  *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: sem_lock() for read/write
+ *	sem_array.sem_undo: global sem_lock() for read/write
  *	sem_undo.proc_next: only "current" is allowed to read/write that field.
  *	
+ *	sem_array.sem_base[i].pending_{const,alter}:
+ *		global or semaphore sem_lock() for read/write
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -536,12 +539,19 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
 }
 
-/*
- * Determine whether a sequence of semaphore operations would succeed
- * all at once. Return 0 if yes, 1 if need to sleep, else return error code.
+/** perform_atomic_semop - Perform (if possible) a semaphore operation
+ * @sma: semaphore array
+ * @sops: array with operations that should be checked
+ * @nsems: number of sops
+ * @un: undo array
+ * @pid: pid that did the change
+ *
+ * Returns 0 if the operation was possible.
+ * Returns 1 if the operation is impossible, the caller must sleep.
+ * Negative values are error codes.
  */
 
-static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
+static int perform_atomic_semop(struct sem_array *sma, struct sembuf *sops,
 			     int nsops, struct sem_undo *un, int pid)
 {
 	int result, sem_op;
@@ -724,8 +734,8 @@ static int wake_const_ops(struct sem_array *sma, int semnum,
 		q = container_of(walk, struct sem_queue, list);
 		walk = walk->next;
 
-		error = try_atomic_semop(sma, q->sops, q->nsops,
-						q->undo, q->pid);
+		error = perform_atomic_semop(sma, q->sops, q->nsops,
+						 q->undo, q->pid);
 
 		if (error <= 0) {
 			/* operation completed, remove from queue & wakeup */
@@ -836,7 +846,7 @@ again:
 		if (semnum != -1 && sma->sem_base[semnum].semval == 0)
 			break;
 
-		error = try_atomic_semop(sma, q->sops, q->nsops,
+		error = perform_atomic_semop(sma, q->sops, q->nsops,
 					 q->undo, q->pid);
 
 		/* Does q->sleeper still need to sleep? */
@@ -1680,7 +1690,6 @@ static int get_queue_result(struct sem_queue *q)
 	return error;
 }
 
-
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1778,7 +1787,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	if (un && un->semid == -1)
 		goto out_unlock_free;
 
-	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
+	error = perform_atomic_semop(sma, sops, nsops, un,
+					task_tgid_vnr(current));
 	if (error <= 0) {
 		if (alter && error == 0)
 			do_smart_update(sma, sops, nsops, 1, &tasks);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-10 17:16 [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
  2013-06-10 17:16 ` [PATCH 1/6] ipc/util.c, ipc_rcu_alloc: cacheline align allocation Manfred Spraul
@ 2013-06-14 15:38 ` Manfred Spraul
  2013-06-14 19:05   ` Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-14 15:38 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds, Mike Galbraith

Hi all,

On 06/10/2013 07:16 PM, Manfred Spraul wrote:
> Hi Andrew,
>
> I have cleaned up/improved my updates to sysv sem.
> Could you replace my patches in -akpm with this series?
>
> - 1: cacheline align output from ipc_rcu_alloc
> - 2: cacheline align semaphore structures
> - 3: seperate-wait-for-zero-and-alter-tasks
> - 4: Always-use-only-one-queue-for-alter-operations
> - 5: Replace the global sem_otime with a distributed otime
> - 6: Rename-try_atomic_semop-to-perform_atomic
Just to keep everyone updated:
I have updated my testapp:
https://github.com/manfred-colorfu/ipcscale/blob/master/sem-waitzero.cpp

Something like this gives a nice output:

     # sem-waitzero -t 5 -m 0 | grep 'Cpus' | gawk '{printf("%f - 
%s\n",$7/$2,$0);}' | sort -n -r

The first number is the number of operations per cpu during 5 seconds.

Mike was kind enough to run in on a 32-core (4-socket) Intel system:
- master doesn't scale at all when multiple sockets are used:
     interleave 4: (i.e.: use cpu 0, then 4, then 8 (2nd socket), then 12):
         34,717586.000000 - Cpus 1, interleave 4 delay 0: 34717586 in 5 secs
         24,507337.500000 - Cpus 2, interleave 4 delay 0: 49014675 in 5 secs
          3,487540.000000 - Cpus 3, interleave 4 delay 0: 10462620 in 5 secs
          2,708145.000000 - Cpus 4, interleave 4 delay 0: 10832580 in 5 secs
     interleave 8: (i.e.: use cpu 0, then 8 (2nd socket):
         34,587329.000000 - Cpus 1, interleave 8 delay 0: 34587329 in 5 secs
          7,746981.500000 - Cpus 2, interleave 8 delay 0: 15493963 in 5 secs

- with my patches applied, it scales linearly - but only sometimes
     example for good scaling (18 threads in parallel - linear scaling):
         33,928616.111111 - Cpus 18, interleave 8 delay 0: 610715090 in 
5 secs
     example for bad scaling:
         5,829109.600000 - Cpus 5, interleave 8 delay 0: 29145548 in 5 secs

For me, it looks like a livelock somewhere:
Good example: all threads contribute the same amount to the final result:
> Result matrix:
>   Thread   0: 33476433
>   Thread   1: 33697100
>   Thread   2: 33514249
>   Thread   3: 33657413
>   Thread   4: 33727959
>   Thread   5: 33580684
>   Thread   6: 33530294
>   Thread   7: 33666761
>   Thread   8: 33749836
>   Thread   9: 32636493
>   Thread  10: 33550620
>   Thread  11: 33403314
>   Thread  12: 33594457
>   Thread  13: 33331920
>   Thread  14: 33503588
>   Thread  15: 33585348
> Cpus 16, interleave 8 delay 0: 536206469 in 5 secs
Bad example: one thread is as fast as it should be, others are slow:
> Result matrix:
>   Thread   0: 31629540
>   Thread   1:  5336968
>   Thread   2:  6404314
>   Thread   3:  9190595
>   Thread   4:  9681006
>   Thread   5:  9935421
>   Thread   6:  9424324
> Cpus 7, interleave 8 delay 0: 81602168 in 5 secs

The results are not stable: the same test is sometimes fast, sometimes slow.
I have no idea where the livelock could be and I wasn't able to notice 
anything on my i3 laptop.

Thus: Who has an idea?
What I can say is that the livelock can't be in do_smart_update(): The 
function is never called.

--
     Manfred


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-14 15:38 ` [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
@ 2013-06-14 19:05   ` Mike Galbraith
  2013-06-15  5:27     ` Manfred Spraul
  2013-06-15 11:10     ` Manfred Spraul
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Galbraith @ 2013-06-14 19:05 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Fri, 2013-06-14 at 17:38 +0200, Manfred Spraul wrote: 
> Hi all,
> 
> On 06/10/2013 07:16 PM, Manfred Spraul wrote:
> > Hi Andrew,
> >
> > I have cleaned up/improved my updates to sysv sem.
> > Could you replace my patches in -akpm with this series?
> >
> > - 1: cacheline align output from ipc_rcu_alloc
> > - 2: cacheline align semaphore structures
> > - 3: seperate-wait-for-zero-and-alter-tasks
> > - 4: Always-use-only-one-queue-for-alter-operations
> > - 5: Replace the global sem_otime with a distributed otime
> > - 6: Rename-try_atomic_semop-to-perform_atomic
> Just to keep everyone updated:
> I have updated my testapp:
> https://github.com/manfred-colorfu/ipcscale/blob/master/sem-waitzero.cpp
> 
> Something like this gives a nice output:
> 
>      # sem-waitzero -t 5 -m 0 | grep 'Cpus' | gawk '{printf("%f - 
> %s\n",$7/$2,$0);}' | sort -n -r
> 
> The first number is the number of operations per cpu during 5 seconds.
> 
> Mike was kind enough to run in on a 32-core (4-socket) Intel system:
> - master doesn't scale at all when multiple sockets are used:
>      interleave 4: (i.e.: use cpu 0, then 4, then 8 (2nd socket), then 12):
>          34,717586.000000 - Cpus 1, interleave 4 delay 0: 34717586 in 5 secs
>          24,507337.500000 - Cpus 2, interleave 4 delay 0: 49014675 in 5 secs
>           3,487540.000000 - Cpus 3, interleave 4 delay 0: 10462620 in 5 secs
>           2,708145.000000 - Cpus 4, interleave 4 delay 0: 10832580 in 5 secs
>      interleave 8: (i.e.: use cpu 0, then 8 (2nd socket):
>          34,587329.000000 - Cpus 1, interleave 8 delay 0: 34587329 in 5 secs
>           7,746981.500000 - Cpus 2, interleave 8 delay 0: 15493963 in 5 secs
> 
> - with my patches applied, it scales linearly - but only sometimes
>      example for good scaling (18 threads in parallel - linear scaling):
>          33,928616.111111 - Cpus 18, interleave 8 delay 0: 610715090 in 
> 5 secs
>      example for bad scaling:
>          5,829109.600000 - Cpus 5, interleave 8 delay 0: 29145548 in 5 secs
> 
> For me, it looks like a livelock somewhere:
> Good example: all threads contribute the same amount to the final result:
> > Result matrix:
> >   Thread   0: 33476433
> >   Thread   1: 33697100
> >   Thread   2: 33514249
> >   Thread   3: 33657413
> >   Thread   4: 33727959
> >   Thread   5: 33580684
> >   Thread   6: 33530294
> >   Thread   7: 33666761
> >   Thread   8: 33749836
> >   Thread   9: 32636493
> >   Thread  10: 33550620
> >   Thread  11: 33403314
> >   Thread  12: 33594457
> >   Thread  13: 33331920
> >   Thread  14: 33503588
> >   Thread  15: 33585348
> > Cpus 16, interleave 8 delay 0: 536206469 in 5 secs
> Bad example: one thread is as fast as it should be, others are slow:
> > Result matrix:
> >   Thread   0: 31629540
> >   Thread   1:  5336968
> >   Thread   2:  6404314
> >   Thread   3:  9190595
> >   Thread   4:  9681006
> >   Thread   5:  9935421
> >   Thread   6:  9424324
> > Cpus 7, interleave 8 delay 0: 81602168 in 5 secs
> 
> The results are not stable: the same test is sometimes fast, sometimes slow.
> I have no idea where the livelock could be and I wasn't able to notice 
> anything on my i3 laptop.
> 
> Thus: Who has an idea?
> What I can say is that the livelock can't be in do_smart_update(): The 
> function is never called.

64 core DL980, using all cores is stable at being horribly _unstable_,
much worse than the 32 core UV2000, but if using only 32 cores, it
becomes considerably more stable than the newer/faster UV box.

32 of 64 cores DL980 without the -rt killing goto again loop removal I
showed you.  Unstable, not wonderful throughput.

Result matrix:
  Thread   0:  7253945
  Thread   1:  9050395
  Thread   2:  7708921
  Thread   3:  7274316
  Thread   4:  9815215
  Thread   5:  9924773
  Thread   6:  7743325
  Thread   7:  8643970
  Thread   8: 11268731
  Thread   9:  9610031
  Thread  10:  7540230
  Thread  11:  8432077
  Thread  12: 11071762
  Thread  13: 10436946
  Thread  14:  8051919
  Thread  15:  7461884
  Thread  16: 11706359
  Thread  17: 10512449
  Thread  18:  8225636
  Thread  19:  7809035
  Thread  20: 10465783
  Thread  21: 10072878
  Thread  22:  7632289
  Thread  23:  6758903
  Thread  24: 10763830
  Thread  25:  8974703
  Thread  26:  7054996
  Thread  27:  7367430
  Thread  28:  9816388
  Thread  29:  9622796
  Thread  30:  6500835
  Thread  31:  7959901

# Events: 802K cycles
#
# Overhead                                      Symbol
# ........  ..........................................
#
    18.42%  [k] SYSC_semtimedop
    15.39%  [k] sem_lock
    10.26%  [k] _raw_spin_lock
     9.00%  [k] perform_atomic_semop
     7.89%  [k] system_call
     7.70%  [k] ipc_obtain_object_check
     6.95%  [k] ipcperms
     6.62%  [k] copy_user_generic_string
     4.16%  [.] __semop
     2.57%  [.] worker_thread(void*)
     2.30%  [k] copy_from_user
     1.75%  [k] sem_unlock
     1.25%  [k] ipc_obtain_object

With -goto again loop whacked, it's nearly stable, but not quite, and
throughput mostly looks like so..

Result matrix:
  Thread   0: 24164305
  Thread   1: 24224024
  Thread   2: 24112445
  Thread   3: 24076559
  Thread   4: 24364901
  Thread   5: 24249681
  Thread   6: 24048409
  Thread   7: 24267064
  Thread   8: 24614799
  Thread   9: 24330378
  Thread  10: 24132766
  Thread  11: 24158460
  Thread  12: 24456538
  Thread  13: 24300952
  Thread  14: 24079298
  Thread  15: 24100075
  Thread  16: 24643074
  Thread  17: 24369761
  Thread  18: 24151657
  Thread  19: 24143953
  Thread  20: 24575677
  Thread  21: 24169945
  Thread  22: 24055378
  Thread  23: 24016710
  Thread  24: 24548028
  Thread  25: 24290316
  Thread  26: 24169379
  Thread  27: 24119776
  Thread  28: 24399737
  Thread  29: 24256724
  Thread  30: 23914777
  Thread  31: 24215780

and profile like so.

# Events: 802K cycles
#
# Overhead                           Symbol
# ........  ...............................
#
    17.38%  [k] SYSC_semtimedop
    13.26%  [k] system_call
    11.31%  [k] copy_user_generic_string
     7.62%  [.] __semop
     7.18%  [k] _raw_spin_lock
     5.66%  [k] ipcperms
     5.40%  [k] sem_lock
     4.65%  [k] perform_atomic_semop
     4.22%  [k] ipc_obtain_object_check
     4.08%  [.] worker_thread(void*)
     4.06%  [k] copy_from_user
     2.40%  [k] ipc_obtain_object
     1.98%  [k] pid_vnr
     1.45%  [k] wake_up_sem_queue_do
     1.39%  [k] sys_semop
     1.35%  [k] sys_semtimedop
     1.30%  [k] sem_unlock
     1.14%  [k] security_ipc_permission

So that goto again loop is not only an -rt killer, it seems to be part
of the instability picture too.

Back to virgin source + your patch series

Using 64 cores with or without loop removed, it's uniformly unstable as
hell.  With goto again loop removed, it improves some, but not much, so
loop isn't the biggest deal, except to -rt, where it's utterly deadly.
. 
Result matrix:
  Thread   0:   997088
  Thread   1:  1962065
  Thread   2:   117899
  Thread   3:   125918
  Thread   4:    80233
  Thread   5:    85001
  Thread   6:    88413
  Thread   7:   104424
  Thread   8:  1549782
  Thread   9:  2172206
  Thread  10:   119314
  Thread  11:   127109
  Thread  12:    81179
  Thread  13:    89026
  Thread  14:    91497
  Thread  15:   103410
  Thread  16:  1661969
  Thread  17:  2223131
  Thread  18:   119739
  Thread  19:   126294
  Thread  20:    81172
  Thread  21:    87850
  Thread  22:    90621
  Thread  23:   102964
  Thread  24:  1641042
  Thread  25:  2152851
  Thread  26:   118818
  Thread  27:   125801
  Thread  28:    79316
  Thread  29:    99029
  Thread  30:   101513
  Thread  31:    91206
  Thread  32:  1825614
  Thread  33:  2432801
  Thread  34:   120599
  Thread  35:   131854
  Thread  36:    81346
  Thread  37:   103464
  Thread  38:   105223
  Thread  39:   101554
  Thread  40:  1980013
  Thread  41:  2574055
  Thread  42:   122887
  Thread  43:   131096
  Thread  44:    80521
  Thread  45:   105162
  Thread  46:   110329
  Thread  47:   104078
  Thread  48:  1925173
  Thread  49:  2552441
  Thread  50:   123806
  Thread  51:   134857
  Thread  52:    82148
  Thread  53:   105312
  Thread  54:   109728
  Thread  55:   107766
  Thread  56:  1999696
  Thread  57:  2699455
  Thread  58:   128375
  Thread  59:   128289
  Thread  60:    80071
  Thread  61:   106968
  Thread  62:   111768
  Thread  63:   115243

# Events: 1M cycles
#
# Overhead                                   Symbol
# ........  .......................................
#
    30.73%  [k] ipc_obtain_object_check
    29.46%  [k] sem_lock
    25.12%  [k] ipcperms
     4.93%  [k] SYSC_semtimedop
     4.35%  [k] perform_atomic_semop
     2.83%  [k] _raw_spin_lock
     0.40%  [k] system_call

ipc_obtain_object_check():

         :         * Call inside the RCU critical section.                                                                                                                                                                                 ↑
         :         * The ipc object is *not* locked on exit.                                                                                                                                                                               ▒
         :         */                                                                                                                                                                                                                      ▒
         :        struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id)                                                                                                                                               ▒
         :        {                                                                                                                                                                                                                        ▒
         :                struct kern_ipc_perm *out = ipc_obtain_object(ids, id);                                                                                                                                                          ▒
    0.00 :        ffffffff81256a2b:       48 89 c2                mov    %rax,%rdx                                                                                                                                                         ▒
         :                                                                                                                                                                                                                                 ▒
         :                if (IS_ERR(out))                                                                                                                                                                                                 ▒
    0.02 :        ffffffff81256a2e:       77 20                   ja     ffffffff81256a50 <ipc_obtain_object_check+0x40>                                                                                                                   ▒
         :                        goto out;                                                                                                                                                                                                ▒
         :                                                                                                                                                                                                                                 ▒
         :                if (ipc_checkid(out, id))                                                                                                                                                                                        ▒
    0.00 :        ffffffff81256a30:       8d 83 ff 7f 00 00       lea    0x7fff(%rbx),%eax                                                                                                                                                 ▒
    0.00 :        ffffffff81256a36:       85 db                   test   %ebx,%ebx                                                                                                                                                         ▒
    0.00 :        ffffffff81256a38:       0f 48 d8                cmovs  %eax,%ebx                                                                                                                                                         ▒
    0.02 :        ffffffff81256a3b:       c1 fb 0f                sar    $0xf,%ebx                                                                                                                                                         ▒
    0.00 :        ffffffff81256a3e:       48 63 c3                movslq %ebx,%rax                                                                                                                                                         ▒
    0.00 :        ffffffff81256a41:       48 3b 42 28             cmp    0x28(%rdx),%rax                                                                                                                                                   ▒
   99.84 :        ffffffff81256a45:       48 c7 c0 d5 ff ff ff    mov    $0xffffffffffffffd5,%rax                                                                                                                                          ▒
    0.00 :        ffffffff81256a4c:       48 0f 45 d0             cmovne %rax,%rdx                                                                                                                                                         ▒
         :                        return ERR_PTR(-EIDRM);                                                                                                                                                                                  ▒
         :        out:                                                                                                                                                                                                                     ▒
         :                return out;                                                                                                                                                                                                      ▒
         :        }                                                                                                                                                                                                                        ▒
    0.03 :        ffffffff81256a50:       48 83 c4 08             add    $0x8,%rsp                                                                                                                                                         ▒
    0.00 :        ffffffff81256a54:       48 89 d0                mov    %rdx,%rax                                                                                                                                                         ▒
    0.02 :        ffffffff81256a57:       5b                      pop    %rbx                                                                                                                                                              ▒
    0.00 :        ffffffff81256a58:       c9                      leaveq 

sem_lock():

         :        static inline void spin_lock(spinlock_t *lock)                                                                                                                                                                           ▒
         :        {                                                                                                                                                                                                                        ▒
         :                raw_spin_lock(&lock->rlock);                                                                                                                                                                                     ▒
    0.10 :        ffffffff81258a7c:       4c 8d 6b 08             lea    0x8(%rbx),%r13                                                                                                                                                    ▒
    0.01 :        ffffffff81258a80:       4c 89 ef                mov    %r13,%rdi                                                                                                                                                         ▒
    0.01 :        ffffffff81258a83:       e8 08 4f 35 00          callq  ffffffff815ad990 <_raw_spin_lock>                                                                                                                                 ▒
         :                                                                                                                                                                                                                                 ▒
         :                        /*                                                                                                                                                                                                       ▒
         :                         * If sma->complex_count was set while we were spinning,                                                                                                                                                 ▒
         :                         * we may need to look at things we did not lock here.                                                                                                                                                   ▒
         :                         */                                                                                                                                                                                                      ▒
         :                        if (unlikely(sma->complex_count)) {                                                                                                                                                                      ▒
    0.02 :        ffffffff81258a88:       41 8b 44 24 7c          mov    0x7c(%r12),%eax                                                                                                                                                   ▮
    6.18 :        ffffffff81258a8d:       85 c0                   test   %eax,%eax                                                                                                                                                         ▒
    0.00 :        ffffffff81258a8f:       75 29                   jne    ffffffff81258aba <sem_lock+0x7a>                                                                                                                                  ▒
         :                __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);                                                                                                                                                               ▒
         :        }                                                                                                                                                                                                                        ▒
         :                                                                                                                                                                                                                                 ▒
         :        static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)                                                                                                                                                         ▒
         :        {                                                                                                                                                                                                                        ▒
         :                struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);                                                                                                                                                           ▒
    0.00 :        ffffffff81258a91:       41 0f b7 54 24 02       movzwl 0x2(%r12),%edx                                                                                                                                                    ▒
   84.33 :        ffffffff81258a97:       41 0f b7 04 24          movzwl (%r12),%eax                                                                                                                                                       ▒
         :                        /*                                                                                                                                                                                                       ▒
         :                         * Another process is holding the global lock on the                                                                                                                                                     ▒
         :                         * sem_array; we cannot enter our critical section,                                                                                                                                                      ▒
         :                         * but have to wait for the global lock to be released.                                                                                                                                                  ▒
         :                         */                                                                                                                                                                                                      ▒
         :                        if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {                                                                                                                                                     ▒
    0.42 :        ffffffff81258a9c:       66 39 c2                cmp    %ax,%dx                                                                                                                                                           ▒
    0.01 :        ffffffff81258a9f:       75 76                   jne    ffffffff81258b17 <sem_lock+0xd7>                                                                                                                                  ▒
         :                                spin_unlock(&sem->lock);                                                                                                                                                                         ▒
         :                                spin_unlock_wait(&sma->sem_perm.lock);                                                                                                                                                           ▒
         :                                goto again;


ipcperms():

         :        static inline int audit_dummy_context(void)                                                                                                                                                                              ▒
         :        {                                                                                                                                                                                                                        ▒
         :                void *p = current->audit_context;                                                                                                                                                                                ▒
    0.01 :        ffffffff81255f9e:       48 8b 82 d0 05 00 00    mov    0x5d0(%rdx),%rax                                                                                                                                                  ▒
         :                return !p || *(int *)p;                                                                                                                                                                                          ▒
    0.01 :        ffffffff81255fa5:       48 85 c0                test   %rax,%rax                                                                                                                                                         ▒
    0.00 :        ffffffff81255fa8:       74 06                   je     ffffffff81255fb0 <ipcperms+0x50>                                                                                                                                  ▒
    0.00 :        ffffffff81255faa:       8b 00                   mov    (%rax),%eax                                                                                                                                                       ▒
    0.00 :        ffffffff81255fac:       85 c0                   test   %eax,%eax                                                                                                                                                         ▒
    0.00 :        ffffffff81255fae:       74 60                   je     ffffffff81256010 <ipcperms+0xb0>                                                                                                                                  ▒
         :                int requested_mode, granted_mode;                                                                                                                                                                                ▒
         :                                                                                                                                                                                                                                 ▒
         :                audit_ipc_obj(ipcp);                                                                                                                                                                                             ▒
         :                requested_mode = (flag >> 6) | (flag >> 3) | flag;                                                                                                                                                               ▒
         :                granted_mode = ipcp->mode;                                                                                                                                                                                       ▒
         :                if (uid_eq(euid, ipcp->cuid) ||                                                                                                                                                                                  ▒
    0.02 :        ffffffff81255fb0:       45 3b 6c 24 18          cmp    0x18(%r12),%r13d                                                                                                                                                  ▒
         :                kuid_t euid = current_euid();                                                                                                                                                                                    ▒
         :                int requested_mode, granted_mode;                                                                                                                                                                                ▒
         :                                                                                                                                                                                                                                 ▒
         :                audit_ipc_obj(ipcp);                                                                                                                                                                                             ▒
         :                requested_mode = (flag >> 6) | (flag >> 3) | flag;                                                                                                                                                               ▒
         :                granted_mode = ipcp->mode;                                                                                                                                                                                       ▒
   99.18 :        ffffffff81255fb5:       41 0f b7 5c 24 20       movzwl 0x20(%r12),%ebx                                                                                                                                                   ▒
         :                if (uid_eq(euid, ipcp->cuid) ||                                                                                                                                                                                  ▒
    0.46 :        ffffffff81255fbb:       74 07                   je     ffffffff81255fc4 <ipcperms+0x64>                                                                                                                                  ▒
    0.00 :        ffffffff81255fbd:       45 3b 6c 24 10          cmp    0x10(%r12),%r13d                                                                                                                                                  ▒
    0.00 :        ffffffff81255fc2:       75 5c                   jne    ffffffff81256020 <ipcperms+0xc0>                                                                                                                                  ▒
         :                    uid_eq(euid, ipcp->uid))                                                                                                                                                                                     ▒
         :                        granted_mode >>= 6;                                                                                                                                                                                      ▮
    0.02 :        ffffffff81255fc4:       c1 fb 06                sar    $0x6,%ebx                                                                                                                                                         ▒
         :                else if (in_group_p(ipcp->cgid) || in_group_p(ipcp->gid))                                                                                                                                                        ▒
         :                        granted_mode >>= 3;                                                                                                                                                                                      ▒
         :                /* is there some bit set in requested_mode but not in granted_mode? */                                                                                                                                           ▒
         :                if ((requested_mode & ~granted_mode & 0007) &&                                                                                                                                                                   ▒
    0.00 :        ffffffff81255fc7:       44 89 f0                mov    %r14d,%eax                                                                                                                                                        ▒
    0.00 :        ffffffff81255fca:       44 89 f2                mov    %r14d,%edx                                                                                                                                                        ▒
    0.00 :        ffffffff81255fcd:       f7 d3                   not    %ebx                                                                                                                                                              ▒
    0.02 :        ffffffff81255fcf:       66 c1 f8 06             sar    $0x6,%ax                                                                                                                                                          ▒
    0.00 :        ffffffff81255fd3:       66 c1 fa 03             sar    $0x3,%dx                                                                                                                                                          ▒
    0.00 :        ffffffff81255fd7:       09 d0                   or     %edx,%eax                                                                                                                                                         ▒
    0.02 :        ffffffff81255fd9:       44 09 f0                or     %r14d,%eax                                                                                                                                                        ▒
    0.00 :        ffffffff81255fdc:       83 e0 07                and    $0x7,%eax                                                                                                                                                         ▒
    0.00 :        ffffffff81255fdf:       85 d8                   test   %ebx,%eax                                                                                                                                                         ▒
    0.00 :        ffffffff81255fe1:       75 75                   jne    ffffffff81256058 <ipcperms+0xf8>                                                                                                                                  ▒
         :                    !ns_capable(ns-






^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-14 19:05   ` Mike Galbraith
@ 2013-06-15  5:27     ` Manfred Spraul
  2013-06-15  5:48       ` Mike Galbraith
  2013-06-15 11:10     ` Manfred Spraul
  1 sibling, 1 reply; 18+ messages in thread
From: Manfred Spraul @ 2013-06-15  5:27 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On 06/14/2013 09:05 PM, Mike Galbraith wrote:
> 32 of 64 cores DL980 without the -rt killing goto again loop removal I
> showed you.  Unstable, not wonderful throughput.
Unfortunately the -rt approach is defintively unstable:
> @@ -285,9 +274,29 @@ static inline int sem_lock(struct sem_ar
>                  * but have to wait for the global lock to be released.
>                  */
>                 if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
> -                       spin_unlock(&sem->lock);
> - spin_unlock_wait(&sma->sem_perm.lock);
> -                       goto again;
> +                       spin_lock(&sma->sem_perm.lock);
> +                       if (sma->complex_count)
> +                               goto wait_array;
> +
> +                       /*
> +                        * Acquiring our sem->lock under the global lock
> +                        * forces new complex operations to wait for us
> +                        * to exit our critical section.
> +                        */
> +                       spin_lock(&sem->lock);
> +                       spin_unlock(&sma->sem_perm.lock);

Assume there is one op (semctl(), whatever) that acquires the global 
lock - and a continuous stream of simple ops.
- spin_is_locked() returns true due to the semctl().
- then simple ops will switch to spin_lock(&sma->sem_perm.lock).
- since the spinlock is acquired, the next operation will get true from 
spin_is_locked().

It will stay that way around - as long as there is at least one op 
waiting for sma->sem_perm.lock.
With enough cpus, it will stay like this forever.

--
     Manfred

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-15  5:27     ` Manfred Spraul
@ 2013-06-15  5:48       ` Mike Galbraith
  2013-06-15  7:30         ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2013-06-15  5:48 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-15 at 07:27 +0200, Manfred Spraul wrote:

> Assume there is one op (semctl(), whatever) that acquires the global 
> lock - and a continuous stream of simple ops.
> - spin_is_locked() returns true due to the semctl().
> - then simple ops will switch to spin_lock(&sma->sem_perm.lock).
> - since the spinlock is acquired, the next operation will get true from 
> spin_is_locked().
> 
> It will stay that way around - as long as there is at least one op 
> waiting for sma->sem_perm.lock.
> With enough cpus, it will stay like this forever.

Yup, pondered that yesterday, scratching my head over how to do better.
Hints highly welcome.  Maybe if I figure out how to scratch dual lock
thingy properly for -rt, non-rt will start acting sane too, as that spot
seems to be itchy in both kernels.

-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-15  5:48       ` Mike Galbraith
@ 2013-06-15  7:30         ` Mike Galbraith
  2013-06-15  8:36           ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2013-06-15  7:30 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-15 at 07:48 +0200, Mike Galbraith wrote: 
> On Sat, 2013-06-15 at 07:27 +0200, Manfred Spraul wrote:
> 
> > Assume there is one op (semctl(), whatever) that acquires the global 
> > lock - and a continuous stream of simple ops.
> > - spin_is_locked() returns true due to the semctl().
> > - then simple ops will switch to spin_lock(&sma->sem_perm.lock).
> > - since the spinlock is acquired, the next operation will get true from 
> > spin_is_locked().
> > 
> > It will stay that way around - as long as there is at least one op 
> > waiting for sma->sem_perm.lock.
> > With enough cpus, it will stay like this forever.
> 
> Yup, pondered that yesterday, scratching my head over how to do better.
> Hints highly welcome.  Maybe if I figure out how to scratch dual lock
> thingy properly for -rt, non-rt will start acting sane too, as that spot
> seems to be itchy in both kernels.

Gee, just trying to flip back to a single semaphore lock mode if you had
to do the global wait thing fixed up -rt.  10 consecutive sem-waitzero 5
8 64 runs with the 3.8-rt9 kernel went like so, which is one hell of an
improvement.

Result matrix:
  Thread   0: 20209311
  Thread   1: 20255372
  Thread   2: 20082611
...
  Thread  61: 20162924
  Thread  62: 20048995
  Thread  63: 20142689

I must have screwed up something :)

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
                              int nsops)
{
        struct sem *sem;
        int locknum;

        if (nsops == 1 && !sma->complex_count) {
                sem = sma->sem_base + sops->sem_num;

                /*
                 * Another process is holding the global lock on the
                 * sem_array; we cannot enter our critical section,
                 * but have to wait for the global lock to be released.
                 */
                if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
                        spin_lock(&sma->sem_perm.lock);
                        if (sma->complex_count)
                                goto wait_array;

                        /*
                         * Acquiring our sem->lock under the global lock
                         * forces new complex operations to wait for us
                         * to exit our critical section.
                         */
                        spin_lock(&sem->lock);
                        spin_unlock(&sma->sem_perm.lock);
                } else {
                        /* Lock just the semaphore we are interested in. */
                        spin_lock(&sem->lock);

                        /*
                         * If sma->complex_count was set prior to acquisition,
                         * we must fall back to the global array lock.
                         */
                        if (unlikely(sma->complex_count)) {
                                spin_unlock(&sem->lock);
                                goto lock_array;
                        }
                }

                locknum = sops->sem_num;
        } else {
                int i;
                /*
                 * Lock the semaphore array, and wait for all of the
                 * individual semaphore locks to go away.  The code
                 * above ensures no new single-lock holders will enter
                 * their critical section while the array lock is held.
                 */
 lock_array:
                spin_lock(&sma->sem_perm.lock);
 wait_array:
                for (i = 0; i < sma->sem_nsems; i++) {
                        sem = sma->sem_base + i;
#ifdef CONFIG_PREEMPT_RT_BASE
                        if (spin_is_locked(&sem->lock))
#endif
                        spin_unlock_wait(&sem->lock);
                }
                locknum = -1;

                if (nsops == 1 && !sma->complex_count) {
                        sem = sma->sem_base + sops->sem_num;
                        spin_lock(&sem->lock);
                        spin_unlock(&sma->sem_perm.lock);
                        locknum = sops->sem_num;
                }
        }
        return locknum;
}

Not very pretty, but it works markedly better.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-15  7:30         ` Mike Galbraith
@ 2013-06-15  8:36           ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2013-06-15  8:36 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-15 at 09:30 +0200, Mike Galbraith wrote:

> Gee, just trying to flip back to a single semaphore lock mode if you had
> to do the global wait thing fixed up -rt.

But master wants a sharper rock tossed at it, it's still unstable.

Result matrix:
  Thread   0: 11190947
  Thread   1: 11743861
  Thread   2: 10039223
  Thread   3:  9776111
  Thread   4:  9734815
  Thread   5:  9941121
  Thread   6: 11604118
  Thread   7: 11716281
  Thread   8:  9828587
  Thread   9:  9589199
...

And that's one of the better samples.  Probably -rt sleepy locks help
get us flipped back to single sem lock mode.

Hohum.  On the bright side, -rt gets to scale better than mainline for a
little while, that's kinda different.

-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-14 19:05   ` Mike Galbraith
  2013-06-15  5:27     ` Manfred Spraul
@ 2013-06-15 11:10     ` Manfred Spraul
  2013-06-15 11:37       ` Mike Galbraith
  2013-06-18  6:48       ` Mike Galbraith
  1 sibling, 2 replies; 18+ messages in thread
From: Manfred Spraul @ 2013-06-15 11:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On 06/14/2013 09:05 PM, Mike Galbraith wrote:
> # Events: 802K cycles
> #
> # Overhead                                      Symbol
> # ........  ..........................................
> #
>      18.42%  [k] SYSC_semtimedop
>      15.39%  [k] sem_lock
>      10.26%  [k] _raw_spin_lock
>       9.00%  [k] perform_atomic_semop
>       7.89%  [k] system_call
>       7.70%  [k] ipc_obtain_object_check
>       6.95%  [k] ipcperms
>       6.62%  [k] copy_user_generic_string
>       4.16%  [.] __semop
>       2.57%  [.] worker_thread(void*)
>       2.30%  [k] copy_from_user
>       1.75%  [k] sem_unlock
>       1.25%  [k] ipc_obtain_object
~ 280 mio ops.
2.3% copy_from_user,
9% perform_atomic_semop.

> # Events: 802K cycles
> #
> # Overhead                           Symbol
> # ........  ...............................
> #
>      17.38%  [k] SYSC_semtimedop
>      13.26%  [k] system_call
>      11.31%  [k] copy_user_generic_string
>       7.62%  [.] __semop
>       7.18%  [k] _raw_spin_lock
>       5.66%  [k] ipcperms
>       5.40%  [k] sem_lock
>       4.65%  [k] perform_atomic_semop
>       4.22%  [k] ipc_obtain_object_check
>       4.08%  [.] worker_thread(void*)
>       4.06%  [k] copy_from_user
>       2.40%  [k] ipc_obtain_object
>       1.98%  [k] pid_vnr
>       1.45%  [k] wake_up_sem_queue_do
>       1.39%  [k] sys_semop
>       1.35%  [k] sys_semtimedop
>       1.30%  [k] sem_unlock
>       1.14%  [k] security_ipc_permission
~ 700 mio ops.
4% copy_from_user -> as expected a bit more
4.6% perform_atomic_semop --> less.

Thus: Could you send the oprofile output from perform_atomic_semop()?

Perhaps that gives us a hint.

My current guess:
sem_lock() somehow ends up in lock_array.
Lock_array scans all struct sem -> transfer of that cacheline from all 
cpus to the cpu that does the lock_array..
Then the next write by the "correct" cpu causes a transfer back when 
setting sem->pid.

--
     Manfred

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-15 11:10     ` Manfred Spraul
@ 2013-06-15 11:37       ` Mike Galbraith
  2013-06-18  6:48       ` Mike Galbraith
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2013-06-15 11:37 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-15 at 13:10 +0200, Manfred Spraul wrote: 
> On 06/14/2013 09:05 PM, Mike Galbraith wrote:
> > # Events: 802K cycles
> > #
> > # Overhead                                      Symbol
> > # ........  ..........................................
> > #
> >      18.42%  [k] SYSC_semtimedop
> >      15.39%  [k] sem_lock
> >      10.26%  [k] _raw_spin_lock
> >       9.00%  [k] perform_atomic_semop
> >       7.89%  [k] system_call
> >       7.70%  [k] ipc_obtain_object_check
> >       6.95%  [k] ipcperms
> >       6.62%  [k] copy_user_generic_string
> >       4.16%  [.] __semop
> >       2.57%  [.] worker_thread(void*)
> >       2.30%  [k] copy_from_user
> >       1.75%  [k] sem_unlock
> >       1.25%  [k] ipc_obtain_object
> ~ 280 mio ops.
> 2.3% copy_from_user,
> 9% perform_atomic_semop.
> 
> > # Events: 802K cycles
> > #
> > # Overhead                           Symbol
> > # ........  ...............................
> > #
> >      17.38%  [k] SYSC_semtimedop
> >      13.26%  [k] system_call
> >      11.31%  [k] copy_user_generic_string
> >       7.62%  [.] __semop
> >       7.18%  [k] _raw_spin_lock
> >       5.66%  [k] ipcperms
> >       5.40%  [k] sem_lock
> >       4.65%  [k] perform_atomic_semop
> >       4.22%  [k] ipc_obtain_object_check
> >       4.08%  [.] worker_thread(void*)
> >       4.06%  [k] copy_from_user
> >       2.40%  [k] ipc_obtain_object
> >       1.98%  [k] pid_vnr
> >       1.45%  [k] wake_up_sem_queue_do
> >       1.39%  [k] sys_semop
> >       1.35%  [k] sys_semtimedop
> >       1.30%  [k] sem_unlock
> >       1.14%  [k] security_ipc_permission
> ~ 700 mio ops.
> 4% copy_from_user -> as expected a bit more
> 4.6% perform_atomic_semop --> less.
> 
> Thus: Could you send the oprofile output from perform_atomic_semop()?

Ok, newly profiled 32 core run.


Percent |	Source code & Disassembly of vmlinux
------------------------------------------------
         :
         :
         :
         :	Disassembly of section .text:
         :
         :	ffffffff812584d0 <perform_atomic_semop>:
         :	 * Negative values are error codes.
         :	 */
         :
         :	static int perform_atomic_semop(struct sem_array *sma, struct sembuf *sops,
         :	                             int nsops, struct sem_undo *un, int pid)
         :	{
    3.70 :	ffffffff812584d0:       55                      push   %rbp
    0.00 :	ffffffff812584d1:       48 89 e5                mov    %rsp,%rbp
    0.00 :	ffffffff812584d4:       41 54                   push   %r12
    3.40 :	ffffffff812584d6:       53                      push   %rbx
    0.00 :	ffffffff812584d7:       e8 64 dc 35 00          callq  ffffffff815b6140 <mcount>
         :	        int result, sem_op;
         :	        struct sembuf *sop;
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
    0.00 :	ffffffff812584dc:       48 63 d2                movslq %edx,%rdx
         :	 * Negative values are error codes.
         :	 */
         :
         :	static int perform_atomic_semop(struct sem_array *sma, struct sembuf *sops,
         :	                             int nsops, struct sem_undo *un, int pid)
         :	{
    0.00 :	ffffffff812584df:       45 89 c4                mov    %r8d,%r12d
    3.62 :	ffffffff812584e2:       48 89 cb                mov    %rcx,%rbx
         :	        int result, sem_op;
         :	        struct sembuf *sop;
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
    0.00 :	ffffffff812584e5:       48 8d 14 52             lea    (%rdx,%rdx,2),%rdx
    0.00 :	ffffffff812584e9:       49 89 f2                mov    %rsi,%r10
    0.00 :	ffffffff812584ec:       4c 8d 04 56             lea    (%rsi,%rdx,2),%r8
    3.53 :	ffffffff812584f0:       4c 39 c6                cmp    %r8,%rsi
    0.00 :	ffffffff812584f3:       0f 83 17 01 00 00       jae    ffffffff81258610 <perform_atomic_semop+0x140>
         :	                curr = sma->sem_base + sop->sem_num;
    0.00 :	ffffffff812584f9:       0f b7 0e                movzwl (%rsi),%ecx
         :	                sem_op = sop->sem_op;
    0.00 :	ffffffff812584fc:       0f bf 56 02             movswl 0x2(%rsi),%edx
         :	        int result, sem_op;
         :	        struct sembuf *sop;
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
         :	                curr = sma->sem_base + sop->sem_num;
    0.00 :	ffffffff81258500:       49 89 c9                mov    %rcx,%r9
    3.75 :	ffffffff81258503:       49 c1 e1 06             shl    $0x6,%r9
    0.00 :	ffffffff81258507:       4c 03 4f 40             add    0x40(%rdi),%r9
         :	                sem_op = sop->sem_op;
         :	                result = curr->semval;
         :	  
         :	                if (!sem_op && result)
    4.52 :	ffffffff8125850b:       85 d2                   test   %edx,%edx
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
         :	                curr = sma->sem_base + sop->sem_num;
         :	                sem_op = sop->sem_op;
         :	                result = curr->semval;
    0.00 :	ffffffff8125850d:       41 8b 01                mov    (%r9),%eax
         :	  
         :	                if (!sem_op && result)
   18.66 :	ffffffff81258510:       0f 84 e2 00 00 00       je     ffffffff812585f8 <perform_atomic_semop+0x128>
         :	                        goto would_block;
         :
         :	                result += sem_op;
         :	                if (result < 0)
    3.52 :	ffffffff81258516:       41 89 d3                mov    %edx,%r11d
    0.00 :	ffffffff81258519:       41 01 c3                add    %eax,%r11d
    0.00 :	ffffffff8125851c:       0f 88 de 00 00 00       js     ffffffff81258600 <perform_atomic_semop+0x130>
         :	                        goto would_block;
         :	                if (result > SEMVMX)
    0.00 :	ffffffff81258522:       41 81 fb ff 7f 00 00    cmp    $0x7fff,%r11d
    3.84 :	ffffffff81258529:       49 89 f2                mov    %rsi,%r10
    0.00 :	ffffffff8125852c:       0f 8f bb 00 00 00       jg     ffffffff812585ed <perform_atomic_semop+0x11d>
    0.00 :	ffffffff81258532:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
         :	                        goto out_of_range;
         :	                if (sop->sem_flg & SEM_UNDO) {
    0.00 :	ffffffff81258538:       41 f6 42 05 10          testb  $0x10,0x5(%r10)
    3.66 :	ffffffff8125853d:       74 1a                   je     ffffffff81258559 <perform_atomic_semop+0x89>
         :	                        int undo = un->semadj[sop->sem_num] - sem_op;
         :	                        /*
         :	                         *      Exceeding the undo range is an error.
         :	                         */
         :	                        if (undo < (-SEMAEM - 1) || undo > SEMAEM)
    0.00 :	ffffffff8125853f:       48 8b 43 40             mov    0x40(%rbx),%rax
    0.00 :	ffffffff81258543:       0f bf 04 48             movswl (%rax,%rcx,2),%eax
    0.00 :	ffffffff81258547:       29 d0                   sub    %edx,%eax
    0.00 :	ffffffff81258549:       05 00 80 00 00          add    $0x8000,%eax
    0.00 :	ffffffff8125854e:       3d ff ff 00 00          cmp    $0xffff,%eax
    0.00 :	ffffffff81258553:       0f 87 94 00 00 00       ja     ffffffff812585ed <perform_atomic_semop+0x11d>
         :	{
         :	        int result, sem_op;
         :	        struct sembuf *sop;
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
    3.70 :	ffffffff81258559:       49 83 c2 06             add    $0x6,%r10
         :	                         *      Exceeding the undo range is an error.
         :	                         */
         :	                        if (undo < (-SEMAEM - 1) || undo > SEMAEM)
         :	                                goto out_of_range;
         :	                }
         :	                curr->semval = result;
    0.01 :	ffffffff8125855d:       45 89 19                mov    %r11d,(%r9)
         :	{
         :	        int result, sem_op;
         :	        struct sembuf *sop;
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
    0.01 :	ffffffff81258560:       4d 39 c2                cmp    %r8,%r10
    0.00 :	ffffffff81258563:       0f 83 a7 00 00 00       jae    ffffffff81258610 <perform_atomic_semop+0x140>
         :	                curr = sma->sem_base + sop->sem_num;
    0.00 :	ffffffff81258569:       41 0f b7 0a             movzwl (%r10),%ecx
         :	                sem_op = sop->sem_op;
    0.00 :	ffffffff8125856d:       41 0f bf 52 02          movswl 0x2(%r10),%edx
         :	        int result, sem_op;
         :	        struct sembuf *sop;
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
         :	                curr = sma->sem_base + sop->sem_num;
    0.00 :	ffffffff81258572:       49 89 c9                mov    %rcx,%r9
    0.00 :	ffffffff81258575:       49 c1 e1 06             shl    $0x6,%r9
    0.00 :	ffffffff81258579:       4c 03 4f 40             add    0x40(%rdi),%r9
         :	                sem_op = sop->sem_op;
         :	                result = curr->semval;
         :	  
         :	                if (!sem_op && result)
    0.00 :	ffffffff8125857d:       85 d2                   test   %edx,%edx
         :	        struct sem * curr;
         :
         :	        for (sop = sops; sop < sops + nsops; sop++) {
         :	                curr = sma->sem_base + sop->sem_num;
         :	                sem_op = sop->sem_op;
         :	                result = curr->semval;
    0.00 :	ffffffff8125857f:       41 8b 01                mov    (%r9),%eax
         :	  
         :	                if (!sem_op && result)
    0.00 :	ffffffff81258582:       75 54                   jne    ffffffff812585d8 <perform_atomic_semop+0x108>
    0.00 :	ffffffff81258584:       85 c0                   test   %eax,%eax
    0.00 :	ffffffff81258586:       74 50                   je     ffffffff812585d8 <perform_atomic_semop+0x108>
         :
         :	out_of_range:
         :	        result = -ERANGE;
         :	        goto undo;
         :
         :	would_block:
    0.00 :	ffffffff81258588:       4c 89 d0                mov    %r10,%rax
         :	        if (sop->sem_flg & IPC_NOWAIT)
    0.00 :	ffffffff8125858b:       0f bf 40 04             movswl 0x4(%rax),%eax
    0.00 :	ffffffff8125858f:       25 00 08 00 00          and    $0x800,%eax
    0.00 :	ffffffff81258594:       83 f8 01                cmp    $0x1,%eax
    0.00 :	ffffffff81258597:       45 19 c0                sbb    %r8d,%r8d
    0.00 :	ffffffff8125859a:       41 83 e0 0c             and    $0xc,%r8d
    0.00 :	ffffffff8125859e:       41 83 e8 0b             sub    $0xb,%r8d
         :	                result = -EAGAIN;
         :	        else
         :	                result = 1;
         :
         :	undo:
         :	        sop--;
    0.00 :	ffffffff812585a2:       49 8d 4a fa             lea    -0x6(%r10),%rcx
         :	        while (sop >= sops) {
    0.00 :	ffffffff812585a6:       48 39 ce                cmp    %rcx,%rsi
    0.00 :	ffffffff812585a9:       77 1f                   ja     ffffffff812585ca <perform_atomic_semop+0xfa>
    0.00 :	ffffffff812585ab:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
         :	                sma->sem_base[sop->sem_num].semval -= sop->sem_op;
    0.00 :	ffffffff812585b0:       0f b7 01                movzwl (%rcx),%eax
    0.00 :	ffffffff812585b3:       0f bf 51 02             movswl 0x2(%rcx),%edx
         :	                sop--;
    0.00 :	ffffffff812585b7:       48 83 e9 06             sub    $0x6,%rcx
         :	                result = 1;
         :
         :	undo:
         :	        sop--;
         :	        while (sop >= sops) {
         :	                sma->sem_base[sop->sem_num].semval -= sop->sem_op;
    0.00 :	ffffffff812585bb:       48 c1 e0 06             shl    $0x6,%rax
    0.00 :	ffffffff812585bf:       48 03 47 40             add    0x40(%rdi),%rax
    0.00 :	ffffffff812585c3:       29 10                   sub    %edx,(%rax)
         :	        else
         :	                result = 1;
         :
         :	undo:
         :	        sop--;
         :	        while (sop >= sops) {
    0.00 :	ffffffff812585c5:       48 39 ce                cmp    %rcx,%rsi
    0.00 :	ffffffff812585c8:       76 e6                   jbe    ffffffff812585b0 <perform_atomic_semop+0xe0>
         :	                sma->sem_base[sop->sem_num].semval -= sop->sem_op;
         :	                sop--;
         :	        }
         :
         :	        return result;
         :	}
    0.00 :	ffffffff812585ca:       5b                      pop    %rbx
    0.00 :	ffffffff812585cb:       44 89 c0                mov    %r8d,%eax
    0.00 :	ffffffff812585ce:       41 5c                   pop    %r12
    0.00 :	ffffffff812585d0:       c9                      leaveq 
    0.00 :	ffffffff812585d1:       c3                      retq   
    0.00 :	ffffffff812585d2:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
         :	  
         :	                if (!sem_op && result)
         :	                        goto would_block;
         :
         :	                result += sem_op;
         :	                if (result < 0)
    0.00 :	ffffffff812585d8:       41 89 d3                mov    %edx,%r11d
    0.00 :	ffffffff812585db:       41 01 c3                add    %eax,%r11d
    0.00 :	ffffffff812585de:       78 a8                   js     ffffffff81258588 <perform_atomic_semop+0xb8>
         :	                        goto would_block;
         :	                if (result > SEMVMX)
    0.00 :	ffffffff812585e0:       41 81 fb ff 7f 00 00    cmp    $0x7fff,%r11d
    0.00 :	ffffffff812585e7:       0f 8e 4b ff ff ff       jle    ffffffff81258538 <perform_atomic_semop+0x68>
         :	        if (sop->sem_flg & IPC_NOWAIT)
         :	                result = -EAGAIN;
         :	        else
         :	                result = 1;
         :
         :	undo:
    0.00 :	ffffffff812585ed:       41 b8 de ff ff ff       mov    $0xffffffde,%r8d
    0.00 :	ffffffff812585f3:       eb ad                   jmp    ffffffff812585a2 <perform_atomic_semop+0xd2>
    0.00 :	ffffffff812585f5:       0f 1f 00                nopl   (%rax)
         :	        for (sop = sops; sop < sops + nsops; sop++) {
         :	                curr = sma->sem_base + sop->sem_num;
         :	                sem_op = sop->sem_op;
         :	                result = curr->semval;
         :	  
         :	                if (!sem_op && result)
    3.56 :	ffffffff812585f8:       85 c0                   test   %eax,%eax
    0.00 :	ffffffff812585fa:       0f 84 16 ff ff ff       je     ffffffff81258516 <perform_atomic_semop+0x46>
         :
         :	out_of_range:
         :	        result = -ERANGE;
         :	        goto undo;
         :
         :	would_block:
    0.00 :	ffffffff81258600:       48 89 f0                mov    %rsi,%rax
    0.00 :	ffffffff81258603:       49 89 f2                mov    %rsi,%r10
    0.00 :	ffffffff81258606:       e9 80 ff ff ff          jmpq   ffffffff8125858b <perform_atomic_semop+0xbb>
    0.00 :	ffffffff8125860b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
         :	                                goto out_of_range;
         :	                }
         :	                curr->semval = result;
         :	        }
         :
         :	        sop--;
    3.58 :	ffffffff81258610:       4d 8d 4a fa             lea    -0x6(%r10),%r9
         :	        while (sop >= sops) {
    0.00 :	ffffffff81258614:       4c 39 ce                cmp    %r9,%rsi
    0.00 :	ffffffff81258617:       77 3b                   ja     ffffffff81258654 <perform_atomic_semop+0x184>
    0.00 :	ffffffff81258619:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
         :	                sma->sem_base[sop->sem_num].sempid = pid;
    0.00 :	ffffffff81258620:       41 0f b7 01             movzwl (%r9),%eax
    3.51 :	ffffffff81258624:       48 8b 57 40             mov    0x40(%rdi),%rdx
   22.37 :	ffffffff81258628:       48 c1 e0 06             shl    $0x6,%rax
    0.00 :	ffffffff8125862c:       44 89 64 02 04          mov    %r12d,0x4(%rdx,%rax,1)
         :	                if (sop->sem_flg & SEM_UNDO)
    3.79 :	ffffffff81258631:       41 f6 41 05 10          testb  $0x10,0x5(%r9)
    0.00 :	ffffffff81258636:       74 13                   je     ffffffff8125864b <perform_atomic_semop+0x17b>
         :	                        un->semadj[sop->sem_num] -= sop->sem_op;
    0.00 :	ffffffff81258638:       41 0f b7 01             movzwl (%r9),%eax
    0.00 :	ffffffff8125863c:       41 0f b7 51 02          movzwl 0x2(%r9),%edx
    0.00 :	ffffffff81258641:       48 01 c0                add    %rax,%rax
    0.00 :	ffffffff81258644:       48 03 43 40             add    0x40(%rbx),%rax
    0.00 :	ffffffff81258648:       66 29 10                sub    %dx,(%rax)
         :	                sop--;
    3.58 :	ffffffff8125864b:       49 83 e9 06             sub    $0x6,%r9
         :	                }
         :	                curr->semval = result;
         :	        }
         :
         :	        sop--;
         :	        while (sop >= sops) {
    0.00 :	ffffffff8125864f:       4c 39 ce                cmp    %r9,%rsi
    0.00 :	ffffffff81258652:       76 cc                   jbe    ffffffff81258620 <perform_atomic_semop+0x150>
         :	                sma->sem_base[sop->sem_num].semval -= sop->sem_op;
         :	                sop--;
         :	        }
         :
         :	        return result;
         :	}
    0.00 :	ffffffff81258654:       5b                      pop    %rbx
         :	        else
         :	                result = 1;
         :
         :	undo:
         :	        sop--;
         :	        while (sop >= sops) {
    0.00 :	ffffffff81258655:       45 31 c0                xor    %r8d,%r8d
         :	                sma->sem_base[sop->sem_num].semval -= sop->sem_op;
         :	                sop--;
         :	        }
         :
         :	        return result;
         :	}
    3.67 :	ffffffff81258658:       44 89 c0                mov    %r8d,%eax
    0.00 :	ffffffff8125865b:       41 5c                   pop    %r12
    0.00 :	ffffffff8125865d:       c9                      leaveq 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-15 11:10     ` Manfred Spraul
  2013-06-15 11:37       ` Mike Galbraith
@ 2013-06-18  6:48       ` Mike Galbraith
  2013-06-18  7:14         ` Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2013-06-18  6:48 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-15 at 13:10 +0200, Manfred Spraul wrote: 
> On 06/14/2013 09:05 PM, Mike Galbraith wrote:
> > # Events: 802K cycles
> > #
> > # Overhead                                      Symbol
> > # ........  ..........................................
> > #
> >      18.42%  [k] SYSC_semtimedop
> >      15.39%  [k] sem_lock
> >      10.26%  [k] _raw_spin_lock
> >       9.00%  [k] perform_atomic_semop
> >       7.89%  [k] system_call
> >       7.70%  [k] ipc_obtain_object_check
> >       6.95%  [k] ipcperms
> >       6.62%  [k] copy_user_generic_string
> >       4.16%  [.] __semop
> >       2.57%  [.] worker_thread(void*)
> >       2.30%  [k] copy_from_user
> >       1.75%  [k] sem_unlock
> >       1.25%  [k] ipc_obtain_object
> ~ 280 mio ops.
> 2.3% copy_from_user,
> 9% perform_atomic_semop.
> 
> > # Events: 802K cycles
> > #
> > # Overhead                           Symbol
> > # ........  ...............................
> > #
> >      17.38%  [k] SYSC_semtimedop
> >      13.26%  [k] system_call
> >      11.31%  [k] copy_user_generic_string
> >       7.62%  [.] __semop
> >       7.18%  [k] _raw_spin_lock
> >       5.66%  [k] ipcperms
> >       5.40%  [k] sem_lock
> >       4.65%  [k] perform_atomic_semop
> >       4.22%  [k] ipc_obtain_object_check
> >       4.08%  [.] worker_thread(void*)
> >       4.06%  [k] copy_from_user
> >       2.40%  [k] ipc_obtain_object
> >       1.98%  [k] pid_vnr
> >       1.45%  [k] wake_up_sem_queue_do
> >       1.39%  [k] sys_semop
> >       1.35%  [k] sys_semtimedop
> >       1.30%  [k] sem_unlock
> >       1.14%  [k] security_ipc_permission
> ~ 700 mio ops.
> 4% copy_from_user -> as expected a bit more
> 4.6% perform_atomic_semop --> less.
> 
> Thus: Could you send the oprofile output from perform_atomic_semop()?
> 
> Perhaps that gives us a hint.
> 
> My current guess:
> sem_lock() somehow ends up in lock_array.
> Lock_array scans all struct sem -> transfer of that cacheline from all 
> cpus to the cpu that does the lock_array..
> Then the next write by the "correct" cpu causes a transfer back when 
> setting sem->pid.

Profiling sem_lock(), observe sma->complex_count.

         :       again:
         :              if (nsops == 1 && !sma->complex_count) {
    0.00 :      ffffffff81258a64:       75 5a                   jne    ffffffff81258ac0 <sem_lock+0x80>
    0.50 :      ffffffff81258a66:       41 8b 44 24 7c          mov    0x7c(%r12),%eax
   23.04 :      ffffffff81258a6b:       85 c0                   test   %eax,%eax
    0.00 :      ffffffff81258a6d:       75 51                   jne    ffffffff81258ac0 <sem_lock+0x80>
         :                      struct sem *sem = sma->sem_base + sops->sem_num;
    0.01 :      ffffffff81258a6f:       41 0f b7 1e             movzwl (%r14),%ebx
    0.48 :      ffffffff81258a73:       48 c1 e3 06             shl    $0x6,%rbx
    0.52 :      ffffffff81258a77:       49 03 5c 24 40          add    0x40(%r12),%rbx
         :              raw_spin_lock_init(&(_lock)->rlock);            \
         :      } while (0)
         :
         :      static inline void spin_lock(spinlock_t *lock)
         :      {
         :              raw_spin_lock(&lock->rlock);
    1.45 :      ffffffff81258a7c:       4c 8d 6b 08             lea    0x8(%rbx),%r13
    0.47 :      ffffffff81258a80:       4c 89 ef                mov    %r13,%rdi
    0.50 :      ffffffff81258a83:       e8 08 4f 35 00          callq  ffffffff815ad990 <_raw_spin_lock>
         :
         :                      /*
         :                       * If sma->complex_count was set while we were spinning,
         :                       * we may need to look at things we did not lock here.
         :                       */
         :                      if (unlikely(sma->complex_count)) {
    0.53 :      ffffffff81258a88:       41 8b 44 24 7c          mov    0x7c(%r12),%eax
   34.33 :      ffffffff81258a8d:       85 c0                   test   %eax,%eax
    0.02 :      ffffffff81258a8f:       75 29                   jne    ffffffff81258aba <sem_lock+0x7a>
         :              __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
         :      }
         :

We're taking cache misses for _both_ reads, so I speculated that
somebody somewhere has to be banging on that structure, so I tried
putting complex_count on a different cache line, and indeed, the
overhead disappeared, and box started scaling linearly and reliably so
to 32 cores.  64 is still unstable, but 32 became rock solid.

Moving ____cacheline_aligned_in_smp upward one at a time to try to find
out which field was causing trouble, I ended up at sem_base.. a pointer
that's not modified.  That makes zero sense to me, does anybody have an
idea why having sem_base and complex_count in the same cache line would
cause this?

---
 include/linux/sem.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/sem.h
===================================================================
--- linux-2.6.orig/include/linux/sem.h
+++ linux-2.6/include/linux/sem.h
@@ -10,10 +10,10 @@ struct task_struct;

 /* One sem_array data structure for each set of semaphores in the system. */
 struct sem_array {
-       struct kern_ipc_perm    ____cacheline_aligned_in_smp
-                               sem_perm;       /* permissions .. see ipc.h */
+       struct kern_ipc_perm    sem_perm;       /* permissions .. see ipc.h */
        time_t                  sem_ctime;      /* last change time */
-       struct sem              *sem_base;      /* ptr to first semaphore in array */
+       struct sem              ____cacheline_aligned_in_smp
+                               *sem_base;      /* ptr to first semaphore in array */
        struct list_head        pending_alter;  /* pending operations */
                                                /* that alter the array */
        struct list_head        pending_const;  /* pending complex operations */
@@ -21,7 +21,7 @@ struct sem_array {
        struct list_head        list_id;        /* undo requests on this array */
        int                     sem_nsems;      /* no. of semaphores in array */
        int                     complex_count;  /* pending complex operations */
-};
+} ____cacheline_aligned_in_smp;

 #ifdef CONFIG_SYSVIPC

If I put the ____cacheline_aligned_in_smp any place below sem_base, the
overhead goes *poof* gone.  For a 64 core run, massive overhead
reappears, but at spin_is_locked(), and we thrash badly again, with the
characteristic wildly uneven distribution.. but we still perform much
better than without the move in total throughput.

In -rt, even without moving complex_count, only using my livelock free
version of sem_lock(), we magically scale to 64 cores with your patches,
returning a 650 fold throughput improvement for sem-waitzero.  I say
"magically" because that patch should not have the effect it does, but
it has that same effect on mainline up to 32 cores.   Changing memory
access patterns around a little has a wild and fully repeatable impact
on throughput.

Something is rotten in Denmark, any ideas out there as to what that
something might be?  HTH can moving that pointer have the effect it
does?  How can my livelock fix for -rt have the same effect on mainline
as moving complex_count, and much more effect for -rt, to the point that
we start scaling perfectly to 64 cores?  AFAIKT, it should be noop or
maybe cost a bit, but it's confused, insists that it's a performance
patch, when it's really only a remove the darn livelockable loop patch.

-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-18  6:48       ` Mike Galbraith
@ 2013-06-18  7:14         ` Mike Galbraith
  2013-06-19 12:57           ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2013-06-18  7:14 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Tue, 2013-06-18 at 08:48 +0200, Mike Galbraith wrote: 
> On Sat, 2013-06-15 at 13:10 +0200, Manfred Spraul wrote: 

P.S.

> > My current guess:
> > sem_lock() somehow ends up in lock_array.

Tracing shows that happens precisely one time, at end of benchmark.

-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO
  2013-06-18  7:14         ` Mike Galbraith
@ 2013-06-19 12:57           ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2013-06-19 12:57 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Rik van Riel, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Tue, 2013-06-18 at 09:14 +0200, Mike Galbraith wrote: 
> On Tue, 2013-06-18 at 08:48 +0200, Mike Galbraith wrote: 
> > On Sat, 2013-06-15 at 13:10 +0200, Manfred Spraul wrote: 
> 
> P.S.
> 
> > > My current guess:
> > > sem_lock() somehow ends up in lock_array.
> 
> Tracing shows that happens precisely one time, at end of benchmark.

FWIW, below is a profile of 3.8-rt scaling to 64 cores.  No mucking
about with sem_array, only livelock hack doing it's thing.. whatever
that is.

# Events: 1M cycles
#
# Overhead                                   Symbol
# ........  .......................................
#
    16.71%  [k] sys_semtimedop
    11.31%  [k] system_call
    11.23%  [k] copy_user_generic_string
     7.59%  [.] __semop
     4.88%  [k] sem_lock
     4.52%  [k] rt_spin_lock
     4.48%  [k] rt_spin_unlock
     3.75%  [.] worker_thread(void*)
     3.56%  [k] perform_atomic_semop
     3.15%  [k] idr_find
     3.15%  [k] ipc_obtain_object_check
     3.04%  [k] pid_vnr
     2.97%  [k] migrate_enable
     2.79%  [k] migrate_disable
     2.46%  [k] ipcperms
     2.01%  [k] sysret_check
     1.86%  [k] pin_current_cpu
     1.49%  [k] unpin_current_cpu
     1.13%  [k] __rcu_read_lock
     1.11%  [k] __rcu_read_unlock
     1.11%  [k] ipc_obtain_object

 Percent |      Source code & Disassembly of vmlinux
------------------------------------------------
         :
         :
         :
         :      Disassembly of section .text:
         :
         :      ffffffff81273390 <sem_lock>:
         :       * checking each local lock once. This means that the local lock paths
         :       * cannot start their critical sections while the global lock is held.
         :       */
         :      static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
         :                                    int nsops)
         :      {
    3.90 :      ffffffff81273390:       41 55                   push   %r13
    0.00 :      ffffffff81273392:       49 89 f5                mov    %rsi,%r13
    0.00 :      ffffffff81273395:       41 54                   push   %r12
    0.00 :      ffffffff81273397:       41 89 d4                mov    %edx,%r12d
    3.86 :      ffffffff8127339a:       55                      push   %rbp
    0.00 :      ffffffff8127339b:       48 89 fd                mov    %rdi,%rbp
    0.00 :      ffffffff8127339e:       53                      push   %rbx
    0.00 :      ffffffff8127339f:       48 83 ec 08             sub    $0x8,%rsp
         :              struct sem *sem;
         :              int locknum;
         :
         :              if (nsops == 1 && !sma->complex_count) {
    3.76 :      ffffffff812733a3:       83 fa 01                cmp    $0x1,%edx
    0.00 :      ffffffff812733a6:       75 0c                   jne    ffffffff812733b4 <sem_lock+0x24>
    0.00 :      ffffffff812733a8:       44 8b 87 a4 00 00 00    mov    0xa4(%rdi),%r8d
    8.92 :      ffffffff812733af:       45 85 c0                test   %r8d,%r8d
    0.00 :      ffffffff812733b2:       74 5c                   je     ffffffff81273410 <sem_lock+0x80>
         :                       * individual semaphore locks to go away.  The code
         :                       * above ensures no new single-lock holders will enter
         :                       * their critical section while the array lock is held.
         :                       */
         :       lock_array:
         :                      spin_lock(&sma->sem_perm.lock);
    0.00 :      ffffffff812733b4:       e8 57 4b e1 ff          callq  ffffffff81087f10 <migrate_disable>
    0.00 :      ffffffff812733b9:       48 89 ef                mov    %rbp,%rdi
    0.00 :      ffffffff812733bc:       e8 1f d4 34 00          callq  ffffffff815c07e0 <rt_spin_lock>
         :       wait_array:
         :                      for (i = 0; i < sma->sem_nsems; i++) {
    0.00 :      ffffffff812733c1:       8b 8d a0 00 00 00       mov    0xa0(%rbp),%ecx
    0.00 :      ffffffff812733c7:       85 c9                   test   %ecx,%ecx
    0.00 :      ffffffff812733c9:       7e 2b                   jle    ffffffff812733f6 <sem_lock+0x66>
    0.00 :      ffffffff812733cb:       31 db                   xor    %ebx,%ebx
    0.00 :      ffffffff812733cd:       0f 1f 00                nopl   (%rax)
         :                              sem = sma->sem_base + i;
    0.00 :      ffffffff812733d0:       48 63 c3                movslq %ebx,%rax
    0.00 :      ffffffff812733d3:       48 c1 e0 07             shl    $0x7,%rax
    0.00 :      ffffffff812733d7:       48 03 45 68             add    0x68(%rbp),%rax
         :      #ifdef CONFIG_PREEMPT_RT_BASE
         :                              if (spin_is_locked(&sem->lock))
    0.00 :      ffffffff812733db:       48 83 78 20 00          cmpq   $0x0,0x20(%rax)
    0.00 :      ffffffff812733e0:       74 09                   je     ffffffff812733eb <sem_lock+0x5b>
         :      #endif
         :                              spin_unlock_wait(&sem->lock);
    0.00 :      ffffffff812733e2:       48 8d 78 08             lea    0x8(%rax),%rdi
    0.00 :      ffffffff812733e6:       e8 b5 d4 34 00          callq  ffffffff815c08a0 <rt_spin_unlock_wait>
         :                       * their critical section while the array lock is held.
         :                       */
         :       lock_array:
         :                      spin_lock(&sma->sem_perm.lock);
         :       wait_array:
         :                      for (i = 0; i < sma->sem_nsems; i++) {
    0.00 :      ffffffff812733eb:       83 c3 01                add    $0x1,%ebx
    0.00 :      ffffffff812733ee:       39 9d a0 00 00 00       cmp    %ebx,0xa0(%rbp)
    0.00 :      ffffffff812733f4:       7f da                   jg     ffffffff812733d0 <sem_lock+0x40>
         :      #endif
         :                              spin_unlock_wait(&sem->lock);
         :                      }
         :                      locknum = -1;
         :
         :                      if (nsops == 1 && !sma->complex_count) {
    0.00 :      ffffffff812733f6:       41 83 ec 01             sub    $0x1,%r12d
    0.00 :      ffffffff812733fa:       74 51                   je     ffffffff8127344d <sem_lock+0xbd>
         :                              spin_unlock(&sma->sem_perm.lock);
         :                              locknum = sops->sem_num;
         :                      }
         :              }
         :              return locknum;
         :      }
    0.00 :      ffffffff812733fc:       48 83 c4 08             add    $0x8,%rsp
         :
         :                      if (nsops == 1 && !sma->complex_count) {
         :                              sem = sma->sem_base + sops->sem_num;
         :                              spin_lock(&sem->lock);
         :                              spin_unlock(&sma->sem_perm.lock);
         :                              locknum = sops->sem_num;
    0.00 :      ffffffff81273400:       b8 ff ff ff ff          mov    $0xffffffff,%eax
         :                      }
         :              }
         :              return locknum;
         :      }
    0.00 :      ffffffff81273405:       5b                      pop    %rbx
    0.00 :      ffffffff81273406:       5d                      pop    %rbp
    0.00 :      ffffffff81273407:       41 5c                   pop    %r12
    0.00 :      ffffffff81273409:       41 5d                   pop    %r13
    0.00 :      ffffffff8127340b:       c3                      retq   
    0.00 :      ffffffff8127340c:       0f 1f 40 00             nopl   0x0(%rax)
         :      {
         :              struct sem *sem;
         :              int locknum;
         :
         :              if (nsops == 1 && !sma->complex_count) {
         :                      sem = sma->sem_base + sops->sem_num;
    3.69 :      ffffffff81273410:       0f b7 1e                movzwl (%rsi),%ebx
    0.01 :      ffffffff81273413:       48 c1 e3 07             shl    $0x7,%rbx
    3.98 :      ffffffff81273417:       48 03 5f 68             add    0x68(%rdi),%rbx
         :                      /*
         :                       * Another process is holding the global lock on the
         :                       * sem_array; we cannot enter our critical section,
         :                       * but have to wait for the global lock to be released.
         :                       */
         :                      if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
   10.68 :      ffffffff8127341b:       48 83 7f 18 00          cmpq   $0x0,0x18(%rdi)
   22.07 :      ffffffff81273420:       75 60                   jne    ffffffff81273482 <sem_lock+0xf2>
         :                               */
         :                              spin_lock(&sem->lock);
         :                              spin_unlock(&sma->sem_perm.lock);
         :                      } else {
         :                              /* Lock just the semaphore we are interested in. */
         :                              spin_lock(&sem->lock);
    0.01 :      ffffffff81273422:       48 83 c3 08             add    $0x8,%rbx
    3.69 :      ffffffff81273426:       e8 e5 4a e1 ff          callq  ffffffff81087f10 <migrate_disable>
    7.42 :      ffffffff8127342b:       48 89 df                mov    %rbx,%rdi
    0.00 :      ffffffff8127342e:       e8 ad d3 34 00          callq  ffffffff815c07e0 <rt_spin_lock>
         :
         :                              /*
         :                               * If sma->complex_count was set prior to acquisition,
         :                               * we must fall back to the global array lock.
         :                               */
         :                              if (unlikely(sma->complex_count)) {
    3.75 :      ffffffff81273433:       8b b5 a4 00 00 00       mov    0xa4(%rbp),%esi
   16.44 :      ffffffff81273439:       85 f6                   test   %esi,%esi
    0.00 :      ffffffff8127343b:       75 68                   jne    ffffffff812734a5 <sem_lock+0x115>
         :
         :                      if (nsops == 1 && !sma->complex_count) {
         :                              sem = sma->sem_base + sops->sem_num;
         :                              spin_lock(&sem->lock);
         :                              spin_unlock(&sma->sem_perm.lock);
         :                              locknum = sops->sem_num;
    0.00 :      ffffffff8127343d:       41 0f b7 45 00          movzwl 0x0(%r13),%eax
         :                      }
         :              }
         :              return locknum;
         :      }
    0.00 :      ffffffff81273442:       48 83 c4 08             add    $0x8,%rsp
    3.94 :      ffffffff81273446:       5b                      pop    %rbx
    0.00 :      ffffffff81273447:       5d                      pop    %rbp
    0.01 :      ffffffff81273448:       41 5c                   pop    %r12
    3.86 :      ffffffff8127344a:       41 5d                   pop    %r13
    0.00 :      ffffffff8127344c:       c3                      retq   
         :      #endif
         :                              spin_unlock_wait(&sem->lock);
         :                      }
         :                      locknum = -1;
         :
         :                      if (nsops == 1 && !sma->complex_count) {
    0.00 :      ffffffff8127344d:       8b 95 a4 00 00 00       mov    0xa4(%rbp),%edx
    0.00 :      ffffffff81273453:       85 d2                   test   %edx,%edx
    0.00 :      ffffffff81273455:       75 a5                   jne    ffffffff812733fc <sem_lock+0x6c>
         :                              sem = sma->sem_base + sops->sem_num;
    0.00 :      ffffffff81273457:       41 0f b7 5d 00          movzwl 0x0(%r13),%ebx
    0.00 :      ffffffff8127345c:       48 c1 e3 07             shl    $0x7,%rbx
    0.00 :      ffffffff81273460:       48 03 5d 68             add    0x68(%rbp),%rbx
         :                              spin_lock(&sem->lock);
    0.00 :      ffffffff81273464:       e8 a7 4a e1 ff          callq  ffffffff81087f10 <migrate_disable>
    0.00 :      ffffffff81273469:       48 8d 7b 08             lea    0x8(%rbx),%rdi
    0.00 :      ffffffff8127346d:       e8 6e d3 34 00          callq  ffffffff815c07e0 <rt_spin_lock>
         :                              spin_unlock(&sma->sem_perm.lock);
    0.00 :      ffffffff81273472:       48 89 ef                mov    %rbp,%rdi
    0.00 :      ffffffff81273475:       e8 e6 d3 34 00          callq  ffffffff815c0860 <rt_spin_unlock>
    0.00 :      ffffffff8127347a:       e8 41 48 e1 ff          callq  ffffffff81087cc0 <migrate_enable>
    0.00 :      ffffffff8127347f:       90                      nop
    0.00 :      ffffffff81273480:       eb bb                   jmp    ffffffff8127343d <sem_lock+0xad>
    0.00 :      ffffffff81273482:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
         :                       * Another process is holding the global lock on the
         :                       * sem_array; we cannot enter our critical section,
         :                       * but have to wait for the global lock to be released.
         :                       */
         :                      if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
         :                              spin_lock(&sma->sem_perm.lock);
    0.00 :      ffffffff81273488:       e8 83 4a e1 ff          callq  ffffffff81087f10 <migrate_disable>
    0.00 :      ffffffff8127348d:       48 89 ef                mov    %rbp,%rdi
    0.00 :      ffffffff81273490:       e8 4b d3 34 00          callq  ffffffff815c07e0 <rt_spin_lock>
         :                              if (sma->complex_count)
    0.00 :      ffffffff81273495:       8b bd a4 00 00 00       mov    0xa4(%rbp),%edi
    0.00 :      ffffffff8127349b:       85 ff                   test   %edi,%edi
    0.00 :      ffffffff8127349d:       0f 85 1e ff ff ff       jne    ffffffff812733c1 <sem_lock+0x31>
    0.00 :      ffffffff812734a3:       eb bf                   jmp    ffffffff81273464 <sem_lock+0xd4>
         :                              /*
         :                               * If sma->complex_count was set prior to acquisition,
         :                               * we must fall back to the global array lock.
         :                               */
         :                              if (unlikely(sma->complex_count)) {
         :                                      spin_unlock(&sem->lock);
    0.00 :      ffffffff812734a5:       48 89 df                mov    %rbx,%rdi
    0.00 :      ffffffff812734a8:       e8 b3 d3 34 00          callq  ffffffff815c0860 <rt_spin_unlock>
    0.00 :      ffffffff812734ad:       0f 1f 00                nopl   (%rax)
    0.00 :      ffffffff812734b0:       e8 0b 48 e1 ff          callq  ffffffff81087cc0 <migrate_enable>
         :                                      goto lock_array;
    0.00 :      ffffffff812734b5:       e9 fa fe ff ff          jmpq   ffffffff812733b4 <sem_lock+0x24>



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-06-19 12:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-10 17:16 [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
2013-06-10 17:16 ` [PATCH 1/6] ipc/util.c, ipc_rcu_alloc: cacheline align allocation Manfred Spraul
2013-06-10 17:16   ` [PATCH 2/6] ipc/sem.c: cacheline align the semaphore structures Manfred Spraul
2013-06-10 17:16     ` [PATCH 3/6] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
2013-06-10 17:16       ` [PATCH 4/6] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
2013-06-10 17:16         ` [PATCH 5/6] ipc/sem.c: Replace shared sem_otime with per-semaphore value Manfred Spraul
2013-06-10 17:16           ` [PATCH 6/6] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update Manfred Spraul
2013-06-14 15:38 ` [PATCH 0/6] ipc/sem.c: performance improvements, FIFO Manfred Spraul
2013-06-14 19:05   ` Mike Galbraith
2013-06-15  5:27     ` Manfred Spraul
2013-06-15  5:48       ` Mike Galbraith
2013-06-15  7:30         ` Mike Galbraith
2013-06-15  8:36           ` Mike Galbraith
2013-06-15 11:10     ` Manfred Spraul
2013-06-15 11:37       ` Mike Galbraith
2013-06-18  6:48       ` Mike Galbraith
2013-06-18  7:14         ` Mike Galbraith
2013-06-19 12:57           ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.