linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] Optimize semtimedop
@ 2010-04-12 18:49 Chris Mason
  2010-04-12 18:49 ` [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop Chris Mason
  2010-04-12 18:49 ` [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU Chris Mason
  0 siblings, 2 replies; 25+ messages in thread
From: Chris Mason @ 2010-04-12 18:49 UTC (permalink / raw)
  To: chris.mason, zach.brown, jens.axboe, linux-kernel, Nick Piggin,
	Manfred Spraul

We've been poking around in semtimedop for a while now, mostly because
it is consistently showing up at the top of the CPU profiles for benchmarking
runs on big numa systems.  The biggest problem seems to be the IPC lock, and
the fact that we hold it for a long time while we loop over different lists and
try to do semaphore operations.

Zach Brown came up with a set of patches a while ago that switched away from
the global pending list, and semtimedop was recently optimized for the
single sop case by Nick and Manfred.

This patch series tries to build on ideas from all of these patches.  The
list of pending semaphore operations is pushed down to the individual
semaphore and the locking is also pushed down into the semaphore.  The
result is much faster with my micro benchmark:

http://oss.oracle.com/~mason/sembench.c

It more than doubles the total number of post/wait cycles the benchmark
is able to get in 30s.  Before this patch, semtimedop scored about the
same as futexes for the post/wait cycles, and so now it is 2x faster.

I did run this code through all of the ltp ipc tests, and later this week I
hope to get a full tpc database benchmark on it.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-12 18:49 [PATCH RFC] Optimize semtimedop Chris Mason
@ 2010-04-12 18:49 ` Chris Mason
  2010-04-13 17:15   ` Manfred Spraul
  2010-04-16 11:26   ` Manfred Spraul
  2010-04-12 18:49 ` [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU Chris Mason
  1 sibling, 2 replies; 25+ messages in thread
From: Chris Mason @ 2010-04-12 18:49 UTC (permalink / raw)
  To: chris.mason, zach.brown, jens.axboe, linux-kernel, Nick Piggin,
	Manfred Spraul

One feature of the ipc semaphores is they are defined to be
atomic for the full set of operations done per syscall.  So if you do a
semtimedop syscall changing 100 semaphores, the kernel needs to try all
100 changes and only apply them when all 100 are able to succeed without
putting the process to sleep.

Today we use a single lock per semaphore array (the ipc lock).  This lock is
held every time we try a set of operations requested by userland, and
when taken again when a process is woken up.

Whenever a given set of changes sent to semtimedop would sleep, that
set is queued up on a big list of pending changes for the entire
semaphore array.

Whenever a semtimedop call changes a single semaphore value, it
walks the entire list of pending operations to see if any of them
can now succeed.  The ipc lock is held for this entire loop.

This change makes two major changes, pushing both the list of pending
operations and a spinlock down to each individual semaphore.  Now:

Whenever a given semaphore modification is going to block, the set of
operations semtimedop wants to do is saved onto that semaphore's list.

Whenever a givem semtimedop call changes a single semaphore value, it
walks the list of pending operations on that single semaphore to see if
they can now succeed.  If any of the operations will block on a
different semaphore, they are moved to that semaphore's list.

The locking is now done per-semaphore.  In order to get the changes done
atomically, the lock of every semaphore being changed is taken while we
test the requested operations.  We sort the operations by semaphore id
to make sure we don't deadlock in the kernel.

I have a microbenchmark to test how quickly we can post and wait in
bulk.  With this change, semtimedop is able do to more than twice
as much work in the same run.  On a large numa machine, it brings
the IPC lock system time (reported by perf) down from 85% to 15%.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 include/linux/sem.h |    4 +-
 ipc/sem.c           |  504 +++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 372 insertions(+), 136 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 8a4adbe..8b97b51 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -86,6 +86,7 @@ struct task_struct;
 struct sem {
 	int	semval;		/* current value */
 	int	sempid;		/* pid of last operation */
+	spinlock_t		lock;
 	struct list_head sem_pending; /* pending single-sop operations */
 };
 
@@ -95,15 +96,12 @@ struct sem_array {
 	time_t			sem_otime;	/* last semop time */
 	time_t			sem_ctime;	/* last change time */
 	struct sem		*sem_base;	/* ptr to first semaphore in array */
-	struct list_head	sem_pending;	/* pending operations to be processed */
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
-	int			complex_count;	/* pending complex operations */
 };
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
-	struct list_head	simple_list; /* queue of pending operations */
 	struct list_head	list;	 /* queue of pending operations */
 	struct task_struct	*sleeper; /* this process */
 	struct sem_undo		*undo;	 /* undo structure */
diff --git a/ipc/sem.c b/ipc/sem.c
index dbef95b..335cd35 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -83,6 +83,8 @@
 #include <linux/rwsem.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/sort.h>
+#include <linux/list_sort.h>
 
 #include <asm/uaccess.h>
 #include "util.h"
@@ -195,24 +197,23 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * Without the check/retry algorithm a lockless wakeup is possible:
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
- *	* unlinking the queue entry from sma->sem_pending
  *	* setting queue.status to IN_WAKEUP
  *	  This is the notification for the blocked thread that a
  *	  result value is imminent.
  *	* call wake_up_process
  *	* set queue.status to the final value.
  * - the previously blocked thread checks queue.status:
- *   	* if it's IN_WAKEUP, then it must wait until the value changes
- *   	* if it's not -EINTR, then the operation was completed by
- *   	  update_queue. semtimedop can return queue.status without
- *   	  performing any operation on the sem array.
- *   	* otherwise it must acquire the spinlock and check what's up.
+ *	* if it's IN_WAKEUP, then it must wait until the value changes
+ *	* if it's not -EINTR, then the operation was completed by
+ *	  update_queue. semtimedop can return queue.status without
+ *	  performing any operation on the sem array.
+ *	* otherwise it must find itself on the list of pending operations.
  *
  * The two-stage algorithm is necessary to protect against the following
  * races:
  * - if queue.status is set after wake_up_process, then the woken up idle
- *   thread could race forward and try (and fail) to acquire sma->lock
- *   before update_queue had a chance to set queue.status
+ *   thread could race forward and not realize its semaphore operation had
+ *   happened.
  * - if queue.status is written before wake_up_process and if the
  *   blocked process is woken up by a signal between writing
  *   queue.status and the wake_up_process, then the woken up
@@ -275,11 +276,11 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 
 	sma->sem_base = (struct sem *) &sma[1];
 
-	for (i = 0; i < nsems; i++)
+	for (i = 0; i < nsems; i++) {
 		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+		spin_lock_init(&sma->sem_base[i].lock);
+	}
 
-	sma->complex_count = 0;
-	INIT_LIST_HEAD(&sma->sem_pending);
 	INIT_LIST_HEAD(&sma->list_id);
 	sma->sem_nsems = nsems;
 	sma->sem_ctime = get_seconds();
@@ -338,34 +339,115 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 }
 
 /*
+ * when a semaphore is modified, we want to retry the series of operations
+ * for anyone that was blocking on that semaphore.  This breaks down into
+ * a few different common operations:
+ *
+ * 1) One modification releases one or more waiters for zero.
+ * 2) Many waiters are trying to get a single lock, only one will get it.
+ * 3) Many modifications to the count will succeed.
+ *
+ * For case one, we copy over anyone waiting for zero when the semval is
+ * zero.  We don't bother copying them over if the semval isn't zero yet.
+ *
+ * For case two, we copy over the first queue trying to modify the semaphore,
+ * assuming it is trying to get a lock.
+ *
+ * For case three, after the first queue trying to change this semaphore is
+ * run, it will call this function again.  It'll find the next queue
+ * that wants to change things at that time.
+ *
+ * The goal behind all of this is to avoid retrying atomic ops that have
+ * no hope of actually completing.  It is optimized for the case where a
+ * call modifies a single semaphore at a time.
+ */
+static void copy_sem_queue(unsigned long semval,
+			   unsigned short sem_num, struct list_head *queue,
+			   struct list_head *dest)
+{
+	struct sem_queue *q;
+	struct sem_queue *safe;
+
+	list_for_each_entry_safe(q, safe, queue, list) {
+		/*
+		 * if this is a complex operation, we don't really know what is
+		 * going on.  Splice the whole list over to preserve the queue
+		 * order.
+		 */
+		if (q->sops[0].sem_num != sem_num) {
+			list_splice_tail_init(queue, dest);
+			break;
+		}
+
+		/*
+		 * they are waiting for zero, leave it on the list if
+		 * we're not at zero yet, otherwise copy it over
+		 */
+		if (q->sops[0].sem_op == 0) {
+			if (semval == 0) {
+				list_del(&q->list);
+				list_add_tail(&q->list, dest);
+			}
+			continue;
+		}
+
+		/*
+		 * at this point we know the first sop in the queue is
+		 * changing this semaphore.  Copy this one queue over
+		 * and leave the rest.  If more than one alter is going
+		 * to succeed, the others will bubble in after each
+		 * one is able to modify the queue.
+		 */
+		list_del(&q->list);
+		list_add_tail(&q->list, dest);
+		break;
+	}
+}
+
+/*
  * Determine whether a sequence of semaphore operations would succeed
  * all at once. Return 0 if yes, 1 if need to sleep, else return error code.
  */
-
-static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
-			     int nsops, struct sem_undo *un, int pid)
+static noinline int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
+			     int nsops, struct sem_undo *un, int pid,
+			     struct list_head *pending, struct sem **blocker)
 {
 	int result, sem_op;
 	struct sembuf *sop;
 	struct sem * curr;
+	int last = 0;
 
 	for (sop = sops; sop < sops + nsops; sop++) {
 		curr = sma->sem_base + sop->sem_num;
+
+		/*
+		 * deal with userland sending the same
+		 * sem_num twice.  Thanks to sort they will
+		 * be adjacent.  We unlock in the loops below.
+		 */
+		if (sop == sops || last != sop->sem_num)
+			spin_lock(&curr->lock);
+
+		last = sop->sem_num;
 		sem_op = sop->sem_op;
 		result = curr->semval;
-  
-		if (!sem_op && result)
+
+		if (!sem_op && result) {
+			*blocker = curr;
 			goto would_block;
+		}
 
 		result += sem_op;
-		if (result < 0)
+		if (result < 0) {
+			*blocker = curr;
 			goto would_block;
+		}
 		if (result > SEMVMX)
 			goto out_of_range;
 		if (sop->sem_flg & SEM_UNDO) {
 			int undo = un->semadj[sop->sem_num] - sem_op;
 			/*
-	 		 *	Exceeding the undo range is an error.
+			 *	Exceeding the undo range is an error.
 			 */
 			if (undo < (-SEMAEM - 1) || undo > SEMAEM)
 				goto out_of_range;
@@ -380,7 +462,27 @@ static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
 			un->semadj[sop->sem_num] -= sop->sem_op;
 		sop--;
 	}
-	
+
+	/*
+	 * our operation is going to succeed, do any list splicing
+	 * required so that we can try to wakeup people waiting on the
+	 * sems we've changed.
+	 */
+	for (sop = sops; sop < sops + nsops; sop++) {
+		/* if there are duplicate sem_nums in the list
+		 * we only want to process the first one
+		 */
+		if (sop != sops && last == sop->sem_num)
+			continue;
+
+		curr = sma->sem_base + sop->sem_num;
+		if (sop->sem_op)
+			copy_sem_queue(curr->semval, sop->sem_num,
+				       &curr->sem_pending, pending);
+		spin_unlock(&curr->lock);
+		last = sop->sem_num;
+	}
+
 	sma->sem_otime = get_seconds();
 	return 0;
 
@@ -389,15 +491,32 @@ out_of_range:
 	goto undo;
 
 would_block:
-	if (sop->sem_flg & IPC_NOWAIT)
+	if (sop->sem_flg & IPC_NOWAIT) {
 		result = -EAGAIN;
-	else
+		if (*blocker) {
+			/*
+			 * the blocker doesn't put itself on any
+			 * list for -EAGAIN, unlock it here
+			 */
+			spin_unlock(&(*blocker)->lock);
+			*blocker = NULL;
+		}
+	} else
 		result = 1;
 
 undo:
 	sop--;
 	while (sop >= sops) {
-		sma->sem_base[sop->sem_num].semval -= sop->sem_op;
+		curr = sma->sem_base + sop->sem_num;
+
+		curr->semval -= sop->sem_op;
+		/* we leave the blocker locked, and we make sure not
+		 * to unlock duplicates in the list twice
+		 */
+		if (curr != *blocker &&
+		    (sop == sops || (sop - 1)->sem_num != sop->sem_num)) {
+			spin_unlock(&curr->lock);
+		}
 		sop--;
 	}
 
@@ -425,88 +544,60 @@ static void wake_up_sem_queue(struct sem_queue *q, int error)
 	preempt_enable();
 }
 
-static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
-{
-	list_del(&q->list);
-	if (q->nsops == 1)
-		list_del(&q->simple_list);
-	else
-		sma->complex_count--;
-}
-
-
 /**
  * update_queue(sma, semnum): Look for tasks that can be completed.
  * @sma: semaphore array.
- * @semnum: semaphore that was modified.
+ * @pending_list: list of struct sem_queues to try
  *
  * update_queue must be called after a semaphore in a semaphore array
- * was modified. If multiple semaphore were modified, then @semnum
- * must be set to -1.
+ * was modified.
  */
-static void update_queue(struct sem_array *sma, int semnum)
+static void update_queue(struct sem_array *sma, struct list_head *pending_list)
 {
 	struct sem_queue *q;
-	struct list_head *walk;
-	struct list_head *pending_list;
-	int offset;
+	LIST_HEAD(new_pending);
+	LIST_HEAD(work_list);
 
-	/* if there are complex operations around, then knowing the semaphore
-	 * that was modified doesn't help us. Assume that multiple semaphores
-	 * were modified.
+	/*
+	 * this seems strange, but what we want to do is process everything
+	 * on the pending list, and then process any queues that have a chance
+	 * to finish because of processing the pending list.
+	 *
+	 * So, we send new_pending to try_atomic_semop each time, and it
+	 * splices any additional queues we have to try into new_pending.
+	 * When the work list is empty, we splice new_pending into the
+	 * work list and loop again.
+	 *
+	 * At the end of the whole thing, after we've built the largest
+	 * possible list of tasks to wake up, we wake them in bulk.
 	 */
-	if (sma->complex_count)
-		semnum = -1;
-
-	if (semnum == -1) {
-		pending_list = &sma->sem_pending;
-		offset = offsetof(struct sem_queue, list);
-	} else {
-		pending_list = &sma->sem_base[semnum].sem_pending;
-		offset = offsetof(struct sem_queue, simple_list);
-	}
-
+	list_splice_init(pending_list, &work_list);
 again:
-	walk = pending_list->next;
-	while (walk != pending_list) {
-		int error, alter;
-
-		q = (struct sem_queue *)((char *)walk - offset);
-		walk = walk->next;
-
-		/* If we are scanning the single sop, per-semaphore list of
-		 * one semaphore and that semaphore is 0, then it is not
-		 * necessary to scan the "alter" entries: simple increments
-		 * that affect only one entry succeed immediately and cannot
-		 * be in the  per semaphore pending queue, and decrements
-		 * cannot be successful if the value is already 0.
-		 */
-		if (semnum != -1 && sma->sem_base[semnum].semval == 0 &&
-				q->alter)
-			break;
+	while (!list_empty(&work_list)) {
+		struct sem *blocker;
+		int error;
 
+		q = list_entry(work_list.next, struct sem_queue, list);
+		list_del_init(&q->list);
+
+		blocker = NULL;
 		error = try_atomic_semop(sma, q->sops, q->nsops,
-					 q->undo, q->pid);
+					 q->undo, q->pid, &new_pending,
+					 &blocker);
 
 		/* Does q->sleeper still need to sleep? */
-		if (error > 0)
+		if (error > 0) {
+			list_add_tail(&q->list, &blocker->sem_pending);
+			spin_unlock(&blocker->lock);
 			continue;
+		}
+		wake_up_sem_queue(q, error);
 
-		unlink_queue(sma, q);
+	}
 
-		/*
-		 * The next operation that must be checked depends on the type
-		 * of the completed operation:
-		 * - if the operation modified the array, then restart from the
-		 *   head of the queue and check for threads that might be
-		 *   waiting for the new semaphore values.
-		 * - if the operation didn't modify the array, then just
-		 *   continue.
-		 */
-		alter = q->alter;
-		wake_up_sem_queue(q, error);
-		if (alter && !error)
-			goto again;
+	if (!list_empty(&new_pending)) {
+		list_splice_init(&new_pending, &work_list);
+		goto again;
 	}
 }
 
@@ -523,9 +614,11 @@ static int count_semncnt (struct sem_array * sma, ushort semnum)
 {
 	int semncnt;
 	struct sem_queue * q;
+	struct sem *curr;
 
+	curr = &sma->sem_base[semnum];
 	semncnt = 0;
-	list_for_each_entry(q, &sma->sem_pending, list) {
+	list_for_each_entry(q, &curr->sem_pending, list) {
 		struct sembuf * sops = q->sops;
 		int nsops = q->nsops;
 		int i;
@@ -542,9 +635,12 @@ static int count_semzcnt (struct sem_array * sma, ushort semnum)
 {
 	int semzcnt;
 	struct sem_queue * q;
+	struct sem *curr;
+
+	curr = &sma->sem_base[semnum];
 
 	semzcnt = 0;
-	list_for_each_entry(q, &sma->sem_pending, list) {
+	list_for_each_entry(q, &curr->sem_pending, list) {
 		struct sembuf * sops = q->sops;
 		int nsops = q->nsops;
 		int i;
@@ -572,6 +668,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
 	struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm);
+	int i;
 
 	/* Free the existing undo structures for this semaphore set.  */
 	assert_spin_locked(&sma->sem_perm.lock);
@@ -584,10 +681,15 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 		call_rcu(&un->rcu, free_un);
 	}
 
-	/* Wake up all pending processes and let them fail with EIDRM. */
-	list_for_each_entry_safe(q, tq, &sma->sem_pending, list) {
-		unlink_queue(sma, q);
-		wake_up_sem_queue(q, -EIDRM);
+
+	for (i = 0; i < sma->sem_nsems; i++) {
+		struct sem *curr = sma->sem_base + i;
+		spin_lock(&curr->lock);
+		list_for_each_entry_safe(q, tq, &curr->sem_pending, list) {
+			list_del_init(&q->list);
+			wake_up_sem_queue(q, -EIDRM);
+		}
+		spin_unlock(&curr->lock);
 	}
 
 	/* Remove the semaphore set from the IDR */
@@ -766,6 +868,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 	{
 		int i;
 		struct sem_undo *un;
+		LIST_HEAD(pending);
 
 		sem_getref_and_unlock(sma);
 
@@ -797,8 +900,15 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 			goto out_free;
 		}
 
-		for (i = 0; i < nsems; i++)
-			sma->sem_base[i].semval = sem_io[i];
+		for (i = 0; i < nsems; i++) {
+			curr = &sma->sem_base[i];
+
+			spin_lock(&curr->lock);
+			curr->semval = sem_io[i];
+			copy_sem_queue(curr->semval, i,
+				       &curr->sem_pending, &pending);
+			spin_unlock(&curr->lock);
+		}
 
 		assert_spin_locked(&sma->sem_perm.lock);
 		list_for_each_entry(un, &sma->list_id, list_id) {
@@ -807,7 +917,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		}
 		sma->sem_ctime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		update_queue(sma, -1);
+		update_queue(sma, &pending);
 		err = 0;
 		goto out_unlock;
 	}
@@ -836,6 +946,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 	{
 		int val = arg.val;
 		struct sem_undo *un;
+		LIST_HEAD(pending);
 
 		err = -ERANGE;
 		if (val > SEMVMX || val < 0)
@@ -845,11 +956,16 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		list_for_each_entry(un, &sma->list_id, list_id)
 			un->semadj[semnum] = 0;
 
+		spin_lock(&curr->lock);
 		curr->semval = val;
+		copy_sem_queue(curr->semval, semnum,
+				       &curr->sem_pending, &pending);
 		curr->sempid = task_tgid_vnr(current);
+		spin_unlock(&curr->lock);
+
 		sma->sem_ctime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		update_queue(sma, semnum);
+		update_queue(sma, &pending);
 		err = 0;
 		goto out_unlock;
 	}
@@ -1117,6 +1233,67 @@ out:
 	return un;
 }
 
+/*
+ * since we take spinlocks on the semaphores based on the
+ * values from userland, we have to sort them to make sure
+ * we lock them in order
+ */
+static int sembuf_compare(const void *a, const void *b)
+{
+	const struct sembuf *abuf = a;
+	const struct sembuf *bbuf = b;
+
+	if (abuf->sem_num < bbuf->sem_num)
+		return -1;
+	if (abuf->sem_num > bbuf->sem_num)
+		return 1;
+	return 0;
+}
+
+/*
+ * if a process wakes up on its own while on a semaphore list
+ * we have to take it off the list before that process can exit.
+ *
+ * We check all the semaphore's the sem_queue was trying to modify
+ * and if we find the sem_queue, we remove it and return.
+ *
+ * If we don't find the sem_queue its because someone is about to
+ * wake us up, and they have removed us from the list.
+ * We schedule and try again in hopes that they do it real soon now.
+ *
+ * We check queue->status to detect if someone did actually manage to
+ * wake us up.
+ */
+static int remove_queue_from_lists(struct sem_array *sma,
+				   struct sem_queue *queue)
+{
+	struct sembuf *sops = queue->sops;
+	struct sembuf *sop;
+	struct sem * curr;
+	struct sem_queue *test;
+
+again:
+	for (sop = sops; sop < sops + queue->nsops; sop++) {
+		curr = sma->sem_base + sop->sem_num;
+		spin_lock(&curr->lock);
+		list_for_each_entry(test, &curr->sem_pending, list) {
+			if (test == queue) {
+				list_del(&test->list);
+				spin_unlock(&curr->lock);
+				goto found;
+			}
+		}
+		spin_unlock(&curr->lock);
+	}
+	if (queue->status == -EINTR) {
+		set_current_state(TASK_RUNNING);
+		schedule();
+		goto again;
+	}
+found:
+	return 0;
+}
+
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1129,6 +1306,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	struct sem_queue queue;
 	unsigned long jiffies_left = 0;
 	struct ipc_namespace *ns;
+	struct sem *blocker = NULL;
+	LIST_HEAD(pending);
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -1168,6 +1347,14 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 			alter = 1;
 	}
 
+	/*
+	 * try_atomic_semop takes all the locks of all the semaphores in
+	 * the sops array.  We have to make sure we don't deadlock if userland
+	 * happens to send them out of order, so we sort them by semnum.
+	 */
+	if (nsops > 1)
+		sort(sops, nsops, sizeof(*sops), sembuf_compare, NULL);
+
 	if (undos) {
 		un = find_alloc_undo(ns, semid);
 		if (IS_ERR(un)) {
@@ -1222,45 +1409,52 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	if (error)
 		goto out_unlock_free;
 
-	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
+	/*
+	 * undos are scary, keep the lock if we have to deal with undos.
+	 * Otherwise, drop the big fat ipc lock and use the fine grained
+	 * per-semaphore locks instead.
+	 */
+	if (!un)
+		sem_getref_and_unlock(sma);
+
+	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current),
+				  &pending, &blocker);
 	if (error <= 0) {
 		if (alter && error == 0)
-			update_queue(sma, (nsops == 1) ? sops[0].sem_num : -1);
-
-		goto out_unlock_free;
+			update_queue(sma, &pending);
+		if (un)
+			goto out_unlock_free;
+		else
+			goto out_putref;
 	}
 
 	/* We need to sleep on this operation, so we put the current
 	 * task into the pending queue and go to sleep.
 	 */
-		
+
 	queue.sops = sops;
 	queue.nsops = nsops;
 	queue.undo = un;
 	queue.pid = task_tgid_vnr(current);
 	queue.alter = alter;
-	if (alter)
-		list_add_tail(&queue.list, &sma->sem_pending);
-	else
-		list_add(&queue.list, &sma->sem_pending);
-
-	if (nsops == 1) {
-		struct sem *curr;
-		curr = &sma->sem_base[sops->sem_num];
-
-		if (alter)
-			list_add_tail(&queue.simple_list, &curr->sem_pending);
-		else
-			list_add(&queue.simple_list, &curr->sem_pending);
-	} else {
-		INIT_LIST_HEAD(&queue.simple_list);
-		sma->complex_count++;
-	}
-
 	queue.status = -EINTR;
 	queue.sleeper = current;
+
 	current->state = TASK_INTERRUPTIBLE;
-	sem_unlock(sma);
+
+	/*
+	 * we could be woken up at any time after we add ourselves to the
+	 * blocker's list and unlock the spinlock.  So, all queue setup
+	 * must be done before this point
+	 */
+	if (alter)
+		list_add_tail(&queue.list, &blocker->sem_pending);
+	else
+		list_add(&queue.list, &blocker->sem_pending);
+	spin_unlock(&blocker->lock);
+
+	if (un)
+		sem_getref_and_unlock(sma);
 
 	if (timeout)
 		jiffies_left = schedule_timeout(jiffies_left);
@@ -1268,40 +1462,76 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		schedule();
 
 	error = queue.status;
+
 	while(unlikely(error == IN_WAKEUP)) {
 		cpu_relax();
 		error = queue.status;
 	}
 
-	if (error != -EINTR) {
+	/*
+	 * we are lock free right here, and we could have timed out or
+	 * gotten a signal, so we need to be really careful with how we
+	 * play with queue.status.  It has three possible states:
+	 *
+	 * -EINTR, which means nobody has changed it since we slept.  This
+	 * means we woke up on our own.
+	 *
+	 * IN_WAKEUP, someone is currently waking us up.  We need to loop
+	 * here until they change it to the operation error value.  If
+	 * we don't loop, our process could exit before they are done waking us
+	 *
+	 * operation error value: we've been properly woken up and can exit
+	 * at any time.
+	 *
+	 * If queue.status is currently -EINTR, we are still being processed
+	 * by the semtimedop core.  Someone either has us on a list head
+	 * or is currently poking our queue struct.  We need to find that
+	 * reference and remove it, which is what remove_queue_from_lists
+	 * does.
+	 *
+	 * We always check for both -EINTR and IN_WAKEUP because we have no
+	 * locks held.  Someone could change us from -EINTR to IN_WAKEUP at
+	 * any time.
+	 */
+	if (error != -EINTR && error != IN_WAKEUP) {
 		/* fast path: update_queue already obtained all requested
 		 * resources */
-		goto out_free;
-	}
-
-	sma = sem_lock(ns, semid);
-	if (IS_ERR(sma)) {
-		error = -EIDRM;
-		goto out_free;
+		goto out_putref;
 	}
 
 	/*
-	 * If queue.status != -EINTR we are woken up by another process
+	 * Someone has a reference on us, lets find it.
 	 */
+	remove_queue_from_lists(sma, &queue);
+
+	/* check the status again in case we were woken up */
 	error = queue.status;
-	if (error != -EINTR) {
-		goto out_unlock_free;
+	while(unlikely(error == IN_WAKEUP)) {
+		cpu_relax();
+		error = queue.status;
 	}
 
 	/*
+	 * at this point we know nobody can possibly wake us up, if error
+	 * isn't -EINTR, the wakeup did happen and our semaphore operation is
+	 * complete.  Otherwise, we return -EAGAIN.
+	 */
+	if (error != -EINTR)
+		goto out_putref;
+
+	/*
 	 * If an interrupt occurred we have to clean up the queue
 	 */
 	if (timeout && jiffies_left == 0)
 		error = -EAGAIN;
-	unlink_queue(sma, &queue);
+
+out_putref:
+	sem_putref(sma);
+	goto out_free;
 
 out_unlock_free:
 	sem_unlock(sma);
+
 out_free:
 	if(sops != fast_sops)
 		kfree(sops);
@@ -1360,11 +1590,14 @@ void exit_sem(struct task_struct *tsk)
 		return;
 
 	for (;;) {
+		struct list_head pending;
 		struct sem_array *sma;
 		struct sem_undo *un;
 		int semid;
 		int i;
 
+		INIT_LIST_HEAD(&pending);
+
 		rcu_read_lock();
 		un = list_entry_rcu(ulp->list_proc.next,
 				    struct sem_undo, list_proc);
@@ -1404,6 +1637,7 @@ void exit_sem(struct task_struct *tsk)
 		for (i = 0; i < sma->sem_nsems; i++) {
 			struct sem * semaphore = &sma->sem_base[i];
 			if (un->semadj[i]) {
+				spin_lock(&semaphore->lock);
 				semaphore->semval += un->semadj[i];
 				/*
 				 * Range checks of the new semaphore value,
@@ -1423,11 +1657,15 @@ void exit_sem(struct task_struct *tsk)
 				if (semaphore->semval > SEMVMX)
 					semaphore->semval = SEMVMX;
 				semaphore->sempid = task_tgid_vnr(current);
+				copy_sem_queue(semaphore->semval, i,
+					       &semaphore->sem_pending,
+					       &pending);
+				spin_unlock(&semaphore->lock);
 			}
 		}
 		sma->sem_otime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		update_queue(sma, -1);
+		update_queue(sma, &pending);
 		sem_unlock(sma);
 
 		call_rcu(&un->rcu, free_un);
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU
  2010-04-12 18:49 [PATCH RFC] Optimize semtimedop Chris Mason
  2010-04-12 18:49 ` [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop Chris Mason
@ 2010-04-12 18:49 ` Chris Mason
  2010-04-17 10:24   ` Manfred Spraul
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Mason @ 2010-04-12 18:49 UTC (permalink / raw)
  To: chris.mason, zach.brown, jens.axboe, linux-kernel, Nick Piggin,
	Manfred Spraul

When IPC semaphores are used in a bulk post and wait system, we
can end up waking a very large number of processes per semtimedop call.
At least one major database will use a single process to kick hundreds
of other processes at a time.

This patch tries to reduce the runqueue lock contention by ordering the
wakeups based on the CPU the waiting process was on when it went to
sleep.

A later patch could add some code in the scheduler to help
wake these up in bulk and take the various runqueue locks less often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 include/linux/sem.h |    1 +
 ipc/sem.c           |   37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 8b97b51..4a37319 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -104,6 +104,7 @@ struct sem_array {
 struct sem_queue {
 	struct list_head	list;	 /* queue of pending operations */
 	struct task_struct	*sleeper; /* this process */
+	unsigned long		sleep_cpu;
 	struct sem_undo		*undo;	 /* undo structure */
 	int    			pid;	 /* process id of requesting process */
 	int    			status;	 /* completion status of operation */
diff --git a/ipc/sem.c b/ipc/sem.c
index 335cd35..07fe1d5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -544,6 +544,25 @@ static void wake_up_sem_queue(struct sem_queue *q, int error)
 	preempt_enable();
 }
 
+/*
+ * sorting helper for struct sem_queues in a list.  This is used to
+ * sort by the CPU they are likely to be on when waking them.
+ */
+int list_comp(void *priv, struct list_head *a, struct list_head *b)
+{
+	struct sem_queue *qa;
+	struct sem_queue *qb;
+
+	qa = list_entry(a, struct sem_queue, list);
+	qb = list_entry(b, struct sem_queue, list);
+
+	if (qa->sleep_cpu < qb->sleep_cpu)
+		return -1;
+	if (qa->sleep_cpu > qb->sleep_cpu)
+		return 1;
+	return 0;
+}
+
 /**
  * update_queue(sma, semnum): Look for tasks that can be completed.
  * @sma: semaphore array.
@@ -557,6 +576,7 @@ static void update_queue(struct sem_array *sma, struct list_head *pending_list)
 	struct sem_queue *q;
 	LIST_HEAD(new_pending);
 	LIST_HEAD(work_list);
+	LIST_HEAD(wake_list);
 
 	/*
 	 * this seems strange, but what we want to do is process everything
@@ -591,7 +611,10 @@ again:
 			spin_unlock(&blocker->lock);
 			continue;
 		}
-		wake_up_sem_queue(q, error);
+		if (error)
+			wake_up_sem_queue(q, error);
+		else
+			list_add_tail(&q->list, &wake_list);
 
 	}
 
@@ -599,6 +622,13 @@ again:
 		list_splice_init(&new_pending, &work_list);
 		goto again;
 	}
+
+	list_sort(NULL, &wake_list, list_comp);
+	while (!list_empty(&wake_list)) {
+		q = list_entry(wake_list.next, struct sem_queue, list);
+		list_del_init(&q->list);
+		wake_up_sem_queue(q, 0);
+	}
 }
 
 /* The following counts are associated to each semaphore:
@@ -1440,6 +1470,11 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	queue.status = -EINTR;
 	queue.sleeper = current;
 
+	/*
+	 * the sleep_cpu number allows sorting by the CPU we expect
+	 * their runqueue entry to be on..hopefully faster for waking up
+	 */
+	queue.sleep_cpu = my_cpu_offset;
 	current->state = TASK_INTERRUPTIBLE;
 
 	/*
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-12 18:49 ` [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop Chris Mason
@ 2010-04-13 17:15   ` Manfred Spraul
  2010-04-13 17:39     ` Chris Mason
  2010-04-16 11:26   ` Manfred Spraul
  1 sibling, 1 reply; 25+ messages in thread
From: Manfred Spraul @ 2010-04-13 17:15 UTC (permalink / raw)
  To: Chris Mason; +Cc: zach.brown, jens.axboe, linux-kernel, Nick Piggin

Hi Chris,


On 04/12/2010 08:49 PM, Chris Mason wrote:
>   /*
> + * when a semaphore is modified, we want to retry the series of operations
> + * for anyone that was blocking on that semaphore.  This breaks down into
> + * a few different common operations:
> + *
> + * 1) One modification releases one or more waiters for zero.
> + * 2) Many waiters are trying to get a single lock, only one will get it.
> + * 3) Many modifications to the count will succeed.
> + *
>    
Have you thought about odd corner cases:
Nick noticed the last time that it is possible to wait for arbitrary values:
in one semop:
     - decrease semaphore 5 by 10
     - wait until semaphore 5 is 0
     - increase semaphore 5 by 10.

>   SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>   		unsigned, nsops, const struct timespec __user *, timeout)
>   {
> @@ -1129,6 +1306,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>   	struct sem_queue queue;
>   	unsigned long jiffies_left = 0;
>   	struct ipc_namespace *ns;
> +	struct sem *blocker = NULL;
> +	LIST_HEAD(pending);
>
>   	ns = current->nsproxy->ipc_ns;
>
> @@ -1168,6 +1347,14 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>   			alter = 1;
>   	}
>
> +	/*
> +	 * try_atomic_semop takes all the locks of all the semaphores in
> +	 * the sops array.  We have to make sure we don't deadlock if userland
> +	 * happens to send them out of order, so we sort them by semnum.
> +	 */
> +	if (nsops>  1)
> +		sort(sops, nsops, sizeof(*sops), sembuf_compare, NULL);
> +
>    
Does sorting preserve the behavior?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 17:15   ` Manfred Spraul
@ 2010-04-13 17:39     ` Chris Mason
  2010-04-13 18:09       ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2010-04-13 17:39 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: zach.brown, jens.axboe, linux-kernel, Nick Piggin

On Tue, Apr 13, 2010 at 07:15:30PM +0200, Manfred Spraul wrote:
> Hi Chris,
> 
> 
> On 04/12/2010 08:49 PM, Chris Mason wrote:
> >  /*
> >+ * when a semaphore is modified, we want to retry the series of operations
> >+ * for anyone that was blocking on that semaphore.  This breaks down into
> >+ * a few different common operations:
> >+ *
> >+ * 1) One modification releases one or more waiters for zero.
> >+ * 2) Many waiters are trying to get a single lock, only one will get it.
> >+ * 3) Many modifications to the count will succeed.
> >+ *
> Have you thought about odd corner cases:
> Nick noticed the last time that it is possible to wait for arbitrary values:
> in one semop:
>     - decrease semaphore 5 by 10
>     - wait until semaphore 5 is 0
>     - increase semaphore 5 by 10.

Do you mean within a single sop array doing all three of these?  I don't
know if the sort is going to leave the three operations on semaphore 5
in the same order (it probably won't).

But I could change that by having it include the slot in the original
sop array in the sorting.  That way if we have duplicate semnums in the
array, they will end up in the same position relative to each other in
the sorted result.

(ewwww ;)

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 17:39     ` Chris Mason
@ 2010-04-13 18:09       ` Nick Piggin
  2010-04-13 18:19         ` Chris Mason
  2010-04-13 18:24         ` Zach Brown
  0 siblings, 2 replies; 25+ messages in thread
From: Nick Piggin @ 2010-04-13 18:09 UTC (permalink / raw)
  To: Chris Mason, Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> On Tue, Apr 13, 2010 at 07:15:30PM +0200, Manfred Spraul wrote:
> > Hi Chris,
> > 
> > 
> > On 04/12/2010 08:49 PM, Chris Mason wrote:
> > >  /*
> > >+ * when a semaphore is modified, we want to retry the series of operations
> > >+ * for anyone that was blocking on that semaphore.  This breaks down into
> > >+ * a few different common operations:
> > >+ *
> > >+ * 1) One modification releases one or more waiters for zero.
> > >+ * 2) Many waiters are trying to get a single lock, only one will get it.
> > >+ * 3) Many modifications to the count will succeed.
> > >+ *
> > Have you thought about odd corner cases:
> > Nick noticed the last time that it is possible to wait for arbitrary values:
> > in one semop:
> >     - decrease semaphore 5 by 10
> >     - wait until semaphore 5 is 0
> >     - increase semaphore 5 by 10.
> 
> Do you mean within a single sop array doing all three of these?  I don't
> know if the sort is going to leave the three operations on semaphore 5
> in the same order (it probably won't).
> 
> But I could change that by having it include the slot in the original
> sop array in the sorting.  That way if we have duplicate semnums in the
> array, they will end up in the same position relative to each other in
> the sorted result.
> 
> (ewwww ;)

I had a bit of a hack at doing per-semaphore stuff when I was looking
at the first optimization, but it was tricky to make it work.

The other thing I don't know if your patch gets right is requeueing on
of the operations. When you requeue from one list to another, then you
seem to lose ordering with other pending operations, so that would
seem to break the API as well (can't remember if the API strictly
mandates FIFO, but anyway it can open up starvation cases).

I was looking at doing a sequence number to be able to sort these, but
it ended up getting over complex (and SAP was only using simple ops so
it didn't seem to need much better).

We want to be careful not to change semantics at all. And it gets
tricky quickly :( What about Zach's simpler wakeup API?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 18:09       ` Nick Piggin
@ 2010-04-13 18:19         ` Chris Mason
  2010-04-13 18:57           ` Nick Piggin
  2010-04-14 16:16           ` Manfred Spraul
  2010-04-13 18:24         ` Zach Brown
  1 sibling, 2 replies; 25+ messages in thread
From: Chris Mason @ 2010-04-13 18:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
> On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> > On Tue, Apr 13, 2010 at 07:15:30PM +0200, Manfred Spraul wrote:
> > > Hi Chris,
> > > 
> > > 
> > > On 04/12/2010 08:49 PM, Chris Mason wrote:
> > > >  /*
> > > >+ * when a semaphore is modified, we want to retry the series of operations
> > > >+ * for anyone that was blocking on that semaphore.  This breaks down into
> > > >+ * a few different common operations:
> > > >+ *
> > > >+ * 1) One modification releases one or more waiters for zero.
> > > >+ * 2) Many waiters are trying to get a single lock, only one will get it.
> > > >+ * 3) Many modifications to the count will succeed.
> > > >+ *
> > > Have you thought about odd corner cases:
> > > Nick noticed the last time that it is possible to wait for arbitrary values:
> > > in one semop:
> > >     - decrease semaphore 5 by 10
> > >     - wait until semaphore 5 is 0
> > >     - increase semaphore 5 by 10.
> > 
> > Do you mean within a single sop array doing all three of these?  I don't
> > know if the sort is going to leave the three operations on semaphore 5
> > in the same order (it probably won't).
> > 
> > But I could change that by having it include the slot in the original
> > sop array in the sorting.  That way if we have duplicate semnums in the
> > array, they will end up in the same position relative to each other in
> > the sorted result.
> > 
> > (ewwww ;)
> 
> I had a bit of a hack at doing per-semaphore stuff when I was looking
> at the first optimization, but it was tricky to make it work.
> 
> The other thing I don't know if your patch gets right is requeueing on
> of the operations. When you requeue from one list to another, then you
> seem to lose ordering with other pending operations, so that would
> seem to break the API as well (can't remember if the API strictly
> mandates FIFO, but anyway it can open up starvation cases).

I don't see anything in the docs about the FIFO order.  I could add an
extra sort on sequence number pretty easily, but is the starvation case
really that bad?

> 
> I was looking at doing a sequence number to be able to sort these, but
> it ended up getting over complex (and SAP was only using simple ops so
> it didn't seem to need much better).
> 
> We want to be careful not to change semantics at all. And it gets
> tricky quickly :( What about Zach's simpler wakeup API?

Yeah, that's why my patches include code to handle userland sending
duplicate semids.  Zach's simpler API is cooking too, but if I can get
this done without insane complexity it helps with more than just the
post/wait oracle workload.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 18:09       ` Nick Piggin
  2010-04-13 18:19         ` Chris Mason
@ 2010-04-13 18:24         ` Zach Brown
  1 sibling, 0 replies; 25+ messages in thread
From: Zach Brown @ 2010-04-13 18:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chris Mason, Manfred Spraul, jens.axboe, linux-kernel

> What about Zach's simpler wakeup API?

It's making slow progress in the background as a longer-term experiment.

http://oss.oracle.com/~zab/wake-many/

That URL still has an API description, patches, and little test  
utilities for the simple first draft.

- z

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 18:19         ` Chris Mason
@ 2010-04-13 18:57           ` Nick Piggin
  2010-04-13 19:01             ` Chris Mason
  2010-05-16 16:57             ` Manfred Spraul
  2010-04-14 16:16           ` Manfred Spraul
  1 sibling, 2 replies; 25+ messages in thread
From: Nick Piggin @ 2010-04-13 18:57 UTC (permalink / raw)
  To: Chris Mason, Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Tue, Apr 13, 2010 at 02:19:37PM -0400, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
> > On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> > > On Tue, Apr 13, 2010 at 07:15:30PM +0200, Manfred Spraul wrote:
> > > > Hi Chris,
> > > > 
> > > > 
> > > > On 04/12/2010 08:49 PM, Chris Mason wrote:
> > > > >  /*
> > > > >+ * when a semaphore is modified, we want to retry the series of operations
> > > > >+ * for anyone that was blocking on that semaphore.  This breaks down into
> > > > >+ * a few different common operations:
> > > > >+ *
> > > > >+ * 1) One modification releases one or more waiters for zero.
> > > > >+ * 2) Many waiters are trying to get a single lock, only one will get it.
> > > > >+ * 3) Many modifications to the count will succeed.
> > > > >+ *
> > > > Have you thought about odd corner cases:
> > > > Nick noticed the last time that it is possible to wait for arbitrary values:
> > > > in one semop:
> > > >     - decrease semaphore 5 by 10
> > > >     - wait until semaphore 5 is 0
> > > >     - increase semaphore 5 by 10.
> > > 
> > > Do you mean within a single sop array doing all three of these?  I don't
> > > know if the sort is going to leave the three operations on semaphore 5
> > > in the same order (it probably won't).
> > > 
> > > But I could change that by having it include the slot in the original
> > > sop array in the sorting.  That way if we have duplicate semnums in the
> > > array, they will end up in the same position relative to each other in
> > > the sorted result.
> > > 
> > > (ewwww ;)
> > 
> > I had a bit of a hack at doing per-semaphore stuff when I was looking
> > at the first optimization, but it was tricky to make it work.
> > 
> > The other thing I don't know if your patch gets right is requeueing on
> > of the operations. When you requeue from one list to another, then you
> > seem to lose ordering with other pending operations, so that would
> > seem to break the API as well (can't remember if the API strictly
> > mandates FIFO, but anyway it can open up starvation cases).
> 
> I don't see anything in the docs about the FIFO order.  I could add an
> extra sort on sequence number pretty easily, but is the starvation case
> really that bad?

Yes, because it's not just a theoretical livelock, it can be basically
a certainty, given the right pattern of semops.

You could have two mostly-independent groups of processes, each taking
and releasing a different sem, which are always contended (eg. if it is
being used for a producer-consumer type situation, or even just mutual
exclusion with high contention).

Then you could have some overall management process for example which
tries to take both sems. It will never get it.


> > I was looking at doing a sequence number to be able to sort these, but
> > it ended up getting over complex (and SAP was only using simple ops so
> > it didn't seem to need much better).
> > 
> > We want to be careful not to change semantics at all. And it gets
> > tricky quickly :( What about Zach's simpler wakeup API?
> 
> Yeah, that's why my patches include code to handle userland sending
> duplicate semids.

Duplicate semids? What do you mean?


>  Zach's simpler API is cooking too, but if I can get
> this done without insane complexity it helps with more than just the
> post/wait oracle workload.

Iam worried about complexity and slowing other cases, given that Oracle
DB seems willing to adapt to the (better suited) new API. So I'd be
interested to know what it helps outside Oracle.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 18:57           ` Nick Piggin
@ 2010-04-13 19:01             ` Chris Mason
  2010-04-13 19:25               ` Nick Piggin
  2010-05-16 16:57             ` Manfred Spraul
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Mason @ 2010-04-13 19:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Wed, Apr 14, 2010 at 04:57:56AM +1000, Nick Piggin wrote:
> On Tue, Apr 13, 2010 at 02:19:37PM -0400, Chris Mason wrote:
> > On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
> > > On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> > > > On Tue, Apr 13, 2010 at 07:15:30PM +0200, Manfred Spraul wrote:
> > > > > Hi Chris,
> > > > > 
> > > > > 
> > > > > On 04/12/2010 08:49 PM, Chris Mason wrote:
> > > > > >  /*
> > > > > >+ * when a semaphore is modified, we want to retry the series of operations
> > > > > >+ * for anyone that was blocking on that semaphore.  This breaks down into
> > > > > >+ * a few different common operations:
> > > > > >+ *
> > > > > >+ * 1) One modification releases one or more waiters for zero.
> > > > > >+ * 2) Many waiters are trying to get a single lock, only one will get it.
> > > > > >+ * 3) Many modifications to the count will succeed.
> > > > > >+ *
> > > > > Have you thought about odd corner cases:
> > > > > Nick noticed the last time that it is possible to wait for arbitrary values:
> > > > > in one semop:
> > > > >     - decrease semaphore 5 by 10
> > > > >     - wait until semaphore 5 is 0
> > > > >     - increase semaphore 5 by 10.
> > > > 
> > > > Do you mean within a single sop array doing all three of these?  I don't
> > > > know if the sort is going to leave the three operations on semaphore 5
> > > > in the same order (it probably won't).
> > > > 
> > > > But I could change that by having it include the slot in the original
> > > > sop array in the sorting.  That way if we have duplicate semnums in the
> > > > array, they will end up in the same position relative to each other in
> > > > the sorted result.
> > > > 
> > > > (ewwww ;)
> > > 
> > > I had a bit of a hack at doing per-semaphore stuff when I was looking
> > > at the first optimization, but it was tricky to make it work.
> > > 
> > > The other thing I don't know if your patch gets right is requeueing on
> > > of the operations. When you requeue from one list to another, then you
> > > seem to lose ordering with other pending operations, so that would
> > > seem to break the API as well (can't remember if the API strictly
> > > mandates FIFO, but anyway it can open up starvation cases).
> > 
> > I don't see anything in the docs about the FIFO order.  I could add an
> > extra sort on sequence number pretty easily, but is the starvation case
> > really that bad?
> 
> Yes, because it's not just a theoretical livelock, it can be basically
> a certainty, given the right pattern of semops.
> 
> You could have two mostly-independent groups of processes, each taking
> and releasing a different sem, which are always contended (eg. if it is
> being used for a producer-consumer type situation, or even just mutual
> exclusion with high contention).
> 
> Then you could have some overall management process for example which
> tries to take both sems. It will never get it.

Ok, fair enough, I'll add the sequence number.

> 
> 
> > > I was looking at doing a sequence number to be able to sort these, but
> > > it ended up getting over complex (and SAP was only using simple ops so
> > > it didn't seem to need much better).
> > > 
> > > We want to be careful not to change semantics at all. And it gets
> > > tricky quickly :( What about Zach's simpler wakeup API?
> > 
> > Yeah, that's why my patches include code to handle userland sending
> > duplicate semids.
> 
> Duplicate semids? What do you mean?

Sorry, semnums...index into the array of semaphores.

> 
> 
> >  Zach's simpler API is cooking too, but if I can get
> > this done without insane complexity it helps with more than just the
> > post/wait oracle workload.
> 
> Iam worried about complexity and slowing other cases, given that Oracle
> DB seems willing to adapt to the (better suited) new API. So I'd be
> interested to know what it helps outside Oracle.
> 

Sure, I'd hope that your benchmark from last time around is faster now.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 19:01             ` Chris Mason
@ 2010-04-13 19:25               ` Nick Piggin
  2010-04-13 19:38                 ` Chris Mason
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-04-13 19:25 UTC (permalink / raw)
  To: Chris Mason, Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Tue, Apr 13, 2010 at 03:01:10PM -0400, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 04:57:56AM +1000, Nick Piggin wrote:
> > Yes, because it's not just a theoretical livelock, it can be basically
> > a certainty, given the right pattern of semops.
> > 
> > You could have two mostly-independent groups of processes, each taking
> > and releasing a different sem, which are always contended (eg. if it is
> > being used for a producer-consumer type situation, or even just mutual
> > exclusion with high contention).
> > 
> > Then you could have some overall management process for example which
> > tries to take both sems. It will never get it.
> 
> Ok, fair enough, I'll add the sequence number.
> 
> > 
> > 
> > > > I was looking at doing a sequence number to be able to sort these, but
> > > > it ended up getting over complex (and SAP was only using simple ops so
> > > > it didn't seem to need much better).
> > > > 
> > > > We want to be careful not to change semantics at all. And it gets
> > > > tricky quickly :( What about Zach's simpler wakeup API?
> > > 
> > > Yeah, that's why my patches include code to handle userland sending
> > > duplicate semids.
> > 
> > Duplicate semids? What do you mean?
> 
> Sorry, semnums...index into the array of semaphores.

OK, I wonder just how much it helps, and what.


> > >  Zach's simpler API is cooking too, but if I can get
> > > this done without insane complexity it helps with more than just the
> > > post/wait oracle workload.
> > 
> > Iam worried about complexity and slowing other cases, given that Oracle
> > DB seems willing to adapt to the (better suited) new API. So I'd be
> > interested to know what it helps outside Oracle.
> > 
> 
> Sure, I'd hope that your benchmark from last time around is faster now.

I didn't actually reproduce it here, I think it was a customer or
partner workload. But SAP only seemed to have one contended semnum in
its array, and it was being operated on with "simple" semops (so that's
about as far as the patches went).

I didn't notice anything that should make that go faster?

Yes, with such a workload, using semops is basically legacy and simple
mutexes should work better. So I'm not outright against improving sysv
sem performance for more complex cases where nothing else we have works
as well.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 19:25               ` Nick Piggin
@ 2010-04-13 19:38                 ` Chris Mason
  2010-04-13 20:05                   ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2010-04-13 19:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Wed, Apr 14, 2010 at 05:25:51AM +1000, Nick Piggin wrote:
> On Tue, Apr 13, 2010 at 03:01:10PM -0400, Chris Mason wrote:
> > On Wed, Apr 14, 2010 at 04:57:56AM +1000, Nick Piggin wrote:
> > > Yes, because it's not just a theoretical livelock, it can be basically
> > > a certainty, given the right pattern of semops.
> > > 
> > > You could have two mostly-independent groups of processes, each taking
> > > and releasing a different sem, which are always contended (eg. if it is
> > > being used for a producer-consumer type situation, or even just mutual
> > > exclusion with high contention).
> > > 
> > > Then you could have some overall management process for example which
> > > tries to take both sems. It will never get it.
> > 
> > Ok, fair enough, I'll add the sequence number.
> > 
> > > 
> > > 
> > > > > I was looking at doing a sequence number to be able to sort these, but
> > > > > it ended up getting over complex (and SAP was only using simple ops so
> > > > > it didn't seem to need much better).
> > > > > 
> > > > > We want to be careful not to change semantics at all. And it gets
> > > > > tricky quickly :( What about Zach's simpler wakeup API?
> > > > 
> > > > Yeah, that's why my patches include code to handle userland sending
> > > > duplicate semids.
> > > 
> > > Duplicate semids? What do you mean?
> > 
> > Sorry, semnums...index into the array of semaphores.
> 
> OK, I wonder just how much it helps, and what.

Detecting the dups just keeps me from deadlocking.  I'm locking each
individual semaphore in sequence, so if userland does something strange
and sends two updates to the same semaphore, the code detects that and
only locks the first one.

> 
> 
> > > >  Zach's simpler API is cooking too, but if I can get
> > > > this done without insane complexity it helps with more than just the
> > > > post/wait oracle workload.
> > > 
> > > Iam worried about complexity and slowing other cases, given that Oracle
> > > DB seems willing to adapt to the (better suited) new API. So I'd be
> > > interested to know what it helps outside Oracle.
> > > 
> > 
> > Sure, I'd hope that your benchmark from last time around is faster now.
> 
> I didn't actually reproduce it here, I think it was a customer or
> partner workload. But SAP only seemed to have one contended semnum in
> its array, and it was being operated on with "simple" semops (so that's
> about as far as the patches went).
> 
> I didn't notice anything that should make that go faster?

Since I'm avoiding the ipc lock while operating on the array, it'll help
any workload that hits on two or more semaphores in the array at
once.

> 
> Yes, with such a workload, using semops is basically legacy and simple
> mutexes should work better. So I'm not outright against improving sysv
> sem performance for more complex cases where nothing else we have works
> as well.
> 

I'm not in a hurry to overhaul a part of the kernel that has been stable
for a long time.  But it really needs some love I think.  I'll have more
numbers from a tpc run later this week.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 19:38                 ` Chris Mason
@ 2010-04-13 20:05                   ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-04-13 20:05 UTC (permalink / raw)
  To: Chris Mason, Manfred Spraul, zach.brown, jens.axboe, linux-kernel

On Tue, Apr 13, 2010 at 03:38:01PM -0400, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 05:25:51AM +1000, Nick Piggin wrote:
> > I didn't notice anything that should make that go faster?
> 
> Since I'm avoiding the ipc lock while operating on the array, it'll help
> any workload that hits on two or more semaphores in the array at
> once.

Yeah, I don't think SAP did that, significantly to matter. Possibly
some (aside from Oracle of course), do though.


> > Yes, with such a workload, using semops is basically legacy and simple
> > mutexes should work better. So I'm not outright against improving sysv
> > sem performance for more complex cases where nothing else we have works
> > as well.
> > 
> 
> I'm not in a hurry to overhaul a part of the kernel that has been stable
> for a long time.  But it really needs some love I think.  I'll have more
> numbers from a tpc run later this week.

Yep, I'm not against it. "industry standard benchmark" numbers would
be great.

I do think we need to be really careful with semantics though. The
API's been around for long enough that it is going to have been
(ab)used in every way possible :)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 18:19         ` Chris Mason
  2010-04-13 18:57           ` Nick Piggin
@ 2010-04-14 16:16           ` Manfred Spraul
  2010-04-14 17:33             ` Chris Mason
  1 sibling, 1 reply; 25+ messages in thread
From: Manfred Spraul @ 2010-04-14 16:16 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, zach.brown, jens.axboe, linux-kernel

On 04/13/2010 08:19 PM, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
>    
>> On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
>>      
>> The other thing I don't know if your patch gets right is requeueing on
>> of the operations. When you requeue from one list to another, then you
>> seem to lose ordering with other pending operations, so that would
>> seem to break the API as well (can't remember if the API strictly
>> mandates FIFO, but anyway it can open up starvation cases).
>>      
> I don't see anything in the docs about the FIFO order.  I could add an
> extra sort on sequence number pretty easily, but is the starvation case
> really that bad?
>
>    
How do you want to determine the sequence number?
Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?

>> I was looking at doing a sequence number to be able to sort these, but
>> it ended up getting over complex (and SAP was only using simple ops so
>> it didn't seem to need much better).
>>
>> We want to be careful not to change semantics at all. And it gets
>> tricky quickly :( What about Zach's simpler wakeup API?
>>      
> Yeah, that's why my patches include code to handle userland sending
> duplicate semids.  Zach's simpler API is cooking too, but if I can get
> this done without insane complexity it helps with more than just the
> post/wait oracle workload.
>
>    
What is the oracle workload, which multi-sembuf operations does it use?
How many semaphores are in one array?

When the last optimizations were written, I've searched a bit:
- postgres uses per-process semaphores, with small semaphore arrays.
     [process sleeps on it's own semaphore and is woken up by someone 
else when it can make progress]
- with google, I couldn't find anything relevant that uses multi-sembuf 
semop() calls.

And I agree with Nick: We should be careful about changing the API.

--
     Manfred

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-14 16:16           ` Manfred Spraul
@ 2010-04-14 17:33             ` Chris Mason
  2010-04-14 19:11               ` Manfred Spraul
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2010-04-14 17:33 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Nick Piggin, zach.brown, jens.axboe, linux-kernel

On Wed, Apr 14, 2010 at 06:16:53PM +0200, Manfred Spraul wrote:
> On 04/13/2010 08:19 PM, Chris Mason wrote:
> >On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
> >>On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> >>The other thing I don't know if your patch gets right is requeueing on
> >>of the operations. When you requeue from one list to another, then you
> >>seem to lose ordering with other pending operations, so that would
> >>seem to break the API as well (can't remember if the API strictly
> >>mandates FIFO, but anyway it can open up starvation cases).
> >I don't see anything in the docs about the FIFO order.  I could add an
> >extra sort on sequence number pretty easily, but is the starvation case
> >really that bad?
> >
> How do you want to determine the sequence number?
> Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?

I haven't tried yet, but hopefully it won't be a problem.  A later patch
does atomics on the reference count and it doesn't show up in the
profiles.

> 
> >>I was looking at doing a sequence number to be able to sort these, but
> >>it ended up getting over complex (and SAP was only using simple ops so
> >>it didn't seem to need much better).
> >>
> >>We want to be careful not to change semantics at all. And it gets
> >>tricky quickly :( What about Zach's simpler wakeup API?
> >Yeah, that's why my patches include code to handle userland sending
> >duplicate semids.  Zach's simpler API is cooking too, but if I can get
> >this done without insane complexity it helps with more than just the
> >post/wait oracle workload.
> >
> What is the oracle workload, which multi-sembuf operations does it use?
> How many semaphores are in one array?
> 
> When the last optimizations were written, I've searched a bit:
> - postgres uses per-process semaphores, with small semaphore arrays.
>     [process sleeps on it's own semaphore and is woken up by someone
> else when it can make progress]

This is similar to Oracle (and the sembench program).  Each process has
a semaphore and when it is waiting for a commit it goes to sleep on it.
They are woken up in bulk with semtimedop calls from a single process.

But oracle also uses semaphores for locking in a traditional sense.

Putting the waiters into a per-semaphore list is really only part of the
speedup.  The real boost comes from the patch to break up the locks into
a per semaphore lock.

We gain another 10-15% from a later patch that gets uses atomics on the
refcount, which lets us do sem_putref without a lock (meaning we're
lockless once we get woken up).

I'm cleaning up fixes based on suggestions here and will repost.

> - with google, I couldn't find anything relevant that uses
> multi-sembuf semop() calls.
> 

I think this should help any workload that has more than one semaphore
per array, even if they only do one sem per call.

> And I agree with Nick: We should be careful about changing the API.

Definitely, thanks for reading through it.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-14 17:33             ` Chris Mason
@ 2010-04-14 19:11               ` Manfred Spraul
  2010-04-14 19:50                 ` Chris Mason
  0 siblings, 1 reply; 25+ messages in thread
From: Manfred Spraul @ 2010-04-14 19:11 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, zach.brown, jens.axboe, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3388 bytes --]

On 04/14/2010 07:33 PM, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 06:16:53PM +0200, Manfred Spraul wrote:
>    
>> On 04/13/2010 08:19 PM, Chris Mason wrote:
>>      
>>> On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
>>>        
>>>> On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
>>>> The other thing I don't know if your patch gets right is requeueing on
>>>> of the operations. When you requeue from one list to another, then you
>>>> seem to lose ordering with other pending operations, so that would
>>>> seem to break the API as well (can't remember if the API strictly
>>>> mandates FIFO, but anyway it can open up starvation cases).
>>>>          
>>> I don't see anything in the docs about the FIFO order.  I could add an
>>> extra sort on sequence number pretty easily, but is the starvation case
>>> really that bad?
>>>
>>>        
>> How do you want to determine the sequence number?
>> Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?
>>      
> I haven't tried yet, but hopefully it won't be a problem.  A later patch
> does atomics on the reference count and it doesn't show up in the
> profiles.
>
>    
>>      
>>>> I was looking at doing a sequence number to be able to sort these, but
>>>> it ended up getting over complex (and SAP was only using simple ops so
>>>> it didn't seem to need much better).
>>>>
>>>> We want to be careful not to change semantics at all. And it gets
>>>> tricky quickly :( What about Zach's simpler wakeup API?
>>>>          
>>> Yeah, that's why my patches include code to handle userland sending
>>> duplicate semids.  Zach's simpler API is cooking too, but if I can get
>>> this done without insane complexity it helps with more than just the
>>> post/wait oracle workload.
>>>
>>>        
>> What is the oracle workload, which multi-sembuf operations does it use?
>> How many semaphores are in one array?
>>
>> When the last optimizations were written, I've searched a bit:
>> - postgres uses per-process semaphores, with small semaphore arrays.
>>      [process sleeps on it's own semaphore and is woken up by someone
>> else when it can make progress]
>>      
> This is similar to Oracle (and the sembench program).  Each process has
> a semaphore and when it is waiting for a commit it goes to sleep on it.
> They are woken up in bulk with semtimedop calls from a single process.
>
>    
Hmm. Thus you have:
- single sembuf decrease operations that are waiting frequently.
- multi-sembuf  increase operations.

What about optimizing for that case?
Increase operations succeed immediately. Thus complex_count is 0.

If we have performed an update operation, then we can scan all 
simple_lists that have seen an increase instead of checking the global 
list - as long as there are no complex operations waiting.
Right now, we give up if the update operation was a complex operation - 
but that does not matter.
All that matters are the sleeping operations, not the operation that did 
the wakeup.
I've attached an untested idea.

> But oracle also uses semaphores for locking in a traditional sense.
>
> Putting the waiters into a per-semaphore list is really only part of the
> speedup.  The real boost comes from the patch to break up the locks into
> a per semaphore lock.
>
>    
Ok. Then simple tricks won't help.
How many semaphores are in one array?

--
     Manfred

[-- Attachment #2: patch-ipc-optimize_bulkwakeup --]
[-- Type: text/plain, Size: 646 bytes --]

diff --git a/ipc/sem.c b/ipc/sem.c
index dbef95b..8986239 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1224,8 +1224,18 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {
-		if (alter && error == 0)
-			update_queue(sma, (nsops == 1) ? sops[0].sem_num : -1);
+		if (alter && error == 0) {
+			if (sma->complex_count) {
+				update_queue(sma, -1);
+			} else {
+				int i;
+				
+				for (i=0;i<nsops;i++) {
+					if (sops[i].sem_op > 0)
+						update_queue(sma, sops[i].sem_num);
+				}
+			}
+		}
 
 		goto out_unlock_free;
 	}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-14 19:11               ` Manfred Spraul
@ 2010-04-14 19:50                 ` Chris Mason
  2010-04-15 16:33                   ` Manfred Spraul
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2010-04-14 19:50 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Nick Piggin, zach.brown, jens.axboe, linux-kernel

On Wed, Apr 14, 2010 at 09:11:44PM +0200, Manfred Spraul wrote:
> On 04/14/2010 07:33 PM, Chris Mason wrote:
> >On Wed, Apr 14, 2010 at 06:16:53PM +0200, Manfred Spraul wrote:
> >>On 04/13/2010 08:19 PM, Chris Mason wrote:
> >>>On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
> >>>>On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> >>>>The other thing I don't know if your patch gets right is requeueing on
> >>>>of the operations. When you requeue from one list to another, then you
> >>>>seem to lose ordering with other pending operations, so that would
> >>>>seem to break the API as well (can't remember if the API strictly
> >>>>mandates FIFO, but anyway it can open up starvation cases).
> >>>I don't see anything in the docs about the FIFO order.  I could add an
> >>>extra sort on sequence number pretty easily, but is the starvation case
> >>>really that bad?
> >>>
> >>How do you want to determine the sequence number?
> >>Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?
> >I haven't tried yet, but hopefully it won't be a problem.  A later patch
> >does atomics on the reference count and it doesn't show up in the
> >profiles.
> >
> >>>>I was looking at doing a sequence number to be able to sort these, but
> >>>>it ended up getting over complex (and SAP was only using simple ops so
> >>>>it didn't seem to need much better).
> >>>>
> >>>>We want to be careful not to change semantics at all. And it gets
> >>>>tricky quickly :( What about Zach's simpler wakeup API?
> >>>Yeah, that's why my patches include code to handle userland sending
> >>>duplicate semids.  Zach's simpler API is cooking too, but if I can get
> >>>this done without insane complexity it helps with more than just the
> >>>post/wait oracle workload.
> >>>
> >>What is the oracle workload, which multi-sembuf operations does it use?
> >>How many semaphores are in one array?
> >>
> >>When the last optimizations were written, I've searched a bit:
> >>- postgres uses per-process semaphores, with small semaphore arrays.
> >>     [process sleeps on it's own semaphore and is woken up by someone
> >>else when it can make progress]
> >This is similar to Oracle (and the sembench program).  Each process has
> >a semaphore and when it is waiting for a commit it goes to sleep on it.
> >They are woken up in bulk with semtimedop calls from a single process.
> >
> Hmm. Thus you have:
> - single sembuf decrease operations that are waiting frequently.
> - multi-sembuf  increase operations.
> 
> What about optimizing for that case?
> Increase operations succeed immediately. Thus complex_count is 0.

I've been wondering about that.  I can optimize the patch to special
case the increase operations.  The only problem I saw was checking or
the range overflow.  Current behavior will abort the whole set if the
range overflow happens.

> 
> If we have performed an update operation, then we can scan all
> simple_lists that have seen an increase instead of checking the
> global list - as long as there are no complex operations waiting.
> Right now, we give up if the update operation was a complex
> operation - but that does not matter.
> All that matters are the sleeping operations, not the operation that
> did the wakeup.
> I've attached an untested idea.

Zach Brown's original patch set tried just the list magic and not
the spinlocks.  I'm afraid it didn't help very much over all.

> 
> >But oracle also uses semaphores for locking in a traditional sense.
> >
> >Putting the waiters into a per-semaphore list is really only part of the
> >speedup.  The real boost comes from the patch to break up the locks into
> >a per semaphore lock.
> >
> Ok. Then simple tricks won't help.
> How many semaphores are in one array?

On a big system I saw about 4000 semaphores total.  The database will
just allocate as many as it can into a single array and keep creating
arrays until it has all it needs.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-14 19:50                 ` Chris Mason
@ 2010-04-15 16:33                   ` Manfred Spraul
  2010-04-15 16:34                     ` Chris Mason
  0 siblings, 1 reply; 25+ messages in thread
From: Manfred Spraul @ 2010-04-15 16:33 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, zach.brown, jens.axboe, linux-kernel

On 04/14/2010 09:50 PM, Chris Mason wrote:
> On a big system I saw about 4000 semaphores total.  The database will
> just allocate as many as it can into a single array and keep creating
> arrays until it has all it needs.
>
>    
What happens if SEMMSL is reduced (first entry in /proc/sys/kernel/sem)?

--
     Manfred

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-15 16:33                   ` Manfred Spraul
@ 2010-04-15 16:34                     ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2010-04-15 16:34 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Nick Piggin, zach.brown, jens.axboe, linux-kernel

On Thu, Apr 15, 2010 at 06:33:13PM +0200, Manfred Spraul wrote:
> On 04/14/2010 09:50 PM, Chris Mason wrote:
> >On a big system I saw about 4000 semaphores total.  The database will
> >just allocate as many as it can into a single array and keep creating
> >arrays until it has all it needs.
> >
> What happens if SEMMSL is reduced (first entry in /proc/sys/kernel/sem)?

Performance improves slightly but the ipc lock is still at the top of the
profile ;)

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-12 18:49 ` [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop Chris Mason
  2010-04-13 17:15   ` Manfred Spraul
@ 2010-04-16 11:26   ` Manfred Spraul
  2010-04-16 11:45     ` Chris Mason
  1 sibling, 1 reply; 25+ messages in thread
From: Manfred Spraul @ 2010-04-16 11:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: zach.brown, jens.axboe, linux-kernel, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

On 04/12/2010 08:49 PM, Chris Mason wrote:
> I have a microbenchmark to test how quickly we can post and wait in
> bulk.  With this change, semtimedop is able do to more than twice
> as much work in the same run.  On a large numa machine, it brings
> the IPC lock system time (reported by perf) down from 85% to 15%.
>
>    
Looking at the current code:
- update_queue() can be O(N^2) if only some of the waiting tasks are 
woken up.
Actually: all non-woken up tasks are rescanned after a task that can be 
woken up is found.

- Your test app tests the best case for the current code:
You wake up the tasks in the same order as the called semop().
If you invert the order (i.e.: worklist_add() adds to head instead of 
tail), I would expect an even worse performance of the current code.

The O(N^2) is simple to fix, I've attached a patch.
For your micro-benchmark, the patch does not change much: you wake-up 
in-order, thus the current code does not misbehave.

Do you know how Oracle wakes up the tasks?
FIFO, LIFO, un-ordered?

> 	while(unlikely(error == IN_WAKEUP)) {
>   		cpu_relax();
>   		error = queue.status;
>   	}
>
> -	if (error != -EINTR) {
> +	/*
> +	 * we are lock free right here, and we could have timed out or
> +	 * gotten a signal, so we need to be really careful with how we
> +	 * play with queue.status.  It has three possible states:
> +	 *
> +	 * -EINTR, which means nobody has changed it since we slept.  This
> +	 * means we woke up on our own.
> +	 *
> +	 * IN_WAKEUP, someone is currently waking us up.  We need to loop
> +	 * here until they change it to the operation error value.  If
> +	 * we don't loop, our process could exit before they are done waking us
> +	 *
> +	 * operation error value: we've been properly woken up and can exit
> +	 * at any time.
> +	 *
> +	 * If queue.status is currently -EINTR, we are still being processed
> +	 * by the semtimedop core.  Someone either has us on a list head
> +	 * or is currently poking our queue struct.  We need to find that
> +	 * reference and remove it, which is what remove_queue_from_lists
> +	 * does.
> +	 *
> +	 * We always check for both -EINTR and IN_WAKEUP because we have no
> +	 * locks held.  Someone could change us from -EINTR to IN_WAKEUP at
> +	 * any time.
> +	 */
> +	if (error != -EINTR&&  error != IN_WAKEUP) {
>   		/* fast path: update_queue already obtained all requested
>   		 * resources */
No: The code accesses a local variable. The loop above the comment 
guarantees that the error can't be IN_WAKEUP.

> +
> +out_putref:
> +	sem_putref(sma);
> +	goto out_free;
>    
Is it possible to move the sem_putref into wakeup_sem_queue()?
Right now, the exit path of semtimedop doesn't touch the spinlock.
You remove that optimization.

--
     Manfred

[-- Attachment #2: patch-ipc-optimize_bulkwakeup --]
[-- Type: text/plain, Size: 4403 bytes --]

diff --git a/ipc/sem.c b/ipc/sem.c
index dbef95b..efadb6d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -434,6 +434,69 @@ static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
 		sma->complex_count--;
 }
 
+/** check_restart(sma, q)
+ * @sma: semaphore array
+ * @q: the operation that just completed
+ *
+ * update_queue is O(N^2) when it restarts scanning the whole queue of
+ * waiting operations. Therefore this function checks if the restart is
+ * really necessary. It is called after a previously waiting operation
+ * was completed.
+ */
+static int check_restart(struct sem_array *sma, struct sem_queue *q)
+{
+	struct sem * curr;
+	struct sem_queue *h;
+
+	/* if the operation didn't modify the array, then no restart */
+	if (q->alter == 0)
+		return 0;
+
+	/* pending complex operations are too difficult to analyse */
+	if (sma->complex_count)
+		return 1;
+
+	/* we were a sleeping complex operation. Too difficult */
+	if (q->nsops > 1)
+		return 1;
+
+	curr = sma->sem_base + q->sops[0].sem_num;
+
+	/* No-one waits on this queue */
+	if (list_empty(&curr->sem_pending))
+		return 0;
+
+	/* the new semaphore value */
+	if (curr->semval) {
+		/* It is impossible that someone waits for the new value:
+		 * - q is a previously sleeping simple operation that
+		 *   altered the array. It must be a decrement, because
+		 *   simple increments never sleep.
+		 * - The value is not 0, thus wait-for-zero won't proceed.
+		 * - If there are older (higher priority) decrements
+		 *   in the queue, then they have observed the original
+		 *   semval value and couldn't proceed. The operation
+		 *   decremented to value - thus they won't proceed either.
+		 */
+		BUG_ON(q->sops[0].sem_op >= 0);
+		return 0;
+	}
+	/*
+	 * semval is 0. Check if there are wait-for-zero semops.
+	 * They must be the first entries in the per-semaphore simple queue
+	 */
+	h=list_first_entry(&curr->sem_pending, struct sem_queue, simple_list);
+	BUG_ON(h->nsops != 1);
+	BUG_ON(h->sops[0].sem_num != q->sops[0].sem_num);
+	
+	/* Yes, there is a wait-for-zero semop. Restart */	
+	if (h->sops[0].sem_op == 0)
+		return 1;
+
+	/* Again - no-one is waiting for the new value. */
+	return 0;
+}		
+
 
 /**
  * update_queue(sma, semnum): Look for tasks that can be completed.
@@ -469,7 +532,7 @@ static void update_queue(struct sem_array *sma, int semnum)
 again:
 	walk = pending_list->next;
 	while (walk != pending_list) {
-		int error, alter;
+		int error, restart;
 
 		q = (struct sem_queue *)((char *)walk - offset);
 		walk = walk->next;
@@ -494,22 +557,42 @@ again:
 
 		unlink_queue(sma, q);
 
-		/*
-		 * The next operation that must be checked depends on the type
-		 * of the completed operation:
-		 * - if the operation modified the array, then restart from the
-		 *   head of the queue and check for threads that might be
-		 *   waiting for the new semaphore values.
-		 * - if the operation didn't modify the array, then just
-		 *   continue.
-		 */
-		alter = q->alter;
+		if (error)
+			restart = 0;
+		else
+			restart = check_restart(sma, q);
+
 		wake_up_sem_queue(q, error);
-		if (alter && !error)
+		if (restart)
 			goto again;
 	}
 }
 
+/** do_smart_update(sma, sops, nsops): Optimized update_queue
+ * @sma: semaphore array
+ * @sops: operations that were performed
+ * @nsops: number of operations
+ *
+ * do_smart_update() does the required called to update_queue, based on the
+ * actual changes that were performed on the semaphore array.
+ */
+void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsops)
+{
+	int i;
+
+	if (sma->complex_count) {
+		update_queue(sma, -1);
+		return;
+	}
+				
+	for (i=0;i<nsops;i++) {
+		if (sops[i].sem_op > 0 ||
+			(sops[i].sem_op < 0 && sma->sem_base[sops[i].sem_num].semval == 0))
+			update_queue(sma, sops[i].sem_num);
+	}
+}
+
+
 /* The following counts are associated to each semaphore:
  *   semncnt        number of tasks waiting on semval being nonzero
  *   semzcnt        number of tasks waiting on semval being zero
@@ -1224,8 +1307,9 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {
-		if (alter && error == 0)
-			update_queue(sma, (nsops == 1) ? sops[0].sem_num : -1);
+		if (alter && error == 0) {
+			do_smart_update(sma, sops, nsops);
+		}
 
 		goto out_unlock_free;
 	}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-16 11:26   ` Manfred Spraul
@ 2010-04-16 11:45     ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2010-04-16 11:45 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: zach.brown, jens.axboe, linux-kernel, Nick Piggin

On Fri, Apr 16, 2010 at 01:26:15PM +0200, Manfred Spraul wrote:
> On 04/12/2010 08:49 PM, Chris Mason wrote:
> >I have a microbenchmark to test how quickly we can post and wait in
> >bulk.  With this change, semtimedop is able do to more than twice
> >as much work in the same run.  On a large numa machine, it brings
> >the IPC lock system time (reported by perf) down from 85% to 15%.
> >
> Looking at the current code:
> - update_queue() can be O(N^2) if only some of the waiting tasks are
> woken up.
> Actually: all non-woken up tasks are rescanned after a task that can
> be woken up is found.
> 
> - Your test app tests the best case for the current code:
> You wake up the tasks in the same order as the called semop().
> If you invert the order (i.e.: worklist_add() adds to head instead
> of tail), I would expect an even worse performance of the current
> code.
> 
> The O(N^2) is simple to fix, I've attached a patch.

Good point.

> For your micro-benchmark, the patch does not change much: you
> wake-up in-order, thus the current code does not misbehave.
> 
> Do you know how Oracle wakes up the tasks?
> FIFO, LIFO, un-ordered?

Ordering in terms of the sem array?  I had them try many variations ;) I
don't think it will be ordered as well as sembench most of the time.

> 
> >	while(unlikely(error == IN_WAKEUP)) {
> >  		cpu_relax();
> >  		error = queue.status;
> >  	}
> >
> >-	if (error != -EINTR) {
> >+	/*
> >+	 * we are lock free right here, and we could have timed out or
> >+	 * gotten a signal, so we need to be really careful with how we
> >+	 * play with queue.status.  It has three possible states:
> >+	 *
> >+	 * -EINTR, which means nobody has changed it since we slept.  This
> >+	 * means we woke up on our own.
> >+	 *
> >+	 * IN_WAKEUP, someone is currently waking us up.  We need to loop
> >+	 * here until they change it to the operation error value.  If
> >+	 * we don't loop, our process could exit before they are done waking us
> >+	 *
> >+	 * operation error value: we've been properly woken up and can exit
> >+	 * at any time.
> >+	 *
> >+	 * If queue.status is currently -EINTR, we are still being processed
> >+	 * by the semtimedop core.  Someone either has us on a list head
> >+	 * or is currently poking our queue struct.  We need to find that
> >+	 * reference and remove it, which is what remove_queue_from_lists
> >+	 * does.
> >+	 *
> >+	 * We always check for both -EINTR and IN_WAKEUP because we have no
> >+	 * locks held.  Someone could change us from -EINTR to IN_WAKEUP at
> >+	 * any time.
> >+	 */
> >+	if (error != -EINTR&&  error != IN_WAKEUP) {
> >  		/* fast path: update_queue already obtained all requested
> >  		 * resources */
> No: The code accesses a local variable. The loop above the comment
> guarantees that the error can't be IN_WAKEUP.

Whoops, thanks.

> 
> >+
> >+out_putref:
> >+	sem_putref(sma);
> >+	goto out_free;
> Is it possible to move the sem_putref into wakeup_sem_queue()?
> Right now, the exit path of semtimedop doesn't touch the spinlock.
> You remove that optimization.

I'll look at this, we need to be able to go through the sma to remove
the process from the lists if it woke up on its own, but I don't see why
we can't putref in wakeup.

My current revision of the patch uses an atomic instead of the lock, so it
restores the lockless wakeup either way.  Still it is better to putref in
wakeup.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU
  2010-04-12 18:49 ` [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU Chris Mason
@ 2010-04-17 10:24   ` Manfred Spraul
  0 siblings, 0 replies; 25+ messages in thread
From: Manfred Spraul @ 2010-04-17 10:24 UTC (permalink / raw)
  To: Chris Mason; +Cc: zach.brown, jens.axboe, linux-kernel, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 2033 bytes --]

Hi Chris,

On 04/12/2010 08:49 PM, Chris Mason wrote:
> @@ -599,6 +622,13 @@ again:
>   		list_splice_init(&new_pending,&work_list);
>   		goto again;
>   	}
> +
> +	list_sort(NULL,&wake_list, list_comp);
> +	while (!list_empty(&wake_list)) {
> +		q = list_entry(wake_list.next, struct sem_queue, list);
> +		list_del_init(&q->list);
> +		wake_up_sem_queue(q, 0);
> +	}
>   }
>    
What about moving this step much later?

There is no need to hold any locks for the actual wake_up_process().

I've updated my patch:
- improved update_queue that guarantees no O(N^2) for your workload.
- move the actual wake-up after dropping all locks
- optimize setting sem_otime
- cacheline align the ipc spinlock.

But the odd thing:
It doesn't improve the sembench result at all (AMD Phenom X4)
The only thing that is reduced is the system time:
 From ~1 min system time for "sembench -t 250 -w 250 -r 30 -o 0" to ~30 sec.

cpu binding the sembench threads results in an improvement of ~50% - at 
the cost of a significant increase of the system time (from 30 seconds 
to 1 min) and the user time (from 2 seconds to 14 seconds).

Are you sure that the problem is contention on the semaphore array spinlock?
With the above changes, the code that is under the spin_lock is very short.
Especially:
- Why does optimizing ipc/sem.c only reduce the system time [reported by 
time] and not the sembench output?
- Why is there no improvement from the ___cache_line_align?
If there would be  contention, then there should be trashing from 
accessing the lock and writing sem_otime and reading sem_base.
- Additionally: you wrote that reducing the array size does not help much.
But: The arrays are 100% independant, the ipc code scales linearly.
Spreading the work over multiple spinlocks is - like cache line aligning 
- usually a 100% guaranteed improvement if there is contention.

I've attached a modified sembench.c and the proposal for ipc/sem.c
Could you try it?
What do you think?
How many cores do you have in your test system?

--
     Manfred

[-- Attachment #2: patch-ipc-optimize_bulkwakeup-3 --]
[-- Type: text/plain, Size: 11774 bytes --]

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 8a4adbe..66b5fc6 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -78,6 +78,7 @@ struct  seminfo {
 
 #ifdef __KERNEL__
 #include <asm/atomic.h>
+#include <linux/cache.h>
 #include <linux/rcupdate.h>
 
 struct task_struct;
@@ -91,7 +92,8 @@ struct sem {
 
 /* One sem_array data structure for each set of semaphores in the system. */
 struct sem_array {
-	struct kern_ipc_perm	sem_perm;	/* permissions .. see ipc.h */
+	struct kern_ipc_perm ____cacheline_aligned_in_smp
+				sem_perm;	/* permissions .. see ipc.h */
 	time_t			sem_otime;	/* last semop time */
 	time_t			sem_ctime;	/* last change time */
 	struct sem		*sem_base;	/* ptr to first semaphore in array */
diff --git a/ipc/sem.c b/ipc/sem.c
index dbef95b..34ae151 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -381,7 +381,6 @@ static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
 		sop--;
 	}
 	
-	sma->sem_otime = get_seconds();
 	return 0;
 
 out_of_range:
@@ -404,25 +403,41 @@ undo:
 	return result;
 }
 
-/*
- * Wake up a process waiting on the sem queue with a given error.
- * The queue is invalid (may not be accessed) after the function returns.
+/** wake_up_sem_queue_prepare(q, error): Prepare wake-up
+ * @q: queue entry that must be signaled
+ * @error: Error value for the signal
+ *
+ * Prepare the wake-up of the queue entry q.
  */
-static void wake_up_sem_queue(struct sem_queue *q, int error)
+static void wake_up_sem_queue_prepare(struct list_head *pt, struct sem_queue *q, int error)
 {
-	/*
-	 * Hold preempt off so that we don't get preempted and have the
-	 * wakee busy-wait until we're scheduled back on. We're holding
-	 * locks here so it may not strictly be needed, however if the
-	 * locks become preemptible then this prevents such a problem.
-	 */
-	preempt_disable();
+	if (list_empty(pt)) {
+		/*
+		 * Hold preempt off so that we don't get preempted and have the
+		 * wakee busy-wait until we're scheduled back on.
+		 */
+		preempt_disable();
+	}
 	q->status = IN_WAKEUP;
-	wake_up_process(q->sleeper);
-	/* hands-off: q can disappear immediately after writing q->status. */
-	smp_wmb();
-	q->status = error;
-	preempt_enable();
+	q->pid = error;
+
+	list_add_tail(&q->simple_list, pt);
+}
+
+static void wake_up_sem_queue_do(struct list_head *pt)
+{
+	struct sem_queue *q, *t;
+	int did_something;
+
+	did_something = !list_empty(pt);
+	list_for_each_entry_safe(q, t, pt, simple_list) {
+		wake_up_process(q->sleeper);
+		/* hands-off: q can disappear immediately after writing q->status. */
+		smp_wmb();
+		q->status = q->pid;
+	}
+	if (did_something)
+		preempt_enable();
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -434,22 +449,90 @@ static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
 		sma->complex_count--;
 }
 
+/** check_restart(sma, q)
+ * @sma: semaphore array
+ * @q: the operation that just completed
+ *
+ * update_queue is O(N^2) when it restarts scanning the whole queue of
+ * waiting operations. Therefore this function checks if the restart is
+ * really necessary. It is called after a previously waiting operation
+ * was completed.
+ */
+static int check_restart(struct sem_array *sma, struct sem_queue *q)
+{
+	struct sem * curr;
+	struct sem_queue *h;
+
+	/* if the operation didn't modify the array, then no restart */
+	if (q->alter == 0)
+		return 0;
+
+	/* pending complex operations are too difficult to analyse */
+	if (sma->complex_count)
+		return 1;
+
+	/* we were a sleeping complex operation. Too difficult */
+	if (q->nsops > 1)
+		return 1;
+
+	curr = sma->sem_base + q->sops[0].sem_num;
+
+	/* No-one waits on this queue */
+	if (list_empty(&curr->sem_pending))
+		return 0;
+
+	/* the new semaphore value */
+	if (curr->semval) {
+		/* It is impossible that someone waits for the new value:
+		 * - q is a previously sleeping simple operation that
+		 *   altered the array. It must be a decrement, because
+		 *   simple increments never sleep.
+		 * - The value is not 0, thus wait-for-zero won't proceed.
+		 * - If there are older (higher priority) decrements
+		 *   in the queue, then they have observed the original
+		 *   semval value and couldn't proceed. The operation
+		 *   decremented to value - thus they won't proceed either.
+		 */
+		BUG_ON(q->sops[0].sem_op >= 0);
+		return 0;
+	}
+	/*
+	 * semval is 0. Check if there are wait-for-zero semops.
+	 * They must be the first entries in the per-semaphore simple queue
+	 */
+	h=list_first_entry(&curr->sem_pending, struct sem_queue, simple_list);
+	BUG_ON(h->nsops != 1);
+	BUG_ON(h->sops[0].sem_num != q->sops[0].sem_num);
+	
+	/* Yes, there is a wait-for-zero semop. Restart */	
+	if (h->sops[0].sem_op == 0)
+		return 1;
+
+	/* Again - no-one is waiting for the new value. */
+	return 0;
+}		
+
 
 /**
  * update_queue(sma, semnum): Look for tasks that can be completed.
  * @sma: semaphore array.
  * @semnum: semaphore that was modified.
+ * @pt: list head for the tasks that must be woken up.
  *
  * update_queue must be called after a semaphore in a semaphore array
  * was modified. If multiple semaphore were modified, then @semnum
  * must be set to -1.
+ * The tasks that must be woken up are added to @pt. The return code
+ * is stored in q->pid.
+ * The function return 1 if at least one array variable was modified.
  */
-static void update_queue(struct sem_array *sma, int semnum)
+static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
 {
 	struct sem_queue *q;
 	struct list_head *walk;
 	struct list_head *pending_list;
 	int offset;
+	int retval;
 
 	/* if there are complex operations around, then knowing the semaphore
 	 * that was modified doesn't help us. Assume that multiple semaphores
@@ -469,7 +552,7 @@ static void update_queue(struct sem_array *sma, int semnum)
 again:
 	walk = pending_list->next;
 	while (walk != pending_list) {
-		int error, alter;
+		int error, restart;
 
 		q = (struct sem_queue *)((char *)walk - offset);
 		walk = walk->next;
@@ -493,23 +576,57 @@ again:
 			continue;
 
 		unlink_queue(sma, q);
+		if (q->alter)
+			retval = 1;
 
-		/*
-		 * The next operation that must be checked depends on the type
-		 * of the completed operation:
-		 * - if the operation modified the array, then restart from the
-		 *   head of the queue and check for threads that might be
-		 *   waiting for the new semaphore values.
-		 * - if the operation didn't modify the array, then just
-		 *   continue.
-		 */
-		alter = q->alter;
-		wake_up_sem_queue(q, error);
-		if (alter && !error)
+		if (error)
+			restart = 0;
+		else
+			restart = check_restart(sma, q);
+
+		wake_up_sem_queue_prepare(pt, q, error);
+		if (restart)
 			goto again;
 	}
+	return retval;
 }
 
+/** do_smart_update(sma, sops, nsops, otime, pt): Optimized update_queue
+ * @sma: semaphore array
+ * @sops: operations that were performed
+ * @nsops: number of operations
+ * @otime: force setting otime
+ * @pt: list head of the tasks that must be woken up.
+ *
+ * do_smart_update() does the required called to update_queue, based on the
+ * actual changes that were performed on the semaphore array.
+ * Note that the function does not do the actual wake-up: the caller is
+ * responsible for calling wake_up_sem_queue_do(@pt).
+ * It is safe to perform this call after dropping all locks.
+ */
+void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsops,
+			int otime, struct list_head *pt)
+{
+	int i;
+
+	if (sma->complex_count) {
+		if (update_queue(sma, -1, pt))
+			otime = 1;
+		goto done;
+	}
+				
+	for (i=0;i<nsops;i++) {
+		if (sops[i].sem_op > 0 ||
+			(sops[i].sem_op < 0 && sma->sem_base[sops[i].sem_num].semval == 0))
+			if (update_queue(sma, sops[i].sem_num, pt))
+				otime = 1;
+	}
+done:
+	if (otime)
+		sma->sem_otime = get_seconds();
+}
+
+
 /* The following counts are associated to each semaphore:
  *   semncnt        number of tasks waiting on semval being nonzero
  *   semzcnt        number of tasks waiting on semval being zero
@@ -572,6 +689,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
 	struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm);
+	struct list_head tasks;
 
 	/* Free the existing undo structures for this semaphore set.  */
 	assert_spin_locked(&sma->sem_perm.lock);
@@ -585,15 +703,17 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	}
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
+	INIT_LIST_HEAD(&tasks);
 	list_for_each_entry_safe(q, tq, &sma->sem_pending, list) {
 		unlink_queue(sma, q);
-		wake_up_sem_queue(q, -EIDRM);
+		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
 	}
 
 	/* Remove the semaphore set from the IDR */
 	sem_rmid(ns, sma);
 	sem_unlock(sma);
 
+	wake_up_sem_queue_do(&tasks);
 	ns->used_sems -= sma->sem_nsems;
 	security_sem_free(sma);
 	ipc_rcu_putref(sma);
@@ -715,11 +835,13 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 	ushort fast_sem_io[SEMMSL_FAST];
 	ushort* sem_io = fast_sem_io;
 	int nsems;
+	struct list_head tasks;
 
 	sma = sem_lock_check(ns, semid);
 	if (IS_ERR(sma))
 		return PTR_ERR(sma);
 
+	INIT_LIST_HEAD(&tasks);
 	nsems = sma->sem_nsems;
 
 	err = -EACCES;
@@ -807,7 +929,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		}
 		sma->sem_ctime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		update_queue(sma, -1);
+		do_smart_update(sma, NULL, 0, 0, &tasks);
 		err = 0;
 		goto out_unlock;
 	}
@@ -849,13 +971,15 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		curr->sempid = task_tgid_vnr(current);
 		sma->sem_ctime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		update_queue(sma, semnum);
+		do_smart_update(sma, NULL, 0, 0, &tasks);
 		err = 0;
 		goto out_unlock;
 	}
 	}
 out_unlock:
 	sem_unlock(sma);
+	wake_up_sem_queue_do(&tasks);
+
 out_free:
 	if(sem_io != fast_sem_io)
 		ipc_free(sem_io, sizeof(ushort)*nsems);
@@ -1129,6 +1253,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	struct sem_queue queue;
 	unsigned long jiffies_left = 0;
 	struct ipc_namespace *ns;
+	struct list_head tasks;
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -1177,6 +1302,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	} else
 		un = NULL;
 
+	INIT_LIST_HEAD(&tasks);
+
 	sma = sem_lock_check(ns, semid);
 	if (IS_ERR(sma)) {
 		if (un)
@@ -1225,7 +1352,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {
 		if (alter && error == 0)
-			update_queue(sma, (nsops == 1) ? sops[0].sem_num : -1);
+			do_smart_update(sma, sops, nsops, 1, &tasks);
 
 		goto out_unlock_free;
 	}
@@ -1302,6 +1429,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 
 out_unlock_free:
 	sem_unlock(sma);
+
+	wake_up_sem_queue_do(&tasks);
 out_free:
 	if(sops != fast_sops)
 		kfree(sops);
@@ -1362,6 +1491,7 @@ void exit_sem(struct task_struct *tsk)
 	for (;;) {
 		struct sem_array *sma;
 		struct sem_undo *un;
+		struct list_head tasks;
 		int semid;
 		int i;
 
@@ -1425,10 +1555,11 @@ void exit_sem(struct task_struct *tsk)
 				semaphore->sempid = task_tgid_vnr(current);
 			}
 		}
-		sma->sem_otime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		update_queue(sma, -1);
+		INIT_LIST_HEAD(&tasks);
+		do_smart_update(sma, NULL, 0, 1, &tasks);
 		sem_unlock(sma);
+		wake_up_sem_queue_do(&tasks);
 
 		call_rcu(&un->rcu, free_un);
 	}

[-- Attachment #3: sembench.c --]
[-- Type: text/plain, Size: 12582 bytes --]

/*
 * copyright Oracle 2007.  Licensed under GPLv2
 * To compile: gcc -Wall -o sembench sembench.c -lpthread
 *
 * usage: sembench -t thread count -w wakenum -r runtime -o op
 * op can be: 0 (ipc sem) 1 (nanosleep) 2 (futexes)
 *
 * example:
 *	sembench -t 1024 -w 512 -r 60 -o 2
 * runs 1024 threads, waking up 512 at a time, running for 60 seconds using
 * futex locking.
 *
 */
#define  _GNU_SOURCE
#define _POSIX_C_SOURCE 199309
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/sem.h>
#include <sys/ipc.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <pthread.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#include <sys/syscall.h>
#include <errno.h>

#define VERSION "0.2"

static int g_cpu_bind = 0;
/* futexes have been around since 2.5.something, but it still seems I
 * need to make my own syscall.  Sigh.
 */
#define FUTEX_WAIT              0
#define FUTEX_WAKE              1
#define FUTEX_FD                2
#define FUTEX_REQUEUE           3
#define FUTEX_CMP_REQUEUE       4
#define FUTEX_WAKE_OP           5
static inline int futex (int *uaddr, int op, int val,
			 const struct timespec *timeout,
			 int *uaddr2, int val3)
{
	return syscall(__NR_futex, uaddr, op, val, timeout, uaddr2, val3);
}

static int all_done = 0;
static int timeout_test = 0;

#define SEMS_PERID 250

struct sem_operations;

struct lockinfo {
	unsigned long id;
	unsigned long index;
	int data;
	pthread_t tid;
	struct lockinfo *next;
	struct sem_operations *ops;
};

struct sem_wakeup_info {
	int wakeup_count;
	struct sembuf sb[SEMS_PERID];
};

struct sem_operations {
	void (*wait)(struct lockinfo *l);
	int (*wake)(struct sem_wakeup_info *wi, int num_semids, int num);
	void (*setup)(struct sem_wakeup_info **wi, int num_semids);
	void (*cleanup)(int num_semids);
	char *name;
};

int *semid_lookup = NULL;

#if 0
pthread_mutex_t worklist_mutex = PTHREAD_MUTEX_INITIALIZER;

void do_lock(void)
{
	pthread_mutex_lock(&worklist_mutex);
}

void do_unlock(void)
{
	pthread_mutex_unlock(&worklist_mutex);
}

#else
static volatile long lock = 1;

void do_lock(void)
{
	int res;

again:
	res = 0;
	asm volatile(
		"lock; xchgb %b0,%1\n\t"
		: "=q" (res)
		: "m" (lock), "0" (res)
		: "memory");
	if (res == 1) {
		if(lock == 1) {
			printf("Lock error 1.\n");
			exit(1);
		}
		return;
	}
	asm volatile("rep; nop \n\t" : : :);
	goto again;
}

void do_unlock(void)
{
	int res = 1;
	
	if(lock) {
		printf("Lock error 2a.\n");
		exit(1);
	}

	asm volatile(
		"lock; xchgb %b0,%1\n\t"
		: "=q" (res)
		: "m" (lock), "0" (res)
		: "memory");
	if (res == 1) {
		printf("lock error 2b.\n");
		exit(1);
	}
}
#endif

static unsigned long total_burns = 0;
static unsigned long min_burns = ~0UL;
static unsigned long max_burns = 0;
static int thread_count = 0;
struct lockinfo *worklist_head = NULL;
struct lockinfo *worklist_tail = NULL;

static void worklist_add(struct lockinfo *l)
{
	do_lock();
	l->next = NULL;
	if (!worklist_head)
		worklist_head = l;
	else
		worklist_tail->next = l;
	worklist_tail = l;
	do_unlock();
}

static struct lockinfo *worklist_rm(void)
{
	struct lockinfo *ret;
	do_lock();
	if (!worklist_head) {
		ret = NULL;
	} else {
		ret = worklist_head;
		worklist_head = ret->next;
		if (!worklist_head)
			worklist_tail = NULL;
	}
	do_unlock();
	return ret;
}

static void do_cpu_bind(int master)
{
	if (g_cpu_bind)
	{
		cpu_set_t cpus;
		pthread_t thread;
		int ret;

		CPU_ZERO(&cpus);

		thread = pthread_self();		

		if (master) {
			CPU_SET(0, &cpus);
		} else {
			ret = pthread_getaffinity_np(thread, sizeof(cpus), &cpus);
			if (ret < 0) {
				printf("pthread_getaffinity_np() failed for thread %p with errno %d.\n",
						thread, errno);
				fflush(stdout);
				return;
			}
 			CPU_CLR(0, &cpus);
		}

		ret = pthread_setaffinity_np(thread, sizeof(cpus), &cpus);
		if (ret < 0) {
			printf("pthread_setaffinity_np failed for thread %p with errno %d.\n",
					thread, errno);
			fflush(stdout);
			return;
		}

		ret = pthread_getaffinity_np(thread, sizeof(cpus), &cpus);
		if (ret < 0) {
			printf("pthread_getaffinity_np() failed for thread %p with errno %d.\n",
					thread, errno);
		} else {
			printf("thread %p: type %d bound to %04lxh\n",thread, master,
					cpus.__bits[0]);
		}
		fflush(stdout);
	}
}

/* ipc semaphore post& wait */
void wait_ipc_sem(struct lockinfo *l)
{
	struct sembuf sb;
	int ret;
	struct timespec *tvp = NULL;
	struct timespec tv = { 0, 1 };

	sb.sem_num = l->index;
	sb.sem_flg = 0;

	sb.sem_op = -1;
	l->data = 1;

	if (timeout_test && (l->id % 5) == 0)
		tvp = &tv;

	worklist_add(l);
	ret = semtimedop(semid_lookup[l->id], &sb, 1, tvp);

	while(l->data != 0 && tvp) {
		struct timespec tv2 = { 0, 500 };
		nanosleep(&tv2, NULL);
	}

	if (l->data != 0) {
		if (tvp)
			return;
		fprintf(stderr, "wakeup without data update\n");
		exit(1);
	}
	if (ret) {
		if (errno == EAGAIN && tvp)
			return;
		perror("semtimed op");
		exit(1);
	}
}

int ipc_wake_some(struct sem_wakeup_info *wi, int num_semids, int num)
{
	int i;
	int ret;
	struct lockinfo *l;
	int found = 0;

	for (i = 0; i < num_semids; i++) {
		wi[i].wakeup_count = 0;
	}
	while(num > 0) {
		struct sembuf *sb;
		l = worklist_rm();
		if (!l)
			break;
		if (l->data != 1)
			fprintf(stderr, "warning, lockinfo data was %d\n",
				l->data);
		l->data = 0;
		sb = wi[l->id].sb + wi[l->id].wakeup_count;
		sb->sem_num = l->index;
		sb->sem_op = 1;
		sb->sem_flg = IPC_NOWAIT;
		wi[l->id].wakeup_count++;
		found++;
		num--;
	}
	if (!found)
		return 0;
	for (i = 0; i < num_semids; i++) {
		int wakeup_total;
		int cur;
		int offset = 0;
		if (!wi[i].wakeup_count)
			continue;
		wakeup_total = wi[i].wakeup_count;
		while(wakeup_total > 0) {
			cur = wakeup_total > 64 ? 64 : wakeup_total;
			ret = semtimedop(semid_lookup[i], wi[i].sb + offset,
					 cur, NULL);
			if (ret) {
				perror("semtimedop");
				exit(1);
			}
			offset += cur;
			wakeup_total -= cur;
		}
	}
	return found;
}

void setup_ipc_sems(struct sem_wakeup_info **wi, int num_semids)
{
	int i;
	*wi = malloc(sizeof(**wi) * num_semids);
	semid_lookup = malloc(num_semids * sizeof(int));
	for(i = 0; i < num_semids; i++) {
		semid_lookup[i] = semget(IPC_PRIVATE, SEMS_PERID,
					 IPC_CREAT | 0777);
		if (semid_lookup[i] < 0) {
			perror("semget");
			exit(1);
		}
	}
	usleep(200);
}

void cleanup_ipc_sems(int num)
{
	int i;
	for (i = 0; i < num; i++) {
		semctl(semid_lookup[i], 0, IPC_RMID);
	}
}

struct sem_operations ipc_sem_ops = {
	.wait = wait_ipc_sem,
	.wake = ipc_wake_some,
	.setup = setup_ipc_sems,
	.cleanup = cleanup_ipc_sems,
	.name = "ipc sem operations",
};

/* futex post & wait */
void wait_futex_sem(struct lockinfo *l)
{
	int ret;
	l->data = 1;
	worklist_add(l);
	while(l->data == 1) {
		ret = futex(&l->data, FUTEX_WAIT, 1, NULL, NULL, 0);
		if (ret && ret != EWOULDBLOCK) {
			perror("futex wait");
			exit(1);
		}
	}
}

int futex_wake_some(struct sem_wakeup_info *wi, int num_semids, int num)
{
	int i;
	int ret;
	struct lockinfo *l;
	int found = 0;

	for (i = 0; i < num; i++) {
		l = worklist_rm();
		if (!l)
			break;
		if (l->data != 1)
			fprintf(stderr, "warning, lockinfo data was %d\n",
				l->data);
		l->data = 0;
		ret = futex(&l->data, FUTEX_WAKE, 1, NULL, NULL, 0);
		if (ret < 0) {
			perror("futex wake");
			exit(1);
		}
		found++;
	}
	return found;
}

void setup_futex_sems(struct sem_wakeup_info **wi, int num_semids)
{
	return;
}

void cleanup_futex_sems(int num)
{
	return;
}

struct sem_operations futex_sem_ops = {
	.wait = wait_futex_sem,
	.wake = futex_wake_some,
	.setup = setup_futex_sems,
	.cleanup = cleanup_futex_sems,
	.name = "futex sem operations",
};

/* nanosleep sems here */
void wait_nanosleep_sem(struct lockinfo *l)
{
	int ret;
	struct timespec tv = { 0, 1000000 };
	int count = 0;

	l->data = 1;
	worklist_add(l);
	while(l->data) {
		ret = nanosleep(&tv, NULL);
		if (ret) {
			perror("nanosleep");
			exit(1);
		}
		count++;
	}
}

int nanosleep_wake_some(struct sem_wakeup_info *wi, int num_semids, int num)
{
	int i;
	struct lockinfo *l;

	for (i = 0; i < num; i++) {
		l = worklist_rm();
		if (!l)
			break;
		if (l->data != 1)
			fprintf(stderr, "warning, lockinfo data was %d\n",
				l->data);
		l->data = 0;
	}
	return i;
}

void setup_nanosleep_sems(struct sem_wakeup_info **wi, int num_semids)
{
	return;
}

void cleanup_nanosleep_sems(int num)
{
	return;
}

struct sem_operations nanosleep_sem_ops = {
	.wait = wait_nanosleep_sem,
	.wake = nanosleep_wake_some,
	.setup = setup_nanosleep_sems,
	.cleanup = cleanup_nanosleep_sems,
	.name = "nano sleep sem operations",
};

void *worker(void *arg)
{
	struct lockinfo *l = (struct lockinfo *)arg;
	int burn_count = 0;
	pthread_t tid = pthread_self();
	size_t pagesize = getpagesize();
	char *buf = malloc(pagesize);
	struct lockinfo *node_local_l;

	if (!buf) {
		perror("malloc");
		exit(1);
	}

	node_local_l = malloc(sizeof(struct lockinfo));
	if (!node_local_l) {
		perror("malloc");
		exit(1);
	}
	memcpy(node_local_l, l, sizeof(*l));
	l = node_local_l;
	node_local_l = (struct lockinfo *)arg;

	do_cpu_bind(0);

	do_lock();
	thread_count++;
	do_unlock();

	l->tid = tid;
	while(!all_done) {
		l->ops->wait(l);
		if (all_done)
			break;
		burn_count++;
	}
	do_lock();
	total_burns += burn_count;
	if (burn_count < min_burns)
		min_burns = burn_count;
	if (burn_count > max_burns)
		max_burns = burn_count;
	thread_count--;
	do_unlock();

	free(node_local_l);
	return (void *)0;
}

void print_usage(void)
{
	printf("usage: sembench [-t threads] [-w wake incr] [-r runtime]");
	printf("                [-o num] (0=ipc, 1=nanosleep, 2=futex)\n");
	printf("                [-b] (cpu bind) [-T] (time test)\n");
	exit(1);
}

#define NUM_OPERATIONS 3
struct sem_operations *allops[NUM_OPERATIONS] = { &ipc_sem_ops,
						&nanosleep_sem_ops,
						&futex_sem_ops};

int main(int ac, char **av) {
	int ret;
	int i;
	int semid = 0;
	int sem_num = 0;
	int burn_count = 0;
	struct sem_wakeup_info *wi = NULL;
	struct timeval start;
	struct timeval now;
	int num_semids = 0;
	int num_threads = 2048;
	int wake_num = 256;
	int run_secs = 30;
	int pagesize = getpagesize();
	char *buf = malloc(pagesize);
	struct sem_operations *ops = allops[0];

	if (!buf) {
		perror("malloc");
		exit(1);
	}
	for (i = 1; i < ac; i++) {
		if (strcmp(av[i], "-t") == 0) {
			if (i == ac -1)
				print_usage();
			num_threads = atoi(av[i+1]);
			i++;
		} else if (strcmp(av[i], "-w") == 0) {
			if (i == ac -1)
				print_usage();
			wake_num = atoi(av[i+1]);
			i++;
		} else if (strcmp(av[i], "-r") == 0) {
			if (i == ac -1)
				print_usage();
			run_secs = atoi(av[i+1]);
			i++;
		} else if (strcmp(av[i], "-o") == 0) {
			int index;
			if (i == ac -1)
				print_usage();
			index = atoi(av[i+1]);
			if (index >= NUM_OPERATIONS) {
				fprintf(stderr, "invalid operations %d\n",
					index);
				exit(1);
			}
			ops = allops[index];
			i++;
		} else if (strcmp(av[i], "-b") == 0) {
			g_cpu_bind = 1;
		}  else if (strcmp(av[i], "-T") == 0) {
			timeout_test = 1;
		} else if (strcmp(av[i], "-h") == 0) {
			print_usage();
		}
	}
	num_semids = (num_threads + SEMS_PERID - 1) / SEMS_PERID;
	ops->setup(&wi, num_semids);

	for (i = 0; i < num_threads; i++) {
		struct lockinfo *l;
		pthread_t tid;

		l = malloc(sizeof(*l));
		if (!l) {
			perror("malloc");
			exit(1);
		}
		l->id = semid;
		l->index = sem_num++;
		l->ops = ops;
		if (sem_num >= SEMS_PERID) {
			semid++;
			sem_num = 0;
		}
		ret = pthread_create(&tid, NULL, worker, (void *)l);
		if (ret) {
			perror("pthread_create");
			exit(1);
		}
		ret = pthread_detach(tid);
		if (ret) {
			perror("pthread_detach");
			exit(1);
		}
	}
	do_cpu_bind(1);
	while(thread_count != num_threads)
		usleep(200);

	while(!worklist_head)
		usleep(200);

	gettimeofday(&start, NULL);
	fprintf(stderr, "main loop going\n");
	while(1) {
		ops->wake(wi, num_semids, wake_num);
		burn_count++;
		gettimeofday(&now, NULL);
		if (now.tv_sec - start.tv_sec >= run_secs)
			break;
	}
	fprintf(stderr, "all done\n");
	all_done = 1;
	while(thread_count > 0) {
		ops->wake(wi, num_semids, wake_num);
		usleep(200);
	}
	printf("%d threads, waking %d at a time\n", num_threads, wake_num);
	printf("using %s\n", ops->name);
	printf("main thread burns: %d\n", burn_count);
	printf("worker burn count total %lu min %lu max %lu avg %lu\n",
	       total_burns, min_burns, max_burns, total_burns / num_threads);
	printf("run time %d seconds %lu worker burns per second\n",
		(int)(now.tv_sec - start.tv_sec),
		total_burns / (now.tv_sec - start.tv_sec));
	ops->cleanup(num_semids);
	return 0;
}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-04-13 18:57           ` Nick Piggin
  2010-04-13 19:01             ` Chris Mason
@ 2010-05-16 16:57             ` Manfred Spraul
  2010-05-16 22:40               ` Chris Mason
  2010-05-17  7:20               ` Nick Piggin
  1 sibling, 2 replies; 25+ messages in thread
From: Manfred Spraul @ 2010-05-16 16:57 UTC (permalink / raw)
  To: Nick Piggin, Chris Mason; +Cc: zach.brown, jens.axboe, linux-kernel

On 04/13/2010 08:57 PM, Nick Piggin wrote:
> On Tue, Apr 13, 2010 at 02:19:37PM -0400, Chris Mason wrote:
>    
>> I don't see anything in the docs about the FIFO order.  I could add an
>> extra sort on sequence number pretty easily, but is the starvation case
>> really that bad?
>>      
> Yes, because it's not just a theoretical livelock, it can be basically
> a certainty, given the right pattern of semops.
>
> You could have two mostly-independent groups of processes, each taking
> and releasing a different sem, which are always contended (eg. if it is
> being used for a producer-consumer type situation, or even just mutual
> exclusion with high contention).
>
> Then you could have some overall management process for example which
> tries to take both sems. It will never get it.
>
>    
The management process won't get the sem on Linux either:
Linux implements FIFO, but there is no protection at all against starvation.

If I understand the benchmark numbers correctly, a 4-core, 2 GHz Phenom 
is able to do ~ 2 million semaphore operations per second in one 
semaphore array.
That's the limit - cache line trashing on the sma structure prevent 
higher numbers.

For a NUMA system, the limit is probably lower.

Chris:
Do you have an estimate how many semop() your app will perform in one array?

Perhaps we should really remove the per-array list, sma->sem_perm.lock 
and sma->sem_otime.

--
     Manfred

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-05-16 16:57             ` Manfred Spraul
@ 2010-05-16 22:40               ` Chris Mason
  2010-05-17  7:20               ` Nick Piggin
  1 sibling, 0 replies; 25+ messages in thread
From: Chris Mason @ 2010-05-16 22:40 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Nick Piggin, zach.brown, jens.axboe, linux-kernel

On Sun, May 16, 2010 at 06:57:38PM +0200, Manfred Spraul wrote:
> On 04/13/2010 08:57 PM, Nick Piggin wrote:
> >On Tue, Apr 13, 2010 at 02:19:37PM -0400, Chris Mason wrote:
> >>I don't see anything in the docs about the FIFO order.  I could add an
> >>extra sort on sequence number pretty easily, but is the starvation case
> >>really that bad?
> >Yes, because it's not just a theoretical livelock, it can be basically
> >a certainty, given the right pattern of semops.
> >
> >You could have two mostly-independent groups of processes, each taking
> >and releasing a different sem, which are always contended (eg. if it is
> >being used for a producer-consumer type situation, or even just mutual
> >exclusion with high contention).
> >
> >Then you could have some overall management process for example which
> >tries to take both sems. It will never get it.
> >
> The management process won't get the sem on Linux either:
> Linux implements FIFO, but there is no protection at all against starvation.
> 
> If I understand the benchmark numbers correctly, a 4-core, 2 GHz
> Phenom is able to do ~ 2 million semaphore operations per second in
> one semaphore array.
> That's the limit - cache line trashing on the sma structure prevent
> higher numbers.
> 
> For a NUMA system, the limit is probably lower.
> 
> Chris:
> Do you have an estimate how many semop() your app will perform in one array?

There are two different workloads at play.  The first is to just use
semaphores as a lock, which is a traditional mutex type operation (one
at a time).  This isn't a problem with the current code, aside from lock
contention created by the second case.

The second case is batched wakeup.  One process will wake hundreds or
more at once that are each waiting on their own semaphore.

> 
> Perhaps we should really remove the per-array list,
> sma->sem_perm.lock and sma->sem_otime.

So far for our uses the per-array list was the biggest trouble.  I tried
to benchmark your patches on Friday, but these are preproduction systems
and it appears to have woken up in a bad mood.

The hardware guys are giving it some TLC and I'll be able to run on
Monday.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop
  2010-05-16 16:57             ` Manfred Spraul
  2010-05-16 22:40               ` Chris Mason
@ 2010-05-17  7:20               ` Nick Piggin
  1 sibling, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-05-17  7:20 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Chris Mason, zach.brown, jens.axboe, linux-kernel

On Sun, May 16, 2010 at 06:57:38PM +0200, Manfred Spraul wrote:
> On 04/13/2010 08:57 PM, Nick Piggin wrote:
> >On Tue, Apr 13, 2010 at 02:19:37PM -0400, Chris Mason wrote:
> >>I don't see anything in the docs about the FIFO order.  I could add an
> >>extra sort on sequence number pretty easily, but is the starvation case
> >>really that bad?
> >Yes, because it's not just a theoretical livelock, it can be basically
> >a certainty, given the right pattern of semops.
> >
> >You could have two mostly-independent groups of processes, each taking
> >and releasing a different sem, which are always contended (eg. if it is
> >being used for a producer-consumer type situation, or even just mutual
> >exclusion with high contention).
> >
> >Then you could have some overall management process for example which
> >tries to take both sems. It will never get it.
> >
> The management process won't get the sem on Linux either:
> Linux implements FIFO, but there is no protection at all against starvation.

Yeah I did realise this after I posted. But anyway I think FIFO
is reasonable to have, although you *may* be able to justify
removing it after your research of other UNIXes, if there are
sufficient gains.

> 
> If I understand the benchmark numbers correctly, a 4-core, 2 GHz
> Phenom is able to do ~ 2 million semaphore operations per second in
> one semaphore array.
> That's the limit - cache line trashing on the sma structure prevent
> higher numbers.
> 
> For a NUMA system, the limit is probably lower.
> 
> Chris:
> Do you have an estimate how many semop() your app will perform in one array?
> 
> Perhaps we should really remove the per-array list,
> sma->sem_perm.lock and sma->sem_otime.
> 
> --
>     Manfred

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-05-18  6:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-12 18:49 [PATCH RFC] Optimize semtimedop Chris Mason
2010-04-12 18:49 ` [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop Chris Mason
2010-04-13 17:15   ` Manfred Spraul
2010-04-13 17:39     ` Chris Mason
2010-04-13 18:09       ` Nick Piggin
2010-04-13 18:19         ` Chris Mason
2010-04-13 18:57           ` Nick Piggin
2010-04-13 19:01             ` Chris Mason
2010-04-13 19:25               ` Nick Piggin
2010-04-13 19:38                 ` Chris Mason
2010-04-13 20:05                   ` Nick Piggin
2010-05-16 16:57             ` Manfred Spraul
2010-05-16 22:40               ` Chris Mason
2010-05-17  7:20               ` Nick Piggin
2010-04-14 16:16           ` Manfred Spraul
2010-04-14 17:33             ` Chris Mason
2010-04-14 19:11               ` Manfred Spraul
2010-04-14 19:50                 ` Chris Mason
2010-04-15 16:33                   ` Manfred Spraul
2010-04-15 16:34                     ` Chris Mason
2010-04-13 18:24         ` Zach Brown
2010-04-16 11:26   ` Manfred Spraul
2010-04-16 11:45     ` Chris Mason
2010-04-12 18:49 ` [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU Chris Mason
2010-04-17 10:24   ` Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).