All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
@ 2013-05-26  9:08 Manfred Spraul
  2013-05-26  9:08 ` [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue() Manfred Spraul
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Manfred Spraul @ 2013-05-26  9:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Manfred Spraul

I've split my patch into 4 parts:
- 1: Fix-missing-wakeups-in-do_smart_update
- 2: seperate-wait-for-zero-and-alter-tasks
- 3: Always-use-only-one-queue-for-alter-operations
- 4: Rename-try_atomic_semop-to-perform_atomic

Linus:
- Patch 1 should be merged immediately: It fixes bugs,
  the current code misses wakeups.

- Patch 2 and 3 restore the behavior of linux <=3.0.9.
  I would propose that they are merged, too: I can't rule out that
  changing the priority of the wakeups breaks user space apps.

- Patch 4 is trivial, no code changes at all.
  If 2+3 are merged, then 4 should be merged, too.

I have tested patch 1 seperately and 1+2+3+4:
With patch 1 applied, there are no more missed wakeups.

With all 4 applied, linux-3.0.10-rc1 behaves as linux <=3.0.9.

With regards to the scalability, I do not expect any degradation:
Operations on seperate semaphores in an array remain parallelized.

Apps that use lots of wait-for-zero semop are probably even faster,
because the wait-for-zero ops are now only scanned if a semval is 0.

--
	Manfred

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue()
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
@ 2013-05-26  9:08 ` Manfred Spraul
  2013-05-27 17:51   ` Rik van Riel
  2013-05-26  9:08 ` [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Manfred Spraul @ 2013-05-26  9:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Manfred Spraul

do_smart_update_queue() is called when an operation
(semop, semctl(SETVAL), semctl(SETALL), ...) modified the array.
It must check which of the sleeping tasks can proceed.

do_smart_update_queue() missed a few wakeups:
- if a sleeping complex op was completed, then all per-semaphore queues
  must be scanned - not only those that were modified by *sops
- if a sleeping simple op proceeded, then the global queue
  must be scanned again

And:
- the test for "|sops == NULL) before scanning the global
  queue is not required: If the global queue is empty, then
  it doesn't need to be scanned - regardless of the reason
  for calling do_smart_update_queue()

The patch is not optimized, i.e. even completing a wait-for-zero
operation causes a rescan. This is done to keep the patch as simple as
possible.
Avoiding unnecessary scans is implemented in the following patches.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index a7e40ed..70480a3 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -752,19 +752,29 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 			int otime, struct list_head *pt)
 {
 	int i;
+	int progress;
 
-	if (sma->complex_count || sops == NULL) {
-		if (update_queue(sma, -1, pt))
+	progress = 1;
+retry_global:
+	if (sma->complex_count) {
+		if (update_queue(sma, -1, pt)) {
+			progress = 1;
 			otime = 1;
+			sops = NULL;
+		}
 	}
+	if (!progress)
+		goto done;
 
 	if (!sops) {
 		/* No semops; something special is going on. */
 		for (i = 0; i < sma->sem_nsems; i++) {
-			if (update_queue(sma, i, pt))
+			if (update_queue(sma, i, pt)) {
 				otime = 1;
+				progress = 1;
+			}
 		}
-		goto done;
+		goto done_checkretry;
 	}
 
 	/* Check the semaphores that were modified. */
@@ -772,8 +782,15 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 		if (sops[i].sem_op > 0 ||
 			(sops[i].sem_op < 0 &&
 				sma->sem_base[sops[i].sem_num].semval == 0))
-			if (update_queue(sma, sops[i].sem_num, pt))
+			if (update_queue(sma, sops[i].sem_num, pt)) {
 				otime = 1;
+				progress = 1;
+			}
+	}
+done_checkretry:
+	if (progress) {
+		progress = 0;
+		goto retry_global;
 	}
 done:
 	if (otime)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
  2013-05-26  9:08 ` [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue() Manfred Spraul
@ 2013-05-26  9:08 ` Manfred Spraul
  2013-05-27 17:57   ` Rik van Riel
  2013-05-26  9:08 ` [PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Manfred Spraul @ 2013-05-26  9:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Manfred Spraul

Introduce seperate queues for operations that do not modify the
semaphore values.
Advantages:
- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations
  are always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/sem.h |   5 +-
 ipc/sem.c           | 209 +++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 153 insertions(+), 61 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 53d4265..55e17f6 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -15,7 +15,10 @@ struct sem_array {
 	time_t			sem_otime;	/* last semop time */
 	time_t			sem_ctime;	/* last change time */
 	struct sem		*sem_base;	/* ptr to first semaphore in array */
-	struct list_head	sem_pending;	/* pending operations to be processed */
+	struct list_head	pending_alter;	/* pending operations */
+						/* that alter the array */
+	struct list_head	pending_const;	/* pending complex operations */
+						/* that do not alter semvals */
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 70480a3..ce25da3 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -95,7 +95,10 @@ struct sem {
 	int	semval;		/* current value */
 	int	sempid;		/* pid of last operation */
 	spinlock_t	lock;	/* spinlock for fine-grained semtimedop */
-	struct list_head sem_pending; /* pending single-sop operations */
+	struct list_head pending_alter; /* pending single-sop operations */
+					/* that alter the semaphore */
+	struct list_head pending_const; /* pending single-sop operations */
+					/* that do not alter the semaphore*/
 };
 
 /* One queue for each sleeping process in the system. */
@@ -152,7 +155,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 /*
  * linked list protection:
  *	sem_undo.id_next,
- *	sem_array.sem_pending{,last},
+ *	sem_array.pending{_alter,_cont},
  *	sem_array.sem_undo: sem_lock() for read/write
  *	sem_undo.proc_next: only "current" is allowed to read/write that field.
  *	
@@ -337,7 +340,7 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * Without the check/retry algorithm a lockless wakeup is possible:
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
- *	* unlinking the queue entry from sma->sem_pending
+ *	* unlinking the queue entry from the pending list
  *	* setting queue.status to IN_WAKEUP
  *	  This is the notification for the blocked thread that a
  *	  result value is imminent.
@@ -418,12 +421,14 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	sma->sem_base = (struct sem *) &sma[1];
 
 	for (i = 0; i < nsems; i++) {
-		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+		INIT_LIST_HEAD(&sma->sem_base[i].pending_alter);
+		INIT_LIST_HEAD(&sma->sem_base[i].pending_const);
 		spin_lock_init(&sma->sem_base[i].lock);
 	}
 
 	sma->complex_count = 0;
-	INIT_LIST_HEAD(&sma->sem_pending);
+	INIT_LIST_HEAD(&sma->pending_alter);
+	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
 	sma->sem_nsems = nsems;
 	sma->sem_ctime = get_seconds();
@@ -609,60 +614,130 @@ static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
  * update_queue is O(N^2) when it restarts scanning the whole queue of
  * waiting operations. Therefore this function checks if the restart is
  * really necessary. It is called after a previously waiting operation
- * was completed.
+ * modified the array.
+ * Note that wait-for-zero operations are handled without restart.
  */
 static int check_restart(struct sem_array *sma, struct sem_queue *q)
 {
-	struct sem *curr;
-	struct sem_queue *h;
-
-	/* if the operation didn't modify the array, then no restart */
-	if (q->alter == 0)
-		return 0;
-
-	/* pending complex operations are too difficult to analyse */
-	if (sma->complex_count)
+	/* pending complex alter operations are too difficult to analyse */
+	if (!list_empty(&sma->pending_alter))
 		return 1;
 
 	/* we were a sleeping complex operation. Too difficult */
 	if (q->nsops > 1)
 		return 1;
 
-	curr = sma->sem_base + q->sops[0].sem_num;
+	/* It is impossible that someone waits for the new value:
+	 * - complex operations always restart.
+	 * - wait-for-zero are handled seperately.
+	 * - q is a previously sleeping simple operation that
+	 *   altered the array. It must be a decrement, because
+	 *   simple increments never sleep.
+	 * - If there are older (higher priority) decrements
+	 *   in the queue, then they have observed the original
+	 *   semval value and couldn't proceed. The operation
+	 *   decremented to value - thus they won't proceed either.
+	 */
+	return 0;
+}
 
-	/* No-one waits on this queue */
-	if (list_empty(&curr->sem_pending))
-		return 0;
+/**
+ * wake_const_ops(sma, semnum, pt) - Wake up non-alter tasks
+ * @sma: semaphore array.
+ * @semnum: semaphore that was modified.
+ * @pt: list head for the tasks that must be woken up.
+ *
+ * wake_const_ops must be called after a semaphore in a semaphore array
+ * was set to 0. If complex const operations are pending, wake_const_ops must
+ * be called with semnum = -1, as well as with the number of each modified
+ * semaphore.
+ * The tasks that must be woken up are added to @pt. The return code
+ * is stored in q->pid.
+ * The function returns 1 if at least one operation was completed successfully.
+ */
+static int wake_const_ops(struct sem_array *sma, int semnum,
+				struct list_head *pt)
+{
+	struct sem_queue *q;
+	struct list_head *walk;
+	struct list_head *pending_list;
+	int semop_completed = 0;
+
+	if (semnum == -1)
+		pending_list = &sma->pending_const;
+	else
+		pending_list = &sma->sem_base[semnum].pending_const;
+
+	walk = pending_list->next;
+	while (walk != pending_list) {
+		int error;
+
+		q = container_of(walk, struct sem_queue, list);
+		walk = walk->next;
+
+		error = try_atomic_semop(sma, q->sops, q->nsops,
+						q->undo, q->pid);
+
+		if (error <= 0) {
+			/* operation completed, remove from queue & wakeup */
+
+			unlink_queue(sma, q);
+
+			wake_up_sem_queue_prepare(pt, q, error);
+			if (error == 0)
+				semop_completed = 1;
+		}
+	}
+	return semop_completed;
+}
 
-	/* the new semaphore value */
-	if (curr->semval) {
-		/* It is impossible that someone waits for the new value:
-		 * - q is a previously sleeping simple operation that
-		 *   altered the array. It must be a decrement, because
-		 *   simple increments never sleep.
-		 * - The value is not 0, thus wait-for-zero won't proceed.
-		 * - If there are older (higher priority) decrements
-		 *   in the queue, then they have observed the original
-		 *   semval value and couldn't proceed. The operation
-		 *   decremented to value - thus they won't proceed either.
+/**
+ * do_smart_wakeup_zero(sma, sops, nsops, pt) - wakeup all wait for zero tasks
+ * @sma: semaphore array
+ * @sops: operations that were performed
+ * @nsops: number of operations
+ * @pt: list head of the tasks that must be woken up.
+ *
+ * do_smart_wakeup_zero() checks all required queue for wait-for-zero
+ * operations, based on the actual changes that were performed on the
+ * semaphore array.
+ * The function returns 1 if at least one operation was completed successfully.
+ */
+static int do_smart_wakeup_zero(struct sem_array *sma, struct sembuf *sops,
+					int nsops, struct list_head *pt)
+{
+	int i;
+	int semop_completed = 0;
+	int got_zero = 0;
+
+	/* first: the per-semaphore queues, if known */
+	if (sops) {
+		for (i = 0; i < nsops; i++) {
+			int num = sops[i].sem_num;
+
+			if (sma->sem_base[num].semval == 0) {
+				got_zero = 1;
+				semop_completed |= wake_const_ops(sma, num, pt);
+			}
+		}
+	} else {
+		/*
+		 * No sops means modified semaphores not known.
+		 * Assume all were changed.
 		 */
-		BUG_ON(q->sops[0].sem_op >= 0);
-		return 0;
+		for (i = 0; i < sma->sem_nsems; i++) {
+			if (sma->sem_base[i].semval == 0)
+				semop_completed |= wake_const_ops(sma, i, pt);
+		}
 	}
 	/*
-	 * semval is 0. Check if there are wait-for-zero semops.
-	 * They must be the first entries in the per-semaphore queue
+	 * If one of the modified semaphores got 0,
+	 * then check the global queue, too.
 	 */
-	h = list_first_entry(&curr->sem_pending, struct sem_queue, list);
-	BUG_ON(h->nsops != 1);
-	BUG_ON(h->sops[0].sem_num != q->sops[0].sem_num);
+	if (got_zero)
+		semop_completed |= wake_const_ops(sma, -1, pt);
 
-	/* Yes, there is a wait-for-zero semop. Restart */
-	if (h->sops[0].sem_op == 0)
-		return 1;
-
-	/* Again - no-one is waiting for the new value. */
-	return 0;
+	return semop_completed;
 }
 
 
@@ -678,6 +753,8 @@ static int check_restart(struct sem_array *sma, struct sem_queue *q)
  * semaphore.
  * The tasks that must be woken up are added to @pt. The return code
  * is stored in q->pid.
+ * The function internally checks if const operations can now succeed.
+ *
  * The function return 1 if at least one semop was completed successfully.
  */
 static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
@@ -688,9 +765,9 @@ static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
 	int semop_completed = 0;
 
 	if (semnum == -1)
-		pending_list = &sma->sem_pending;
+		pending_list = &sma->pending_alter;
 	else
-		pending_list = &sma->sem_base[semnum].sem_pending;
+		pending_list = &sma->sem_base[semnum].pending_alter;
 
 again:
 	walk = pending_list->next;
@@ -702,13 +779,12 @@ again:
 
 		/* If we are scanning the single sop, per-semaphore list of
 		 * one semaphore and that semaphore is 0, then it is not
-		 * necessary to scan the "alter" entries: simple increments
+		 * necessary to scan further: simple increments
 		 * that affect only one entry succeed immediately and cannot
 		 * be in the  per semaphore pending queue, and decrements
 		 * cannot be successful if the value is already 0.
 		 */
-		if (semnum != -1 && sma->sem_base[semnum].semval == 0 &&
-				q->alter)
+		if (semnum != -1 && sma->sem_base[semnum].semval == 0)
 			break;
 
 		error = try_atomic_semop(sma, q->sops, q->nsops,
@@ -724,6 +800,7 @@ again:
 			restart = 0;
 		} else {
 			semop_completed = 1;
+			do_smart_wakeup_zero(sma, q->sops, q->nsops, pt);
 			restart = check_restart(sma, q);
 		}
 
@@ -742,8 +819,8 @@ again:
  * @otime: force setting otime
  * @pt: list head of the tasks that must be woken up.
  *
- * do_smart_update() does the required called to update_queue, based on the
- * actual changes that were performed on the semaphore array.
+ * do_smart_update() does the required calls to update_queue and wakeup_zero,
+ * based on the actual changes that were performed on the semaphore array.
  * Note that the function does not do the actual wake-up: the caller is
  * responsible for calling wake_up_sem_queue_do(@pt).
  * It is safe to perform this call after dropping all locks.
@@ -754,6 +831,8 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 	int i;
 	int progress;
 
+	otime |= do_smart_wakeup_zero(sma, sops, nsops, pt);
+
 	progress = 1;
 retry_global:
 	if (sma->complex_count) {
@@ -813,14 +892,14 @@ static int count_semncnt (struct sem_array * sma, ushort semnum)
 	struct sem_queue * q;
 
 	semncnt = 0;
-	list_for_each_entry(q, &sma->sem_base[semnum].sem_pending, list) {
+	list_for_each_entry(q, &sma->sem_base[semnum].pending_alter, list) {
 		struct sembuf * sops = q->sops;
 		BUG_ON(sops->sem_num != semnum);
 		if ((sops->sem_op < 0) && !(sops->sem_flg & IPC_NOWAIT))
 			semncnt++;
 	}
 
-	list_for_each_entry(q, &sma->sem_pending, list) {
+	list_for_each_entry(q, &sma->pending_alter, list) {
 		struct sembuf * sops = q->sops;
 		int nsops = q->nsops;
 		int i;
@@ -839,14 +918,14 @@ static int count_semzcnt (struct sem_array * sma, ushort semnum)
 	struct sem_queue * q;
 
 	semzcnt = 0;
-	list_for_each_entry(q, &sma->sem_base[semnum].sem_pending, list) {
+	list_for_each_entry(q, &sma->sem_base[semnum].pending_const, list) {
 		struct sembuf * sops = q->sops;
 		BUG_ON(sops->sem_num != semnum);
 		if ((sops->sem_op == 0) && !(sops->sem_flg & IPC_NOWAIT))
 			semzcnt++;
 	}
 
-	list_for_each_entry(q, &sma->sem_pending, list) {
+	list_for_each_entry(q, &sma->pending_const, list) {
 		struct sembuf * sops = q->sops;
 		int nsops = q->nsops;
 		int i;
@@ -884,13 +963,22 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
 	INIT_LIST_HEAD(&tasks);
-	list_for_each_entry_safe(q, tq, &sma->sem_pending, list) {
+	list_for_each_entry_safe(q, tq, &sma->pending_const, list) {
+		unlink_queue(sma, q);
+		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
+	}
+
+	list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
 		unlink_queue(sma, q);
 		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
 	}
 	for (i = 0; i < sma->sem_nsems; i++) {
 		struct sem *sem = sma->sem_base + i;
-		list_for_each_entry_safe(q, tq, &sem->sem_pending, list) {
+		list_for_each_entry_safe(q, tq, &sem->pending_const, list) {
+			unlink_queue(sma, q);
+			wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
+		}
+		list_for_each_entry_safe(q, tq, &sem->pending_alter, list) {
 			unlink_queue(sma, q);
 			wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
 		}
@@ -1654,14 +1742,15 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		curr = &sma->sem_base[sops->sem_num];
 
 		if (alter)
-			list_add_tail(&queue.list, &curr->sem_pending);
+			list_add_tail(&queue.list, &curr->pending_alter);
 		else
-			list_add(&queue.list, &curr->sem_pending);
+			list_add_tail(&queue.list, &curr->pending_const);
 	} else {
 		if (alter)
-			list_add_tail(&queue.list, &sma->sem_pending);
+			list_add_tail(&queue.list, &sma->pending_alter);
 		else
-			list_add(&queue.list, &sma->sem_pending);
+			list_add_tail(&queue.list, &sma->pending_const);
+
 		sma->complex_count++;
 	}
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations.
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
  2013-05-26  9:08 ` [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue() Manfred Spraul
  2013-05-26  9:08 ` [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
@ 2013-05-26  9:08 ` Manfred Spraul
  2013-05-27 18:00   ` Rik van Riel
  2013-05-26  9:08 ` [PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update Manfred Spraul
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Manfred Spraul @ 2013-05-26  9:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Manfred Spraul

There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.

Since one of the queues must be processed first, this causes an odd
priorization of the wakeups:
Right now, complex operations have priority over simple ops.

The patch restores the behavior of linux <=3.0.9: The longest
waiting operation has the highest priority.

This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.

As a side effect, do_smart_update_queue() becomes much simpler:
No more goto logic.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 128 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 88 insertions(+), 40 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ce25da3..9ed3853 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -192,6 +192,53 @@ void __init sem_init (void)
 				IPC_SEM_IDS, sysvipc_sem_proc_show);
 }
 
+/**
+ * unmerge_queues - unmerge queues, if possible.
+ * @sma: semaphore array
+ *
+ * The function unmerges the wait queues if complex_count is 0.
+ * It must be called prior to dropping the global semaphore array lock.
+ */
+static void unmerge_queues(struct sem_array *sma)
+{
+	struct sem_queue *q, *tq;
+
+	/* complex operations still around? */
+	if (sma->complex_count)
+		return;
+	/*
+	 * We will switch back to simple mode.
+	 * Move all pending operation back into the per-semaphore
+	 * queues.
+	 */
+	list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
+		struct sem *curr;
+		curr = &sma->sem_base[q->sops[0].sem_num];
+
+		list_add_tail(&q->list, &curr->pending_alter);
+	}
+	INIT_LIST_HEAD(&sma->pending_alter);
+}
+
+/**
+ * merge_queues - Merge single semop queues into global queue
+ * @sma: semaphore array
+ *
+ * This function merges all per-semaphore queues into the global queue.
+ * It is necessary to achieve FIFO ordering for the pending single-sop
+ * operations when a multi-semop operation must sleep.
+ * Only the alter operations must be moved, the const operations can stay.
+ */
+static void merge_queues(struct sem_array *sma)
+{
+	int i;
+	for (i = 0; i < sma->sem_nsems; i++) {
+		struct sem *sem = sma->sem_base + i;
+
+		list_splice_init(&sem->pending_alter, &sma->pending_alter);
+	}
+}
+
 /*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
@@ -262,6 +309,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
+		unmerge_queues(sma);
 		spin_unlock(&sma->sem_perm.lock);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -829,49 +877,38 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 			int otime, struct list_head *pt)
 {
 	int i;
-	int progress;
 
 	otime |= do_smart_wakeup_zero(sma, sops, nsops, pt);
 
-	progress = 1;
-retry_global:
-	if (sma->complex_count) {
-		if (update_queue(sma, -1, pt)) {
-			progress = 1;
-			otime = 1;
-			sops = NULL;
-		}
-	}
-	if (!progress)
-		goto done;
-
-	if (!sops) {
-		/* No semops; something special is going on. */
-		for (i = 0; i < sma->sem_nsems; i++) {
-			if (update_queue(sma, i, pt)) {
-				otime = 1;
-				progress = 1;
+	if (!list_empty(&sma->pending_alter)) {
+		/* semaphore array uses the global queue - just process it. */
+		otime |= update_queue(sma, -1, pt);
+	} else {
+		if (!sops) {
+			/*
+			 * No sops, thus the modified semaphores are not
+			 * known. Check all.
+			 */
+			for (i = 0; i < sma->sem_nsems; i++)
+				otime |= update_queue(sma, i, pt);
+		} else {
+			/*
+			 * Check the semaphores that were increased:
+			 * - No complex ops, thus all sleeping ops are
+			 *   decrease.
+			 * - if we decreased the value, then any sleeping
+			 *   semaphore ops wont be able to run: If the
+			 *   previous value was too small, then the new
+			 *   value will be too small, too.
+			 */
+			for (i = 0; i < nsops; i++) {
+				if (sops[i].sem_op > 0) {
+					otime |= update_queue(sma,
+							sops[i].sem_num, pt);
+				}
 			}
 		}
-		goto done_checkretry;
-	}
-
-	/* Check the semaphores that were modified. */
-	for (i = 0; i < nsops; i++) {
-		if (sops[i].sem_op > 0 ||
-			(sops[i].sem_op < 0 &&
-				sma->sem_base[sops[i].sem_num].semval == 0))
-			if (update_queue(sma, sops[i].sem_num, pt)) {
-				otime = 1;
-				progress = 1;
-			}
-	}
-done_checkretry:
-	if (progress) {
-		progress = 0;
-		goto retry_global;
 	}
-done:
 	if (otime)
 		sma->sem_otime = get_seconds();
 }
@@ -1741,11 +1778,22 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		struct sem *curr;
 		curr = &sma->sem_base[sops->sem_num];
 
-		if (alter)
-			list_add_tail(&queue.list, &curr->pending_alter);
-		else
+		if (alter) {
+			if (sma->complex_count) {
+				list_add_tail(&queue.list,
+						&sma->pending_alter);
+			} else {
+
+				list_add_tail(&queue.list,
+						&curr->pending_alter);
+			}
+		} else {
 			list_add_tail(&queue.list, &curr->pending_const);
+		}
 	} else {
+		if (!sma->complex_count)
+			merge_queues(sma);
+
 		if (alter)
 			list_add_tail(&queue.list, &sma->pending_alter);
 		else
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
                   ` (2 preceding siblings ...)
  2013-05-26  9:08 ` [PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
@ 2013-05-26  9:08 ` Manfred Spraul
  2013-05-27 17:59   ` Rik van Riel
  2013-05-26 17:37 ` [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Linus Torvalds
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Manfred Spraul @ 2013-05-26  9:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Manfred Spraul

Final cleanup: Some minor points that I noticed while writing the
previous 3 patches

1) The name try_atomic_semop() is misleading: The function performs the
   operation (if it is possible).

2) Some documentation updates.

No real code change, a rename and documentation changes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 9ed3853..1dbb2fa 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -153,12 +153,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #define SEMOPM_FAST	64  /* ~ 372 bytes on stack */
 
 /*
- * linked list protection:
+ * Locking:
  *	sem_undo.id_next,
+ *	sem_array.complex_count,
  *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: sem_lock() for read/write
+ *	sem_array.sem_undo: global sem_lock() for read/write
  *	sem_undo.proc_next: only "current" is allowed to read/write that field.
  *	
+ *	sem_array.sem_base[i].pending_{const,alter}:
+ *		global or semaphore sem_lock() for read/write
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -535,12 +538,19 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
 }
 
-/*
- * Determine whether a sequence of semaphore operations would succeed
- * all at once. Return 0 if yes, 1 if need to sleep, else return error code.
+/** perform_atomic_semop - Perform (if possible) a semaphore operation
+ * @sma: semaphore array
+ * @sops: array with operations that should be checked
+ * @nsems: number of sops
+ * @un: undo array
+ * @pid: pid that did the change
+ *
+ * Returns 0 if the operation was possible.
+ * Returns 1 if the operation is impossible, the caller must sleep.
+ * Negative values are error codes.
  */
 
-static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
+static int perform_atomic_semop(struct sem_array *sma, struct sembuf *sops,
 			     int nsops, struct sem_undo *un, int pid)
 {
 	int result, sem_op;
@@ -723,8 +733,8 @@ static int wake_const_ops(struct sem_array *sma, int semnum,
 		q = container_of(walk, struct sem_queue, list);
 		walk = walk->next;
 
-		error = try_atomic_semop(sma, q->sops, q->nsops,
-						q->undo, q->pid);
+		error = perform_atomic_semop(sma, q->sops, q->nsops,
+						 q->undo, q->pid);
 
 		if (error <= 0) {
 			/* operation completed, remove from queue & wakeup */
@@ -835,7 +845,7 @@ again:
 		if (semnum != -1 && sma->sem_base[semnum].semval == 0)
 			break;
 
-		error = try_atomic_semop(sma, q->sops, q->nsops,
+		error = perform_atomic_semop(sma, q->sops, q->nsops,
 					 q->undo, q->pid);
 
 		/* Does q->sleeper still need to sleep? */
@@ -1658,7 +1668,6 @@ static int get_queue_result(struct sem_queue *q)
 	return error;
 }
 
-
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1756,7 +1765,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	if (un && un->semid == -1)
 		goto out_unlock_free;
 
-	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
+	error = perform_atomic_semop(sma, sops, nsops, un,
+					task_tgid_vnr(current));
 	if (error <= 0) {
 		if (alter && error == 0)
 			do_smart_update(sma, sops, nsops, 1, &tasks);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
                   ` (3 preceding siblings ...)
  2013-05-26  9:08 ` [PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update Manfred Spraul
@ 2013-05-26 17:37 ` Linus Torvalds
  2013-05-26 20:50 ` Davidlohr Bueso
  2013-05-27  2:04 ` Greg KH
  6 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2013-05-26 17:37 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Rik van Riel, LKML, Andrew Morton, Davidlohr Bueso, Hai Huang

On Sun, May 26, 2013 at 2:08 AM, Manfred Spraul
<manfred@colorfullife.com> wrote:
> I've split my patch into 4 parts:
> - 1: Fix-missing-wakeups-in-do_smart_update
> - 2: seperate-wait-for-zero-and-alter-tasks
> - 3: Always-use-only-one-queue-for-alter-operations
> - 4: Rename-try_atomic_semop-to-perform_atomic

Rik, Davidlohr, acks/reviews please?

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
                   ` (4 preceding siblings ...)
  2013-05-26 17:37 ` [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Linus Torvalds
@ 2013-05-26 20:50 ` Davidlohr Bueso
  2013-05-26 22:59   ` Linus Torvalds
  2013-05-27 15:57   ` Manfred Spraul
  2013-05-27  2:04 ` Greg KH
  6 siblings, 2 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2013-05-26 20:50 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Rik van Riel, LKML, Andrew Morton, hhuang, Linus Torvalds

On Sun, 2013-05-26 at 11:08 +0200, Manfred Spraul wrote:
> I've split my patch into 4 parts:
> - 1: Fix-missing-wakeups-in-do_smart_update
> - 2: seperate-wait-for-zero-and-alter-tasks
> - 3: Always-use-only-one-queue-for-alter-operations
> - 4: Rename-try_atomic_semop-to-perform_atomic
> 
> Linus:
> - Patch 1 should be merged immediately: It fixes bugs,
>   the current code misses wakeups.

Nothing against this.

> - Patch 2 and 3 restore the behavior of linux <=3.0.9.
>   I would propose that they are merged, too: I can't rule out that
>   changing the priority of the wakeups breaks user space apps.
> 
> - Patch 4 is trivial, no code changes at all.
>   If 2+3 are merged, then 4 should be merged, too.
> 
> I have tested patch 1 seperately and 1+2+3+4:
> With patch 1 applied, there are no more missed wakeups.
> 
> With all 4 applied, linux-3.0.10-rc1 behaves as linux <=3.0.9.
> 
> With regards to the scalability, I do not expect any degradation:
> Operations on seperate semaphores in an array remain parallelized.

In lack of getting my swingbench DSS environment back, I ran these
changes against the semop-multi program on my laptop. For 256 threads,
with Manfred's patchset the ops/sec suffers around -7.3%.

3.10-rc2-baseline:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 325289276, ops/sec 10842975

-  18.14%          a.out  [kernel.kallsyms]  [k] SYSC_semtimedop                                                                             ◆
   - SYSC_semtimedop                                                                                                                         ▒
      + 97.54% SyS_semtimedop                                                                                                                ▒
      + 2.46% SyS_semop  
-   5.24%          a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check                                                                     ▒
   - ipc_obtain_object_check                                                                                                                 ▒
      + 92.37% SYSC_semtimedop                                                                                                               ▒
      + 7.63% SyS_semtimedop                                                                  
-   4.67%          a.out  [kernel.kallsyms]  [k] _raw_spin_lock                                                                              ▒
   - _raw_spin_lock                                                                                                                          ▒
      + 91.98% SYSC_semtimedop                                                                                                               ▒
      + 7.89% SyS_semtimedop 


3.10-rc2-manfred:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 303314830, ops/sec 10110494

-  17.10%            a.out  [kernel.kallsyms]  [k] SYSC_semtimedop                                                                           ◆
   - SYSC_semtimedop                                                                                                                         ▒
      + 97.47% SyS_semtimedop                                                                                                                ▒
      + 2.53% SyS_semop
-   4.79%            a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check                                                                   ▒
   - ipc_obtain_object_check                                                                                                                 ▒
      + 91.88% SYSC_semtimedop                                                                                                               ▒
      + 8.12% SyS_semtimedop 
-   4.50%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock                                                                            ▒
   - _raw_spin_lock                                                                                                                          ▒
      + 90.92% SYSC_semtimedop                                                                                                               ▒
      + 8.95% SyS_semtimedop  

3.9:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151293714, ops/sec 5043123

-  59.73%           a.out  [kernel.kallsyms]  [k] _raw_spin_lock                                                                             ◆
   - _raw_spin_lock                                                                                                                          ▒
      + 98.86% ipc_lock                                                                                                                      ▒
      + 1.13% ipc_lock_check                                                                                                                 ▒
-   6.48%           a.out  [kernel.kallsyms]  [k] sys_semtimedop                                                                             ▒
   - sys_semtimedop                                                                                                                          ▒
      + 95.26% sys_semop                                                                                                                     ▒
      + 4.74% system_call_fastpath 


While I'm not happy about the [smallish] throughput impact, I'm not as
against this patchset as I was originally. I still think that such
changes, if applied, should go through the linux-next/3.11 phase as much
testing is still needed. I'd also like to see how the Oracle benchmark
behaves (yes, it should be more or less faithful to how semop-multi is
impacted).

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-26 20:50 ` Davidlohr Bueso
@ 2013-05-26 22:59   ` Linus Torvalds
  2013-05-27 15:57   ` Manfred Spraul
  1 sibling, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2013-05-26 22:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Manfred Spraul, Rik van Riel, LKML, Andrew Morton, Hai Huang

On Sun, May 26, 2013 at 1:50 PM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
>
> In lack of getting my swingbench DSS environment back, I ran these
> changes against the semop-multi program on my laptop. For 256 threads,
> with Manfred's patchset the ops/sec suffers around -7.3%.

Hmm. That test program only tests simple ops, right? So the slowdown
comes from either the code simply being that much slower due to extra
complexity, or from the fair wakeups..

Anyway, I applied 1/4, we can continue to discuss the rest.

          Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
                   ` (5 preceding siblings ...)
  2013-05-26 20:50 ` Davidlohr Bueso
@ 2013-05-27  2:04 ` Greg KH
  2013-05-27 17:50   ` Rik van Riel
  6 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2013-05-27  2:04 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Rik van Riel, LKML, Andrew Morton, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sun, May 26, 2013 at 11:08:51AM +0200, Manfred Spraul wrote:
> I've split my patch into 4 parts:
> - 1: Fix-missing-wakeups-in-do_smart_update
> - 2: seperate-wait-for-zero-and-alter-tasks
> - 3: Always-use-only-one-queue-for-alter-operations
> - 4: Rename-try_atomic_semop-to-perform_atomic
> 
> Linus:
> - Patch 1 should be merged immediately: It fixes bugs,
>   the current code misses wakeups.

Should this patch be backported to 3.9 and maybe older kernels as well?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-26 20:50 ` Davidlohr Bueso
  2013-05-26 22:59   ` Linus Torvalds
@ 2013-05-27 15:57   ` Manfred Spraul
  2013-05-28 20:37     ` Davidlohr Bueso
  1 sibling, 1 reply; 23+ messages in thread
From: Manfred Spraul @ 2013-05-27 15:57 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: Rik van Riel, LKML, Andrew Morton, hhuang, Linus Torvalds

Hi Davidlohr,

On 05/26/2013 10:50 PM, Davidlohr Bueso wrote:
>
> In lack of getting my swingbench DSS environment back, I ran these
> changes against the semop-multi program on my laptop. For 256 threads,
> with Manfred's patchset the ops/sec suffers around -7.3%.

Could you also check the performance of only patch#1?
I fear that it might be slower than all 4 together.

With regards to semop-multi:
Is this the tool?
http://marc.info/?l=linux-kernel&m=136208613626892&q=p3

I think the logic is the wrong:
Locking a semaphore is substraction, unlocking adding.
Thus multiple tasks can run in parallel - and the task switch code is 
never triggered.
Could you double check that the number of context switches matches the 
output?

I usually use this tool:
http://marc.info/?l=linux-kernel&m=125038376609750


--
     Manfred

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-27  2:04 ` Greg KH
@ 2013-05-27 17:50   ` Rik van Riel
  2013-05-27 19:57     ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2013-05-27 17:50 UTC (permalink / raw)
  To: Greg KH
  Cc: Manfred Spraul, LKML, Andrew Morton, Davidlohr Bueso, hhuang,
	Linus Torvalds

On 05/26/2013 10:04 PM, Greg KH wrote:
> On Sun, May 26, 2013 at 11:08:51AM +0200, Manfred Spraul wrote:
>> I've split my patch into 4 parts:
>> - 1: Fix-missing-wakeups-in-do_smart_update
>> - 2: seperate-wait-for-zero-and-alter-tasks
>> - 3: Always-use-only-one-queue-for-alter-operations
>> - 4: Rename-try_atomic_semop-to-perform_atomic
>>
>> Linus:
>> - Patch 1 should be merged immediately: It fixes bugs,
>>    the current code misses wakeups.
>
> Should this patch be backported to 3.9 and maybe older kernels as well?

No need, the new semaphore code (with this last regression)
did not get merged until the 3.10 merge window.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue()
  2013-05-26  9:08 ` [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue() Manfred Spraul
@ 2013-05-27 17:51   ` Rik van Riel
  2013-05-27 19:53     ` Manfred Spraul
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2013-05-27 17:51 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds

On 05/26/2013 05:08 AM, Manfred Spraul wrote:
> do_smart_update_queue() is called when an operation
> (semop, semctl(SETVAL), semctl(SETALL), ...) modified the array.
> It must check which of the sleeping tasks can proceed.
>
> do_smart_update_queue() missed a few wakeups:
> - if a sleeping complex op was completed, then all per-semaphore queues
>    must be scanned - not only those that were modified by *sops
> - if a sleeping simple op proceeded, then the global queue
>    must be scanned again
>
> And:
> - the test for "|sops == NULL) before scanning the global
>    queue is not required: If the global queue is empty, then
>    it doesn't need to be scanned - regardless of the reason
>    for calling do_smart_update_queue()
>
> The patch is not optimized, i.e. even completing a wait-for-zero
> operation causes a rescan. This is done to keep the patch as simple as
> possible.
> Avoiding unnecessary scans is implemented in the following patches.
>
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>

Very much not optimized, but we need to fix the regression.

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-05-26  9:08 ` [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
@ 2013-05-27 17:57   ` Rik van Riel
  2013-06-01  9:20     ` Manfred Spraul
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2013-05-27 17:57 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds

On 05/26/2013 05:08 AM, Manfred Spraul wrote:
> Introduce seperate queues for operations that do not modify the
> semaphore values.
> Advantages:
> - Simpler logic in check_restart().
> - Faster update_queue(): Right now, all wait-for-zero operations
>    are always tested, even if the semaphore value is not 0.
> - wait-for-zero gets again priority, as in linux <=3.0.9

Whether this complexity is wanted is not for
me to decide, as I am not the ipc/sem.c
maintainer. I'll leave that up to Andrew and Linus.

The code looks correct, though.

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update
  2013-05-26  9:08 ` [PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update Manfred Spraul
@ 2013-05-27 17:59   ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2013-05-27 17:59 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds

On 05/26/2013 05:08 AM, Manfred Spraul wrote:
> Final cleanup: Some minor points that I noticed while writing the
> previous 3 patches
>
> 1) The name try_atomic_semop() is misleading: The function performs the
>     operation (if it is possible).
>
> 2) Some documentation updates.
>
> No real code change, a rename and documentation changes.

I like this.

> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations.
  2013-05-26  9:08 ` [PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
@ 2013-05-27 18:00   ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2013-05-27 18:00 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds

On 05/26/2013 05:08 AM, Manfred Spraul wrote:
> There are two places that can contain alter operations:
> - the global queue: sma->pending_alter
> - the per-semaphore queues: sma->sem_base[].pending_alter.
>
> Since one of the queues must be processed first, this causes an odd
> priorization of the wakeups:
> Right now, complex operations have priority over simple ops.
>
> The patch restores the behavior of linux <=3.0.9: The longest
> waiting operation has the highest priority.
>
> This is done by using only one queue:
> - if there are complex ops, then sma->pending_alter is used.
> - otherwise, the per-semaphore queues are used.
>
> As a side effect, do_smart_update_queue() becomes much simpler:
> No more goto logic.
>
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>

This is the one where I really wonder whether the
complexity is warranted.

Again, the code does look correct.

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue()
  2013-05-27 17:51   ` Rik van Riel
@ 2013-05-27 19:53     ` Manfred Spraul
  0 siblings, 0 replies; 23+ messages in thread
From: Manfred Spraul @ 2013-05-27 19:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds

Hi Rik,

On 05/27/2013 07:51 PM, Rik van Riel wrote:
>
> Very much not optimized, but we need to fix the regression.
>
> Acked-by: Rik van Riel <riel@redhat.com>

Thanks for the review.

I was afraid that the differences would be noticeable but they aren't:
At least on my laptop, the differences between the patches are within 
the noise:

# ./cat /proc/stat | grep ctxt;./osim 2 2 1000000 0 0;cat /proc/stat | 
grep ctxt
runtime: stock    patch 1    patch 1-4
2,0088875    2,041084    1,997367

# ./cat /proc/stat | grep ctxt;./osim 2 20 1000000 0 0;cat /proc/stat | 
grep ctxt
4,32541    4,295956    4,277817

(assuming 2 cpus, otherwise increase "2" and "20" accordingly)

Here is the link to the tool:
http://marc.info/?l=linux-kernel&m=125038376609750

--
     Manfred


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-27 17:50   ` Rik van Riel
@ 2013-05-27 19:57     ` Greg KH
  0 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2013-05-27 19:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Manfred Spraul, LKML, Andrew Morton, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Mon, May 27, 2013 at 01:50:38PM -0400, Rik van Riel wrote:
> On 05/26/2013 10:04 PM, Greg KH wrote:
> >On Sun, May 26, 2013 at 11:08:51AM +0200, Manfred Spraul wrote:
> >>I've split my patch into 4 parts:
> >>- 1: Fix-missing-wakeups-in-do_smart_update
> >>- 2: seperate-wait-for-zero-and-alter-tasks
> >>- 3: Always-use-only-one-queue-for-alter-operations
> >>- 4: Rename-try_atomic_semop-to-perform_atomic
> >>
> >>Linus:
> >>- Patch 1 should be merged immediately: It fixes bugs,
> >>   the current code misses wakeups.
> >
> >Should this patch be backported to 3.9 and maybe older kernels as well?
> 
> No need, the new semaphore code (with this last regression)
> did not get merged until the 3.10 merge window.

Thanks, I didn't realize that.

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3
  2013-05-27 15:57   ` Manfred Spraul
@ 2013-05-28 20:37     ` Davidlohr Bueso
  0 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2013-05-28 20:37 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Rik van Riel, LKML, Andrew Morton, hhuang, Linus Torvalds

*sigh* it seems that the email this morning wasn't sent, resending...

On Mon, 2013-05-27 at 17:57 +0200, Manfred Spraul wrote:
> Hi Davidlohr,
> 
> On 05/26/2013 10:50 PM, Davidlohr Bueso wrote:
> >
> > In lack of getting my swingbench DSS environment back, I ran these
> > changes against the semop-multi program on my laptop. For 256 threads,
> > with Manfred's patchset the ops/sec suffers around -7.3%.
> 
> Could you also check the performance of only patch#1?
> I fear that it might be slower than all 4 together.

Performance wise, patch 1 actually doesn't make any difference with what
was already upstream.

> 
> With regards to semop-multi:
> Is this the tool?
> http://marc.info/?l=linux-kernel&m=136208613626892&q=p3

Yep.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-05-27 17:57   ` Rik van Riel
@ 2013-06-01  9:20     ` Manfred Spraul
  2013-06-01 10:03       ` Mike Galbraith
  2013-06-01 11:52       ` Manfred Spraul
  0 siblings, 2 replies; 23+ messages in thread
From: Manfred Spraul @ 2013-06-01  9:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 2189 bytes --]

Hi Rik,

On 05/27/2013 07:57 PM, Rik van Riel wrote:
> On 05/26/2013 05:08 AM, Manfred Spraul wrote:
>> Introduce seperate queues for operations that do not modify the
>> semaphore values.
>> Advantages:
>> - Simpler logic in check_restart().
>> - Faster update_queue(): Right now, all wait-for-zero operations
>>    are always tested, even if the semaphore value is not 0.
>> - wait-for-zero gets again priority, as in linux <=3.0.9
>
> Whether this complexity is wanted is not for
> me to decide, as I am not the ipc/sem.c
> maintainer. I'll leave that up to Andrew and Linus.
>
We can have only one: Either more logic or unoptimized loops.
But I don't think that the complexity increases that much, e.g. some 
parts (e.g. check_restart()) get much simpler.

But:
Mike Galbraith ran 3.10-rc3 with and without my changes on a 4-socket 
64-core system, and for me the results appears to be quite slow:
- semop-multi 256 64: around 600.000 ops/sec, both with and without my 
additional patches [difference around 1%]
     That is slower than my 1.4 GHz i3 with 3.9 - I get around 1.000.000 
ops/sec
     Is that expected?
     My only idea would be trashing from writing sma->sem_otime.

- osim [i.e.: with reschedules] is much slower: around 21 us per schedule.
     Perhaps the scheduler didn't pair the threads optimally: intra-cpu 
reschedules
     take around 2 us on my i3, inter-cpu reschedules around 16 us.

Thus I have attached my test apps.
- psem: psem tests sleeping semaphore operations.
     Pairs of two threads perform ping-pong operations, starting with 1 
semaphore and increasing up to the given max.
     Either bound to the same cpu ("intra-cpu") or bound to different 
cpus ("inter-cpu").
     Inter-cpu is hardcoded, probably always a different socket 
(distance max_cpus/2).

- semscale performs operations that never block, i.e. like your 
semop-multi.c
     It does:
     - delays in user space to figure out what is the maximum number of 
operations possible taking into account that user space will do something.
     - use interleaving, to force the threads to different cores/sockets.

Perhaps something in 3.0.10-rc3 breaks the scalability?

--
     Manfred

[-- Attachment #2: psem.cpp --]
[-- Type: text/x-c++src, Size: 7123 bytes --]

/*
 * psem.cpp, parallel sysv sem pingpong
 *
 * Copyright (C) 1999, 2001, 2005, 2008 by Manfred Spraul.
 *	All rights reserved except the rights granted by the GPL.
 *
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL) version 2 or later.
 * $Header$
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <pthread.h>

#ifdef __sun
	 #include <sys/pset.h> /* P_PID, processor_bind() */
#endif

#undef VERBOSE

//////////////////////////////////////////////////////////////////////////////

static enum {
	WAITING,
	RUNNING,
	STOPPED,
} volatile g_state = WAITING;

unsigned long long *g_results;
int *g_svsem_ids;
int *g_svsem_nrs;
pthread_t *g_threads;

struct taskinfo {
	int svsem_id;
	int svsem_nr;
	int threadid;
	int cpubind;
	int sender;
};

void bind_cpu(int cpunr)
{
#if __sun
	int ret;
	ret = processor_bind(P_PID, getpid(), cpunr, NULL);
	if (ret == -1) {
		perror("bind_thread:processor_bind");
		printf(" Binding to cpu %d failed.\n", cpunr);
	}
#else
	int ret;
	cpu_set_t cpus;
	cpu_set_t v;
	CPU_ZERO(&cpus);
	CPU_SET(cpunr, &cpus);
	pthread_t self;

	self = pthread_self();

	ret = pthread_setaffinity_np(self, sizeof(cpus), &cpus);
	if (ret < 0) {
		printf("pthread_setaffinity_np failed for thread %p with errno %d.\n",
				(void*)self, errno);
	}

	ret = pthread_getaffinity_np(self, sizeof(v), &v);
	if (ret < 0) {
		printf("pthread_getaffinity_np() failed for thread %p with errno %d.\n",
				(void*)self, errno);
		fflush(stdout);
	}
	if (memcmp(&v, &cpus, sizeof(cpus) != 0)) {
		printf("Note: Actual affinity does not match intention: got 0x%08lx, expected 0x%08lx.\n",
			(unsigned long)v.__bits[0], (unsigned long)cpus.__bits[0]);
	}
	fflush(stdout);
#endif
}
#define DATASIZE	8

void* worker_thread(void *arg)
{
	struct taskinfo *ti = (struct taskinfo*)arg;
	unsigned long long rounds;
	int ret;

	bind_cpu(ti->cpubind);
#ifdef VERBOSE
	printf("thread %d: sysvsem %8d, off %8d type %d bound to cpu %d\n",ti->threadid,
			ti->svsem_id, ti->svsem_nr,
			ti->sender, ti->cpubind);
#endif
	
	rounds = 0;
	while(g_state == WAITING) {
#ifdef __GNUC__
#if defined(__i386__) || defined (__x86_64__)
		__asm__ __volatile__("pause": : :"memory");
#else
		__asm__ __volatile__("": : :"memory");
#endif
#endif
	}

	if (ti->sender) {
		struct sembuf sop[1];

		/* 1) insert token */
		sop[0].sem_num=ti->svsem_nr+0;
		sop[0].sem_op=1;
		sop[0].sem_flg=0;
		ret = semop(ti->svsem_id,sop,1);
	
		if (ret != 0) {
			printf("Initial semop failed, ret %d, errno %d.\n", ret, errno);
			exit(1);
		}
	}
	while(g_state == RUNNING) {
		struct sembuf sop[1];

		/* 1) retrieve token */
		sop[0].sem_num=ti->svsem_nr+ti->sender;
		sop[0].sem_op=-1;
		sop[0].sem_flg=0;
		ret = semop(ti->svsem_id,sop,1);
		if (ret != 0) {
			/* EIDRM can happen */
			if (errno == EIDRM)
				break;

			/* Some OS do not report EIDRM properly */
			if (g_state != RUNNING)
				break;
			printf("main semop failed, ret %d errno %d.\n", ret, errno);
			printf(" round %lld sop: num %d op %d flg %d.\n",
					rounds,
					sop[0].sem_num, sop[0].sem_op, sop[0].sem_flg);
			fflush(stdout);
			exit(1);
		}

		/* 2) reinsert token */
		sop[0].sem_num=ti->svsem_nr+1-ti->sender;
		sop[0].sem_op=1;
		sop[0].sem_flg=0;
		ret = semop(ti->svsem_id,sop,1);
		if (ret != 0) {
			/* EIDRM can happen */
			if (errno == EIDRM)
				break;
			/* Some OS do not report EIDRM properly */
			if (g_state != RUNNING)
				break;
			printf("main semop failed, ret %d errno %d.\n", ret, errno);
			printf(" round %lld sop: num %d op %d flg %d.\n",
					rounds,
					sop[0].sem_num, sop[0].sem_op, sop[0].sem_flg);
			fflush(stdout);
			exit(1);
		}
		rounds++;
	}
	g_results[ti->threadid] = rounds;

	pthread_exit(0);
	return NULL;
}

void init_threads(int cpu, int cpus, int sems, int shared)
{
	int ret;
	struct taskinfo *ti1, *ti2;

	ti1 = (struct taskinfo*)malloc(sizeof(struct taskinfo));
	ti2 = (struct taskinfo*)malloc(sizeof(struct taskinfo));
	if (!ti1 || !ti2) {
		printf("Could not allocate task info\n");
		exit(1);
	}
	if (cpu % sems == 0) {
		int i;
		g_svsem_ids[cpu] = semget(IPC_PRIVATE,2*sems,0777|IPC_CREAT);
		if(g_svsem_ids[cpu] == -1) {
			printf("sem array create failed.\n");
			exit(1);
		}
		for (i=0;i<sems;i++) {
			g_svsem_ids[cpu+i] = g_svsem_ids[cpu];
			g_svsem_nrs[cpu+i] = 2*i;
		}
	}

	g_results[cpu] = 0;
	g_results[cpu+cpus] = 0;

	ti1->svsem_id = g_svsem_ids[cpu];
	ti1->svsem_nr = g_svsem_nrs[cpu];
	ti1->threadid = cpu;
	ti1->cpubind = cpu;
	ti1->sender = 1;
	ti2->svsem_id = g_svsem_ids[cpu];
	ti2->svsem_nr = g_svsem_nrs[cpu];
	ti2->threadid = cpu+cpus;
	if (shared) {
		ti2->cpubind = cpu;
	} else {
		ti2->cpubind = cpus+cpu;
	}
	ti2->sender = 0;

	ret = pthread_create(&g_threads[ti1->threadid], NULL, worker_thread, ti1);
	if (ret) {
		printf(" pthread_create failed with error code %d\n", ret);
		exit(1);
	}
	ret = pthread_create(&g_threads[ti2->threadid], NULL, worker_thread, ti2);
	if (ret) {
		printf(" pthread_create failed with error code %d\n", ret);
		exit(1);
	}
}

//////////////////////////////////////////////////////////////////////////////

void do_psem(int queues, int timeout, int shared)
{
	unsigned long long totals;
	int i;
	int sems = queues; /* No need to test multiple arrays: that part scales linearly */

	g_state = WAITING;

	g_results = (unsigned long long*)malloc(sizeof(unsigned long long)*2*queues);
	g_svsem_ids = (int*)malloc(sizeof(int)*(queues+sems));
	g_svsem_nrs = (int*)malloc(sizeof(int)*(queues+sems));
	g_threads = (pthread_t*)malloc(sizeof(pthread_t)*2*queues);
	for (i=0;i<queues;i++) {
		init_threads(i, queues, sems, shared);
	}

	usleep(10000);
	g_state = RUNNING;
	sleep(timeout);
	g_state = STOPPED;
	usleep(10000);
	for (i=0;i<queues;i++) {
		int res;
		if (g_svsem_nrs[i] == 0) {
			res = semctl(g_svsem_ids[i],1,IPC_RMID,NULL);
			if (res < 0) {
				printf("semctl(IPC_RMID) failed for %d, errno%d.\n",
					g_svsem_ids[i], errno);
			}
		}
	}
	for (i=0;i<2*queues;i++)
		pthread_join(g_threads[i], NULL);

#ifdef VERBOSE
	printf("Result matrix:\n");
#endif
	totals = 0;
	for (i=0;i<queues;i++) {
#ifdef VERBOSE
		printf("  Thread %3d: %8lld     %3d: %8lld\n",
				i, g_results[i], i+queues, g_results[i+queues]);
#endif
		totals += g_results[i] + g_results[i+queues];
	}
	printf("Queues: %d (%s) %lld in %d secs\n", queues, shared ? "intra-cpu" : "inter-cpu",
			totals, timeout);

	free(g_results);
	free(g_svsem_ids);
	free(g_svsem_nrs);
	free(g_threads);
}

//////////////////////////////////////////////////////////////////////////////

int main(int argc, char **argv)
{
	int max_cpus;
	int timeout;
	int i;

	printf("psem [max cpus] [timeout]\n");
	if (argc != 3) {
		printf(" Invalid parameters.\n");
		return 0;
	}
	max_cpus = atoi(argv[1]);
	timeout = atoi(argv[2]);
	/* Intra-cpu */
	for (i=1;i<=max_cpus;i++) {
		usleep(10000);
		do_psem(i, timeout, 1);
	}
	/* Inter-cpu */
	for (i=1;i<=max_cpus/2;i++) {
		usleep(10000);
		do_psem(i, timeout, 0);
	}

}

[-- Attachment #3: semscale.cpp --]
[-- Type: text/x-c++src, Size: 7757 bytes --]

/*
 * semscale.cpp - sysv scaling test
 *
 * Copyright (C) 1999, 2001, 2005, 2008, 2010 by Manfred Spraul.
 *	All rights reserved except the rights granted by the GPL.
 *
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL) version 2 or later.
 * $Header$
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <pthread.h>

#ifdef __sun
	 #include <sys/pset.h> /* P_PID, processor_bind() */
#endif

#define VERBOSE
#undef VERBOSE

//////////////////////////////////////////////////////////////////////////////

#define DELAY_LOOPS	20

static volatile int g_numerator = 12345678;
static volatile int g_denominator = 123456;

unsigned long long do_delay(int loops)
{
	unsigned long long sum;
	int i, j;

	sum = loops;
	for (i=0;i<loops;i++) {
		for (j=0;j<DELAY_LOOPS*loops;j++) {
			sum += g_numerator/g_denominator;
		}
	}
	return sum;
}

//////////////////////////////////////////////////////////////////////////////

#define DELAY_10MS	(10000)

static enum {
	WAITING,
	RUNNING,
	STOPPED,
} volatile g_state = WAITING;

unsigned long long *g_results;
int g_svsem_id;
int g_max_cpus;
int *g_svsem_nrs;
pthread_t *g_threads;

struct taskinfo {
	int svsem_id;
	int svsem_nr;
	int threadid;
	int cpubind;
	int interleave;
	int delay;
};

int get_cpunr(int cpunr, int interleave)
{
	int off = 0;
	int ret = 0;

#ifdef VERBOSE
	printf("get_cpunr %p: cpunr %d max_cpu %d interleave %d.\n",
		(void*)pthread_self(), cpunr, g_max_cpus, interleave);
#endif

	while (cpunr>0) {
		ret += interleave;
		if (ret >=g_max_cpus) {
			off++;
			ret = off;
		}
		cpunr--;
	}
#ifdef VERBOSE
	printf("get_cpunr %p: result %d.\n", (void*)pthread_self(), ret);
#endif
	return ret;
}

void bind_cpu(int cpunr)
{
	int ret;
#if __sun
	ret = processor_bind(P_PID, getpid(), cpunr, NULL);
	if (ret == -1) {
		perror("bind_thread:processor_bind");
		printf(" Binding to cpu %d failed.\n", cpunr);
	}
#else
	cpu_set_t cpus;
	cpu_set_t v;
	CPU_ZERO(&cpus);
	CPU_SET(cpunr, &cpus);
	pthread_t self;

	self = pthread_self();

	ret = pthread_setaffinity_np(self, sizeof(cpus), &cpus);
	if (ret < 0) {
		printf("pthread_setaffinity_np failed for thread %p with errno %d.\n",
				(void*)self, errno);
	}

	ret = pthread_getaffinity_np(self, sizeof(v), &v);
	if (ret < 0) {
		printf("pthread_getaffinity_np() failed for thread %p with errno %d.\n",
				(void*)self, errno);
		fflush(stdout);
	}
	if (memcmp(&v, &cpus, sizeof(cpus) != 0)) {
		printf("Note: Actual affinity does not match intention: got 0x%08lx, expected 0x%08lx.\n",
			(unsigned long)v.__bits[0], (unsigned long)cpus.__bits[0]);
	}
	fflush(stdout);
#endif
}

void* worker_thread(void *arg)
{
	struct taskinfo *ti = (struct taskinfo*)arg;
	unsigned long long rounds;
	int ret;
	int cpu = get_cpunr(ti->cpubind, ti->interleave);

	bind_cpu(cpu);
#ifdef VERBOSE
	printf("thread %d: sysvsem %8d, off %8d bound to cpu %d\n",ti->threadid,
			ti->svsem_id, ti->svsem_nr,cpu);
#endif
	
	rounds = 0;
	while(g_state == WAITING) {
#ifdef __GNUC__
#if defined(__i386__) || defined (__x86_64__)
		__asm__ __volatile__("pause": : :"memory");
#else
		__asm__ __volatile__("": : :"memory");
#endif
#endif
	}

	while(g_state == RUNNING) {
		struct sembuf sop[1];

		/* 1) check if the semaphore value is 0 */
		sop[0].sem_num=ti->svsem_nr;
		sop[0].sem_op=0;
		sop[0].sem_flg=0;
		ret = semop(ti->svsem_id,sop,1);
		if (ret != 0) {
			/* EIDRM can happen */
			if (errno == EIDRM)
				break;

			printf("main semop failed, ret %d errno %d.\n", ret, errno);

			/* Some OS do not report EIDRM properly */
			if (g_state != RUNNING)
				break;
			printf(" round %lld sop: num %d op %d flg %d.\n",
					rounds,
					sop[0].sem_num, sop[0].sem_op, sop[0].sem_flg);
			fflush(stdout);
			exit(1);
		}
		if (ti->delay)
			do_delay(ti->delay);
		rounds++;
	}
	g_results[ti->threadid] = rounds;

	pthread_exit(0);
	return NULL;
}

void init_threads(int cpu, int cpus, int delay, int interleave)
{
	int ret;
	struct taskinfo *ti;

	ti = (struct taskinfo*)malloc(sizeof(struct taskinfo));
	if (!ti) {
		printf("Could not allocate task info\n");
		exit(1);
	}
	if (cpu == 0) {
		int i;
		g_svsem_id = semget(IPC_PRIVATE,cpus,0777|IPC_CREAT);
		if(g_svsem_id == -1) {
			printf("sem array create failed.\n");
			exit(1);
		}
		for (i=0;i<cpus;i++)
			g_svsem_nrs[i] = i;
	}

	g_results[cpu] = 0;

	ti->svsem_id = g_svsem_id;
	ti->svsem_nr = g_svsem_nrs[cpu];
	ti->threadid = cpu;
	ti->cpubind = cpu;
	ti->interleave = interleave;
	ti->delay = delay;

	ret = pthread_create(&g_threads[ti->threadid], NULL, worker_thread, ti);
	if (ret) {
		printf(" pthread_create failed with error code %d\n", ret);
		exit(1);
	}
}

//////////////////////////////////////////////////////////////////////////////

unsigned long long do_psem(int cpus, int timeout, int delay, int interleave)
{
	unsigned long long totals;
	int i;
	int res;

	g_state = WAITING;

	g_results = (unsigned long long*)malloc(sizeof(unsigned long long)*cpus);
	g_svsem_nrs = (int*)malloc(sizeof(int)*cpus);
	g_threads = (pthread_t*)malloc(sizeof(pthread_t)*cpus);

	for (i=0;i<cpus;i++)
		init_threads(i, cpus, delay, interleave);

	usleep(DELAY_10MS);
	g_state = RUNNING;
	sleep(timeout);
	g_state = STOPPED;
	usleep(DELAY_10MS);

	res = semctl(g_svsem_id,1,IPC_RMID,NULL);
	if (res < 0) {
		printf("semctl(IPC_RMID) failed for %d, errno%d.\n",
			g_svsem_id, errno);
	}

	for (i=0;i<cpus;i++)
		pthread_join(g_threads[i], NULL);

#ifdef VERBOSE
	printf("Result matrix:\n");
#endif
	totals = 0;
	for (i=0;i<cpus;i++) {
#ifdef VERBOSE
		printf("  Thread %3d: %8lld\n",
				i, g_results[i]);
#endif
		totals += g_results[i];
	}
	printf("Cpus %d, interleave %d delay %d: %lld in %d secs\n",
			cpus, interleave, delay,
			totals, timeout);

	free(g_results);
	free(g_svsem_nrs);
	free(g_threads);

	return totals;
}

//////////////////////////////////////////////////////////////////////////////

int main(int argc, char **argv)
{
	int timeout;
	unsigned long long totals, max_totals;
	int max_interleave;
	int fastest;
	int i, j, k;

	printf("semscale [timeout] <max interleave> <max cpus>\n");

	if (argc < 2) {
		printf(" Invalid parameters.\n");
		return 0;
	}
	timeout = atoi(argv[1]);

	if (argc > 2) {
		max_interleave = atoi(argv[2]);
	} else {
		max_interleave = 16;
	}

	if (argc > 3) {
		g_max_cpus = atoi(argv[3]);
	} else {
		cpu_set_t cpus;
		int ret;

		ret = pthread_getaffinity_np(pthread_self(), sizeof(cpus), &cpus);
		if (ret < 0) {
			printf("pthread_getaffinity_np() failed with errno %d.\n", errno);
			fflush(stdout);
			g_max_cpus = 4;
		} else {
			g_max_cpus = 0;
			while (CPU_ISSET(g_max_cpus, &cpus))
				g_max_cpus++;
		}
		if (g_max_cpus == 0) {
			printf("Autodetection of the number of cpus failed.\n");
			return 1;
		}
		printf("Autodetected number of cpus: %d.\n", g_max_cpus);
	}
	if (g_max_cpus >= 2) {
		while (max_interleave >= g_max_cpus) {
			max_interleave = max_interleave/2;
		}
	} else {
		max_interleave = 1;
	}
	printf("Adjusted max interleave: %d.\n", max_interleave);

	for (k=1;k<=max_interleave;k*=2) {
		for (j=0;;) {
			max_totals = 0;
			fastest = 0;
			for (i=1;;) {
				totals = do_psem(i, timeout, j, k);
				if (totals > max_totals) {
					max_totals = totals;
					fastest = i;
				} else {
					if (totals < 0.5*max_totals && i > 1.5*fastest)
						break;
				}
				if (i== g_max_cpus)
					break;
				i += i * 0.2 + 1;
				if (i > g_max_cpus)
					i = g_max_cpus;
			}
			printf("Interleave %d, delay %d: Max total: %lld with %d cpus\n",
					k, j, max_totals, fastest);
			if (fastest == g_max_cpus)
				break;
			j += j * 0.2 + 1;
		}
	}
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-06-01  9:20     ` Manfred Spraul
@ 2013-06-01 10:03       ` Mike Galbraith
  2013-06-01 10:05         ` Mike Galbraith
  2013-06-01 11:04         ` Mike Galbraith
  2013-06-01 11:52       ` Manfred Spraul
  1 sibling, 2 replies; 23+ messages in thread
From: Mike Galbraith @ 2013-06-01 10:03 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Rik van Riel, LKML, Andrew Morton, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-01 at 11:20 +0200, Manfred Spraul wrote: 
> Hi Rik,
> 
> On 05/27/2013 07:57 PM, Rik van Riel wrote:
> > On 05/26/2013 05:08 AM, Manfred Spraul wrote:
> >> Introduce seperate queues for operations that do not modify the
> >> semaphore values.
> >> Advantages:
> >> - Simpler logic in check_restart().
> >> - Faster update_queue(): Right now, all wait-for-zero operations
> >>    are always tested, even if the semaphore value is not 0.
> >> - wait-for-zero gets again priority, as in linux <=3.0.9
> >
> > Whether this complexity is wanted is not for
> > me to decide, as I am not the ipc/sem.c
> > maintainer. I'll leave that up to Andrew and Linus.
> >
> We can have only one: Either more logic or unoptimized loops.
> But I don't think that the complexity increases that much, e.g. some 
> parts (e.g. check_restart()) get much simpler.
> 
> But:
> Mike Galbraith ran 3.10-rc3 with and without my changes on a 4-socket 
> 64-core system, and for me the results appears to be quite slow:
> - semop-multi 256 64: around 600.000 ops/sec, both with and without my 
> additional patches [difference around 1%]
>      That is slower than my 1.4 GHz i3 with 3.9 - I get around 1.000.000 
> ops/sec
>      Is that expected?
>      My only idea would be trashing from writing sma->sem_otime.
> 
> - osim [i.e.: with reschedules] is much slower: around 21 us per schedule.
>      Perhaps the scheduler didn't pair the threads optimally: intra-cpu 
> reschedules
>      take around 2 us on my i3, inter-cpu reschedules around 16 us.

I got -rt backports working, and it's faster at semop-multi, but one
hell of a lot slower at osim.  The goto again loop in sem_lock() is a
livelock in -rt, I had to do that differently.

-rtm is with our patches added, -rt is backport without.

vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rtm
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 33553800, ops/sec 1118460
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 33344598, ops/sec 1111486
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 33655348, ops/sec 1121844
vogelweide:/abuild/mike/:[0]# 
vogelweide:/abuild/mike/:[130]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 12.296215 seconds for 1000192 loops
per loop execution time: 12.293 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 11.613663 seconds for 1000192 loops
per loop execution time: 11.611 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 13.755537 seconds for 1000192 loops
per loop execution time: 13.752 usec

vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rt
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 37343656, ops/sec 1244788
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 37226496, ops/sec 1240883
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 36730816, ops/sec 1224360
vogelweide:/abuild/mike/:[0]# 
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 12.676632 seconds for 1000192 loops
per loop execution time: 12.674 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 14.166756 seconds for 1000192 loops
per loop execution time: 14.164 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 15.116200 seconds for 1000192 loops
per loop execution time: 15.113 use


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-06-01 10:03       ` Mike Galbraith
@ 2013-06-01 10:05         ` Mike Galbraith
  2013-06-01 11:04         ` Mike Galbraith
  1 sibling, 0 replies; 23+ messages in thread
From: Mike Galbraith @ 2013-06-01 10:05 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Rik van Riel, LKML, Andrew Morton, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-01 at 12:03 +0200, Mike Galbraith wrote:

> -rtm is with our patches added, -rt is backport without.

'y' key in 'your' not fully depressed.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-06-01 10:03       ` Mike Galbraith
  2013-06-01 10:05         ` Mike Galbraith
@ 2013-06-01 11:04         ` Mike Galbraith
  1 sibling, 0 replies; 23+ messages in thread
From: Mike Galbraith @ 2013-06-01 11:04 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Rik van Riel, LKML, Andrew Morton, Davidlohr Bueso, hhuang,
	Linus Torvalds

On Sat, 2013-06-01 at 12:03 +0200, Mike Galbraith wrote:

> -rtm is with our patches added, -rt is backport without.

Now same tree, turn off PREEMPT_FULL so the kernel is plain old PREEMPT.
Again, -rt is without your patches, -rtm with, followed by backing out
all ipc backports to end up at pre scalability series to show what the
set gains on this box.

vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rt
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 19102372, ops/sec 636745
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 19877668, ops/sec 662588
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 19610632, ops/sec 653687
vogelweide:/abuild/mike/:[0]# 
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.893317 seconds for 1000192 loops
per loop execution time: 1.892 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.976055 seconds for 1000192 loops
per loop execution time: 1.975 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.651162 seconds for 1000192 loops
per loop execution time: 1.650 usec


vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rtm
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 22053968, ops/sec 735132
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 21295310, ops/sec 709843
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 21458002, ops/sec 715266
vogelweide:/abuild/mike/:[0]# 
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.858765 seconds for 1000192 loops
per loop execution time: 1.858 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.806943 seconds for 1000192 loops
per loop execution time: 1.806 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.854004 seconds for 1000192 loops
per loop execution time: 1.853 usec

All ipc patches backed out.

vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 8885026, ops/sec 296167
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 8791676, ops/sec 293055
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 8301092, ops/sec 276703
vogelweide:/abuild/mike/:[0]# 
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 10.750488 seconds for 1000192 loops
per loop execution time: 10.748 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 10.737264 seconds for 1000192 loops
per loop execution time: 10.735 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 10.756395 seconds for 1000192 loops
per loop execution time: 10.754 usec




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues
  2013-06-01  9:20     ` Manfred Spraul
  2013-06-01 10:03       ` Mike Galbraith
@ 2013-06-01 11:52       ` Manfred Spraul
  1 sibling, 0 replies; 23+ messages in thread
From: Manfred Spraul @ 2013-06-01 11:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Davidlohr Bueso, hhuang, Linus Torvalds,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

Hi all,

On 06/01/2013 11:20 AM, Manfred Spraul wrote:
>
> - osim [i.e.: with reschedules] is much slower: around 21 us per 
> schedule.
>     Perhaps the scheduler didn't pair the threads optimally: intra-cpu 
> reschedules
>     take around 2 us on my i3, inter-cpu reschedules around 16 us.
>
I mixed up numbers.
osim reports around 1.6 us for the 64-thread system.
It is still odd that it is only factor 1.5 facter than my 4-thread i3, 
but at least it is not slower.

Anyway: I have attached the table of all results that I have so far.

--
     Manfred

[-- Attachment #2: Eval-ipc.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 22645 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-06-01 11:52 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-26  9:08 [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Manfred Spraul
2013-05-26  9:08 ` [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue() Manfred Spraul
2013-05-27 17:51   ` Rik van Riel
2013-05-27 19:53     ` Manfred Spraul
2013-05-26  9:08 ` [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues Manfred Spraul
2013-05-27 17:57   ` Rik van Riel
2013-06-01  9:20     ` Manfred Spraul
2013-06-01 10:03       ` Mike Galbraith
2013-06-01 10:05         ` Mike Galbraith
2013-06-01 11:04         ` Mike Galbraith
2013-06-01 11:52       ` Manfred Spraul
2013-05-26  9:08 ` [PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations Manfred Spraul
2013-05-27 18:00   ` Rik van Riel
2013-05-26  9:08 ` [PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update Manfred Spraul
2013-05-27 17:59   ` Rik van Riel
2013-05-26 17:37 ` [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3 Linus Torvalds
2013-05-26 20:50 ` Davidlohr Bueso
2013-05-26 22:59   ` Linus Torvalds
2013-05-27 15:57   ` Manfred Spraul
2013-05-28 20:37     ` Davidlohr Bueso
2013-05-27  2:04 ` Greg KH
2013-05-27 17:50   ` Rik van Riel
2013-05-27 19:57     ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.