[PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency
@ 2013-08-19 21:33 Steven Rostedt
  2013-08-19 21:33 ` [PATCH RT 1/3] hwlat-detector: Update hwlat_detector to add outer loop detection Steven Rostedt
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Steven Rostedt @ 2013-08-19 21:33 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Clark Williams

Thi patch series fixes up hwlat-detector to check the entire path
of time for a latency being hit, instead of the quick check between
two time stamps.

It also uses the trace_local_clock() if available, which is much lighter
weight than ktime_get() which might produce false positives. As
the trace clock is used to detect irq latencies, it's suitable for
hardware latency.

Finally, stop machine is removed and the test is run by a single thread.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RT 1/3] hwlat-detector: Update hwlat_detector to add outer loop detection
  2013-08-19 21:33 [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Steven Rostedt
@ 2013-08-19 21:33 ` Steven Rostedt
  2013-08-19 21:33 ` [PATCH RT 2/3] hwlat-detector: Use trace_clock_local if available Steven Rostedt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2013-08-19 21:33 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Clark Williams, Steven Rostedt

[-- Attachment #1: hwlat-outerloop-update.patch --]
[-- Type: text/plain, Size: 3704 bytes --]

The hwlat_detector reads two timestamps in a row, then reports any
gap between those calls. The problem is, it misses everything between
the second reading of the time stamp to the first reading of the time stamp
in the next loop. That's were most of the time is spent, which means,
chances are likely that it will miss all hardware latencies. This
defeats the purpose.

By also testing the first time stamp from the previous loop second
time stamp (the outer loop), we are more likely to find a latency.

Setting the threshold to 1, here's what the report now looks like:

1347415723.0232202770	0	2
1347415725.0234202822	0	2
1347415727.0236202875	0	2
1347415729.0238202928	0	2
1347415731.0240202980	0	2
1347415734.0243203061	0	2
1347415736.0245203113	0	2
1347415738.0247203166	2	0
1347415740.0249203219	0	3
1347415742.0251203272	0	3
1347415743.0252203299	0	3
1347415745.0254203351	0	2
1347415747.0256203404	0	2
1347415749.0258203457	0	2
1347415751.0260203510	0	2
1347415754.0263203589	0	2
1347415756.0265203642	0	2
1347415758.0267203695	0	2
1347415760.0269203748	0	2
1347415762.0271203801	0	2
1347415764.0273203853	2	0

There's some hardware latency that takes 2 microseconds to run.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

Index: linux-rt.git/drivers/misc/hwlat_detector.c
===================================================================
--- linux-rt.git.orig/drivers/misc/hwlat_detector.c
+++ linux-rt.git/drivers/misc/hwlat_detector.c
@@ -143,6 +143,7 @@ static void detector_exit(void);
 struct sample {
 	u64		seqnum;		/* unique sequence */
 	u64		duration;	/* ktime delta */
+	u64		outer_duration;	/* ktime delta (outer loop) */
 	struct timespec	timestamp;	/* wall time */
 	unsigned long   lost;
 };
@@ -219,11 +220,13 @@ static struct sample *buffer_get_sample(
  */
 static int get_sample(void *unused)
 {
-	ktime_t start, t1, t2;
+	ktime_t start, t1, t2, last_t2;
 	s64 diff, total = 0;
 	u64 sample = 0;
+	u64 outer_sample = 0;
 	int ret = 1;
 
+	last_t2.tv64 = 0;
 	start = ktime_get(); /* start timestamp */
 
 	do {
@@ -231,7 +234,22 @@ static int get_sample(void *unused)
 		t1 = ktime_get();	/* we'll look for a discontinuity */
 		t2 = ktime_get();
 
+		if (last_t2.tv64) {
+			/* Check the delta from the outer loop (t2 to next t1) */
+			diff = ktime_to_us(ktime_sub(t1, last_t2));
+			/* This shouldn't happen */
+			if (diff < 0) {
+				printk(KERN_ERR BANNER "time running backwards\n");
+				goto out;
+			}
+			if (diff > outer_sample)
+				outer_sample = diff;
+		}
+		last_t2 = t2;
+
 		total = ktime_to_us(ktime_sub(t2, start)); /* sample width */
+
+		/* This checks the inner loop (t1 to t2) */
 		diff = ktime_to_us(ktime_sub(t2, t1));     /* current diff */
 
 		/* This shouldn't happen */
@@ -246,12 +264,13 @@ static int get_sample(void *unused)
 	} while (total <= data.sample_width);
 
 	/* If we exceed the threshold value, we have found a hardware latency */
-	if (sample > data.threshold) {
+	if (sample > data.threshold || outer_sample > data.threshold) {
 		struct sample s;
 
 		data.count++;
 		s.seqnum = data.count;
 		s.duration = sample;
+		s.outer_duration = outer_sample;
 		s.timestamp = CURRENT_TIME;
 		__buffer_add_sample(&s);
 
@@ -738,10 +757,11 @@ static ssize_t debug_sample_fread(struct
 		}
 	}
 
-	len = snprintf(buf, sizeof(buf), "%010lu.%010lu\t%llu\n",
-		      sample->timestamp.tv_sec,
-		      sample->timestamp.tv_nsec,
-		      sample->duration);
+	len = snprintf(buf, sizeof(buf), "%010lu.%010lu\t%llu\t%llu\n",
+		       sample->timestamp.tv_sec,
+		       sample->timestamp.tv_nsec,
+		       sample->duration,
+		       sample->outer_duration);
 
 
 	/* handling partial reads is more trouble than it's worth */


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RT 2/3] hwlat-detector: Use trace_clock_local if available
  2013-08-19 21:33 [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Steven Rostedt
  2013-08-19 21:33 ` [PATCH RT 1/3] hwlat-detector: Update hwlat_detector to add outer loop detection Steven Rostedt
@ 2013-08-19 21:33 ` Steven Rostedt
  2013-08-19 21:33 ` [PATCH RT 3/3] hwlat-detector: Use thread instead of stop machine Steven Rostedt
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2013-08-19 21:33 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Clark Williams, Steven Rostedt

[-- Attachment #1: hwlat-use-trace-clock-3.10.patch --]
[-- Type: text/plain, Size: 2807 bytes --]

As ktime_get() calls into the timing code which does a read_seq(), it
may be affected by other CPUS that touch that lock. To remove this
dependency, use the trace_clock_local() which is already exported
for module use. If CONFIG_TRACING is enabled, use that as the clock,
otherwise use ktime_get().

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

Index: linux-rt.git/drivers/misc/hwlat_detector.c
===================================================================
--- linux-rt.git.orig/drivers/misc/hwlat_detector.c
+++ linux-rt.git/drivers/misc/hwlat_detector.c
@@ -51,6 +51,7 @@
 #include <linux/version.h>
 #include <linux/delay.h>
 #include <linux/slab.h>
+#include <linux/trace_clock.h>
 
 #define BUF_SIZE_DEFAULT	262144UL		/* 8K*(sizeof(entry)) */
 #define BUF_FLAGS		(RB_FL_OVERWRITE)	/* no block on full */
@@ -211,6 +212,21 @@ static struct sample *buffer_get_sample(
 	return sample;
 }
 
+#ifndef CONFIG_TRACING
+#define time_type	ktime_t
+#define time_get()	ktime_get()
+#define time_to_us(x)	ktime_to_us(x)
+#define time_sub(a, b)	ktime_sub(a, b)
+#define init_time(a, b)	(a).tv64 = b
+#define time_u64(a)	(a).tv64
+#else
+#define time_type	u64
+#define time_get()	trace_clock_local()
+#define time_to_us(x)	((x) / 1000)
+#define time_sub(a, b)	((a) - (b))
+#define init_time(a, b)	a = b
+#define time_u64(a)	a
+#endif
 /**
  * get_sample - sample the CPU TSC and look for likely hardware latencies
  * @unused: This is not used but is a part of the stop_machine API
@@ -220,23 +236,23 @@ static struct sample *buffer_get_sample(
  */
 static int get_sample(void *unused)
 {
-	ktime_t start, t1, t2, last_t2;
+	time_type start, t1, t2, last_t2;
 	s64 diff, total = 0;
 	u64 sample = 0;
 	u64 outer_sample = 0;
 	int ret = 1;
 
-	last_t2.tv64 = 0;
-	start = ktime_get(); /* start timestamp */
+	init_time(last_t2, 0);
+	start = time_get(); /* start timestamp */
 
 	do {
 
-		t1 = ktime_get();	/* we'll look for a discontinuity */
-		t2 = ktime_get();
+		t1 = time_get();	/* we'll look for a discontinuity */
+		t2 = time_get();
 
-		if (last_t2.tv64) {
+		if (time_u64(last_t2)) {
 			/* Check the delta from the outer loop (t2 to next t1) */
-			diff = ktime_to_us(ktime_sub(t1, last_t2));
+			diff = time_to_us(time_sub(t1, last_t2));
 			/* This shouldn't happen */
 			if (diff < 0) {
 				printk(KERN_ERR BANNER "time running backwards\n");
@@ -247,10 +263,10 @@ static int get_sample(void *unused)
 		}
 		last_t2 = t2;
 
-		total = ktime_to_us(ktime_sub(t2, start)); /* sample width */
+		total = time_to_us(time_sub(t2, start)); /* sample width */
 
 		/* This checks the inner loop (t1 to t2) */
-		diff = ktime_to_us(ktime_sub(t2, t1));     /* current diff */
+		diff = time_to_us(time_sub(t2, t1));     /* current diff */
 
 		/* This shouldn't happen */
 		if (diff < 0) {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RT 3/3] hwlat-detector: Use thread instead of stop machine
  2013-08-19 21:33 [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Steven Rostedt
  2013-08-19 21:33 ` [PATCH RT 1/3] hwlat-detector: Update hwlat_detector to add outer loop detection Steven Rostedt
  2013-08-19 21:33 ` [PATCH RT 2/3] hwlat-detector: Use trace_clock_local if available Steven Rostedt
@ 2013-08-19 21:33 ` Steven Rostedt
  2013-08-21 15:59 ` [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Sebastian Andrzej Siewior
  2013-08-30  5:57 ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Mike Galbraith
  4 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2013-08-19 21:33 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Clark Williams

[-- Attachment #1: hwlat-thread.patch --]
[-- Type: text/plain, Size: 6116 bytes --]

There's no reason to use stop machine to search for hardware latency.
Simply disabling interrupts while running the loop will do enough to
check if something comes in that wasn't disabled by interrupts being
off, which is exactly what stop machine does.

Instead of using stop machine, just have the thread disable interrupts
while it checks for hardware latency.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


Index: linux-rt.git/drivers/misc/hwlat_detector.c
===================================================================
--- linux-rt.git.orig/drivers/misc/hwlat_detector.c
+++ linux-rt.git/drivers/misc/hwlat_detector.c
@@ -41,7 +41,6 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/ring_buffer.h>
-#include <linux/stop_machine.h>
 #include <linux/time.h>
 #include <linux/hrtimer.h>
 #include <linux/kthread.h>
@@ -107,7 +106,6 @@ struct data;					/* Global state */
 /* Sampling functions */
 static int __buffer_add_sample(struct sample *sample);
 static struct sample *buffer_get_sample(struct sample *sample);
-static int get_sample(void *unused);
 
 /* Threading and state */
 static int kthread_fn(void *unused);
@@ -149,7 +147,7 @@ struct sample {
 	unsigned long   lost;
 };
 
-/* keep the global state somewhere. Mostly used under stop_machine. */
+/* keep the global state somewhere. */
 static struct data {
 
 	struct mutex lock;		/* protect changes */
@@ -172,7 +170,7 @@ static struct data {
  * @sample: The new latency sample value
  *
  * This receives a new latency sample and records it in a global ring buffer.
- * No additional locking is used in this case - suited for stop_machine use.
+ * No additional locking is used in this case.
  */
 static int __buffer_add_sample(struct sample *sample)
 {
@@ -229,18 +227,17 @@ static struct sample *buffer_get_sample(
 #endif
 /**
  * get_sample - sample the CPU TSC and look for likely hardware latencies
- * @unused: This is not used but is a part of the stop_machine API
  *
  * Used to repeatedly capture the CPU TSC (or similar), looking for potential
- * hardware-induced latency. Called under stop_machine, with data.lock held.
+ * hardware-induced latency. Called with interrupts disabled and with data.lock held.
  */
-static int get_sample(void *unused)
+static int get_sample(void)
 {
 	time_type start, t1, t2, last_t2;
 	s64 diff, total = 0;
 	u64 sample = 0;
 	u64 outer_sample = 0;
-	int ret = 1;
+	int ret = -1;
 
 	init_time(last_t2, 0);
 	start = time_get(); /* start timestamp */
@@ -279,10 +276,14 @@ static int get_sample(void *unused)
 
 	} while (total <= data.sample_width);
 
+	ret = 0;
+
 	/* If we exceed the threshold value, we have found a hardware latency */
 	if (sample > data.threshold || outer_sample > data.threshold) {
 		struct sample s;
 
+		ret = 1;
+
 		data.count++;
 		s.seqnum = data.count;
 		s.duration = sample;
@@ -295,7 +296,6 @@ static int get_sample(void *unused)
 			data.max_sample = sample;
 	}
 
-	ret = 0;
 out:
 	return ret;
 }
@@ -305,32 +305,30 @@ out:
  * @unused: A required part of the kthread API.
  *
  * Used to periodically sample the CPU TSC via a call to get_sample. We
- * use stop_machine, whith does (intentionally) introduce latency since we
+ * disable interrupts, which does (intentionally) introduce latency since we
  * need to ensure nothing else might be running (and thus pre-empting).
  * Obviously this should never be used in production environments.
  *
- * stop_machine will schedule us typically only on CPU0 which is fine for
- * almost every real-world hardware latency situation - but we might later
- * generalize this if we find there are any actualy systems with alternate
- * SMI delivery or other non CPU0 hardware latencies.
+ * Currently this runs on which ever CPU it was scheduled on, but most
+ * real-worald hardware latency situations occur across several CPUs,
+ * but we might later generalize this if we find there are any actualy
+ * systems with alternate SMI delivery or other hardware latencies.
  */
 static int kthread_fn(void *unused)
 {
-	int err = 0;
-	u64 interval = 0;
+	int ret;
+	u64 interval;
 
 	while (!kthread_should_stop()) {
 
 		mutex_lock(&data.lock);
 
-		err = stop_machine(get_sample, unused, 0);
-		if (err) {
-			/* Houston, we have a problem */
-			mutex_unlock(&data.lock);
-			goto err_out;
-		}
+		local_irq_disable();
+		ret = get_sample();
+		local_irq_enable();
 
-		wake_up(&data.wq); /* wake up reader(s) */
+		if (ret > 0)
+			wake_up(&data.wq); /* wake up reader(s) */
 
 		interval = data.sample_window - data.sample_width;
 		do_div(interval, USEC_PER_MSEC); /* modifies interval value */
@@ -338,15 +336,10 @@ static int kthread_fn(void *unused)
 		mutex_unlock(&data.lock);
 
 		if (msleep_interruptible(interval))
-			goto out;
+			break;
 	}
-		goto out;
-err_out:
-	printk(KERN_ERR BANNER "could not call stop_machine, disabling\n");
-	enabled = 0;
-out:
-	return err;
 
+	return 0;
 }
 
 /**
@@ -442,8 +435,7 @@ out:
  * This function provides a generic read implementation for the global state
  * "data" structure debugfs filesystem entries. It would be nice to use
  * simple_attr_read directly, but we need to make sure that the data.lock
- * spinlock is held during the actual read (even though we likely won't ever
- * actually race here as the updater runs under a stop_machine context).
+ * is held during the actual read.
  */
 static ssize_t simple_data_read(struct file *filp, char __user *ubuf,
 				size_t cnt, loff_t *ppos, const u64 *entry)
@@ -478,8 +470,7 @@ static ssize_t simple_data_read(struct f
  * This function provides a generic write implementation for the global state
  * "data" structure debugfs filesystem entries. It would be nice to use
  * simple_attr_write directly, but we need to make sure that the data.lock
- * spinlock is held during the actual write (even though we likely won't ever
- * actually race here as the updater runs under a stop_machine context).
+ * is held during the actual write.
  */
 static ssize_t simple_data_write(struct file *filp, const char __user *ubuf,
 				 size_t cnt, loff_t *ppos, u64 *entry)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency
  2013-08-19 21:33 [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Steven Rostedt
                   ` (2 preceding siblings ...)
  2013-08-19 21:33 ` [PATCH RT 3/3] hwlat-detector: Use thread instead of stop machine Steven Rostedt
@ 2013-08-21 15:59 ` Sebastian Andrzej Siewior
  2013-08-30  5:57 ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Mike Galbraith
  4 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-08-21 15:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Thomas Gleixner, Carsten Emde,
	John Kacur, Clark Williams

* Steven Rostedt | 2013-08-19 17:33:24 [-0400]:

>Thi patch series fixes up hwlat-detector to check the entire path
>of time for a latency being hit, instead of the quick check between
>two time stamps.
>
>It also uses the trace_local_clock() if available, which is much lighter
>weight than ktime_get() which might produce false positives. As
>the trace clock is used to detect irq latencies, it's suitable for
>hardware latency.
>
>Finally, stop machine is removed and the test is run by a single thread.

Applied, thanks.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RT]  hwlat-detector: Don't ignore threshold module parameter
  2013-08-19 21:33 [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Steven Rostedt
                   ` (3 preceding siblings ...)
  2013-08-21 15:59 ` [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Sebastian Andrzej Siewior
@ 2013-08-30  5:57 ` Mike Galbraith
  2013-08-30 14:39   ` [PATCH RT] rt,ipc,sem: fix -rt livelock Mike Galbraith
                     ` (2 more replies)
  4 siblings, 3 replies; 20+ messages in thread
From: Mike Galbraith @ 2013-08-30  5:57 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior

If the user specified a threshold at module load time, use it.

Signed-off-by: Mike Galbraith <bitbucket@online.de>

--- a/drivers/misc/hwlat_detector.c	2013-08-30 07:16:05.387959392 +0200
+++ b/drivers/misc/hwlat_detector.c	2013-08-30 07:10:52.958670183 +0200
@@ -413,7 +413,7 @@ static int init_stats(void)
 		goto out;
 
 	__reset_stats();
-	data.threshold = DEFAULT_LAT_THRESHOLD;	    /* threshold us */
+	data.threshold = threshold ?: DEFAULT_LAT_THRESHOLD;	    /* threshold us */
 	data.sample_window = DEFAULT_SAMPLE_WINDOW; /* window us */
 	data.sample_width = DEFAULT_SAMPLE_WIDTH;   /* width us */
 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-08-30  5:57 ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Mike Galbraith
@ 2013-08-30 14:39   ` Mike Galbraith
  2013-08-31  5:27     ` Mike Galbraith
  2013-09-12 18:23   ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Steven Rostedt
  2013-10-04 10:30   ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2013-08-30 14:39 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior

Greetings,

The semaphore scalability changes have a less than wonderful trait for
-rt.  The below.. makes it not do that.  I originally did this to test
(and actually survive) scalability patches backported to 3.0-rt on a 64
core box, and was just rudely reminded of the problem after putting
3.10-rt onto the same box for latency comparison.


goto again loop can and does induce livelock in -rt.  Remove it.

Signed-off-by: Mike Galbraith <bitbucket@online.de>

---
 ipc/sem.c |   56 ++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 18 deletions(-)

Index: linux-2.6/ipc/sem.c
===================================================================
--- linux-2.6.orig/ipc/sem.c
+++ linux-2.6/ipc/sem.c
@@ -208,22 +208,11 @@ void __init sem_init (void)
 static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 			      int nsops)
 {
+	struct sem *sem;
 	int locknum;
- again:
-	if (nsops == 1 && !sma->complex_count) {
-		struct sem *sem = sma->sem_base + sops->sem_num;
-
-		/* Lock just the semaphore we are interested in. */
-		spin_lock(&sem->lock);
 
-		/*
-		 * If sma->complex_count was set while we were spinning,
-		 * we may need to look at things we did not lock here.
-		 */
-		if (unlikely(sma->complex_count)) {
-			spin_unlock(&sem->lock);
-			goto lock_array;
-		}
+	if (nsops == 1 && !sma->complex_count) {
+		sem = sma->sem_base + sops->sem_num;
 
 		/*
 		 * Another process is holding the global lock on the
@@ -231,9 +220,29 @@ static inline int sem_lock(struct sem_ar
 		 * but have to wait for the global lock to be released.
 		 */
 		if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
-			spin_unlock(&sem->lock);
-			spin_unlock_wait(&sma->sem_perm.lock);
-			goto again;
+			spin_lock(&sma->sem_perm.lock);
+			if (sma->complex_count)
+				goto wait_array;
+
+			/*
+			 * Acquiring our sem->lock under the global lock
+			 * forces new complex operations to wait for us
+			 * to exit our critical section.
+			 */
+			spin_lock(&sem->lock);
+			spin_unlock(&sma->sem_perm.lock);
+		} else {
+			/* Lock just the semaphore we are interested in. */
+			spin_lock(&sem->lock);
+
+			/*
+			 * If sma->complex_count was set prior to acquisition,
+			 * we must fall back to the global array lock.
+			 */
+			if (unlikely(sma->complex_count)) {
+				spin_unlock(&sem->lock);
+				goto lock_array;
+			}
 		}
 
 		locknum = sops->sem_num;
@@ -247,11 +256,22 @@ static inline int sem_lock(struct sem_ar
 		 */
  lock_array:
 		spin_lock(&sma->sem_perm.lock);
+ wait_array:
 		for (i = 0; i < sma->sem_nsems; i++) {
-			struct sem *sem = sma->sem_base + i;
+			sem = sma->sem_base + i;
+#ifdef CONFIG_PREEMPT_RT_BASE
+			if (spin_is_locked(&sem->lock))
+#endif
 			spin_unlock_wait(&sem->lock);
 		}
 		locknum = -1;
+
+		if (nsops == 1 && !sma->complex_count) {
+			sem = sma->sem_base + sops->sem_num;
+			spin_lock(&sem->lock);
+			spin_unlock(&sma->sem_perm.lock);
+			locknum = sops->sem_num;
+		}
 	}
 	return locknum;
 }



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-08-30 14:39   ` [PATCH RT] rt,ipc,sem: fix -rt livelock Mike Galbraith
@ 2013-08-31  5:27     ` Mike Galbraith
  2013-09-06  7:08       ` Mike Galbraith
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2013-08-31  5:27 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 4385 bytes --]

The attached proggy (Manfred Spraul) is one reproducer. 

./osim 16 32 1000000 0 0 on my little Q6600 box produced instant
gratification.  IIRC, you don't need that many tasks, fiddle with it,
you'll see the livelock pretty quickly.

There's perhaps a more subtle way to fix it up than loop-ectomy with an
axe, but patient does survive when so treated.

-Mike


Box playing osim unkillable forever-loop.

   PerfTop:    5592 irqs/sec  kernel:97.6%  exact:  0.0% [4000Hz cycles],  (all, 4 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                            

    13.90%  [kernel]               [k] _raw_spin_lock_irqsave        
     8.28%  [kernel]               [k] add_preempt_count.part.78     
     6.76%  [kernel]               [k] _raw_spin_lock                
     6.69%  [kernel]               [k] sub_preempt_count             
     6.45%  [kernel]               [k] _raw_spin_unlock_irqrestore   
     5.12%  [kernel]               [k] rt_spin_unlock                
     5.10%  [kernel]               [k] rt_spin_lock                  
     5.04%  [kernel]               [k] add_preempt_count             
     4.01%  [kernel]               [k] get_parent_ip                 
     3.33%  [kernel]               [k] __try_to_take_rt_mutex        
     3.18%  [kernel]               [k] in_lock_functions             
     2.88%  [kernel]               [k] migrate_enable                
     2.24%  [kernel]               [k] debug_smp_processor_id        
     2.23%  [kernel]               [k] pin_current_cpu               
     2.14%  [kernel]               [k] wakeup_next_waiter            
     2.00%  [kernel]               [k] migrate_disable               
     1.94%  [kernel]               [k] __try_to_take_rt_mutex.part.10
     1.48%  [kernel]               [k] rt_spin_lock_slowlock         
     1.42%  [kernel]               [k] __raw_spin_unlock             
     1.15%  [kernel]               [k] plist_del                     
     1.15%  [kernel]               [k] unpin_current_cpu             
     1.14%  [kernel]               [k] SYSC_semtimedop               
     0.94%  [kernel]               [k] try_to_wake_up                
     0.94%  [kernel]               [k] rt_spin_lock_slowunlock       
     0.79%  [kernel]               [k] plist_add                     
     0.73%  [kernel]               [k] __schedule

   PerfTop:    5776 irqs/sec  kernel:97.6%  exact:  0.0% [4000Hz cycles],  (all, 4 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                            
Showing cycles for SYSC_semtimedop
  Events  Pcnt (>=5%)
 Percent |      Source code & Disassembly of vmlinux
-----------------------------------------------------
         :                       * If sma->complex_count was set while we were spinning,
         :                       * we may need to look at things we did not lock here.
         :                       */
         :                      if (unlikely(sma->complex_count)) {
    5.27 :        ffffffff811c8072:       mov    0x9c(%r14),%eax
         :                       */
         :                      if (unlikely(sma->complex_count)) {
    4.82 :        ffffffff811c8285:       mov    -0x258(%rbp),%rax
    2.76 :        ffffffff811c828c:       mov    0x9c(%rax),%ecx
    7.58 :        ffffffff811c8292:       test   %ecx,%ecx
         :                      if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
         :                              spin_unlock(&sem->lock);
    3.21 :        ffffffff811c860b:       mov    %r15,%rdi
    0.00 :        ffffffff811c860e:       callq  ffffffff814be060 <rt_spin_unlock>
    6.36 :        ffffffff811c8613:       callq  ffffffff81069170 <migrate_enable>
         :              int locknum;
         :       again:
         :              if (nsops == 1 && !sma->complex_count) {
    2.12 :        ffffffff811c8660:       mov    -0x258(%rbp),%rax
    5.53 :        ffffffff811c8667:       cmpl   $0x0,0x9c(%rax)
   10.41 :        ffffffff811c866e:       je     ffffffff811c825c <SYSC_semtimedop+0x4cc>


[-- Attachment #2: osim.c --]
[-- Type: text/x-csrc, Size: 3754 bytes --]

/*
 * Copyright (C) 1999,2001 by Manfred Spraul.
 * 
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL)
 */

#include <sys/sem.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <assert.h>
#include <signal.h>
#include <unistd.h>

#define TRUE	1
#define FALSE	0

union semun {
	int val;
	struct semid_ds *buf;
	unsigned short int *array;
	struct seminfo* __buf;
};

#define barrier()	__asm__ __volatile__("": : : "memory")

int g_loops;
int g_busy_in;
int g_busy_out;
int g_sem;
int g_completedsem;

static void thread_fnc(int id)
{
	int i;
	volatile int j;
	int res;
	struct sembuf sop[1];

	for (i=0;i<g_loops;i++) {

		sop[0].sem_num=id;
		sop[0].sem_op=-1;
		sop[0].sem_flg=0;
		res = semop(g_sem,sop,1);
		if(res==-1) {
			printf("semop -1 failed, errno %d.\n", errno);
			return;
		}
		for(j=0;j<g_busy_in;j++);
			barrier();
		sop[0].sem_num=id;
		sop[0].sem_op=1;
		sop[0].sem_flg=0;
		res = semop(g_sem,sop,1);
		if(res==-1) {
			printf("semop +1 failed, errno %d.\n", errno);
			return;
		}
		for(j=0;j<g_busy_out;j++);
			barrier();
	}

	sop[0].sem_num=g_completedsem;
	sop[0].sem_op=-1;
	sop[0].sem_flg=IPC_NOWAIT;
	res = semop(g_sem,sop,1);
	if(res==-1) {
		printf("semop -1 on completedsem returned %d, errno %d.\n", res, errno);
		return;
	}
	return;
}

int main(int argc,char** argv)
{
	int nsems;
	int tasks;
	int res;
	pid_t *pids;
	unsigned short *psems;
	struct timeval t_before, t_after;
	unsigned long long delta;
	union semun arg;
	int i;

	printf("osim <sems> <tasks> <loops> <busy-in> <busy-out>\n");
	if(argc != 6) {
		printf("Invalid parameters.\n");
		return 1;
	}
	nsems=atoi(argv[1]);
	tasks=atoi(argv[2]);
	g_loops=atoi(argv[3]);
	g_loops = (g_loops+tasks-1)/tasks;
	g_busy_in=atoi(argv[4]);
	g_busy_out=atoi(argv[5]);
	g_completedsem = nsems;

	res = semget(IPC_PRIVATE, nsems+1, 0777 | IPC_CREAT);
	if(res == -1) {
		printf(" create failed.\n");
		return 1;
	}
	g_sem = res;
	fflush(stdout);

	pids = malloc(sizeof(pid_t)*tasks);
	for (i=0;i<tasks;i++) {
		res = fork();
		if (res == 0) {
			thread_fnc(i%nsems);
			exit(0);
		} 
		if (res == -1) {
			printf("fork() failed, errno now %d.\n", errno);
			return 1;
		}
		pids[i] = res;
	}

	printf("osim: using a semaphore array with %d semaphores.\n", nsems);
	printf("osim: using %d tasks.\n", tasks);
	printf("osim: each thread loops %d times\n", g_loops);
	printf("osim: each thread busyloops %d loops outside and %d loops inside.\n", g_busy_out, g_busy_in);
	fflush(stdout);

	psems = malloc(sizeof(unsigned short)*nsems);
	for (i=0;i<nsems;i++)
		psems[i] = 1;
	psems[i] = tasks;

	{
		struct sembuf sop[1];

		gettimeofday(&t_before, NULL);
		arg.array = psems;
		semctl(g_sem, 0, SETALL, arg);

		sop[0].sem_num=g_completedsem;
		sop[0].sem_op=0;
		sop[0].sem_flg=0;
		res = semop(g_sem,sop,1);
		if(res==-1) {
			printf("semop 0 failed, errno %d.\n", errno);
			return 1;
		}
		gettimeofday(&t_after, NULL);
	}
	for (i=0;i<tasks;i++) {
		res = waitpid(pids[i], NULL, 0);
		if (res != pids[i]) {
			printf("waitpid() failed, errno now %d.\n", errno);
			return 1;
		}
	}

	delta = t_after.tv_sec - t_before.tv_sec;
	delta = delta*1000000L;
	delta += t_after.tv_usec - t_before.tv_usec;

	printf("total execution time: %Ld.%03Ld%03Ld seconds for %d loops\n",
		(delta/1000000),
		(delta/1000)%1000,
		(delta)%1000,
		tasks*g_loops);

	delta = delta*1000;
	delta = delta/(tasks*g_loops);

	printf("per loop execution time: %Ld.%03Ld usec\n",
		(delta/1000),
		(delta)%1000);

	res = semctl(g_sem, 1, IPC_RMID, arg);
	if(res == -1) {
		printf(" semctl failed.\n");
		return 1;
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-08-31  5:27     ` Mike Galbraith
@ 2013-09-06  7:08       ` Mike Galbraith
  2013-09-10  6:30         ` Mike Galbraith
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2013-09-06  7:08 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior

On Sat, 2013-08-31 at 07:27 +0200, Mike Galbraith wrote: 
> The attached proggy (Manfred Spraul) is one reproducer. 

Just in case it's not obvious, the cause is spin_unlock_wait() usage.  

All it takes is one task to start the ball rolling while enough others
exist to react by then taking the lock themselves, and you instantly get
"Woohoo, let's all lock the lock we stare at to see if we need to lock
the lock that we stare at.....".

(loop doesn't look lovely for non-rt either, seems you could stimulate
it if you tried it hard enough)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-06  7:08       ` Mike Galbraith
@ 2013-09-10  6:30         ` Mike Galbraith
  2013-09-11 14:03           ` Manfred Spraul
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2013-09-10  6:30 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Steven Rostedt, Thomas Gleixner, Sebastian Andrzej Siewior,
	Manfred Spraul, Peter Zijlstra

On Fri, 2013-09-06 at 09:08 +0200, Mike Galbraith wrote:
> On Sat, 2013-08-31 at 07:27 +0200, Mike Galbraith wrote: 
> > The attached proggy (Manfred Spraul) is one reproducer. 
> 
> Just in case it's not obvious, the cause is spin_unlock_wait() usage.  

(readers are encouraged to skip this bit, go straight to second diff)

The below will prevent livelock without the unpleasant down-sides of /me
trying to let PI see any thumb twiddling going on (also induces thumb
twiddling, just not the endless variety), but surely won't fly, and...

---
 ipc/sem.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -232,7 +232,8 @@ static inline int sem_lock(struct sem_ar
 		 */
 		if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
 			spin_unlock(&sem->lock);
-			spin_unlock_wait(&sma->sem_perm.lock);
+			while(spin_is_locked(&sma->sem_perm.lock))
+				cpu_relax();
 			goto again;
 		}
 
@@ -249,7 +250,8 @@ static inline int sem_lock(struct sem_ar
 		spin_lock(&sma->sem_perm.lock);
 		for (i = 0; i < sma->sem_nsems; i++) {
 			struct sem *sem = sma->sem_base + i;
-			spin_unlock_wait(&sem->lock);
+			while(spin_is_locked(&sem->lock))
+				cpu_relax();
 		}
 		locknum = -1;
 	}


You can kill this particular (osim induced) livelock by using Manfred's
completion wakeup scheme.  It may not make it all better, but it does
eliminate this trigger, despite that not being what the patch is about.

Beware, contains 3.10.10-rt7 hammer marks made by /me to test.

>From 7fdbcc018b85d5804ea51323f3cefd5fe3032f82 Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Sat, 28 Jan 2012 18:42:58 +0100
Subject: [PATCH] ipc/sem.c: Add -RT compatible wakeup scheme.

ipc/sem.c uses a custom wakeup scheme that relies on preempt_disable() to
prevent livelocks.
On -RT, this causes increased latencies and debug warnings.

The patch:
- separates the wakeup scheme into helper functions
- adds a scheme built around a completion.

The preferred solution (use a spinlock instead of the completion) does not
work, because there is a limit of at most 256 concurrent spinlocks.

--
        Manfred

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c |  210 +++++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 127 insertions(+), 83 deletions(-)

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -61,8 +61,8 @@
  * - A woken up task may not even touch the semaphore array anymore, it may
  *   have been destroyed already by a semctl(RMID).
  * - The synchronizations between wake-ups due to a timeout/signal and a
- *   wake-up due to a completed semaphore operation is achieved by using an
- *   intermediate state (IN_WAKEUP).
+ *   wake-up due to a completed semaphore operation is achieved by using a
+ *   special wakeup scheme (queuewakeup_wait and support functions)
  * - UNDO values are stored in an array (one per process and per
  *   semaphore array, lazily allocated). For backwards compatibility, multiple
  *   modes for the UNDO variables are supported (per process, per thread)
@@ -90,6 +90,85 @@
 #include <asm/uaccess.h>
 #include "util.h"
 
+
+#ifdef CONFIG_PREEMPT_RT_BASE
+	#define SYSVSEM_COMPLETION 1
+#else
+	#define SYSVSEM_CUSTOM 1
+#endif
+
+#ifdef SYSVSEM_COMPLETION
+	/* Using a completion causes some overhead, but avoids a busy loop
+	 * that increases the worst case latency.
+	 */
+	struct queue_done {
+		struct completion done;
+	};
+
+	static void queuewakeup_prepare(void)
+	{
+		/* no preparation necessary */
+	}
+
+	static void queuewakeup_completed(void)
+	{
+		/* empty */
+	}
+
+	static void queuewakeup_handsoff(struct queue_done *qd)
+	{
+		complete_all(&qd->done);
+	}
+
+	static void queuewakeup_init(struct queue_done *qd)
+	{
+		init_completion(&qd->done);
+	}
+
+	static void queuewakeup_wait(struct queue_done *qd)
+	{
+		wait_for_completion(&qd->done);
+	}
+
+#elif defined(SYSVSEM_CUSTOM)
+	struct queue_done {
+		atomic_t done;
+	};
+
+	static void queuewakeup_prepare(void)
+	{
+		preempt_disable();
+	}
+
+	static void queuewakeup_completed(void)
+	{
+		preempt_enable();
+	}
+
+	static void queuewakeup_handsoff(struct queue_done *qd)
+	{
+		BUG_ON(atomic_read(&qd->done) != 2);
+		smp_mb();
+		atomic_set(&qd->done, 1);
+	}
+
+	static void queuewakeup_init(struct queue_done *qd)
+	{
+		atomic_set(&qd->done, 2);
+	}
+
+	static void queuewakeup_wait(struct queue_done *qd)
+	{
+		while (atomic_read(&qd->done) != 1)
+			cpu_relax();
+
+		smp_mb();
+	}
+#else
+#error Unknown locking primitive
+#endif
+
+
 /* One semaphore structure for each semaphore in the system. */
 struct sem {
 	int	semval;		/* current value */
@@ -108,6 +187,7 @@ struct sem_queue {
 	struct sembuf		*sops;	 /* array of pending operations */
 	int			nsops;	 /* number of operations */
 	int			alter;	 /* does *sops alter the array? */
+	struct queue_done	done;	 /* completion synchronization */
 };
 
 /* Each task has a list of undo requests. They are executed automatically
@@ -334,27 +414,34 @@ static inline void sem_rmid(struct ipc_n
 
 /*
  * Lockless wakeup algorithm:
- * Without the check/retry algorithm a lockless wakeup is possible:
+ * The whole task of choosing tasks to wake up is done by the thread that
+ * does the wakeup. For the woken up thread, this makes the following
+ * lockless wakeup possible:
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
- *	* unlinking the queue entry from sma->sem_pending
- *	* setting queue.status to IN_WAKEUP
- *	  This is the notification for the blocked thread that a
- *	  result value is imminent.
+ * 	* call queuewakeup_prepare() once. This is necessary to prevent
+ *	  livelocks.
+ *	* setting queue.status to the actual result code
  *	* call wake_up_process
- *	* set queue.status to the final value.
+ *	* queuewakeup_handsoff(&q->done);
+ *	* call queuewakeup_completed() once.
  * - the previously blocked thread checks queue.status:
- *   	* if it's IN_WAKEUP, then it must wait until the value changes
- *   	* if it's not -EINTR, then the operation was completed by
- *   	  update_queue. semtimedop can return queue.status without
- *   	  performing any operation on the sem array.
- *   	* otherwise it must acquire the spinlock and check what's up.
+ *	* if it's not -EINTR, then someone completed the operation.
+ *	  First, queuewakeup_wait() must be called. Afterwards,
+ *	  semtimedop must return queue.status without performing any
+ *	  operation on the sem array.
+ *	* otherwise it must acquire the spinlock and repeat the test
+ *	    (including: call queuewakeup_wait() if there is a return code)
+ *	* If it is still -EINTR, then no update_queue() completed the
+ *	    operation, thus semtimedop() can proceed normally.
  *
- * The two-stage algorithm is necessary to protect against the following
+ * queuewakeup_wait() is necessary to protect against the following
  * races:
  * - if queue.status is set after wake_up_process, then the woken up idle
  *   thread could race forward and try (and fail) to acquire sma->lock
- *   before update_queue had a chance to set queue.status
+ *   before update_queue had a chance to set queue.status.
+ *   More importantly, it would mean that wake_up_process must be done
+ *   while holding sma->lock, i.e. this would reduce the scalability.
  * - if queue.status is written before wake_up_process and if the
  *   blocked process is woken up by a signal between writing
  *   queue.status and the wake_up_process, then the woken up
@@ -364,7 +451,6 @@ static inline void sem_rmid(struct ipc_n
  *   (yes, this happened on s390 with sysv msg).
  *
  */
-#define IN_WAKEUP	1
 
 /**
  * newary - Create a new semaphore set
@@ -557,22 +643,17 @@ static int try_atomic_semop (struct sem_
 static void wake_up_sem_queue_prepare(struct list_head *pt,
 				struct sem_queue *q, int error)
 {
-#ifdef CONFIG_PREEMPT_RT_BASE
+#ifdef CONFIG_PREEMPT_RT_BASE_MIKE
 	struct task_struct *p = q->sleeper;
 	get_task_struct(p);
 	q->status = error;
 	wake_up_process(p);
 	put_task_struct(p);
 #else
-	if (list_empty(pt)) {
-		/*
-		 * Hold preempt off so that we don't get preempted and have the
-		 * wakee busy-wait until we're scheduled back on.
-		 */
-		preempt_disable();
-	}
-	q->status = IN_WAKEUP;
-	q->pid = error;
+	if (list_empty(pt))
+		queuewakeup_prepare();
+
+	q->status = error;
 
 	list_add_tail(&q->list, pt);
 #endif
@@ -584,24 +665,23 @@ static void wake_up_sem_queue_prepare(st
  *
  * Do the actual wake-up.
  * The function is called without any locks held, thus the semaphore array
- * could be destroyed already and the tasks can disappear as soon as the
- * status is set to the actual return code.
+ * could be destroyed already and the tasks can disappear as soon as
+ * queuewakeup_handsoff() is called.
  */
 static void wake_up_sem_queue_do(struct list_head *pt)
 {
-#ifndef CONFIG_PREEMPT_RT_BASE
+#ifndef CONFIG_PREEMPT_RT_BASE_MIKE
 	struct sem_queue *q, *t;
 	int did_something;
 
 	did_something = !list_empty(pt);
 	list_for_each_entry_safe(q, t, pt, list) {
 		wake_up_process(q->sleeper);
-		/* q can disappear immediately after writing q->status. */
-		smp_wmb();
-		q->status = q->pid;
+		/* q can disappear immediately after completing q->done */
+		queuewakeup_handsoff(&q->done);
 	}
 	if (did_something)
-		preempt_enable();
+		queuewakeup_completed();
 #endif
 }
 
@@ -1517,33 +1597,6 @@ static struct sem_undo *find_alloc_undo(
 	return un;
 }
 
-
-/**
- * get_queue_result - Retrieve the result code from sem_queue
- * @q: Pointer to queue structure
- *
- * Retrieve the return code from the pending queue. If IN_WAKEUP is found in
- * q->status, then we must loop until the value is replaced with the final
- * value: This may happen if a task is woken up by an unrelated event (e.g.
- * signal) and in parallel the task is woken up by another task because it got
- * the requested semaphores.
- *
- * The function can be called with or without holding the semaphore spinlock.
- */
-static int get_queue_result(struct sem_queue *q)
-{
-	int error;
-
-	error = q->status;
-	while (unlikely(error == IN_WAKEUP)) {
-		cpu_relax();
-		error = q->status;
-	}
-
-	return error;
-}
-
-
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1677,6 +1730,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid,
 
 	queue.status = -EINTR;
 	queue.sleeper = current;
+	queuewakeup_init(&queue.done);
 
 sleep_again:
 	current->state = TASK_INTERRUPTIBLE;
@@ -1688,17 +1742,14 @@ SYSCALL_DEFINE4(semtimedop, int, semid,
 	else
 		schedule();
 
-	error = get_queue_result(&queue);
+	error = queue.status;
 
 	if (error != -EINTR) {
 		/* fast path: update_queue already obtained all requested
-		 * resources.
-		 * Perform a smp_mb(): User space could assume that semop()
-		 * is a memory barrier: Without the mb(), the cpu could
-		 * speculatively read in user space stale data that was
-		 * overwritten by the previous owner of the semaphore.
+		 * resources. Just ensure that update_queue completed
+		 * it's access to &queue.
 		 */
-		smp_mb();
+		queuewakeup_wait(&queue.done);
 
 		goto out_free;
 	}
@@ -1709,26 +1760,19 @@ SYSCALL_DEFINE4(semtimedop, int, semid,
 	/*
 	 * Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing.
 	 */
-	error = get_queue_result(&queue);
-
-	/*
-	 * Array removed? If yes, leave without sem_unlock().
-	 */
-	if (IS_ERR(sma)) {
+	error = queue.status;
+	if (error != -EINTR) {
+		/* If there is a return code, then we can leave immediately. */
+		if (!IS_ERR(sma)) {
+			/* sem_lock() succeeded - then unlock */
+			sem_unlock(sma, locknum);
+		}
 		rcu_read_unlock();
+		/* Except that we must wait for the hands-off */
+		queuewakeup_wait(&queue.done);
 		goto out_free;
 	}
 
-
-	/*
-	 * If queue.status != -EINTR we are woken up by another process.
-	 * Leave without unlink_queue(), but with sem_unlock().
-	 */
-
-	if (error != -EINTR) {
-		goto out_unlock_free;
-	}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-10  6:30         ` Mike Galbraith
@ 2013-09-11 14:03           ` Manfred Spraul
  2013-09-12  7:40             ` Mike Galbraith
  0 siblings, 1 reply; 20+ messages in thread
From: Manfred Spraul @ 2013-09-11 14:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

Hi Mike,

On 09/10/2013 08:30 AM, Mike Galbraith wrote:
>
> You can kill this particular (osim induced) livelock by using Manfred's
> completion wakeup scheme.  It may not make it all better, but it does
> eliminate this trigger, despite that not being what the patch is about.

[snip]

> +#elif defined(SYSVSEM_CUSTOM)
> +	struct queue_done {
> +		atomic_t done;
> +	};
> +
> +	static void queuewakeup_prepare(void)
> +	{
> +		preempt_disable();
> +	}
> +
> +	static void queuewakeup_completed(void)
> +	{
> +		preempt_enable();
> +	}
> +
> +	static void queuewakeup_handsoff(struct queue_done *qd)
> +	{
> +		BUG_ON(atomic_read(&qd->done) != 2);
> +		smp_mb();
> +		atomic_set(&qd->done, 1);
> +	}
> +
> +	static void queuewakeup_init(struct queue_done *qd)
> +	{
> +		atomic_set(&qd->done, 2);
> +	}
> +
> +	static void queuewakeup_wait(struct queue_done *qd)
> +	{
> +		while (atomic_read(&qd->done) != 1)
> +			cpu_relax();
> +
> +		smp_mb();
> +	}
Two remarks:
- The reason why a standard waitqueue is not used should be mentioned in 
the commit comment:

A standard wait queue that waits with a timeout (or with 
TASK_INTERRUPTIBLE) doesn't allow the caller of add_wait_queue to figure 
out if the thread was woken up to due expiry of the timer or due to an 
explicit wake_up() call.
a) The ipc code must know that in order to decide if the task must 
return with -EAGAIN/-EINTR or with 0. Therefore ipc/sem.c, ipc/msg.c and 
ipc/mqueue.c use custom queues.
b) These custom queues have a lockless fast-path for a successful 
operation - and this fast path must handle the various races that may 
happen the timer expires in parallel with a wake_up() call.

- the ipc/msg.c and ipc/mqueue.c should be converted as well (just 
search for cpu_relax()).

--
     Manfred


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-11 14:03           ` Manfred Spraul
@ 2013-09-12  7:40             ` Mike Galbraith
  2013-09-12 19:15               ` Steven Rostedt
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2013-09-12  7:40 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

On Wed, 2013-09-11 at 16:03 +0200, Manfred Spraul wrote: 
> Hi Mike,
> 
> On 09/10/2013 08:30 AM, Mike Galbraith wrote:
> >
> > You can kill this particular (osim induced) livelock by using Manfred's
> > completion wakeup scheme.  It may not make it all better, but it does
> > eliminate this trigger, despite that not being what the patch is about.
> 
> [snip]
> 
> > +#elif defined(SYSVSEM_CUSTOM)
> > +	struct queue_done {
> > +		atomic_t done;
> > +	};
> > +
> > +	static void queuewakeup_prepare(void)
> > +	{
> > +		preempt_disable();
> > +	}
> > +
> > +	static void queuewakeup_completed(void)
> > +	{
> > +		preempt_enable();
> > +	}
> > +
> > +	static void queuewakeup_handsoff(struct queue_done *qd)
> > +	{
> > +		BUG_ON(atomic_read(&qd->done) != 2);
> > +		smp_mb();
> > +		atomic_set(&qd->done, 1);
> > +	}
> > +
> > +	static void queuewakeup_init(struct queue_done *qd)
> > +	{
> > +		atomic_set(&qd->done, 2);
> > +	}
> > +
> > +	static void queuewakeup_wait(struct queue_done *qd)
> > +	{
> > +		while (atomic_read(&qd->done) != 1)
> > +			cpu_relax();
> > +
> > +		smp_mb();
> > +	}
> Two remarks:
> - The reason why a standard waitqueue is not used should be mentioned in 
> the commit comment:
> 
> A standard wait queue that waits with a timeout (or with 
> TASK_INTERRUPTIBLE) doesn't allow the caller of add_wait_queue to figure 
> out if the thread was woken up to due expiry of the timer or due to an 
> explicit wake_up() call.
> a) The ipc code must know that in order to decide if the task must 
> return with -EAGAIN/-EINTR or with 0. Therefore ipc/sem.c, ipc/msg.c and 
> ipc/mqueue.c use custom queues.
> b) These custom queues have a lockless fast-path for a successful 
> operation - and this fast path must handle the various races that may 
> happen the timer expires in parallel with a wake_up() call.

If maintainers are interested, I can add that and clean up hammer marks
so it can be applied in lieu of the existing rt wakeup patch instead of
being crammed in on top of it, or they can trivially do that themselves,
both options presuming your (implied) approval.

> - the ipc/msg.c and ipc/mqueue.c should be converted as well (just 
> search for cpu_relax()).

Thanks, I'll save this.

ATM, I'm more mystified by 3.10.10-rt wild variance, independent of
hacked up/virgin/PREEMPT/NOPREEMPT whatever whereas 3.0-rt with the same
semaphore changes wedged in scaled perfectly/stably to 64 cores
regardless of same variables on same box.  That shouldn't have happened,
so I'm going to try out mainline, see what box does with that.

-Mike


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  hwlat-detector: Don't ignore threshold module parameter
  2013-08-30  5:57 ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Mike Galbraith
  2013-08-30 14:39   ` [PATCH RT] rt,ipc,sem: fix -rt livelock Mike Galbraith
@ 2013-09-12 18:23   ` Steven Rostedt
  2013-10-04 10:30   ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2013-09-12 18:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, Sebastian Andrzej Siewior, LKML

Don't be afraid to Cc LKML with rt kernel patches. linux-rt-users is
for users not kernel developers. People like Peter Zijlstra will never
see this patch.

On Fri, 30 Aug 2013 07:57:25 +0200
Mike Galbraith <bitbucket@online.de> wrote:

> If the user specified a threshold at module load time, use it.

This could use a little better change log.

Other than that,

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

> 
> Signed-off-by: Mike Galbraith <bitbucket@online.de>
> 
> --- a/drivers/misc/hwlat_detector.c	2013-08-30 07:16:05.387959392 +0200
> +++ b/drivers/misc/hwlat_detector.c	2013-08-30 07:10:52.958670183 +0200
> @@ -413,7 +413,7 @@ static int init_stats(void)
>  		goto out;
>  
>  	__reset_stats();
> -	data.threshold = DEFAULT_LAT_THRESHOLD;	    /* threshold us */
> +	data.threshold = threshold ?: DEFAULT_LAT_THRESHOLD;	    /* threshold us */
>  	data.sample_window = DEFAULT_SAMPLE_WINDOW; /* window us */
>  	data.sample_width = DEFAULT_SAMPLE_WIDTH;   /* width us */
>  
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-12  7:40             ` Mike Galbraith
@ 2013-09-12 19:15               ` Steven Rostedt
  2013-09-13  3:20                 ` Mike Galbraith
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2013-09-12 19:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Manfred Spraul, linux-rt-users, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

On Thu, 12 Sep 2013 09:40:28 +0200
Mike Galbraith <bitbucket@online.de> wrote:


> If maintainers are interested, I can add that and clean up hammer marks
> so it can be applied in lieu of the existing rt wakeup patch instead of
> being crammed in on top of it, or they can trivially do that themselves,
> both options presuming your (implied) approval.
> 

Mike,

What's the final patch for this?

Could you post another one with what you believe is the correct fix.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-12 19:15               ` Steven Rostedt
@ 2013-09-13  3:20                 ` Mike Galbraith
  2013-09-13  3:33                   ` Steven Rostedt
  2013-09-13  4:36                   ` Mike Galbraith
  0 siblings, 2 replies; 20+ messages in thread
From: Mike Galbraith @ 2013-09-13  3:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Manfred Spraul, linux-rt-users, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

On Thu, 2013-09-12 at 15:15 -0400, Steven Rostedt wrote: 
> On Thu, 12 Sep 2013 09:40:28 +0200
> Mike Galbraith <bitbucket@online.de> wrote:
> 
> 
> > If maintainers are interested, I can add that and clean up hammer marks
> > so it can be applied in lieu of the existing rt wakeup patch instead of
> > being crammed in on top of it, or they can trivially do that themselves,
> > both options presuming your (implied) approval.
> > 
> 
> Mike,
> 
> What's the final patch for this?

Both loop removal patch or something like it (but prettier), and
Manfred's wakeup scheme to prevent the locked array lock being being
seen and reacted to by wakees, who can then contend on that lock,
defeating the whole purpose for a while.

> Could you post another one with what you believe is the correct fix.

As noted, I _think_ they're both needed.  Manfred's wakeup scheme cures
a trigger, but not the bug methinks (non-deterministic trylock loop).
Would be nice if it turned out that it cured all possible livelocks, but
I don't think it does.  Even if it did, crawling over a large array
always taking and releasing every lock will suck, which is why I did the
look at it before taking it thing.  That ain't pretty, but it's less
painful.

-Mike

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-13  3:20                 ` Mike Galbraith
@ 2013-09-13  3:33                   ` Steven Rostedt
  2013-09-13  4:48                     ` Mike Galbraith
  2013-09-13  4:36                   ` Mike Galbraith
  1 sibling, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2013-09-13  3:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Manfred Spraul, linux-rt-users, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

On Fri, 13 Sep 2013 05:20:21 +0200
Mike Galbraith <bitbucket@online.de> wrote:

> On Thu, 2013-09-12 at 15:15 -0400, Steven Rostedt wrote: 
> > On Thu, 12 Sep 2013 09:40:28 +0200
> > Mike Galbraith <bitbucket@online.de> wrote:
> > 
> > 
> > > If maintainers are interested, I can add that and clean up hammer marks
> > > so it can be applied in lieu of the existing rt wakeup patch instead of
> > > being crammed in on top of it, or they can trivially do that themselves,
> > > both options presuming your (implied) approval.
> > > 
> > 
> > Mike,
> > 
> > What's the final patch for this?
> 
> Both loop removal patch or something like it (but prettier), and
> Manfred's wakeup scheme to prevent the locked array lock being being
> seen and reacted to by wakees, who can then contend on that lock,
> defeating the whole purpose for a while.
> 
> > Could you post another one with what you believe is the correct fix.
> 
> As noted, I _think_ they're both needed.  

Can you repost as a patch series please.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-13  3:20                 ` Mike Galbraith
  2013-09-13  3:33                   ` Steven Rostedt
@ 2013-09-13  4:36                   ` Mike Galbraith
  2013-09-13 12:56                     ` Mike Galbraith
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2013-09-13  4:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Manfred Spraul, linux-rt-users, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

P.S.

While I have your attention, let me provide some related info.

for i in 1 2 3 4 5; do ipcscale/sem-waitzero -t 30 -c 1,8,16,24,32,40,48,56,64 -i 1 -m 0 -d 2 -f 2>&1|tee -a log-blah; done

3.0-rt (SLE11-SP3-RT with sem backports) scales perfectly to 64 cores on
my 8 node (alas, 1 memory node) DL980 rt test box, whether configured
PREEMPT_RT/1000 or NOPREEMPT/250.  (livelock bug is irrelevant)

3.10.10-rt7 scales horridly, is erratic as all hell whether configured
PREEMPT_FULL/1000 or NOPREEMPT/250.  Ditto pure 3.10.10 NOPREEMPT/250.

(3.10-rt kernel is where I started, Manfred asked me to try something, I
happened to have this kernel on box to take a peek at future workhorse) 

Master NOPREEMPT/250 as of yesterday with same config taken from tests
above scales perfectly.

So, that _alleges_ that 3.10.10, a.k.a next longterm stable kernel, is
sick, missing something.  It appears something went wonky post 3.0, but
that's only possibly interesting for finding the origin for -stable iff
the thing turns out to be _really_ stable/bisectable in the first place,
which I'm going to find out, hopefully today.  It _looks_ stable.

Would be nice if someone would confirm/deny bad big box behavior, but...

-Mike

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-13  3:33                   ` Steven Rostedt
@ 2013-09-13  4:48                     ` Mike Galbraith
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Galbraith @ 2013-09-13  4:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Manfred Spraul, linux-rt-users, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

On Thu, 2013-09-12 at 23:33 -0400, Steven Rostedt wrote:

> Can you repost as a patch series please.

Sure, will do.

-Mike


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  rt,ipc,sem: fix -rt livelock
  2013-09-13  4:36                   ` Mike Galbraith
@ 2013-09-13 12:56                     ` Mike Galbraith
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Galbraith @ 2013-09-13 12:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Manfred Spraul, linux-rt-users, Thomas Gleixner,
	Sebastian Andrzej Siewior, Peter Zijlstra

On Fri, 2013-09-13 at 06:36 +0200, Mike Galbraith wrote:

> 3.10.10-rt7 scales horridly, is erratic as all hell whether configured
> PREEMPT_FULL/1000 or NOPREEMPT/250.  Ditto pure 3.10.10 NOPREEMPT/250.

Bah.  Forget 3.10 horrid behavior, Manfred's patches went into 3.11, not
3.10.. so 3.10-longterm has stable horrid behavior.

Sometimes stability doesn't sound all that wonderful.

-Mike


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT]  hwlat-detector: Don't ignore threshold module parameter
  2013-08-30  5:57 ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Mike Galbraith
  2013-08-30 14:39   ` [PATCH RT] rt,ipc,sem: fix -rt livelock Mike Galbraith
  2013-09-12 18:23   ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Steven Rostedt
@ 2013-10-04 10:30   ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-10-04 10:30 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-rt-users, Steven Rostedt, Thomas Gleixner

* Mike Galbraith | 2013-08-30 07:57:25 [+0200]:

>--- a/drivers/misc/hwlat_detector.c	2013-08-30 07:16:05.387959392 +0200
>+++ b/drivers/misc/hwlat_detector.c	2013-08-30 07:10:52.958670183 +0200
>@@ -413,7 +413,7 @@ static int init_stats(void)
> 		goto out;
> 
> 	__reset_stats();
>-	data.threshold = DEFAULT_LAT_THRESHOLD;	    /* threshold us */
>+	data.threshold = threshold ?: DEFAULT_LAT_THRESHOLD;	    /* threshold us */
> 	data.sample_window = DEFAULT_SAMPLE_WINDOW; /* window us */
> 	data.sample_width = DEFAULT_SAMPLE_WIDTH;   /* width us */

Interresting. According to my history the parameter has been used
before.
Applied, thanks.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-10-04 10:30 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-19 21:33 [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Steven Rostedt
2013-08-19 21:33 ` [PATCH RT 1/3] hwlat-detector: Update hwlat_detector to add outer loop detection Steven Rostedt
2013-08-19 21:33 ` [PATCH RT 2/3] hwlat-detector: Use trace_clock_local if available Steven Rostedt
2013-08-19 21:33 ` [PATCH RT 3/3] hwlat-detector: Use thread instead of stop machine Steven Rostedt
2013-08-21 15:59 ` [PATCH RT 0/3] hwlat-detector: Have it actually find hardware latency Sebastian Andrzej Siewior
2013-08-30  5:57 ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Mike Galbraith
2013-08-30 14:39   ` [PATCH RT] rt,ipc,sem: fix -rt livelock Mike Galbraith
2013-08-31  5:27     ` Mike Galbraith
2013-09-06  7:08       ` Mike Galbraith
2013-09-10  6:30         ` Mike Galbraith
2013-09-11 14:03           ` Manfred Spraul
2013-09-12  7:40             ` Mike Galbraith
2013-09-12 19:15               ` Steven Rostedt
2013-09-13  3:20                 ` Mike Galbraith
2013-09-13  3:33                   ` Steven Rostedt
2013-09-13  4:48                     ` Mike Galbraith
2013-09-13  4:36                   ` Mike Galbraith
2013-09-13 12:56                     ` Mike Galbraith
2013-09-12 18:23   ` [PATCH RT] hwlat-detector: Don't ignore threshold module parameter Steven Rostedt
2013-10-04 10:30   ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.