linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
@ 2006-11-29 23:53 Oleg Nesterov
  2006-11-30  1:57 ` Oleg Nesterov
  2006-12-01  0:14 ` Oleg Nesterov
  0 siblings, 2 replies; 8+ messages in thread
From: Oleg Nesterov @ 2006-11-29 23:53 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Paul E. McKenney, Alan Stern, Josh Triplett, linux-kernel

Very much based on ideas, corrections, and patient explanations from
Alan and Paul.

The current srcu implementation is very good for readers, lock/unlock
are extremely cheap. But for that reason it is not possible to avoid
synchronize_sched() and polling in synchronize_srcu().

Jens Axboe wrote:
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.

'qrcu' behaves the same as srcu but optimized for writers. The fast path
for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
The slow path is __wait_event(), no polling. However, the reader does
atomic inc/dec on lock/unlock, and the counters are not per-cpu.

Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
and 'qrcu_struct' can be compile-time initialized.

See also (a long) discussion:
	http://marc.theaimsgroup.com/?t=116370857600003

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- 19-rc6/include/linux/srcu.h~1_qrcu	2006-11-17 19:42:31.000000000 +0300
+++ 19-rc6/include/linux/srcu.h	2006-11-29 20:22:37.000000000 +0300
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/wait.h>
+
 struct srcu_struct_array {
 	int c[2];
 };
@@ -50,4 +52,24 @@ void srcu_read_unlock(struct srcu_struct
 void synchronize_srcu(struct srcu_struct *sp);
 long srcu_batches_completed(struct srcu_struct *sp);
 
+/*
+ * fully compatible with srcu, but optimized for writers.
+ */
+
+struct qrcu_struct {
+	int completed;
+	atomic_t ctr[2];
+	wait_queue_head_t wq;
+	struct mutex mutex;
+};
+
+int init_qrcu_struct(struct qrcu_struct *qp);
+int qrcu_read_lock(struct qrcu_struct *qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+void synchronize_qrcu(struct qrcu_struct *qp);
+
+static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
+{
+}
+
 #endif
--- 19-rc6/kernel/srcu.c~1_qrcu	2006-11-17 19:42:31.000000000 +0300
+++ 19-rc6/kernel/srcu.c	2006-11-29 20:09:49.000000000 +0300
@@ -256,3 +256,55 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 EXPORT_SYMBOL_GPL(srcu_batches_completed);
 EXPORT_SYMBOL_GPL(srcu_readers_active);
+
+int init_qrcu_struct(struct qrcu_struct *qp)
+{
+	qp->completed = 0;
+	atomic_set(qp->ctr + 0, 1);
+	atomic_set(qp->ctr + 1, 0);
+	init_waitqueue_head(&qp->wq);
+	mutex_init(&qp->mutex);
+
+	return 0;
+}
+
+int qrcu_read_lock(struct qrcu_struct *qp)
+{
+	for (;;) {
+		int idx = qp->completed & 0x1;
+		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
+			return idx;
+	}
+}
+
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
+{
+	if (atomic_dec_and_test(qp->ctr + idx))
+		wake_up(&qp->wq);
+}
+
+void synchronize_qrcu(struct qrcu_struct *qp)
+{
+	int idx;
+
+	smp_mb();
+	mutex_lock(&qp->mutex);
+
+	idx = qp->completed & 0x1;
+	if (atomic_read(qp->ctr + idx) == 1)
+		goto out;
+
+	atomic_inc(qp->ctr + (idx ^ 0x1));
+	qp->completed++;
+
+	atomic_dec(qp->ctr + idx);
+	__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
+out:
+	mutex_unlock(&qp->mutex);
+	smp_mb();
+}
+
+EXPORT_SYMBOL_GPL(init_qrcu_struct);
+EXPORT_SYMBOL_GPL(qrcu_read_lock);
+EXPORT_SYMBOL_GPL(qrcu_read_unlock);
+EXPORT_SYMBOL_GPL(synchronize_qrcu);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-11-29 23:53 [RFC, PATCH 1/2] qrcu: "quick" srcu implementation Oleg Nesterov
@ 2006-11-30  1:57 ` Oleg Nesterov
  2006-11-30  2:46   ` Paul E. McKenney
  2006-12-01  0:14 ` Oleg Nesterov
  1 sibling, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2006-11-30  1:57 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Paul E. McKenney, Alan Stern, Josh Triplett, linux-kernel

(the same patch + comments from Paul)

[RFC, PATCH 1/2] qrcu: "quick" srcu implementation

Very much based on ideas, corrections, and patient explanations from
Alan and Paul.

The current srcu implementation is very good for readers, lock/unlock
are extremely cheap. But for that reason it is not possible to avoid
synchronize_sched() and polling in synchronize_srcu().

Jens Axboe wrote:
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.

'qrcu' behaves the same as srcu but optimized for writers. The fast path
for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
The slow path is __wait_event(), no polling. However, the reader does
atomic inc/dec on lock/unlock, and the counters are not per-cpu.

Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
and 'qrcu_struct' can be compile-time initialized.

See also (a long) discussion:
	http://marc.theaimsgroup.com/?t=116370857600003

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- 19-rc6/include/linux/srcu.h~1_qrcu	2006-10-22 18:24:03.000000000 +0400
+++ 19-rc6/include/linux/srcu.h	2006-11-30 04:32:42.000000000 +0300
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/wait.h>
+
 struct srcu_struct_array {
 	int c[2];
 };
@@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct
 void synchronize_srcu(struct srcu_struct *sp);
 long srcu_batches_completed(struct srcu_struct *sp);
 
+/*
+ * fully compatible with srcu, but optimized for writers.
+ */
+
+struct qrcu_struct {
+	int completed;
+	atomic_t ctr[2];
+	wait_queue_head_t wq;
+	struct mutex mutex;
+};
+
+int init_qrcu_struct(struct qrcu_struct *qp);
+int qrcu_read_lock(struct qrcu_struct *qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+void synchronize_qrcu(struct qrcu_struct *qp);
+
+/**
+ * cleanup_qrcu_struct - deconstruct a quick-RCU structure
+ * @qp: structure to clean up.
+ *
+ * Must invoke this after you are finished using a given qrcu_struct that
+ * was initialized via init_qrcu_struct().  We reserve the right to
+ * leak memory should you fail to do this!
+ */
+static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
+{
+}
+
 #endif
--- 19-rc6/kernel/srcu.c~1_qrcu	2006-10-22 18:24:03.000000000 +0400
+++ 19-rc6/kernel/srcu.c	2006-11-30 04:39:53.000000000 +0300
@@ -256,3 +256,94 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 EXPORT_SYMBOL_GPL(srcu_batches_completed);
 EXPORT_SYMBOL_GPL(srcu_readers_active);
+
+/**
+ * init_qrcu_struct - initialize a quick-RCU structure.
+ * @qp: structure to initialize.
+ *
+ * Must invoke this on a given qrcu_struct before passing that qrcu_struct
+ * to any other function.  Each qrcu_struct represents a separate domain
+ * of QRCU protection.
+ */
+int init_qrcu_struct(struct qrcu_struct *qp)
+{
+	qp->completed = 0;
+	atomic_set(qp->ctr + 0, 1);
+	atomic_set(qp->ctr + 1, 0);
+	init_waitqueue_head(&qp->wq);
+	mutex_init(&qp->mutex);
+
+	return 0;
+}
+
+/**
+ * qrcu_read_lock - register a new reader for an QRCU-protected structure.
+ * @qp: qrcu_struct in which to register the new reader.
+ *
+ * Counts the new reader in the appropriate element of the qrcu_struct.
+ * Returns an index that must be passed to the matching qrcu_read_unlock().
+ */
+int qrcu_read_lock(struct qrcu_struct *qp)
+{
+	for (;;) {
+		int idx = qp->completed & 0x1;
+		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
+			return idx;
+	}
+}
+
+/**
+ * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure.
+ * @qp: qrcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding qrcu_read_lock().
+ *
+ * Removes the count for the old reader from the appropriate element of
+ * the qrcu_struct.
+ */
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
+{
+	if (atomic_dec_and_test(qp->ctr + idx))
+		wake_up(&qp->wq);
+}
+
+/**
+ * synchronize_qrcu - wait for prior QRCU read-side critical-section completion
+ * @qp: qrcu_struct with which to synchronize.
+ *
+ * Flip the completed counter, and wait for the old count to drain to zero.
+ * As with classic RCU, the updater must use some separate means of
+ * synchronizing concurrent updates.  Can block; must be called from
+ * process context.
+ *
+ * Note that it is illegal to call synchronize_qrcu() from the corresponding
+ * QRCU read-side critical section; doing so will result in deadlock.
+ * However, it is perfectly legal to call synchronize_qrcu() on one
+ * qrcu_struct from some other qrcu_struct's read-side critical section.
+ */
+void synchronize_qrcu(struct qrcu_struct *qp)
+{
+	int idx;
+
+	smp_mb();
+	mutex_lock(&qp->mutex);
+
+	idx = qp->completed & 0x1;
+	if (atomic_read(qp->ctr + idx) == 1)
+		goto out;
+
+	atomic_inc(qp->ctr + (idx ^ 0x1));
+	/* Reduce the likelihood that qrcu_read_lock() will loop */
+	smp_mb__after_atomic_inc();
+	qp->completed++;
+
+	atomic_dec(qp->ctr + idx);
+	__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
+out:
+	mutex_unlock(&qp->mutex);
+	smp_mb();
+}
+
+EXPORT_SYMBOL_GPL(init_qrcu_struct);
+EXPORT_SYMBOL_GPL(qrcu_read_lock);
+EXPORT_SYMBOL_GPL(qrcu_read_unlock);
+EXPORT_SYMBOL_GPL(synchronize_qrcu);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-11-30  1:57 ` Oleg Nesterov
@ 2006-11-30  2:46   ` Paul E. McKenney
  2006-11-30  3:22     ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2006-11-30  2:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Jens Axboe, Paul E. McKenney, Alan Stern,
	Josh Triplett, linux-kernel

On Thu, Nov 30, 2006 at 04:57:14AM +0300, Oleg Nesterov wrote:
> (the same patch + comments from Paul)
> 
> [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
> 
> Very much based on ideas, corrections, and patient explanations from
> Alan and Paul.
> 
> The current srcu implementation is very good for readers, lock/unlock
> are extremely cheap. But for that reason it is not possible to avoid
> synchronize_sched() and polling in synchronize_srcu().
> 
> Jens Axboe wrote:
> >
> > It works for me, but the overhead is still large. Before it would take
> > 8-12 jiffies for a synchronize_srcu() to complete without there actually
> > being any reader locks active, now it takes 2-3 jiffies. So it's
> > definitely faster, and as suspected the loss of two of three
> > synchronize_sched() cut down the overhead to a third.
> 
> 'qrcu' behaves the same as srcu but optimized for writers. The fast path
> for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
> The slow path is __wait_event(), no polling. However, the reader does
> atomic inc/dec on lock/unlock, and the counters are not per-cpu.
> 
> Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
> and 'qrcu_struct' can be compile-time initialized.
> 
> See also (a long) discussion:
> 	http://marc.theaimsgroup.com/?t=116370857600003

With the addition of a comment for the smp_mb() at the beginning of
synchronize_qrcu(), shown below:

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
> 
> --- 19-rc6/include/linux/srcu.h~1_qrcu	2006-10-22 18:24:03.000000000 +0400
> +++ 19-rc6/include/linux/srcu.h	2006-11-30 04:32:42.000000000 +0300
> @@ -27,6 +27,8 @@
>  #ifndef _LINUX_SRCU_H
>  #define _LINUX_SRCU_H
> 
> +#include <linux/wait.h>
> +
>  struct srcu_struct_array {
>  	int c[2];
>  };
> @@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct
>  void synchronize_srcu(struct srcu_struct *sp);
>  long srcu_batches_completed(struct srcu_struct *sp);
> 
> +/*
> + * fully compatible with srcu, but optimized for writers.
> + */
> +
> +struct qrcu_struct {
> +	int completed;
> +	atomic_t ctr[2];
> +	wait_queue_head_t wq;
> +	struct mutex mutex;
> +};
> +
> +int init_qrcu_struct(struct qrcu_struct *qp);
> +int qrcu_read_lock(struct qrcu_struct *qp);
> +void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
> +void synchronize_qrcu(struct qrcu_struct *qp);
> +
> +/**
> + * cleanup_qrcu_struct - deconstruct a quick-RCU structure
> + * @qp: structure to clean up.
> + *
> + * Must invoke this after you are finished using a given qrcu_struct that
> + * was initialized via init_qrcu_struct().  We reserve the right to
> + * leak memory should you fail to do this!
> + */
> +static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
> +{
> +}
> +
>  #endif
> --- 19-rc6/kernel/srcu.c~1_qrcu	2006-10-22 18:24:03.000000000 +0400
> +++ 19-rc6/kernel/srcu.c	2006-11-30 04:39:53.000000000 +0300
> @@ -256,3 +256,94 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
>  EXPORT_SYMBOL_GPL(synchronize_srcu);
>  EXPORT_SYMBOL_GPL(srcu_batches_completed);
>  EXPORT_SYMBOL_GPL(srcu_readers_active);
> +
> +/**
> + * init_qrcu_struct - initialize a quick-RCU structure.
> + * @qp: structure to initialize.
> + *
> + * Must invoke this on a given qrcu_struct before passing that qrcu_struct
> + * to any other function.  Each qrcu_struct represents a separate domain
> + * of QRCU protection.
> + */
> +int init_qrcu_struct(struct qrcu_struct *qp)
> +{
> +	qp->completed = 0;
> +	atomic_set(qp->ctr + 0, 1);
> +	atomic_set(qp->ctr + 1, 0);
> +	init_waitqueue_head(&qp->wq);
> +	mutex_init(&qp->mutex);
> +
> +	return 0;
> +}
> +
> +/**
> + * qrcu_read_lock - register a new reader for an QRCU-protected structure.
> + * @qp: qrcu_struct in which to register the new reader.
> + *
> + * Counts the new reader in the appropriate element of the qrcu_struct.
> + * Returns an index that must be passed to the matching qrcu_read_unlock().
> + */
> +int qrcu_read_lock(struct qrcu_struct *qp)
> +{
> +	for (;;) {
> +		int idx = qp->completed & 0x1;
> +		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
> +			return idx;
> +	}
> +}
> +
> +/**
> + * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure.
> + * @qp: qrcu_struct in which to unregister the old reader.
> + * @idx: return value from corresponding qrcu_read_lock().
> + *
> + * Removes the count for the old reader from the appropriate element of
> + * the qrcu_struct.
> + */
> +void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
> +{
> +	if (atomic_dec_and_test(qp->ctr + idx))
> +		wake_up(&qp->wq);
> +}
> +
> +/**
> + * synchronize_qrcu - wait for prior QRCU read-side critical-section completion
> + * @qp: qrcu_struct with which to synchronize.
> + *
> + * Flip the completed counter, and wait for the old count to drain to zero.
> + * As with classic RCU, the updater must use some separate means of
> + * synchronizing concurrent updates.  Can block; must be called from
> + * process context.
> + *
> + * Note that it is illegal to call synchronize_qrcu() from the corresponding
> + * QRCU read-side critical section; doing so will result in deadlock.
> + * However, it is perfectly legal to call synchronize_qrcu() on one
> + * qrcu_struct from some other qrcu_struct's read-side critical section.
> + */
> +void synchronize_qrcu(struct qrcu_struct *qp)
> +{
> +	int idx;

	/*
	 * The following memory barrier is needed to ensure that
	 * any prior data-structure manipulation is seen by other
	 * CPUs to happen before picking up the value of
	 * qp->completed.
	 */

> +	smp_mb();
> +	mutex_lock(&qp->mutex);
> +
> +	idx = qp->completed & 0x1;
> +	if (atomic_read(qp->ctr + idx) == 1)
> +		goto out;
> +
> +	atomic_inc(qp->ctr + (idx ^ 0x1));
> +	/* Reduce the likelihood that qrcu_read_lock() will loop */
> +	smp_mb__after_atomic_inc();
> +	qp->completed++;
> +
> +	atomic_dec(qp->ctr + idx);
> +	__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> +out:
> +	mutex_unlock(&qp->mutex);

	/*
	 * The following memory barrier is needed to ensure that
	 * and subsequent freeing of data elements previously
	 * removed is seen by other CPUs after the wait completes.
	 */

Hmmm...  Now I am wondering if the memory barriers inherent in the
__wait_event() suffice for this last barrier...  :-/  Thoughts?

> +	smp_mb();
> +}
> +
> +EXPORT_SYMBOL_GPL(init_qrcu_struct);
> +EXPORT_SYMBOL_GPL(qrcu_read_lock);
> +EXPORT_SYMBOL_GPL(qrcu_read_unlock);
> +EXPORT_SYMBOL_GPL(synchronize_qrcu);
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-11-30  2:46   ` Paul E. McKenney
@ 2006-11-30  3:22     ` Oleg Nesterov
  2006-11-30  3:37       ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2006-11-30  3:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Jens Axboe, Alan Stern, Josh Triplett, linux-kernel

On 11/29, Paul E. McKenney wrote:
>
> On Thu, Nov 30, 2006 at 04:57:14AM +0300, Oleg Nesterov wrote:
> > (the same patch + comments from Paul)
> > 
> With the addition of a comment for the smp_mb() at the beginning of
> synchronize_qrcu(), shown below:
> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Thanks!

> 	/*
> 	 * The following memory barrier is needed to ensure that
> 	 * and subsequent freeing of data elements previously
> 	 * removed is seen by other CPUs after the wait completes.
> 	 */

I think we have another reason for mb(), but I can't suggest a clear
comment.

	struct data {
		...
		int in_use;
		...
	}

	void free_data(struct data *p)
	{
		BUG_ON(p->in_use);
		kfree(p);
	}

	struct data *DATA;

Reader:

	qrcu_read_lock();
	data = rcu_dereference(DATA);

	data->in_use = 1;
	do_something(data);
	data->in_use = 0;

	qrcu_read_unlock();

Writer:

	old = DATA;
	DATA = alloc_new_data();
	
	synchronize_qrcu();
	free_data(old);

qrcu_read_unlock() does (implicit) mb() on reader's side, but we must pair
it on our side, otherwise we can't be sure (of course, _only_ in theory) we
are seeing all the changes (->in_use == 0) made by the reader.

> Hmmm...  Now I am wondering if the memory barriers inherent in the
> __wait_event() suffice for this last barrier...  :-/  Thoughts?
> 
> > +	smp_mb();

Fastpath skips __wait_event(), and it is possible that the reader does
lock/unlock between the first 'mb()' and 'if (atomic_read() == 1)'.

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-11-30  3:22     ` Oleg Nesterov
@ 2006-11-30  3:37       ` Oleg Nesterov
  2006-11-30 17:05         ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2006-11-30  3:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Jens Axboe, Alan Stern, Josh Triplett, linux-kernel

On 11/30, Oleg Nesterov wrote:
>
> On 11/29, Paul E. McKenney wrote:
> >
> > Hmmm...  Now I am wondering if the memory barriers inherent in the
> > __wait_event() suffice for this last barrier...  :-/  Thoughts?
> > 
> > > +	smp_mb();
> 
> Fastpath skips __wait_event(), and it is possible that the reader does
> lock/unlock between the first 'mb()' and 'if (atomic_read() == 1)'.

In fact, a slow path needs (I think) it too. We can have an unrelated
wakeup, and then the reader does unlock() before we check !atomic_read()
in the __wait_event()'s loop. The reader removes us from ->wq, in that
case finish_wait() does nothing.

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-11-30  3:37       ` Oleg Nesterov
@ 2006-11-30 17:05         ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2006-11-30 17:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Andrew Morton, Jens Axboe, Alan Stern,
	Josh Triplett, linux-kernel

On Thu, Nov 30, 2006 at 06:37:57AM +0300, Oleg Nesterov wrote:
> On 11/30, Oleg Nesterov wrote:
> >
> > On 11/29, Paul E. McKenney wrote:
> > >
> > > Hmmm...  Now I am wondering if the memory barriers inherent in the
> > > __wait_event() suffice for this last barrier...  :-/  Thoughts?
> > >
> > > > +	smp_mb();
> >
> > Fastpath skips __wait_event(), and it is possible that the reader does
> > lock/unlock between the first 'mb()' and 'if (atomic_read() == 1)'.
> 
> In fact, a slow path needs (I think) it too. We can have an unrelated
> wakeup, and then the reader does unlock() before we check !atomic_read()
> in the __wait_event()'s loop. The reader removes us from ->wq, in that
> case finish_wait() does nothing.

Good point -- I was forgetting about the fastpath checks in __wait_event().

How about something like this?

	/*
	 * The following memory barrier is needed to perserve ordering
	 * in the case where __wait_event() follows its fastpath,
	 * which includes neither locks nor memory barriers.
	 */

						Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-11-29 23:53 [RFC, PATCH 1/2] qrcu: "quick" srcu implementation Oleg Nesterov
  2006-11-30  1:57 ` Oleg Nesterov
@ 2006-12-01  0:14 ` Oleg Nesterov
  2006-12-07 17:16   ` Paul E. McKenney
  1 sibling, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2006-12-01  0:14 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Paul E. McKenney, Alan Stern, Josh Triplett, linux-kernel

Hopefully the final version: added a comment (thanks Paul!) for the
2-nd smp_mb().

[RFC, PATCH 1/2] qrcu: "quick" srcu implementation

Very much based on ideas, corrections, and patient explanations from
Alan and Paul.

The current srcu implementation is very good for readers, lock/unlock
are extremely cheap. But for that reason it is not possible to avoid
synchronize_sched() and polling in synchronize_srcu().

Jens Axboe wrote:
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.

'qrcu' behaves the same as srcu but optimized for writers. The fast path
for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
The slow path is __wait_event(), no polling. However, the reader does
atomic inc/dec on lock/unlock, and the counters are not per-cpu.

Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
and 'qrcu_struct' can be compile-time initialized.

See also (a long) discussion:
	http://marc.theaimsgroup.com/?t=116370857600003

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- 19-rc6/include/linux/srcu.h~1_qrcu	2006-10-22 18:24:03.000000000 +0400
+++ 19-rc6/include/linux/srcu.h	2006-11-30 04:32:42.000000000 +0300
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/wait.h>
+
 struct srcu_struct_array {
 	int c[2];
 };
@@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct
 void synchronize_srcu(struct srcu_struct *sp);
 long srcu_batches_completed(struct srcu_struct *sp);
 
+/*
+ * fully compatible with srcu, but optimized for writers.
+ */
+
+struct qrcu_struct {
+	int completed;
+	atomic_t ctr[2];
+	wait_queue_head_t wq;
+	struct mutex mutex;
+};
+
+int init_qrcu_struct(struct qrcu_struct *qp);
+int qrcu_read_lock(struct qrcu_struct *qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+void synchronize_qrcu(struct qrcu_struct *qp);
+
+/**
+ * cleanup_qrcu_struct - deconstruct a quick-RCU structure
+ * @qp: structure to clean up.
+ *
+ * Must invoke this after you are finished using a given qrcu_struct that
+ * was initialized via init_qrcu_struct().  We reserve the right to
+ * leak memory should you fail to do this!
+ */
+static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
+{
+}
+
 #endif
--- 19-rc6/kernel/srcu.c~1_qrcu	2006-10-22 18:24:03.000000000 +0400
+++ 19-rc6/kernel/srcu.c	2006-12-01 01:54:30.000000000 +0300
@@ -256,3 +256,108 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 EXPORT_SYMBOL_GPL(srcu_batches_completed);
 EXPORT_SYMBOL_GPL(srcu_readers_active);
+
+/**
+ * init_qrcu_struct - initialize a quick-RCU structure.
+ * @qp: structure to initialize.
+ *
+ * Must invoke this on a given qrcu_struct before passing that qrcu_struct
+ * to any other function.  Each qrcu_struct represents a separate domain
+ * of QRCU protection.
+ */
+int init_qrcu_struct(struct qrcu_struct *qp)
+{
+	qp->completed = 0;
+	atomic_set(qp->ctr + 0, 1);
+	atomic_set(qp->ctr + 1, 0);
+	init_waitqueue_head(&qp->wq);
+	mutex_init(&qp->mutex);
+
+	return 0;
+}
+
+/**
+ * qrcu_read_lock - register a new reader for an QRCU-protected structure.
+ * @qp: qrcu_struct in which to register the new reader.
+ *
+ * Counts the new reader in the appropriate element of the qrcu_struct.
+ * Returns an index that must be passed to the matching qrcu_read_unlock().
+ */
+int qrcu_read_lock(struct qrcu_struct *qp)
+{
+	for (;;) {
+		int idx = qp->completed & 0x1;
+		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
+			return idx;
+	}
+}
+
+/**
+ * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure.
+ * @qp: qrcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding qrcu_read_lock().
+ *
+ * Removes the count for the old reader from the appropriate element of
+ * the qrcu_struct.
+ */
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
+{
+	if (atomic_dec_and_test(qp->ctr + idx))
+		wake_up(&qp->wq);
+}
+
+/**
+ * synchronize_qrcu - wait for prior QRCU read-side critical-section completion
+ * @qp: qrcu_struct with which to synchronize.
+ *
+ * Flip the completed counter, and wait for the old count to drain to zero.
+ * As with classic RCU, the updater must use some separate means of
+ * synchronizing concurrent updates.  Can block; must be called from
+ * process context.
+ *
+ * Note that it is illegal to call synchronize_qrcu() from the corresponding
+ * QRCU read-side critical section; doing so will result in deadlock.
+ * However, it is perfectly legal to call synchronize_qrcu() on one
+ * qrcu_struct from some other qrcu_struct's read-side critical section.
+ */
+void synchronize_qrcu(struct qrcu_struct *qp)
+{
+	int idx;
+
+	/*
+	 * The following memory barrier is needed to ensure that
+	 * any prior data-structure manipulation is seen by other
+	 * CPUs to happen before picking up the value of
+	 * qp->completed.
+	 */
+	smp_mb();
+	mutex_lock(&qp->mutex);
+
+	idx = qp->completed & 0x1;
+	if (atomic_read(qp->ctr + idx) == 1)
+		goto out;
+
+	atomic_inc(qp->ctr + (idx ^ 0x1));
+	/* Reduce the likelihood that qrcu_read_lock() will loop */
+	smp_mb__after_atomic_inc();
+	qp->completed++;
+
+	atomic_dec(qp->ctr + idx);
+	__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
+out:
+	mutex_unlock(&qp->mutex);
+	smp_mb();
+	/*
+	 * The above smp_mb() is needed in the case that we
+	 * see the counter reaching zero, so that we do not
+	 * need to block.  In this case, we need to make
+	 * sure that the CPU does not re-order any subsequent
+	 * changes made by the caller to occur prior to the
+	 * test, as seen by other CPUs.
+	 */
+}
+
+EXPORT_SYMBOL_GPL(init_qrcu_struct);
+EXPORT_SYMBOL_GPL(qrcu_read_lock);
+EXPORT_SYMBOL_GPL(qrcu_read_unlock);
+EXPORT_SYMBOL_GPL(synchronize_qrcu);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
  2006-12-01  0:14 ` Oleg Nesterov
@ 2006-12-07 17:16   ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2006-12-07 17:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Jens Axboe, Alan Stern, Josh Triplett, linux-kernel

On Fri, Dec 01, 2006 at 03:14:05AM +0300, Oleg Nesterov wrote:
> Hopefully the final version: added a comment (thanks Paul!) for the
> 2-nd smp_mb().
> 
> [RFC, PATCH 1/2] qrcu: "quick" srcu implementation
> 
> Very much based on ideas, corrections, and patient explanations from
> Alan and Paul.
> 
> The current srcu implementation is very good for readers, lock/unlock
> are extremely cheap. But for that reason it is not possible to avoid
> synchronize_sched() and polling in synchronize_srcu().
> 
> Jens Axboe wrote:
> >
> > It works for me, but the overhead is still large. Before it would take
> > 8-12 jiffies for a synchronize_srcu() to complete without there actually
> > being any reader locks active, now it takes 2-3 jiffies. So it's
> > definitely faster, and as suspected the loss of two of three
> > synchronize_sched() cut down the overhead to a third.
> 
> 'qrcu' behaves the same as srcu but optimized for writers. The fast path
> for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
> The slow path is __wait_event(), no polling. However, the reader does
> atomic inc/dec on lock/unlock, and the counters are not per-cpu.
> 
> Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
> and 'qrcu_struct' can be compile-time initialized.
> 
> See also (a long) discussion:
> 	http://marc.theaimsgroup.com/?t=116370857600003

I was concerned about the fastpath on weak-memory machines, and this seems
to tie into the endless memory-ordering thread (more on this later).

Passed a 12-hour rcutorture run on a ppc64 machine (weak memory-ordering
model), and my Alpha docs claim this works, so...

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
> 
> --- 19-rc6/include/linux/srcu.h~1_qrcu	2006-10-22 18:24:03.000000000 +0400
> +++ 19-rc6/include/linux/srcu.h	2006-11-30 04:32:42.000000000 +0300
> @@ -27,6 +27,8 @@
>  #ifndef _LINUX_SRCU_H
>  #define _LINUX_SRCU_H
> 
> +#include <linux/wait.h>
> +
>  struct srcu_struct_array {
>  	int c[2];
>  };
> @@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct
>  void synchronize_srcu(struct srcu_struct *sp);
>  long srcu_batches_completed(struct srcu_struct *sp);
> 
> +/*
> + * fully compatible with srcu, but optimized for writers.
> + */
> +
> +struct qrcu_struct {
> +	int completed;
> +	atomic_t ctr[2];
> +	wait_queue_head_t wq;
> +	struct mutex mutex;
> +};
> +
> +int init_qrcu_struct(struct qrcu_struct *qp);
> +int qrcu_read_lock(struct qrcu_struct *qp);
> +void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
> +void synchronize_qrcu(struct qrcu_struct *qp);
> +
> +/**
> + * cleanup_qrcu_struct - deconstruct a quick-RCU structure
> + * @qp: structure to clean up.
> + *
> + * Must invoke this after you are finished using a given qrcu_struct that
> + * was initialized via init_qrcu_struct().  We reserve the right to
> + * leak memory should you fail to do this!
> + */
> +static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
> +{
> +}
> +
>  #endif
> --- 19-rc6/kernel/srcu.c~1_qrcu	2006-10-22 18:24:03.000000000 +0400
> +++ 19-rc6/kernel/srcu.c	2006-12-01 01:54:30.000000000 +0300
> @@ -256,3 +256,108 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
>  EXPORT_SYMBOL_GPL(synchronize_srcu);
>  EXPORT_SYMBOL_GPL(srcu_batches_completed);
>  EXPORT_SYMBOL_GPL(srcu_readers_active);
> +
> +/**
> + * init_qrcu_struct - initialize a quick-RCU structure.
> + * @qp: structure to initialize.
> + *
> + * Must invoke this on a given qrcu_struct before passing that qrcu_struct
> + * to any other function.  Each qrcu_struct represents a separate domain
> + * of QRCU protection.
> + */
> +int init_qrcu_struct(struct qrcu_struct *qp)
> +{
> +	qp->completed = 0;
> +	atomic_set(qp->ctr + 0, 1);
> +	atomic_set(qp->ctr + 1, 0);
> +	init_waitqueue_head(&qp->wq);
> +	mutex_init(&qp->mutex);
> +
> +	return 0;
> +}
> +
> +/**
> + * qrcu_read_lock - register a new reader for an QRCU-protected structure.
> + * @qp: qrcu_struct in which to register the new reader.
> + *
> + * Counts the new reader in the appropriate element of the qrcu_struct.
> + * Returns an index that must be passed to the matching qrcu_read_unlock().
> + */
> +int qrcu_read_lock(struct qrcu_struct *qp)
> +{
> +	for (;;) {
> +		int idx = qp->completed & 0x1;
> +		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
> +			return idx;
> +	}
> +}
> +
> +/**
> + * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure.
> + * @qp: qrcu_struct in which to unregister the old reader.
> + * @idx: return value from corresponding qrcu_read_lock().
> + *
> + * Removes the count for the old reader from the appropriate element of
> + * the qrcu_struct.
> + */
> +void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
> +{
> +	if (atomic_dec_and_test(qp->ctr + idx))
> +		wake_up(&qp->wq);
> +}
> +
> +/**
> + * synchronize_qrcu - wait for prior QRCU read-side critical-section completion
> + * @qp: qrcu_struct with which to synchronize.
> + *
> + * Flip the completed counter, and wait for the old count to drain to zero.
> + * As with classic RCU, the updater must use some separate means of
> + * synchronizing concurrent updates.  Can block; must be called from
> + * process context.
> + *
> + * Note that it is illegal to call synchronize_qrcu() from the corresponding
> + * QRCU read-side critical section; doing so will result in deadlock.
> + * However, it is perfectly legal to call synchronize_qrcu() on one
> + * qrcu_struct from some other qrcu_struct's read-side critical section.
> + */
> +void synchronize_qrcu(struct qrcu_struct *qp)
> +{
> +	int idx;
> +
> +	/*
> +	 * The following memory barrier is needed to ensure that
> +	 * any prior data-structure manipulation is seen by other
> +	 * CPUs to happen before picking up the value of
> +	 * qp->completed.
> +	 */
> +	smp_mb();
> +	mutex_lock(&qp->mutex);
> +
> +	idx = qp->completed & 0x1;
> +	if (atomic_read(qp->ctr + idx) == 1)
> +		goto out;
> +
> +	atomic_inc(qp->ctr + (idx ^ 0x1));
> +	/* Reduce the likelihood that qrcu_read_lock() will loop */
> +	smp_mb__after_atomic_inc();
> +	qp->completed++;
> +
> +	atomic_dec(qp->ctr + idx);
> +	__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> +out:
> +	mutex_unlock(&qp->mutex);
> +	smp_mb();
> +	/*
> +	 * The above smp_mb() is needed in the case that we
> +	 * see the counter reaching zero, so that we do not
> +	 * need to block.  In this case, we need to make
> +	 * sure that the CPU does not re-order any subsequent
> +	 * changes made by the caller to occur prior to the
> +	 * test, as seen by other CPUs.
> +	 */
> +}
> +
> +EXPORT_SYMBOL_GPL(init_qrcu_struct);
> +EXPORT_SYMBOL_GPL(qrcu_read_lock);
> +EXPORT_SYMBOL_GPL(qrcu_read_unlock);
> +EXPORT_SYMBOL_GPL(synchronize_qrcu);
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-12-07 17:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-29 23:53 [RFC, PATCH 1/2] qrcu: "quick" srcu implementation Oleg Nesterov
2006-11-30  1:57 ` Oleg Nesterov
2006-11-30  2:46   ` Paul E. McKenney
2006-11-30  3:22     ` Oleg Nesterov
2006-11-30  3:37       ` Oleg Nesterov
2006-11-30 17:05         ` Paul E. McKenney
2006-12-01  0:14 ` Oleg Nesterov
2006-12-07 17:16   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).