linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
@ 2015-07-11 23:35 Oleg Nesterov
  2015-07-11 23:35 ` [PATCH 1/7] rcu: Create rcu_sync infrastructure Oleg Nesterov
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:35 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

Hello,

Let me make another attempt to push rcu_sync and add a _simple_
improvment into percpu-rwsem. It already has another user (cgroups)
and I think it can have more. Peter has some use-cases. sb->s_writers
(which afaics is buggy btw) can be turned into percpu-rwsem too I think.

Linus, I am mostly trying to convince you. Nobody else objected so far.
Could you please comment?

Peter, if you agree with 5-7, can I add your Signed-off-by's ?

To me, the most annoying problem with percpu_rw_semaphore is
synchronize_sched_expedited() which is called twice by every
down_write/up_write. I think it would be really nice to avoid it.

Let's start with the simple test-case,

	#!/bin/bash

	perf probe -x /lib/libc.so.6 syscall

	for i in {1..1000}; do
		echo 1 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
		echo 0 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
	done

It needs ~ 13.5 seconds (2 CPUs, KVM). If we simply replace
synchronize_sched_expedited() with synchronize_sched() it takes
~ 67.5 seconds. This is not good.

With these patches it takes around 13.3 seconds again (a little
bit faster), and it doesn't use _expedited. synchronize_sched()
is called 1-2 (max 3) times in average. And now it does not
disturb the whole system.

And just in case, I also measured

	for (i = 0; i < 1000000; ++i) {
		percpu_down_write(&dup_mmap_sem);
		percpu_up_write(&dup_mmap_sem);
	}

and it runs more than 1.5 times faster (to remind, only 2 CPUs),
but this is not that interesting, I agree.

And note that the actual change in percpu-rwsem is really simple,
and imo it even makes the code simpler. (the last patch is off-
topic cleanup).

So the only complication is rcu_sync itself. But, rightly or not (I
am obviously biased), I believe this new rcu infrastructure is natural
and useful, and I think it can have more users too.

And. We can do more improvements in rcu_sync and percpu-rwsem, and
I don't only mean other optimizations from Peter. In particular, we
can extract the "wait for gp pass" from rcu_sync_enter() into another
helper, we can teach percpu_down_write() to allow multiple writers,
and more.

Oleg.

 include/linux/percpu-rwsem.h  |    3 +-
 include/linux/rcusync.h       |   57 +++++++++++++++
 kernel/locking/percpu-rwsem.c |   78 ++++++---------------
 kernel/rcu/Makefile           |    2 +-
 kernel/rcu/sync.c             |  152 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 235 insertions(+), 57 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
@ 2015-07-11 23:35 ` Oleg Nesterov
  2015-07-15 18:05   ` Paul E. McKenney
  2015-07-11 23:35 ` [PATCH 2/7] rcusync: Introduce struct rcu_sync_ops Oleg Nesterov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:35 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

It is functionally equivalent to

        struct rcu_sync_struct {
                atomic_t counter;
        };

        static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
        {
                return atomic_read(&rss->counter) == 0;
        }

        static inline void rcu_sync_enter(struct rcu_sync_struct *rss)
        {
                atomic_inc(&rss->counter);
                synchronize_sched();
        }

        static inline void rcu_sync_exit(struct rcu_sync_struct *rss)
        {
                synchronize_sched();
                atomic_dec(&rss->counter);
        }

except: it records the state and synchronize_sched() is only called by
rcu_sync_enter() and only if necessary.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rcusync.h |   64 ++++++++++++++++++++++++++++
 kernel/rcu/Makefile     |    2 +-
 kernel/rcu/sync.c       |  108 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 173 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/rcusync.h
 create mode 100644 kernel/rcu/sync.c

diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
new file mode 100644
index 0000000..7858491
--- /dev/null
+++ b/include/linux/rcusync.h
@@ -0,0 +1,64 @@
+#ifndef _LINUX_RCUSYNC_H_
+#define _LINUX_RCUSYNC_H_
+
+#include <linux/wait.h>
+#include <linux/rcupdate.h>
+
+struct rcu_sync_struct {
+	int			gp_state;
+	int			gp_count;
+	wait_queue_head_t	gp_wait;
+
+	int			cb_state;
+	struct rcu_head		cb_head;
+
+	void (*sync)(void);
+	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+};
+
+#define ___RCU_SYNC_INIT(name)						\
+	.gp_state = 0,							\
+	.gp_count = 0,							\
+	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
+	.cb_state = 0
+
+#define __RCU_SCHED_SYNC_INIT(name) {					\
+	___RCU_SYNC_INIT(name),						\
+	.sync = synchronize_sched,					\
+	.call = call_rcu_sched,						\
+}
+
+#define __RCU_BH_SYNC_INIT(name) {					\
+	___RCU_SYNC_INIT(name),						\
+	.sync = synchronize_rcu_bh,					\
+	.call = call_rcu_bh,						\
+}
+
+#define __RCU_SYNC_INIT(name) {						\
+	___RCU_SYNC_INIT(name),						\
+	.sync = synchronize_rcu,					\
+	.call = call_rcu,						\
+}
+
+#define DEFINE_RCU_SCHED_SYNC(name)					\
+	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
+
+#define DEFINE_RCU_BH_SYNC(name)					\
+	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
+
+#define DEFINE_RCU_SYNC(name)						\
+	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
+
+static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
+{
+	return !rss->gp_state; /* GP_IDLE */
+}
+
+enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
+
+extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
+extern void rcu_sync_enter(struct rcu_sync_struct *);
+extern void rcu_sync_exit(struct rcu_sync_struct *);
+
+#endif /* _LINUX_RCUSYNC_H_ */
+
diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
index 50a8084..61a1656 100644
--- a/kernel/rcu/Makefile
+++ b/kernel/rcu/Makefile
@@ -1,4 +1,4 @@
-obj-y += update.o
+obj-y += update.o sync.o
 obj-$(CONFIG_SRCU) += srcu.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += tree.o
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
new file mode 100644
index 0000000..f84176a
--- /dev/null
+++ b/kernel/rcu/sync.c
@@ -0,0 +1,108 @@
+
+#include <linux/rcusync.h>
+#include <linux/sched.h>
+
+enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
+enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
+
+#define	rss_lock	gp_wait.lock
+
+void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
+{
+	memset(rss, 0, sizeof(*rss));
+	init_waitqueue_head(&rss->gp_wait);
+
+	switch (type) {
+	case RCU_SYNC:
+		rss->sync = synchronize_rcu;
+		rss->call = call_rcu;
+		break;
+
+	case RCU_SCHED_SYNC:
+		rss->sync = synchronize_sched;
+		rss->call = call_rcu_sched;
+		break;
+
+	case RCU_BH_SYNC:
+		rss->sync = synchronize_rcu_bh;
+		rss->call = call_rcu_bh;
+		break;
+	}
+}
+
+void rcu_sync_enter(struct rcu_sync_struct *rss)
+{
+	bool need_wait, need_sync;
+
+	spin_lock_irq(&rss->rss_lock);
+	need_wait = rss->gp_count++;
+	need_sync = rss->gp_state == GP_IDLE;
+	if (need_sync)
+		rss->gp_state = GP_PENDING;
+	spin_unlock_irq(&rss->rss_lock);
+
+	BUG_ON(need_wait && need_sync);
+
+	if (need_sync) {
+		rss->sync();
+		rss->gp_state = GP_PASSED;
+		wake_up_all(&rss->gp_wait);
+	} else if (need_wait) {
+		wait_event(rss->gp_wait, rss->gp_state == GP_PASSED);
+	} else {
+		/*
+		 * Possible when there's a pending CB from a rcu_sync_exit().
+		 * Nobody has yet been allowed the 'fast' path and thus we can
+		 * avoid doing any sync(). The callback will get 'dropped'.
+		 */
+		BUG_ON(rss->gp_state != GP_PASSED);
+	}
+}
+
+static void rcu_sync_func(struct rcu_head *rcu)
+{
+	struct rcu_sync_struct *rss =
+		container_of(rcu, struct rcu_sync_struct, cb_head);
+	unsigned long flags;
+
+
+	BUG_ON(rss->gp_state != GP_PASSED);
+	BUG_ON(rss->cb_state == CB_IDLE);
+
+	spin_lock_irqsave(&rss->rss_lock, flags);
+	if (rss->gp_count) {
+		/*
+		 * A new rcu_sync_begin() has happened; drop the callback.
+		 */
+		rss->cb_state = CB_IDLE;
+	} else if (rss->cb_state == CB_REPLAY) {
+		/*
+		 * A new rcu_sync_exit() has happened; requeue the callback
+		 * to catch a later GP.
+		 */
+		rss->cb_state = CB_PENDING;
+		rss->call(&rss->cb_head, rcu_sync_func);
+	} else {
+		/*
+		 * We're at least a GP after rcu_sync_exit(); eveybody will now
+		 * have observed the write side critical section. Let 'em rip!.
+		 */
+		rss->cb_state = CB_IDLE;
+		rss->gp_state = GP_IDLE;
+	}
+	spin_unlock_irqrestore(&rss->rss_lock, flags);
+}
+
+void rcu_sync_exit(struct rcu_sync_struct *rss)
+{
+	spin_lock_irq(&rss->rss_lock);
+	if (!--rss->gp_count) {
+		if (rss->cb_state == CB_IDLE) {
+			rss->cb_state = CB_PENDING;
+			rss->call(&rss->cb_head, rcu_sync_func);
+		} else if (rss->cb_state == CB_PENDING) {
+			rss->cb_state = CB_REPLAY;
+		}
+	}
+	spin_unlock_irq(&rss->rss_lock);
+}
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] rcusync: Introduce struct rcu_sync_ops
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
  2015-07-11 23:35 ` [PATCH 1/7] rcu: Create rcu_sync infrastructure Oleg Nesterov
@ 2015-07-11 23:35 ` Oleg Nesterov
  2015-07-11 23:35 ` [PATCH 3/7] rcusync: Add the CONFIG_PROVE_RCU checks Oleg Nesterov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:35 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

Add the new struct rcu_sync_ops which holds sync/call methods, and
turn the function pointers in rcu_sync_struct into an array of struct
rcu_sync_ops.

This simplifies the "init" helpers, and this way it is simpler to add
the new methods we need, especially ifdef'ed.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rcusync.h |   60 ++++++++++++++++++-----------------------------
 kernel/rcu/sync.c       |   43 +++++++++++++++++----------------
 2 files changed, 45 insertions(+), 58 deletions(-)

diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
index 7858491..988ec33 100644
--- a/include/linux/rcusync.h
+++ b/include/linux/rcusync.h
@@ -4,6 +4,8 @@
 #include <linux/wait.h>
 #include <linux/rcupdate.h>
 
+enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
+
 struct rcu_sync_struct {
 	int			gp_state;
 	int			gp_count;
@@ -12,53 +14,37 @@ struct rcu_sync_struct {
 	int			cb_state;
 	struct rcu_head		cb_head;
 
-	void (*sync)(void);
-	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+	enum rcu_sync_type	gp_type;
 };
 
-#define ___RCU_SYNC_INIT(name)						\
-	.gp_state = 0,							\
-	.gp_count = 0,							\
-	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
-	.cb_state = 0
-
-#define __RCU_SCHED_SYNC_INIT(name) {					\
-	___RCU_SYNC_INIT(name),						\
-	.sync = synchronize_sched,					\
-	.call = call_rcu_sched,						\
-}
-
-#define __RCU_BH_SYNC_INIT(name) {					\
-	___RCU_SYNC_INIT(name),						\
-	.sync = synchronize_rcu_bh,					\
-	.call = call_rcu_bh,						\
-}
-
-#define __RCU_SYNC_INIT(name) {						\
-	___RCU_SYNC_INIT(name),						\
-	.sync = synchronize_rcu,					\
-	.call = call_rcu,						\
-}
-
-#define DEFINE_RCU_SCHED_SYNC(name)					\
-	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
-
-#define DEFINE_RCU_BH_SYNC(name)					\
-	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
-
-#define DEFINE_RCU_SYNC(name)						\
-	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
-
 static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
 {
 	return !rss->gp_state; /* GP_IDLE */
 }
 
-enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
-
 extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
 extern void rcu_sync_enter(struct rcu_sync_struct *);
 extern void rcu_sync_exit(struct rcu_sync_struct *);
 
+#define __RCU_SYNC_INITIALIZER(name, type) {				\
+		.gp_state = 0,						\
+		.gp_count = 0,						\
+		.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),	\
+		.cb_state = 0,						\
+		.gp_type = type,					\
+	}
+
+#define	__DEFINE_RCU_SYNC(name, type)	\
+	struct rcu_sync_struct name = __RCU_SYNC_INITIALIZER(name, type)
+
+#define DEFINE_RCU_SYNC(name)		\
+	__DEFINE_RCU_SYNC(name, RCU_SYNC)
+
+#define DEFINE_RCU_SCHED_SYNC(name)	\
+	__DEFINE_RCU_SYNC(name, RCU_SCHED_SYNC)
+
+#define DEFINE_RCU_BH_SYNC(name)	\
+	__DEFINE_RCU_SYNC(name, RCU_BH_SYNC)
+
 #endif /* _LINUX_RCUSYNC_H_ */
 
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index f84176a..99051b7 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -1,7 +1,24 @@
-
 #include <linux/rcusync.h>
 #include <linux/sched.h>
 
+static const struct {
+	void (*sync)(void);
+	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+} gp_ops[] = {
+	[RCU_SYNC] = {
+		.sync = synchronize_rcu,
+		.call = call_rcu,
+	},
+	[RCU_SCHED_SYNC] = {
+		.sync = synchronize_sched,
+		.call = call_rcu_sched,
+	},
+	[RCU_BH_SYNC] = {
+		.sync = synchronize_rcu_bh,
+		.call = call_rcu_bh,
+	},
+};
+
 enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
 enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
@@ -11,23 +28,7 @@ void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
 {
 	memset(rss, 0, sizeof(*rss));
 	init_waitqueue_head(&rss->gp_wait);
-
-	switch (type) {
-	case RCU_SYNC:
-		rss->sync = synchronize_rcu;
-		rss->call = call_rcu;
-		break;
-
-	case RCU_SCHED_SYNC:
-		rss->sync = synchronize_sched;
-		rss->call = call_rcu_sched;
-		break;
-
-	case RCU_BH_SYNC:
-		rss->sync = synchronize_rcu_bh;
-		rss->call = call_rcu_bh;
-		break;
-	}
+	rss->gp_type = type;
 }
 
 void rcu_sync_enter(struct rcu_sync_struct *rss)
@@ -44,7 +45,7 @@ void rcu_sync_enter(struct rcu_sync_struct *rss)
 	BUG_ON(need_wait && need_sync);
 
 	if (need_sync) {
-		rss->sync();
+		gp_ops[rss->gp_type].sync();
 		rss->gp_state = GP_PASSED;
 		wake_up_all(&rss->gp_wait);
 	} else if (need_wait) {
@@ -81,7 +82,7 @@ static void rcu_sync_func(struct rcu_head *rcu)
 		 * to catch a later GP.
 		 */
 		rss->cb_state = CB_PENDING;
-		rss->call(&rss->cb_head, rcu_sync_func);
+		gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
 	} else {
 		/*
 		 * We're at least a GP after rcu_sync_exit(); eveybody will now
@@ -99,7 +100,7 @@ void rcu_sync_exit(struct rcu_sync_struct *rss)
 	if (!--rss->gp_count) {
 		if (rss->cb_state == CB_IDLE) {
 			rss->cb_state = CB_PENDING;
-			rss->call(&rss->cb_head, rcu_sync_func);
+			gp_ops[rss->gp_type].call(&rss->cb_head, rcu_sync_func);
 		} else if (rss->cb_state == CB_PENDING) {
 			rss->cb_state = CB_REPLAY;
 		}
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] rcusync: Add the CONFIG_PROVE_RCU checks
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
  2015-07-11 23:35 ` [PATCH 1/7] rcu: Create rcu_sync infrastructure Oleg Nesterov
  2015-07-11 23:35 ` [PATCH 2/7] rcusync: Introduce struct rcu_sync_ops Oleg Nesterov
@ 2015-07-11 23:35 ` Oleg Nesterov
  2015-07-11 23:35 ` [PATCH 4/7] rcusync: Introduce rcu_sync_dtor() Oleg Nesterov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:35 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

It would be nice to validate that the caller of rcu_sync_is_idle()
holds the corresponding type of RCU read-side lock. Add the new
rcu_sync_ops->held() method and change rcu_sync_is_idle() to
WARN() if it returns false.

This obviously penalizes the readers (fast-path), but only if
CONFIG_PROVE_RCU.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rcusync.h |    6 ++++++
 kernel/rcu/sync.c       |   21 +++++++++++++++++++++
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
index 988ec33..a51e5c7 100644
--- a/include/linux/rcusync.h
+++ b/include/linux/rcusync.h
@@ -17,9 +17,15 @@ struct rcu_sync_struct {
 	enum rcu_sync_type	gp_type;
 };
 
+extern bool __rcu_sync_is_idle(struct rcu_sync_struct *);
+
 static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
 {
+#ifdef CONFIG_PROVE_RCU
+	return __rcu_sync_is_idle(rss);
+#else
 	return !rss->gp_state; /* GP_IDLE */
+#endif
 }
 
 extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 99051b7..32cdbb8 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -1,21 +1,33 @@
 #include <linux/rcusync.h>
 #include <linux/sched.h>
 
+#ifdef CONFIG_PROVE_RCU
+#define __INIT_HELD(func)	.held = func,
+#else
+#define __INIT_HELD(func)
+#endif
+
 static const struct {
 	void (*sync)(void);
 	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+#ifdef CONFIG_PROVE_RCU
+	int  (*held)(void);
+#endif
 } gp_ops[] = {
 	[RCU_SYNC] = {
 		.sync = synchronize_rcu,
 		.call = call_rcu,
+		__INIT_HELD(rcu_read_lock_held)
 	},
 	[RCU_SCHED_SYNC] = {
 		.sync = synchronize_sched,
 		.call = call_rcu_sched,
+		__INIT_HELD(rcu_read_lock_sched_held)
 	},
 	[RCU_BH_SYNC] = {
 		.sync = synchronize_rcu_bh,
 		.call = call_rcu_bh,
+		__INIT_HELD(rcu_read_lock_bh_held)
 	},
 };
 
@@ -24,6 +36,15 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
 #define	rss_lock	gp_wait.lock
 
+#ifdef CONFIG_PROVE_RCU
+bool __rcu_sync_is_idle(struct rcu_sync_struct *rss)
+{
+	WARN_ON(!gp_ops[rss->gp_type].held());
+	return rss->gp_state == GP_IDLE;
+}
+EXPORT_SYMBOL_GPL(__rcu_sync_is_idle);
+#endif
+
 void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
 {
 	memset(rss, 0, sizeof(*rss));
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] rcusync: Introduce rcu_sync_dtor()
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
                   ` (2 preceding siblings ...)
  2015-07-11 23:35 ` [PATCH 3/7] rcusync: Add the CONFIG_PROVE_RCU checks Oleg Nesterov
@ 2015-07-11 23:35 ` Oleg Nesterov
  2015-07-11 23:36 ` [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure Oleg Nesterov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:35 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

Add the new rcu_sync_ops->wait() method and the new helper,
rcu_sync_dtor().

It is needed if you are going to, say, kfree(rcu_sync_object).
It simply calls ops->wait() to "flush" the potentially pending
rcu callback.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rcusync.h |    1 +
 kernel/rcu/sync.c       |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
index a51e5c7..0135838 100644
--- a/include/linux/rcusync.h
+++ b/include/linux/rcusync.h
@@ -31,6 +31,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
 extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
 extern void rcu_sync_enter(struct rcu_sync_struct *);
 extern void rcu_sync_exit(struct rcu_sync_struct *);
+extern void rcu_sync_dtor(struct rcu_sync_struct *);
 
 #define __RCU_SYNC_INITIALIZER(name, type) {				\
 		.gp_state = 0,						\
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 32cdbb8..8835ad1 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -10,6 +10,7 @@
 static const struct {
 	void (*sync)(void);
 	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+	void (*wait)(void);
 #ifdef CONFIG_PROVE_RCU
 	int  (*held)(void);
 #endif
@@ -17,16 +18,19 @@ static const struct {
 	[RCU_SYNC] = {
 		.sync = synchronize_rcu,
 		.call = call_rcu,
+		.wait = rcu_barrier,
 		__INIT_HELD(rcu_read_lock_held)
 	},
 	[RCU_SCHED_SYNC] = {
 		.sync = synchronize_sched,
 		.call = call_rcu_sched,
+		.wait = rcu_barrier_sched,
 		__INIT_HELD(rcu_read_lock_sched_held)
 	},
 	[RCU_BH_SYNC] = {
 		.sync = synchronize_rcu_bh,
 		.call = call_rcu_bh,
+		.wait = rcu_barrier_bh,
 		__INIT_HELD(rcu_read_lock_bh_held)
 	},
 };
@@ -128,3 +132,21 @@ void rcu_sync_exit(struct rcu_sync_struct *rss)
 	}
 	spin_unlock_irq(&rss->rss_lock);
 }
+
+void rcu_sync_dtor(struct rcu_sync_struct *rss)
+{
+	int cb_state;
+
+	BUG_ON(rss->gp_count);
+
+	spin_lock_irq(&rss->rss_lock);
+	if (rss->cb_state == CB_REPLAY)
+		rss->cb_state = CB_PENDING;
+	cb_state = rss->cb_state;
+	spin_unlock_irq(&rss->rss_lock);
+
+	if (cb_state != CB_IDLE) {
+		gp_ops[rss->gp_type].wait();
+		BUG_ON(rss->cb_state != CB_IDLE);
+	}
+}
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
                   ` (3 preceding siblings ...)
  2015-07-11 23:35 ` [PATCH 4/7] rcusync: Introduce rcu_sync_dtor() Oleg Nesterov
@ 2015-07-11 23:36 ` Oleg Nesterov
  2015-07-15 18:15   ` Paul E. McKenney
  2015-07-11 23:36 ` [PATCH 6/7] percpu-rwsem: fix the comments outdated by rcu_sync Oleg Nesterov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:36 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

Currently down_write/up_write calls synchronize_sched_expedited()
twice which is evil. Change this code to rely on rcu-sync primitives.
This avoids the _expedited "big hammer", and this can be faster in
the contended case or even in the case when a single thread does
down_write/up_write in a loop.

Of course, a single down_write() will take more time, but otoh it
will be much more friendly to the whole system.

To simplify the review this patch doesn't update the comments, fixed
by the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h  |    3 ++-
 kernel/locking/percpu-rwsem.c |   18 +++++++-----------
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3e88c9a..3e58226 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -5,11 +5,12 @@
 #include <linux/rwsem.h>
 #include <linux/percpu.h>
 #include <linux/wait.h>
+#include <linux/rcusync.h>
 #include <linux/lockdep.h>
 
 struct percpu_rw_semaphore {
+	struct rcu_sync_struct	rss;
 	unsigned int __percpu	*fast_read_ctr;
-	atomic_t		write_ctr;
 	struct rw_semaphore	rw_sem;
 	atomic_t		slow_read_ctr;
 	wait_queue_head_t	write_waitq;
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 652a8ee..69a7314 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
 
 	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
 	__init_rwsem(&brw->rw_sem, name, rwsem_key);
-	atomic_set(&brw->write_ctr, 0);
+	rcu_sync_init(&brw->rss, RCU_SCHED_SYNC);
 	atomic_set(&brw->slow_read_ctr, 0);
 	init_waitqueue_head(&brw->write_waitq);
 	return 0;
@@ -25,6 +25,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
 
 void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 {
+	rcu_sync_dtor(&brw->rss);
 	free_percpu(brw->fast_read_ctr);
 	brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
@@ -54,13 +55,12 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
  */
 static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
 {
-	bool success = false;
+	bool success;
 
 	preempt_disable();
-	if (likely(!atomic_read(&brw->write_ctr))) {
+	success = rcu_sync_is_idle(&brw->rss);
+	if (likely(success))
 		__this_cpu_add(*brw->fast_read_ctr, val);
-		success = true;
-	}
 	preempt_enable();
 
 	return success;
@@ -126,8 +126,6 @@ static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
  */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
-	/* tell update_fast_ctr() there is a pending writer */
-	atomic_inc(&brw->write_ctr);
 	/*
 	 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
 	 *    so that update_fast_ctr() can't succeed.
@@ -139,7 +137,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 	 *    fast-path, it executes a full memory barrier before we return.
 	 *    See R_W case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
+	rcu_sync_enter(&brw->rss);
 
 	/* exclude other writers, and block the new readers completely */
 	down_write(&brw->rw_sem);
@@ -159,7 +157,5 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 	 * Insert the barrier before the next fast-path in down_read,
 	 * see W_R case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
-	/* the last writer unblocks update_fast_ctr() */
-	atomic_dec(&brw->write_ctr);
+	rcu_sync_exit(&brw->rss);
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] percpu-rwsem: fix the comments outdated by rcu_sync
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
                   ` (4 preceding siblings ...)
  2015-07-11 23:36 ` [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure Oleg Nesterov
@ 2015-07-11 23:36 ` Oleg Nesterov
  2015-07-11 23:36 ` [PATCH 7/7] percpu-rwsem: cleanup the lockdep annotations in percpu_down_read() Oleg Nesterov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:36 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

Update the comments broken by the previous change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/locking/percpu-rwsem.c |   50 +++++++++--------------------------------
 1 files changed, 11 insertions(+), 39 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 69a7314..705aefd 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -31,27 +31,12 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 }
 
 /*
- * This is the fast-path for down_read/up_read, it only needs to ensure
- * there is no pending writer (atomic_read(write_ctr) == 0) and inc/dec the
- * fast per-cpu counter. The writer uses synchronize_sched_expedited() to
- * serialize with the preempt-disabled section below.
- *
- * The nontrivial part is that we should guarantee acquire/release semantics
- * in case when
- *
- *	R_W: down_write() comes after up_read(), the writer should see all
- *	     changes done by the reader
- * or
- *	W_R: down_read() comes after up_write(), the reader should see all
- *	     changes done by the writer
+ * This is the fast-path for down_read/up_read. If it succeeds we rely
+ * on the barriers provided by rcu_sync_enter/exit; see the comments in
+ * percpu_down_write() and percpu_up_write().
  *
  * If this helper fails the callers rely on the normal rw_semaphore and
  * atomic_dec_and_test(), so in this case we have the necessary barriers.
- *
- * But if it succeeds we do not have any barriers, atomic_read(write_ctr) or
- * __this_cpu_add() below can be reordered with any LOAD/STORE done by the
- * reader inside the critical section. See the comments in down_write and
- * up_write below.
  */
 static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
 {
@@ -113,29 +98,15 @@ static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
 	return sum;
 }
 
-/*
- * A writer increments ->write_ctr to force the readers to switch to the
- * slow mode, note the atomic_read() check in update_fast_ctr().
- *
- * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
- * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
- * counter it represents the number of active readers.
- *
- * Finally the writer takes ->rw_sem for writing and blocks the new readers,
- * then waits until the slow counter becomes zero.
- */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
 	/*
-	 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
-	 *    so that update_fast_ctr() can't succeed.
-	 *
-	 * 2. Ensures we see the result of every previous this_cpu_add() in
-	 *    update_fast_ctr().
+	 * Make rcu_sync_is_idle() == F and thus disable the fast-path in
+	 * percpu_down_read() and percpu_up_read(), and wait for gp pass.
 	 *
-	 * 3. Ensures that if any reader has exited its critical section via
-	 *    fast-path, it executes a full memory barrier before we return.
-	 *    See R_W case in the comment above update_fast_ctr().
+	 * The latter synchronises us with the preceeding readers which used
+	 * the fast-past, so we can not miss the result of __this_cpu_add()
+	 * or anything else inside their criticial sections.
 	 */
 	rcu_sync_enter(&brw->rss);
 
@@ -154,8 +125,9 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 	/* release the lock, but the readers can't use the fast-path */
 	up_write(&brw->rw_sem);
 	/*
-	 * Insert the barrier before the next fast-path in down_read,
-	 * see W_R case in the comment above update_fast_ctr().
+	 * Enable the fast-path in percpu_down_read() and percpu_up_read()
+	 * but only after another gp pass; this adds the necessary barrier
+	 * to ensure the reader can't miss the changes done by us.
 	 */
 	rcu_sync_exit(&brw->rss);
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] percpu-rwsem: cleanup the lockdep annotations in percpu_down_read()
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
                   ` (5 preceding siblings ...)
  2015-07-11 23:36 ` [PATCH 6/7] percpu-rwsem: fix the comments outdated by rcu_sync Oleg Nesterov
@ 2015-07-11 23:36 ` Oleg Nesterov
  2015-07-11 23:47 ` [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Linus Torvalds
  2015-07-15 18:27 ` Paul E. McKenney
  8 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-11 23:36 UTC (permalink / raw)
  To: Linus Torvalds, Paul McKenney, Peter Zijlstra
  Cc: Daniel Wagner, Davidlohr Bueso, Ingo Molnar, Tejun Heo, linux-kernel

Stolen from Peter's patch.

Change percpu_down_read() to use __down_read(), this way we can
do rwsem_acquire_read() unconditionally at the start to make this
code more symmetric and clean.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/locking/percpu-rwsem.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 705aefd..2c54c64 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -62,14 +62,14 @@ static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
 void percpu_down_read(struct percpu_rw_semaphore *brw)
 {
 	might_sleep();
-	if (likely(update_fast_ctr(brw, +1))) {
-		rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);
+	rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);
+
+	if (likely(update_fast_ctr(brw, +1)))
 		return;
-	}
 
-	down_read(&brw->rw_sem);
+	/* Avoid rwsem_acquire_read() and rwsem_release() */
+	__down_read(&brw->rw_sem);
 	atomic_inc(&brw->slow_read_ctr);
-	/* avoid up_read()->rwsem_release() */
 	__up_read(&brw->rw_sem);
 }
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
                   ` (6 preceding siblings ...)
  2015-07-11 23:36 ` [PATCH 7/7] percpu-rwsem: cleanup the lockdep annotations in percpu_down_read() Oleg Nesterov
@ 2015-07-11 23:47 ` Linus Torvalds
  2015-07-15 18:27 ` Paul E. McKenney
  8 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2015-07-11 23:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul McKenney, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, Linux Kernel Mailing List

On Sat, Jul 11, 2015 at 4:35 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Linus, I am mostly trying to convince you. Nobody else objected so far.
> Could you please comment?

I don't mind this part of the series.

It's the whole "do we really want to put the effort into percpu-rwsem
I worry about, as there just aren't that many users.

The conversions made that "too damn special" thing go away, but the
conversions (particularly the big _real_ user, namely fs/locks.c) seem
to have serious performance problems that are quite possibly not
fixable.

So my objection isn't to your change, my objection is to the whole
"right now there are two users, and they both use a global lock, so
*of course* they scale like shit, and this is all just papering over
that much more fundamental problem".

I hate one-off locking. One-off locking with one global lock? Yeah,
that just smells.

                  Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-11 23:35 ` [PATCH 1/7] rcu: Create rcu_sync infrastructure Oleg Nesterov
@ 2015-07-15 18:05   ` Paul E. McKenney
  2015-07-15 18:15     ` Peter Zijlstra
  2015-07-15 18:41     ` Oleg Nesterov
  0 siblings, 2 replies; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-15 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Sun, Jul 12, 2015 at 01:35:48AM +0200, Oleg Nesterov wrote:
> It is functionally equivalent to
> 
>         struct rcu_sync_struct {
>                 atomic_t counter;
>         };
> 
>         static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
>         {

If you add an smp_mb() here...

>                 return atomic_read(&rss->counter) == 0;
>         }
> 
>         static inline void rcu_sync_enter(struct rcu_sync_struct *rss)
>         {
>                 atomic_inc(&rss->counter);
>                 synchronize_sched();
>         }
> 
>         static inline void rcu_sync_exit(struct rcu_sync_struct *rss)
>         {
>                 synchronize_sched();

You should be able to demote the above synchronize_sched() to an
smp_mb__before_atomic().  Even rare writes should make this tradeoff
worthwhile.

I will try to call out the equivalent transformation below, for your
amusement if for no other reason.  ;-)

>                 atomic_dec(&rss->counter);
>         }
> 
> except: it records the state and synchronize_sched() is only called by
> rcu_sync_enter() and only if necessary.
> 
> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/rcusync.h |   64 ++++++++++++++++++++++++++++
>  kernel/rcu/Makefile     |    2 +-
>  kernel/rcu/sync.c       |  108 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 173 insertions(+), 1 deletions(-)
>  create mode 100644 include/linux/rcusync.h
>  create mode 100644 kernel/rcu/sync.c
> 
> diff --git a/include/linux/rcusync.h b/include/linux/rcusync.h
> new file mode 100644
> index 0000000..7858491
> --- /dev/null
> +++ b/include/linux/rcusync.h
> @@ -0,0 +1,64 @@
> +#ifndef _LINUX_RCUSYNC_H_
> +#define _LINUX_RCUSYNC_H_
> +
> +#include <linux/wait.h>
> +#include <linux/rcupdate.h>
> +
> +struct rcu_sync_struct {
> +	int			gp_state;
> +	int			gp_count;
> +	wait_queue_head_t	gp_wait;
> +
> +	int			cb_state;
> +	struct rcu_head		cb_head;
> +
> +	void (*sync)(void);
> +	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> +};
> +
> +#define ___RCU_SYNC_INIT(name)						\
> +	.gp_state = 0,							\
> +	.gp_count = 0,							\
> +	.gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait),		\
> +	.cb_state = 0
> +
> +#define __RCU_SCHED_SYNC_INIT(name) {					\
> +	___RCU_SYNC_INIT(name),						\
> +	.sync = synchronize_sched,					\
> +	.call = call_rcu_sched,						\
> +}
> +
> +#define __RCU_BH_SYNC_INIT(name) {					\
> +	___RCU_SYNC_INIT(name),						\
> +	.sync = synchronize_rcu_bh,					\
> +	.call = call_rcu_bh,						\
> +}
> +
> +#define __RCU_SYNC_INIT(name) {						\
> +	___RCU_SYNC_INIT(name),						\
> +	.sync = synchronize_rcu,					\
> +	.call = call_rcu,						\
> +}
> +
> +#define DEFINE_RCU_SCHED_SYNC(name)					\
> +	struct rcu_sync_struct name = __RCU_SCHED_SYNC_INIT(name)
> +
> +#define DEFINE_RCU_BH_SYNC(name)					\
> +	struct rcu_sync_struct name = __RCU_BH_SYNC_INIT(name)
> +
> +#define DEFINE_RCU_SYNC(name)						\
> +	struct rcu_sync_struct name = __RCU_SYNC_INIT(name)
> +
> +static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
> +{

	smp_mb();  /* A: Ensure that reader sees last update. */
		   /* Pairs with B. */

> +	return !rss->gp_state; /* GP_IDLE */
> +}
> +
> +enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
> +
> +extern void rcu_sync_init(struct rcu_sync_struct *, enum rcu_sync_type);
> +extern void rcu_sync_enter(struct rcu_sync_struct *);
> +extern void rcu_sync_exit(struct rcu_sync_struct *);
> +
> +#endif /* _LINUX_RCUSYNC_H_ */
> +
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 50a8084..61a1656 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -1,4 +1,4 @@
> -obj-y += update.o
> +obj-y += update.o sync.o
>  obj-$(CONFIG_SRCU) += srcu.o
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_TREE_RCU) += tree.o
> diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
> new file mode 100644
> index 0000000..f84176a
> --- /dev/null
> +++ b/kernel/rcu/sync.c
> @@ -0,0 +1,108 @@
> +
> +#include <linux/rcusync.h>
> +#include <linux/sched.h>
> +
> +enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> +enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> +
> +#define	rss_lock	gp_wait.lock
> +
> +void rcu_sync_init(struct rcu_sync_struct *rss, enum rcu_sync_type type)
> +{
> +	memset(rss, 0, sizeof(*rss));
> +	init_waitqueue_head(&rss->gp_wait);
> +
> +	switch (type) {
> +	case RCU_SYNC:
> +		rss->sync = synchronize_rcu;
> +		rss->call = call_rcu;
> +		break;
> +
> +	case RCU_SCHED_SYNC:
> +		rss->sync = synchronize_sched;
> +		rss->call = call_rcu_sched;
> +		break;
> +
> +	case RCU_BH_SYNC:
> +		rss->sync = synchronize_rcu_bh;
> +		rss->call = call_rcu_bh;
> +		break;
> +	}
> +}
> +
> +void rcu_sync_enter(struct rcu_sync_struct *rss)
> +{
> +	bool need_wait, need_sync;
> +
> +	spin_lock_irq(&rss->rss_lock);
> +	need_wait = rss->gp_count++;
> +	need_sync = rss->gp_state == GP_IDLE;
> +	if (need_sync)
> +		rss->gp_state = GP_PENDING;
> +	spin_unlock_irq(&rss->rss_lock);
> +
> +	BUG_ON(need_wait && need_sync);
> +
> +	if (need_sync) {
> +		rss->sync();
> +		rss->gp_state = GP_PASSED;
> +		wake_up_all(&rss->gp_wait);
> +	} else if (need_wait) {
> +		wait_event(rss->gp_wait, rss->gp_state == GP_PASSED);
> +	} else {
> +		/*
> +		 * Possible when there's a pending CB from a rcu_sync_exit().
> +		 * Nobody has yet been allowed the 'fast' path and thus we can
> +		 * avoid doing any sync(). The callback will get 'dropped'.
> +		 */
> +		BUG_ON(rss->gp_state != GP_PASSED);
> +	}
> +}
> +
> +static void rcu_sync_func(struct rcu_head *rcu)
> +{
> +	struct rcu_sync_struct *rss =
> +		container_of(rcu, struct rcu_sync_struct, cb_head);
> +	unsigned long flags;
> +
> +
> +	BUG_ON(rss->gp_state != GP_PASSED);
> +	BUG_ON(rss->cb_state == CB_IDLE);
> +
> +	spin_lock_irqsave(&rss->rss_lock, flags);
> +	if (rss->gp_count) {
> +		/*
> +		 * A new rcu_sync_begin() has happened; drop the callback.
> +		 */
> +		rss->cb_state = CB_IDLE;
> +	} else if (rss->cb_state == CB_REPLAY) {
> +		/*
> +		 * A new rcu_sync_exit() has happened; requeue the callback
> +		 * to catch a later GP.
> +		 */
> +		rss->cb_state = CB_PENDING;
> +		rss->call(&rss->cb_head, rcu_sync_func);
> +	} else {
> +		/*
> +		 * We're at least a GP after rcu_sync_exit(); eveybody will now
> +		 * have observed the write side critical section. Let 'em rip!.
> +		 */
> +		rss->cb_state = CB_IDLE;
> +		rss->gp_state = GP_IDLE;
> +	}
> +	spin_unlock_irqrestore(&rss->rss_lock, flags);
> +}
> +
> +void rcu_sync_exit(struct rcu_sync_struct *rss)
> +{
> +	spin_lock_irq(&rss->rss_lock);

	smp_mb(); /* B: Make sure next readers see critical section. */
		  /* Pairs with A. */

> +	if (!--rss->gp_count) {

At which point, I believe you can ditch the callback entirely, along
with ->cb_state.

So, what am I missing here?  Are readers really so frequent that the
added read-side memory barrier is visible?  Given that the current
code forces the readers to grab ->rss_lock during the trailing grace
period, any writer frequency at all will overwhelm the cost of the
read-side memory barrier.

							Thanx, Paul

> +		if (rss->cb_state == CB_IDLE) {
> +			rss->cb_state = CB_PENDING;
> +			rss->call(&rss->cb_head, rcu_sync_func);
> +		} else if (rss->cb_state == CB_PENDING) {
> +			rss->cb_state = CB_REPLAY;
> +		}
> +	}
> +	spin_unlock_irq(&rss->rss_lock);
> +}
> -- 
> 1.5.5.1
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure
  2015-07-11 23:36 ` [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure Oleg Nesterov
@ 2015-07-15 18:15   ` Paul E. McKenney
  2015-07-15 18:59     ` Oleg Nesterov
  0 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-15 18:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Sun, Jul 12, 2015 at 01:36:01AM +0200, Oleg Nesterov wrote:
> Currently down_write/up_write calls synchronize_sched_expedited()
> twice which is evil. Change this code to rely on rcu-sync primitives.
> This avoids the _expedited "big hammer", and this can be faster in
> the contended case or even in the case when a single thread does
> down_write/up_write in a loop.

But "evil" is such a strong word!  ;-)

More seriously, introducing a read-side smp_mb() should allow the
write-release synchronize_sched_expedited() to be demoted to an
smp_mb() as well.  But yes, you would still have the write-acquire
synchronize_sched_expedited().  That of course could be eliminated
in the writer-only common case in the same way you eliminated the
write-acquisition grace-period delay in rss_sync.

My main concern would be the introduction of an exclusive lock for the
writer, but if you are hitting the write-side acquisition that hard,
perhaps you are having other bigger problems.

Another concern would be performance in a workload where there are way
more readers than writers, but enough writers so that there is not much
more than one grace period's worth of spacing between them.  In that case,
the readers are having to hit the global lock more frequently with this
change than in the original.  Cue debate over tradeoff between disturbing
non-idle non-nohz CPUs on the one hand and forcing readers to hit the
global lock more frequently on the other.  ;-)

							Thanx, Paul

> Of course, a single down_write() will take more time, but otoh it
> will be much more friendly to the whole system.
> 
> To simplify the review this patch doesn't update the comments, fixed
> by the next change.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/percpu-rwsem.h  |    3 ++-
>  kernel/locking/percpu-rwsem.c |   18 +++++++-----------
>  2 files changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 3e88c9a..3e58226 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -5,11 +5,12 @@
>  #include <linux/rwsem.h>
>  #include <linux/percpu.h>
>  #include <linux/wait.h>
> +#include <linux/rcusync.h>
>  #include <linux/lockdep.h>
> 
>  struct percpu_rw_semaphore {
> +	struct rcu_sync_struct	rss;
>  	unsigned int __percpu	*fast_read_ctr;
> -	atomic_t		write_ctr;
>  	struct rw_semaphore	rw_sem;
>  	atomic_t		slow_read_ctr;
>  	wait_queue_head_t	write_waitq;
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index 652a8ee..69a7314 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
> 
>  	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
>  	__init_rwsem(&brw->rw_sem, name, rwsem_key);
> -	atomic_set(&brw->write_ctr, 0);
> +	rcu_sync_init(&brw->rss, RCU_SCHED_SYNC);
>  	atomic_set(&brw->slow_read_ctr, 0);
>  	init_waitqueue_head(&brw->write_waitq);
>  	return 0;
> @@ -25,6 +25,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
> 
>  void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
>  {
> +	rcu_sync_dtor(&brw->rss);
>  	free_percpu(brw->fast_read_ctr);
>  	brw->fast_read_ctr = NULL; /* catch use after free bugs */
>  }
> @@ -54,13 +55,12 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
>   */
>  static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
>  {
> -	bool success = false;
> +	bool success;
> 
>  	preempt_disable();
> -	if (likely(!atomic_read(&brw->write_ctr))) {
> +	success = rcu_sync_is_idle(&brw->rss);
> +	if (likely(success))
>  		__this_cpu_add(*brw->fast_read_ctr, val);
> -		success = true;
> -	}
>  	preempt_enable();
> 
>  	return success;
> @@ -126,8 +126,6 @@ static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
>   */
>  void percpu_down_write(struct percpu_rw_semaphore *brw)
>  {
> -	/* tell update_fast_ctr() there is a pending writer */
> -	atomic_inc(&brw->write_ctr);
>  	/*
>  	 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
>  	 *    so that update_fast_ctr() can't succeed.
> @@ -139,7 +137,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
>  	 *    fast-path, it executes a full memory barrier before we return.
>  	 *    See R_W case in the comment above update_fast_ctr().
>  	 */
> -	synchronize_sched_expedited();
> +	rcu_sync_enter(&brw->rss);
> 
>  	/* exclude other writers, and block the new readers completely */
>  	down_write(&brw->rw_sem);
> @@ -159,7 +157,5 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
>  	 * Insert the barrier before the next fast-path in down_read,
>  	 * see W_R case in the comment above update_fast_ctr().
>  	 */
> -	synchronize_sched_expedited();
> -	/* the last writer unblocks update_fast_ctr() */
> -	atomic_dec(&brw->write_ctr);
> +	rcu_sync_exit(&brw->rss);
>  }
> -- 
> 1.5.5.1
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-15 18:05   ` Paul E. McKenney
@ 2015-07-15 18:15     ` Peter Zijlstra
  2015-07-15 18:28       ` Paul E. McKenney
  2015-07-15 18:41     ` Oleg Nesterov
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-07-15 18:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Wed, Jul 15, 2015 at 11:05:46AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 12, 2015 at 01:35:48AM +0200, Oleg Nesterov wrote:
> > It is functionally equivalent to
> > 
> >         struct rcu_sync_struct {
> >                 atomic_t counter;
> >         };
> > 
> >         static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
> >         {
> 
> If you add an smp_mb() here...
> 
> >                 return atomic_read(&rss->counter) == 0;
> >         }
> > 
> >         static inline void rcu_sync_enter(struct rcu_sync_struct *rss)
> >         {
> >                 atomic_inc(&rss->counter);
> >                 synchronize_sched();
> >         }
> > 
> >         static inline void rcu_sync_exit(struct rcu_sync_struct *rss)
> >         {
> >                 synchronize_sched();
> 
> You should be able to demote the above synchronize_sched() to an
> smp_mb__before_atomic().  Even rare writes should make this tradeoff
> worthwhile.

No, it makes the read-side primitive contain an unconditional memory
barrier, that forgoes the entire point.

The writers are stupidly expensive already for they need global
serialization, optimizing them in any way doesn't make sense.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
  2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
                   ` (7 preceding siblings ...)
  2015-07-11 23:47 ` [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Linus Torvalds
@ 2015-07-15 18:27 ` Paul E. McKenney
  2015-07-15 19:36   ` Oleg Nesterov
  8 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-15 18:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Sun, Jul 12, 2015 at 01:35:35AM +0200, Oleg Nesterov wrote:
> Hello,
> 
> Let me make another attempt to push rcu_sync and add a _simple_
> improvment into percpu-rwsem. It already has another user (cgroups)
> and I think it can have more. Peter has some use-cases. sb->s_writers
> (which afaics is buggy btw) can be turned into percpu-rwsem too I think.
> 
> Linus, I am mostly trying to convince you. Nobody else objected so far.
> Could you please comment?
> 
> Peter, if you agree with 5-7, can I add your Signed-off-by's ?
> 
> To me, the most annoying problem with percpu_rw_semaphore is
> synchronize_sched_expedited() which is called twice by every
> down_write/up_write. I think it would be really nice to avoid it.
> 
> Let's start with the simple test-case,
> 
> 	#!/bin/bash
> 
> 	perf probe -x /lib/libc.so.6 syscall
> 
> 	for i in {1..1000}; do
> 		echo 1 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
> 		echo 0 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
> 	done
> 
> It needs ~ 13.5 seconds (2 CPUs, KVM). If we simply replace
> synchronize_sched_expedited() with synchronize_sched() it takes
> ~ 67.5 seconds. This is not good.

Yep, even if you avoided the write-release grace period, you would
still be looking at something like 40 seconds, which is 3x.  Some
might consider that to be a performance regression.  ;-)

> With these patches it takes around 13.3 seconds again (a little
> bit faster), and it doesn't use _expedited. synchronize_sched()
> is called 1-2 (max 3) times in average. And now it does not
> disturb the whole system.
> 
> And just in case, I also measured
> 
> 	for (i = 0; i < 1000000; ++i) {
> 		percpu_down_write(&dup_mmap_sem);
> 		percpu_up_write(&dup_mmap_sem);
> 	}
> 
> and it runs more than 1.5 times faster (to remind, only 2 CPUs),
> but this is not that interesting, I agree.

Your trick avoiding the grace periods during a writer-to-writer handoff
are cute, and they are helping a lot here.  Concurrent readers would
have a tough time of it with this workload, though.  They would all
be serialized.

> And note that the actual change in percpu-rwsem is really simple,
> and imo it even makes the code simpler. (the last patch is off-
> topic cleanup).
> 
> So the only complication is rcu_sync itself. But, rightly or not (I
> am obviously biased), I believe this new rcu infrastructure is natural
> and useful, and I think it can have more users too.

I don't have an objection to it, even in its current form (I did
review it long ago), but it does need to have a user!

> And. We can do more improvements in rcu_sync and percpu-rwsem, and
> I don't only mean other optimizations from Peter. In particular, we
> can extract the "wait for gp pass" from rcu_sync_enter() into another
> helper, we can teach percpu_down_write() to allow multiple writers,
> and more.

As in a percpu_down_write() that allows up to (say) five concurrent
write-holders?  (Which can be useful, don't get me wrong.)  Or do
you mean as an internal optimization of some sort?

							Thanx, Paul

> Oleg.
> 
>  include/linux/percpu-rwsem.h  |    3 +-
>  include/linux/rcusync.h       |   57 +++++++++++++++
>  kernel/locking/percpu-rwsem.c |   78 ++++++---------------
>  kernel/rcu/Makefile           |    2 +-
>  kernel/rcu/sync.c             |  152 +++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 235 insertions(+), 57 deletions(-)
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-15 18:15     ` Peter Zijlstra
@ 2015-07-15 18:28       ` Paul E. McKenney
  2015-07-15 19:08         ` Oleg Nesterov
  0 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-15 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Wed, Jul 15, 2015 at 08:15:11PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 15, 2015 at 11:05:46AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 12, 2015 at 01:35:48AM +0200, Oleg Nesterov wrote:
> > > It is functionally equivalent to
> > > 
> > >         struct rcu_sync_struct {
> > >                 atomic_t counter;
> > >         };
> > > 
> > >         static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
> > >         {
> > 
> > If you add an smp_mb() here...
> > 
> > >                 return atomic_read(&rss->counter) == 0;
> > >         }
> > > 
> > >         static inline void rcu_sync_enter(struct rcu_sync_struct *rss)
> > >         {
> > >                 atomic_inc(&rss->counter);
> > >                 synchronize_sched();
> > >         }
> > > 
> > >         static inline void rcu_sync_exit(struct rcu_sync_struct *rss)
> > >         {
> > >                 synchronize_sched();
> > 
> > You should be able to demote the above synchronize_sched() to an
> > smp_mb__before_atomic().  Even rare writes should make this tradeoff
> > worthwhile.
> 
> No, it makes the read-side primitive contain an unconditional memory
> barrier, that forgoes the entire point.
> 
> The writers are stupidly expensive already for they need global
> serialization, optimizing them in any way doesn't make sense.

That could well be the case, but it would be good to see the numbers.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-15 18:05   ` Paul E. McKenney
  2015-07-15 18:15     ` Peter Zijlstra
@ 2015-07-15 18:41     ` Oleg Nesterov
  1 sibling, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-15 18:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On 07/15, Paul E. McKenney wrote:
>
> On Sun, Jul 12, 2015 at 01:35:48AM +0200, Oleg Nesterov wrote:
> > It is functionally equivalent to
> >
> >         struct rcu_sync_struct {
> >                 atomic_t counter;
> >         };
> >
> >         static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
> >         {
>
> If you add an smp_mb() here...

I don't think so, please see below...

> >         static inline void rcu_sync_exit(struct rcu_sync_struct *rss)
> >         {
> >                 synchronize_sched();
>
> You should be able to demote the above synchronize_sched() to an
> smp_mb__before_atomic().  Even rare writes should make this tradeoff
> worthwhile.

This is irrelevant I think, this (pseudo) code just tries to explain
what this interface does.

> > +static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
> > +{
>
> 	smp_mb();  /* A: Ensure that reader sees last update. */
> 		   /* Pairs with B. */
>

Let me remind you about your f0a0e6f282c72247e7c8ec "rcu: Clarify
memory-ordering properties of grace-period primitives" documentation
patch ;)

We do not need any barrier, assuming that this is called under
preempt_disable/etc.

rcu_sync_is_idle() becomes true after another gp pass. The reader
should see all updates after that.

> > +void rcu_sync_exit(struct rcu_sync_struct *rss)
> > +{
> > +	spin_lock_irq(&rss->rss_lock);
>
> 	smp_mb(); /* B: Make sure next readers see critical section. */
> 		  /* Pairs with A. */
>
> > +	if (!--rss->gp_count) {
>
> At which point, I believe you can ditch the callback entirely, along
> with ->cb_state.
>
> So, what am I missing here?

Please see above. We start anothe gp before "unlock" to avoid mb's in
the reader's code.

> Are readers really so frequent that the
> added read-side memory barrier is visible?

But this code is heavily optimized for the readers. And please see
another discussion about sb_writers and percpu_rw_semaphore. I was
suprized, but mb() in sb_start_write() is actually noticeable.

> Given that the current
> code forces the readers to grab ->rss_lock

Where? the readers never take this lock.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure
  2015-07-15 18:15   ` Paul E. McKenney
@ 2015-07-15 18:59     ` Oleg Nesterov
  0 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-15 18:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On 07/15, Paul E. McKenney wrote:
>
> On Sun, Jul 12, 2015 at 01:36:01AM +0200, Oleg Nesterov wrote:
> > Currently down_write/up_write calls synchronize_sched_expedited()
> > twice which is evil. Change this code to rely on rcu-sync primitives.
> > This avoids the _expedited "big hammer", and this can be faster in
> > the contended case or even in the case when a single thread does
> > down_write/up_write in a loop.
>
> But "evil" is such a strong word!  ;-)

I tried to be convincive ;)

But yes, to me this synchronize_sched_expedited() looks really annoying
because it is not friendly to other workloads.

> More seriously, introducing a read-side smp_mb()

Sure, we can penalize the readers and simplify this code. But the
main purpose of this primitive was "make the readers fast".

I'll write another email about this...

> My main concern would be the introduction of an exclusive lock for the
> writer, but if you are hitting the write-side acquisition that hard,
> perhaps you are having other bigger problems.

Sorry, I don't really understand... could you explain?

> Another concern would be performance in a workload where there are way
> more readers than writers, but enough writers so that there is not much
> more than one grace period's worth of spacing between them.  In that case,
> the readers are having to hit the global lock more frequently with this
> change than in the original.

Of course, this is true.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-15 18:28       ` Paul E. McKenney
@ 2015-07-15 19:08         ` Oleg Nesterov
  2015-07-15 19:15           ` Paul E. McKenney
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-15 19:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On 07/15, Paul E. McKenney wrote:
>
> On Wed, Jul 15, 2015 at 08:15:11PM +0200, Peter Zijlstra wrote:
> >
> > No, it makes the read-side primitive contain an unconditional memory
> > barrier, that forgoes the entire point.
> >
> > The writers are stupidly expensive already for they need global
> > serialization, optimizing them in any way doesn't make sense.
>
> That could well be the case, but it would be good to see the numbers.

Please see the discussion in another "change sb_writers to use
percpu_rw_semaphore".

The simple test-case from Dave

	#include <fcntl.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <string.h>
	#include <assert.h>

	#define BUFLEN 1
	#define FILESIZE (1 * 1024 * 1024)

	char *testcase_description = "Separate file write";

	void testcase(unsigned long long *iterations)
	{
		char buf[BUFLEN];
		char tmpfile[] = "/run/user/1000/willitscale.XXXXXX";
		int fd = mkstemp(tmpfile);
		unsigned long size = 0;

		memset(buf, 0, sizeof(buf));
		assert(fd >= 0);
		unlink(tmpfile);

		while (1) {
			int ret = write(fd, buf, BUFLEN);
			assert(ret >= 0);
			size += ret;
			if (size >= FILESIZE) {
				size = 0;
				lseek(fd, 0, SEEK_SET);
			}

			(*iterations)++;
		}
	}

runs 12% faster if we "simply" remove mb's from sb_start/end_write().
percpu_rw_semaphore does this too and has the approximately same
performance, and we can (hopefully) remove this nontrivial, currently
not 100% correct, and very "special" code in fs/super.c.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] rcu: Create rcu_sync infrastructure
  2015-07-15 19:08         ` Oleg Nesterov
@ 2015-07-15 19:15           ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-15 19:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Wed, Jul 15, 2015 at 09:08:15PM +0200, Oleg Nesterov wrote:
> On 07/15, Paul E. McKenney wrote:
> >
> > On Wed, Jul 15, 2015 at 08:15:11PM +0200, Peter Zijlstra wrote:
> > >
> > > No, it makes the read-side primitive contain an unconditional memory
> > > barrier, that forgoes the entire point.
> > >
> > > The writers are stupidly expensive already for they need global
> > > serialization, optimizing them in any way doesn't make sense.
> >
> > That could well be the case, but it would be good to see the numbers.
> 
> Please see the discussion in another "change sb_writers to use
> percpu_rw_semaphore".
> 
> The simple test-case from Dave
> 
> 	#include <fcntl.h>
> 	#include <stdlib.h>
> 	#include <unistd.h>
> 	#include <string.h>
> 	#include <assert.h>
> 
> 	#define BUFLEN 1
> 	#define FILESIZE (1 * 1024 * 1024)
> 
> 	char *testcase_description = "Separate file write";
> 
> 	void testcase(unsigned long long *iterations)
> 	{
> 		char buf[BUFLEN];
> 		char tmpfile[] = "/run/user/1000/willitscale.XXXXXX";
> 		int fd = mkstemp(tmpfile);
> 		unsigned long size = 0;
> 
> 		memset(buf, 0, sizeof(buf));
> 		assert(fd >= 0);
> 		unlink(tmpfile);
> 
> 		while (1) {
> 			int ret = write(fd, buf, BUFLEN);
> 			assert(ret >= 0);
> 			size += ret;
> 			if (size >= FILESIZE) {
> 				size = 0;
> 				lseek(fd, 0, SEEK_SET);
> 			}
> 
> 			(*iterations)++;
> 		}
> 	}
> 
> runs 12% faster if we "simply" remove mb's from sb_start/end_write().
> percpu_rw_semaphore does this too and has the approximately same
> performance, and we can (hopefully) remove this nontrivial, currently
> not 100% correct, and very "special" code in fs/super.c.

OK, if that is the type of workload you are using this stuff for,
you really don't want read-side memory barriers.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
  2015-07-15 18:27 ` Paul E. McKenney
@ 2015-07-15 19:36   ` Oleg Nesterov
  2015-07-15 21:59     ` Paul E. McKenney
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-15 19:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On 07/15, Paul E. McKenney wrote:
>
> On Sun, Jul 12, 2015 at 01:35:35AM +0200, Oleg Nesterov wrote:
>
> > Let's start with the simple test-case,
> >
> > 	#!/bin/bash
> >
> > 	perf probe -x /lib/libc.so.6 syscall
> >
> > 	for i in {1..1000}; do
> > 		echo 1 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
> > 		echo 0 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
> > 	done
> >
> > It needs ~ 13.5 seconds (2 CPUs, KVM). If we simply replace
> > synchronize_sched_expedited() with synchronize_sched() it takes
> > ~ 67.5 seconds. This is not good.
>
> Yep, even if you avoided the write-release grace period, you would
> still be looking at something like 40 seconds, which is 3x.  Some
> might consider that to be a performance regression.  ;-)

Yes ;)

> > And just in case, I also measured
> >
> > 	for (i = 0; i < 1000000; ++i) {
> > 		percpu_down_write(&dup_mmap_sem);
> > 		percpu_up_write(&dup_mmap_sem);
> > 	}
> >
> > and it runs more than 1.5 times faster (to remind, only 2 CPUs),
> > but this is not that interesting, I agree.
>
> Your trick avoiding the grace periods during a writer-to-writer handoff
> are cute, and they are helping a lot here.

Yes. And even the fact that a single writer doesn't need to sleep in
percpu_up_write()->synchronize_sched() looks good imo.

Yes, yes, we can remove it if we penalize the readers, but I'd like to
avoid this.

> Concurrent readers would
> have a tough time of it with this workload, though.  They would all
> be serialized.

Sure. in this case it is not better than the normal rw_semaphore. Worse
actually.

> > And note that the actual change in percpu-rwsem is really simple,
> > and imo it even makes the code simpler. (the last patch is off-
> > topic cleanup).
> >
> > So the only complication is rcu_sync itself. But, rightly or not (I
> > am obviously biased), I believe this new rcu infrastructure is natural
> > and useful, and I think it can have more users too.
>
> I don't have an objection to it, even in its current form (I did
> review it long ago), but it does need to have a user!

Do you mean you need another user except percpu_rw_semaphore? I do
not see any right now...

Let me remind about sb_writers again. It actually has 3 rw_sem's
and I am trying to turn then into percpu_rw_semaphore's.

In this case freeze_super() will need 6 synchronize_sched_expedited().
This just looks ugly. But if we have rcu_sync primitives, all 3 sem's
in struct super_block can share the same "struct rcu_sync", and
freeze_super() will need only once synchronize_sched().

> > And. We can do more improvements in rcu_sync and percpu-rwsem, and
> > I don't only mean other optimizations from Peter. In particular, we
> > can extract the "wait for gp pass" from rcu_sync_enter() into another
> > helper, we can teach percpu_down_write() to allow multiple writers,
> > and more.
>
> As in a percpu_down_write() that allows up to (say) five concurrent
> write-holders?

Yes. Because this is what uprobes (and probably cgroups) actually needs.
It does not need the global lock. Just it needs to exclude the "readers"
(dup_mmap) globally.

And in fact the very first version I sent worked this way. Then I removed
this because a) this was a bit "unusual" for reviewers ;) and b) because
I raced with another commit which has already added the initial (and
sub-optimal) version of percpu_rw_semaphore.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
  2015-07-15 19:36   ` Oleg Nesterov
@ 2015-07-15 21:59     ` Paul E. McKenney
  2015-07-17 23:29       ` Oleg Nesterov
  0 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-15 21:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Wed, Jul 15, 2015 at 09:36:01PM +0200, Oleg Nesterov wrote:
> On 07/15, Paul E. McKenney wrote:
> >
> > On Sun, Jul 12, 2015 at 01:35:35AM +0200, Oleg Nesterov wrote:
> >
> > > Let's start with the simple test-case,
> > >
> > > 	#!/bin/bash
> > >
> > > 	perf probe -x /lib/libc.so.6 syscall
> > >
> > > 	for i in {1..1000}; do
> > > 		echo 1 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
> > > 		echo 0 >| /sys/kernel/debug/tracing/events/probe_libc/syscall/enable
> > > 	done
> > >
> > > It needs ~ 13.5 seconds (2 CPUs, KVM). If we simply replace
> > > synchronize_sched_expedited() with synchronize_sched() it takes
> > > ~ 67.5 seconds. This is not good.
> >
> > Yep, even if you avoided the write-release grace period, you would
> > still be looking at something like 40 seconds, which is 3x.  Some
> > might consider that to be a performance regression.  ;-)
> 
> Yes ;)
> 
> > > And just in case, I also measured
> > >
> > > 	for (i = 0; i < 1000000; ++i) {
> > > 		percpu_down_write(&dup_mmap_sem);
> > > 		percpu_up_write(&dup_mmap_sem);
> > > 	}
> > >
> > > and it runs more than 1.5 times faster (to remind, only 2 CPUs),
> > > but this is not that interesting, I agree.
> >
> > Your trick avoiding the grace periods during a writer-to-writer handoff
> > are cute, and they are helping a lot here.
> 
> Yes. And even the fact that a single writer doesn't need to sleep in
> percpu_up_write()->synchronize_sched() looks good imo.
> 
> Yes, yes, we can remove it if we penalize the readers, but I'd like to
> avoid this.
> 
> > Concurrent readers would
> > have a tough time of it with this workload, though.  They would all
> > be serialized.
> 
> Sure. in this case it is not better than the normal rw_semaphore. Worse
> actually.
> 
> > > And note that the actual change in percpu-rwsem is really simple,
> > > and imo it even makes the code simpler. (the last patch is off-
> > > topic cleanup).
> > >
> > > So the only complication is rcu_sync itself. But, rightly or not (I
> > > am obviously biased), I believe this new rcu infrastructure is natural
> > > and useful, and I think it can have more users too.
> >
> > I don't have an objection to it, even in its current form (I did
> > review it long ago), but it does need to have a user!
> 
> Do you mean you need another user except percpu_rw_semaphore? I do
> not see any right now...

Not asking for more than one use, but it does need a use.  I believe
that percpu_rw_semaphore suffices.

> Let me remind about sb_writers again. It actually has 3 rw_sem's
> and I am trying to turn then into percpu_rw_semaphore's.
> 
> In this case freeze_super() will need 6 synchronize_sched_expedited().
> This just looks ugly. But if we have rcu_sync primitives, all 3 sem's
> in struct super_block can share the same "struct rcu_sync", and
> freeze_super() will need only once synchronize_sched().

Makes sense.

> > > And. We can do more improvements in rcu_sync and percpu-rwsem, and
> > > I don't only mean other optimizations from Peter. In particular, we
> > > can extract the "wait for gp pass" from rcu_sync_enter() into another
> > > helper, we can teach percpu_down_write() to allow multiple writers,
> > > and more.
> >
> > As in a percpu_down_write() that allows up to (say) five concurrent
> > write-holders?
> 
> Yes. Because this is what uprobes (and probably cgroups) actually needs.
> It does not need the global lock. Just it needs to exclude the "readers"
> (dup_mmap) globally.
> 
> And in fact the very first version I sent worked this way. Then I removed
> this because a) this was a bit "unusual" for reviewers ;) and b) because
> I raced with another commit which has already added the initial (and
> sub-optimal) version of percpu_rw_semaphore.

;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
  2015-07-15 21:59     ` Paul E. McKenney
@ 2015-07-17 23:29       ` Oleg Nesterov
  2015-07-17 23:47         ` Paul E. McKenney
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2015-07-17 23:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On 07/15, Paul E. McKenney wrote:
>
> On Wed, Jul 15, 2015 at 09:36:01PM +0200, Oleg Nesterov wrote:
> >
> > Do you mean you need another user except percpu_rw_semaphore? I do
> > not see any right now...
>
> Not asking for more than one use, but it does need a use.  I believe
> that percpu_rw_semaphore suffices.
>
> > Let me remind about sb_writers again. It actually has 3 rw_sem's
> > and I am trying to turn then into percpu_rw_semaphore's.
> >
> > In this case freeze_super() will need 6 synchronize_sched_expedited().
> > This just looks ugly. But if we have rcu_sync primitives, all 3 sem's
> > in struct super_block can share the same "struct rcu_sync", and
> > freeze_super() will need only once synchronize_sched().
>
> Makes sense.

Great, thanks. And iiuc Linus doesn't object to this particular change.
Plus I see the "Make checkpatch.pl warn on expedited RCU grace periods"
patch ;)

So can I assume you will take these changes?

I do not need them right now, just I need to know what should I do in
destroy_super() and (much more importantly) what should I say in the
changelogs if I try to convert sb_writers to use percpu_rw_semaphore.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem
  2015-07-17 23:29       ` Oleg Nesterov
@ 2015-07-17 23:47         ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2015-07-17 23:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Daniel Wagner, Davidlohr Bueso,
	Ingo Molnar, Tejun Heo, linux-kernel

On Sat, Jul 18, 2015 at 01:29:30AM +0200, Oleg Nesterov wrote:
> On 07/15, Paul E. McKenney wrote:
> >
> > On Wed, Jul 15, 2015 at 09:36:01PM +0200, Oleg Nesterov wrote:
> > >
> > > Do you mean you need another user except percpu_rw_semaphore? I do
> > > not see any right now...
> >
> > Not asking for more than one use, but it does need a use.  I believe
> > that percpu_rw_semaphore suffices.
> >
> > > Let me remind about sb_writers again. It actually has 3 rw_sem's
> > > and I am trying to turn then into percpu_rw_semaphore's.
> > >
> > > In this case freeze_super() will need 6 synchronize_sched_expedited().
> > > This just looks ugly. But if we have rcu_sync primitives, all 3 sem's
> > > in struct super_block can share the same "struct rcu_sync", and
> > > freeze_super() will need only once synchronize_sched().
> >
> > Makes sense.
> 
> Great, thanks. And iiuc Linus doesn't object to this particular change.
> Plus I see the "Make checkpatch.pl warn on expedited RCU grace periods"
> patch ;)

Note that it is a warning rather than an error.  ;-)

> So can I assume you will take these changes?
> 
> I do not need them right now, just I need to know what should I do in
> destroy_super() and (much more importantly) what should I say in the
> changelogs if I try to convert sb_writers to use percpu_rw_semaphore.

Yes, given a real use case, which you do appear to have, I will take
these changes.

								Thanx, Paul


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-07-17 23:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-11 23:35 [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Oleg Nesterov
2015-07-11 23:35 ` [PATCH 1/7] rcu: Create rcu_sync infrastructure Oleg Nesterov
2015-07-15 18:05   ` Paul E. McKenney
2015-07-15 18:15     ` Peter Zijlstra
2015-07-15 18:28       ` Paul E. McKenney
2015-07-15 19:08         ` Oleg Nesterov
2015-07-15 19:15           ` Paul E. McKenney
2015-07-15 18:41     ` Oleg Nesterov
2015-07-11 23:35 ` [PATCH 2/7] rcusync: Introduce struct rcu_sync_ops Oleg Nesterov
2015-07-11 23:35 ` [PATCH 3/7] rcusync: Add the CONFIG_PROVE_RCU checks Oleg Nesterov
2015-07-11 23:35 ` [PATCH 4/7] rcusync: Introduce rcu_sync_dtor() Oleg Nesterov
2015-07-11 23:36 ` [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync infrastructure Oleg Nesterov
2015-07-15 18:15   ` Paul E. McKenney
2015-07-15 18:59     ` Oleg Nesterov
2015-07-11 23:36 ` [PATCH 6/7] percpu-rwsem: fix the comments outdated by rcu_sync Oleg Nesterov
2015-07-11 23:36 ` [PATCH 7/7] percpu-rwsem: cleanup the lockdep annotations in percpu_down_read() Oleg Nesterov
2015-07-11 23:47 ` [PATCH 0/7] Add rcu_sync infrastructure to avoid _expedited() in percpu-rwsem Linus Torvalds
2015-07-15 18:27 ` Paul E. McKenney
2015-07-15 19:36   ` Oleg Nesterov
2015-07-15 21:59     ` Paul E. McKenney
2015-07-17 23:29       ` Oleg Nesterov
2015-07-17 23:47         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).