All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] membarrier: provide register sync core cmd
@ 2017-08-27 20:50 Mathieu Desnoyers
  2017-08-27 22:53 ` Andy Lutomirski
  2017-08-30  4:01 ` Boqun Feng
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-08-27 20:50 UTC (permalink / raw)
  To: Paul E . McKenney, Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Boqun Feng, Andrew Hunter,
	Maged Michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, Andy Lutomirski,
	Will Deacon, Hans Boehm

Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
system call. It allows processes to register their intent to have their
threads issue core serializing barriers in addition to memory barriers
whenever a membarrier command is performed.

It is relevant for reclaim of JIT code, which requires to issue core
serializing barriers on all threads running on behalf of a process
after ensuring the old code is not visible anymore, before re-using
memory for new code.

When a processes returns from a MEMBARRIER_CMD_REGISTER_SYNC_CORE
command, it is guaranteed that all following MEMBARRIER_CMD_SHARED and
MEMBARRIER_CMD_PRIVATE_EXPEDITED issue core serializing barriers, in
addition to the memory barriers, on all of its running threads.

* Scheduler Overhead Benchmarks

Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
taskset 01 ./perf bench sched pipe -T
Linux v4.13-rc6

                       Avg. usecs/op         Std.Dev. usecs/op
Before this change:         2.75                   0.12
Non-registered processes:   2.73                   0.08
Registered processes:       3.07                   0.02

Changes since v1:
- Add missing MEMBARRIER_CMD_REGISTER_SYNC_CORE header documentation,
- Add benchmarks to commit message.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Will Deacon <will.deacon@arm.com>
CC: Hans Boehm <hboehm@google.com>
---
 fs/exec.c                       |  1 +
 include/linux/sched.h           | 52 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/membarrier.h |  8 +++++++
 kernel/fork.c                   |  2 ++
 kernel/sched/core.c             |  3 +++
 kernel/sched/membarrier.c       | 37 ++++++++++++++++++++++++++---
 6 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 62175cbcc801..a4ab3253bac7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1794,6 +1794,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	membarrier_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8337e2db0bb2..b1ecdc4e8b84 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1086,6 +1086,9 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+#ifdef CONFIG_MEMBARRIER
+	int membarrier_sync_core;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
@@ -1623,4 +1626,53 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_MEMBARRIER
+static inline void membarrier_fork(struct task_struct *t,
+		unsigned long clone_flags)
+{
+	/*
+	 * Coherence of membarrier_sync_core against thread fork is
+	 * protected by siglock. membarrier_fork is called with siglock
+	 * held.
+	 */
+	t->membarrier_sync_core = current->membarrier_sync_core;
+}
+static inline void membarrier_execve(struct task_struct *t)
+{
+	t->membarrier_sync_core = 0;
+}
+static inline void membarrier_sched_out(struct task_struct *t)
+{
+	/*
+	 * Core serialization is performed before the memory barrier
+	 * preceding the store to rq->curr.
+	 */
+	if (unlikely(READ_ONCE(t->membarrier_sync_core)))
+		sync_core();
+}
+static inline void membarrier_sched_in(struct task_struct *t)
+{
+	/*
+	 * Core serialization is performed after the memory barrier
+	 * following the store to rq->curr.
+	 */
+	if (unlikely(READ_ONCE(t->membarrier_sync_core)))
+		sync_core();
+}
+#else
+static inline void membarrier_fork(struct task_struct *t,
+		unsigned long clone_flags)
+{
+}
+static inline void membarrier_execve(struct task_struct *t)
+{
+}
+static inline void membarrier_sched_out(struct task_struct *t)
+{
+}
+static inline void membarrier_sched_in(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 6d47b3249d8a..933c35ebcc10 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -56,6 +56,13 @@
  *                          complete faster than the non-expedited ones,
  *                          they never block, but have the downside of
  *                          causing extra overhead.
+ * @MEMBARRIER_CMD_REGISTER_SYNC_CORE:
+ *                          Register the caller process so all of its
+ *                          threads will issue core serialization
+ *                          barriers in addition to memory barriers upon
+ *                          SHARED and PRIVATE barriers targeting
+ *                          its threads issued after this registration
+ *                          returns.
  *
  * Command to be passed to the membarrier system call. The commands need to
  * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
@@ -67,6 +74,7 @@ enum membarrier_cmd {
 	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
 	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
 	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
+	MEMBARRIER_CMD_REGISTER_SYNC_CORE	= (1 << 4),
 };
 
 #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..713b3c932671 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1840,6 +1840,8 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	copy_seccomp(p);
 
+	membarrier_fork(p, clone_flags);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0e36d9960d91..3ca27d066950 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3292,6 +3292,8 @@ static void __sched notrace __schedule(bool preempt)
 	local_irq_disable();
 	rcu_note_context_switch(preempt);
 
+	membarrier_sched_out(prev);
+
 	/*
 	 * Make sure that signal_pending_state()->signal_pending() below
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
@@ -3364,6 +3366,7 @@ static void __sched notrace __schedule(bool preempt)
 
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
+		membarrier_sched_in(next);
 	} else {
 		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 		rq_unlock_irq(rq, &rf);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 7eec6914d2d2..f128caf64b49 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -25,12 +25,16 @@
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */
-#define MEMBARRIER_CMD_BITMASK	\
-	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+#define MEMBARRIER_CMD_BITMASK			\
+	(MEMBARRIER_CMD_SHARED			\
+	| MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
+	| MEMBARRIER_CMD_REGISTER_SYNC_CORE)
 
 static void ipi_mb(void *info)
 {
-	smp_mb();	/* IPIs should be serializing but paranoid. */
+	/* IPIs should be serializing but paranoid. */
+	smp_mb();
+	sync_core();
 }
 
 static void membarrier_private_expedited(void)
@@ -96,6 +100,30 @@ static void membarrier_private_expedited(void)
 	smp_mb();	/* exit from system call is not a mb */
 }
 
+static void membarrier_register_sync_core(void)
+{
+	struct task_struct *p = current, *t;
+
+	if (get_nr_threads(p) == 1) {
+		p->membarrier_sync_core = 1;
+		return;
+	}
+
+	/*
+	 * Coherence of membarrier_sync_core against thread fork is
+	 * protected by siglock.
+	 */
+	spin_lock(&p->sighand->siglock);
+	for_each_thread(p, t)
+		WRITE_ONCE(t->membarrier_sync_core, 1);
+	spin_unlock(&p->sighand->siglock);
+	/*
+	 * Ensure all future scheduler execution will observe the new
+	 * membarrier_sync_core state for this process.
+	 */
+	synchronize_sched();
+}
+
 /**
  * sys_membarrier - issue memory barriers on a set of threads
  * @cmd:   Takes command values defined in enum membarrier_cmd.
@@ -146,6 +174,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
 	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
 		membarrier_private_expedited();
 		return 0;
+	case MEMBARRIER_CMD_REGISTER_SYNC_CORE:
+		membarrier_register_sync_core();
+		return 0;
 	default:
 		return -EINVAL;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-27 20:50 [PATCH v2] membarrier: provide register sync core cmd Mathieu Desnoyers
@ 2017-08-27 22:53 ` Andy Lutomirski
  2017-08-28  3:05   ` Mathieu Desnoyers
  2017-08-30  4:01 ` Boqun Feng
  1 sibling, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2017-08-27 22:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, Maged Michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Will Deacon, Hans Boehm



> On Aug 27, 2017, at 1:50 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
> system call. It allows processes to register their intent to have their
> threads issue core serializing barriers in addition to memory barriers
> whenever a membarrier command is performed.
> 

Why is this stateful?  That is, why not just have a new membarrier command to sync every thread's icache?

> It is relevant for reclaim of JIT code, which requires to issue core
> serializing barriers on all threads running on behalf of a process
> after ensuring the old code is not visible anymore, before re-using
> memory for new code.
> 
> When a processes returns from a MEMBARRIER_CMD_REGISTER_SYNC_CORE
> command, it is guaranteed that all following MEMBARRIER_CMD_SHARED and
> MEMBARRIER_CMD_PRIVATE_EXPEDITED issue core serializing barriers, in
> addition to the memory barriers, on all of its running threads.
> 
> * Scheduler Overhead Benchmarks
> 
> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> taskset 01 ./perf bench sched pipe -T
> Linux v4.13-rc6
> 
>                       Avg. usecs/op         Std.Dev. usecs/op
> Before this change:         2.75                   0.12
> Non-registered processes:   2.73                   0.08
> Registered processes:       3.07                   0.02
> 
> Changes since v1:
> - Add missing MEMBARRIER_CMD_REGISTER_SYNC_CORE header documentation,
> - Add benchmarks to commit message.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> CC: Boqun Feng <boqun.feng@gmail.com>
> CC: Andrew Hunter <ahh@google.com>
> CC: Maged Michael <maged.michael@gmail.com>
> CC: gromer@google.com
> CC: Avi Kivity <avi@scylladb.com>
> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Michael Ellerman <mpe@ellerman.id.au>
> CC: Dave Watson <davejwatson@fb.com>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Will Deacon <will.deacon@arm.com>
> CC: Hans Boehm <hboehm@google.com>
> ---
> fs/exec.c                       |  1 +
> include/linux/sched.h           | 52 +++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/membarrier.h |  8 +++++++
> kernel/fork.c                   |  2 ++
> kernel/sched/core.c             |  3 +++
> kernel/sched/membarrier.c       | 37 ++++++++++++++++++++++++++---
> 6 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 62175cbcc801..a4ab3253bac7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1794,6 +1794,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>    /* execve succeeded */
>    current->fs->in_exec = 0;
>    current->in_execve = 0;
> +    membarrier_execve(current);
>    acct_update_integrals(current);
>    task_numa_free(current);
>    free_bprm(bprm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8337e2db0bb2..b1ecdc4e8b84 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1086,6 +1086,9 @@ struct task_struct {
>    /* Used by LSM modules for access restriction: */
>    void                *security;
> #endif
> +#ifdef CONFIG_MEMBARRIER
> +    int membarrier_sync_core;
> +#endif
> 
>    /*
>     * New fields for task_struct should be added above here, so that
> @@ -1623,4 +1626,53 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
> #define TASK_SIZE_OF(tsk)    TASK_SIZE
> #endif
> 
> +#ifdef CONFIG_MEMBARRIER
> +static inline void membarrier_fork(struct task_struct *t,
> +        unsigned long clone_flags)
> +{
> +    /*
> +     * Coherence of membarrier_sync_core against thread fork is
> +     * protected by siglock. membarrier_fork is called with siglock
> +     * held.
> +     */
> +    t->membarrier_sync_core = current->membarrier_sync_core;
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> +    t->membarrier_sync_core = 0;
> +}
> +static inline void membarrier_sched_out(struct task_struct *t)
> +{
> +    /*
> +     * Core serialization is performed before the memory barrier
> +     * preceding the store to rq->curr.
> +     */
> +    if (unlikely(READ_ONCE(t->membarrier_sync_core)))
> +        sync_core();
> +}
> +static inline void membarrier_sched_in(struct task_struct *t)
> +{
> +    /*
> +     * Core serialization is performed after the memory barrier
> +     * following the store to rq->curr.
> +     */
> +    if (unlikely(READ_ONCE(t->membarrier_sync_core)))
> +        sync_core();
> +}
> +#else
> +static inline void membarrier_fork(struct task_struct *t,
> +        unsigned long clone_flags)
> +{
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> +}
> +static inline void membarrier_sched_out(struct task_struct *t)
> +{
> +}
> +static inline void membarrier_sched_in(struct task_struct *t)
> +{
> +}
> +#endif
> +
> #endif
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index 6d47b3249d8a..933c35ebcc10 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -56,6 +56,13 @@
>  *                          complete faster than the non-expedited ones,
>  *                          they never block, but have the downside of
>  *                          causing extra overhead.
> + * @MEMBARRIER_CMD_REGISTER_SYNC_CORE:
> + *                          Register the caller process so all of its
> + *                          threads will issue core serialization
> + *                          barriers in addition to memory barriers upon
> + *                          SHARED and PRIVATE barriers targeting
> + *                          its threads issued after this registration
> + *                          returns.
>  *
>  * Command to be passed to the membarrier system call. The commands need to
>  * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> @@ -67,6 +74,7 @@ enum membarrier_cmd {
>    /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>    /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>    MEMBARRIER_CMD_PRIVATE_EXPEDITED    = (1 << 3),
> +    MEMBARRIER_CMD_REGISTER_SYNC_CORE    = (1 << 4),
> };
> 
> #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..713b3c932671 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1840,6 +1840,8 @@ static __latent_entropy struct task_struct *copy_process(
>     */
>    copy_seccomp(p);
> 
> +    membarrier_fork(p, clone_flags);
> +
>    /*
>     * Process group and session signals need to be delivered to just the
>     * parent before the fork or both the parent and the child after the
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0e36d9960d91..3ca27d066950 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3292,6 +3292,8 @@ static void __sched notrace __schedule(bool preempt)
>    local_irq_disable();
>    rcu_note_context_switch(preempt);
> 
> +    membarrier_sched_out(prev);
> +
>    /*
>     * Make sure that signal_pending_state()->signal_pending() below
>     * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -3364,6 +3366,7 @@ static void __sched notrace __schedule(bool preempt)
> 
>        /* Also unlocks the rq: */
>        rq = context_switch(rq, prev, next, &rf);
> +        membarrier_sched_in(next);
>    } else {
>        rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>        rq_unlock_irq(rq, &rf);
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 7eec6914d2d2..f128caf64b49 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -25,12 +25,16 @@
>  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>  * except MEMBARRIER_CMD_QUERY.
>  */
> -#define MEMBARRIER_CMD_BITMASK    \
> -    (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> +#define MEMBARRIER_CMD_BITMASK            \
> +    (MEMBARRIER_CMD_SHARED            \
> +    | MEMBARRIER_CMD_PRIVATE_EXPEDITED    \
> +    | MEMBARRIER_CMD_REGISTER_SYNC_CORE)
> 
> static void ipi_mb(void *info)
> {
> -    smp_mb();    /* IPIs should be serializing but paranoid. */
> +    /* IPIs should be serializing but paranoid. */
> +    smp_mb();
> +    sync_core();
> }
> 
> static void membarrier_private_expedited(void)
> @@ -96,6 +100,30 @@ static void membarrier_private_expedited(void)
>    smp_mb();    /* exit from system call is not a mb */
> }
> 
> +static void membarrier_register_sync_core(void)
> +{
> +    struct task_struct *p = current, *t;
> +
> +    if (get_nr_threads(p) == 1) {
> +        p->membarrier_sync_core = 1;
> +        return;
> +    }
> +
> +    /*
> +     * Coherence of membarrier_sync_core against thread fork is
> +     * protected by siglock.
> +     */
> +    spin_lock(&p->sighand->siglock);
> +    for_each_thread(p, t)
> +        WRITE_ONCE(t->membarrier_sync_core, 1);
> +    spin_unlock(&p->sighand->siglock);
> +    /*
> +     * Ensure all future scheduler execution will observe the new
> +     * membarrier_sync_core state for this process.
> +     */
> +    synchronize_sched();
> +}
> +
> /**
>  * sys_membarrier - issue memory barriers on a set of threads
>  * @cmd:   Takes command values defined in enum membarrier_cmd.
> @@ -146,6 +174,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>    case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>        membarrier_private_expedited();
>        return 0;
> +    case MEMBARRIER_CMD_REGISTER_SYNC_CORE:
> +        membarrier_register_sync_core();
> +        return 0;
>    default:
>        return -EINVAL;
>    }
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-27 22:53 ` Andy Lutomirski
@ 2017-08-28  3:05   ` Mathieu Desnoyers
  2017-08-30  5:01     ` Andy Lutomirski
  2017-08-31 17:00     ` Will Deacon
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-08-28  3:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Will Deacon, Hans Boehm

----- On Aug 27, 2017, at 3:53 PM, Andy Lutomirski luto@amacapital.net wrote:

>> On Aug 27, 2017, at 1:50 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> wrote:
>> 
>> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
>> system call. It allows processes to register their intent to have their
>> threads issue core serializing barriers in addition to memory barriers
>> whenever a membarrier command is performed.
>> 
> 
> Why is this stateful?  That is, why not just have a new membarrier command to
> sync every thread's icache?

If we'd do it on every CPU icache, it would be as trivial as you say. The
concern here is sending IPIs only to CPUs running threads that belong to the
same process, so we don't disturb unrelated processes.

If we could just grab each CPU's runqueue lock, it would be fairly simple
to do. But we want to avoid hitting each runqueue with exclusive atomic
access associated with grabbing the lock. (cache-line bouncing)

So, the "private" membarrier command end up reading the rq->curr->mm pointer
value for each runqueue, and compare them to its own current->mm value.
However, this means that whenever we skip a CPU, we're not sending an
IPI to that CPU. So we rely on the scheduler for providing the required
full barrier either before storing to rq->curr, after user-space memory
accesses performed by "prev", as well as after storing to rq->curr,
before user-space memory accesses performed by "next".

The IPI of the private membarrier can issue issue both smp_mb()
and sync_core() (that's what my implementation does).
However, having sys_membarrier issue core serializing barriers adds
extra constraints on entry into the scheduler/resuming to user-space.
It's not sufficient to order user-space memory accesses wrt storing
to rq->curr;  we also want to serialize the core execution. This
is why I'm adding sync_core before the full barrier on entry, and
sync_core after the full barrier on exit. Arguably, some architectures
may not need the extra sync_core on exit (e.g. x86 has iret which
implies core serialization), there are cases where it's not guaranteed
(AFAIK sysexit), and it's rarely guaranteed on entry.

So, we end up with the possibility of adding the core serialization
unconditionally on entry and exit of scheduler. However, as my numbers
below show, performance is slightly impacted in heavy benchmarks.
Therefore, I propose to make processes register their intent to
have the scheduler issue core serializing barriers on their behalf
when it schedules them out/in.

>> * Scheduler Overhead Benchmarks
>> 
>> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> taskset 01 ./perf bench sched pipe -T
>> Linux v4.13-rc6
>> 
>>                       Avg. usecs/op         Std.Dev. usecs/op
>> Before this change:         2.75                   0.12
>> Non-registered processes:   2.73                   0.08
>> Registered processes:       3.07                   0.02
>> 

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-27 20:50 [PATCH v2] membarrier: provide register sync core cmd Mathieu Desnoyers
  2017-08-27 22:53 ` Andy Lutomirski
@ 2017-08-30  4:01 ` Boqun Feng
  2017-08-30 17:25   ` Mathieu Desnoyers
  1 sibling, 1 reply; 9+ messages in thread
From: Boqun Feng @ 2017-08-30  4:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Peter Zijlstra, linux-kernel, Andrew Hunter,
	Maged Michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, Andy Lutomirski,
	Will Deacon, Hans Boehm

[-- Attachment #1: Type: text/plain, Size: 9851 bytes --]

Hi Mathieu,

On Sun, Aug 27, 2017 at 01:50:35PM -0700, Mathieu Desnoyers wrote:
> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
> system call. It allows processes to register their intent to have their
> threads issue core serializing barriers in addition to memory barriers
> whenever a membarrier command is performed.
> 
> It is relevant for reclaim of JIT code, which requires to issue core
> serializing barriers on all threads running on behalf of a process
> after ensuring the old code is not visible anymore, before re-using
> memory for new code.
> 
> When a processes returns from a MEMBARRIER_CMD_REGISTER_SYNC_CORE
> command, it is guaranteed that all following MEMBARRIER_CMD_SHARED and
> MEMBARRIER_CMD_PRIVATE_EXPEDITED issue core serializing barriers, in
> addition to the memory barriers, on all of its running threads.
> 
> * Scheduler Overhead Benchmarks
> 
> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> taskset 01 ./perf bench sched pipe -T
> Linux v4.13-rc6
> 
>                        Avg. usecs/op         Std.Dev. usecs/op
> Before this change:         2.75                   0.12
> Non-registered processes:   2.73                   0.08
> Registered processes:       3.07                   0.02
> 
> Changes since v1:
> - Add missing MEMBARRIER_CMD_REGISTER_SYNC_CORE header documentation,
> - Add benchmarks to commit message.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> CC: Boqun Feng <boqun.feng@gmail.com>
> CC: Andrew Hunter <ahh@google.com>
> CC: Maged Michael <maged.michael@gmail.com>
> CC: gromer@google.com
> CC: Avi Kivity <avi@scylladb.com>
> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Michael Ellerman <mpe@ellerman.id.au>
> CC: Dave Watson <davejwatson@fb.com>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Will Deacon <will.deacon@arm.com>
> CC: Hans Boehm <hboehm@google.com>
> ---
>  fs/exec.c                       |  1 +
>  include/linux/sched.h           | 52 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/membarrier.h |  8 +++++++
>  kernel/fork.c                   |  2 ++
>  kernel/sched/core.c             |  3 +++
>  kernel/sched/membarrier.c       | 37 ++++++++++++++++++++++++++---
>  6 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 62175cbcc801..a4ab3253bac7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1794,6 +1794,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	/* execve succeeded */
>  	current->fs->in_exec = 0;
>  	current->in_execve = 0;
> +	membarrier_execve(current);
>  	acct_update_integrals(current);
>  	task_numa_free(current);
>  	free_bprm(bprm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8337e2db0bb2..b1ecdc4e8b84 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1086,6 +1086,9 @@ struct task_struct {
>  	/* Used by LSM modules for access restriction: */
>  	void				*security;
>  #endif
> +#ifdef CONFIG_MEMBARRIER
> +	int membarrier_sync_core;

Why put this in task_struct rather than mm_struct? If the reclaim of JIT
code is one of the target users, don't you want to make this as a
attribute of an address space?

Am I missing something subtle here?

Regards,
Boqun

> +#endif
>  
>  	/*
>  	 * New fields for task_struct should be added above here, so that
> @@ -1623,4 +1626,53 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>  #define TASK_SIZE_OF(tsk)	TASK_SIZE
>  #endif
>  
> +#ifdef CONFIG_MEMBARRIER
> +static inline void membarrier_fork(struct task_struct *t,
> +		unsigned long clone_flags)
> +{
> +	/*
> +	 * Coherence of membarrier_sync_core against thread fork is
> +	 * protected by siglock. membarrier_fork is called with siglock
> +	 * held.
> +	 */
> +	t->membarrier_sync_core = current->membarrier_sync_core;
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> +	t->membarrier_sync_core = 0;
> +}
> +static inline void membarrier_sched_out(struct task_struct *t)
> +{
> +	/*
> +	 * Core serialization is performed before the memory barrier
> +	 * preceding the store to rq->curr.
> +	 */
> +	if (unlikely(READ_ONCE(t->membarrier_sync_core)))
> +		sync_core();
> +}
> +static inline void membarrier_sched_in(struct task_struct *t)
> +{
> +	/*
> +	 * Core serialization is performed after the memory barrier
> +	 * following the store to rq->curr.
> +	 */
> +	if (unlikely(READ_ONCE(t->membarrier_sync_core)))
> +		sync_core();
> +}
> +#else
> +static inline void membarrier_fork(struct task_struct *t,
> +		unsigned long clone_flags)
> +{
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> +}
> +static inline void membarrier_sched_out(struct task_struct *t)
> +{
> +}
> +static inline void membarrier_sched_in(struct task_struct *t)
> +{
> +}
> +#endif
> +
>  #endif
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index 6d47b3249d8a..933c35ebcc10 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -56,6 +56,13 @@
>   *                          complete faster than the non-expedited ones,
>   *                          they never block, but have the downside of
>   *                          causing extra overhead.
> + * @MEMBARRIER_CMD_REGISTER_SYNC_CORE:
> + *                          Register the caller process so all of its
> + *                          threads will issue core serialization
> + *                          barriers in addition to memory barriers upon
> + *                          SHARED and PRIVATE barriers targeting
> + *                          its threads issued after this registration
> + *                          returns.
>   *
>   * Command to be passed to the membarrier system call. The commands need to
>   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> @@ -67,6 +74,7 @@ enum membarrier_cmd {
>  	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>  	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>  	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
> +	MEMBARRIER_CMD_REGISTER_SYNC_CORE	= (1 << 4),
>  };
>  
>  #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..713b3c932671 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1840,6 +1840,8 @@ static __latent_entropy struct task_struct *copy_process(
>  	 */
>  	copy_seccomp(p);
>  
> +	membarrier_fork(p, clone_flags);
> +
>  	/*
>  	 * Process group and session signals need to be delivered to just the
>  	 * parent before the fork or both the parent and the child after the
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0e36d9960d91..3ca27d066950 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3292,6 +3292,8 @@ static void __sched notrace __schedule(bool preempt)
>  	local_irq_disable();
>  	rcu_note_context_switch(preempt);
>  
> +	membarrier_sched_out(prev);
> +
>  	/*
>  	 * Make sure that signal_pending_state()->signal_pending() below
>  	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -3364,6 +3366,7 @@ static void __sched notrace __schedule(bool preempt)
>  
>  		/* Also unlocks the rq: */
>  		rq = context_switch(rq, prev, next, &rf);
> +		membarrier_sched_in(next);
>  	} else {
>  		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>  		rq_unlock_irq(rq, &rf);
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 7eec6914d2d2..f128caf64b49 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -25,12 +25,16 @@
>   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>   * except MEMBARRIER_CMD_QUERY.
>   */
> -#define MEMBARRIER_CMD_BITMASK	\
> -	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> +#define MEMBARRIER_CMD_BITMASK			\
> +	(MEMBARRIER_CMD_SHARED			\
> +	| MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
> +	| MEMBARRIER_CMD_REGISTER_SYNC_CORE)
>  
>  static void ipi_mb(void *info)
>  {
> -	smp_mb();	/* IPIs should be serializing but paranoid. */
> +	/* IPIs should be serializing but paranoid. */
> +	smp_mb();
> +	sync_core();
>  }
>  
>  static void membarrier_private_expedited(void)
> @@ -96,6 +100,30 @@ static void membarrier_private_expedited(void)
>  	smp_mb();	/* exit from system call is not a mb */
>  }
>  
> +static void membarrier_register_sync_core(void)
> +{
> +	struct task_struct *p = current, *t;
> +
> +	if (get_nr_threads(p) == 1) {
> +		p->membarrier_sync_core = 1;
> +		return;
> +	}
> +
> +	/*
> +	 * Coherence of membarrier_sync_core against thread fork is
> +	 * protected by siglock.
> +	 */
> +	spin_lock(&p->sighand->siglock);
> +	for_each_thread(p, t)
> +		WRITE_ONCE(t->membarrier_sync_core, 1);
> +	spin_unlock(&p->sighand->siglock);
> +	/*
> +	 * Ensure all future scheduler execution will observe the new
> +	 * membarrier_sync_core state for this process.
> +	 */
> +	synchronize_sched();
> +}
> +
>  /**
>   * sys_membarrier - issue memory barriers on a set of threads
>   * @cmd:   Takes command values defined in enum membarrier_cmd.
> @@ -146,6 +174,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>  	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>  		membarrier_private_expedited();
>  		return 0;
> +	case MEMBARRIER_CMD_REGISTER_SYNC_CORE:
> +		membarrier_register_sync_core();
> +		return 0;
>  	default:
>  		return -EINVAL;
>  	}
> -- 
> 2.11.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-28  3:05   ` Mathieu Desnoyers
@ 2017-08-30  5:01     ` Andy Lutomirski
  2017-08-30 14:46       ` Paul E. McKenney
  2017-08-31 17:00     ` Will Deacon
  1 sibling, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2017-08-30  5:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Will Deacon, Hans Boehm

> On Aug 27, 2017, at 8:05 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Aug 27, 2017, at 3:53 PM, Andy Lutomirski luto@amacapital.net wrote:
>
>>> On Aug 27, 2017, at 1:50 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> wrote:
>>>
>>> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
>>> system call. It allows processes to register their intent to have their
>>> threads issue core serializing barriers in addition to memory barriers
>>> whenever a membarrier command is performed.
>>>
>>
>> Why is this stateful?  That is, why not just have a new membarrier command to
>> sync every thread's icache?
>
> If we'd do it on every CPU icache, it would be as trivial as you say. The
> concern here is sending IPIs only to CPUs running threads that belong to the
> same process, so we don't disturb unrelated processes.
>
> If we could just grab each CPU's runqueue lock, it would be fairly simple
> to do. But we want to avoid hitting each runqueue with exclusive atomic
> access associated with grabbing the lock. (cache-line bouncing)

Hmm.  Are there really arches where there is no clean implementation
without this hacker?  It seems rather unfortunate that munmap() can be
done efficiently but this barrier can't be.

At the very least, could there be a register command *and* a special
sync command?  I dislike the idea that the sync command does something
different depending on some other state.  Even better (IMO) would be a
design where you ask for an isync and, if the arch can do it
efficiently (x86), you get an efficient isync and, if the arch can't
(arm64?) you take all the rq locks?

--Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-30  5:01     ` Andy Lutomirski
@ 2017-08-30 14:46       ` Paul E. McKenney
  2017-08-30 17:33         ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2017-08-30 14:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Will Deacon, Hans Boehm

On Tue, Aug 29, 2017 at 10:01:56PM -0700, Andy Lutomirski wrote:
> > On Aug 27, 2017, at 8:05 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > ----- On Aug 27, 2017, at 3:53 PM, Andy Lutomirski luto@amacapital.net wrote:
> >
> >>> On Aug 27, 2017, at 1:50 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>> wrote:
> >>>
> >>> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
> >>> system call. It allows processes to register their intent to have their
> >>> threads issue core serializing barriers in addition to memory barriers
> >>> whenever a membarrier command is performed.
> >>>
> >>
> >> Why is this stateful?  That is, why not just have a new membarrier command to
> >> sync every thread's icache?
> >
> > If we'd do it on every CPU icache, it would be as trivial as you say. The
> > concern here is sending IPIs only to CPUs running threads that belong to the
> > same process, so we don't disturb unrelated processes.
> >
> > If we could just grab each CPU's runqueue lock, it would be fairly simple
> > to do. But we want to avoid hitting each runqueue with exclusive atomic
> > access associated with grabbing the lock. (cache-line bouncing)
> 
> Hmm.  Are there really arches where there is no clean implementation
> without this hacker?  It seems rather unfortunate that munmap() can be
> done efficiently but this barrier can't be.
> 
> At the very least, could there be a register command *and* a special
> sync command?  I dislike the idea that the sync command does something
> different depending on some other state.  Even better (IMO) would be a
> design where you ask for an isync and, if the arch can do it
> efficiently (x86), you get an efficient isync and, if the arch can't
> (arm64?) you take all the rq locks?

In some cases I suspect that IPIs might be required.  Regardless of
that, we might well need to provide a way for architectures to do
special things.

But I must defer to Mathieu on this.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-30  4:01 ` Boqun Feng
@ 2017-08-30 17:25   ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-08-30 17:25 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, Andrew Hunter,
	maged michael, gromer, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, Andy Lutomirski,
	Will Deacon, Hans Boehm

----- On Aug 30, 2017, at 12:01 AM, Boqun Feng boqun.feng@gmail.com wrote:

> Hi Mathieu,
> 
> On Sun, Aug 27, 2017 at 01:50:35PM -0700, Mathieu Desnoyers wrote:
>> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
>> system call. It allows processes to register their intent to have their
>> threads issue core serializing barriers in addition to memory barriers
>> whenever a membarrier command is performed.
>> 
>> It is relevant for reclaim of JIT code, which requires to issue core
>> serializing barriers on all threads running on behalf of a process
>> after ensuring the old code is not visible anymore, before re-using
>> memory for new code.
>> 
>> When a processes returns from a MEMBARRIER_CMD_REGISTER_SYNC_CORE
>> command, it is guaranteed that all following MEMBARRIER_CMD_SHARED and
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED issue core serializing barriers, in
>> addition to the memory barriers, on all of its running threads.
>> 
>> * Scheduler Overhead Benchmarks
>> 
>> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> taskset 01 ./perf bench sched pipe -T
>> Linux v4.13-rc6
>> 
>>                        Avg. usecs/op         Std.Dev. usecs/op
>> Before this change:         2.75                   0.12
>> Non-registered processes:   2.73                   0.08
>> Registered processes:       3.07                   0.02
>> 
>> Changes since v1:
>> - Add missing MEMBARRIER_CMD_REGISTER_SYNC_CORE header documentation,
>> - Add benchmarks to commit message.
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> CC: Boqun Feng <boqun.feng@gmail.com>
>> CC: Andrew Hunter <ahh@google.com>
>> CC: Maged Michael <maged.michael@gmail.com>
>> CC: gromer@google.com
>> CC: Avi Kivity <avi@scylladb.com>
>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> CC: Paul Mackerras <paulus@samba.org>
>> CC: Michael Ellerman <mpe@ellerman.id.au>
>> CC: Dave Watson <davejwatson@fb.com>
>> CC: Andy Lutomirski <luto@kernel.org>
>> CC: Will Deacon <will.deacon@arm.com>
>> CC: Hans Boehm <hboehm@google.com>
>> ---
>>  fs/exec.c                       |  1 +
>>  include/linux/sched.h           | 52 +++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/membarrier.h |  8 +++++++
>>  kernel/fork.c                   |  2 ++
>>  kernel/sched/core.c             |  3 +++
>>  kernel/sched/membarrier.c       | 37 ++++++++++++++++++++++++++---
>>  6 files changed, 100 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 62175cbcc801..a4ab3253bac7 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1794,6 +1794,7 @@ static int do_execveat_common(int fd, struct filename
>> *filename,
>>  	/* execve succeeded */
>>  	current->fs->in_exec = 0;
>>  	current->in_execve = 0;
>> +	membarrier_execve(current);
>>  	acct_update_integrals(current);
>>  	task_numa_free(current);
>>  	free_bprm(bprm);
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 8337e2db0bb2..b1ecdc4e8b84 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1086,6 +1086,9 @@ struct task_struct {
>>  	/* Used by LSM modules for access restriction: */
>>  	void				*security;
>>  #endif
>> +#ifdef CONFIG_MEMBARRIER
>> +	int membarrier_sync_core;
> 
> Why put this in task_struct rather than mm_struct? If the reclaim of JIT
> code is one of the target users, don't you want to make this as a
> attribute of an address space?
> 
> Am I missing something subtle here?

Placing this in the task struct rather than in the mm struct saves a
pointer dereference on scheduler fast-paths (task_struct *)->mm.

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> +#endif
>>  
>>  	/*
>>  	 * New fields for task_struct should be added above here, so that
>> @@ -1623,4 +1626,53 @@ extern long sched_getaffinity(pid_t pid, struct cpumask
>> *mask);
>>  #define TASK_SIZE_OF(tsk)	TASK_SIZE
>>  #endif
>>  
>> +#ifdef CONFIG_MEMBARRIER
>> +static inline void membarrier_fork(struct task_struct *t,
>> +		unsigned long clone_flags)
>> +{
>> +	/*
>> +	 * Coherence of membarrier_sync_core against thread fork is
>> +	 * protected by siglock. membarrier_fork is called with siglock
>> +	 * held.
>> +	 */
>> +	t->membarrier_sync_core = current->membarrier_sync_core;
>> +}
>> +static inline void membarrier_execve(struct task_struct *t)
>> +{
>> +	t->membarrier_sync_core = 0;
>> +}
>> +static inline void membarrier_sched_out(struct task_struct *t)
>> +{
>> +	/*
>> +	 * Core serialization is performed before the memory barrier
>> +	 * preceding the store to rq->curr.
>> +	 */
>> +	if (unlikely(READ_ONCE(t->membarrier_sync_core)))
>> +		sync_core();
>> +}
>> +static inline void membarrier_sched_in(struct task_struct *t)
>> +{
>> +	/*
>> +	 * Core serialization is performed after the memory barrier
>> +	 * following the store to rq->curr.
>> +	 */
>> +	if (unlikely(READ_ONCE(t->membarrier_sync_core)))
>> +		sync_core();
>> +}
>> +#else
>> +static inline void membarrier_fork(struct task_struct *t,
>> +		unsigned long clone_flags)
>> +{
>> +}
>> +static inline void membarrier_execve(struct task_struct *t)
>> +{
>> +}
>> +static inline void membarrier_sched_out(struct task_struct *t)
>> +{
>> +}
>> +static inline void membarrier_sched_in(struct task_struct *t)
>> +{
>> +}
>> +#endif
>> +
>>  #endif
>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> index 6d47b3249d8a..933c35ebcc10 100644
>> --- a/include/uapi/linux/membarrier.h
>> +++ b/include/uapi/linux/membarrier.h
>> @@ -56,6 +56,13 @@
>>   *                          complete faster than the non-expedited ones,
>>   *                          they never block, but have the downside of
>>   *                          causing extra overhead.
>> + * @MEMBARRIER_CMD_REGISTER_SYNC_CORE:
>> + *                          Register the caller process so all of its
>> + *                          threads will issue core serialization
>> + *                          barriers in addition to memory barriers upon
>> + *                          SHARED and PRIVATE barriers targeting
>> + *                          its threads issued after this registration
>> + *                          returns.
>>   *
>>   * Command to be passed to the membarrier system call. The commands need to
>>   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>> @@ -67,6 +74,7 @@ enum membarrier_cmd {
>>  	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>>  	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>>  	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
>> +	MEMBARRIER_CMD_REGISTER_SYNC_CORE	= (1 << 4),
>>  };
>>  
>>  #endif /* _UAPI_LINUX_MEMBARRIER_H */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 17921b0390b4..713b3c932671 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1840,6 +1840,8 @@ static __latent_entropy struct task_struct *copy_process(
>>  	 */
>>  	copy_seccomp(p);
>>  
>> +	membarrier_fork(p, clone_flags);
>> +
>>  	/*
>>  	 * Process group and session signals need to be delivered to just the
>>  	 * parent before the fork or both the parent and the child after the
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0e36d9960d91..3ca27d066950 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3292,6 +3292,8 @@ static void __sched notrace __schedule(bool preempt)
>>  	local_irq_disable();
>>  	rcu_note_context_switch(preempt);
>>  
>> +	membarrier_sched_out(prev);
>> +
>>  	/*
>>  	 * Make sure that signal_pending_state()->signal_pending() below
>>  	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
>> @@ -3364,6 +3366,7 @@ static void __sched notrace __schedule(bool preempt)
>>  
>>  		/* Also unlocks the rq: */
>>  		rq = context_switch(rq, prev, next, &rf);
>> +		membarrier_sched_in(next);
>>  	} else {
>>  		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>>  		rq_unlock_irq(rq, &rf);
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index 7eec6914d2d2..f128caf64b49 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -25,12 +25,16 @@
>>   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>   * except MEMBARRIER_CMD_QUERY.
>>   */
>> -#define MEMBARRIER_CMD_BITMASK	\
>> -	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
>> +#define MEMBARRIER_CMD_BITMASK			\
>> +	(MEMBARRIER_CMD_SHARED			\
>> +	| MEMBARRIER_CMD_PRIVATE_EXPEDITED	\
>> +	| MEMBARRIER_CMD_REGISTER_SYNC_CORE)
>>  
>>  static void ipi_mb(void *info)
>>  {
>> -	smp_mb();	/* IPIs should be serializing but paranoid. */
>> +	/* IPIs should be serializing but paranoid. */
>> +	smp_mb();
>> +	sync_core();
>>  }
>>  
>>  static void membarrier_private_expedited(void)
>> @@ -96,6 +100,30 @@ static void membarrier_private_expedited(void)
>>  	smp_mb();	/* exit from system call is not a mb */
>>  }
>>  
>> +static void membarrier_register_sync_core(void)
>> +{
>> +	struct task_struct *p = current, *t;
>> +
>> +	if (get_nr_threads(p) == 1) {
>> +		p->membarrier_sync_core = 1;
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Coherence of membarrier_sync_core against thread fork is
>> +	 * protected by siglock.
>> +	 */
>> +	spin_lock(&p->sighand->siglock);
>> +	for_each_thread(p, t)
>> +		WRITE_ONCE(t->membarrier_sync_core, 1);
>> +	spin_unlock(&p->sighand->siglock);
>> +	/*
>> +	 * Ensure all future scheduler execution will observe the new
>> +	 * membarrier_sync_core state for this process.
>> +	 */
>> +	synchronize_sched();
>> +}
>> +
>>  /**
>>   * sys_membarrier - issue memory barriers on a set of threads
>>   * @cmd:   Takes command values defined in enum membarrier_cmd.
>> @@ -146,6 +174,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>  	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>>  		membarrier_private_expedited();
>>  		return 0;
>> +	case MEMBARRIER_CMD_REGISTER_SYNC_CORE:
>> +		membarrier_register_sync_core();
>> +		return 0;
>>  	default:
>>  		return -EINVAL;
>>  	}
>> --
>> 2.11.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-30 14:46       ` Paul E. McKenney
@ 2017-08-30 17:33         ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2017-08-30 17:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Peter Zijlstra, linux-kernel, Boqun Feng,
	Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Will Deacon, Hans Boehm

----- On Aug 30, 2017, at 10:46 AM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

> On Tue, Aug 29, 2017 at 10:01:56PM -0700, Andy Lutomirski wrote:
>> > On Aug 27, 2017, at 8:05 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> > wrote:
>> >
>> > ----- On Aug 27, 2017, at 3:53 PM, Andy Lutomirski luto@amacapital.net wrote:
>> >
>> >>> On Aug 27, 2017, at 1:50 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >>> wrote:
>> >>>
>> >>> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
>> >>> system call. It allows processes to register their intent to have their
>> >>> threads issue core serializing barriers in addition to memory barriers
>> >>> whenever a membarrier command is performed.
>> >>>
>> >>
>> >> Why is this stateful?  That is, why not just have a new membarrier command to
>> >> sync every thread's icache?
>> >
>> > If we'd do it on every CPU icache, it would be as trivial as you say. The
>> > concern here is sending IPIs only to CPUs running threads that belong to the
>> > same process, so we don't disturb unrelated processes.
>> >
>> > If we could just grab each CPU's runqueue lock, it would be fairly simple
>> > to do. But we want to avoid hitting each runqueue with exclusive atomic
>> > access associated with grabbing the lock. (cache-line bouncing)
>> 
>> Hmm.  Are there really arches where there is no clean implementation
>> without this hacker?  It seems rather unfortunate that munmap() can be
>> done efficiently but this barrier can't be.
>> 
>> At the very least, could there be a register command *and* a special
>> sync command?  I dislike the idea that the sync command does something
>> different depending on some other state.  Even better (IMO) would be a
>> design where you ask for an isync and, if the arch can do it
>> efficiently (x86), you get an efficient isync and, if the arch can't
>> (arm64?) you take all the rq locks?
> 
> In some cases I suspect that IPIs might be required.  Regardless of
> that, we might well need to provide a way for architectures to do
> special things.
> 
> But I must defer to Mathieu on this.

Yes, I think you are both correct. It's better to expose a new command
for code sync, so architectures have more freedom in how they choose to
implement it.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] membarrier: provide register sync core cmd
  2017-08-28  3:05   ` Mathieu Desnoyers
  2017-08-30  5:01     ` Andy Lutomirski
@ 2017-08-31 17:00     ` Will Deacon
  1 sibling, 0 replies; 9+ messages in thread
From: Will Deacon @ 2017-08-31 17:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Paul E. McKenney, Peter Zijlstra, linux-kernel,
	Boqun Feng, Andrew Hunter, maged michael, gromer, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, Andy Lutomirski, Hans Boehm

On Mon, Aug 28, 2017 at 03:05:46AM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 27, 2017, at 3:53 PM, Andy Lutomirski luto@amacapital.net wrote:
> 
> >> On Aug 27, 2017, at 1:50 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> wrote:
> >> 
> >> Add a new MEMBARRIER_CMD_REGISTER_SYNC_CORE command to the membarrier
> >> system call. It allows processes to register their intent to have their
> >> threads issue core serializing barriers in addition to memory barriers
> >> whenever a membarrier command is performed.
> >> 
> > 
> > Why is this stateful?  That is, why not just have a new membarrier command to
> > sync every thread's icache?
> 
> If we'd do it on every CPU icache, it would be as trivial as you say. The
> concern here is sending IPIs only to CPUs running threads that belong to the
> same process, so we don't disturb unrelated processes.
> 
> If we could just grab each CPU's runqueue lock, it would be fairly simple
> to do. But we want to avoid hitting each runqueue with exclusive atomic
> access associated with grabbing the lock. (cache-line bouncing)

I'm still trying to get my head around this for arm64, where we have the
following properties:

  * Return to userspace is context-synchronizing
  * We have a heavy barrier in switch_to

so it would seem to me that we could avoid taking RQ locks if the mm_cpumask
was kept up to date. The problematic case is where a CPU is not observed in
the mask (maybe the write is buffered), but it is running in userspace.
However, that can't occur with the barrier in switch_to.

So we only need to IPI those CPUs that were in userspace for this task
at the point when the syscall was made, and the mm_cpumask should reflect
that.

What am I missing?

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-08-31 17:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-27 20:50 [PATCH v2] membarrier: provide register sync core cmd Mathieu Desnoyers
2017-08-27 22:53 ` Andy Lutomirski
2017-08-28  3:05   ` Mathieu Desnoyers
2017-08-30  5:01     ` Andy Lutomirski
2017-08-30 14:46       ` Paul E. McKenney
2017-08-30 17:33         ` Mathieu Desnoyers
2017-08-31 17:00     ` Will Deacon
2017-08-30  4:01 ` Boqun Feng
2017-08-30 17:25   ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.