linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>,
	Ingo Molnar <mingo@kernel.org>,
	linux-kernel@vger.kernel.org, Oleg Nesterov <oleg@redhat.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Russell King - ARM Linux admin <linux@armlinux.org.uk>,
	Chris Metcalf <cmetcalf@ezchip.com>,
	Christoph Lameter <cl@linux.com>, Kirill Tkhai <tkhai@yandex.ru>,
	Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load
Date: Fri, 6 Sep 2019 10:23:05 +0200	[thread overview]
Message-ID: <20190906082305.GU2349@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20190906031300.1647-5-mathieu.desnoyers@efficios.com>

On Thu, Sep 05, 2019 at 11:13:00PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a7a1083b6fb..7020572eb605 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -382,6 +382,9 @@ struct mm_struct {
>  		unsigned long task_size;	/* size of task vm space */
>  		unsigned long highest_vm_end;	/* highest vma end address */
>  		pgd_t * pgd;
> +#ifdef CONFIG_MEMBARRIER

Stick in a comment, on why here. To be close to data already used by
switch_mm().

> +		atomic_t membarrier_state;
> +#endif
>  
>  		/**
>  		 * @mm_users: The number of users including userspace.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 010d578118d6..1cffc1aa403c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3038,6 +3038,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	perf_event_task_sched_out(prev, next);
>  	rseq_preempt(prev);
>  	fire_sched_out_preempt_notifiers(prev, next);
> +	membarrier_prepare_task_switch(rq, prev, next);

This had me confused for a while, because I initially thought we'd only
do this for switch_mm(), but you're made it agressive and track kernel
threads too.

I think we can do that slightly different. See below...

>  	prepare_task(next);
>  	prepare_arch_switch(next);
>  }

> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 7e0a0d6535f3..5744c300d29e 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -30,6 +30,28 @@ static void ipi_mb(void *info)

> +void membarrier_execve(struct task_struct *t)
> +{
> +	atomic_set(&t->mm->membarrier_state, 0);
> +	WRITE_ONCE(this_rq()->membarrier_state, 0);

It is the callsite of this one that had me puzzled and confused. I
think it works by accident more than anything else.

You see; I thought the rules were that we'd change it near/before
switch_mm(), and this is quite a way _after_.

I think it might be best to place the call in exec_mmap(), right before
activate_mm().

But that then had me wonder about the membarrier_prepate_task_switch()
thing...

> +/*
> + * The scheduler provides memory barriers required by membarrier between:
> + * - prior user-space memory accesses and store to rq->membarrier_state,
> + * - store to rq->membarrier_state and following user-space memory accesses.
> + * In the same way it provides those guarantees around store to rq->curr.
> + */
> +static inline void membarrier_prepare_task_switch(struct rq *rq,
> +						  struct task_struct *prev,
> +						  struct task_struct *next)
> +{
> +	int membarrier_state = 0;
> +	struct mm_struct *next_mm = next->mm;
> +
> +	if (prev->mm == next_mm)
> +		return;
> +	if (next_mm)
> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> +	if (READ_ONCE(rq->membarrier_state) != membarrier_state)
> +		WRITE_ONCE(rq->membarrier_state, membarrier_state);
> +}

So if you make the above something like:

static inline void
membarrier_switch_mm(struct rq *rq, struct mm_struct *prev_mm, struct mm_struct *next_mm)
{
	int membarrier_state;

	if (prev_mm == next_mm)
		return;

	membarrier_state = atomic_read(&next_mm->membarrier_state);
	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
		return;

	WRITE_ONCE(rq->membarrier_state, membarrier_state);
}

And put it right in front of switch_mm() in context_switch() then we'll
deal with kernel on the other side, like so:

> @@ -70,16 +90,13 @@ static int membarrier_global_expedited(void)
>  		if (cpu == raw_smp_processor_id())
>  			continue;
>  
> -		rcu_read_lock();
> -		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> -		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
> -				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> +		if (READ_ONCE(cpu_rq(cpu)->membarrier_state) &
> +		    MEMBARRIER_STATE_GLOBAL_EXPEDITED) {

		p = rcu_dereference(rq->curr);
		if ((READ_ONCE(cpu_rq(cpu)->membarrier_state) & MEMBARRIER_STATE_GLOBAL_EXPEDITED) &&
		    !(p->flags & PF_KTHREAD))

>  			if (!fallback)
>  				__cpumask_set_cpu(cpu, tmpmask);
>  			else
>  				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>  		}
> -		rcu_read_unlock();
>  	}
>  	if (!fallback) {
>  		preempt_disable();

does that make sense?

(also, I hate how long all these membarrier names are)

  reply	other threads:[~2019-09-06  8:24 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-06  3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers
2019-09-06  3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers
2019-09-06  3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers
2019-09-06  3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers
2019-09-06  7:41   ` Peter Zijlstra
2019-09-06 13:40     ` Mathieu Desnoyers
2019-09-06  3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers
2019-09-06  8:23   ` Peter Zijlstra [this message]
2019-09-08 13:49     ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers
2019-09-08 16:51       ` Linus Torvalds
2019-09-10  9:48         ` Mathieu Desnoyers
2019-09-12 13:48           ` Will Deacon
2019-09-12 14:24             ` Linus Torvalds
2019-09-12 15:47               ` Will Deacon
2019-09-13 14:22                 ` Mathieu Desnoyers
2019-09-19 16:26                   ` Will Deacon
2019-09-19 17:33                     ` Mathieu Desnoyers
2019-09-09 11:00       ` Oleg Nesterov
2019-09-13 15:20         ` Mathieu Desnoyers
2019-09-13 16:04           ` Oleg Nesterov
2019-09-13 17:07             ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190906082305.GU2349@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=cl@linux.com \
    --cc=cmetcalf@ezchip.com \
    --cc=ebiederm@xmission.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=tkhai@yandex.ru \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).