All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>,
	tj@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org,
	der.herr@hofr.at, dave@stgolabs.net, riel@redhat.com,
	viro@ZenIV.linux.org.uk, torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock
Date: Tue, 23 Jun 2015 10:30:38 -0700	[thread overview]
Message-ID: <20150623173038.GJ3892@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150623130826.GG18673@twins.programming.kicks-ass.net>

On Tue, Jun 23, 2015 at 03:08:26PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2015 at 01:20:41PM +0200, Peter Zijlstra wrote:
> > Paul, why does this use stop_machine anyway? I seemed to remember you
> > sending resched IPIs around.

It used to, but someone submitted a patch long ago that switched it
to try_stop_cpus().  At that time, RCU didn't unconditionally do the
dyntick-idle thing for CONFIG_NO_HZ=n kernel, so try_stop_cpus() was
quite a bit simpler.

That said, I do use your new-age resched-IPI API in other cases.

> > The rcu_sched_qs() thing would set passed_quiesce, which you can then
> > collect to gauge progress.
> > 
> > Shooting IPIs around is bad enough, but running a full blown
> > stop_machine is really blunt and heavy.
> 
> Is there anything obviously amiss with the below? It does stop_one_cpu()
> in a loop instead of the multi cpu stop_machine and is therefore much
> friendlier (albeit still heavier than bare resched IPIs) since the CPUs
> do not have to go an sync up.
> 
> After all, all we're really interested in is that each CPUs has
> scheduled at least once, we do not care about the cross cpu syncup.

This was on my list.  I was thinking of using smp_call_function_single()
combined with polling in order to avoid the double context switch, but
there the approach below is of course simpler.  I was intending to fix
up the rest of RCU's relationship with CPU hotplug first, as this would
allow fully covering the incoming and outgoing code paths.

But perhaps a bit too simple.  A few comments below...

							Thanx, Paul

> ---
>  include/linux/stop_machine.h |  7 ----
>  kernel/rcu/tree.c            | 99 +++++---------------------------------------
>  kernel/stop_machine.c        | 30 --------------
>  3 files changed, 10 insertions(+), 126 deletions(-)
> 
> diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
> index d2abbdb8c6aa..f992da7ee492 100644
> --- a/include/linux/stop_machine.h
> +++ b/include/linux/stop_machine.h
> @@ -32,7 +32,6 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *
>  void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
>  			 struct cpu_stop_work *work_buf);
>  int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
> -int try_stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
> 
>  #else	/* CONFIG_SMP */
> 
> @@ -83,12 +82,6 @@ static inline int stop_cpus(const struct cpumask *cpumask,
>  	return -ENOENT;
>  }
> 
> -static inline int try_stop_cpus(const struct cpumask *cpumask,
> -				cpu_stop_fn_t fn, void *arg)
> -{
> -	return stop_cpus(cpumask, fn, arg);
> -}
> -
>  #endif	/* CONFIG_SMP */
> 
>  /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index add042926a66..4a8cde155dce 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3257,7 +3257,7 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
>  {
>  	/*
>  	 * There must be a full memory barrier on each affected CPU
> -	 * between the time that try_stop_cpus() is called and the
> +	 * between the time that stop_one_cpu() is called and the
>  	 * time that it returns.
>  	 *
>  	 * In the current initial implementation of cpu_stop, the
> @@ -3291,25 +3291,12 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
>   * grace period.  We are then done, so we use atomic_cmpxchg() to
>   * update sync_sched_expedited_done to match our snapshot -- but
>   * only if someone else has not already advanced past our snapshot.
> - *
> - * On the other hand, if try_stop_cpus() fails, we check the value
> - * of sync_sched_expedited_done.  If it has advanced past our
> - * initial snapshot, then someone else must have forced a grace period
> - * some time after we took our snapshot.  In this case, our work is
> - * done for us, and we can simply return.  Otherwise, we try again,
> - * but keep our initial snapshot for purposes of checking for someone
> - * doing our work for us.
> - *
> - * If we fail too many times in a row, we fall back to synchronize_sched().
>   */
>  void synchronize_sched_expedited(void)
>  {
> -	cpumask_var_t cm;
> -	bool cma = false;
> -	int cpu;
> -	long firstsnap, s, snap;
> -	int trycount = 0;
>  	struct rcu_state *rsp = &rcu_sched_state;
> +	long s, snap;
> +	int cpu;
> 
>  	/*
>  	 * If we are in danger of counter wrap, just do synchronize_sched().
> @@ -3332,7 +3319,6 @@ void synchronize_sched_expedited(void)
>  	 * full memory barrier.
>  	 */
>  	snap = atomic_long_inc_return(&rsp->expedited_start);
> -	firstsnap = snap;

Hmmm...

>  	if (!try_get_online_cpus()) {
>  		/* CPU hotplug operation in flight, fall back to normal GP. */
>  		wait_rcu_gp(call_rcu_sched);
> @@ -3341,82 +3327,17 @@ void synchronize_sched_expedited(void)
>  	}
>  	WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));
> 
> -	/* Offline CPUs, idle CPUs, and any CPU we run on are quiescent. */
> -	cma = zalloc_cpumask_var(&cm, GFP_KERNEL);
> -	if (cma) {
> -		cpumask_copy(cm, cpu_online_mask);
> -		cpumask_clear_cpu(raw_smp_processor_id(), cm);
> -		for_each_cpu(cpu, cm) {
> -			struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> -
> -			if (!(atomic_add_return(0, &rdtp->dynticks) & 0x1))
> -				cpumask_clear_cpu(cpu, cm);
> -		}
> -		if (cpumask_weight(cm) == 0)
> -			goto all_cpus_idle;
> -	}

Good, you don't need this because you can check for dynticks later.
You will need to check for offline CPUs.

If you had lots of CPUs coming and going, you could argue that tracking
them would help, but synchronize_sched_expedited() should run fast enough
that there isn't time for CPUs to come or go, at least in the common case.

> -	/*
> -	 * Each pass through the following loop attempts to force a
> -	 * context switch on each CPU.
> -	 */
> -	while (try_stop_cpus(cma ? cm : cpu_online_mask,
> -			     synchronize_sched_expedited_cpu_stop,
> -			     NULL) == -EAGAIN) {
> -		put_online_cpus();
> -		atomic_long_inc(&rsp->expedited_tryfail);
> -
> -		/* Check to see if someone else did our work for us. */
> -		s = atomic_long_read(&rsp->expedited_done);
> -		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
> -			/* ensure test happens before caller kfree */
> -			smp_mb__before_atomic(); /* ^^^ */
> -			atomic_long_inc(&rsp->expedited_workdone1);
> -			free_cpumask_var(cm);
> -			return;

Here you lose batching.  Yeah, I know that synchronize_sched_expedited()
is -supposed- to be used sparingly, but it is not cool for the kernel
to melt down just because some creative user found a way to heat up a
code path.  Need a mutex_trylock() with a counter and checking for
others having already done the needed work.

> -		}
> -
> -		/* No joy, try again later.  Or just synchronize_sched(). */
> -		if (trycount++ < 10) {
> -			udelay(trycount * num_online_cpus());
> -		} else {
> -			wait_rcu_gp(call_rcu_sched);
> -			atomic_long_inc(&rsp->expedited_normal);
> -			free_cpumask_var(cm);
> -			return;
> -		}

And we still need to be able to drop back to synchronize_sched()
(AKA wait_rcu_gp(call_rcu_sched) in this case) in case we have both a
creative user and a long-running RCU-sched read-side critical section.

> +	for_each_online_cpu(cpu) {
> +		struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> 
> -		/* Recheck to see if someone else did our work for us. */
> -		s = atomic_long_read(&rsp->expedited_done);
> -		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
> -			/* ensure test happens before caller kfree */
> -			smp_mb__before_atomic(); /* ^^^ */
> -			atomic_long_inc(&rsp->expedited_workdone2);
> -			free_cpumask_var(cm);
> -			return;
> -		}
> +		/* Offline CPUs, idle CPUs, and any CPU we run on are quiescent. */
> +		if (!(atomic_add_return(0, &rdtp->dynticks) & 0x1))
> +			continue;

Let's see...  This does work for idle CPUs and for nohz_full CPUs running
in userspace.

It does not work for the current CPU, so the check needs an additional
check against raw_smp_processor_id(), which is easy enough to add.

There always has been a race window involving CPU hotplug.  My recent
CPU_DYING_IDLE change allows things to be exact on the outgoing side,
and I need to make a similar change on the incoming side.  There will
continue to be a window where RCU needs to pay attention to the CPU,
but neither IPIs nor scheduling works, and I guess I just do a timed
wait in that case.  Rare race anyway, so should be fine.

> -		/*
> -		 * Refetching sync_sched_expedited_started allows later
> -		 * callers to piggyback on our grace period.  We retry
> -		 * after they started, so our grace period works for them,
> -		 * and they started after our first try, so their grace
> -		 * period works for us.
> -		 */
> -		if (!try_get_online_cpus()) {
> -			/* CPU hotplug operation in flight, use normal GP. */
> -			wait_rcu_gp(call_rcu_sched);
> -			atomic_long_inc(&rsp->expedited_normal);
> -			free_cpumask_var(cm);
> -			return;
> -		}
> -		snap = atomic_long_read(&rsp->expedited_start);
> -		smp_mb(); /* ensure read is before try_stop_cpus(). */
> +		stop_one_cpu(cpu, synchronize_sched_expedited_cpu_stop, NULL);

My thought was to use smp_call_function_single(), and to have the function
called recheck dyntick-idle state, avoiding doing a set_tsk_need_resched()
if so.  This would result in a single pass through schedule() instead
of stop_one_cpu()'s double context switch.  It would likely also require
some rework of rcu_note_context_switch(), which stop_one_cpu() avoids
the need for.

>  	}
> -	atomic_long_inc(&rsp->expedited_stoppedcpus);
> 
> -all_cpus_idle:
> -	free_cpumask_var(cm);
> +	atomic_long_inc(&rsp->expedited_stoppedcpus);
> 
>  	/*
>  	 * Everyone up to our most recent fetch is covered by our grace
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index fd643d8c4b42..b1329a213503 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -371,36 +371,6 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
>  	return ret;
>  }
> 
> -/**
> - * try_stop_cpus - try to stop multiple cpus
> - * @cpumask: cpus to stop
> - * @fn: function to execute
> - * @arg: argument to @fn
> - *
> - * Identical to stop_cpus() except that it fails with -EAGAIN if
> - * someone else is already using the facility.
> - *
> - * CONTEXT:
> - * Might sleep.
> - *
> - * RETURNS:
> - * -EAGAIN if someone else is already stopping cpus, -ENOENT if
> - * @fn(@arg) was not executed at all because all cpus in @cpumask were
> - * offline; otherwise, 0 if all executions of @fn returned 0, any non
> - * zero return value if any returned non zero.
> - */
> -int try_stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
> -{
> -	int ret;
> -
> -	/* static works are used, process one request at a time */
> -	if (!mutex_trylock(&stop_cpus_mutex))
> -		return -EAGAIN;
> -	ret = __stop_cpus(cpumask, fn, arg);
> -	mutex_unlock(&stop_cpus_mutex);
> -	return ret;
> -}
> -
>  static int cpu_stop_should_run(unsigned int cpu)
>  {
>  	struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
> 


  parent reply	other threads:[~2015-06-23 17:30 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-22 12:16 [RFC][PATCH 00/13] percpu rwsem -v2 Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 01/13] rcu: Create rcu_sync infrastructure Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 02/13] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 03/13] rcusync: Add the CONFIG_PROVE_RCU checks Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 04/13] rcusync: Introduce rcu_sync_dtor() Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 05/13] percpu-rwsem: Optimize readers and reduce global impact Peter Zijlstra
2015-06-22 23:02   ` Oleg Nesterov
2015-06-23  7:28   ` Nicholas Mc Guire
2015-06-25 19:08     ` Peter Zijlstra
2015-06-25 19:17       ` Tejun Heo
2015-06-29  9:32         ` Peter Zijlstra
2015-06-29 15:12           ` Tejun Heo
2015-06-29 15:14             ` Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 06/13] percpu-rwsem: Provide percpu_down_read_trylock() Peter Zijlstra
2015-06-22 23:08   ` Oleg Nesterov
2015-06-22 12:16 ` [RFC][PATCH 07/13] sched: Reorder task_struct Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 08/13] percpu-rwsem: DEFINE_STATIC_PERCPU_RWSEM Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 09/13] hotplug: Replace hotplug lock with percpu-rwsem Peter Zijlstra
2015-06-22 22:57   ` Oleg Nesterov
2015-06-23  7:16     ` Peter Zijlstra
2015-06-23 17:01       ` Oleg Nesterov
2015-06-23 17:53         ` Peter Zijlstra
2015-06-24 13:50           ` Oleg Nesterov
2015-06-24 14:13             ` Peter Zijlstra
2015-06-24 15:12               ` Oleg Nesterov
2015-06-24 16:15                 ` Peter Zijlstra
2015-06-28 23:56             ` [PATCH 0/3] percpu-rwsem: introduce percpu_rw_semaphore->recursive mode Oleg Nesterov
2015-06-28 23:56               ` [PATCH 1/3] rcusync: introduce rcu_sync_struct->exclusive mode Oleg Nesterov
2015-06-28 23:56               ` [PATCH 2/3] percpu-rwsem: don't use percpu_rw_semaphore->rw_sem to exclude writers Oleg Nesterov
2015-06-28 23:56               ` [PATCH 3/3] percpu-rwsem: introduce percpu_rw_semaphore->recursive mode Oleg Nesterov
2015-06-22 12:16 ` [RFC][PATCH 10/13] fs/locks: Replace lg_global with a percpu-rwsem Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 11/13] fs/locks: Replace lg_local with a per-cpu spinlock Peter Zijlstra
2015-06-23  0:19   ` Oleg Nesterov
2015-06-22 12:16 ` [RFC][PATCH 12/13] stop_machine: Remove lglock Peter Zijlstra
2015-06-22 22:21   ` Oleg Nesterov
2015-06-23 10:09     ` Peter Zijlstra
2015-06-23 10:55       ` Peter Zijlstra
2015-06-23 11:20         ` Peter Zijlstra
2015-06-23 13:08           ` Peter Zijlstra
2015-06-23 16:36             ` Oleg Nesterov
2015-06-23 17:30             ` Paul E. McKenney [this message]
2015-06-23 18:04               ` Peter Zijlstra
2015-06-23 18:26                 ` Paul E. McKenney
2015-06-23 19:05                   ` Paul E. McKenney
2015-06-24  2:23                     ` Paul E. McKenney
2015-06-24  8:32                       ` Peter Zijlstra
2015-06-24  9:31                         ` Peter Zijlstra
2015-06-24 13:48                           ` Paul E. McKenney
2015-06-24 15:01                         ` Paul E. McKenney
2015-06-24 15:34                           ` Peter Zijlstra
2015-06-24  7:35                   ` Peter Zijlstra
2015-06-24  8:42                     ` Ingo Molnar
2015-06-24 13:39                       ` Paul E. McKenney
2015-06-24 13:43                         ` Ingo Molnar
2015-06-24 14:03                           ` Paul E. McKenney
2015-06-24 14:50                     ` Paul E. McKenney
2015-06-24 15:01                       ` Peter Zijlstra
2015-06-24 15:27                         ` Paul E. McKenney
2015-06-24 15:40                           ` Peter Zijlstra
2015-06-24 16:09                             ` Paul E. McKenney
2015-06-24 16:42                               ` Peter Zijlstra
2015-06-24 17:10                                 ` Paul E. McKenney
2015-06-24 17:20                                   ` Paul E. McKenney
2015-06-24 17:29                                     ` Peter Zijlstra
2015-06-24 17:28                                   ` Peter Zijlstra
2015-06-24 17:32                                     ` Peter Zijlstra
2015-06-24 18:14                                     ` Peter Zijlstra
2015-06-24 17:58                                   ` Peter Zijlstra
2015-06-25  3:23                                     ` Paul E. McKenney
2015-06-25 11:07                                       ` Peter Zijlstra
2015-06-25 13:47                                         ` Paul E. McKenney
2015-06-25 14:20                                           ` Peter Zijlstra
2015-06-25 14:51                                             ` Paul E. McKenney
2015-06-26 12:32                                               ` Peter Zijlstra
2015-06-26 16:14                                                 ` Paul E. McKenney
2015-06-29  7:56                                                   ` Peter Zijlstra
2015-06-30 21:32                                                     ` Paul E. McKenney
2015-07-01 11:56                                                       ` Peter Zijlstra
2015-07-01 15:56                                                         ` Paul E. McKenney
2015-07-01 16:16                                                           ` Peter Zijlstra
2015-07-01 18:45                                                             ` Paul E. McKenney
2015-06-23 14:39         ` Paul E. McKenney
2015-06-23 16:20       ` Oleg Nesterov
2015-06-23 17:24         ` Oleg Nesterov
2015-06-25 19:18           ` Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 13/13] locking: " Peter Zijlstra
2015-06-22 12:36 ` [RFC][PATCH 00/13] percpu rwsem -v2 Peter Zijlstra
2015-06-22 18:11 ` Daniel Wagner
2015-06-22 19:05   ` Peter Zijlstra
2015-06-23  9:35     ` Daniel Wagner
2015-06-23 10:00       ` Ingo Molnar
2015-06-23 14:34       ` Peter Zijlstra
2015-06-23 14:56         ` Daniel Wagner
2015-06-23 17:50           ` Peter Zijlstra
2015-06-23 19:36             ` Peter Zijlstra
2015-06-24  8:46               ` Ingo Molnar
2015-06-24  9:01                 ` Peter Zijlstra
2015-06-24  9:18                 ` Daniel Wagner
2015-07-01  5:57                   ` Daniel Wagner
2015-07-01 21:54                     ` Linus Torvalds
2015-07-02  9:41                       ` Peter Zijlstra
2015-07-20  5:53                         ` Daniel Wagner
2015-07-20 18:44                           ` Linus Torvalds
2015-06-22 20:06 ` Linus Torvalds
2015-06-23 16:10 ` Davidlohr Bueso
2015-06-23 16:21   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150623173038.GJ3892@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=der.herr@hofr.at \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.