RCU Archive on lore.kernel.org
 help / color / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Dave Chinner <david@fromorbit.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	rcu@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, mingo@kernel.org, jiangshanlai@gmail.com,
	dipankar@in.ibm.com, mathieu.desnoyers@efficios.com,
	josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org,
	rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com,
	fweisbec@gmail.com, oleg@redhat.com, joel@joelfernandes.org,
	viro@zeniv.linux.org.uk
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode
Date: Wed, 13 May 2020 05:52:24 -0700
Message-ID: <20200513125224.GP2869@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <1705751d-0be0-1f7f-116b-e5aea44e8efb@yandex-team.ru>

On Wed, May 13, 2020 at 07:35:25AM +0300, Konstantin Khlebnikov wrote:
> On 13/05/2020 06.18, Paul E. McKenney wrote:
> > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote:
> > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > > > > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > > > > and inefficient mode" APIs for MM to call!
> > > > > > 
> > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()?  I would
> > > > > > expect them not to nest (as in if you need them to nest, please let
> > > > > > me know).  I would not expect these to be invoked all that often (as in
> > > > > > if you do need them to be fast and scalable, please let me know). >
> > > > > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > > > > or if RCU had detected callback overload on at least one CPU.
> > > > > > 
> > > > > > Seem reasonable?
> > > > > 
> > > > > Not exactly nested calls, but kswapd threads are per numa node.
> > > > > So, at some level nodes under pressure must be counted.
> > > > 
> > > > Easy enough, especially given that RCU already "counts" CPUs having
> > > > excessive numbers of callbacks.  But assuming that the transitions to/from
> > > > OOM are rare, I would start by just counting them with a global counter.
> > > > If the counter is non-zero, RCU is in fast and inefficient mode.
> > > > 
> > > > > Also forcing rcu calls only for cpus in one numa node might be useful.
> > > > 
> > > > Interesting.  RCU currently evaluates a given CPU by comparing the
> > > > number of callbacks against a fixed cutoff that can be set at boot using
> > > > rcutree.qhimark, which defaults to 10,000.  When this cutoff is exceeded,
> > > > RCU becomes more aggressive about invoking callbacks on that CPU, for
> > > > example, by sacrificing some degree of real-time response.  I believe
> > > > that this heuristic would also serve the OOM use case well.
> > > 
> > > So one of the things that I'm not sure people have connected here is
> > > that memory reclaim done by shrinkers is one of the things that
> > > drives huge numbers of call_rcu() callbacks to free memory via rcu.
> > > If we are reclaiming dentries and inodes, then we can be pushing
> > > thousands to hundreds of thousands of objects into kfree_rcu()
> > > and/or direct call_rcu() calls to free these objects in a single
> > > reclaim pass.
> > 
> > Good point!
> 
> Indeed
> 
> > 
> > > Hence the trigger for RCU going into "excessive callback" mode
> > > might, in fact, be kswapd running a pass over the shrinkers. i.e.
> > > memory reclaim itself can be responsible for pushing RCU into this "OOM
> > > pressure" situation.
> > > 
> > > So perhaps we've missed a trick here by not having the memory
> > > reclaim routines trigger RCU callbacks at the end of a priority
> > > scan. The shrinkers have queued the objects for freeing, but they
> > > haven't actually been freed yet and so things like slab pages
> > > haven't actually been returned to the free pool even though the
> > > shrinkers have said "freed this many objects"...
> > > 
> > > i.e. perhaps the right solution here is a "rcu_run_callbacks()"
> > > function that memory reclaim calls before backing off and/or winding
> > > up reclaim priority.
> > 
> > It would not be hard to make something that put RCU into fast/inefficient
> > mode for a couple of grace periods.  I will also look into the possibility
> > of speeding up callback invocation.
> > 
> > It might also make sense to put RCU grace periods into fast mode while
> > running the shrinkers that are freeing dentries and inodes.  However,
> > kbuild test robot reports ugly regressions when putting RCU into
> > fast/inefficient mode to quickly and too often.  As in 78.5% degradation
> > on one of the benchmarks.
> 
> I think fast/inefficient mode here just an optimization for freeing
> memory faster. It doesn't solve the problem itself.
> 
> At first we have to close the loop in reclaimer and actually wait or run
> rcu callbacks which might free memory before increasing priority and
> invoking OOM killer.

That is easy, just invoke rcu_barrier(), which will block until all
prior call_rcu()/kfree_rcu() callbacks have been invoked.

> > > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
> > > > 
> > > > The rcu_oom_count() function in the patch starting this thread returns the
> > > > total number of outstanding callbacks queued on all CPUs.  So one approach
> > > > would be to invoke this function, and if the return value was truly
> > > > huge (taking size of memory and who knows that all else into account),
> > > > do the rcu_barrier() to wait for RCU to clear its current backlog.
> > > 
> > > The shrinker scan control structure has a node mask in it to
> > > indicate what node (and hence CPUs) it should be reclaiming from.
> > > This information comes from the main reclaim scan routine, so it
> > > would be trivial to feed straight into the RCU code to have it
> > > act on just the CPUs/node that we are reclaiming memory from...
> > 
> > For the callbacks, RCU can operate on CPUs, in theory anyway.  The
> > grace period itself, however, is inherently global.
> > 
> > > > On the NUMA point, it would be dead easy for me to supply a function
> > > > that returned the number of callbacks on a given CPU, which would allow
> > > > you to similarly evaluate a NUMA node, a cgroup, or whatever.
> > > 
> > > I'd think it runs the other way around - we optimisitically call the
> > > RCU layer to do cleanup, and the RCU layer decides if there's enough
> > > queued callbacks on the cpus/node to run callbacks immediately. It
> > > would even be provided with the scan priority to indicate the level
> > > of desperation memory reclaim is under....
> > 
> > Easy for RCU to count the number of callbacks.  That said, it has no
> > idea which callbacks are which.  Perhaps kfree_rcu() could gather that
> > information from the slab allocator, though.
> 
> It's simple to mark slab shrinkers that frees object through RCU and
> count freed objects in reclaimer:
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -536,6 +536,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>         else
>                 new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> 
> +       if (shrinker->flags & SHRINKER_KFREE_RCU)
> +               shrinkctl->nr_kfree_rcu += freed;
> +
>         trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
>         return freed;
>  }
> 
> And when accumulated enough do some synchronization.
> 
> Probably it's better to sum freed objects at per-cpu variable,
> and accumulate size rather than count.

RCU currently has no notion of size outside of possibly kfree_rcu(),
so that would be new information to RCU.

							Thanx, Paul

> > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > > > > in page_alloc shouldn't block RCU and doesn't need special care.
> > > > 
> > > > I must defer to you guys on this.  The main caution is the duration of
> > > > direct reclaim.  After all, if it is too long, the kfree_rcu() instance
> > > > would have been better of just invoking synchronize_rcu().
> > > 
> > > Individual callers of kfree_rcu() have no idea of the load on RCU,
> > > nor how long direct reclaim is taking. Calling synchronize_rcu()
> > > incorrectly has pretty major downsides to it, so nobody should be
> > > trying to expedite kfree_rcu() unless there is a good reason to do
> > > so (e.g. at unmount to ensure everything allocated by a filesystem
> > > has actually been freed). Hence I'd much prefer the decision to
> > > expedite callbacks is made by the RCU subsystem based on it's known
> > > callback load and some indication of how close memory reclaim is to
> > > declaring OOM...
> > 
> > Sorry, I was unclear.  There is a new single-argument kfree_rcu() under
> > way that does not require an rcu_head in the structure being freed.
> > However, in this case, kfree_rcu() might either allocate the memory
> > that is needed to track the memory to be freed on the one hand or just
> > invoke synchronize_rcu() on the other.  So this decision would be taken
> > inside kfree_rcu(), and not be visible to either core RCU or the caller
> > of kfree_rcu().
> > 
> > This decision is made based on whether or not the allocator provides
> > kfree_rcu() the memory needed.  The tradeoff is what GFP flags are
> > supplied.  So the question kfree_rcu() has to answer is "Would it be
> > better to give myself to reclaim as an additional task, or would it
> > instead be better to just invoke synchronize_rcu() and then immediately
> > free()?"
> > 
> > I am probably still unclear, but hopefully at least one step in the
> > right direction.
> > 
> > 							Thanx, Paul
> > 

  reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-07  0:42 Paul E. McKenney
2020-05-07  0:55 ` Andrew Morton
2020-05-07  2:45   ` Paul E. McKenney
2020-05-07 17:00   ` Johannes Weiner
2020-05-07 17:09     ` Paul E. McKenney
2020-05-07 17:29       ` Paul E. McKenney
2020-05-07 18:31       ` Johannes Weiner
2020-05-07 19:09         ` Paul E. McKenney
2020-05-08  9:00           ` Konstantin Khlebnikov
2020-05-08 14:46             ` Paul E. McKenney
2020-05-09  8:54               ` Konstantin Khlebnikov
2020-05-09 16:09                 ` Paul E. McKenney
2020-05-13  1:32                   ` Dave Chinner
2020-05-13  3:18                     ` Paul E. McKenney
2020-05-13  4:35                       ` Konstantin Khlebnikov
2020-05-13 12:52                         ` Paul E. McKenney [this message]
2020-05-13  5:07                       ` Dave Chinner
2020-05-13 13:03                         ` Paul E. McKenney
     [not found] ` <20200507093647.11932-1-hdanton@sina.com>
2020-05-07 15:49   ` Paul E. McKenney
     [not found]   ` <20200508133743.9356-1-hdanton@sina.com>
2020-05-08 14:47     ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200513125224.GP2869@paulmck-ThinkPad-P72 \
    --to=paulmck@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=kernel-team@fb.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git