All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leonardo Bras <leobras@redhat.com>
To: Leonardo Bras <leobras@redhat.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <quic_neeraju@quicinc.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang1211@gmail.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu
Date: Fri, 10 May 2024 18:15:14 -0300	[thread overview]
Message-ID: <Zj6OYvivXF7tUIqV@LeoBras> (raw)
In-Reply-To: <Zj56kVxuTJm4EsAn@LeoBras>

On Fri, May 10, 2024 at 04:50:41PM -0300, Leonardo Bras wrote:
> On Fri, May 10, 2024 at 10:41:53AM -0700, Paul E. McKenney wrote:
> > On Fri, May 10, 2024 at 02:12:32PM -0300, Leonardo Bras wrote:
> > > On Fri, May 10, 2024 at 09:21:59AM -0700, Paul E. McKenney wrote:
> > > > On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> > > > > On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > > > > > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > > > > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > > Here I suppose something like this can take care of not needing to convert 
> > > > > > > > ms -> jiffies every rcu_pending():
> > > > > > > > 
> > > > > > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > > 
> > > > > > > 
> > > > > > > Uh, there is more to it, actually. We need to make sure the user 
> > > > > > > understands that we are rounding-down the value to multiple of a jiffy 
> > > > > > > period, so it's not a surprise if the delay value is not exactly the same 
> > > > > > > as the passed on kernel cmdline.
> > > > > > > 
> > > > > > > So something like bellow diff should be ok, as this behavior is explained 
> > > > > > > in the docs, and pr_info() will print the effective value.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > 
> > > > > > Good point, and I have taken your advice on making the documentation
> > > > > > say what it does.
> > > > > 
> > > > > Thanks :)
> > > > > 
> > > > > > 
> > > > > > > Thanks!
> > > > > > > Leo
> > > > > > > 
> > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > @@ -4974,20 +4974,28 @@
> > > > > > >                         otherwise be caused by callback floods through
> > > > > > >                         use of the ->nocb_bypass list.  However, in the
> > > > > > >                         common non-flooded case, RCU queues directly to
> > > > > > >                         the main ->cblist in order to avoid the extra
> > > > > > >                         overhead of the ->nocb_bypass list and its lock.
> > > > > > >                         But if there are too many callbacks queued during
> > > > > > >                         a single jiffy, RCU pre-queues the callbacks into
> > > > > > >                         the ->nocb_bypass queue.  The definition of "too
> > > > > > >                         many" is supplied by this kernel boot parameter.
> > > > > > >  
> > > > > > > +       rcutree.nocb_patience_delay= [KNL]
> > > > > > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > > > > > +                       disturbing RCU unless the grace period has
> > > > > > > +                       reached the specified age in milliseconds.
> > > > > > > +                       Defaults to zero.  Large values will be capped
> > > > > > > +                       at five seconds. Values rounded-down to a multiple
> > > > > > > +                       of a jiffy period.
> > > > > > > +
> > > > > > >         rcutree.qhimark= [KNL]
> > > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > >                         batch limiting is disabled.
> > > > > > >  
> > > > > > >         rcutree.qlowmark= [KNL]
> > > > > > >                         Set threshold of queued RCU callbacks below which
> > > > > > >                         batch limiting is re-enabled.
> > > > > > >  
> > > > > > >         rcutree.qovld= [KNL]
> > > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > > > index fcf2b4aa3441..62ede401420f 100644
> > > > > > > --- a/kernel/rcu/tree.h
> > > > > > > +++ b/kernel/rcu/tree.h
> > > > > > > @@ -512,20 +512,21 @@ do {                                                              \
> > > > > > >         local_irq_save(flags);                                  \
> > > > > > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > > > > > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > > > > > >  } while (0)
> > > > > > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > > > > > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > > >  
> > > > > > >  static void rcu_bind_gp_kthread(void);
> > > > > > >  static bool rcu_nohz_full_cpu(void);
> > > > > > > +static bool rcu_on_patience_delay(void);
> > > > > > 
> > > > > > I don't think we need an access function, but will check below.
> > > > > > 
> > > > > > >  /* Forward declarations for tree_stall.h */
> > > > > > >  static void record_gp_stall_check_time(void);
> > > > > > >  static void rcu_iw_handler(struct irq_work *iwp);
> > > > > > >  static void check_cpu_stall(struct rcu_data *rdp);
> > > > > > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > >                                      const unsigned long gpssdelay);
> > > > > > >  
> > > > > > >  /* Forward declarations for tree_exp.h. */
> > > > > > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > > index 340bbefe5f65..639243b0410f 100644
> > > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > > @@ -5,20 +5,21 @@
> > > > > > >   * or preemptible semantics.
> > > > > > >   *
> > > > > > >   * Copyright Red Hat, 2009
> > > > > > >   * Copyright IBM Corporation, 2009
> > > > > > >   *
> > > > > > >   * Author: Ingo Molnar <mingo@elte.hu>
> > > > > > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > > > > > >   */
> > > > > > >  
> > > > > > >  #include "../locking/rtmutex_common.h"
> > > > > > > +#include <linux/jiffies.h>
> > > > > > 
> > > > > > This is already pulled in by the enclosing tree.c file, so it should not
> > > > > > be necessary to include it again. 
> > > > > 
> > > > > Even better :)
> > > > > 
> > > > > > (Or did you get a build failure when
> > > > > > leaving this out?)
> > > > > 
> > > > > I didn't, it's just that my editor complained the symbols were not getting 
> > > > > properly resolved, so I included it and it was fixed. But since clangd is 
> > > > > know to make some mistakes, I should have compile-test'd before adding it.
> > > > 
> > > > Ah, got it!  ;-)
> > > > 
> > > > > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > > > > > >  {
> > > > > > >         /*
> > > > > > >          * In order to read the offloaded state of an rdp in a safe
> > > > > > >          * and stable way and prevent from its value to be changed
> > > > > > >          * under us, we must either hold the barrier mutex, the cpu
> > > > > > >          * hotplug lock (read or write) or the nocb lock. Local
> > > > > > >          * non-preemptible reads are also safe. NOCB kthreads and
> > > > > > >          * timers have their own means of synchronization against the
> > > > > > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > > >         if (rcu_kick_kthreads)
> > > > > > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > > > > > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > > > > > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > > > > > >         if (gp_preinit_delay)
> > > > > > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > > > > > >         if (gp_init_delay)
> > > > > > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > > > > > >         if (gp_cleanup_delay)
> > > > > > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > > > > > +       if (nocb_patience_delay < 0) {
> > > > > > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > > > > > +                       nocb_patience_delay);
> > > > > > > +               nocb_patience_delay = 0;
> > > > > > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > > > > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > > > > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > > > > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > > > > > +       } else if (nocb_patience_delay) {
> > > > > > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > > > > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > > > > > +       }
> > > > > > 
> > > > > > I just did this here at the end:
> > > > > > 
> > > > > > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > > > > > 
> > > > > > Ah, you are wanting to print out the milliseconds after the rounding
> > > > > > to jiffies.
> > > > > 
> > > > > That's right, just to make sure the user gets the effective patience time, 
> > > > > instead of the before-rounding one, which was on input.
> > > > > 
> > > > > > I am going to hold off on that for the moment, but I hear your request
> > > > > > and I have not yet said "no".  ;-)
> > > > > 
> > > > > Sure :)
> > > > > It's just something I think it's nice to have (as a user).
> > > > 
> > > > If you would like to do a separate patch adding this, here are the
> > > > requirements:
> > > > 
> > > > o	If the current code prints nothing, nothing additional should
> > > > 	be printed.
> > > > 
> > > > o	If the rounding ended up with the same value (as it should in
> > > > 	systems with HZ=1000), nothing additional should be printed.
> > > > 
> > > > o	Your choice as to whether or not you want to print out the
> > > > 	jiffies value.
> > > > 
> > > > o	If the additional message is on a new line, it needs to be
> > > > 	indented so that it is clear that it is subordinate to the
> > > > 	previous message.
> > > > 
> > > > 	Otherwise, you can use pr_cont() to continue the previous
> > > > 	line, of course being careful about "\n".
> > > > 
> > > > Probably also something that I am forgetting, but that is most of it.
> > > 
> > > Thanks!
> > > I will work on a patch doing that :)
> > 
> > Very good, looking forward to seeing what you come up with!
> > 
> > My current state is on the "dev" branch of the -rcu tree, so please base
> > on that.
> 
> Thanks! I used it earlier to send the previous diff :)
> 
> > 
> > > > > > >         if (!use_softirq)
> > > > > > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > > > > > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > > > > > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > > > > > >         rcupdate_announce_bootup_oddness();
> > > > > > >  }
> > > > > > >  
> > > > > > >  #ifdef CONFIG_PREEMPT_RCU
> > > > > > >  
> > > > > > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > > > > > >   */
> > > > > > >  static void rcu_bind_gp_kthread(void)
> > > > > > >  {
> > > > > > >         if (!tick_nohz_full_enabled())
> > > > > > >                 return;
> > > > > > >         housekeeping_affine(current, HK_TYPE_RCU);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > > > > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > > > > > + *
> > > > > > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > > > > > + * RCU_NOCB_CPU CPUs.
> > > > > > > + */
> > > > > > > +static bool rcu_on_patience_delay(void)
> > > > > > > +{
> > > > > > > +#ifdef CONFIG_NO_HZ_FULL
> > > > > > 
> > > > > > You lost me on this one.  Why do we need the #ifdef instead of
> > > > > > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > > > > > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> > > > > 
> > > > > You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> > > > > 	if ((...) && rcu_nohz_full_cpu())
> > > > > And since it returns false, this whole statement will be compiled out, and 
> > > > > the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> > > > > need to test it.
> > > > 
> > > > Very good!  You had me going there for a bit.  ;-)
> > > > 
> > > > > > > +       if (!nocb_patience_delay)
> > > > > > > +               return false;
> > > > > > 
> > > > > > We get this automatically with the comparison below, right?
> > > > > 
> > > > > Right
> > > > > 
> > > > > >   If so, we
> > > > > > are not gaining much by creating the helper function.  Or am I missing
> > > > > > some trick here?
> > > > > 
> > > > > Well, it's a fastpath. Up to here, we just need to read 
> > > > > nocb_patience_delay{,_jiffies} from memory.
> > > > 
> > > > Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
> > > > nocb_patience_delay is unused after boot.
> > > 
> > > Right, I used both because I was referring to the older version and the 
> > > current version with _jiffies.
> > 
> > Fair enough!
> > 
> > > > > If we don't include the fastpath we have to read jiffies and 
> > > > > rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> > > > > 
> > > > > I thought it could be relevant, as we reduce the overhead of the new 
> > > > > parameter when it's disabled (patience=0). 
> > > > > 
> > > > > Do you think that could be relevant?
> > > > 
> > > > Well, the hardware's opinion is what matters.  ;-)
> > > > 
> > > > But the caller's code path reads jiffies a few times, so it should
> > > > be hot in the cache, correct?
> > > 
> > > Right, but I wonder how are the chances of it getting updated between  
> > > caller's use and this function's. Same for gp_start.
> > 
> > Well, jiffies is updated at most once per millisecond, and gp_start is
> > updated at most once per few milliseconds.  So the chances of it being
> > updated within that code sequence are quite small.
> 
> Fair enough, and we probably don't need to worry about it getting 
> cached-out in this sequence, as well. 
> 
> Also time_before() is a macro and we don't need to worry on the function 
> call, so we just spend 2 extra L1-cache reads and a couple arithmetic 
> instructions which are not supposed to take long, so it's fair to assume 
> the fast-path would not be that much faster than the slow path, which means 
> we don't need a fast path after all.
> 
> Thanks for helping me notice that :)
> 
> > 
> > > > But that does lead to another topic, namely the possibility of tagging
> > > > nocb_patience_delay_jiffies with __read_mostly. 
> > > 
> > > Oh, right. This was supposed to be in the diff I sent earlier, but I 
> > > completelly forgot to change before sending. So, yeah, I agree on 
> > > nocb_patience_delay being __read_mostly; 
> > > 
> > > > And there might be
> > > > a number of other of RCU's variables that could be similarly tagged
> > > > in order to avoid false sharing.  (But is there any false sharing?
> > > > This might be worth testing.)
> > > 
> > > Maybe there isn't, but I wonder if it would hurt performance if they were 
> > > tagged as __read_only anyway. 
> > 
> > Let's be at least a little careful here.  It is just as easy to hurt
> > performance by marking things __read_mostly or __read_only as it is
> > to help performance.  ;-)
> 
> Fair enough :)
> 
> > 
> > 							Thanx, Paul
> > 
> 

Oh, btw, for what it's worth:
Reviewed-by: Leonardo Bras <leobras@redhat.com>

Thanks!
Leo


  reply	other threads:[~2024-05-10 21:15 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-28 17:19 [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu Leonardo Bras
2024-03-28 17:19 ` [RFC PATCH v1 1/2] kvm: Implement guest_exit_last_time() Leonardo Bras
2024-03-28 17:19 ` [RFC PATCH v1 2/2] rcu: Ignore RCU in nohz_full cpus if it was running a guest recently Leonardo Bras
2024-04-01 15:52   ` Paul E. McKenney
2024-04-01 20:21 ` [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu Sean Christopherson
2024-04-05 13:45   ` Marcelo Tosatti
2024-04-05 14:42     ` Sean Christopherson
2024-04-06  0:03       ` Paul E. McKenney
2024-04-08 17:16         ` Sean Christopherson
2024-04-08 18:42           ` Paul E. McKenney
2024-04-08 20:06             ` Sean Christopherson
2024-04-08 21:02               ` Paul E. McKenney
2024-04-08 21:56                 ` Sean Christopherson
2024-04-08 22:35                   ` Paul E. McKenney
2024-04-08 23:06                     ` Sean Christopherson
2024-04-08 23:20                       ` Paul E. McKenney
2024-04-10  2:39           ` Marcelo Tosatti
2024-04-15 19:47           ` Marcelo Tosatti
2024-04-15 21:29             ` Sean Christopherson
2024-04-16 12:36               ` Marcelo Tosatti
2024-04-16 14:07                 ` Sean Christopherson
2024-04-17 16:14                   ` Marcelo Tosatti
2024-04-17 17:22                     ` Sean Christopherson
2024-05-03 20:44                       ` Leonardo Bras
2024-05-06 18:47                         ` Marcelo Tosatti
2024-05-07 18:05                           ` Sean Christopherson
2024-05-07 22:36                             ` Leonardo Bras
2024-05-03 18:42   ` Leonardo Bras
2024-05-03 19:09     ` Leonardo Bras
2024-05-03 21:29     ` Sean Christopherson
2024-05-03 22:00       ` Leonardo Bras
2024-05-03 22:00       ` Paul E. McKenney
2024-05-07 17:55         ` Sean Christopherson
2024-05-07 19:15           ` Paul E. McKenney
2024-05-07 21:00             ` Sean Christopherson
2024-05-07 21:37               ` Paul E. McKenney
2024-05-07 23:47                 ` Sean Christopherson
2024-05-08  0:08                   ` Sean Christopherson
2024-05-08  2:51                     ` Leonardo Bras
2024-05-08  3:22                       ` Paul E. McKenney
2024-05-08  6:19                         ` Leonardo Bras
2024-05-08 14:01                           ` Sean Christopherson
2024-05-09  3:32                             ` Paul E. McKenney
2024-05-09  8:16                               ` Leonardo Bras
2024-05-09 10:14                                 ` Leonardo Bras
2024-05-09 23:45                                   ` Paul E. McKenney
2024-05-10 16:06                                     ` Leonardo Bras
2024-05-10 16:21                                       ` Paul E. McKenney
2024-05-10 17:12                                         ` Leonardo Bras
2024-05-10 17:41                                           ` Paul E. McKenney
2024-05-10 19:50                                             ` Leonardo Bras
2024-05-10 21:15                                               ` Leonardo Bras [this message]
2024-05-10 21:38                                                 ` Paul E. McKenney
2024-05-09 22:41                                 ` Paul E. McKenney
2024-05-09 23:07                                   ` Leonardo Bras Soares Passos
2024-05-11  2:08                             ` Leonardo Bras
2024-05-08  3:20                     ` Paul E. McKenney
2024-05-08  4:04                       ` Paul E. McKenney
2024-05-08 14:36                         ` Paul E. McKenney
2024-05-08 15:35                       ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zj6OYvivXF7tUIqV@LeoBras \
    --to=leobras@redhat.com \
    --cc=boqun.feng@gmail.com \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mtosatti@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qiang.zhang1211@gmail.com \
    --cc=quic_neeraju@quicinc.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.