All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Improving RCU wait in mini_qdisc_pair_swap()
       [not found] <YXCIrNiLY5kMmcSI@ubuntu-x1>
@ 2021-10-20 22:44 ` Paul E. McKenney
  2021-10-21 14:15   ` Seth Forshee
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2021-10-20 22:44 UTC (permalink / raw)
  To: Seth Forshee; +Cc: rcu

On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> Hi Paul,

Hello, Seth!

Adding the rcu list on CC in case others are seeing similar issues.

> You got some questions the other day from (I believe) Mathieu Desnoyers,
> who a colleague of mine had reached out to about this problem. Thanks a
> lot for your suggestions! I was headed in the right direction, but I
> wouldn't have got there as quickly, and I'm not sure I would have known
> to use start_poll_syncronize_rcu(). I do have a question though, so I
> thought I'd reach out to you about it while I continue to test the
> patch.
> 
> To refresh your memory, or fill in bits that might not have made it to
> you: we're seeing a lot of time spent in rcu_barrier() from
> mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> of network connections for VMs. I'm trying to make this go faster.
> 
> My patch is pasted below, which speeds things up drastically, by a
> factor of 10,000 or so in most cases. My question is about initializing
> mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> initialized it from get_state_synchronize_rcu() when the structure is
> first initialized. That patch didn't really help. I suspected this was
> because the time between init and calling mini_qdisc_pair_swap() was
> short enough that a grace period had not elapsed, so in the patch below
> rcu_state is initialized to 0 (due to the fact that the memory is
> kzalloc-ed). I think this should be safe, based on my understanding of
> how the cookie works and the fact that no other task yet has a reference
> to either buffer. Is this correct?

Let me see if I understand your reasoning.  Right after creating the
mini Qdisc, someone might be accessing the current buffer, but no one
ever was accessing the non-current buffer.  So on the initial swap
(and -only- on the initial swap), there are no readers to wait for.
Is that your line of reasoning?  If so, sort-of OK, but let's talk about
more organized ways of handling this.

However, another approach is something like this:

	oldstate = start_poll_synchronize_rcu();

	// do something.

	if (!poll_state_synchronize_rcu(oldstate))
		synchronize_rcu_expedited();

This will be roughly an order of magnitude faster than you would see with
your current cond_synchronize_rcu().  And it will speed up all swaps,
not just the first one.  On the other hand, synchronize_rcu_expedited()
sends IPIs to all non-idle CPUs, which is often OK, but not so much for
heavy real-time workloads.  So you can let the heavy real-time workloads
have slowish Qdisc switching and IPI the IPI-insensitive workloads:

	oldstate = start_poll_synchronize_rcu();

	// do something.

	if (IS_ENABLED(CONFIG_PREEMPT_RT))
		cond_synchronize_rcu(oldstate);
	else if (!poll_state_synchronize_rcu(oldstate))
		synchronize_rcu_expedited();

Yet another approach is this:

	oldstate = start_poll_synchronize_rcu();

	// do something.

	while (!poll_state_synchronize_rcu(oldstate))
		schedule_timeout_idle(1);

The advantage of this one is that you get credit for partial progress
through grace periods, but it won't be anywhere near as fast as
synchronize_rcu_expedited().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Improving RCU wait in mini_qdisc_pair_swap()
  2021-10-20 22:44 ` Improving RCU wait in mini_qdisc_pair_swap() Paul E. McKenney
@ 2021-10-21 14:15   ` Seth Forshee
  2021-10-21 17:55     ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Seth Forshee @ 2021-10-21 14:15 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> > Hi Paul,
> 
> Hello, Seth!
> 
> Adding the rcu list on CC in case others are seeing similar issues.
> 
> > You got some questions the other day from (I believe) Mathieu Desnoyers,
> > who a colleague of mine had reached out to about this problem. Thanks a
> > lot for your suggestions! I was headed in the right direction, but I
> > wouldn't have got there as quickly, and I'm not sure I would have known
> > to use start_poll_syncronize_rcu(). I do have a question though, so I
> > thought I'd reach out to you about it while I continue to test the
> > patch.
> > 
> > To refresh your memory, or fill in bits that might not have made it to
> > you: we're seeing a lot of time spent in rcu_barrier() from
> > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> > of network connections for VMs. I'm trying to make this go faster.
> > 
> > My patch is pasted below, which speeds things up drastically, by a
> > factor of 10,000 or so in most cases. My question is about initializing
> > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> > initialized it from get_state_synchronize_rcu() when the structure is
> > first initialized. That patch didn't really help. I suspected this was
> > because the time between init and calling mini_qdisc_pair_swap() was
> > short enough that a grace period had not elapsed, so in the patch below
> > rcu_state is initialized to 0 (due to the fact that the memory is
> > kzalloc-ed). I think this should be safe, based on my understanding of
> > how the cookie works and the fact that no other task yet has a reference
> > to either buffer. Is this correct?
> 
> Let me see if I understand your reasoning.  Right after creating the
> mini Qdisc, someone might be accessing the current buffer, but no one
> ever was accessing the non-current buffer.  So on the initial swap
> (and -only- on the initial swap), there are no readers to wait for.
> Is that your line of reasoning?  If so, sort-of OK, but let's talk about
> more organized ways of handling this.

That's almost right, just missing a couple of details. Initially neither
buffer (let's call them miniq1 and miniq2 for clarity) is active, and
*p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there
can be no readers for either. On the first swap miniq1 becomes active,
and there are no readers to wait for. On the second swap there are no
readers of miniq2 to wait for, and at this time miniq1->rcu_state will
be set. On the third swap we might finally have readers of miniq1 to
wait for.

Also, this function is a little different from the way your examples are
structured in that oldstate is saved at the end of the function, and the
wait on oldstate happens in a subsequent call to the function. So it's
more like:

        oldstate = start_poll_synchronize_rcu();
        return;

        // Some unknown amount of time passes ...

        cond_synchronize_rcu(oldstate);

So I suspect that it wouldn't be uncommon for enough time to have passed
that there are no readers to wait for.

But that brings us to an important point. I'm not a networking expert, I
don't know what common patterns are for these swaps, so I'm probably
ill-equipped to say what makes the most sense beyond the specific
scenario I'm trying to improve. These swaps will happen when setting up
QoS on a network interface, so if I had to guess I'd think the most
common pattern would be short bursts of activity when interfaces are
set up. I know the pattern in the use case I've been debugging, which
sees a lot of "first swaps" for different instances of mini_Qdisc_pair,
and a smaller number of "second swaps" about 200 ms after the first,
which means we're paying a high price for the rcu_barrier() when there
can be no readers. Other scenaios might result in multiple swaps on the
same instance within a short time period, but I can't say for sure.

> However, another approach is something like this:
> 
> 	oldstate = start_poll_synchronize_rcu();
> 
> 	// do something.
> 
> 	if (!poll_state_synchronize_rcu(oldstate))
> 		synchronize_rcu_expedited();
> 
> This will be roughly an order of magnitude faster than you would see with
> your current cond_synchronize_rcu().  And it will speed up all swaps,
> not just the first one.  On the other hand, synchronize_rcu_expedited()
> sends IPIs to all non-idle CPUs, which is often OK, but not so much for
> heavy real-time workloads.
> 
> So you can let the heavy real-time workloads
> have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> 
> 	oldstate = start_poll_synchronize_rcu();
> 
> 	// do something.
> 
> 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> 		cond_synchronize_rcu(oldstate);
> 	else if (!poll_state_synchronize_rcu(oldstate))
> 		synchronize_rcu_expedited();

I was concerned about the system-wide impact synchronize_rcu_expedited()
has, but this at least avoids doing it if it's unnecessary. If it's bad
form to leave mini_Qdisc.rcu_state initialized to 0, this option is
preferable from my perspective.

> 
> Yet another approach is this:
> 
> 	oldstate = start_poll_synchronize_rcu();
> 
> 	// do something.
> 
> 	while (!poll_state_synchronize_rcu(oldstate))
> 		schedule_timeout_idle(1);
> 
> The advantage of this one is that you get credit for partial progress
> through grace periods, but it won't be anywhere near as fast as
> synchronize_rcu_expedited().

If we need to initialize mini_Qdisc.rcu_state up front this approach
probably won't improve things much for my problem.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Improving RCU wait in mini_qdisc_pair_swap()
  2021-10-21 14:15   ` Seth Forshee
@ 2021-10-21 17:55     ` Paul E. McKenney
  2021-10-21 21:16       ` Seth Forshee
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2021-10-21 17:55 UTC (permalink / raw)
  To: Seth Forshee; +Cc: rcu

On Thu, Oct 21, 2021 at 09:15:22AM -0500, Seth Forshee wrote:
> On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> > > Hi Paul,
> > 
> > Hello, Seth!
> > 
> > Adding the rcu list on CC in case others are seeing similar issues.
> > 
> > > You got some questions the other day from (I believe) Mathieu Desnoyers,
> > > who a colleague of mine had reached out to about this problem. Thanks a
> > > lot for your suggestions! I was headed in the right direction, but I
> > > wouldn't have got there as quickly, and I'm not sure I would have known
> > > to use start_poll_syncronize_rcu(). I do have a question though, so I
> > > thought I'd reach out to you about it while I continue to test the
> > > patch.
> > > 
> > > To refresh your memory, or fill in bits that might not have made it to
> > > you: we're seeing a lot of time spent in rcu_barrier() from
> > > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> > > of network connections for VMs. I'm trying to make this go faster.
> > > 
> > > My patch is pasted below, which speeds things up drastically, by a
> > > factor of 10,000 or so in most cases. My question is about initializing
> > > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> > > initialized it from get_state_synchronize_rcu() when the structure is
> > > first initialized. That patch didn't really help. I suspected this was
> > > because the time between init and calling mini_qdisc_pair_swap() was
> > > short enough that a grace period had not elapsed, so in the patch below
> > > rcu_state is initialized to 0 (due to the fact that the memory is
> > > kzalloc-ed). I think this should be safe, based on my understanding of
> > > how the cookie works and the fact that no other task yet has a reference
> > > to either buffer. Is this correct?
> > 
> > Let me see if I understand your reasoning.  Right after creating the
> > mini Qdisc, someone might be accessing the current buffer, but no one
> > ever was accessing the non-current buffer.  So on the initial swap
> > (and -only- on the initial swap), there are no readers to wait for.
> > Is that your line of reasoning?  If so, sort-of OK, but let's talk about
> > more organized ways of handling this.
> 
> That's almost right, just missing a couple of details. Initially neither
> buffer (let's call them miniq1 and miniq2 for clarity) is active, and
> *p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there
> can be no readers for either. On the first swap miniq1 becomes active,
> and there are no readers to wait for. On the second swap there are no
> readers of miniq2 to wait for, and at this time miniq1->rcu_state will
> be set. On the third swap we might finally have readers of miniq1 to
> wait for.
> 
> Also, this function is a little different from the way your examples are
> structured in that oldstate is saved at the end of the function, and the
> wait on oldstate happens in a subsequent call to the function. So it's
> more like:
> 
>         oldstate = start_poll_synchronize_rcu();
>         return;
> 
>         // Some unknown amount of time passes ...
> 
>         cond_synchronize_rcu(oldstate);
> 
> So I suspect that it wouldn't be uncommon for enough time to have passed
> that there are no readers to wait for.
> 
> But that brings us to an important point. I'm not a networking expert, I
> don't know what common patterns are for these swaps, so I'm probably
> ill-equipped to say what makes the most sense beyond the specific
> scenario I'm trying to improve. These swaps will happen when setting up
> QoS on a network interface, so if I had to guess I'd think the most
> common pattern would be short bursts of activity when interfaces are
> set up. I know the pattern in the use case I've been debugging, which
> sees a lot of "first swaps" for different instances of mini_Qdisc_pair,
> and a smaller number of "second swaps" about 200 ms after the first,
> which means we're paying a high price for the rcu_barrier() when there
> can be no readers. Other scenaios might result in multiple swaps on the
> same instance within a short time period, but I can't say for sure.

Very good!  We can pool our ignorance of networking!  ;-)

If the common case really is mostly just the one swap, I can easily give
you an API that hands you an already-expired "oldstate".  Full disclosure:
On 32-bit systems, after 2^31 grace periods (maybe as quickly as three
months of runtime), the "oldstate" would no longer appear to be at
all old.  And would take another 2^31 grace periods to look old again.

However, that is an unusual API, so I would want to know that it was
really useful before creating it.

Plus if you start early enough at boot, the value of zero will not look
old, just so you know.  ;-)

> > However, another approach is something like this:
> > 
> > 	oldstate = start_poll_synchronize_rcu();
> > 
> > 	// do something.
> > 
> > 	if (!poll_state_synchronize_rcu(oldstate))
> > 		synchronize_rcu_expedited();
> > 
> > This will be roughly an order of magnitude faster than you would see with
> > your current cond_synchronize_rcu().  And it will speed up all swaps,
> > not just the first one.  On the other hand, synchronize_rcu_expedited()
> > sends IPIs to all non-idle CPUs, which is often OK, but not so much for
> > heavy real-time workloads.
> > 
> > So you can let the heavy real-time workloads
> > have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> > 
> > 	oldstate = start_poll_synchronize_rcu();
> > 
> > 	// do something.
> > 
> > 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > 		cond_synchronize_rcu(oldstate);
> > 	else if (!poll_state_synchronize_rcu(oldstate))
> > 		synchronize_rcu_expedited();
> 
> I was concerned about the system-wide impact synchronize_rcu_expedited()
> has, but this at least avoids doing it if it's unnecessary. If it's bad
> form to leave mini_Qdisc.rcu_state initialized to 0, this option is
> preferable from my perspective.

This might well be the best approach if multiple swaps are at all common.

But if needed, we really could do both:  Make initial swaps really fast,
and improve the performance of later swaps.

> > Yet another approach is this:
> > 
> > 	oldstate = start_poll_synchronize_rcu();
> > 
> > 	// do something.
> > 
> > 	while (!poll_state_synchronize_rcu(oldstate))
> > 		schedule_timeout_idle(1);
> > 
> > The advantage of this one is that you get credit for partial progress
> > through grace periods, but it won't be anywhere near as fast as
> > synchronize_rcu_expedited().
> 
> If we need to initialize mini_Qdisc.rcu_state up front this approach
> probably won't improve things much for my problem.

It all depends on how long between swaps.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Improving RCU wait in mini_qdisc_pair_swap()
  2021-10-21 17:55     ` Paul E. McKenney
@ 2021-10-21 21:16       ` Seth Forshee
  2021-10-21 22:24         ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Seth Forshee @ 2021-10-21 21:16 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Thu, Oct 21, 2021 at 10:55:37AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 21, 2021 at 09:15:22AM -0500, Seth Forshee wrote:
> > On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote:
> > > On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> > > > Hi Paul,
> > > 
> > > Hello, Seth!
> > > 
> > > Adding the rcu list on CC in case others are seeing similar issues.
> > > 
> > > > You got some questions the other day from (I believe) Mathieu Desnoyers,
> > > > who a colleague of mine had reached out to about this problem. Thanks a
> > > > lot for your suggestions! I was headed in the right direction, but I
> > > > wouldn't have got there as quickly, and I'm not sure I would have known
> > > > to use start_poll_syncronize_rcu(). I do have a question though, so I
> > > > thought I'd reach out to you about it while I continue to test the
> > > > patch.
> > > > 
> > > > To refresh your memory, or fill in bits that might not have made it to
> > > > you: we're seeing a lot of time spent in rcu_barrier() from
> > > > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> > > > of network connections for VMs. I'm trying to make this go faster.
> > > > 
> > > > My patch is pasted below, which speeds things up drastically, by a
> > > > factor of 10,000 or so in most cases. My question is about initializing
> > > > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> > > > initialized it from get_state_synchronize_rcu() when the structure is
> > > > first initialized. That patch didn't really help. I suspected this was
> > > > because the time between init and calling mini_qdisc_pair_swap() was
> > > > short enough that a grace period had not elapsed, so in the patch below
> > > > rcu_state is initialized to 0 (due to the fact that the memory is
> > > > kzalloc-ed). I think this should be safe, based on my understanding of
> > > > how the cookie works and the fact that no other task yet has a reference
> > > > to either buffer. Is this correct?
> > > 
> > > Let me see if I understand your reasoning.  Right after creating the
> > > mini Qdisc, someone might be accessing the current buffer, but no one
> > > ever was accessing the non-current buffer.  So on the initial swap
> > > (and -only- on the initial swap), there are no readers to wait for.
> > > Is that your line of reasoning?  If so, sort-of OK, but let's talk about
> > > more organized ways of handling this.
> > 
> > That's almost right, just missing a couple of details. Initially neither
> > buffer (let's call them miniq1 and miniq2 for clarity) is active, and
> > *p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there
> > can be no readers for either. On the first swap miniq1 becomes active,
> > and there are no readers to wait for. On the second swap there are no
> > readers of miniq2 to wait for, and at this time miniq1->rcu_state will
> > be set. On the third swap we might finally have readers of miniq1 to
> > wait for.
> > 
> > Also, this function is a little different from the way your examples are
> > structured in that oldstate is saved at the end of the function, and the
> > wait on oldstate happens in a subsequent call to the function. So it's
> > more like:
> > 
> >         oldstate = start_poll_synchronize_rcu();
> >         return;
> > 
> >         // Some unknown amount of time passes ...
> > 
> >         cond_synchronize_rcu(oldstate);
> > 
> > So I suspect that it wouldn't be uncommon for enough time to have passed
> > that there are no readers to wait for.
> > 
> > But that brings us to an important point. I'm not a networking expert, I
> > don't know what common patterns are for these swaps, so I'm probably
> > ill-equipped to say what makes the most sense beyond the specific
> > scenario I'm trying to improve. These swaps will happen when setting up
> > QoS on a network interface, so if I had to guess I'd think the most
> > common pattern would be short bursts of activity when interfaces are
> > set up. I know the pattern in the use case I've been debugging, which
> > sees a lot of "first swaps" for different instances of mini_Qdisc_pair,
> > and a smaller number of "second swaps" about 200 ms after the first,
> > which means we're paying a high price for the rcu_barrier() when there
> > can be no readers. Other scenaios might result in multiple swaps on the
> > same instance within a short time period, but I can't say for sure.
> 
> Very good!  We can pool our ignorance of networking!  ;-)
> 
> If the common case really is mostly just the one swap, I can easily give
> you an API that hands you an already-expired "oldstate".  Full disclosure:
> On 32-bit systems, after 2^31 grace periods (maybe as quickly as three
> months of runtime), the "oldstate" would no longer appear to be at
> all old.  And would take another 2^31 grace periods to look old again.
> 
> However, that is an unusual API, so I would want to know that it was
> really useful before creating it.
> 
> Plus if you start early enough at boot, the value of zero will not look
> old, just so you know.  ;-)

Of course. However, when I was initializing mini_Qdisc.rcu_state from
get_state_synchronize_rcu() along with cond_synchronize_rcu(), I was
still seeing many calls (50% or more) to mini_qdisc_pair_swap() take
tens of milliseconds in my use case. When I initialized rcu_state to 0 I
saw virtually no times over 1 ms. Trading a high probability of a long
delay for a low probability seems like an improvement to me. :-)

I think the new API is probably unnecessary though. I have some notes
from testing the other two approaches below.

> > > However, another approach is something like this:
> > > 
> > > 	oldstate = start_poll_synchronize_rcu();
> > > 
> > > 	// do something.
> > > 
> > > 	if (!poll_state_synchronize_rcu(oldstate))
> > > 		synchronize_rcu_expedited();
> > > 
> > > This will be roughly an order of magnitude faster than you would see with
> > > your current cond_synchronize_rcu().  And it will speed up all swaps,
> > > not just the first one.  On the other hand, synchronize_rcu_expedited()
> > > sends IPIs to all non-idle CPUs, which is often OK, but not so much for
> > > heavy real-time workloads.
> > > 
> > > So you can let the heavy real-time workloads
> > > have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> > > 
> > > 	oldstate = start_poll_synchronize_rcu();
> > > 
> > > 	// do something.
> > > 
> > > 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > 		cond_synchronize_rcu(oldstate);
> > > 	else if (!poll_state_synchronize_rcu(oldstate))
> > > 		synchronize_rcu_expedited();
> > 
> > I was concerned about the system-wide impact synchronize_rcu_expedited()
> > has, but this at least avoids doing it if it's unnecessary. If it's bad
> > form to leave mini_Qdisc.rcu_state initialized to 0, this option is
> > preferable from my perspective.
> 
> This might well be the best approach if multiple swaps are at all common.
> 
> But if needed, we really could do both:  Make initial swaps really fast,
> and improve the performance of later swaps.

Using this the time for mini_qdisc_pair_swap() to complete was similar
to my earlier patch with mini_Qdisc.rcu_state initialized to 0. So this
seems acceptible for my purposes.

> > > Yet another approach is this:
> > > 
> > > 	oldstate = start_poll_synchronize_rcu();
> > > 
> > > 	// do something.
> > > 
> > > 	while (!poll_state_synchronize_rcu(oldstate))
> > > 		schedule_timeout_idle(1);
> > > 
> > > The advantage of this one is that you get credit for partial progress
> > > through grace periods, but it won't be anywhere near as fast as
> > > synchronize_rcu_expedited().
> > 
> > If we need to initialize mini_Qdisc.rcu_state up front this approach
> > probably won't improve things much for my problem.
> 
> It all depends on how long between swaps.  ;-)

This approach helps, but I still see about 50% of the
mini_qdisc_pair_swap() calls taking tens of milliseconds. This is
similar to what I saw using cond_synchronize_rcu() with rcu_state
initialized from get_state_synchronize_rcu().

Wouldn't this also behave very poorly in the wrapping case you mentioned
above?

In the end, I'm happy to move forward using synchronize_rcu_expedited().
I'm going to test a bit more, and I should have a patch out sometime
tomorrow.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Improving RCU wait in mini_qdisc_pair_swap()
  2021-10-21 21:16       ` Seth Forshee
@ 2021-10-21 22:24         ` Paul E. McKenney
  2021-10-22 13:17           ` Seth Forshee
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2021-10-21 22:24 UTC (permalink / raw)
  To: Seth Forshee; +Cc: rcu

On Thu, Oct 21, 2021 at 04:16:41PM -0500, Seth Forshee wrote:
> On Thu, Oct 21, 2021 at 10:55:37AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 21, 2021 at 09:15:22AM -0500, Seth Forshee wrote:
> > > On Wed, Oct 20, 2021 at 03:44:30PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Oct 20, 2021 at 04:22:52PM -0500, Seth Forshee wrote:
> > > > > Hi Paul,
> > > > 
> > > > Hello, Seth!
> > > > 
> > > > Adding the rcu list on CC in case others are seeing similar issues.
> > > > 
> > > > > You got some questions the other day from (I believe) Mathieu Desnoyers,
> > > > > who a colleague of mine had reached out to about this problem. Thanks a
> > > > > lot for your suggestions! I was headed in the right direction, but I
> > > > > wouldn't have got there as quickly, and I'm not sure I would have known
> > > > > to use start_poll_syncronize_rcu(). I do have a question though, so I
> > > > > thought I'd reach out to you about it while I continue to test the
> > > > > patch.
> > > > > 
> > > > > To refresh your memory, or fill in bits that might not have made it to
> > > > > you: we're seeing a lot of time spent in rcu_barrier() from
> > > > > mini_qdisc_pair_swap() when we are trying to (quickly) restore a bunch
> > > > > of network connections for VMs. I'm trying to make this go faster.
> > > > > 
> > > > > My patch is pasted below, which speeds things up drastically, by a
> > > > > factor of 10,000 or so in most cases. My question is about initializing
> > > > > mini_Qdisc.rcu_state. On my first pass I took a cautious approach and
> > > > > initialized it from get_state_synchronize_rcu() when the structure is
> > > > > first initialized. That patch didn't really help. I suspected this was
> > > > > because the time between init and calling mini_qdisc_pair_swap() was
> > > > > short enough that a grace period had not elapsed, so in the patch below
> > > > > rcu_state is initialized to 0 (due to the fact that the memory is
> > > > > kzalloc-ed). I think this should be safe, based on my understanding of
> > > > > how the cookie works and the fact that no other task yet has a reference
> > > > > to either buffer. Is this correct?
> > > > 
> > > > Let me see if I understand your reasoning.  Right after creating the
> > > > mini Qdisc, someone might be accessing the current buffer, but no one
> > > > ever was accessing the non-current buffer.  So on the initial swap
> > > > (and -only- on the initial swap), there are no readers to wait for.
> > > > Is that your line of reasoning?  If so, sort-of OK, but let's talk about
> > > > more organized ways of handling this.
> > > 
> > > That's almost right, just missing a couple of details. Initially neither
> > > buffer (let's call them miniq1 and miniq2 for clarity) is active, and
> > > *p_miniq is NULL until the first mini_qdisc_pair_swap() call, so there
> > > can be no readers for either. On the first swap miniq1 becomes active,
> > > and there are no readers to wait for. On the second swap there are no
> > > readers of miniq2 to wait for, and at this time miniq1->rcu_state will
> > > be set. On the third swap we might finally have readers of miniq1 to
> > > wait for.
> > > 
> > > Also, this function is a little different from the way your examples are
> > > structured in that oldstate is saved at the end of the function, and the
> > > wait on oldstate happens in a subsequent call to the function. So it's
> > > more like:
> > > 
> > >         oldstate = start_poll_synchronize_rcu();
> > >         return;
> > > 
> > >         // Some unknown amount of time passes ...
> > > 
> > >         cond_synchronize_rcu(oldstate);
> > > 
> > > So I suspect that it wouldn't be uncommon for enough time to have passed
> > > that there are no readers to wait for.
> > > 
> > > But that brings us to an important point. I'm not a networking expert, I
> > > don't know what common patterns are for these swaps, so I'm probably
> > > ill-equipped to say what makes the most sense beyond the specific
> > > scenario I'm trying to improve. These swaps will happen when setting up
> > > QoS on a network interface, so if I had to guess I'd think the most
> > > common pattern would be short bursts of activity when interfaces are
> > > set up. I know the pattern in the use case I've been debugging, which
> > > sees a lot of "first swaps" for different instances of mini_Qdisc_pair,
> > > and a smaller number of "second swaps" about 200 ms after the first,
> > > which means we're paying a high price for the rcu_barrier() when there
> > > can be no readers. Other scenaios might result in multiple swaps on the
> > > same instance within a short time period, but I can't say for sure.
> > 
> > Very good!  We can pool our ignorance of networking!  ;-)
> > 
> > If the common case really is mostly just the one swap, I can easily give
> > you an API that hands you an already-expired "oldstate".  Full disclosure:
> > On 32-bit systems, after 2^31 grace periods (maybe as quickly as three
> > months of runtime), the "oldstate" would no longer appear to be at
> > all old.  And would take another 2^31 grace periods to look old again.
> > 
> > However, that is an unusual API, so I would want to know that it was
> > really useful before creating it.
> > 
> > Plus if you start early enough at boot, the value of zero will not look
> > old, just so you know.  ;-)
> 
> Of course. However, when I was initializing mini_Qdisc.rcu_state from
> get_state_synchronize_rcu() along with cond_synchronize_rcu(), I was
> still seeing many calls (50% or more) to mini_qdisc_pair_swap() take
> tens of milliseconds in my use case. When I initialized rcu_state to 0 I
> saw virtually no times over 1 ms. Trading a high probability of a long
> delay for a low probability seems like an improvement to me. :-)
> 
> I think the new API is probably unnecessary though. I have some notes
> from testing the other two approaches below.

Happy to avoid further RCU API proliferation!  ;-)

> > > > However, another approach is something like this:
> > > > 
> > > > 	oldstate = start_poll_synchronize_rcu();
> > > > 
> > > > 	// do something.
> > > > 
> > > > 	if (!poll_state_synchronize_rcu(oldstate))
> > > > 		synchronize_rcu_expedited();
> > > > 
> > > > This will be roughly an order of magnitude faster than you would see with
> > > > your current cond_synchronize_rcu().  And it will speed up all swaps,
> > > > not just the first one.  On the other hand, synchronize_rcu_expedited()
> > > > sends IPIs to all non-idle CPUs, which is often OK, but not so much for
> > > > heavy real-time workloads.
> > > > 
> > > > So you can let the heavy real-time workloads
> > > > have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> > > > 
> > > > 	oldstate = start_poll_synchronize_rcu();
> > > > 
> > > > 	// do something.
> > > > 
> > > > 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > 		cond_synchronize_rcu(oldstate);
> > > > 	else if (!poll_state_synchronize_rcu(oldstate))
> > > > 		synchronize_rcu_expedited();
> > > 
> > > I was concerned about the system-wide impact synchronize_rcu_expedited()
> > > has, but this at least avoids doing it if it's unnecessary. If it's bad
> > > form to leave mini_Qdisc.rcu_state initialized to 0, this option is
> > > preferable from my perspective.
> > 
> > This might well be the best approach if multiple swaps are at all common.
> > 
> > But if needed, we really could do both:  Make initial swaps really fast,
> > and improve the performance of later swaps.
> 
> Using this the time for mini_qdisc_pair_swap() to complete was similar
> to my earlier patch with mini_Qdisc.rcu_state initialized to 0. So this
> seems acceptible for my purposes.

Not sure which approach "this" refers to.  For the moment, I am assuming
the last code fragment shown above.

> > > > Yet another approach is this:
> > > > 
> > > > 	oldstate = start_poll_synchronize_rcu();
> > > > 
> > > > 	// do something.
> > > > 
> > > > 	while (!poll_state_synchronize_rcu(oldstate))
> > > > 		schedule_timeout_idle(1);
> > > > 
> > > > The advantage of this one is that you get credit for partial progress
> > > > through grace periods, but it won't be anywhere near as fast as
> > > > synchronize_rcu_expedited().
> > > 
> > > If we need to initialize mini_Qdisc.rcu_state up front this approach
> > > probably won't improve things much for my problem.
> > 
> > It all depends on how long between swaps.  ;-)
> 
> This approach helps, but I still see about 50% of the
> mini_qdisc_pair_swap() calls taking tens of milliseconds. This is
> similar to what I saw using cond_synchronize_rcu() with rcu_state
> initialized from get_state_synchronize_rcu().

OK, which means that the swaps are too closely spaced to get much
benefit about 50% of the time.

> Wouldn't this also behave very poorly in the wrapping case you mentioned
> above?

I confess, I have no idea what case is the "wrapping case".

But any implementation that needs to wait for a full normal
(non-expedited) grace period will suffer at least 50% of the time
when running your workload.  And of the above code fragments, only the
synchronize_rcu_expedited() variants avoid waiting for a full normal
grace period.

> In the end, I'm happy to move forward using synchronize_rcu_expedited().
> I'm going to test a bit more, and I should have a patch out sometime
> tomorrow.

Works for me!  At least given the CONFIG_PREEMPT_RT check.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Improving RCU wait in mini_qdisc_pair_swap()
  2021-10-21 22:24         ` Paul E. McKenney
@ 2021-10-22 13:17           ` Seth Forshee
  0 siblings, 0 replies; 6+ messages in thread
From: Seth Forshee @ 2021-10-22 13:17 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Thu, Oct 21, 2021 at 03:24:35PM -0700, Paul E. McKenney wrote:
> > > > > So you can let the heavy real-time workloads
> > > > > have slowish Qdisc switching and IPI the IPI-insensitive workloads:
> > > > > 
> > > > > 	oldstate = start_poll_synchronize_rcu();
> > > > > 
> > > > > 	// do something.
> > > > > 
> > > > > 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > 		cond_synchronize_rcu(oldstate);
> > > > > 	else if (!poll_state_synchronize_rcu(oldstate))
> > > > > 		synchronize_rcu_expedited();
> > > > 
> > > > I was concerned about the system-wide impact synchronize_rcu_expedited()
> > > > has, but this at least avoids doing it if it's unnecessary. If it's bad
> > > > form to leave mini_Qdisc.rcu_state initialized to 0, this option is
> > > > preferable from my perspective.
> > > 
> > > This might well be the best approach if multiple swaps are at all common.
> > > 
> > > But if needed, we really could do both:  Make initial swaps really fast,
> > > and improve the performance of later swaps.
> > 
> > Using this the time for mini_qdisc_pair_swap() to complete was similar
> > to my earlier patch with mini_Qdisc.rcu_state initialized to 0. So this
> > seems acceptible for my purposes.
> 
> Not sure which approach "this" refers to.  For the moment, I am assuming
> the last code fragment shown above.

Yes, that is correct. Apologies that I wasn't clear.

> > > > > Yet another approach is this:
> > > > > 
> > > > > 	oldstate = start_poll_synchronize_rcu();
> > > > > 
> > > > > 	// do something.
> > > > > 
> > > > > 	while (!poll_state_synchronize_rcu(oldstate))
> > > > > 		schedule_timeout_idle(1);
> > > > > 
> > > > > The advantage of this one is that you get credit for partial progress
> > > > > through grace periods, but it won't be anywhere near as fast as
> > > > > synchronize_rcu_expedited().
> > > > 
> > > > If we need to initialize mini_Qdisc.rcu_state up front this approach
> > > > probably won't improve things much for my problem.
> > > 
> > > It all depends on how long between swaps.  ;-)
> > 
> > This approach helps, but I still see about 50% of the
> > mini_qdisc_pair_swap() calls taking tens of milliseconds. This is
> > similar to what I saw using cond_synchronize_rcu() with rcu_state
> > initialized from get_state_synchronize_rcu().
> 
> OK, which means that the swaps are too closely spaced to get much
> benefit about 50% of the time.

Yes, or in my case it's the time between mini_qdisc_pair_init() and the
first swap, which is why initializing rcu_state to 0 helped before.

> > Wouldn't this also behave very poorly in the wrapping case you mentioned
> > above?
> 
> I confess, I have no idea what case is the "wrapping case".

This question was based on a misunderstanding on my part, you can ignore
it.

> But any implementation that needs to wait for a full normal
> (non-expedited) grace period will suffer at least 50% of the time
> when running your workload.  And of the above code fragments, only the
> synchronize_rcu_expedited() variants avoid waiting for a full normal
> grace period.
> 
> > In the end, I'm happy to move forward using synchronize_rcu_expedited().
> > I'm going to test a bit more, and I should have a patch out sometime
> > tomorrow.
> 
> Works for me!  At least given the CONFIG_PREEMPT_RT check.

Yes, I included that check.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-10-22 13:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <YXCIrNiLY5kMmcSI@ubuntu-x1>
2021-10-20 22:44 ` Improving RCU wait in mini_qdisc_pair_swap() Paul E. McKenney
2021-10-21 14:15   ` Seth Forshee
2021-10-21 17:55     ` Paul E. McKenney
2021-10-21 21:16       ` Seth Forshee
2021-10-21 22:24         ` Paul E. McKenney
2021-10-22 13:17           ` Seth Forshee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.