All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel-rt rcuc lock contention problem
@ 2015-01-26 19:14 Luiz Capitulino
  2015-01-27 20:37 ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Luiz Capitulino @ 2015-01-26 19:14 UTC (permalink / raw)
  To: paulmck; +Cc: linux-rt-users, Marcelo Tosatti

Paul,

We're running some measurements with cyclictest running inside a
KVM guest where we could observe spinlock contention among rcuc
threads.

Basically, we have a 16-CPU NUMA machine very well setup for RT.
This machine and the guest run the RT kernel. As our test-case
requires an application in the guest taking 100% of the CPU, the
RT priority configuration that gives the best latency is this one:

 263  FF   3  [rcuc/15]
  13  FF   3  [rcub/1]
  12  FF   3  [rcub/0]
 265  FF   2  [ksoftirqd/15]
3181  FF   1  qemu-kvm

In this configuration, the rcuc can preempt the guest's vcpu
thread. This shouldn't be a problem, except for the fact that
we're seeing that in some cases the rcuc/15 thread spends 10us
or more spinning in this spinlock (note that IRQs are disabled
during this period):

__rcu_process_callbacks()
{
...
	local_irq_save(flags);
	if (cpu_needs_another_gp(rsp, rdp)) {
		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
		rcu_start_gp(rsp);
		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
...

We've tried playing with the rcu_nocbs= option. However, it
did not help because, for reasons we don't understand, the rcuc
threads have to handle grace period start even when callback
offloading is used. Handling this case requires this code path
to be executed.

We've cooked the following extremely dirty patch, just to see
what would happen:

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index eaed1ef..c0771cc 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	/* Does this CPU require a not-yet-started grace period? */
 	local_irq_save(flags);
 	if (cpu_needs_another_gp(rsp, rdp)) {
-		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
-		rcu_start_gp(rsp);
-		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
+		for (;;) {
+			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
+				local_irq_restore(flags);
+				local_bh_enable();
+				schedule_timeout_interruptible(2);
+				local_bh_disable();
+				local_irq_save(flags);
+				continue;
+			}
+			rcu_start_gp(rsp);
+			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
+			break;
+		}
 	} else {
 		local_irq_restore(flags);
 	}

With this patch rcuc is gone from our traces and the scheduling
latency is reduced by 3us in our CPU-bound test-case.

Could you please advice on how to solve this contention problem?

Can we test whether the local CPU is nocb, and in that case, 
skip rcu_start_gp entirely for example?

Thanks!

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-26 19:14 kernel-rt rcuc lock contention problem Luiz Capitulino
@ 2015-01-27 20:37 ` Paul E. McKenney
  2015-01-28  1:55   ` Marcelo Tosatti
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-27 20:37 UTC (permalink / raw)
  To: Luiz Capitulino; +Cc: linux-rt-users, Marcelo Tosatti

On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> Paul,
> 
> We're running some measurements with cyclictest running inside a
> KVM guest where we could observe spinlock contention among rcuc
> threads.
> 
> Basically, we have a 16-CPU NUMA machine very well setup for RT.
> This machine and the guest run the RT kernel. As our test-case
> requires an application in the guest taking 100% of the CPU, the
> RT priority configuration that gives the best latency is this one:
> 
>  263  FF   3  [rcuc/15]
>   13  FF   3  [rcub/1]
>   12  FF   3  [rcub/0]
>  265  FF   2  [ksoftirqd/15]
> 3181  FF   1  qemu-kvm
> 
> In this configuration, the rcuc can preempt the guest's vcpu
> thread. This shouldn't be a problem, except for the fact that
> we're seeing that in some cases the rcuc/15 thread spends 10us
> or more spinning in this spinlock (note that IRQs are disabled
> during this period):
> 
> __rcu_process_callbacks()
> {
> ...
> 	local_irq_save(flags);
> 	if (cpu_needs_another_gp(rsp, rdp)) {
> 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> 		rcu_start_gp(rsp);
> 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> ...

Life can be hard when irq-disabled spinlocks can be preempted!  But how
often does this happen?  Also, does this happen on smaller systems, for
example, with four or eight CPUs?  And I confess to be a bit surprised
that you expect real-time response from a guest that is subject to
preemption -- as I understand it, the usual approach is to give RT guests
their own CPUs.

Or am I missing something?

> We've tried playing with the rcu_nocbs= option. However, it
> did not help because, for reasons we don't understand, the rcuc
> threads have to handle grace period start even when callback
> offloading is used. Handling this case requires this code path
> to be executed.

Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
the per-CPU work required to inform RCU of quiescent states.

> We've cooked the following extremely dirty patch, just to see
> what would happen:
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index eaed1ef..c0771cc 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
>  	/* Does this CPU require a not-yet-started grace period? */
>  	local_irq_save(flags);
>  	if (cpu_needs_another_gp(rsp, rdp)) {
> -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> -		rcu_start_gp(rsp);
> -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> +		for (;;) {
> +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> +				local_irq_restore(flags);
> +				local_bh_enable();
> +				schedule_timeout_interruptible(2);

Yes, the above will get you a splat in mainline kernels, which do not
necessarily push softirq processing to the ksoftirqd kthreads.  ;-)

> +				local_bh_disable();
> +				local_irq_save(flags);
> +				continue;
> +			}
> +			rcu_start_gp(rsp);
> +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> +			break;
> +		}
>  	} else {
>  		local_irq_restore(flags);
>  	}
> 
> With this patch rcuc is gone from our traces and the scheduling
> latency is reduced by 3us in our CPU-bound test-case.
> 
> Could you please advice on how to solve this contention problem?

The usual advice would be to configure the system such that the guest's
VCPUs do not get preempted.  Or is the contention on the root rcu_node
structure's ->lock field high for some other reason?

> Can we test whether the local CPU is nocb, and in that case, 
> skip rcu_start_gp entirely for example?

If you do that, you can see system hangs due to needed grace periods never
getting started.

Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
If you are using a smaller value, it would be possible to rework the
code to reduce contention on ->lock, though if a VCPU does get preempted
while holding the root rcu_node structure's ->lock, life will be hard.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-27 20:37 ` Paul E. McKenney
@ 2015-01-28  1:55   ` Marcelo Tosatti
  2015-01-28 14:18     ` Luiz Capitulino
  2015-01-28 18:03     ` Paul E. McKenney
  0 siblings, 2 replies; 23+ messages in thread
From: Marcelo Tosatti @ 2015-01-28  1:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Luiz Capitulino, linux-rt-users

On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > Paul,
> > 
> > We're running some measurements with cyclictest running inside a
> > KVM guest where we could observe spinlock contention among rcuc
> > threads.
> > 
> > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > This machine and the guest run the RT kernel. As our test-case
> > requires an application in the guest taking 100% of the CPU, the
> > RT priority configuration that gives the best latency is this one:
> > 
> >  263  FF   3  [rcuc/15]
> >   13  FF   3  [rcub/1]
> >   12  FF   3  [rcub/0]
> >  265  FF   2  [ksoftirqd/15]
> > 3181  FF   1  qemu-kvm
> > 
> > In this configuration, the rcuc can preempt the guest's vcpu
> > thread. This shouldn't be a problem, except for the fact that
> > we're seeing that in some cases the rcuc/15 thread spends 10us
> > or more spinning in this spinlock (note that IRQs are disabled
> > during this period):
> > 
> > __rcu_process_callbacks()
> > {
> > ...
> > 	local_irq_save(flags);
> > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > 		rcu_start_gp(rsp);
> > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > ...
> 
> Life can be hard when irq-disabled spinlocks can be preempted!  But how
> often does this happen?  Also, does this happen on smaller systems, for
> example, with four or eight CPUs?  And I confess to be a bit surprised
> that you expect real-time response from a guest that is subject to
> preemption -- as I understand it, the usual approach is to give RT guests
> their own CPUs.
> 
> Or am I missing something?

We are trying to avoid relying on the guest VCPU to voluntarily yield
the CPU therefore allowing the critical services (such as rcu callback 
processing and sched tick processing) to execute.

> > We've tried playing with the rcu_nocbs= option. However, it
> > did not help because, for reasons we don't understand, the rcuc
> > threads have to handle grace period start even when callback
> > offloading is used. Handling this case requires this code path
> > to be executed.
> 
> Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> the per-CPU work required to inform RCU of quiescent states.

Can't you execute that on vCPU entry/exit? Those are quiescent states
after all.

> > We've cooked the following extremely dirty patch, just to see
> > what would happen:
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index eaed1ef..c0771cc 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> >  	/* Does this CPU require a not-yet-started grace period? */
> >  	local_irq_save(flags);
> >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > -		rcu_start_gp(rsp);
> > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > +		for (;;) {
> > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > +				local_irq_restore(flags);
> > +				local_bh_enable();
> > +				schedule_timeout_interruptible(2);
> 
> Yes, the above will get you a splat in mainline kernels, which do not
> necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> 
> > +				local_bh_disable();
> > +				local_irq_save(flags);
> > +				continue;
> > +			}
> > +			rcu_start_gp(rsp);
> > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > +			break;
> > +		}
> >  	} else {
> >  		local_irq_restore(flags);
> >  	}
> > 
> > With this patch rcuc is gone from our traces and the scheduling
> > latency is reduced by 3us in our CPU-bound test-case.
> > 
> > Could you please advice on how to solve this contention problem?
> 
> The usual advice would be to configure the system such that the guest's
> VCPUs do not get preempted.

The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
spinning). In that case, rcuc would never execute, because it has a 
lower priority than guest VCPUs.

I do not think we want that.

> Or is the contention on the root rcu_node structure's ->lock field
> high for some other reason?

Luiz?

> > Can we test whether the local CPU is nocb, and in that case, 
> > skip rcu_start_gp entirely for example?
> 
> If you do that, you can see system hangs due to needed grace periods never
> getting started.

So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
necessary for nocb CPUs to execute rcu_start_gp?

> Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> If you are using a smaller value, it would be possible to rework the
> code to reduce contention on ->lock, though if a VCPU does get preempted
> while holding the root rcu_node structure's ->lock, life will be hard.

Its a raw spinlock, isnt it?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28  1:55   ` Marcelo Tosatti
@ 2015-01-28 14:18     ` Luiz Capitulino
  2015-01-28 18:09       ` Paul E. McKenney
  2015-01-28 18:03     ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Luiz Capitulino @ 2015-01-28 14:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paul E. McKenney, linux-rt-users

On Tue, 27 Jan 2015 23:55:08 -0200
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > Paul,
> > > 
> > > We're running some measurements with cyclictest running inside a
> > > KVM guest where we could observe spinlock contention among rcuc
> > > threads.
> > > 
> > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > This machine and the guest run the RT kernel. As our test-case
> > > requires an application in the guest taking 100% of the CPU, the
> > > RT priority configuration that gives the best latency is this one:
> > > 
> > >  263  FF   3  [rcuc/15]
> > >   13  FF   3  [rcub/1]
> > >   12  FF   3  [rcub/0]
> > >  265  FF   2  [ksoftirqd/15]
> > > 3181  FF   1  qemu-kvm
> > > 
> > > In this configuration, the rcuc can preempt the guest's vcpu
> > > thread. This shouldn't be a problem, except for the fact that
> > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > or more spinning in this spinlock (note that IRQs are disabled
> > > during this period):
> > > 
> > > __rcu_process_callbacks()
> > > {
> > > ...
> > > 	local_irq_save(flags);
> > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > 		rcu_start_gp(rsp);
> > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > ...
> > 
> > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > often does this happen?  

I have to run cyclictest in the guest for 16m a few times to reproduce it.

> > Also, does this happen on smaller systems, for
> > example, with four or eight CPUs?  

Didn't test.

> > And I confess to be a bit surprised
> > that you expect real-time response from a guest that is subject to
> > preemption -- as I understand it, the usual approach is to give RT guests
> > their own CPUs.
> > 
> > Or am I missing something?
> 
> We are trying to avoid relying on the guest VCPU to voluntarily yield
> the CPU therefore allowing the critical services (such as rcu callback 
> processing and sched tick processing) to execute.

Yes. I hope I won't regret saying this, but what I'm observing is that
preempting-off the vcpu is not the end of the world as long as you're
quick.

> > > We've tried playing with the rcu_nocbs= option. However, it
> > > did not help because, for reasons we don't understand, the rcuc
> > > threads have to handle grace period start even when callback
> > > offloading is used. Handling this case requires this code path
> > > to be executed.
> > 
> > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > the per-CPU work required to inform RCU of quiescent states.
> 
> Can't you execute that on vCPU entry/exit? Those are quiescent states
> after all.
> 
> > > We've cooked the following extremely dirty patch, just to see
> > > what would happen:
> > > 
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index eaed1ef..c0771cc 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > >  	/* Does this CPU require a not-yet-started grace period? */
> > >  	local_irq_save(flags);
> > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > -		rcu_start_gp(rsp);
> > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > +		for (;;) {
> > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > +				local_irq_restore(flags);
> > > +				local_bh_enable();
> > > +				schedule_timeout_interruptible(2);
> > 
> > Yes, the above will get you a splat in mainline kernels, which do not
> > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > 
> > > +				local_bh_disable();
> > > +				local_irq_save(flags);
> > > +				continue;
> > > +			}
> > > +			rcu_start_gp(rsp);
> > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > +			break;
> > > +		}
> > >  	} else {
> > >  		local_irq_restore(flags);
> > >  	}
> > > 
> > > With this patch rcuc is gone from our traces and the scheduling
> > > latency is reduced by 3us in our CPU-bound test-case.
> > > 
> > > Could you please advice on how to solve this contention problem?
> > 
> > The usual advice would be to configure the system such that the guest's
> > VCPUs do not get preempted.
> 
> The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> spinning). In that case, rcuc would never execute, because it has a 
> lower priority than guest VCPUs.
> 
> I do not think we want that.
> 
> > Or is the contention on the root rcu_node structure's ->lock field
> > high for some other reason?

I didn't go far on trying to determine the reason. What I observed
was the rcuc preempting-off the vcpu and taking 10us+. I debugged it
and most of this time it spends spinning on the spinlock. The patch
above makes the rcuc disappear from our traces. This is all I've got.
I could try to debug it further if you have suggestions on how to
trace the cause.

> 
> Luiz?
> 
> > > Can we test whether the local CPU is nocb, and in that case, 
> > > skip rcu_start_gp entirely for example?
> > 
> > If you do that, you can see system hangs due to needed grace periods never
> > getting started.
> 
> So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> necessary for nocb CPUs to execute rcu_start_gp?
> 
> > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > If you are using a smaller value, it would be possible to rework the
> > code to reduce contention on ->lock, though if a VCPU does get preempted
> > while holding the root rcu_node structure's ->lock, life will be hard.
> 
> Its a raw spinlock, isnt it?
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28  1:55   ` Marcelo Tosatti
  2015-01-28 14:18     ` Luiz Capitulino
@ 2015-01-28 18:03     ` Paul E. McKenney
  2015-01-28 18:25       ` Marcelo Tosatti
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-28 18:03 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Luiz Capitulino, linux-rt-users

On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote:
> On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > Paul,
> > > 
> > > We're running some measurements with cyclictest running inside a
> > > KVM guest where we could observe spinlock contention among rcuc
> > > threads.
> > > 
> > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > This machine and the guest run the RT kernel. As our test-case
> > > requires an application in the guest taking 100% of the CPU, the
> > > RT priority configuration that gives the best latency is this one:
> > > 
> > >  263  FF   3  [rcuc/15]
> > >   13  FF   3  [rcub/1]
> > >   12  FF   3  [rcub/0]
> > >  265  FF   2  [ksoftirqd/15]
> > > 3181  FF   1  qemu-kvm
> > > 
> > > In this configuration, the rcuc can preempt the guest's vcpu
> > > thread. This shouldn't be a problem, except for the fact that
> > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > or more spinning in this spinlock (note that IRQs are disabled
> > > during this period):
> > > 
> > > __rcu_process_callbacks()
> > > {
> > > ...
> > > 	local_irq_save(flags);
> > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > 		rcu_start_gp(rsp);
> > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > ...
> > 
> > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > often does this happen?  Also, does this happen on smaller systems, for
> > example, with four or eight CPUs?  And I confess to be a bit surprised
> > that you expect real-time response from a guest that is subject to
> > preemption -- as I understand it, the usual approach is to give RT guests
> > their own CPUs.
> > 
> > Or am I missing something?
> 
> We are trying to avoid relying on the guest VCPU to voluntarily yield
> the CPU therefore allowing the critical services (such as rcu callback 
> processing and sched tick processing) to execute.

These critical services executing in the context of the host?
(If not, I am confused.  Actually, I am confused either way...)

> > > We've tried playing with the rcu_nocbs= option. However, it
> > > did not help because, for reasons we don't understand, the rcuc
> > > threads have to handle grace period start even when callback
> > > offloading is used. Handling this case requires this code path
> > > to be executed.
> > 
> > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > the per-CPU work required to inform RCU of quiescent states.
> 
> Can't you execute that on vCPU entry/exit? Those are quiescent states
> after all.

I am guessing that we are talking about quiescent states in the guest.
If so, can't vCPU entry/exit operations happen in guest interrupt
handlers?  If so, these operations are not necessarily quiescent states.

> > > We've cooked the following extremely dirty patch, just to see
> > > what would happen:
> > > 
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index eaed1ef..c0771cc 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > >  	/* Does this CPU require a not-yet-started grace period? */
> > >  	local_irq_save(flags);
> > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > -		rcu_start_gp(rsp);
> > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > +		for (;;) {
> > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > +				local_irq_restore(flags);
> > > +				local_bh_enable();
> > > +				schedule_timeout_interruptible(2);
> > 
> > Yes, the above will get you a splat in mainline kernels, which do not
> > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > 
> > > +				local_bh_disable();
> > > +				local_irq_save(flags);
> > > +				continue;
> > > +			}
> > > +			rcu_start_gp(rsp);
> > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > +			break;
> > > +		}
> > >  	} else {
> > >  		local_irq_restore(flags);
> > >  	}
> > > 
> > > With this patch rcuc is gone from our traces and the scheduling
> > > latency is reduced by 3us in our CPU-bound test-case.
> > > 
> > > Could you please advice on how to solve this contention problem?
> > 
> > The usual advice would be to configure the system such that the guest's
> > VCPUs do not get preempted.
> 
> The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> spinning). In that case, rcuc would never execute, because it has a 
> lower priority than guest VCPUs.

OK, this leads me to believe that you are talking about the rcuc kthreads
in the host, not the guest.  In which case the usual approach is to
reserve a CPU or two on the host which never runs guest VCPUs, and to
force the rcuc kthreads there.  Note that CONFIG_NO_HZ_FULL will do this
automatically for you, reserving the boot CPU.  And CONFIG_NO_HZ_FULL
might well be very useful in this scenario.  And reserving a CPU or two
for housekeeping purposes is quite common for heavy CPU-bound workloads.

Of course, you need to make sure that the reserved CPU or two is sufficient
for all the rcuc kthreads, but if your guests are mostly CPU bound, this
should not be a problem.

> I do not think we want that.

Assuming "that" is "rcuc would never execute" -- agreed, that would be
very bad.  You would eventually OOM the system.

> > Or is the contention on the root rcu_node structure's ->lock field
> > high for some other reason?
> 
> Luiz?
> 
> > > Can we test whether the local CPU is nocb, and in that case, 
> > > skip rcu_start_gp entirely for example?
> > 
> > If you do that, you can see system hangs due to needed grace periods never
> > getting started.
> 
> So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> necessary for nocb CPUs to execute rcu_start_gp?

Sigh.  Are we in the host or the guest OS at this point?

In any case, if you want the best real-time response for a CPU-bound
workload on a given CPU, careful use of NO_HZ_FULL would prevent
that CPU from ever invoking __rcu_process_callbacks() in the first
place, which would have the beneficial side effect of preventing
__rcu_process_callbacks() from ever invoking rcu_start_gp().

Of course, NO_HZ_FULL does have the drawback of increasing the cost
of user-kernel transitions.

> > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > If you are using a smaller value, it would be possible to rework the
> > code to reduce contention on ->lock, though if a VCPU does get preempted
> > while holding the root rcu_node structure's ->lock, life will be hard.
> 
> Its a raw spinlock, isnt it?

As I understand it, in a guest OS, that means nothing.  The host can
preempt a guest even if that guest believes that it has interrupts
disabled, correct?

If we are talking about the host, then I have to ask what is causing
the high levels of contention on the root rcu_node structure's ->lock
field.  (Which is the only rcu_node structure if you are using default
.config.)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 14:18     ` Luiz Capitulino
@ 2015-01-28 18:09       ` Paul E. McKenney
  2015-01-28 18:39         ` Luiz Capitulino
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-28 18:09 UTC (permalink / raw)
  To: Luiz Capitulino; +Cc: Marcelo Tosatti, linux-rt-users

On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote:
> On Tue, 27 Jan 2015 23:55:08 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > > Paul,
> > > > 
> > > > We're running some measurements with cyclictest running inside a
> > > > KVM guest where we could observe spinlock contention among rcuc
> > > > threads.
> > > > 
> > > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > > This machine and the guest run the RT kernel. As our test-case
> > > > requires an application in the guest taking 100% of the CPU, the
> > > > RT priority configuration that gives the best latency is this one:
> > > > 
> > > >  263  FF   3  [rcuc/15]
> > > >   13  FF   3  [rcub/1]
> > > >   12  FF   3  [rcub/0]
> > > >  265  FF   2  [ksoftirqd/15]
> > > > 3181  FF   1  qemu-kvm
> > > > 
> > > > In this configuration, the rcuc can preempt the guest's vcpu
> > > > thread. This shouldn't be a problem, except for the fact that
> > > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > > or more spinning in this spinlock (note that IRQs are disabled
> > > > during this period):
> > > > 
> > > > __rcu_process_callbacks()
> > > > {
> > > > ...
> > > > 	local_irq_save(flags);
> > > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > 		rcu_start_gp(rsp);
> > > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > ...
> > > 
> > > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > > often does this happen?  
> 
> I have to run cyclictest in the guest for 16m a few times to reproduce it.

So you are seeing the high contention in the guest, correct?

> > > Also, does this happen on smaller systems, for
> > > example, with four or eight CPUs?  
> 
> Didn't test.
> 
> > > And I confess to be a bit surprised
> > > that you expect real-time response from a guest that is subject to
> > > preemption -- as I understand it, the usual approach is to give RT guests
> > > their own CPUs.
> > > 
> > > Or am I missing something?
> > 
> > We are trying to avoid relying on the guest VCPU to voluntarily yield
> > the CPU therefore allowing the critical services (such as rcu callback 
> > processing and sched tick processing) to execute.
> 
> Yes. I hope I won't regret saying this, but what I'm observing is that
> preempting-off the vcpu is not the end of the world as long as you're
> quick.

And as long as you get lucky and avoid preempting a VCPU that happens to
be holding a critical lock.

Look, if you want real-time response in a guest OS, there simply is no
substitute for ensuring that the guest has its own CPUs that are not used
for anything else, either by anything in the host or by another guest.
If you do allow preemption of a guest OS that might be holding a critical
guest-OS lock, you are going to see latency blows.  Count on it!  ;-)

> > > > We've tried playing with the rcu_nocbs= option. However, it
> > > > did not help because, for reasons we don't understand, the rcuc
> > > > threads have to handle grace period start even when callback
> > > > offloading is used. Handling this case requires this code path
> > > > to be executed.
> > > 
> > > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > > the per-CPU work required to inform RCU of quiescent states.
> > 
> > Can't you execute that on vCPU entry/exit? Those are quiescent states
> > after all.
> > 
> > > > We've cooked the following extremely dirty patch, just to see
> > > > what would happen:
> > > > 
> > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > index eaed1ef..c0771cc 100644
> > > > --- a/kernel/rcutree.c
> > > > +++ b/kernel/rcutree.c
> > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > >  	local_irq_save(flags);
> > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > -		rcu_start_gp(rsp);
> > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > +		for (;;) {
> > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > +				local_irq_restore(flags);
> > > > +				local_bh_enable();
> > > > +				schedule_timeout_interruptible(2);
> > > 
> > > Yes, the above will get you a splat in mainline kernels, which do not
> > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > 
> > > > +				local_bh_disable();
> > > > +				local_irq_save(flags);
> > > > +				continue;
> > > > +			}
> > > > +			rcu_start_gp(rsp);
> > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > +			break;
> > > > +		}
> > > >  	} else {
> > > >  		local_irq_restore(flags);
> > > >  	}
> > > > 
> > > > With this patch rcuc is gone from our traces and the scheduling
> > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > 
> > > > Could you please advice on how to solve this contention problem?
> > > 
> > > The usual advice would be to configure the system such that the guest's
> > > VCPUs do not get preempted.
> > 
> > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > spinning). In that case, rcuc would never execute, because it has a 
> > lower priority than guest VCPUs.
> > 
> > I do not think we want that.
> > 
> > > Or is the contention on the root rcu_node structure's ->lock field
> > > high for some other reason?
> 
> I didn't go far on trying to determine the reason. What I observed
> was the rcuc preempting-off the vcpu and taking 10us+. I debugged it
> and most of this time it spends spinning on the spinlock. The patch
> above makes the rcuc disappear from our traces. This is all I've got.
> I could try to debug it further if you have suggestions on how to
> trace the cause.

My current guess is that either:

1.	You are allowing the host or another guest to preempt this
	guest's VCPU.  Don't do that.  ;-)

2.	You are letting the rcuc kthreads contend for the worker CPUs.
	Pin them to housekeeping CPUs.  This applies to both the
	host and the guest rcuc kthreads, but especially to the
	host rcuc kthreads.

Or am I still unclear on your goals and configuration?

							Thanx, Paul

> > Luiz?
> > 
> > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > skip rcu_start_gp entirely for example?
> > > 
> > > If you do that, you can see system hangs due to needed grace periods never
> > > getting started.
> > 
> > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > necessary for nocb CPUs to execute rcu_start_gp?
> > 
> > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > > If you are using a smaller value, it would be possible to rework the
> > > code to reduce contention on ->lock, though if a VCPU does get preempted
> > > while holding the root rcu_node structure's ->lock, life will be hard.
> > 
> > Its a raw spinlock, isnt it?
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:03     ` Paul E. McKenney
@ 2015-01-28 18:25       ` Marcelo Tosatti
  2015-01-28 18:55         ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2015-01-28 18:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Luiz Capitulino, linux-rt-users

On Wed, Jan 28, 2015 at 10:03:35AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote:
> > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > > Paul,
> > > > 
> > > > We're running some measurements with cyclictest running inside a
> > > > KVM guest where we could observe spinlock contention among rcuc
> > > > threads.
> > > > 
> > > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > > This machine and the guest run the RT kernel. As our test-case
> > > > requires an application in the guest taking 100% of the CPU, the
> > > > RT priority configuration that gives the best latency is this one:
> > > > 
> > > >  263  FF   3  [rcuc/15]
> > > >   13  FF   3  [rcub/1]
> > > >   12  FF   3  [rcub/0]
> > > >  265  FF   2  [ksoftirqd/15]
> > > > 3181  FF   1  qemu-kvm
> > > > 
> > > > In this configuration, the rcuc can preempt the guest's vcpu
> > > > thread. This shouldn't be a problem, except for the fact that
> > > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > > or more spinning in this spinlock (note that IRQs are disabled
> > > > during this period):
> > > > 
> > > > __rcu_process_callbacks()
> > > > {
> > > > ...
> > > > 	local_irq_save(flags);
> > > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > 		rcu_start_gp(rsp);
> > > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > ...
> > > 
> > > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > > often does this happen?  Also, does this happen on smaller systems, for
> > > example, with four or eight CPUs?  And I confess to be a bit surprised
> > > that you expect real-time response from a guest that is subject to
> > > preemption -- as I understand it, the usual approach is to give RT guests
> > > their own CPUs.
> > > 
> > > Or am I missing something?
> > 
> > We are trying to avoid relying on the guest VCPU to voluntarily yield
> > the CPU therefore allowing the critical services (such as rcu callback 
> > processing and sched tick processing) to execute.
> 
> These critical services executing in the context of the host?
> (If not, I am confused.  Actually, I am confused either way...)

The host. Imagine a Windows 95 guest running a realtime app.
That should help.

> > > > We've tried playing with the rcu_nocbs= option. However, it
> > > > did not help because, for reasons we don't understand, the rcuc
> > > > threads have to handle grace period start even when callback
> > > > offloading is used. Handling this case requires this code path
> > > > to be executed.
> > > 
> > > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > > the per-CPU work required to inform RCU of quiescent states.
> > 
> > Can't you execute that on vCPU entry/exit? Those are quiescent states
> > after all.
> 
> I am guessing that we are talking about quiescent states in the guest.

Host.

> If so, can't vCPU entry/exit operations happen in guest interrupt
> handlers?  If so, these operations are not necessarily quiescent states.

vCPU entry/exit are quiescent states in the host.

> > > > We've cooked the following extremely dirty patch, just to see
> > > > what would happen:
> > > > 
> > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > index eaed1ef..c0771cc 100644
> > > > --- a/kernel/rcutree.c
> > > > +++ b/kernel/rcutree.c
> > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > >  	local_irq_save(flags);
> > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > -		rcu_start_gp(rsp);
> > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > +		for (;;) {
> > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > +				local_irq_restore(flags);
> > > > +				local_bh_enable();
> > > > +				schedule_timeout_interruptible(2);
> > > 
> > > Yes, the above will get you a splat in mainline kernels, which do not
> > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > 
> > > > +				local_bh_disable();
> > > > +				local_irq_save(flags);
> > > > +				continue;
> > > > +			}
> > > > +			rcu_start_gp(rsp);
> > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > +			break;
> > > > +		}
> > > >  	} else {
> > > >  		local_irq_restore(flags);
> > > >  	}
> > > > 
> > > > With this patch rcuc is gone from our traces and the scheduling
> > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > 
> > > > Could you please advice on how to solve this contention problem?
> > > 
> > > The usual advice would be to configure the system such that the guest's
> > > VCPUs do not get preempted.
> > 
> > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > spinning). In that case, rcuc would never execute, because it has a 
> > lower priority than guest VCPUs.
> 
> OK, this leads me to believe that you are talking about the rcuc kthreads
> in the host, not the guest.  In which case the usual approach is to
> reserve a CPU or two on the host which never runs guest VCPUs, and to
> force the rcuc kthreads there.  Note that CONFIG_NO_HZ_FULL will do this
> automatically for you, reserving the boot CPU.  And CONFIG_NO_HZ_FULL
> might well be very useful in this scenario.  And reserving a CPU or two
> for housekeeping purposes is quite common for heavy CPU-bound workloads.
> 
> Of course, you need to make sure that the reserved CPU or two is sufficient
> for all the rcuc kthreads, but if your guests are mostly CPU bound, this
> should not be a problem.
> 
> > I do not think we want that.
> 
> Assuming "that" is "rcuc would never execute" -- agreed, that would be
> very bad.  You would eventually OOM the system.
> 
> > > Or is the contention on the root rcu_node structure's ->lock field
> > > high for some other reason?
> > 
> > Luiz?
> > 
> > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > skip rcu_start_gp entirely for example?
> > > 
> > > If you do that, you can see system hangs due to needed grace periods never
> > > getting started.
> > 
> > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > necessary for nocb CPUs to execute rcu_start_gp?
> 
> Sigh.  Are we in the host or the guest OS at this point?

Host.

> In any case, if you want the best real-time response for a CPU-bound
> workload on a given CPU, careful use of NO_HZ_FULL would prevent
> that CPU from ever invoking __rcu_process_callbacks() in the first
> place, which would have the beneficial side effect of preventing
> __rcu_process_callbacks() from ever invoking rcu_start_gp().
> 
> Of course, NO_HZ_FULL does have the drawback of increasing the cost
> of user-kernel transitions.

We need periodic processing of __run_timers to keep timer wheel
processing from falling behind too much.

See http://www.gossamer-threads.com/lists/linux/kernel/2094151.

> > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > > If you are using a smaller value, it would be possible to rework the
> > > code to reduce contention on ->lock, though if a VCPU does get preempted
> > > while holding the root rcu_node structure's ->lock, life will be hard.
> > 
> > Its a raw spinlock, isnt it?
> 
> As I understand it, in a guest OS, that means nothing.  The host can
> preempt a guest even if that guest believes that it has interrupts
> disabled, correct?

Yes.

> If we are talking about the host, then I have to ask what is causing
> the high levels of contention on the root rcu_node structure's ->lock
> field.  (Which is the only rcu_node structure if you are using default
> .config.)
> 
> 							Thanx, Paul

OK, great.

Thanks a lot.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:09       ` Paul E. McKenney
@ 2015-01-28 18:39         ` Luiz Capitulino
  2015-01-28 19:00           ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Luiz Capitulino @ 2015-01-28 18:39 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Marcelo Tosatti, linux-rt-users

On Wed, 28 Jan 2015 10:09:50 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote:
> > On Tue, 27 Jan 2015 23:55:08 -0200
> > Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > 
> > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > > > Paul,
> > > > > 
> > > > > We're running some measurements with cyclictest running inside a
> > > > > KVM guest where we could observe spinlock contention among rcuc
> > > > > threads.
> > > > > 
> > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > > > This machine and the guest run the RT kernel. As our test-case
> > > > > requires an application in the guest taking 100% of the CPU, the
> > > > > RT priority configuration that gives the best latency is this one:
> > > > > 
> > > > >  263  FF   3  [rcuc/15]
> > > > >   13  FF   3  [rcub/1]
> > > > >   12  FF   3  [rcub/0]
> > > > >  265  FF   2  [ksoftirqd/15]
> > > > > 3181  FF   1  qemu-kvm
> > > > > 
> > > > > In this configuration, the rcuc can preempt the guest's vcpu
> > > > > thread. This shouldn't be a problem, except for the fact that
> > > > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > > > or more spinning in this spinlock (note that IRQs are disabled
> > > > > during this period):
> > > > > 
> > > > > __rcu_process_callbacks()
> > > > > {
> > > > > ...
> > > > > 	local_irq_save(flags);
> > > > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > 		rcu_start_gp(rsp);
> > > > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > ...
> > > > 
> > > > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > > > often does this happen?  
> > 
> > I have to run cyclictest in the guest for 16m a few times to reproduce it.
> 
> So you are seeing the high contention in the guest, correct?

No, it's in the host.

> 
> > > > Also, does this happen on smaller systems, for
> > > > example, with four or eight CPUs?  
> > 
> > Didn't test.
> > 
> > > > And I confess to be a bit surprised
> > > > that you expect real-time response from a guest that is subject to
> > > > preemption -- as I understand it, the usual approach is to give RT guests
> > > > their own CPUs.
> > > > 
> > > > Or am I missing something?
> > > 
> > > We are trying to avoid relying on the guest VCPU to voluntarily yield
> > > the CPU therefore allowing the critical services (such as rcu callback 
> > > processing and sched tick processing) to execute.
> > 
> > Yes. I hope I won't regret saying this, but what I'm observing is that
> > preempting-off the vcpu is not the end of the world as long as you're
> > quick.
> 
> And as long as you get lucky and avoid preempting a VCPU that happens to
> be holding a critical lock.

That's not the case. Everything I mentioned in this thread about RCU
and contention happens in the host.

> Look, if you want real-time response in a guest OS, there simply is no
> substitute for ensuring that the guest has its own CPUs that are not used
> for anything else, either by anything in the host or by another guest.
> If you do allow preemption of a guest OS that might be holding a critical
> guest-OS lock, you are going to see latency blows.  Count on it!  ;-)
> 
> > > > > We've tried playing with the rcu_nocbs= option. However, it
> > > > > did not help because, for reasons we don't understand, the rcuc
> > > > > threads have to handle grace period start even when callback
> > > > > offloading is used. Handling this case requires this code path
> > > > > to be executed.
> > > > 
> > > > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > > > the per-CPU work required to inform RCU of quiescent states.
> > > 
> > > Can't you execute that on vCPU entry/exit? Those are quiescent states
> > > after all.
> > > 
> > > > > We've cooked the following extremely dirty patch, just to see
> > > > > what would happen:
> > > > > 
> > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > > index eaed1ef..c0771cc 100644
> > > > > --- a/kernel/rcutree.c
> > > > > +++ b/kernel/rcutree.c
> > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > > >  	local_irq_save(flags);
> > > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > -		rcu_start_gp(rsp);
> > > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > +		for (;;) {
> > > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > > +				local_irq_restore(flags);
> > > > > +				local_bh_enable();
> > > > > +				schedule_timeout_interruptible(2);
> > > > 
> > > > Yes, the above will get you a splat in mainline kernels, which do not
> > > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > > 
> > > > > +				local_bh_disable();
> > > > > +				local_irq_save(flags);
> > > > > +				continue;
> > > > > +			}
> > > > > +			rcu_start_gp(rsp);
> > > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > +			break;
> > > > > +		}
> > > > >  	} else {
> > > > >  		local_irq_restore(flags);
> > > > >  	}
> > > > > 
> > > > > With this patch rcuc is gone from our traces and the scheduling
> > > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > > 
> > > > > Could you please advice on how to solve this contention problem?
> > > > 
> > > > The usual advice would be to configure the system such that the guest's
> > > > VCPUs do not get preempted.
> > > 
> > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > > spinning). In that case, rcuc would never execute, because it has a 
> > > lower priority than guest VCPUs.
> > > 
> > > I do not think we want that.
> > > 
> > > > Or is the contention on the root rcu_node structure's ->lock field
> > > > high for some other reason?
> > 
> > I didn't go far on trying to determine the reason. What I observed
> > was the rcuc preempting-off the vcpu and taking 10us+. I debugged it
> > and most of this time it spends spinning on the spinlock. The patch
> > above makes the rcuc disappear from our traces. This is all I've got.
> > I could try to debug it further if you have suggestions on how to
> > trace the cause.
> 
> My current guess is that either:
> 
> 1.	You are allowing the host or another guest to preempt this
> 	guest's VCPU.  Don't do that.  ;-)

We do allow the rcuc kthread to preempt the guest's vCPU (not other
guests). The reason for this is that the workload running inside the
guest may take 100% of the CPU, which won't allow the rcuc thread
to ever execute.

> 2.	You are letting the rcuc kthreads contend for the worker CPUs.
> 	Pin them to housekeeping CPUs.  This applies to both the
> 	host and the guest rcuc kthreads, but especially to the
> 	host rcuc kthreads.

I'd love to be able to do this, but the rcuc threads are CPU-bound
threads. There's one per CPU and the kernel doesn't allow me to move
them around.

> 
> Or am I still unclear on your goals and configuration?
> 
> 							Thanx, Paul
> 
> > > Luiz?
> > > 
> > > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > > skip rcu_start_gp entirely for example?
> > > > 
> > > > If you do that, you can see system hangs due to needed grace periods never
> > > > getting started.
> > > 
> > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > > necessary for nocb CPUs to execute rcu_start_gp?
> > > 
> > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > > > If you are using a smaller value, it would be possible to rework the
> > > > code to reduce contention on ->lock, though if a VCPU does get preempted
> > > > while holding the root rcu_node structure's ->lock, life will be hard.
> > > 
> > > Its a raw spinlock, isnt it?
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:25       ` Marcelo Tosatti
@ 2015-01-28 18:55         ` Paul E. McKenney
  2015-01-29 17:06           ` Steven Rostedt
                             ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-28 18:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Luiz Capitulino, linux-rt-users

On Wed, Jan 28, 2015 at 04:25:12PM -0200, Marcelo Tosatti wrote:
> On Wed, Jan 28, 2015 at 10:03:35AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote:
> > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > > > Paul,
> > > > > 
> > > > > We're running some measurements with cyclictest running inside a
> > > > > KVM guest where we could observe spinlock contention among rcuc
> > > > > threads.
> > > > > 
> > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > > > This machine and the guest run the RT kernel. As our test-case
> > > > > requires an application in the guest taking 100% of the CPU, the
> > > > > RT priority configuration that gives the best latency is this one:
> > > > > 
> > > > >  263  FF   3  [rcuc/15]
> > > > >   13  FF   3  [rcub/1]
> > > > >   12  FF   3  [rcub/0]
> > > > >  265  FF   2  [ksoftirqd/15]
> > > > > 3181  FF   1  qemu-kvm
> > > > > 
> > > > > In this configuration, the rcuc can preempt the guest's vcpu
> > > > > thread. This shouldn't be a problem, except for the fact that
> > > > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > > > or more spinning in this spinlock (note that IRQs are disabled
> > > > > during this period):
> > > > > 
> > > > > __rcu_process_callbacks()
> > > > > {
> > > > > ...
> > > > > 	local_irq_save(flags);
> > > > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > 		rcu_start_gp(rsp);
> > > > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > ...
> > > > 
> > > > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > > > often does this happen?  Also, does this happen on smaller systems, for
> > > > example, with four or eight CPUs?  And I confess to be a bit surprised
> > > > that you expect real-time response from a guest that is subject to
> > > > preemption -- as I understand it, the usual approach is to give RT guests
> > > > their own CPUs.
> > > > 
> > > > Or am I missing something?
> > > 
> > > We are trying to avoid relying on the guest VCPU to voluntarily yield
> > > the CPU therefore allowing the critical services (such as rcu callback 
> > > processing and sched tick processing) to execute.
> > 
> > These critical services executing in the context of the host?
> > (If not, I am confused.  Actually, I am confused either way...)
> 
> The host. Imagine a Windows 95 guest running a realtime app.
> That should help.

Then force the critical services to run on a housekeeping CPU.  If the
host is permitted to preempt the guest, the latency blows you are seeing
are expected behavior.

> > > > > We've tried playing with the rcu_nocbs= option. However, it
> > > > > did not help because, for reasons we don't understand, the rcuc
> > > > > threads have to handle grace period start even when callback
> > > > > offloading is used. Handling this case requires this code path
> > > > > to be executed.
> > > > 
> > > > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > > > the per-CPU work required to inform RCU of quiescent states.
> > > 
> > > Can't you execute that on vCPU entry/exit? Those are quiescent states
> > > after all.
> > 
> > I am guessing that we are talking about quiescent states in the guest.
> 
> Host.
> 
> > If so, can't vCPU entry/exit operations happen in guest interrupt
> > handlers?  If so, these operations are not necessarily quiescent states.
> 
> vCPU entry/exit are quiescent states in the host.

As is execution in the guest.  If you build the host with NO_HZ_FULL
and boot with the appropriate nohz_full= parameter, this will happen
automatically.  If that is infeasible, then yes, it should be possible
to add an explicit quiescent state in the host at vCPU entry/exit, at
least assuming that the host is in a state permitting this.

> > > > > We've cooked the following extremely dirty patch, just to see
> > > > > what would happen:
> > > > > 
> > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > > index eaed1ef..c0771cc 100644
> > > > > --- a/kernel/rcutree.c
> > > > > +++ b/kernel/rcutree.c
> > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > > >  	local_irq_save(flags);
> > > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > -		rcu_start_gp(rsp);
> > > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > +		for (;;) {
> > > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > > +				local_irq_restore(flags);
> > > > > +				local_bh_enable();
> > > > > +				schedule_timeout_interruptible(2);
> > > > 
> > > > Yes, the above will get you a splat in mainline kernels, which do not
> > > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > > 
> > > > > +				local_bh_disable();
> > > > > +				local_irq_save(flags);
> > > > > +				continue;
> > > > > +			}
> > > > > +			rcu_start_gp(rsp);
> > > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > +			break;
> > > > > +		}
> > > > >  	} else {
> > > > >  		local_irq_restore(flags);
> > > > >  	}
> > > > > 
> > > > > With this patch rcuc is gone from our traces and the scheduling
> > > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > > 
> > > > > Could you please advice on how to solve this contention problem?
> > > > 
> > > > The usual advice would be to configure the system such that the guest's
> > > > VCPUs do not get preempted.
> > > 
> > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > > spinning). In that case, rcuc would never execute, because it has a 
> > > lower priority than guest VCPUs.
> > 
> > OK, this leads me to believe that you are talking about the rcuc kthreads
> > in the host, not the guest.  In which case the usual approach is to
> > reserve a CPU or two on the host which never runs guest VCPUs, and to
> > force the rcuc kthreads there.  Note that CONFIG_NO_HZ_FULL will do this
> > automatically for you, reserving the boot CPU.  And CONFIG_NO_HZ_FULL
> > might well be very useful in this scenario.  And reserving a CPU or two
> > for housekeeping purposes is quite common for heavy CPU-bound workloads.
> > 
> > Of course, you need to make sure that the reserved CPU or two is sufficient
> > for all the rcuc kthreads, but if your guests are mostly CPU bound, this
> > should not be a problem.
> > 
> > > I do not think we want that.
> > 
> > Assuming "that" is "rcuc would never execute" -- agreed, that would be
> > very bad.  You would eventually OOM the system.
> > 
> > > > Or is the contention on the root rcu_node structure's ->lock field
> > > > high for some other reason?
> > > 
> > > Luiz?
> > > 
> > > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > > skip rcu_start_gp entirely for example?
> > > > 
> > > > If you do that, you can see system hangs due to needed grace periods never
> > > > getting started.
> > > 
> > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > > necessary for nocb CPUs to execute rcu_start_gp?
> > 
> > Sigh.  Are we in the host or the guest OS at this point?
> 
> Host.

Can you build the host with NO_HZ_FULL and boot with nohz_full=?
That should get rid of of much of your problems here.

> > In any case, if you want the best real-time response for a CPU-bound
> > workload on a given CPU, careful use of NO_HZ_FULL would prevent
> > that CPU from ever invoking __rcu_process_callbacks() in the first
> > place, which would have the beneficial side effect of preventing
> > __rcu_process_callbacks() from ever invoking rcu_start_gp().
> > 
> > Of course, NO_HZ_FULL does have the drawback of increasing the cost
> > of user-kernel transitions.
> 
> We need periodic processing of __run_timers to keep timer wheel
> processing from falling behind too much.
> 
> See http://www.gossamer-threads.com/lists/linux/kernel/2094151.

Hmmm...  Do you have the following commits in your build?

fff421580f51 timers: Track total number of timers in list
d550e81dc0dd timers: Reduce __run_timers() latency for empty list
16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list
18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list
aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0

Keeping extraneous processing off of the CPUs running the real-time
guest will minimize the number of timers, allowing these commits to
do their jobs.

> > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > > > If you are using a smaller value, it would be possible to rework the
> > > > code to reduce contention on ->lock, though if a VCPU does get preempted
> > > > while holding the root rcu_node structure's ->lock, life will be hard.
> > > 
> > > Its a raw spinlock, isnt it?
> > 
> > As I understand it, in a guest OS, that means nothing.  The host can
> > preempt a guest even if that guest believes that it has interrupts
> > disabled, correct?
> 
> Yes.

Then your only hope is to prevent the host (and other guests) from
preempting the real-time guest.

> > If we are talking about the host, then I have to ask what is causing
> > the high levels of contention on the root rcu_node structure's ->lock
> > field.  (Which is the only rcu_node structure if you are using default
> > .config.)
> > 
> > 							Thanx, Paul
> 
> OK, great.
> 
> Thanks a lot.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:39         ` Luiz Capitulino
@ 2015-01-28 19:00           ` Paul E. McKenney
  2015-01-28 19:06             ` Luiz Capitulino
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-28 19:00 UTC (permalink / raw)
  To: Luiz Capitulino; +Cc: Marcelo Tosatti, linux-rt-users

On Wed, Jan 28, 2015 at 01:39:16PM -0500, Luiz Capitulino wrote:
> On Wed, 28 Jan 2015 10:09:50 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote:
> > > On Tue, 27 Jan 2015 23:55:08 -0200
> > > Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > 
> > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > > > > Paul,
> > > > > > 
> > > > > > We're running some measurements with cyclictest running inside a
> > > > > > KVM guest where we could observe spinlock contention among rcuc
> > > > > > threads.
> > > > > > 
> > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > > > > This machine and the guest run the RT kernel. As our test-case
> > > > > > requires an application in the guest taking 100% of the CPU, the
> > > > > > RT priority configuration that gives the best latency is this one:
> > > > > > 
> > > > > >  263  FF   3  [rcuc/15]
> > > > > >   13  FF   3  [rcub/1]
> > > > > >   12  FF   3  [rcub/0]
> > > > > >  265  FF   2  [ksoftirqd/15]
> > > > > > 3181  FF   1  qemu-kvm
> > > > > > 
> > > > > > In this configuration, the rcuc can preempt the guest's vcpu
> > > > > > thread. This shouldn't be a problem, except for the fact that
> > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > > > > or more spinning in this spinlock (note that IRQs are disabled
> > > > > > during this period):
> > > > > > 
> > > > > > __rcu_process_callbacks()
> > > > > > {
> > > > > > ...
> > > > > > 	local_irq_save(flags);
> > > > > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > > 		rcu_start_gp(rsp);
> > > > > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > ...
> > > > > 
> > > > > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > > > > often does this happen?  
> > > 
> > > I have to run cyclictest in the guest for 16m a few times to reproduce it.
> > 
> > So you are seeing the high contention in the guest, correct?
> 
> No, it's in the host.

OK, good to know.  ;-)

> > > > > Also, does this happen on smaller systems, for
> > > > > example, with four or eight CPUs?  
> > > 
> > > Didn't test.
> > > 
> > > > > And I confess to be a bit surprised
> > > > > that you expect real-time response from a guest that is subject to
> > > > > preemption -- as I understand it, the usual approach is to give RT guests
> > > > > their own CPUs.
> > > > > 
> > > > > Or am I missing something?
> > > > 
> > > > We are trying to avoid relying on the guest VCPU to voluntarily yield
> > > > the CPU therefore allowing the critical services (such as rcu callback 
> > > > processing and sched tick processing) to execute.
> > > 
> > > Yes. I hope I won't regret saying this, but what I'm observing is that
> > > preempting-off the vcpu is not the end of the world as long as you're
> > > quick.
> > 
> > And as long as you get lucky and avoid preempting a VCPU that happens to
> > be holding a critical lock.
> 
> That's not the case. Everything I mentioned in this thread about RCU
> and contention happens in the host.
> 
> > Look, if you want real-time response in a guest OS, there simply is no
> > substitute for ensuring that the guest has its own CPUs that are not used
> > for anything else, either by anything in the host or by another guest.
> > If you do allow preemption of a guest OS that might be holding a critical
> > guest-OS lock, you are going to see latency blows.  Count on it!  ;-)
> > 
> > > > > > We've tried playing with the rcu_nocbs= option. However, it
> > > > > > did not help because, for reasons we don't understand, the rcuc
> > > > > > threads have to handle grace period start even when callback
> > > > > > offloading is used. Handling this case requires this code path
> > > > > > to be executed.
> > > > > 
> > > > > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > > > > the per-CPU work required to inform RCU of quiescent states.
> > > > 
> > > > Can't you execute that on vCPU entry/exit? Those are quiescent states
> > > > after all.
> > > > 
> > > > > > We've cooked the following extremely dirty patch, just to see
> > > > > > what would happen:
> > > > > > 
> > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > > > index eaed1ef..c0771cc 100644
> > > > > > --- a/kernel/rcutree.c
> > > > > > +++ b/kernel/rcutree.c
> > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > > > >  	local_irq_save(flags);
> > > > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > > -		rcu_start_gp(rsp);
> > > > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > +		for (;;) {
> > > > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > > > +				local_irq_restore(flags);
> > > > > > +				local_bh_enable();
> > > > > > +				schedule_timeout_interruptible(2);
> > > > > 
> > > > > Yes, the above will get you a splat in mainline kernels, which do not
> > > > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > > > 
> > > > > > +				local_bh_disable();
> > > > > > +				local_irq_save(flags);
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +			rcu_start_gp(rsp);
> > > > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > +			break;
> > > > > > +		}
> > > > > >  	} else {
> > > > > >  		local_irq_restore(flags);
> > > > > >  	}
> > > > > > 
> > > > > > With this patch rcuc is gone from our traces and the scheduling
> > > > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > > > 
> > > > > > Could you please advice on how to solve this contention problem?
> > > > > 
> > > > > The usual advice would be to configure the system such that the guest's
> > > > > VCPUs do not get preempted.
> > > > 
> > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > > > spinning). In that case, rcuc would never execute, because it has a 
> > > > lower priority than guest VCPUs.
> > > > 
> > > > I do not think we want that.
> > > > 
> > > > > Or is the contention on the root rcu_node structure's ->lock field
> > > > > high for some other reason?
> > > 
> > > I didn't go far on trying to determine the reason. What I observed
> > > was the rcuc preempting-off the vcpu and taking 10us+. I debugged it
> > > and most of this time it spends spinning on the spinlock. The patch
> > > above makes the rcuc disappear from our traces. This is all I've got.
> > > I could try to debug it further if you have suggestions on how to
> > > trace the cause.
> > 
> > My current guess is that either:
> > 
> > 1.	You are allowing the host or another guest to preempt this
> > 	guest's VCPU.  Don't do that.  ;-)
> 
> We do allow the rcuc kthread to preempt the guest's vCPU (not other
> guests). The reason for this is that the workload running inside the
> guest may take 100% of the CPU, which won't allow the rcuc thread
> to ever execute.
> 
> > 2.	You are letting the rcuc kthreads contend for the worker CPUs.
> > 	Pin them to housekeeping CPUs.  This applies to both the
> > 	host and the guest rcuc kthreads, but especially to the
> > 	host rcuc kthreads.
> 
> I'd love to be able to do this, but the rcuc threads are CPU-bound
> threads. There's one per CPU and the kernel doesn't allow me to move
> them around.

Can you build with CONFIG_RCU_BOOST=n?  Then you won't have any rcuc
kthreads.  If you are preventing preemption of the VCPUs, you should
not need RCU priority boosting.

							Thanx, Paul

> > 
> > Or am I still unclear on your goals and configuration?
> > 
> > 							Thanx, Paul
> > 
> > > > Luiz?
> > > > 
> > > > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > > > skip rcu_start_gp entirely for example?
> > > > > 
> > > > > If you do that, you can see system hangs due to needed grace periods never
> > > > > getting started.
> > > > 
> > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > > > necessary for nocb CPUs to execute rcu_start_gp?
> > > > 
> > > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF?
> > > > > If you are using a smaller value, it would be possible to rework the
> > > > > code to reduce contention on ->lock, though if a VCPU does get preempted
> > > > > while holding the root rcu_node structure's ->lock, life will be hard.
> > > > 
> > > > Its a raw spinlock, isnt it?
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 19:00           ` Paul E. McKenney
@ 2015-01-28 19:06             ` Luiz Capitulino
  0 siblings, 0 replies; 23+ messages in thread
From: Luiz Capitulino @ 2015-01-28 19:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Marcelo Tosatti, linux-rt-users

On Wed, 28 Jan 2015 11:00:47 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> > > 2.	You are letting the rcuc kthreads contend for the worker CPUs.
> > > 	Pin them to housekeeping CPUs.  This applies to both the
> > > 	host and the guest rcuc kthreads, but especially to the
> > > 	host rcuc kthreads.
> > 
> > I'd love to be able to do this, but the rcuc threads are CPU-bound
> > threads. There's one per CPU and the kernel doesn't allow me to move
> > them around.
> 
> Can you build with CONFIG_RCU_BOOST=n?  Then you won't have any rcuc
> kthreads.  

Oh, really? I will try this right away!

Thanks for your help!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:55         ` Paul E. McKenney
@ 2015-01-29 17:06           ` Steven Rostedt
  2015-01-29 18:11             ` Paul E. McKenney
  2015-01-29 18:13           ` Marcelo Tosatti
  2015-02-02 18:24           ` Marcelo Tosatti
  2 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2015-01-29 17:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Marcelo Tosatti, Luiz Capitulino, linux-rt-users

On Wed, 28 Jan 2015 10:55:53 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Then your only hope is to prevent the host (and other guests) from
> preempting the real-time guest.

Right!

I think there's a miscommunication here.

Basically what is needed is to run the RT guest on a CPU by itself. We
can all agree on that. That guest runs at a high priority where nothing
should preempt it. We should enable NO_HZ_FULL, and move as much off of
that CPU as possible (including rcu callbacks).

I'm not sure if the code does this or not, but I believe it does. When
we enter the guest, the host should be in an RCU quiescent state, where
RCU will ignore the CPU that is running the guest. Remember, we are only
talking about interactions of the host, not the workings of the guest.

Once this isolation happens, then the guest should be running in a
state that it could handle RT reaction times for its own processes (if
the guest OS supports it). The guest shouldn't be preempted by anything
unless it does something that requires a service (interacting with the
network or other baremetal device), then it will need to do the same
things that any RT task must do.

I think all this is feasible.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-29 17:06           ` Steven Rostedt
@ 2015-01-29 18:11             ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-29 18:11 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Marcelo Tosatti, Luiz Capitulino, linux-rt-users

On Thu, Jan 29, 2015 at 12:06:44PM -0500, Steven Rostedt wrote:
> On Wed, 28 Jan 2015 10:55:53 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Then your only hope is to prevent the host (and other guests) from
> > preempting the real-time guest.
> 
> Right!
> 
> I think there's a miscommunication here.

I can easily believe that!

> Basically what is needed is to run the RT guest on a CPU by itself. We
> can all agree on that. That guest runs at a high priority where nothing
> should preempt it. We should enable NO_HZ_FULL, and move as much off of
> that CPU as possible (including rcu callbacks).
> 
> I'm not sure if the code does this or not, but I believe it does. When
> we enter the guest, the host should be in an RCU quiescent state, where
> RCU will ignore the CPU that is running the guest. Remember, we are only
> talking about interactions of the host, not the workings of the guest.

NO_HZ_FULL will automatically tell RCU about the guest-execution quiescent
state because the guest is seen by the host as user-mode execution.
(Right?  Or is KVM treating this specially such that RCU doesn't see
guest execution as a quiescent state?  I think this is currently handled
correctly, because if it wasn't, you would get RCU CPU stall warning
messages.)

> Once this isolation happens, then the guest should be running in a
> state that it could handle RT reaction times for its own processes (if
> the guest OS supports it). The guest shouldn't be preempted by anything
> unless it does something that requires a service (interacting with the
> network or other baremetal device), then it will need to do the same
> things that any RT task must do.

Agreed!

> I think all this is feasible.

The one thing that gives me pause is the high contention on the root
(AKA only) rcu_node structure's ->lock field.  If this persists, one
thing to try would be to build with CONFIG_RCU_FANOUT_LEAF=8 (or 4).
If that helps, it would be worthwhile to do some tracing or lock
profiling to see about reducing the ->lock contention for the default
CONFIG_RCU_FANOUT_LEAF=16.

My first thought when I saw the high contention was to introduce
funnel locking for grace-period start, but that is unlikely to help
in cases where there is only one rcu_node structure.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:55         ` Paul E. McKenney
  2015-01-29 17:06           ` Steven Rostedt
@ 2015-01-29 18:13           ` Marcelo Tosatti
  2015-01-29 18:36             ` Paul E. McKenney
  2015-02-02 18:24           ` Marcelo Tosatti
  2 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2015-01-29 18:13 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Luiz Capitulino, linux-rt-users


On Wed, Jan 28, 2015 at 10:55:53AM -0800, Paul E. McKenney wrote:
> > The host. Imagine a Windows 95 guest running a realtime app.
> > That should help.
> 
> Then force the critical services to run on a housekeeping CPU.  If the
> host is permitted to preempt the guest, the latency blows you are seeing
> are expected behavior.

ksoftirqd must preempt the vcpu as it executes irq_work
routines for example.

IRQ threads must preempt the vcpu to inject HW interrupts
to the guest.

> automatically.  If that is infeasible, then yes, it should be possible
> to add an explicit quiescent state in the host at vCPU entry/exit, at
> least assuming that the host is in a state permitting this.
> 
> > > > > > We've cooked the following extremely dirty patch, just to see
> > > > > > what would happen:
> > > > > > 
> > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > > > index eaed1ef..c0771cc 100644
> > > > > > --- a/kernel/rcutree.c
> > > > > > +++ b/kernel/rcutree.c
> > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > > > >  	local_irq_save(flags);
> > > > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > > -		rcu_start_gp(rsp);
> > > > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > +		for (;;) {
> > > > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > > > +				local_irq_restore(flags);
> > > > > > +				local_bh_enable();
> > > > > > +				schedule_timeout_interruptible(2);
> > > > > 
> > > > > Yes, the above will get you a splat in mainline kernels, which do not
> > > > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > > > 
> > > > > > +				local_bh_disable();
> > > > > > +				local_irq_save(flags);
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +			rcu_start_gp(rsp);
> > > > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > +			break;
> > > > > > +		}
> > > > > >  	} else {
> > > > > >  		local_irq_restore(flags);
> > > > > >  	}
> > > > > > 
> > > > > > With this patch rcuc is gone from our traces and the scheduling
> > > > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > > > 
> > > > > > Could you please advice on how to solve this contention problem?
> > > > > 
> > > > > The usual advice would be to configure the system such that the guest's
> > > > > VCPUs do not get preempted.
> > > > 
> > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > > > spinning). In that case, rcuc would never execute, because it has a 
> > > > lower priority than guest VCPUs.
> > > 
> > > OK, this leads me to believe that you are talking about the rcuc kthreads
> > > in the host, not the guest.  In which case the usual approach is to
> > > reserve a CPU or two on the host which never runs guest VCPUs, and to
> > > force the rcuc kthreads there.  Note that CONFIG_NO_HZ_FULL will do this
> > > automatically for you, reserving the boot CPU.  And CONFIG_NO_HZ_FULL
> > > might well be very useful in this scenario.  And reserving a CPU or two
> > > for housekeeping purposes is quite common for heavy CPU-bound workloads.
> > > 
> > > Of course, you need to make sure that the reserved CPU or two is sufficient
> > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this
> > > should not be a problem.
> > > 
> > > > I do not think we want that.
> > > 
> > > Assuming "that" is "rcuc would never execute" -- agreed, that would be
> > > very bad.  You would eventually OOM the system.
> > > 
> > > > > Or is the contention on the root rcu_node structure's ->lock field
> > > > > high for some other reason?
> > > > 
> > > > Luiz?
> > > > 
> > > > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > > > skip rcu_start_gp entirely for example?
> > > > > 
> > > > > If you do that, you can see system hangs due to needed grace periods never
> > > > > getting started.
> > > > 
> > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > > > necessary for nocb CPUs to execute rcu_start_gp?
> > > 
> > > Sigh.  Are we in the host or the guest OS at this point?
> > 
> > Host.
> 
> Can you build the host with NO_HZ_FULL and boot with nohz_full=?
> That should get rid of of much of your problems here.
> 
> > > In any case, if you want the best real-time response for a CPU-bound
> > > workload on a given CPU, careful use of NO_HZ_FULL would prevent
> > > that CPU from ever invoking __rcu_process_callbacks() in the first
> > > place, which would have the beneficial side effect of preventing
> > > __rcu_process_callbacks() from ever invoking rcu_start_gp().
> > > 
> > > Of course, NO_HZ_FULL does have the drawback of increasing the cost
> > > of user-kernel transitions.
> > 
> > We need periodic processing of __run_timers to keep timer wheel
> > processing from falling behind too much.
> > 
> > See http://www.gossamer-threads.com/lists/linux/kernel/2094151.
> 
> Hmmm...  Do you have the following commits in your build?
> 
> fff421580f51 timers: Track total number of timers in list
> d550e81dc0dd timers: Reduce __run_timers() latency for empty list
> 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list
> 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list
> aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0
> 
> Keeping extraneous processing off of the CPUs running the real-time
> guest will minimize the number of timers, allowing these commits to
> do their jobs.

Clocksource watchdog:

        /*
         * Cycle through CPUs to check if the CPUs stay synchronized
         * to each other.
         */
        next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
        if (next_cpu >= nr_cpu_ids)
                next_cpu = cpumask_first(cpu_online_mask);
        watchdog_timer.expires += WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, next_cpu);

OK to disable...

MCE:

   2   1317  ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_fn>>
             add_timer_on(t, smp_processor_id());
   3   1335  ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_kick>>
             add_timer_on(t, smp_processor_id());
   4   1657  ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_start_timer>>
             add_timer_on(t, cpu);

Unsure how realistic the expectation to be able to exclude add_timer_on
and queue_delayed_work_on users is.

NOK to disable, i suppose.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-29 18:13           ` Marcelo Tosatti
@ 2015-01-29 18:36             ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2015-01-29 18:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Luiz Capitulino, linux-rt-users

On Thu, Jan 29, 2015 at 04:13:24PM -0200, Marcelo Tosatti wrote:
> 
> On Wed, Jan 28, 2015 at 10:55:53AM -0800, Paul E. McKenney wrote:
> > > The host. Imagine a Windows 95 guest running a realtime app.
> > > That should help.
> > 
> > Then force the critical services to run on a housekeeping CPU.  If the
> > host is permitted to preempt the guest, the latency blows you are seeing
> > are expected behavior.
> 
> ksoftirqd must preempt the vcpu as it executes irq_work
> routines for example.
> 
> IRQ threads must preempt the vcpu to inject HW interrupts
> to the guest.

Understood, and hopefully these short preemptions are not causing excessive
trouble.

And my concern with this was partly due to my assumption that you were
seeing high lock contention in the guest.

> > automatically.  If that is infeasible, then yes, it should be possible
> > to add an explicit quiescent state in the host at vCPU entry/exit, at
> > least assuming that the host is in a state permitting this.
> > 
> > > > > > > We've cooked the following extremely dirty patch, just to see
> > > > > > > what would happen:
> > > > > > > 
> > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > > > > index eaed1ef..c0771cc 100644
> > > > > > > --- a/kernel/rcutree.c
> > > > > > > +++ b/kernel/rcutree.c
> > > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > > > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > > > > >  	local_irq_save(flags);
> > > > > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > > > -		rcu_start_gp(rsp);
> > > > > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > > +		for (;;) {
> > > > > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > > > > +				local_irq_restore(flags);
> > > > > > > +				local_bh_enable();
> > > > > > > +				schedule_timeout_interruptible(2);
> > > > > > 
> > > > > > Yes, the above will get you a splat in mainline kernels, which do not
> > > > > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > > > > 
> > > > > > > +				local_bh_disable();
> > > > > > > +				local_irq_save(flags);
> > > > > > > +				continue;
> > > > > > > +			}
> > > > > > > +			rcu_start_gp(rsp);
> > > > > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > >  	} else {
> > > > > > >  		local_irq_restore(flags);
> > > > > > >  	}
> > > > > > > 
> > > > > > > With this patch rcuc is gone from our traces and the scheduling
> > > > > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > > > > 
> > > > > > > Could you please advice on how to solve this contention problem?
> > > > > > 
> > > > > > The usual advice would be to configure the system such that the guest's
> > > > > > VCPUs do not get preempted.
> > > > > 
> > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > > > > spinning). In that case, rcuc would never execute, because it has a 
> > > > > lower priority than guest VCPUs.
> > > > 
> > > > OK, this leads me to believe that you are talking about the rcuc kthreads
> > > > in the host, not the guest.  In which case the usual approach is to
> > > > reserve a CPU or two on the host which never runs guest VCPUs, and to
> > > > force the rcuc kthreads there.  Note that CONFIG_NO_HZ_FULL will do this
> > > > automatically for you, reserving the boot CPU.  And CONFIG_NO_HZ_FULL
> > > > might well be very useful in this scenario.  And reserving a CPU or two
> > > > for housekeeping purposes is quite common for heavy CPU-bound workloads.
> > > > 
> > > > Of course, you need to make sure that the reserved CPU or two is sufficient
> > > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this
> > > > should not be a problem.
> > > > 
> > > > > I do not think we want that.
> > > > 
> > > > Assuming "that" is "rcuc would never execute" -- agreed, that would be
> > > > very bad.  You would eventually OOM the system.
> > > > 
> > > > > > Or is the contention on the root rcu_node structure's ->lock field
> > > > > > high for some other reason?
> > > > > 
> > > > > Luiz?
> > > > > 
> > > > > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > > > > skip rcu_start_gp entirely for example?
> > > > > > 
> > > > > > If you do that, you can see system hangs due to needed grace periods never
> > > > > > getting started.
> > > > > 
> > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > > > > necessary for nocb CPUs to execute rcu_start_gp?
> > > > 
> > > > Sigh.  Are we in the host or the guest OS at this point?
> > > 
> > > Host.
> > 
> > Can you build the host with NO_HZ_FULL and boot with nohz_full=?
> > That should get rid of of much of your problems here.
> > 
> > > > In any case, if you want the best real-time response for a CPU-bound
> > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent
> > > > that CPU from ever invoking __rcu_process_callbacks() in the first
> > > > place, which would have the beneficial side effect of preventing
> > > > __rcu_process_callbacks() from ever invoking rcu_start_gp().
> > > > 
> > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost
> > > > of user-kernel transitions.
> > > 
> > > We need periodic processing of __run_timers to keep timer wheel
> > > processing from falling behind too much.
> > > 
> > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151.
> > 
> > Hmmm...  Do you have the following commits in your build?
> > 
> > fff421580f51 timers: Track total number of timers in list
> > d550e81dc0dd timers: Reduce __run_timers() latency for empty list
> > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list
> > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list
> > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0
> > 
> > Keeping extraneous processing off of the CPUs running the real-time
> > guest will minimize the number of timers, allowing these commits to
> > do their jobs.
> 
> Clocksource watchdog:
> 
>         /*
>          * Cycle through CPUs to check if the CPUs stay synchronized
>          * to each other.
>          */
>         next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>         if (next_cpu >= nr_cpu_ids)
>                 next_cpu = cpumask_first(cpu_online_mask);
>         watchdog_timer.expires += WATCHDOG_INTERVAL;
>         add_timer_on(&watchdog_timer, next_cpu);
> 
> OK to disable...

I have to defer to John Stultz and Thomas Gleixner on this one.

> MCE:
> 
>    2   1317  ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_fn>>
>              add_timer_on(t, smp_processor_id());
>    3   1335  ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_kick>>
>              add_timer_on(t, smp_processor_id());
>    4   1657  ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_start_timer>>
>              add_timer_on(t, cpu);
> 
> Unsure how realistic the expectation to be able to exclude add_timer_on
> and queue_delayed_work_on users is.
> 
> NOK to disable, i suppose.

And I must defer to x86 MCE experts on this one.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-01-28 18:55         ` Paul E. McKenney
  2015-01-29 17:06           ` Steven Rostedt
  2015-01-29 18:13           ` Marcelo Tosatti
@ 2015-02-02 18:24           ` Marcelo Tosatti
  2015-02-02 20:35             ` Steven Rostedt
  2 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2015-02-02 18:24 UTC (permalink / raw)
  To: Paul E. McKenney, Steven Rostedt, Steven Rostedt
  Cc: Luiz Capitulino, linux-rt-users

On Wed, Jan 28, 2015 at 10:55:53AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 28, 2015 at 04:25:12PM -0200, Marcelo Tosatti wrote:
> > On Wed, Jan 28, 2015 at 10:03:35AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote:
> > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote:
> > > > > > Paul,
> > > > > > 
> > > > > > We're running some measurements with cyclictest running inside a
> > > > > > KVM guest where we could observe spinlock contention among rcuc
> > > > > > threads.
> > > > > > 
> > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT.
> > > > > > This machine and the guest run the RT kernel. As our test-case
> > > > > > requires an application in the guest taking 100% of the CPU, the
> > > > > > RT priority configuration that gives the best latency is this one:
> > > > > > 
> > > > > >  263  FF   3  [rcuc/15]
> > > > > >   13  FF   3  [rcub/1]
> > > > > >   12  FF   3  [rcub/0]
> > > > > >  265  FF   2  [ksoftirqd/15]
> > > > > > 3181  FF   1  qemu-kvm
> > > > > > 
> > > > > > In this configuration, the rcuc can preempt the guest's vcpu
> > > > > > thread. This shouldn't be a problem, except for the fact that
> > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us
> > > > > > or more spinning in this spinlock (note that IRQs are disabled
> > > > > > during this period):
> > > > > > 
> > > > > > __rcu_process_callbacks()
> > > > > > {
> > > > > > ...
> > > > > > 	local_irq_save(flags);
> > > > > > 	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > > 		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > > 		rcu_start_gp(rsp);
> > > > > > 		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > ...
> > > > > 
> > > > > Life can be hard when irq-disabled spinlocks can be preempted!  But how
> > > > > often does this happen?  Also, does this happen on smaller systems, for
> > > > > example, with four or eight CPUs?  And I confess to be a bit surprised
> > > > > that you expect real-time response from a guest that is subject to
> > > > > preemption -- as I understand it, the usual approach is to give RT guests
> > > > > their own CPUs.
> > > > > 
> > > > > Or am I missing something?
> > > > 
> > > > We are trying to avoid relying on the guest VCPU to voluntarily yield
> > > > the CPU therefore allowing the critical services (such as rcu callback 
> > > > processing and sched tick processing) to execute.
> > > 
> > > These critical services executing in the context of the host?
> > > (If not, I am confused.  Actually, I am confused either way...)
> > 
> > The host. Imagine a Windows 95 guest running a realtime app.
> > That should help.
> 
> Then force the critical services to run on a housekeeping CPU.  If the
> host is permitted to preempt the guest, the latency blows you are seeing
> are expected behavior.
> 
> > > > > > We've tried playing with the rcu_nocbs= option. However, it
> > > > > > did not help because, for reasons we don't understand, the rcuc
> > > > > > threads have to handle grace period start even when callback
> > > > > > offloading is used. Handling this case requires this code path
> > > > > > to be executed.
> > > > > 
> > > > > Yep.  The rcu_nocbs= option offloads invocation of RCU callbacks, but not
> > > > > the per-CPU work required to inform RCU of quiescent states.
> > > > 
> > > > Can't you execute that on vCPU entry/exit? Those are quiescent states
> > > > after all.
> > > 
> > > I am guessing that we are talking about quiescent states in the guest.
> > 
> > Host.
> > 
> > > If so, can't vCPU entry/exit operations happen in guest interrupt
> > > handlers?  If so, these operations are not necessarily quiescent states.
> > 
> > vCPU entry/exit are quiescent states in the host.
> 
> As is execution in the guest.  If you build the host with NO_HZ_FULL
> and boot with the appropriate nohz_full= parameter, this will happen
> automatically.  If that is infeasible, then yes, it should be possible
> to add an explicit quiescent state in the host at vCPU entry/exit, at
> least assuming that the host is in a state permitting this.
> 
> > > > > > We've cooked the following extremely dirty patch, just to see
> > > > > > what would happen:
> > > > > > 
> > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > > > > index eaed1ef..c0771cc 100644
> > > > > > --- a/kernel/rcutree.c
> > > > > > +++ b/kernel/rcutree.c
> > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp)
> > > > > >  	/* Does this CPU require a not-yet-started grace period? */
> > > > > >  	local_irq_save(flags);
> > > > > >  	if (cpu_needs_another_gp(rsp, rdp)) {
> > > > > > -		raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */
> > > > > > -		rcu_start_gp(rsp);
> > > > > > -		raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > +		for (;;) {
> > > > > > +			if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) {
> > > > > > +				local_irq_restore(flags);
> > > > > > +				local_bh_enable();
> > > > > > +				schedule_timeout_interruptible(2);
> > > > > 
> > > > > Yes, the above will get you a splat in mainline kernels, which do not
> > > > > necessarily push softirq processing to the ksoftirqd kthreads.  ;-)
> > > > > 
> > > > > > +				local_bh_disable();
> > > > > > +				local_irq_save(flags);
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +			rcu_start_gp(rsp);
> > > > > > +			raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > > > > > +			break;
> > > > > > +		}
> > > > > >  	} else {
> > > > > >  		local_irq_restore(flags);
> > > > > >  	}
> > > > > > 
> > > > > > With this patch rcuc is gone from our traces and the scheduling
> > > > > > latency is reduced by 3us in our CPU-bound test-case.
> > > > > > 
> > > > > > Could you please advice on how to solve this contention problem?
> > > > > 
> > > > > The usual advice would be to configure the system such that the guest's
> > > > > VCPUs do not get preempted.
> > > > 
> > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy
> > > > spinning). In that case, rcuc would never execute, because it has a 
> > > > lower priority than guest VCPUs.
> > > 
> > > OK, this leads me to believe that you are talking about the rcuc kthreads
> > > in the host, not the guest.  In which case the usual approach is to
> > > reserve a CPU or two on the host which never runs guest VCPUs, and to
> > > force the rcuc kthreads there.  Note that CONFIG_NO_HZ_FULL will do this
> > > automatically for you, reserving the boot CPU.  And CONFIG_NO_HZ_FULL
> > > might well be very useful in this scenario.  And reserving a CPU or two
> > > for housekeeping purposes is quite common for heavy CPU-bound workloads.
> > > 
> > > Of course, you need to make sure that the reserved CPU or two is sufficient
> > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this
> > > should not be a problem.
> > > 
> > > > I do not think we want that.
> > > 
> > > Assuming "that" is "rcuc would never execute" -- agreed, that would be
> > > very bad.  You would eventually OOM the system.
> > > 
> > > > > Or is the contention on the root rcu_node structure's ->lock field
> > > > > high for some other reason?
> > > > 
> > > > Luiz?
> > > > 
> > > > > > Can we test whether the local CPU is nocb, and in that case, 
> > > > > > skip rcu_start_gp entirely for example?
> > > > > 
> > > > > If you do that, you can see system hangs due to needed grace periods never
> > > > > getting started.
> > > > 
> > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it
> > > > necessary for nocb CPUs to execute rcu_start_gp?
> > > 
> > > Sigh.  Are we in the host or the guest OS at this point?
> > 
> > Host.
> 
> Can you build the host with NO_HZ_FULL and boot with nohz_full=?
> That should get rid of of much of your problems here.
> 
> > > In any case, if you want the best real-time response for a CPU-bound
> > > workload on a given CPU, careful use of NO_HZ_FULL would prevent
> > > that CPU from ever invoking __rcu_process_callbacks() in the first
> > > place, which would have the beneficial side effect of preventing
> > > __rcu_process_callbacks() from ever invoking rcu_start_gp().
> > > 
> > > Of course, NO_HZ_FULL does have the drawback of increasing the cost
> > > of user-kernel transitions.
> > 
> > We need periodic processing of __run_timers to keep timer wheel
> > processing from falling behind too much.
> > 
> > See http://www.gossamer-threads.com/lists/linux/kernel/2094151.
> 
> Hmmm...  Do you have the following commits in your build?
> 
> fff421580f51 timers: Track total number of timers in list
> d550e81dc0dd timers: Reduce __run_timers() latency for empty list
> 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list
> 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list
> aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0
> 
> Keeping extraneous processing off of the CPUs running the real-time
> guest will minimize the number of timers, allowing these commits to
> do their jobs.

Steven,

The second commit, d550e81dc0dd should be part of -RT, and currently is
not, because:

-> Any IRQ work item will raise timer softirq.
-> __run_timers will do a full round of processing,
ruining latency.

Even without any timer pending on the timer wheel.

And about NO_HZ_FULL and -RT, is it correct that NO_HZ_FULL
renders

commit 1a2de830b90e364c3bf95e0000173bffcb65ddb7
Author: Steven Rostedt <rostedt@goodmis.org>
Date:   Fri Jan 31 12:07:57 2014 -0500

    timer/rt: Always raise the softirq if there's irq_work to be done

Inactive? Should raise softirq from irq_work_queue directly?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-02 18:24           ` Marcelo Tosatti
@ 2015-02-02 20:35             ` Steven Rostedt
  2015-02-02 20:46               ` Marcelo Tosatti
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2015-02-02 20:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Mon, 2 Feb 2015 16:24:50 -0200
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > > > In any case, if you want the best real-time response for a CPU-bound
> > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent
> > > > that CPU from ever invoking __rcu_process_callbacks() in the first
> > > > place, which would have the beneficial side effect of preventing
> > > > __rcu_process_callbacks() from ever invoking rcu_start_gp().
> > > > 
> > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost
> > > > of user-kernel transitions.
> > > 
> > > We need periodic processing of __run_timers to keep timer wheel
> > > processing from falling behind too much.
> > > 
> > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151.
> > 
> > Hmmm...  Do you have the following commits in your build?
> > 
> > fff421580f51 timers: Track total number of timers in list
> > d550e81dc0dd timers: Reduce __run_timers() latency for empty list
> > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list
> > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list
> > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0
> > 
> > Keeping extraneous processing off of the CPUs running the real-time
> > guest will minimize the number of timers, allowing these commits to
> > do their jobs.
> 
> Steven,
> 
> The second commit, d550e81dc0dd should be part of -RT, and currently is
> not, because:
> 
> -> Any IRQ work item will raise timer softirq.
> -> __run_timers will do a full round of processing,
> ruining latency.

Was this discussed?

> 
> Even without any timer pending on the timer wheel.
> 
> And about NO_HZ_FULL and -RT, is it correct that NO_HZ_FULL
> renders
> 
> commit 1a2de830b90e364c3bf95e0000173bffcb65ddb7
> Author: Steven Rostedt <rostedt@goodmis.org>
> Date:   Fri Jan 31 12:07:57 2014 -0500
> 
>     timer/rt: Always raise the softirq if there's irq_work to be done
> 
> Inactive? Should raise softirq from irq_work_queue directly?

What do you mean raise from irq_work_queue directly? When irq work
needs to be done, that usually is because something happened in a
context that you can not wake up a process (like raise_softirq might
do). The irq_work itself could raise the softirq, but as it takes the
softirq to trigger the irq_work you are stuck in a catch 22 there.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-02 20:35             ` Steven Rostedt
@ 2015-02-02 20:46               ` Marcelo Tosatti
  2015-02-02 20:55                 ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2015-02-02 20:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Mon, Feb 02, 2015 at 03:35:53PM -0500, Steven Rostedt wrote:
> On Mon, 2 Feb 2015 16:24:50 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > > > In any case, if you want the best real-time response for a CPU-bound
> > > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent
> > > > > that CPU from ever invoking __rcu_process_callbacks() in the first
> > > > > place, which would have the beneficial side effect of preventing
> > > > > __rcu_process_callbacks() from ever invoking rcu_start_gp().
> > > > > 
> > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost
> > > > > of user-kernel transitions.
> > > > 
> > > > We need periodic processing of __run_timers to keep timer wheel
> > > > processing from falling behind too much.
> > > > 
> > > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151.
> > > 
> > > Hmmm...  Do you have the following commits in your build?
> > > 
> > > fff421580f51 timers: Track total number of timers in list
> > > d550e81dc0dd timers: Reduce __run_timers() latency for empty list
> > > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list
> > > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list
> > > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0
> > > 
> > > Keeping extraneous processing off of the CPUs running the real-time
> > > guest will minimize the number of timers, allowing these commits to
> > > do their jobs.
> > 
> > Steven,
> > 
> > The second commit, d550e81dc0dd should be part of -RT, and currently is
> > not, because:
> > 
> > -> Any IRQ work item will raise timer softirq.
> > -> __run_timers will do a full round of processing,
> > ruining latency.
> 
> Was this discussed?

Discussed where?

The point is this: __run_timers has horrible latency.
How to avoid it: configure the system in such a way that no timers 
(old interface, add_timers) expire on the local CPU.

The patches Paul listed above limit the issue allowing
you to call raise_softirq(TIMER_SOFTIRQ) without having to go
through __run_timers, in the case of no pending timers.

> > Even without any timer pending on the timer wheel.
> > 
> > And about NO_HZ_FULL and -RT, is it correct that NO_HZ_FULL
> > renders
> > 
> > commit 1a2de830b90e364c3bf95e0000173bffcb65ddb7
> > Author: Steven Rostedt <rostedt@goodmis.org>
> > Date:   Fri Jan 31 12:07:57 2014 -0500
> > 
> >     timer/rt: Always raise the softirq if there's irq_work to be done
> > 
> > Inactive? Should raise softirq from irq_work_queue directly?
> 
> What do you mean raise from irq_work_queue directly? When irq work
> needs to be done, that usually is because something happened in a
> context that you can not wake up a process (like raise_softirq might
> do). The irq_work itself could raise the softirq, but as it takes the
> softirq to trigger the irq_work you are stuck in a catch 22 there.

Then you rely on the sched timer interrupt to notice there is a pending 
irq_work item? 

If you have no sched timer interrupts, then what happens with that
irq_work item?


> 
> -- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-02 20:46               ` Marcelo Tosatti
@ 2015-02-02 20:55                 ` Steven Rostedt
  2015-02-02 21:02                   ` Marcelo Tosatti
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2015-02-02 20:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Mon, 2 Feb 2015 18:46:59 -0200
Marcelo Tosatti <mtosatti@redhat.com> wrote:

 
> > > The second commit, d550e81dc0dd should be part of -RT, and currently is
> > > not, because:
> > > 
> > > -> Any IRQ work item will raise timer softirq.
> > > -> __run_timers will do a full round of processing,
> > > ruining latency.
> > 
> > Was this discussed?
> 
> Discussed where?

It sounded like that commit was not added because of the above. That's
why I asked, was it discussed. Sounded like you were saying that commit
d550e81dc0dd should be part of -RT but it is not because ..., which
sounds like there were some decisions made.

> 
> The point is this: __run_timers has horrible latency.
> How to avoid it: configure the system in such a way that no timers 
> (old interface, add_timers) expire on the local CPU.
> 
> The patches Paul listed above limit the issue allowing
> you to call raise_softirq(TIMER_SOFTIRQ) without having to go
> through __run_timers, in the case of no pending timers.

OK, so you are asking for me to add those patches?


> Then you rely on the sched timer interrupt to notice there is a pending 
> irq_work item? 

On, x86, there shouldn't be. irq_work can usually trigger its own
interrupt. In the case that it can not, it requires the softirq to
trigger when there's irq work to be done.

> 
> If you have no sched timer interrupts, then what happens with that
> irq_work item?
> 

That's what that patch does. It should trigger some. Hmm, I have to see
if no_hz_full checks irq work too.

But again, of there's no irq_work to do then this should not matter. If
there's irq_work to do, then something on that CPU asked to do irq
work. If you are worried about run_timers, make sure nothing is on that
CPU that can trigger a timer.

Isolation is the *only* way to make that work.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-02 20:55                 ` Steven Rostedt
@ 2015-02-02 21:02                   ` Marcelo Tosatti
  2015-02-03 20:36                     ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2015-02-02 21:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Mon, Feb 02, 2015 at 03:55:28PM -0500, Steven Rostedt wrote:
> On Mon, 2 Feb 2015 18:46:59 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
>  
> > > > The second commit, d550e81dc0dd should be part of -RT, and currently is
> > > > not, because:
> > > > 
> > > > -> Any IRQ work item will raise timer softirq.
> > > > -> __run_timers will do a full round of processing,
> > > > ruining latency.
> > > 
> > > Was this discussed?
> > 
> > Discussed where?
> 
> It sounded like that commit was not added because of the above. That's
> why I asked, was it discussed. Sounded like you were saying that commit
> d550e81dc0dd should be part of -RT but it is not because ..., which
> sounds like there were some decisions made.
> 
> > 
> > The point is this: __run_timers has horrible latency.
> > How to avoid it: configure the system in such a way that no timers 
> > (old interface, add_timers) expire on the local CPU.
> > 
> > The patches Paul listed above limit the issue allowing
> > you to call raise_softirq(TIMER_SOFTIRQ) without having to go
> > through __run_timers, in the case of no pending timers.
> 
> OK, so you are asking for me to add those patches?

Yes.

> > Then you rely on the sched timer interrupt to notice there is a pending 
> > irq_work item? 
> 
> On, x86, there shouldn't be. irq_work can usually trigger its own
> interrupt. In the case that it can not, it requires the softirq to
> trigger when there's irq work to be done.
> 
> > 
> > If you have no sched timer interrupts, then what happens with that
> > irq_work item?
> > 
> 
> That's what that patch does. It should trigger some. Hmm, I have to see
> if no_hz_full checks irq work too.
> 
> But again, of there's no irq_work to do then this should not matter. If
> there's irq_work to do, then something on that CPU asked to do irq
> work. 

Right.

> If you are worried about run_timers, make sure nothing is on that
> CPU that can trigger a timer.

I am worried about two things:

1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of 
Paul's d550e81dc0dd.

The result is __run_timers checking all timer wheel "nodes" 
and updating base->timer_jiffies, latency is ruined.

Even if one carefully made sure no timer is present.

2) Reliance on sched timer interrupt to raise timer softirq 
in case of pending irq work (your patch) AND no_hz_full.

> Isolation is the *only* way to make that work.

Fine. Please see item 1) above.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-02 21:02                   ` Marcelo Tosatti
@ 2015-02-03 20:36                     ` Steven Rostedt
  2015-02-03 20:57                       ` Paul E. McKenney
  2015-02-03 23:55                       ` Marcelo Tosatti
  0 siblings, 2 replies; 23+ messages in thread
From: Steven Rostedt @ 2015-02-03 20:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Mon, 2 Feb 2015 19:02:29 -0200
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> I am worried about two things:
> 
> 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of 
> Paul's d550e81dc0dd.
> 
> The result is __run_timers checking all timer wheel "nodes" 
> and updating base->timer_jiffies, latency is ruined.
> 
> Even if one carefully made sure no timer is present.
> 
> 2) Reliance on sched timer interrupt to raise timer softirq 
> in case of pending irq work (your patch) AND no_hz_full.
> 
> > Isolation is the *only* way to make that work.
> 
> Fine. Please see item 1) above.
> 

So basically you are saying we just need: d550e81dc0dd ?

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-03 20:36                     ` Steven Rostedt
@ 2015-02-03 20:57                       ` Paul E. McKenney
  2015-02-03 23:55                       ` Marcelo Tosatti
  1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2015-02-03 20:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Marcelo Tosatti, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Tue, Feb 03, 2015 at 03:36:19PM -0500, Steven Rostedt wrote:
> On Mon, 2 Feb 2015 19:02:29 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > I am worried about two things:
> > 
> > 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of 
> > Paul's d550e81dc0dd.
> > 
> > The result is __run_timers checking all timer wheel "nodes" 
> > and updating base->timer_jiffies, latency is ruined.
> > 
> > Even if one carefully made sure no timer is present.
> > 
> > 2) Reliance on sched timer interrupt to raise timer softirq 
> > in case of pending irq work (your patch) AND no_hz_full.
> > 
> > > Isolation is the *only* way to make that work.
> > 
> > Fine. Please see item 1) above.
> 
> So basically you are saying we just need: d550e81dc0dd ?

fff421580f51 is of course a prerequisite for d550e81dc0dd.  Of the five
related commits, these two are the most important, as they cover things
for CPUs that never have any timers.  The other three handle CPUs that
occasionally have a timer or two.

So you definitely need fff421580f51 and d550e81dc0dd.  Less carefully
tuned systems will benefit from 16d937f88031, 18d8cb64c9c0, and
aea369b959be, but these last three are more in the nice-to-have
category.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: kernel-rt rcuc lock contention problem
  2015-02-03 20:36                     ` Steven Rostedt
  2015-02-03 20:57                       ` Paul E. McKenney
@ 2015-02-03 23:55                       ` Marcelo Tosatti
  1 sibling, 0 replies; 23+ messages in thread
From: Marcelo Tosatti @ 2015-02-03 23:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users

On Tue, Feb 03, 2015 at 03:36:19PM -0500, Steven Rostedt wrote:
> On Mon, 2 Feb 2015 19:02:29 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > I am worried about two things:
> > 
> > 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of 
> > Paul's d550e81dc0dd.
> > 
> > The result is __run_timers checking all timer wheel "nodes" 
> > and updating base->timer_jiffies, latency is ruined.
> > 
> > Even if one carefully made sure no timer is present.
> > 
> > 2) Reliance on sched timer interrupt to raise timer softirq 
> > in case of pending irq work (your patch) AND no_hz_full.
> > 
> > > Isolation is the *only* way to make that work.
> > 
> > Fine. Please see item 1) above.
> > 
> 
> So basically you are saying we just need: d550e81dc0dd ?

For 1), the 4 patches he mentioned, please.

For 2), it was just a hypothesis (perhaps fuelled  by the fact 
the my test box crashes with nohz_full= and isolated cpus).


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-02-03 23:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-26 19:14 kernel-rt rcuc lock contention problem Luiz Capitulino
2015-01-27 20:37 ` Paul E. McKenney
2015-01-28  1:55   ` Marcelo Tosatti
2015-01-28 14:18     ` Luiz Capitulino
2015-01-28 18:09       ` Paul E. McKenney
2015-01-28 18:39         ` Luiz Capitulino
2015-01-28 19:00           ` Paul E. McKenney
2015-01-28 19:06             ` Luiz Capitulino
2015-01-28 18:03     ` Paul E. McKenney
2015-01-28 18:25       ` Marcelo Tosatti
2015-01-28 18:55         ` Paul E. McKenney
2015-01-29 17:06           ` Steven Rostedt
2015-01-29 18:11             ` Paul E. McKenney
2015-01-29 18:13           ` Marcelo Tosatti
2015-01-29 18:36             ` Paul E. McKenney
2015-02-02 18:24           ` Marcelo Tosatti
2015-02-02 20:35             ` Steven Rostedt
2015-02-02 20:46               ` Marcelo Tosatti
2015-02-02 20:55                 ` Steven Rostedt
2015-02-02 21:02                   ` Marcelo Tosatti
2015-02-03 20:36                     ` Steven Rostedt
2015-02-03 20:57                       ` Paul E. McKenney
2015-02-03 23:55                       ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.