* kernel-rt rcuc lock contention problem @ 2015-01-26 19:14 Luiz Capitulino 2015-01-27 20:37 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Luiz Capitulino @ 2015-01-26 19:14 UTC (permalink / raw) To: paulmck; +Cc: linux-rt-users, Marcelo Tosatti Paul, We're running some measurements with cyclictest running inside a KVM guest where we could observe spinlock contention among rcuc threads. Basically, we have a 16-CPU NUMA machine very well setup for RT. This machine and the guest run the RT kernel. As our test-case requires an application in the guest taking 100% of the CPU, the RT priority configuration that gives the best latency is this one: 263 FF 3 [rcuc/15] 13 FF 3 [rcub/1] 12 FF 3 [rcub/0] 265 FF 2 [ksoftirqd/15] 3181 FF 1 qemu-kvm In this configuration, the rcuc can preempt the guest's vcpu thread. This shouldn't be a problem, except for the fact that we're seeing that in some cases the rcuc/15 thread spends 10us or more spinning in this spinlock (note that IRQs are disabled during this period): __rcu_process_callbacks() { ... local_irq_save(flags); if (cpu_needs_another_gp(rsp, rdp)) { raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ rcu_start_gp(rsp); raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); ... We've tried playing with the rcu_nocbs= option. However, it did not help because, for reasons we don't understand, the rcuc threads have to handle grace period start even when callback offloading is used. Handling this case requires this code path to be executed. We've cooked the following extremely dirty patch, just to see what would happen: diff --git a/kernel/rcutree.c b/kernel/rcutree.c index eaed1ef..c0771cc 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) /* Does this CPU require a not-yet-started grace period? */ local_irq_save(flags); if (cpu_needs_another_gp(rsp, rdp)) { - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ - rcu_start_gp(rsp); - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); + for (;;) { + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { + local_irq_restore(flags); + local_bh_enable(); + schedule_timeout_interruptible(2); + local_bh_disable(); + local_irq_save(flags); + continue; + } + rcu_start_gp(rsp); + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); + break; + } } else { local_irq_restore(flags); } With this patch rcuc is gone from our traces and the scheduling latency is reduced by 3us in our CPU-bound test-case. Could you please advice on how to solve this contention problem? Can we test whether the local CPU is nocb, and in that case, skip rcu_start_gp entirely for example? Thanks! ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-26 19:14 kernel-rt rcuc lock contention problem Luiz Capitulino @ 2015-01-27 20:37 ` Paul E. McKenney 2015-01-28 1:55 ` Marcelo Tosatti 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2015-01-27 20:37 UTC (permalink / raw) To: Luiz Capitulino; +Cc: linux-rt-users, Marcelo Tosatti On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > Paul, > > We're running some measurements with cyclictest running inside a > KVM guest where we could observe spinlock contention among rcuc > threads. > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > This machine and the guest run the RT kernel. As our test-case > requires an application in the guest taking 100% of the CPU, the > RT priority configuration that gives the best latency is this one: > > 263 FF 3 [rcuc/15] > 13 FF 3 [rcub/1] > 12 FF 3 [rcub/0] > 265 FF 2 [ksoftirqd/15] > 3181 FF 1 qemu-kvm > > In this configuration, the rcuc can preempt the guest's vcpu > thread. This shouldn't be a problem, except for the fact that > we're seeing that in some cases the rcuc/15 thread spends 10us > or more spinning in this spinlock (note that IRQs are disabled > during this period): > > __rcu_process_callbacks() > { > ... > local_irq_save(flags); > if (cpu_needs_another_gp(rsp, rdp)) { > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > rcu_start_gp(rsp); > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > ... Life can be hard when irq-disabled spinlocks can be preempted! But how often does this happen? Also, does this happen on smaller systems, for example, with four or eight CPUs? And I confess to be a bit surprised that you expect real-time response from a guest that is subject to preemption -- as I understand it, the usual approach is to give RT guests their own CPUs. Or am I missing something? > We've tried playing with the rcu_nocbs= option. However, it > did not help because, for reasons we don't understand, the rcuc > threads have to handle grace period start even when callback > offloading is used. Handling this case requires this code path > to be executed. Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not the per-CPU work required to inform RCU of quiescent states. > We've cooked the following extremely dirty patch, just to see > what would happen: > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > index eaed1ef..c0771cc 100644 > --- a/kernel/rcutree.c > +++ b/kernel/rcutree.c > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > /* Does this CPU require a not-yet-started grace period? */ > local_irq_save(flags); > if (cpu_needs_another_gp(rsp, rdp)) { > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > - rcu_start_gp(rsp); > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > + for (;;) { > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > + local_irq_restore(flags); > + local_bh_enable(); > + schedule_timeout_interruptible(2); Yes, the above will get you a splat in mainline kernels, which do not necessarily push softirq processing to the ksoftirqd kthreads. ;-) > + local_bh_disable(); > + local_irq_save(flags); > + continue; > + } > + rcu_start_gp(rsp); > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > + break; > + } > } else { > local_irq_restore(flags); > } > > With this patch rcuc is gone from our traces and the scheduling > latency is reduced by 3us in our CPU-bound test-case. > > Could you please advice on how to solve this contention problem? The usual advice would be to configure the system such that the guest's VCPUs do not get preempted. Or is the contention on the root rcu_node structure's ->lock field high for some other reason? > Can we test whether the local CPU is nocb, and in that case, > skip rcu_start_gp entirely for example? If you do that, you can see system hangs due to needed grace periods never getting started. Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? If you are using a smaller value, it would be possible to rework the code to reduce contention on ->lock, though if a VCPU does get preempted while holding the root rcu_node structure's ->lock, life will be hard. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-27 20:37 ` Paul E. McKenney @ 2015-01-28 1:55 ` Marcelo Tosatti 2015-01-28 14:18 ` Luiz Capitulino 2015-01-28 18:03 ` Paul E. McKenney 0 siblings, 2 replies; 23+ messages in thread From: Marcelo Tosatti @ 2015-01-28 1:55 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Luiz Capitulino, linux-rt-users On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > Paul, > > > > We're running some measurements with cyclictest running inside a > > KVM guest where we could observe spinlock contention among rcuc > > threads. > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > This machine and the guest run the RT kernel. As our test-case > > requires an application in the guest taking 100% of the CPU, the > > RT priority configuration that gives the best latency is this one: > > > > 263 FF 3 [rcuc/15] > > 13 FF 3 [rcub/1] > > 12 FF 3 [rcub/0] > > 265 FF 2 [ksoftirqd/15] > > 3181 FF 1 qemu-kvm > > > > In this configuration, the rcuc can preempt the guest's vcpu > > thread. This shouldn't be a problem, except for the fact that > > we're seeing that in some cases the rcuc/15 thread spends 10us > > or more spinning in this spinlock (note that IRQs are disabled > > during this period): > > > > __rcu_process_callbacks() > > { > > ... > > local_irq_save(flags); > > if (cpu_needs_another_gp(rsp, rdp)) { > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > rcu_start_gp(rsp); > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > ... > > Life can be hard when irq-disabled spinlocks can be preempted! But how > often does this happen? Also, does this happen on smaller systems, for > example, with four or eight CPUs? And I confess to be a bit surprised > that you expect real-time response from a guest that is subject to > preemption -- as I understand it, the usual approach is to give RT guests > their own CPUs. > > Or am I missing something? We are trying to avoid relying on the guest VCPU to voluntarily yield the CPU therefore allowing the critical services (such as rcu callback processing and sched tick processing) to execute. > > We've tried playing with the rcu_nocbs= option. However, it > > did not help because, for reasons we don't understand, the rcuc > > threads have to handle grace period start even when callback > > offloading is used. Handling this case requires this code path > > to be executed. > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > the per-CPU work required to inform RCU of quiescent states. Can't you execute that on vCPU entry/exit? Those are quiescent states after all. > > We've cooked the following extremely dirty patch, just to see > > what would happen: > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > index eaed1ef..c0771cc 100644 > > --- a/kernel/rcutree.c > > +++ b/kernel/rcutree.c > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > /* Does this CPU require a not-yet-started grace period? */ > > local_irq_save(flags); > > if (cpu_needs_another_gp(rsp, rdp)) { > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > - rcu_start_gp(rsp); > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > + for (;;) { > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > + local_irq_restore(flags); > > + local_bh_enable(); > > + schedule_timeout_interruptible(2); > > Yes, the above will get you a splat in mainline kernels, which do not > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > + local_bh_disable(); > > + local_irq_save(flags); > > + continue; > > + } > > + rcu_start_gp(rsp); > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > + break; > > + } > > } else { > > local_irq_restore(flags); > > } > > > > With this patch rcuc is gone from our traces and the scheduling > > latency is reduced by 3us in our CPU-bound test-case. > > > > Could you please advice on how to solve this contention problem? > > The usual advice would be to configure the system such that the guest's > VCPUs do not get preempted. The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy spinning). In that case, rcuc would never execute, because it has a lower priority than guest VCPUs. I do not think we want that. > Or is the contention on the root rcu_node structure's ->lock field > high for some other reason? Luiz? > > Can we test whether the local CPU is nocb, and in that case, > > skip rcu_start_gp entirely for example? > > If you do that, you can see system hangs due to needed grace periods never > getting started. So it is not enough for CB CPUs to execute rcu_start_gp. Why is it necessary for nocb CPUs to execute rcu_start_gp? > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > If you are using a smaller value, it would be possible to rework the > code to reduce contention on ->lock, though if a VCPU does get preempted > while holding the root rcu_node structure's ->lock, life will be hard. Its a raw spinlock, isnt it? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 1:55 ` Marcelo Tosatti @ 2015-01-28 14:18 ` Luiz Capitulino 2015-01-28 18:09 ` Paul E. McKenney 2015-01-28 18:03 ` Paul E. McKenney 1 sibling, 1 reply; 23+ messages in thread From: Luiz Capitulino @ 2015-01-28 14:18 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Paul E. McKenney, linux-rt-users On Tue, 27 Jan 2015 23:55:08 -0200 Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > Paul, > > > > > > We're running some measurements with cyclictest running inside a > > > KVM guest where we could observe spinlock contention among rcuc > > > threads. > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > This machine and the guest run the RT kernel. As our test-case > > > requires an application in the guest taking 100% of the CPU, the > > > RT priority configuration that gives the best latency is this one: > > > > > > 263 FF 3 [rcuc/15] > > > 13 FF 3 [rcub/1] > > > 12 FF 3 [rcub/0] > > > 265 FF 2 [ksoftirqd/15] > > > 3181 FF 1 qemu-kvm > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > thread. This shouldn't be a problem, except for the fact that > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > or more spinning in this spinlock (note that IRQs are disabled > > > during this period): > > > > > > __rcu_process_callbacks() > > > { > > > ... > > > local_irq_save(flags); > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > rcu_start_gp(rsp); > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > ... > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > often does this happen? I have to run cyclictest in the guest for 16m a few times to reproduce it. > > Also, does this happen on smaller systems, for > > example, with four or eight CPUs? Didn't test. > > And I confess to be a bit surprised > > that you expect real-time response from a guest that is subject to > > preemption -- as I understand it, the usual approach is to give RT guests > > their own CPUs. > > > > Or am I missing something? > > We are trying to avoid relying on the guest VCPU to voluntarily yield > the CPU therefore allowing the critical services (such as rcu callback > processing and sched tick processing) to execute. Yes. I hope I won't regret saying this, but what I'm observing is that preempting-off the vcpu is not the end of the world as long as you're quick. > > > We've tried playing with the rcu_nocbs= option. However, it > > > did not help because, for reasons we don't understand, the rcuc > > > threads have to handle grace period start even when callback > > > offloading is used. Handling this case requires this code path > > > to be executed. > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > the per-CPU work required to inform RCU of quiescent states. > > Can't you execute that on vCPU entry/exit? Those are quiescent states > after all. > > > > We've cooked the following extremely dirty patch, just to see > > > what would happen: > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > index eaed1ef..c0771cc 100644 > > > --- a/kernel/rcutree.c > > > +++ b/kernel/rcutree.c > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > /* Does this CPU require a not-yet-started grace period? */ > > > local_irq_save(flags); > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > - rcu_start_gp(rsp); > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > + for (;;) { > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > + local_irq_restore(flags); > > > + local_bh_enable(); > > > + schedule_timeout_interruptible(2); > > > > Yes, the above will get you a splat in mainline kernels, which do not > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > + local_bh_disable(); > > > + local_irq_save(flags); > > > + continue; > > > + } > > > + rcu_start_gp(rsp); > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > + break; > > > + } > > > } else { > > > local_irq_restore(flags); > > > } > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > Could you please advice on how to solve this contention problem? > > > > The usual advice would be to configure the system such that the guest's > > VCPUs do not get preempted. > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > spinning). In that case, rcuc would never execute, because it has a > lower priority than guest VCPUs. > > I do not think we want that. > > > Or is the contention on the root rcu_node structure's ->lock field > > high for some other reason? I didn't go far on trying to determine the reason. What I observed was the rcuc preempting-off the vcpu and taking 10us+. I debugged it and most of this time it spends spinning on the spinlock. The patch above makes the rcuc disappear from our traces. This is all I've got. I could try to debug it further if you have suggestions on how to trace the cause. > > Luiz? > > > > Can we test whether the local CPU is nocb, and in that case, > > > skip rcu_start_gp entirely for example? > > > > If you do that, you can see system hangs due to needed grace periods never > > getting started. > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > necessary for nocb CPUs to execute rcu_start_gp? > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > If you are using a smaller value, it would be possible to rework the > > code to reduce contention on ->lock, though if a VCPU does get preempted > > while holding the root rcu_node structure's ->lock, life will be hard. > > Its a raw spinlock, isnt it? > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 14:18 ` Luiz Capitulino @ 2015-01-28 18:09 ` Paul E. McKenney 2015-01-28 18:39 ` Luiz Capitulino 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2015-01-28 18:09 UTC (permalink / raw) To: Luiz Capitulino; +Cc: Marcelo Tosatti, linux-rt-users On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote: > On Tue, 27 Jan 2015 23:55:08 -0200 > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > Paul, > > > > > > > > We're running some measurements with cyclictest running inside a > > > > KVM guest where we could observe spinlock contention among rcuc > > > > threads. > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > This machine and the guest run the RT kernel. As our test-case > > > > requires an application in the guest taking 100% of the CPU, the > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > 263 FF 3 [rcuc/15] > > > > 13 FF 3 [rcub/1] > > > > 12 FF 3 [rcub/0] > > > > 265 FF 2 [ksoftirqd/15] > > > > 3181 FF 1 qemu-kvm > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > thread. This shouldn't be a problem, except for the fact that > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > during this period): > > > > > > > > __rcu_process_callbacks() > > > > { > > > > ... > > > > local_irq_save(flags); > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > rcu_start_gp(rsp); > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > ... > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > often does this happen? > > I have to run cyclictest in the guest for 16m a few times to reproduce it. So you are seeing the high contention in the guest, correct? > > > Also, does this happen on smaller systems, for > > > example, with four or eight CPUs? > > Didn't test. > > > > And I confess to be a bit surprised > > > that you expect real-time response from a guest that is subject to > > > preemption -- as I understand it, the usual approach is to give RT guests > > > their own CPUs. > > > > > > Or am I missing something? > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > the CPU therefore allowing the critical services (such as rcu callback > > processing and sched tick processing) to execute. > > Yes. I hope I won't regret saying this, but what I'm observing is that > preempting-off the vcpu is not the end of the world as long as you're > quick. And as long as you get lucky and avoid preempting a VCPU that happens to be holding a critical lock. Look, if you want real-time response in a guest OS, there simply is no substitute for ensuring that the guest has its own CPUs that are not used for anything else, either by anything in the host or by another guest. If you do allow preemption of a guest OS that might be holding a critical guest-OS lock, you are going to see latency blows. Count on it! ;-) > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > did not help because, for reasons we don't understand, the rcuc > > > > threads have to handle grace period start even when callback > > > > offloading is used. Handling this case requires this code path > > > > to be executed. > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > the per-CPU work required to inform RCU of quiescent states. > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > after all. > > > > > > We've cooked the following extremely dirty patch, just to see > > > > what would happen: > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > index eaed1ef..c0771cc 100644 > > > > --- a/kernel/rcutree.c > > > > +++ b/kernel/rcutree.c > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > local_irq_save(flags); > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > - rcu_start_gp(rsp); > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > + for (;;) { > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > + local_irq_restore(flags); > > > > + local_bh_enable(); > > > > + schedule_timeout_interruptible(2); > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > + local_bh_disable(); > > > > + local_irq_save(flags); > > > > + continue; > > > > + } > > > > + rcu_start_gp(rsp); > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > + break; > > > > + } > > > > } else { > > > > local_irq_restore(flags); > > > > } > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > The usual advice would be to configure the system such that the guest's > > > VCPUs do not get preempted. > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > spinning). In that case, rcuc would never execute, because it has a > > lower priority than guest VCPUs. > > > > I do not think we want that. > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > high for some other reason? > > I didn't go far on trying to determine the reason. What I observed > was the rcuc preempting-off the vcpu and taking 10us+. I debugged it > and most of this time it spends spinning on the spinlock. The patch > above makes the rcuc disappear from our traces. This is all I've got. > I could try to debug it further if you have suggestions on how to > trace the cause. My current guess is that either: 1. You are allowing the host or another guest to preempt this guest's VCPU. Don't do that. ;-) 2. You are letting the rcuc kthreads contend for the worker CPUs. Pin them to housekeeping CPUs. This applies to both the host and the guest rcuc kthreads, but especially to the host rcuc kthreads. Or am I still unclear on your goals and configuration? Thanx, Paul > > Luiz? > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > skip rcu_start_gp entirely for example? > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > getting started. > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > > If you are using a smaller value, it would be possible to rework the > > > code to reduce contention on ->lock, though if a VCPU does get preempted > > > while holding the root rcu_node structure's ->lock, life will be hard. > > > > Its a raw spinlock, isnt it? > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:09 ` Paul E. McKenney @ 2015-01-28 18:39 ` Luiz Capitulino 2015-01-28 19:00 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Luiz Capitulino @ 2015-01-28 18:39 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Marcelo Tosatti, linux-rt-users On Wed, 28 Jan 2015 10:09:50 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote: > > On Tue, 27 Jan 2015 23:55:08 -0200 > > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > > Paul, > > > > > > > > > > We're running some measurements with cyclictest running inside a > > > > > KVM guest where we could observe spinlock contention among rcuc > > > > > threads. > > > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > > This machine and the guest run the RT kernel. As our test-case > > > > > requires an application in the guest taking 100% of the CPU, the > > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > > > 263 FF 3 [rcuc/15] > > > > > 13 FF 3 [rcub/1] > > > > > 12 FF 3 [rcub/0] > > > > > 265 FF 2 [ksoftirqd/15] > > > > > 3181 FF 1 qemu-kvm > > > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > > thread. This shouldn't be a problem, except for the fact that > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > > during this period): > > > > > > > > > > __rcu_process_callbacks() > > > > > { > > > > > ... > > > > > local_irq_save(flags); > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > rcu_start_gp(rsp); > > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > ... > > > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > > often does this happen? > > > > I have to run cyclictest in the guest for 16m a few times to reproduce it. > > So you are seeing the high contention in the guest, correct? No, it's in the host. > > > > > Also, does this happen on smaller systems, for > > > > example, with four or eight CPUs? > > > > Didn't test. > > > > > > And I confess to be a bit surprised > > > > that you expect real-time response from a guest that is subject to > > > > preemption -- as I understand it, the usual approach is to give RT guests > > > > their own CPUs. > > > > > > > > Or am I missing something? > > > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > > the CPU therefore allowing the critical services (such as rcu callback > > > processing and sched tick processing) to execute. > > > > Yes. I hope I won't regret saying this, but what I'm observing is that > > preempting-off the vcpu is not the end of the world as long as you're > > quick. > > And as long as you get lucky and avoid preempting a VCPU that happens to > be holding a critical lock. That's not the case. Everything I mentioned in this thread about RCU and contention happens in the host. > Look, if you want real-time response in a guest OS, there simply is no > substitute for ensuring that the guest has its own CPUs that are not used > for anything else, either by anything in the host or by another guest. > If you do allow preemption of a guest OS that might be holding a critical > guest-OS lock, you are going to see latency blows. Count on it! ;-) > > > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > > did not help because, for reasons we don't understand, the rcuc > > > > > threads have to handle grace period start even when callback > > > > > offloading is used. Handling this case requires this code path > > > > > to be executed. > > > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > > the per-CPU work required to inform RCU of quiescent states. > > > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > > after all. > > > > > > > > We've cooked the following extremely dirty patch, just to see > > > > > what would happen: > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > index eaed1ef..c0771cc 100644 > > > > > --- a/kernel/rcutree.c > > > > > +++ b/kernel/rcutree.c > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > local_irq_save(flags); > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > - rcu_start_gp(rsp); > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > + for (;;) { > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > + local_irq_restore(flags); > > > > > + local_bh_enable(); > > > > > + schedule_timeout_interruptible(2); > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > + local_bh_disable(); > > > > > + local_irq_save(flags); > > > > > + continue; > > > > > + } > > > > > + rcu_start_gp(rsp); > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > + break; > > > > > + } > > > > > } else { > > > > > local_irq_restore(flags); > > > > > } > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > VCPUs do not get preempted. > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > spinning). In that case, rcuc would never execute, because it has a > > > lower priority than guest VCPUs. > > > > > > I do not think we want that. > > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > high for some other reason? > > > > I didn't go far on trying to determine the reason. What I observed > > was the rcuc preempting-off the vcpu and taking 10us+. I debugged it > > and most of this time it spends spinning on the spinlock. The patch > > above makes the rcuc disappear from our traces. This is all I've got. > > I could try to debug it further if you have suggestions on how to > > trace the cause. > > My current guess is that either: > > 1. You are allowing the host or another guest to preempt this > guest's VCPU. Don't do that. ;-) We do allow the rcuc kthread to preempt the guest's vCPU (not other guests). The reason for this is that the workload running inside the guest may take 100% of the CPU, which won't allow the rcuc thread to ever execute. > 2. You are letting the rcuc kthreads contend for the worker CPUs. > Pin them to housekeeping CPUs. This applies to both the > host and the guest rcuc kthreads, but especially to the > host rcuc kthreads. I'd love to be able to do this, but the rcuc threads are CPU-bound threads. There's one per CPU and the kernel doesn't allow me to move them around. > > Or am I still unclear on your goals and configuration? > > Thanx, Paul > > > > Luiz? > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > skip rcu_start_gp entirely for example? > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > getting started. > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > > > If you are using a smaller value, it would be possible to rework the > > > > code to reduce contention on ->lock, though if a VCPU does get preempted > > > > while holding the root rcu_node structure's ->lock, life will be hard. > > > > > > Its a raw spinlock, isnt it? > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:39 ` Luiz Capitulino @ 2015-01-28 19:00 ` Paul E. McKenney 2015-01-28 19:06 ` Luiz Capitulino 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2015-01-28 19:00 UTC (permalink / raw) To: Luiz Capitulino; +Cc: Marcelo Tosatti, linux-rt-users On Wed, Jan 28, 2015 at 01:39:16PM -0500, Luiz Capitulino wrote: > On Wed, 28 Jan 2015 10:09:50 -0800 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote: > > > On Tue, 27 Jan 2015 23:55:08 -0200 > > > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > > > Paul, > > > > > > > > > > > > We're running some measurements with cyclictest running inside a > > > > > > KVM guest where we could observe spinlock contention among rcuc > > > > > > threads. > > > > > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > > > This machine and the guest run the RT kernel. As our test-case > > > > > > requires an application in the guest taking 100% of the CPU, the > > > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > > > > > 263 FF 3 [rcuc/15] > > > > > > 13 FF 3 [rcub/1] > > > > > > 12 FF 3 [rcub/0] > > > > > > 265 FF 2 [ksoftirqd/15] > > > > > > 3181 FF 1 qemu-kvm > > > > > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > > > thread. This shouldn't be a problem, except for the fact that > > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > > > during this period): > > > > > > > > > > > > __rcu_process_callbacks() > > > > > > { > > > > > > ... > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > rcu_start_gp(rsp); > > > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > ... > > > > > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > > > often does this happen? > > > > > > I have to run cyclictest in the guest for 16m a few times to reproduce it. > > > > So you are seeing the high contention in the guest, correct? > > No, it's in the host. OK, good to know. ;-) > > > > > Also, does this happen on smaller systems, for > > > > > example, with four or eight CPUs? > > > > > > Didn't test. > > > > > > > > And I confess to be a bit surprised > > > > > that you expect real-time response from a guest that is subject to > > > > > preemption -- as I understand it, the usual approach is to give RT guests > > > > > their own CPUs. > > > > > > > > > > Or am I missing something? > > > > > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > > > the CPU therefore allowing the critical services (such as rcu callback > > > > processing and sched tick processing) to execute. > > > > > > Yes. I hope I won't regret saying this, but what I'm observing is that > > > preempting-off the vcpu is not the end of the world as long as you're > > > quick. > > > > And as long as you get lucky and avoid preempting a VCPU that happens to > > be holding a critical lock. > > That's not the case. Everything I mentioned in this thread about RCU > and contention happens in the host. > > > Look, if you want real-time response in a guest OS, there simply is no > > substitute for ensuring that the guest has its own CPUs that are not used > > for anything else, either by anything in the host or by another guest. > > If you do allow preemption of a guest OS that might be holding a critical > > guest-OS lock, you are going to see latency blows. Count on it! ;-) > > > > > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > > > did not help because, for reasons we don't understand, the rcuc > > > > > > threads have to handle grace period start even when callback > > > > > > offloading is used. Handling this case requires this code path > > > > > > to be executed. > > > > > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > > > the per-CPU work required to inform RCU of quiescent states. > > > > > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > > > after all. > > > > > > > > > > We've cooked the following extremely dirty patch, just to see > > > > > > what would happen: > > > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > > index eaed1ef..c0771cc 100644 > > > > > > --- a/kernel/rcutree.c > > > > > > +++ b/kernel/rcutree.c > > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > - rcu_start_gp(rsp); > > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + for (;;) { > > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > > + local_irq_restore(flags); > > > > > > + local_bh_enable(); > > > > > > + schedule_timeout_interruptible(2); > > > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > > > + local_bh_disable(); > > > > > > + local_irq_save(flags); > > > > > > + continue; > > > > > > + } > > > > > > + rcu_start_gp(rsp); > > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + break; > > > > > > + } > > > > > > } else { > > > > > > local_irq_restore(flags); > > > > > > } > > > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > > VCPUs do not get preempted. > > > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > > spinning). In that case, rcuc would never execute, because it has a > > > > lower priority than guest VCPUs. > > > > > > > > I do not think we want that. > > > > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > > high for some other reason? > > > > > > I didn't go far on trying to determine the reason. What I observed > > > was the rcuc preempting-off the vcpu and taking 10us+. I debugged it > > > and most of this time it spends spinning on the spinlock. The patch > > > above makes the rcuc disappear from our traces. This is all I've got. > > > I could try to debug it further if you have suggestions on how to > > > trace the cause. > > > > My current guess is that either: > > > > 1. You are allowing the host or another guest to preempt this > > guest's VCPU. Don't do that. ;-) > > We do allow the rcuc kthread to preempt the guest's vCPU (not other > guests). The reason for this is that the workload running inside the > guest may take 100% of the CPU, which won't allow the rcuc thread > to ever execute. > > > 2. You are letting the rcuc kthreads contend for the worker CPUs. > > Pin them to housekeeping CPUs. This applies to both the > > host and the guest rcuc kthreads, but especially to the > > host rcuc kthreads. > > I'd love to be able to do this, but the rcuc threads are CPU-bound > threads. There's one per CPU and the kernel doesn't allow me to move > them around. Can you build with CONFIG_RCU_BOOST=n? Then you won't have any rcuc kthreads. If you are preventing preemption of the VCPUs, you should not need RCU priority boosting. Thanx, Paul > > > > Or am I still unclear on your goals and configuration? > > > > Thanx, Paul > > > > > > Luiz? > > > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > > skip rcu_start_gp entirely for example? > > > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > > getting started. > > > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > > > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > > > > If you are using a smaller value, it would be possible to rework the > > > > > code to reduce contention on ->lock, though if a VCPU does get preempted > > > > > while holding the root rcu_node structure's ->lock, life will be hard. > > > > > > > > Its a raw spinlock, isnt it? > > > > > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 19:00 ` Paul E. McKenney @ 2015-01-28 19:06 ` Luiz Capitulino 0 siblings, 0 replies; 23+ messages in thread From: Luiz Capitulino @ 2015-01-28 19:06 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Marcelo Tosatti, linux-rt-users On Wed, 28 Jan 2015 11:00:47 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > 2. You are letting the rcuc kthreads contend for the worker CPUs. > > > Pin them to housekeeping CPUs. This applies to both the > > > host and the guest rcuc kthreads, but especially to the > > > host rcuc kthreads. > > > > I'd love to be able to do this, but the rcuc threads are CPU-bound > > threads. There's one per CPU and the kernel doesn't allow me to move > > them around. > > Can you build with CONFIG_RCU_BOOST=n? Then you won't have any rcuc > kthreads. Oh, really? I will try this right away! Thanks for your help! ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 1:55 ` Marcelo Tosatti 2015-01-28 14:18 ` Luiz Capitulino @ 2015-01-28 18:03 ` Paul E. McKenney 2015-01-28 18:25 ` Marcelo Tosatti 1 sibling, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2015-01-28 18:03 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Luiz Capitulino, linux-rt-users On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote: > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > Paul, > > > > > > We're running some measurements with cyclictest running inside a > > > KVM guest where we could observe spinlock contention among rcuc > > > threads. > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > This machine and the guest run the RT kernel. As our test-case > > > requires an application in the guest taking 100% of the CPU, the > > > RT priority configuration that gives the best latency is this one: > > > > > > 263 FF 3 [rcuc/15] > > > 13 FF 3 [rcub/1] > > > 12 FF 3 [rcub/0] > > > 265 FF 2 [ksoftirqd/15] > > > 3181 FF 1 qemu-kvm > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > thread. This shouldn't be a problem, except for the fact that > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > or more spinning in this spinlock (note that IRQs are disabled > > > during this period): > > > > > > __rcu_process_callbacks() > > > { > > > ... > > > local_irq_save(flags); > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > rcu_start_gp(rsp); > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > ... > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > often does this happen? Also, does this happen on smaller systems, for > > example, with four or eight CPUs? And I confess to be a bit surprised > > that you expect real-time response from a guest that is subject to > > preemption -- as I understand it, the usual approach is to give RT guests > > their own CPUs. > > > > Or am I missing something? > > We are trying to avoid relying on the guest VCPU to voluntarily yield > the CPU therefore allowing the critical services (such as rcu callback > processing and sched tick processing) to execute. These critical services executing in the context of the host? (If not, I am confused. Actually, I am confused either way...) > > > We've tried playing with the rcu_nocbs= option. However, it > > > did not help because, for reasons we don't understand, the rcuc > > > threads have to handle grace period start even when callback > > > offloading is used. Handling this case requires this code path > > > to be executed. > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > the per-CPU work required to inform RCU of quiescent states. > > Can't you execute that on vCPU entry/exit? Those are quiescent states > after all. I am guessing that we are talking about quiescent states in the guest. If so, can't vCPU entry/exit operations happen in guest interrupt handlers? If so, these operations are not necessarily quiescent states. > > > We've cooked the following extremely dirty patch, just to see > > > what would happen: > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > index eaed1ef..c0771cc 100644 > > > --- a/kernel/rcutree.c > > > +++ b/kernel/rcutree.c > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > /* Does this CPU require a not-yet-started grace period? */ > > > local_irq_save(flags); > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > - rcu_start_gp(rsp); > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > + for (;;) { > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > + local_irq_restore(flags); > > > + local_bh_enable(); > > > + schedule_timeout_interruptible(2); > > > > Yes, the above will get you a splat in mainline kernels, which do not > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > + local_bh_disable(); > > > + local_irq_save(flags); > > > + continue; > > > + } > > > + rcu_start_gp(rsp); > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > + break; > > > + } > > > } else { > > > local_irq_restore(flags); > > > } > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > Could you please advice on how to solve this contention problem? > > > > The usual advice would be to configure the system such that the guest's > > VCPUs do not get preempted. > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > spinning). In that case, rcuc would never execute, because it has a > lower priority than guest VCPUs. OK, this leads me to believe that you are talking about the rcuc kthreads in the host, not the guest. In which case the usual approach is to reserve a CPU or two on the host which never runs guest VCPUs, and to force the rcuc kthreads there. Note that CONFIG_NO_HZ_FULL will do this automatically for you, reserving the boot CPU. And CONFIG_NO_HZ_FULL might well be very useful in this scenario. And reserving a CPU or two for housekeeping purposes is quite common for heavy CPU-bound workloads. Of course, you need to make sure that the reserved CPU or two is sufficient for all the rcuc kthreads, but if your guests are mostly CPU bound, this should not be a problem. > I do not think we want that. Assuming "that" is "rcuc would never execute" -- agreed, that would be very bad. You would eventually OOM the system. > > Or is the contention on the root rcu_node structure's ->lock field > > high for some other reason? > > Luiz? > > > > Can we test whether the local CPU is nocb, and in that case, > > > skip rcu_start_gp entirely for example? > > > > If you do that, you can see system hangs due to needed grace periods never > > getting started. > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > necessary for nocb CPUs to execute rcu_start_gp? Sigh. Are we in the host or the guest OS at this point? In any case, if you want the best real-time response for a CPU-bound workload on a given CPU, careful use of NO_HZ_FULL would prevent that CPU from ever invoking __rcu_process_callbacks() in the first place, which would have the beneficial side effect of preventing __rcu_process_callbacks() from ever invoking rcu_start_gp(). Of course, NO_HZ_FULL does have the drawback of increasing the cost of user-kernel transitions. > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > If you are using a smaller value, it would be possible to rework the > > code to reduce contention on ->lock, though if a VCPU does get preempted > > while holding the root rcu_node structure's ->lock, life will be hard. > > Its a raw spinlock, isnt it? As I understand it, in a guest OS, that means nothing. The host can preempt a guest even if that guest believes that it has interrupts disabled, correct? If we are talking about the host, then I have to ask what is causing the high levels of contention on the root rcu_node structure's ->lock field. (Which is the only rcu_node structure if you are using default .config.) Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:03 ` Paul E. McKenney @ 2015-01-28 18:25 ` Marcelo Tosatti 2015-01-28 18:55 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Marcelo Tosatti @ 2015-01-28 18:25 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Luiz Capitulino, linux-rt-users On Wed, Jan 28, 2015 at 10:03:35AM -0800, Paul E. McKenney wrote: > On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote: > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > Paul, > > > > > > > > We're running some measurements with cyclictest running inside a > > > > KVM guest where we could observe spinlock contention among rcuc > > > > threads. > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > This machine and the guest run the RT kernel. As our test-case > > > > requires an application in the guest taking 100% of the CPU, the > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > 263 FF 3 [rcuc/15] > > > > 13 FF 3 [rcub/1] > > > > 12 FF 3 [rcub/0] > > > > 265 FF 2 [ksoftirqd/15] > > > > 3181 FF 1 qemu-kvm > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > thread. This shouldn't be a problem, except for the fact that > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > during this period): > > > > > > > > __rcu_process_callbacks() > > > > { > > > > ... > > > > local_irq_save(flags); > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > rcu_start_gp(rsp); > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > ... > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > often does this happen? Also, does this happen on smaller systems, for > > > example, with four or eight CPUs? And I confess to be a bit surprised > > > that you expect real-time response from a guest that is subject to > > > preemption -- as I understand it, the usual approach is to give RT guests > > > their own CPUs. > > > > > > Or am I missing something? > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > the CPU therefore allowing the critical services (such as rcu callback > > processing and sched tick processing) to execute. > > These critical services executing in the context of the host? > (If not, I am confused. Actually, I am confused either way...) The host. Imagine a Windows 95 guest running a realtime app. That should help. > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > did not help because, for reasons we don't understand, the rcuc > > > > threads have to handle grace period start even when callback > > > > offloading is used. Handling this case requires this code path > > > > to be executed. > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > the per-CPU work required to inform RCU of quiescent states. > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > after all. > > I am guessing that we are talking about quiescent states in the guest. Host. > If so, can't vCPU entry/exit operations happen in guest interrupt > handlers? If so, these operations are not necessarily quiescent states. vCPU entry/exit are quiescent states in the host. > > > > We've cooked the following extremely dirty patch, just to see > > > > what would happen: > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > index eaed1ef..c0771cc 100644 > > > > --- a/kernel/rcutree.c > > > > +++ b/kernel/rcutree.c > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > local_irq_save(flags); > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > - rcu_start_gp(rsp); > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > + for (;;) { > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > + local_irq_restore(flags); > > > > + local_bh_enable(); > > > > + schedule_timeout_interruptible(2); > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > + local_bh_disable(); > > > > + local_irq_save(flags); > > > > + continue; > > > > + } > > > > + rcu_start_gp(rsp); > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > + break; > > > > + } > > > > } else { > > > > local_irq_restore(flags); > > > > } > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > The usual advice would be to configure the system such that the guest's > > > VCPUs do not get preempted. > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > spinning). In that case, rcuc would never execute, because it has a > > lower priority than guest VCPUs. > > OK, this leads me to believe that you are talking about the rcuc kthreads > in the host, not the guest. In which case the usual approach is to > reserve a CPU or two on the host which never runs guest VCPUs, and to > force the rcuc kthreads there. Note that CONFIG_NO_HZ_FULL will do this > automatically for you, reserving the boot CPU. And CONFIG_NO_HZ_FULL > might well be very useful in this scenario. And reserving a CPU or two > for housekeeping purposes is quite common for heavy CPU-bound workloads. > > Of course, you need to make sure that the reserved CPU or two is sufficient > for all the rcuc kthreads, but if your guests are mostly CPU bound, this > should not be a problem. > > > I do not think we want that. > > Assuming "that" is "rcuc would never execute" -- agreed, that would be > very bad. You would eventually OOM the system. > > > > Or is the contention on the root rcu_node structure's ->lock field > > > high for some other reason? > > > > Luiz? > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > skip rcu_start_gp entirely for example? > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > getting started. > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > necessary for nocb CPUs to execute rcu_start_gp? > > Sigh. Are we in the host or the guest OS at this point? Host. > In any case, if you want the best real-time response for a CPU-bound > workload on a given CPU, careful use of NO_HZ_FULL would prevent > that CPU from ever invoking __rcu_process_callbacks() in the first > place, which would have the beneficial side effect of preventing > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > of user-kernel transitions. We need periodic processing of __run_timers to keep timer wheel processing from falling behind too much. See http://www.gossamer-threads.com/lists/linux/kernel/2094151. > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > > If you are using a smaller value, it would be possible to rework the > > > code to reduce contention on ->lock, though if a VCPU does get preempted > > > while holding the root rcu_node structure's ->lock, life will be hard. > > > > Its a raw spinlock, isnt it? > > As I understand it, in a guest OS, that means nothing. The host can > preempt a guest even if that guest believes that it has interrupts > disabled, correct? Yes. > If we are talking about the host, then I have to ask what is causing > the high levels of contention on the root rcu_node structure's ->lock > field. (Which is the only rcu_node structure if you are using default > .config.) > > Thanx, Paul OK, great. Thanks a lot. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:25 ` Marcelo Tosatti @ 2015-01-28 18:55 ` Paul E. McKenney 2015-01-29 17:06 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Paul E. McKenney @ 2015-01-28 18:55 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Luiz Capitulino, linux-rt-users On Wed, Jan 28, 2015 at 04:25:12PM -0200, Marcelo Tosatti wrote: > On Wed, Jan 28, 2015 at 10:03:35AM -0800, Paul E. McKenney wrote: > > On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote: > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > > Paul, > > > > > > > > > > We're running some measurements with cyclictest running inside a > > > > > KVM guest where we could observe spinlock contention among rcuc > > > > > threads. > > > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > > This machine and the guest run the RT kernel. As our test-case > > > > > requires an application in the guest taking 100% of the CPU, the > > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > > > 263 FF 3 [rcuc/15] > > > > > 13 FF 3 [rcub/1] > > > > > 12 FF 3 [rcub/0] > > > > > 265 FF 2 [ksoftirqd/15] > > > > > 3181 FF 1 qemu-kvm > > > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > > thread. This shouldn't be a problem, except for the fact that > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > > during this period): > > > > > > > > > > __rcu_process_callbacks() > > > > > { > > > > > ... > > > > > local_irq_save(flags); > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > rcu_start_gp(rsp); > > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > ... > > > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > > often does this happen? Also, does this happen on smaller systems, for > > > > example, with four or eight CPUs? And I confess to be a bit surprised > > > > that you expect real-time response from a guest that is subject to > > > > preemption -- as I understand it, the usual approach is to give RT guests > > > > their own CPUs. > > > > > > > > Or am I missing something? > > > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > > the CPU therefore allowing the critical services (such as rcu callback > > > processing and sched tick processing) to execute. > > > > These critical services executing in the context of the host? > > (If not, I am confused. Actually, I am confused either way...) > > The host. Imagine a Windows 95 guest running a realtime app. > That should help. Then force the critical services to run on a housekeeping CPU. If the host is permitted to preempt the guest, the latency blows you are seeing are expected behavior. > > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > > did not help because, for reasons we don't understand, the rcuc > > > > > threads have to handle grace period start even when callback > > > > > offloading is used. Handling this case requires this code path > > > > > to be executed. > > > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > > the per-CPU work required to inform RCU of quiescent states. > > > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > > after all. > > > > I am guessing that we are talking about quiescent states in the guest. > > Host. > > > If so, can't vCPU entry/exit operations happen in guest interrupt > > handlers? If so, these operations are not necessarily quiescent states. > > vCPU entry/exit are quiescent states in the host. As is execution in the guest. If you build the host with NO_HZ_FULL and boot with the appropriate nohz_full= parameter, this will happen automatically. If that is infeasible, then yes, it should be possible to add an explicit quiescent state in the host at vCPU entry/exit, at least assuming that the host is in a state permitting this. > > > > > We've cooked the following extremely dirty patch, just to see > > > > > what would happen: > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > index eaed1ef..c0771cc 100644 > > > > > --- a/kernel/rcutree.c > > > > > +++ b/kernel/rcutree.c > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > local_irq_save(flags); > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > - rcu_start_gp(rsp); > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > + for (;;) { > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > + local_irq_restore(flags); > > > > > + local_bh_enable(); > > > > > + schedule_timeout_interruptible(2); > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > + local_bh_disable(); > > > > > + local_irq_save(flags); > > > > > + continue; > > > > > + } > > > > > + rcu_start_gp(rsp); > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > + break; > > > > > + } > > > > > } else { > > > > > local_irq_restore(flags); > > > > > } > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > VCPUs do not get preempted. > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > spinning). In that case, rcuc would never execute, because it has a > > > lower priority than guest VCPUs. > > > > OK, this leads me to believe that you are talking about the rcuc kthreads > > in the host, not the guest. In which case the usual approach is to > > reserve a CPU or two on the host which never runs guest VCPUs, and to > > force the rcuc kthreads there. Note that CONFIG_NO_HZ_FULL will do this > > automatically for you, reserving the boot CPU. And CONFIG_NO_HZ_FULL > > might well be very useful in this scenario. And reserving a CPU or two > > for housekeeping purposes is quite common for heavy CPU-bound workloads. > > > > Of course, you need to make sure that the reserved CPU or two is sufficient > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this > > should not be a problem. > > > > > I do not think we want that. > > > > Assuming "that" is "rcuc would never execute" -- agreed, that would be > > very bad. You would eventually OOM the system. > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > high for some other reason? > > > > > > Luiz? > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > skip rcu_start_gp entirely for example? > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > getting started. > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > Sigh. Are we in the host or the guest OS at this point? > > Host. Can you build the host with NO_HZ_FULL and boot with nohz_full=? That should get rid of of much of your problems here. > > In any case, if you want the best real-time response for a CPU-bound > > workload on a given CPU, careful use of NO_HZ_FULL would prevent > > that CPU from ever invoking __rcu_process_callbacks() in the first > > place, which would have the beneficial side effect of preventing > > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > > of user-kernel transitions. > > We need periodic processing of __run_timers to keep timer wheel > processing from falling behind too much. > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151. Hmmm... Do you have the following commits in your build? fff421580f51 timers: Track total number of timers in list d550e81dc0dd timers: Reduce __run_timers() latency for empty list 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0 Keeping extraneous processing off of the CPUs running the real-time guest will minimize the number of timers, allowing these commits to do their jobs. > > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > > > If you are using a smaller value, it would be possible to rework the > > > > code to reduce contention on ->lock, though if a VCPU does get preempted > > > > while holding the root rcu_node structure's ->lock, life will be hard. > > > > > > Its a raw spinlock, isnt it? > > > > As I understand it, in a guest OS, that means nothing. The host can > > preempt a guest even if that guest believes that it has interrupts > > disabled, correct? > > Yes. Then your only hope is to prevent the host (and other guests) from preempting the real-time guest. > > If we are talking about the host, then I have to ask what is causing > > the high levels of contention on the root rcu_node structure's ->lock > > field. (Which is the only rcu_node structure if you are using default > > .config.) > > > > Thanx, Paul > > OK, great. > > Thanks a lot. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:55 ` Paul E. McKenney @ 2015-01-29 17:06 ` Steven Rostedt 2015-01-29 18:11 ` Paul E. McKenney 2015-01-29 18:13 ` Marcelo Tosatti 2015-02-02 18:24 ` Marcelo Tosatti 2 siblings, 1 reply; 23+ messages in thread From: Steven Rostedt @ 2015-01-29 17:06 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Marcelo Tosatti, Luiz Capitulino, linux-rt-users On Wed, 28 Jan 2015 10:55:53 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > Then your only hope is to prevent the host (and other guests) from > preempting the real-time guest. Right! I think there's a miscommunication here. Basically what is needed is to run the RT guest on a CPU by itself. We can all agree on that. That guest runs at a high priority where nothing should preempt it. We should enable NO_HZ_FULL, and move as much off of that CPU as possible (including rcu callbacks). I'm not sure if the code does this or not, but I believe it does. When we enter the guest, the host should be in an RCU quiescent state, where RCU will ignore the CPU that is running the guest. Remember, we are only talking about interactions of the host, not the workings of the guest. Once this isolation happens, then the guest should be running in a state that it could handle RT reaction times for its own processes (if the guest OS supports it). The guest shouldn't be preempted by anything unless it does something that requires a service (interacting with the network or other baremetal device), then it will need to do the same things that any RT task must do. I think all this is feasible. -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-29 17:06 ` Steven Rostedt @ 2015-01-29 18:11 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2015-01-29 18:11 UTC (permalink / raw) To: Steven Rostedt; +Cc: Marcelo Tosatti, Luiz Capitulino, linux-rt-users On Thu, Jan 29, 2015 at 12:06:44PM -0500, Steven Rostedt wrote: > On Wed, 28 Jan 2015 10:55:53 -0800 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > Then your only hope is to prevent the host (and other guests) from > > preempting the real-time guest. > > Right! > > I think there's a miscommunication here. I can easily believe that! > Basically what is needed is to run the RT guest on a CPU by itself. We > can all agree on that. That guest runs at a high priority where nothing > should preempt it. We should enable NO_HZ_FULL, and move as much off of > that CPU as possible (including rcu callbacks). > > I'm not sure if the code does this or not, but I believe it does. When > we enter the guest, the host should be in an RCU quiescent state, where > RCU will ignore the CPU that is running the guest. Remember, we are only > talking about interactions of the host, not the workings of the guest. NO_HZ_FULL will automatically tell RCU about the guest-execution quiescent state because the guest is seen by the host as user-mode execution. (Right? Or is KVM treating this specially such that RCU doesn't see guest execution as a quiescent state? I think this is currently handled correctly, because if it wasn't, you would get RCU CPU stall warning messages.) > Once this isolation happens, then the guest should be running in a > state that it could handle RT reaction times for its own processes (if > the guest OS supports it). The guest shouldn't be preempted by anything > unless it does something that requires a service (interacting with the > network or other baremetal device), then it will need to do the same > things that any RT task must do. Agreed! > I think all this is feasible. The one thing that gives me pause is the high contention on the root (AKA only) rcu_node structure's ->lock field. If this persists, one thing to try would be to build with CONFIG_RCU_FANOUT_LEAF=8 (or 4). If that helps, it would be worthwhile to do some tracing or lock profiling to see about reducing the ->lock contention for the default CONFIG_RCU_FANOUT_LEAF=16. My first thought when I saw the high contention was to introduce funnel locking for grace-period start, but that is unlikely to help in cases where there is only one rcu_node structure. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:55 ` Paul E. McKenney 2015-01-29 17:06 ` Steven Rostedt @ 2015-01-29 18:13 ` Marcelo Tosatti 2015-01-29 18:36 ` Paul E. McKenney 2015-02-02 18:24 ` Marcelo Tosatti 2 siblings, 1 reply; 23+ messages in thread From: Marcelo Tosatti @ 2015-01-29 18:13 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Luiz Capitulino, linux-rt-users On Wed, Jan 28, 2015 at 10:55:53AM -0800, Paul E. McKenney wrote: > > The host. Imagine a Windows 95 guest running a realtime app. > > That should help. > > Then force the critical services to run on a housekeeping CPU. If the > host is permitted to preempt the guest, the latency blows you are seeing > are expected behavior. ksoftirqd must preempt the vcpu as it executes irq_work routines for example. IRQ threads must preempt the vcpu to inject HW interrupts to the guest. > automatically. If that is infeasible, then yes, it should be possible > to add an explicit quiescent state in the host at vCPU entry/exit, at > least assuming that the host is in a state permitting this. > > > > > > > We've cooked the following extremely dirty patch, just to see > > > > > > what would happen: > > > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > > index eaed1ef..c0771cc 100644 > > > > > > --- a/kernel/rcutree.c > > > > > > +++ b/kernel/rcutree.c > > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > - rcu_start_gp(rsp); > > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + for (;;) { > > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > > + local_irq_restore(flags); > > > > > > + local_bh_enable(); > > > > > > + schedule_timeout_interruptible(2); > > > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > > > + local_bh_disable(); > > > > > > + local_irq_save(flags); > > > > > > + continue; > > > > > > + } > > > > > > + rcu_start_gp(rsp); > > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + break; > > > > > > + } > > > > > > } else { > > > > > > local_irq_restore(flags); > > > > > > } > > > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > > VCPUs do not get preempted. > > > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > > spinning). In that case, rcuc would never execute, because it has a > > > > lower priority than guest VCPUs. > > > > > > OK, this leads me to believe that you are talking about the rcuc kthreads > > > in the host, not the guest. In which case the usual approach is to > > > reserve a CPU or two on the host which never runs guest VCPUs, and to > > > force the rcuc kthreads there. Note that CONFIG_NO_HZ_FULL will do this > > > automatically for you, reserving the boot CPU. And CONFIG_NO_HZ_FULL > > > might well be very useful in this scenario. And reserving a CPU or two > > > for housekeeping purposes is quite common for heavy CPU-bound workloads. > > > > > > Of course, you need to make sure that the reserved CPU or two is sufficient > > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this > > > should not be a problem. > > > > > > > I do not think we want that. > > > > > > Assuming "that" is "rcuc would never execute" -- agreed, that would be > > > very bad. You would eventually OOM the system. > > > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > > high for some other reason? > > > > > > > > Luiz? > > > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > > skip rcu_start_gp entirely for example? > > > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > > getting started. > > > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > > Sigh. Are we in the host or the guest OS at this point? > > > > Host. > > Can you build the host with NO_HZ_FULL and boot with nohz_full=? > That should get rid of of much of your problems here. > > > > In any case, if you want the best real-time response for a CPU-bound > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent > > > that CPU from ever invoking __rcu_process_callbacks() in the first > > > place, which would have the beneficial side effect of preventing > > > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > > > of user-kernel transitions. > > > > We need periodic processing of __run_timers to keep timer wheel > > processing from falling behind too much. > > > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151. > > Hmmm... Do you have the following commits in your build? > > fff421580f51 timers: Track total number of timers in list > d550e81dc0dd timers: Reduce __run_timers() latency for empty list > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0 > > Keeping extraneous processing off of the CPUs running the real-time > guest will minimize the number of timers, allowing these commits to > do their jobs. Clocksource watchdog: /* * Cycle through CPUs to check if the CPUs stay synchronized * to each other. */ next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask); if (next_cpu >= nr_cpu_ids) next_cpu = cpumask_first(cpu_online_mask); watchdog_timer.expires += WATCHDOG_INTERVAL; add_timer_on(&watchdog_timer, next_cpu); OK to disable... MCE: 2 1317 ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_fn>> add_timer_on(t, smp_processor_id()); 3 1335 ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_kick>> add_timer_on(t, smp_processor_id()); 4 1657 ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_start_timer>> add_timer_on(t, cpu); Unsure how realistic the expectation to be able to exclude add_timer_on and queue_delayed_work_on users is. NOK to disable, i suppose. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-29 18:13 ` Marcelo Tosatti @ 2015-01-29 18:36 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2015-01-29 18:36 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Luiz Capitulino, linux-rt-users On Thu, Jan 29, 2015 at 04:13:24PM -0200, Marcelo Tosatti wrote: > > On Wed, Jan 28, 2015 at 10:55:53AM -0800, Paul E. McKenney wrote: > > > The host. Imagine a Windows 95 guest running a realtime app. > > > That should help. > > > > Then force the critical services to run on a housekeeping CPU. If the > > host is permitted to preempt the guest, the latency blows you are seeing > > are expected behavior. > > ksoftirqd must preempt the vcpu as it executes irq_work > routines for example. > > IRQ threads must preempt the vcpu to inject HW interrupts > to the guest. Understood, and hopefully these short preemptions are not causing excessive trouble. And my concern with this was partly due to my assumption that you were seeing high lock contention in the guest. > > automatically. If that is infeasible, then yes, it should be possible > > to add an explicit quiescent state in the host at vCPU entry/exit, at > > least assuming that the host is in a state permitting this. > > > > > > > > > We've cooked the following extremely dirty patch, just to see > > > > > > > what would happen: > > > > > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > > > index eaed1ef..c0771cc 100644 > > > > > > > --- a/kernel/rcutree.c > > > > > > > +++ b/kernel/rcutree.c > > > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > > > local_irq_save(flags); > > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > > - rcu_start_gp(rsp); > > > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > > + for (;;) { > > > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > > > + local_irq_restore(flags); > > > > > > > + local_bh_enable(); > > > > > > > + schedule_timeout_interruptible(2); > > > > > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > > > > > + local_bh_disable(); > > > > > > > + local_irq_save(flags); > > > > > > > + continue; > > > > > > > + } > > > > > > > + rcu_start_gp(rsp); > > > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > > + break; > > > > > > > + } > > > > > > > } else { > > > > > > > local_irq_restore(flags); > > > > > > > } > > > > > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > > > VCPUs do not get preempted. > > > > > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > > > spinning). In that case, rcuc would never execute, because it has a > > > > > lower priority than guest VCPUs. > > > > > > > > OK, this leads me to believe that you are talking about the rcuc kthreads > > > > in the host, not the guest. In which case the usual approach is to > > > > reserve a CPU or two on the host which never runs guest VCPUs, and to > > > > force the rcuc kthreads there. Note that CONFIG_NO_HZ_FULL will do this > > > > automatically for you, reserving the boot CPU. And CONFIG_NO_HZ_FULL > > > > might well be very useful in this scenario. And reserving a CPU or two > > > > for housekeeping purposes is quite common for heavy CPU-bound workloads. > > > > > > > > Of course, you need to make sure that the reserved CPU or two is sufficient > > > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this > > > > should not be a problem. > > > > > > > > > I do not think we want that. > > > > > > > > Assuming "that" is "rcuc would never execute" -- agreed, that would be > > > > very bad. You would eventually OOM the system. > > > > > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > > > high for some other reason? > > > > > > > > > > Luiz? > > > > > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > > > skip rcu_start_gp entirely for example? > > > > > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > > > getting started. > > > > > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > > > > Sigh. Are we in the host or the guest OS at this point? > > > > > > Host. > > > > Can you build the host with NO_HZ_FULL and boot with nohz_full=? > > That should get rid of of much of your problems here. > > > > > > In any case, if you want the best real-time response for a CPU-bound > > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent > > > > that CPU from ever invoking __rcu_process_callbacks() in the first > > > > place, which would have the beneficial side effect of preventing > > > > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > > > > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > > > > of user-kernel transitions. > > > > > > We need periodic processing of __run_timers to keep timer wheel > > > processing from falling behind too much. > > > > > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151. > > > > Hmmm... Do you have the following commits in your build? > > > > fff421580f51 timers: Track total number of timers in list > > d550e81dc0dd timers: Reduce __run_timers() latency for empty list > > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list > > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list > > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0 > > > > Keeping extraneous processing off of the CPUs running the real-time > > guest will minimize the number of timers, allowing these commits to > > do their jobs. > > Clocksource watchdog: > > /* > * Cycle through CPUs to check if the CPUs stay synchronized > * to each other. > */ > next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask); > if (next_cpu >= nr_cpu_ids) > next_cpu = cpumask_first(cpu_online_mask); > watchdog_timer.expires += WATCHDOG_INTERVAL; > add_timer_on(&watchdog_timer, next_cpu); > > OK to disable... I have to defer to John Stultz and Thomas Gleixner on this one. > MCE: > > 2 1317 ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_fn>> > add_timer_on(t, smp_processor_id()); > 3 1335 ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_timer_kick>> > add_timer_on(t, smp_processor_id()); > 4 1657 ../../arch/x86/kernel/cpu/mcheck/mce.c <<mce_start_timer>> > add_timer_on(t, cpu); > > Unsure how realistic the expectation to be able to exclude add_timer_on > and queue_delayed_work_on users is. > > NOK to disable, i suppose. And I must defer to x86 MCE experts on this one. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-01-28 18:55 ` Paul E. McKenney 2015-01-29 17:06 ` Steven Rostedt 2015-01-29 18:13 ` Marcelo Tosatti @ 2015-02-02 18:24 ` Marcelo Tosatti 2015-02-02 20:35 ` Steven Rostedt 2 siblings, 1 reply; 23+ messages in thread From: Marcelo Tosatti @ 2015-02-02 18:24 UTC (permalink / raw) To: Paul E. McKenney, Steven Rostedt, Steven Rostedt Cc: Luiz Capitulino, linux-rt-users On Wed, Jan 28, 2015 at 10:55:53AM -0800, Paul E. McKenney wrote: > On Wed, Jan 28, 2015 at 04:25:12PM -0200, Marcelo Tosatti wrote: > > On Wed, Jan 28, 2015 at 10:03:35AM -0800, Paul E. McKenney wrote: > > > On Tue, Jan 27, 2015 at 11:55:08PM -0200, Marcelo Tosatti wrote: > > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > > > Paul, > > > > > > > > > > > > We're running some measurements with cyclictest running inside a > > > > > > KVM guest where we could observe spinlock contention among rcuc > > > > > > threads. > > > > > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > > > This machine and the guest run the RT kernel. As our test-case > > > > > > requires an application in the guest taking 100% of the CPU, the > > > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > > > > > 263 FF 3 [rcuc/15] > > > > > > 13 FF 3 [rcub/1] > > > > > > 12 FF 3 [rcub/0] > > > > > > 265 FF 2 [ksoftirqd/15] > > > > > > 3181 FF 1 qemu-kvm > > > > > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > > > thread. This shouldn't be a problem, except for the fact that > > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > > > during this period): > > > > > > > > > > > > __rcu_process_callbacks() > > > > > > { > > > > > > ... > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > rcu_start_gp(rsp); > > > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > ... > > > > > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > > > often does this happen? Also, does this happen on smaller systems, for > > > > > example, with four or eight CPUs? And I confess to be a bit surprised > > > > > that you expect real-time response from a guest that is subject to > > > > > preemption -- as I understand it, the usual approach is to give RT guests > > > > > their own CPUs. > > > > > > > > > > Or am I missing something? > > > > > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > > > the CPU therefore allowing the critical services (such as rcu callback > > > > processing and sched tick processing) to execute. > > > > > > These critical services executing in the context of the host? > > > (If not, I am confused. Actually, I am confused either way...) > > > > The host. Imagine a Windows 95 guest running a realtime app. > > That should help. > > Then force the critical services to run on a housekeeping CPU. If the > host is permitted to preempt the guest, the latency blows you are seeing > are expected behavior. > > > > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > > > did not help because, for reasons we don't understand, the rcuc > > > > > > threads have to handle grace period start even when callback > > > > > > offloading is used. Handling this case requires this code path > > > > > > to be executed. > > > > > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > > > the per-CPU work required to inform RCU of quiescent states. > > > > > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > > > after all. > > > > > > I am guessing that we are talking about quiescent states in the guest. > > > > Host. > > > > > If so, can't vCPU entry/exit operations happen in guest interrupt > > > handlers? If so, these operations are not necessarily quiescent states. > > > > vCPU entry/exit are quiescent states in the host. > > As is execution in the guest. If you build the host with NO_HZ_FULL > and boot with the appropriate nohz_full= parameter, this will happen > automatically. If that is infeasible, then yes, it should be possible > to add an explicit quiescent state in the host at vCPU entry/exit, at > least assuming that the host is in a state permitting this. > > > > > > > We've cooked the following extremely dirty patch, just to see > > > > > > what would happen: > > > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > > index eaed1ef..c0771cc 100644 > > > > > > --- a/kernel/rcutree.c > > > > > > +++ b/kernel/rcutree.c > > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > - rcu_start_gp(rsp); > > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + for (;;) { > > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > > + local_irq_restore(flags); > > > > > > + local_bh_enable(); > > > > > > + schedule_timeout_interruptible(2); > > > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > > > + local_bh_disable(); > > > > > > + local_irq_save(flags); > > > > > > + continue; > > > > > > + } > > > > > > + rcu_start_gp(rsp); > > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + break; > > > > > > + } > > > > > > } else { > > > > > > local_irq_restore(flags); > > > > > > } > > > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > > VCPUs do not get preempted. > > > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > > spinning). In that case, rcuc would never execute, because it has a > > > > lower priority than guest VCPUs. > > > > > > OK, this leads me to believe that you are talking about the rcuc kthreads > > > in the host, not the guest. In which case the usual approach is to > > > reserve a CPU or two on the host which never runs guest VCPUs, and to > > > force the rcuc kthreads there. Note that CONFIG_NO_HZ_FULL will do this > > > automatically for you, reserving the boot CPU. And CONFIG_NO_HZ_FULL > > > might well be very useful in this scenario. And reserving a CPU or two > > > for housekeeping purposes is quite common for heavy CPU-bound workloads. > > > > > > Of course, you need to make sure that the reserved CPU or two is sufficient > > > for all the rcuc kthreads, but if your guests are mostly CPU bound, this > > > should not be a problem. > > > > > > > I do not think we want that. > > > > > > Assuming "that" is "rcuc would never execute" -- agreed, that would be > > > very bad. You would eventually OOM the system. > > > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > > high for some other reason? > > > > > > > > Luiz? > > > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > > skip rcu_start_gp entirely for example? > > > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > > getting started. > > > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > > Sigh. Are we in the host or the guest OS at this point? > > > > Host. > > Can you build the host with NO_HZ_FULL and boot with nohz_full=? > That should get rid of of much of your problems here. > > > > In any case, if you want the best real-time response for a CPU-bound > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent > > > that CPU from ever invoking __rcu_process_callbacks() in the first > > > place, which would have the beneficial side effect of preventing > > > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > > > of user-kernel transitions. > > > > We need periodic processing of __run_timers to keep timer wheel > > processing from falling behind too much. > > > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151. > > Hmmm... Do you have the following commits in your build? > > fff421580f51 timers: Track total number of timers in list > d550e81dc0dd timers: Reduce __run_timers() latency for empty list > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0 > > Keeping extraneous processing off of the CPUs running the real-time > guest will minimize the number of timers, allowing these commits to > do their jobs. Steven, The second commit, d550e81dc0dd should be part of -RT, and currently is not, because: -> Any IRQ work item will raise timer softirq. -> __run_timers will do a full round of processing, ruining latency. Even without any timer pending on the timer wheel. And about NO_HZ_FULL and -RT, is it correct that NO_HZ_FULL renders commit 1a2de830b90e364c3bf95e0000173bffcb65ddb7 Author: Steven Rostedt <rostedt@goodmis.org> Date: Fri Jan 31 12:07:57 2014 -0500 timer/rt: Always raise the softirq if there's irq_work to be done Inactive? Should raise softirq from irq_work_queue directly? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-02 18:24 ` Marcelo Tosatti @ 2015-02-02 20:35 ` Steven Rostedt 2015-02-02 20:46 ` Marcelo Tosatti 0 siblings, 1 reply; 23+ messages in thread From: Steven Rostedt @ 2015-02-02 20:35 UTC (permalink / raw) To: Marcelo Tosatti Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users On Mon, 2 Feb 2015 16:24:50 -0200 Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > In any case, if you want the best real-time response for a CPU-bound > > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent > > > > that CPU from ever invoking __rcu_process_callbacks() in the first > > > > place, which would have the beneficial side effect of preventing > > > > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > > > > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > > > > of user-kernel transitions. > > > > > > We need periodic processing of __run_timers to keep timer wheel > > > processing from falling behind too much. > > > > > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151. > > > > Hmmm... Do you have the following commits in your build? > > > > fff421580f51 timers: Track total number of timers in list > > d550e81dc0dd timers: Reduce __run_timers() latency for empty list > > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list > > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list > > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0 > > > > Keeping extraneous processing off of the CPUs running the real-time > > guest will minimize the number of timers, allowing these commits to > > do their jobs. > > Steven, > > The second commit, d550e81dc0dd should be part of -RT, and currently is > not, because: > > -> Any IRQ work item will raise timer softirq. > -> __run_timers will do a full round of processing, > ruining latency. Was this discussed? > > Even without any timer pending on the timer wheel. > > And about NO_HZ_FULL and -RT, is it correct that NO_HZ_FULL > renders > > commit 1a2de830b90e364c3bf95e0000173bffcb65ddb7 > Author: Steven Rostedt <rostedt@goodmis.org> > Date: Fri Jan 31 12:07:57 2014 -0500 > > timer/rt: Always raise the softirq if there's irq_work to be done > > Inactive? Should raise softirq from irq_work_queue directly? What do you mean raise from irq_work_queue directly? When irq work needs to be done, that usually is because something happened in a context that you can not wake up a process (like raise_softirq might do). The irq_work itself could raise the softirq, but as it takes the softirq to trigger the irq_work you are stuck in a catch 22 there. -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-02 20:35 ` Steven Rostedt @ 2015-02-02 20:46 ` Marcelo Tosatti 2015-02-02 20:55 ` Steven Rostedt 0 siblings, 1 reply; 23+ messages in thread From: Marcelo Tosatti @ 2015-02-02 20:46 UTC (permalink / raw) To: Steven Rostedt Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users On Mon, Feb 02, 2015 at 03:35:53PM -0500, Steven Rostedt wrote: > On Mon, 2 Feb 2015 16:24:50 -0200 > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > > In any case, if you want the best real-time response for a CPU-bound > > > > > workload on a given CPU, careful use of NO_HZ_FULL would prevent > > > > > that CPU from ever invoking __rcu_process_callbacks() in the first > > > > > place, which would have the beneficial side effect of preventing > > > > > __rcu_process_callbacks() from ever invoking rcu_start_gp(). > > > > > > > > > > Of course, NO_HZ_FULL does have the drawback of increasing the cost > > > > > of user-kernel transitions. > > > > > > > > We need periodic processing of __run_timers to keep timer wheel > > > > processing from falling behind too much. > > > > > > > > See http://www.gossamer-threads.com/lists/linux/kernel/2094151. > > > > > > Hmmm... Do you have the following commits in your build? > > > > > > fff421580f51 timers: Track total number of timers in list > > > d550e81dc0dd timers: Reduce __run_timers() latency for empty list > > > 16d937f88031 timers: Reduce future __run_timers() latency for newly emptied list > > > 18d8cb64c9c0 timers: Reduce future __run_timers() latency for first add to empty list > > > aea369b959be timers: Make internal_add_timer() update ->next_timer if ->active_timers == 0 > > > > > > Keeping extraneous processing off of the CPUs running the real-time > > > guest will minimize the number of timers, allowing these commits to > > > do their jobs. > > > > Steven, > > > > The second commit, d550e81dc0dd should be part of -RT, and currently is > > not, because: > > > > -> Any IRQ work item will raise timer softirq. > > -> __run_timers will do a full round of processing, > > ruining latency. > > Was this discussed? Discussed where? The point is this: __run_timers has horrible latency. How to avoid it: configure the system in such a way that no timers (old interface, add_timers) expire on the local CPU. The patches Paul listed above limit the issue allowing you to call raise_softirq(TIMER_SOFTIRQ) without having to go through __run_timers, in the case of no pending timers. > > Even without any timer pending on the timer wheel. > > > > And about NO_HZ_FULL and -RT, is it correct that NO_HZ_FULL > > renders > > > > commit 1a2de830b90e364c3bf95e0000173bffcb65ddb7 > > Author: Steven Rostedt <rostedt@goodmis.org> > > Date: Fri Jan 31 12:07:57 2014 -0500 > > > > timer/rt: Always raise the softirq if there's irq_work to be done > > > > Inactive? Should raise softirq from irq_work_queue directly? > > What do you mean raise from irq_work_queue directly? When irq work > needs to be done, that usually is because something happened in a > context that you can not wake up a process (like raise_softirq might > do). The irq_work itself could raise the softirq, but as it takes the > softirq to trigger the irq_work you are stuck in a catch 22 there. Then you rely on the sched timer interrupt to notice there is a pending irq_work item? If you have no sched timer interrupts, then what happens with that irq_work item? > > -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-02 20:46 ` Marcelo Tosatti @ 2015-02-02 20:55 ` Steven Rostedt 2015-02-02 21:02 ` Marcelo Tosatti 0 siblings, 1 reply; 23+ messages in thread From: Steven Rostedt @ 2015-02-02 20:55 UTC (permalink / raw) To: Marcelo Tosatti Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users On Mon, 2 Feb 2015 18:46:59 -0200 Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > The second commit, d550e81dc0dd should be part of -RT, and currently is > > > not, because: > > > > > > -> Any IRQ work item will raise timer softirq. > > > -> __run_timers will do a full round of processing, > > > ruining latency. > > > > Was this discussed? > > Discussed where? It sounded like that commit was not added because of the above. That's why I asked, was it discussed. Sounded like you were saying that commit d550e81dc0dd should be part of -RT but it is not because ..., which sounds like there were some decisions made. > > The point is this: __run_timers has horrible latency. > How to avoid it: configure the system in such a way that no timers > (old interface, add_timers) expire on the local CPU. > > The patches Paul listed above limit the issue allowing > you to call raise_softirq(TIMER_SOFTIRQ) without having to go > through __run_timers, in the case of no pending timers. OK, so you are asking for me to add those patches? > Then you rely on the sched timer interrupt to notice there is a pending > irq_work item? On, x86, there shouldn't be. irq_work can usually trigger its own interrupt. In the case that it can not, it requires the softirq to trigger when there's irq work to be done. > > If you have no sched timer interrupts, then what happens with that > irq_work item? > That's what that patch does. It should trigger some. Hmm, I have to see if no_hz_full checks irq work too. But again, of there's no irq_work to do then this should not matter. If there's irq_work to do, then something on that CPU asked to do irq work. If you are worried about run_timers, make sure nothing is on that CPU that can trigger a timer. Isolation is the *only* way to make that work. -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-02 20:55 ` Steven Rostedt @ 2015-02-02 21:02 ` Marcelo Tosatti 2015-02-03 20:36 ` Steven Rostedt 0 siblings, 1 reply; 23+ messages in thread From: Marcelo Tosatti @ 2015-02-02 21:02 UTC (permalink / raw) To: Steven Rostedt Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users On Mon, Feb 02, 2015 at 03:55:28PM -0500, Steven Rostedt wrote: > On Mon, 2 Feb 2015 18:46:59 -0200 > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > > The second commit, d550e81dc0dd should be part of -RT, and currently is > > > > not, because: > > > > > > > > -> Any IRQ work item will raise timer softirq. > > > > -> __run_timers will do a full round of processing, > > > > ruining latency. > > > > > > Was this discussed? > > > > Discussed where? > > It sounded like that commit was not added because of the above. That's > why I asked, was it discussed. Sounded like you were saying that commit > d550e81dc0dd should be part of -RT but it is not because ..., which > sounds like there were some decisions made. > > > > > The point is this: __run_timers has horrible latency. > > How to avoid it: configure the system in such a way that no timers > > (old interface, add_timers) expire on the local CPU. > > > > The patches Paul listed above limit the issue allowing > > you to call raise_softirq(TIMER_SOFTIRQ) without having to go > > through __run_timers, in the case of no pending timers. > > OK, so you are asking for me to add those patches? Yes. > > Then you rely on the sched timer interrupt to notice there is a pending > > irq_work item? > > On, x86, there shouldn't be. irq_work can usually trigger its own > interrupt. In the case that it can not, it requires the softirq to > trigger when there's irq work to be done. > > > > > If you have no sched timer interrupts, then what happens with that > > irq_work item? > > > > That's what that patch does. It should trigger some. Hmm, I have to see > if no_hz_full checks irq work too. > > But again, of there's no irq_work to do then this should not matter. If > there's irq_work to do, then something on that CPU asked to do irq > work. Right. > If you are worried about run_timers, make sure nothing is on that > CPU that can trigger a timer. I am worried about two things: 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of Paul's d550e81dc0dd. The result is __run_timers checking all timer wheel "nodes" and updating base->timer_jiffies, latency is ruined. Even if one carefully made sure no timer is present. 2) Reliance on sched timer interrupt to raise timer softirq in case of pending irq work (your patch) AND no_hz_full. > Isolation is the *only* way to make that work. Fine. Please see item 1) above. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-02 21:02 ` Marcelo Tosatti @ 2015-02-03 20:36 ` Steven Rostedt 2015-02-03 20:57 ` Paul E. McKenney 2015-02-03 23:55 ` Marcelo Tosatti 0 siblings, 2 replies; 23+ messages in thread From: Steven Rostedt @ 2015-02-03 20:36 UTC (permalink / raw) To: Marcelo Tosatti Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users On Mon, 2 Feb 2015 19:02:29 -0200 Marcelo Tosatti <mtosatti@redhat.com> wrote: > I am worried about two things: > > 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of > Paul's d550e81dc0dd. > > The result is __run_timers checking all timer wheel "nodes" > and updating base->timer_jiffies, latency is ruined. > > Even if one carefully made sure no timer is present. > > 2) Reliance on sched timer interrupt to raise timer softirq > in case of pending irq work (your patch) AND no_hz_full. > > > Isolation is the *only* way to make that work. > > Fine. Please see item 1) above. > So basically you are saying we just need: d550e81dc0dd ? -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-03 20:36 ` Steven Rostedt @ 2015-02-03 20:57 ` Paul E. McKenney 2015-02-03 23:55 ` Marcelo Tosatti 1 sibling, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2015-02-03 20:57 UTC (permalink / raw) To: Steven Rostedt Cc: Marcelo Tosatti, Steven Rostedt, Luiz Capitulino, linux-rt-users On Tue, Feb 03, 2015 at 03:36:19PM -0500, Steven Rostedt wrote: > On Mon, 2 Feb 2015 19:02:29 -0200 > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > I am worried about two things: > > > > 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of > > Paul's d550e81dc0dd. > > > > The result is __run_timers checking all timer wheel "nodes" > > and updating base->timer_jiffies, latency is ruined. > > > > Even if one carefully made sure no timer is present. > > > > 2) Reliance on sched timer interrupt to raise timer softirq > > in case of pending irq work (your patch) AND no_hz_full. > > > > > Isolation is the *only* way to make that work. > > > > Fine. Please see item 1) above. > > So basically you are saying we just need: d550e81dc0dd ? fff421580f51 is of course a prerequisite for d550e81dc0dd. Of the five related commits, these two are the most important, as they cover things for CPUs that never have any timers. The other three handle CPUs that occasionally have a timer or two. So you definitely need fff421580f51 and d550e81dc0dd. Less carefully tuned systems will benefit from 16d937f88031, 18d8cb64c9c0, and aea369b959be, but these last three are more in the nice-to-have category. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: kernel-rt rcuc lock contention problem 2015-02-03 20:36 ` Steven Rostedt 2015-02-03 20:57 ` Paul E. McKenney @ 2015-02-03 23:55 ` Marcelo Tosatti 1 sibling, 0 replies; 23+ messages in thread From: Marcelo Tosatti @ 2015-02-03 23:55 UTC (permalink / raw) To: Steven Rostedt Cc: Paul E. McKenney, Steven Rostedt, Luiz Capitulino, linux-rt-users On Tue, Feb 03, 2015 at 03:36:19PM -0500, Steven Rostedt wrote: > On Mon, 2 Feb 2015 19:02:29 -0200 > Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > I am worried about two things: > > > > 1) Something calling raise_softirq(TIMER_SOFTIRQ) and lack of > > Paul's d550e81dc0dd. > > > > The result is __run_timers checking all timer wheel "nodes" > > and updating base->timer_jiffies, latency is ruined. > > > > Even if one carefully made sure no timer is present. > > > > 2) Reliance on sched timer interrupt to raise timer softirq > > in case of pending irq work (your patch) AND no_hz_full. > > > > > Isolation is the *only* way to make that work. > > > > Fine. Please see item 1) above. > > > > So basically you are saying we just need: d550e81dc0dd ? For 1), the 4 patches he mentioned, please. For 2), it was just a hypothesis (perhaps fuelled by the fact the my test box crashes with nohz_full= and isolated cpus). ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2015-02-03 23:56 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-26 19:14 kernel-rt rcuc lock contention problem Luiz Capitulino 2015-01-27 20:37 ` Paul E. McKenney 2015-01-28 1:55 ` Marcelo Tosatti 2015-01-28 14:18 ` Luiz Capitulino 2015-01-28 18:09 ` Paul E. McKenney 2015-01-28 18:39 ` Luiz Capitulino 2015-01-28 19:00 ` Paul E. McKenney 2015-01-28 19:06 ` Luiz Capitulino 2015-01-28 18:03 ` Paul E. McKenney 2015-01-28 18:25 ` Marcelo Tosatti 2015-01-28 18:55 ` Paul E. McKenney 2015-01-29 17:06 ` Steven Rostedt 2015-01-29 18:11 ` Paul E. McKenney 2015-01-29 18:13 ` Marcelo Tosatti 2015-01-29 18:36 ` Paul E. McKenney 2015-02-02 18:24 ` Marcelo Tosatti 2015-02-02 20:35 ` Steven Rostedt 2015-02-02 20:46 ` Marcelo Tosatti 2015-02-02 20:55 ` Steven Rostedt 2015-02-02 21:02 ` Marcelo Tosatti 2015-02-03 20:36 ` Steven Rostedt 2015-02-03 20:57 ` Paul E. McKenney 2015-02-03 23:55 ` Marcelo Tosatti
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.