linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc: mitigate impact of decrementer reset
       [not found] <1412708517-84726-1-git-send-email-pc@us.ibm.com>
@ 2014-10-07 19:13 ` Paul Clarke
  2014-10-08  2:52   ` Michael Ellerman
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Paul Clarke @ 2014-10-07 19:13 UTC (permalink / raw)
  To: linuxppc-dev

The POWER ISA defines an always-running decrementer which can be used
to schedule interrupts after a certain time interval has elapsed.
The decrementer counts down at the same frequency as the Time Base,
which is 512 MHz.  The maximum value of the decrementer is 0x7fffffff.
This works out to a maximum interval of about 4.19 seconds.

If a larger interval is desired, the kernel will set the decrementer
to its maximum value and reset it after it expires (underflows)
a sufficient number of times until the desired interval has elapsed.

The negative effect of this is that an unwanted latency spike will
impact normal processing at most every 4.19 seconds.  On an IBM
POWER8-based system, this spike was measured at about 25-30
microseconds, much of which was basic, opportunistic housekeeping
tasks that could otherwise have waited.

This patch short-circuits the reset of the decrementer, exiting after
the decrementer reset, but before the housekeeping tasks if the only
need for the interrupt is simply to reset it.  After this patch,
the latency spike was measured at about 150 nanoseconds.

Signed-off-by: Paul A. Clarke <pc@us.ibm.com>
---
  arch/powerpc/kernel/time.c | 13 +++++++++++++
  1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 368ab37..962a06b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs)
  {
  	struct pt_regs *old_regs;
  	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	u64 now;

  	/* Ensure a positive value is written to the decrementer, or else
  	 * some CPUs will continue to take decrementer exceptions.
@@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs)
  	 */
  	may_hard_irq_enable();

+	/* If this is simply the decrementer expiring (underflow) due to
+	 * the limited size of the decrementer, and not a set timer,
+	 * reset (if needed) and return
+	 */
+	now = get_tb_or_rtc();
+	if (now < *next_tb) {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+		__get_cpu_var(irq_stat).timer_irqs_others++;
+		return;
+	}

  #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
  	if (atomic_read(&ppc_n_lost_interrupts) != 0)
-- 
2.1.2.330.g565301e

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: powerpc: mitigate impact of decrementer reset
  2014-10-07 19:13 ` [PATCH] powerpc: mitigate impact of decrementer reset Paul Clarke
@ 2014-10-08  2:52   ` Michael Ellerman
  2014-10-08 10:27     ` Preeti U Murthy
  2014-11-05 17:06     ` Paul Clarke
  2014-10-08  5:37   ` [PATCH] " Heinz Wrobel
  2014-11-10 10:08   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 14+ messages in thread
From: Michael Ellerman @ 2014-10-08  2:52 UTC (permalink / raw)
  To: Paul Clarke, linuxppc-dev

On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:
> The POWER ISA defines an always-running decrementer which can be used
> to schedule interrupts after a certain time interval has elapsed.
> The decrementer counts down at the same frequency as the Time Base,
> which is 512 MHz.  The maximum value of the decrementer is 0x7fffffff.
> This works out to a maximum interval of about 4.19 seconds.
> 
> If a larger interval is desired, the kernel will set the decrementer
> to its maximum value and reset it after it expires (underflows)
> a sufficient number of times until the desired interval has elapsed.
> 
> The negative effect of this is that an unwanted latency spike will
> impact normal processing at most every 4.19 seconds.  On an IBM
> POWER8-based system, this spike was measured at about 25-30
> microseconds, much of which was basic, opportunistic housekeeping
> tasks that could otherwise have waited.
> 
> This patch short-circuits the reset of the decrementer, exiting after
> the decrementer reset, but before the housekeeping tasks if the only
> need for the interrupt is simply to reset it.  After this patch,
> the latency spike was measured at about 150 nanoseconds.

Hi Paul,

Thanks for the excellent changelog. But this patch makes me a bit nervous :)

Do you know where the latency is coming from? Is it primarily the irq work?

If so I'd prefer if we could move the short circuit into __timer_interrupt()
itself. That way we'd still have the trace points usable, and it would
hopefully result in less duplicated logic.

cheers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-10-07 19:13 ` [PATCH] powerpc: mitigate impact of decrementer reset Paul Clarke
  2014-10-08  2:52   ` Michael Ellerman
@ 2014-10-08  5:37   ` Heinz Wrobel
  2014-10-08 12:27     ` Paul Clarke
  2014-11-10 10:08   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 14+ messages in thread
From: Heinz Wrobel @ 2014-10-08  5:37 UTC (permalink / raw)
  To: Paul Clarke, linuxppc-dev

UGF1bCwNCg0Kd2hhdCBpZiB5b3VyIHRiIHdyYXBzIGR1cmluZyB0aGUgIHRlc3Q/DQoNCj4gLS0t
LS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogTGludXhwcGMtZGV2IFttYWlsdG86bGlu
dXhwcGMtZGV2LQ0KPiBib3VuY2VzK2hlaW56Lndyb2JlbD1mcmVlc2NhbGUuY29tQGxpc3RzLm96
bGFicy5vcmddIE9uIEJlaGFsZiBPZiBQYXVsDQo+IENsYXJrZQ0KPiBTZW50OiBUdWVzZGF5LCBP
Y3RvYmVyIDA3LCAyMDE0IDIxOjEzDQo+IFRvOiBsaW51eHBwYy1kZXZAbGlzdHMub3psYWJzLm9y
Zw0KPiBTdWJqZWN0OiBbUEFUQ0hdIHBvd2VycGM6IG1pdGlnYXRlIGltcGFjdCBvZiBkZWNyZW1l
bnRlciByZXNldA0KPiANCj4gVGhlIFBPV0VSIElTQSBkZWZpbmVzIGFuIGFsd2F5cy1ydW5uaW5n
IGRlY3JlbWVudGVyIHdoaWNoIGNhbiBiZSB1c2VkIHRvDQo+IHNjaGVkdWxlIGludGVycnVwdHMg
YWZ0ZXIgYSBjZXJ0YWluIHRpbWUgaW50ZXJ2YWwgaGFzIGVsYXBzZWQuDQo+IFRoZSBkZWNyZW1l
bnRlciBjb3VudHMgZG93biBhdCB0aGUgc2FtZSBmcmVxdWVuY3kgYXMgdGhlIFRpbWUgQmFzZSwg
d2hpY2gNCj4gaXMgNTEyIE1Iei4gIFRoZSBtYXhpbXVtIHZhbHVlIG9mIHRoZSBkZWNyZW1lbnRl
ciBpcyAweDdmZmZmZmZmLg0KPiBUaGlzIHdvcmtzIG91dCB0byBhIG1heGltdW0gaW50ZXJ2YWwg
b2YgYWJvdXQgNC4xOSBzZWNvbmRzLg0KPiANCj4gSWYgYSBsYXJnZXIgaW50ZXJ2YWwgaXMgZGVz
aXJlZCwgdGhlIGtlcm5lbCB3aWxsIHNldCB0aGUgZGVjcmVtZW50ZXIgdG8gaXRzDQo+IG1heGlt
dW0gdmFsdWUgYW5kIHJlc2V0IGl0IGFmdGVyIGl0IGV4cGlyZXMgKHVuZGVyZmxvd3MpIGEgc3Vm
ZmljaWVudCBudW1iZXIgb2YNCj4gdGltZXMgdW50aWwgdGhlIGRlc2lyZWQgaW50ZXJ2YWwgaGFz
IGVsYXBzZWQuDQo+IA0KPiBUaGUgbmVnYXRpdmUgZWZmZWN0IG9mIHRoaXMgaXMgdGhhdCBhbiB1
bndhbnRlZCBsYXRlbmN5IHNwaWtlIHdpbGwgaW1wYWN0IG5vcm1hbA0KPiBwcm9jZXNzaW5nIGF0
IG1vc3QgZXZlcnkgNC4xOSBzZWNvbmRzLiAgT24gYW4gSUJNIFBPV0VSOC1iYXNlZCBzeXN0ZW0s
IHRoaXMNCj4gc3Bpa2Ugd2FzIG1lYXN1cmVkIGF0IGFib3V0IDI1LTMwIG1pY3Jvc2Vjb25kcywg
bXVjaCBvZiB3aGljaCB3YXMgYmFzaWMsDQo+IG9wcG9ydHVuaXN0aWMgaG91c2VrZWVwaW5nIHRh
c2tzIHRoYXQgY291bGQgb3RoZXJ3aXNlIGhhdmUgd2FpdGVkLg0KPiANCj4gVGhpcyBwYXRjaCBz
aG9ydC1jaXJjdWl0cyB0aGUgcmVzZXQgb2YgdGhlIGRlY3JlbWVudGVyLCBleGl0aW5nIGFmdGVy
IHRoZQ0KPiBkZWNyZW1lbnRlciByZXNldCwgYnV0IGJlZm9yZSB0aGUgaG91c2VrZWVwaW5nIHRh
c2tzIGlmIHRoZSBvbmx5IG5lZWQgZm9yIHRoZQ0KPiBpbnRlcnJ1cHQgaXMgc2ltcGx5IHRvIHJl
c2V0IGl0LiAgQWZ0ZXIgdGhpcyBwYXRjaCwgdGhlIGxhdGVuY3kgc3Bpa2Ugd2FzIG1lYXN1cmVk
DQo+IGF0IGFib3V0IDE1MCBuYW5vc2Vjb25kcy4NCj4gDQo+IFNpZ25lZC1vZmYtYnk6IFBhdWwg
QS4gQ2xhcmtlIDxwY0B1cy5pYm0uY29tPg0KPiAtLS0NCj4gICBhcmNoL3Bvd2VycGMva2VybmVs
L3RpbWUuYyB8IDEzICsrKysrKysrKysrKysNCj4gICAxIGZpbGUgY2hhbmdlZCwgMTMgaW5zZXJ0
aW9ucygrKQ0KPiANCj4gZGlmZiAtLWdpdCBhL2FyY2gvcG93ZXJwYy9rZXJuZWwvdGltZS5jIGIv
YXJjaC9wb3dlcnBjL2tlcm5lbC90aW1lLmMgaW5kZXgNCj4gMzY4YWIzNy4uOTYyYTA2YiAxMDA2
NDQNCj4gLS0tIGEvYXJjaC9wb3dlcnBjL2tlcm5lbC90aW1lLmMNCj4gKysrIGIvYXJjaC9wb3dl
cnBjL2tlcm5lbC90aW1lLmMNCj4gQEAgLTUyOCw2ICs1MjgsNyBAQCB2b2lkIHRpbWVyX2ludGVy
cnVwdChzdHJ1Y3QgcHRfcmVncyAqIHJlZ3MpDQo+ICAgew0KPiAgIAlzdHJ1Y3QgcHRfcmVncyAq
b2xkX3JlZ3M7DQo+ICAgCXU2NCAqbmV4dF90YiA9ICZfX2dldF9jcHVfdmFyKGRlY3JlbWVudGVy
c19uZXh0X3RiKTsNCj4gKwl1NjQgbm93Ow0KPiANCj4gICAJLyogRW5zdXJlIGEgcG9zaXRpdmUg
dmFsdWUgaXMgd3JpdHRlbiB0byB0aGUgZGVjcmVtZW50ZXIsIG9yIGVsc2UNCj4gICAJICogc29t
ZSBDUFVzIHdpbGwgY29udGludWUgdG8gdGFrZSBkZWNyZW1lbnRlciBleGNlcHRpb25zLg0KPiBA
QCAtNTUwLDYgKzU1MSwxOCBAQCB2b2lkIHRpbWVyX2ludGVycnVwdChzdHJ1Y3QgcHRfcmVncyAq
IHJlZ3MpDQo+ICAgCSAqLw0KPiAgIAltYXlfaGFyZF9pcnFfZW5hYmxlKCk7DQo+IA0KPiArCS8q
IElmIHRoaXMgaXMgc2ltcGx5IHRoZSBkZWNyZW1lbnRlciBleHBpcmluZyAodW5kZXJmbG93KSBk
dWUgdG8NCj4gKwkgKiB0aGUgbGltaXRlZCBzaXplIG9mIHRoZSBkZWNyZW1lbnRlciwgYW5kIG5v
dCBhIHNldCB0aW1lciwNCj4gKwkgKiByZXNldCAoaWYgbmVlZGVkKSBhbmQgcmV0dXJuDQo+ICsJ
ICovDQo+ICsJbm93ID0gZ2V0X3RiX29yX3J0YygpOw0KPiArCWlmIChub3cgPCAqbmV4dF90Yikg
ew0KDQpXaGF0IGlmICJub3ciIGFuZCAqbmV4dF90YiBhcmUgbm90IG9uIHRoZSBzYW1lIHdyYXAg
Y291bnQ/IFRoZXkgYXJlIGJvdGggbW9kdWxvIHZhbHVlcyBBRkFDUy4NClNob3VsZG4ndCB0aGlz
IGJlIHJpZ2h0IGhlcmUgbW9yZSBsaWtlIGEgImlmICgoKm5leHRfdGIgLSBub3cpIDwgMl42Myki
IHN0eWxlIHRlc3QgdG8gY2hlY2sgZm9yIGRlbHRhcyB3aXRoaW4gdGhlIHJhbmdlIGluc3RlYWQg
b2YgYWJzb2x1dGUgdmFsdWVzPw0KDQo+ICsJCW5vdyA9ICpuZXh0X3RiIC0gbm93Ow0KPiArCQlp
ZiAobm93IDw9IERFQ1JFTUVOVEVSX01BWCkNCj4gKwkJCXNldF9kZWMoKGludClub3cpOw0KPiAr
CQlfX2dldF9jcHVfdmFyKGlycV9zdGF0KS50aW1lcl9pcnFzX290aGVycysrOw0KPiArCQlyZXR1
cm47DQo+ICsJfQ0KPiANCj4gICAjaWYgZGVmaW5lZChDT05GSUdfUFBDMzIpICYmIGRlZmluZWQo
Q09ORklHX1BQQ19QTUFDKQ0KPiAgIAlpZiAoYXRvbWljX3JlYWQoJnBwY19uX2xvc3RfaW50ZXJy
dXB0cykgIT0gMCkNCj4gLS0NCj4gMi4xLjIuMzMwLmc1NjUzMDFlDQoNCkJSLA0KDQpIZWlueg0K

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: powerpc: mitigate impact of decrementer reset
  2014-10-08  2:52   ` Michael Ellerman
@ 2014-10-08 10:27     ` Preeti U Murthy
  2014-11-05 17:06     ` Paul Clarke
  1 sibling, 0 replies; 14+ messages in thread
From: Preeti U Murthy @ 2014-10-08 10:27 UTC (permalink / raw)
  To: Michael Ellerman, Paul Clarke, linuxppc-dev

On 10/08/2014 08:22 AM, Michael Ellerman wrote:
> On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:
>> The POWER ISA defines an always-running decrementer which can be used
>> to schedule interrupts after a certain time interval has elapsed.
>> The decrementer counts down at the same frequency as the Time Base,
>> which is 512 MHz.  The maximum value of the decrementer is 0x7fffffff.
>> This works out to a maximum interval of about 4.19 seconds.
>>
>> If a larger interval is desired, the kernel will set the decrementer
>> to its maximum value and reset it after it expires (underflows)
>> a sufficient number of times until the desired interval has elapsed.
>>
>> The negative effect of this is that an unwanted latency spike will
>> impact normal processing at most every 4.19 seconds.  On an IBM
>> POWER8-based system, this spike was measured at about 25-30
>> microseconds, much of which was basic, opportunistic housekeeping
>> tasks that could otherwise have waited.
>>
>> This patch short-circuits the reset of the decrementer, exiting after
>> the decrementer reset, but before the housekeeping tasks if the only
>> need for the interrupt is simply to reset it.  After this patch,
>> the latency spike was measured at about 150 nanoseconds.
> 
> Hi Paul,
> 
> Thanks for the excellent changelog. But this patch makes me a bit nervous :)
> 
> Do you know where the latency is coming from? Is it primarily the irq work?
> 
> If so I'd prefer if we could move the short circuit into __timer_interrupt()
> itself. That way we'd still have the trace points usable, and it would
> hopefully result in less duplicated logic.

I agree, this is perhaps the better approach.

Regards
Preeti U Murthy
> 
> cheers
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-10-08  5:37   ` [PATCH] " Heinz Wrobel
@ 2014-10-08 12:27     ` Paul Clarke
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Clarke @ 2014-10-08 12:27 UTC (permalink / raw)
  To: Heinz Wrobel, linuxppc-dev

On 10/08/2014 12:37 AM, Heinz Wrobel wrote:
> what if your tb wraps during the  test?

Per the Power ISA, Time Base is 64 bits, monotonically increasing, and 
is writable only in hypervisor state.  To my understanding, it is set to 
zero at boot (although this is not prescribed).

Also, as noted by others, the logic is roughly duplicated (with some 
differences) from the analogous code in __timer_interrupt just above it.

I don't see wrapping as a concern.

PC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: powerpc: mitigate impact of decrementer reset
  2014-10-08  2:52   ` Michael Ellerman
  2014-10-08 10:27     ` Preeti U Murthy
@ 2014-11-05 17:06     ` Paul Clarke
  2014-11-13  2:39       ` Michael Ellerman
  1 sibling, 1 reply; 14+ messages in thread
From: Paul Clarke @ 2014-11-05 17:06 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev

Sorry it took me so long to get back to this...

On 10/07/2014 09:52 PM, Michael Ellerman wrote:
> On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:
>> The POWER ISA defines an always-running decrementer which can be used
>> to schedule interrupts after a certain time interval has elapsed.
>> The decrementer counts down at the same frequency as the Time Base,
>> which is 512 MHz.  The maximum value of the decrementer is 0x7fffffff.
>> This works out to a maximum interval of about 4.19 seconds.
>>
>> If a larger interval is desired, the kernel will set the decrementer
>> to its maximum value and reset it after it expires (underflows)
>> a sufficient number of times until the desired interval has elapsed.
>>
>> The negative effect of this is that an unwanted latency spike will
>> impact normal processing at most every 4.19 seconds.  On an IBM
>> POWER8-based system, this spike was measured at about 25-30
>> microseconds, much of which was basic, opportunistic housekeeping
>> tasks that could otherwise have waited.
>>
>> This patch short-circuits the reset of the decrementer, exiting after
>> the decrementer reset, but before the housekeeping tasks if the only
>> need for the interrupt is simply to reset it.  After this patch,
>> the latency spike was measured at about 150 nanoseconds.

> Thanks for the excellent changelog. But this patch makes me a bit nervous :)
>
> Do you know where the latency is coming from? Is it primarily the irq work?

Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us).

> If so I'd prefer if we could move the short circuit into __timer_interrupt()
> itself. That way we'd still have the trace points usable, and it would
> hopefully result in less duplicated logic.

But irq_enter and irq_exit are called in timer_interrupt, before 
__timer_interrupt is called.  I don't see how that helps.  The time 
spent in __timer_interrupt is minuscule by comparison.

Are you suggesting that irq_enter/exit be moved into __timer_interrupt 
as well?  (I'm not sure how that would impact the existing call to 
__timer_interrupt from tick_broadcast_ipi_handler?  And if there is no 
impact, what's the point of separating timer_interrupt and 
__timer_interrupt?)

Regards,
PC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-10-07 19:13 ` [PATCH] powerpc: mitigate impact of decrementer reset Paul Clarke
  2014-10-08  2:52   ` Michael Ellerman
  2014-10-08  5:37   ` [PATCH] " Heinz Wrobel
@ 2014-11-10 10:08   ` Benjamin Herrenschmidt
  2014-11-10 20:58     ` Paul Clarke
  2 siblings, 1 reply; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2014-11-10 10:08 UTC (permalink / raw)
  To: Paul Clarke; +Cc: linuxppc-dev

On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:
> The POWER ISA defines an always-running decrementer which can be used
> to schedule interrupts after a certain time interval has elapsed.
> The decrementer counts down at the same frequency as the Time Base,
> which is 512 MHz.  The maximum value of the decrementer is 0x7fffffff.
> This works out to a maximum interval of about 4.19 seconds.
> 
> If a larger interval is desired, the kernel will set the decrementer
> to its maximum value and reset it after it expires (underflows)
> a sufficient number of times until the desired interval has elapsed.
> 
> The negative effect of this is that an unwanted latency spike will
> impact normal processing at most every 4.19 seconds.  On an IBM
> POWER8-based system, this spike was measured at about 25-30
> microseconds, much of which was basic, opportunistic housekeeping
> tasks that could otherwise have waited.
> 
> This patch short-circuits the reset of the decrementer, exiting after
> the decrementer reset, but before the housekeeping tasks if the only
> need for the interrupt is simply to reset it.  After this patch,
> the latency spike was measured at about 150 nanoseconds.

Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
and your patch will probably cause it to be skipped...

Cheers,
Ben.

> Signed-off-by: Paul A. Clarke <pc@us.ibm.com>
> ---
>   arch/powerpc/kernel/time.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 368ab37..962a06b 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs)
>   {
>   	struct pt_regs *old_regs;
>   	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
> +	u64 now;
> 
>   	/* Ensure a positive value is written to the decrementer, or else
>   	 * some CPUs will continue to take decrementer exceptions.
> @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs)
>   	 */
>   	may_hard_irq_enable();
> 
> +	/* If this is simply the decrementer expiring (underflow) due to
> +	 * the limited size of the decrementer, and not a set timer,
> +	 * reset (if needed) and return
> +	 */
> +	now = get_tb_or_rtc();
> +	if (now < *next_tb) {
> +		now = *next_tb - now;
> +		if (now <= DECREMENTER_MAX)
> +			set_dec((int)now);
> +		__get_cpu_var(irq_stat).timer_irqs_others++;
> +		return;
> +	}
> 
>   #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
>   	if (atomic_read(&ppc_n_lost_interrupts) != 0)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-11-10 10:08   ` Benjamin Herrenschmidt
@ 2014-11-10 20:58     ` Paul Clarke
  2014-11-13  2:42       ` Michael Ellerman
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Clarke @ 2014-11-10 20:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulmck, linuxppc-dev

On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:
>> The POWER ISA defines an always-running decrementer which can be used
>> to schedule interrupts after a certain time interval has elapsed.
>> The decrementer counts down at the same frequency as the Time Base,
>> which is 512 MHz.  The maximum value of the decrementer is 0x7fffffff.
>> This works out to a maximum interval of about 4.19 seconds.
>>
>> If a larger interval is desired, the kernel will set the decrementer
>> to its maximum value and reset it after it expires (underflows)
>> a sufficient number of times until the desired interval has elapsed.
>>
>> The negative effect of this is that an unwanted latency spike will
>> impact normal processing at most every 4.19 seconds.  On an IBM
>> POWER8-based system, this spike was measured at about 25-30
>> microseconds, much of which was basic, opportunistic housekeeping
>> tasks that could otherwise have waited.
>>
>> This patch short-circuits the reset of the decrementer, exiting after
>> the decrementer reset, but before the housekeeping tasks if the only
>> need for the interrupt is simply to reset it.  After this patch,
>> the latency spike was measured at about 150 nanoseconds.
>
> Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
> and your patch will probably cause it to be skipped...

You're right.

I'm confused by the division between timer_interrupt() and 
__timer_interrupt().  The former is called with interrupts disabled (and 
enables them), but also calls irq_enter()/irq_exit().  Why are those 
calls not in __timer_interrupt()?  (If they were, the short-circuit 
logic might be a bit easier to put directly in __timer_interrupt(), 
which would eliminate any duplicate code.)

It looks like __timer_interrupt is only called directly by the broadcast 
timer IPI handler.  (Why is __timer_interrupt not static?)  Does this 
path not need irq_enter/irq_exit?
	
>> Signed-off-by: Paul A. Clarke <pc@us.ibm.com>
>> ---
>>    arch/powerpc/kernel/time.c | 13 +++++++++++++
>>    1 file changed, 13 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 368ab37..962a06b 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs)
>>    {
>>    	struct pt_regs *old_regs;
>>    	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>> +	u64 now;
>>
>>    	/* Ensure a positive value is written to the decrementer, or else
>>    	 * some CPUs will continue to take decrementer exceptions.
>> @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs)
>>    	 */
>>    	may_hard_irq_enable();
>>
>> +	/* If this is simply the decrementer expiring (underflow) due to
>> +	 * the limited size of the decrementer, and not a set timer,
>> +	 * reset (if needed) and return
>> +	 */
>> +	now = get_tb_or_rtc();
>> +	if (now < *next_tb) {
>> +		now = *next_tb - now;
>> +		if (now <= DECREMENTER_MAX)
>> +			set_dec((int)now);
>> +		__get_cpu_var(irq_stat).timer_irqs_others++;
>> +		return;
>> +	}
>>
>>    #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
>>    	if (atomic_read(&ppc_n_lost_interrupts) != 0)

Regards,
PC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: powerpc: mitigate impact of decrementer reset
  2014-11-05 17:06     ` Paul Clarke
@ 2014-11-13  2:39       ` Michael Ellerman
  2014-11-13 19:33         ` Paul Clarke
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ellerman @ 2014-11-13  2:39 UTC (permalink / raw)
  To: Paul Clarke; +Cc: linuxppc-dev

On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote:
> Sorry it took me so long to get back to this...
> 
> On 10/07/2014 09:52 PM, Michael Ellerman wrote:
> > On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:
> >> This patch short-circuits the reset of the decrementer, exiting after
> >> the decrementer reset, but before the housekeeping tasks if the only
> >> need for the interrupt is simply to reset it.  After this patch,
> >> the latency spike was measured at about 150 nanoseconds.
> 
> > Thanks for the excellent changelog. But this patch makes me a bit nervous :)
> >
> > Do you know where the latency is coming from? Is it primarily the irq work?
> 
> Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us).

Hmm, OK. I actually meant irq_work_run().

AIUI irq_enter/exit() are just state tracking, they shouldn't be actually
running work.

How are you measuring it?

> > If so I'd prefer if we could move the short circuit into __timer_interrupt()
> > itself. That way we'd still have the trace points usable, and it would
> > hopefully result in less duplicated logic.
> 
> But irq_enter and irq_exit are called in timer_interrupt, before 
> __timer_interrupt is called.  I don't see how that helps.  The time 
> spent in __timer_interrupt is minuscule by comparison.

Right, it won't help if it's irq_enter() that is causing the delay. But I was
assuming it was irq_work_run().

> Are you suggesting that irq_enter/exit be moved into __timer_interrupt 
> as well?  (I'm not sure how that would impact the existing call to 
> __timer_interrupt from tick_broadcast_ipi_handler?  And if there is no 
> impact, what's the point of separating timer_interrupt and 
> __timer_interrupt?)

The point is __timer_interrupt() is called from tick_broadcast_ipi_handler(),
which is called from smp_ipi_demux(), from icp_hv_ipi_action(), from
__do_irq(), which has already done irq_enter() (and will do irq_exit()).

cheers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-11-10 20:58     ` Paul Clarke
@ 2014-11-13  2:42       ` Michael Ellerman
  2014-11-17 19:18         ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ellerman @ 2014-11-13  2:42 UTC (permalink / raw)
  To: Paul Clarke; +Cc: paulmck, linuxppc-dev

On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote:
> On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:
> >> This patch short-circuits the reset of the decrementer, exiting after
> >> the decrementer reset, but before the housekeeping tasks if the only
> >> need for the interrupt is simply to reset it.  After this patch,
> >> the latency spike was measured at about 150 nanoseconds.
> >
> > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
> > and your patch will probably cause it to be skipped...
> 
> You're right.

Yeah, thanks Ben, that would have been bad.

So we'll need to come up with a different approach.
 
> I'm confused by the division between timer_interrupt() and 
> __timer_interrupt().  The former is called with interrupts disabled (and 
> enables them), but also calls irq_enter()/irq_exit().  Why are those 
> calls not in __timer_interrupt()?  (If they were, the short-circuit 
> logic might be a bit easier to put directly in __timer_interrupt(), 
> which would eliminate any duplicate code.)
> 
> It looks like __timer_interrupt is only called directly by the broadcast 
> timer IPI handler.  (Why is __timer_interrupt not static?)  Does this 
> path not need irq_enter/irq_exit?

I think I answered most of this in the other mail I just sent, but let me know
if not.

And __timer_interrupt() is static, if you have a new enough kernel :)

cheers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: powerpc: mitigate impact of decrementer reset
  2014-11-13  2:39       ` Michael Ellerman
@ 2014-11-13 19:33         ` Paul Clarke
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Clarke @ 2014-11-13 19:33 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: paulmck, linuxppc-dev

On 11/12/2014 08:39 PM, Michael Ellerman wrote:
> On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote:
>> On 10/07/2014 09:52 PM, Michael Ellerman wrote:
>>> On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:
>>>> This patch short-circuits the reset of the decrementer, exiting after
>>>> the decrementer reset, but before the housekeeping tasks if the only
>>>> need for the interrupt is simply to reset it.  After this patch,
>>>> the latency spike was measured at about 150 nanoseconds.
>>
>>> Thanks for the excellent changelog. But this patch makes me a bit nervous :)
>>>
>>> Do you know where the latency is coming from? Is it primarily the irq work?
>>
>> Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us).
>
> Hmm, OK. I actually meant irq_work_run().
>
> AIUI irq_enter/exit() are just state tracking, they shouldn't be actually
> running work.
>
> How are you measuring it?

ftrace function_graph tracer:
--
   127.425212 |                |  .irq_enter() {
   127.425213 |                |    .rcu_irq_enter() {
   127.425213 |  + 12.206 us   |      .rcu_eqs_exit_common.isra.41();
   127.425226 |  + 12.750 us   |    }
... RCU is a big hitter
   127.425226 |                |    .vtime_common_account_irq_enter() {
   127.425226 |                |      .vtime_account_user() {
   127.425226 |    0.032 us    |        ._raw_spin_lock();
   127.425227 |    0.034 us    |        .get_vtime_delta();
   127.425227 |                |        .account_user_time() {
   127.425228 |    0.030 us    |          .cpuacct_account_field();
   127.425228 |                |          .acct_account_cputime() {
   127.425228 |    0.082 us    |            .__acct_update_integrals();
   127.425229 |    0.562 us    |          }
   127.425229 |    1.500 us    |        }
   127.425229 |    2.954 us    |      }
   127.425230 |    3.434 us    |    }
... but even accounting is not insignificant
   127.425230 |  + 17.218 us   |  }
   127.425230 |                |  /* timer_interrupt_entry: [...] */
... nothing to see here, because there's nothing to do except reset the 
decrementer
   127.425230 |                |  /* timer_interrupt_exit: [...] */
... (less than 1 us spent doing the "required" work)
   127.425231 |                |  .irq_exit() {
   127.425231 |                |    .vtime_gen_account_irq_exit() {
   127.425231 |    0.036 us    |      ._raw_spin_lock();
   127.425232 |                |      .__vtime_account_system() {
   127.425232 |    0.030 us    |        .get_vtime_delta();
   127.425232 |                |        .account_system_time() {
   127.425233 |    0.030 us    |          .cpuacct_account_field();
   127.425233 |                |          .acct_account_cputime() {
   127.425233 |    0.072 us    |            .__acct_update_integrals();
   127.425234 |    0.564 us    |          }
   127.425234 |    1.546 us    |        }
   127.425234 |    2.528 us    |      }
   127.425235 |    3.700 us    |    }
... significant accounting time
   127.425235 |    0.032 us    |    .idle_cpu();
   127.425235 |                |    .tick_nohz_irq_exit() {
   127.425236 |                |      .can_stop_full_tick() {
   127.425236 |    0.022 us    |        .sched_can_stop_tick();
   127.425236 |    0.020 us    |        .posix_cpu_timers_can_stop_tick()
   127.425237 |    0.970 us    |      }
   127.425237 |    0.082 us    |      .ktime_get();
   127.425238 |                |      .tick_nohz_stop_sched_tick() {
   127.425238 |    0.032 us    |        .timekeeping_max_deferment();
   127.425238 |                |        .get_next_timer_interrupt() {
   127.425239 |    0.038 us    |          ._raw_spin_lock();
   127.425239 |                |          .hrtimer_get_next_event() {
   127.425239 |    0.030 us    |            ._raw_spin_lock_irqsave();
   127.425240 |    0.028 us    |            ._raw_spin_unlock_irqrestore
   127.425240 |    0.984 us    |          }
   127.425241 |    1.936 us    |        }
   127.425241 |    0.032 us    |        .scheduler_tick_max_deferment();
   127.425241 |    3.438 us    |      }
   127.425242 |    5.880 us    |    }
   127.425242 |                |    .rcu_irq_exit() {
   127.425242 |    0.102 us    |      .rcu_eqs_enter_common.isra.40();
   127.425243 |    0.576 us    |    }
   127.425243 |  + 12.156 us   |  }

This one was almost 30 us total (17.218 + 12.156 = 29.374 us), just to 
reset the decrementer.

>>> If so I'd prefer if we could move the short circuit into __timer_interrupt()
>>> itself. That way we'd still have the trace points usable, and it would
>>> hopefully result in less duplicated logic.
>>
>> But irq_enter and irq_exit are called in timer_interrupt, before
>> __timer_interrupt is called.  I don't see how that helps.  The time
>> spent in __timer_interrupt is minuscule by comparison.
>
> Right, it won't help if it's irq_enter() that is causing the delay. But I was
> assuming it was irq_work_run().
>
>> Are you suggesting that irq_enter/exit be moved into __timer_interrupt
>> as well?  (I'm not sure how that would impact the existing call to
>> __timer_interrupt from tick_broadcast_ipi_handler?  And if there is no
>> impact, what's the point of separating timer_interrupt and
>> __timer_interrupt?)
>
> The point is __timer_interrupt() is called from tick_broadcast_ipi_handler(),
> which is called from smp_ipi_demux(), from icp_hv_ipi_action(), from
> __do_irq(), which has already done irq_enter() (and will do irq_exit()).

If that's the only impact, maybe an "IRQ entered" flag would suffice to 
either prevent a 2nd call to irq_enter(), or allow irq_enter to be 
"reentrant" (where it just returns if it was called and the last call 
was not paired with an irq_exit yet?  Alternatively, a new parameter to 
__timer_interrupt() which indicates the same.

PC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-11-13  2:42       ` Michael Ellerman
@ 2014-11-17 19:18         ` Paul E. McKenney
  2014-11-18  1:46           ` Michael Ellerman
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2014-11-17 19:18 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Paul Clarke

On Thu, Nov 13, 2014 at 01:42:12PM +1100, Michael Ellerman wrote:
> On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote:
> > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote:
> > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:
> > >> This patch short-circuits the reset of the decrementer, exiting after
> > >> the decrementer reset, but before the housekeeping tasks if the only
> > >> need for the interrupt is simply to reset it.  After this patch,
> > >> the latency spike was measured at about 150 nanoseconds.
> > >
> > > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
> > > and your patch will probably cause it to be skipped...
> > 
> > You're right.
> 
> Yeah, thanks Ben, that would have been bad.
> 
> So we'll need to come up with a different approach.
> 
> > I'm confused by the division between timer_interrupt() and 
> > __timer_interrupt().  The former is called with interrupts disabled (and 
> > enables them), but also calls irq_enter()/irq_exit().  Why are those 
> > calls not in __timer_interrupt()?  (If they were, the short-circuit 
> > logic might be a bit easier to put directly in __timer_interrupt(), 
> > which would eliminate any duplicate code.)
> > 
> > It looks like __timer_interrupt is only called directly by the broadcast 
> > timer IPI handler.  (Why is __timer_interrupt not static?)  Does this 
> > path not need irq_enter/irq_exit?
> 
> I think I answered most of this in the other mail I just sent, but let me know
> if not.
> 
> And __timer_interrupt() is static, if you have a new enough kernel :)

If I am understanding this correctly, it underscores the need for more
bits in the decrementer register.  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-11-17 19:18         ` Paul E. McKenney
@ 2014-11-18  1:46           ` Michael Ellerman
  2014-11-18  3:08             ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ellerman @ 2014-11-18  1:46 UTC (permalink / raw)
  To: paulmck; +Cc: linuxppc-dev, Paul Clarke

On Mon, 2014-11-17 at 11:18 -0800, Paul E. McKenney wrote:
> On Thu, Nov 13, 2014 at 01:42:12PM +1100, Michael Ellerman wrote:
> > On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote:
> > > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote:
> > > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:
> > > >> This patch short-circuits the reset of the decrementer, exiting after
> > > >> the decrementer reset, but before the housekeeping tasks if the only
> > > >> need for the interrupt is simply to reset it.  After this patch,
> > > >> the latency spike was measured at about 150 nanoseconds.
> > > >
> > > > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
> > > > and your patch will probably cause it to be skipped...
> > > 
> > > You're right.
> > 
> > Yeah, thanks Ben, that would have been bad.
> > 
> > So we'll need to come up with a different approach.
> 
> If I am understanding this correctly, it underscores the need for more
> bits in the decrementer register.  :-/

Yes that is the root cause of the problem :)

cheers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: mitigate impact of decrementer reset
  2014-11-18  1:46           ` Michael Ellerman
@ 2014-11-18  3:08             ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2014-11-18  3:08 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Paul Clarke

On Tue, Nov 18, 2014 at 12:46:56PM +1100, Michael Ellerman wrote:
> On Mon, 2014-11-17 at 11:18 -0800, Paul E. McKenney wrote:
> > On Thu, Nov 13, 2014 at 01:42:12PM +1100, Michael Ellerman wrote:
> > > On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote:
> > > > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote:
> > > > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote:
> > > > >> This patch short-circuits the reset of the decrementer, exiting after
> > > > >> the decrementer reset, but before the housekeeping tasks if the only
> > > > >> need for the interrupt is simply to reset it.  After this patch,
> > > > >> the latency spike was measured at about 150 nanoseconds.
> > > > >
> > > > > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1);
> > > > > and your patch will probably cause it to be skipped...
> > > > 
> > > > You're right.
> > > 
> > > Yeah, thanks Ben, that would have been bad.
> > > 
> > > So we'll need to come up with a different approach.
> > 
> > If I am understanding this correctly, it underscores the need for more
> > bits in the decrementer register.  :-/
> 
> Yes that is the root cause of the problem :)

Sigh!!!  I was hoping!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-11-18  3:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1412708517-84726-1-git-send-email-pc@us.ibm.com>
2014-10-07 19:13 ` [PATCH] powerpc: mitigate impact of decrementer reset Paul Clarke
2014-10-08  2:52   ` Michael Ellerman
2014-10-08 10:27     ` Preeti U Murthy
2014-11-05 17:06     ` Paul Clarke
2014-11-13  2:39       ` Michael Ellerman
2014-11-13 19:33         ` Paul Clarke
2014-10-08  5:37   ` [PATCH] " Heinz Wrobel
2014-10-08 12:27     ` Paul Clarke
2014-11-10 10:08   ` Benjamin Herrenschmidt
2014-11-10 20:58     ` Paul Clarke
2014-11-13  2:42       ` Michael Ellerman
2014-11-17 19:18         ` Paul E. McKenney
2014-11-18  1:46           ` Michael Ellerman
2014-11-18  3:08             ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).