All of lore.kernel.org
 help / color / mirror / Atom feed
* PM related performance degradation on OMAP3
@ 2012-04-06 22:50 Grazvydas Ignotas
  2012-04-09 19:03 ` Kevin Hilman
  2012-04-11 14:59 ` Gary Thomas
  0 siblings, 2 replies; 36+ messages in thread
From: Grazvydas Ignotas @ 2012-04-06 22:50 UTC (permalink / raw)
  To: linux-omap; +Cc: Kevin Hilman, Paul Walmsley

Hello,

I'm DMA seeing performance loss related to CONFIG_PM on OMAP3.

# CONFIG_PM is set:
echo 3 > /proc/sys/vm/drop_caches
# file copy from NAND (using NAND driver in DMA mode)
dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s
# file read from SD (hsmmc uses DMA)
dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s

# CONFIG_PM not set:
# NAND
dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s
# SD
dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s

While SD card performance loss is not that bad (~7%), NAND one is
worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
cpuidle states over sysfs, it did not have any significant effect. Is
there something else to try?

I'm guessing this is caused by CPU wakeup latency to service DMA
interrupts? I've noticed that if I keep CPU busy, the loss is reduced
almost completely.
Talking about cpuidle, what's the difference between C1 and C2 states?
They look mostly the same.
Then there is omap3_do_wfi, it seems to be unconditionally putting
SDRC on self-refresh, would it make sense to just do wfi in higher
power states, like OMAP4 seems to be doing?

-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-06 22:50 PM related performance degradation on OMAP3 Grazvydas Ignotas
@ 2012-04-09 19:03 ` Kevin Hilman
  2012-04-11  0:29   ` Grazvydas Ignotas
  2012-04-11 14:59 ` Gary Thomas
  1 sibling, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-04-09 19:03 UTC (permalink / raw)
  To: Grazvydas Ignotas; +Cc: linux-omap, Paul Walmsley

Grazvydas Ignotas <notasas@gmail.com> writes:

> Hello,
>
> I'm DMA seeing performance loss related to CONFIG_PM on OMAP3.
>
> # CONFIG_PM is set:
> echo 3 > /proc/sys/vm/drop_caches
> # file copy from NAND (using NAND driver in DMA mode)
> dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s
> # file read from SD (hsmmc uses DMA)
> dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s
>
> # CONFIG_PM not set:
> # NAND
> dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s
> # SD
> dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s
>
> While SD card performance loss is not that bad (~7%), NAND one is
> worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
> cpuidle states over sysfs, it did not have any significant effect. Is
> there something else to try?

Looks like we might need a PM QoS constraint when there is DMA activity
in progress.  

You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
DMA transfers are active and I suspect that will help.

> I'm guessing this is caused by CPU wakeup latency to service DMA
> interrupts? I've noticed that if I keep CPU busy, the loss is reduced
> almost completely.

Yeah, that suggests a QoS constraint is what's needed here.

> Talking about cpuidle, what's the difference between C1 and C2 states?
> They look mostly the same.

Except for clockdomains are not allowed to idle in C1 which results in
much shorter wakeup latency.

> Then there is omap3_do_wfi, it seems to be unconditionally putting
> SDRC on self-refresh, would it make sense to just do wfi in higher
> power states, like OMAP4 seems to be doing?

Not sure what you're referring to in OMAP4.  There we do WFI in every
idle state.

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-09 19:03 ` Kevin Hilman
@ 2012-04-11  0:29   ` Grazvydas Ignotas
  2012-04-12  0:19     ` Kevin Hilman
  2012-04-12 23:02     ` Woodruff, Richard
  0 siblings, 2 replies; 36+ messages in thread
From: Grazvydas Ignotas @ 2012-04-11  0:29 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: linux-omap, Paul Walmsley

On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman <khilman@ti.com> wrote:
> Grazvydas Ignotas <notasas@gmail.com> writes:
>> While SD card performance loss is not that bad (~7%), NAND one is
>> worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
>> cpuidle states over sysfs, it did not have any significant effect. Is
>> there something else to try?
>
> Looks like we might need a PM QoS constraint when there is DMA activity
> in progress.
>
> You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
> DMA transfers are active and I suspect that will help.

I've tried it and it didn't help much. It looks like the only thing it
does is limiting cpuidle c-states, I tried to set qos dma latency to 0
and it made it stay in C1 while transfer was ongoing (I watched
/sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance
was still poor.

What I think is going on here is that omap_sram_idle() is taking too
much time because it's overhead is too large. I've added a counter
there and it seems to be called ~530 times per megabyte (DMA operates
in ~2K chunks so it makes sense), that's over 2000 calls per second.
Some quick measurement code shows ~243us spent for setting up in
omap_sram_idle() (before and after omap34xx_do_sram_idle()).

Could we perhaps have a lighter idle function for C1 that doesn't try
to switch all powerdomain states and maybe not enable RAM
self-refresh? As a quick test I've tried this in omap3_enter_idle():

        /* Execute ARM wfi */
        if (index == 0) {
                clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
                cpu_do_idle();
        } else
                omap_sram_idle();

..and it brought performance close to !CONFIG_PM case (cpu_do_idle()
is used as pm_idle on !CONFIG_PM). I don't know what side effects
something like this might have though.

>> Then there is omap3_do_wfi, it seems to be unconditionally putting
>> SDRC on self-refresh, would it make sense to just do wfi in higher
>> power states, like OMAP4 seems to be doing?
>
> Not sure what you're referring to in OMAP4.  There we do WFI in every
> idle state.

What I meant is that OMAP3 idle code always tries to enable RAM
self-refresh (regardless of c-state) before doing wfi while OMAP4 can
do wfi without suspending RAM (although I might be misunderstanding
all that asm code).

-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-06 22:50 PM related performance degradation on OMAP3 Grazvydas Ignotas
  2012-04-09 19:03 ` Kevin Hilman
@ 2012-04-11 14:59 ` Gary Thomas
  2012-04-11 17:23   ` Grazvydas Ignotas
  2012-04-11 19:17   ` Kevin Hilman
  1 sibling, 2 replies; 36+ messages in thread
From: Gary Thomas @ 2012-04-11 14:59 UTC (permalink / raw)
  To: Grazvydas Ignotas; +Cc: linux-omap, Kevin Hilman, Paul Walmsley

On 2012-04-06 16:50, Grazvydas Ignotas wrote:
> Hello,
>
> I'm DMA seeing performance loss related to CONFIG_PM on OMAP3.
>
> # CONFIG_PM is set:
> echo 3>  /proc/sys/vm/drop_caches
> # file copy from NAND (using NAND driver in DMA mode)
> dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 9.088714 seconds, 3.5MB/s
> # file read from SD (hsmmc uses DMA)
> dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 2.065460 seconds, 15.5MB/s
>
> # CONFIG_PM not set:
> # NAND
> dd if=/mnt/tmp/a of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 5.653534 seconds, 5.7MB/s
> # SD
> dd if=/dev/mmcblk0 of=/dev/null bs=1M count=32
> 33554432 bytes (32.0MB) copied, 1.919007 seconds, 16.7MB/s
>
> While SD card performance loss is not that bad (~7%), NAND one is
> worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
> cpuidle states over sysfs, it did not have any significant effect. Is
> there something else to try?
>
> I'm guessing this is caused by CPU wakeup latency to service DMA
> interrupts? I've noticed that if I keep CPU busy, the loss is reduced
> almost completely.
> Talking about cpuidle, what's the difference between C1 and C2 states?
> They look mostly the same.
> Then there is omap3_do_wfi, it seems to be unconditionally putting
> SDRC on self-refresh, would it make sense to just do wfi in higher
> power states, like OMAP4 seems to be doing?
>

I fear I'm seeing similar problems with 3.3.  I have my board (similar
to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
performance on 3.3.  For example, if I use TFTP to download a large file
(~35MB), I get this:
   3.0:  42.5 sec
   3.3: 625.0 sec
That's a factor of 15 worse!

I'd like to try building without CONFIG_PM, but when I disabled this, my
kernel fails to come up.  Can someone point me to the magic to build without
CONFIG_PM, or possibly send me a working config file?

Thanks

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-11 14:59 ` Gary Thomas
@ 2012-04-11 17:23   ` Grazvydas Ignotas
  2012-04-11 18:20     ` Gary Thomas
  2012-04-11 19:17   ` Kevin Hilman
  1 sibling, 1 reply; 36+ messages in thread
From: Grazvydas Ignotas @ 2012-04-11 17:23 UTC (permalink / raw)
  To: Gary Thomas; +Cc: linux-omap, Kevin Hilman, Paul Walmsley

On Wed, Apr 11, 2012 at 5:59 PM, Gary Thomas <gary@mlbassoc.com> wrote:
> I'd like to try building without CONFIG_PM, but when I disabled this, my
> kernel fails to come up.  Can someone point me to the magic to build without
> CONFIG_PM, or possibly send me a working config file?

You probably need this patch:
http://marc.info/?l=linux-omap&m=133374930011086&w=2
If it still won't boot, you'll need to enable earlyprintk both in
.config and as kernel argument to see where it dies.


-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-11 17:23   ` Grazvydas Ignotas
@ 2012-04-11 18:20     ` Gary Thomas
  0 siblings, 0 replies; 36+ messages in thread
From: Gary Thomas @ 2012-04-11 18:20 UTC (permalink / raw)
  To: Grazvydas Ignotas; +Cc: linux-omap, Kevin Hilman, Paul Walmsley

On 2012-04-11 11:23, Grazvydas Ignotas wrote:
> On Wed, Apr 11, 2012 at 5:59 PM, Gary Thomas<gary@mlbassoc.com>  wrote:
>> I'd like to try building without CONFIG_PM, but when I disabled this, my
>> kernel fails to come up.  Can someone point me to the magic to build without
>> CONFIG_PM, or possibly send me a working config file?
>
> You probably need this patch:
> http://marc.info/?l=linux-omap&m=133374930011086&w=2
> If it still won't boot, you'll need to enable earlyprintk both in
> .config and as kernel argument to see where it dies.

That does help, but there are lots of tracebacks like these:
[    0.588500] ------------[ cut here ]------------
[    0.588531] WARNING: at drivers/video/omap2/dss/dispc.c:404 dss_driver_probe+0x44/0xd8()
[    0.588562] Modules linked in:
[    0.588592] [<c0012204>] (unwind_backtrace+0x0/0xf8) from [<c002b81c>] (warn_slowpath_common+0x4c/0x64)
[    0.588623] [<c002b81c>] (warn_slowpath_common+0x4c/0x64) from [<c002b850>] (warn_slowpath_null+0x1c/0x24)
[    0.588623] [<c002b850>] (warn_slowpath_null+0x1c/0x24) from [<c022609c>] (dss_driver_probe+0x44/0xd8)
[    0.588653] [<c022609c>] (dss_driver_probe+0x44/0xd8) from [<c0273e10>] (driver_probe_device+0x70/0x1e4)
[    0.588684] [<c0273e10>] (driver_probe_device+0x70/0x1e4) from [<c0274018>] (__driver_attach+0x94/0x98)
[    0.588714] [<c0274018>] (__driver_attach+0x94/0x98) from [<c027270c>] (bus_for_each_dev+0x50/0x7c)
[    0.588745] [<c027270c>] (bus_for_each_dev+0x50/0x7c) from [<c0273664>] (bus_add_driver+0x184/0x244)
[    0.588775] [<c0273664>] (bus_add_driver+0x184/0x244) from [<c02742bc>] (driver_register+0x78/0x12c)
[    0.588775] [<c02742bc>] (driver_register+0x78/0x12c) from [<c00085a0>] (do_one_initcall+0x34/0x178)
[    0.588806] [<c00085a0>] (do_one_initcall+0x34/0x178) from [<c061d7dc>] (kernel_init+0x78/0x114)
[    0.588836] [<c061d7dc>] (kernel_init+0x78/0x114) from [<c000e0d0>] (kernel_thread_exit+0x0/0x8)
[    0.588867] ---[ end trace 1b75b31a2719ed24 ]---

I also had to disable the watchdog to get it up.

That said, with CONFIG_PM disabled, my network performance is
back to what it was in 3.0 :-)  Note: I also had CONFIG_PM disabled
in that kernel build, so I don't know for sure what the performance
might be with that version if it were enabled.

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-11 14:59 ` Gary Thomas
  2012-04-11 17:23   ` Grazvydas Ignotas
@ 2012-04-11 19:17   ` Kevin Hilman
  2012-04-12 10:44     ` Gary Thomas
  1 sibling, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-04-11 19:17 UTC (permalink / raw)
  To: Gary Thomas; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Gary Thomas <gary@mlbassoc.com> writes:

[...]

> I fear I'm seeing similar problems with 3.3.  I have my board (similar
> to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
> performance on 3.3.  For example, if I use TFTP to download a large file
> (~35MB), I get this:
>   3.0:  42.5 sec
>   3.3: 625.0 sec
> That's a factor of 15 worse!

This might not be the same problem.  What is the NIC being used, and
does it have GPIO interrupts?

If it's using GPIO interrupts, then you likely need this patch from
mainline (v3.4-rc1)

If that doesn't work, or you're not using GPIO interrupts, could you
confirm if the patch below[2] (based on idea from Grasvydas) increases
performance for you when CONFIG_PM=y.

Kevin

[1]
Author: Kevin Hilman <khilman@ti.com>  2012-03-05 15:10:04
Committer: Grant Likely <grant.likely@secretlab.ca>  2012-03-12 09:16:11
Parent: 25db711df3258d125dc1209800317e5c0ef3c870 (gpio/omap: Fix IRQ handling for SPARSE_IRQ)
Child:  8805f410e4fb88a56552c1af42d61b38837a38fd (gpio/omap: Fix section warning for omap_mpuio_alloc_gc())
Branches: many (66)
Follows: v3.3-rc7
Precedes: v3.4-rc1

    gpio/omap: fix wakeups on level-triggered GPIOs
    
    While both level- and edge-triggered GPIOs are capable of generating
    interrupts, only edge-triggered GPIOs are capable of generating a
    module-level wakeup to the PRCM (c.f. 34xx NDA TRM section 25.5.3.2.)
    
    In order to ensure that devices using level-triggered GPIOs as
    interrupts can also cause wakeups (e.g. from idle), this patch enables
    edge-triggering for wakeup-enabled, level-triggered GPIOs when a GPIO
    bank is runtime-suspended (which also happens during idle.)
    
    This fixes a problem found in GPMC-connected network cards with GPIO
    interrupts (e.g. smsc911x on Zoom3, Overo, ...) where network booting
    with NFSroot was very slow since the GPIO IRQs used by the NIC were
    not generating PRCM wakeups, and thus not waking the system from idle.
    NOTE: until v3.3, this boot-time problem was somewhat masked because
    the UART init prevented WFI during boot until the full serial driver
    was available.  Preventing WFI allowed regular GPIO interrupts to fire
    and this problem was not seen.  After the UART runtime PM cleanups, we
    no longer avoid WFI during boot, so GPIO IRQs that were not causing
    wakeups resulted in very slow IRQ response times.
    
    Tested on platforms using level-triggered GPIOs for network IRQs using
    the SMSC911x NIC: 3530/Overo and 3630/Zoom3.
    
    Reported-by: Tony Lindgren <tony@atomide.com>
    Tested-by: Tarun Kanti DebBarma <tarun.kanti@ti.com>
    Tested-by: Tony Lindgren <tony@atomide.com>
    Signed-off-by: Kevin Hilman <khilman@ti.com>
    Signed-off-by: Grant Likely <grant.likely@secretlab.ca>

[2]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c
index 413aac4..ace4bf6 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -120,7 +120,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
 		cpu_pm_enter();
 
 	/* Execute ARM wfi */
-	omap_sram_idle();
+	if (index == 0)
+		cpu_do_idle();
+	else
+		omap_sram_idle();
 
 	/*
 	 * Call idle CPU PM enter notifier chain to restore

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-11  0:29   ` Grazvydas Ignotas
@ 2012-04-12  0:19     ` Kevin Hilman
  2012-04-13 17:32       ` Grazvydas Ignotas
  2012-04-13 19:32       ` Grazvydas Ignotas
  2012-04-12 23:02     ` Woodruff, Richard
  1 sibling, 2 replies; 36+ messages in thread
From: Kevin Hilman @ 2012-04-12  0:19 UTC (permalink / raw)
  To: Grazvydas Ignotas; +Cc: linux-omap, Paul Walmsley

Grazvydas Ignotas <notasas@gmail.com> writes:

> On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman <khilman@ti.com> wrote:
>> Grazvydas Ignotas <notasas@gmail.com> writes:
>>> While SD card performance loss is not that bad (~7%), NAND one is
>>> worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
>>> cpuidle states over sysfs, it did not have any significant effect. Is
>>> there something else to try?
>>
>> Looks like we might need a PM QoS constraint when there is DMA activity
>> in progress.
>>
>> You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
>> DMA transfers are active and I suspect that will help.
>
> I've tried it and it didn't help much. It looks like the only thing it
> does is limiting cpuidle c-states, I tried to set qos dma latency to 0
> and it made it stay in C1 while transfer was ongoing (I watched
> /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance
> was still poor.

Great, thanks for doing this experiment.

Assuming we get to a C1 that's low-latency enough, we will still need
this constraint to ensure C1 during transfers.  But first we have to
figure out what's going on with C1...

> What I think is going on here is that omap_sram_idle() is taking too
> much time because it's overhead is too large. I've added a counter
> there and it seems to be called ~530 times per megabyte (DMA operates
> in ~2K chunks so it makes sense), that's over 2000 calls per second.
> Some quick measurement code shows ~243us spent for setting up in
> omap_sram_idle() (before and after omap34xx_do_sram_idle()).

> Could we perhaps have a lighter idle function for C1 that doesn't try
> to switch all powerdomain states and maybe not enable RAM
> self-refresh? 

Yes, but first let's try to uncover exactly what makes the current C1 so
heavy.  

> As a quick test I've tried this in omap3_enter_idle():
>
>         /* Execute ARM wfi */
>         if (index == 0) {
>                 clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
>                 cpu_do_idle();
>         } else
>                 omap_sram_idle();
>
> ..and it brought performance close to !CONFIG_PM case (cpu_do_idle()
> is used as pm_idle on !CONFIG_PM). 

OK, I see now.   I think you're right about the overhead.

It would be helpful now to narrow down what are the big contributors to
the overhead in omap_sram_idle().  Most of the code there is skipped for
C1 because the next states for MPU and CORE are both ON.

There are 2 primary differences that I see as possible causes.  I list
them here with a couple more experiments for you to try to help us
narrow this down.

1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()

Could you try using omap_sram_idle() and just commenting out those
calls?  Does that help performance?  Those iterate over all the
powerdomains, so defintely add some overhead, but I don't think it
would be as significant as what you're seeing.    Much more likely is...

2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

This is more likely the culprit of most of the overhead.  Specifically,
when returning from idle there are some errata to workaround that
require waiting for DPLL3 to lock.  I suspect this is more likely to be
the source of the problem.  

Can you try the hack below[1], which basically does the cpu_do_idle() hack
that you've already done, but inside omap_sram_idle() and only
eliminates the jump to SRAM, SDRC self-refresh and SDRC errata
workarounds?

I assume that will get performance back to what you expect.  Then it
remains to be seen if it's the SDRC self-refresh that's causing the
delay, or the errata workarounds.

To add the self-refresh back, but eliminate the SDRC errata workaround,
You could try something like I hacked up in the (untested) branch here[2].
If performance is still good, that will tell us that it's the errata
workaround waiting that's causing the extra overhead.

I need to clarify for myself if SDRC self-refresh is even entered in C1.
When the CORE powerdomain is left on, I don't think the PRCM is would
send IDLEREQ to the SDRC, so it should not enter self refresh, but I
need to verify that.

> I don't know what side effects something like this might have though.

There are some other errata workaounds that you miss by not calling
omap_sram_idle().  Specifically, the call to omap3_intc_prepare_idle()
is important.

Kevin




[1]
diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
index 3e6b564..0fb3942 100644
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -313,7 +313,7 @@ void omap_sram_idle(void)
 	if (save_state == 1 || save_state == 3)
 		cpu_suspend(save_state, omap34xx_do_sram_idle);
 	else
-		omap34xx_do_sram_idle(save_state);
+		cpu_do_idle();
 
 	/* Restore normal SDRC POWER settings */
 	if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 &&


[2] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git tmp/sdrc-hacks

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-11 19:17   ` Kevin Hilman
@ 2012-04-12 10:44     ` Gary Thomas
  2012-04-12 14:14       ` Kevin Hilman
  0 siblings, 1 reply; 36+ messages in thread
From: Gary Thomas @ 2012-04-12 10:44 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

On 2012-04-11 13:17, Kevin Hilman wrote:
> Gary Thomas<gary@mlbassoc.com>  writes:
>
> [...]
>
>> I fear I'm seeing similar problems with 3.3.  I have my board (similar
>> to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
>> performance on 3.3.  For example, if I use TFTP to download a large file
>> (~35MB), I get this:
>>    3.0:  42.5 sec
>>    3.3: 625.0 sec
>> That's a factor of 15 worse!
>
> This might not be the same problem.  What is the NIC being used, and
> does it have GPIO interrupts?

My board uses SMSC911x with GPIO interrupt signal.

>
> If it's using GPIO interrupts, then you likely need this patch from
> mainline (v3.4-rc1)

I tried to just pick up the patch you [sort of] quoted below, but had
a hard time applying it to my kernel. I've tried to just pick up the
latest files from the mainline kernel, but so far I've nothing that
builds - too many dependencies.  These are the files I've pulled in
#       modified:   arch/arm/mach-omap2/cpuidle34xx.c
#       modified:   arch/arm/mach-omap2/gpio.c
#       modified:   arch/arm/mach-omap2/pm34xx.c
#       modified:   arch/arm/plat-omap/include/plat/gpio.h
#       modified:   drivers/gpio/gpio-omap.c
but it fails with these errors:
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:34:29: error: asm/system_misc.h: No such file or directory
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c: In function 'omap3_pm_init':
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: 'omap_pm_clkdms_setup' undeclared (first use in this function)
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: (Each undeclared identifier is reported only once
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:744: error: for each function it appears in.)
/local/linux-3.3/arch/arm/mach-omap2/pm34xx.c:767: error: 'arm_pm_idle' undeclared (first use in this function)

Is this a viable path towards getting the GPIO changes into my kernel?
It's hard for me to update the whole kernel as there are some other
dependencies (OMAP3ISP and video in particular), so I'd like to stay
with this 3.3-ish base.

Thanks for any ideas

>
> If that doesn't work, or you're not using GPIO interrupts, could you
> confirm if the patch below[2] (based on idea from Grasvydas) increases
> performance for you when CONFIG_PM=y.
>
> Kevin
>
> [1]
> Author: Kevin Hilman<khilman@ti.com>   2012-03-05 15:10:04
> Committer: Grant Likely<grant.likely@secretlab.ca>   2012-03-12 09:16:11
> Parent: 25db711df3258d125dc1209800317e5c0ef3c870 (gpio/omap: Fix IRQ handling for SPARSE_IRQ)
> Child:  8805f410e4fb88a56552c1af42d61b38837a38fd (gpio/omap: Fix section warning for omap_mpuio_alloc_gc())
> Branches: many (66)
> Follows: v3.3-rc7
> Precedes: v3.4-rc1
>
>      gpio/omap: fix wakeups on level-triggered GPIOs
>
>      While both level- and edge-triggered GPIOs are capable of generating
>      interrupts, only edge-triggered GPIOs are capable of generating a
>      module-level wakeup to the PRCM (c.f. 34xx NDA TRM section 25.5.3.2.)
>
>      In order to ensure that devices using level-triggered GPIOs as
>      interrupts can also cause wakeups (e.g. from idle), this patch enables
>      edge-triggering for wakeup-enabled, level-triggered GPIOs when a GPIO
>      bank is runtime-suspended (which also happens during idle.)
>
>      This fixes a problem found in GPMC-connected network cards with GPIO
>      interrupts (e.g. smsc911x on Zoom3, Overo, ...) where network booting
>      with NFSroot was very slow since the GPIO IRQs used by the NIC were
>      not generating PRCM wakeups, and thus not waking the system from idle.
>      NOTE: until v3.3, this boot-time problem was somewhat masked because
>      the UART init prevented WFI during boot until the full serial driver
>      was available.  Preventing WFI allowed regular GPIO interrupts to fire
>      and this problem was not seen.  After the UART runtime PM cleanups, we
>      no longer avoid WFI during boot, so GPIO IRQs that were not causing
>      wakeups resulted in very slow IRQ response times.
>
>      Tested on platforms using level-triggered GPIOs for network IRQs using
>      the SMSC911x NIC: 3530/Overo and 3630/Zoom3.
>
>      Reported-by: Tony Lindgren<tony@atomide.com>
>      Tested-by: Tarun Kanti DebBarma<tarun.kanti@ti.com>
>      Tested-by: Tony Lindgren<tony@atomide.com>
>      Signed-off-by: Kevin Hilman<khilman@ti.com>
>      Signed-off-by: Grant Likely<grant.likely@secretlab.ca>
>
> [2]
> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c
> index 413aac4..ace4bf6 100644
> --- a/arch/arm/mach-omap2/cpuidle34xx.c
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -120,7 +120,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>   		cpu_pm_enter();
>
>   	/* Execute ARM wfi */
> -	omap_sram_idle();
> +	if (index == 0)
> +		cpu_do_idle();
> +	else
> +		omap_sram_idle();
>
>   	/*
>   	 * Call idle CPU PM enter notifier chain to restore
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 10:44     ` Gary Thomas
@ 2012-04-12 14:14       ` Kevin Hilman
  2012-04-12 15:28         ` Gary Thomas
  0 siblings, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-04-12 14:14 UTC (permalink / raw)
  To: Gary Thomas; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Gary Thomas <gary@mlbassoc.com> writes:

> On 2012-04-11 13:17, Kevin Hilman wrote:
>> Gary Thomas<gary@mlbassoc.com>  writes:
>>
>> [...]
>>
>>> I fear I'm seeing similar problems with 3.3.  I have my board (similar
>>> to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
>>> performance on 3.3.  For example, if I use TFTP to download a large file
>>> (~35MB), I get this:
>>>    3.0:  42.5 sec
>>>    3.3: 625.0 sec
>>> That's a factor of 15 worse!
>>
>> This might not be the same problem.  What is the NIC being used, and
>> does it have GPIO interrupts?
>
> My board uses SMSC911x with GPIO interrupt signal.

OK, then your problem is almost certainly solved by my GPIO triggering
fix, and not related to Grazvytas' problem.

>>
>> If it's using GPIO interrupts, then you likely need this patch from
>> mainline (v3.4-rc1)
>
> I tried to just pick up the patch you [sort of] quoted below, but had
> a hard time applying it to my kernel. I've tried to just pick up the
> latest files from the mainline kernel, but so far I've nothing that
> builds

Oh, right.  Sorry about that.  Yeah, that patch actually has
dependencies on other GPIO changes that were queued for v3.4 (and not in
v3.3.)

If you're on v3.3, just pull the branch below[1] which is based on
v3.3-rc2.  Pulling that into a v3.3 should build just fine.

Kevin

[1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git for_3.4/fixes/gpio


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 14:14       ` Kevin Hilman
@ 2012-04-12 15:28         ` Gary Thomas
  2012-04-12 16:57           ` Kevin Hilman
  0 siblings, 1 reply; 36+ messages in thread
From: Gary Thomas @ 2012-04-12 15:28 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

On 2012-04-12 08:14, Kevin Hilman wrote:
> Gary Thomas<gary@mlbassoc.com>  writes:
>
>> On 2012-04-11 13:17, Kevin Hilman wrote:
>>> Gary Thomas<gary@mlbassoc.com>   writes:
>>>
>>> [...]
>>>
>>>> I fear I'm seeing similar problems with 3.3.  I have my board (similar
>>>> to the BeagleBoard) ported to 3.0 and 3.3.  I'm seeing terrible network
>>>> performance on 3.3.  For example, if I use TFTP to download a large file
>>>> (~35MB), I get this:
>>>>     3.0:  42.5 sec
>>>>     3.3: 625.0 sec
>>>> That's a factor of 15 worse!
>>>
>>> This might not be the same problem.  What is the NIC being used, and
>>> does it have GPIO interrupts?
>>
>> My board uses SMSC911x with GPIO interrupt signal.
>
> OK, then your problem is almost certainly solved by my GPIO triggering
> fix, and not related to Grazvytas' problem.
>
>>>
>>> If it's using GPIO interrupts, then you likely need this patch from
>>> mainline (v3.4-rc1)
>>
>> I tried to just pick up the patch you [sort of] quoted below, but had
>> a hard time applying it to my kernel. I've tried to just pick up the
>> latest files from the mainline kernel, but so far I've nothing that
>> builds
>
> Oh, right.  Sorry about that.  Yeah, that patch actually has
> dependencies on other GPIO changes that were queued for v3.4 (and not in
> v3.3.)
>
> If you're on v3.3, just pull the branch below[1] which is based on
> v3.3-rc2.  Pulling that into a v3.3 should build just fine.
>
> Kevin
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap-pm.git for_3.4/fixes/gpio

This worked a treat, thanks.  My network performance is better
now, but still not what it was.  The same TFTP transfer now takes
71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.

I am interested in having PM working as I'm designing a battery powered
portable unit, so I need to keep pursuing this.

Note: I noticed that when I built with CONFIG_PM off and no other
changes, my EHCI USB didn't work properly.  Should this be the case?

Thanks again for your help


-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 15:28         ` Gary Thomas
@ 2012-04-12 16:57           ` Kevin Hilman
  2012-04-12 17:10             ` Gary Thomas
  2012-04-13  9:13             ` Felipe Balbi
  0 siblings, 2 replies; 36+ messages in thread
From: Kevin Hilman @ 2012-04-12 16:57 UTC (permalink / raw)
  To: Gary Thomas; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley, Felipe Balbi

+Felipe for EHCI question

Gary Thomas <gary@mlbassoc.com> writes:

[...]

> This worked a treat, thanks.  My network performance is better
> now, but still not what it was.  The same TFTP transfer now takes
> 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
> second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.

And does a CONFIG_PM=n kernel get you back to your v3.0 performance?

> I am interested in having PM working as I'm designing a battery powered
> portable unit, so I need to keep pursuing this.

So do I. :)

> Note: I noticed that when I built with CONFIG_PM off and no other
> changes, my EHCI USB didn't work properly.  Should this be the case?

Probably not, but haven't tested EHCI USB.  I've Cc'd Felipe to see if
he has any ideas why EHCI wouldn't work with CONFIG_PM=n.

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 16:57           ` Kevin Hilman
@ 2012-04-12 17:10             ` Gary Thomas
  2012-04-12 18:08               ` Kevin Hilman
  2012-04-13  9:13             ` Felipe Balbi
  1 sibling, 1 reply; 36+ messages in thread
From: Gary Thomas @ 2012-04-12 17:10 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley, Felipe Balbi

On 2012-04-12 10:57, Kevin Hilman wrote:
> +Felipe for EHCI question
>
> Gary Thomas<gary@mlbassoc.com>  writes:
>
> [...]
>
>> This worked a treat, thanks.  My network performance is better
>> now, but still not what it was.  The same TFTP transfer now takes
>> 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
>> second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.
>
> And does a CONFIG_PM=n kernel get you back to your v3.0 performance?

Correct.

>
>> I am interested in having PM working as I'm designing a battery powered
>> portable unit, so I need to keep pursuing this.
>
> So do I. :)
>
>> Note: I noticed that when I built with CONFIG_PM off and no other
>> changes, my EHCI USB didn't work properly.  Should this be the case?
>
> Probably not, but haven't tested EHCI USB.  I've Cc'd Felipe to see if
> he has any ideas why EHCI wouldn't work with CONFIG_PM=n.

Thanks

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 17:10             ` Gary Thomas
@ 2012-04-12 18:08               ` Kevin Hilman
  2012-04-12 19:05                 ` Gary Thomas
  0 siblings, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-04-12 18:08 UTC (permalink / raw)
  To: Gary Thomas; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley, Felipe Balbi

Gary Thomas <gary@mlbassoc.com> writes:

> On 2012-04-12 10:57, Kevin Hilman wrote:
>> +Felipe for EHCI question
>>
>> Gary Thomas<gary@mlbassoc.com>  writes:
>>
>> [...]
>>
>>> This worked a treat, thanks.  My network performance is better
>>> now, but still not what it was.  The same TFTP transfer now takes
>>> 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
>>> second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.
>>
>> And does a CONFIG_PM=n kernel get you back to your v3.0 performance?
>
> Correct.
>

OK, I just tried your TFTP experiment on a 3530/Overo board with the
same smsc911x NIC that has GPIO interrupts, and I don't see much
difference between a PM-enabled v3.0 and a PM-enabled v3.3.

Are you TFTP'ing the file to an MMC filesystem?    Can you try to a
ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
v3.0 that are actually causing your performance hit.

In my experiment, I TFTP'd a 24Mb file to a ramdisk, to make sure no
other drivers were invovled, and didn't see any major differences
between v3.0, v3.3, and v3.3 CONFIG_PM disabled.

Below are my results.  As you can see, all the results seem to be pretty
close to the same.  This test was not on a controlled, isolated network,
so the differences are probably explained by other network activity:

- v3.0 vanilla: PM enabled, CPUidle enabled
  - Received 25362406 bytes in 35.5 seconds
  - Received 25362406 bytes in 44.9 seconds
  - Received 25362406 bytes in 49.0 seconds
  - Received 25362406 bytes in 36.2 seconds
  - Received 25362406 bytes in 56.3 seconds
  - Received 25362406 bytes in 65.2 seconds
  - Received 25362406 bytes in 37.0 seconds

- v3.3: PM enabled, CPUidle enabled
 + GPIO fix (my for_3.4/fixes/gpio branch)
 + smsc911x regulator boot fix (Tony's omap/fix-smsc911x-regulator branch)
  - Received 25362406 bytes in 32.1 seconds
  - Received 25362406 bytes in 29.8 seconds
  - Received 25362406 bytes in 33.5 seconds
  - Received 25362406 bytes in 44.5 seconds
  - Received 25362406 bytes in 39.2 seconds
  - Received 25362406 bytes in 57.0 seconds
  - Received 25362406 bytes in 49.6 seconds

- v3.3: CONFIG_PM=n + branches above 
 + fix from Grazvydas for !CONFIG_PM case: [PATCH] ARM: OMAP: sram: fix BUG in dpll code for !PM case
 + disable CONFIG_OMAP_WATCHDOG which fails to boot when CONFIG_PM=y 
  - Received 25362406 bytes in 34.1 seconds
  - Received 25362406 bytes in 33.9 seconds
  - Received 25362406 bytes in 34.9 seconds
  - Received 25362406 bytes in 37.8 seconds
  - Received 25362406 bytes in 40.0 seconds
  - Received 25362406 bytes in 37.6 seconds
  - Received 25362406 bytes in 34.4 seconds


Kevin

[1] simple steps to make a ramdisk
mkfs.ext2 /dev/ram0
mkdir /tmp/rd
mount /dev/ram0 /tmp/rd
cd /tmp/rd
<then TFTP file here>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 18:08               ` Kevin Hilman
@ 2012-04-12 19:05                 ` Gary Thomas
  2012-04-12 22:03                   ` Kevin Hilman
  0 siblings, 1 reply; 36+ messages in thread
From: Gary Thomas @ 2012-04-12 19:05 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley, Felipe Balbi

On 2012-04-12 12:08, Kevin Hilman wrote:
> Gary Thomas<gary@mlbassoc.com>  writes:
>
>> On 2012-04-12 10:57, Kevin Hilman wrote:
>>> +Felipe for EHCI question
>>>
>>> Gary Thomas<gary@mlbassoc.com>   writes:
>>>
>>> [...]
>>>
>>>> This worked a treat, thanks.  My network performance is better
>>>> now, but still not what it was.  The same TFTP transfer now takes
>>>> 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
>>>> second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.
>>>
>>> And does a CONFIG_PM=n kernel get you back to your v3.0 performance?
>>
>> Correct.
>>
>
> OK, I just tried your TFTP experiment on a 3530/Overo board with the
> same smsc911x NIC that has GPIO interrupts, and I don't see much
> difference between a PM-enabled v3.0 and a PM-enabled v3.3.
>
> Are you TFTP'ing the file to an MMC filesystem?    Can you try to a
> ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
> v3.0 that are actually causing your performance hit.

I'm testing to a ramdisk, so we're on the same page.

Could you send me your config file so I can compare?  Maybe I have something
"dumb" in my settings that aggravates things.

Also, what's your performance on 3.4-rc2?  The linux-media tree I started
from is a bit post v3.3, so there might be something else causing this.

>
> In my experiment, I TFTP'd a 24Mb file to a ramdisk, to make sure no
> other drivers were invovled, and didn't see any major differences
> between v3.0, v3.3, and v3.3 CONFIG_PM disabled.
>
> Below are my results.  As you can see, all the results seem to be pretty
> close to the same.  This test was not on a controlled, isolated network,
> so the differences are probably explained by other network activity:
>
> - v3.0 vanilla: PM enabled, CPUidle enabled
>    - Received 25362406 bytes in 35.5 seconds
>    - Received 25362406 bytes in 44.9 seconds
>    - Received 25362406 bytes in 49.0 seconds
>    - Received 25362406 bytes in 36.2 seconds
>    - Received 25362406 bytes in 56.3 seconds
>    - Received 25362406 bytes in 65.2 seconds
>    - Received 25362406 bytes in 37.0 seconds
>
> - v3.3: PM enabled, CPUidle enabled
>   + GPIO fix (my for_3.4/fixes/gpio branch)
>   + smsc911x regulator boot fix (Tony's omap/fix-smsc911x-regulator branch)
>    - Received 25362406 bytes in 32.1 seconds
>    - Received 25362406 bytes in 29.8 seconds
>    - Received 25362406 bytes in 33.5 seconds
>    - Received 25362406 bytes in 44.5 seconds
>    - Received 25362406 bytes in 39.2 seconds
>    - Received 25362406 bytes in 57.0 seconds
>    - Received 25362406 bytes in 49.6 seconds
>
> - v3.3: CONFIG_PM=n + branches above
>   + fix from Grazvydas for !CONFIG_PM case: [PATCH] ARM: OMAP: sram: fix BUG in dpll code for !PM case
>   + disable CONFIG_OMAP_WATCHDOG which fails to boot when CONFIG_PM=y
>    - Received 25362406 bytes in 34.1 seconds
>    - Received 25362406 bytes in 33.9 seconds
>    - Received 25362406 bytes in 34.9 seconds
>    - Received 25362406 bytes in 37.8 seconds
>    - Received 25362406 bytes in 40.0 seconds
>    - Received 25362406 bytes in 37.6 seconds
>    - Received 25362406 bytes in 34.4 seconds
>
>
> Kevin
>
> [1] simple steps to make a ramdisk
> mkfs.ext2 /dev/ram0
> mkdir /tmp/rd
> mount /dev/ram0 /tmp/rd
> cd /tmp/rd
> <then TFTP file here>

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 19:05                 ` Gary Thomas
@ 2012-04-12 22:03                   ` Kevin Hilman
  2012-04-13  0:39                     ` Gary Thomas
  0 siblings, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-04-12 22:03 UTC (permalink / raw)
  To: Gary Thomas; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley, Felipe Balbi

Gary Thomas <gary@mlbassoc.com> writes:

> On 2012-04-12 12:08, Kevin Hilman wrote:
>> Gary Thomas<gary@mlbassoc.com>  writes:
>>
>>> On 2012-04-12 10:57, Kevin Hilman wrote:
>>>> +Felipe for EHCI question
>>>>
>>>> Gary Thomas<gary@mlbassoc.com>   writes:
>>>>
>>>> [...]
>>>>
>>>>> This worked a treat, thanks.  My network performance is better
>>>>> now, but still not what it was.  The same TFTP transfer now takes
>>>>> 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
>>>>> second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.
>>>>
>>>> And does a CONFIG_PM=n kernel get you back to your v3.0 performance?
>>>
>>> Correct.
>>>
>>
>> OK, I just tried your TFTP experiment on a 3530/Overo board with the
>> same smsc911x NIC that has GPIO interrupts, and I don't see much
>> difference between a PM-enabled v3.0 and a PM-enabled v3.3.
>>
>> Are you TFTP'ing the file to an MMC filesystem?    Can you try to a
>> ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
>> v3.0 that are actually causing your performance hit.
>
> I'm testing to a ramdisk, so we're on the same page.
>
> Could you send me your config file so I can compare?  Maybe I have something
> "dumb" in my settings that aggravates things.

Below is the Kconfig snippet[1] I append to a default
omap2plus_defconfig to enable CPUidle, CPUfreq and some debug.  Rebuild
with that appended and these settings override the default ones.  I used
omap2plus_defcnfig plus this snippit for v3.0, v3.3 and v3.4-rc2 tests.

> Also, what's your performance on 3.4-rc2?  The linux-media tree I started
> from is a bit post v3.3, so there might be something else causing this.

I just tried with vanilla v3.4-rc2, and I see basically the same
results.  Between 35 and 50 seconds for the 24Mb file transfer, which is
similar to the v3.0 and v3.3 results.

Kevin

[1] 
CONFIG_CPU_IDLE=y
CONFIG_PM_ADVANCED_DEBUG=y
CONFIG_PM_SLEEP_ADVANCED_DEBUG=y
CONFIG_PM_GENERIC_DOMAINS=y
CONFIG_OMAP_SMARTREFLEX=y
CONFIG_OMAP_SMARTREFLEX_CLASS3=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_ARM_OMAP2PLUS_CPUFREQ=y
CONFIG_REGULATOR_OMAP_SMPS=y

CONFIG_DEBUG_LL=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_USER=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_SECTION_MISMATCH=y


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: PM related performance degradation on OMAP3
  2012-04-11  0:29   ` Grazvydas Ignotas
  2012-04-12  0:19     ` Kevin Hilman
@ 2012-04-12 23:02     ` Woodruff, Richard
  1 sibling, 0 replies; 36+ messages in thread
From: Woodruff, Richard @ 2012-04-12 23:02 UTC (permalink / raw)
  To: Grazvydas Ignotas, Hilman, Kevin; +Cc: linux-omap, Paul Walmsley

> From: linux-omap-owner@vger.kernel.org [mailto:linux-omap-
> owner@vger.kernel.org] On Behalf Of Grazvydas Ignotas
> Sent: Tuesday, April 10, 2012 7:30 PM

> What I think is going on here is that omap_sram_idle() is taking too
> much time because it's overhead is too large. I've added a counter
> there and it seems to be called ~530 times per megabyte (DMA operates
> in ~2K chunks so it makes sense), that's over 2000 calls per second.
> Some quick measurement code shows ~243us spent for setting up in
> omap_sram_idle() (before and after omap34xx_do_sram_idle()).

243uS is really a long time for C1. For some reason has grown a lot since last time I captured path in ETM.

Your analysis correlates well to reports from a couple years back. N900 folks did report that the non-clock gated C1 was needed (as exists in code today). IIRC the NAND stack did have small-uS spins on NAND status or something which having higher clock stop penalty resulted in big performance dip. You needed like <10uS for C1 or bit hit.

Regards,
Richard W.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 22:03                   ` Kevin Hilman
@ 2012-04-13  0:39                     ` Gary Thomas
  0 siblings, 0 replies; 36+ messages in thread
From: Gary Thomas @ 2012-04-13  0:39 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley, Felipe Balbi

On 2012-04-12 16:03, Kevin Hilman wrote:
> Gary Thomas<gary@mlbassoc.com>  writes:
>
>> On 2012-04-12 12:08, Kevin Hilman wrote:
>>> Gary Thomas<gary@mlbassoc.com>   writes:
>>>
>>>> On 2012-04-12 10:57, Kevin Hilman wrote:
>>>>> +Felipe for EHCI question
>>>>>
>>>>> Gary Thomas<gary@mlbassoc.com>    writes:
>>>>>
>>>>> [...]
>>>>>
>>>>>> This worked a treat, thanks.  My network performance is better
>>>>>> now, but still not what it was.  The same TFTP transfer now takes
>>>>>> 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
>>>>>> second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.
>>>>>
>>>>> And does a CONFIG_PM=n kernel get you back to your v3.0 performance?
>>>>
>>>> Correct.
>>>>
>>>
>>> OK, I just tried your TFTP experiment on a 3530/Overo board with the
>>> same smsc911x NIC that has GPIO interrupts, and I don't see much
>>> difference between a PM-enabled v3.0 and a PM-enabled v3.3.
>>>
>>> Are you TFTP'ing the file to an MMC filesystem?    Can you try to a
>>> ramdisk[1]?  If you're using MMC, it could be MMC driver changes since
>>> v3.0 that are actually causing your performance hit.
>>
>> I'm testing to a ramdisk, so we're on the same page.
>>
>> Could you send me your config file so I can compare?  Maybe I have something
>> "dumb" in my settings that aggravates things.
>
> Below is the Kconfig snippet[1] I append to a default
> omap2plus_defconfig to enable CPUidle, CPUfreq and some debug.  Rebuild
> with that appended and these settings override the default ones.  I used
> omap2plus_defcnfig plus this snippit for v3.0, v3.3 and v3.4-rc2 tests.
>
>> Also, what's your performance on 3.4-rc2?  The linux-media tree I started
>> from is a bit post v3.3, so there might be something else causing this.
>
> I just tried with vanilla v3.4-rc2, and I see basically the same
> results.  Between 35 and 50 seconds for the 24Mb file transfer, which is
> similar to the v3.0 and v3.3 results.
>
> Kevin
>
> [1]
> CONFIG_CPU_IDLE=y
> CONFIG_PM_ADVANCED_DEBUG=y
> CONFIG_PM_SLEEP_ADVANCED_DEBUG=y
> CONFIG_PM_GENERIC_DOMAINS=y
> CONFIG_OMAP_SMARTREFLEX=y
> CONFIG_OMAP_SMARTREFLEX_CLASS3=y
> CONFIG_CPU_FREQ=y
> CONFIG_CPU_FREQ_TABLE=y
> CONFIG_CPU_FREQ_STAT=y
> CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> CONFIG_CPU_FREQ_GOV_USERSPACE=y
> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> CONFIG_CPU_FREQ_GOV_POWERSAVE=y
> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> CONFIG_ARM_OMAP2PLUS_CPUFREQ=y
> CONFIG_REGULATOR_OMAP_SMPS=y
>
> CONFIG_DEBUG_LL=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_USER=y
> CONFIG_EARLY_PRINTK=y
> CONFIG_DEBUG_SECTION_MISMATCH=y

These settings made no difference.

I just reverified my results to xfer a 39MB file to ramdisk:
   3.0 + PM = 39sec
   3.3 + PM = 70sec
   3.3 - PM = 48sec
so it's not quite the same as 3.0 was, but closer.  BTW, your
results normalized to mine would be
   3.3 + PM = 56sec

I wish I knew why I'm seeing a big difference between +PM/-PM
and you don't.  Is there some way to compare your source tree
(the one you built for v3.3) and mine?  I'm not very good with
GIT so I'm not quite sure how to do it.

Sorry for being so much trouble, I'm just in search of all the
performance I can get out of my system :-)

Thanks


-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12 16:57           ` Kevin Hilman
  2012-04-12 17:10             ` Gary Thomas
@ 2012-04-13  9:13             ` Felipe Balbi
  1 sibling, 0 replies; 36+ messages in thread
From: Felipe Balbi @ 2012-04-13  9:13 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Gary Thomas, Grazvydas Ignotas, linux-omap, Paul Walmsley,
	Felipe Balbi, Govindraj Raja, Keshava Munegowda

[-- Attachment #1: Type: text/plain, Size: 1146 bytes --]

Hi,

On Thu, Apr 12, 2012 at 09:57:32AM -0700, Kevin Hilman wrote:
> +Felipe for EHCI question
> 
> Gary Thomas <gary@mlbassoc.com> writes:
> 
> [...]
> 
> > This worked a treat, thanks.  My network performance is better
> > now, but still not what it was.  The same TFTP transfer now takes
> > 71 seconds, so about 50% slower than on the 3.0 kernel.  Applying the
> > second [unnamed] patch (arch/arm/mach-omap2/cpuidle34xx.c) made no difference.
> 
> And does a CONFIG_PM=n kernel get you back to your v3.0 performance?
> 
> > I am interested in having PM working as I'm designing a battery powered
> > portable unit, so I need to keep pursuing this.
> 
> So do I. :)

we all are :-p

> > Note: I noticed that when I built with CONFIG_PM off and no other
> > changes, my EHCI USB didn't work properly.  Should this be the case?
> 
> Probably not, but haven't tested EHCI USB.  I've Cc'd Felipe to see if
> he has any ideas why EHCI wouldn't work with CONFIG_PM=n.

Govind, Keshava... can you look into this at some point next week ? Or
maybe give us a good reason why it doesn't work without PM ;-)

-- 
balbi

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12  0:19     ` Kevin Hilman
@ 2012-04-13 17:32       ` Grazvydas Ignotas
  2012-04-13 19:32       ` Grazvydas Ignotas
  1 sibling, 0 replies; 36+ messages in thread
From: Grazvydas Ignotas @ 2012-04-13 17:32 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: linux-omap, Paul Walmsley

On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
> Grazvydas Ignotas <notasas@gmail.com> writes:
>
>> On Mon, Apr 9, 2012 at 10:03 PM, Kevin Hilman <khilman@ti.com> wrote:
>>> Grazvydas Ignotas <notasas@gmail.com> writes:
>>>> While SD card performance loss is not that bad (~7%), NAND one is
>>>> worrying (~39%). I've tried disabling/enabling CONFIG_CPU_IDLE, also
>>>> cpuidle states over sysfs, it did not have any significant effect. Is
>>>> there something else to try?
>>>
>>> Looks like we might need a PM QoS constraint when there is DMA activity
>>> in progress.
>>>
>>> You can try doing a pm_qos_add_request() for PM_QOS_CPU_DMA_LATENCY when
>>> DMA transfers are active and I suspect that will help.
>>
>> I've tried it and it didn't help much. It looks like the only thing it
>> does is limiting cpuidle c-states, I tried to set qos dma latency to 0
>> and it made it stay in C1 while transfer was ongoing (I watched
>> /sys/devices/system/cpu/cpu0/cpuidle/state*/usage), but performance
>> was still poor.
>
> Great, thanks for doing this experiment.
>
> Assuming we get to a C1 that's low-latency enough, we will still need
> this constraint to ensure C1 during transfers.  But first we have to
> figure out what's going on with C1...

I've been working on this to collect more data, and noticed that PER
is often being put to RET even at C1, is that expected? There is some
additional work being done in that case, like putting GPIOs to sleep,
and it seems to be source of part of performance loss here as it
happens often during NAND transfers.

This can be reproduced while doing mmc transfers too and detected with this:
(not a valid patch, sorry, sending through gmail web)

--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -87,6 +87,8 @@ static int _cpuidle_deny_idle(struct powerdomain *pwrdm,
        return 0;
 }

+int is_c1;
+
 static int __omap3_enter_idle(struct cpuidle_device *dev,
                                struct cpuidle_driver *drv,
                                int index)
@@ -117,6 +120,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
                cpu_pm_enter();

        /* Execute ARM wfi */
+       is_c1 = (index == 0);
        omap_sram_idle();

        /*
diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
index 703bd10..519ce9d 100644
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -275,6 +275,7 @@ void omap_sram_idle(void)
        int per_going_off;
        int core_prev_state, per_prev_state;
        u32 sdrc_pwr = 0;
+       extern int is_c1;

        mpu_next_state = pwrdm_read_next_pwrst(mpu_pwrdm);
        switch (mpu_next_state) {
@@ -299,6 +300,8 @@ void omap_sram_idle(void)
        /* Enable IO-PAD and IO-CHAIN wakeups */
        per_next_state = pwrdm_read_next_pwrst(per_pwrdm);
        core_next_state = pwrdm_read_next_pwrst(core_pwrdm);
+if (is_c1 && (per_next_state != PWRDM_POWER_ON || core_next_state !=
PWRDM_POWER_ON))
+ printk(KERN_ERR "c1 core %d, per %d\n", per_next_state, core_next_state);
        if (omap3_has_io_wakeup() &&
            (per_next_state < PWRDM_POWER_ON ||
             core_next_state < PWRDM_POWER_ON)) {


-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-12  0:19     ` Kevin Hilman
  2012-04-13 17:32       ` Grazvydas Ignotas
@ 2012-04-13 19:32       ` Grazvydas Ignotas
  2012-04-17 14:30         ` Kevin Hilman
  1 sibling, 1 reply; 36+ messages in thread
From: Grazvydas Ignotas @ 2012-04-13 19:32 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: linux-omap, Paul Walmsley

On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
> It would be helpful now to narrow down what are the big contributors to
> the overhead in omap_sram_idle().  Most of the code there is skipped for
> C1 because the next states for MPU and CORE are both ON.

Ok I did some tests, all in mostly idle system with just init, busybox
shell and dd doing a NAND read to /dev/null . MB/s is throughput that
dd reports, mA and approx. current draw during the transfer, read from
fuel gauge that's onboard.

MB/s| mA|comment
 3.7|218|mainline f549e088b80
 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
 4.4|220|[1] + pwrdm_p*_transition commented [2]
 3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
 4.0|224|[1] + 'Deny idle' [5]
 5.1|210|[2] + [4] + [5]
 5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
 5.5|243|!CONFIG_PM
 6.1|282|busywait DMA end (for reference)

> There are 2 primary differences that I see as possible causes.  I list
> them here with a couple more experiments for you to try to help us
> narrow this down.
>
> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>
> Could you try using omap_sram_idle() and just commenting out those
> calls?  Does that help performance?  Those iterate over all the
> powerdomains, so defintely add some overhead, but I don't think it
> would be as significant as what you're seeing.

Seems to be taking good part of it.

>    Much more likely is...
>
> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds

Could not notice any difference.

To me it looks like this results from many small things adding up..
Idle is called so often that pwrdm_p*_transition() and those
pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
because they access lots of registers on slow buses? Maybe some
register cache would help us there, or are those registers expected to
be changed by hardware often?
Also trying to idle PER while transfer is ongoing (as reported in
previous mail) doesn't sound like a good idea and is one of the
reasons for slowdown. Seems to also causing more current drain,
ironically.


changes (again, sorry for corrupted diffs, but they should be easy to
reproduce):
[2]:
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -307,7 +307,7 @@ void omap_sram_idle(void)
                        omap3_enable_io_chain();
        }

-       pwrdm_pre_transition();
+//     pwrdm_pre_transition();

        /* PER */
        if (per_next_state < PWRDM_POWER_ON) {
@@ -372,7 +373,7 @@ void omap_sram_idle(void)
        }
        omap3_intc_resume_idle();

-       pwrdm_post_transition();
+//     pwrdm_post_transition();

        /* PER */
        if (per_next_state < PWRDM_POWER_ON) {
[3]:
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -347,7 +347,7 @@ void omap_sram_idle(void)
        if (save_state == 1 || save_state == 3)
                cpu_suspend(save_state, omap34xx_do_sram_idle);
        else
-               omap34xx_do_sram_idle(save_state);
+               cpu_do_idle();

        /* Restore normal SDRC POWER settings */
        if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 &&
[4]:
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
        if (index == 0) {
                pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
                pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
+               pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON);
        }

        /*
[5]:
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

        /* Deny idle for C1 */
        if (index == 0) {
-               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
-               pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
+               clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
        }

        /*
@@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

        /* Re-allow idle for C1 */
        if (index == 0) {
-               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
-               pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
+               clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
        }

 return_sleep_time:
[6]:
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
                cpu_pm_enter();

        /* Execute ARM wfi */
-       omap_sram_idle();
+       //omap_sram_idle();
+       cpu_do_idle();

        /*
         * Call idle CPU PM enter notifier chain to restore


-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-13 19:32       ` Grazvydas Ignotas
@ 2012-04-17 14:30         ` Kevin Hilman
  2012-04-17 21:50           ` Grazvydas Ignotas
  2012-04-24  9:50           ` Jean Pihet
  0 siblings, 2 replies; 36+ messages in thread
From: Kevin Hilman @ 2012-04-17 14:30 UTC (permalink / raw)
  To: Grazvydas Ignotas; +Cc: linux-omap, Paul Walmsley

Grazvydas Ignotas <notasas@gmail.com> writes:

> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
>> It would be helpful now to narrow down what are the big contributors to
>> the overhead in omap_sram_idle().  Most of the code there is skipped for
>> C1 because the next states for MPU and CORE are both ON.
>
> Ok I did some tests, all in mostly idle system with just init, busybox
> shell and dd doing a NAND read to /dev/null . 

Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my
Overo.  I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy
path' but that didn't seem to help my crash.

> MB/s is throughput that
> dd reports, mA and approx. current draw during the transfer, read from
> fuel gauge that's onboard.
>
> MB/s| mA|comment
>  3.7|218|mainline f549e088b80
>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
>  4.4|220|[1] + pwrdm_p*_transition commented [2]
>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
>  4.0|224|[1] + 'Deny idle' [5]
>  5.1|210|[2] + [4] + [5]
>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
>  5.5|243|!CONFIG_PM
>  6.1|282|busywait DMA end (for reference)

Thanks for the detailed experiments.  This definitely confirms we have
some serious unwanted overhead for C1, and our C-state latency values
are clearly way off base, since they only account HW latency and not any
of the SW latency introduced in omap_sram_idle().

>> There are 2 primary differences that I see as possible causes.  I list
>> them here with a couple more experiments for you to try to help us
>> narrow this down.
>>
>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>>
>> Could you try using omap_sram_idle() and just commenting out those
>> calls?  Does that help performance?  Those iterate over all the
>> powerdomains, so defintely add some overhead, but I don't think it
>> would be as significant as what you're seeing.
>
> Seems to be taking good part of it.
>
>>    Much more likely is...
>>
>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
>
> Could not notice any difference.
>
> To me it looks like this results from many small things adding up..
> Idle is called so often that pwrdm_p*_transition() and those
> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
> because they access lots of registers on slow buses? 

Yes PRCM register accesses are unfortunately rather slow, and we've
known that for some time, but haven't done any detailed analysis of the
overhead.

Using the function_graph tracer, I was able to see that the pre/post
transition are taking an enormous amount of time:

  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)

Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
using CPUfreq at all in your tests?  If using cpufreq + ondemand
governor, you're probably running at low OPP due to lack of CPU activity
which will also affect the latencies in the idle path.

> Maybe some register cache would help us there, or are those registers
> expected to be changed by hardware often?

Yes, we've known that some sort of register cache here would be useful
for some time, but haven't got to implementing it.

> Also trying to idle PER while transfer is ongoing (as reported in
> previous mail) doesn't sound like a good idea and is one of the
> reasons for slowdown. Seems to also causing more current drain,
> ironically.

Agreed.  Again, using the function_graph tracer, I get some pretty big
latencies from the GPIO pre/post idling process:

  - gpio_prepare_for_idle(): 2400+ us at 600MHz (8200+ us at 125MHz)
  - gpio_resume_from_idle(): 2200+ us at 600MHz (7600+ us at 125MHz)

Removing PER transtions as you did will get rid of those.

I'm looking into this in more detail know, and will likely have a few
patches for you to experiment with.

Thanks again for digging into this with us,

Kevin

>
>
> changes (again, sorry for corrupted diffs, but they should be easy to
> reproduce):
> [2]:
> --- a/arch/arm/mach-omap2/pm34xx.c
> +++ b/arch/arm/mach-omap2/pm34xx.c
> @@ -307,7 +307,7 @@ void omap_sram_idle(void)
>                         omap3_enable_io_chain();
>         }
>
> -       pwrdm_pre_transition();
> +//     pwrdm_pre_transition();
>
>         /* PER */
>         if (per_next_state < PWRDM_POWER_ON) {
> @@ -372,7 +373,7 @@ void omap_sram_idle(void)
>         }
>         omap3_intc_resume_idle();
>
> -       pwrdm_post_transition();
> +//     pwrdm_post_transition();
>
>         /* PER */
>         if (per_next_state < PWRDM_POWER_ON) {
> [3]:
> --- a/arch/arm/mach-omap2/pm34xx.c
> +++ b/arch/arm/mach-omap2/pm34xx.c
> @@ -347,7 +347,7 @@ void omap_sram_idle(void)
>         if (save_state == 1 || save_state == 3)
>                 cpu_suspend(save_state, omap34xx_do_sram_idle);
>         else
> -               omap34xx_do_sram_idle(save_state);
> +               cpu_do_idle();
>
>         /* Restore normal SDRC POWER settings */
>         if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 &&
> [4]:
> --- a/arch/arm/mach-omap2/cpuidle34xx.c
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>         if (index == 0) {
>                 pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>                 pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
> +               pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON);
>         }
>
>         /*
> [5]:
> --- a/arch/arm/mach-omap2/cpuidle34xx.c
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>
>         /* Deny idle for C1 */
>         if (index == 0) {
> -               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
> -               pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
> +               clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
>         }
>
>         /*
> @@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>
>         /* Re-allow idle for C1 */
>         if (index == 0) {
> -               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
> -               pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
> +               clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
>         }
>
>  return_sleep_time:
> [6]:
> --- a/arch/arm/mach-omap2/cpuidle34xx.c
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>                 cpu_pm_enter();
>
>         /* Execute ARM wfi */
> -       omap_sram_idle();
> +       //omap_sram_idle();
> +       cpu_do_idle();
>
>         /*
>          * Call idle CPU PM enter notifier chain to restore
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-17 14:30         ` Kevin Hilman
@ 2012-04-17 21:50           ` Grazvydas Ignotas
  2012-04-18  0:36             ` Kevin Hilman
  2012-04-24  9:50           ` Jean Pihet
  1 sibling, 1 reply; 36+ messages in thread
From: Grazvydas Ignotas @ 2012-04-17 21:50 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: linux-omap, Paul Walmsley

On Tue, Apr 17, 2012 at 5:30 PM, Kevin Hilman <khilman@ti.com> wrote:
> Grazvydas Ignotas <notasas@gmail.com> writes:
>>
>> Ok I did some tests, all in mostly idle system with just init, busybox
>> shell and dd doing a NAND read to /dev/null .
>
> Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my
> Overo.  I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy
> path' but that didn't seem to help my crash.

I see overo doesn't set 16bit flag, I think it has NAND on 16bit bus?
Perhaps try this:

--- a/arch/arm/mach-omap2/board-overo.c
+++ b/arch/arm/mach-omap2/board-overo.c
@@ -517,7 +517,7 @@ static void __init overo_init(void)
        omap_serial_init();
        omap_sdrc_init(mt46h32m32lf6_sdrc_params,
                                  mt46h32m32lf6_sdrc_params);
-       omap_nand_flash_init(0, overo_nand_partitions,
+       omap_nand_flash_init(NAND_BUSWIDTH_16, overo_nand_partitions,
                             ARRAY_SIZE(overo_nand_partitions));
        usb_musb_init(NULL);
        usbhs_init(&usbhs_bdata);

Also only pandora is using NAND DMA mode right now in mainline, the
default polling mode won't exhibit the latency problem (with all other
polling consequences like high CPU usage), so this is needed too for
the test:

--- a/arch/arm/mach-omap2/common-board-devices.c
+++ b/arch/arm/mach-omap2/common-board-devices.c
@@ -127,6 +127,7 @@ void __init omap_nand_flash_init(int options,
struct mtd_partition *parts,
                nand_data.parts = parts;
                nand_data.nr_parts = nr_parts;
                nand_data.devsize = options;
+               nand_data.xfer_type = NAND_OMAP_PREFETCH_DMA;

                printk(KERN_INFO "Registering NAND on CS%d\n", nandcs);
                if (gpmc_nand_init(&nand_data) < 0)

I also forgot to mention I was using ubifs in my test (dd'ing large
file from it), I don't think it has much effect, but if you want to
try with that:
.config
CONFIG_MTD_UBI=y
CONFIG_UBIFS_FS=y
--
ubiformat /dev/mtdX -s 512
ubiattach /dev/ubi_ctrl -m X # X from mtdX
ubimkvol /dev/ubi0 -m -N somename
mount -t ubifs ubi0:somename /mnt

>> To me it looks like this results from many small things adding up..
>> Idle is called so often that pwrdm_p*_transition() and those
>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>> because they access lots of registers on slow buses?
>
> Yes PRCM register accesses are unfortunately rather slow, and we've
> known that for some time, but haven't done any detailed analysis of the
> overhead.
>
> Using the function_graph tracer, I was able to see that the pre/post
> transition are taking an enormous amount of time:
>
>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)

Hmm, with this it wouldn't be able to do ~500+ calls/sec I was seeing,
so the tracer overhead is probably quite large too..

> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
> using CPUfreq at all in your tests?  If using cpufreq + ondemand
> governor, you're probably running at low OPP due to lack of CPU activity
> which will also affect the latencies in the idle path.

I used performance governor in my tests, so it all was at 600MHz.

> I'm looking into this in more detail know, and will likely have a few
> patches for you to experiment with.

Sounds good,


-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-17 21:50           ` Grazvydas Ignotas
@ 2012-04-18  0:36             ` Kevin Hilman
  0 siblings, 0 replies; 36+ messages in thread
From: Kevin Hilman @ 2012-04-18  0:36 UTC (permalink / raw)
  To: Grazvydas Ignotas; +Cc: linux-omap, Paul Walmsley

Grazvydas Ignotas <notasas@gmail.com> writes:

> On Tue, Apr 17, 2012 at 5:30 PM, Kevin Hilman <khilman@ti.com> wrote:
>> Grazvydas Ignotas <notasas@gmail.com> writes:
>>>
>>> Ok I did some tests, all in mostly idle system with just init, busybox
>>> shell and dd doing a NAND read to /dev/null .
>>
>> Hmm, I seem to get a hang using dd to read from NAND /dev/mtdX on my
>> Overo.  I saw your patch 'mtd: omap2: fix resource leak in prefetch-busy
>> path' but that didn't seem to help my crash.

[...]

> Also only pandora is using NAND DMA mode right now in mainline, the
> default polling mode won't exhibit the latency problem (with all other
> polling consequences like high CPU usage), so this is needed too for
> the test:

Yeah, I noticed that today when I discovered my dd tests weren't causing
any DMA interrupts. ;) I switched Overo to use DMA mode by copy/paste
the pdata from Pandora board file, and now it's working fine, and I'm
seeing throughput similar to yours.

> I also forgot to mention I was using ubifs in my test (dd'ing large
> file from it), I don't think it has much effect, but if you want to
> try with that:

[...]

I'm just dd'ing raw bytes from /dev/mtdX to /dev/null, so the format
shouldn't matter I guess.

>>> To me it looks like this results from many small things adding up..
>>> Idle is called so often that pwrdm_p*_transition() and those
>>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>>> because they access lots of registers on slow buses?
>>
>> Yes PRCM register accesses are unfortunately rather slow, and we've
>> known that for some time, but haven't done any detailed analysis of the
>> overhead.
>>
>> Using the function_graph tracer, I was able to see that the pre/post
>> transition are taking an enormous amount of time:
>>
>>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
>
> Hmm, with this it wouldn't be able to do ~500+ calls/sec I was seeing,
> so the tracer overhead is probably quite large too..

Yes, tracer overhead is important there, but it still shows me who the
biggest contributors are to the overhead/delay.

>> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
>> using CPUfreq at all in your tests?  If using cpufreq + ondemand
>> governor, you're probably running at low OPP due to lack of CPU activity
>> which will also affect the latencies in the idle path.
>
> I used performance governor in my tests, so it all was at 600MHz.

OK, good.

Kevin

>> I'm looking into this in more detail know, and will likely have a few
>> patches for you to experiment with.
>
> Sounds good,
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-17 14:30         ` Kevin Hilman
  2012-04-17 21:50           ` Grazvydas Ignotas
@ 2012-04-24  9:50           ` Jean Pihet
  2012-04-24 10:38             ` Santosh Shilimkar
  2012-04-24 14:29             ` Kevin Hilman
  1 sibling, 2 replies; 36+ messages in thread
From: Jean Pihet @ 2012-04-24  9:50 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Hi Grazvydas, Kevin,

I did some gather some performance measurements and statistics using
custom tracepoints in __omap3_enter_idle.
All the details are at
http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
.

The setup is:
- Beagleboard (OMAP3530) at 500MHz,
- l-o master kernel + functional power states + per-device PM QoS. It
has been checked that the changes from l-o master do not have an
impact on the performance.
- The data transfer is performed using dd from a file in JFFS2 to
/dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.

On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@ti.com> wrote:
> Grazvydas Ignotas <notasas@gmail.com> writes:
>
>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
>>> It would be helpful now to narrow down what are the big contributors to
>>> the overhead in omap_sram_idle().  Most of the code there is skipped for
>>> C1 because the next states for MPU and CORE are both ON.
>>
>> Ok I did some tests, all in mostly idle system with just init, busybox
>> shell and dd doing a NAND read to /dev/null .
>
...
>
>> MB/s is throughput that
>> dd reports, mA and approx. current draw during the transfer, read from
>> fuel gauge that's onboard.
>>
>> MB/s| mA|comment
>>  3.7|218|mainline f549e088b80
>>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
>>  4.4|220|[1] + pwrdm_p*_transition commented [2]
>>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
>>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
>>  4.0|224|[1] + 'Deny idle' [5]
>>  5.1|210|[2] + [4] + [5]
>>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
>>  5.5|243|!CONFIG_PM
>>  6.1|282|busywait DMA end (for reference)

Here are the results (BW in MB/s) on Beagleboard:
- 4.7: without using DMA,

- Using DMA
  2.1: [0]
  2.1: [1] only C1
  2.6: [1]+[2] no pre_ post_
  2.3: [1]+[5] no pwrdm_for_each_clkdm
  2.8: [1]+[5]+[2]
  3.1: [1]+[5]+[6] no omap_sram_idle
  3.1: No IDLE, no omap_sram_idle, all pwrdms to ON

So indeed this shows there is some serious performance issue with the
C1 C-state.

> Thanks for the detailed experiments.  This definitely confirms we have
> some serious unwanted overhead for C1, and our C-state latency values
> are clearly way off base, since they only account HW latency and not any
> of the SW latency introduced in omap_sram_idle().
>
>>> There are 2 primary differences that I see as possible causes.  I list
>>> them here with a couple more experiments for you to try to help us
>>> narrow this down.
>>>
>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>>>
>>> Could you try using omap_sram_idle() and just commenting out those
>>> calls?  Does that help performance?  Those iterate over all the
>>> powerdomains, so defintely add some overhead, but I don't think it
>>> would be as significant as what you're seeing.
>>
>> Seems to be taking good part of it.
>>
>>>    Much more likely is...
>>>
>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
>>
>> Could not notice any difference.
>>
>> To me it looks like this results from many small things adding up..
>> Idle is called so often that pwrdm_p*_transition() and those
>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>> because they access lots of registers on slow buses?

From the list of contributors, the main ones are:
    (140us) pwrdm_pre_transition and pwrdm_post_transition,
    (105us) omap2_gpio_prepare_for_idle and
omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
the latency-critical C-states,
    (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
    (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
    (11 us) clkdm_allow_idle(mpu). Is this needed?

Here are a few questions and suggestions:
- In case of latency critical C-states could the high-latency code be
bypassed in favor of a much simpler version? Pushing the concept a bit
farther one could have a C1 state that just relaxes the cpu (no WFI),
a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
rest of the C-states as we have today,
- Is it needed to iterate through all the power and clock domains in
order to keep them active?
- Trying to idle some non related power domains (e.g. PER) causes a
performance hit. How to link all the power domains states to the
cpuidle C-state? The per-device PM QoS framework could be used to
constraint some power domains, but this is highly dependent on the use
case.

> Yes PRCM register accesses are unfortunately rather slow, and we've
> known that for some time, but haven't done any detailed analysis of the
> overhead.
That would be worth doing the analysis. A lot of read accesses to the
current, next and previous power states are performed in the idle
code.

> Using the function_graph tracer, I was able to see that the pre/post
> transition are taking an enormous amount of time:
>
>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
>
> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
> using CPUfreq at all in your tests?  If using cpufreq + ondemand
> governor, you're probably running at low OPP due to lack of CPU activity
> which will also affect the latencies in the idle path.
>
>> Maybe some register cache would help us there, or are those registers
>> expected to be changed by hardware often?
>
> Yes, we've known that some sort of register cache here would be useful
> for some time, but haven't got to implementing it.
I can try some proof of concept code, just to prove its usefulness.

>> Also trying to idle PER while transfer is ongoing (as reported in
>> previous mail) doesn't sound like a good idea and is one of the
>> reasons for slowdown. Seems to also causing more current drain,
>> ironically.
>
> Agreed.  Again, using the function_graph tracer, I get some pretty big
> latencies from the GPIO pre/post idling process:
>
>  - gpio_prepare_for_idle(): 2400+ us at 600MHz (8200+ us at 125MHz)
>  - gpio_resume_from_idle(): 2200+ us at 600MHz (7600+ us at 125MHz)
>
> Removing PER transtions as you did will get rid of those.
>
> I'm looking into this in more detail know, and will likely have a few
> patches for you to experiment with.
>
> Thanks again for digging into this with us,
>
> Kevin
>

Any thoughts?

Regards,
Jean

>>
>>
>> changes (again, sorry for corrupted diffs, but they should be easy to
>> reproduce):
>> [2]:
>> --- a/arch/arm/mach-omap2/pm34xx.c
>> +++ b/arch/arm/mach-omap2/pm34xx.c
>> @@ -307,7 +307,7 @@ void omap_sram_idle(void)
>>                         omap3_enable_io_chain();
>>         }
>>
>> -       pwrdm_pre_transition();
>> +//     pwrdm_pre_transition();
>>
>>         /* PER */
>>         if (per_next_state < PWRDM_POWER_ON) {
>> @@ -372,7 +373,7 @@ void omap_sram_idle(void)
>>         }
>>         omap3_intc_resume_idle();
>>
>> -       pwrdm_post_transition();
>> +//     pwrdm_post_transition();
>>
>>         /* PER */
>>         if (per_next_state < PWRDM_POWER_ON) {
>> [3]:
>> --- a/arch/arm/mach-omap2/pm34xx.c
>> +++ b/arch/arm/mach-omap2/pm34xx.c
>> @@ -347,7 +347,7 @@ void omap_sram_idle(void)
>>         if (save_state == 1 || save_state == 3)
>>                 cpu_suspend(save_state, omap34xx_do_sram_idle);
>>         else
>> -               omap34xx_do_sram_idle(save_state);
>> +               cpu_do_idle();
>>
>>         /* Restore normal SDRC POWER settings */
>>         if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 &&
>> [4]:
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>         if (index == 0) {
>>                 pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>>                 pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
>> +               pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON);
>>         }
>>
>>         /*
>> [5]:
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>
>>         /* Deny idle for C1 */
>>         if (index == 0) {
>> -               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>> -               pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
>> +               clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
>>         }
>>
>>         /*
>> @@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>
>>         /* Re-allow idle for C1 */
>>         if (index == 0) {
>> -               pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
>> -               pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
>> +               clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
>>         }
>>
>>  return_sleep_time:
>> [6]:
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>                 cpu_pm_enter();
>>
>>         /* Execute ARM wfi */
>> -       omap_sram_idle();
>> +       //omap_sram_idle();
>> +       cpu_do_idle();
>>
>>         /*
>>          * Call idle CPU PM enter notifier chain to restore
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-24  9:50           ` Jean Pihet
@ 2012-04-24 10:38             ` Santosh Shilimkar
  2012-04-24 12:21               ` Tero Kristo
  2012-04-24 14:29             ` Kevin Hilman
  1 sibling, 1 reply; 36+ messages in thread
From: Santosh Shilimkar @ 2012-04-24 10:38 UTC (permalink / raw)
  To: Jean Pihet
  Cc: Kevin Hilman, Grazvydas Ignotas, linux-omap, Paul Walmsley, Tero Kristo

+ Tero

On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
> Hi Grazvydas, Kevin,
> 
> I did some gather some performance measurements and statistics using
> custom tracepoints in __omap3_enter_idle.
> All the details are at
> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
> .
> 
Nice data.

> The setup is:
> - Beagleboard (OMAP3530) at 500MHz,
> - l-o master kernel + functional power states + per-device PM QoS. It
> has been checked that the changes from l-o master do not have an
> impact on the performance.
> - The data transfer is performed using dd from a file in JFFS2 to
> /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.
> 
> On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@ti.com> wrote:
>> Grazvydas Ignotas <notasas@gmail.com> writes:
>>
>>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
>>>> It would be helpful now to narrow down what are the big contributors to
>>>> the overhead in omap_sram_idle().  Most of the code there is skipped for
>>>> C1 because the next states for MPU and CORE are both ON.
>>>
>>> Ok I did some tests, all in mostly idle system with just init, busybox
>>> shell and dd doing a NAND read to /dev/null .
>>
> ...
>>
>>> MB/s is throughput that
>>> dd reports, mA and approx. current draw during the transfer, read from
>>> fuel gauge that's onboard.
>>>
>>> MB/s| mA|comment
>>>  3.7|218|mainline f549e088b80
>>>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
>>>  4.4|220|[1] + pwrdm_p*_transition commented [2]
>>>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
>>>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
>>>  4.0|224|[1] + 'Deny idle' [5]
>>>  5.1|210|[2] + [4] + [5]
>>>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
>>>  5.5|243|!CONFIG_PM
>>>  6.1|282|busywait DMA end (for reference)
> 
> Here are the results (BW in MB/s) on Beagleboard:
> - 4.7: without using DMA,
> 
> - Using DMA
>   2.1: [0]
>   2.1: [1] only C1
>   2.6: [1]+[2] no pre_ post_
>   2.3: [1]+[5] no pwrdm_for_each_clkdm
>   2.8: [1]+[5]+[2]
>   3.1: [1]+[5]+[6] no omap_sram_idle
>   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
> 
> So indeed this shows there is some serious performance issue with the
> C1 C-state.
>
Looks like other clock-domain (notably l4, per, AON) should be denied
idle in C1 to avoid the huge penalties. It might just do the trick.


>> Thanks for the detailed experiments.  This definitely confirms we have
>> some serious unwanted overhead for C1, and our C-state latency values
>> are clearly way off base, since they only account HW latency and not any
>> of the SW latency introduced in omap_sram_idle().
>>
>>>> There are 2 primary differences that I see as possible causes.  I list
>>>> them here with a couple more experiments for you to try to help us
>>>> narrow this down.
>>>>
>>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>>>>
>>>> Could you try using omap_sram_idle() and just commenting out those
>>>> calls?  Does that help performance?  Those iterate over all the
>>>> powerdomains, so defintely add some overhead, but I don't think it
>>>> would be as significant as what you're seeing.
>>>
>>> Seems to be taking good part of it.
>>>
>>>>    Much more likely is...
>>>>
>>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
>>>
>>> Could not notice any difference.
>>>
>>> To me it looks like this results from many small things adding up..
>>> Idle is called so often that pwrdm_p*_transition() and those
>>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>>> because they access lots of registers on slow buses?
> 
> From the list of contributors, the main ones are:
>     (140us) pwrdm_pre_transition and pwrdm_post_transition,

I have observed this one on OMAP4 too. There was a plan to remove
this as part of Tero's PD/CD use-counting series.

>     (105us) omap2_gpio_prepare_for_idle and
> omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
> the latency-critical C-states,
Yes. In C1 when you deny idle for per, there should be no need to
call this. But even in the case when it is called, why is it taking
105 uS. Needs to dig further.

>     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
shouldn't be surprising.

>     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
This is again dominated by PRCM read

>     (11 us) clkdm_allow_idle(mpu). Is this needed?
> 
I guess yes other wise when C2+ is attempted MPU CD can't idle.

> Here are a few questions and suggestions:
> - In case of latency critical C-states could the high-latency code be
> bypassed in favor of a much simpler version? Pushing the concept a bit
> farther one could have a C1 state that just relaxes the cpu (no WFI),
> a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
> rest of the C-states as we have today,
We should do that. Infact C1 state should be as lite as possible like
WFI or so.

> - Is it needed to iterate through all the power and clock domains in
> order to keep them active?
That iteration should be removed.

> - Trying to idle some non related power domains (e.g. PER) causes a
> performance hit. How to link all the power domains states to the
> cpuidle C-state? The per-device PM QoS framework could be used to
> constraint some power domains, but this is highly dependent on the use
> case.
>
Note that just limiting PER PD state to ON is not going to
solve the penalty. You need to avoid per CD transition and
hence deny idle. I remember Nokia team did this on some
products.


>> Yes PRCM register accesses are unfortunately rather slow, and we've
>> known that for some time, but haven't done any detailed analysis of the
>> overhead.
> That would be worth doing the analysis. A lot of read accesses to the
> current, next and previous power states are performed in the idle
> code.
> 
>> Using the function_graph tracer, I was able to see that the pre/post
>> transition are taking an enormous amount of time:
>>
>>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
>>
>> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
>> using CPUfreq at all in your tests?  If using cpufreq + ondemand
>> governor, you're probably running at low OPP due to lack of CPU activity
>> which will also affect the latencies in the idle path.
>>
>>> Maybe some register cache would help us there, or are those registers
>>> expected to be changed by hardware often?
>>
>> Yes, we've known that some sort of register cache here would be useful
>> for some time, but haven't got to implementing it.
> I can try some proof of concept code, just to prove its usefulness.
> 
Please do so. We were hoping that after Tero's series, we don't need
this pre/post stuff but am not sure if Tero is addressing that.

Register cache initiative is most welcome.

Regards
Santosh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-24 10:38             ` Santosh Shilimkar
@ 2012-04-24 12:21               ` Tero Kristo
  2012-04-24 12:50                 ` Jean Pihet
  0 siblings, 1 reply; 36+ messages in thread
From: Tero Kristo @ 2012-04-24 12:21 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: Jean Pihet, Kevin Hilman, Grazvydas Ignotas, linux-omap, Paul Walmsley

On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
> + Tero
> 
> On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
> > Hi Grazvydas, Kevin,
> > 
> > I did some gather some performance measurements and statistics using
> > custom tracepoints in __omap3_enter_idle.
> > All the details are at
> > http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
> > .
> > 
> Nice data.
> 
> > The setup is:
> > - Beagleboard (OMAP3530) at 500MHz,
> > - l-o master kernel + functional power states + per-device PM QoS. It
> > has been checked that the changes from l-o master do not have an
> > impact on the performance.
> > - The data transfer is performed using dd from a file in JFFS2 to
> > /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.

Question: what is used for gathering the latency values?

> > 
> > On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@ti.com> wrote:
> >> Grazvydas Ignotas <notasas@gmail.com> writes:
> >>
> >>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
> >>>> It would be helpful now to narrow down what are the big contributors to
> >>>> the overhead in omap_sram_idle().  Most of the code there is skipped for
> >>>> C1 because the next states for MPU and CORE are both ON.
> >>>
> >>> Ok I did some tests, all in mostly idle system with just init, busybox
> >>> shell and dd doing a NAND read to /dev/null .
> >>
> > ...
> >>
> >>> MB/s is throughput that
> >>> dd reports, mA and approx. current draw during the transfer, read from
> >>> fuel gauge that's onboard.
> >>>
> >>> MB/s| mA|comment
> >>>  3.7|218|mainline f549e088b80
> >>>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
> >>>  4.4|220|[1] + pwrdm_p*_transition commented [2]
> >>>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
> >>>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
> >>>  4.0|224|[1] + 'Deny idle' [5]
> >>>  5.1|210|[2] + [4] + [5]
> >>>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
> >>>  5.5|243|!CONFIG_PM
> >>>  6.1|282|busywait DMA end (for reference)
> > 
> > Here are the results (BW in MB/s) on Beagleboard:
> > - 4.7: without using DMA,
> > 
> > - Using DMA
> >   2.1: [0]
> >   2.1: [1] only C1
> >   2.6: [1]+[2] no pre_ post_
> >   2.3: [1]+[5] no pwrdm_for_each_clkdm
> >   2.8: [1]+[5]+[2]
> >   3.1: [1]+[5]+[6] no omap_sram_idle
> >   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
> > 
> > So indeed this shows there is some serious performance issue with the
> > C1 C-state.
> >
> Looks like other clock-domain (notably l4, per, AON) should be denied
> idle in C1 to avoid the huge penalties. It might just do the trick.
> 
> 
> >> Thanks for the detailed experiments.  This definitely confirms we have
> >> some serious unwanted overhead for C1, and our C-state latency values
> >> are clearly way off base, since they only account HW latency and not any
> >> of the SW latency introduced in omap_sram_idle().
> >>
> >>>> There are 2 primary differences that I see as possible causes.  I list
> >>>> them here with a couple more experiments for you to try to help us
> >>>> narrow this down.
> >>>>
> >>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
> >>>>
> >>>> Could you try using omap_sram_idle() and just commenting out those
> >>>> calls?  Does that help performance?  Those iterate over all the
> >>>> powerdomains, so defintely add some overhead, but I don't think it
> >>>> would be as significant as what you're seeing.
> >>>
> >>> Seems to be taking good part of it.
> >>>
> >>>>    Much more likely is...
> >>>>
> >>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
> >>>
> >>> Could not notice any difference.
> >>>
> >>> To me it looks like this results from many small things adding up..
> >>> Idle is called so often that pwrdm_p*_transition() and those
> >>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
> >>> because they access lots of registers on slow buses?
> > 
> > From the list of contributors, the main ones are:
> >     (140us) pwrdm_pre_transition and pwrdm_post_transition,
> 
> I have observed this one on OMAP4 too. There was a plan to remove
> this as part of Tero's PD/CD use-counting series.

pwrdm_pre / post transitions could be optimized a bit already now. They
only should need to be called for mpu / core and per domains, but
currently they scan through everything.

> 
> >     (105us) omap2_gpio_prepare_for_idle and
> > omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
> > the latency-critical C-states,
> Yes. In C1 when you deny idle for per, there should be no need to
> call this. But even in the case when it is called, why is it taking
> 105 uS. Needs to dig further.
> 
> >     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
> Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
> shouldn't be surprising.
> 
> >     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
> This is again dominated by PRCM read
> 
> >     (11 us) clkdm_allow_idle(mpu). Is this needed?
> > 
> I guess yes other wise when C2+ is attempted MPU CD can't idle.
> 
> > Here are a few questions and suggestions:
> > - In case of latency critical C-states could the high-latency code be
> > bypassed in favor of a much simpler version? Pushing the concept a bit
> > farther one could have a C1 state that just relaxes the cpu (no WFI),
> > a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
> > rest of the C-states as we have today,
> We should do that. Infact C1 state should be as lite as possible like
> WFI or so.
> 
> > - Is it needed to iterate through all the power and clock domains in
> > order to keep them active?
> That iteration should be removed.
> 
> > - Trying to idle some non related power domains (e.g. PER) causes a
> > performance hit. How to link all the power domains states to the
> > cpuidle C-state? The per-device PM QoS framework could be used to
> > constraint some power domains, but this is highly dependent on the use
> > case.
> >
> Note that just limiting PER PD state to ON is not going to
> solve the penalty. You need to avoid per CD transition and
> hence deny idle. I remember Nokia team did this on some
> products.

n9 kernel (which is available here
http://harmattan-dev.nokia.com/pool/harmattan/free/k/kernel/) contained
a lot of optimizations in the idle path. Maybe someone should take a
look at this at some point.

> 
> 
> >> Yes PRCM register accesses are unfortunately rather slow, and we've
> >> known that for some time, but haven't done any detailed analysis of the
> >> overhead.
> > That would be worth doing the analysis. A lot of read accesses to the
> > current, next and previous power states are performed in the idle
> > code.
> > 
> >> Using the function_graph tracer, I was able to see that the pre/post
> >> transition are taking an enormous amount of time:
> >>
> >>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
> >>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
> >>
> >> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
> >> using CPUfreq at all in your tests?  If using cpufreq + ondemand
> >> governor, you're probably running at low OPP due to lack of CPU activity
> >> which will also affect the latencies in the idle path.
> >>
> >>> Maybe some register cache would help us there, or are those registers
> >>> expected to be changed by hardware often?
> >>
> >> Yes, we've known that some sort of register cache here would be useful
> >> for some time, but haven't got to implementing it.
> > I can try some proof of concept code, just to prove its usefulness.
> > 
> Please do so. We were hoping that after Tero's series, we don't need
> this pre/post stuff but am not sure if Tero is addressing that.
> 
> Register cache initiative is most welcome.
> 
> Regards
> Santosh



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-24 12:21               ` Tero Kristo
@ 2012-04-24 12:50                 ` Jean Pihet
  2012-04-24 13:04                   ` Tero Kristo
  0 siblings, 1 reply; 36+ messages in thread
From: Jean Pihet @ 2012-04-24 12:50 UTC (permalink / raw)
  To: t-kristo
  Cc: Santosh Shilimkar, Kevin Hilman, Grazvydas Ignotas, linux-omap,
	Paul Walmsley

Hi Tero,

On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo <t-kristo@ti.com> wrote:
> On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
>> + Tero
>>
>> On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
>> > Hi Grazvydas, Kevin,
>> >
>> > I did some gather some performance measurements and statistics using
>> > custom tracepoints in __omap3_enter_idle.
>> > All the details are at
>> > http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
>> > .
>> >
>> Nice data.
>>
>> > The setup is:
>> > - Beagleboard (OMAP3530) at 500MHz,
>> > - l-o master kernel + functional power states + per-device PM QoS. It
>> > has been checked that the changes from l-o master do not have an
>> > impact on the performance.
>> > - The data transfer is performed using dd from a file in JFFS2 to
>> > /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.
>
> Question: what is used for gathering the latency values?
I used ftrace tracepoints which are supposed to be low overhead. I
checked that the overhead cannot be measured on the measurement
interval (>400us), given the fact that the time base is 31us (32 KHz
clock).

>> >
>> > On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@ti.com> wrote:
>> >> Grazvydas Ignotas <notasas@gmail.com> writes:
>> >>
>> >>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@ti.com> wrote:
>> >>>> It would be helpful now to narrow down what are the big contributors to
>> >>>> the overhead in omap_sram_idle().  Most of the code there is skipped for
>> >>>> C1 because the next states for MPU and CORE are both ON.
>> >>>
>> >>> Ok I did some tests, all in mostly idle system with just init, busybox
>> >>> shell and dd doing a NAND read to /dev/null .
>> >>
>> > ...
>> >>
>> >>> MB/s is throughput that
>> >>> dd reports, mA and approx. current draw during the transfer, read from
>> >>> fuel gauge that's onboard.
>> >>>
>> >>> MB/s| mA|comment
>> >>>  3.7|218|mainline f549e088b80
>> >>>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
>> >>>  4.4|220|[1] + pwrdm_p*_transition commented [2]
>> >>>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
>> >>>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
>> >>>  4.0|224|[1] + 'Deny idle' [5]
>> >>>  5.1|210|[2] + [4] + [5]
>> >>>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
>> >>>  5.5|243|!CONFIG_PM
>> >>>  6.1|282|busywait DMA end (for reference)
>> >
>> > Here are the results (BW in MB/s) on Beagleboard:
>> > - 4.7: without using DMA,
>> >
>> > - Using DMA
>> >   2.1: [0]
>> >   2.1: [1] only C1
>> >   2.6: [1]+[2] no pre_ post_
>> >   2.3: [1]+[5] no pwrdm_for_each_clkdm
>> >   2.8: [1]+[5]+[2]
>> >   3.1: [1]+[5]+[6] no omap_sram_idle
>> >   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
>> >
>> > So indeed this shows there is some serious performance issue with the
>> > C1 C-state.
>> >
>> Looks like other clock-domain (notably l4, per, AON) should be denied
>> idle in C1 to avoid the huge penalties. It might just do the trick.
>>
>>
>> >> Thanks for the detailed experiments.  This definitely confirms we have
>> >> some serious unwanted overhead for C1, and our C-state latency values
>> >> are clearly way off base, since they only account HW latency and not any
>> >> of the SW latency introduced in omap_sram_idle().
>> >>
>> >>>> There are 2 primary differences that I see as possible causes.  I list
>> >>>> them here with a couple more experiments for you to try to help us
>> >>>> narrow this down.
>> >>>>
>> >>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>> >>>>
>> >>>> Could you try using omap_sram_idle() and just commenting out those
>> >>>> calls?  Does that help performance?  Those iterate over all the
>> >>>> powerdomains, so defintely add some overhead, but I don't think it
>> >>>> would be as significant as what you're seeing.
>> >>>
>> >>> Seems to be taking good part of it.
>> >>>
>> >>>>    Much more likely is...
>> >>>>
>> >>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
>> >>>
>> >>> Could not notice any difference.
>> >>>
>> >>> To me it looks like this results from many small things adding up..
>> >>> Idle is called so often that pwrdm_p*_transition() and those
>> >>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>> >>> because they access lots of registers on slow buses?
>> >
>> > From the list of contributors, the main ones are:
>> >     (140us) pwrdm_pre_transition and pwrdm_post_transition,
>>
>> I have observed this one on OMAP4 too. There was a plan to remove
>> this as part of Tero's PD/CD use-counting series.
>
> pwrdm_pre / post transitions could be optimized a bit already now. They
> only should need to be called for mpu / core and per domains, but
> currently they scan through everything.
>
>>
>> >     (105us) omap2_gpio_prepare_for_idle and
>> > omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
>> > the latency-critical C-states,
>> Yes. In C1 when you deny idle for per, there should be no need to
>> call this. But even in the case when it is called, why is it taking
>> 105 uS. Needs to dig further.
>>
>> >     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
>> Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
>> shouldn't be surprising.
>>
>> >     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
>> This is again dominated by PRCM read
>>
>> >     (11 us) clkdm_allow_idle(mpu). Is this needed?
>> >
>> I guess yes other wise when C2+ is attempted MPU CD can't idle.
>>
>> > Here are a few questions and suggestions:
>> > - In case of latency critical C-states could the high-latency code be
>> > bypassed in favor of a much simpler version? Pushing the concept a bit
>> > farther one could have a C1 state that just relaxes the cpu (no WFI),
>> > a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
>> > rest of the C-states as we have today,
>> We should do that. Infact C1 state should be as lite as possible like
>> WFI or so.
>>
>> > - Is it needed to iterate through all the power and clock domains in
>> > order to keep them active?
>> That iteration should be removed.
>>
>> > - Trying to idle some non related power domains (e.g. PER) causes a
>> > performance hit. How to link all the power domains states to the
>> > cpuidle C-state? The per-device PM QoS framework could be used to
>> > constraint some power domains, but this is highly dependent on the use
>> > case.
>> >
>> Note that just limiting PER PD state to ON is not going to
>> solve the penalty. You need to avoid per CD transition and
>> hence deny idle. I remember Nokia team did this on some
>> products.
>
> n9 kernel (which is available here
> http://harmattan-dev.nokia.com/pool/harmattan/free/k/kernel/) contained
> a lot of optimizations in the idle path. Maybe someone should take a
> look at this at some point.
Ok, thanks for the link.

>
>>
>>
>> >> Yes PRCM register accesses are unfortunately rather slow, and we've
>> >> known that for some time, but haven't done any detailed analysis of the
>> >> overhead.
>> > That would be worth doing the analysis. A lot of read accesses to the
>> > current, next and previous power states are performed in the idle
>> > code.
>> >
>> >> Using the function_graph tracer, I was able to see that the pre/post
>> >> transition are taking an enormous amount of time:
>> >>
>> >>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>> >>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
>> >>
>> >> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
>> >> using CPUfreq at all in your tests?  If using cpufreq + ondemand
>> >> governor, you're probably running at low OPP due to lack of CPU activity
>> >> which will also affect the latencies in the idle path.
>> >>
>> >>> Maybe some register cache would help us there, or are those registers
>> >>> expected to be changed by hardware often?
>> >>
>> >> Yes, we've known that some sort of register cache here would be useful
>> >> for some time, but haven't got to implementing it.
>> > I can try some proof of concept code, just to prove its usefulness.
>> >
>> Please do so. We were hoping that after Tero's series, we don't need
>> this pre/post stuff but am not sure if Tero is addressing that.
>>
>> Register cache initiative is most welcome.
>>
>> Regards
>> Santosh
>
>

Regards,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-24 12:50                 ` Jean Pihet
@ 2012-04-24 13:04                   ` Tero Kristo
  0 siblings, 0 replies; 36+ messages in thread
From: Tero Kristo @ 2012-04-24 13:04 UTC (permalink / raw)
  To: Jean Pihet
  Cc: Santosh Shilimkar, Kevin Hilman, Grazvydas Ignotas, linux-omap,
	Paul Walmsley

On Tue, 2012-04-24 at 14:50 +0200, Jean Pihet wrote:
> Hi Tero,
> 
> On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo <t-kristo@ti.com> wrote:
> > On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
> >> + Tero
> >>
> >> On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
> >> > Hi Grazvydas, Kevin,
> >> >
> >> > I did some gather some performance measurements and statistics using
> >> > custom tracepoints in __omap3_enter_idle.
> >> > All the details are at
> >> > http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
> >> > .
> >> >
> >> Nice data.
> >>
> >> > The setup is:
> >> > - Beagleboard (OMAP3530) at 500MHz,
> >> > - l-o master kernel + functional power states + per-device PM QoS. It
> >> > has been checked that the changes from l-o master do not have an
> >> > impact on the performance.
> >> > - The data transfer is performed using dd from a file in JFFS2 to
> >> > /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.
> >
> > Question: what is used for gathering the latency values?
> I used ftrace tracepoints which are supposed to be low overhead. I
> checked that the overhead cannot be measured on the measurement
> interval (>400us), given the fact that the time base is 31us (32 KHz
> clock).

If you want to get accurate measurements, you could use ARM performance
counters, namely the cycle counter. I have a couple of patches for that
purpose I've used if you are interested.

-Tero


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-24  9:50           ` Jean Pihet
  2012-04-24 10:38             ` Santosh Shilimkar
@ 2012-04-24 14:29             ` Kevin Hilman
  2012-05-01 14:10               ` Jean Pihet
  1 sibling, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-04-24 14:29 UTC (permalink / raw)
  To: Jean Pihet; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Jean Pihet <jean.pihet@newoldbits.com> writes:

> Hi Grazvydas, Kevin,
>
> I did some gather some performance measurements and statistics using
> custom tracepoints in __omap3_enter_idle.
> All the details are at
> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
> .

This is great, thanks.

[...]

> Here are the results (BW in MB/s) on Beagleboard:
> - 4.7: without using DMA,
>
> - Using DMA
>   2.1: [0]
>   2.1: [1] only C1
>   2.6: [1]+[2] no pre_ post_
>   2.3: [1]+[5] no pwrdm_for_each_clkdm
>   2.8: [1]+[5]+[2]
>   3.1: [1]+[5]+[6] no omap_sram_idle
>   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
>
> So indeed this shows there is some serious performance issue with the
> C1 C-state.

Yes, this confirms what both Grazvytas and I are seeing as well.

[...]

> From the list of contributors, the main ones are:
>     (140us) pwrdm_pre_transition and pwrdm_post_transition,

See the series I just posted to address this one:
[PATCH/RFT 0/3] ARM: OMAP: PM: reduce overhead of pwrdm pre/post transitions

>     (105us) omap2_gpio_prepare_for_idle and
> omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
> the latency-critical C-states,
>     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
>     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
>     (11 us) clkdm_allow_idle(mpu). Is this needed?

In that same series, I removed this as it appears to be a remnant of a
code move (c.f. patch 3 in above series.)

> Here are a few questions and suggestions:
> - In case of latency critical C-states could the high-latency code be
> bypassed in favor of a much simpler version? Pushing the concept a bit
> farther one could have a C1 state that just relaxes the cpu (no WFI),
> a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
> rest of the C-states as we have today,

I was thinking a "WFI only" state, with *all* powerdomains staying on is
probably sufficient for C1.  Do you see the enter/exit latency from that
as even being too hight?

> - Is it needed to iterate through all the power and clock domains in
> order to keep them active?

No.  My series above starts to addresses this, but I think Tero's
use-counting series is the final solution since this should really be
done when we know the powerdomains are transitioning.

Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-04-24 14:29             ` Kevin Hilman
@ 2012-05-01 14:10               ` Jean Pihet
  2012-05-01 17:27                 ` Kevin Hilman
  0 siblings, 1 reply; 36+ messages in thread
From: Jean Pihet @ 2012-05-01 14:10 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

HI Kevin, Grazvydas,

On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman <khilman@ti.com> wrote:
> Jean Pihet <jean.pihet@newoldbits.com> writes:
>
>> Hi Grazvydas, Kevin,
>>
>> I did some gather some performance measurements and statistics using
>> custom tracepoints in __omap3_enter_idle.
I posted the patches for the power domains registers cache, cf.
http://marc.info/?l=linux-omap&m=133587781712039&w=2.

>> All the details are at
>> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
I updated the page with the measurements results with Kevin's patches
and the registers cache patches.

The results are showing that:
- the registers cache optimizes the low power mode transitions, but is
not sufficient to obtain a big gain. A few unused domains are
transitioning, which causes a big penalty in the idle path.
- khilman's optimizations are really helpful. Furthermore it optimizes
farther the registers cache statistics accesses.
- the average time in idle now drops to 246us, which is still very
large for a cpu intensive C-state. For information with PM disabled
the average time in idle is 113us.

Regards,
Jean

>> .
>
> This is great, thanks.
>
> [...]
>
>> Here are the results (BW in MB/s) on Beagleboard:
>> - 4.7: without using DMA,
>>
>> - Using DMA
>>   2.1: [0]
>>   2.1: [1] only C1
>>   2.6: [1]+[2] no pre_ post_
>>   2.3: [1]+[5] no pwrdm_for_each_clkdm
>>   2.8: [1]+[5]+[2]
>>   3.1: [1]+[5]+[6] no omap_sram_idle
>>   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
>>
>> So indeed this shows there is some serious performance issue with the
>> C1 C-state.
>
> Yes, this confirms what both Grazvytas and I are seeing as well.
>
> [...]
>
>> From the list of contributors, the main ones are:
>>     (140us) pwrdm_pre_transition and pwrdm_post_transition,
>
> See the series I just posted to address this one:
> [PATCH/RFT 0/3] ARM: OMAP: PM: reduce overhead of pwrdm pre/post transitions
>
>>     (105us) omap2_gpio_prepare_for_idle and
>> omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
>> the latency-critical C-states,
>>     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
>>     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
>>     (11 us) clkdm_allow_idle(mpu). Is this needed?
>
> In that same series, I removed this as it appears to be a remnant of a
> code move (c.f. patch 3 in above series.)
>
>> Here are a few questions and suggestions:
>> - In case of latency critical C-states could the high-latency code be
>> bypassed in favor of a much simpler version? Pushing the concept a bit
>> farther one could have a C1 state that just relaxes the cpu (no WFI),
>> a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
>> rest of the C-states as we have today,
>
> I was thinking a "WFI only" state, with *all* powerdomains staying on is
> probably sufficient for C1.  Do you see the enter/exit latency from that
> as even being too hight?
>
>> - Is it needed to iterate through all the power and clock domains in
>> order to keep them active?
>
> No.  My series above starts to addresses this, but I think Tero's
> use-counting series is the final solution since this should really be
> done when we know the powerdomains are transitioning.
>
> Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-05-01 14:10               ` Jean Pihet
@ 2012-05-01 17:27                 ` Kevin Hilman
  2012-05-02  5:59                   ` Paul Walmsley
  2012-05-02 19:46                   ` Jean Pihet
  0 siblings, 2 replies; 36+ messages in thread
From: Kevin Hilman @ 2012-05-01 17:27 UTC (permalink / raw)
  To: Jean Pihet; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Jean Pihet <jean.pihet@newoldbits.com> writes:

> HI Kevin, Grazvydas,
>
> On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman <khilman@ti.com> wrote:
>> Jean Pihet <jean.pihet@newoldbits.com> writes:
>>
>>> Hi Grazvydas, Kevin,
>>>
>>> I did some gather some performance measurements and statistics using
>>> custom tracepoints in __omap3_enter_idle.
> I posted the patches for the power domains registers cache, cf.
> http://marc.info/?l=linux-omap&m=133587781712039&w=2.
>
>>> All the details are at
>>> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
> I updated the page with the measurements results with Kevin's patches
> and the registers cache patches.
>
> The results are showing that:
> - the registers cache optimizes the low power mode transitions, but is
> not sufficient to obtain a big gain. A few unused domains are
> transitioning, which causes a big penalty in the idle path.

PER is the one that seems to be causing the most latency.  

Can you try do your measurements using hack below which makes sure that
PER isn't any deeper than CORE?

Kevin

>From bb2f67ed93dc83c645080e293d315d383c23c0c6 Mon Sep 17 00:00:00 2001
From: Kevin Hilman <khilman@ti.com>
Date: Mon, 16 Apr 2012 17:53:14 -0700
Subject: [PATCH] cpuidle34xx: per follows core, C1 use _bm

---
 arch/arm/mach-omap2/cpuidle34xx.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c
index 374708d..00400ad 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -278,9 +278,11 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
 	cx = cpuidle_get_statedata(&dev->states_usage[index]);
 	core_next_state = cx->core_state;
 	per_next_state = per_saved_state = pwrdm_read_next_pwrst(per_pd);
-	if ((per_next_state == PWRDM_POWER_OFF) &&
-	    (core_next_state > PWRDM_POWER_RET))
-		per_next_state = PWRDM_POWER_RET;
+	/* if ((per_next_state == PWRDM_POWER_OFF) && */
+	/*     (core_next_state > PWRDM_POWER_RET)) */
+	/* 	per_next_state = PWRDM_POWER_RET; */
+	if (per_next_state < core_next_state)
+		per_next_state = core_next_state;
 
 	/* Are we changing PER target state? */
 	if (per_next_state != per_saved_state)
@@ -374,7 +376,6 @@ int __init omap3_idle_init(void)
 
 	/* C1 . MPU WFI + Core active */
 	_fill_cstate(drv, 0, "MPU ON + CORE ON");
-	(&drv->states[0])->enter = omap3_enter_idle;
 	drv->safe_state_index = 0;
 	cx = _fill_cstate_usage(dev, 0);
 	cx->valid = 1;	/* C1 is always valid */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-05-01 17:27                 ` Kevin Hilman
@ 2012-05-02  5:59                   ` Paul Walmsley
  2012-05-02 19:46                   ` Jean Pihet
  1 sibling, 0 replies; 36+ messages in thread
From: Paul Walmsley @ 2012-05-02  5:59 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Jean Pihet, Grazvydas Ignotas, linux-omap

On Tue, 1 May 2012, Kevin Hilman wrote:

> PER is the one that seems to be causing the most latency.  
> 
> Can you try do your measurements using hack below which makes sure that
> PER isn't any deeper than CORE?

It might be the relock time for DPLL4, the PER DPLL.  You might also 
try disabling DPLL4 autoidle for the shallow C-states...
 

- Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-05-01 17:27                 ` Kevin Hilman
  2012-05-02  5:59                   ` Paul Walmsley
@ 2012-05-02 19:46                   ` Jean Pihet
  2012-05-07 17:31                     ` Kevin Hilman
  1 sibling, 1 reply; 36+ messages in thread
From: Jean Pihet @ 2012-05-02 19:46 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman <khilman@ti.com> wrote:
> Jean Pihet <jean.pihet@newoldbits.com> writes:
>
>> HI Kevin, Grazvydas,
>>
>> On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman <khilman@ti.com> wrote:
>>> Jean Pihet <jean.pihet@newoldbits.com> writes:
>>>
>>>> Hi Grazvydas, Kevin,
>>>>
>>>> I did some gather some performance measurements and statistics using
>>>> custom tracepoints in __omap3_enter_idle.
>> I posted the patches for the power domains registers cache, cf.
>> http://marc.info/?l=linux-omap&m=133587781712039&w=2.
>>
>>>> All the details are at
>>>> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
>> I updated the page with the measurements results with Kevin's patches
>> and the registers cache patches.
>>
>> The results are showing that:
>> - the registers cache optimizes the low power mode transitions, but is
>> not sufficient to obtain a big gain. A few unused domains are
>> transitioning, which causes a big penalty in the idle path.
>
> PER is the one that seems to be causing the most latency.
>
> Can you try do your measurements using hack below which makes sure that
> PER isn't any deeper than CORE?

Indeed your patch brings significant improvements, cf. wiki page at
http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
for detailed information.
Here below is the reworked patch, more suited for inclusion in mainline [1]

I have another optimisation -in proof of concept state- that brings
another significant improvement. It is about allowing/disabling idle
for only 1 clkdm in a pwrdm and not iterate through all the clkdms.
This still needs some rework though. Cf. patch [2]

Patches [1] and [2] on top of the registers cache and the
optimisations in pre/post_transition bring the performance close to
the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s
on Beagleboard).

What do you think?

Regards,
Jean

---
[1]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
b/arch/arm/mach-omap2/cpuidle34xx.c
index e406d7b..572b605 100644
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
	int ret;

	/*
-	 * Prevent idle completely if CAM is active.
+	 * Use only C1 if CAM is active.
	 * CAM does not have wakeup capability in OMAP3.
	 */
-	if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) {
+	if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON)
		new_state_idx = drv->safe_state_index;
-		goto select_state;
-	}
-
-	new_state_idx = next_valid_state(dev, drv, index);
+	else
+		new_state_idx = next_valid_state(dev, drv, index);

-	/*
-	 * Prevent PER off if CORE is not in retention or off as this
-	 * would disable PER wakeups completely.
-	 */
+	/* Program PER state */
	cx = cpuidle_get_statedata(&dev->states_usage[new_state_idx]);
	core_next_state = cx->core_state;
-	per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd);
-	if ((per_next_state == PWRDM_FUNC_PWRST_OFF) &&
-	    (core_next_state > PWRDM_FUNC_PWRST_CSWR))
-		per_next_state = PWRDM_FUNC_PWRST_CSWR;
+	if (new_state_idx == 0) {
+		/* In C1 do not allow PER state lower than CORE state */
+		per_next_state = core_next_state;
+	} else {
+		/*
+		 * Prevent PER off if CORE is not in RETention or OFF as this
+		 * would disable PER wakeups completely.
+		 */
+		per_next_state = per_saved_state =
+				pwrdm_read_next_func_pwrst(per_pd);
+		if ((per_next_state == PWRDM_FUNC_PWRST_OFF) &&
+		    (core_next_state > PWRDM_FUNC_PWRST_CSWR))
+			per_next_state = PWRDM_FUNC_PWRST_CSWR;
+	}

	/* Are we changing PER target state? */
	if (per_next_state != per_saved_state)
		omap_set_pwrdm_state(per_pd, per_next_state);

-select_state:
	ret = omap3_enter_idle(dev, drv, new_state_idx);

	/* Restore original PER state if it was modified */
@@ -390,7 +394,6 @@ int __init omap3_idle_init(void)

	/* C1 . MPU WFI + Core active */
	_fill_cstate(drv, 0, "MPU ON + CORE ON");
-	(&drv->states[0])->enter = omap3_enter_idle;
	drv->safe_state_index = 0;
	cx = _fill_cstate_usage(dev, 0);
	cx->valid = 1;	/* C1 is always valid */

[2]
diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
b/arch/arm/mach-omap2/cpuidle34xx.c
index e406d7b..6aa3c75 100644
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

 	/* Deny idle for C1 */
 	if (index == 0) {
-		pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
-		pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
+		//pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
+		clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
+		//pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
+		clkdm_deny_idle(core_pd->pwrdm_clkdms[0]);
 	}

 	/*
@@ -141,8 +143,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,

 	/* Re-allow idle for C1 */
 	if (index == 0) {
-		pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
-		pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
+		//pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
+		clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
+		//pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
+		clkdm_allow_idle(core_pd->pwrdm_clkdms[0]);
 	}

 return_sleep_time:

>
> Kevin
>
> From bb2f67ed93dc83c645080e293d315d383c23c0c6 Mon Sep 17 00:00:00 2001
> From: Kevin Hilman <khilman@ti.com>
> Date: Mon, 16 Apr 2012 17:53:14 -0700
> Subject: [PATCH] cpuidle34xx: per follows core, C1 use _bm
>
> ---
>  arch/arm/mach-omap2/cpuidle34xx.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c
> index 374708d..00400ad 100644
> --- a/arch/arm/mach-omap2/cpuidle34xx.c
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -278,9 +278,11 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
>        cx = cpuidle_get_statedata(&dev->states_usage[index]);
>        core_next_state = cx->core_state;
>        per_next_state = per_saved_state = pwrdm_read_next_pwrst(per_pd);
> -       if ((per_next_state == PWRDM_POWER_OFF) &&
> -           (core_next_state > PWRDM_POWER_RET))
> -               per_next_state = PWRDM_POWER_RET;
> +       /* if ((per_next_state == PWRDM_POWER_OFF) && */
> +       /*     (core_next_state > PWRDM_POWER_RET)) */
> +       /*      per_next_state = PWRDM_POWER_RET; */
> +       if (per_next_state < core_next_state)
> +               per_next_state = core_next_state;
>
>        /* Are we changing PER target state? */
>        if (per_next_state != per_saved_state)
> @@ -374,7 +376,6 @@ int __init omap3_idle_init(void)
>
>        /* C1 . MPU WFI + Core active */
>        _fill_cstate(drv, 0, "MPU ON + CORE ON");
> -       (&drv->states[0])->enter = omap3_enter_idle;
>        drv->safe_state_index = 0;
>        cx = _fill_cstate_usage(dev, 0);
>        cx->valid = 1;  /* C1 is always valid */
> --
> 1.7.9.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-05-02 19:46                   ` Jean Pihet
@ 2012-05-07 17:31                     ` Kevin Hilman
  2012-05-09 11:00                       ` Jean Pihet
  0 siblings, 1 reply; 36+ messages in thread
From: Kevin Hilman @ 2012-05-07 17:31 UTC (permalink / raw)
  To: Jean Pihet; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Jean Pihet <jean.pihet@newoldbits.com> writes:

> On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman <khilman@ti.com> wrote:
>> Jean Pihet <jean.pihet@newoldbits.com> writes:
>>
>>> HI Kevin, Grazvydas,
>>>
>>> On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman <khilman@ti.com> wrote:
>>>> Jean Pihet <jean.pihet@newoldbits.com> writes:
>>>>
>>>>> Hi Grazvydas, Kevin,
>>>>>
>>>>> I did some gather some performance measurements and statistics using
>>>>> custom tracepoints in __omap3_enter_idle.
>>> I posted the patches for the power domains registers cache, cf.
>>> http://marc.info/?l=linux-omap&m=133587781712039&w=2.
>>>
>>>>> All the details are at
>>>>> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
>>> I updated the page with the measurements results with Kevin's patches
>>> and the registers cache patches.
>>>
>>> The results are showing that:
>>> - the registers cache optimizes the low power mode transitions, but is
>>> not sufficient to obtain a big gain. A few unused domains are
>>> transitioning, which causes a big penalty in the idle path.
>>
>> PER is the one that seems to be causing the most latency.
>>
>> Can you try do your measurements using hack below which makes sure that
>> PER isn't any deeper than CORE?
>
> Indeed your patch brings significant improvements, cf. wiki page at
> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
> for detailed information.
> Here below is the reworked patch, more suited for inclusion in mainline [1]
>
> I have another optimisation -in proof of concept state- that brings
> another significant improvement. It is about allowing/disabling idle
> for only 1 clkdm in a pwrdm and not iterate through all the clkdms.
> This still needs some rework though. Cf. patch [2]

That should work since disabling idle for any clkdm will have the same
effect.  Can you send this as a separate patch with a descriptive
changelog.

Kevin


> Patches [1] and [2] on top of the registers cache and the
> optimisations in pre/post_transition bring the performance close to
> the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s
> on Beagleboard).
>
> What do you think?
>
> Regards,
> Jean
>
> ---
> [1]
> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
> b/arch/arm/mach-omap2/cpuidle34xx.c
> index e406d7b..572b605 100644
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
> 	int ret;
>
> 	/*
> -	 * Prevent idle completely if CAM is active.
> +	 * Use only C1 if CAM is active.
> 	 * CAM does not have wakeup capability in OMAP3.
> 	 */
> -	if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) {
> +	if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON)
> 		new_state_idx = drv->safe_state_index;
> -		goto select_state;
> -	}
> -
> -	new_state_idx = next_valid_state(dev, drv, index);
> +	else
> +		new_state_idx = next_valid_state(dev, drv, index);
>
> -	/*
> -	 * Prevent PER off if CORE is not in retention or off as this
> -	 * would disable PER wakeups completely.
> -	 */
> +	/* Program PER state */
> 	cx = cpuidle_get_statedata(&dev->states_usage[new_state_idx]);
> 	core_next_state = cx->core_state;
> -	per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd);
> -	if ((per_next_state == PWRDM_FUNC_PWRST_OFF) &&
> -	    (core_next_state > PWRDM_FUNC_PWRST_CSWR))
> -		per_next_state = PWRDM_FUNC_PWRST_CSWR;
> +	if (new_state_idx == 0) {
> +		/* In C1 do not allow PER state lower than CORE state */
> +		per_next_state = core_next_state;
> +	} else {
> +		/*
> +		 * Prevent PER off if CORE is not in RETention or OFF as this
> +		 * would disable PER wakeups completely.
> +		 */
> +		per_next_state = per_saved_state =
> +				pwrdm_read_next_func_pwrst(per_pd);
> +		if ((per_next_state == PWRDM_FUNC_PWRST_OFF) &&
> +		    (core_next_state > PWRDM_FUNC_PWRST_CSWR))
> +			per_next_state = PWRDM_FUNC_PWRST_CSWR;
> +	}
>
> 	/* Are we changing PER target state? */
> 	if (per_next_state != per_saved_state)
> 		omap_set_pwrdm_state(per_pd, per_next_state);
>
> -select_state:
> 	ret = omap3_enter_idle(dev, drv, new_state_idx);
>
> 	/* Restore original PER state if it was modified */
> @@ -390,7 +394,6 @@ int __init omap3_idle_init(void)
>
> 	/* C1 . MPU WFI + Core active */
> 	_fill_cstate(drv, 0, "MPU ON + CORE ON");
> -	(&drv->states[0])->enter = omap3_enter_idle;
> 	drv->safe_state_index = 0;
> 	cx = _fill_cstate_usage(dev, 0);
> 	cx->valid = 1;	/* C1 is always valid */
>
> [2]
> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
> b/arch/arm/mach-omap2/cpuidle34xx.c
> index e406d7b..6aa3c75 100644
> --- a/arch/arm/mach-omap2/cpuidle34xx.c
> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
> @@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>
>  	/* Deny idle for C1 */
>  	if (index == 0) {
> -		pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
> -		pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
> +		//pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
> +		clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
> +		//pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
> +		clkdm_deny_idle(core_pd->pwrdm_clkdms[0]);
>  	}
>
>  	/*
> @@ -141,8 +143,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>
>  	/* Re-allow idle for C1 */
>  	if (index == 0) {
> -		pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
> -		pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
> +		//pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
> +		clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
> +		//pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
> +		clkdm_allow_idle(core_pd->pwrdm_clkdms[0]);
>  	}
>
>  return_sleep_time:
>
>>
>> Kevin
>>
>> From bb2f67ed93dc83c645080e293d315d383c23c0c6 Mon Sep 17 00:00:00 2001
>> From: Kevin Hilman <khilman@ti.com>
>> Date: Mon, 16 Apr 2012 17:53:14 -0700
>> Subject: [PATCH] cpuidle34xx: per follows core, C1 use _bm
>>
>> ---
>>  arch/arm/mach-omap2/cpuidle34xx.c |    9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c
>> index 374708d..00400ad 100644
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -278,9 +278,11 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
>>        cx = cpuidle_get_statedata(&dev->states_usage[index]);
>>        core_next_state = cx->core_state;
>>        per_next_state = per_saved_state = pwrdm_read_next_pwrst(per_pd);
>> -       if ((per_next_state == PWRDM_POWER_OFF) &&
>> -           (core_next_state > PWRDM_POWER_RET))
>> -               per_next_state = PWRDM_POWER_RET;
>> +       /* if ((per_next_state == PWRDM_POWER_OFF) && */
>> +       /*     (core_next_state > PWRDM_POWER_RET)) */
>> +       /*      per_next_state = PWRDM_POWER_RET; */
>> +       if (per_next_state < core_next_state)
>> +               per_next_state = core_next_state;
>>
>>        /* Are we changing PER target state? */
>>        if (per_next_state != per_saved_state)
>> @@ -374,7 +376,6 @@ int __init omap3_idle_init(void)
>>
>>        /* C1 . MPU WFI + Core active */
>>        _fill_cstate(drv, 0, "MPU ON + CORE ON");
>> -       (&drv->states[0])->enter = omap3_enter_idle;
>>        drv->safe_state_index = 0;
>>        cx = _fill_cstate_usage(dev, 0);
>>        cx->valid = 1;  /* C1 is always valid */
>> --
>> 1.7.9.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PM related performance degradation on OMAP3
  2012-05-07 17:31                     ` Kevin Hilman
@ 2012-05-09 11:00                       ` Jean Pihet
  0 siblings, 0 replies; 36+ messages in thread
From: Jean Pihet @ 2012-05-09 11:00 UTC (permalink / raw)
  To: Kevin Hilman; +Cc: Grazvydas Ignotas, linux-omap, Paul Walmsley

Hi Kevin,

On Mon, May 7, 2012 at 7:31 PM, Kevin Hilman <khilman@ti.com> wrote:
> Jean Pihet <jean.pihet@newoldbits.com> writes:
>
>> On Tue, May 1, 2012 at 7:27 PM, Kevin Hilman <khilman@ti.com> wrote:
>>> Jean Pihet <jean.pihet@newoldbits.com> writes:
>>>
>>>> HI Kevin, Grazvydas,
>>>>
>>>> On Tue, Apr 24, 2012 at 4:29 PM, Kevin Hilman <khilman@ti.com> wrote:
>>>>> Jean Pihet <jean.pihet@newoldbits.com> writes:
>>>>>
>>>>>> Hi Grazvydas, Kevin,
>>>>>>
>>>>>> I did some gather some performance measurements and statistics using
>>>>>> custom tracepoints in __omap3_enter_idle.
>>>> I posted the patches for the power domains registers cache, cf.
>>>> http://marc.info/?l=linux-omap&m=133587781712039&w=2.
>>>>
>>>>>> All the details are at
>>>>>> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
>>>> I updated the page with the measurements results with Kevin's patches
>>>> and the registers cache patches.
>>>>
>>>> The results are showing that:
>>>> - the registers cache optimizes the low power mode transitions, but is
>>>> not sufficient to obtain a big gain. A few unused domains are
>>>> transitioning, which causes a big penalty in the idle path.
>>>
>>> PER is the one that seems to be causing the most latency.
>>>
>>> Can you try do your measurements using hack below which makes sure that
>>> PER isn't any deeper than CORE?
>>
>> Indeed your patch brings significant improvements, cf. wiki page at
>> http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
>> for detailed information.
>> Here below is the reworked patch, more suited for inclusion in mainline [1]
>>
>> I have another optimisation -in proof of concept state- that brings
>> another significant improvement. It is about allowing/disabling idle
>> for only 1 clkdm in a pwrdm and not iterate through all the clkdms.
>> This still needs some rework though. Cf. patch [2]
>
> That should work since disabling idle for any clkdm will have the same
> effect.  Can you send this as a separate patch with a descriptive
> changelog.
I just sent 2 patches which optimize the C1 state latency:
 . [PATCH 1/2] ARM: OMAP3: PM: cpuidle: optimize the PER latency in C1 state
 . [PATCH 2/2] ARM: OMAP3: PM: cpuidle: optimize the clkdm idle
latency in C1 state

Note: those patches apply on top of your pre/post_transition
optimization patches.

The performance results are close to the !PM case (No IDLE, no
omap_sram_idle, all pwrdms to ON), i.e. 3.1MB/s on Beagleboard.
The wiki page update comes asap.

Regards,
Jean

>
> Kevin
>
>
>> Patches [1] and [2] on top of the registers cache and the
>> optimisations in pre/post_transition bring the performance close to
>> the performance for the non cpuidle case (3.0MB/s compared to 3.1MB/s
>> on Beagleboard).
>>
>> What do you think?
>>
>> Regards,
>> Jean
>>
>> ---
>> [1]
>> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
>> b/arch/arm/mach-omap2/cpuidle34xx.c
>> index e406d7b..572b605 100644
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -279,32 +279,36 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
>>       int ret;
>>
>>       /*
>> -      * Prevent idle completely if CAM is active.
>> +      * Use only C1 if CAM is active.
>>        * CAM does not have wakeup capability in OMAP3.
>>        */
>> -     if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON) {
>> +     if (pwrdm_read_func_pwrst(cam_pd) == PWRDM_FUNC_PWRST_ON)
>>               new_state_idx = drv->safe_state_index;
>> -             goto select_state;
>> -     }
>> -
>> -     new_state_idx = next_valid_state(dev, drv, index);
>> +     else
>> +             new_state_idx = next_valid_state(dev, drv, index);
>>
>> -     /*
>> -      * Prevent PER off if CORE is not in retention or off as this
>> -      * would disable PER wakeups completely.
>> -      */
>> +     /* Program PER state */
>>       cx = cpuidle_get_statedata(&dev->states_usage[new_state_idx]);
>>       core_next_state = cx->core_state;
>> -     per_next_state = per_saved_state = pwrdm_read_next_func_pwrst(per_pd);
>> -     if ((per_next_state == PWRDM_FUNC_PWRST_OFF) &&
>> -         (core_next_state > PWRDM_FUNC_PWRST_CSWR))
>> -             per_next_state = PWRDM_FUNC_PWRST_CSWR;
>> +     if (new_state_idx == 0) {
>> +             /* In C1 do not allow PER state lower than CORE state */
>> +             per_next_state = core_next_state;
>> +     } else {
>> +             /*
>> +              * Prevent PER off if CORE is not in RETention or OFF as this
>> +              * would disable PER wakeups completely.
>> +              */
>> +             per_next_state = per_saved_state =
>> +                             pwrdm_read_next_func_pwrst(per_pd);
>> +             if ((per_next_state == PWRDM_FUNC_PWRST_OFF) &&
>> +                 (core_next_state > PWRDM_FUNC_PWRST_CSWR))
>> +                     per_next_state = PWRDM_FUNC_PWRST_CSWR;
>> +     }
>>
>>       /* Are we changing PER target state? */
>>       if (per_next_state != per_saved_state)
>>               omap_set_pwrdm_state(per_pd, per_next_state);
>>
>> -select_state:
>>       ret = omap3_enter_idle(dev, drv, new_state_idx);
>>
>>       /* Restore original PER state if it was modified */
>> @@ -390,7 +394,6 @@ int __init omap3_idle_init(void)
>>
>>       /* C1 . MPU WFI + Core active */
>>       _fill_cstate(drv, 0, "MPU ON + CORE ON");
>> -     (&drv->states[0])->enter = omap3_enter_idle;
>>       drv->safe_state_index = 0;
>>       cx = _fill_cstate_usage(dev, 0);
>>       cx->valid = 1;  /* C1 is always valid */
>>
>> [2]
>> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c
>> b/arch/arm/mach-omap2/cpuidle34xx.c
>> index e406d7b..6aa3c75 100644
>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>> @@ -118,8 +118,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>
>>       /* Deny idle for C1 */
>>       if (index == 0) {
>> -             pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>> -             pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
>> +             //pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle);
>> +             clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]);
>> +             //pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle);
>> +             clkdm_deny_idle(core_pd->pwrdm_clkdms[0]);
>>       }
>>
>>       /*
>> @@ -141,8 +143,10 @@ static int __omap3_enter_idle(struct cpuidle_device *dev,
>>
>>       /* Re-allow idle for C1 */
>>       if (index == 0) {
>> -             pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
>> -             pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
>> +             //pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle);
>> +             clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]);
>> +             //pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle);
>> +             clkdm_allow_idle(core_pd->pwrdm_clkdms[0]);
>>       }
>>
>>  return_sleep_time:
>>
>>>
>>> Kevin
>>>
>>> From bb2f67ed93dc83c645080e293d315d383c23c0c6 Mon Sep 17 00:00:00 2001
>>> From: Kevin Hilman <khilman@ti.com>
>>> Date: Mon, 16 Apr 2012 17:53:14 -0700
>>> Subject: [PATCH] cpuidle34xx: per follows core, C1 use _bm
>>>
>>> ---
>>>  arch/arm/mach-omap2/cpuidle34xx.c |    9 +++++----
>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm/mach-omap2/cpuidle34xx.c b/arch/arm/mach-omap2/cpuidle34xx.c
>>> index 374708d..00400ad 100644
>>> --- a/arch/arm/mach-omap2/cpuidle34xx.c
>>> +++ b/arch/arm/mach-omap2/cpuidle34xx.c
>>> @@ -278,9 +278,11 @@ static int omap3_enter_idle_bm(struct cpuidle_device *dev,
>>>        cx = cpuidle_get_statedata(&dev->states_usage[index]);
>>>        core_next_state = cx->core_state;
>>>        per_next_state = per_saved_state = pwrdm_read_next_pwrst(per_pd);
>>> -       if ((per_next_state == PWRDM_POWER_OFF) &&
>>> -           (core_next_state > PWRDM_POWER_RET))
>>> -               per_next_state = PWRDM_POWER_RET;
>>> +       /* if ((per_next_state == PWRDM_POWER_OFF) && */
>>> +       /*     (core_next_state > PWRDM_POWER_RET)) */
>>> +       /*      per_next_state = PWRDM_POWER_RET; */
>>> +       if (per_next_state < core_next_state)
>>> +               per_next_state = core_next_state;
>>>
>>>        /* Are we changing PER target state? */
>>>        if (per_next_state != per_saved_state)
>>> @@ -374,7 +376,6 @@ int __init omap3_idle_init(void)
>>>
>>>        /* C1 . MPU WFI + Core active */
>>>        _fill_cstate(drv, 0, "MPU ON + CORE ON");
>>> -       (&drv->states[0])->enter = omap3_enter_idle;
>>>        drv->safe_state_index = 0;
>>>        cx = _fill_cstate_usage(dev, 0);
>>>        cx->valid = 1;  /* C1 is always valid */
>>> --
>>> 1.7.9.2
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-05-09 11:00 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-06 22:50 PM related performance degradation on OMAP3 Grazvydas Ignotas
2012-04-09 19:03 ` Kevin Hilman
2012-04-11  0:29   ` Grazvydas Ignotas
2012-04-12  0:19     ` Kevin Hilman
2012-04-13 17:32       ` Grazvydas Ignotas
2012-04-13 19:32       ` Grazvydas Ignotas
2012-04-17 14:30         ` Kevin Hilman
2012-04-17 21:50           ` Grazvydas Ignotas
2012-04-18  0:36             ` Kevin Hilman
2012-04-24  9:50           ` Jean Pihet
2012-04-24 10:38             ` Santosh Shilimkar
2012-04-24 12:21               ` Tero Kristo
2012-04-24 12:50                 ` Jean Pihet
2012-04-24 13:04                   ` Tero Kristo
2012-04-24 14:29             ` Kevin Hilman
2012-05-01 14:10               ` Jean Pihet
2012-05-01 17:27                 ` Kevin Hilman
2012-05-02  5:59                   ` Paul Walmsley
2012-05-02 19:46                   ` Jean Pihet
2012-05-07 17:31                     ` Kevin Hilman
2012-05-09 11:00                       ` Jean Pihet
2012-04-12 23:02     ` Woodruff, Richard
2012-04-11 14:59 ` Gary Thomas
2012-04-11 17:23   ` Grazvydas Ignotas
2012-04-11 18:20     ` Gary Thomas
2012-04-11 19:17   ` Kevin Hilman
2012-04-12 10:44     ` Gary Thomas
2012-04-12 14:14       ` Kevin Hilman
2012-04-12 15:28         ` Gary Thomas
2012-04-12 16:57           ` Kevin Hilman
2012-04-12 17:10             ` Gary Thomas
2012-04-12 18:08               ` Kevin Hilman
2012-04-12 19:05                 ` Gary Thomas
2012-04-12 22:03                   ` Kevin Hilman
2012-04-13  0:39                     ` Gary Thomas
2012-04-13  9:13             ` Felipe Balbi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.