All of lore.kernel.org
 help / color / mirror / Atom feed
* python3 recipe PGO tests
@ 2020-06-12 21:28 Ryan Rowe
  2020-06-12 21:37 ` Alexander Kanavin
  2020-06-15  1:05 ` [OE-core] " Anuj Mittal
  0 siblings, 2 replies; 8+ messages in thread
From: Ryan Rowe @ 2020-06-12 21:28 UTC (permalink / raw)
  To: openembedded-core; +Cc: alex.kanavin, Martin Kelly, Jim Broadus

[-- Attachment #1: Type: text/plain, Size: 2024 bytes --]

Hello Alex,

I’m investigating Python 3 performance issues on a Raspberry Pi Yocto build; I appreciate any insights you can provide into the problem.

In my investigation, I noticed that PGO was disabled in all cases due to a small bug. I fixed it in a patch submitted to OE-Core (#139459<https://lists.openembedded.org/g/openembedded-core/message/139459>). Even when PGO is indeed enabled, Python 3 runs significantly slower on Yocto-compiled Python 3.8.3 than the same version compiled on Raspbian.

In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch<http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-devtools/python/python3/0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch>, I see that you override the default PROFILE_TASK, which did not explicitly specify test suites, to a command that explicitly provides test suites. How did you decide on these tests? The standard PGO command runs 43 tests, while you specify 7. When I compile Python 3.8.3 on Raspbian, I see no intersection between the 43 tests run by default and the 7 you specify. Additionally, the default module for PROFILE is test while you use test.regrtest.

For reference, here’s the results of a simple CPU-bound test. These tests were run on the same Raspberry Pi 4 with same SD card.

python3 -m timeit -r 10 --setup '
def fib(n):
 if n < 2:
   return n
 if n == 2:
   return 1
 return fib(n - 1) + fib(n - 2)
' '[fib(n) for n in range(20)]'

# Yocto Python 3.8.3
# 10 loops, best of 10: 28.9 msec per loop
# 10 loops, best of 10: 29.3 msec per loop
# 10 loops, best of 10: 27.9 msec per loop
# 10 loops, best of 10: 30.4 msec per loop
# Average result: 31.625 msec per loop

# Raspbian Python 3.8.3
# 50 loops, best of 10: 7.73 msec per loop
# 50 loops, best of 10: 7.72 msec per loop
# 50 loops, best of 10: 7.67 msec per loop
# 50 loops, best of 10: 7.74 msec per loop
# Average result: 7.715 msec per loop

# Raspbian speedup: 4.09x

Best,
Ryan Rowe

[-- Attachment #2: Type: text/html, Size: 6971 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: python3 recipe PGO tests
  2020-06-12 21:28 python3 recipe PGO tests Ryan Rowe
@ 2020-06-12 21:37 ` Alexander Kanavin
  2020-06-15  1:05 ` [OE-core] " Anuj Mittal
  1 sibling, 0 replies; 8+ messages in thread
From: Alexander Kanavin @ 2020-06-12 21:37 UTC (permalink / raw)
  To: Ryan Rowe, Anuj Mittal, Ross Burton
  Cc: openembedded-core, Martin Kelly, Jim Broadus

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

Hello Ryan,

I did not write the pgo bits, I only preserved them (without testing) when
the Python recipe was rewritten from scratch (by me) in order to bring some
sanity to it, and make it possible again to update it to newer versions.
The people you want to talk to are Anuj Mittal and Ross Burton (cc).

Alex

On Fri, 12 Jun 2020 at 23:28, Ryan Rowe <rrowe@xevo.com> wrote:

> Hello Alex,
>
>
>
> I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> build; I appreciate any insights you can provide into the problem.
>
>
>
> In my investigation, I noticed that PGO was disabled in all cases due to a
> small bug. I fixed it in a patch submitted to OE-Core (#139459
> <https://lists.openembedded.org/g/openembedded-core/message/139459>).
> Even when PGO is indeed enabled, Python 3 runs significantly slower on
> Yocto-compiled Python 3.8.3 than the same version compiled on Raspbian.
>
>
>
> In your patch,
> 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch
> <http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-devtools/python/python3/0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch>,
> I see that you override the default PROFILE_TASK, which did not
> explicitly specify test suites, to a command that explicitly provides test
> suites. How did you decide on these tests? The standard PGO command runs 43
> tests, while you specify 7. When I compile Python 3.8.3 on Raspbian, I see
> no intersection between the 43 tests run by default and the 7 you specify.
> Additionally, the default module for PROFILE is test while you use
> test.regrtest.
>
>
>
> For reference, here’s the results of a simple CPU-bound test. These tests
> were run on the same Raspberry Pi 4 with same SD card.
>
>
>
> python3 -m timeit -r 10 --setup '
> def fib(n):
>  if n < 2:
>    return n
>
>  if n == 2:
>    return 1
>
>  return fib(n - 1) + fib(n - 2)
> ' '[fib(n) for n in range(20)]'
>
>
>
> # Yocto Python 3.8.3
> # 10 loops, best of 10: 28.9 msec per loop
> # 10 loops, best of 10: 29.3 msec per loop
> # 10 loops, best of 10: 27.9 msec per loop
> # 10 loops, best of 10: 30.4 msec per loop
> # Average result: 31.625 msec per loop
>
>
>
> # Raspbian Python 3.8.3
> # 50 loops, best of 10: 7.73 msec per loop
> # 50 loops, best of 10: 7.72 msec per loop
> # 50 loops, best of 10: 7.67 msec per loop
> # 50 loops, best of 10: 7.74 msec per loop
>
> # Average result: 7.715 msec per loop
>
>
>
> # Raspbian speedup: 4.09x
>
>
>
> Best,
>
> Ryan Rowe
>

[-- Attachment #2: Type: text/html, Size: 6104 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [OE-core] python3 recipe PGO tests
  2020-06-12 21:28 python3 recipe PGO tests Ryan Rowe
  2020-06-12 21:37 ` Alexander Kanavin
@ 2020-06-15  1:05 ` Anuj Mittal
  2020-06-15 20:33   ` Ryan Rowe
  1 sibling, 1 reply; 8+ messages in thread
From: Anuj Mittal @ 2020-06-15  1:05 UTC (permalink / raw)
  To: openembedded-core, rrowe; +Cc: alex.kanavin, jbroadus, mkelly

On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote:
> Hello Alex,
>  
> I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> build; I appreciate any insights you can provide into the problem.
>  
> In my investigation, I noticed that PGO was disabled in all cases due
> to a small bug. I fixed it in a patch submitted to OE-Core (#139459).
> Even when PGO is indeed enabled, Python 3 runs significantly slower
> on Yocto-compiled Python 3.8.3 than the same version compiled on
> Raspbian.
>  
> In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-
> profile.patch, I see that you override the default PROFILE_TASK,
> which did not explicitly specify test suites, to a command that
> explicitly provides test suites. How did you decide on these tests?
> The standard PGO command runs 43 tests, while you specify 7. When I
> compile Python 3.8.3 on Raspbian, I see no intersection between the
> 43 tests run by default and the 7 you specify. Additionally, the
> default module for PROFILE is test while you use test.regrtest.

We used to run pybench and then switched to regrtest:

https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195e68b2c1b09e3eb42e623c9a20

The PROFILE_TASK value it looks like was changed recently:

https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585

If the performance is actually degrading, may be we should change it to
something more useful. Do you know much time does the default set of
tasks take to run in qemu?

Thanks,

Anuj

>  
> For reference, here’s the results of a simple CPU-bound test. These
> tests were run on the same Raspberry Pi 4 with same SD card.
>  
> python3 -m timeit -r 10 --setup '
> def fib(n):
>  if n < 2:
>    return n
>  if n == 2:
>    return 1
>  return fib(n - 1) + fib(n - 2)
> ' '[fib(n) for n in range(20)]'
>  
> # Yocto Python 3.8.3
> # 10 loops, best of 10: 28.9 msec per loop
> # 10 loops, best of 10: 29.3 msec per loop
> # 10 loops, best of 10: 27.9 msec per loop
> # 10 loops, best of 10: 30.4 msec per loop
> # Average result: 31.625 msec per loop
>  
> # Raspbian Python 3.8.3
> # 50 loops, best of 10: 7.73 msec per loop
> # 50 loops, best of 10: 7.72 msec per loop
> # 50 loops, best of 10: 7.67 msec per loop
> # 50 loops, best of 10: 7.74 msec per loop
> # Average result: 7.715 msec per loop
>  
> # Raspbian speedup: 4.09x
>  
> Best,
> Ryan Rowe
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [OE-core] python3 recipe PGO tests
  2020-06-15  1:05 ` [OE-core] " Anuj Mittal
@ 2020-06-15 20:33   ` Ryan Rowe
  2020-06-18 23:24     ` Khem Raj
  0 siblings, 1 reply; 8+ messages in thread
From: Ryan Rowe @ 2020-06-15 20:33 UTC (permalink / raw)
  To: Mittal, Anuj, openembedded-core; +Cc: Martin Kelly, Jim Broadus, alex.kanavin

On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote:
> On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote:
> > Hello Alex,
> >
> > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> > build; I appreciate any insights you can provide into the problem.
> >
> > In my investigation, I noticed that PGO was disabled in all cases due
> > to a small bug. I fixed it in a patch submitted to OE-Core (#139459).
> > Even when PGO is indeed enabled, Python 3 runs significantly slower
> > on Yocto-compiled Python 3.8.3 than the same version compiled on
> > Raspbian.
> >
> > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-
> > profile.patch, I see that you override the default PROFILE_TASK,
> > which did not explicitly specify test suites, to a command that
> > explicitly provides test suites. How did you decide on these tests?
> > The standard PGO command runs 43 tests, while you specify 7. When I
> > compile Python 3.8.3 on Raspbian, I see no intersection between the
> > 43 tests run by default and the 7 you specify. Additionally, the
> > default module for PROFILE is test while you use test.regrtest.
>
> We used to run pybench and then switched to regrtest:
>
> https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195e68b2c1b09e3eb42e623c9a20
>
> The PROFILE_TASK value it looks like was changed recently:
>
> https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585
>
> If the performance is actually degrading, may be we should change it to
> something more useful. Do you know much time does the default set of
> tasks take to run in qemu?
>
> Thanks,
>
> Anuj

Thanks for looking into this. It took me about 20 minutes to run the PGO
tests and I did notice a significant improvement in Python runtime.
However, that is compared against a non-PGO build. I have not compared
the existing PGO arguments against the new upstream arguments.

We've come to realize that our performance issues are not due to Python,
but in fact a much deeper rooted issue. Simple C code takes 2-3 times
longer to run on our image based on meta-raspberrypi's raspberrypi4
machine than stock Raspbian.

On a side node, it seems that cPython now exposes PROFILE_TASK as a
configuration option, so we can override that variable with our
desired profiling arguments rather than modifying the Makefile
directly with a patch.

Thanks,
Ryan

> >
> > For reference, here’s the results of a simple CPU-bound test. These
> > tests were run on the same Raspberry Pi 4 with same SD card.
> >
> > python3 -m timeit -r 10 --setup '
> > def fib(n):
> >  if n < 2:
> >    return n
> >  if n == 2:
> >    return 1
> >  return fib(n - 1) + fib(n - 2)
> > ' '[fib(n) for n in range(20)]'
> >
> > # Yocto Python 3.8.3
> > # 10 loops, best of 10: 28.9 msec per loop
> > # 10 loops, best of 10: 29.3 msec per loop
> > # 10 loops, best of 10: 27.9 msec per loop
> > # 10 loops, best of 10: 30.4 msec per loop
> > # Average result: 31.625 msec per loop
> >
> > # Raspbian Python 3.8.3
> > # 50 loops, best of 10: 7.73 msec per loop
> > # 50 loops, best of 10: 7.72 msec per loop
> > # 50 loops, best of 10: 7.67 msec per loop
> > # 50 loops, best of 10: 7.74 msec per loop
> > # Average result: 7.715 msec per loop
> >
> > # Raspbian speedup: 4.09x
> >
> > Best,
> > Ryan Rowe
> > 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [OE-core] python3 recipe PGO tests
  2020-06-15 20:33   ` Ryan Rowe
@ 2020-06-18 23:24     ` Khem Raj
  2020-06-18 23:47       ` Andre McCurdy
  0 siblings, 1 reply; 8+ messages in thread
From: Khem Raj @ 2020-06-18 23:24 UTC (permalink / raw)
  To: Mittal, Anuj, openembedded-core
  Cc: Martin Kelly, Jim Broadus, alex.kanavin, Ryan Rowe

On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote:
> On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote:
> 
> > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote:
> > 
> > > Hello Alex,
> > >
> > >
> > >
> > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> > > build; I appreciate any insights you can provide into the problem.
> > >
> > >
> > >
> > > In my investigation, I noticed that PGO was disabled in all cases due
> > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459).
> > > Even when PGO is indeed enabled, Python 3 runs significantly slower
> > > on Yocto-compiled Python 3.8.3 than the same version compiled on
> > > Raspbian.
> > >
> > >
> > >
> > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-
> > > profile.patch, I see that you override the default PROFILE_TASK,
> > > which did not explicitly specify test suites, to a command that
> > > explicitly provides test suites. How did you decide on these tests?
> > > The standard PGO command runs 43 tests, while you specify 7. When I
> > > compile Python 3.8.3 on Raspbian, I see no intersection between the
> > > 43 tests run by default and the 7 you specify. Additionally, the
> > > default module for PROFILE is test while you use test.regrtest.
> >
> >
> >
> > We used to run pybench and then switched to regrtest:
> >
> >
> >
> > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195
> > e68b2c1b09e3eb42e623c9a20
>
> >
> >
> > The PROFILE_TASK value it looks like was changed recently:
> >
> >
> >
> > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928
> > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585
>
> >
> >
> > If the performance is actually degrading, may be we should change it to
> > something more useful. Do you know much time does the default set of
> > tasks take to run in qemu?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Anuj
> 
> 
> Thanks for looking into this. It took me about 20 minutes to run the PGO
> tests and I did notice a significant improvement in Python runtime.
> However, that is compared against a non-PGO build. I have not compared
> the existing PGO arguments against the new upstream arguments.
> 
> We've come to realize that our performance issues are not due to Python,
> but in fact a much deeper rooted issue. Simple C code takes 2-3 times
> longer to run on our image based on meta-raspberrypi's raspberrypi4
> machine than stock Raspbian.
> 
> On a side node, it seems that cPython now exposes PROFILE_TASK as a
> configuration option, so we can override that variable with our
> desired profiling arguments rather than modifying the Makefile
> directly with a patch.
> 

The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch 
seems to hardcode what tests to run, perhaps it will be better to use 
PROFILE_TASK

When 3.5 -> 3.7 upgrade was done in 

https://git.openembedded.org/openembedded-core/commit/?
id=02714c105426b0d687620913c1a7401b386428b6

it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes
this patch carried. I guess we have not checked the py3 runtime performance to 
detect this regression.

so it will be good to reinstate the variable to choose what tests one wants to 
run with defaults being whatever is optimal for autobuilder. 

> Thanks,
> Ryan
> 
> 
> > >
> > >
> > > For reference, here’s the results of a simple CPU-bound test. These
> > > tests were run on the same Raspberry Pi 4 with same SD card.
> > >
> > >
> > >
> > > python3 -m timeit -r 10 --setup '
> > > def fib(n):
> > > 
> > >  if n < 2:
> > >  
> > >    return n
> > >  
> > >  if n == 2:
> > >  
> > >    return 1
> > >  
> > >  return fib(n - 1) + fib(n - 2)
> > > 
> > > ' '[fib(n) for n in range(20)]'
> > >
> > >
> > >
> > > # Yocto Python 3.8.3
> > > # 10 loops, best of 10: 28.9 msec per loop
> > > # 10 loops, best of 10: 29.3 msec per loop
> > > # 10 loops, best of 10: 27.9 msec per loop
> > > # 10 loops, best of 10: 30.4 msec per loop
> > > # Average result: 31.625 msec per loop
> > >
> > >
> > >
> > > # Raspbian Python 3.8.3
> > > # 50 loops, best of 10: 7.73 msec per loop
> > > # 50 loops, best of 10: 7.72 msec per loop
> > > # 50 loops, best of 10: 7.67 msec per loop
> > > # 50 loops, best of 10: 7.74 msec per loop
> > > # Average result: 7.715 msec per loop
> > >
> > >
> > >
> > > # Raspbian speedup: 4.09x
> > >
> > >
> > >
> > > Best,
> > > Ryan Rowe
> > > 
> 
> 





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [OE-core] python3 recipe PGO tests
  2020-06-18 23:24     ` Khem Raj
@ 2020-06-18 23:47       ` Andre McCurdy
  2020-06-18 23:56         ` Khem Raj
  0 siblings, 1 reply; 8+ messages in thread
From: Andre McCurdy @ 2020-06-18 23:47 UTC (permalink / raw)
  To: Khem Raj
  Cc: Mittal, Anuj, openembedded-core, Martin Kelly, Jim Broadus,
	alex.kanavin, Ryan Rowe

On Thu, Jun 18, 2020 at 4:25 PM Khem Raj <raj.khem@gmail.com> wrote:
>
> On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote:
> > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote:
> >
> > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote:
> > >
> > > > Hello Alex,
> > > >
> > > >
> > > >
> > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> > > > build; I appreciate any insights you can provide into the problem.
> > > >
> > > >
> > > >
> > > > In my investigation, I noticed that PGO was disabled in all cases due
> > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459).
> > > > Even when PGO is indeed enabled, Python 3 runs significantly slower
> > > > on Yocto-compiled Python 3.8.3 than the same version compiled on
> > > > Raspbian.
> > > >
> > > >
> > > >
> > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-
> > > > profile.patch, I see that you override the default PROFILE_TASK,
> > > > which did not explicitly specify test suites, to a command that
> > > > explicitly provides test suites. How did you decide on these tests?
> > > > The standard PGO command runs 43 tests, while you specify 7. When I
> > > > compile Python 3.8.3 on Raspbian, I see no intersection between the
> > > > 43 tests run by default and the 7 you specify. Additionally, the
> > > > default module for PROFILE is test while you use test.regrtest.
> > >
> > >
> > >
> > > We used to run pybench and then switched to regrtest:
> > >
> > >
> > >
> > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195
> > > e68b2c1b09e3eb42e623c9a20
> >
> > >
> > >
> > > The PROFILE_TASK value it looks like was changed recently:
> > >
> > >
> > >
> > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928
> > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585
> >
> > >
> > >
> > > If the performance is actually degrading, may be we should change it to
> > > something more useful. Do you know much time does the default set of
> > > tasks take to run in qemu?
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > Anuj
> >
> >
> > Thanks for looking into this. It took me about 20 minutes to run the PGO
> > tests and I did notice a significant improvement in Python runtime.
> > However, that is compared against a non-PGO build. I have not compared
> > the existing PGO arguments against the new upstream arguments.
> >
> > We've come to realize that our performance issues are not due to Python,
> > but in fact a much deeper rooted issue. Simple C code takes 2-3 times
> > longer to run on our image based on meta-raspberrypi's raspberrypi4
> > machine than stock Raspbian.
> >
> > On a side node, it seems that cPython now exposes PROFILE_TASK as a
> > configuration option, so we can override that variable with our
> > desired profiling arguments rather than modifying the Makefile
> > directly with a patch.
> >
>
> The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch
> seems to hardcode what tests to run, perhaps it will be better to use
> PROFILE_TASK
>
> When 3.5 -> 3.7 upgrade was done in
>
> https://git.openembedded.org/openembedded-core/commit/?
> id=02714c105426b0d687620913c1a7401b386428b6
>
> it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes
> this patch carried. I guess we have not checked the py3 runtime performance to
> detect this regression.

Are we sure there is a regression? Ryan posted a follow up saying
everything was slower in his tests, not just python.

> so it will be good to reinstate the variable to choose what tests one wants to
> run with defaults being whatever is optimal for autobuilder.
>
> > Thanks,
> > Ryan
> >
> >
> > > >
> > > >
> > > > For reference, here’s the results of a simple CPU-bound test. These
> > > > tests were run on the same Raspberry Pi 4 with same SD card.
> > > >
> > > >
> > > >
> > > > python3 -m timeit -r 10 --setup '
> > > > def fib(n):
> > > >
> > > >  if n < 2:
> > > >
> > > >    return n
> > > >
> > > >  if n == 2:
> > > >
> > > >    return 1
> > > >
> > > >  return fib(n - 1) + fib(n - 2)
> > > >
> > > > ' '[fib(n) for n in range(20)]'
> > > >
> > > >
> > > >
> > > > # Yocto Python 3.8.3
> > > > # 10 loops, best of 10: 28.9 msec per loop
> > > > # 10 loops, best of 10: 29.3 msec per loop
> > > > # 10 loops, best of 10: 27.9 msec per loop
> > > > # 10 loops, best of 10: 30.4 msec per loop
> > > > # Average result: 31.625 msec per loop
> > > >
> > > >
> > > >
> > > > # Raspbian Python 3.8.3
> > > > # 50 loops, best of 10: 7.73 msec per loop
> > > > # 50 loops, best of 10: 7.72 msec per loop
> > > > # 50 loops, best of 10: 7.67 msec per loop
> > > > # 50 loops, best of 10: 7.74 msec per loop
> > > > # Average result: 7.715 msec per loop
> > > >
> > > >
> > > >
> > > > # Raspbian speedup: 4.09x
> > > >
> > > >
> > > >
> > > > Best,
> > > > Ryan Rowe
> > > >
> >
> >
>
>
>
>
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [OE-core] python3 recipe PGO tests
  2020-06-18 23:47       ` Andre McCurdy
@ 2020-06-18 23:56         ` Khem Raj
  2020-06-19  1:30           ` Ryan Rowe
  0 siblings, 1 reply; 8+ messages in thread
From: Khem Raj @ 2020-06-18 23:56 UTC (permalink / raw)
  To: Andre McCurdy
  Cc: Mittal, Anuj, openembedded-core, Martin Kelly, Jim Broadus,
	alex.kanavin, Ryan Rowe

On Thu, Jun 18, 2020 at 4:47 PM Andre McCurdy <armccurdy@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 4:25 PM Khem Raj <raj.khem@gmail.com> wrote:
> >
> > On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote:
> > > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote:
> > >
> > > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote:
> > > >
> > > > > Hello Alex,
> > > > >
> > > > >
> > > > >
> > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> > > > > build; I appreciate any insights you can provide into the problem.
> > > > >
> > > > >
> > > > >
> > > > > In my investigation, I noticed that PGO was disabled in all cases due
> > > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459).
> > > > > Even when PGO is indeed enabled, Python 3 runs significantly slower
> > > > > on Yocto-compiled Python 3.8.3 than the same version compiled on
> > > > > Raspbian.
> > > > >
> > > > >
> > > > >
> > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-
> > > > > profile.patch, I see that you override the default PROFILE_TASK,
> > > > > which did not explicitly specify test suites, to a command that
> > > > > explicitly provides test suites. How did you decide on these tests?
> > > > > The standard PGO command runs 43 tests, while you specify 7. When I
> > > > > compile Python 3.8.3 on Raspbian, I see no intersection between the
> > > > > 43 tests run by default and the 7 you specify. Additionally, the
> > > > > default module for PROFILE is test while you use test.regrtest.
> > > >
> > > >
> > > >
> > > > We used to run pybench and then switched to regrtest:
> > > >
> > > >
> > > >
> > > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195
> > > > e68b2c1b09e3eb42e623c9a20
> > >
> > > >
> > > >
> > > > The PROFILE_TASK value it looks like was changed recently:
> > > >
> > > >
> > > >
> > > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928
> > > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585
> > >
> > > >
> > > >
> > > > If the performance is actually degrading, may be we should change it to
> > > > something more useful. Do you know much time does the default set of
> > > > tasks take to run in qemu?
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > >
> > > >
> > > > Anuj
> > >
> > >
> > > Thanks for looking into this. It took me about 20 minutes to run the PGO
> > > tests and I did notice a significant improvement in Python runtime.
> > > However, that is compared against a non-PGO build. I have not compared
> > > the existing PGO arguments against the new upstream arguments.
> > >
> > > We've come to realize that our performance issues are not due to Python,
> > > but in fact a much deeper rooted issue. Simple C code takes 2-3 times
> > > longer to run on our image based on meta-raspberrypi's raspberrypi4
> > > machine than stock Raspbian.
> > >
> > > On a side node, it seems that cPython now exposes PROFILE_TASK as a
> > > configuration option, so we can override that variable with our
> > > desired profiling arguments rather than modifying the Makefile
> > > directly with a patch.
> > >
> >
> > The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch
> > seems to hardcode what tests to run, perhaps it will be better to use
> > PROFILE_TASK
> >
> > When 3.5 -> 3.7 upgrade was done in
> >
> > https://git.openembedded.org/openembedded-core/commit/?
> > id=02714c105426b0d687620913c1a7401b386428b6
> >
> > it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes
> > this patch carried. I guess we have not checked the py3 runtime performance to
> > detect this regression.
>
> Are we sure there is a regression? Ryan posted a follow up saying
> everything was slower in his tests, not just python.

regression is disabling it with e53ebf29

>
> > so it will be good to reinstate the variable to choose what tests one wants to
> > run with defaults being whatever is optimal for autobuilder.
> >
> > > Thanks,
> > > Ryan
> > >
> > >
> > > > >
> > > > >
> > > > > For reference, here’s the results of a simple CPU-bound test. These
> > > > > tests were run on the same Raspberry Pi 4 with same SD card.
> > > > >
> > > > >
> > > > >
> > > > > python3 -m timeit -r 10 --setup '
> > > > > def fib(n):
> > > > >
> > > > >  if n < 2:
> > > > >
> > > > >    return n
> > > > >
> > > > >  if n == 2:
> > > > >
> > > > >    return 1
> > > > >
> > > > >  return fib(n - 1) + fib(n - 2)
> > > > >
> > > > > ' '[fib(n) for n in range(20)]'
> > > > >
> > > > >
> > > > >
> > > > > # Yocto Python 3.8.3
> > > > > # 10 loops, best of 10: 28.9 msec per loop
> > > > > # 10 loops, best of 10: 29.3 msec per loop
> > > > > # 10 loops, best of 10: 27.9 msec per loop
> > > > > # 10 loops, best of 10: 30.4 msec per loop
> > > > > # Average result: 31.625 msec per loop
> > > > >
> > > > >
> > > > >
> > > > > # Raspbian Python 3.8.3
> > > > > # 50 loops, best of 10: 7.73 msec per loop
> > > > > # 50 loops, best of 10: 7.72 msec per loop
> > > > > # 50 loops, best of 10: 7.67 msec per loop
> > > > > # 50 loops, best of 10: 7.74 msec per loop
> > > > > # Average result: 7.715 msec per loop
> > > > >
> > > > >
> > > > >
> > > > > # Raspbian speedup: 4.09x
> > > > >
> > > > >
> > > > >
> > > > > Best,
> > > > > Ryan Rowe
> > > > >
> > >
> > >
> >
> >
> >
> >
> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [OE-core] python3 recipe PGO tests
  2020-06-18 23:56         ` Khem Raj
@ 2020-06-19  1:30           ` Ryan Rowe
  0 siblings, 0 replies; 8+ messages in thread
From: Ryan Rowe @ 2020-06-19  1:30 UTC (permalink / raw)
  To: Khem Raj, Andre McCurdy
  Cc: openembedded-core, Mittal, Anuj, alex.kanavin, Jim Broadus, Martin Kelly

On 18/6/20, 16:57, "Khem Raj" <raj.khem@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 4:47 PM Andre McCurdy <armccurdy@gmail.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 4:25 PM Khem Raj <raj.khem@gmail.com> wrote:
> > >
> > > On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote:
> > >
> > > > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote:
> > > >
> > > > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote:
> > > > >
> > > > > > Hello Alex,
> > > > > >
> > > > > >
> > > > > >
> > > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto
> > > > > > build; I appreciate any insights you can provide into the problem.
> > > > > >
> > > > > >
> > > > > >
> > > > > > In my investigation, I noticed that PGO was disabled in all cases due
> > > > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459).
> > > > > > Even when PGO is indeed enabled, Python 3 runs significantly slower
> > > > > > on Yocto-compiled Python 3.8.3 than the same version compiled on
> > > > > > Raspbian.
> > > > > >
> > > > > >
> > > > > >
> > > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-
> > > > > > profile.patch, I see that you override the default PROFILE_TASK,
> > > > > > which did not explicitly specify test suites, to a command that
> > > > > > explicitly provides test suites. How did you decide on these tests?
> > > > > > The standard PGO command runs 43 tests, while you specify 7. When I
> > > > > > compile Python 3.8.3 on Raspbian, I see no intersection between the
> > > > > > 43 tests run by default and the 7 you specify. Additionally, the
> > > > > > default module for PROFILE is test while you use test.regrtest.
> > > > >
> > > > >
> > > > >
> > > > > We used to run pybench and then switched to regrtest:
> > > > >
> > > > >
> > > > >
> > > > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195
> > > > > e68b2c1b09e3eb42e623c9a20
> > > >
> > > > >
> > > > >
> > > > > The PROFILE_TASK value it looks like was changed recently:
> > > > >
> > > > >
> > > > >
> > > > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928
> > > > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585
> > > >
> > > > >
> > > > >
> > > > > If the performance is actually degrading, may be we should change it to
> > > > > something more useful. Do you know much time does the default set of
> > > > > tasks take to run in qemu?
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > >
> > > > > Anuj
> > > >
> > > >
> > > > Thanks for looking into this. It took me about 20 minutes to run the PGO
> > > > tests and I did notice a significant improvement in Python runtime.
> > > > However, that is compared against a non-PGO build. I have not compared
> > > > the existing PGO arguments against the new upstream arguments.
> > > >
> > > > We've come to realize that our performance issues are not due to Python,
> > > > but in fact a much deeper rooted issue. Simple C code takes 2-3 times
> > > > longer to run on our image based on meta-raspberrypi's raspberrypi4
> > > > machine than stock Raspbian.
> > > >
> > > > On a side node, it seems that cPython now exposes PROFILE_TASK as a
> > > > configuration option, so we can override that variable with our
> > > > desired profiling arguments rather than modifying the Makefile
> > > > directly with a patch.
> > > >
> > >
> > > The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch
> > > seems to hardcode what tests to run, perhaps it will be better to use
> > > PROFILE_TASK

We can use the default PROFILE_TASK, however it sounds like Ross had reason to
switch from Pybench to regrtest, mainly execution time. In his commit, he notes
"also upstream have removed it from Python and instead use test.regrtest —pgo
to profile the interpreter." This does not seem to be true anymore as upstream
uses test rather than test.regrtest. However, the default tests do take 20 minutes
to run which is considerably longer than the current explicit tests.

> > >
> > > When 3.5 -> 3.7 upgrade was done in
> > >
> > > https://git.openembedded.org/openembedded-core/commit/?
> > > id=02714c105426b0d687620913c1a7401b386428b6
> > >
> > > it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes
> > > this patch carried. I guess we have not checked the py3 runtime performance to
> > > detect this regression.
> >
> > Are we sure there is a regression? Ryan posted a follow up saying
> > everything was slower in his tests, not just python.

In case anyone is curious, I did find out the issue. The CPU governor was powersave
rather than ondemand. Silly me, I only checked the min and max freq, not that they
were being used. And a quirk of the OS prevented any of my benchmarks from
printing the observed clock speed during test, just empty strings.

With this fixed and when compiling with upstream PGO in Yocto, I do observe
comparable performance to regular upstream Python 3.8 compiled with PGO on
Raspbian.

>
> regression is disabling it with e53ebf29

Yes, that's correct. This inadvertently disabled PGO entirely. I can do some tests
tomorrow to determine the performance loss due to PGO with these explicit test
suites rather than the defaults from the upstream. I did notice performance gain
when using PGO, but that was against non-PGO.

>
> >
> > > so it will be good to reinstate the variable to choose what tests one wants to
> > > run with defaults being whatever is optimal for autobuilder.
> > >
> > > > Thanks,
> > > > Ryan
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > For reference, here’s the results of a simple CPU-bound test. These
> > > > > > tests were run on the same Raspberry Pi 4 with same SD card.
> > > > > >
> > > > > >
> > > > > >
> > > > > > python3 -m timeit -r 10 --setup '
> > > > > > def fib(n):
> > > > > >
> > > > > >  if n < 2:
> > > > > >
> > > > > >    return n
> > > > > >
> > > > > >  if n == 2:
> > > > > >
> > > > > >    return 1
> > > > > >
> > > > > >  return fib(n - 1) + fib(n - 2)
> > > > > >
> > > > > > ' '[fib(n) for n in range(20)]'
> > > > > >
> > > > > >
> > > > > >
> > > > > > # Yocto Python 3.8.3
> > > > > > # 10 loops, best of 10: 28.9 msec per loop
> > > > > > # 10 loops, best of 10: 29.3 msec per loop
> > > > > > # 10 loops, best of 10: 27.9 msec per loop
> > > > > > # 10 loops, best of 10: 30.4 msec per loop
> > > > > > # Average result: 31.625 msec per loop
> > > > > >
> > > > > >
> > > > > >
> > > > > > # Raspbian Python 3.8.3
> > > > > > # 50 loops, best of 10: 7.73 msec per loop
> > > > > > # 50 loops, best of 10: 7.72 msec per loop
> > > > > > # 50 loops, best of 10: 7.67 msec per loop
> > > > > > # 50 loops, best of 10: 7.74 msec per loop
> > > > > > # Average result: 7.715 msec per loop
> > > > > >
> > > > > >
> > > > > >
> > > > > > # Raspbian speedup: 4.09x
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Ryan Rowe
> > > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-19  1:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-12 21:28 python3 recipe PGO tests Ryan Rowe
2020-06-12 21:37 ` Alexander Kanavin
2020-06-15  1:05 ` [OE-core] " Anuj Mittal
2020-06-15 20:33   ` Ryan Rowe
2020-06-18 23:24     ` Khem Raj
2020-06-18 23:47       ` Andre McCurdy
2020-06-18 23:56         ` Khem Raj
2020-06-19  1:30           ` Ryan Rowe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.