bitbake-devel.lists.openembedded.org archive mirror
 help / color / mirror / Atom feed
* [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation
@ 2023-02-14 10:34 ecordonnier
  2023-02-14 10:56 ` [bitbake-devel] " Martin Jansa
  2023-02-14 14:49 ` Steve Sakoman
  0 siblings, 2 replies; 5+ messages in thread
From: ecordonnier @ 2023-02-14 10:34 UTC (permalink / raw)
  To: bitbake-devel; +Cc: Etienne Cordonnier, JJ Robertson

From: Etienne Cordonnier <ecordonnier@snap.com>

As discussed in https://stackoverflow.com/a/4435752/1710392 , CPython
has an optimization for statements in the form "a = a + b" or "a += b".
It seems that this line does not get optimized, because it has a form a = a + b + c:
data = data + "./" + f.split("/./")[1]

For that reason, it does a copy of data for each iteration, potentially copying megabytes
of data for each iteration.

Changing this line causes SignatureGeneratorBasic::get_taskhash to take 0.06 seconds
instead of 45 seconds on my test setup where SRC_URI points to a big directory.

Note that PEP8 recommends explicitely not to use this optimization which is specific to CPython:
"do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b"

However, the PEP8 recommended form using "join()" also does not avoid the copy and takes 45 seconds in my test setup:
data = ''.join((data, "./", f.split("/./")[1]))

I have changed the other lines to also use += for consistency only, however those were in the form a = a + b
and were optimized already.

Co-authored-by: JJ Robertson <jrobertson@snap.com>
Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
---
 lib/bb/siggen.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
index 130b38d8..2381edd2 100644
--- a/lib/bb/siggen.py
+++ b/lib/bb/siggen.py
@@ -328,19 +328,19 @@ class SignatureGeneratorBasic(SignatureGenerator):
 
         data = self.basehash[tid]
         for dep in self.runtaskdeps[tid]:
-            data = data + self.get_unihash(dep)
+            data += self.get_unihash(dep)
 
         for (f, cs) in self.file_checksum_values[tid]:
             if cs:
                 if "/./" in f:
-                    data = data + "./" + f.split("/./")[1]
-                data = data + cs
+                    data += "./" + f.split("/./")[1]
+                data += cs
 
         if tid in self.taints:
             if self.taints[tid].startswith("nostamp:"):
-                data = data + self.taints[tid][8:]
+                data += self.taints[tid][8:]
             else:
-                data = data + self.taints[tid]
+                data += self.taints[tid]
 
         h = hashlib.sha256(data.encode("utf-8")).hexdigest()
         self.taskhash[tid] = h
-- 
2.36.1.vfs.0.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [bitbake-devel] [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation
  2023-02-14 10:34 [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation ecordonnier
@ 2023-02-14 10:56 ` Martin Jansa
  2023-02-14 11:04   ` Etienne Cordonnier
  2023-02-14 14:49 ` Steve Sakoman
  1 sibling, 1 reply; 5+ messages in thread
From: Martin Jansa @ 2023-02-14 10:56 UTC (permalink / raw)
  To: ecordonnier; +Cc: bitbake-devel, JJ Robertson

[-- Attachment #1: Type: text/plain, Size: 3331 bytes --]

This might help with:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14942

will test it on my testcase with 380227 paths and update the ticket if this
helps.

On Tue, Feb 14, 2023 at 11:35 AM ecordonnier via lists.openembedded.org
<ecordonnier=snap.com@lists.openembedded.org> wrote:

> From: Etienne Cordonnier <ecordonnier@snap.com>
>
> As discussed in https://stackoverflow.com/a/4435752/1710392 , CPython
> has an optimization for statements in the form "a = a + b" or "a += b".
> It seems that this line does not get optimized, because it has a form a =
> a + b + c:
> data = data + "./" + f.split("/./")[1]
>
> For that reason, it does a copy of data for each iteration, potentially
> copying megabytes
> of data for each iteration.
>
> Changing this line causes SignatureGeneratorBasic::get_taskhash to take
> 0.06 seconds
> instead of 45 seconds on my test setup where SRC_URI points to a big
> directory.
>
> Note that PEP8 recommends explicitely not to use this optimization which
> is specific to CPython:
> "do not rely on CPython’s efficient implementation of in-place string
> concatenation for statements in the form a += b or a = a + b"
>
> However, the PEP8 recommended form using "join()" also does not avoid the
> copy and takes 45 seconds in my test setup:
> data = ''.join((data, "./", f.split("/./")[1]))
>
> I have changed the other lines to also use += for consistency only,
> however those were in the form a = a + b
> and were optimized already.
>
> Co-authored-by: JJ Robertson <jrobertson@snap.com>
> Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
> ---
>  lib/bb/siggen.py | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
> index 130b38d8..2381edd2 100644
> --- a/lib/bb/siggen.py
> +++ b/lib/bb/siggen.py
> @@ -328,19 +328,19 @@ class SignatureGeneratorBasic(SignatureGenerator):
>
>          data = self.basehash[tid]
>          for dep in self.runtaskdeps[tid]:
> -            data = data + self.get_unihash(dep)
> +            data += self.get_unihash(dep)
>
>          for (f, cs) in self.file_checksum_values[tid]:
>              if cs:
>                  if "/./" in f:
> -                    data = data + "./" + f.split("/./")[1]
> -                data = data + cs
> +                    data += "./" + f.split("/./")[1]
> +                data += cs
>
>          if tid in self.taints:
>              if self.taints[tid].startswith("nostamp:"):
> -                data = data + self.taints[tid][8:]
> +                data += self.taints[tid][8:]
>              else:
> -                data = data + self.taints[tid]
> +                data += self.taints[tid]
>
>          h = hashlib.sha256(data.encode("utf-8")).hexdigest()
>          self.taskhash[tid] = h
> --
> 2.36.1.vfs.0.0
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14404):
> https://lists.openembedded.org/g/bitbake-devel/message/14404
> Mute This Topic: https://lists.openembedded.org/mt/96957072/3617156
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [
> Martin.Jansa@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>

[-- Attachment #2: Type: text/html, Size: 4916 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bitbake-devel] [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation
  2023-02-14 10:56 ` [bitbake-devel] " Martin Jansa
@ 2023-02-14 11:04   ` Etienne Cordonnier
       [not found]     ` <CAHUKmYbn95Mybr69s0WkU5a5j9c-2p-KbthrYqigTPwfis1q0Q@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Etienne Cordonnier @ 2023-02-14 11:04 UTC (permalink / raw)
  To: Martin Jansa; +Cc: bitbake-devel, JJ Robertson

[-- Attachment #1: Type: text/plain, Size: 5836 bytes --]

Note that this is just a cherry-pick from the main branch, so I kept the
commit exactly identical.

However in case your test-case is still slow, I believe there is a
potential for speed optimization by creating a list and then calling join()
one time once the loop is finished ("loop 5" in
https://stackoverflow.com/a/75313327/1710392 ). Of course the memory usage
would increase, but it's a trade-off.

On Tue, Feb 14, 2023 at 11:57 AM Martin Jansa <martin.jansa@gmail.com>
wrote:

> This might help with:
> https://bugzilla.yoctoproject.org/show_bug.cgi?id=14942
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.yoctoproject.org_show-5Fbug.cgi-3Fid-3D14942&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=yeRxgb5569zYnhHWpVFzs4fvU9NrVNah2UOPePY64Rs&e=>
>
> will test it on my testcase with 380227 paths and update the ticket if
> this helps.
>
> On Tue, Feb 14, 2023 at 11:35 AM ecordonnier via lists.openembedded.org
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.openembedded.org&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=URKYJE0IqTwSkkx42HBRLJE5BsC41VfpRdhus8DX5xY&e=>
> <ecordonnier=snap.com@lists.openembedded.org> wrote:
>
>> From: Etienne Cordonnier <ecordonnier@snap.com>
>>
>> As discussed in https://stackoverflow.com/a/4435752/1710392
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_4435752_1710392&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=QpRKrH6Ih3NfErHKEJ3_t6yojMBxhRvk66sn4Gsazkw&e=>
>> , CPython
>> has an optimization for statements in the form "a = a + b" or "a += b".
>> It seems that this line does not get optimized, because it has a form a =
>> a + b + c:
>> data = data + "./" + f.split("/./")[1]
>>
>> For that reason, it does a copy of data for each iteration, potentially
>> copying megabytes
>> of data for each iteration.
>>
>> Changing this line causes SignatureGeneratorBasic::get_taskhash to take
>> 0.06 seconds
>> instead of 45 seconds on my test setup where SRC_URI points to a big
>> directory.
>>
>> Note that PEP8 recommends explicitely not to use this optimization which
>> is specific to CPython:
>> "do not rely on CPython’s efficient implementation of in-place string
>> concatenation for statements in the form a += b or a = a + b"
>>
>> However, the PEP8 recommended form using "join()" also does not avoid the
>> copy and takes 45 seconds in my test setup:
>> data = ''.join((data, "./", f.split("/./")[1]))
>>
>> I have changed the other lines to also use += for consistency only,
>> however those were in the form a = a + b
>> and were optimized already.
>>
>> Co-authored-by: JJ Robertson <jrobertson@snap.com>
>> Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
>> ---
>>  lib/bb/siggen.py | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
>> index 130b38d8..2381edd2 100644
>> --- a/lib/bb/siggen.py
>> +++ b/lib/bb/siggen.py
>> @@ -328,19 +328,19 @@ class SignatureGeneratorBasic(SignatureGenerator):
>>
>>          data = self.basehash[tid]
>>          for dep in self.runtaskdeps[tid]:
>> -            data = data + self.get_unihash(dep)
>> +            data += self.get_unihash(dep)
>>
>>          for (f, cs) in self.file_checksum_values[tid]:
>>              if cs:
>>                  if "/./" in f:
>> -                    data = data + "./" + f.split("/./")[1]
>> -                data = data + cs
>> +                    data += "./" + f.split("/./")[1]
>> +                data += cs
>>
>>          if tid in self.taints:
>>              if self.taints[tid].startswith("nostamp:"):
>> -                data = data + self.taints[tid][8:]
>> +                data += self.taints[tid][8:]
>>              else:
>> -                data = data + self.taints[tid]
>> +                data += self.taints[tid]
>>
>>          h = hashlib.sha256(data.encode("utf-8")).hexdigest()
>>          self.taskhash[tid] = h
>> --
>> 2.36.1.vfs.0.0
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>> View/Reply Online (#14404):
>> https://lists.openembedded.org/g/bitbake-devel/message/14404
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_message_14404&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=Vw-mNHnRPLvPlWEXUYciNNhBB4pZanI2ZGEsbWRNR-g&e=>
>> Mute This Topic: https://lists.openembedded.org/mt/96957072/3617156
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_mt_96957072_3617156&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=l4fkRs6786x3hj5GGhT8XQ30SmRozQ8PJGoP_xPn_y0&e=>
>> Group Owner: bitbake-devel+owner@lists.openembedded.org
>> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_unsub&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=DVQKHhGOkYy3oSbWagenl7KtRRYBC1655FauLxtgkHw&e=>
>> [Martin.Jansa@gmail.com]
>> -=-=-=-=-=-=-=-=-=-=-=-
>>
>>

[-- Attachment #2: Type: text/html, Size: 7536 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bitbake-devel] [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation
  2023-02-14 10:34 [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation ecordonnier
  2023-02-14 10:56 ` [bitbake-devel] " Martin Jansa
@ 2023-02-14 14:49 ` Steve Sakoman
  1 sibling, 0 replies; 5+ messages in thread
From: Steve Sakoman @ 2023-02-14 14:49 UTC (permalink / raw)
  To: ecordonnier; +Cc: bitbake-devel, JJ Robertson

I finished autobuilder testing this patch on both langdale and
kirkstone - will send the review email out later today.

Steve

On Tue, Feb 14, 2023 at 12:35 AM ecordonnier via
lists.openembedded.org <ecordonnier=snap.com@lists.openembedded.org>
wrote:
>
> From: Etienne Cordonnier <ecordonnier@snap.com>
>
> As discussed in https://stackoverflow.com/a/4435752/1710392 , CPython
> has an optimization for statements in the form "a = a + b" or "a += b".
> It seems that this line does not get optimized, because it has a form a = a + b + c:
> data = data + "./" + f.split("/./")[1]
>
> For that reason, it does a copy of data for each iteration, potentially copying megabytes
> of data for each iteration.
>
> Changing this line causes SignatureGeneratorBasic::get_taskhash to take 0.06 seconds
> instead of 45 seconds on my test setup where SRC_URI points to a big directory.
>
> Note that PEP8 recommends explicitely not to use this optimization which is specific to CPython:
> "do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b"
>
> However, the PEP8 recommended form using "join()" also does not avoid the copy and takes 45 seconds in my test setup:
> data = ''.join((data, "./", f.split("/./")[1]))
>
> I have changed the other lines to also use += for consistency only, however those were in the form a = a + b
> and were optimized already.
>
> Co-authored-by: JJ Robertson <jrobertson@snap.com>
> Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
> ---
>  lib/bb/siggen.py | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
> index 130b38d8..2381edd2 100644
> --- a/lib/bb/siggen.py
> +++ b/lib/bb/siggen.py
> @@ -328,19 +328,19 @@ class SignatureGeneratorBasic(SignatureGenerator):
>
>          data = self.basehash[tid]
>          for dep in self.runtaskdeps[tid]:
> -            data = data + self.get_unihash(dep)
> +            data += self.get_unihash(dep)
>
>          for (f, cs) in self.file_checksum_values[tid]:
>              if cs:
>                  if "/./" in f:
> -                    data = data + "./" + f.split("/./")[1]
> -                data = data + cs
> +                    data += "./" + f.split("/./")[1]
> +                data += cs
>
>          if tid in self.taints:
>              if self.taints[tid].startswith("nostamp:"):
> -                data = data + self.taints[tid][8:]
> +                data += self.taints[tid][8:]
>              else:
> -                data = data + self.taints[tid]
> +                data += self.taints[tid]
>
>          h = hashlib.sha256(data.encode("utf-8")).hexdigest()
>          self.taskhash[tid] = h
> --
> 2.36.1.vfs.0.0
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14404): https://lists.openembedded.org/g/bitbake-devel/message/14404
> Mute This Topic: https://lists.openembedded.org/mt/96957072/3620601
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [steve@sakoman.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bitbake-devel] [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation
       [not found]                                     ` <CAHUKmYZ5DJtA4NFdisJsT2aJJQjXfqkcPbSZx_SEh6K7SS40Kg@mail.gmail.com>
@ 2023-02-14 15:19                                       ` Martin Jansa
  0 siblings, 0 replies; 5+ messages in thread
From: Martin Jansa @ 2023-02-14 15:19 UTC (permalink / raw)
  To: Etienne Cordonnier, bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 26157 bytes --]

On Tue, Feb 14, 2023 at 2:53 PM Etienne Cordonnier <ecordonnier@snap.com>
wrote:

> Yes, see my last email for the explanation of why 3.11 is slower when the
> code is not in a function. Let's use extend() instead of +=, this will be
> more portable (but as far as I understand, the code should already be fast
> on your setup because it is called inside a function when called inside
> poky).
>

Thanks, I've seen it after sending my reply. Yes it explains why I'm seeing
it in my simple test case, but I'm surprised that it's not seen as
regression, true global variables were always a bit slower than local, but
this is significantly slower and the improvement from 3.12 doesn't return
the 3.10 performance (at least not in alpha5 I was testing).

Do you mind adding me as co-author of the patch?
>

No problem, but it's not ready
1) It causes a lot of warnings like:
WARNING: Runqueue deadlocked on deferred tasks, forcing task
virtual:native:/OE/build/oe-core/openembedded-core/meta/recipes-devtools/gnu-config/gnu-config_git.bb:do_populate_sysroot
blocked by
virtual:native:/OE/build/oe-core/openembedded-core/meta/recipes-devtools/gnu-config/gnu-config_git.bb:
do_deploy_source_date_epoch
and
NOTE: Deferring
virtual:native:/OE/build/oe-core/openembedded-core/meta/recipes-support/sqlite/sqlite3_3.40.1.bb:do_populate_sysroot
after
virtual:native:/OE/build/oe-core/openembedded-core/meta/recipes-support/sqlite/sqlite3_3.40.1.bb:
do_deploy_source_date_epoch

2) I still want to test it with externalsrc (measuring whole get_taskhash()
function), the bottleneck might still be there somewhere else - it took
almost 2 hours before, so only some portion of this was this for loop.

+ML again as some parts of this might be interesting to other as well.

Regards,

On Tue, Feb 14, 2023 at 2:48 PM Martin Jansa <martin.jansa@gmail.com> wrote:
>
>> Gentoo:
>> $ python3 --version
>> Python 3.11.2
>> $ ./test-with-both-times.py
>> += took 507.63995265960693 seconds
>> extends() took 0.7833898067474365 seconds
>>
>> $ python3.10 test-with-both-times.py
>> += took 0.2911818027496338 seconds
>> extends() took 0.7215933799743652 seconds
>>
>> $ pypy3.9 ./test-with-both-times.py
>> += took 351.19071674346924 seconds
>> extends() took 0.9291989803314209 seconds
>>
>> Ubuntu-23.04 in docker:
>> $ python3 --version
>> Python 3.11.1
>> $ python3 test-with-both-times.py
>> += took 509.31823468208313 seconds
>> extends() took 0.1940152645111084 seconds
>>
>> Ubuntu-22.04 in docker:
>> $ python3 --version
>> Python 3.10.4
>> $ python3 test-with-both-times.py
>> += took 0.09383106231689453 seconds
>> extends() took 0.20920562744140625 seconds
>>
>> Current oe-core/master
>> $
>> /OE/build/oe-core/tmp-glibc/sysroots-components/x86_64/python3-native/usr/bin/python3-native/python3
>> --version
>> Python 3.11.1
>> $
>> /OE/build/oe-core/tmp-glibc/sysroots-components/x86_64/python3-native/usr/bin/python3-native/python3
>> test-with-both-times.py
>> += took 508.69702982902527 seconds
>> extends() took 0.22980260848999023 seconds
>>
>> My host is busy but not that busy (load only ~ 5 now) :/
>>
>> Looks like 3.10 python was built with some secret sauce where the
>> optimization you mentioned worked, but it doesn't anymore with 3.11.
>>
>> I'm checking python 3.12-alpha5 in gentoo and I've noticed that PGO is
>> disabled in my build, but I don't expect it to have such huge impact, lets
>> see.
>>
>> On Tue, Feb 14, 2023 at 2:00 PM Etienne Cordonnier <ecordonnier@snap.com>
>> wrote:
>>
>>> By the way, from a big picture point of view, I think using extends() is
>>> still acceptable. Even on my setup where it's 100% slower than the +=
>>> version, it is still 500 times faster than the original version and
>>> therefore the function is not a bottleneck. It makes the code more portable
>>> and less dependent on a particular optimization of CPython, which is the
>>> recommendation of PEP8 I quoted in my commit message.
>>>
>>> On Tue, Feb 14, 2023 at 1:55 PM Etienne Cordonnier <ecordonnier@snap.com>
>>> wrote:
>>>
>>>> What do you get if you run this?
>>>>
>>>>
>>>> ecordonnier@lj8k2dq3:~$ python test-with-both-times.py
>>>> += took 0.09932756423950195 seconds
>>>> extends() took 0.2495267391204834 seconds
>>>>
>>>> On Tue, Feb 14, 2023 at 1:53 PM Etienne Cordonnier <
>>>> ecordonnier@snap.com> wrote:
>>>>
>>>>> Yes, I also don't get why it's slower with extends. This contradicts
>>>>> the results of this test script attached where the version with extends is
>>>>> the faster:
>>>>>
>>>>> ecordonnier@lj8k2dq3:~$ python test2.py
>>>>> end loop2: 1.1121666431427002
>>>>> end loop4: 1.0626156330108643
>>>>> end loop5: 0.7910439968109131
>>>>>
>>>>>
>>>>> ecordonnier@lj8k2dq3:~$ ls -l checksums.json
>>>>> -rw-rw-r-- 1 ecordonnier ecordonnier 52554695 Feb 14 13:34
>>>>> checksums.json
>>>>> ecordonnier@lj8k2dq3:~$ du -h checksums.data
>>>>> 12M checksums.data
>>>>>
>>>>> On Tue, Feb 14, 2023 at 1:51 PM Martin Jansa <martin.jansa@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> That's weird, why does it use 512s on my host and only 0.09 on yours
>>>>>> :/
>>>>>> Were you using whole 51MB checksums.json with 12MB resulting file?
>>>>>>
>>>>>> $ du -h checksums.data
>>>>>> 12M     checksums.data
>>>>>>
>>>>>> On Tue, Feb 14, 2023 at 1:48 PM Etienne Cordonnier <
>>>>>> ecordonnier@snap.com> wrote:
>>>>>>
>>>>>>> Nevermind, the script is garbage after my modification. The correct
>>>>>>> time is 0.24 seconds for the "extends" version and 0.09 seconds for the
>>>>>>> "+=" version.
>>>>>>>
>>>>>>> On Tue, Feb 14, 2023 at 1:45 PM Etienne Cordonnier <
>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>
>>>>>>>> Those are the results on my setup (I modified the script slightly):
>>>>>>>>
>>>>>>>> $ python --version
>>>>>>>> Python 3.10.6
>>>>>>>> ecordonnier@lj8k2dq3:~$ python test.py
>>>>>>>> += took 0.09993505477905273 seconds
>>>>>>>> extends() took 0.0006468296051025391 seconds
>>>>>>>>
>>>>>>>> On Tue, Feb 14, 2023 at 1:34 PM Martin Jansa <
>>>>>>>> martin.jansa@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> > Do you have checksums.json?
>>>>>>>>>
>>>>>>>>> Link to it was in previous mail
>>>>>>>>>
>>>>>>>>> 51MB checksums.json:
>>>>>>>>>
>>>>>>>>> https://drive.google.com/file/d/1tA24mXF7cJ8RonoevFmU_7n4xh9k1KoS/view?usp=sharing
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 14, 2023 at 1:33 PM Etienne Cordonnier <
>>>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>>>
>>>>>>>>>> I wonder why the optimization with "+=" is not working as yocto
>>>>>>>>>> patch on your setup even though it works on your setup with
>>>>>>>>>> "concatenate.py".
>>>>>>>>>>
>>>>>>>>>> Do you have checksums.json?
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 14, 2023 at 1:21 PM Martin Jansa <
>>>>>>>>>> martin.jansa@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The approach with array seems way to go it went down from 512s
>>>>>>>>>>> to 0.767s (both with loaded host). It produced the same result as expected
>>>>>>>>>>> (but I the paths I have have no /./).
>>>>>>>>>>>
>>>>>>>>>>> Can you test this on your host with older python3?
>>>>>>>>>>>
>>>>>>>>>>> If it works similarly for you, then we can use this to fix
>>>>>>>>>>> https://bugzilla.yoctoproject.org/show_bug.cgi?id=14942
>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.yoctoproject.org_show-5Fbug.cgi-3Fid-3D14942&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=u28ErN8vnQyB_3KruKe2A4aqRS4PBeCVjnIYNr0wAAvki8gZJWo_6ObIH7YCgvVN&s=NE0gvBMLDj8hPL8pFnfwoyE7f3zS5Kwj44TjfiYc5S0&e=>
>>>>>>>>>>>
>>>>>>>>>>> #!/usr/bin/env python3
>>>>>>>>>>>
>>>>>>>>>>> import json
>>>>>>>>>>>
>>>>>>>>>>> data = ''
>>>>>>>>>>> with open('checksums.json', "r") as f:
>>>>>>>>>>>     checksums = json.load(f)
>>>>>>>>>>> import time
>>>>>>>>>>> start_time = time.time()
>>>>>>>>>>> datalist = []
>>>>>>>>>>> for (f, cs) in checksums:
>>>>>>>>>>>     if cs:
>>>>>>>>>>>         # 1018.7424070835114s
>>>>>>>>>>>         # data += cs
>>>>>>>>>>>
>>>>>>>>>>>         # 1061.4139277935028
>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>
>>>>>>>>>>>         # 1020.9841980934143
>>>>>>>>>>>         # if "/./" in f:
>>>>>>>>>>>         #     data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>
>>>>>>>>>>>         # Test again 2023-02-14
>>>>>>>>>>>         # took 1076.6172671318054 seconds
>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>
>>>>>>>>>>>         # took 699.0907781124115 seconds
>>>>>>>>>>>         # if "/./" in f:
>>>>>>>>>>>         #     data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>
>>>>>>>>>>>         # took 582.110622882843 seconds
>>>>>>>>>>>         # took 512.9146637916565 seconds
>>>>>>>>>>>         if "/./" in f:
>>>>>>>>>>>             data += "./" + f.split("/./")[1]
>>>>>>>>>>>         data += cs
>>>>>>>>>>>
>>>>>>>>>>>         # took 546.2394247055054 seconds
>>>>>>>>>>>         # if "/./" in f:
>>>>>>>>>>>         #     data += "./"
>>>>>>>>>>>         #     data += f.split("/./")[1]
>>>>>>>>>>>         # data += cs
>>>>>>>>>>>
>>>>>>>>>>> #         took 0.7674515247344971 seconds
>>>>>>>>>>> #         if "/./" in f:
>>>>>>>>>>> #             datalist.extend(("./", f.split("/./")[1]))
>>>>>>>>>>> #         datalist.extend(cs)
>>>>>>>>>>> # data = "".join(datalist)
>>>>>>>>>>>
>>>>>>>>>>> stop_time = time.time()
>>>>>>>>>>> print("took %s seconds" % (stop_time - start_time))
>>>>>>>>>>>
>>>>>>>>>>> with open('checksums.data', "w") as f:
>>>>>>>>>>>     f.write(data)
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 14, 2023 at 1:07 PM Etienne Cordonnier <
>>>>>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> OK, I'm on Ubuntu 22.04, with python 3.10.6 (based on CPython)
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 14, 2023 at 12:36 PM Martin Jansa <
>>>>>>>>>>>> martin.jansa@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> OK,
>>>>>>>>>>>>>
>>>>>>>>>>>>> FWIW: There is no difference between loop2 and loop3 on mine:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ python3 concatenate.py
>>>>>>>>>>>>> begin: 1676374316.6670556
>>>>>>>>>>>>> end loop1: 5.00896430015564
>>>>>>>>>>>>> end loop2: 0.7002358436584473
>>>>>>>>>>>>> end loop3: 0.7122139930725098
>>>>>>>>>>>>> $ python3 --version
>>>>>>>>>>>>> Python 3.11.2
>>>>>>>>>>>>>
>>>>>>>>>>>>> These are the versions I plan to test (faster 2nd loop might
>>>>>>>>>>>>> be caused by less load of the host during the run - which I why I wanted to
>>>>>>>>>>>>> wait for builds to finish first):
>>>>>>>>>>>>>
>>>>>>>>>>>>>         # Test again 2023-02-14
>>>>>>>>>>>>>         # took 1076.6172671318054 seconds
>>>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>>>
>>>>>>>>>>>>>         # took 699.0907781124115 seconds
>>>>>>>>>>>>>         # if "/./" in f:
>>>>>>>>>>>>>         #     data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>>>
>>>>>>>>>>>>>         if "/./" in f:
>>>>>>>>>>>>>             data += "./" + f.split("/./")[1]
>>>>>>>>>>>>>         data += cs
>>>>>>>>>>>>>
>>>>>>>>>>>>>         # if "/./" in f:
>>>>>>>>>>>>>         #     data += "./"
>>>>>>>>>>>>>         #     data += f.split("/./")[1]
>>>>>>>>>>>>>         # data += cs
>>>>>>>>>>>>>
>>>>>>>>>>>>> #        if "/./" in f:
>>>>>>>>>>>>> #            datalist.extend(("./", f.split("/./")[1]))
>>>>>>>>>>>>> #        datalist.extend(cs)
>>>>>>>>>>>>> #data = "".join(datalist)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 12:30 PM Etienne Cordonnier <
>>>>>>>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> For instance on my linux machine:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> $ python concatenate.py /usr
>>>>>>>>>>>>>> begin: 1676374195.3408449
>>>>>>>>>>>>>> end loop1: 3.8941030502319336
>>>>>>>>>>>>>> end loop2: 0.026114225387573242
>>>>>>>>>>>>>> end loop3: 0.8803601264953613
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 12:28 PM Etienne Cordonnier <
>>>>>>>>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "I'm surprised if
>>>>>>>>>>>>>>> -data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>>> +data += "./" + f.split("/./")[1]
>>>>>>>>>>>>>>> shows any significant improvement,"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -> It is surprising indeed, but it shows a fully
>>>>>>>>>>>>>>> reproducible massive improvement on my setup because of this CPython
>>>>>>>>>>>>>>> optimization (I have a directory in SRC_URI with 70K files, however I'm not
>>>>>>>>>>>>>>> using externalsrc). You can see that it is also fully reproducible with a
>>>>>>>>>>>>>>> generic test python script (the benchmark script I shared in
>>>>>>>>>>>>>>> https://stackoverflow.com/q/75313204/1710392
>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_q_75313204_1710392&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=gt-tyvzhSf7FdU4C_i2N28p3HBcbuG1YK7V2zRfzMRyccJoUdOu3Mgd9-FEDkJ27&s=ExrYHlBJDdIxPj1Mt10Pl1_Yzsq6c8OWoUXDXyW2l48&e=>
>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 12:24 PM Martin Jansa <
>>>>>>>>>>>>>>> martin.jansa@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Etienne,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm didn't notice that this was already merged in master.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm surprised if
>>>>>>>>>>>>>>>> -data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>> +data += "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>> shows any significant improvement, because that's what I
>>>>>>>>>>>>>>>> was testing before when working on:
>>>>>>>>>>>>>>>> https://bugzilla.yoctoproject.org/show_bug.cgi?id=14942
>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.yoctoproject.org_show-5Fbug.cgi-3Fid-3D14942&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=HPBNj1QO1qPbzf1RJ4lAV7mnwD4_jzP_YHYEg6vvnJs-ws8Hyh5xTgTYHnjbXlOZ&s=OvBJbC8LaGfBfX3f9OMXk24rW24sXFje0wa14nh_xPo&e=>
>>>>>>>>>>>>>>>> and it made no difference, I was assuming that you'll need
>>>>>>>>>>>>>>>> to split it as:
>>>>>>>>>>>>>>>> data += "./"
>>>>>>>>>>>>>>>> data += f.split("/./")[1]
>>>>>>>>>>>>>>>> to trigger that optimalization.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is the simplified test I'm using to simulate
>>>>>>>>>>>>>>>> externalsrc.bbclass on huge repo (with those 380227 paths):
>>>>>>>>>>>>>>>> 51MB checksums.json:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1tA24mXF7cJ8RonoevFmU_7n4xh9k1KoS/view?usp=sharing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And the "test" (haven't updated it yet, because the host is
>>>>>>>>>>>>>>>> busy with building many things ATM).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #!/usr/bin/env python3
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> import json
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> data = ''
>>>>>>>>>>>>>>>> with open('checksums.json', "r") as f:
>>>>>>>>>>>>>>>>     checksums = json.load(f)
>>>>>>>>>>>>>>>> import time
>>>>>>>>>>>>>>>> start_time = time.time()
>>>>>>>>>>>>>>>> for (f, cs) in checksums:
>>>>>>>>>>>>>>>>     if cs:
>>>>>>>>>>>>>>>>         # 1018.7424070835114s
>>>>>>>>>>>>>>>>         # data += cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         # 1061.4139277935028
>>>>>>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         # 1020.9841980934143
>>>>>>>>>>>>>>>>         # if "/./" in f:
>>>>>>>>>>>>>>>>         #     data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         # Test again 2023-02-14
>>>>>>>>>>>>>>>>         # took 1076.6172671318054 seconds
>>>>>>>>>>>>>>>>         # data = data + cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         if "/./" in f:
>>>>>>>>>>>>>>>>             data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>>         data = data + cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #        if "/./" in f:
>>>>>>>>>>>>>>>> #            data += "./"
>>>>>>>>>>>>>>>> #            data += f.split("/./")[1]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #        if "/./" in f:
>>>>>>>>>>>>>>>> #            data += "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>> #        data = data + cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> #       data += cs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> stop_time = time.time()
>>>>>>>>>>>>>>>> print("took %s seconds" % (stop_time - start_time))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 12:15 PM Etienne Cordonnier <
>>>>>>>>>>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Martin, I don't want to spam the mailing list, but
>>>>>>>>>>>>>>>>> basically you can try to replace this line:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -                    data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With those lines:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> + datalist = []
>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>> +  datalist.extend(("./", f.split("/./")[1]))
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> + datalist = "".join(datalist)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And then you need to append datalist to data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think it could be even faster than my patch, but for my test-case this part of the code doesn't seem to be the bottleneck any more anyway, so I don't think I'll spend more time testing this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 12:04 PM Etienne Cordonnier <
>>>>>>>>>>>>>>>>> ecordonnier@snap.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that this is just a cherry-pick from the main
>>>>>>>>>>>>>>>>>> branch, so I kept the commit exactly identical.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However in case your test-case is still slow, I believe
>>>>>>>>>>>>>>>>>> there is a potential for speed optimization by creating a list and then
>>>>>>>>>>>>>>>>>> calling join() one time once the loop is finished ("loop 5" in
>>>>>>>>>>>>>>>>>> https://stackoverflow.com/a/75313327/1710392
>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_75313327_1710392&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=HPBNj1QO1qPbzf1RJ4lAV7mnwD4_jzP_YHYEg6vvnJs-ws8Hyh5xTgTYHnjbXlOZ&s=Gdb2DQUOf6peT_9enI2SqvbqHXMOmYziGjtNQHkmPXs&e=>
>>>>>>>>>>>>>>>>>> ). Of course the memory usage would increase, but it's a trade-off.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 11:57 AM Martin Jansa <
>>>>>>>>>>>>>>>>>> martin.jansa@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This might help with:
>>>>>>>>>>>>>>>>>>> https://bugzilla.yoctoproject.org/show_bug.cgi?id=14942
>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.yoctoproject.org_show-5Fbug.cgi-3Fid-3D14942&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=yeRxgb5569zYnhHWpVFzs4fvU9NrVNah2UOPePY64Rs&e=>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> will test it on my testcase with 380227 paths and update
>>>>>>>>>>>>>>>>>>> the ticket if this helps.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Feb 14, 2023 at 11:35 AM ecordonnier via
>>>>>>>>>>>>>>>>>>> lists.openembedded.org
>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.openembedded.org&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=URKYJE0IqTwSkkx42HBRLJE5BsC41VfpRdhus8DX5xY&e=>
>>>>>>>>>>>>>>>>>>> <ecordonnier=snap.com@lists.openembedded.org> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> From: Etienne Cordonnier <ecordonnier@snap.com>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As discussed in
>>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/a/4435752/1710392
>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_4435752_1710392&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=QpRKrH6Ih3NfErHKEJ3_t6yojMBxhRvk66sn4Gsazkw&e=>
>>>>>>>>>>>>>>>>>>>> , CPython
>>>>>>>>>>>>>>>>>>>> has an optimization for statements in the form "a = a +
>>>>>>>>>>>>>>>>>>>> b" or "a += b".
>>>>>>>>>>>>>>>>>>>> It seems that this line does not get optimized, because
>>>>>>>>>>>>>>>>>>>> it has a form a = a + b + c:
>>>>>>>>>>>>>>>>>>>> data = data + "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For that reason, it does a copy of data for each
>>>>>>>>>>>>>>>>>>>> iteration, potentially copying megabytes
>>>>>>>>>>>>>>>>>>>> of data for each iteration.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Changing this line causes
>>>>>>>>>>>>>>>>>>>> SignatureGeneratorBasic::get_taskhash to take 0.06 seconds
>>>>>>>>>>>>>>>>>>>> instead of 45 seconds on my test setup where SRC_URI
>>>>>>>>>>>>>>>>>>>> points to a big directory.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Note that PEP8 recommends explicitely not to use this
>>>>>>>>>>>>>>>>>>>> optimization which is specific to CPython:
>>>>>>>>>>>>>>>>>>>> "do not rely on CPython’s efficient implementation of
>>>>>>>>>>>>>>>>>>>> in-place string concatenation for statements in the form a += b or a = a +
>>>>>>>>>>>>>>>>>>>> b"
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, the PEP8 recommended form using "join()" also
>>>>>>>>>>>>>>>>>>>> does not avoid the copy and takes 45 seconds in my test setup:
>>>>>>>>>>>>>>>>>>>> data = ''.join((data, "./", f.split("/./")[1]))
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have changed the other lines to also use += for
>>>>>>>>>>>>>>>>>>>> consistency only, however those were in the form a = a + b
>>>>>>>>>>>>>>>>>>>> and were optimized already.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Co-authored-by: JJ Robertson <jrobertson@snap.com>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com
>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>  lib/bb/siggen.py | 10 +++++-----
>>>>>>>>>>>>>>>>>>>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
>>>>>>>>>>>>>>>>>>>> index 130b38d8..2381edd2 100644
>>>>>>>>>>>>>>>>>>>> --- a/lib/bb/siggen.py
>>>>>>>>>>>>>>>>>>>> +++ b/lib/bb/siggen.py
>>>>>>>>>>>>>>>>>>>> @@ -328,19 +328,19 @@ class
>>>>>>>>>>>>>>>>>>>> SignatureGeneratorBasic(SignatureGenerator):
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>          data = self.basehash[tid]
>>>>>>>>>>>>>>>>>>>>          for dep in self.runtaskdeps[tid]:
>>>>>>>>>>>>>>>>>>>> -            data = data + self.get_unihash(dep)
>>>>>>>>>>>>>>>>>>>> +            data += self.get_unihash(dep)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>          for (f, cs) in self.file_checksum_values[tid]:
>>>>>>>>>>>>>>>>>>>>              if cs:
>>>>>>>>>>>>>>>>>>>>                  if "/./" in f:
>>>>>>>>>>>>>>>>>>>> -                    data = data + "./" +
>>>>>>>>>>>>>>>>>>>> f.split("/./")[1]
>>>>>>>>>>>>>>>>>>>> -                data = data + cs
>>>>>>>>>>>>>>>>>>>> +                    data += "./" + f.split("/./")[1]
>>>>>>>>>>>>>>>>>>>> +                data += cs
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>          if tid in self.taints:
>>>>>>>>>>>>>>>>>>>>              if self.taints[tid].startswith("nostamp:"):
>>>>>>>>>>>>>>>>>>>> -                data = data + self.taints[tid][8:]
>>>>>>>>>>>>>>>>>>>> +                data += self.taints[tid][8:]
>>>>>>>>>>>>>>>>>>>>              else:
>>>>>>>>>>>>>>>>>>>> -                data = data + self.taints[tid]
>>>>>>>>>>>>>>>>>>>> +                data += self.taints[tid]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>          h =
>>>>>>>>>>>>>>>>>>>> hashlib.sha256(data.encode("utf-8")).hexdigest()
>>>>>>>>>>>>>>>>>>>>          self.taskhash[tid] = h
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> 2.36.1.vfs.0.0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>>>>>>>>> Links: You receive all messages sent to this group.
>>>>>>>>>>>>>>>>>>>> View/Reply Online (#14404):
>>>>>>>>>>>>>>>>>>>> https://lists.openembedded.org/g/bitbake-devel/message/14404
>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_message_14404&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=Vw-mNHnRPLvPlWEXUYciNNhBB4pZanI2ZGEsbWRNR-g&e=>
>>>>>>>>>>>>>>>>>>>> Mute This Topic:
>>>>>>>>>>>>>>>>>>>> https://lists.openembedded.org/mt/96957072/3617156
>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_mt_96957072_3617156&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=l4fkRs6786x3hj5GGhT8XQ30SmRozQ8PJGoP_xPn_y0&e=>
>>>>>>>>>>>>>>>>>>>> Group Owner: bitbake-devel+owner@lists.openembedded.org
>>>>>>>>>>>>>>>>>>>> Unsubscribe:
>>>>>>>>>>>>>>>>>>>> https://lists.openembedded.org/g/bitbake-devel/unsub
>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_unsub&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=oiN1UiZT6K3vh9RWCh9pNu6Z-ZcCPacnWP9stnxw3HyKW5Qn-XC3JVkLEOc4mUN5&s=DVQKHhGOkYy3oSbWagenl7KtRRYBC1655FauLxtgkHw&e=>
>>>>>>>>>>>>>>>>>>>> [Martin.Jansa@gmail.com]
>>>>>>>>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

[-- Attachment #2: Type: text/html, Size: 30549 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-02-14 15:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-14 10:34 [kirkstone][2.0][PATCH] siggen: fix inefficient string concatenation ecordonnier
2023-02-14 10:56 ` [bitbake-devel] " Martin Jansa
2023-02-14 11:04   ` Etienne Cordonnier
     [not found]     ` <CAHUKmYbn95Mybr69s0WkU5a5j9c-2p-KbthrYqigTPwfis1q0Q@mail.gmail.com>
     [not found]       ` <CA+chaQfQmppPAYvHxCH5F2O24QmunG8dyWvmD7CmNz3pmYGimQ@mail.gmail.com>
     [not found]         ` <CAHUKmYaGb_4=YzhKrZxcSCeiNAOPTm2=yVM4KC9JP3c4HAmhBw@mail.gmail.com>
     [not found]           ` <CAHUKmYaVGiWd1gDEcih=Yz6n9xjiFQtAsb6QDi6S9quJuUteog@mail.gmail.com>
     [not found]             ` <CA+chaQc5PJYgRHQ_kTTuq62uh870SXAwAVx4OQVH9ui-K3WzgA@mail.gmail.com>
     [not found]               ` <CAHUKmYaESjaPsuXSaCnsD_Jc4f9McnPCB-SoVZpcdTw+L=JPaA@mail.gmail.com>
     [not found]                 ` <CA+chaQd_ZjXdZ6yvDTn4xqGTeT0v1-gPn4ao=+GVJLyvuYgUPQ@mail.gmail.com>
     [not found]                   ` <CAHUKmYaQOADurM5Xg+rqjTMggnzJ+PNr_hBEv7O4+H0xeowmUA@mail.gmail.com>
     [not found]                     ` <CA+chaQfM_RtiC4iejTVQvfjxyF9ALRx=Q_kR1Rcir70Og0zRFQ@mail.gmail.com>
     [not found]                       ` <CAHUKmYYDQh6KpHyBwkDwYxadTnZAPmzGmGvsFmo=dXD_EGXa9Q@mail.gmail.com>
     [not found]                         ` <CAHUKmYadeSYGPeV8VWbDnA_YaXVmw--EYQf1aPNaK0_JX4nfyQ@mail.gmail.com>
     [not found]                           ` <CA+chaQcAF6uBDSRQcVffuiAo-aJGN2JM7x1C24-VXiRydX3kOw@mail.gmail.com>
     [not found]                             ` <CAHUKmYZbm0YhePfh+1icGv25fda0X8L2315zvY1Bgi-of+PN7A@mail.gmail.com>
     [not found]                               ` <CAHUKmYa3+w1NN5kGOcnwnoGS2yRQfPe6o2NHy3TDhV=OXe24hA@mail.gmail.com>
     [not found]                                 ` <CAHUKmYY9rj+nfcxietZo6rPWfHd4Fbko3c5damyE3P+14Zvi3w@mail.gmail.com>
     [not found]                                   ` <CA+chaQdm+h1FMAOTJtkiPwY3hcnXFMjmhKC56Z3Jzfoahogh9Q@mail.gmail.com>
     [not found]                                     ` <CAHUKmYZ5DJtA4NFdisJsT2aJJQjXfqkcPbSZx_SEh6K7SS40Kg@mail.gmail.com>
2023-02-14 15:19                                       ` Martin Jansa
2023-02-14 14:49 ` Steve Sakoman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).