All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [bitbake-devel] [2.0][PATCH 1/5] runqueue: fix PSI check calculation
       [not found] <17964777D58C012A.3202@lists.openembedded.org>
@ 2023-11-10 14:06 ` Martin Jansa
  2024-02-15 23:06   ` Randy MacLeod
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Jansa @ 2023-11-10 14:06 UTC (permalink / raw)
  To: martin.jansa; +Cc: bitbake-devel, steve, Chen Qi, Richard Purdie, Randy Macleod

[-- Attachment #1: Type: text/plain, Size: 5116 bytes --]

I forgot to include [RFC] tag in the subject, because these changes might
be a bit controversial.

From my testing:
https://github.com/shr-project/test-oe-build-time/commit/d5111f4472ac397c0f1197eb6366ac7d2e56453f
the reasonable pressure values for 2.0 (in kirkstone) and 2.6 (in nanbield)
are significantly different due to these changes in PSI calculation and
logic so while 1000 was reasonable value for my system in 2.0, I need
100000 with 2.6.

I've backported these changes long time ago for my local builds to use the
same PSI values for all builds.

And the controversy is that backporting these will change the expected
values (so kind of change in behavior which might not be suitable for
stable release). But the commits make it sound as a bug in the logic, so we
can also consider them as bug fixes.

The change in expected pressure values was reported in:
https://lists.openembedded.org/g/bitbake-devel/message/14942

If it's not a bug to be fixed in 2.0 then maybe we should mention different
PSI values in release notes for nanbield (I know I'm too late for that).

Regards,

On Fri, Nov 10, 2023 at 2:54 PM Martin Jansa via lists.openembedded.org
<martin.jansa=gmail.com@lists.openembedded.org> wrote:

> From: Chen Qi <Qi.Chen@windriver.com>
>
> The current PSI check calculation does not take into consideration
> the possibility of the time interval between last check and current
> check being much larger than 1s. In fact, the current behavior does
> not match what the manual says about BB_PRESSURE_MAX_XXX, even if
> the value is set to upper limit, 1000000, we still get many blocks
> on new task launch. The difference between 'total' should be divided
> by the time interval if it's larger than 1s.
>
> Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> ---
>  lib/bb/runqueue.py | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py
> index 8605c46c8..3e89c38bc 100644
> --- a/lib/bb/runqueue.py
> +++ b/lib/bb/runqueue.py
> @@ -198,15 +198,20 @@ class RunQueueScheduler(object):
>                  curr_cpu_pressure =
> cpu_pressure_fds.readline().split()[4].split("=")[1]
>                  curr_io_pressure =
> io_pressure_fds.readline().split()[4].split("=")[1]
>                  curr_memory_pressure =
> memory_pressure_fds.readline().split()[4].split("=")[1]
> -                exceeds_cpu_pressure =  self.rq.max_cpu_pressure and
> (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) >
> self.rq.max_cpu_pressure
> -                exceeds_io_pressure =  self.rq.max_io_pressure and
> (float(curr_io_pressure) - float(self.prev_io_pressure)) >
> self.rq.max_io_pressure
> -                exceeds_memory_pressure = self.rq.max_memory_pressure and
> (float(curr_memory_pressure) - float(self.prev_memory_pressure)) >
> self.rq.max_memory_pressure
>                  now = time.time()
> -                if now - self.prev_pressure_time > 1.0:
> +                tdiff = now - self.prev_pressure_time
> +                if tdiff > 1.0:
> +                    exceeds_cpu_pressure =  self.rq.max_cpu_pressure and
> (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff >
> self.rq.max_cpu_pressure
> +                    exceeds_io_pressure =  self.rq.max_io_pressure and
> (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff >
> self.rq.max_io_pressure
> +                    exceeds_memory_pressure = self.rq.max_memory_pressure
> and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) /
> tdiff > self.rq.max_memory_pressure
>                      self.prev_cpu_pressure = curr_cpu_pressure
>                      self.prev_io_pressure = curr_io_pressure
>                      self.prev_memory_pressure = curr_memory_pressure
>                      self.prev_pressure_time = now
> +                else:
> +                    exceeds_cpu_pressure =  self.rq.max_cpu_pressure and
> (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) >
> self.rq.max_cpu_pressure
> +                    exceeds_io_pressure =  self.rq.max_io_pressure and
> (float(curr_io_pressure) - float(self.prev_io_pressure)) >
> self.rq.max_io_pressure
> +                    exceeds_memory_pressure = self.rq.max_memory_pressure
> and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) >
> self.rq.max_memory_pressure
>              return (exceeds_cpu_pressure or exceeds_io_pressure or
> exceeds_memory_pressure)
>          return False
>
> --
> 2.42.1
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#15494):
> https://lists.openembedded.org/g/bitbake-devel/message/15494
> Mute This Topic: https://lists.openembedded.org/mt/102506693/3617156
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [
> martin.jansa@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>

[-- Attachment #2: Type: text/html, Size: 6826 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [2.0][PATCH 1/5] runqueue: fix PSI check calculation
  2023-11-10 14:06 ` [bitbake-devel] [2.0][PATCH 1/5] runqueue: fix PSI check calculation Martin Jansa
@ 2024-02-15 23:06   ` Randy MacLeod
  2024-02-15 23:14     ` Richard Purdie
  0 siblings, 1 reply; 4+ messages in thread
From: Randy MacLeod @ 2024-02-15 23:06 UTC (permalink / raw)
  To: Martin Jansa; +Cc: bitbake-devel, steve, Chen Qi, Richard Purdie

[-- Attachment #1: Type: text/plain, Size: 3772 bytes --]

On 2023-11-10 9:06 a.m., Martin Jansa wrote:
> I forgot to include [RFC] tag in the subject, because these changes 
> might be a bit controversial.
>
> From my testing:
> https://github.com/shr-project/test-oe-build-time/commit/d5111f4472ac397c0f1197eb6366ac7d2e56453f 
> <https://urldefense.com/v3/__https://github.com/shr-project/test-oe-build-time/commit/d5111f4472ac397c0f1197eb6366ac7d2e56453f__;!!AjveYdw8EvQ!f35PVorsjRHzPeCdLVCittHJcXcQTibeF0q5nqa0CcWnxVr6Zdu3-DT7IpTT6Q1miq7Pyxk8DDBJq1Nij5jI1m3znH8$>
> the reasonable pressure values for 2.0 (in kirkstone) and 2.6 (in 
> nanbield) are significantly different due to these changes in PSI 
> calculation and logic so while 1000 was reasonable value for my system 
> in 2.0, I need 100000 with 2.6.
>
> I've backported these changes long time ago for my local builds to use 
> the same PSI values for all builds.
>
> And the controversy is that backporting these will change the expected 
> values (so kind of change in behavior which might not be suitable for 
> stable release). But the commits make it sound as a bug in the logic, 
> so we can also consider them as bug fixes.
>
> The change in expected pressure values was reported in:
> https://lists.openembedded.org/g/bitbake-devel/message/14942 
> <https://urldefense.com/v3/__https://lists.openembedded.org/g/bitbake-devel/message/14942__;!!AjveYdw8EvQ!f35PVorsjRHzPeCdLVCittHJcXcQTibeF0q5nqa0CcWnxVr6Zdu3-DT7IpTT6Q1miq7Pyxk8DDBJq1Nij5jICLP93rQ$>
>
> If it's not a bug to be fixed in 2.0 then maybe we should mention 
> different PSI values in release notes for nanbield (I know I'm too 
> late for that).
>
> Regards,

So people were suggesting that the slow YP AB times might be due to PSI 
regulation.

On a 24 core system, I built core-image-minimal with the original code
from before Chen's commit in August 2023

poky.git
$ git log --oneline -2 653ff4d85cbaf53627f7978b06c1f025ac4694e2
653ff4d85c bitbake: runqueue.py: fix PSI check logic
ac5512b0ac bitbake: fetch2: add Google Cloud Platform (GCP) fetcher    
<---- Old


and with current master ~ Feb 2024 :

$ git log --oneline -2
9382d731bd bash: nativesdk-bash does not provide /bin/bash so don't 
claim to <----- New
0fe85ce0c6 busybox: Explicitly specify tty device for serial consoles

Note that the source being build is different and it would have been 
better to just change the
pressure regulaiton code but this was a weekend 'fun' project. Also, yes 
I fetched the code before
building with the set of BB_PRESSURE_MAX_CPU values.

There wasn't much of a difference as shown in the graph and data below.

I could try the same thing for world builds but I'm not sure that's 
worthwhile.
I still plan to construct a test of PSI regulation using some -native 
recipes that invoke stress-ng
enough to cause sustained (CPU or IO) pressure and then test if a new 
task will ALWAYS be prevented
from running until the pressure subsides.

../Randy


yocto/psi/feb-2024
✦ ❯ cat old.dat
100000, 82.7167
10000, 83.1833
8000, 83.4333
4000, 84.9833
2500, 88.2833
2000, 91.4
1500, 99.3667
1250, 107.6167
1000, 117.9667
750, 149.2333

yocto/psi/feb-2024
✦ ❯ cat new.dat
100000, 81.6833
10000, 83.25
5000, 83.9167
4000, 84.1167
2500, 87.65
2000, 90.6833
1500, 97.5833
1000, 120.6833
500, 194.1333
250, 213.3333

Steps:

$ bitbake --runall fetch core-image-minimal; sync; sleep 10

$ for i in 100000 10000 5000 4000 2500 2000 1500 1000 500 250 ; do \
     rm -rf sstate-cache/ cache tmp; sleep 2; \

    echo BB_PRESSURE_MAX_CPU=\"$i\" >> conf/local.conf; \

    /usr/bin/time time -o time-$i.log  bitbake core-image-minimal; \

done


-- 
# Randy MacLeod
# Wind River Linux

[-- Attachment #2.1: Type: text/html, Size: 5563 bytes --]

[-- Attachment #2.2: HFmTHJTOoDp5MXqn.png --]
[-- Type: image/png, Size: 22841 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [2.0][PATCH 1/5] runqueue: fix PSI check calculation
  2024-02-15 23:06   ` Randy MacLeod
@ 2024-02-15 23:14     ` Richard Purdie
  2024-02-16  1:48       ` Randy MacLeod
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Purdie @ 2024-02-15 23:14 UTC (permalink / raw)
  To: Randy MacLeod, Martin Jansa; +Cc: bitbake-devel, steve, Chen Qi

[-- Attachment #1: Type: text/plain, Size: 3193 bytes --]

On Thu, 2024-02-15 at 18:06 -0500, Randy MacLeod wrote:
>  
> On 2023-11-10 9:06 a.m., Martin Jansa wrote:
>  
> >  
> > I forgot to include [RFC] tag in the subject, because these changes
> > might be a bit controversial. 
> > 
> > 
> > 
> > From my testing:
> > 
> > https://github.com/shr-project/test-oe-build-time/commit/d5111f4472ac397c0f1197eb6366ac7d2e56453f
> > 
> > the reasonable pressure values for 2.0 (in kirkstone) and 2.6 (in
> > nanbield) are significantly different due to these changes in PSI
> > calculation and logic so while 1000 was reasonable value for my
> > system in 2.0, I need 100000 with 2.6.
> > 
> > 
> > 
> > 
> > I've backported these changes long time ago for my local builds to
> > use the same PSI values for all builds.
> > 
> > 
> > 
> > 
> > And the controversy is that backporting these will change the
> > expected values (so kind of change in behavior which might not be
> > suitable for stable release). But the commits make it sound as a
> > bug in the logic, so we can also consider them as bug fixes.
> > 
> > 
> > 
> > 
> > The change in expected pressure values was reported in:
> > 
> > https://lists.openembedded.org/g/bitbake-devel/message/14942
> > 
> > 
> > 
> > 
> > If it's not a bug to be fixed in 2.0 then maybe we should mention
> > different PSI values in release notes for nanbield (I know I'm too
> > late for that).
> > 
> > 
> > 
> > 
> > Regards,
> >  
>  
> So people were suggesting that the slow YP AB times might be due to
> PSI regulation.
> 
> On a 24 core system, I built core-image-minimal with the original
> code 
> from before Chen's commit in August 2023 
>  
> poky.git
> $ git log --oneline -2 653ff4d85cbaf53627f7978b06c1f025ac4694e2
> 653ff4d85c bitbake: runqueue.py: fix PSI check logic 
> ac5512b0ac bitbake: fetch2: add Google Cloud Platform (GCP)
> fetcher    <---- Old
>  
> 
>  
>  
> and with current master ~ Feb 2024 :
>  
> $ git log --oneline -2 
> 9382d731bd bash: nativesdk-bash does not provide /bin/bash so don't
> claim to <----- New
> 0fe85ce0c6 busybox: Explicitly specify tty device for serial consoles
>  
> Note that the source being build is different and it would have been
> better to just change the
> pressure regulaiton code but this was a weekend 'fun' project. Also,
> yes I fetched the code before
> building with the set of BB_PRESSURE_MAX_CPU values.
>  
> There wasn't much of a difference as shown in the graph and data
> below.
>  
> I could try the same thing for world builds but I'm not sure that's
> worthwhile.
> I still plan to construct a test of PSI regulation using some -native
> recipes that invoke stress-ng
> enough to cause sustained (CPU or IO) pressure and then test if a new
> task will ALWAYS be prevented
> from running until the pressure subsides. 

Thanks for checking that, it is helpful to rule things out. I think the
bigger question was whether we needed to adjust the pressure values and
whether the autobuilder is under-utilising. I'm surprised and slightly
suspicious when the graphs have correlation that good btw :)

Cheers,

Richard

>  


[-- Attachment #2: Type: text/html, Size: 4672 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [2.0][PATCH 1/5] runqueue: fix PSI check calculation
  2024-02-15 23:14     ` Richard Purdie
@ 2024-02-16  1:48       ` Randy MacLeod
  0 siblings, 0 replies; 4+ messages in thread
From: Randy MacLeod @ 2024-02-16  1:48 UTC (permalink / raw)
  To: Richard Purdie, Martin Jansa; +Cc: bitbake-devel, steve, Chen Qi

[-- Attachment #1: Type: text/plain, Size: 5370 bytes --]

On 2024-02-15 6:14 p.m., Richard Purdie wrote:
> On Thu, 2024-02-15 at 18:06 -0500, Randy MacLeod wrote:
>> On 2023-11-10 9:06 a.m., Martin Jansa wrote:
>>
>>> I forgot to include [RFC] tag in the subject, because these changes 
>>> might be a bit controversial.
>>>
>>> From my testing:
>>> https://github.com/shr-project/test-oe-build-time/commit/d5111f4472ac397c0f1197eb6366ac7d2e56453f 
>>> <https://urldefense.com/v3/__https://github.com/shr-project/test-oe-build-time/commit/d5111f4472ac397c0f1197eb6366ac7d2e56453f__;!!AjveYdw8EvQ!f35PVorsjRHzPeCdLVCittHJcXcQTibeF0q5nqa0CcWnxVr6Zdu3-DT7IpTT6Q1miq7Pyxk8DDBJq1Nij5jI1m3znH8$>
>>> the reasonable pressure values for 2.0 (in kirkstone) and 2.6 (in 
>>> nanbield) are significantly different due to these changes in PSI 
>>> calculation and logic so while 1000 was reasonable value for my 
>>> system in 2.0, I need 100000 with 2.6.
>>>
>>> I've backported these changes long time ago for my local builds to 
>>> use the same PSI values for all builds.
>>>
>>> And the controversy is that backporting these will change the 
>>> expected values (so kind of change in behavior which might not be 
>>> suitable for stable release). But the commits make it sound as a bug 
>>> in the logic, so we can also consider them as bug fixes.
>>>
>>> The change in expected pressure values was reported in:
>>> https://lists.openembedded.org/g/bitbake-devel/message/14942 
>>> <https://urldefense.com/v3/__https://lists.openembedded.org/g/bitbake-devel/message/14942__;!!AjveYdw8EvQ!f35PVorsjRHzPeCdLVCittHJcXcQTibeF0q5nqa0CcWnxVr6Zdu3-DT7IpTT6Q1miq7Pyxk8DDBJq1Nij5jICLP93rQ$>
>>>
>>> If it's not a bug to be fixed in 2.0 then maybe we should mention 
>>> different PSI values in release notes for nanbield (I know I'm too 
>>> late for that).
>>>
>>> Regards,
>>
>> So people were suggesting that the slow YP AB times might be due to 
>> PSI regulation.
>>
>> On a 24 core system, I built core-image-minimal with the original code
>> from before Chen's commit in August 2023
>>
>> poky.git
>> $ git log --oneline -2 653ff4d85cbaf53627f7978b06c1f025ac4694e2
>> 653ff4d85c bitbake: runqueue.py 
>> <https://urldefense.com/v3/__http://runqueue.py__;!!AjveYdw8EvQ!f5JS427vmCgvFjpobXMoKkWemtw0ZHFuTZhawQuPUzSeHjv22YOn5jDF9_QASdCRR2hDP4SptFSb1WF57ZYDLSQ-pN48msQ19wpneHc$>: 
>> fix PSI check logic
>> ac5512b0ac bitbake: fetch2: add Google Cloud Platform (GCP) 
>> fetcher    <---- Old
>>
>>
>> and with current master ~ Feb 2024 :
>>
>> $ git log --oneline -2
>> 9382d731bd bash: nativesdk-bash does not provide /bin/bash so don't 
>> claim to <----- New
>> 0fe85ce0c6 busybox: Explicitly specify tty device for serial consoles
>>
>> Note that the source being build is different and it would have been 
>> better to just change the
>> pressure regulaiton code but this was a weekend 'fun' project. Also, 
>> yes I fetched the code before
>> building with the set of BB_PRESSURE_MAX_CPU values.
>>
>> There wasn't much of a difference as shown in the graph and data below.
>>
>> I could try the same thing for world builds but I'm not sure that's 
>> worthwhile.
>> I still plan to construct a test of PSI regulation using some -native 
>> recipes that invoke stress-ng
>> enough to cause sustained (CPU or IO) pressure and then test if a new 
>> task will ALWAYS be prevented
>> from running until the pressure subsides.
>>
>
> Thanks for checking that, it is helpful to rule things out. I think 
> the bigger question was whether we needed to adjust the pressure 
> values and whether the autobuilder is under-utilising.

Right. Did you want me to send a patch to disable pressure regulation 
completely or
increase the limit in the y-ab-helper config.json for master-next?


> I'm surprised and slightly suspicious when the graphs have correlation 
> that good btw :)

Yeah, it's particularily odd when even the avg10 (seconds) data is so 
variable.
For fun, after dinner today, I took the cpu pressure data from 
BB_PRESSURE_MAX_CPU = 1K, 10K
and played with it.

The 10K limited run finishes in 4945 seconds and the more regulated 1K 
run takes 7193 seconds.
I just munged the logged pressure data using emacs macros because I know 
how to do that and
I didn't bother to figure out how to merge 3 lines using awk, python, 
and then filtered it with awk (1)
to stretch out the 10K time values to get:


You see that the avg10 pressure is generally much higher for the 10K 
run, thereby allowing the full build
to complete in less time. to 1K run, has lower avg10 pressure but 
occasionally jobs escape regulation and
the system is overloaded for a while. Once or twice in the middle of the 
build, a large job escapes regulation.

At least that's what I see! ;-)

With a job server pool and improvements to our regulation algorithm, the 
CPU pressure should smooth out  quite a bit... some day...

../Randy


1)

yocto/psi/feb-2024
✦ ❯ cat yp-ab-build-cpu-feb-9-2024/feb11/reduced-psi-10000/cpu.dat | awk 
'{ if (NR == 1) { for (i = 1; i <= NF; i++) { first_row[i] = $i} }; 
print (($1 - first_row[1])*7193.0/4945.0) ", " $2}' > 
yp-ab-build-cpu-feb-9-2024/feb11/reduced-psi-10000/cpu-time-avg10-scaled-1000-elapsed.dat


>
> Cheers,
>
> Richard
>
>

-- 
# Randy MacLeod
# Wind River Linux

[-- Attachment #2.1: Type: text/html, Size: 8949 bytes --]

[-- Attachment #2.2: veAFub9nsCKTt0Oa.png --]
[-- Type: image/png, Size: 30485 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-16  1:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <17964777D58C012A.3202@lists.openembedded.org>
2023-11-10 14:06 ` [bitbake-devel] [2.0][PATCH 1/5] runqueue: fix PSI check calculation Martin Jansa
2024-02-15 23:06   ` Randy MacLeod
2024-02-15 23:14     ` Richard Purdie
2024-02-16  1:48       ` Randy MacLeod

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.