bitbake-devel.lists.openembedded.org archive mirror
 help / color / mirror / Atom feed
* Bitbake PSI checker
@ 2022-12-12 10:07 Ola x Nilsson
  2022-12-12 20:48 ` [bitbake-devel] " Randy MacLeod
  0 siblings, 1 reply; 9+ messages in thread
From: Ola x Nilsson @ 2022-12-12 10:07 UTC (permalink / raw)
  To: bitbake-devel

Hi,

I've been looking into using the pressure stall information awareness of
bitbake but I have some problems getting it to work.  Actually I think
it just doesn't work at all.

Reading the code I find that
runqueue.QunQueueScheduler.exceeds_max_pressure claims to "Monitor the
difference in pressure at least once per second".  But using some
debugprints added to that method I see output like

1670840023.757171 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
1670840023.758697 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
1670840023.760158 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
1670840023.761733 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
1670840023.959357 cpu_pressure 969.0 io_pressure 16135.0 mem_pressure 0.0
1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
1670840042.490340 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0

where the first column is the value of 'now', and the pressure values
are the calculated deltas.  The 0-pressure values are probably because
this is very early in the run and the time delta is less than 0.01
seconds. 

But there is a time delta of almost 19 seconds between line 5 and 6, and
unsurprisingly the pressure exceeds my max settings of CPU:600000 and
IO:200000.

But the very next check is only 0.1 second later and while the
prev-values wont be updated, the calculated pressure will be used.  This
pressure will be below my settings and a new task will be started.

Am I missing something here?  If the pressure should be monitored each
second, isn't it reasonable to have some sort of tick to update the
pre-values?  And using the pressure delta of intervals of less than a
second also seems to give too low pressure values.

/Ola Nilsson


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2022-12-12 10:07 Bitbake PSI checker Ola x Nilsson
@ 2022-12-12 20:48 ` Randy MacLeod
  2022-12-19 12:50   ` Ola x Nilsson
  0 siblings, 1 reply; 9+ messages in thread
From: Randy MacLeod @ 2022-12-12 20:48 UTC (permalink / raw)
  To: bitbake-devel, Richard Purdie, ola.x.nilsson; +Cc: Zheng.qiu

CCing Richard

On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
> Hi,
>
> I've been looking into using the pressure stall information awareness of
> bitbake
That's good to hear Ola.
>   but I have some problems getting it to work.  Actually I think
> it just doesn't work at all.

Doesn't work at all?

Well that would be surprising. See below.

>
> Reading the code I find that
> runqueue.QunQueueScheduler.exceeds_max_pressure claims to "Monitor the
> difference in pressure at least once per second".

That comment isn't accurate. I'll fix it.

Currently, the pressure is only checked when
bitbake is looking for the next_buildable_task.

This can occur many/100s of times per seconds at some points in a build
and later, when larger recipes are compiling, the function may not be called
for 10s or 100s of seconds depending on what is being built.


>   But using some
> debugprints added to that method I see output like
>
> 1670840023.757171 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
> 1670840023.758697 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
> 1670840023.760158 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
> 1670840023.761733 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0
> 1670840023.959357 cpu_pressure 969.0 io_pressure 16135.0 mem_pressure 0.0
   19 second gap
> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
> 1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
> 1670840042.490340 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
>
> where the first column is the value of 'now', and the pressure values
> are the calculated deltas.  The 0-pressure values are probably because
> this is very early in the run and the time delta is less than 0.01
> seconds.
>
> But there is a time delta of almost 19 seconds between line 5 and 6, and
> unsurprisingly the pressure exceeds my max settings of CPU:600000 and
> IO:200000.
>
> But the very next check is only 0.1 second later and while the
> prev-values wont be updated, the calculated pressure will be used.  This
> pressure will be below my settings and a new task will be started.

Yes, that's a bug and I need to fix it. See below.

>
> Am I missing something here?

You aren't missing anything.

The code has "limitations" but it has still proven useful to some people
and on the Yocto Autobuilder system. Note the lack of 'interval" errors 
starting

around Aug 18th, 2022, when we enabled this feature for the YP Autobuilder:

    https://autobuilder.yocto.io/pub/non-release/


> If the pressure should be monitored each
> second, isn't it reasonable to have some sort of tick to update the
> pre-values?  And using the pressure delta of intervals of less than a
> second also seems to give too low pressure values.

That would be a better implementation in some ways but
what we've done so far is only check the pressure when
bitbake is checking for a new task to run. This will be less
intrusive and people do worry about the efficiency of bitbake.
Adding a 1 second timer may not be where we want to go.

It's a little tricky to provide short-term averaging regardless
of how often the function is called. Here are the improvements
that I'm considering:

1. Rather than just keep track of the previous pressure values
seen more than 1 second ago as done currently:

       if now - self.prev_pressure_time > 1.0:

and always using that as a reference, we can
store say 10 values per second and use that as a reference.

There are some challenges in that approach in that we don't control
how often the function is called. Averaging over the last 10 calls
is tempting but likely has some edge cases such as when there are
lots of tasks starting/ending.


2. If there has been a long delay since the function was last called,
we could check the pressure, sleep for a short period of time and check it
again. Some people would not like this since it will needlessly delay 
the build
so we'd have to keep the delay to < 1 second. Too short a delay will reduce
the accuracy of the result but I suspect that 0.1 seconds is sufficient 
for most
users. We could also look at the avg10 value in this case or even some 
combination of
both the current contention and avg10.


3. Just calculate the pressure per second by:

    ( current pressure - last pressure ) / (now - last_time)

This could handle  short time differences such os milliseconds
as would be a 'cheap' way to deal with long delays. In your case,
the pressure would be:

   978077.0 io_pressure 1353882.0 mem_pressure 20922.0

divided by ~19 since the initial values were close to zero.

Then for the next time, just 0.1 seconds later:

1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0

Multiplying by 10 or easy calculation, the would be a pressure:

cpu: 4660, io: 307920, mem: 0.


Do you have another idea or a preference as to which approach we take?


../Randy


>
> /Ola Nilsson
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14178): https://lists.openembedded.org/g/bitbake-devel/message/14178
> Mute This Topic: https://lists.openembedded.org/mt/95618299/3616765
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>

-- 
# Randy MacLeod
# Wind River Linux



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2022-12-12 20:48 ` [bitbake-devel] " Randy MacLeod
@ 2022-12-19 12:50   ` Ola x Nilsson
  2022-12-19 19:49     ` contrib
  0 siblings, 1 reply; 9+ messages in thread
From: Ola x Nilsson @ 2022-12-19 12:50 UTC (permalink / raw)
  To: Randy MacLeod; +Cc: Richard Purdie, Zheng.qiu, bitbake-devel


On Mon, Dec 12 2022, Randy MacLeod wrote:

> CCing Richard
>
> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
>> Hi,
>>
>> I've been looking into using the pressure stall information awareness of
>> bitbake
> That's good to hear Ola.
>>   but I have some problems getting it to work.  Actually I think
>> it just doesn't work at all.
>
> Doesn't work at all?
>
> Well that would be surprising. See below.

OK, it will occasionally block a task. But since the next attempt will
always be a very short time interval it will almost always start a new
task even if the pressure is high.
At least this is what I observe on my system.

<snip>

> 1. Rather than just keep track of the previous pressure values
> seen more than 1 second ago as done currently:
>
>        if now - self.prev_pressure_time > 1.0:
>
> and always using that as a reference, we can
> store say 10 values per second and use that as a reference.
>
> There are some challenges in that approach in that we don't control
> how often the function is called. Averaging over the last 10 calls
> is tempting but likely has some edge cases such as when there are
> lots of tasks starting/ending.
>
>
> 2. If there has been a long delay since the function was last called,
> we could check the pressure, sleep for a short period of time and check it
> again. Some people would not like this since it will needlessly delay 
> the build
> so we'd have to keep the delay to < 1 second. Too short a delay will reduce
> the accuracy of the result but I suspect that 0.1 seconds is sufficient 
> for most
> users. We could also look at the avg10 value in this case or even some 
> combination of
> both the current contention and avg10.
>
>
> 3. Just calculate the pressure per second by:
>
>     ( current pressure - last pressure ) / (now - last_time)
>
> This could handle  short time differences such os milliseconds
> as would be a 'cheap' way to deal with long delays. In your case,
> the pressure would be:
>
>    978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>
> divided by ~19 since the initial values were close to zero.
>
> Then for the next time, just 0.1 seconds later:
>
> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
> 1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
>
> Multiplying by 10 or easy calculation, the would be a pressure:
>
> cpu: 4660, io: 307920, mem: 0.
>
>
> Do you have another idea or a preference as to which approach we take?

I think 3 is a good first step.  Using multiple samples could improve
our calculated "avg1", but lets do that later if needed.

/Ola

>
> ../Randy
>
>
>>
>> /Ola Nilsson
>>
>> 
>>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2022-12-19 12:50   ` Ola x Nilsson
@ 2022-12-19 19:49     ` contrib
  2023-05-20 19:58       ` Randy MacLeod
  0 siblings, 1 reply; 9+ messages in thread
From: contrib @ 2022-12-19 19:49 UTC (permalink / raw)
  To: Ola x Nilsson; +Cc: Randy MacLeod, Richard Purdie, bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 4281 bytes --]



> On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> wrote:
> 
> 
> On Mon, Dec 12 2022, Randy MacLeod wrote:
> 
>> CCing Richard
>> 
>> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
>>> Hi,
>>> 
>>> I've been looking into using the pressure stall information awareness of
>>> bitbake
>> That's good to hear Ola.
>>>  but I have some problems getting it to work.  Actually I think
>>> it just doesn't work at all.
>> 
>> Doesn't work at all?
>> 
>> Well that would be surprising. See below.
> 
> OK, it will occasionally block a task. But since the next attempt will
> always be a very short time interval it will almost always start a new
> task even if the pressure is high.
> At least this is what I observe on my system.
> 
> <snip>
> 
>> 1. Rather than just keep track of the previous pressure values
>> seen more than 1 second ago as done currently:
>> 
>>       if now - self.prev_pressure_time > 1.0:
>> 
>> and always using that as a reference, we can
>> store say 10 values per second and use that as a reference.
>> 
>> There are some challenges in that approach in that we don't control
>> how often the function is called. Averaging over the last 10 calls
>> is tempting but likely has some edge cases such as when there are
>> lots of tasks starting/ending.
>> 
>> 
>> 2. If there has been a long delay since the function was last called,
>> we could check the pressure, sleep for a short period of time and check it
>> again. Some people would not like this since it will needlessly delay 
>> the build
>> so we'd have to keep the delay to < 1 second. Too short a delay will reduce
>> the accuracy of the result but I suspect that 0.1 seconds is sufficient 
>> for most
>> users. We could also look at the avg10 value in this case or even some 
>> combination of
>> both the current contention and avg10.
>> 
>> 
>> 3. Just calculate the pressure per second by:
>> 
>>    ( current pressure - last pressure ) / (now - last_time)
>> 
>> This could handle  short time differences such os milliseconds
>> as would be a 'cheap' way to deal with long delays. In your case,
>> the pressure would be:
>> 
>>   978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>> 
>> divided by ~19 since the initial values were close to zero.
>> 
>> Then for the next time, just 0.1 seconds later:
>> 
>> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>> 1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
>> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
>> 
>> Multiplying by 10 or easy calculation, the would be a pressure:
>> 
>> cpu: 4660, io: 307920, mem: 0.
>> 
>> 
>> Do you have another idea or a preference as to which approach we take?
> 
> I think 3 is a good first step.  Using multiple samples could improve
> our calculated "avg1", but lets do that later if needed.

I agree; Randy and I have been working on patching make and have taken a similar approach:
https://github.com/ZhengQ2/make/tree/cpu-pressure
ZhengQ2/make at cpu-pressure
github.com
Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result, 
even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel.
So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not,
we will return the last result that has been produced.

I will CC you when I have a patch, and you can try it out before the commit gets merged if you like.

ZQ

> 
> /Ola
> 
>> 
>> ../Randy
>> 
>> 
>>> 
>>> /Ola Nilsson
>>> 
>>> 
>>> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14199): https://lists.openembedded.org/g/bitbake-devel/message/14199
> Mute This Topic: https://lists.openembedded.org/mt/95618299/7355053
> Group Owner: bitbake-devel+owner@lists.openembedded.org <mailto:bitbake-devel+owner@lists.openembedded.org>
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [contrib@zhengqiu.net <mailto:contrib@zhengqiu.net>]
> -=-=-=-=-=-=-=-=-=-=-=-


[-- Attachment #2.1: Type: text/html, Size: 25607 bytes --]

[-- Attachment #2.2: make.png --]
[-- Type: image/png, Size: 107869 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2022-12-19 19:49     ` contrib
@ 2023-05-20 19:58       ` Randy MacLeod
  2023-05-22  2:17         ` ChenQi
  0 siblings, 1 reply; 9+ messages in thread
From: Randy MacLeod @ 2023-05-20 19:58 UTC (permalink / raw)
  To: contrib, Ola x Nilsson; +Cc: Richard Purdie, bitbake-devel, Chen, Qi

[-- Attachment #1: Type: text/plain, Size: 10014 bytes --]

On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote:
>
>
>> On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> 
>> wrote:
>>
>>
>> On Mon, Dec 12 2022, Randy MacLeod wrote:
>>
>>> CCing Richard
>>>
>>> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
>>>> Hi,
>>>>
>>>> I've been looking into using the pressure stall information 
>>>> awareness of
>>>> bitbake
>>> That's good to hear Ola.
>>>>  but I have some problems getting it to work.  Actually I think
>>>> it just doesn't work at all.
>>>
>>> Doesn't work at all?
>>>
>>> Well that would be surprising. See below.
>>
>> OK, it will occasionally block a task. But since the next attempt will
>> always be a very short time interval it will almost always start a new
>> task even if the pressure is high.
>> At least this is what I observe on my system.
>>
>> <snip>
>>
>>> 1. Rather than just keep track of the previous pressure values
>>> seen more than 1 second ago as done currently:
>>>
>>>       if now - self.prev_pressure_time > 1.0:
>>>
>>> and always using that as a reference, we can
>>> store say 10 values per second and use that as a reference.
>>>
>>> There are some challenges in that approach in that we don't control
>>> how often the function is called. Averaging over the last 10 calls
>>> is tempting but likely has some edge cases such as when there are
>>> lots of tasks starting/ending.
>>>
>>>
>>> 2. If there has been a long delay since the function was last called,
>>> we could check the pressure, sleep for a short period of time and 
>>> check it
>>> again. Some people would not like this since it will needlessly delay
>>> the build
>>> so we'd have to keep the delay to < 1 second. Too short a delay will 
>>> reduce
>>> the accuracy of the result but I suspect that 0.1 seconds is sufficient
>>> for most
>>> users. We could also look at the avg10 value in this case or even some
>>> combination of
>>> both the current contention and avg10.
>>>
>>>
>>> 3. Just calculate the pressure per second by:
>>>
>>>    ( current pressure - last pressure ) / (now - last_time)
>>>
>>> This could handle  short time differences such os milliseconds
>>> as would be a 'cheap' way to deal with long delays. In your case,
>>> the pressure would be:
>>>
>>>   978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>>>
>>> divided by ~19 since the initial values were close to zero.
>>>
>>> Then for the next time, just 0.1 seconds later:
>>>
>>> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 
>>> mem_pressure 20922.0
>>> 1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
>>> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 
>>> mem_pressure 0.0
>>>
>>> Multiplying by 10 or easy calculation, the would be a pressure:
>>>
>>> cpu: 4660, io: 307920, mem: 0.
>>>
>>>
>>> Do you have another idea or a preference as to which approach we take?
>>
>> I think 3 is a good first step.  Using multiple samples could improve
>> our calculated "avg1", but lets do that later if needed.
>
> I agree; Randy and I have been working on patching make and have taken 
> a similar approach:
> make.png
> ZhengQ2/make at cpu-pressure 
> <https://github.com/ZhengQ2/make/tree/cpu-pressure>
> github.com <https://github.com/ZhengQ2/make/tree/cpu-pressure>
>
> <https://github.com/ZhengQ2/make/tree/cpu-pressure>
> Additionally, we found that when the pressure read is too frequent, we 
> may get the same cpu pressure as an result,
> even if the pressure have actually changed. This is likely due to the 
> per cpu variables used in the kernel.
> So, in addition to the algorithm Randy talked above, we also compares 
> if the cpu pressure has been changed, if not,
> we will return the last result that has been produced.
>
> I will CC you when I have a patch, and you can try it out before the 
> commit gets merged if you like.


Ola,

Does Qi's patch below help in your situation?

I still want/intent to add a bitbake PSI test case that uses stress-ng 
to induce load
and a lightweight sleep task but there are never enough hours in the 
day/week/...

The basic idea is to:

1. Run a task that just sleeps for say 10 seconds and confirm that the 
actual
execution time is < 11 seconds or so.

2. use stress to get the system into a CPU pressure environment above
the current threshold for say 30 seconds and simultaneously / shortly 
there after,
launch the same sleep task and confirm that this time, the actual 
exectuion time of
the launch to completion time is 40+ seconds.

../Randy 'getting caught up on email on the weekend' MacLeod


❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
Author: Chen Qi <Qi.Chen@windriver.com>
Date:   Thu Apr 6 23:07:14 2023

     bitbake: runqueue: fix PSI check calculation

     The current PSI check calculation does not take into consideration
     the possibility of the time interval between last check and current
     check being much larger than 1s. In fact, the current behavior does
     not match what the manual says about BB_PRESSURE_MAX_XXX, even if
     the value is set to upper limit, 1000000, we still get many blocks
     on new task launch. The difference between 'total' should be divided
     by the time interval if it's larger than 1s.

     (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b)

     Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
     Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>


Δ bitbake/lib/bb/runqueue.py
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────┐
• 198: class RunQueueScheduler(object): │
────────────────────────────────────────┘
                 curr_cpu_pressure = 
cpu_pressure_fds.readline().split()[4].split("=")[1]
                 curr_io_pressure = 
io_pressure_fds.readline().split()[4].split("=")[1]
                 curr_memory_pressure = 
memory_pressure_fds.readline().split()[4].split("=")[1]
                 exceeds_cpu_pressure =  self.rq.max_cpu_pressure and 
(float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > 
self.rq.max_cpu_pressure
                 exceeds_io_pressure =  self.rq.max_io_pressure and 
(float(curr_io_pressure) - float(self.prev_io_pressure)) > 
self.rq.max_io_pressure
                 exceeds_memory_pressure = self.rq.max_memory_pressure 
and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) > 
self.rq.max_memory_pressure
                 now = time.time()
                 if now - self.prev_pressure_time > 1.0:
                 tdiff = now - self.prev_pressure_time
                 if tdiff > 1.0:
                     exceeds_cpu_pressure = self.rq.max_cpu_pressure and 
(float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff > 
self.rq.max_cpu_pressure
                     exceeds_io_pressure =  self.rq.max_io_pressure and 
(float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > 
self.rq.max_io_pressure
                     exceeds_memory_pressure = 
self.rq.max_memory_pressure and (float(curr_memory_pressure) - 
float(self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure
                     self.prev_cpu_pressure = curr_cpu_pressure
                     self.prev_io_pressure = curr_io_pressure
                     self.prev_memory_pressure = curr_memory_pressure
                     self.prev_pressure_time = now
                 else:
                     exceeds_cpu_pressure = self.rq.max_cpu_pressure and 
(float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > 
self.rq.max_cpu_pressure
                     exceeds_io_pressure =  self.rq.max_io_pressure and 
(float(curr_io_pressure) - float(self.prev_io_pressure)) > 
self.rq.max_io_pressure
                     exceeds_memory_pressure = 
self.rq.max_memory_pressure and (float(curr_memory_pressure) - 
float(self.prev_memory_pressure)) > self.rq.max_memory_pressure
             return (exceeds_cpu_pressure or exceeds_io_pressure or 
exceeds_memory_pressure)
         return False


>
> ZQ
>
>>
>> /Ola
>>
>>>
>>> ../Randy
>>>
>>>
>>>>
>>>> /Ola Nilsson
>>>>
>>>>
>>>>
>>
>>
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14206):https://lists.openembedded.org/g/bitbake-devel/message/14206
> Mute This Topic:https://lists.openembedded.org/mt/95618299/3616765
> Group Owner:bitbake-devel+owner@lists.openembedded.org
> Unsubscribe:https://lists.openembedded.org/g/bitbake-devel/unsub  [randy.macleod@windriver.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>

-- 
# Randy MacLeod
# Wind River Linux

[-- Attachment #2.1: Type: text/html, Size: 34570 bytes --]

[-- Attachment #2.2: make.png --]
[-- Type: image/png, Size: 107869 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2023-05-20 19:58       ` Randy MacLeod
@ 2023-05-22  2:17         ` ChenQi
  2023-05-22  9:36           ` Ola x Nilsson
  0 siblings, 1 reply; 9+ messages in thread
From: ChenQi @ 2023-05-22  2:17 UTC (permalink / raw)
  To: Randy MacLeod, contrib, Ola x Nilsson; +Cc: Richard Purdie, bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 10747 bytes --]

Hi Ola & Randy,

I just checked the codes and I think Ola is right. The current PSI check 
cannot block spawning of new tasks if the time interval is small between 
current check and last check. I'll send out a patch to fix this issue.

Also, I don't think calculating the value too often is a good idea, so 
I'll change the check to be >1s.

Please help review the patch.

Regards,
Qi


On 5/21/23 03:58, Randy MacLeod wrote:
> On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote:
>>
>>
>>> On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> 
>>> wrote:
>>>
>>>
>>> On Mon, Dec 12 2022, Randy MacLeod wrote:
>>>
>>>> CCing Richard
>>>>
>>>> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
>>>>> Hi,
>>>>>
>>>>> I've been looking into using the pressure stall information 
>>>>> awareness of
>>>>> bitbake
>>>> That's good to hear Ola.
>>>>>  but I have some problems getting it to work.  Actually I think
>>>>> it just doesn't work at all.
>>>>
>>>> Doesn't work at all?
>>>>
>>>> Well that would be surprising. See below.
>>>
>>> OK, it will occasionally block a task. But since the next attempt will
>>> always be a very short time interval it will almost always start a new
>>> task even if the pressure is high.
>>> At least this is what I observe on my system.
>>>
>>> <snip>
>>>
>>>> 1. Rather than just keep track of the previous pressure values
>>>> seen more than 1 second ago as done currently:
>>>>
>>>>       if now - self.prev_pressure_time > 1.0:
>>>>
>>>> and always using that as a reference, we can
>>>> store say 10 values per second and use that as a reference.
>>>>
>>>> There are some challenges in that approach in that we don't control
>>>> how often the function is called. Averaging over the last 10 calls
>>>> is tempting but likely has some edge cases such as when there are
>>>> lots of tasks starting/ending.
>>>>
>>>>
>>>> 2. If there has been a long delay since the function was last called,
>>>> we could check the pressure, sleep for a short period of time and 
>>>> check it
>>>> again. Some people would not like this since it will needlessly delay
>>>> the build
>>>> so we'd have to keep the delay to < 1 second. Too short a delay 
>>>> will reduce
>>>> the accuracy of the result but I suspect that 0.1 seconds is sufficient
>>>> for most
>>>> users. We could also look at the avg10 value in this case or even some
>>>> combination of
>>>> both the current contention and avg10.
>>>>
>>>>
>>>> 3. Just calculate the pressure per second by:
>>>>
>>>>    ( current pressure - last pressure ) / (now - last_time)
>>>>
>>>> This could handle  short time differences such os milliseconds
>>>> as would be a 'cheap' way to deal with long delays. In your case,
>>>> the pressure would be:
>>>>
>>>>   978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>>>>
>>>> divided by ~19 since the initial values were close to zero.
>>>>
>>>> Then for the next time, just 0.1 seconds later:
>>>>
>>>> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 
>>>> mem_pressure 20922.0
>>>> 1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
>>>> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 
>>>> mem_pressure 0.0
>>>>
>>>> Multiplying by 10 or easy calculation, the would be a pressure:
>>>>
>>>> cpu: 4660, io: 307920, mem: 0.
>>>>
>>>>
>>>> Do you have another idea or a preference as to which approach we take?
>>>
>>> I think 3 is a good first step.  Using multiple samples could improve
>>> our calculated "avg1", but lets do that later if needed.
>>
>> I agree; Randy and I have been working on patching make and have 
>> taken a similar approach:
>> make.png
>> ZhengQ2/make at cpu-pressure 
>> <https://github.com/ZhengQ2/make/tree/cpu-pressure>
>> github.com <https://github.com/ZhengQ2/make/tree/cpu-pressure>
>>
>> <https://github.com/ZhengQ2/make/tree/cpu-pressure>
>> Additionally, we found that when the pressure read is too frequent, 
>> we may get the same cpu pressure as an result,
>> even if the pressure have actually changed. This is likely due to the 
>> per cpu variables used in the kernel.
>> So, in addition to the algorithm Randy talked above, we also compares 
>> if the cpu pressure has been changed, if not,
>> we will return the last result that has been produced.
>>
>> I will CC you when I have a patch, and you can try it out before the 
>> commit gets merged if you like.
>
>
> Ola,
>
> Does Qi's patch below help in your situation?
>
> I still want/intent to add a bitbake PSI test case that uses stress-ng 
> to induce load
> and a lightweight sleep task but there are never enough hours in the 
> day/week/...
>
> The basic idea is to:
>
> 1. Run a task that just sleeps for say 10 seconds and confirm that the 
> actual
> execution time is < 11 seconds or so.
>
> 2. use stress to get the system into a CPU pressure environment above
> the current threshold for say 30 seconds and simultaneously / shortly 
> there after,
> launch the same sleep task and confirm that this time, the actual 
> exectuion time of
> the launch to completion time is 40+ seconds.
>
> ../Randy 'getting caught up on email on the weekend' MacLeod
>
>
> ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
> commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
> Author: Chen Qi <Qi.Chen@windriver.com>
> Date:   Thu Apr 6 23:07:14 2023
>
>     bitbake: runqueue: fix PSI check calculation
>
>     The current PSI check calculation does not take into consideration
>     the possibility of the time interval between last check and current
>     check being much larger than 1s. In fact, the current behavior does
>     not match what the manual says about BB_PRESSURE_MAX_XXX, even if
>     the value is set to upper limit, 1000000, we still get many blocks
>     on new task launch. The difference between 'total' should be divided
>     by the time interval if it's larger than 1s.
>
>     (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b)
>
>     Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
>     Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>
>
> Δ bitbake/lib/bb/runqueue.py
> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>
> ────────────────────────────────────────┐
> • 198: class RunQueueScheduler(object): │
> ────────────────────────────────────────┘
>                 curr_cpu_pressure = 
> cpu_pressure_fds.readline().split()[4].split("=")[1]
>                 curr_io_pressure = 
> io_pressure_fds.readline().split()[4].split("=")[1]
>                 curr_memory_pressure = 
> memory_pressure_fds.readline().split()[4].split("=")[1]
>                 exceeds_cpu_pressure =  self.rq.max_cpu_pressure and 
> (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > 
> self.rq.max_cpu_pressure
>                 exceeds_io_pressure =  self.rq.max_io_pressure and 
> (float(curr_io_pressure) - float(self.prev_io_pressure)) > 
> self.rq.max_io_pressure
>                 exceeds_memory_pressure = self.rq.max_memory_pressure 
> and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) > 
> self.rq.max_memory_pressure
>                 now = time.time()
>                 if now - self.prev_pressure_time > 1.0:
>                 tdiff = now - self.prev_pressure_time
>                 if tdiff > 1.0:
>                     exceeds_cpu_pressure = self.rq.max_cpu_pressure 
> and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff 
> > self.rq.max_cpu_pressure
>                     exceeds_io_pressure = self.rq.max_io_pressure and 
> (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > 
> self.rq.max_io_pressure
>                     exceeds_memory_pressure = 
> self.rq.max_memory_pressure and (float(curr_memory_pressure) - 
> float(self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure
>                     self.prev_cpu_pressure = curr_cpu_pressure
>                     self.prev_io_pressure = curr_io_pressure
>                     self.prev_memory_pressure = curr_memory_pressure
>                     self.prev_pressure_time = now
>                 else:
>                     exceeds_cpu_pressure = self.rq.max_cpu_pressure 
> and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > 
> self.rq.max_cpu_pressure
>                     exceeds_io_pressure = self.rq.max_io_pressure and 
> (float(curr_io_pressure) - float(self.prev_io_pressure)) > 
> self.rq.max_io_pressure
>                     exceeds_memory_pressure = 
> self.rq.max_memory_pressure and (float(curr_memory_pressure) - 
> float(self.prev_memory_pressure)) > self.rq.max_memory_pressure
>             return (exceeds_cpu_pressure or exceeds_io_pressure or 
> exceeds_memory_pressure)
>         return False
>
>
>>
>> ZQ
>>
>>>
>>> /Ola
>>>
>>>>
>>>> ../Randy
>>>>
>>>>
>>>>>
>>>>> /Ola Nilsson
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>> View/Reply Online (#14206):https://lists.openembedded.org/g/bitbake-devel/message/14206
>> Mute This Topic:https://lists.openembedded.org/mt/95618299/3616765
>> Group Owner:bitbake-devel+owner@lists.openembedded.org
>> Unsubscribe:https://lists.openembedded.org/g/bitbake-devel/unsub  [randy.macleod@windriver.com]
>> -=-=-=-=-=-=-=-=-=-=-=-
>>
>
> -- 
> # Randy MacLeod
> # Wind River Linux


[-- Attachment #2.1: Type: text/html, Size: 37189 bytes --]

[-- Attachment #2.2: make.png --]
[-- Type: image/png, Size: 107869 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2023-05-22  2:17         ` ChenQi
@ 2023-05-22  9:36           ` Ola x Nilsson
  2023-05-22 14:41             ` Randy MacLeod
  0 siblings, 1 reply; 9+ messages in thread
From: Ola x Nilsson @ 2023-05-22  9:36 UTC (permalink / raw)
  To: ChenQi; +Cc: Randy MacLeod, contrib, Richard Purdie, bitbake-devel


Hi Qi and Randy,

I did some testing this morning, and I think this works fine for the <1s
intervals.

I added log prints whenever the exceeds_max_pressure function was called
and was a bit suprised at some of my observations.

It seems setscene tasks are started without checking the PSI.  Is this
by design?  With the antivirus program forced on me by IT I easily reach
CPU PSI on above 600000 (my current limit) while only running setscene
tasks.

If the PSI threshold has been reached, no new tasks will be started for
a while.  But once the PSI check passes, it seems as many tasks as are
allowed are started at once.  Considering the time interval between
checks for each started task would be very small, this would probably
happen even if the PSI was checked for each task start.  But won't this
cause 'waves' of tasks that compete and cause high PSI instead of
allowing just a few (one?) tasks to start and then wait a second?

These two things are obviously not connected to this patch.  I think
this is fine except for the commit message which refers to runqemu.py
instead of runqueue.py.

Thank you for this improvment. 
/Ola

On Mon, May 22 2023, ChenQi wrote:

> Hi Ola & Randy,
>
> I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval
> is small between current check and last check. I'll send out a patch to fix this issue.
>
> Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s.
>
> Please help review the patch.
>
> Regards,
> Qi
>
> On 5/21/23 03:58, Randy MacLeod wrote:
>
>  On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote:
>
>  On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> wrote:
>
>  On Mon, Dec 12 2022, Randy MacLeod wrote:
>
>  CCing Richard
>
>  On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
>
>  Hi,
>
>  I've been looking into using the pressure stall information awareness of
>  bitbake
>
>  That's good to hear Ola.
>
>   but I have some problems getting it to work.  Actually I think
>  it just doesn't work at all.
>
>  Doesn't work at all?
>
>  Well that would be surprising. See below.
>
>  OK, it will occasionally block a task. But since the next attempt will
>  always be a very short time interval it will almost always start a new
>  task even if the pressure is high.
>  At least this is what I observe on my system.
>
>  <snip>
>
>  1. Rather than just keep track of the previous pressure values
>  seen more than 1 second ago as done currently:
>
>        if now - self.prev_pressure_time > 1.0:
>
>  and always using that as a reference, we can
>  store say 10 values per second and use that as a reference.
>
>  There are some challenges in that approach in that we don't control
>  how often the function is called. Averaging over the last 10 calls
>  is tempting but likely has some edge cases such as when there are
>  lots of tasks starting/ending.
>
>  2. If there has been a long delay since the function was last called,
>  we could check the pressure, sleep for a short period of time and check it
>  again. Some people would not like this since it will needlessly delay 
>  the build
>  so we'd have to keep the delay to < 1 second. Too short a delay will reduce
>  the accuracy of the result but I suspect that 0.1 seconds is sufficient 
>  for most
>  users. We could also look at the avg10 value in this case or even some 
>  combination of
>  both the current contention and avg10.
>
>  3. Just calculate the pressure per second by:
>
>     ( current pressure - last pressure ) / (now - last_time)
>
>  This could handle  short time differences such os milliseconds
>  as would be a 'cheap' way to deal with long delays. In your case,
>  the pressure would be:
>
>    978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>
>  divided by ~19 since the initial values were close to zero.
>
>  Then for the next time, just 0.1 seconds later:
>
>  1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>  1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
>  1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
>
>  Multiplying by 10 or easy calculation, the would be a pressure:
>
>  cpu: 4660, io: 307920, mem: 0.
>
>  Do you have another idea or a preference as to which approach we take?
>
>  I think 3 is a good first step.  Using multiple samples could improve
>  our calculated "avg1", but lets do that later if needed.
>
>  I agree; Randy and I have been working on patching make and have taken a similar approach:
>
>  make.png 
>  ZhengQ2/make at cpu-pressure github.com   
> make.png
>  Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result, 
>  even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel.
>  So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not,
>  we will return the last result that has been produced.
>
>  I will CC you when I have a patch, and you can try it out before the commit gets merged if you like.
>
>  Ola, 
>
>  Does Qi's patch below help in your situation?
>
>  I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load
>  and a lightweight sleep task but there are never enough hours in the day/week/...
>
>  The basic idea is to:
>
>  1. Run a task that just sleeps for say 10 seconds and confirm that the actual
>  execution time is < 11 seconds or so.
>
>  2. use stress to get the system into a CPU pressure environment above
>  the current threshold for say 30 seconds and simultaneously / shortly there after, 
>  launch the same sleep task and confirm that this time, the actual exectuion time of
>  the launch to completion time is 40+ seconds.
>
>  ../Randy 'getting caught up on email on the weekend' MacLeod
>
>  ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
>  commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
>  Author: Chen Qi <Qi.Chen@windriver.com>
>  Date:   Thu Apr 6 23:07:14 2023
>
>      bitbake: runqueue: fix PSI check calculation
>      
>      The current PSI check calculation does not take into consideration
>      the possibility of the time interval between last check and current
>      check being much larger than 1s. In fact, the current behavior does
>      not match what the manual says about BB_PRESSURE_MAX_XXX, even if
>      the value is set to upper limit, 1000000, we still get many blocks
>      on new task launch. The difference between 'total' should be divided
>      by the time interval if it's larger than 1s.
>      
>      (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b)
>      
>      Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
>      Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>
>  Δ bitbake/lib/bb/runqueue.py
>  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>  
>
>  ────────────────────────────────────────┐
>  • 198: class RunQueueScheduler(object): │
>  ────────────────────────────────────────┘
>                  curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1]
>                  curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1]
>                  curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1]
>                  exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure))
>  > self.rq.max_cpu_pressure
>                  exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) >
>  self.rq.max_io_pressure
>                  exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float
>  (self.prev_memory_pressure)) > self.rq.max_memory_pressure
>                  now = time.time()
>                  if now - self.prev_pressure_time > 1.0:
>                  tdiff = now - self.prev_pressure_time
>                  if tdiff > 1.0:
>                      exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float
>  (self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure
>                      exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) /
>  tdiff > self.rq.max_io_pressure
>                      exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float
>  (self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure
>                      self.prev_cpu_pressure = curr_cpu_pressure
>                      self.prev_io_pressure = curr_io_pressure
>                      self.prev_memory_pressure = curr_memory_pressure
>                      self.prev_pressure_time = now
>                  else:
>                      exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float
>  (self.prev_cpu_pressure)) > self.rq.max_cpu_pressure
>                      exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) >
>  self.rq.max_io_pressure
>                      exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float
>  (self.prev_memory_pressure)) > self.rq.max_memory_pressure
>              return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure)
>          return False
>
>  ZQ
>
>  /Ola
>
>  ../Randy
>
>  /Ola Nilsson
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14206): https://lists.openembedded.org/g/bitbake-devel/message/14206
> Mute This Topic: https://lists.openembedded.org/mt/95618299/3616765
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com]
> -=-=-=-=-=-=-=-=-=-=-=-


-- 
Ola x Nilsson


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bitbake-devel] Bitbake PSI checker
  2023-05-22  9:36           ` Ola x Nilsson
@ 2023-05-22 14:41             ` Randy MacLeod
  2023-05-23  2:08               ` Chen, Qi
  0 siblings, 1 reply; 9+ messages in thread
From: Randy MacLeod @ 2023-05-22 14:41 UTC (permalink / raw)
  To: Ola x Nilsson, ChenQi; +Cc: contrib, Richard Purdie, bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 13064 bytes --]

On 2023-05-22 05:36, Ola x Nilsson wrote:
> Hi Qi and Randy,
>
> I did some testing this morning, and I think this works fine for the <1s
> intervals.
>
> I added log prints whenever the exceeds_max_pressure function was called
> and was a bit suprised at some of my observations.


Yes, the kernel uses per-cpu variables to track pressure
efficiently and only updates what you see in /proc/pressure
periodically. Fun, eh!

I don't have a graph at hand to show that but here's a
CPU pressure typical pattern:

https://photos.app.goo.gl/XCMVAjywmBgoqj4E6

for those who haven't looked at the data.

This graph doesn't show that if you over-sample you'll get the same
value from pressure repeatedly until the per-cpu data is updated.
I might have that data on hand somewhere else but officially today is
a holiday so I'm not going to go look for it even if graphs are more
of a hobby than work!

>
> It seems setscene tasks are started without checking the PSI.  Is this
> by design?
Well, more like by lack of design!

I'll take a look, hopefully this week.


>   With the antivirus program forced on me by IT I easily reach
> CPU PSI on above 600000 (my current limit) while only running setscene
> tasks.
Ugh!
>
> If the PSI threshold has been reached, no new tasks will be started for
> a while.  But once the PSI check passes, it seems as many tasks as are
> allowed are started at once.  Considering the time interval between
> checks for each started task would be very small, this would probably
> happen even if the PSI was checked for each task start.  But won't this
> cause 'waves' of tasks that compete and cause high PSI instead of
> allowing just a few (one?) tasks to start and then wait a second?
Yes, I've considered that but hadn't gather data when
on it when Zheng was still working with me. I also was
concerned that we didn't want to slow the builds down
too much. I'm not sure how to make that trade-off in a
generic manner given that we don't know if a new build

will generate little, some or tremendous pressure.


The problem is even harder if you have 2 or 3 builds on the
same machine. The related but not exactly appropriate term
for this phenomena is, 'The thundering herd problem",
https://en.wikipedia.org/wiki/Thundering_herd_problem

I expect that there are good or even optimal solutions but
I haven't had/taken time to read the literature.


>
> These two things are obviously not connected to this patch.  I think
> this is fine except for the commit message which refers to runqemu.py
> instead of runqueue.py.


Oops.... I don't actually see that error but if it's done, c'est la vie.

>
> Thank you for this improvment.

+1 Qi !

Ola,
Thanks for checking and reporting and helping push us to do better!

../Randy



> /Ola
>
> On Mon, May 22 2023, ChenQi wrote:
>
>> Hi Ola & Randy,
>>
>> I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval
>> is small between current check and last check. I'll send out a patch to fix this issue.
>>
>> Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s.
>>
>> Please help review the patch.
>>
>> Regards,
>> Qi
>>
>> On 5/21/23 03:58, Randy MacLeod wrote:
>>
>>   On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote:
>>
>>   On Dec 19, 2022, at 7:50 AM, Ola x Nilsson<ola.x.nilsson@axis.com>  wrote:
>>
>>   On Mon, Dec 12 2022, Randy MacLeod wrote:
>>
>>   CCing Richard
>>
>>   On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:
>>
>>   Hi,
>>
>>   I've been looking into using the pressure stall information awareness of
>>   bitbake
>>
>>   That's good to hear Ola.
>>
>>    but I have some problems getting it to work.  Actually I think
>>   it just doesn't work at all.
>>
>>   Doesn't work at all?
>>
>>   Well that would be surprising. See below.
>>
>>   OK, it will occasionally block a task. But since the next attempt will
>>   always be a very short time interval it will almost always start a new
>>   task even if the pressure is high.
>>   At least this is what I observe on my system.
>>
>>   <snip>
>>
>>   1. Rather than just keep track of the previous pressure values
>>   seen more than 1 second ago as done currently:
>>
>>         if now - self.prev_pressure_time > 1.0:
>>
>>   and always using that as a reference, we can
>>   store say 10 values per second and use that as a reference.
>>
>>   There are some challenges in that approach in that we don't control
>>   how often the function is called. Averaging over the last 10 calls
>>   is tempting but likely has some edge cases such as when there are
>>   lots of tasks starting/ending.
>>
>>   2. If there has been a long delay since the function was last called,
>>   we could check the pressure, sleep for a short period of time and check it
>>   again. Some people would not like this since it will needlessly delay
>>   the build
>>   so we'd have to keep the delay to < 1 second. Too short a delay will reduce
>>   the accuracy of the result but I suspect that 0.1 seconds is sufficient
>>   for most
>>   users. We could also look at the avg10 value in this case or even some
>>   combination of
>>   both the current contention and avg10.
>>
>>   3. Just calculate the pressure per second by:
>>
>>      ( current pressure - last pressure ) / (now - last_time)
>>
>>   This could handle  short time differences such os milliseconds
>>   as would be a 'cheap' way to deal with long delays. In your case,
>>   the pressure would be:
>>
>>     978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>>
>>   divided by ~19 since the initial values were close to zero.
>>
>>   Then for the next time, just 0.1 seconds later:
>>
>>   1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0
>>   1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds
>>   1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0
>>
>>   Multiplying by 10 or easy calculation, the would be a pressure:
>>
>>   cpu: 4660, io: 307920, mem: 0.
>>
>>   Do you have another idea or a preference as to which approach we take?
>>
>>   I think 3 is a good first step.  Using multiple samples could improve
>>   our calculated "avg1", but lets do that later if needed.
>>
>>   I agree; Randy and I have been working on patching make and have taken a similar approach:
>>
>>   make.png
>>   ZhengQ2/make at cpu-pressure github.com
>> make.png
>>   Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result,
>>   even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel.
>>   So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not,
>>   we will return the last result that has been produced.
>>
>>   I will CC you when I have a patch, and you can try it out before the commit gets merged if you like.
>>
>>   Ola,
>>
>>   Does Qi's patch below help in your situation?
>>
>>   I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load
>>   and a lightweight sleep task but there are never enough hours in the day/week/...
>>
>>   The basic idea is to:
>>
>>   1. Run a task that just sleeps for say 10 seconds and confirm that the actual
>>   execution time is < 11 seconds or so.
>>
>>   2. use stress to get the system into a CPU pressure environment above
>>   the current threshold for say 30 seconds and simultaneously / shortly there after,
>>   launch the same sleep task and confirm that this time, the actual exectuion time of
>>   the launch to completion time is 40+ seconds.
>>
>>   ../Randy 'getting caught up on email on the weekend' MacLeod
>>
>>   ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
>>   commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307
>>   Author: Chen Qi<Qi.Chen@windriver.com>
>>   Date:   Thu Apr 6 23:07:14 2023
>>
>>       bitbake: runqueue: fix PSI check calculation
>>       
>>       The current PSI check calculation does not take into consideration
>>       the possibility of the time interval between last check and current
>>       check being much larger than 1s. In fact, the current behavior does
>>       not match what the manual says about BB_PRESSURE_MAX_XXX, even if
>>       the value is set to upper limit, 1000000, we still get many blocks
>>       on new task launch. The difference between 'total' should be divided
>>       by the time interval if it's larger than 1s.
>>       
>>       (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b)
>>       
>>       Signed-off-by: Chen Qi<Qi.Chen@windriver.com>
>>       Signed-off-by: Richard Purdie<richard.purdie@linuxfoundation.org>
>>
>>   Δ bitbake/lib/bb/runqueue.py
>>   ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>   
>>
>>   ────────────────────────────────────────┐
>>   • 198: class RunQueueScheduler(object): │
>>   ────────────────────────────────────────┘
>>                   curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1]
>>                   curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1]
>>                   curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1]
>>                   exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure))
>>   > self.rq.max_cpu_pressure
>>                   exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) >
>>   self.rq.max_io_pressure
>>                   exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float
>>   (self.prev_memory_pressure)) > self.rq.max_memory_pressure
>>                   now = time.time()
>>                   if now - self.prev_pressure_time > 1.0:
>>                   tdiff = now - self.prev_pressure_time
>>                   if tdiff > 1.0:
>>                       exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float
>>   (self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure
>>                       exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) /
>>   tdiff > self.rq.max_io_pressure
>>                       exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float
>>   (self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure
>>                       self.prev_cpu_pressure = curr_cpu_pressure
>>                       self.prev_io_pressure = curr_io_pressure
>>                       self.prev_memory_pressure = curr_memory_pressure
>>                       self.prev_pressure_time = now
>>                   else:
>>                       exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float
>>   (self.prev_cpu_pressure)) > self.rq.max_cpu_pressure
>>                       exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) >
>>   self.rq.max_io_pressure
>>                       exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float
>>   (self.prev_memory_pressure)) > self.rq.max_memory_pressure
>>               return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure)
>>           return False
>>
>>   ZQ
>>
>>   /Ola
>>
>>   ../Randy
>>
>>   /Ola Nilsson
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>> View/Reply Online (#14206):https://lists.openembedded.org/g/bitbake-devel/message/14206
>> Mute This Topic:https://lists.openembedded.org/mt/95618299/3616765
>> Group Owner:bitbake-devel+owner@lists.openembedded.org
>> Unsubscribe:https://lists.openembedded.org/g/bitbake-devel/unsub  [randy.macleod@windriver.com]
>> -=-=-=-=-=-=-=-=-=-=-=-


-- 
# Randy MacLeod
# Wind River Linux

[-- Attachment #2: Type: text/html, Size: 15361 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [bitbake-devel] Bitbake PSI checker
  2023-05-22 14:41             ` Randy MacLeod
@ 2023-05-23  2:08               ` Chen, Qi
  0 siblings, 0 replies; 9+ messages in thread
From: Chen, Qi @ 2023-05-23  2:08 UTC (permalink / raw)
  To: MacLeod, Randy, Ola x Nilsson; +Cc: contrib, Richard Purdie, bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 13458 bytes --]

Thanks for the review. I’ll fix the commit and send out V2.

Regards,
Qi

From: MacLeod, Randy <Randy.MacLeod@windriver.com>
Sent: Monday, May 22, 2023 10:42 PM
To: Ola x Nilsson <ola.x.nilsson@axis.com>; Chen, Qi <Qi.Chen@windriver.com>
Cc: contrib@zhengqiu.net; Richard Purdie <richard.purdie@linuxfoundation.org>; bitbake-devel@lists.openembedded.org
Subject: Re: [bitbake-devel] Bitbake PSI checker

On 2023-05-22 05:36, Ola x Nilsson wrote:

Hi Qi and Randy,



I did some testing this morning, and I think this works fine for the <1s

intervals.



I added log prints whenever the exceeds_max_pressure function was called

and was a bit suprised at some of my observations.



Yes, the kernel uses per-cpu variables to track pressure
efficiently and only updates what you see in /proc/pressure
periodically. Fun, eh!

I don't have a graph at hand to show that but here's a
CPU pressure typical pattern:

   https://photos.app.goo.gl/XCMVAjywmBgoqj4E6

for those who haven't looked at the data.

This graph doesn't show that if you over-sample you'll get the same
value from pressure repeatedly until the per-cpu data is updated.
I might have that data on hand somewhere else but officially today is
a holiday so I'm not going to go look for it even if graphs are more
of a hobby than work!





It seems setscene tasks are started without checking the PSI.  Is this

by design?
Well, more like by lack of design!

I'll take a look, hopefully this week.



 With the antivirus program forced on me by IT I easily reach

CPU PSI on above 600000 (my current limit) while only running setscene

tasks.
Ugh!






If the PSI threshold has been reached, no new tasks will be started for

a while.  But once the PSI check passes, it seems as many tasks as are

allowed are started at once.  Considering the time interval between

checks for each started task would be very small, this would probably

happen even if the PSI was checked for each task start.  But won't this

cause 'waves' of tasks that compete and cause high PSI instead of

allowing just a few (one?) tasks to start and then wait a second?
Yes, I've considered that but hadn't gather data when
on it when Zheng was still working with me. I also was
concerned that we didn't want to slow the builds down
too much. I'm not sure how to make that trade-off in a
generic manner given that we don't know if a new build

will generate little, some or tremendous pressure.



The problem is even harder if you have 2 or 3 builds on the
same machine. The related but not exactly appropriate term
for this phenomena is, 'The thundering herd problem",
   https://en.wikipedia.org/wiki/Thundering_herd_problem

I expect that there are good or even optimal solutions but
I haven't had/taken time to read the literature.







These two things are obviously not connected to this patch.  I think

this is fine except for the commit message which refers to runqemu.py

instead of runqueue.py.



Oops.... I don't actually see that error but if it's done, c'est la vie.





Thank you for this improvment.

+1 Qi !

Ola,
Thanks for checking and reporting and helping push us to do better!

../Randy







/Ola



On Mon, May 22 2023, ChenQi wrote:



Hi Ola & Randy,



I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval

is small between current check and last check. I'll send out a patch to fix this issue.



Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s.



Please help review the patch.



Regards,

Qi



On 5/21/23 03:58, Randy MacLeod wrote:



 On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote:



 On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com><mailto:ola.x.nilsson@axis.com> wrote:



 On Mon, Dec 12 2022, Randy MacLeod wrote:



 CCing Richard



 On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote:



 Hi,



 I've been looking into using the pressure stall information awareness of

 bitbake



 That's good to hear Ola.



  but I have some problems getting it to work.  Actually I think

 it just doesn't work at all.



 Doesn't work at all?



 Well that would be surprising. See below.



 OK, it will occasionally block a task. But since the next attempt will

 always be a very short time interval it will almost always start a new

 task even if the pressure is high.

 At least this is what I observe on my system.



 <snip>



 1. Rather than just keep track of the previous pressure values

 seen more than 1 second ago as done currently:



       if now - self.prev_pressure_time > 1.0:



 and always using that as a reference, we can

 store say 10 values per second and use that as a reference.



 There are some challenges in that approach in that we don't control

 how often the function is called. Averaging over the last 10 calls

 is tempting but likely has some edge cases such as when there are

 lots of tasks starting/ending.



 2. If there has been a long delay since the function was last called,

 we could check the pressure, sleep for a short period of time and check it

 again. Some people would not like this since it will needlessly delay

 the build

 so we'd have to keep the delay to < 1 second. Too short a delay will reduce

 the accuracy of the result but I suspect that 0.1 seconds is sufficient

 for most

 users. We could also look at the avg10 value in this case or even some

 combination of

 both the current contention and avg10.



 3. Just calculate the pressure per second by:



    ( current pressure - last pressure ) / (now - last_time)



 This could handle  short time differences such os milliseconds

 as would be a 'cheap' way to deal with long delays. In your case,

 the pressure would be:



   978077.0 io_pressure 1353882.0 mem_pressure 20922.0



 divided by ~19 since the initial values were close to zero.



 Then for the next time, just 0.1 seconds later:



 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0

 1670840042.384582 cpu io  pressure exceeded over 18.677629 seconds

 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0



 Multiplying by 10 or easy calculation, the would be a pressure:



 cpu: 4660, io: 307920, mem: 0.



 Do you have another idea or a preference as to which approach we take?



 I think 3 is a good first step.  Using multiple samples could improve

 our calculated "avg1", but lets do that later if needed.



 I agree; Randy and I have been working on patching make and have taken a similar approach:



 make.png

 ZhengQ2/make at cpu-pressure github.com

make.png

 Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result,

 even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel.

 So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not,

 we will return the last result that has been produced.



 I will CC you when I have a patch, and you can try it out before the commit gets merged if you like.



 Ola,



 Does Qi's patch below help in your situation?



 I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load

 and a lightweight sleep task but there are never enough hours in the day/week/...



 The basic idea is to:



 1. Run a task that just sleeps for say 10 seconds and confirm that the actual

 execution time is < 11 seconds or so.



 2. use stress to get the system into a CPU pressure environment above

 the current threshold for say 30 seconds and simultaneously / shortly there after,

 launch the same sleep task and confirm that this time, the actual exectuion time of

 the launch to completion time is 40+ seconds.



 ../Randy 'getting caught up on email on the weekend' MacLeod



 ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307

 commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307

 Author: Chen Qi <Qi.Chen@windriver.com><mailto:Qi.Chen@windriver.com>

 Date:   Thu Apr 6 23:07:14 2023



     bitbake: runqueue: fix PSI check calculation



     The current PSI check calculation does not take into consideration

     the possibility of the time interval between last check and current

     check being much larger than 1s. In fact, the current behavior does

     not match what the manual says about BB_PRESSURE_MAX_XXX, even if

     the value is set to upper limit, 1000000, we still get many blocks

     on new task launch. The difference between 'total' should be divided

     by the time interval if it's larger than 1s.



     (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b)



     Signed-off-by: Chen Qi <Qi.Chen@windriver.com><mailto:Qi.Chen@windriver.com>

     Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org><mailto:richard.purdie@linuxfoundation.org>



 Δ bitbake/lib/bb/runqueue.py

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────





 ────────────────────────────────────────┐

 • 198: class RunQueueScheduler(object): │

 ────────────────────────────────────────┘

                 curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1]

                 curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1]

                 curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1]

                 exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure))

 > self.rq.max_cpu_pressure

                 exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) >

 self.rq.max_io_pressure

                 exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float

 (self.prev_memory_pressure)) > self.rq.max_memory_pressure

                 now = time.time()

                 if now - self.prev_pressure_time > 1.0:

                 tdiff = now - self.prev_pressure_time

                 if tdiff > 1.0:

                     exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float

 (self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure

                     exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) /

 tdiff > self.rq.max_io_pressure

                     exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float

 (self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure

                     self.prev_cpu_pressure = curr_cpu_pressure

                     self.prev_io_pressure = curr_io_pressure

                     self.prev_memory_pressure = curr_memory_pressure

                     self.prev_pressure_time = now

                 else:

                     exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float

 (self.prev_cpu_pressure)) > self.rq.max_cpu_pressure

                     exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) >

 self.rq.max_io_pressure

                     exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float

 (self.prev_memory_pressure)) > self.rq.max_memory_pressure

             return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure)

         return False



 ZQ



 /Ola



 ../Randy



 /Ola Nilsson



-=-=-=-=-=-=-=-=-=-=-=-

Links: You receive all messages sent to this group.

View/Reply Online (#14206): https://lists.openembedded.org/g/bitbake-devel/message/14206

Mute This Topic: https://lists.openembedded.org/mt/95618299/3616765

Group Owner: bitbake-devel+owner@lists.openembedded.org<mailto:bitbake-devel+owner@lists.openembedded.org>

Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com<mailto:randy.macleod@windriver.com>]

-=-=-=-=-=-=-=-=-=-=-=-





--

# Randy MacLeod

# Wind River Linux

[-- Attachment #2: Type: text/html, Size: 25587 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-23  2:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-12 10:07 Bitbake PSI checker Ola x Nilsson
2022-12-12 20:48 ` [bitbake-devel] " Randy MacLeod
2022-12-19 12:50   ` Ola x Nilsson
2022-12-19 19:49     ` contrib
2023-05-20 19:58       ` Randy MacLeod
2023-05-22  2:17         ` ChenQi
2023-05-22  9:36           ` Ola x Nilsson
2023-05-22 14:41             ` Randy MacLeod
2023-05-23  2:08               ` Chen, Qi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).