* Bitbake PSI checker @ 2022-12-12 10:07 Ola x Nilsson 2022-12-12 20:48 ` [bitbake-devel] " Randy MacLeod 0 siblings, 1 reply; 9+ messages in thread From: Ola x Nilsson @ 2022-12-12 10:07 UTC (permalink / raw) To: bitbake-devel Hi, I've been looking into using the pressure stall information awareness of bitbake but I have some problems getting it to work. Actually I think it just doesn't work at all. Reading the code I find that runqueue.QunQueueScheduler.exceeds_max_pressure claims to "Monitor the difference in pressure at least once per second". But using some debugprints added to that method I see output like 1670840023.757171 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 1670840023.758697 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 1670840023.760158 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 1670840023.761733 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 1670840023.959357 cpu_pressure 969.0 io_pressure 16135.0 mem_pressure 0.0 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 1670840042.490340 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 where the first column is the value of 'now', and the pressure values are the calculated deltas. The 0-pressure values are probably because this is very early in the run and the time delta is less than 0.01 seconds. But there is a time delta of almost 19 seconds between line 5 and 6, and unsurprisingly the pressure exceeds my max settings of CPU:600000 and IO:200000. But the very next check is only 0.1 second later and while the prev-values wont be updated, the calculated pressure will be used. This pressure will be below my settings and a new task will be started. Am I missing something here? If the pressure should be monitored each second, isn't it reasonable to have some sort of tick to update the pre-values? And using the pressure delta of intervals of less than a second also seems to give too low pressure values. /Ola Nilsson ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2022-12-12 10:07 Bitbake PSI checker Ola x Nilsson @ 2022-12-12 20:48 ` Randy MacLeod 2022-12-19 12:50 ` Ola x Nilsson 0 siblings, 1 reply; 9+ messages in thread From: Randy MacLeod @ 2022-12-12 20:48 UTC (permalink / raw) To: bitbake-devel, Richard Purdie, ola.x.nilsson; +Cc: Zheng.qiu CCing Richard On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: > Hi, > > I've been looking into using the pressure stall information awareness of > bitbake That's good to hear Ola. > but I have some problems getting it to work. Actually I think > it just doesn't work at all. Doesn't work at all? Well that would be surprising. See below. > > Reading the code I find that > runqueue.QunQueueScheduler.exceeds_max_pressure claims to "Monitor the > difference in pressure at least once per second". That comment isn't accurate. I'll fix it. Currently, the pressure is only checked when bitbake is looking for the next_buildable_task. This can occur many/100s of times per seconds at some points in a build and later, when larger recipes are compiling, the function may not be called for 10s or 100s of seconds depending on what is being built. > But using some > debugprints added to that method I see output like > > 1670840023.757171 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 > 1670840023.758697 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 > 1670840023.760158 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 > 1670840023.761733 cpu_pressure 0.0 io_pressure 0.0 mem_pressure 0.0 > 1670840023.959357 cpu_pressure 969.0 io_pressure 16135.0 mem_pressure 0.0 19 second gap > 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 > 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds > 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 > 1670840042.490340 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 > > where the first column is the value of 'now', and the pressure values > are the calculated deltas. The 0-pressure values are probably because > this is very early in the run and the time delta is less than 0.01 > seconds. > > But there is a time delta of almost 19 seconds between line 5 and 6, and > unsurprisingly the pressure exceeds my max settings of CPU:600000 and > IO:200000. > > But the very next check is only 0.1 second later and while the > prev-values wont be updated, the calculated pressure will be used. This > pressure will be below my settings and a new task will be started. Yes, that's a bug and I need to fix it. See below. > > Am I missing something here? You aren't missing anything. The code has "limitations" but it has still proven useful to some people and on the Yocto Autobuilder system. Note the lack of 'interval" errors starting around Aug 18th, 2022, when we enabled this feature for the YP Autobuilder: https://autobuilder.yocto.io/pub/non-release/ > If the pressure should be monitored each > second, isn't it reasonable to have some sort of tick to update the > pre-values? And using the pressure delta of intervals of less than a > second also seems to give too low pressure values. That would be a better implementation in some ways but what we've done so far is only check the pressure when bitbake is checking for a new task to run. This will be less intrusive and people do worry about the efficiency of bitbake. Adding a 1 second timer may not be where we want to go. It's a little tricky to provide short-term averaging regardless of how often the function is called. Here are the improvements that I'm considering: 1. Rather than just keep track of the previous pressure values seen more than 1 second ago as done currently: if now - self.prev_pressure_time > 1.0: and always using that as a reference, we can store say 10 values per second and use that as a reference. There are some challenges in that approach in that we don't control how often the function is called. Averaging over the last 10 calls is tempting but likely has some edge cases such as when there are lots of tasks starting/ending. 2. If there has been a long delay since the function was last called, we could check the pressure, sleep for a short period of time and check it again. Some people would not like this since it will needlessly delay the build so we'd have to keep the delay to < 1 second. Too short a delay will reduce the accuracy of the result but I suspect that 0.1 seconds is sufficient for most users. We could also look at the avg10 value in this case or even some combination of both the current contention and avg10. 3. Just calculate the pressure per second by: ( current pressure - last pressure ) / (now - last_time) This could handle short time differences such os milliseconds as would be a 'cheap' way to deal with long delays. In your case, the pressure would be: 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 divided by ~19 since the initial values were close to zero. Then for the next time, just 0.1 seconds later: 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 Multiplying by 10 or easy calculation, the would be a pressure: cpu: 4660, io: 307920, mem: 0. Do you have another idea or a preference as to which approach we take? ../Randy > > /Ola Nilsson > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#14178): https://lists.openembedded.org/g/bitbake-devel/message/14178 > Mute This Topic: https://lists.openembedded.org/mt/95618299/3616765 > Group Owner: bitbake-devel+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com] > -=-=-=-=-=-=-=-=-=-=-=- > -- # Randy MacLeod # Wind River Linux ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2022-12-12 20:48 ` [bitbake-devel] " Randy MacLeod @ 2022-12-19 12:50 ` Ola x Nilsson 2022-12-19 19:49 ` contrib 0 siblings, 1 reply; 9+ messages in thread From: Ola x Nilsson @ 2022-12-19 12:50 UTC (permalink / raw) To: Randy MacLeod; +Cc: Richard Purdie, Zheng.qiu, bitbake-devel On Mon, Dec 12 2022, Randy MacLeod wrote: > CCing Richard > > On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: >> Hi, >> >> I've been looking into using the pressure stall information awareness of >> bitbake > That's good to hear Ola. >> but I have some problems getting it to work. Actually I think >> it just doesn't work at all. > > Doesn't work at all? > > Well that would be surprising. See below. OK, it will occasionally block a task. But since the next attempt will always be a very short time interval it will almost always start a new task even if the pressure is high. At least this is what I observe on my system. <snip> > 1. Rather than just keep track of the previous pressure values > seen more than 1 second ago as done currently: > > if now - self.prev_pressure_time > 1.0: > > and always using that as a reference, we can > store say 10 values per second and use that as a reference. > > There are some challenges in that approach in that we don't control > how often the function is called. Averaging over the last 10 calls > is tempting but likely has some edge cases such as when there are > lots of tasks starting/ending. > > > 2. If there has been a long delay since the function was last called, > we could check the pressure, sleep for a short period of time and check it > again. Some people would not like this since it will needlessly delay > the build > so we'd have to keep the delay to < 1 second. Too short a delay will reduce > the accuracy of the result but I suspect that 0.1 seconds is sufficient > for most > users. We could also look at the avg10 value in this case or even some > combination of > both the current contention and avg10. > > > 3. Just calculate the pressure per second by: > > ( current pressure - last pressure ) / (now - last_time) > > This could handle short time differences such os milliseconds > as would be a 'cheap' way to deal with long delays. In your case, > the pressure would be: > > 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 > > divided by ~19 since the initial values were close to zero. > > Then for the next time, just 0.1 seconds later: > > 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 > 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds > 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 > > Multiplying by 10 or easy calculation, the would be a pressure: > > cpu: 4660, io: 307920, mem: 0. > > > Do you have another idea or a preference as to which approach we take? I think 3 is a good first step. Using multiple samples could improve our calculated "avg1", but lets do that later if needed. /Ola > > ../Randy > > >> >> /Ola Nilsson >> >> >> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2022-12-19 12:50 ` Ola x Nilsson @ 2022-12-19 19:49 ` contrib 2023-05-20 19:58 ` Randy MacLeod 0 siblings, 1 reply; 9+ messages in thread From: contrib @ 2022-12-19 19:49 UTC (permalink / raw) To: Ola x Nilsson; +Cc: Randy MacLeod, Richard Purdie, bitbake-devel [-- Attachment #1: Type: text/plain, Size: 4281 bytes --] > On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> wrote: > > > On Mon, Dec 12 2022, Randy MacLeod wrote: > >> CCing Richard >> >> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: >>> Hi, >>> >>> I've been looking into using the pressure stall information awareness of >>> bitbake >> That's good to hear Ola. >>> but I have some problems getting it to work. Actually I think >>> it just doesn't work at all. >> >> Doesn't work at all? >> >> Well that would be surprising. See below. > > OK, it will occasionally block a task. But since the next attempt will > always be a very short time interval it will almost always start a new > task even if the pressure is high. > At least this is what I observe on my system. > > <snip> > >> 1. Rather than just keep track of the previous pressure values >> seen more than 1 second ago as done currently: >> >> if now - self.prev_pressure_time > 1.0: >> >> and always using that as a reference, we can >> store say 10 values per second and use that as a reference. >> >> There are some challenges in that approach in that we don't control >> how often the function is called. Averaging over the last 10 calls >> is tempting but likely has some edge cases such as when there are >> lots of tasks starting/ending. >> >> >> 2. If there has been a long delay since the function was last called, >> we could check the pressure, sleep for a short period of time and check it >> again. Some people would not like this since it will needlessly delay >> the build >> so we'd have to keep the delay to < 1 second. Too short a delay will reduce >> the accuracy of the result but I suspect that 0.1 seconds is sufficient >> for most >> users. We could also look at the avg10 value in this case or even some >> combination of >> both the current contention and avg10. >> >> >> 3. Just calculate the pressure per second by: >> >> ( current pressure - last pressure ) / (now - last_time) >> >> This could handle short time differences such os milliseconds >> as would be a 'cheap' way to deal with long delays. In your case, >> the pressure would be: >> >> 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 >> >> divided by ~19 since the initial values were close to zero. >> >> Then for the next time, just 0.1 seconds later: >> >> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 >> 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds >> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 >> >> Multiplying by 10 or easy calculation, the would be a pressure: >> >> cpu: 4660, io: 307920, mem: 0. >> >> >> Do you have another idea or a preference as to which approach we take? > > I think 3 is a good first step. Using multiple samples could improve > our calculated "avg1", but lets do that later if needed. I agree; Randy and I have been working on patching make and have taken a similar approach: https://github.com/ZhengQ2/make/tree/cpu-pressure ZhengQ2/make at cpu-pressure github.com Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result, even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel. So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not, we will return the last result that has been produced. I will CC you when I have a patch, and you can try it out before the commit gets merged if you like. ZQ > > /Ola > >> >> ../Randy >> >> >>> >>> /Ola Nilsson >>> >>> >>> > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#14199): https://lists.openembedded.org/g/bitbake-devel/message/14199 > Mute This Topic: https://lists.openembedded.org/mt/95618299/7355053 > Group Owner: bitbake-devel+owner@lists.openembedded.org <mailto:bitbake-devel+owner@lists.openembedded.org> > Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [contrib@zhengqiu.net <mailto:contrib@zhengqiu.net>] > -=-=-=-=-=-=-=-=-=-=-=- [-- Attachment #2.1: Type: text/html, Size: 25607 bytes --] [-- Attachment #2.2: make.png --] [-- Type: image/png, Size: 107869 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2022-12-19 19:49 ` contrib @ 2023-05-20 19:58 ` Randy MacLeod 2023-05-22 2:17 ` ChenQi 0 siblings, 1 reply; 9+ messages in thread From: Randy MacLeod @ 2023-05-20 19:58 UTC (permalink / raw) To: contrib, Ola x Nilsson; +Cc: Richard Purdie, bitbake-devel, Chen, Qi [-- Attachment #1: Type: text/plain, Size: 10014 bytes --] On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote: > > >> On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> >> wrote: >> >> >> On Mon, Dec 12 2022, Randy MacLeod wrote: >> >>> CCing Richard >>> >>> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: >>>> Hi, >>>> >>>> I've been looking into using the pressure stall information >>>> awareness of >>>> bitbake >>> That's good to hear Ola. >>>> but I have some problems getting it to work. Actually I think >>>> it just doesn't work at all. >>> >>> Doesn't work at all? >>> >>> Well that would be surprising. See below. >> >> OK, it will occasionally block a task. But since the next attempt will >> always be a very short time interval it will almost always start a new >> task even if the pressure is high. >> At least this is what I observe on my system. >> >> <snip> >> >>> 1. Rather than just keep track of the previous pressure values >>> seen more than 1 second ago as done currently: >>> >>> if now - self.prev_pressure_time > 1.0: >>> >>> and always using that as a reference, we can >>> store say 10 values per second and use that as a reference. >>> >>> There are some challenges in that approach in that we don't control >>> how often the function is called. Averaging over the last 10 calls >>> is tempting but likely has some edge cases such as when there are >>> lots of tasks starting/ending. >>> >>> >>> 2. If there has been a long delay since the function was last called, >>> we could check the pressure, sleep for a short period of time and >>> check it >>> again. Some people would not like this since it will needlessly delay >>> the build >>> so we'd have to keep the delay to < 1 second. Too short a delay will >>> reduce >>> the accuracy of the result but I suspect that 0.1 seconds is sufficient >>> for most >>> users. We could also look at the avg10 value in this case or even some >>> combination of >>> both the current contention and avg10. >>> >>> >>> 3. Just calculate the pressure per second by: >>> >>> ( current pressure - last pressure ) / (now - last_time) >>> >>> This could handle short time differences such os milliseconds >>> as would be a 'cheap' way to deal with long delays. In your case, >>> the pressure would be: >>> >>> 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 >>> >>> divided by ~19 since the initial values were close to zero. >>> >>> Then for the next time, just 0.1 seconds later: >>> >>> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 >>> mem_pressure 20922.0 >>> 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds >>> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 >>> mem_pressure 0.0 >>> >>> Multiplying by 10 or easy calculation, the would be a pressure: >>> >>> cpu: 4660, io: 307920, mem: 0. >>> >>> >>> Do you have another idea or a preference as to which approach we take? >> >> I think 3 is a good first step. Using multiple samples could improve >> our calculated "avg1", but lets do that later if needed. > > I agree; Randy and I have been working on patching make and have taken > a similar approach: > make.png > ZhengQ2/make at cpu-pressure > <https://github.com/ZhengQ2/make/tree/cpu-pressure> > github.com <https://github.com/ZhengQ2/make/tree/cpu-pressure> > > <https://github.com/ZhengQ2/make/tree/cpu-pressure> > Additionally, we found that when the pressure read is too frequent, we > may get the same cpu pressure as an result, > even if the pressure have actually changed. This is likely due to the > per cpu variables used in the kernel. > So, in addition to the algorithm Randy talked above, we also compares > if the cpu pressure has been changed, if not, > we will return the last result that has been produced. > > I will CC you when I have a patch, and you can try it out before the > commit gets merged if you like. Ola, Does Qi's patch below help in your situation? I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load and a lightweight sleep task but there are never enough hours in the day/week/... The basic idea is to: 1. Run a task that just sleeps for say 10 seconds and confirm that the actual execution time is < 11 seconds or so. 2. use stress to get the system into a CPU pressure environment above the current threshold for say 30 seconds and simultaneously / shortly there after, launch the same sleep task and confirm that this time, the actual exectuion time of the launch to completion time is 40+ seconds. ../Randy 'getting caught up on email on the weekend' MacLeod ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 Author: Chen Qi <Qi.Chen@windriver.com> Date: Thu Apr 6 23:07:14 2023 bitbake: runqueue: fix PSI check calculation The current PSI check calculation does not take into consideration the possibility of the time interval between last check and current check being much larger than 1s. In fact, the current behavior does not match what the manual says about BB_PRESSURE_MAX_XXX, even if the value is set to upper limit, 1000000, we still get many blocks on new task launch. The difference between 'total' should be divided by the time interval if it's larger than 1s. (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b) Signed-off-by: Chen Qi <Qi.Chen@windriver.com> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org> Δ bitbake/lib/bb/runqueue.py ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ────────────────────────────────────────┐ • 198: class RunQueueScheduler(object): │ ────────────────────────────────────────┘ curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1] curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1] curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1] exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > self.rq.max_cpu_pressure exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > self.rq.max_io_pressure exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) > self.rq.max_memory_pressure now = time.time() if now - self.prev_pressure_time > 1.0: tdiff = now - self.prev_pressure_time if tdiff > 1.0: exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > self.rq.max_io_pressure exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure self.prev_cpu_pressure = curr_cpu_pressure self.prev_io_pressure = curr_io_pressure self.prev_memory_pressure = curr_memory_pressure self.prev_pressure_time = now else: exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > self.rq.max_cpu_pressure exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > self.rq.max_io_pressure exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) > self.rq.max_memory_pressure return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure) return False > > ZQ > >> >> /Ola >> >>> >>> ../Randy >>> >>> >>>> >>>> /Ola Nilsson >>>> >>>> >>>> >> >> > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#14206):https://lists.openembedded.org/g/bitbake-devel/message/14206 > Mute This Topic:https://lists.openembedded.org/mt/95618299/3616765 > Group Owner:bitbake-devel+owner@lists.openembedded.org > Unsubscribe:https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com] > -=-=-=-=-=-=-=-=-=-=-=- > -- # Randy MacLeod # Wind River Linux [-- Attachment #2.1: Type: text/html, Size: 34570 bytes --] [-- Attachment #2.2: make.png --] [-- Type: image/png, Size: 107869 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2023-05-20 19:58 ` Randy MacLeod @ 2023-05-22 2:17 ` ChenQi 2023-05-22 9:36 ` Ola x Nilsson 0 siblings, 1 reply; 9+ messages in thread From: ChenQi @ 2023-05-22 2:17 UTC (permalink / raw) To: Randy MacLeod, contrib, Ola x Nilsson; +Cc: Richard Purdie, bitbake-devel [-- Attachment #1: Type: text/plain, Size: 10747 bytes --] Hi Ola & Randy, I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval is small between current check and last check. I'll send out a patch to fix this issue. Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s. Please help review the patch. Regards, Qi On 5/21/23 03:58, Randy MacLeod wrote: > On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote: >> >> >>> On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> >>> wrote: >>> >>> >>> On Mon, Dec 12 2022, Randy MacLeod wrote: >>> >>>> CCing Richard >>>> >>>> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: >>>>> Hi, >>>>> >>>>> I've been looking into using the pressure stall information >>>>> awareness of >>>>> bitbake >>>> That's good to hear Ola. >>>>> but I have some problems getting it to work. Actually I think >>>>> it just doesn't work at all. >>>> >>>> Doesn't work at all? >>>> >>>> Well that would be surprising. See below. >>> >>> OK, it will occasionally block a task. But since the next attempt will >>> always be a very short time interval it will almost always start a new >>> task even if the pressure is high. >>> At least this is what I observe on my system. >>> >>> <snip> >>> >>>> 1. Rather than just keep track of the previous pressure values >>>> seen more than 1 second ago as done currently: >>>> >>>> if now - self.prev_pressure_time > 1.0: >>>> >>>> and always using that as a reference, we can >>>> store say 10 values per second and use that as a reference. >>>> >>>> There are some challenges in that approach in that we don't control >>>> how often the function is called. Averaging over the last 10 calls >>>> is tempting but likely has some edge cases such as when there are >>>> lots of tasks starting/ending. >>>> >>>> >>>> 2. If there has been a long delay since the function was last called, >>>> we could check the pressure, sleep for a short period of time and >>>> check it >>>> again. Some people would not like this since it will needlessly delay >>>> the build >>>> so we'd have to keep the delay to < 1 second. Too short a delay >>>> will reduce >>>> the accuracy of the result but I suspect that 0.1 seconds is sufficient >>>> for most >>>> users. We could also look at the avg10 value in this case or even some >>>> combination of >>>> both the current contention and avg10. >>>> >>>> >>>> 3. Just calculate the pressure per second by: >>>> >>>> ( current pressure - last pressure ) / (now - last_time) >>>> >>>> This could handle short time differences such os milliseconds >>>> as would be a 'cheap' way to deal with long delays. In your case, >>>> the pressure would be: >>>> >>>> 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 >>>> >>>> divided by ~19 since the initial values were close to zero. >>>> >>>> Then for the next time, just 0.1 seconds later: >>>> >>>> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 >>>> mem_pressure 20922.0 >>>> 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds >>>> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 >>>> mem_pressure 0.0 >>>> >>>> Multiplying by 10 or easy calculation, the would be a pressure: >>>> >>>> cpu: 4660, io: 307920, mem: 0. >>>> >>>> >>>> Do you have another idea or a preference as to which approach we take? >>> >>> I think 3 is a good first step. Using multiple samples could improve >>> our calculated "avg1", but lets do that later if needed. >> >> I agree; Randy and I have been working on patching make and have >> taken a similar approach: >> make.png >> ZhengQ2/make at cpu-pressure >> <https://github.com/ZhengQ2/make/tree/cpu-pressure> >> github.com <https://github.com/ZhengQ2/make/tree/cpu-pressure> >> >> <https://github.com/ZhengQ2/make/tree/cpu-pressure> >> Additionally, we found that when the pressure read is too frequent, >> we may get the same cpu pressure as an result, >> even if the pressure have actually changed. This is likely due to the >> per cpu variables used in the kernel. >> So, in addition to the algorithm Randy talked above, we also compares >> if the cpu pressure has been changed, if not, >> we will return the last result that has been produced. >> >> I will CC you when I have a patch, and you can try it out before the >> commit gets merged if you like. > > > Ola, > > Does Qi's patch below help in your situation? > > I still want/intent to add a bitbake PSI test case that uses stress-ng > to induce load > and a lightweight sleep task but there are never enough hours in the > day/week/... > > The basic idea is to: > > 1. Run a task that just sleeps for say 10 seconds and confirm that the > actual > execution time is < 11 seconds or so. > > 2. use stress to get the system into a CPU pressure environment above > the current threshold for say 30 seconds and simultaneously / shortly > there after, > launch the same sleep task and confirm that this time, the actual > exectuion time of > the launch to completion time is 40+ seconds. > > ../Randy 'getting caught up on email on the weekend' MacLeod > > > ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 > commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 > Author: Chen Qi <Qi.Chen@windriver.com> > Date: Thu Apr 6 23:07:14 2023 > > bitbake: runqueue: fix PSI check calculation > > The current PSI check calculation does not take into consideration > the possibility of the time interval between last check and current > check being much larger than 1s. In fact, the current behavior does > not match what the manual says about BB_PRESSURE_MAX_XXX, even if > the value is set to upper limit, 1000000, we still get many blocks > on new task launch. The difference between 'total' should be divided > by the time interval if it's larger than 1s. > > (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b) > > Signed-off-by: Chen Qi <Qi.Chen@windriver.com> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org> > > > Δ bitbake/lib/bb/runqueue.py > ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── > > ────────────────────────────────────────┐ > • 198: class RunQueueScheduler(object): │ > ────────────────────────────────────────┘ > curr_cpu_pressure = > cpu_pressure_fds.readline().split()[4].split("=")[1] > curr_io_pressure = > io_pressure_fds.readline().split()[4].split("=")[1] > curr_memory_pressure = > memory_pressure_fds.readline().split()[4].split("=")[1] > exceeds_cpu_pressure = self.rq.max_cpu_pressure and > (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > > self.rq.max_cpu_pressure > exceeds_io_pressure = self.rq.max_io_pressure and > (float(curr_io_pressure) - float(self.prev_io_pressure)) > > self.rq.max_io_pressure > exceeds_memory_pressure = self.rq.max_memory_pressure > and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) > > self.rq.max_memory_pressure > now = time.time() > if now - self.prev_pressure_time > 1.0: > tdiff = now - self.prev_pressure_time > if tdiff > 1.0: > exceeds_cpu_pressure = self.rq.max_cpu_pressure > and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff > > self.rq.max_cpu_pressure > exceeds_io_pressure = self.rq.max_io_pressure and > (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > > self.rq.max_io_pressure > exceeds_memory_pressure = > self.rq.max_memory_pressure and (float(curr_memory_pressure) - > float(self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure > self.prev_cpu_pressure = curr_cpu_pressure > self.prev_io_pressure = curr_io_pressure > self.prev_memory_pressure = curr_memory_pressure > self.prev_pressure_time = now > else: > exceeds_cpu_pressure = self.rq.max_cpu_pressure > and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > > self.rq.max_cpu_pressure > exceeds_io_pressure = self.rq.max_io_pressure and > (float(curr_io_pressure) - float(self.prev_io_pressure)) > > self.rq.max_io_pressure > exceeds_memory_pressure = > self.rq.max_memory_pressure and (float(curr_memory_pressure) - > float(self.prev_memory_pressure)) > self.rq.max_memory_pressure > return (exceeds_cpu_pressure or exceeds_io_pressure or > exceeds_memory_pressure) > return False > > >> >> ZQ >> >>> >>> /Ola >>> >>>> >>>> ../Randy >>>> >>>> >>>>> >>>>> /Ola Nilsson >>>>> >>>>> >>>>> >>> >>> >> >> >> -=-=-=-=-=-=-=-=-=-=-=- >> Links: You receive all messages sent to this group. >> View/Reply Online (#14206):https://lists.openembedded.org/g/bitbake-devel/message/14206 >> Mute This Topic:https://lists.openembedded.org/mt/95618299/3616765 >> Group Owner:bitbake-devel+owner@lists.openembedded.org >> Unsubscribe:https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com] >> -=-=-=-=-=-=-=-=-=-=-=- >> > > -- > # Randy MacLeod > # Wind River Linux [-- Attachment #2.1: Type: text/html, Size: 37189 bytes --] [-- Attachment #2.2: make.png --] [-- Type: image/png, Size: 107869 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2023-05-22 2:17 ` ChenQi @ 2023-05-22 9:36 ` Ola x Nilsson 2023-05-22 14:41 ` Randy MacLeod 0 siblings, 1 reply; 9+ messages in thread From: Ola x Nilsson @ 2023-05-22 9:36 UTC (permalink / raw) To: ChenQi; +Cc: Randy MacLeod, contrib, Richard Purdie, bitbake-devel Hi Qi and Randy, I did some testing this morning, and I think this works fine for the <1s intervals. I added log prints whenever the exceeds_max_pressure function was called and was a bit suprised at some of my observations. It seems setscene tasks are started without checking the PSI. Is this by design? With the antivirus program forced on me by IT I easily reach CPU PSI on above 600000 (my current limit) while only running setscene tasks. If the PSI threshold has been reached, no new tasks will be started for a while. But once the PSI check passes, it seems as many tasks as are allowed are started at once. Considering the time interval between checks for each started task would be very small, this would probably happen even if the PSI was checked for each task start. But won't this cause 'waves' of tasks that compete and cause high PSI instead of allowing just a few (one?) tasks to start and then wait a second? These two things are obviously not connected to this patch. I think this is fine except for the commit message which refers to runqemu.py instead of runqueue.py. Thank you for this improvment. /Ola On Mon, May 22 2023, ChenQi wrote: > Hi Ola & Randy, > > I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval > is small between current check and last check. I'll send out a patch to fix this issue. > > Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s. > > Please help review the patch. > > Regards, > Qi > > On 5/21/23 03:58, Randy MacLeod wrote: > > On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote: > > On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com> wrote: > > On Mon, Dec 12 2022, Randy MacLeod wrote: > > CCing Richard > > On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: > > Hi, > > I've been looking into using the pressure stall information awareness of > bitbake > > That's good to hear Ola. > > but I have some problems getting it to work. Actually I think > it just doesn't work at all. > > Doesn't work at all? > > Well that would be surprising. See below. > > OK, it will occasionally block a task. But since the next attempt will > always be a very short time interval it will almost always start a new > task even if the pressure is high. > At least this is what I observe on my system. > > <snip> > > 1. Rather than just keep track of the previous pressure values > seen more than 1 second ago as done currently: > > if now - self.prev_pressure_time > 1.0: > > and always using that as a reference, we can > store say 10 values per second and use that as a reference. > > There are some challenges in that approach in that we don't control > how often the function is called. Averaging over the last 10 calls > is tempting but likely has some edge cases such as when there are > lots of tasks starting/ending. > > 2. If there has been a long delay since the function was last called, > we could check the pressure, sleep for a short period of time and check it > again. Some people would not like this since it will needlessly delay > the build > so we'd have to keep the delay to < 1 second. Too short a delay will reduce > the accuracy of the result but I suspect that 0.1 seconds is sufficient > for most > users. We could also look at the avg10 value in this case or even some > combination of > both the current contention and avg10. > > 3. Just calculate the pressure per second by: > > ( current pressure - last pressure ) / (now - last_time) > > This could handle short time differences such os milliseconds > as would be a 'cheap' way to deal with long delays. In your case, > the pressure would be: > > 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 > > divided by ~19 since the initial values were close to zero. > > Then for the next time, just 0.1 seconds later: > > 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 > 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds > 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 > > Multiplying by 10 or easy calculation, the would be a pressure: > > cpu: 4660, io: 307920, mem: 0. > > Do you have another idea or a preference as to which approach we take? > > I think 3 is a good first step. Using multiple samples could improve > our calculated "avg1", but lets do that later if needed. > > I agree; Randy and I have been working on patching make and have taken a similar approach: > > make.png > ZhengQ2/make at cpu-pressure github.com > make.png > Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result, > even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel. > So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not, > we will return the last result that has been produced. > > I will CC you when I have a patch, and you can try it out before the commit gets merged if you like. > > Ola, > > Does Qi's patch below help in your situation? > > I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load > and a lightweight sleep task but there are never enough hours in the day/week/... > > The basic idea is to: > > 1. Run a task that just sleeps for say 10 seconds and confirm that the actual > execution time is < 11 seconds or so. > > 2. use stress to get the system into a CPU pressure environment above > the current threshold for say 30 seconds and simultaneously / shortly there after, > launch the same sleep task and confirm that this time, the actual exectuion time of > the launch to completion time is 40+ seconds. > > ../Randy 'getting caught up on email on the weekend' MacLeod > > ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 > commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 > Author: Chen Qi <Qi.Chen@windriver.com> > Date: Thu Apr 6 23:07:14 2023 > > bitbake: runqueue: fix PSI check calculation > > The current PSI check calculation does not take into consideration > the possibility of the time interval between last check and current > check being much larger than 1s. In fact, the current behavior does > not match what the manual says about BB_PRESSURE_MAX_XXX, even if > the value is set to upper limit, 1000000, we still get many blocks > on new task launch. The difference between 'total' should be divided > by the time interval if it's larger than 1s. > > (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b) > > Signed-off-by: Chen Qi <Qi.Chen@windriver.com> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org> > > Δ bitbake/lib/bb/runqueue.py > ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── > > > ────────────────────────────────────────┐ > • 198: class RunQueueScheduler(object): │ > ────────────────────────────────────────┘ > curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1] > curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1] > curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1] > exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > > self.rq.max_cpu_pressure > exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > > self.rq.max_io_pressure > exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float > (self.prev_memory_pressure)) > self.rq.max_memory_pressure > now = time.time() > if now - self.prev_pressure_time > 1.0: > tdiff = now - self.prev_pressure_time > if tdiff > 1.0: > exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float > (self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure > exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) / > tdiff > self.rq.max_io_pressure > exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float > (self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure > self.prev_cpu_pressure = curr_cpu_pressure > self.prev_io_pressure = curr_io_pressure > self.prev_memory_pressure = curr_memory_pressure > self.prev_pressure_time = now > else: > exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float > (self.prev_cpu_pressure)) > self.rq.max_cpu_pressure > exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > > self.rq.max_io_pressure > exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float > (self.prev_memory_pressure)) > self.rq.max_memory_pressure > return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure) > return False > > ZQ > > /Ola > > ../Randy > > /Ola Nilsson > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#14206): https://lists.openembedded.org/g/bitbake-devel/message/14206 > Mute This Topic: https://lists.openembedded.org/mt/95618299/3616765 > Group Owner: bitbake-devel+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com] > -=-=-=-=-=-=-=-=-=-=-=- -- Ola x Nilsson ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bitbake-devel] Bitbake PSI checker 2023-05-22 9:36 ` Ola x Nilsson @ 2023-05-22 14:41 ` Randy MacLeod 2023-05-23 2:08 ` Chen, Qi 0 siblings, 1 reply; 9+ messages in thread From: Randy MacLeod @ 2023-05-22 14:41 UTC (permalink / raw) To: Ola x Nilsson, ChenQi; +Cc: contrib, Richard Purdie, bitbake-devel [-- Attachment #1: Type: text/plain, Size: 13064 bytes --] On 2023-05-22 05:36, Ola x Nilsson wrote: > Hi Qi and Randy, > > I did some testing this morning, and I think this works fine for the <1s > intervals. > > I added log prints whenever the exceeds_max_pressure function was called > and was a bit suprised at some of my observations. Yes, the kernel uses per-cpu variables to track pressure efficiently and only updates what you see in /proc/pressure periodically. Fun, eh! I don't have a graph at hand to show that but here's a CPU pressure typical pattern: https://photos.app.goo.gl/XCMVAjywmBgoqj4E6 for those who haven't looked at the data. This graph doesn't show that if you over-sample you'll get the same value from pressure repeatedly until the per-cpu data is updated. I might have that data on hand somewhere else but officially today is a holiday so I'm not going to go look for it even if graphs are more of a hobby than work! > > It seems setscene tasks are started without checking the PSI. Is this > by design? Well, more like by lack of design! I'll take a look, hopefully this week. > With the antivirus program forced on me by IT I easily reach > CPU PSI on above 600000 (my current limit) while only running setscene > tasks. Ugh! > > If the PSI threshold has been reached, no new tasks will be started for > a while. But once the PSI check passes, it seems as many tasks as are > allowed are started at once. Considering the time interval between > checks for each started task would be very small, this would probably > happen even if the PSI was checked for each task start. But won't this > cause 'waves' of tasks that compete and cause high PSI instead of > allowing just a few (one?) tasks to start and then wait a second? Yes, I've considered that but hadn't gather data when on it when Zheng was still working with me. I also was concerned that we didn't want to slow the builds down too much. I'm not sure how to make that trade-off in a generic manner given that we don't know if a new build will generate little, some or tremendous pressure. The problem is even harder if you have 2 or 3 builds on the same machine. The related but not exactly appropriate term for this phenomena is, 'The thundering herd problem", https://en.wikipedia.org/wiki/Thundering_herd_problem I expect that there are good or even optimal solutions but I haven't had/taken time to read the literature. > > These two things are obviously not connected to this patch. I think > this is fine except for the commit message which refers to runqemu.py > instead of runqueue.py. Oops.... I don't actually see that error but if it's done, c'est la vie. > > Thank you for this improvment. +1 Qi ! Ola, Thanks for checking and reporting and helping push us to do better! ../Randy > /Ola > > On Mon, May 22 2023, ChenQi wrote: > >> Hi Ola & Randy, >> >> I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval >> is small between current check and last check. I'll send out a patch to fix this issue. >> >> Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s. >> >> Please help review the patch. >> >> Regards, >> Qi >> >> On 5/21/23 03:58, Randy MacLeod wrote: >> >> On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote: >> >> On Dec 19, 2022, at 7:50 AM, Ola x Nilsson<ola.x.nilsson@axis.com> wrote: >> >> On Mon, Dec 12 2022, Randy MacLeod wrote: >> >> CCing Richard >> >> On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: >> >> Hi, >> >> I've been looking into using the pressure stall information awareness of >> bitbake >> >> That's good to hear Ola. >> >> but I have some problems getting it to work. Actually I think >> it just doesn't work at all. >> >> Doesn't work at all? >> >> Well that would be surprising. See below. >> >> OK, it will occasionally block a task. But since the next attempt will >> always be a very short time interval it will almost always start a new >> task even if the pressure is high. >> At least this is what I observe on my system. >> >> <snip> >> >> 1. Rather than just keep track of the previous pressure values >> seen more than 1 second ago as done currently: >> >> if now - self.prev_pressure_time > 1.0: >> >> and always using that as a reference, we can >> store say 10 values per second and use that as a reference. >> >> There are some challenges in that approach in that we don't control >> how often the function is called. Averaging over the last 10 calls >> is tempting but likely has some edge cases such as when there are >> lots of tasks starting/ending. >> >> 2. If there has been a long delay since the function was last called, >> we could check the pressure, sleep for a short period of time and check it >> again. Some people would not like this since it will needlessly delay >> the build >> so we'd have to keep the delay to < 1 second. Too short a delay will reduce >> the accuracy of the result but I suspect that 0.1 seconds is sufficient >> for most >> users. We could also look at the avg10 value in this case or even some >> combination of >> both the current contention and avg10. >> >> 3. Just calculate the pressure per second by: >> >> ( current pressure - last pressure ) / (now - last_time) >> >> This could handle short time differences such os milliseconds >> as would be a 'cheap' way to deal with long delays. In your case, >> the pressure would be: >> >> 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 >> >> divided by ~19 since the initial values were close to zero. >> >> Then for the next time, just 0.1 seconds later: >> >> 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 >> 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds >> 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 >> >> Multiplying by 10 or easy calculation, the would be a pressure: >> >> cpu: 4660, io: 307920, mem: 0. >> >> Do you have another idea or a preference as to which approach we take? >> >> I think 3 is a good first step. Using multiple samples could improve >> our calculated "avg1", but lets do that later if needed. >> >> I agree; Randy and I have been working on patching make and have taken a similar approach: >> >> make.png >> ZhengQ2/make at cpu-pressure github.com >> make.png >> Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result, >> even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel. >> So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not, >> we will return the last result that has been produced. >> >> I will CC you when I have a patch, and you can try it out before the commit gets merged if you like. >> >> Ola, >> >> Does Qi's patch below help in your situation? >> >> I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load >> and a lightweight sleep task but there are never enough hours in the day/week/... >> >> The basic idea is to: >> >> 1. Run a task that just sleeps for say 10 seconds and confirm that the actual >> execution time is < 11 seconds or so. >> >> 2. use stress to get the system into a CPU pressure environment above >> the current threshold for say 30 seconds and simultaneously / shortly there after, >> launch the same sleep task and confirm that this time, the actual exectuion time of >> the launch to completion time is 40+ seconds. >> >> ../Randy 'getting caught up on email on the weekend' MacLeod >> >> ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 >> commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 >> Author: Chen Qi<Qi.Chen@windriver.com> >> Date: Thu Apr 6 23:07:14 2023 >> >> bitbake: runqueue: fix PSI check calculation >> >> The current PSI check calculation does not take into consideration >> the possibility of the time interval between last check and current >> check being much larger than 1s. In fact, the current behavior does >> not match what the manual says about BB_PRESSURE_MAX_XXX, even if >> the value is set to upper limit, 1000000, we still get many blocks >> on new task launch. The difference between 'total' should be divided >> by the time interval if it's larger than 1s. >> >> (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b) >> >> Signed-off-by: Chen Qi<Qi.Chen@windriver.com> >> Signed-off-by: Richard Purdie<richard.purdie@linuxfoundation.org> >> >> Δ bitbake/lib/bb/runqueue.py >> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── >> >> >> ────────────────────────────────────────┐ >> • 198: class RunQueueScheduler(object): │ >> ────────────────────────────────────────┘ >> curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1] >> curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1] >> curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1] >> exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) >> > self.rq.max_cpu_pressure >> exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > >> self.rq.max_io_pressure >> exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float >> (self.prev_memory_pressure)) > self.rq.max_memory_pressure >> now = time.time() >> if now - self.prev_pressure_time > 1.0: >> tdiff = now - self.prev_pressure_time >> if tdiff > 1.0: >> exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float >> (self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure >> exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) / >> tdiff > self.rq.max_io_pressure >> exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float >> (self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure >> self.prev_cpu_pressure = curr_cpu_pressure >> self.prev_io_pressure = curr_io_pressure >> self.prev_memory_pressure = curr_memory_pressure >> self.prev_pressure_time = now >> else: >> exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float >> (self.prev_cpu_pressure)) > self.rq.max_cpu_pressure >> exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > >> self.rq.max_io_pressure >> exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float >> (self.prev_memory_pressure)) > self.rq.max_memory_pressure >> return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure) >> return False >> >> ZQ >> >> /Ola >> >> ../Randy >> >> /Ola Nilsson >> >> -=-=-=-=-=-=-=-=-=-=-=- >> Links: You receive all messages sent to this group. >> View/Reply Online (#14206):https://lists.openembedded.org/g/bitbake-devel/message/14206 >> Mute This Topic:https://lists.openembedded.org/mt/95618299/3616765 >> Group Owner:bitbake-devel+owner@lists.openembedded.org >> Unsubscribe:https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com] >> -=-=-=-=-=-=-=-=-=-=-=- -- # Randy MacLeod # Wind River Linux [-- Attachment #2: Type: text/html, Size: 15361 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [bitbake-devel] Bitbake PSI checker 2023-05-22 14:41 ` Randy MacLeod @ 2023-05-23 2:08 ` Chen, Qi 0 siblings, 0 replies; 9+ messages in thread From: Chen, Qi @ 2023-05-23 2:08 UTC (permalink / raw) To: MacLeod, Randy, Ola x Nilsson; +Cc: contrib, Richard Purdie, bitbake-devel [-- Attachment #1: Type: text/plain, Size: 13458 bytes --] Thanks for the review. I’ll fix the commit and send out V2. Regards, Qi From: MacLeod, Randy <Randy.MacLeod@windriver.com> Sent: Monday, May 22, 2023 10:42 PM To: Ola x Nilsson <ola.x.nilsson@axis.com>; Chen, Qi <Qi.Chen@windriver.com> Cc: contrib@zhengqiu.net; Richard Purdie <richard.purdie@linuxfoundation.org>; bitbake-devel@lists.openembedded.org Subject: Re: [bitbake-devel] Bitbake PSI checker On 2023-05-22 05:36, Ola x Nilsson wrote: Hi Qi and Randy, I did some testing this morning, and I think this works fine for the <1s intervals. I added log prints whenever the exceeds_max_pressure function was called and was a bit suprised at some of my observations. Yes, the kernel uses per-cpu variables to track pressure efficiently and only updates what you see in /proc/pressure periodically. Fun, eh! I don't have a graph at hand to show that but here's a CPU pressure typical pattern: https://photos.app.goo.gl/XCMVAjywmBgoqj4E6 for those who haven't looked at the data. This graph doesn't show that if you over-sample you'll get the same value from pressure repeatedly until the per-cpu data is updated. I might have that data on hand somewhere else but officially today is a holiday so I'm not going to go look for it even if graphs are more of a hobby than work! It seems setscene tasks are started without checking the PSI. Is this by design? Well, more like by lack of design! I'll take a look, hopefully this week. With the antivirus program forced on me by IT I easily reach CPU PSI on above 600000 (my current limit) while only running setscene tasks. Ugh! If the PSI threshold has been reached, no new tasks will be started for a while. But once the PSI check passes, it seems as many tasks as are allowed are started at once. Considering the time interval between checks for each started task would be very small, this would probably happen even if the PSI was checked for each task start. But won't this cause 'waves' of tasks that compete and cause high PSI instead of allowing just a few (one?) tasks to start and then wait a second? Yes, I've considered that but hadn't gather data when on it when Zheng was still working with me. I also was concerned that we didn't want to slow the builds down too much. I'm not sure how to make that trade-off in a generic manner given that we don't know if a new build will generate little, some or tremendous pressure. The problem is even harder if you have 2 or 3 builds on the same machine. The related but not exactly appropriate term for this phenomena is, 'The thundering herd problem", https://en.wikipedia.org/wiki/Thundering_herd_problem I expect that there are good or even optimal solutions but I haven't had/taken time to read the literature. These two things are obviously not connected to this patch. I think this is fine except for the commit message which refers to runqemu.py instead of runqueue.py. Oops.... I don't actually see that error but if it's done, c'est la vie. Thank you for this improvment. +1 Qi ! Ola, Thanks for checking and reporting and helping push us to do better! ../Randy /Ola On Mon, May 22 2023, ChenQi wrote: Hi Ola & Randy, I just checked the codes and I think Ola is right. The current PSI check cannot block spawning of new tasks if the time interval is small between current check and last check. I'll send out a patch to fix this issue. Also, I don't think calculating the value too often is a good idea, so I'll change the check to be >1s. Please help review the patch. Regards, Qi On 5/21/23 03:58, Randy MacLeod wrote: On 2022-12-19 14:49, Zheng Qiu via lists.openembedded.org wrote: On Dec 19, 2022, at 7:50 AM, Ola x Nilsson <ola.x.nilsson@axis.com><mailto:ola.x.nilsson@axis.com> wrote: On Mon, Dec 12 2022, Randy MacLeod wrote: CCing Richard On 2022-12-12 05:07, Ola x Nilsson via lists.openembedded.org wrote: Hi, I've been looking into using the pressure stall information awareness of bitbake That's good to hear Ola. but I have some problems getting it to work. Actually I think it just doesn't work at all. Doesn't work at all? Well that would be surprising. See below. OK, it will occasionally block a task. But since the next attempt will always be a very short time interval it will almost always start a new task even if the pressure is high. At least this is what I observe on my system. <snip> 1. Rather than just keep track of the previous pressure values seen more than 1 second ago as done currently: if now - self.prev_pressure_time > 1.0: and always using that as a reference, we can store say 10 values per second and use that as a reference. There are some challenges in that approach in that we don't control how often the function is called. Averaging over the last 10 calls is tempting but likely has some edge cases such as when there are lots of tasks starting/ending. 2. If there has been a long delay since the function was last called, we could check the pressure, sleep for a short period of time and check it again. Some people would not like this since it will needlessly delay the build so we'd have to keep the delay to < 1 second. Too short a delay will reduce the accuracy of the result but I suspect that 0.1 seconds is sufficient for most users. We could also look at the avg10 value in this case or even some combination of both the current contention and avg10. 3. Just calculate the pressure per second by: ( current pressure - last pressure ) / (now - last_time) This could handle short time differences such os milliseconds as would be a 'cheap' way to deal with long delays. In your case, the pressure would be: 978077.0 io_pressure 1353882.0 mem_pressure 20922.0 divided by ~19 since the initial values were close to zero. Then for the next time, just 0.1 seconds later: 1670840042.384582 cpu_pressure 8978077.0 io_pressure 1353882.0 mem_pressure 20922.0 1670840042.384582 cpu io pressure exceeded over 18.677629 seconds 1670840042.486946 cpu_pressure 466.0 io_pressure 30792.0 mem_pressure 0.0 Multiplying by 10 or easy calculation, the would be a pressure: cpu: 4660, io: 307920, mem: 0. Do you have another idea or a preference as to which approach we take? I think 3 is a good first step. Using multiple samples could improve our calculated "avg1", but lets do that later if needed. I agree; Randy and I have been working on patching make and have taken a similar approach: make.png ZhengQ2/make at cpu-pressure github.com make.png Additionally, we found that when the pressure read is too frequent, we may get the same cpu pressure as an result, even if the pressure have actually changed. This is likely due to the per cpu variables used in the kernel. So, in addition to the algorithm Randy talked above, we also compares if the cpu pressure has been changed, if not, we will return the last result that has been produced. I will CC you when I have a patch, and you can try it out before the commit gets merged if you like. Ola, Does Qi's patch below help in your situation? I still want/intent to add a bitbake PSI test case that uses stress-ng to induce load and a lightweight sleep task but there are never enough hours in the day/week/... The basic idea is to: 1. Run a task that just sleeps for say 10 seconds and confirm that the actual execution time is < 11 seconds or so. 2. use stress to get the system into a CPU pressure environment above the current threshold for say 30 seconds and simultaneously / shortly there after, launch the same sleep task and confirm that this time, the actual exectuion time of the launch to completion time is 40+ seconds. ../Randy 'getting caught up on email on the weekend' MacLeod ❯ git show ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 commit ba94f9a3b1960cc0fdc831c20a9d2f8ad289f307 Author: Chen Qi <Qi.Chen@windriver.com><mailto:Qi.Chen@windriver.com> Date: Thu Apr 6 23:07:14 2023 bitbake: runqueue: fix PSI check calculation The current PSI check calculation does not take into consideration the possibility of the time interval between last check and current check being much larger than 1s. In fact, the current behavior does not match what the manual says about BB_PRESSURE_MAX_XXX, even if the value is set to upper limit, 1000000, we still get many blocks on new task launch. The difference between 'total' should be divided by the time interval if it's larger than 1s. (Bitbake rev: b4763c2c93e7494e0a27f5970c19c1aac66c228b) Signed-off-by: Chen Qi <Qi.Chen@windriver.com><mailto:Qi.Chen@windriver.com> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org><mailto:richard.purdie@linuxfoundation.org> Δ bitbake/lib/bb/runqueue.py ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ────────────────────────────────────────┐ • 198: class RunQueueScheduler(object): │ ────────────────────────────────────────┘ curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1] curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1] curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1] exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > self.rq.max_cpu_pressure exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > self.rq.max_io_pressure exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float (self.prev_memory_pressure)) > self.rq.max_memory_pressure now = time.time() if now - self.prev_pressure_time > 1.0: tdiff = now - self.prev_pressure_time if tdiff > 1.0: exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float (self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > self.rq.max_io_pressure exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float (self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure self.prev_cpu_pressure = curr_cpu_pressure self.prev_io_pressure = curr_io_pressure self.prev_memory_pressure = curr_memory_pressure self.prev_pressure_time = now else: exceeds_cpu_pressure = self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float (self.prev_cpu_pressure)) > self.rq.max_cpu_pressure exceeds_io_pressure = self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > self.rq.max_io_pressure exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float (self.prev_memory_pressure)) > self.rq.max_memory_pressure return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure) return False ZQ /Ola ../Randy /Ola Nilsson -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14206): https://lists.openembedded.org/g/bitbake-devel/message/14206 Mute This Topic: https://lists.openembedded.org/mt/95618299/3616765 Group Owner: bitbake-devel+owner@lists.openembedded.org<mailto:bitbake-devel+owner@lists.openembedded.org> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [randy.macleod@windriver.com<mailto:randy.macleod@windriver.com>] -=-=-=-=-=-=-=-=-=-=-=- -- # Randy MacLeod # Wind River Linux [-- Attachment #2: Type: text/html, Size: 25587 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-23 2:08 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-12-12 10:07 Bitbake PSI checker Ola x Nilsson 2022-12-12 20:48 ` [bitbake-devel] " Randy MacLeod 2022-12-19 12:50 ` Ola x Nilsson 2022-12-19 19:49 ` contrib 2023-05-20 19:58 ` Randy MacLeod 2023-05-22 2:17 ` ChenQi 2023-05-22 9:36 ` Ola x Nilsson 2023-05-22 14:41 ` Randy MacLeod 2023-05-23 2:08 ` Chen, Qi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).