bitbake-devel.lists.openembedded.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] runqueue: Improve multiconfig deferred task issues
@ 2021-05-08  9:14 Richard Purdie
  2021-06-25  1:48 ` [bitbake-devel] " bkylerussell
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Purdie @ 2021-05-08  9:14 UTC (permalink / raw)
  To: bitbake-devel

The previous patches have exposed new issues with this code path,
the issues being around what should happen when the hash of a task
changes and the task is or is not on the deferred task list.

Rather than rebuilding the deferred task list during each rehash
event, build it once at the start of a build. This avoids the problem
of tasks being added back after they have run and also avoids problems
of always ensuring the same task is deferred. It also allows the
'outrightfail' codepath to be handled separately as the conditions
are subtly differnt.

One significant win for the new approch is the build is not continually
printing out lists of deferred tasks, that list remains fairly static
from the start of the build. Logic is added in to ensure a rehashed
task with a hash matching other deferred tasks is deferred along with
them as a small optimization.

An interesting test case for this code was reported by Mark Hatle
with four multiconfigs, each the same apart from TMPDIR and running a
build of:

bitbake buildtools-tarball mc:{one,two,three,four}:core-image-minimal

which is interesting in that the build of buildtools partially overlaps
core-image-minimal and the build has a rehash event for qemuwrapper-cross
even without any external hash equivalence server or preexisting data.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/runqueue.py | 56 +++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 26 deletions(-)

diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py
index 6c41fe6d43..88983eba1a 100644
--- a/lib/bb/runqueue.py
+++ b/lib/bb/runqueue.py
@@ -2443,6 +2443,11 @@ class RunQueueExecute:
 
         if update_tasks:
             self.sqdone = False
+            for tid in [t[0] for t in update_tasks]:
+                h = pending_hash_index(tid, self.rqdata)
+                if h in self.sqdata.hashes and tid != self.sqdata.hashes[h]:
+                    self.sq_deferred[tid] = self.sqdata.hashes[h]
+                    bb.note("Deferring %s after %s" % (tid, self.sqdata.hashes[h]))
             update_scenequeue_data([t[0] for t in update_tasks], self.sqdata, self.rqdata, self.rq, self.cooker, self.stampcache, self, summary=False)
 
         for (tid, harddepfail, origvalid) in update_tasks:
@@ -2786,6 +2791,19 @@ def build_scenequeue_data(sqdata, rqdata, rq, cooker, stampcache, sqrq):
     sqdata.stamppresent = set()
     sqdata.valid = set()
 
+    sqdata.hashes = {}
+    sqrq.sq_deferred = {}
+    for mc in sorted(sqdata.multiconfigs):
+        for tid in sorted(sqdata.sq_revdeps):
+            if mc_from_tid(tid) != mc:
+                continue
+            h = pending_hash_index(tid, rqdata)
+            if h not in sqdata.hashes:
+                sqdata.hashes[h] = tid
+            else:
+                sqrq.sq_deferred[tid] = sqdata.hashes[h]
+                bb.note("Deferring %s after %s" % (tid, sqdata.hashes[h]))
+
     update_scenequeue_data(sqdata.sq_revdeps, sqdata, rqdata, rq, cooker, stampcache, sqrq, summary=True)
 
     # Compute a list of 'stale' sstate tasks where the current hash does not match the one
@@ -2850,32 +2868,18 @@ def update_scenequeue_data(tids, sqdata, rqdata, rq, cooker, stampcache, sqrq, s
 
     sqdata.valid |= rq.validate_hashes(tocheck, cooker.data, len(sqdata.stamppresent), False, summary=summary)
 
-    sqdata.hashes = {}
-    sqrq.sq_deferred = {}
-    for mc in sorted(sqdata.multiconfigs):
-        for tid in sorted(sqdata.sq_revdeps):
-            if mc_from_tid(tid) != mc:
-                continue
-            if tid in sqdata.stamppresent:
-                continue
-            if tid in sqdata.valid:
-                continue
-            if tid in sqdata.noexec:
-                continue
-            if tid in sqrq.scenequeue_notcovered:
-                continue
-            if tid in sqrq.scenequeue_covered:
-                continue
-
-            h = pending_hash_index(tid, rqdata)
-            if h not in sqdata.hashes:
-                if tid in tids:
-                    sqdata.outrightfail.add(tid)
-                sqdata.hashes[h] = tid
-            else:
-                sqrq.sq_deferred[tid] = sqdata.hashes[h]
-                bb.note("Deferring %s after %s" % (tid, sqdata.hashes[h]))
-
+    for tid in tids:
+        if tid in sqdata.stamppresent:
+            continue
+        if tid in sqdata.valid:
+            continue
+        if tid in sqdata.noexec:
+            continue
+        if tid in sqrq.scenequeue_covered:
+            continue
+        if tid in sqrq.sq_deferred:
+            continue
+        sqdata.outrightfail.add(tid)
 
 class TaskFailure(Exception):
     """
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [PATCH] runqueue: Improve multiconfig deferred task issues
  2021-05-08  9:14 [PATCH] runqueue: Improve multiconfig deferred task issues Richard Purdie
@ 2021-06-25  1:48 ` bkylerussell
  2021-06-25 10:05   ` Richard Purdie
  0 siblings, 1 reply; 4+ messages in thread
From: bkylerussell @ 2021-06-25  1:48 UTC (permalink / raw)
  To: Richard Purdie; +Cc: bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 6712 bytes --]

What's the status on this patch?

I have an issue on dunfell where two valid setscene tasks are getting
scheduled for native packages even though the hashes are equivalent.  When
both configs' do_populate_sysroot_setscene tasks get scheduled near the
same time, one config often fails with an unpack error.  One config sees
that the other config's task has already started downloading sstate, so it
immediately starts unpacking the tgz it finds.  However, if wget hasn't
actually finished, tar errors on EOF during run.sstate_unpack_package.  I
can readily reproduce this on my setup.

I think bitbake rev a75c5fd6d4ec5 attempted to address this with
pending_hash_index() in runqueue.py, but it seems that over time, that
block has been pushed further to the bottom of update_scenequeue_data.  I
think in my case, both tasks are being marked as valid, so the following
case appears to prevent one of the tasks from being deferred until later.

            if tid in sqdata.valid:
                continue

I have seen some instances where native sstate tasks are actually deferred
because the bb.note is printed in the log, but not always, so I'm not sure
under what circumstances update_scenequeue_data is supposed to execute the
pending_hash_index check.

It looks like this patch might take a step towards fixing that, but I
haven't followed all the recent multiconfig discussions completely.  I'm
not confident on the current state of things and wondered if you could
provide any insight.

Thanks!

Kyle


On Sat, May 8, 2021 at 5:14 AM Richard Purdie <
richard.purdie@linuxfoundation.org> wrote:

> The previous patches have exposed new issues with this code path,
> the issues being around what should happen when the hash of a task
> changes and the task is or is not on the deferred task list.
>
> Rather than rebuilding the deferred task list during each rehash
> event, build it once at the start of a build. This avoids the problem
> of tasks being added back after they have run and also avoids problems
> of always ensuring the same task is deferred. It also allows the
> 'outrightfail' codepath to be handled separately as the conditions
> are subtly differnt.
>
> One significant win for the new approch is the build is not continually
> printing out lists of deferred tasks, that list remains fairly static
> from the start of the build. Logic is added in to ensure a rehashed
> task with a hash matching other deferred tasks is deferred along with
> them as a small optimization.
>
> An interesting test case for this code was reported by Mark Hatle
> with four multiconfigs, each the same apart from TMPDIR and running a
> build of:
>
> bitbake buildtools-tarball mc:{one,two,three,four}:core-image-minimal
>
> which is interesting in that the build of buildtools partially overlaps
> core-image-minimal and the build has a rehash event for qemuwrapper-cross
> even without any external hash equivalence server or preexisting data.
>
> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> ---
>  lib/bb/runqueue.py | 56 +++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 26 deletions(-)
>
> diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py
> index 6c41fe6d43..88983eba1a 100644
> --- a/lib/bb/runqueue.py
> +++ b/lib/bb/runqueue.py
> @@ -2443,6 +2443,11 @@ class RunQueueExecute:
>
>          if update_tasks:
>              self.sqdone = False
> +            for tid in [t[0] for t in update_tasks]:
> +                h = pending_hash_index(tid, self.rqdata)
> +                if h in self.sqdata.hashes and tid !=
> self.sqdata.hashes[h]:
> +                    self.sq_deferred[tid] = self.sqdata.hashes[h]
> +                    bb.note("Deferring %s after %s" % (tid,
> self.sqdata.hashes[h]))
>              update_scenequeue_data([t[0] for t in update_tasks],
> self.sqdata, self.rqdata, self.rq, self.cooker, self.stampcache, self,
> summary=False)
>
>          for (tid, harddepfail, origvalid) in update_tasks:
> @@ -2786,6 +2791,19 @@ def build_scenequeue_data(sqdata, rqdata, rq,
> cooker, stampcache, sqrq):
>      sqdata.stamppresent = set()
>      sqdata.valid = set()
>
> +    sqdata.hashes = {}
> +    sqrq.sq_deferred = {}
> +    for mc in sorted(sqdata.multiconfigs):
> +        for tid in sorted(sqdata.sq_revdeps):
> +            if mc_from_tid(tid) != mc:
> +                continue
> +            h = pending_hash_index(tid, rqdata)
> +            if h not in sqdata.hashes:
> +                sqdata.hashes[h] = tid
> +            else:
> +                sqrq.sq_deferred[tid] = sqdata.hashes[h]
> +                bb.note("Deferring %s after %s" % (tid, sqdata.hashes[h]))
> +
>      update_scenequeue_data(sqdata.sq_revdeps, sqdata, rqdata, rq, cooker,
> stampcache, sqrq, summary=True)
>
>      # Compute a list of 'stale' sstate tasks where the current hash does
> not match the one
> @@ -2850,32 +2868,18 @@ def update_scenequeue_data(tids, sqdata, rqdata,
> rq, cooker, stampcache, sqrq, s
>
>      sqdata.valid |= rq.validate_hashes(tocheck, cooker.data,
> len(sqdata.stamppresent), False, summary=summary)
>
> -    sqdata.hashes = {}
> -    sqrq.sq_deferred = {}
> -    for mc in sorted(sqdata.multiconfigs):
> -        for tid in sorted(sqdata.sq_revdeps):
> -            if mc_from_tid(tid) != mc:
> -                continue
> -            if tid in sqdata.stamppresent:
> -                continue
> -            if tid in sqdata.valid:
> -                continue
> -            if tid in sqdata.noexec:
> -                continue
> -            if tid in sqrq.scenequeue_notcovered:
> -                continue
> -            if tid in sqrq.scenequeue_covered:
> -                continue
> -
> -            h = pending_hash_index(tid, rqdata)
> -            if h not in sqdata.hashes:
> -                if tid in tids:
> -                    sqdata.outrightfail.add(tid)
> -                sqdata.hashes[h] = tid
> -            else:
> -                sqrq.sq_deferred[tid] = sqdata.hashes[h]
> -                bb.note("Deferring %s after %s" % (tid, sqdata.hashes[h]))
> -
> +    for tid in tids:
> +        if tid in sqdata.stamppresent:
> +            continue
> +        if tid in sqdata.valid:
> +            continue
> +        if tid in sqdata.noexec:
> +            continue
> +        if tid in sqrq.scenequeue_covered:
> +            continue
> +        if tid in sqrq.sq_deferred:
> +            continue
> +        sqdata.outrightfail.add(tid)
>
>  class TaskFailure(Exception):
>      """
> --
> 2.30.2
>
>
> 
>
>

[-- Attachment #2: Type: text/html, Size: 7943 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [PATCH] runqueue: Improve multiconfig deferred task issues
  2021-06-25  1:48 ` [bitbake-devel] " bkylerussell
@ 2021-06-25 10:05   ` Richard Purdie
  2021-06-25 12:38     ` bkylerussell
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Purdie @ 2021-06-25 10:05 UTC (permalink / raw)
  To: Kyle Russell; +Cc: bitbake-devel

On Thu, 2021-06-24 at 21:48 -0400, Kyle Russell wrote:
> What's the status on this patch?

It fixes some issues, it seems to cause others.

I'm afraid I simply don't have the bandwidth to work on everything and
whilst I do want to look into and fix those issues, I'm struggling
to find the time (amongst a long list of other things).

The challenge with this is finding test cases which don't take hours 
to run and the "starting" configuration is easily set up.

I think the code in master-next is probably correct so I'm tempted
to merge it and then improve from there as the code in -next definitely
feels more correct even if there are some other issues. It does reduce
the output noise a lot.

That said, I've held off since I know if this does cause further problems,
I don't have the bandwidth to look into it. I've also not 100% happy
the code is right and I don' want to see the inevitable backport to dunfell
request when I know there are more issues remaining.

> I have an issue on dunfell where two valid setscene tasks are getting
> scheduled for native packages even though the hashes are equivalent.  
> When both configs' do_populate_sysroot_setscene tasks get scheduled near 
> the same time, one config often fails with an unpack error.  One config 
> sees that the other config's task has already started downloading sstate, 
> so it immediately starts unpacking the tgz it finds.  However, if wget 
> hasn't actually finished, tar errors on EOF during run.sstate_unpack_package.
> I can readily reproduce this on my setup.
I'm not completely confident it I'm not completely confident it 
That sounds like an sstate or fetcher bug, it should be moving the files 
into place. Obviously it shouldn't be racing either but two builds could
race. We should probably have a bug filed for that...

> I think bitbake rev a75c5fd6d4ec5 attempted to address this with 
> pending_hash_index() in runqueue.py, but it seems that over time, that block 
> has been pushed further to the bottom of update_scenequeue_data.  I think in 
> my case, both tasks are being marked as valid, so the following case appears 
> to prevent one of the tasks from being deferred until later.
> 
>             if tid in sqdata.valid:
>                 continue
I'm not completely confident it 
You are correct, the code is definitely marking multiple things as valid when it 
shouldn't currently.


> I have seen some instances where native sstate tasks are actually deferred 
> because the bb.note is printed in the log, but not always, so I'm not sure 
> under what circumstances update_scenequeue_data is supposed to execute the 
> pending_hash_index check.

The problem is caused through interaction with the hash equivalence and unihash 
task "rehashing".


> It looks like this patch might take a step towards fixing that, but I haven't 
> followed all the recent multiconfig discussions completely.  I'm not confident 
> on the current state of things and wondered if you could provide any insight.

Basically I think it is the right direction but I'm not completely confident it 
fixes everything and I know it needs a few days to spend on it (mostly to be 
able to run large test builds).

Cheers,

Richard


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [PATCH] runqueue: Improve multiconfig deferred task issues
  2021-06-25 10:05   ` Richard Purdie
@ 2021-06-25 12:38     ` bkylerussell
  0 siblings, 0 replies; 4+ messages in thread
From: bkylerussell @ 2021-06-25 12:38 UTC (permalink / raw)
  To: Richard Purdie; +Cc: bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]

On Fri, Jun 25, 2021 at 6:05 AM Richard Purdie <
richard.purdie@linuxfoundation.org> wrote:

> I'm afraid I simply don't have the bandwidth to work on everything and
> whilst I do want to look into and fix those issues, I'm struggling
> to find the time (amongst a long list of other things).
>

Thanks, Richard.  No worries.  I completely understand.  The information
you've provided
is helpful to at least make sure I'm headed in the right direction.  I'll
keep studying the recent
multiconfig patch sets to see if I can come up with some options.

That said, I've held off since I know if this does cause further problems,
> I don't have the bandwidth to look into it. I've also not 100% happy
> the code is right and I don' want to see the inevitable backport to dunfell
> request when I know there are more issues remaining.
>

I did try naively backporting this to dunfell (in addition to a handful of
other
multiconfig runqueue patches that have been submitted recently), but ended
up
with a runqueue deadlock where essentially all the native tasks that had
just been
deferred were forced to run.  I'll keep analyzing.

> I have an issue on dunfell where two valid setscene tasks are getting
> > scheduled for native packages even though the hashes are equivalent.
> > When both configs' do_populate_sysroot_setscene tasks get scheduled near
> > the same time, one config often fails with an unpack error.  One config
> > sees that the other config's task has already started downloading
> sstate,
> > so it immediately starts unpacking the tgz it finds.  However, if wget
> > hasn't actually finished, tar errors on EOF during
> run.sstate_unpack_package.
> > I can readily reproduce this on my setup.
> I'm not completely confident it I'm not completely confident it
> That sounds like an sstate or fetcher bug, it should be moving the files
> into place. Obviously it shouldn't be racing either but two builds could
> race. We should probably have a bug filed for that...
>

I do think that my issue would be avoided if sstate_installpkg() looked for
the tgz done file
instead of the tgz itself.  Even if both tasks were scheduled, making sure
the download had
completed would be preferable to unpacking an incomplete download, but it
didn't seem like
the best approach if one of the tasks could be optimized away.  I'd be
happy to submit a patch
for that if it sounds interesting, but didn't want to waste time on it if
we'd rather just address the
task deferment.

Thanks again,

Kyle

[-- Attachment #2: Type: text/html, Size: 3488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-25 12:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-08  9:14 [PATCH] runqueue: Improve multiconfig deferred task issues Richard Purdie
2021-06-25  1:48 ` [bitbake-devel] " bkylerussell
2021-06-25 10:05   ` Richard Purdie
2021-06-25 12:38     ` bkylerussell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).