All of lore.kernel.org
 help / color / mirror / Atom feed
* Cache unihash ... doesn't match BB_UNIHASH ...
@ 2020-02-07  8:16 Alex Kiernan
  2020-02-07  8:31 ` Richard Purdie
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Kiernan @ 2020-02-07  8:16 UTC (permalink / raw)
  To: Patches and discussions about the oe-core layer

Hi

We've got a couple of multiconfig builds with different configurations
in one bitbake run (whether it needs to be like this a different
question - I suspect we had something that worked and stopped there!)

However, for sometime we've been seeing errors like this on master
(the recipes vary):

ERROR: mc:forensic:ca-certificates-20190110-r0
do_deploy_source_date_epoch: Cache unihash
37d2546973a2b7c16fbcb1a8dd4939e4a1bba8c3d6d2dfb6e7204ca08d437ace
doesn't match BB_UNIHASH
09baf78bd9316acdd308980c42ea4ed1eacea61bcc0d840c36293d1a5a22db69

Coupled with deadlock messages:

WARNING: Runqeueue deadlocked on deferred tasks, forcing task
mc:factory:virtual:native:/var/lib/jenkins/workspace/nanohub_master/build/../poky/meta/recipes-support/sqlite/sqlite3_3.31.1.bb:do_populate_sysroot

Any pointers to whereabouts I should be looking for what we've got wrong?

-- 
Alex Kiernan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07  8:16 Cache unihash ... doesn't match BB_UNIHASH Alex Kiernan
@ 2020-02-07  8:31 ` Richard Purdie
  2020-02-07  9:12   ` Alex Kiernan
  2020-02-07 14:43   ` chris.laplante
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Purdie @ 2020-02-07  8:31 UTC (permalink / raw)
  To: Alex Kiernan, Patches and discussions about the oe-core layer

On Fri, 2020-02-07 at 08:16 +0000, Alex Kiernan wrote:
> Hi
> 
> We've got a couple of multiconfig builds with different
> configurations in one bitbake run (whether it needs to be like this a
> different question - I suspect we had something that worked and
> stopped there!)
> 
> However, for sometime we've been seeing errors like this on master
> (the recipes vary):
> 
> ERROR: mc:forensic:ca-certificates-20190110-r0
> do_deploy_source_date_epoch: Cache unihash
> 37d2546973a2b7c16fbcb1a8dd4939e4a1bba8c3d6d2dfb6e7204ca08d437ace
> doesn't match BB_UNIHASH
> 09baf78bd9316acdd308980c42ea4ed1eacea61bcc0d840c36293d1a5a22db69

This is an internal sanity test. It means the hash that it had in the
task/worker context didn't match what the server thought the unihash
should be. There have been changes in this area to try and fix various
things, I'm wondering if we introduced a bug with multiconfig handling.

There are some patches in master-next related to this area so it would
be interesting to know if that made things better/worse.

> Coupled with deadlock messages:
> 
> WARNING: Runqeueue deadlocked on deferred tasks, forcing task
> mc:factory:virtual:native:/var/lib/jenkins/workspace/nanohub_master/b
> uild/../poky/meta/recipes-
> support/sqlite/sqlite3_3.31.1.bb:do_populate_sysroot

This happens if a build has multiple different components with matching
hashes. It builds the "first" one and defers the others since they
could become available in sstate after the first one builds.

You should see messages from runqueue.py:

bb.note("Deferring %s after %s" % (tid, sqdata.hashes[h]))

The behaviour should be to therefore prioritise one multiconfig. I
worry that the sorting isn't quite right and its building bits of
several multiconfigs and deadlocking?

Do you have the logs with the Deferring XXX after YYY messages in you
could share? I suspect its a sorting problem.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07  8:31 ` Richard Purdie
@ 2020-02-07  9:12   ` Alex Kiernan
  2020-02-07 14:43   ` chris.laplante
  1 sibling, 0 replies; 13+ messages in thread
From: Alex Kiernan @ 2020-02-07  9:12 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Patches and discussions about the oe-core layer

On Fri, Feb 7, 2020 at 8:31 AM Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
>
> On Fri, 2020-02-07 at 08:16 +0000, Alex Kiernan wrote:
> > Hi
> >
> > We've got a couple of multiconfig builds with different
> > configurations in one bitbake run (whether it needs to be like this a
> > different question - I suspect we had something that worked and
> > stopped there!)
> >
> > However, for sometime we've been seeing errors like this on master
> > (the recipes vary):
> >
> > ERROR: mc:forensic:ca-certificates-20190110-r0
> > do_deploy_source_date_epoch: Cache unihash
> > 37d2546973a2b7c16fbcb1a8dd4939e4a1bba8c3d6d2dfb6e7204ca08d437ace
> > doesn't match BB_UNIHASH
> > 09baf78bd9316acdd308980c42ea4ed1eacea61bcc0d840c36293d1a5a22db69
>
> This is an internal sanity test. It means the hash that it had in the
> task/worker context didn't match what the server thought the unihash
> should be. There have been changes in this area to try and fix various
> things, I'm wondering if we introduced a bug with multiconfig handling.
>
> There are some patches in master-next related to this area so it would
> be interesting to know if that made things better/worse.
>

I can pull those in, if it'd be useful to test?

> > Coupled with deadlock messages:
> >
> > WARNING: Runqeueue deadlocked on deferred tasks, forcing task
> > mc:factory:virtual:native:/var/lib/jenkins/workspace/nanohub_master/b
> > uild/../poky/meta/recipes-
> > support/sqlite/sqlite3_3.31.1.bb:do_populate_sysroot
>
> This happens if a build has multiple different components with matching
> hashes. It builds the "first" one and defers the others since they
> could become available in sstate after the first one builds.
>
> You should see messages from runqueue.py:
>
> bb.note("Deferring %s after %s" % (tid, sqdata.hashes[h]))
>
> The behaviour should be to therefore prioritise one multiconfig. I
> worry that the sorting isn't quite right and its building bits of
> several multiconfigs and deadlocking?
>
> Do you have the logs with the Deferring XXX after YYY messages in you
> could share? I suspect its a sorting problem.
>

Sure - I'll mail it over separately.

-- 
Alex Kiernan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07  8:31 ` Richard Purdie
  2020-02-07  9:12   ` Alex Kiernan
@ 2020-02-07 14:43   ` chris.laplante
  2020-02-07 15:05     ` Richard Purdie
  1 sibling, 1 reply; 13+ messages in thread
From: chris.laplante @ 2020-02-07 14:43 UTC (permalink / raw)
  To: Richard Purdie, Patches and discussions about the oe-core layer

Hi Richard,

> On Fri, 2020-02-07 at 08:16 +0000, Alex Kiernan wrote:
> > Hi
> >
> > We've got a couple of multiconfig builds with different
> > configurations in one bitbake run (whether it needs to be like this a
> > different question - I suspect we had something that worked and
> > stopped there!)
> >
> > However, for sometime we've been seeing errors like this on master
> > (the recipes vary):
> >
> > ERROR: mc:forensic:ca-certificates-20190110-r0
> > do_deploy_source_date_epoch: Cache unihash
> > 37d2546973a2b7c16fbcb1a8dd4939e4a1bba8c3d6d2dfb6e7204ca08d437ace
> > doesn't match BB_UNIHASH
> > 09baf78bd9316acdd308980c42ea4ed1eacea61bcc0d840c36293d1a5a22db69
> 
> This is an internal sanity test. It means the hash that it had in the
> task/worker context didn't match what the server thought the unihash
> should be. There have been changes in this area to try and fix various
> things, I'm wondering if we introduced a bug with multiconfig handling.
> 
> There are some patches in master-next related to this area so it would
> be interesting to know if that made things better/worse.

Anecdotally, we are running Zeus for nightly builds with three multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake: fixup" patches and haven't seen any of the BB_UNIHASH errors since. Granted it's only been a week. But before that, hash equiv + multiconfig was unusable due to the BB_UNIHASH errors.

Thanks, 
Chris


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07 14:43   ` chris.laplante
@ 2020-02-07 15:05     ` Richard Purdie
  2020-02-07 15:08       ` chris.laplante
  2020-02-07 15:44       ` Alex Kiernan
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Purdie @ 2020-02-07 15:05 UTC (permalink / raw)
  To: chris.laplante, Patches and discussions about the oe-core layer

On Fri, 2020-02-07 at 14:43 +0000, chris.laplante@agilent.com wrote:
> Hi Richard,
> 
> > On Fri, 2020-02-07 at 08:16 +0000, Alex Kiernan wrote:
> > > Hi
> > > 
> > > We've got a couple of multiconfig builds with different
> > > configurations in one bitbake run (whether it needs to be like
> > > this a
> > > different question - I suspect we had something that worked and
> > > stopped there!)
> > > 
> > > However, for sometime we've been seeing errors like this on
> > > master
> > > (the recipes vary):
> > > 
> > > ERROR: mc:forensic:ca-certificates-20190110-r0
> > > do_deploy_source_date_epoch: Cache unihash
> > > 37d2546973a2b7c16fbcb1a8dd4939e4a1bba8c3d6d2dfb6e7204ca08d437ace
> > > doesn't match BB_UNIHASH
> > > 09baf78bd9316acdd308980c42ea4ed1eacea61bcc0d840c36293d1a5a22db69
> > 
> > This is an internal sanity test. It means the hash that it had in
> > the
> > task/worker context didn't match what the server thought the
> > unihash
> > should be. There have been changes in this area to try and fix
> > various
> > things, I'm wondering if we introduced a bug with multiconfig
> > handling.
> > 
> > There are some patches in master-next related to this area so it
> > would
> > be interesting to know if that made things better/worse.
> 
> Anecdotally, we are running Zeus for nightly builds with three
> multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> Granted it's only been a week. But before that, hash equiv +
> multiconfig was unusable due to the BB_UNIHASH errors.

That is a really helpful data point, thanks. I should probably clean up
those bitbake patches and get them merged then, I couldn't decide if
they were right or not...

Cheers,

Richard



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07 15:05     ` Richard Purdie
@ 2020-02-07 15:08       ` chris.laplante
  2020-02-07 15:44       ` Alex Kiernan
  1 sibling, 0 replies; 13+ messages in thread
From: chris.laplante @ 2020-02-07 15:08 UTC (permalink / raw)
  To: Richard Purdie, Patches and discussions about the oe-core layer

> > Anecdotally, we are running Zeus for nightly builds with three
> > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > Granted it's only been a week. But before that, hash equiv +
> > multiconfig was unusable due to the BB_UNIHASH errors.
> 
> That is a really helpful data point, thanks. I should probably clean up
> those bitbake patches and get them merged then, I couldn't decide if
> they were right or not...

Sure thing! I should note, I did delete my hashserv.db (on the hash equiv server) out of caution after applying those patches. 

Chris 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07 15:05     ` Richard Purdie
  2020-02-07 15:08       ` chris.laplante
@ 2020-02-07 15:44       ` Alex Kiernan
  2020-02-09  0:23         ` chris.laplante
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Kiernan @ 2020-02-07 15:44 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Patches and discussions about the oe-core layer

On Fri, Feb 7, 2020 at 3:06 PM Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
>
> On Fri, 2020-02-07 at 14:43 +0000, chris.laplante@agilent.com wrote:
> > Hi Richard,
> >
> > > On Fri, 2020-02-07 at 08:16 +0000, Alex Kiernan wrote:
> > > > Hi
> > > >
> > > > We've got a couple of multiconfig builds with different
> > > > configurations in one bitbake run (whether it needs to be like
> > > > this a
> > > > different question - I suspect we had something that worked and
> > > > stopped there!)
> > > >
> > > > However, for sometime we've been seeing errors like this on
> > > > master
> > > > (the recipes vary):
> > > >
> > > > ERROR: mc:forensic:ca-certificates-20190110-r0
> > > > do_deploy_source_date_epoch: Cache unihash
> > > > 37d2546973a2b7c16fbcb1a8dd4939e4a1bba8c3d6d2dfb6e7204ca08d437ace
> > > > doesn't match BB_UNIHASH
> > > > 09baf78bd9316acdd308980c42ea4ed1eacea61bcc0d840c36293d1a5a22db69
> > >
> > > This is an internal sanity test. It means the hash that it had in
> > > the
> > > task/worker context didn't match what the server thought the
> > > unihash
> > > should be. There have been changes in this area to try and fix
> > > various
> > > things, I'm wondering if we introduced a bug with multiconfig
> > > handling.
> > >
> > > There are some patches in master-next related to this area so it
> > > would
> > > be interesting to know if that made things better/worse.
> >
> > Anecdotally, we are running Zeus for nightly builds with three
> > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > Granted it's only been a week. But before that, hash equiv +
> > multiconfig was unusable due to the BB_UNIHASH errors.
>
> That is a really helpful data point, thanks. I should probably clean up
> those bitbake patches and get them merged then, I couldn't decide if
> they were right or not...
>

I just picked all your pending changes out of master-next into our
local patch queue - will let you know how it looks when it's finished
cooking!

-- 
Alex Kiernan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-07 15:44       ` Alex Kiernan
@ 2020-02-09  0:23         ` chris.laplante
  2020-02-09  7:27           ` Alex Kiernan
  2020-02-09 16:37           ` Joshua Watt
  0 siblings, 2 replies; 13+ messages in thread
From: chris.laplante @ 2020-02-09  0:23 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Patches and discussions about the oe-core layer

Hi Richard,

> > > Anecdotally, we are running Zeus for nightly builds with three
> > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > Granted it's only been a week. But before that, hash equiv +
> > > multiconfig was unusable due to the BB_UNIHASH errors.
> >
> > That is a really helpful data point, thanks. I should probably clean up
> > those bitbake patches and get them merged then, I couldn't decide if
> > they were right or not...
> >
> 
> I just picked all your pending changes out of master-next into our
> local patch queue - will let you know how it looks when it's finished
> cooking!

There are two small issues I have observed.  

One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with 7dc72fde6edeb5d6ac6b3832530998afeea67cbc. 

Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in runqueue.py (line 1168 on zeus):

        # Iterate over the task list and call into the siggen code
        dealtwith = set()
        todeal = set(self.runtaskentries)
        while len(todeal) > 0: 
            for tid in todeal.copy():
                if len(self.runtaskentries[tid].depends - dealtwith) == 0:
                    dealtwith.add(tid)
                    todeal.remove(tid)
                    self.prepare_task_hash(tid)

When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which makes things extra slow. Is there an opportunity for batching here? 

Thanks,
Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-09  0:23         ` chris.laplante
@ 2020-02-09  7:27           ` Alex Kiernan
  2020-02-09 16:25             ` Alex Kiernan
  2020-02-09 16:37           ` Joshua Watt
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Kiernan @ 2020-02-09  7:27 UTC (permalink / raw)
  To: chris.laplante; +Cc: Patches and discussions about the oe-core layer

On Sun, Feb 9, 2020 at 12:23 AM chris.laplante@agilent.com
<chris.laplante@agilent.com> wrote:
>
> Hi Richard,
>
> > > > Anecdotally, we are running Zeus for nightly builds with three
> > > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > > Granted it's only been a week. But before that, hash equiv +
> > > > multiconfig was unusable due to the BB_UNIHASH errors.
> > >
> > > That is a really helpful data point, thanks. I should probably clean up
> > > those bitbake patches and get them merged then, I couldn't decide if
> > > they were right or not...
> > >
> >
> > I just picked all your pending changes out of master-next into our
> > local patch queue - will let you know how it looks when it's finished
> > cooking!
>
> There are two small issues I have observed.
>
> One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
>
> Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in runqueue.py (line 1168 on zeus):
>
>         # Iterate over the task list and call into the siggen code
>         dealtwith = set()
>         todeal = set(self.runtaskentries)
>         while len(todeal) > 0:
>             for tid in todeal.copy():
>                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
>                     dealtwith.add(tid)
>                     todeal.remove(tid)
>                     self.prepare_task_hash(tid)
>
> When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which makes things extra slow. Is there an opportunity for batching here?
>

I've a new failure:

00:20:59.829  Traceback (most recent call last):
00:20:59.829    File
"/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/server/process.py",
line 278, in ProcessServer.idle_commands(delay=0.1,
fds=[<socket.socket fd=6, family=AddressFamily.AF_UNIX,
type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
<socket.socket fd=18, family=AddressFamily.AF_UNIX,
type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
<bb.server.process.ConnectionReader object at 0x7f831b7adb70>]):
00:20:59.829                   try:
00:20:59.829      >                retval = function(self, data, False)
00:20:59.829                       if retval is False:
00:20:59.829    File
"/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/cooker.py",
line 1434, in buildTargetsIdle(server=<ProcessServer(ProcessServer-1,
started)>, rq=<bb.runqueue.RunQueue object at 0x7f82f5112f98>,
abort=False):
00:20:59.829                   try:
00:20:59.829      >                retval = rq.execute_runqueue()
00:20:59.829                   except runqueue.TaskFailure as exc:
00:20:59.829    File
"/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
line 1522, in RunQueue.execute_runqueue():
00:20:59.829               try:
00:20:59.829      >            return self._execute_runqueue()
00:20:59.829               except bb.runqueue.TaskFailure:
00:20:59.829    File
"/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
line 1488, in RunQueue._execute_runqueue():
00:20:59.829               if self.state is runQueueRunning:
00:20:59.829      >            retval = self.rqexe.execute()
00:20:59.829
00:20:59.829    File
"/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
line 1997, in RunQueueExecute.execute():
00:20:59.829                               else:
00:20:59.829      >
self.sqdata.outrightfail.remove(nexttask)
00:20:59.829                           if nexttask in self.sqdata.outrightfail:

Just testing locally with:

diff --git a/bitbake/lib/bb/runqueue.py b/bitbake/lib/bb/runqueue.py
index 71108eeed752..a94a9bb27ae2 100644
--- a/bitbake/lib/bb/runqueue.py
+++ b/bitbake/lib/bb/runqueue.py
@@ -1994,7 +1994,7 @@ class RunQueueExecute:
                             self.sq_task_failoutright(nexttask)
                             return True
                         else:
-                            self.sqdata.outrightfail.remove(nexttask)
+                            self.sqdata.outrightfail.discard(nexttask)
                     if nexttask in self.sqdata.outrightfail:
                         logger.debug(2, 'No package found, so
skipping setscene task %s', nexttask)
                         self.sq_task_failoutright(nexttask)



-- 
Alex Kiernan


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-09  7:27           ` Alex Kiernan
@ 2020-02-09 16:25             ` Alex Kiernan
  2020-02-10 14:22               ` Alex Kiernan
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Kiernan @ 2020-02-09 16:25 UTC (permalink / raw)
  To: chris.laplante; +Cc: Patches and discussions about the oe-core layer

On Sun, Feb 9, 2020 at 7:27 AM Alex Kiernan <alex.kiernan@gmail.com> wrote:
>
> On Sun, Feb 9, 2020 at 12:23 AM chris.laplante@agilent.com
> <chris.laplante@agilent.com> wrote:
> >
> > Hi Richard,
> >
> > > > > Anecdotally, we are running Zeus for nightly builds with three
> > > > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > > > Granted it's only been a week. But before that, hash equiv +
> > > > > multiconfig was unusable due to the BB_UNIHASH errors.
> > > >
> > > > That is a really helpful data point, thanks. I should probably clean up
> > > > those bitbake patches and get them merged then, I couldn't decide if
> > > > they were right or not...
> > > >
> > >
> > > I just picked all your pending changes out of master-next into our
> > > local patch queue - will let you know how it looks when it's finished
> > > cooking!
> >
> > There are two small issues I have observed.
> >
> > One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
> >
> > Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in runqueue.py (line 1168 on zeus):
> >
> >         # Iterate over the task list and call into the siggen code
> >         dealtwith = set()
> >         todeal = set(self.runtaskentries)
> >         while len(todeal) > 0:
> >             for tid in todeal.copy():
> >                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
> >                     dealtwith.add(tid)
> >                     todeal.remove(tid)
> >                     self.prepare_task_hash(tid)
> >
> > When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which makes things extra slow. Is there an opportunity for batching here?
> >
>
> I've a new failure:
>
> 00:20:59.829  Traceback (most recent call last):
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/server/process.py",
> line 278, in ProcessServer.idle_commands(delay=0.1,
> fds=[<socket.socket fd=6, family=AddressFamily.AF_UNIX,
> type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> <socket.socket fd=18, family=AddressFamily.AF_UNIX,
> type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> <bb.server.process.ConnectionReader object at 0x7f831b7adb70>]):
> 00:20:59.829                   try:
> 00:20:59.829      >                retval = function(self, data, False)
> 00:20:59.829                       if retval is False:
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/cooker.py",
> line 1434, in buildTargetsIdle(server=<ProcessServer(ProcessServer-1,
> started)>, rq=<bb.runqueue.RunQueue object at 0x7f82f5112f98>,
> abort=False):
> 00:20:59.829                   try:
> 00:20:59.829      >                retval = rq.execute_runqueue()
> 00:20:59.829                   except runqueue.TaskFailure as exc:
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> line 1522, in RunQueue.execute_runqueue():
> 00:20:59.829               try:
> 00:20:59.829      >            return self._execute_runqueue()
> 00:20:59.829               except bb.runqueue.TaskFailure:
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> line 1488, in RunQueue._execute_runqueue():
> 00:20:59.829               if self.state is runQueueRunning:
> 00:20:59.829      >            retval = self.rqexe.execute()
> 00:20:59.829
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> line 1997, in RunQueueExecute.execute():
> 00:20:59.829                               else:
> 00:20:59.829      >
> self.sqdata.outrightfail.remove(nexttask)
> 00:20:59.829                           if nexttask in self.sqdata.outrightfail:
>
> Just testing locally with:
>
> diff --git a/bitbake/lib/bb/runqueue.py b/bitbake/lib/bb/runqueue.py
> index 71108eeed752..a94a9bb27ae2 100644
> --- a/bitbake/lib/bb/runqueue.py
> +++ b/bitbake/lib/bb/runqueue.py
> @@ -1994,7 +1994,7 @@ class RunQueueExecute:
>                              self.sq_task_failoutright(nexttask)
>                              return True
>                          else:
> -                            self.sqdata.outrightfail.remove(nexttask)
> +                            self.sqdata.outrightfail.discard(nexttask)
>                      if nexttask in self.sqdata.outrightfail:
>                          logger.debug(2, 'No package found, so
> skipping setscene task %s', nexttask)
>                          self.sq_task_failoutright(nexttask)
>

That change has got me a clean build to complete end to end, which a
rebuild is then successfully using the sstate-cache.

But something is upsetting sstate I'm serving back from the jenkins
box to a local build, as I'm getting different hashes for the same
sstate:

akiernan@akiernan-virtual-machine:~/nanohub/build$ find sstate-cache
-name '*quilt-native*populate_sysroot*' -ls
  2240468     40 -rw-rw-r--   1 akiernan akiernan    39406 Feb  9
13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
  2241106     56 -rw-rw-r--   1 akiernan akiernan    53302 Feb  9
13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz
  2634859     40 -rw-rw-r--   1 akiernan akiernan    39387 Feb  9
16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
  2634858     52 -rw-rw-r--   1 akiernan akiernan    52543 Feb  9
16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz
akiernan@akiernan-virtual-machine:~/nanohub/build$ bitbake-diffsigs
sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
NOTE: Starting bitbake server...
akiernan@akiernan-virtual-machine:~/nanohub/build$

Running dumpsig and diffing them manually I'm none the wiser - other
than variables being a in a different order in the two sstate files,
they're identical.

-- 
Alex Kiernan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-09  0:23         ` chris.laplante
  2020-02-09  7:27           ` Alex Kiernan
@ 2020-02-09 16:37           ` Joshua Watt
  2020-02-12 16:51             ` chris.laplante
  1 sibling, 1 reply; 13+ messages in thread
From: Joshua Watt @ 2020-02-09 16:37 UTC (permalink / raw)
  To: chris.laplante; +Cc: Patches and discussions about the oe-core layer

[-- Attachment #1: Type: text/plain, Size: 3006 bytes --]

On Sat, Feb 8, 2020, 6:24 PM chris.laplante--- via Openembedded-core <
openembedded-core@lists.openembedded.org> wrote:

> Hi Richard,
>
> > > > Anecdotally, we are running Zeus for nightly builds with three
> > > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > > Granted it's only been a week. But before that, hash equiv +
> > > > multiconfig was unusable due to the BB_UNIHASH errors.
> > >
> > > That is a really helpful data point, thanks. I should probably clean up
> > > those bitbake patches and get them merged then, I couldn't decide if
> > > they were right or not...
> > >
> >
> > I just picked all your pending changes out of master-next into our
> > local patch queue - will let you know how it looks when it's finished
> > cooking!
>
> There are two small issues I have observed.
>
> One is occasionally I get a lot of undeterministic metadata errors when
> BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The
> errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't
> always happen. But it did just now when I rebased our "zeus-modified"
> branch onto the upstream "zeus" branch, to get the changes starting with
> 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
>
> Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for
> a couple minutes. I traced it down to this code in runqueue.py (line 1168
> on zeus):
>
>         # Iterate over the task list and call into the siggen code
>         dealtwith = set()
>         todeal = set(self.runtaskentries)
>         while len(todeal) > 0:
>             for tid in todeal.copy():
>                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
>                     dealtwith.add(tid)
>                     todeal.remove(tid)
>                     self.prepare_task_hash(tid)
>
> When I instrument the loop to print out the size of "todeal", I see it
> decrease very slowly, sometimes only a couple at a time. I'm guessing this
> is because prepare_task_hash is contacting the hash equiv server, in a
> serial manner here. I'm over my work VPN which makes things extra slow. Is
> there an opportunity for batching here?
>

Batching is hard because the hashes you request might depend on hashes
previous received from the server. It might be possible to figure out these
dependencies and submit multiple batch requests, but this would require
some work in the code. Another option would be to have some multi-level
caching scheme where you can have a local database that mirrors your
centralized server.

If you have any ideas on how to make it faster, we would love to hear your
opinion :)



> Thanks,
> Chris
> --
> _______________________________________________
> Openembedded-core mailing list
> Openembedded-core@lists.openembedded.org
> http://lists.openembedded.org/mailman/listinfo/openembedded-core
>

[-- Attachment #2: Type: text/html, Size: 4191 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-09 16:25             ` Alex Kiernan
@ 2020-02-10 14:22               ` Alex Kiernan
  0 siblings, 0 replies; 13+ messages in thread
From: Alex Kiernan @ 2020-02-10 14:22 UTC (permalink / raw)
  To: chris.laplante; +Cc: Patches and discussions about the oe-core layer

On Sun, Feb 9, 2020 at 4:25 PM Alex Kiernan <alex.kiernan@gmail.com> wrote:
>
> On Sun, Feb 9, 2020 at 7:27 AM Alex Kiernan <alex.kiernan@gmail.com> wrote:
> >
> > On Sun, Feb 9, 2020 at 12:23 AM chris.laplante@agilent.com
> > <chris.laplante@agilent.com> wrote:
> > >
> > > Hi Richard,
> > >
> > > > > > Anecdotally, we are running Zeus for nightly builds with three
> > > > > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > > > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > > > > Granted it's only been a week. But before that, hash equiv +
> > > > > > multiconfig was unusable due to the BB_UNIHASH errors.
> > > > >
> > > > > That is a really helpful data point, thanks. I should probably clean up
> > > > > those bitbake patches and get them merged then, I couldn't decide if
> > > > > they were right or not...
> > > > >
> > > >
> > > > I just picked all your pending changes out of master-next into our
> > > > local patch queue - will let you know how it looks when it's finished
> > > > cooking!
> > >
> > > There are two small issues I have observed.
> > >
> > > One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
> > >
> > > Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in runqueue.py (line 1168 on zeus):
> > >
> > >         # Iterate over the task list and call into the siggen code
> > >         dealtwith = set()
> > >         todeal = set(self.runtaskentries)
> > >         while len(todeal) > 0:
> > >             for tid in todeal.copy():
> > >                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
> > >                     dealtwith.add(tid)
> > >                     todeal.remove(tid)
> > >                     self.prepare_task_hash(tid)
> > >
> > > When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which makes things extra slow. Is there an opportunity for batching here?
> > >
> >
> > I've a new failure:
> >
> > 00:20:59.829  Traceback (most recent call last):
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/server/process.py",
> > line 278, in ProcessServer.idle_commands(delay=0.1,
> > fds=[<socket.socket fd=6, family=AddressFamily.AF_UNIX,
> > type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> > <socket.socket fd=18, family=AddressFamily.AF_UNIX,
> > type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> > <bb.server.process.ConnectionReader object at 0x7f831b7adb70>]):
> > 00:20:59.829                   try:
> > 00:20:59.829      >                retval = function(self, data, False)
> > 00:20:59.829                       if retval is False:
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/cooker.py",
> > line 1434, in buildTargetsIdle(server=<ProcessServer(ProcessServer-1,
> > started)>, rq=<bb.runqueue.RunQueue object at 0x7f82f5112f98>,
> > abort=False):
> > 00:20:59.829                   try:
> > 00:20:59.829      >                retval = rq.execute_runqueue()
> > 00:20:59.829                   except runqueue.TaskFailure as exc:
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> > line 1522, in RunQueue.execute_runqueue():
> > 00:20:59.829               try:
> > 00:20:59.829      >            return self._execute_runqueue()
> > 00:20:59.829               except bb.runqueue.TaskFailure:
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> > line 1488, in RunQueue._execute_runqueue():
> > 00:20:59.829               if self.state is runQueueRunning:
> > 00:20:59.829      >            retval = self.rqexe.execute()
> > 00:20:59.829
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> > line 1997, in RunQueueExecute.execute():
> > 00:20:59.829                               else:
> > 00:20:59.829      >
> > self.sqdata.outrightfail.remove(nexttask)
> > 00:20:59.829                           if nexttask in self.sqdata.outrightfail:
> >
> > Just testing locally with:
> >
> > diff --git a/bitbake/lib/bb/runqueue.py b/bitbake/lib/bb/runqueue.py
> > index 71108eeed752..a94a9bb27ae2 100644
> > --- a/bitbake/lib/bb/runqueue.py
> > +++ b/bitbake/lib/bb/runqueue.py
> > @@ -1994,7 +1994,7 @@ class RunQueueExecute:
> >                              self.sq_task_failoutright(nexttask)
> >                              return True
> >                          else:
> > -                            self.sqdata.outrightfail.remove(nexttask)
> > +                            self.sqdata.outrightfail.discard(nexttask)
> >                      if nexttask in self.sqdata.outrightfail:
> >                          logger.debug(2, 'No package found, so
> > skipping setscene task %s', nexttask)
> >                          self.sq_task_failoutright(nexttask)
> >
>
> That change has got me a clean build to complete end to end, which a
> rebuild is then successfully using the sstate-cache.
>

With this change on top of master I've 7 green builds one after
another, which is better than we've managed in a week.

> But something is upsetting sstate I'm serving back from the jenkins
> box to a local build, as I'm getting different hashes for the same
> sstate:
>
> akiernan@akiernan-virtual-machine:~/nanohub/build$ find sstate-cache
> -name '*quilt-native*populate_sysroot*' -ls
>   2240468     40 -rw-rw-r--   1 akiernan akiernan    39406 Feb  9
> 13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
>   2241106     56 -rw-rw-r--   1 akiernan akiernan    53302 Feb  9
> 13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz
>   2634859     40 -rw-rw-r--   1 akiernan akiernan    39387 Feb  9
> 16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
>   2634858     52 -rw-rw-r--   1 akiernan akiernan    52543 Feb  9
> 16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz
> akiernan@akiernan-virtual-machine:~/nanohub/build$ bitbake-diffsigs
> sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
> sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
> NOTE: Starting bitbake server...
> akiernan@akiernan-virtual-machine:~/nanohub/build$
>
> Running dumpsig and diffing them manually I'm none the wiser - other
> than variables being a in a different order in the two sstate files,
> they're identical.
>

Whatever my problem is here, it's now vanished again :|

-- 
Alex Kiernan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Cache unihash ... doesn't match BB_UNIHASH ...
  2020-02-09 16:37           ` Joshua Watt
@ 2020-02-12 16:51             ` chris.laplante
  0 siblings, 0 replies; 13+ messages in thread
From: chris.laplante @ 2020-02-12 16:51 UTC (permalink / raw)
  To: Joshua Watt; +Cc: Patches and discussions about the oe-core layer

> One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are
> enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased
> our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with
> 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
> 
> Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in
> runqueue.py (line 1168 on zeus):
> 
>         # Iterate over the task list and call into the siggen code
>         dealtwith = set()
>         todeal = set(self.runtaskentries)
>         while len(todeal) > 0:
>             for tid in todeal.copy():
>                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
>                     dealtwith.add(tid)
>                     todeal.remove(tid)
>                     self.prepare_task_hash(tid)
> 
> When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm
> guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which
> makes things extra slow. Is there an opportunity for batching here?
> 
> Batching is hard because the hashes you request might depend on hashes previous received from the server. It might be possible to
> figure out these dependencies and submit multiple batch requests, but this would require some work in the code. Another option
> would be to have some multi-level caching scheme where you can have a local database that mirrors your centralized server.
> 
> If you have any ideas on how to make it faster, we would love to hear your opinion :)


Gotcha, unfortunately I don't have any ideas right now :/. Regarding the first issue, any ideas there?

Thanks,
Chris


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-12 16:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-07  8:16 Cache unihash ... doesn't match BB_UNIHASH Alex Kiernan
2020-02-07  8:31 ` Richard Purdie
2020-02-07  9:12   ` Alex Kiernan
2020-02-07 14:43   ` chris.laplante
2020-02-07 15:05     ` Richard Purdie
2020-02-07 15:08       ` chris.laplante
2020-02-07 15:44       ` Alex Kiernan
2020-02-09  0:23         ` chris.laplante
2020-02-09  7:27           ` Alex Kiernan
2020-02-09 16:25             ` Alex Kiernan
2020-02-10 14:22               ` Alex Kiernan
2020-02-09 16:37           ` Joshua Watt
2020-02-12 16:51             ` chris.laplante

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.