On Thu, Aug 15, 2019 at 05:05:48PM +0200, Martin Jansa wrote:
> On Tue, Aug 13, 2019 at 10:04:08AM +0100, Richard Purdie wrote:
> > On Mon, 2019-08-12 at 20:26 +0000, Peter Kjellerstedt wrote:
> > > Comparing that build to a corresponding do-nothing build with Thud,
> > > the time difference matches those three minutes where I have no idea
> > > what bitbake is doing now that it didn’t need to do before…
> > >  
> > > Hopefully these time degradations can be solved, because the current
> > > state of bitbake is barely usable. We also need to look into possible
> > > ways to improve the cooker output when it is running the setscene
> > > tasks so it makes some kind of sense again.
> > 
> > We talked on irc and you pointed at the commit things started to go
> > wrong. Just to summarise things for the benefit of the list, this is
> > some quick testing I did:
> > 
> > "bitbake -p; time bitbake core-image-minimal -n"
> > 
> > 30.0s 6c7c0cefd34067311144a1d4c01986fe0a4aef26
> > 30.6s a0d941c787cf3ef030d190903279d311bc05d752
> > 40.3s 7df31ff36892c2f9c65326b06b4c70 
> > 42.2s a0542ed3ff700eca35f9195f743c9e28bcd50f3e 
> > 45.4s 9983b07fffd19082abded7c3f15cc77d306dd69c 
> > 76.9s master-next
> 
> I know I'm late to the party, but it took really long for the test to finish..
> 
> I'm not using poky, so reproduce this testing on our builds I've used
> bitbake revisions matching with poky revision RP was testing:
> 
> + oe-core 18cdc087fd5da30e2b31f3d4e81b153cd36ca844 for older bitbake, because current master isn't compatible with old bitbake (bb.data_smart.ExpansionError: Failure expanding variable OE_IMPORTED[:=], expression was ${@oe_import(d)} which triggered exception AttributeError: module 'bb.siggen' has no attribute 'SignatureGeneratorUniHashMixIn')
> and oe-core 23db236a054ee7a989cdbbcb42ad5c6eefd4a6ae for newer bitbake (to possibly eliminate impact of metadata changes)
> 
> tested on 72core (Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz) with 380GB RAM
> 
> nothing in TMPDIR/SSTATE_DIR, no SSTATE_MIRRO, no Hash Equivalence Server configured
> 
> on a build with few more layers:
> Parsing of 3238 .bb files complete (0 cached, 3238 parsed). 7632 targets, 1927 skipped, 54 masked, 0 errors.
> 
> but first doing just core-image-minimal like RP did:
> 
> time		poky						bitbake						oe-core
> 2m8.191s	6c7c0cefd34067311144a1d4c01986fe0a4aef26	18d4a31fdcec1f0e5d2199d6142f0ce833fca1a7	18cdc087fd5da30e2b31f3d4e81b153cd36ca844
> N/A		a0d941c787cf3ef030d190903279d311bc05d752	doesn't exist in poky/poky-contrib		18cdc087fd5da30e2b31f3d4e81b153cd36ca844
> 2m17.053s	7df31ff36892c2f9c65326b06b4c70			1f630fdf0260db08541d3ca9f25f852931c19905	18cdc087fd5da30e2b31f3d4e81b153cd36ca844
> 2m18.515s	a0542ed3ff700eca35f9195f743c9e28bcd50f3e	f43778c2d19e70d4befd483b06aaf247fc65c799	23db236a054ee7a989cdbbcb42ad5c6eefd4a6ae
> 2m22.220s	9983b07fffd19082abded7c3f15cc77d306dd69c	8c26b451f22193ef1c544e2017cc84515566c1b8
> 2m38.185s	master-next					fbcb89dc3dbabfc80310e9a4ebe72d919300a32e
> cache.py:446: ResourceWarning: unclosed <socket.socket fd=10, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 41117)>
>   value = pickled.load()
> started showing with this revision
> 2m17.991s	master-next + fixups				f7cd14f90463957b3e4be6d3876def98b78f1424
> 2m9.651s	master-next + "Holding off tasks %s" out
> 
> now world
> 253m36.637s	7df31ff36892c2f9c65326b06b4c70			1f630fdf0260db08541d3ca9f25f852931c19905	18cdc087fd5da30e2b31f3d4e81b153cd36ca844
> 174m13.324s	a0542ed3ff700eca35f9195f743c9e28bcd50f3e	f43778c2d19e70d4befd483b06aaf247fc65c799	23db236a054ee7a989cdbbcb42ad5c6eefd4a6ae
>    this is time when killed at "NOTE: Executing Tasks" without showing any progress on the tasks (unlike other tests) and only one bitbake process is running
> 633m27.475s	master-next					fbcb89dc3dbabfc80310e9a4ebe72d919300a32e
>    this is time when killed at (1417 from 71749) - running very slowly unlike other bitbake revisions where the task number changes relatively quickly - about 10 tasks per second
> 89m13.992s	master-next + "Holding off tasks %s" out
> 89m59.201s	master-next updated today			85fe627fdb6510f0942917964386fad9d8c479c8
> 
> Interestingly old 1f630fdf0260db08541d3ca9f25f852931c19905 is 3 times
> slower than recent master-next.
> 
> I'll send -P output next.


So 1f630fdf0260db08541d3ca9f25f852931c19905 was already "bad", because
this commit introduced that
logger.debug(2, "Holding off tasks %s" % pprint.pformat(self.holdoff_tasks))

Here is -P output from it:

5min (til "Executing Tasks" message):
profile.log.processed.1f630fdf0260db08541d3ca9f25f852931c19905.return.at.Executing.Tasks
http://paste.ubuntu.com/p/Nw6n5hTPmF/

5hours (whole dry run):
profile.log.processed.1f630fdf0260db08541d3ca9f25f852931c19905.whole.dry.run
http://paste.ubuntu.com/p/sydFGvYTxQ/
profile.log.processed.1f630fdf0260db08541d3ca9f25f852931c19905.worker.log.processed
http://paste.ubuntu.com/p/RMGbm9DyZz/

So for good old baseline I need to go few commits back, to e.g. 18d4a31f RP was testing before:

1f630fdf runqueue: Enable dynamic task adjustment to hash equivalency
a0d941c7 siggen: Convert to use self.unitaskhashes
1f326f2c siggen: Add new unitaskhashes data variable which is cached
b707d0cb runqueue: Improve scenequeue processing logic
19a6e356 cache: Add SimpleCache class
7bb79099 siggen: Import unihash code from OE-Core
7aec8632 hashserv: SQL Optimizations
18d4a31f fetch2/wget: avoid 'maximum recursion depth' RuntimeErrors when handling 403 codes

94m19.081s
profile.log.processed.18d4a31fdcec1f0e5d2199d6142f0ce833fca1a7.whole.dry.run
http://paste.ubuntu.com/p/Jnvv8GJNKB/

And here is profile with latest master from yesterday:
90mins:
profile.log.processed.a5e084a266f63c2fd370122327615e49beaeb94e.whole.dry.run
http://paste.ubuntu.com/p/YhMCHPmg56/

So good news is that current master is a bit faster than "good old" 18d4a31fdcec1f0e5d2199d6142f0ce833fca1a7.

But still significantly slower on my fast builder compared to world tests on poky + meta-oe,
not sure if it's all just because the size of metadata in my builds (with 71K tasks).

Even the dry running the tasks was about 10 tasks/second while RP was seeing around 200 tasks/second.

Will try to bump BB_NUMBER_THREADS from 8 to 72.

I've already disabled rm_work and PRserv, but that doesn't seem to have any significant impact on performance here. Other than resolving the PRserv terminating and then warnings about lost connection.

Cheers,
> 
> > 
> > So basically the original changes showed a 25% hit but the performance
> > of -next is dire.
> > 
> > This is with no hash equiv server configured.
> > 
> > It will vary depending on the target used (numbers with -sato for the
> > above would be interesting for comparision) and how much was or is in
> > sstate, they type of sstate mirror configured and so on.
> > 
> > I really need to focus on getting the new code functioning correctly
> > before we attempt to optimise but if nobody tests the new code due to
> > performance problems we have a different issue. We also have a scaling
> > problem with the hash server itself I need to fix to stop the
> > autobuilder throwing weird errors. I'm therefore a bit challenged on
> > where to start with it all :/.
> > 
> > Cheers,
> > 
> > Richard
> > 
> > 
> > 
> > -- 
> > _______________________________________________
> > Openembedded-core mailing list
> > Openembedded-core@lists.openembedded.org
> > http://lists.openembedded.org/mailman/listinfo/openembedded-core
> 
> -- 
> Martin 'JaMa' Jansa     jabber: Martin.Jansa@gmail.com



-- 
Martin 'JaMa' Jansa     jabber: Martin.Jansa@gmail.com