On Fri, Sep 24, 2021 at 05:36:31PM -0600, Simon Glass wrote:
> Hi Tom,
> 
> On Fri, 24 Sept 2021 at 08:55, Tom Rini <trini@konsulko.com> wrote:
> >
> > On Fri, Sep 24, 2021 at 08:38:49AM -0600, Simon Glass wrote:
> > > Hi Tom,
> > >
> > > On Fri, 24 Sept 2021 at 08:20, Tom Rini <trini@konsulko.com> wrote:
> > > >
> > > > On Fri, Sep 24, 2021 at 04:01:21PM +0200, Harald Seiler wrote:
> > > > > Hi Simon,
> > > > >
> > > > > On Mon, 2021-09-20 at 08:06 -0600, Simon Glass wrote:
> > > > > > Hi Harald,
> > > > > >
> > > > > > On Mon, 20 Sept 2021 at 02:12, Harald Seiler <hws@denx.de> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Sat, 2021-09-18 at 10:37 -0600, Simon Glass wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Is there something screwy with this? It seems that denx-vulcan does
> > > > > > > > two builds at once?
> > > > > > > >
> > > > > > > > https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/323540
> > > > > > >
> > > > > > > Hm, I did some changes to the vulcan runner which might have caused
> > > > > > > this... But still, even if it is running multiple jobs in parallel, they
> > > > > > > should still be isolated, so how does this lead to a build failure?
> > > > > >
> > > > > > I'm not sure that it does, but I do see this at the above link:
> > > > > >
> > > > > > Error: Unable to create
> > > > > > '/builds/u-boot/custodians/u-boot-dm/.git/logs/HEAD.lock': File
> > > > > > exists.
> > > > >
> > > > > This is super strange... Each build should be running in its own
> > > > > container so there should never be a way for such a race to occur.  No
> > > > > clue what is going on here...
> > > >
> > > > I know this from having to track down a different oddball failure with
> > > > konsulko-bootbake.  It comes down to something along the lines of
> > > > volumes being re-used.  Good in that it means that every job every time
> > > > isn't doing a whole clone of the u-boot tree.  Bad in that just in case
> > > > the job gets wedged/killed in a crazy spot you end up with problems like
> > > > this.  If you run a 'find' on vulcan you'll figure out which overlay has
> > > > a problem.  Or you can stop the runner for a moment and tell docker to
> > > > purge unused volumes and it'll clear it up.
> > > >
> > > > > > Re doing multiple builds, have you set it up so it doesn't take on the
> > > > > > very large builds? I would love to enable multiple builds for the qemu
> > > > > > steps since they mostly use a single CPU, but am not sure how to do
> > > > > > it.
> > > > >
> > > > > Actually, this was more a mistake than an intentional change.  I updated
> > > > > the runner on vulcan to also take jobs for some other repos and wanted
> > > > > those jobs to run in parallel.  It looks like I just forgot setting the
> > > > > `limit = 1` option for the U-Boot runner.
> > > > >
> > > > > Now, I think doing what you suggest is possible.  We need to tag build
> > > > > and "test" jobs differently and then define multiple runners with
> > > > > different limits.  E.g. in `.gitlab-ci.yml`:
> > > > >
> > > > >       build all 32bit ARM platforms:
> > > > >         stage: world build
> > > > >         tags:
> > > > >           - build
> > > > >
> > > > >       cppcheck:
> > > > >         stage: testsuites
> > > > >         tags:
> > > > >           - test
> > > > >
> > > > > And then define two runners in `/etc/gitlab-runner/config.toml`:
> > > > >
> > > > >       concurrent = 4
> > > > >
> > > > >       [[runners]]
> > > > >         name = "u-boot builder on vulcan"
> > > > >         limit = 1
> > > > >         ...
> > > > >
> > > > >       [[runners]]
> > > > >         name = "u-boot tester on vulcan"
> > > > >         limit = 4
> > > > >         ...
> > > > >
> > > > > and during registration they get the `build` and `test` tags
> > > > > respectively.  This would allow running (in this example) up to 4 test
> > > > > jobs concurrently, but only ever one large build job at once.
> > > >
> > > > Yes, but this would also make it harder for people to use the CI as-is
> > > > with their own runners.  For example, the only thing stopping people
> > > > from using the free gitlab CI runners on their own is that squashfs
> > > > test being broken.
> > >
> > > Thanks for the info Harald.
> > >
> > > Would it just mean that they would need to add both 'build' and 'test'
> > > tags to their running? If so that does not sound onerous.
> >
> > Along with not being able to use the gitlab free runners.
> >
> > > I believe it would speed up CI quite a bit.
> >
> > I'm not sure?  First, did you upgrade your runners recently?  I started
> > by looking at
> > https://source.denx.de/u-boot/u-boot/-/pipelines/9238/builds and all of
> > the last stage jobs went super quick.  But second, assuming the time
> 
> They are the same as ever: tui did about 1 build per second on average
> and kaki did 0.5 builds per second, but this has slowed by about 15%
> recently. They are both have quite a few cores. It could just be that
> the other two runners were busy so kaki and tui did everything.
> 
> > there includes spinning up the runner, sandbox+clang took 2x as long to
> > run as regular sandbox, to run less tests:
> > https://source.denx.de/u-boot/u-boot/-/jobs/326772
> > https://source.denx.de/u-boot/u-boot/-/jobs/326773
> 
> Yes but tui is 2x as fast as kaki (both in terms of number of CPUs and
> single-threaded performance) so that might explain it.
> 
> >
> > But we might save a minute, or two, if all of the other much quicker
> > tests ran to completion sooner, but we'd still be stuck waiting on the
> > longest running test.
> 
> Yes, which can be many minutes. But each qemu run takes a good minute
> and we have about 30 of them now. Even if all four runners are running
> on them, then that is 7 minutes. In parallel it might only take a
> minute or two.
> 
> >
> > So while I think splitting the job in to stages, such that if something
> > fails early we call it all off, a time test where we just have a single
> > stage would mean more stuff in parallel and maybe would be quicker,
> > especially when we have more free runners.  And to me, sadly, that's our
> > biggest gating factor and the one that can be solved with money rather
> > than technical wizardry.
> 
> Make sense. The other problem is that, to run the tests in parallel,
> we might need to clean some of them up (the series I sent is a start
> on that). But I think tui could probably run all the qemu jobs in
> parallel at once, for example.
> 
> So perhaps we can come back to this when we get parallel tests
> running. It definitely is not efficient at present, in the second
> (qemu) stage.

OK.  And I guess the other part of this would be that you could take
tui/kaki/etc out of general rotation for a bit and run some pipelines to
see what the time change is with your ideas in place.

-- 
Tom