Re: [RFC] QEMU Gating CI

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Cleber Rosa <crosa@redhat.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	qemu-devel@nongnu.org,
	"Wainer dos Santos Moschetta" <wainersm@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Jeff Nelson" <jen@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Ademar Reis" <areis@redhat.com>
Subject: Re: [RFC] QEMU Gating CI
Date: Tue, 3 Dec 2019 14:14:56 +0000	[thread overview]
Message-ID: <20191203141456.GB230219@stefanha-x1.localdomain> (raw)
In-Reply-To: <20191202181254.GA20551@localhost.localdomain>

[-- Attachment #1: Type: text/plain, Size: 8997 bytes --]

On Mon, Dec 02, 2019 at 01:12:54PM -0500, Cleber Rosa wrote:
> On Mon, Dec 02, 2019 at 05:00:18PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Dec 02, 2019 at 09:05:52AM -0500, Cleber Rosa wrote:
> > > RFC: QEMU Gating CI
> > > ===================
> > 
> > Excellent, thank you for your work on this!
> > 
> > > 
> > > This RFC attempts to address most of the issues described in
> > > "Requirements/GatinCI"[1].  An also relevant write up is the "State of
> > > QEMU CI as we enter 4.0"[2].
> > > 
> > > The general approach is one to minimize the infrastructure maintenance
> > > and development burden, leveraging as much as possible "other people's"
> > > infrastructure and code.  GitLab's CI/CD platform is the most relevant
> > > component dealt with here.
> > > 
> > > Problem Statement
> > > -----------------
> > > 
> > > The following is copied verbatim from Peter Maydell's write up[1]:
> > > 
> > > "A gating CI is a prerequisite to having a multi-maintainer model of
> > > merging. By having a common set of tests that are run prior to a merge
> > > you do not rely on who is currently doing merging duties having access
> > > to the current set of test machines."
> > > 
> > > This is of a very simplified view of the problem that I'd like to break
> > > down even further into the following key points:
> > > 
> > >  * Common set of tests
> > >  * Pre-merge ("prior to a merge")
> > >  * Access to the current set of test machines
> > >  * Multi-maintainer model
> > > 
> > > Common set of tests
> > > ~~~~~~~~~~~~~~~~~~~
> > > 
> > > Before we delve any further, let's make it clear that a "common set of
> > > tests" is really a "dynamic common set of tests".  My point is that a
> > > set of tests in QEMU may include or exclude different tests depending
> > > on the environment.
> > > 
> > > The exact tests that will be executed may differ depending on the
> > > environment, including:
> > > 
> > >  * Hardware
> > >  * Operating system
> > >  * Build configuration
> > >  * Environment variables
> > > 
> > > In the "State of QEMU CI as we enter 4.0" Alex Bennée listed some of
> > > those "common set of tests":
> > > 
> > >  * check
> > >  * check-tcg
> > >  * check-softfloat
> > >  * check-block
> > >  * check-acceptance
> > > 
> > > While Peter mentions that most of his checks are limited to:
> > > 
> > >  * check
> > >  * check-tcg
> > > 
> > > Our current inability to quickly identify a faulty test from test
> > > execution results (and specially in remote environments), and act upon
> > > it (say quickly disable it on a given host platform), makes me believe
> > > that it's fair to start a gating CI implementation that uses this
> > > rather coarse granularity.
> > > 
> > > Another benefit is a close or even a 1:1 relationship between a common
> > > test set and an entry in the CI configuration.  For instance, the
> > > "check" common test set would map to a "make check" command in a
> > > "script:" YAML entry.
> > > 
> > > To exemplify my point, if one specific test run as part of "check-tcg"
> > > is found to be faulty on a specific job (say on a specific OS), the
> > > entire "check-tcg" test set may be disabled as a CI-level maintenance
> > > action.  Of course a follow up action to deal with the specific test
> > > is required, probably in the form of a Launchpad bug and patches
> > > dealing with the issue, but without necessarily a CI related angle to
> > > it.
> > 
> > I think this coarse level of granularity is unrealistic.  We cannot
> > disable 99 tests because of 1 known failure.  There must be a way of
> > disabling individual tests.  You don't need to implement it yourself,
> > but I think this needs to be solved by someone before a gating CI can be
> > put into use.
> >
> 
> IMO it should be realistic if you look at it from a "CI related
> angle".  The pull request could still be revised and disable a single
> test because of a known failure, but this would not be necessarily
> related to the CI.

That sounds fine, thanks.  I interpreted the text a little differently.
I agree this functionality doesn't need to present in order to move to
GitLab.

> 
> > It probably involves adding a "make EXCLUDE_TESTS=foo,bar check"
> > variable so that .gitlab-ci.yml can be modified to exclude specific
> > tests on certain OSes.
> >
> 
> I certainly acknowledge the issue, but I don't think this (and many
> other issues that will certainly come up) should be a blocker to the
> transition to GitLab.
> 
> > > 
> > > If/when test result presentation and control mechanism evolve, we may
> > > feel confident and go into finer grained granularity.  For instance, a
> > > mechanism for disabling nothing but "tests/migration-test" on a given
> > > environment would be possible and desirable from a CI management level.
> > > 
> > > Pre-merge
> > > ~~~~~~~~~
> > > 
> > > The natural way to have pre-merge CI jobs in GitLab is to send "Merge
> > > Requests"[3] (abbreviated as "MR" from now on).  In most projects, a
> > > MR comes from individual contributors, usually the authors of the
> > > changes themselves.  It's my understanding that the current maintainer
> > > model employed in QEMU will *not* change at this time, meaning that
> > > code contributions and reviews will continue to happen on the mailing
> > > list.  A maintainer then, having collected a number of patches, would
> > > submit a MR either in addition or in substitution to the Pull Requests
> > > sent to the mailing list.
> > > 
> > > "Pipelines for Merged Results"[4] is a very important feature to
> > > support the multi-maintainer model, and looks in practice, similar to
> > > Peter's "staging" branch approach, with an "automatic refresh" of the
> > > target branch.  It can give a maintainer extra confidence that a MR
> > > will play nicely with the updated status of the target branch.  It's
> > > my understanding that it should be the "key to the gates".  A minor
> > > note is that conflicts are still possible in a multi-maintainer model
> > > if there are more than one person doing the merges.
> > 
> > The intention is to have only 1 active maintainer at a time.  The
> > maintainer will handle all merges for the current QEMU release and then
> > hand over to the next maintainer after the release has been made.
> > 
> > Solving the problem for multiple active maintainers is low priority at
> > the moment.
> >
> 
> Even so, I have the impression that the following workflow:
> 
>  - Look at Merge Results Pipeline for MR#1
>  - Merge MR #1
>  - Hack on something else
>  - Look at *automatically updated* Merge Results Pipeline for MR#2
>  - Merge MR #2
> 
> Is better than:
> 
>  - Push PR #1 to staging
>  - Wait for PR #1 Pipeline to finish
>  - Look at PR #1 Pipeline results
>  - Push staging into master
>  - Push PR #2 to staging 
>  - Wait for PR #2 Pipeline to finish
>  - Push staging into master
> 
> But I don't think I'll be a direct user of those workflows, so I'm
> completely open to feedback on it.

If the goal is to run multiple trees through the CI in parallel then
multiple branches can be used.  I guess I'm just

> 
> > > A worthy point is that the GitLab web UI is not the only way to create
> > > a Merge Request, but a rich set of APIs are available[5].  This is
> > > interesting for many reasons, and maybe some of Peter's
> > > "apply-pullreq"[6] actions (such as bad UTF8 or bogus qemu-devel email
> > > addresses checks could be made earlier) as part of a
> > > "send-mergereq"-like script, bringing conformance earlier on the merge
> > > process, at the MR creation stage.
> > > 
> > > Note: It's possible to have CI jobs definition that are specific to
> > > MR, allowing generic non-MR jobs to be kept on the default
> > > configuration.  This can be used so individual contributors continue
> > > to leverage some of the "free" (shared) runner made available on
> > > gitlab.com.
> > 
> > I expected this section to say:
> > 1. Maintainer sets up a personal gitlab.com account with a qemu.git fork.
> > 2. Maintainer adds QEMU's CI tokens to their personal account.
> > 3. Each time a maintainer pushes to their "staging" branch the CI
> >    triggers.
> > 
> > IMO this model is simpler than MRs because once it has been set up the
> > maintainer just uses git push.  Why are MRs necessary?
> >
> 
> I am not sure GitLab "Specific Runners" can be used from other
> accounts/forks.  AFAICT, you'd need a MR to send jobs that would run
> on those machines, because (again AFAICT) the token used to register
> those gitlab-runner instances on those machines is not shareable
> across forks.  But, I'll double check that.

Another question:
Is a Merge Request necessary in order to trigger the CI or is just
pushing to a branch enough?  With GitHub + Travis just pushing is
enough.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]