Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

From: Amir Goldstein <amir73il@gmail.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>,
	Bart Van Assche <bvanassche@acm.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Luis Chamberlain <mcgrof@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Pankaj Raghav <pankydev8@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>,
	jmeneghi@redhat.com, Jan Kara <jack@suse.cz>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dan Williams <dan.j.williams@intel.com>, Jake Edge <jake@lwn.net>,
	Klaus Jensen <its@irrelevant.dk>,
	fstests <fstests@vger.kernel.org>, Zorro Lang <zlang@redhat.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
Date: Wed, 6 Jul 2022 19:35:57 +0300	[thread overview]
Message-ID: <CAOQ4uxgr8v=h1xi=sfJD9uSp6DR_iAiXScd68Ov7=6Cm-iA+ZA@mail.gmail.com> (raw)
In-Reply-To: <YsWcZbBALgWKS88+@mit.edu>

On Wed, Jul 6, 2022 at 5:30 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote:
> >
> > So I am wondering what is the status today, because I rarely
> > see fstests failure reports from kernel test bot on the list, but there
> > are some reports.
> >
> > Does anybody have a clue what hw/fs/config/group of fstests
> > kernel test bot is running on linux-next?
>
> Te zero-day test bot only reports test regressions.  So they have some
> list of tests that have failed in the past, and they only report *new*
> test failures.  This is not just true for fstests, but it's also true
> for things like check and compiler warnings warnings --- and I suspect
> it's for those sorts of reports that caused the zero-day bot to keep
> state, and to filter out test failures and/or check warnings and/or
> compiler warnings, so that only new test failures and/or new compiler
> warnigns are reported.  If they didn't, they would be spamming kernel
> developers, and given how.... "kind and understanding" kernel
> developers are at getting spammed, especially when sometimes the
> complaints are bogus ones (either test bugs or compiler bugs), my
> guess is that they did the filtering out of sheer self-defense.  It
> certainly wasn't something requested by a file system developer as far
> as I know.
>
>
> So this is how I think an automated system for "drive-by testers"
> should work.  First, the tester would specify the baseline/origin tag,
> and the testing system would run the tests on the baseline once.
> Hopefully, the test runner already has exclude files so that kernel
> bugs that cause an immediate kernel crash or deadlock would be already
> be in the exclude list.  But as I've discovered this weekend, for file
> systems that I haven't tried in a few yeas, like udf, or
> ubifs. etc. there may be missing tests that result in the test VM to
> stop responding and/or crash.
>
> I have a planned improvement where if you are using the gce-xfstests's
> lightweight test manager, since the LTM is constantly reading the
> serial console, a deadlock can be detected and the LTM can restart the
> VM.  The VM can then disambiguate from a forced reboot caused by the
> LTM, or a forced shutdown caused by the use of a preemptible VM (a
> planned feature not yet fully implemented yet), and the test runner
> can skip the tests already run, and skip the test which caused the
> crash or deadlock, and this could be reported so that eventually, the
> test could be added to the exclude file to benefit thouse people who
> are using kvm-xfstests.  (This is an example of a planned improvement
> in xfstests-bld which if someone is interested in helping to implement
> it, they should give me a ring.)
>
> Once the tests which are failing given a particular baseline are
> known, this state would then get saved, and then now the tests can be
> run on the drive-by developer's changes.  We can now compare the known
> failures for the baseline, with the changed kernels, and if there are
> any new failures, there are two possibilities: (a) this was a new
> feailure caused by the drive-by developer's changes, (b) this was a
> pre-existing known flake.
>
> To disambiguate between these two cases, we now run the failed test N
> times (where N is probably something like 10-50 times; I normally use
> 25 times) on the changed kernel, and get the failure rate.  If the
> failure rate is 100%, then this is almost certainly (a).  If the
> failure rate is < 100% (and greater than 0%), then we need to rerun
> the failed test on the baseline kernel N times, and see if the failure
> rate is 0%, then we should do a bisection search to determine the
> guilty commit.
>
> If the failure rate is 0%, then this is either an extremely rare
> flake, in which case we might need to increase N --- or it's an
> example of a test failure which is sensitive to the order of tests
> which are failed, in which case we may need to reun all of the tests
> in order up to the failed test.
>
> This is right now what I do when processing patches for upstream.
> It's also rather similar to what we're doing for the XFS stable
> backports, because it's much more efficient than running the baseline
> tests 100 times (which can take a week of continuous testing per
> Luis's comments) --- we only tests dozens (or more) times where a
> potential flake has been found, as opposed to *all* tests.  It's all
> done manually, but it would be great if we could automate this to make
> life easier for XFS stable backporters, and *also* for drive-by
> developers.
>

This process sounds like it could get us to mostly unattended regression
testing, so it sounds good.

I do wonder if there is nothing more that fstests devlopers can do to
assist when annotating new (and existing) tests to aid in that effort.

For example, there might be a case to tag a test as "this is a very
reliable test that should have no failures at all - if there is a failure
then something is surely wrong".
I wonder if it would help to have a group like that and how many
tests would that group include.

> And again, if anyone is interested in helping with this, especially if
> you're familiar with shell, python 3, and/or the Go language, please
> contact me off-line.
>

Please keep me in the loop if you have a prototype I may be able
to help test it.

Thanks,
Amir.