Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

From: Amir Goldstein <amir73il@gmail.com>
To: Bart Van Assche <bvanassche@acm.org>,
	"Darrick J. Wong" <djwong@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Pankaj Raghav <pankydev8@gmail.com>, Theodore Tso <tytso@mit.edu>,
	Josef Bacik <josef@toxicpanda.com>,
	jmeneghi@redhat.com, Jan Kara <jack@suse.cz>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dan Williams <dan.j.williams@intel.com>, Jake Edge <jake@lwn.net>,
	Klaus Jensen <its@irrelevant.dk>,
	fstests <fstests@vger.kernel.org>, Zorro Lang <zlang@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
Date: Sun, 3 Jul 2022 08:56:54 +0300	[thread overview]
Message-ID: <CAOQ4uxgBtMifsNt1SDA0tz098Rt7Km6MAaNgfCeW=s=FPLtpCQ@mail.gmail.com> (raw)
In-Reply-To: <a120fb86-5a08-230f-33ee-1cb47381fff1@acm.org>

On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 5/18/22 20:07, Luis Chamberlain wrote:
> > I've been promoting the idea that running fstests once is nice,
> > but things get interesting if you try to run fstests multiple
> > times until a failure is found. It turns out at least kdevops has
> > found tests which fail with a failure rate of typically 1/2 to
> > 1/30 average failure rate. That is 1/2 means a failure can happen
> > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > failure.
> >
> > I have tried my best to annotate failure rates when I know what
> > they might be on the test expunge list, as an example:
> >
> > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> >
> > The term "failure rate 1/15" is 16 characters long, so I'd like
> > to propose to standardize a way to represent this. How about
> >
> > generic/530 # F:1/15
> >
> > Then we could extend the definition. F being current estimate, and this
> > can be just how long it took to find the first failure. A more valuable
> > figure would be failure rate avarage, so running the test multiple
> > times, say 10, to see what the failure rate is and then averaging the
> > failure out. So this could be a more accurate representation. For this
> > how about:
> >
> > generic/530 # FA:1/15
> >
> > This would mean on average there failure rate has been found to be about
> > 1/15, and this was determined based on 10 runs.
> >
> > We should also go extend check for fstests/blktests to run a test
> > until a failure is found and report back the number of successes.
> >
> > Thoughts?
> >
> > Note: yes failure rates lower than 1/100 do exist but they are rare
> > creatures. I love them though as my experience shows so far that they
> > uncover hidden bones in the closet, and they they make take months and
> > a lot of eyeballs to resolve.
>
> I strongly disagree with annotating tests with failure rates. My opinion
> is that on a given test setup a test either should pass 100% of the time
> or fail 100% of the time. If a test passes in one run and fails in
> another run that either indicates a bug in the test or a bug in the
> software that is being tested. Examples of behaviors that can cause
> tests to behave unpredictably are use-after-free bugs and race
> conditions. How likely it is to trigger such behavior depends on a
> number of factors. This could even depend on external factors like which
> network packets are received from other systems. I do not expect that
> flaky tests have an exact failure rate. Hence my opinion that flaky
> tests are not useful and also that it is not useful to annotate flaky
> tests with a failure rate. If a test is flaky I think that the root
> cause of the flakiness must be determined and fixed.
>

That is true for some use cases, but unfortunately, the flaky
fstests are way too valuable and too hard to replace or improve,
so practically, fs developers have to run them, but not everyone does.

Zorro has already proposed to properly tag the non deterministic tests
with a specific group and I think there is really no other solution.

The only question is whether we remove them from the 'auto' group
(I think we should).

There is probably a large overlap already between the 'stress' 'soak' and
'fuzzers' test groups and the non-deterministic tests.
Moreover, if the test is not a stress/fuzzer test and it is not deterministic
then the test is likely buggy.

There is only one 'stress' test not in 'auto' group (generic/019), only two
'soak' tests not in the 'auto' group (generic/52{1,2}).
There are only three tests in 'soak' group and they are also exactly
the same three tests in the 'long_rw' group.

So instead of thinking up a new 'flaky' 'random' 'stochastic' name
we may just repurpose the 'soak' group for this matter and start
moving known flaky tests from 'auto' to 'soak'.

generic/52{1,2} can be removed from 'soak' group and remain
in 'long_rw' group, unless filesystem developers would like to
add those to the stochastic test run.

filesystem developers that will run ./check -g auto -g soak
will get the exact same test coverage as today's -g auto
and the "commoners" that run ./check -g auto will enjoy blissful
determitic test results, at least for the default config of regularly
tested filesystems (a.k.a, the ones tested by kernet test bot).?

Darrick,

As the one who created the 'soak' group and only one that added
tests to it, what do you think about this proposal?
What do you think should be done with generic/52{1,2}?

Thanks,
Amir.