[RFC: kdevops] Standardizing on failure rate nomenclature for expunges

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
@ 2022-05-19  3:07 Luis Chamberlain
  2022-05-19  6:36 ` Amir Goldstein
  2022-07-02 21:48 ` Bart Van Assche
  0 siblings, 2 replies; 32+ messages in thread
From: Luis Chamberlain @ 2022-05-19  3:07 UTC (permalink / raw)
  To: linux-fsdevel, linux-block
  Cc: amir73il, pankydev8, tytso, josef, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen

I've been promoting the idea that running fstests once is nice,
but things get interesting if you try to run fstests multiple
times until a failure is found. It turns out at least kdevops has
found tests which fail with a failure rate of typically 1/2 to
1/30 average failure rate. That is 1/2 means a failure can happen
50% of the time, whereas 1/30 means it takes 30 runs to find the
failure.

I have tried my best to annotate failure rates when I know what
they might be on the test expunge list, as an example:

workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d

The term "failure rate 1/15" is 16 characters long, so I'd like
to propose to standardize a way to represent this. How about

generic/530 # F:1/15

Then we could extend the definition. F being current estimate, and this
can be just how long it took to find the first failure. A more valuable
figure would be failure rate avarage, so running the test multiple
times, say 10, to see what the failure rate is and then averaging the
failure out. So this could be a more accurate representation. For this
how about:

generic/530 # FA:1/15

This would mean on average there failure rate has been found to be about
1/15, and this was determined based on 10 runs.

We should also go extend check for fstests/blktests to run a test
until a failure is found and report back the number of successes.

Thoughts?

Note: yes failure rates lower than 1/100 do exist but they are rare
creatures. I love them though as my experience shows so far that they
uncover hidden bones in the closet, and they they make take months and
a lot of eyeballs to resolve.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19  3:07 [RFC: kdevops] Standardizing on failure rate nomenclature for expunges Luis Chamberlain
@ 2022-05-19  6:36 ` Amir Goldstein
  2022-05-19  7:58   ` Dave Chinner
  2022-05-19 11:24   ` Zorro Lang
  2022-07-02 21:48 ` Bart Van Assche
  1 sibling, 2 replies; 32+ messages in thread
From: Amir Goldstein @ 2022-05-19  6:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik,
	jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge,
	Klaus Jensen, Zorro Lang, fstests

[adding fstests and Zorro]

On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> I've been promoting the idea that running fstests once is nice,
> but things get interesting if you try to run fstests multiple
> times until a failure is found. It turns out at least kdevops has
> found tests which fail with a failure rate of typically 1/2 to
> 1/30 average failure rate. That is 1/2 means a failure can happen
> 50% of the time, whereas 1/30 means it takes 30 runs to find the
> failure.
>
> I have tried my best to annotate failure rates when I know what
> they might be on the test expunge list, as an example:
>
> workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
>
> The term "failure rate 1/15" is 16 characters long, so I'd like
> to propose to standardize a way to represent this. How about
>
> generic/530 # F:1/15
>

I am not fond of the 1/15 annotation at all, because the only fact that you
are able to document is that the test failed after 15 runs.
Suggesting that this means failure rate of 1/15 is a very big step.

> Then we could extend the definition. F being current estimate, and this
> can be just how long it took to find the first failure. A more valuable
> figure would be failure rate avarage, so running the test multiple
> times, say 10, to see what the failure rate is and then averaging the
> failure out. So this could be a more accurate representation. For this
> how about:
>
> generic/530 # FA:1/15
>
> This would mean on average there failure rate has been found to be about
> 1/15, and this was determined based on 10 runs.
>
> We should also go extend check for fstests/blktests to run a test
> until a failure is found and report back the number of successes.
>
> Thoughts?
>

I have had a discussion about those tests with Zorro.

Those tests that some people refer to as "flaky" are valuable,
but they are not deterministic, they are stochastic.

I think MTBF is the standard way to describe reliability
of such tests, but I am having a hard time imagining how
the community can manage to document accurate annotations
of this sort, so I would stick with documenting the facts
(i.e. the test fails after N runs).

OTOH, we do have deterministic tests, maybe even the majority of
fstests are deterministic(?)

Considering that every auto test loop takes ~2 hours on our rig and that
I have been running over 100 loops over the past two weeks, if half
of fstests are deterministic, that is a lot of wait time and a lot of carbon
emission gone to waste.

It would have been nice if I was able to exclude a "deterministic" group.
The problem is - can a developer ever tag a test as being "deterministic"?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19  6:36 ` Amir Goldstein
@ 2022-05-19  7:58   ` Dave Chinner
  2022-05-19  9:20     ` Amir Goldstein
  2022-05-19 11:24   ` Zorro Lang
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-05-19  7:58 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8,
	Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests

On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> [adding fstests and Zorro]
> 
> On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > I've been promoting the idea that running fstests once is nice,
> > but things get interesting if you try to run fstests multiple
> > times until a failure is found. It turns out at least kdevops has
> > found tests which fail with a failure rate of typically 1/2 to
> > 1/30 average failure rate. That is 1/2 means a failure can happen
> > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > failure.
> >
> > I have tried my best to annotate failure rates when I know what
> > they might be on the test expunge list, as an example:
> >
> > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> >
> > The term "failure rate 1/15" is 16 characters long, so I'd like
> > to propose to standardize a way to represent this. How about
> >
> > generic/530 # F:1/15
> >
> 
> I am not fond of the 1/15 annotation at all, because the only fact that you
> are able to document is that the test failed after 15 runs.
> Suggesting that this means failure rate of 1/15 is a very big step.
> 
> > Then we could extend the definition. F being current estimate, and this
> > can be just how long it took to find the first failure. A more valuable
> > figure would be failure rate avarage, so running the test multiple
> > times, say 10, to see what the failure rate is and then averaging the
> > failure out. So this could be a more accurate representation. For this
> > how about:
> >
> > generic/530 # FA:1/15
> >
> > This would mean on average there failure rate has been found to be about
> > 1/15, and this was determined based on 10 runs.

These tests are run on multiple different filesystems. What happens
if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
tests results, and 1 failure.

Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?

What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?

Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?

In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
1/1 breakdown is useful information, because it tells us whihc
filesystem failed the test, or which specific config failed the
test.

Hence I think the ability for us to draw useful conclusions from a
number like this is large dependent on the specific data set it is
drawn from...

> > We should also go extend check for fstests/blktests to run a test
> > until a failure is found and report back the number of successes.
> >
> > Thoughts?

Who is the expected consumer of this information?

I'm not sure it will be meaningful for anyone developing new code
and needing to run every test every time they run fstests.

OTOH, for a QA environment where you have a fixed progression of the
kernel releases you are testing, it's likely valuable and already
being tracked in various distro QE management tools and
dashboards....

> I have had a discussion about those tests with Zorro.
> 
> Those tests that some people refer to as "flaky" are valuable,
> but they are not deterministic, they are stochastic.

Extremely valuable. Worth their weight in gold to developers like
me.

The recoveryloop group tests are a good example of this. The name of
the group indicates how we use it. I typically set it up to run with
an loop iteration like "-I 100" knowing that is will likely fail a
random test in the group within 10 iterations.

Those one-off failures are almost always a real bug, and they are
often unique and difficult to reproduce exactly. Post-mortem needs
to be performed immediately because it may well be a unique on-off
failure and running another test after the failure destroys the
state needed to perform a post-mortem.

Hence having a test farm running these multiple times and then
reporting "failed once in 15 runs" isn't really useful to me as a
developer - it doesn't tell us anything new, nor does it help us
find the bugs that are being tripped over.

Less obvious stochastic tests exist, too. There are many tests that
use fstress as a workload that runs while some other operation is
performed - freeze, grow, ENOSPC, error injections, etc. They will
never be deterministic, any again any failure tends to be a real
bug, too.

However, I think these should be run by QE environments all the time
as they require long term, frequent execution across different
configs in different environments to find the deep dark corners
where the bugs may lie dormant. These are the tests that find things
like subtle timing races no other tests ever exercise.

I suspect that tests that alter their behaviour via LOAD_FACTOR or
TIME_FACTOR will fall into this category.

> I think MTBF is the standard way to describe reliability
> of such tests, but I am having a hard time imagining how
> the community can manage to document accurate annotations
> of this sort, so I would stick with documenting the facts
> (i.e. the test fails after N runs).

I'm unsure of what "reliablity of such tests" means in this context.
The tests are trying to exercise and measure the reliability of the
kernel code - if the *test is unreliable* then that says to me the
test needs fixing. If the test is reliable, then any failures that
occur indicate that the filesystem/kernel/fs tools are unreliable,
not the test....

"test reliability" and "reliability of filesystem under test" are
different things with similar names. The latter is what I think we
are talking about measuring and reporting here, right?

> OTOH, we do have deterministic tests, maybe even the majority of
> fstests are deterministic(?)

Very likely. As a generalisation, I'd say that anything that has a
fixed, single step at a time recipe and a very well defined golden
output or exact output comparison match is likely deterministic.

We use things like 'within tolerance' so that slight variations in
test results don't cause spurious failures and hence make the test
more deterministic.  Hence any test that uses 'within_tolerance' is
probably a test that is expecting deterministic behaviour....

> Considering that every auto test loop takes ~2 hours on our rig and that
> I have been running over 100 loops over the past two weeks, if half
> of fstests are deterministic, that is a lot of wait time and a lot of carbon
> emission gone to waste.
> 
> It would have been nice if I was able to exclude a "deterministic" group.
> The problem is - can a developer ever tag a test as being "deterministic"?

fstests allows private exclude lists to be used - perhaps these
could be used to start building such a group for your test
environment. Building a list from the tests you never see fail in
your environment could be a good way to seed such a group...

Maybe you have all the raw results from those hundreds of tests
sitting around - what does crunching that data look like? Who else
has large sets of consistent historic data sitting around? I don't
because I pollute my results archive by frequently running varied
and badly broken kernels through fstests, but people who just run
released or stable kernels may have data sets that could be used....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19  7:58   ` Dave Chinner
@ 2022-05-19  9:20     ` Amir Goldstein
  2022-05-19 15:36       ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Amir Goldstein @ 2022-05-19  9:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8,
	Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests

On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > [adding fstests and Zorro]
> >
> > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
> > > I've been promoting the idea that running fstests once is nice,
> > > but things get interesting if you try to run fstests multiple
> > > times until a failure is found. It turns out at least kdevops has
> > > found tests which fail with a failure rate of typically 1/2 to
> > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > failure.
> > >
> > > I have tried my best to annotate failure rates when I know what
> > > they might be on the test expunge list, as an example:
> > >
> > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > >
> > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > to propose to standardize a way to represent this. How about
> > >
> > > generic/530 # F:1/15
> > >
> >
> > I am not fond of the 1/15 annotation at all, because the only fact that you
> > are able to document is that the test failed after 15 runs.
> > Suggesting that this means failure rate of 1/15 is a very big step.
> >
> > > Then we could extend the definition. F being current estimate, and this
> > > can be just how long it took to find the first failure. A more valuable
> > > figure would be failure rate avarage, so running the test multiple
> > > times, say 10, to see what the failure rate is and then averaging the
> > > failure out. So this could be a more accurate representation. For this
> > > how about:
> > >
> > > generic/530 # FA:1/15
> > >
> > > This would mean on average there failure rate has been found to be about
> > > 1/15, and this was determined based on 10 runs.
>
> These tests are run on multiple different filesystems. What happens
> if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> tests results, and 1 failure.
>
> Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
>
> What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
>
> Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
>
> In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> 1/1 breakdown is useful information, because it tells us whihc
> filesystem failed the test, or which specific config failed the
> test.
>
> Hence I think the ability for us to draw useful conclusions from a
> number like this is large dependent on the specific data set it is
> drawn from...
>
> > > We should also go extend check for fstests/blktests to run a test
> > > until a failure is found and report back the number of successes.
> > >
> > > Thoughts?
>
> Who is the expected consumer of this information?
>
> I'm not sure it will be meaningful for anyone developing new code
> and needing to run every test every time they run fstests.
>
> OTOH, for a QA environment where you have a fixed progression of the
> kernel releases you are testing, it's likely valuable and already
> being tracked in various distro QE management tools and
> dashboards....
>
> > I have had a discussion about those tests with Zorro.
> >
> > Those tests that some people refer to as "flaky" are valuable,
> > but they are not deterministic, they are stochastic.
>
> Extremely valuable. Worth their weight in gold to developers like
> me.
>
> The recoveryloop group tests are a good example of this. The name of
> the group indicates how we use it. I typically set it up to run with
> an loop iteration like "-I 100" knowing that is will likely fail a
> random test in the group within 10 iterations.
>
> Those one-off failures are almost always a real bug, and they are
> often unique and difficult to reproduce exactly. Post-mortem needs
> to be performed immediately because it may well be a unique on-off
> failure and running another test after the failure destroys the
> state needed to perform a post-mortem.
>
> Hence having a test farm running these multiple times and then
> reporting "failed once in 15 runs" isn't really useful to me as a
> developer - it doesn't tell us anything new, nor does it help us
> find the bugs that are being tripped over.
>
> Less obvious stochastic tests exist, too. There are many tests that
> use fstress as a workload that runs while some other operation is
> performed - freeze, grow, ENOSPC, error injections, etc. They will
> never be deterministic, any again any failure tends to be a real
> bug, too.
>
> However, I think these should be run by QE environments all the time
> as they require long term, frequent execution across different
> configs in different environments to find the deep dark corners
> where the bugs may lie dormant. These are the tests that find things
> like subtle timing races no other tests ever exercise.
>
> I suspect that tests that alter their behaviour via LOAD_FACTOR or
> TIME_FACTOR will fall into this category.
>
> > I think MTBF is the standard way to describe reliability
> > of such tests, but I am having a hard time imagining how
> > the community can manage to document accurate annotations
> > of this sort, so I would stick with documenting the facts
> > (i.e. the test fails after N runs).
>
> I'm unsure of what "reliablity of such tests" means in this context.
> The tests are trying to exercise and measure the reliability of the
> kernel code - if the *test is unreliable* then that says to me the
> test needs fixing. If the test is reliable, then any failures that
> occur indicate that the filesystem/kernel/fs tools are unreliable,
> not the test....
>
> "test reliability" and "reliability of filesystem under test" are
> different things with similar names. The latter is what I think we
> are talking about measuring and reporting here, right?
>
> > OTOH, we do have deterministic tests, maybe even the majority of
> > fstests are deterministic(?)
>
> Very likely. As a generalisation, I'd say that anything that has a
> fixed, single step at a time recipe and a very well defined golden
> output or exact output comparison match is likely deterministic.
>
> We use things like 'within tolerance' so that slight variations in
> test results don't cause spurious failures and hence make the test
> more deterministic.  Hence any test that uses 'within_tolerance' is
> probably a test that is expecting deterministic behaviour....
>
> > Considering that every auto test loop takes ~2 hours on our rig and that
> > I have been running over 100 loops over the past two weeks, if half
> > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > emission gone to waste.
> >
> > It would have been nice if I was able to exclude a "deterministic" group.
> > The problem is - can a developer ever tag a test as being "deterministic"?
>
> fstests allows private exclude lists to be used - perhaps these
> could be used to start building such a group for your test
> environment. Building a list from the tests you never see fail in
> your environment could be a good way to seed such a group...
>
> Maybe you have all the raw results from those hundreds of tests
> sitting around - what does crunching that data look like? Who else
> has large sets of consistent historic data sitting around? I don't
> because I pollute my results archive by frequently running varied
> and badly broken kernels through fstests, but people who just run
> released or stable kernels may have data sets that could be used....
>

I have no historic data of that sort and I have never stayed on the
same test system long enough to collect this sort of data.

Josef has told us in LPC 2021 about his btrfs fstests dashboard
where he started to collect historical data a while ago.

Collaborating on expunge lists of different fs and different
kernel/config/distro
is one of the goals behind Luis's kdevops project.

For now, the expunge lists are curated in git:
https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges
Going forward, this cannot scale. If we want to collaborate and
collect results from
multiple testers and test labs we should consult with the KernelCI
project, who are
doing exactly that for other test suites.

You did not attend Luis' talk in LSFMM this year (he has already mentioned
kdevops back in LSFMM 2019), where some of these issues were discussed.
The video from LSFMM 2022 talk should be available in coming weeks.
I hear that Luis is also planning on giving a talk to a wider audience
in LPC 2022.

Thanks,
Amir.



> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19  9:20     ` Amir Goldstein
@ 2022-05-19 15:36       ` Josef Bacik
  2022-05-19 16:18         ` Zorro Lang
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2022-05-19 15:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests

On Thu, May 19, 2022 at 12:20:28PM +0300, Amir Goldstein wrote:
> On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > > [adding fstests and Zorro]
> > >
> > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > >
> > > > I've been promoting the idea that running fstests once is nice,
> > > > but things get interesting if you try to run fstests multiple
> > > > times until a failure is found. It turns out at least kdevops has
> > > > found tests which fail with a failure rate of typically 1/2 to
> > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > failure.
> > > >
> > > > I have tried my best to annotate failure rates when I know what
> > > > they might be on the test expunge list, as an example:
> > > >
> > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > >
> > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > to propose to standardize a way to represent this. How about
> > > >
> > > > generic/530 # F:1/15
> > > >
> > >
> > > I am not fond of the 1/15 annotation at all, because the only fact that you
> > > are able to document is that the test failed after 15 runs.
> > > Suggesting that this means failure rate of 1/15 is a very big step.
> > >
> > > > Then we could extend the definition. F being current estimate, and this
> > > > can be just how long it took to find the first failure. A more valuable
> > > > figure would be failure rate avarage, so running the test multiple
> > > > times, say 10, to see what the failure rate is and then averaging the
> > > > failure out. So this could be a more accurate representation. For this
> > > > how about:
> > > >
> > > > generic/530 # FA:1/15
> > > >
> > > > This would mean on average there failure rate has been found to be about
> > > > 1/15, and this was determined based on 10 runs.
> >
> > These tests are run on multiple different filesystems. What happens
> > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> > tests results, and 1 failure.
> >
> > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
> >
> > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
> >
> > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
> >
> > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> > 1/1 breakdown is useful information, because it tells us whihc
> > filesystem failed the test, or which specific config failed the
> > test.
> >
> > Hence I think the ability for us to draw useful conclusions from a
> > number like this is large dependent on the specific data set it is
> > drawn from...
> >
> > > > We should also go extend check for fstests/blktests to run a test
> > > > until a failure is found and report back the number of successes.
> > > >
> > > > Thoughts?
> >
> > Who is the expected consumer of this information?
> >
> > I'm not sure it will be meaningful for anyone developing new code
> > and needing to run every test every time they run fstests.
> >
> > OTOH, for a QA environment where you have a fixed progression of the
> > kernel releases you are testing, it's likely valuable and already
> > being tracked in various distro QE management tools and
> > dashboards....
> >
> > > I have had a discussion about those tests with Zorro.
> > >
> > > Those tests that some people refer to as "flaky" are valuable,
> > > but they are not deterministic, they are stochastic.
> >
> > Extremely valuable. Worth their weight in gold to developers like
> > me.
> >
> > The recoveryloop group tests are a good example of this. The name of
> > the group indicates how we use it. I typically set it up to run with
> > an loop iteration like "-I 100" knowing that is will likely fail a
> > random test in the group within 10 iterations.
> >
> > Those one-off failures are almost always a real bug, and they are
> > often unique and difficult to reproduce exactly. Post-mortem needs
> > to be performed immediately because it may well be a unique on-off
> > failure and running another test after the failure destroys the
> > state needed to perform a post-mortem.
> >
> > Hence having a test farm running these multiple times and then
> > reporting "failed once in 15 runs" isn't really useful to me as a
> > developer - it doesn't tell us anything new, nor does it help us
> > find the bugs that are being tripped over.
> >
> > Less obvious stochastic tests exist, too. There are many tests that
> > use fstress as a workload that runs while some other operation is
> > performed - freeze, grow, ENOSPC, error injections, etc. They will
> > never be deterministic, any again any failure tends to be a real
> > bug, too.
> >
> > However, I think these should be run by QE environments all the time
> > as they require long term, frequent execution across different
> > configs in different environments to find the deep dark corners
> > where the bugs may lie dormant. These are the tests that find things
> > like subtle timing races no other tests ever exercise.
> >
> > I suspect that tests that alter their behaviour via LOAD_FACTOR or
> > TIME_FACTOR will fall into this category.
> >
> > > I think MTBF is the standard way to describe reliability
> > > of such tests, but I am having a hard time imagining how
> > > the community can manage to document accurate annotations
> > > of this sort, so I would stick with documenting the facts
> > > (i.e. the test fails after N runs).
> >
> > I'm unsure of what "reliablity of such tests" means in this context.
> > The tests are trying to exercise and measure the reliability of the
> > kernel code - if the *test is unreliable* then that says to me the
> > test needs fixing. If the test is reliable, then any failures that
> > occur indicate that the filesystem/kernel/fs tools are unreliable,
> > not the test....
> >
> > "test reliability" and "reliability of filesystem under test" are
> > different things with similar names. The latter is what I think we
> > are talking about measuring and reporting here, right?
> >
> > > OTOH, we do have deterministic tests, maybe even the majority of
> > > fstests are deterministic(?)
> >
> > Very likely. As a generalisation, I'd say that anything that has a
> > fixed, single step at a time recipe and a very well defined golden
> > output or exact output comparison match is likely deterministic.
> >
> > We use things like 'within tolerance' so that slight variations in
> > test results don't cause spurious failures and hence make the test
> > more deterministic.  Hence any test that uses 'within_tolerance' is
> > probably a test that is expecting deterministic behaviour....
> >
> > > Considering that every auto test loop takes ~2 hours on our rig and that
> > > I have been running over 100 loops over the past two weeks, if half
> > > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > > emission gone to waste.
> > >
> > > It would have been nice if I was able to exclude a "deterministic" group.
> > > The problem is - can a developer ever tag a test as being "deterministic"?
> >
> > fstests allows private exclude lists to be used - perhaps these
> > could be used to start building such a group for your test
> > environment. Building a list from the tests you never see fail in
> > your environment could be a good way to seed such a group...
> >
> > Maybe you have all the raw results from those hundreds of tests
> > sitting around - what does crunching that data look like? Who else
> > has large sets of consistent historic data sitting around? I don't
> > because I pollute my results archive by frequently running varied
> > and badly broken kernels through fstests, but people who just run
> > released or stable kernels may have data sets that could be used....
> >
> 
> I have no historic data of that sort and I have never stayed on the
> same test system long enough to collect this sort of data.
> 
> Josef has told us in LPC 2021 about his btrfs fstests dashboard
> where he started to collect historical data a while ago.
> 

I'm clearly biased, but I think this is the best way to go for *developers*.  We
want to know all the things, so we just need to have a clear way to see what's
failing and have a historical view of what has failed.  If you look at our
dashboard at toxicpanda.com you can click on the tests and see their runs and
failures on different configs.  This has been insanely valuable to me, and
helped me narrow down test cases that needed to be adjusted for compression.

> Collaborating on expunge lists of different fs and different
> kernel/config/distro
> is one of the goals behind Luis's kdevops project.
> 

I think this is also hugely valuable from the "Willy usecase" perspective.
Willy doesn't care about failure rates or interpreting the tea leaves of what
our format is, he wants to make sure he didn't break anything.  We should strive
to have 0 failures for this use case, so having expunge lists in place to get
rid of any flakey results are going to make it easier for non-experts to get a
solid grasp on wether they introduced a regression or not.

There's room for both use cases.  I want the expunge lists for newbies, I want
good reporting for the developers who know what they're doing.  We can provide
documentation for both

- If Willy, run 'make fstests-clean'
- If Josef, run 'mkame fstests'

Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 15:36       ` Josef Bacik
@ 2022-05-19 16:18         ` Zorro Lang
  0 siblings, 0 replies; 32+ messages in thread
From: Zorro Lang @ 2022-05-19 16:18 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Amir Goldstein, Dave Chinner, Luis Chamberlain, linux-fsdevel,
	linux-block, pankydev8, Theodore Tso, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 11:36:02AM -0400, Josef Bacik wrote:
> On Thu, May 19, 2022 at 12:20:28PM +0300, Amir Goldstein wrote:
> > On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > > > [adding fstests and Zorro]
> > > >
> > > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > > >
> > > > > I've been promoting the idea that running fstests once is nice,
> > > > > but things get interesting if you try to run fstests multiple
> > > > > times until a failure is found. It turns out at least kdevops has
> > > > > found tests which fail with a failure rate of typically 1/2 to
> > > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > > failure.
> > > > >
> > > > > I have tried my best to annotate failure rates when I know what
> > > > > they might be on the test expunge list, as an example:
> > > > >
> > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > > >
> > > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > > to propose to standardize a way to represent this. How about
> > > > >
> > > > > generic/530 # F:1/15
> > > > >
> > > >
> > > > I am not fond of the 1/15 annotation at all, because the only fact that you
> > > > are able to document is that the test failed after 15 runs.
> > > > Suggesting that this means failure rate of 1/15 is a very big step.
> > > >
> > > > > Then we could extend the definition. F being current estimate, and this
> > > > > can be just how long it took to find the first failure. A more valuable
> > > > > figure would be failure rate avarage, so running the test multiple
> > > > > times, say 10, to see what the failure rate is and then averaging the
> > > > > failure out. So this could be a more accurate representation. For this
> > > > > how about:
> > > > >
> > > > > generic/530 # FA:1/15
> > > > >
> > > > > This would mean on average there failure rate has been found to be about
> > > > > 1/15, and this was determined based on 10 runs.
> > >
> > > These tests are run on multiple different filesystems. What happens
> > > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> > > tests results, and 1 failure.
> > >
> > > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
> > >
> > > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
> > >
> > > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
> > >
> > > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> > > 1/1 breakdown is useful information, because it tells us whihc
> > > filesystem failed the test, or which specific config failed the
> > > test.
> > >
> > > Hence I think the ability for us to draw useful conclusions from a
> > > number like this is large dependent on the specific data set it is
> > > drawn from...
> > >
> > > > > We should also go extend check for fstests/blktests to run a test
> > > > > until a failure is found and report back the number of successes.
> > > > >
> > > > > Thoughts?
> > >
> > > Who is the expected consumer of this information?
> > >
> > > I'm not sure it will be meaningful for anyone developing new code
> > > and needing to run every test every time they run fstests.
> > >
> > > OTOH, for a QA environment where you have a fixed progression of the
> > > kernel releases you are testing, it's likely valuable and already
> > > being tracked in various distro QE management tools and
> > > dashboards....
> > >
> > > > I have had a discussion about those tests with Zorro.
> > > >
> > > > Those tests that some people refer to as "flaky" are valuable,
> > > > but they are not deterministic, they are stochastic.
> > >
> > > Extremely valuable. Worth their weight in gold to developers like
> > > me.
> > >
> > > The recoveryloop group tests are a good example of this. The name of
> > > the group indicates how we use it. I typically set it up to run with
> > > an loop iteration like "-I 100" knowing that is will likely fail a
> > > random test in the group within 10 iterations.
> > >
> > > Those one-off failures are almost always a real bug, and they are
> > > often unique and difficult to reproduce exactly. Post-mortem needs
> > > to be performed immediately because it may well be a unique on-off
> > > failure and running another test after the failure destroys the
> > > state needed to perform a post-mortem.
> > >
> > > Hence having a test farm running these multiple times and then
> > > reporting "failed once in 15 runs" isn't really useful to me as a
> > > developer - it doesn't tell us anything new, nor does it help us
> > > find the bugs that are being tripped over.
> > >
> > > Less obvious stochastic tests exist, too. There are many tests that
> > > use fstress as a workload that runs while some other operation is
> > > performed - freeze, grow, ENOSPC, error injections, etc. They will
> > > never be deterministic, any again any failure tends to be a real
> > > bug, too.
> > >
> > > However, I think these should be run by QE environments all the time
> > > as they require long term, frequent execution across different
> > > configs in different environments to find the deep dark corners
> > > where the bugs may lie dormant. These are the tests that find things
> > > like subtle timing races no other tests ever exercise.
> > >
> > > I suspect that tests that alter their behaviour via LOAD_FACTOR or
> > > TIME_FACTOR will fall into this category.
> > >
> > > > I think MTBF is the standard way to describe reliability
> > > > of such tests, but I am having a hard time imagining how
> > > > the community can manage to document accurate annotations
> > > > of this sort, so I would stick with documenting the facts
> > > > (i.e. the test fails after N runs).
> > >
> > > I'm unsure of what "reliablity of such tests" means in this context.
> > > The tests are trying to exercise and measure the reliability of the
> > > kernel code - if the *test is unreliable* then that says to me the
> > > test needs fixing. If the test is reliable, then any failures that
> > > occur indicate that the filesystem/kernel/fs tools are unreliable,
> > > not the test....
> > >
> > > "test reliability" and "reliability of filesystem under test" are
> > > different things with similar names. The latter is what I think we
> > > are talking about measuring and reporting here, right?
> > >
> > > > OTOH, we do have deterministic tests, maybe even the majority of
> > > > fstests are deterministic(?)
> > >
> > > Very likely. As a generalisation, I'd say that anything that has a
> > > fixed, single step at a time recipe and a very well defined golden
> > > output or exact output comparison match is likely deterministic.
> > >
> > > We use things like 'within tolerance' so that slight variations in
> > > test results don't cause spurious failures and hence make the test
> > > more deterministic.  Hence any test that uses 'within_tolerance' is
> > > probably a test that is expecting deterministic behaviour....
> > >
> > > > Considering that every auto test loop takes ~2 hours on our rig and that
> > > > I have been running over 100 loops over the past two weeks, if half
> > > > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > > > emission gone to waste.
> > > >
> > > > It would have been nice if I was able to exclude a "deterministic" group.
> > > > The problem is - can a developer ever tag a test as being "deterministic"?
> > >
> > > fstests allows private exclude lists to be used - perhaps these
> > > could be used to start building such a group for your test
> > > environment. Building a list from the tests you never see fail in
> > > your environment could be a good way to seed such a group...
> > >
> > > Maybe you have all the raw results from those hundreds of tests
> > > sitting around - what does crunching that data look like? Who else
> > > has large sets of consistent historic data sitting around? I don't
> > > because I pollute my results archive by frequently running varied
> > > and badly broken kernels through fstests, but people who just run
> > > released or stable kernels may have data sets that could be used....
> > >
> > 
> > I have no historic data of that sort and I have never stayed on the
> > same test system long enough to collect this sort of data.
> > 
> > Josef has told us in LPC 2021 about his btrfs fstests dashboard
> > where he started to collect historical data a while ago.
> > 
> 
> I'm clearly biased, but I think this is the best way to go for *developers*.  We
> want to know all the things, so we just need to have a clear way to see what's
> failing and have a historical view of what has failed.  If you look at our

I agree the "historical view" is needed, but it can't be provided by mainline
fstests, due to it might be used to test many different filesystems with different
sysytem software and hardware environment, and there're lots of downstream project,
they have their own variation, the "upstream mainline linux historical view"
isn't referential for all of them. Some of "downstream historical view" isn't
referential either.

The "historical view" is worthy for each project(or project group) itself, but
might be not universal for others. If someone would like to help to test someone
project, likes someone Ubuntu LTS version, or Debian, or CentOS, or someone LTS
kernel... that might be better to ask if related people have their "historical
view" data to help to get start, better than asking if fstests has that for all
different known/unknown projects.

I just replied Ted, I think his idea makes more sense. fstests can provide
some meaningful interfaces to help the testers use their historical data, or
help to summarize their historical data for each specific user/project. But
fstests doesn't store/provide those one-sided data directly.

Thanks,
Zorro

> dashboard at toxicpanda.com you can click on the tests and see their runs and
> failures on different configs.  This has been insanely valuable to me, and
> helped me narrow down test cases that needed to be adjusted for compression.
> 
> > Collaborating on expunge lists of different fs and different
> > kernel/config/distro
> > is one of the goals behind Luis's kdevops project.
> > 
> 
> I think this is also hugely valuable from the "Willy usecase" perspective.
> Willy doesn't care about failure rates or interpreting the tea leaves of what
> our format is, he wants to make sure he didn't break anything.  We should strive
> to have 0 failures for this use case, so having expunge lists in place to get
> rid of any flakey results are going to make it easier for non-experts to get a
> solid grasp on wether they introduced a regression or not.
> 
> There's room for both use cases.  I want the expunge lists for newbies, I want
> good reporting for the developers who know what they're doing.  We can provide
> documentation for both
> 
> - If Willy, run 'make fstests-clean'
> - If Josef, run 'mkame fstests'
> 
> Thanks,
> 
> Josef
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19  6:36 ` Amir Goldstein
  2022-05-19  7:58   ` Dave Chinner
@ 2022-05-19 11:24   ` Zorro Lang
  2022-05-19 14:18     ` Theodore Ts'o
  2022-05-19 14:58     ` Matthew Wilcox
  1 sibling, 2 replies; 32+ messages in thread
From: Zorro Lang @ 2022-05-19 11:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8,
	Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> [adding fstests and Zorro]
> 
> On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > I've been promoting the idea that running fstests once is nice,
> > but things get interesting if you try to run fstests multiple
> > times until a failure is found. It turns out at least kdevops has
> > found tests which fail with a failure rate of typically 1/2 to
> > 1/30 average failure rate. That is 1/2 means a failure can happen
> > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > failure.
> >
> > I have tried my best to annotate failure rates when I know what
> > they might be on the test expunge list, as an example:
> >
> > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> >
> > The term "failure rate 1/15" is 16 characters long, so I'd like
> > to propose to standardize a way to represent this. How about
> >
> > generic/530 # F:1/15
> >
> 
> I am not fond of the 1/15 annotation at all, because the only fact that you
> are able to document is that the test failed after 15 runs.
> Suggesting that this means failure rate of 1/15 is a very big step.
> 
> > Then we could extend the definition. F being current estimate, and this
> > can be just how long it took to find the first failure. A more valuable
> > figure would be failure rate avarage, so running the test multiple
> > times, say 10, to see what the failure rate is and then averaging the
> > failure out. So this could be a more accurate representation. For this
> > how about:
> >
> > generic/530 # FA:1/15
> >
> > This would mean on average there failure rate has been found to be about
> > 1/15, and this was determined based on 10 runs.
> >
> > We should also go extend check for fstests/blktests to run a test
> > until a failure is found and report back the number of successes.
> >
> > Thoughts?
> >
> 
> I have had a discussion about those tests with Zorro.

Hi Amir,

Thanks for publicing this discussion.

Yes, we talked about this, but if I don't rememeber wrong, I recommended each
downstream testers maintain their own "testing data/config", likes exclude
list, failed ratio, known failures etc. I think they're not suitable to be
fixed in the mainline fstests.

About the other idea I metioned in LSF, we can create some more group names to
mark those cases with random load/data/env etc, they're worth to be run more
times. I also talked about that with Darrick, we haven't maken a decision,
but I'd like to push that if most of other forks would like to see that.

In my internal regression test for RHEL, I give some fstests cases a new
group name "redhat_random" (sure, I know it's not a good name, it's just
for my internal test, welcome better name, I'm not a good english speaker :).
Then combine with quick and stress group name, I loop run "redhat_random"
cases different times, with different LOAD/TIME_FACTOR.

So I hope to have one "or more specific" group name to mark those random
test cases at first, likes [1] (I'm sure it's incomplete, but can be improved
if we can get more help from more people :)

Thanks,
Zorro

[1]
generic/013
generic/019
generic/051
generic/068
generic/070
generic/075
generic/076
generic/083
generic/091
generic/112
generic/117
generic/127
generic/231
generic/232
generic/233
generic/263
generic/269
generic/270
generic/388
generic/390
generic/413
generic/455
generic/457
generic/461
generic/464
generic/475
generic/476
generic/482
generic/521
generic/522
generic/547
generic/551
generic/560
generic/561
generic/616
generic/617
generic/648
generic/650
xfs/011
xfs/013
xfs/017
xfs/032
xfs/051
xfs/057
xfs/068
xfs/079
xfs/104
xfs/137
xfs/141
xfs/167
xfs/297
xfs/305
xfs/442
xfs/517

> 
> Those tests that some people refer to as "flaky" are valuable,
> but they are not deterministic, they are stochastic.
> 
> I think MTBF is the standard way to describe reliability
> of such tests, but I am having a hard time imagining how
> the community can manage to document accurate annotations
> of this sort, so I would stick with documenting the facts
> (i.e. the test fails after N runs).
> 
> OTOH, we do have deterministic tests, maybe even the majority of
> fstests are deterministic(?)
> 
> Considering that every auto test loop takes ~2 hours on our rig and that
> I have been running over 100 loops over the past two weeks, if half
> of fstests are deterministic, that is a lot of wait time and a lot of carbon
> emission gone to waste.
> 
> It would have been nice if I was able to exclude a "deterministic" group.
> The problem is - can a developer ever tag a test as being "deterministic"?
> 
> Thanks,
> Amir.
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 11:24   ` Zorro Lang
@ 2022-05-19 14:18     ` Theodore Ts'o
  2022-05-19 15:10       ` Zorro Lang
  2022-05-19 14:58     ` Matthew Wilcox
  1 sibling, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2022-05-19 14:18 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> 
> Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> downstream testers maintain their own "testing data/config", likes exclude
> list, failed ratio, known failures etc. I think they're not suitable to be
> fixed in the mainline fstests.

Failure ratios are the sort of thing that are only applicable for

* A specific filesystem
* A specific configuration
* A specific storage device / storage device class
* A specific CPU architecture / CPU speed
* A specific amount of memory available

Put another way, there are problems that fail so close to rarely as to
be "hever" on, say, an x86_64 class server with gobs and gobs of
memory, but which can more reliably fail on, say, a Rasberry PI using
eMMC flash.

I don't think that Luis was suggesting that this kind of failure
annotation would go in upstream fstests.  I suspect he just wants to
use it in kdevops, and hope that other people would use it as well in
other contexts.  But even in the context of test runners like kdevops
and {kvm,gce,android}-xfstests, it's going to be very specific to a
particular test environment, and for the global list of excludes for a
particular file system.  So in the gce-xfstests context, this is the
difference between the excludes in the files:

	fs/ext4/excludes
vs
	fs/ext4/cfg/bigalloc.exclude

even if I only cared about, say, how things ran on GCE using
SSD-backed Persistent Disk (never mind that I can only run
gce-xfstests on Local SSD, and PD Extreme, etc.), failure percentages
would never make sense for fs/ext4/excludes, since that covers
multiple file system configs.  And my infrastructure supports kvm,
gce, and Android, as well as some people (such as at $WORK for our
data center kernels) who run the test appliacce directly on bare
metal, so I wouldn't use the failure percentages in these files, etc.

Now, what I *do* is to track this sort of thing in my own notes, e.g:

generic/051	ext4/adv	Failure percentage: 16% (4/25)
    "Basic log recovery stress test - do lots of stuff, shut down in
    the middle of it and check that recovery runs to completion and
    everything can be successfully removed afterwards."

generic/410 nojournal	Couldn't reproduce after running 25 times
     "Test mount shared subtrees, verify the state transitions..."

generic/68[12]	encrypt   Failure percentage: 100%
    The directory does grow, but blocks aren't charged to either root or
    the non-privileged users' quota.  So this appears to be a real bug.

There is one thing that I'd like to add to upstream fstests, and that
is some kind of option so that "check --retry-failures NN" would cause
fstests to automatically, upon finding a test failure, will rerun that
failing test NN aditional times.  Another potential related feature
which we currently have in our daily spinner infrastructure at $WORK
would be to on a test failure, rerun a test up to M times (typically a
small number, such as 3), and if it passes on a retry attempt, declare
the test result as "flaky", and stop running the retries.  If the test
repeatedly fails after M attempts, then the test result is "fail".

These results would be reported in the junit XML file, and would allow
the test runners to annotate their test summaries appropriately.

I'm thinking about trying to implement something like this in my
copious spare time; but before I do, does the general idea seem
acceptable?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 14:18     ` Theodore Ts'o
@ 2022-05-19 15:10       ` Zorro Lang
  0 siblings, 0 replies; 32+ messages in thread
From: Zorro Lang @ 2022-05-19 15:10 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 10:18:48AM -0400, Theodore Ts'o wrote:
> On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > 
> > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > downstream testers maintain their own "testing data/config", likes exclude
> > list, failed ratio, known failures etc. I think they're not suitable to be
> > fixed in the mainline fstests.
> 
> Failure ratios are the sort of thing that are only applicable for
> 
> * A specific filesystem
> * A specific configuration
> * A specific storage device / storage device class
> * A specific CPU architecture / CPU speed
> * A specific amount of memory available

And a specific bug I suppose :)

> 
> Put another way, there are problems that fail so close to rarely as to
> be "hever" on, say, an x86_64 class server with gobs and gobs of
> memory, but which can more reliably fail on, say, a Rasberry PI using
> eMMC flash.
> 
> I don't think that Luis was suggesting that this kind of failure
> annotation would go in upstream fstests.  I suspect he just wants to
> use it in kdevops, and hope that other people would use it as well in
> other contexts.  But even in the context of test runners like kdevops
> and {kvm,gce,android}-xfstests, it's going to be very specific to a
> particular test environment, and for the global list of excludes for a
> particular file system.  So in the gce-xfstests context, this is the
> difference between the excludes in the files:
> 
> 	fs/ext4/excludes
> vs
> 	fs/ext4/cfg/bigalloc.exclude
> 
> even if I only cared about, say, how things ran on GCE using
> SSD-backed Persistent Disk (never mind that I can only run
> gce-xfstests on Local SSD, and PD Extreme, etc.), failure percentages
> would never make sense for fs/ext4/excludes, since that covers
> multiple file system configs.  And my infrastructure supports kvm,
> gce, and Android, as well as some people (such as at $WORK for our
> data center kernels) who run the test appliacce directly on bare
> metal, so I wouldn't use the failure percentages in these files, etc.
> 
> Now, what I *do* is to track this sort of thing in my own notes, e.g:
> 
> generic/051	ext4/adv	Failure percentage: 16% (4/25)
>     "Basic log recovery stress test - do lots of stuff, shut down in
>     the middle of it and check that recovery runs to completion and
>     everything can be successfully removed afterwards."
> 
> generic/410 nojournal	Couldn't reproduce after running 25 times
>      "Test mount shared subtrees, verify the state transitions..."
> 
> generic/68[12]	encrypt   Failure percentage: 100%
>     The directory does grow, but blocks aren't charged to either root or
>     the non-privileged users' quota.  So this appears to be a real bug.
> 
> 
> There is one thing that I'd like to add to upstream fstests, and that
> is some kind of option so that "check --retry-failures NN" would cause
> fstests to automatically, upon finding a test failure, will rerun that
> failing test NN aditional times.

That makes more sense for me :) I'd like to help the testers to retry the
(randomly) failed cases, to help them to get their testing statistics. That's
better than recording these statistics in fstests itself.

> Another potential related feature
> which we currently have in our daily spinner infrastructure at $WORK
> would be to on a test failure, rerun a test up to M times (typically a
> small number, such as 3), and if it passes on a retry attempt, declare
> the test result as "flaky", and stop running the retries.  If the test
> repeatedly fails after M attempts, then the test result is "fail".
> 
> These results would be reported in the junit XML file, and would allow
> the test runners to annotate their test summaries appropriately.
> 
> I'm thinking about trying to implement something like this in my
> copious spare time; but before I do, does the general idea seem
> acceptable?

After a "./check ..." done, generally fstests shows 3 list:
  Ran: ...
  Not run: ...
  Failures: ...

So you mean if the "--retry-failures N" is specified. we can have one more list
named "Flaky", which is part of "Failures" list, likes:
  Ran: ...
  Not run: ...
  Failures: generic/388 generic/475 xfs/104 xfs/442
  Flaky: generic/388 [2/N] xfs/104 [1/N]

If I understand this correctly, it's acceptable for me. And it might be helpful
for Amir's situation. But let's hear more voice from other developers, if there
is not big objection from other fs maintainers, let's do it :)

BTW, about the new group name to mark cases with random load/operations/env.,
what do you think? Any suggestions or good names for that?

Thanks,
Zorro

> 
> Thanks,
> 
> 					- Ted
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 11:24   ` Zorro Lang
  2022-05-19 14:18     ` Theodore Ts'o
@ 2022-05-19 14:58     ` Matthew Wilcox
  2022-05-19 15:44       ` Zorro Lang
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-05-19 14:58 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> downstream testers maintain their own "testing data/config", likes exclude
> list, failed ratio, known failures etc. I think they're not suitable to be
> fixed in the mainline fstests.

This assumes a certain level of expertise, which is a barrier to entry.

For someone who wants to check "Did my patch to filesystem Y that I have
never touched before break anything?", having non-deterministic tests
run by default is bad.

As an example, run xfstests against jfs.  Hundreds of failures, including
some very scary-looking assertion failures from the page allocator.
They're (mostly) harmless in fact, just being a memory leak, but it
makes xfstests useless for this scenario.

Even for well-maintained filesystems like xfs which is regularly tested,
I expect generic/270 and a few others to fail.  They just do, and they're
not an indication that *I* broke anything.

By all means, we want to keep tests around which have failures, but
they need to be restricted to people who have a level of expertise and
interest in fixing long-standing problems, not people who are looking
for regressions.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 14:58     ` Matthew Wilcox
@ 2022-05-19 15:44       ` Zorro Lang
  2022-05-19 16:06         ` Matthew Wilcox
  0 siblings, 1 reply; 32+ messages in thread
From: Zorro Lang @ 2022-05-19 15:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote:
> On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > downstream testers maintain their own "testing data/config", likes exclude
> > list, failed ratio, known failures etc. I think they're not suitable to be
> > fixed in the mainline fstests.
> 
> This assumes a certain level of expertise, which is a barrier to entry.
> 
> For someone who wants to check "Did my patch to filesystem Y that I have
> never touched before break anything?", having non-deterministic tests
> run by default is bad.
> 
> As an example, run xfstests against jfs.  Hundreds of failures, including
> some very scary-looking assertion failures from the page allocator.
> They're (mostly) harmless in fact, just being a memory leak, but it
> makes xfstests useless for this scenario.
> 
> Even for well-maintained filesystems like xfs which is regularly tested,
> I expect generic/270 and a few others to fail.  They just do, and they're
> not an indication that *I* broke anything.
> 
> By all means, we want to keep tests around which have failures, but
> they need to be restricted to people who have a level of expertise and
> interest in fixing long-standing problems, not people who are looking
> for regressions.

It's hard to make sure if a failure is a regression, if someone only run
the test once. The testers need some experience, at least need some
history test data.

If a tester find a case has 10% chance fail on his system, to make sure
it's a regression or not, if he doesn't have history test data, at least
he need to do the same test more times on old kernel version with his
system. If it never fail on old kernel version, but can fail on new kernel.
Then we suspect it's a regression.

Even if the tester isn't an expert of the fs he's testing, he can report
this issue to that fs experts, to get more checking. For downstream kernel,
he has to report to the maintainers of downstream, or check by himself.
If a case pass on upstream, but fail on downstream, it might mean there's
a patchset on upstream can be backported.

So, anyway, the testers need their own "experience" (include testing history
data, known issue, etc) to judge if a failure is a suspected regression, or
a known issue of downstream which hasn't been fixed (by backport).

That's my personal perspective :)

Thanks,
Zorro

> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 15:44       ` Zorro Lang
@ 2022-05-19 16:06         ` Matthew Wilcox
  2022-05-19 16:54           ` Zorro Lang
                             ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Matthew Wilcox @ 2022-05-19 16:06 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 11:44:19PM +0800, Zorro Lang wrote:
> On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote:
> > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > > downstream testers maintain their own "testing data/config", likes exclude
> > > list, failed ratio, known failures etc. I think they're not suitable to be
> > > fixed in the mainline fstests.
> > 
> > This assumes a certain level of expertise, which is a barrier to entry.
> > 
> > For someone who wants to check "Did my patch to filesystem Y that I have
> > never touched before break anything?", having non-deterministic tests
> > run by default is bad.
> > 
> > As an example, run xfstests against jfs.  Hundreds of failures, including
> > some very scary-looking assertion failures from the page allocator.
> > They're (mostly) harmless in fact, just being a memory leak, but it
> > makes xfstests useless for this scenario.
> > 
> > Even for well-maintained filesystems like xfs which is regularly tested,
> > I expect generic/270 and a few others to fail.  They just do, and they're
> > not an indication that *I* broke anything.
> > 
> > By all means, we want to keep tests around which have failures, but
> > they need to be restricted to people who have a level of expertise and
> > interest in fixing long-standing problems, not people who are looking
> > for regressions.
> 
> It's hard to make sure if a failure is a regression, if someone only run
> the test once. The testers need some experience, at least need some
> history test data.
> 
> If a tester find a case has 10% chance fail on his system, to make sure
> it's a regression or not, if he doesn't have history test data, at least
> he need to do the same test more times on old kernel version with his
> system. If it never fail on old kernel version, but can fail on new kernel.
> Then we suspect it's a regression.
> 
> Even if the tester isn't an expert of the fs he's testing, he can report
> this issue to that fs experts, to get more checking. For downstream kernel,
> he has to report to the maintainers of downstream, or check by himself.
> If a case pass on upstream, but fail on downstream, it might mean there's
> a patchset on upstream can be backported.
> 
> So, anyway, the testers need their own "experience" (include testing history
> data, known issue, etc) to judge if a failure is a suspected regression, or
> a known issue of downstream which hasn't been fixed (by backport).
> 
> That's my personal perspective :)

Right, but that's the personal perspective of an expert tester.  I don't
particularly want to build that expertise myself; I want to write patches
which touch dozens of filesystems, and I want to be able to smoke-test
those patches.  Maybe xfstests or kdevops doesn't want to solve that
problem, but that would seem like a waste of other peoples time.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 16:06         ` Matthew Wilcox
@ 2022-05-19 16:54           ` Zorro Lang
  2022-07-01 23:36           ` Luis Chamberlain
  2022-07-02 17:01           ` Theodore Ts'o
  2 siblings, 0 replies; 32+ messages in thread
From: Zorro Lang @ 2022-05-19 16:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 05:06:07PM +0100, Matthew Wilcox wrote:
> On Thu, May 19, 2022 at 11:44:19PM +0800, Zorro Lang wrote:
> > On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote:
> > > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > > > downstream testers maintain their own "testing data/config", likes exclude
> > > > list, failed ratio, known failures etc. I think they're not suitable to be
> > > > fixed in the mainline fstests.
> > > 
> > > This assumes a certain level of expertise, which is a barrier to entry.
> > > 
> > > For someone who wants to check "Did my patch to filesystem Y that I have
> > > never touched before break anything?", having non-deterministic tests
> > > run by default is bad.
> > > 
> > > As an example, run xfstests against jfs.  Hundreds of failures, including
> > > some very scary-looking assertion failures from the page allocator.
> > > They're (mostly) harmless in fact, just being a memory leak, but it
> > > makes xfstests useless for this scenario.
> > > 
> > > Even for well-maintained filesystems like xfs which is regularly tested,
> > > I expect generic/270 and a few others to fail.  They just do, and they're
> > > not an indication that *I* broke anything.
> > > 
> > > By all means, we want to keep tests around which have failures, but
> > > they need to be restricted to people who have a level of expertise and
> > > interest in fixing long-standing problems, not people who are looking
> > > for regressions.
> > 
> > It's hard to make sure if a failure is a regression, if someone only run
> > the test once. The testers need some experience, at least need some
> > history test data.
> > 
> > If a tester find a case has 10% chance fail on his system, to make sure
> > it's a regression or not, if he doesn't have history test data, at least
> > he need to do the same test more times on old kernel version with his
> > system. If it never fail on old kernel version, but can fail on new kernel.
> > Then we suspect it's a regression.
> > 
> > Even if the tester isn't an expert of the fs he's testing, he can report
> > this issue to that fs experts, to get more checking. For downstream kernel,
> > he has to report to the maintainers of downstream, or check by himself.
> > If a case pass on upstream, but fail on downstream, it might mean there's
> > a patchset on upstream can be backported.
> > 
> > So, anyway, the testers need their own "experience" (include testing history
> > data, known issue, etc) to judge if a failure is a suspected regression, or
> > a known issue of downstream which hasn't been fixed (by backport).
> > 
> > That's my personal perspective :)
> 
> Right, but that's the personal perspective of an expert tester.  I don't
> particularly want to build that expertise myself; I want to write patches
> which touch dozens of filesystems, and I want to be able to smoke-test
> those patches.  Maybe xfstests or kdevops doesn't want to solve that

I think it's hard to judge which cases are smoke-test cases commonly, especially
you hope they should all pass if no real bugs. If for "all filesystems", I have to
recomment some simple cases of fsx and fsstress only... Even if we can add
a group name as 'smoke', and mark all stable and simple enough test cases
as 'smoke', but I still can't be sure './check -g smoke' will test pass for
your all filesystems testing with random system environment :)

Thanks,
Zorro

> problem, but that would seem like a waste of other peoples time.
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 16:06         ` Matthew Wilcox
  2022-05-19 16:54           ` Zorro Lang
@ 2022-07-01 23:36           ` Luis Chamberlain
  2022-07-02 17:01           ` Theodore Ts'o
  2 siblings, 0 replies; 32+ messages in thread
From: Luis Chamberlain @ 2022-07-01 23:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zorro Lang, Amir Goldstein, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 05:06:07PM +0100, Matthew Wilcox wrote:
> Right, but that's the personal perspective of an expert tester.  I don't
> particularly want to build that expertise myself; I want to write patches
> which touch dozens of filesystems, and I want to be able to smoke-test
> those patches.  Maybe xfstests or kdevops doesn't want to solve that
> problem,

kdevop's goals are aligned to enable that. However at this point
in time there is no agreement to share expunges and so we just
carry tons of them per kernel / distro for those that *did* have
time to run them for the environment used and share them.

Today there are baselines for stable and linus' kernel for some
filesystems, but these are on a best effort basis as this takes
system resources and someone's time. The results are tracked in:

workflows/fstests/expunges/

With time now that there is at least a rig to do this for stable
and upstream this should expand to be more up to date. There is
also a shared repo which enables folks to share results there.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19 16:06         ` Matthew Wilcox
  2022-05-19 16:54           ` Zorro Lang
  2022-07-01 23:36           ` Luis Chamberlain
@ 2022-07-02 17:01           ` Theodore Ts'o
  2022-07-07 21:36             ` Luis Chamberlain
  2 siblings, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2022-07-02 17:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zorro Lang, Amir Goldstein, Luis Chamberlain, linux-fsdevel,
	linux-block, pankydev8, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Thu, May 19, 2022 at 05:06:07PM +0100, Matthew Wilcox wrote:
> 
> Right, but that's the personal perspective of an expert tester.  I don't
> particularly want to build that expertise myself; I want to write patches
> which touch dozens of filesystems, and I want to be able to smoke-test
> those patches.  Maybe xfstests or kdevops doesn't want to solve that
> problem, but that would seem like a waste of other peoples time.

Willy,

For your use case I'm guessing that you have two major concerns:

  * bugs that you may have introduced when "which touch dozens of
    filesystems"

  * bugs in the core mm and fs-writeback code which may be much
    more substantive/complex changes.

Would you say that is correct?

At least for ext4 and xfs, it's probably quite sufficient just to run
the -g auto group for the ext4/4k and xfs/4k test configs --- that is
the standard default file system configs using the 4k block size.
Both of these currently don't require any test exclusions for
kvm-xfstests or gce-xfstests when running the auto group.  And so for
the purposes of catching bugs in the core MM/VFS layer and any changes
that the folio patches are likely to touch for ext4 and xfs, it's the
auto group for ext4/4k and xfs/4k is probably quite sufficient.
Testing the more exotic test configs, such as bigalloc for ext4, or
realtime for xfs, or the external log configs, are not likely to be
relevant for the folio patches.

Note: I recommend that you skip using the loop device xfstests
strategy, which Luis likes to advocate.  For the perspective of
*likely* regressions caused by the Folio patches, I claim they are
going to cause you more pain than they are worth.  If there are some
strange Folio/loop device interactions, they aren't likely going to be
obvious/reproduceable failures that will cause pain to linux-next
testers.  While it would be nice to find **all** possible bugs before
patches go usptream to Linus, if it slows down your development
velocity to near-standstill, it's not worth it.  We have to be
realistic about things.

What about other file systems?  Well, first of all, xfstests only has
support for the following file systems:

	9p btrfs ceph cifs exfat ext2 ext4 f2fs gfs glusterfs jfs msdos
	nfs ocfs2 overlay pvfs2 reiserfs tmpfs ubifs udf vfat virtiofs xfs

{kvm,gce}-xfstests supports these 16 file systems:

	9p btrfs exfat ext2 ext4 f2fs jfs msdos nfs overlay reiserfs
	tmpfs ubifs udf vfat xfs

kdevops has support for these file systems:

	btrfs ext4 xfs

So realistically, you're not going to have *full* test coverage for
all of the file systems you might want to touch, no matter what you
do.  And even for those file systems that are technically supported by
xfstests and kvm-xfstests, if they aren't being regularly run (for
example, exfat, 9p, ubifs, udf, etc.) there may be bitrot and very
likely there is no one actively *to* maintain exclude files.  For that
matter, there might not be anyone you could turn to for help
interpreting the test results.

So....  I believe the most realistic thing is to do is to run xfstests
on a simple set of configs --- using no special mkfs or mount options
--- first against the baseline, and then after you've applied your
folio patches.  If there are any new test failures, do something like:

   kvm-xfstests -c f2fs/default -C 10 generic/013

to check to see whether it's a hard failure or not.  If it's a hard
failure, then it's a problem with your patches.  If it's a flaky
failure, it's possible you'll need to repeat the test against the baseline:

   git checkout origin; kbuild
   kvm-xfstests -c f2fs/default -C 10 generic/013

If it's also flaky on the baseline, you can ignore the test failure
for the purposes of folio development.

There are more complex things you could do, such as running a baseline
set of tests 500 times (as Luis suggests), but I believe that for your
use case, it's not a good use of your time.  You'd need to speed
several weeks finding *all* the flaky tests up front, especially if
you want to do this for a large set of file systems.  It's much more
efficient to check if a suspetected test regression is really a flaky
test result when you come across them.

I'd also suggest using the -g quick tests for file systems other than
ext4 and xfs.  That's probably going to be quite sufficient for
finding obvious problems that might be introduced when you're making
changes to f2fs, btrfs, etc., and it will reduce the number of
potential flaky tests that you might have to handle.

It should be possible to automate this, and Leah and I have talked
about designs to automate this process.  Leah has some rough scripts
that do a semantic-style diff for the baseline and after applying the
proposed xfs backports.  So it operates on something like this:

f2fs/default: 868 tests, 10 failures, 217 skipped, 6899 seconds
  Failures: generic/050 generic/064 generic/252 generic/342
    generic/383 generic/502 generic/506 generic/526 generic/527
    generic/563

In theory, we could also have automated tools that look for the
suspected test regressions, and then try running those test
regressions 20 or 25 times on the baseline and after applying the
patch series.  Those don't exist yet, but it's just a Mere Matter of
Programming.  :-)

I can't promise anything, especially with dates, but developing better
automation tools to support the xfs stable backports is on our
near-term roadmap --- and that would probably be applicable for for
folio development usecase.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-02 17:01           ` Theodore Ts'o
@ 2022-07-07 21:36             ` Luis Chamberlain
  0 siblings, 0 replies; 32+ messages in thread
From: Luis Chamberlain @ 2022-07-07 21:36 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Matthew Wilcox, Zorro Lang, Amir Goldstein, linux-fsdevel,
	linux-block, pankydev8, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests

On Sat, Jul 02, 2022 at 01:01:22PM -0400, Theodore Ts'o wrote:
> Note: I recommend that you skip using the loop device xfstests
> strategy, which Luis likes to advocate.  For the perspective of
> *likely* regressions caused by the Folio patches, I claim they are
> going to cause you more pain than they are worth.  If there are some
> strange Folio/loop device interactions, they aren't likely going to be
> obvious/reproduceable failures that will cause pain to linux-next
> testers.  While it would be nice to find **all** possible bugs before
> patches go usptream to Linus, if it slows down your development
> velocity to near-standstill, it's not worth it.  We have to be
> realistic about things.

Regressions with the loopback block driver can creep up and we used to
be much worse, but we have gotten better at it. Certainly testing a
loopback driver can mean running into a regression with the loopback
driver. But some block driver must be used in the end.

> What about other file systems?  Well, first of all, xfstests only has
> support for the following file systems:
> 
> 	9p btrfs ceph cifs exfat ext2 ext4 f2fs gfs glusterfs jfs msdos
> 	nfs ocfs2 overlay pvfs2 reiserfs tmpfs ubifs udf vfat virtiofs xfs
> 
> {kvm,gce}-xfstests supports these 16 file systems:
> 
> 	9p btrfs exfat ext2 ext4 f2fs jfs msdos nfs overlay reiserfs
> 	tmpfs ubifs udf vfat xfs
> 
> kdevops has support for these file systems:
> 
> 	btrfs ext4 xfs

Thanks for this list Ted!

And so adding suport for a new filesystem in kdevops should be:

 * a kconfig symbol for the fs and then one per supported mkfs config
   option you want to support

 * a configuration file for it, this can be as elaborate to support
   different mkfs config options as we have for xfs [0] or one
   with just one or two mkfs config options [1]. The default
   is just shared information.

[0] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/xfs/xfs.config
[1] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/ext4/ext4.config

> There are more complex things you could do, such as running a baseline
> set of tests 500 times (as Luis suggests),

I advocate 100 and I suggest that is a nice goal for enterprise kernels.

I also personally advocate this confidence in a baseline for stable
kernels if *I* am going to backport changes.

> but I believe that for your
> use case, it's not a good use of your time.  You'd need to speed
> several weeks finding *all* the flaky tests up front, especially if
> you want to do this for a large set of file systems.  It's much more
> efficient to check if a suspetected test regression is really a flaky
> test result when you come across them.

Or you work with a test runner that has the list of known failures / flaky
failures for a target configuration like using loopbacks already. And
hence why I tend to attend to these for xfs, btrfs, and ext4 when I have
time. My goal has been to work towards a baseline of at least 100
successful runs without failure tracking upstream.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-05-19  3:07 [RFC: kdevops] Standardizing on failure rate nomenclature for expunges Luis Chamberlain
  2022-05-19  6:36 ` Amir Goldstein
@ 2022-07-02 21:48 ` Bart Van Assche
  2022-07-03  5:56   ` Amir Goldstein
  2022-07-03 13:32   ` Theodore Ts'o
  1 sibling, 2 replies; 32+ messages in thread
From: Bart Van Assche @ 2022-07-02 21:48 UTC (permalink / raw)
  To: Luis Chamberlain, linux-fsdevel, linux-block
  Cc: amir73il, pankydev8, tytso, josef, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen

On 5/18/22 20:07, Luis Chamberlain wrote:
> I've been promoting the idea that running fstests once is nice,
> but things get interesting if you try to run fstests multiple
> times until a failure is found. It turns out at least kdevops has
> found tests which fail with a failure rate of typically 1/2 to
> 1/30 average failure rate. That is 1/2 means a failure can happen
> 50% of the time, whereas 1/30 means it takes 30 runs to find the
> failure.
> 
> I have tried my best to annotate failure rates when I know what
> they might be on the test expunge list, as an example:
> 
> workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> 
> The term "failure rate 1/15" is 16 characters long, so I'd like
> to propose to standardize a way to represent this. How about
> 
> generic/530 # F:1/15
> 
> Then we could extend the definition. F being current estimate, and this
> can be just how long it took to find the first failure. A more valuable
> figure would be failure rate avarage, so running the test multiple
> times, say 10, to see what the failure rate is and then averaging the
> failure out. So this could be a more accurate representation. For this
> how about:
> 
> generic/530 # FA:1/15
> 
> This would mean on average there failure rate has been found to be about
> 1/15, and this was determined based on 10 runs.
> 
> We should also go extend check for fstests/blktests to run a test
> until a failure is found and report back the number of successes.
> 
> Thoughts?
> 
> Note: yes failure rates lower than 1/100 do exist but they are rare
> creatures. I love them though as my experience shows so far that they
> uncover hidden bones in the closet, and they they make take months and
> a lot of eyeballs to resolve.

I strongly disagree with annotating tests with failure rates. My opinion 
is that on a given test setup a test either should pass 100% of the time 
or fail 100% of the time. If a test passes in one run and fails in 
another run that either indicates a bug in the test or a bug in the 
software that is being tested. Examples of behaviors that can cause 
tests to behave unpredictably are use-after-free bugs and race 
conditions. How likely it is to trigger such behavior depends on a 
number of factors. This could even depend on external factors like which 
network packets are received from other systems. I do not expect that 
flaky tests have an exact failure rate. Hence my opinion that flaky 
tests are not useful and also that it is not useful to annotate flaky 
tests with a failure rate. If a test is flaky I think that the root 
cause of the flakiness must be determined and fixed.

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-02 21:48 ` Bart Van Assche
@ 2022-07-03  5:56   ` Amir Goldstein
  2022-07-03 13:15     ` Theodore Ts'o
  2022-07-04  3:25     ` Dave Chinner
  2022-07-03 13:32   ` Theodore Ts'o
  1 sibling, 2 replies; 32+ messages in thread
From: Amir Goldstein @ 2022-07-03  5:56 UTC (permalink / raw)
  To: Bart Van Assche, Darrick J. Wong
  Cc: Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav,
	Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang,
	Matthew Wilcox, Dave Chinner

On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 5/18/22 20:07, Luis Chamberlain wrote:
> > I've been promoting the idea that running fstests once is nice,
> > but things get interesting if you try to run fstests multiple
> > times until a failure is found. It turns out at least kdevops has
> > found tests which fail with a failure rate of typically 1/2 to
> > 1/30 average failure rate. That is 1/2 means a failure can happen
> > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > failure.
> >
> > I have tried my best to annotate failure rates when I know what
> > they might be on the test expunge list, as an example:
> >
> > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> >
> > The term "failure rate 1/15" is 16 characters long, so I'd like
> > to propose to standardize a way to represent this. How about
> >
> > generic/530 # F:1/15
> >
> > Then we could extend the definition. F being current estimate, and this
> > can be just how long it took to find the first failure. A more valuable
> > figure would be failure rate avarage, so running the test multiple
> > times, say 10, to see what the failure rate is and then averaging the
> > failure out. So this could be a more accurate representation. For this
> > how about:
> >
> > generic/530 # FA:1/15
> >
> > This would mean on average there failure rate has been found to be about
> > 1/15, and this was determined based on 10 runs.
> >
> > We should also go extend check for fstests/blktests to run a test
> > until a failure is found and report back the number of successes.
> >
> > Thoughts?
> >
> > Note: yes failure rates lower than 1/100 do exist but they are rare
> > creatures. I love them though as my experience shows so far that they
> > uncover hidden bones in the closet, and they they make take months and
> > a lot of eyeballs to resolve.
>
> I strongly disagree with annotating tests with failure rates. My opinion
> is that on a given test setup a test either should pass 100% of the time
> or fail 100% of the time. If a test passes in one run and fails in
> another run that either indicates a bug in the test or a bug in the
> software that is being tested. Examples of behaviors that can cause
> tests to behave unpredictably are use-after-free bugs and race
> conditions. How likely it is to trigger such behavior depends on a
> number of factors. This could even depend on external factors like which
> network packets are received from other systems. I do not expect that
> flaky tests have an exact failure rate. Hence my opinion that flaky
> tests are not useful and also that it is not useful to annotate flaky
> tests with a failure rate. If a test is flaky I think that the root
> cause of the flakiness must be determined and fixed.
>

That is true for some use cases, but unfortunately, the flaky
fstests are way too valuable and too hard to replace or improve,
so practically, fs developers have to run them, but not everyone does.

Zorro has already proposed to properly tag the non deterministic tests
with a specific group and I think there is really no other solution.

The only question is whether we remove them from the 'auto' group
(I think we should).

There is probably a large overlap already between the 'stress' 'soak' and
'fuzzers' test groups and the non-deterministic tests.
Moreover, if the test is not a stress/fuzzer test and it is not deterministic
then the test is likely buggy.

There is only one 'stress' test not in 'auto' group (generic/019), only two
'soak' tests not in the 'auto' group (generic/52{1,2}).
There are only three tests in 'soak' group and they are also exactly
the same three tests in the 'long_rw' group.

So instead of thinking up a new 'flaky' 'random' 'stochastic' name
we may just repurpose the 'soak' group for this matter and start
moving known flaky tests from 'auto' to 'soak'.

generic/52{1,2} can be removed from 'soak' group and remain
in 'long_rw' group, unless filesystem developers would like to
add those to the stochastic test run.

filesystem developers that will run ./check -g auto -g soak
will get the exact same test coverage as today's -g auto
and the "commoners" that run ./check -g auto will enjoy blissful
determitic test results, at least for the default config of regularly
tested filesystems (a.k.a, the ones tested by kernet test bot).?

Darrick,

As the one who created the 'soak' group and only one that added
tests to it, what do you think about this proposal?
What do you think should be done with generic/52{1,2}?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03  5:56   ` Amir Goldstein
@ 2022-07-03 13:15     ` Theodore Ts'o
  2022-07-03 14:22       ` Amir Goldstein
  2022-07-04  3:25     ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2022-07-03 13:15 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi,
	Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen,
	fstests, Zorro Lang, Matthew Wilcox, Dave Chinner

On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote:
> 
> That is true for some use cases, but unfortunately, the flaky
> fstests are way too valuable and too hard to replace or improve,
> so practically, fs developers have to run them, but not everyone does.
> 
> Zorro has already proposed to properly tag the non deterministic tests
> with a specific group and I think there is really no other solution.

The non-deterministic tests are not the sole, or even the most likely
cause of flaky tests.  Or put another way, even if we used a
deterministic pseudo-random numberator seed for some of the curently
"non-determinstic tests" (and I believe we are for many of them
already anyway), it's not going to be make the flaky tests go away.

That's because with many of these tests, we are running multiple
threads either in the fstress or fsx, or in the antogonist workload
that is say, running the space utilization to full to generate ENOSPC
errors, and then deleting a bunch of files to trigger as many ENOSPC
hitter events as possible.

> The only question is whether we remove them from the 'auto' group
> (I think we should).

I wouldn't; if someone wants to exclude the non-determistic tests,
once they are tagged as belonging to a group, they can just exclude
that group.  So there's no point removing them from the auto group
IMHO.

> filesystem developers that will run ./check -g auto -g soak
> will get the exact same test coverage as today's -g auto
> and the "commoners" that run ./check -g auto will enjoy blissful
> determitic test results, at least for the default config of regularly
> tested filesystems (a.k.a, the ones tested by kernet test bot).?

First of all, there are a number of tests today which are in soak or
long_rw which are not in auto, so "-g auto -g soak" will *not* result
in the "exact same test coverage".

Secondly, as I've tested above, deterministic tests does not
necessasrily mean determinsitic test results --- unless by
"determinsitic tests" you mean "completely single-threaded tests",
which would eliminate a large amount of useful test coverage.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03 13:15     ` Theodore Ts'o
@ 2022-07-03 14:22       ` Amir Goldstein
  2022-07-03 16:30         ` Theodore Ts'o
  0 siblings, 1 reply; 32+ messages in thread
From: Amir Goldstein @ 2022-07-03 14:22 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi,
	Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen,
	fstests, Zorro Lang, Matthew Wilcox, Dave Chinner

On Sun, Jul 3, 2022 at 4:15 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote:
> >
> > That is true for some use cases, but unfortunately, the flaky
> > fstests are way too valuable and too hard to replace or improve,
> > so practically, fs developers have to run them, but not everyone does.
> >
> > Zorro has already proposed to properly tag the non deterministic tests
> > with a specific group and I think there is really no other solution.
>
> The non-deterministic tests are not the sole, or even the most likely
> cause of flaky tests.  Or put another way, even if we used a
> deterministic pseudo-random numberator seed for some of the curently
> "non-determinstic tests" (and I believe we are for many of them
> already anyway), it's not going to be make the flaky tests go away.
>
> That's because with many of these tests, we are running multiple
> threads either in the fstress or fsx, or in the antogonist workload
> that is say, running the space utilization to full to generate ENOSPC
> errors, and then deleting a bunch of files to trigger as many ENOSPC
> hitter events as possible.
>
> > The only question is whether we remove them from the 'auto' group
> > (I think we should).
>
> I wouldn't; if someone wants to exclude the non-determistic tests,
> once they are tagged as belonging to a group, they can just exclude
> that group.  So there's no point removing them from the auto group
> IMHO.

The reason I suggested that *we* change our habits is because
we want to give passing-by fs testers an easier experience.

Another argument in favor of splitting out -g soak from -g auto -
You only need to run -g soak in a loop for as long as you like to be
confident about the results.
You need to run -g auto only once per definition -
If a test ends up failing the Nth time you run -g auto then it belongs
in -g soak and not in -g auto.

>
> > filesystem developers that will run ./check -g auto -g soak
> > will get the exact same test coverage as today's -g auto
> > and the "commoners" that run ./check -g auto will enjoy blissful
> > determitic test results, at least for the default config of regularly
> > tested filesystems (a.k.a, the ones tested by kernet test bot).?
>
> First of all, there are a number of tests today which are in soak or
> long_rw which are not in auto, so "-g auto -g soak" will *not* result
> in the "exact same test coverage".

I addressed this in my proposal.
I proposed to remove these two tests out of soak and asked for
Darrick's opinion.
Who is using -g soak anyway?

>
> Secondly, as I've tested above, deterministic tests does not
> necessasrily mean determinsitic test results --- unless by
> "determinsitic tests" you mean "completely single-threaded tests",
> which would eliminate a large amount of useful test coverage.
>

To be clear, when I wrote deterministic, what I meant was deterministic
results empirically, in the same sense that Bart meant - a test should
always pass.

Because Luis was using the expunge lists to blacklist any test failure,
no matter the failure rate, the kdevops expunge lists could be used as
a first draft for -g soak group, at least for tests that are blocklisted by
kdevops for all of ext4,xfs and btrfs default configs on the upstream kernel.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03 14:22       ` Amir Goldstein
@ 2022-07-03 16:30         ` Theodore Ts'o
  0 siblings, 0 replies; 32+ messages in thread
From: Theodore Ts'o @ 2022-07-03 16:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi,
	Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen,
	fstests, Zorro Lang, Matthew Wilcox, Dave Chinner

On Sun, Jul 03, 2022 at 05:22:17PM +0300, Amir Goldstein wrote:
> 
> To be clear, when I wrote deterministic, what I meant was deterministic
> results empirically, in the same sense that Bart meant - a test should
> always pass.

Well all of the tests in the auto group pass at 100% of the time for
the ext4/4k and xfs/4k groups.  (Well, at least if use the HDD and SSD
as the storage device.  If you are using eMMC flash, or Luis's loop
device config, there would be more failures.)

But if we're talking about btrfs/4k, f2fs/4k, xfs/realtime,
xfs/realtime_28k/logdev, ext4/bigalloc, etc. there would be a *lot* of
tests that would need to be removed from the auto group.

So what "non-determinsitic tests" should we remove from the auto
group?  For what file systems, file system configs, and storage
devices?  What would you propose?

Remember, Matthew wants something that he can use to test "dozens" of
file systems that he's touching for the folio patches.  If we have to
remove all of the tests that fail if you are using nfs, vfat, hfs,
msdos, etc., then the auto group would be pretty anemic.  Let's not do
that.

If you want a "always pass" group, we could do that, but let's not
call that the "auto" group, please.

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03  5:56   ` Amir Goldstein
  2022-07-03 13:15     ` Theodore Ts'o
@ 2022-07-04  3:25     ` Dave Chinner
  2022-07-04  7:58       ` Amir Goldstein
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-07-04  3:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso,
	Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams,
	Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox

On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote:
> On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 5/18/22 20:07, Luis Chamberlain wrote:
> > > I've been promoting the idea that running fstests once is nice,
> > > but things get interesting if you try to run fstests multiple
> > > times until a failure is found. It turns out at least kdevops has
> > > found tests which fail with a failure rate of typically 1/2 to
> > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > failure.
> > >
> > > I have tried my best to annotate failure rates when I know what
> > > they might be on the test expunge list, as an example:
> > >
> > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > >
> > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > to propose to standardize a way to represent this. How about
> > >
> > > generic/530 # F:1/15
> > >
> > > Then we could extend the definition. F being current estimate, and this
> > > can be just how long it took to find the first failure. A more valuable
> > > figure would be failure rate avarage, so running the test multiple
> > > times, say 10, to see what the failure rate is and then averaging the
> > > failure out. So this could be a more accurate representation. For this
> > > how about:
> > >
> > > generic/530 # FA:1/15
> > >
> > > This would mean on average there failure rate has been found to be about
> > > 1/15, and this was determined based on 10 runs.
> > >
> > > We should also go extend check for fstests/blktests to run a test
> > > until a failure is found and report back the number of successes.
> > >
> > > Thoughts?
> > >
> > > Note: yes failure rates lower than 1/100 do exist but they are rare
> > > creatures. I love them though as my experience shows so far that they
> > > uncover hidden bones in the closet, and they they make take months and
> > > a lot of eyeballs to resolve.
> >
> > I strongly disagree with annotating tests with failure rates. My opinion
> > is that on a given test setup a test either should pass 100% of the time
> > or fail 100% of the time. If a test passes in one run and fails in
> > another run that either indicates a bug in the test or a bug in the
> > software that is being tested. Examples of behaviors that can cause
> > tests to behave unpredictably are use-after-free bugs and race
> > conditions. How likely it is to trigger such behavior depends on a
> > number of factors. This could even depend on external factors like which
> > network packets are received from other systems. I do not expect that
> > flaky tests have an exact failure rate. Hence my opinion that flaky
> > tests are not useful and also that it is not useful to annotate flaky
> > tests with a failure rate. If a test is flaky I think that the root
> > cause of the flakiness must be determined and fixed.
> >
> 
> That is true for some use cases, but unfortunately, the flaky
> fstests are way too valuable and too hard to replace or improve,
> so practically, fs developers have to run them, but not everyone does.

Everyone *should* be running them. They find *new bugs*, and it
doesn't matter how old the kernel is. e.g. if you're backporting XFS
log changes and you aren't running the "flakey" recoveryloop group
tests, then you are *not testing failure handling log recovery
sufficiently*.

Where do you draw the line? recvoeryloop tests that shutdown and
recover the filesystem will find bugs, they are guaranteed to be
flakey, and they are *absolutely necessary* to be run because they
are the only tests that exercise that critical filesysetm crash
recovery functionality.

What do we actually gain by excluding these "non-deterministic"
tests from automated QA environments?

> Zorro has already proposed to properly tag the non deterministic tests
> with a specific group and I think there is really no other solution.
> 
> The only question is whether we remove them from the 'auto' group
> (I think we should).

As per above, this shows that many people simply don't understand
what many of these non-determinsitic tests are actually exercising,
and hence what they fail to test by excluding them from automated
testing.

> There is probably a large overlap already between the 'stress' 'soak' and
> 'fuzzers' test groups and the non-deterministic tests.
> Moreover, if the test is not a stress/fuzzer test and it is not deterministic
> then the test is likely buggy.
> 
> There is only one 'stress' test not in 'auto' group (generic/019), only two
> 'soak' tests not in the 'auto' group (generic/52{1,2}).
> There are only three tests in 'soak' group and they are also exactly
> the same three tests in the 'long_rw' group.
> 
> So instead of thinking up a new 'flaky' 'random' 'stochastic' name
> we may just repurpose the 'soak' group for this matter and start
> moving known flaky tests from 'auto' to 'soak'.

Please, no. The policy for the auto group is inclusive, not
exclusive. It is based on the concept that every test is valuable
and should be run if possible. Hence any test that generally
passes, does not run forever and does not endanger the system should
be a member of the auto group. That effectively only rules out
fuzzer and dangerous tests from being in the auto group, as long
running tests should be scaled by TIME_FACTOR/LOAD_FACTOR and hence
the default test behaviour results in only a short time run time.

If someone wants to *reduce their test coverage* for whatever reason
(e.g. runtime, want to only run pass/fail tests, etc) then the
mechanism we already have in place for this is for that person to
use *exclusion groups*. i.e. we exclude subsets of tests from the
default set, we don't remove them from the default set.

Such an environment would run:

./check -g auto -x soak

So that the test environment doesn't run the "non-determinisitic"
tests in the 'soak' group. i.e. the requirements of this test
environment do not dictate the tests that every other test
environment runs by default.

> generic/52{1,2} can be removed from 'soak' group and remain
> in 'long_rw' group, unless filesystem developers would like to
> add those to the stochastic test run.
> 
> filesystem developers that will run ./check -g auto -g soak
> will get the exact same test coverage as today's -g auto
> and the "commoners" that run ./check -g auto will enjoy blissful
> determitic test results, at least for the default config of regularly
> tested filesystems (a.k.a, the ones tested by kernet test bot).?

An argument that says "everyone else has to change what they do so I
don't have to change" means that the person making the argument
thinks their requirements are more important than the requirements
of anyone else. The test run policy mechanisms we already have avoid
this whole can of worms - we don't need to care about the specific
test requirements of any specific test enviroment because the
default is inclusive and it is trivial to exclude tests from that
default set if needed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-04  3:25     ` Dave Chinner
@ 2022-07-04  7:58       ` Amir Goldstein
  2022-07-05  2:29         ` Theodore Ts'o
  2022-07-05  3:11         ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: Amir Goldstein @ 2022-07-04  7:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso,
	Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams,
	Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox

On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote:
> > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 5/18/22 20:07, Luis Chamberlain wrote:
> > > > I've been promoting the idea that running fstests once is nice,
> > > > but things get interesting if you try to run fstests multiple
> > > > times until a failure is found. It turns out at least kdevops has
> > > > found tests which fail with a failure rate of typically 1/2 to
> > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > failure.
> > > >
> > > > I have tried my best to annotate failure rates when I know what
> > > > they might be on the test expunge list, as an example:
> > > >
> > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > >
> > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > to propose to standardize a way to represent this. How about
> > > >
> > > > generic/530 # F:1/15
> > > >
> > > > Then we could extend the definition. F being current estimate, and this
> > > > can be just how long it took to find the first failure. A more valuable
> > > > figure would be failure rate avarage, so running the test multiple
> > > > times, say 10, to see what the failure rate is and then averaging the
> > > > failure out. So this could be a more accurate representation. For this
> > > > how about:
> > > >
> > > > generic/530 # FA:1/15
> > > >
> > > > This would mean on average there failure rate has been found to be about
> > > > 1/15, and this was determined based on 10 runs.
> > > >
> > > > We should also go extend check for fstests/blktests to run a test
> > > > until a failure is found and report back the number of successes.
> > > >
> > > > Thoughts?
> > > >
> > > > Note: yes failure rates lower than 1/100 do exist but they are rare
> > > > creatures. I love them though as my experience shows so far that they
> > > > uncover hidden bones in the closet, and they they make take months and
> > > > a lot of eyeballs to resolve.
> > >
> > > I strongly disagree with annotating tests with failure rates. My opinion
> > > is that on a given test setup a test either should pass 100% of the time
> > > or fail 100% of the time. If a test passes in one run and fails in
> > > another run that either indicates a bug in the test or a bug in the
> > > software that is being tested. Examples of behaviors that can cause
> > > tests to behave unpredictably are use-after-free bugs and race
> > > conditions. How likely it is to trigger such behavior depends on a
> > > number of factors. This could even depend on external factors like which
> > > network packets are received from other systems. I do not expect that
> > > flaky tests have an exact failure rate. Hence my opinion that flaky
> > > tests are not useful and also that it is not useful to annotate flaky
> > > tests with a failure rate. If a test is flaky I think that the root
> > > cause of the flakiness must be determined and fixed.
> > >
> >
> > That is true for some use cases, but unfortunately, the flaky
> > fstests are way too valuable and too hard to replace or improve,
> > so practically, fs developers have to run them, but not everyone does.
>
> Everyone *should* be running them. They find *new bugs*, and it
> doesn't matter how old the kernel is. e.g. if you're backporting XFS
> log changes and you aren't running the "flakey" recoveryloop group
> tests, then you are *not testing failure handling log recovery
> sufficiently*.
>
> Where do you draw the line? recvoeryloop tests that shutdown and
> recover the filesystem will find bugs, they are guaranteed to be
> flakey, and they are *absolutely necessary* to be run because they
> are the only tests that exercise that critical filesysetm crash
> recovery functionality.
>
> What do we actually gain by excluding these "non-deterministic"
> tests from automated QA environments?
>

Automated QA environment is a broad term.
We all have automated QA environments.
But there is a specific class of automated test env, such as CI build bots
that do not tolerate human intervention.

Is it enough to run only the deterministic tests to validate xfs code?
No it is not.

My LTS environment is human monitored - I look at every failure and
analyse the logs and look at historic data to decide if they are regressions
or not. A bot simply cannot do that.
The bot can go back and run the test N times on baseline vs patch.

The question is, do we want kernel test bot to run -g auto -x soak
on linux-next and report issues to us?

I think the answer to this question should be yes.

Do we want kernel test bot to run -g auto and report flaky test
failures to us?

I am pretty sure that the answer is no.

So we need -x soak or whatever for this specific class of machine
automated validation.

> > Zorro has already proposed to properly tag the non deterministic tests
> > with a specific group and I think there is really no other solution.
> >
> > The only question is whether we remove them from the 'auto' group
> > (I think we should).
>
> As per above, this shows that many people simply don't understand
> what many of these non-determinsitic tests are actually exercising,
> and hence what they fail to test by excluding them from automated
> testing.
>
> > There is probably a large overlap already between the 'stress' 'soak' and
> > 'fuzzers' test groups and the non-deterministic tests.
> > Moreover, if the test is not a stress/fuzzer test and it is not deterministic
> > then the test is likely buggy.
> >
> > There is only one 'stress' test not in 'auto' group (generic/019), only two
> > 'soak' tests not in the 'auto' group (generic/52{1,2}).
> > There are only three tests in 'soak' group and they are also exactly
> > the same three tests in the 'long_rw' group.
> >
> > So instead of thinking up a new 'flaky' 'random' 'stochastic' name
> > we may just repurpose the 'soak' group for this matter and start
> > moving known flaky tests from 'auto' to 'soak'.
>
> Please, no. The policy for the auto group is inclusive, not
> exclusive. It is based on the concept that every test is valuable
> and should be run if possible. Hence any test that generally
> passes, does not run forever and does not endanger the system should
> be a member of the auto group. That effectively only rules out
> fuzzer and dangerous tests from being in the auto group, as long
> running tests should be scaled by TIME_FACTOR/LOAD_FACTOR and hence
> the default test behaviour results in only a short time run time.
>
> If someone wants to *reduce their test coverage* for whatever reason
> (e.g. runtime, want to only run pass/fail tests, etc) then the
> mechanism we already have in place for this is for that person to
> use *exclusion groups*. i.e. we exclude subsets of tests from the
> default set, we don't remove them from the default set.
>
> Such an environment would run:
>
> ./check -g auto -x soak
>
> So that the test environment doesn't run the "non-determinisitic"
> tests in the 'soak' group. i.e. the requirements of this test
> environment do not dictate the tests that every other test
> environment runs by default.
>


OK.

> > generic/52{1,2} can be removed from 'soak' group and remain
> > in 'long_rw' group, unless filesystem developers would like to
> > add those to the stochastic test run.
> >
> > filesystem developers that will run ./check -g auto -g soak
> > will get the exact same test coverage as today's -g auto
> > and the "commoners" that run ./check -g auto will enjoy blissful
> > determitic test results, at least for the default config of regularly
> > tested filesystems (a.k.a, the ones tested by kernet test bot).?
>
> An argument that says "everyone else has to change what they do so I
> don't have to change" means that the person making the argument
> thinks their requirements are more important than the requirements
> of anyone else.

Unless that person was not arguing for themselves...
I was referring to passing-by developers that develop a patch that
interacts with fs code who do not usually develop and test filesystems.
Not for myself.

> The test run policy mechanisms we already have avoid
> this whole can of worms - we don't need to care about the specific
> test requirements of any specific test enviroment because the
> default is inclusive and it is trivial to exclude tests from that
> default set if needed.
>

I had the humble notion that we should make running fstests to
passing-by developers as easy as possible, because I have had the
chance to get feedback from some developers on their first time
attempt to run fstests and it wasn't pleasant, but nevermind.
-g auto -x soak is fine.

When you think about it, many fs developrs run ./check -g auto,
so we should not interfere with that, but I bet very few run './check'?
so we could make the default for './check' some group combination
that is as deterministic as possible.

If I am not mistaken, LTP main run script runltp.sh when run w/o
parameters has a default set of tests which obviously get run by the
kernel bots.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-04  7:58       ` Amir Goldstein
@ 2022-07-05  2:29         ` Theodore Ts'o
  2022-07-05  3:11         ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Theodore Ts'o @ 2022-07-05  2:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi,
	Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen,
	fstests, Zorro Lang, Matthew Wilcox

On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote:
> I had the humble notion that we should make running fstests to
> passing-by developers as easy as possible, because I have had the
> chance to get feedback from some developers on their first time
> attempt to run fstests and it wasn't pleasant, but nevermind.
> -g auto -x soak is fine.

I really don't think using some kind of group exclusion is the right
way to go.  First of all, the definition "determinism" keeps shifting
around.  The most recent way you've used it, it's for "passing-by
developers" to be able to tell if their patches may have caused a
regression.  Is that right.

Secondly, a group-based exclusion list is problematic because groups
are fixed with respect to a kernel version (and therefore will very
easily go out of date), and they are fixed with respect to a file
system type, and which tests will either fail or be flaky will vary,
widly, by file system type.

For example to the "fixed in time" problem, I had a global exclude for
the following tests: generic/471 generic/484 generic/554.  That's
because they were failing for me all the time, with various resources
that made them test or global kernel bugs.  But unless you regularly
check to see whether a test which is on a particular exclusion list or
in some magic "soak" group (and "soak is a ***massive*** misnomer ---
the group as you've proposed it should really be named
"might-fail-perhaps-on-some-fs-type-or-config"), the tests could
remain on the list long after the test bug or the kernel bug has been
addressed.

So as of commit 7fd7c21547a1 ("test-appliance: add kernel version
conditionals using cpp to exclude files") in xfstests-bld, generic/471
and generic/484 are only excluded when testing LTS kernels older than
5.10, and generic/554 is only excluded when testing LTS kernels older
than 5.4.  The point is (a) you can't easily do version-specific
exclusions with xfstests group declarations, and (b) someone needs to
periodically sweep through the tests to see if tests should be in the
soak or "might-fail-perhaps-on-some-fs-type-or-config" group.

As an example of why you're going to want to do the exclusions on a
per-file system basis, consider the tests that would have to be added
to the "might-fail-perhaps-on-some-fs-type-or-config".  If the goal is
to let "drive-by developer to know whether their has caused a
regression", especially someone like Willy who might be modifying a
large number of testers, then you would need to add at *least* 135
tests to the "soak" or "might-fail-perhaps-on-some-fs-type-or-config"
group:

(These are current test failures that I have observed using
v4.19-rc4.)

btrfs/default: 1176 tests, 7 failures, 244 skipped, 8937 seconds
  Failures: btrfs/012 btrfs/219 btrfs/235 generic/041 generic/297
    generic/298 shared/298
exfat/default: 1222 tests, 23 failures, 552 skipped, 1561 seconds
  Failures: generic/013 generic/309 generic/310 generic/394
    generic/409 generic/410 generic/411 generic/428 generic/430
    generic/431 generic/432 generic/433 generic/438 generic/443
    generic/465 generic/490 generic/519 generic/563 generic/565
    generic/591 generic/633 generic/639 generic/676
ext2/default: 1188 tests, 4 failures, 472 skipped, 2803 seconds
  Failures: generic/347 generic/607 generic/614 generic/631
f2fs/default: 877 tests, 5 failures, 217 skipped, 4249 seconds
  Failures: generic/050 generic/064 generic/252 generic/506
    generic/563
jfs/default: 1074 tests, 62 failures, 404 skipped, 3695 seconds
  Failures: generic/015 generic/034 generic/039 generic/040
    generic/041 generic/056 generic/057 generic/065 generic/066
    generic/073 generic/079 generic/083 generic/090 generic/101
    generic/102 generic/104 generic/106 generic/107 generic/204
    generic/226 generic/258 generic/260 generic/269 generic/288
    generic/321 generic/322 generic/325 generic/335 generic/336
    generic/341 generic/342 generic/343 generic/348 generic/376
    generic/405 generic/416 generic/424 generic/427 generic/467
    generic/475 generic/479 generic/480 generic/481 generic/489
    generic/498 generic/502 generic/510 generic/520 generic/526
    generic/527 generic/534 generic/535 generic/537 generic/547
    generic/552 generic/557 generic/563 generic/607 generic/614
    generic/629 generic/640 generic/690
nfs/loopback: 818 tests, 2 failures, 345 skipped, 9365 seconds
  Failures: generic/426 generic/551
reiserfs/default: 1076 tests, 25 failures, 413 skipped, 3368 seconds
  Failures: generic/102 generic/232 generic/235 generic/258
    generic/321 generic/355 generic/381 generic/382 generic/383
    generic/385 generic/386 generic/394 generic/418 generic/520
    generic/533 generic/535 generic/563 generic/566 generic/594
    generic/603 generic/614 generic/620 generic/634 generic/643
    generic/691
vfat/default: 1197 tests, 42 failures, 528 skipped, 4616 seconds
  Failures: generic/003 generic/130 generic/192 generic/213
    generic/221 generic/258 generic/309 generic/310 generic/313
    generic/394 generic/409 generic/410 generic/411 generic/428
    generic/430 generic/431 generic/432 generic/433 generic/438
    generic/443 generic/465 generic/467 generic/477 generic/495
    generic/519 generic/563 generic/565 generic/568 generic/569
    generic/589 generic/632 generic/633 generic/637 generic/638
    generic/639 generic/644 generic/645 generic/656 generic/676
    generic/683 generic/688 generic/689

This is not *all* of the tests.  There are a number of file system
types that are causing the VM to crash.  I haven't had time this
weekend to figure out what tests need to be added to the exclude group
for udf, ubifs, overlayfs, etc.  So there might be even *more* tests
that we would need to be added to the
"might-fail-perhaps-on-some-fs-type-or-config" group.

It *could* be fewer, if we want to exclude reiserfs and jfs, on the
theory that they might be deprecated soon.  But there are still some
very commonly used file systems, such as vfat, exfat, etc., that have
a *huge* number of failing tests that is going to make life unpleasant
for the drive-by developer/tester.  And there are other file systems
which will cause a kernel crash or lockup on v4.19-rc4, which
certainly will give trouble for the drive-by tester.

Which is why I argue that using a group, whether it's called soak, or
something else, to exclude all of the tests that might fail and thus
confuse the passing-by fs testers is the best way to go.

> The reason I suggested that *we* change our habits is because
> we want to give passing-by fs testers an easier experience.

Realistically, if we want to give passing-by fs testers an easier
experience, we need to give these testers more turn-key experience.
This was part of my original goals when I created kvm-xfstests and
gce-xfstests.  This is why I upload pre-created VM images for
kvm-xfstests and gce-xfstests --- so people don't have to build
xfstests and all their dependencies, and to figure out how to set up
and configure it.  For example, it means that I can tell ext4
developers to just run "kvm-xfstests smoke" as a bare minimum before
sending me a patch for review.  

There is more that we clearly need to do if we want to make something
which is completely turn-key for a drive-by tester, especially if they
need to test more than just ext4 and xfs.  I have some ideas, and this
is something that I'm hoping to do more work in the next few months.
If someone is interested in contributing some time and energy to this
project, please give me a ring.  Many hands make light work, and all that.

Cheers,

	 	     	  	      	    - Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-04  7:58       ` Amir Goldstein
  2022-07-05  2:29         ` Theodore Ts'o
@ 2022-07-05  3:11         ` Dave Chinner
  2022-07-06 10:11           ` Amir Goldstein
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-07-05  3:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso,
	Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams,
	Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox

On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote:
> On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote:
> > > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > > >
> > > > On 5/18/22 20:07, Luis Chamberlain wrote:
> > > > > I've been promoting the idea that running fstests once is nice,
> > > > > but things get interesting if you try to run fstests multiple
> > > > > times until a failure is found. It turns out at least kdevops has
> > > > > found tests which fail with a failure rate of typically 1/2 to
> > > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > > failure.
> > > > >
> > > > > I have tried my best to annotate failure rates when I know what
> > > > > they might be on the test expunge list, as an example:
> > > > >
> > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > > >
> > > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > > to propose to standardize a way to represent this. How about
> > > > >
> > > > > generic/530 # F:1/15
> > > > >
> > > > > Then we could extend the definition. F being current estimate, and this
> > > > > can be just how long it took to find the first failure. A more valuable
> > > > > figure would be failure rate avarage, so running the test multiple
> > > > > times, say 10, to see what the failure rate is and then averaging the
> > > > > failure out. So this could be a more accurate representation. For this
> > > > > how about:
> > > > >
> > > > > generic/530 # FA:1/15
> > > > >
> > > > > This would mean on average there failure rate has been found to be about
> > > > > 1/15, and this was determined based on 10 runs.
> > > > >
> > > > > We should also go extend check for fstests/blktests to run a test
> > > > > until a failure is found and report back the number of successes.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Note: yes failure rates lower than 1/100 do exist but they are rare
> > > > > creatures. I love them though as my experience shows so far that they
> > > > > uncover hidden bones in the closet, and they they make take months and
> > > > > a lot of eyeballs to resolve.
> > > >
> > > > I strongly disagree with annotating tests with failure rates. My opinion
> > > > is that on a given test setup a test either should pass 100% of the time
> > > > or fail 100% of the time. If a test passes in one run and fails in
> > > > another run that either indicates a bug in the test or a bug in the
> > > > software that is being tested. Examples of behaviors that can cause
> > > > tests to behave unpredictably are use-after-free bugs and race
> > > > conditions. How likely it is to trigger such behavior depends on a
> > > > number of factors. This could even depend on external factors like which
> > > > network packets are received from other systems. I do not expect that
> > > > flaky tests have an exact failure rate. Hence my opinion that flaky
> > > > tests are not useful and also that it is not useful to annotate flaky
> > > > tests with a failure rate. If a test is flaky I think that the root
> > > > cause of the flakiness must be determined and fixed.
> > > >
> > >
> > > That is true for some use cases, but unfortunately, the flaky
> > > fstests are way too valuable and too hard to replace or improve,
> > > so practically, fs developers have to run them, but not everyone does.
> >
> > Everyone *should* be running them. They find *new bugs*, and it
> > doesn't matter how old the kernel is. e.g. if you're backporting XFS
> > log changes and you aren't running the "flakey" recoveryloop group
> > tests, then you are *not testing failure handling log recovery
> > sufficiently*.
> >
> > Where do you draw the line? recvoeryloop tests that shutdown and
> > recover the filesystem will find bugs, they are guaranteed to be
> > flakey, and they are *absolutely necessary* to be run because they
> > are the only tests that exercise that critical filesysetm crash
> > recovery functionality.
> >
> > What do we actually gain by excluding these "non-deterministic"
> > tests from automated QA environments?
> >
> 
> Automated QA environment is a broad term.
> We all have automated QA environments.
> But there is a specific class of automated test env, such as CI build bots
> that do not tolerate human intervention.

Those environments need curated test lists because of the fact
that failures gate progress. They also tend to run in resource
limited environments as fstests is not the only set of tests that
are run. Hence, generally speaking, CI is not an environment you'd
be running a full "auto" group set of tests. Even the 'quick' group
(which take an hour to run here) is often far too time and resource
intensive for a CI system to use effectively.

IOWs, we can't easily curate a set of tests that are appropriate for
all CI environments - it's up to the people running the CI
enviroment to determine what level of testing is appropriate for
gating commits to their source tree, not the fstests maintainers or
developers...

> Is it enough to run only the deterministic tests to validate xfs code?
> No it is not.
> 
> My LTS environment is human monitored - I look at every failure and
> analyse the logs and look at historic data to decide if they are regressions
> or not. A bot simply cannot do that.
> The bot can go back and run the test N times on baseline vs patch.
> 
> The question is, do we want kernel test bot to run -g auto -x soak
> on linux-next and report issues to us?
> 
> I think the answer to this question should be yes.
> 
> Do we want kernel test bot to run -g auto and report flaky test
> failures to us?
> 
> I am pretty sure that the answer is no.

My answer to both is *yes, absolutely*.

The zero-day kernel test bot runs all sorts of non-deterministic
tests, including performance regression testing. We want these
flakey/non-deterministic tests run in such environments, because
they are often configurations we do not ahve access to and/or would
never even consider. e.g. 128p server with a single HDD running IO
scalability tests like AIM7...

This is exactly where such automated testing provides developers
with added value - it covers both hardware and software configs that
indvidual developers cannot exercise themselves. Developers may or
may not pay attention to those results depending on the test that
"fails" and the hardware it "failed' on, but the point is that it
got tested on something we'd never get coverage on otherwise.

> > The test run policy mechanisms we already have avoid
> > this whole can of worms - we don't need to care about the specific
> > test requirements of any specific test enviroment because the
> > default is inclusive and it is trivial to exclude tests from that
> > default set if needed.
> >
> 
> I had the humble notion that we should make running fstests to
> passing-by developers as easy as possible, because I have had the
> chance to get feedback from some developers on their first time
> attempt to run fstests and it wasn't pleasant, but nevermind.
> -g auto -x soak is fine.

I think that the way to do this is the way Ted has described - wrap
fstests in an environment where all the required knowledge is
already encapsulated and the "drive by testers" just need to crank
the handle and it churns out results.

As it is, I don't think making things easier for "drive-by" testing
at the expense of making things arbitrarily different and/or harder
for people who use it every day is a good trade-off. The "target
market" for fstests is *filesystem developers* and people who spend
their working life *testing filesystems*. The number of people who
do *useful* "drive-by" testing of filesystems is pretty damn small,
and IMO that niche is nowhere near as important as making things
better for the people who use fstests every day....

> When you think about it, many fs developrs run ./check -g auto,
> so we should not interfere with that, but I bet very few run './check'?
> so we could make the default for './check' some group combination
> that is as deterministic as possible.

Bare ./check invocations are designed to run every test, regardless
of what group they are in.

Stop trying to redefine longstanding existing behaviour - if you
want to define "deterministic" tests so that you can run just those
tests, define a group for it, add all the tests to it, and then
document it in the README as "if you have no experience with
fstests, this is where you should start".

Good luck keeping that up to date, though, as you're now back to the
same problem that Ted describes, which is the "deterministic" group
changes based on kernel, filesystem, config, etc.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-05  3:11         ` Dave Chinner
@ 2022-07-06 10:11           ` Amir Goldstein
  2022-07-06 14:29             ` Theodore Ts'o
  0 siblings, 1 reply; 32+ messages in thread
From: Amir Goldstein @ 2022-07-06 10:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso,
	Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams,
	Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox

On Tue, Jul 5, 2022 at 6:11 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote:
> > On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote:
> > > > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > > > >
> > > > > On 5/18/22 20:07, Luis Chamberlain wrote:
> > > > > > I've been promoting the idea that running fstests once is nice,
> > > > > > but things get interesting if you try to run fstests multiple
> > > > > > times until a failure is found. It turns out at least kdevops has
> > > > > > found tests which fail with a failure rate of typically 1/2 to
> > > > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > > > failure.
> > > > > >
> > > > > > I have tried my best to annotate failure rates when I know what
> > > > > > they might be on the test expunge list, as an example:
> > > > > >
> > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > > > >
> > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > > > to propose to standardize a way to represent this. How about
> > > > > >
> > > > > > generic/530 # F:1/15
> > > > > >
> > > > > > Then we could extend the definition. F being current estimate, and this
> > > > > > can be just how long it took to find the first failure. A more valuable
> > > > > > figure would be failure rate avarage, so running the test multiple
> > > > > > times, say 10, to see what the failure rate is and then averaging the
> > > > > > failure out. So this could be a more accurate representation. For this
> > > > > > how about:
> > > > > >
> > > > > > generic/530 # FA:1/15
> > > > > >
> > > > > > This would mean on average there failure rate has been found to be about
> > > > > > 1/15, and this was determined based on 10 runs.
> > > > > >
> > > > > > We should also go extend check for fstests/blktests to run a test
> > > > > > until a failure is found and report back the number of successes.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Note: yes failure rates lower than 1/100 do exist but they are rare
> > > > > > creatures. I love them though as my experience shows so far that they
> > > > > > uncover hidden bones in the closet, and they they make take months and
> > > > > > a lot of eyeballs to resolve.
> > > > >
> > > > > I strongly disagree with annotating tests with failure rates. My opinion
> > > > > is that on a given test setup a test either should pass 100% of the time
> > > > > or fail 100% of the time. If a test passes in one run and fails in
> > > > > another run that either indicates a bug in the test or a bug in the
> > > > > software that is being tested. Examples of behaviors that can cause
> > > > > tests to behave unpredictably are use-after-free bugs and race
> > > > > conditions. How likely it is to trigger such behavior depends on a
> > > > > number of factors. This could even depend on external factors like which
> > > > > network packets are received from other systems. I do not expect that
> > > > > flaky tests have an exact failure rate. Hence my opinion that flaky
> > > > > tests are not useful and also that it is not useful to annotate flaky
> > > > > tests with a failure rate. If a test is flaky I think that the root
> > > > > cause of the flakiness must be determined and fixed.
> > > > >
> > > >
> > > > That is true for some use cases, but unfortunately, the flaky
> > > > fstests are way too valuable and too hard to replace or improve,
> > > > so practically, fs developers have to run them, but not everyone does.
> > >
> > > Everyone *should* be running them. They find *new bugs*, and it
> > > doesn't matter how old the kernel is. e.g. if you're backporting XFS
> > > log changes and you aren't running the "flakey" recoveryloop group
> > > tests, then you are *not testing failure handling log recovery
> > > sufficiently*.
> > >
> > > Where do you draw the line? recvoeryloop tests that shutdown and
> > > recover the filesystem will find bugs, they are guaranteed to be
> > > flakey, and they are *absolutely necessary* to be run because they
> > > are the only tests that exercise that critical filesysetm crash
> > > recovery functionality.
> > >
> > > What do we actually gain by excluding these "non-deterministic"
> > > tests from automated QA environments?
> > >
> >
> > Automated QA environment is a broad term.
> > We all have automated QA environments.
> > But there is a specific class of automated test env, such as CI build bots
> > that do not tolerate human intervention.
>
> Those environments need curated test lists because of the fact
> that failures gate progress. They also tend to run in resource
> limited environments as fstests is not the only set of tests that
> are run. Hence, generally speaking, CI is not an environment you'd
> be running a full "auto" group set of tests. Even the 'quick' group
> (which take an hour to run here) is often far too time and resource
> intensive for a CI system to use effectively.
>
> IOWs, we can't easily curate a set of tests that are appropriate for
> all CI environments - it's up to the people running the CI
> enviroment to determine what level of testing is appropriate for
> gating commits to their source tree, not the fstests maintainers or
> developers...
>

OK. I think that is the way CIFS are doing CI -
Running a whitelist of (probably quick) tests known to pass on cifs.

> > Is it enough to run only the deterministic tests to validate xfs code?
> > No it is not.
> >
> > My LTS environment is human monitored - I look at every failure and
> > analyse the logs and look at historic data to decide if they are regressions
> > or not. A bot simply cannot do that.
> > The bot can go back and run the test N times on baseline vs patch.
> >
> > The question is, do we want kernel test bot to run -g auto -x soak
> > on linux-next and report issues to us?
> >
> > I think the answer to this question should be yes.
> >
> > Do we want kernel test bot to run -g auto and report flaky test
> > failures to us?
> >
> > I am pretty sure that the answer is no.
>
> My answer to both is *yes, absolutely*.
>

Ok.

> The zero-day kernel test bot runs all sorts of non-deterministic
> tests, including performance regression testing. We want these
> flakey/non-deterministic tests run in such environments, because
> they are often configurations we do not ahve access to and/or would
> never even consider. e.g. 128p server with a single HDD running IO
> scalability tests like AIM7...
>
> This is exactly where such automated testing provides developers
> with added value - it covers both hardware and software configs that
> indvidual developers cannot exercise themselves. Developers may or
> may not pay attention to those results depending on the test that
> "fails" and the hardware it "failed' on, but the point is that it
> got tested on something we'd never get coverage on otherwise.
>

So I am wondering what is the status today, because I rarely
see fstests failure reports from kernel test bot on the list, but there
are some reports.

Does anybody have a clue what hw/fs/config/group of fstests
kernel test bot is running on linux-next?

Did any fs maintainer communicate to kernel test bot maintainer
about this?

> > > The test run policy mechanisms we already have avoid
> > > this whole can of worms - we don't need to care about the specific
> > > test requirements of any specific test enviroment because the
> > > default is inclusive and it is trivial to exclude tests from that
> > > default set if needed.
> > >
> >
> > I had the humble notion that we should make running fstests to
> > passing-by developers as easy as possible, because I have had the
> > chance to get feedback from some developers on their first time
> > attempt to run fstests and it wasn't pleasant, but nevermind.
> > -g auto -x soak is fine.
>
> I think that the way to do this is the way Ted has described - wrap
> fstests in an environment where all the required knowledge is
> already encapsulated and the "drive by testers" just need to crank
> the handle and it churns out results.
>
> As it is, I don't think making things easier for "drive-by" testing
> at the expense of making things arbitrarily different and/or harder
> for people who use it every day is a good trade-off. The "target
> market" for fstests is *filesystem developers* and people who spend
> their working life *testing filesystems*. The number of people who
> do *useful* "drive-by" testing of filesystems is pretty damn small,
> and IMO that niche is nowhere near as important as making things
> better for the people who use fstests every day....
>

I agree that using fstests runners for drive-by testing makes more
sense.

> > When you think about it, many fs developrs run ./check -g auto,
> > so we should not interfere with that, but I bet very few run './check'?
> > so we could make the default for './check' some group combination
> > that is as deterministic as possible.
>
> Bare ./check invocations are designed to run every test, regardless
> of what group they are in.
>
> Stop trying to redefine longstanding existing behaviour - if you
> want to define "deterministic" tests so that you can run just those
> tests, define a group for it, add all the tests to it, and then
> document it in the README as "if you have no experience with
> fstests, this is where you should start".

OK. Or better yet, use an fstests runner.

>
> Good luck keeping that up to date, though, as you're now back to the
> same problem that Ted describes, which is the "deterministic" group
> changes based on kernel, filesystem, config, etc.
>

It's true. I think there is some room for improvement of how
tests are classified in fstests repo. I will elaborate on my reply to Ted.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-06 10:11           ` Amir Goldstein
@ 2022-07-06 14:29             ` Theodore Ts'o
  2022-07-06 16:35               ` Amir Goldstein
  0 siblings, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2022-07-06 14:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi,
	Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen,
	fstests, Zorro Lang, Matthew Wilcox

On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote:
> 
> So I am wondering what is the status today, because I rarely
> see fstests failure reports from kernel test bot on the list, but there
> are some reports.
> 
> Does anybody have a clue what hw/fs/config/group of fstests
> kernel test bot is running on linux-next?

Te zero-day test bot only reports test regressions.  So they have some
list of tests that have failed in the past, and they only report *new*
test failures.  This is not just true for fstests, but it's also true
for things like check and compiler warnings warnings --- and I suspect
it's for those sorts of reports that caused the zero-day bot to keep
state, and to filter out test failures and/or check warnings and/or
compiler warnings, so that only new test failures and/or new compiler
warnigns are reported.  If they didn't, they would be spamming kernel
developers, and given how.... "kind and understanding" kernel
developers are at getting spammed, especially when sometimes the
complaints are bogus ones (either test bugs or compiler bugs), my
guess is that they did the filtering out of sheer self-defense.  It
certainly wasn't something requested by a file system developer as far
as I know.

So this is how I think an automated system for "drive-by testers"
should work.  First, the tester would specify the baseline/origin tag,
and the testing system would run the tests on the baseline once.
Hopefully, the test runner already has exclude files so that kernel
bugs that cause an immediate kernel crash or deadlock would be already
be in the exclude list.  But as I've discovered this weekend, for file
systems that I haven't tried in a few yeas, like udf, or
ubifs. etc. there may be missing tests that result in the test VM to
stop responding and/or crash.

I have a planned improvement where if you are using the gce-xfstests's
lightweight test manager, since the LTM is constantly reading the
serial console, a deadlock can be detected and the LTM can restart the
VM.  The VM can then disambiguate from a forced reboot caused by the
LTM, or a forced shutdown caused by the use of a preemptible VM (a
planned feature not yet fully implemented yet), and the test runner
can skip the tests already run, and skip the test which caused the
crash or deadlock, and this could be reported so that eventually, the
test could be added to the exclude file to benefit thouse people who
are using kvm-xfstests.  (This is an example of a planned improvement
in xfstests-bld which if someone is interested in helping to implement
it, they should give me a ring.)

Once the tests which are failing given a particular baseline are
known, this state would then get saved, and then now the tests can be
run on the drive-by developer's changes.  We can now compare the known
failures for the baseline, with the changed kernels, and if there are
any new failures, there are two possibilities: (a) this was a new
feailure caused by the drive-by developer's changes, (b) this was a
pre-existing known flake.

To disambiguate between these two cases, we now run the failed test N
times (where N is probably something like 10-50 times; I normally use
25 times) on the changed kernel, and get the failure rate.  If the
failure rate is 100%, then this is almost certainly (a).  If the
failure rate is < 100% (and greater than 0%), then we need to rerun
the failed test on the baseline kernel N times, and see if the failure
rate is 0%, then we should do a bisection search to determine the
guilty commit.

If the failure rate is 0%, then this is either an extremely rare
flake, in which case we might need to increase N --- or it's an
example of a test failure which is sensitive to the order of tests
which are failed, in which case we may need to reun all of the tests
in order up to the failed test.

This is right now what I do when processing patches for upstream.
It's also rather similar to what we're doing for the XFS stable
backports, because it's much more efficient than running the baseline
tests 100 times (which can take a week of continuous testing per
Luis's comments) --- we only tests dozens (or more) times where a
potential flake has been found, as opposed to *all* tests.  It's all
done manually, but it would be great if we could automate this to make
life easier for XFS stable backporters, and *also* for drive-by
developers.

And again, if anyone is interested in helping with this, especially if
you're familiar with shell, python 3, and/or the Go language, please
contact me off-line.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-06 14:29             ` Theodore Ts'o
@ 2022-07-06 16:35               ` Amir Goldstein
  0 siblings, 0 replies; 32+ messages in thread
From: Amir Goldstein @ 2022-07-06 16:35 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Bart Van Assche, Darrick J. Wong, Luis Chamberlain,
	linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi,
	Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen,
	fstests, Zorro Lang, Matthew Wilcox

On Wed, Jul 6, 2022 at 5:30 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote:
> >
> > So I am wondering what is the status today, because I rarely
> > see fstests failure reports from kernel test bot on the list, but there
> > are some reports.
> >
> > Does anybody have a clue what hw/fs/config/group of fstests
> > kernel test bot is running on linux-next?
>
> Te zero-day test bot only reports test regressions.  So they have some
> list of tests that have failed in the past, and they only report *new*
> test failures.  This is not just true for fstests, but it's also true
> for things like check and compiler warnings warnings --- and I suspect
> it's for those sorts of reports that caused the zero-day bot to keep
> state, and to filter out test failures and/or check warnings and/or
> compiler warnings, so that only new test failures and/or new compiler
> warnigns are reported.  If they didn't, they would be spamming kernel
> developers, and given how.... "kind and understanding" kernel
> developers are at getting spammed, especially when sometimes the
> complaints are bogus ones (either test bugs or compiler bugs), my
> guess is that they did the filtering out of sheer self-defense.  It
> certainly wasn't something requested by a file system developer as far
> as I know.
>
>
> So this is how I think an automated system for "drive-by testers"
> should work.  First, the tester would specify the baseline/origin tag,
> and the testing system would run the tests on the baseline once.
> Hopefully, the test runner already has exclude files so that kernel
> bugs that cause an immediate kernel crash or deadlock would be already
> be in the exclude list.  But as I've discovered this weekend, for file
> systems that I haven't tried in a few yeas, like udf, or
> ubifs. etc. there may be missing tests that result in the test VM to
> stop responding and/or crash.
>
> I have a planned improvement where if you are using the gce-xfstests's
> lightweight test manager, since the LTM is constantly reading the
> serial console, a deadlock can be detected and the LTM can restart the
> VM.  The VM can then disambiguate from a forced reboot caused by the
> LTM, or a forced shutdown caused by the use of a preemptible VM (a
> planned feature not yet fully implemented yet), and the test runner
> can skip the tests already run, and skip the test which caused the
> crash or deadlock, and this could be reported so that eventually, the
> test could be added to the exclude file to benefit thouse people who
> are using kvm-xfstests.  (This is an example of a planned improvement
> in xfstests-bld which if someone is interested in helping to implement
> it, they should give me a ring.)
>
> Once the tests which are failing given a particular baseline are
> known, this state would then get saved, and then now the tests can be
> run on the drive-by developer's changes.  We can now compare the known
> failures for the baseline, with the changed kernels, and if there are
> any new failures, there are two possibilities: (a) this was a new
> feailure caused by the drive-by developer's changes, (b) this was a
> pre-existing known flake.
>
> To disambiguate between these two cases, we now run the failed test N
> times (where N is probably something like 10-50 times; I normally use
> 25 times) on the changed kernel, and get the failure rate.  If the
> failure rate is 100%, then this is almost certainly (a).  If the
> failure rate is < 100% (and greater than 0%), then we need to rerun
> the failed test on the baseline kernel N times, and see if the failure
> rate is 0%, then we should do a bisection search to determine the
> guilty commit.
>
> If the failure rate is 0%, then this is either an extremely rare
> flake, in which case we might need to increase N --- or it's an
> example of a test failure which is sensitive to the order of tests
> which are failed, in which case we may need to reun all of the tests
> in order up to the failed test.
>
> This is right now what I do when processing patches for upstream.
> It's also rather similar to what we're doing for the XFS stable
> backports, because it's much more efficient than running the baseline
> tests 100 times (which can take a week of continuous testing per
> Luis's comments) --- we only tests dozens (or more) times where a
> potential flake has been found, as opposed to *all* tests.  It's all
> done manually, but it would be great if we could automate this to make
> life easier for XFS stable backporters, and *also* for drive-by
> developers.
>

This process sounds like it could get us to mostly unattended regression
testing, so it sounds good.

I do wonder if there is nothing more that fstests devlopers can do to
assist when annotating new (and existing) tests to aid in that effort.

For example, there might be a case to tag a test as "this is a very
reliable test that should have no failures at all - if there is a failure
then something is surely wrong".
I wonder if it would help to have a group like that and how many
tests would that group include.

> And again, if anyone is interested in helping with this, especially if
> you're familiar with shell, python 3, and/or the Go language, please
> contact me off-line.
>

Please keep me in the loop if you have a prototype I may be able
to help test it.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-02 21:48 ` Bart Van Assche
  2022-07-03  5:56   ` Amir Goldstein
@ 2022-07-03 13:32   ` Theodore Ts'o
  2022-07-03 14:54     ` Bart Van Assche
  2022-07-07 21:06     ` Luis Chamberlain
  1 sibling, 2 replies; 32+ messages in thread
From: Theodore Ts'o @ 2022-07-03 13:32 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Luis Chamberlain, linux-fsdevel, linux-block, amir73il,
	pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen

On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote:
> 
> I strongly disagree with annotating tests with failure rates. My opinion is
> that on a given test setup a test either should pass 100% of the time or
> fail 100% of the time.

My opinion is also that no child should ever go to bed hungry, and we
should end world hunger.

However, meanwhile, in the real world, while we can *strive* to
eliminate all flaky tests, whether it is caused by buggy tests, or
buggy kernel code, there's an old saying that the only time code is
bug-free is when it is no longer being used.

That being said, I completely agree that annotating failure rates in
xfstesets-dev upstream probably doesn't make much sense.  As we've
stated before, it is highly dependent on the hardware configuration,
and kernel version (remember, sometimes flaky tests are caused by bugs
in other kernel subsystems --- including the loop device, which has
not historically been bug-free(tm) either, and so bugs come and go
across the entire kernel surface).

I believe the best way to handle this is to have better test results
analysis tools.  We can certainly consider having some shared test
results database, but I'm not convinced that flat text files shared
via git is sufficiently scalable.

The final thing I'll note it that we've lived with low probability
flakes for a very long time, and it hasn't been the end of the world.
Sometime in 2011 or 2012, when I first started at Google and when we
first started rolling out ext4 to the all of our data centers, once or
twice a month --- across the entire world-wide fleet --- there would
be an unexplained file system corruption that had remarkably similar
characteristics.  It took us several months to run it down, and it
turned out to be a lock getting released one C statement too soon.
When I did some further archeological research, it turned out it had
been in upstream for well over a *decade* --- in ext3 and ext4 --- and
had not been noticed in at least 3 or 4 enterprise distro GA
testing/qualification cycles.  Or rather, it might have been noticed,
but since it couldn't be replicated, I'm guessing the QA testers
shrugged, assumed that it *must* have been due to some cosmic ray, or
some such, and moved on.

> If a test is flaky I think that the root cause of the flakiness must
> be determined and fixed.  

In the ideal world, sure.  Then again, in the ideal world, we wouldn't
have thousands of people getting killed over border disputes and
because some maniacal world leader thinks that it's A-OK to overrun
the borders of adjacent countries.

However, until we have infinite resources available to us, the reality
is that we need to live with the fact that life is imperfect, despite
all of our efforts to reduce these sort of flaky tests --- especially
when we're talking about esoteric test configurations that most users
won't be using.  (Or when they are triggered by test code that is not
used in production, but for which the error injection or shutdown
simuilation code is itself not perfect.)

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03 13:32   ` Theodore Ts'o
@ 2022-07-03 14:54     ` Bart Van Assche
  2022-07-07 21:16       ` Luis Chamberlain
  2022-07-07 21:06     ` Luis Chamberlain
  1 sibling, 1 reply; 32+ messages in thread
From: Bart Van Assche @ 2022-07-03 14:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Luis Chamberlain, linux-fsdevel, linux-block, amir73il,
	pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen

On 7/3/22 06:32, Theodore Ts'o wrote:
> On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote:
>>
>> I strongly disagree with annotating tests with failure rates. My opinion is
>> that on a given test setup a test either should pass 100% of the time or
>> fail 100% of the time.
> 
> My opinion is also that no child should ever go to bed hungry, and we
> should end world hunger.

In my view the above comment is unfair. The first year after I wrote the
SRP tests in blktests I submitted multiple fixes for kernel bugs 
encountered by running these tests. Although it took a significant 
effort, after about one year the test itself and the kernel code it 
triggered finally resulted in reliable operation of the test. After that 
initial stabilization period these tests uncovered regressions in many 
kernel development cycles, even in the v5.19-rc cycle.

Since I'm not very familiar with xfstests I do not know what makes the 
stress tests in this test suite fail. Would it be useful to modify the 
code that decides the test outcome to remove the flakiness, e.g. by only 
checking that the stress tests do not trigger any unwanted behavior, 
e.g. kernel warnings or filesystem inconsistencies?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03 14:54     ` Bart Van Assche
@ 2022-07-07 21:16       ` Luis Chamberlain
  0 siblings, 0 replies; 32+ messages in thread
From: Luis Chamberlain @ 2022-07-07 21:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Theodore Ts'o, linux-fsdevel, linux-block, amir73il,
	pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen

On Sun, Jul 03, 2022 at 07:54:11AM -0700, Bart Van Assche wrote:
> On 7/3/22 06:32, Theodore Ts'o wrote:
> > On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote:
> > > 
> > > I strongly disagree with annotating tests with failure rates. My opinion is
> > > that on a given test setup a test either should pass 100% of the time or
> > > fail 100% of the time.
> > 
> > My opinion is also that no child should ever go to bed hungry, and we
> > should end world hunger.
> 
> In my view the above comment is unfair. The first year after I wrote the
> SRP tests in blktests I submitted multiple fixes for kernel bugs encountered
> by running these tests. Although it took a significant effort, after about
> one year the test itself and the kernel code it triggered finally resulted
> in reliable operation of the test. After that initial stabilization period
> these tests uncovered regressions in many kernel development cycles, even in
> the v5.19-rc cycle.
> 
> Since I'm not very familiar with xfstests I do not know what makes the
> stress tests in this test suite fail. Would it be useful to modify the code
> that decides the test outcome to remove the flakiness, e.g. by only checking
> that the stress tests do not trigger any unwanted behavior, e.g. kernel
> warnings or filesystem inconsistencies?

Filesystems and the block layer are bundled on top of tons of things in
the kernel, and those layers could introduce the undeterminism. To rule
out determinism we must first rule out undeterminism in other areas of
the kernel, and that will take a long time. Things like kunit tests will
help here, along with adding more tests to other smaller layers. The
list is long.

At LSFMM I mentioned how blktests block/009 had an odd failure rate of
about 1/669 a while ago. The issue was real, and it took a while to
figure out what the real issue was. Jan Kara's patches solved these
issues and they are not trivial to backport to ancient enterprise
kernels ;)

Another more recent one was the undeterministic RCU cpu stall warnings with
a failure rate of about 1/80 on zbd/006 and that lead to some interesting
revelations about how qemu's use of discard was shitty and just needed
to be enhanced.

Yes, you can probably make zbd/006 more atomic and split it into 10
tests, but I don't think we can escape the lack of determinism in
certain areas of the kernel. We can *work to improve* it, but again,
that will take time, and I am not quite sure many folks really want
that too.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
  2022-07-03 13:32   ` Theodore Ts'o
  2022-07-03 14:54     ` Bart Van Assche
@ 2022-07-07 21:06     ` Luis Chamberlain
  1 sibling, 0 replies; 32+ messages in thread
From: Luis Chamberlain @ 2022-07-07 21:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Bart Van Assche, linux-fsdevel, linux-block, amir73il, pankydev8,
	josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams,
	Jake Edge, Klaus Jensen

On Sun, Jul 03, 2022 at 09:32:03AM -0400, Theodore Ts'o wrote:
> On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote:
> > 
> > I strongly disagree with annotating tests with failure rates. My opinion is
> > that on a given test setup a test either should pass 100% of the time or
> > fail 100% of the time.
> 
> in the real world, while we can *strive* to
> eliminate all flaky tests, whether it is caused by buggy tests, or
> buggy kernel code, there's an old saying that the only time code is
> bug-free is when it is no longer being used.

Agreed but I will provide proof in reply to Bart shortly a bit more
related to the block layer. I thought I made the case clear enough
at LSFMM but I suppose not.

> That being said, I completely agree that annotating failure rates in
> xfstesets-dev upstream probably doesn't make much sense.  As we've
> stated before, it is highly dependent on the hardware configuration,
> and kernel version (remember, sometimes flaky tests are caused by bugs
> in other kernel subsystems --- including the loop device, which has
> not historically been bug-free(tm) either, and so bugs come and go
> across the entire kernel surface).

That does not eliminate the possible value of having failure rates for
the minimum virtualized storage arangement you can have with either
loopback devices or LVM volumes.

Nor, does it eliminate the possibility to say come up with generic
system names. Just as 0-day has names for kernel configs, we can easily
come up with names for hw profiles.

> I believe the best way to handle this is to have better test results
> analysis tools.

We're going to evaluate an ELK stack for this, but there is a difference
between historical data for random runs Vs what may be useful
generically.

> We can certainly consider having some shared test
> results database, but I'm not convinced that flat text files shared
> via git is sufficiently scalable.

How data is stored is secondary, first is order of business is if
sharing any of this information may be useful to others. I have
results dating back to 4.17.3, each kernel supported and I have
found it very valuable. I figured it may be.. but if there is no
agreement on it, we can just keep that on kdevops as-is and move
forward with our own nomenclature for hw profiles.

  Luis

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-07-07 21:36 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19  3:07 [RFC: kdevops] Standardizing on failure rate nomenclature for expunges Luis Chamberlain
2022-05-19  6:36 ` Amir Goldstein
2022-05-19  7:58   ` Dave Chinner
2022-05-19  9:20     ` Amir Goldstein
2022-05-19 15:36       ` Josef Bacik
2022-05-19 16:18         ` Zorro Lang
2022-05-19 11:24   ` Zorro Lang
2022-05-19 14:18     ` Theodore Ts'o
2022-05-19 15:10       ` Zorro Lang
2022-05-19 14:58     ` Matthew Wilcox
2022-05-19 15:44       ` Zorro Lang
2022-05-19 16:06         ` Matthew Wilcox
2022-05-19 16:54           ` Zorro Lang
2022-07-01 23:36           ` Luis Chamberlain
2022-07-02 17:01           ` Theodore Ts'o
2022-07-07 21:36             ` Luis Chamberlain
2022-07-02 21:48 ` Bart Van Assche
2022-07-03  5:56   ` Amir Goldstein
2022-07-03 13:15     ` Theodore Ts'o
2022-07-03 14:22       ` Amir Goldstein
2022-07-03 16:30         ` Theodore Ts'o
2022-07-04  3:25     ` Dave Chinner
2022-07-04  7:58       ` Amir Goldstein
2022-07-05  2:29         ` Theodore Ts'o
2022-07-05  3:11         ` Dave Chinner
2022-07-06 10:11           ` Amir Goldstein
2022-07-06 14:29             ` Theodore Ts'o
2022-07-06 16:35               ` Amir Goldstein
2022-07-03 13:32   ` Theodore Ts'o
2022-07-03 14:54     ` Bart Van Assche
2022-07-07 21:16       ` Luis Chamberlain
2022-07-07 21:06     ` Luis Chamberlain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.