* [RFC: kdevops] Standardizing on failure rate nomenclature for expunges @ 2022-05-19 3:07 Luis Chamberlain 2022-05-19 6:36 ` Amir Goldstein 2022-07-02 21:48 ` Bart Van Assche 0 siblings, 2 replies; 32+ messages in thread From: Luis Chamberlain @ 2022-05-19 3:07 UTC (permalink / raw) To: linux-fsdevel, linux-block Cc: amir73il, pankydev8, tytso, josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen I've been promoting the idea that running fstests once is nice, but things get interesting if you try to run fstests multiple times until a failure is found. It turns out at least kdevops has found tests which fail with a failure rate of typically 1/2 to 1/30 average failure rate. That is 1/2 means a failure can happen 50% of the time, whereas 1/30 means it takes 30 runs to find the failure. I have tried my best to annotate failure rates when I know what they might be on the test expunge list, as an example: workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d The term "failure rate 1/15" is 16 characters long, so I'd like to propose to standardize a way to represent this. How about generic/530 # F:1/15 Then we could extend the definition. F being current estimate, and this can be just how long it took to find the first failure. A more valuable figure would be failure rate avarage, so running the test multiple times, say 10, to see what the failure rate is and then averaging the failure out. So this could be a more accurate representation. For this how about: generic/530 # FA:1/15 This would mean on average there failure rate has been found to be about 1/15, and this was determined based on 10 runs. We should also go extend check for fstests/blktests to run a test until a failure is found and report back the number of successes. Thoughts? Note: yes failure rates lower than 1/100 do exist but they are rare creatures. I love them though as my experience shows so far that they uncover hidden bones in the closet, and they they make take months and a lot of eyeballs to resolve. Luis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 3:07 [RFC: kdevops] Standardizing on failure rate nomenclature for expunges Luis Chamberlain @ 2022-05-19 6:36 ` Amir Goldstein 2022-05-19 7:58 ` Dave Chinner 2022-05-19 11:24 ` Zorro Lang 2022-07-02 21:48 ` Bart Van Assche 1 sibling, 2 replies; 32+ messages in thread From: Amir Goldstein @ 2022-05-19 6:36 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests [adding fstests and Zorro] On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > I've been promoting the idea that running fstests once is nice, > but things get interesting if you try to run fstests multiple > times until a failure is found. It turns out at least kdevops has > found tests which fail with a failure rate of typically 1/2 to > 1/30 average failure rate. That is 1/2 means a failure can happen > 50% of the time, whereas 1/30 means it takes 30 runs to find the > failure. > > I have tried my best to annotate failure rates when I know what > they might be on the test expunge list, as an example: > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > The term "failure rate 1/15" is 16 characters long, so I'd like > to propose to standardize a way to represent this. How about > > generic/530 # F:1/15 > I am not fond of the 1/15 annotation at all, because the only fact that you are able to document is that the test failed after 15 runs. Suggesting that this means failure rate of 1/15 is a very big step. > Then we could extend the definition. F being current estimate, and this > can be just how long it took to find the first failure. A more valuable > figure would be failure rate avarage, so running the test multiple > times, say 10, to see what the failure rate is and then averaging the > failure out. So this could be a more accurate representation. For this > how about: > > generic/530 # FA:1/15 > > This would mean on average there failure rate has been found to be about > 1/15, and this was determined based on 10 runs. > > We should also go extend check for fstests/blktests to run a test > until a failure is found and report back the number of successes. > > Thoughts? > I have had a discussion about those tests with Zorro. Those tests that some people refer to as "flaky" are valuable, but they are not deterministic, they are stochastic. I think MTBF is the standard way to describe reliability of such tests, but I am having a hard time imagining how the community can manage to document accurate annotations of this sort, so I would stick with documenting the facts (i.e. the test fails after N runs). OTOH, we do have deterministic tests, maybe even the majority of fstests are deterministic(?) Considering that every auto test loop takes ~2 hours on our rig and that I have been running over 100 loops over the past two weeks, if half of fstests are deterministic, that is a lot of wait time and a lot of carbon emission gone to waste. It would have been nice if I was able to exclude a "deterministic" group. The problem is - can a developer ever tag a test as being "deterministic"? Thanks, Amir. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 6:36 ` Amir Goldstein @ 2022-05-19 7:58 ` Dave Chinner 2022-05-19 9:20 ` Amir Goldstein 2022-05-19 11:24 ` Zorro Lang 1 sibling, 1 reply; 32+ messages in thread From: Dave Chinner @ 2022-05-19 7:58 UTC (permalink / raw) To: Amir Goldstein Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > [adding fstests and Zorro] > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > I've been promoting the idea that running fstests once is nice, > > but things get interesting if you try to run fstests multiple > > times until a failure is found. It turns out at least kdevops has > > found tests which fail with a failure rate of typically 1/2 to > > 1/30 average failure rate. That is 1/2 means a failure can happen > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > failure. > > > > I have tried my best to annotate failure rates when I know what > > they might be on the test expunge list, as an example: > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > to propose to standardize a way to represent this. How about > > > > generic/530 # F:1/15 > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > are able to document is that the test failed after 15 runs. > Suggesting that this means failure rate of 1/15 is a very big step. > > > Then we could extend the definition. F being current estimate, and this > > can be just how long it took to find the first failure. A more valuable > > figure would be failure rate avarage, so running the test multiple > > times, say 10, to see what the failure rate is and then averaging the > > failure out. So this could be a more accurate representation. For this > > how about: > > > > generic/530 # FA:1/15 > > > > This would mean on average there failure rate has been found to be about > > 1/15, and this was determined based on 10 runs. These tests are run on multiple different filesystems. What happens if you run xfs, ext4, btrfs, overlay in sequence? We now have 4 tests results, and 1 failure. Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1? What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas? Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1? In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs 1/1 breakdown is useful information, because it tells us whihc filesystem failed the test, or which specific config failed the test. Hence I think the ability for us to draw useful conclusions from a number like this is large dependent on the specific data set it is drawn from... > > We should also go extend check for fstests/blktests to run a test > > until a failure is found and report back the number of successes. > > > > Thoughts? Who is the expected consumer of this information? I'm not sure it will be meaningful for anyone developing new code and needing to run every test every time they run fstests. OTOH, for a QA environment where you have a fixed progression of the kernel releases you are testing, it's likely valuable and already being tracked in various distro QE management tools and dashboards.... > I have had a discussion about those tests with Zorro. > > Those tests that some people refer to as "flaky" are valuable, > but they are not deterministic, they are stochastic. Extremely valuable. Worth their weight in gold to developers like me. The recoveryloop group tests are a good example of this. The name of the group indicates how we use it. I typically set it up to run with an loop iteration like "-I 100" knowing that is will likely fail a random test in the group within 10 iterations. Those one-off failures are almost always a real bug, and they are often unique and difficult to reproduce exactly. Post-mortem needs to be performed immediately because it may well be a unique on-off failure and running another test after the failure destroys the state needed to perform a post-mortem. Hence having a test farm running these multiple times and then reporting "failed once in 15 runs" isn't really useful to me as a developer - it doesn't tell us anything new, nor does it help us find the bugs that are being tripped over. Less obvious stochastic tests exist, too. There are many tests that use fstress as a workload that runs while some other operation is performed - freeze, grow, ENOSPC, error injections, etc. They will never be deterministic, any again any failure tends to be a real bug, too. However, I think these should be run by QE environments all the time as they require long term, frequent execution across different configs in different environments to find the deep dark corners where the bugs may lie dormant. These are the tests that find things like subtle timing races no other tests ever exercise. I suspect that tests that alter their behaviour via LOAD_FACTOR or TIME_FACTOR will fall into this category. > I think MTBF is the standard way to describe reliability > of such tests, but I am having a hard time imagining how > the community can manage to document accurate annotations > of this sort, so I would stick with documenting the facts > (i.e. the test fails after N runs). I'm unsure of what "reliablity of such tests" means in this context. The tests are trying to exercise and measure the reliability of the kernel code - if the *test is unreliable* then that says to me the test needs fixing. If the test is reliable, then any failures that occur indicate that the filesystem/kernel/fs tools are unreliable, not the test.... "test reliability" and "reliability of filesystem under test" are different things with similar names. The latter is what I think we are talking about measuring and reporting here, right? > OTOH, we do have deterministic tests, maybe even the majority of > fstests are deterministic(?) Very likely. As a generalisation, I'd say that anything that has a fixed, single step at a time recipe and a very well defined golden output or exact output comparison match is likely deterministic. We use things like 'within tolerance' so that slight variations in test results don't cause spurious failures and hence make the test more deterministic. Hence any test that uses 'within_tolerance' is probably a test that is expecting deterministic behaviour.... > Considering that every auto test loop takes ~2 hours on our rig and that > I have been running over 100 loops over the past two weeks, if half > of fstests are deterministic, that is a lot of wait time and a lot of carbon > emission gone to waste. > > It would have been nice if I was able to exclude a "deterministic" group. > The problem is - can a developer ever tag a test as being "deterministic"? fstests allows private exclude lists to be used - perhaps these could be used to start building such a group for your test environment. Building a list from the tests you never see fail in your environment could be a good way to seed such a group... Maybe you have all the raw results from those hundreds of tests sitting around - what does crunching that data look like? Who else has large sets of consistent historic data sitting around? I don't because I pollute my results archive by frequently running varied and badly broken kernels through fstests, but people who just run released or stable kernels may have data sets that could be used.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 7:58 ` Dave Chinner @ 2022-05-19 9:20 ` Amir Goldstein 2022-05-19 15:36 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Amir Goldstein @ 2022-05-19 9:20 UTC (permalink / raw) To: Dave Chinner Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > > [adding fstests and Zorro] > > > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > > I've been promoting the idea that running fstests once is nice, > > > but things get interesting if you try to run fstests multiple > > > times until a failure is found. It turns out at least kdevops has > > > found tests which fail with a failure rate of typically 1/2 to > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > failure. > > > > > > I have tried my best to annotate failure rates when I know what > > > they might be on the test expunge list, as an example: > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > to propose to standardize a way to represent this. How about > > > > > > generic/530 # F:1/15 > > > > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > > are able to document is that the test failed after 15 runs. > > Suggesting that this means failure rate of 1/15 is a very big step. > > > > > Then we could extend the definition. F being current estimate, and this > > > can be just how long it took to find the first failure. A more valuable > > > figure would be failure rate avarage, so running the test multiple > > > times, say 10, to see what the failure rate is and then averaging the > > > failure out. So this could be a more accurate representation. For this > > > how about: > > > > > > generic/530 # FA:1/15 > > > > > > This would mean on average there failure rate has been found to be about > > > 1/15, and this was determined based on 10 runs. > > These tests are run on multiple different filesystems. What happens > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4 > tests results, and 1 failure. > > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1? > > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas? > > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1? > > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs > 1/1 breakdown is useful information, because it tells us whihc > filesystem failed the test, or which specific config failed the > test. > > Hence I think the ability for us to draw useful conclusions from a > number like this is large dependent on the specific data set it is > drawn from... > > > > We should also go extend check for fstests/blktests to run a test > > > until a failure is found and report back the number of successes. > > > > > > Thoughts? > > Who is the expected consumer of this information? > > I'm not sure it will be meaningful for anyone developing new code > and needing to run every test every time they run fstests. > > OTOH, for a QA environment where you have a fixed progression of the > kernel releases you are testing, it's likely valuable and already > being tracked in various distro QE management tools and > dashboards.... > > > I have had a discussion about those tests with Zorro. > > > > Those tests that some people refer to as "flaky" are valuable, > > but they are not deterministic, they are stochastic. > > Extremely valuable. Worth their weight in gold to developers like > me. > > The recoveryloop group tests are a good example of this. The name of > the group indicates how we use it. I typically set it up to run with > an loop iteration like "-I 100" knowing that is will likely fail a > random test in the group within 10 iterations. > > Those one-off failures are almost always a real bug, and they are > often unique and difficult to reproduce exactly. Post-mortem needs > to be performed immediately because it may well be a unique on-off > failure and running another test after the failure destroys the > state needed to perform a post-mortem. > > Hence having a test farm running these multiple times and then > reporting "failed once in 15 runs" isn't really useful to me as a > developer - it doesn't tell us anything new, nor does it help us > find the bugs that are being tripped over. > > Less obvious stochastic tests exist, too. There are many tests that > use fstress as a workload that runs while some other operation is > performed - freeze, grow, ENOSPC, error injections, etc. They will > never be deterministic, any again any failure tends to be a real > bug, too. > > However, I think these should be run by QE environments all the time > as they require long term, frequent execution across different > configs in different environments to find the deep dark corners > where the bugs may lie dormant. These are the tests that find things > like subtle timing races no other tests ever exercise. > > I suspect that tests that alter their behaviour via LOAD_FACTOR or > TIME_FACTOR will fall into this category. > > > I think MTBF is the standard way to describe reliability > > of such tests, but I am having a hard time imagining how > > the community can manage to document accurate annotations > > of this sort, so I would stick with documenting the facts > > (i.e. the test fails after N runs). > > I'm unsure of what "reliablity of such tests" means in this context. > The tests are trying to exercise and measure the reliability of the > kernel code - if the *test is unreliable* then that says to me the > test needs fixing. If the test is reliable, then any failures that > occur indicate that the filesystem/kernel/fs tools are unreliable, > not the test.... > > "test reliability" and "reliability of filesystem under test" are > different things with similar names. The latter is what I think we > are talking about measuring and reporting here, right? > > > OTOH, we do have deterministic tests, maybe even the majority of > > fstests are deterministic(?) > > Very likely. As a generalisation, I'd say that anything that has a > fixed, single step at a time recipe and a very well defined golden > output or exact output comparison match is likely deterministic. > > We use things like 'within tolerance' so that slight variations in > test results don't cause spurious failures and hence make the test > more deterministic. Hence any test that uses 'within_tolerance' is > probably a test that is expecting deterministic behaviour.... > > > Considering that every auto test loop takes ~2 hours on our rig and that > > I have been running over 100 loops over the past two weeks, if half > > of fstests are deterministic, that is a lot of wait time and a lot of carbon > > emission gone to waste. > > > > It would have been nice if I was able to exclude a "deterministic" group. > > The problem is - can a developer ever tag a test as being "deterministic"? > > fstests allows private exclude lists to be used - perhaps these > could be used to start building such a group for your test > environment. Building a list from the tests you never see fail in > your environment could be a good way to seed such a group... > > Maybe you have all the raw results from those hundreds of tests > sitting around - what does crunching that data look like? Who else > has large sets of consistent historic data sitting around? I don't > because I pollute my results archive by frequently running varied > and badly broken kernels through fstests, but people who just run > released or stable kernels may have data sets that could be used.... > I have no historic data of that sort and I have never stayed on the same test system long enough to collect this sort of data. Josef has told us in LPC 2021 about his btrfs fstests dashboard where he started to collect historical data a while ago. Collaborating on expunge lists of different fs and different kernel/config/distro is one of the goals behind Luis's kdevops project. For now, the expunge lists are curated in git: https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges Going forward, this cannot scale. If we want to collaborate and collect results from multiple testers and test labs we should consult with the KernelCI project, who are doing exactly that for other test suites. You did not attend Luis' talk in LSFMM this year (he has already mentioned kdevops back in LSFMM 2019), where some of these issues were discussed. The video from LSFMM 2022 talk should be available in coming weeks. I hear that Luis is also planning on giving a talk to a wider audience in LPC 2022. Thanks, Amir. > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 9:20 ` Amir Goldstein @ 2022-05-19 15:36 ` Josef Bacik 2022-05-19 16:18 ` Zorro Lang 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2022-05-19 15:36 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests On Thu, May 19, 2022 at 12:20:28PM +0300, Amir Goldstein wrote: > On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > > > [adding fstests and Zorro] > > > > > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > > > > I've been promoting the idea that running fstests once is nice, > > > > but things get interesting if you try to run fstests multiple > > > > times until a failure is found. It turns out at least kdevops has > > > > found tests which fail with a failure rate of typically 1/2 to > > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > > failure. > > > > > > > > I have tried my best to annotate failure rates when I know what > > > > they might be on the test expunge list, as an example: > > > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > > to propose to standardize a way to represent this. How about > > > > > > > > generic/530 # F:1/15 > > > > > > > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > > > are able to document is that the test failed after 15 runs. > > > Suggesting that this means failure rate of 1/15 is a very big step. > > > > > > > Then we could extend the definition. F being current estimate, and this > > > > can be just how long it took to find the first failure. A more valuable > > > > figure would be failure rate avarage, so running the test multiple > > > > times, say 10, to see what the failure rate is and then averaging the > > > > failure out. So this could be a more accurate representation. For this > > > > how about: > > > > > > > > generic/530 # FA:1/15 > > > > > > > > This would mean on average there failure rate has been found to be about > > > > 1/15, and this was determined based on 10 runs. > > > > These tests are run on multiple different filesystems. What happens > > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4 > > tests results, and 1 failure. > > > > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1? > > > > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas? > > > > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1? > > > > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs > > 1/1 breakdown is useful information, because it tells us whihc > > filesystem failed the test, or which specific config failed the > > test. > > > > Hence I think the ability for us to draw useful conclusions from a > > number like this is large dependent on the specific data set it is > > drawn from... > > > > > > We should also go extend check for fstests/blktests to run a test > > > > until a failure is found and report back the number of successes. > > > > > > > > Thoughts? > > > > Who is the expected consumer of this information? > > > > I'm not sure it will be meaningful for anyone developing new code > > and needing to run every test every time they run fstests. > > > > OTOH, for a QA environment where you have a fixed progression of the > > kernel releases you are testing, it's likely valuable and already > > being tracked in various distro QE management tools and > > dashboards.... > > > > > I have had a discussion about those tests with Zorro. > > > > > > Those tests that some people refer to as "flaky" are valuable, > > > but they are not deterministic, they are stochastic. > > > > Extremely valuable. Worth their weight in gold to developers like > > me. > > > > The recoveryloop group tests are a good example of this. The name of > > the group indicates how we use it. I typically set it up to run with > > an loop iteration like "-I 100" knowing that is will likely fail a > > random test in the group within 10 iterations. > > > > Those one-off failures are almost always a real bug, and they are > > often unique and difficult to reproduce exactly. Post-mortem needs > > to be performed immediately because it may well be a unique on-off > > failure and running another test after the failure destroys the > > state needed to perform a post-mortem. > > > > Hence having a test farm running these multiple times and then > > reporting "failed once in 15 runs" isn't really useful to me as a > > developer - it doesn't tell us anything new, nor does it help us > > find the bugs that are being tripped over. > > > > Less obvious stochastic tests exist, too. There are many tests that > > use fstress as a workload that runs while some other operation is > > performed - freeze, grow, ENOSPC, error injections, etc. They will > > never be deterministic, any again any failure tends to be a real > > bug, too. > > > > However, I think these should be run by QE environments all the time > > as they require long term, frequent execution across different > > configs in different environments to find the deep dark corners > > where the bugs may lie dormant. These are the tests that find things > > like subtle timing races no other tests ever exercise. > > > > I suspect that tests that alter their behaviour via LOAD_FACTOR or > > TIME_FACTOR will fall into this category. > > > > > I think MTBF is the standard way to describe reliability > > > of such tests, but I am having a hard time imagining how > > > the community can manage to document accurate annotations > > > of this sort, so I would stick with documenting the facts > > > (i.e. the test fails after N runs). > > > > I'm unsure of what "reliablity of such tests" means in this context. > > The tests are trying to exercise and measure the reliability of the > > kernel code - if the *test is unreliable* then that says to me the > > test needs fixing. If the test is reliable, then any failures that > > occur indicate that the filesystem/kernel/fs tools are unreliable, > > not the test.... > > > > "test reliability" and "reliability of filesystem under test" are > > different things with similar names. The latter is what I think we > > are talking about measuring and reporting here, right? > > > > > OTOH, we do have deterministic tests, maybe even the majority of > > > fstests are deterministic(?) > > > > Very likely. As a generalisation, I'd say that anything that has a > > fixed, single step at a time recipe and a very well defined golden > > output or exact output comparison match is likely deterministic. > > > > We use things like 'within tolerance' so that slight variations in > > test results don't cause spurious failures and hence make the test > > more deterministic. Hence any test that uses 'within_tolerance' is > > probably a test that is expecting deterministic behaviour.... > > > > > Considering that every auto test loop takes ~2 hours on our rig and that > > > I have been running over 100 loops over the past two weeks, if half > > > of fstests are deterministic, that is a lot of wait time and a lot of carbon > > > emission gone to waste. > > > > > > It would have been nice if I was able to exclude a "deterministic" group. > > > The problem is - can a developer ever tag a test as being "deterministic"? > > > > fstests allows private exclude lists to be used - perhaps these > > could be used to start building such a group for your test > > environment. Building a list from the tests you never see fail in > > your environment could be a good way to seed such a group... > > > > Maybe you have all the raw results from those hundreds of tests > > sitting around - what does crunching that data look like? Who else > > has large sets of consistent historic data sitting around? I don't > > because I pollute my results archive by frequently running varied > > and badly broken kernels through fstests, but people who just run > > released or stable kernels may have data sets that could be used.... > > > > I have no historic data of that sort and I have never stayed on the > same test system long enough to collect this sort of data. > > Josef has told us in LPC 2021 about his btrfs fstests dashboard > where he started to collect historical data a while ago. > I'm clearly biased, but I think this is the best way to go for *developers*. We want to know all the things, so we just need to have a clear way to see what's failing and have a historical view of what has failed. If you look at our dashboard at toxicpanda.com you can click on the tests and see their runs and failures on different configs. This has been insanely valuable to me, and helped me narrow down test cases that needed to be adjusted for compression. > Collaborating on expunge lists of different fs and different > kernel/config/distro > is one of the goals behind Luis's kdevops project. > I think this is also hugely valuable from the "Willy usecase" perspective. Willy doesn't care about failure rates or interpreting the tea leaves of what our format is, he wants to make sure he didn't break anything. We should strive to have 0 failures for this use case, so having expunge lists in place to get rid of any flakey results are going to make it easier for non-experts to get a solid grasp on wether they introduced a regression or not. There's room for both use cases. I want the expunge lists for newbies, I want good reporting for the developers who know what they're doing. We can provide documentation for both - If Willy, run 'make fstests-clean' - If Josef, run 'mkame fstests' Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 15:36 ` Josef Bacik @ 2022-05-19 16:18 ` Zorro Lang 0 siblings, 0 replies; 32+ messages in thread From: Zorro Lang @ 2022-05-19 16:18 UTC (permalink / raw) To: Josef Bacik Cc: Amir Goldstein, Dave Chinner, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 11:36:02AM -0400, Josef Bacik wrote: > On Thu, May 19, 2022 at 12:20:28PM +0300, Amir Goldstein wrote: > > On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > > > > [adding fstests and Zorro] > > > > > > > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > > > > > > I've been promoting the idea that running fstests once is nice, > > > > > but things get interesting if you try to run fstests multiple > > > > > times until a failure is found. It turns out at least kdevops has > > > > > found tests which fail with a failure rate of typically 1/2 to > > > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > > > failure. > > > > > > > > > > I have tried my best to annotate failure rates when I know what > > > > > they might be on the test expunge list, as an example: > > > > > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > > > to propose to standardize a way to represent this. How about > > > > > > > > > > generic/530 # F:1/15 > > > > > > > > > > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > > > > are able to document is that the test failed after 15 runs. > > > > Suggesting that this means failure rate of 1/15 is a very big step. > > > > > > > > > Then we could extend the definition. F being current estimate, and this > > > > > can be just how long it took to find the first failure. A more valuable > > > > > figure would be failure rate avarage, so running the test multiple > > > > > times, say 10, to see what the failure rate is and then averaging the > > > > > failure out. So this could be a more accurate representation. For this > > > > > how about: > > > > > > > > > > generic/530 # FA:1/15 > > > > > > > > > > This would mean on average there failure rate has been found to be about > > > > > 1/15, and this was determined based on 10 runs. > > > > > > These tests are run on multiple different filesystems. What happens > > > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4 > > > tests results, and 1 failure. > > > > > > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1? > > > > > > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas? > > > > > > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1? > > > > > > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs > > > 1/1 breakdown is useful information, because it tells us whihc > > > filesystem failed the test, or which specific config failed the > > > test. > > > > > > Hence I think the ability for us to draw useful conclusions from a > > > number like this is large dependent on the specific data set it is > > > drawn from... > > > > > > > > We should also go extend check for fstests/blktests to run a test > > > > > until a failure is found and report back the number of successes. > > > > > > > > > > Thoughts? > > > > > > Who is the expected consumer of this information? > > > > > > I'm not sure it will be meaningful for anyone developing new code > > > and needing to run every test every time they run fstests. > > > > > > OTOH, for a QA environment where you have a fixed progression of the > > > kernel releases you are testing, it's likely valuable and already > > > being tracked in various distro QE management tools and > > > dashboards.... > > > > > > > I have had a discussion about those tests with Zorro. > > > > > > > > Those tests that some people refer to as "flaky" are valuable, > > > > but they are not deterministic, they are stochastic. > > > > > > Extremely valuable. Worth their weight in gold to developers like > > > me. > > > > > > The recoveryloop group tests are a good example of this. The name of > > > the group indicates how we use it. I typically set it up to run with > > > an loop iteration like "-I 100" knowing that is will likely fail a > > > random test in the group within 10 iterations. > > > > > > Those one-off failures are almost always a real bug, and they are > > > often unique and difficult to reproduce exactly. Post-mortem needs > > > to be performed immediately because it may well be a unique on-off > > > failure and running another test after the failure destroys the > > > state needed to perform a post-mortem. > > > > > > Hence having a test farm running these multiple times and then > > > reporting "failed once in 15 runs" isn't really useful to me as a > > > developer - it doesn't tell us anything new, nor does it help us > > > find the bugs that are being tripped over. > > > > > > Less obvious stochastic tests exist, too. There are many tests that > > > use fstress as a workload that runs while some other operation is > > > performed - freeze, grow, ENOSPC, error injections, etc. They will > > > never be deterministic, any again any failure tends to be a real > > > bug, too. > > > > > > However, I think these should be run by QE environments all the time > > > as they require long term, frequent execution across different > > > configs in different environments to find the deep dark corners > > > where the bugs may lie dormant. These are the tests that find things > > > like subtle timing races no other tests ever exercise. > > > > > > I suspect that tests that alter their behaviour via LOAD_FACTOR or > > > TIME_FACTOR will fall into this category. > > > > > > > I think MTBF is the standard way to describe reliability > > > > of such tests, but I am having a hard time imagining how > > > > the community can manage to document accurate annotations > > > > of this sort, so I would stick with documenting the facts > > > > (i.e. the test fails after N runs). > > > > > > I'm unsure of what "reliablity of such tests" means in this context. > > > The tests are trying to exercise and measure the reliability of the > > > kernel code - if the *test is unreliable* then that says to me the > > > test needs fixing. If the test is reliable, then any failures that > > > occur indicate that the filesystem/kernel/fs tools are unreliable, > > > not the test.... > > > > > > "test reliability" and "reliability of filesystem under test" are > > > different things with similar names. The latter is what I think we > > > are talking about measuring and reporting here, right? > > > > > > > OTOH, we do have deterministic tests, maybe even the majority of > > > > fstests are deterministic(?) > > > > > > Very likely. As a generalisation, I'd say that anything that has a > > > fixed, single step at a time recipe and a very well defined golden > > > output or exact output comparison match is likely deterministic. > > > > > > We use things like 'within tolerance' so that slight variations in > > > test results don't cause spurious failures and hence make the test > > > more deterministic. Hence any test that uses 'within_tolerance' is > > > probably a test that is expecting deterministic behaviour.... > > > > > > > Considering that every auto test loop takes ~2 hours on our rig and that > > > > I have been running over 100 loops over the past two weeks, if half > > > > of fstests are deterministic, that is a lot of wait time and a lot of carbon > > > > emission gone to waste. > > > > > > > > It would have been nice if I was able to exclude a "deterministic" group. > > > > The problem is - can a developer ever tag a test as being "deterministic"? > > > > > > fstests allows private exclude lists to be used - perhaps these > > > could be used to start building such a group for your test > > > environment. Building a list from the tests you never see fail in > > > your environment could be a good way to seed such a group... > > > > > > Maybe you have all the raw results from those hundreds of tests > > > sitting around - what does crunching that data look like? Who else > > > has large sets of consistent historic data sitting around? I don't > > > because I pollute my results archive by frequently running varied > > > and badly broken kernels through fstests, but people who just run > > > released or stable kernels may have data sets that could be used.... > > > > > > > I have no historic data of that sort and I have never stayed on the > > same test system long enough to collect this sort of data. > > > > Josef has told us in LPC 2021 about his btrfs fstests dashboard > > where he started to collect historical data a while ago. > > > > I'm clearly biased, but I think this is the best way to go for *developers*. We > want to know all the things, so we just need to have a clear way to see what's > failing and have a historical view of what has failed. If you look at our I agree the "historical view" is needed, but it can't be provided by mainline fstests, due to it might be used to test many different filesystems with different sysytem software and hardware environment, and there're lots of downstream project, they have their own variation, the "upstream mainline linux historical view" isn't referential for all of them. Some of "downstream historical view" isn't referential either. The "historical view" is worthy for each project(or project group) itself, but might be not universal for others. If someone would like to help to test someone project, likes someone Ubuntu LTS version, or Debian, or CentOS, or someone LTS kernel... that might be better to ask if related people have their "historical view" data to help to get start, better than asking if fstests has that for all different known/unknown projects. I just replied Ted, I think his idea makes more sense. fstests can provide some meaningful interfaces to help the testers use their historical data, or help to summarize their historical data for each specific user/project. But fstests doesn't store/provide those one-sided data directly. Thanks, Zorro > dashboard at toxicpanda.com you can click on the tests and see their runs and > failures on different configs. This has been insanely valuable to me, and > helped me narrow down test cases that needed to be adjusted for compression. > > > Collaborating on expunge lists of different fs and different > > kernel/config/distro > > is one of the goals behind Luis's kdevops project. > > > > I think this is also hugely valuable from the "Willy usecase" perspective. > Willy doesn't care about failure rates or interpreting the tea leaves of what > our format is, he wants to make sure he didn't break anything. We should strive > to have 0 failures for this use case, so having expunge lists in place to get > rid of any flakey results are going to make it easier for non-experts to get a > solid grasp on wether they introduced a regression or not. > > There's room for both use cases. I want the expunge lists for newbies, I want > good reporting for the developers who know what they're doing. We can provide > documentation for both > > - If Willy, run 'make fstests-clean' > - If Josef, run 'mkame fstests' > > Thanks, > > Josef > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 6:36 ` Amir Goldstein 2022-05-19 7:58 ` Dave Chinner @ 2022-05-19 11:24 ` Zorro Lang 2022-05-19 14:18 ` Theodore Ts'o 2022-05-19 14:58 ` Matthew Wilcox 1 sibling, 2 replies; 32+ messages in thread From: Zorro Lang @ 2022-05-19 11:24 UTC (permalink / raw) To: Amir Goldstein Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > [adding fstests and Zorro] > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > I've been promoting the idea that running fstests once is nice, > > but things get interesting if you try to run fstests multiple > > times until a failure is found. It turns out at least kdevops has > > found tests which fail with a failure rate of typically 1/2 to > > 1/30 average failure rate. That is 1/2 means a failure can happen > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > failure. > > > > I have tried my best to annotate failure rates when I know what > > they might be on the test expunge list, as an example: > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > to propose to standardize a way to represent this. How about > > > > generic/530 # F:1/15 > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > are able to document is that the test failed after 15 runs. > Suggesting that this means failure rate of 1/15 is a very big step. > > > Then we could extend the definition. F being current estimate, and this > > can be just how long it took to find the first failure. A more valuable > > figure would be failure rate avarage, so running the test multiple > > times, say 10, to see what the failure rate is and then averaging the > > failure out. So this could be a more accurate representation. For this > > how about: > > > > generic/530 # FA:1/15 > > > > This would mean on average there failure rate has been found to be about > > 1/15, and this was determined based on 10 runs. > > > > We should also go extend check for fstests/blktests to run a test > > until a failure is found and report back the number of successes. > > > > Thoughts? > > > > I have had a discussion about those tests with Zorro. Hi Amir, Thanks for publicing this discussion. Yes, we talked about this, but if I don't rememeber wrong, I recommended each downstream testers maintain their own "testing data/config", likes exclude list, failed ratio, known failures etc. I think they're not suitable to be fixed in the mainline fstests. About the other idea I metioned in LSF, we can create some more group names to mark those cases with random load/data/env etc, they're worth to be run more times. I also talked about that with Darrick, we haven't maken a decision, but I'd like to push that if most of other forks would like to see that. In my internal regression test for RHEL, I give some fstests cases a new group name "redhat_random" (sure, I know it's not a good name, it's just for my internal test, welcome better name, I'm not a good english speaker :). Then combine with quick and stress group name, I loop run "redhat_random" cases different times, with different LOAD/TIME_FACTOR. So I hope to have one "or more specific" group name to mark those random test cases at first, likes [1] (I'm sure it's incomplete, but can be improved if we can get more help from more people :) Thanks, Zorro [1] generic/013 generic/019 generic/051 generic/068 generic/070 generic/075 generic/076 generic/083 generic/091 generic/112 generic/117 generic/127 generic/231 generic/232 generic/233 generic/263 generic/269 generic/270 generic/388 generic/390 generic/413 generic/455 generic/457 generic/461 generic/464 generic/475 generic/476 generic/482 generic/521 generic/522 generic/547 generic/551 generic/560 generic/561 generic/616 generic/617 generic/648 generic/650 xfs/011 xfs/013 xfs/017 xfs/032 xfs/051 xfs/057 xfs/068 xfs/079 xfs/104 xfs/137 xfs/141 xfs/167 xfs/297 xfs/305 xfs/442 xfs/517 > > Those tests that some people refer to as "flaky" are valuable, > but they are not deterministic, they are stochastic. > > I think MTBF is the standard way to describe reliability > of such tests, but I am having a hard time imagining how > the community can manage to document accurate annotations > of this sort, so I would stick with documenting the facts > (i.e. the test fails after N runs). > > OTOH, we do have deterministic tests, maybe even the majority of > fstests are deterministic(?) > > Considering that every auto test loop takes ~2 hours on our rig and that > I have been running over 100 loops over the past two weeks, if half > of fstests are deterministic, that is a lot of wait time and a lot of carbon > emission gone to waste. > > It would have been nice if I was able to exclude a "deterministic" group. > The problem is - can a developer ever tag a test as being "deterministic"? > > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 11:24 ` Zorro Lang @ 2022-05-19 14:18 ` Theodore Ts'o 2022-05-19 15:10 ` Zorro Lang 2022-05-19 14:58 ` Matthew Wilcox 1 sibling, 1 reply; 32+ messages in thread From: Theodore Ts'o @ 2022-05-19 14:18 UTC (permalink / raw) To: Zorro Lang Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote: > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each > downstream testers maintain their own "testing data/config", likes exclude > list, failed ratio, known failures etc. I think they're not suitable to be > fixed in the mainline fstests. Failure ratios are the sort of thing that are only applicable for * A specific filesystem * A specific configuration * A specific storage device / storage device class * A specific CPU architecture / CPU speed * A specific amount of memory available Put another way, there are problems that fail so close to rarely as to be "hever" on, say, an x86_64 class server with gobs and gobs of memory, but which can more reliably fail on, say, a Rasberry PI using eMMC flash. I don't think that Luis was suggesting that this kind of failure annotation would go in upstream fstests. I suspect he just wants to use it in kdevops, and hope that other people would use it as well in other contexts. But even in the context of test runners like kdevops and {kvm,gce,android}-xfstests, it's going to be very specific to a particular test environment, and for the global list of excludes for a particular file system. So in the gce-xfstests context, this is the difference between the excludes in the files: fs/ext4/excludes vs fs/ext4/cfg/bigalloc.exclude even if I only cared about, say, how things ran on GCE using SSD-backed Persistent Disk (never mind that I can only run gce-xfstests on Local SSD, and PD Extreme, etc.), failure percentages would never make sense for fs/ext4/excludes, since that covers multiple file system configs. And my infrastructure supports kvm, gce, and Android, as well as some people (such as at $WORK for our data center kernels) who run the test appliacce directly on bare metal, so I wouldn't use the failure percentages in these files, etc. Now, what I *do* is to track this sort of thing in my own notes, e.g: generic/051 ext4/adv Failure percentage: 16% (4/25) "Basic log recovery stress test - do lots of stuff, shut down in the middle of it and check that recovery runs to completion and everything can be successfully removed afterwards." generic/410 nojournal Couldn't reproduce after running 25 times "Test mount shared subtrees, verify the state transitions..." generic/68[12] encrypt Failure percentage: 100% The directory does grow, but blocks aren't charged to either root or the non-privileged users' quota. So this appears to be a real bug. There is one thing that I'd like to add to upstream fstests, and that is some kind of option so that "check --retry-failures NN" would cause fstests to automatically, upon finding a test failure, will rerun that failing test NN aditional times. Another potential related feature which we currently have in our daily spinner infrastructure at $WORK would be to on a test failure, rerun a test up to M times (typically a small number, such as 3), and if it passes on a retry attempt, declare the test result as "flaky", and stop running the retries. If the test repeatedly fails after M attempts, then the test result is "fail". These results would be reported in the junit XML file, and would allow the test runners to annotate their test summaries appropriately. I'm thinking about trying to implement something like this in my copious spare time; but before I do, does the general idea seem acceptable? Thanks, - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 14:18 ` Theodore Ts'o @ 2022-05-19 15:10 ` Zorro Lang 0 siblings, 0 replies; 32+ messages in thread From: Zorro Lang @ 2022-05-19 15:10 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 10:18:48AM -0400, Theodore Ts'o wrote: > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote: > > > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each > > downstream testers maintain their own "testing data/config", likes exclude > > list, failed ratio, known failures etc. I think they're not suitable to be > > fixed in the mainline fstests. > > Failure ratios are the sort of thing that are only applicable for > > * A specific filesystem > * A specific configuration > * A specific storage device / storage device class > * A specific CPU architecture / CPU speed > * A specific amount of memory available And a specific bug I suppose :) > > Put another way, there are problems that fail so close to rarely as to > be "hever" on, say, an x86_64 class server with gobs and gobs of > memory, but which can more reliably fail on, say, a Rasberry PI using > eMMC flash. > > I don't think that Luis was suggesting that this kind of failure > annotation would go in upstream fstests. I suspect he just wants to > use it in kdevops, and hope that other people would use it as well in > other contexts. But even in the context of test runners like kdevops > and {kvm,gce,android}-xfstests, it's going to be very specific to a > particular test environment, and for the global list of excludes for a > particular file system. So in the gce-xfstests context, this is the > difference between the excludes in the files: > > fs/ext4/excludes > vs > fs/ext4/cfg/bigalloc.exclude > > even if I only cared about, say, how things ran on GCE using > SSD-backed Persistent Disk (never mind that I can only run > gce-xfstests on Local SSD, and PD Extreme, etc.), failure percentages > would never make sense for fs/ext4/excludes, since that covers > multiple file system configs. And my infrastructure supports kvm, > gce, and Android, as well as some people (such as at $WORK for our > data center kernels) who run the test appliacce directly on bare > metal, so I wouldn't use the failure percentages in these files, etc. > > Now, what I *do* is to track this sort of thing in my own notes, e.g: > > generic/051 ext4/adv Failure percentage: 16% (4/25) > "Basic log recovery stress test - do lots of stuff, shut down in > the middle of it and check that recovery runs to completion and > everything can be successfully removed afterwards." > > generic/410 nojournal Couldn't reproduce after running 25 times > "Test mount shared subtrees, verify the state transitions..." > > generic/68[12] encrypt Failure percentage: 100% > The directory does grow, but blocks aren't charged to either root or > the non-privileged users' quota. So this appears to be a real bug. > > > There is one thing that I'd like to add to upstream fstests, and that > is some kind of option so that "check --retry-failures NN" would cause > fstests to automatically, upon finding a test failure, will rerun that > failing test NN aditional times. That makes more sense for me :) I'd like to help the testers to retry the (randomly) failed cases, to help them to get their testing statistics. That's better than recording these statistics in fstests itself. > Another potential related feature > which we currently have in our daily spinner infrastructure at $WORK > would be to on a test failure, rerun a test up to M times (typically a > small number, such as 3), and if it passes on a retry attempt, declare > the test result as "flaky", and stop running the retries. If the test > repeatedly fails after M attempts, then the test result is "fail". > > These results would be reported in the junit XML file, and would allow > the test runners to annotate their test summaries appropriately. > > I'm thinking about trying to implement something like this in my > copious spare time; but before I do, does the general idea seem > acceptable? After a "./check ..." done, generally fstests shows 3 list: Ran: ... Not run: ... Failures: ... So you mean if the "--retry-failures N" is specified. we can have one more list named "Flaky", which is part of "Failures" list, likes: Ran: ... Not run: ... Failures: generic/388 generic/475 xfs/104 xfs/442 Flaky: generic/388 [2/N] xfs/104 [1/N] If I understand this correctly, it's acceptable for me. And it might be helpful for Amir's situation. But let's hear more voice from other developers, if there is not big objection from other fs maintainers, let's do it :) BTW, about the new group name to mark cases with random load/operations/env., what do you think? Any suggestions or good names for that? Thanks, Zorro > > Thanks, > > - Ted > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 11:24 ` Zorro Lang 2022-05-19 14:18 ` Theodore Ts'o @ 2022-05-19 14:58 ` Matthew Wilcox 2022-05-19 15:44 ` Zorro Lang 1 sibling, 1 reply; 32+ messages in thread From: Matthew Wilcox @ 2022-05-19 14:58 UTC (permalink / raw) To: Zorro Lang Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote: > Yes, we talked about this, but if I don't rememeber wrong, I recommended each > downstream testers maintain their own "testing data/config", likes exclude > list, failed ratio, known failures etc. I think they're not suitable to be > fixed in the mainline fstests. This assumes a certain level of expertise, which is a barrier to entry. For someone who wants to check "Did my patch to filesystem Y that I have never touched before break anything?", having non-deterministic tests run by default is bad. As an example, run xfstests against jfs. Hundreds of failures, including some very scary-looking assertion failures from the page allocator. They're (mostly) harmless in fact, just being a memory leak, but it makes xfstests useless for this scenario. Even for well-maintained filesystems like xfs which is regularly tested, I expect generic/270 and a few others to fail. They just do, and they're not an indication that *I* broke anything. By all means, we want to keep tests around which have failures, but they need to be restricted to people who have a level of expertise and interest in fixing long-standing problems, not people who are looking for regressions. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 14:58 ` Matthew Wilcox @ 2022-05-19 15:44 ` Zorro Lang 2022-05-19 16:06 ` Matthew Wilcox 0 siblings, 1 reply; 32+ messages in thread From: Zorro Lang @ 2022-05-19 15:44 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote: > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote: > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each > > downstream testers maintain their own "testing data/config", likes exclude > > list, failed ratio, known failures etc. I think they're not suitable to be > > fixed in the mainline fstests. > > This assumes a certain level of expertise, which is a barrier to entry. > > For someone who wants to check "Did my patch to filesystem Y that I have > never touched before break anything?", having non-deterministic tests > run by default is bad. > > As an example, run xfstests against jfs. Hundreds of failures, including > some very scary-looking assertion failures from the page allocator. > They're (mostly) harmless in fact, just being a memory leak, but it > makes xfstests useless for this scenario. > > Even for well-maintained filesystems like xfs which is regularly tested, > I expect generic/270 and a few others to fail. They just do, and they're > not an indication that *I* broke anything. > > By all means, we want to keep tests around which have failures, but > they need to be restricted to people who have a level of expertise and > interest in fixing long-standing problems, not people who are looking > for regressions. It's hard to make sure if a failure is a regression, if someone only run the test once. The testers need some experience, at least need some history test data. If a tester find a case has 10% chance fail on his system, to make sure it's a regression or not, if he doesn't have history test data, at least he need to do the same test more times on old kernel version with his system. If it never fail on old kernel version, but can fail on new kernel. Then we suspect it's a regression. Even if the tester isn't an expert of the fs he's testing, he can report this issue to that fs experts, to get more checking. For downstream kernel, he has to report to the maintainers of downstream, or check by himself. If a case pass on upstream, but fail on downstream, it might mean there's a patchset on upstream can be backported. So, anyway, the testers need their own "experience" (include testing history data, known issue, etc) to judge if a failure is a suspected regression, or a known issue of downstream which hasn't been fixed (by backport). That's my personal perspective :) Thanks, Zorro > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 15:44 ` Zorro Lang @ 2022-05-19 16:06 ` Matthew Wilcox 2022-05-19 16:54 ` Zorro Lang ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Matthew Wilcox @ 2022-05-19 16:06 UTC (permalink / raw) To: Zorro Lang Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 11:44:19PM +0800, Zorro Lang wrote: > On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote: > > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote: > > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each > > > downstream testers maintain their own "testing data/config", likes exclude > > > list, failed ratio, known failures etc. I think they're not suitable to be > > > fixed in the mainline fstests. > > > > This assumes a certain level of expertise, which is a barrier to entry. > > > > For someone who wants to check "Did my patch to filesystem Y that I have > > never touched before break anything?", having non-deterministic tests > > run by default is bad. > > > > As an example, run xfstests against jfs. Hundreds of failures, including > > some very scary-looking assertion failures from the page allocator. > > They're (mostly) harmless in fact, just being a memory leak, but it > > makes xfstests useless for this scenario. > > > > Even for well-maintained filesystems like xfs which is regularly tested, > > I expect generic/270 and a few others to fail. They just do, and they're > > not an indication that *I* broke anything. > > > > By all means, we want to keep tests around which have failures, but > > they need to be restricted to people who have a level of expertise and > > interest in fixing long-standing problems, not people who are looking > > for regressions. > > It's hard to make sure if a failure is a regression, if someone only run > the test once. The testers need some experience, at least need some > history test data. > > If a tester find a case has 10% chance fail on his system, to make sure > it's a regression or not, if he doesn't have history test data, at least > he need to do the same test more times on old kernel version with his > system. If it never fail on old kernel version, but can fail on new kernel. > Then we suspect it's a regression. > > Even if the tester isn't an expert of the fs he's testing, he can report > this issue to that fs experts, to get more checking. For downstream kernel, > he has to report to the maintainers of downstream, or check by himself. > If a case pass on upstream, but fail on downstream, it might mean there's > a patchset on upstream can be backported. > > So, anyway, the testers need their own "experience" (include testing history > data, known issue, etc) to judge if a failure is a suspected regression, or > a known issue of downstream which hasn't been fixed (by backport). > > That's my personal perspective :) Right, but that's the personal perspective of an expert tester. I don't particularly want to build that expertise myself; I want to write patches which touch dozens of filesystems, and I want to be able to smoke-test those patches. Maybe xfstests or kdevops doesn't want to solve that problem, but that would seem like a waste of other peoples time. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 16:06 ` Matthew Wilcox @ 2022-05-19 16:54 ` Zorro Lang 2022-07-01 23:36 ` Luis Chamberlain 2022-07-02 17:01 ` Theodore Ts'o 2 siblings, 0 replies; 32+ messages in thread From: Zorro Lang @ 2022-05-19 16:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 05:06:07PM +0100, Matthew Wilcox wrote: > On Thu, May 19, 2022 at 11:44:19PM +0800, Zorro Lang wrote: > > On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote: > > > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote: > > > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each > > > > downstream testers maintain their own "testing data/config", likes exclude > > > > list, failed ratio, known failures etc. I think they're not suitable to be > > > > fixed in the mainline fstests. > > > > > > This assumes a certain level of expertise, which is a barrier to entry. > > > > > > For someone who wants to check "Did my patch to filesystem Y that I have > > > never touched before break anything?", having non-deterministic tests > > > run by default is bad. > > > > > > As an example, run xfstests against jfs. Hundreds of failures, including > > > some very scary-looking assertion failures from the page allocator. > > > They're (mostly) harmless in fact, just being a memory leak, but it > > > makes xfstests useless for this scenario. > > > > > > Even for well-maintained filesystems like xfs which is regularly tested, > > > I expect generic/270 and a few others to fail. They just do, and they're > > > not an indication that *I* broke anything. > > > > > > By all means, we want to keep tests around which have failures, but > > > they need to be restricted to people who have a level of expertise and > > > interest in fixing long-standing problems, not people who are looking > > > for regressions. > > > > It's hard to make sure if a failure is a regression, if someone only run > > the test once. The testers need some experience, at least need some > > history test data. > > > > If a tester find a case has 10% chance fail on his system, to make sure > > it's a regression or not, if he doesn't have history test data, at least > > he need to do the same test more times on old kernel version with his > > system. If it never fail on old kernel version, but can fail on new kernel. > > Then we suspect it's a regression. > > > > Even if the tester isn't an expert of the fs he's testing, he can report > > this issue to that fs experts, to get more checking. For downstream kernel, > > he has to report to the maintainers of downstream, or check by himself. > > If a case pass on upstream, but fail on downstream, it might mean there's > > a patchset on upstream can be backported. > > > > So, anyway, the testers need their own "experience" (include testing history > > data, known issue, etc) to judge if a failure is a suspected regression, or > > a known issue of downstream which hasn't been fixed (by backport). > > > > That's my personal perspective :) > > Right, but that's the personal perspective of an expert tester. I don't > particularly want to build that expertise myself; I want to write patches > which touch dozens of filesystems, and I want to be able to smoke-test > those patches. Maybe xfstests or kdevops doesn't want to solve that I think it's hard to judge which cases are smoke-test cases commonly, especially you hope they should all pass if no real bugs. If for "all filesystems", I have to recomment some simple cases of fsx and fsstress only... Even if we can add a group name as 'smoke', and mark all stable and simple enough test cases as 'smoke', but I still can't be sure './check -g smoke' will test pass for your all filesystems testing with random system environment :) Thanks, Zorro > problem, but that would seem like a waste of other peoples time. > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 16:06 ` Matthew Wilcox 2022-05-19 16:54 ` Zorro Lang @ 2022-07-01 23:36 ` Luis Chamberlain 2022-07-02 17:01 ` Theodore Ts'o 2 siblings, 0 replies; 32+ messages in thread From: Luis Chamberlain @ 2022-07-01 23:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Zorro Lang, Amir Goldstein, linux-fsdevel, linux-block, pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 05:06:07PM +0100, Matthew Wilcox wrote: > Right, but that's the personal perspective of an expert tester. I don't > particularly want to build that expertise myself; I want to write patches > which touch dozens of filesystems, and I want to be able to smoke-test > those patches. Maybe xfstests or kdevops doesn't want to solve that > problem, kdevop's goals are aligned to enable that. However at this point in time there is no agreement to share expunges and so we just carry tons of them per kernel / distro for those that *did* have time to run them for the environment used and share them. Today there are baselines for stable and linus' kernel for some filesystems, but these are on a best effort basis as this takes system resources and someone's time. The results are tracked in: workflows/fstests/expunges/ With time now that there is at least a rig to do this for stable and upstream this should expand to be more up to date. There is also a shared repo which enables folks to share results there. Luis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 16:06 ` Matthew Wilcox 2022-05-19 16:54 ` Zorro Lang 2022-07-01 23:36 ` Luis Chamberlain @ 2022-07-02 17:01 ` Theodore Ts'o 2022-07-07 21:36 ` Luis Chamberlain 2 siblings, 1 reply; 32+ messages in thread From: Theodore Ts'o @ 2022-07-02 17:01 UTC (permalink / raw) To: Matthew Wilcox Cc: Zorro Lang, Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block, pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Thu, May 19, 2022 at 05:06:07PM +0100, Matthew Wilcox wrote: > > Right, but that's the personal perspective of an expert tester. I don't > particularly want to build that expertise myself; I want to write patches > which touch dozens of filesystems, and I want to be able to smoke-test > those patches. Maybe xfstests or kdevops doesn't want to solve that > problem, but that would seem like a waste of other peoples time. Willy, For your use case I'm guessing that you have two major concerns: * bugs that you may have introduced when "which touch dozens of filesystems" * bugs in the core mm and fs-writeback code which may be much more substantive/complex changes. Would you say that is correct? At least for ext4 and xfs, it's probably quite sufficient just to run the -g auto group for the ext4/4k and xfs/4k test configs --- that is the standard default file system configs using the 4k block size. Both of these currently don't require any test exclusions for kvm-xfstests or gce-xfstests when running the auto group. And so for the purposes of catching bugs in the core MM/VFS layer and any changes that the folio patches are likely to touch for ext4 and xfs, it's the auto group for ext4/4k and xfs/4k is probably quite sufficient. Testing the more exotic test configs, such as bigalloc for ext4, or realtime for xfs, or the external log configs, are not likely to be relevant for the folio patches. Note: I recommend that you skip using the loop device xfstests strategy, which Luis likes to advocate. For the perspective of *likely* regressions caused by the Folio patches, I claim they are going to cause you more pain than they are worth. If there are some strange Folio/loop device interactions, they aren't likely going to be obvious/reproduceable failures that will cause pain to linux-next testers. While it would be nice to find **all** possible bugs before patches go usptream to Linus, if it slows down your development velocity to near-standstill, it's not worth it. We have to be realistic about things. What about other file systems? Well, first of all, xfstests only has support for the following file systems: 9p btrfs ceph cifs exfat ext2 ext4 f2fs gfs glusterfs jfs msdos nfs ocfs2 overlay pvfs2 reiserfs tmpfs ubifs udf vfat virtiofs xfs {kvm,gce}-xfstests supports these 16 file systems: 9p btrfs exfat ext2 ext4 f2fs jfs msdos nfs overlay reiserfs tmpfs ubifs udf vfat xfs kdevops has support for these file systems: btrfs ext4 xfs So realistically, you're not going to have *full* test coverage for all of the file systems you might want to touch, no matter what you do. And even for those file systems that are technically supported by xfstests and kvm-xfstests, if they aren't being regularly run (for example, exfat, 9p, ubifs, udf, etc.) there may be bitrot and very likely there is no one actively *to* maintain exclude files. For that matter, there might not be anyone you could turn to for help interpreting the test results. So.... I believe the most realistic thing is to do is to run xfstests on a simple set of configs --- using no special mkfs or mount options --- first against the baseline, and then after you've applied your folio patches. If there are any new test failures, do something like: kvm-xfstests -c f2fs/default -C 10 generic/013 to check to see whether it's a hard failure or not. If it's a hard failure, then it's a problem with your patches. If it's a flaky failure, it's possible you'll need to repeat the test against the baseline: git checkout origin; kbuild kvm-xfstests -c f2fs/default -C 10 generic/013 If it's also flaky on the baseline, you can ignore the test failure for the purposes of folio development. There are more complex things you could do, such as running a baseline set of tests 500 times (as Luis suggests), but I believe that for your use case, it's not a good use of your time. You'd need to speed several weeks finding *all* the flaky tests up front, especially if you want to do this for a large set of file systems. It's much more efficient to check if a suspetected test regression is really a flaky test result when you come across them. I'd also suggest using the -g quick tests for file systems other than ext4 and xfs. That's probably going to be quite sufficient for finding obvious problems that might be introduced when you're making changes to f2fs, btrfs, etc., and it will reduce the number of potential flaky tests that you might have to handle. It should be possible to automate this, and Leah and I have talked about designs to automate this process. Leah has some rough scripts that do a semantic-style diff for the baseline and after applying the proposed xfs backports. So it operates on something like this: f2fs/default: 868 tests, 10 failures, 217 skipped, 6899 seconds Failures: generic/050 generic/064 generic/252 generic/342 generic/383 generic/502 generic/506 generic/526 generic/527 generic/563 In theory, we could also have automated tools that look for the suspected test regressions, and then try running those test regressions 20 or 25 times on the baseline and after applying the patch series. Those don't exist yet, but it's just a Mere Matter of Programming. :-) I can't promise anything, especially with dates, but developing better automation tools to support the xfs stable backports is on our near-term roadmap --- and that would probably be applicable for for folio development usecase. Cheers, - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-02 17:01 ` Theodore Ts'o @ 2022-07-07 21:36 ` Luis Chamberlain 0 siblings, 0 replies; 32+ messages in thread From: Luis Chamberlain @ 2022-07-07 21:36 UTC (permalink / raw) To: Theodore Ts'o Cc: Matthew Wilcox, Zorro Lang, Amir Goldstein, linux-fsdevel, linux-block, pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests On Sat, Jul 02, 2022 at 01:01:22PM -0400, Theodore Ts'o wrote: > Note: I recommend that you skip using the loop device xfstests > strategy, which Luis likes to advocate. For the perspective of > *likely* regressions caused by the Folio patches, I claim they are > going to cause you more pain than they are worth. If there are some > strange Folio/loop device interactions, they aren't likely going to be > obvious/reproduceable failures that will cause pain to linux-next > testers. While it would be nice to find **all** possible bugs before > patches go usptream to Linus, if it slows down your development > velocity to near-standstill, it's not worth it. We have to be > realistic about things. Regressions with the loopback block driver can creep up and we used to be much worse, but we have gotten better at it. Certainly testing a loopback driver can mean running into a regression with the loopback driver. But some block driver must be used in the end. > What about other file systems? Well, first of all, xfstests only has > support for the following file systems: > > 9p btrfs ceph cifs exfat ext2 ext4 f2fs gfs glusterfs jfs msdos > nfs ocfs2 overlay pvfs2 reiserfs tmpfs ubifs udf vfat virtiofs xfs > > {kvm,gce}-xfstests supports these 16 file systems: > > 9p btrfs exfat ext2 ext4 f2fs jfs msdos nfs overlay reiserfs > tmpfs ubifs udf vfat xfs > > kdevops has support for these file systems: > > btrfs ext4 xfs Thanks for this list Ted! And so adding suport for a new filesystem in kdevops should be: * a kconfig symbol for the fs and then one per supported mkfs config option you want to support * a configuration file for it, this can be as elaborate to support different mkfs config options as we have for xfs [0] or one with just one or two mkfs config options [1]. The default is just shared information. [0] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/xfs/xfs.config [1] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/ext4/ext4.config > There are more complex things you could do, such as running a baseline > set of tests 500 times (as Luis suggests), I advocate 100 and I suggest that is a nice goal for enterprise kernels. I also personally advocate this confidence in a baseline for stable kernels if *I* am going to backport changes. > but I believe that for your > use case, it's not a good use of your time. You'd need to speed > several weeks finding *all* the flaky tests up front, especially if > you want to do this for a large set of file systems. It's much more > efficient to check if a suspetected test regression is really a flaky > test result when you come across them. Or you work with a test runner that has the list of known failures / flaky failures for a target configuration like using loopbacks already. And hence why I tend to attend to these for xfs, btrfs, and ext4 when I have time. My goal has been to work towards a baseline of at least 100 successful runs without failure tracking upstream. Luis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-05-19 3:07 [RFC: kdevops] Standardizing on failure rate nomenclature for expunges Luis Chamberlain 2022-05-19 6:36 ` Amir Goldstein @ 2022-07-02 21:48 ` Bart Van Assche 2022-07-03 5:56 ` Amir Goldstein 2022-07-03 13:32 ` Theodore Ts'o 1 sibling, 2 replies; 32+ messages in thread From: Bart Van Assche @ 2022-07-02 21:48 UTC (permalink / raw) To: Luis Chamberlain, linux-fsdevel, linux-block Cc: amir73il, pankydev8, tytso, josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen On 5/18/22 20:07, Luis Chamberlain wrote: > I've been promoting the idea that running fstests once is nice, > but things get interesting if you try to run fstests multiple > times until a failure is found. It turns out at least kdevops has > found tests which fail with a failure rate of typically 1/2 to > 1/30 average failure rate. That is 1/2 means a failure can happen > 50% of the time, whereas 1/30 means it takes 30 runs to find the > failure. > > I have tried my best to annotate failure rates when I know what > they might be on the test expunge list, as an example: > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > The term "failure rate 1/15" is 16 characters long, so I'd like > to propose to standardize a way to represent this. How about > > generic/530 # F:1/15 > > Then we could extend the definition. F being current estimate, and this > can be just how long it took to find the first failure. A more valuable > figure would be failure rate avarage, so running the test multiple > times, say 10, to see what the failure rate is and then averaging the > failure out. So this could be a more accurate representation. For this > how about: > > generic/530 # FA:1/15 > > This would mean on average there failure rate has been found to be about > 1/15, and this was determined based on 10 runs. > > We should also go extend check for fstests/blktests to run a test > until a failure is found and report back the number of successes. > > Thoughts? > > Note: yes failure rates lower than 1/100 do exist but they are rare > creatures. I love them though as my experience shows so far that they > uncover hidden bones in the closet, and they they make take months and > a lot of eyeballs to resolve. I strongly disagree with annotating tests with failure rates. My opinion is that on a given test setup a test either should pass 100% of the time or fail 100% of the time. If a test passes in one run and fails in another run that either indicates a bug in the test or a bug in the software that is being tested. Examples of behaviors that can cause tests to behave unpredictably are use-after-free bugs and race conditions. How likely it is to trigger such behavior depends on a number of factors. This could even depend on external factors like which network packets are received from other systems. I do not expect that flaky tests have an exact failure rate. Hence my opinion that flaky tests are not useful and also that it is not useful to annotate flaky tests with a failure rate. If a test is flaky I think that the root cause of the flakiness must be determined and fixed. Bart. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-02 21:48 ` Bart Van Assche @ 2022-07-03 5:56 ` Amir Goldstein 2022-07-03 13:15 ` Theodore Ts'o 2022-07-04 3:25 ` Dave Chinner 2022-07-03 13:32 ` Theodore Ts'o 1 sibling, 2 replies; 32+ messages in thread From: Amir Goldstein @ 2022-07-03 5:56 UTC (permalink / raw) To: Bart Van Assche, Darrick J. Wong Cc: Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox, Dave Chinner On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 5/18/22 20:07, Luis Chamberlain wrote: > > I've been promoting the idea that running fstests once is nice, > > but things get interesting if you try to run fstests multiple > > times until a failure is found. It turns out at least kdevops has > > found tests which fail with a failure rate of typically 1/2 to > > 1/30 average failure rate. That is 1/2 means a failure can happen > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > failure. > > > > I have tried my best to annotate failure rates when I know what > > they might be on the test expunge list, as an example: > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > to propose to standardize a way to represent this. How about > > > > generic/530 # F:1/15 > > > > Then we could extend the definition. F being current estimate, and this > > can be just how long it took to find the first failure. A more valuable > > figure would be failure rate avarage, so running the test multiple > > times, say 10, to see what the failure rate is and then averaging the > > failure out. So this could be a more accurate representation. For this > > how about: > > > > generic/530 # FA:1/15 > > > > This would mean on average there failure rate has been found to be about > > 1/15, and this was determined based on 10 runs. > > > > We should also go extend check for fstests/blktests to run a test > > until a failure is found and report back the number of successes. > > > > Thoughts? > > > > Note: yes failure rates lower than 1/100 do exist but they are rare > > creatures. I love them though as my experience shows so far that they > > uncover hidden bones in the closet, and they they make take months and > > a lot of eyeballs to resolve. > > I strongly disagree with annotating tests with failure rates. My opinion > is that on a given test setup a test either should pass 100% of the time > or fail 100% of the time. If a test passes in one run and fails in > another run that either indicates a bug in the test or a bug in the > software that is being tested. Examples of behaviors that can cause > tests to behave unpredictably are use-after-free bugs and race > conditions. How likely it is to trigger such behavior depends on a > number of factors. This could even depend on external factors like which > network packets are received from other systems. I do not expect that > flaky tests have an exact failure rate. Hence my opinion that flaky > tests are not useful and also that it is not useful to annotate flaky > tests with a failure rate. If a test is flaky I think that the root > cause of the flakiness must be determined and fixed. > That is true for some use cases, but unfortunately, the flaky fstests are way too valuable and too hard to replace or improve, so practically, fs developers have to run them, but not everyone does. Zorro has already proposed to properly tag the non deterministic tests with a specific group and I think there is really no other solution. The only question is whether we remove them from the 'auto' group (I think we should). There is probably a large overlap already between the 'stress' 'soak' and 'fuzzers' test groups and the non-deterministic tests. Moreover, if the test is not a stress/fuzzer test and it is not deterministic then the test is likely buggy. There is only one 'stress' test not in 'auto' group (generic/019), only two 'soak' tests not in the 'auto' group (generic/52{1,2}). There are only three tests in 'soak' group and they are also exactly the same three tests in the 'long_rw' group. So instead of thinking up a new 'flaky' 'random' 'stochastic' name we may just repurpose the 'soak' group for this matter and start moving known flaky tests from 'auto' to 'soak'. generic/52{1,2} can be removed from 'soak' group and remain in 'long_rw' group, unless filesystem developers would like to add those to the stochastic test run. filesystem developers that will run ./check -g auto -g soak will get the exact same test coverage as today's -g auto and the "commoners" that run ./check -g auto will enjoy blissful determitic test results, at least for the default config of regularly tested filesystems (a.k.a, the ones tested by kernet test bot).? Darrick, As the one who created the 'soak' group and only one that added tests to it, what do you think about this proposal? What do you think should be done with generic/52{1,2}? Thanks, Amir. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 5:56 ` Amir Goldstein @ 2022-07-03 13:15 ` Theodore Ts'o 2022-07-03 14:22 ` Amir Goldstein 2022-07-04 3:25 ` Dave Chinner 1 sibling, 1 reply; 32+ messages in thread From: Theodore Ts'o @ 2022-07-03 13:15 UTC (permalink / raw) To: Amir Goldstein Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox, Dave Chinner On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > > That is true for some use cases, but unfortunately, the flaky > fstests are way too valuable and too hard to replace or improve, > so practically, fs developers have to run them, but not everyone does. > > Zorro has already proposed to properly tag the non deterministic tests > with a specific group and I think there is really no other solution. The non-deterministic tests are not the sole, or even the most likely cause of flaky tests. Or put another way, even if we used a deterministic pseudo-random numberator seed for some of the curently "non-determinstic tests" (and I believe we are for many of them already anyway), it's not going to be make the flaky tests go away. That's because with many of these tests, we are running multiple threads either in the fstress or fsx, or in the antogonist workload that is say, running the space utilization to full to generate ENOSPC errors, and then deleting a bunch of files to trigger as many ENOSPC hitter events as possible. > The only question is whether we remove them from the 'auto' group > (I think we should). I wouldn't; if someone wants to exclude the non-determistic tests, once they are tagged as belonging to a group, they can just exclude that group. So there's no point removing them from the auto group IMHO. > filesystem developers that will run ./check -g auto -g soak > will get the exact same test coverage as today's -g auto > and the "commoners" that run ./check -g auto will enjoy blissful > determitic test results, at least for the default config of regularly > tested filesystems (a.k.a, the ones tested by kernet test bot).? First of all, there are a number of tests today which are in soak or long_rw which are not in auto, so "-g auto -g soak" will *not* result in the "exact same test coverage". Secondly, as I've tested above, deterministic tests does not necessasrily mean determinsitic test results --- unless by "determinsitic tests" you mean "completely single-threaded tests", which would eliminate a large amount of useful test coverage. Cheers, - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 13:15 ` Theodore Ts'o @ 2022-07-03 14:22 ` Amir Goldstein 2022-07-03 16:30 ` Theodore Ts'o 0 siblings, 1 reply; 32+ messages in thread From: Amir Goldstein @ 2022-07-03 14:22 UTC (permalink / raw) To: Theodore Ts'o Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox, Dave Chinner On Sun, Jul 3, 2022 at 4:15 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > > > > That is true for some use cases, but unfortunately, the flaky > > fstests are way too valuable and too hard to replace or improve, > > so practically, fs developers have to run them, but not everyone does. > > > > Zorro has already proposed to properly tag the non deterministic tests > > with a specific group and I think there is really no other solution. > > The non-deterministic tests are not the sole, or even the most likely > cause of flaky tests. Or put another way, even if we used a > deterministic pseudo-random numberator seed for some of the curently > "non-determinstic tests" (and I believe we are for many of them > already anyway), it's not going to be make the flaky tests go away. > > That's because with many of these tests, we are running multiple > threads either in the fstress or fsx, or in the antogonist workload > that is say, running the space utilization to full to generate ENOSPC > errors, and then deleting a bunch of files to trigger as many ENOSPC > hitter events as possible. > > > The only question is whether we remove them from the 'auto' group > > (I think we should). > > I wouldn't; if someone wants to exclude the non-determistic tests, > once they are tagged as belonging to a group, they can just exclude > that group. So there's no point removing them from the auto group > IMHO. The reason I suggested that *we* change our habits is because we want to give passing-by fs testers an easier experience. Another argument in favor of splitting out -g soak from -g auto - You only need to run -g soak in a loop for as long as you like to be confident about the results. You need to run -g auto only once per definition - If a test ends up failing the Nth time you run -g auto then it belongs in -g soak and not in -g auto. > > > filesystem developers that will run ./check -g auto -g soak > > will get the exact same test coverage as today's -g auto > > and the "commoners" that run ./check -g auto will enjoy blissful > > determitic test results, at least for the default config of regularly > > tested filesystems (a.k.a, the ones tested by kernet test bot).? > > First of all, there are a number of tests today which are in soak or > long_rw which are not in auto, so "-g auto -g soak" will *not* result > in the "exact same test coverage". I addressed this in my proposal. I proposed to remove these two tests out of soak and asked for Darrick's opinion. Who is using -g soak anyway? > > Secondly, as I've tested above, deterministic tests does not > necessasrily mean determinsitic test results --- unless by > "determinsitic tests" you mean "completely single-threaded tests", > which would eliminate a large amount of useful test coverage. > To be clear, when I wrote deterministic, what I meant was deterministic results empirically, in the same sense that Bart meant - a test should always pass. Because Luis was using the expunge lists to blacklist any test failure, no matter the failure rate, the kdevops expunge lists could be used as a first draft for -g soak group, at least for tests that are blocklisted by kdevops for all of ext4,xfs and btrfs default configs on the upstream kernel. Thanks, Amir. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 14:22 ` Amir Goldstein @ 2022-07-03 16:30 ` Theodore Ts'o 0 siblings, 0 replies; 32+ messages in thread From: Theodore Ts'o @ 2022-07-03 16:30 UTC (permalink / raw) To: Amir Goldstein Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox, Dave Chinner On Sun, Jul 03, 2022 at 05:22:17PM +0300, Amir Goldstein wrote: > > To be clear, when I wrote deterministic, what I meant was deterministic > results empirically, in the same sense that Bart meant - a test should > always pass. Well all of the tests in the auto group pass at 100% of the time for the ext4/4k and xfs/4k groups. (Well, at least if use the HDD and SSD as the storage device. If you are using eMMC flash, or Luis's loop device config, there would be more failures.) But if we're talking about btrfs/4k, f2fs/4k, xfs/realtime, xfs/realtime_28k/logdev, ext4/bigalloc, etc. there would be a *lot* of tests that would need to be removed from the auto group. So what "non-determinsitic tests" should we remove from the auto group? For what file systems, file system configs, and storage devices? What would you propose? Remember, Matthew wants something that he can use to test "dozens" of file systems that he's touching for the folio patches. If we have to remove all of the tests that fail if you are using nfs, vfat, hfs, msdos, etc., then the auto group would be pretty anemic. Let's not do that. If you want a "always pass" group, we could do that, but let's not call that the "auto" group, please. - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 5:56 ` Amir Goldstein 2022-07-03 13:15 ` Theodore Ts'o @ 2022-07-04 3:25 ` Dave Chinner 2022-07-04 7:58 ` Amir Goldstein 1 sibling, 1 reply; 32+ messages in thread From: Dave Chinner @ 2022-07-04 3:25 UTC (permalink / raw) To: Amir Goldstein Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > > > On 5/18/22 20:07, Luis Chamberlain wrote: > > > I've been promoting the idea that running fstests once is nice, > > > but things get interesting if you try to run fstests multiple > > > times until a failure is found. It turns out at least kdevops has > > > found tests which fail with a failure rate of typically 1/2 to > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > failure. > > > > > > I have tried my best to annotate failure rates when I know what > > > they might be on the test expunge list, as an example: > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > to propose to standardize a way to represent this. How about > > > > > > generic/530 # F:1/15 > > > > > > Then we could extend the definition. F being current estimate, and this > > > can be just how long it took to find the first failure. A more valuable > > > figure would be failure rate avarage, so running the test multiple > > > times, say 10, to see what the failure rate is and then averaging the > > > failure out. So this could be a more accurate representation. For this > > > how about: > > > > > > generic/530 # FA:1/15 > > > > > > This would mean on average there failure rate has been found to be about > > > 1/15, and this was determined based on 10 runs. > > > > > > We should also go extend check for fstests/blktests to run a test > > > until a failure is found and report back the number of successes. > > > > > > Thoughts? > > > > > > Note: yes failure rates lower than 1/100 do exist but they are rare > > > creatures. I love them though as my experience shows so far that they > > > uncover hidden bones in the closet, and they they make take months and > > > a lot of eyeballs to resolve. > > > > I strongly disagree with annotating tests with failure rates. My opinion > > is that on a given test setup a test either should pass 100% of the time > > or fail 100% of the time. If a test passes in one run and fails in > > another run that either indicates a bug in the test or a bug in the > > software that is being tested. Examples of behaviors that can cause > > tests to behave unpredictably are use-after-free bugs and race > > conditions. How likely it is to trigger such behavior depends on a > > number of factors. This could even depend on external factors like which > > network packets are received from other systems. I do not expect that > > flaky tests have an exact failure rate. Hence my opinion that flaky > > tests are not useful and also that it is not useful to annotate flaky > > tests with a failure rate. If a test is flaky I think that the root > > cause of the flakiness must be determined and fixed. > > > > That is true for some use cases, but unfortunately, the flaky > fstests are way too valuable and too hard to replace or improve, > so practically, fs developers have to run them, but not everyone does. Everyone *should* be running them. They find *new bugs*, and it doesn't matter how old the kernel is. e.g. if you're backporting XFS log changes and you aren't running the "flakey" recoveryloop group tests, then you are *not testing failure handling log recovery sufficiently*. Where do you draw the line? recvoeryloop tests that shutdown and recover the filesystem will find bugs, they are guaranteed to be flakey, and they are *absolutely necessary* to be run because they are the only tests that exercise that critical filesysetm crash recovery functionality. What do we actually gain by excluding these "non-deterministic" tests from automated QA environments? > Zorro has already proposed to properly tag the non deterministic tests > with a specific group and I think there is really no other solution. > > The only question is whether we remove them from the 'auto' group > (I think we should). As per above, this shows that many people simply don't understand what many of these non-determinsitic tests are actually exercising, and hence what they fail to test by excluding them from automated testing. > There is probably a large overlap already between the 'stress' 'soak' and > 'fuzzers' test groups and the non-deterministic tests. > Moreover, if the test is not a stress/fuzzer test and it is not deterministic > then the test is likely buggy. > > There is only one 'stress' test not in 'auto' group (generic/019), only two > 'soak' tests not in the 'auto' group (generic/52{1,2}). > There are only three tests in 'soak' group and they are also exactly > the same three tests in the 'long_rw' group. > > So instead of thinking up a new 'flaky' 'random' 'stochastic' name > we may just repurpose the 'soak' group for this matter and start > moving known flaky tests from 'auto' to 'soak'. Please, no. The policy for the auto group is inclusive, not exclusive. It is based on the concept that every test is valuable and should be run if possible. Hence any test that generally passes, does not run forever and does not endanger the system should be a member of the auto group. That effectively only rules out fuzzer and dangerous tests from being in the auto group, as long running tests should be scaled by TIME_FACTOR/LOAD_FACTOR and hence the default test behaviour results in only a short time run time. If someone wants to *reduce their test coverage* for whatever reason (e.g. runtime, want to only run pass/fail tests, etc) then the mechanism we already have in place for this is for that person to use *exclusion groups*. i.e. we exclude subsets of tests from the default set, we don't remove them from the default set. Such an environment would run: ./check -g auto -x soak So that the test environment doesn't run the "non-determinisitic" tests in the 'soak' group. i.e. the requirements of this test environment do not dictate the tests that every other test environment runs by default. > generic/52{1,2} can be removed from 'soak' group and remain > in 'long_rw' group, unless filesystem developers would like to > add those to the stochastic test run. > > filesystem developers that will run ./check -g auto -g soak > will get the exact same test coverage as today's -g auto > and the "commoners" that run ./check -g auto will enjoy blissful > determitic test results, at least for the default config of regularly > tested filesystems (a.k.a, the ones tested by kernet test bot).? An argument that says "everyone else has to change what they do so I don't have to change" means that the person making the argument thinks their requirements are more important than the requirements of anyone else. The test run policy mechanisms we already have avoid this whole can of worms - we don't need to care about the specific test requirements of any specific test enviroment because the default is inclusive and it is trivial to exclude tests from that default set if needed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-04 3:25 ` Dave Chinner @ 2022-07-04 7:58 ` Amir Goldstein 2022-07-05 2:29 ` Theodore Ts'o 2022-07-05 3:11 ` Dave Chinner 0 siblings, 2 replies; 32+ messages in thread From: Amir Goldstein @ 2022-07-04 7:58 UTC (permalink / raw) To: Dave Chinner Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > > > > > On 5/18/22 20:07, Luis Chamberlain wrote: > > > > I've been promoting the idea that running fstests once is nice, > > > > but things get interesting if you try to run fstests multiple > > > > times until a failure is found. It turns out at least kdevops has > > > > found tests which fail with a failure rate of typically 1/2 to > > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > > failure. > > > > > > > > I have tried my best to annotate failure rates when I know what > > > > they might be on the test expunge list, as an example: > > > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > > to propose to standardize a way to represent this. How about > > > > > > > > generic/530 # F:1/15 > > > > > > > > Then we could extend the definition. F being current estimate, and this > > > > can be just how long it took to find the first failure. A more valuable > > > > figure would be failure rate avarage, so running the test multiple > > > > times, say 10, to see what the failure rate is and then averaging the > > > > failure out. So this could be a more accurate representation. For this > > > > how about: > > > > > > > > generic/530 # FA:1/15 > > > > > > > > This would mean on average there failure rate has been found to be about > > > > 1/15, and this was determined based on 10 runs. > > > > > > > > We should also go extend check for fstests/blktests to run a test > > > > until a failure is found and report back the number of successes. > > > > > > > > Thoughts? > > > > > > > > Note: yes failure rates lower than 1/100 do exist but they are rare > > > > creatures. I love them though as my experience shows so far that they > > > > uncover hidden bones in the closet, and they they make take months and > > > > a lot of eyeballs to resolve. > > > > > > I strongly disagree with annotating tests with failure rates. My opinion > > > is that on a given test setup a test either should pass 100% of the time > > > or fail 100% of the time. If a test passes in one run and fails in > > > another run that either indicates a bug in the test or a bug in the > > > software that is being tested. Examples of behaviors that can cause > > > tests to behave unpredictably are use-after-free bugs and race > > > conditions. How likely it is to trigger such behavior depends on a > > > number of factors. This could even depend on external factors like which > > > network packets are received from other systems. I do not expect that > > > flaky tests have an exact failure rate. Hence my opinion that flaky > > > tests are not useful and also that it is not useful to annotate flaky > > > tests with a failure rate. If a test is flaky I think that the root > > > cause of the flakiness must be determined and fixed. > > > > > > > That is true for some use cases, but unfortunately, the flaky > > fstests are way too valuable and too hard to replace or improve, > > so practically, fs developers have to run them, but not everyone does. > > Everyone *should* be running them. They find *new bugs*, and it > doesn't matter how old the kernel is. e.g. if you're backporting XFS > log changes and you aren't running the "flakey" recoveryloop group > tests, then you are *not testing failure handling log recovery > sufficiently*. > > Where do you draw the line? recvoeryloop tests that shutdown and > recover the filesystem will find bugs, they are guaranteed to be > flakey, and they are *absolutely necessary* to be run because they > are the only tests that exercise that critical filesysetm crash > recovery functionality. > > What do we actually gain by excluding these "non-deterministic" > tests from automated QA environments? > Automated QA environment is a broad term. We all have automated QA environments. But there is a specific class of automated test env, such as CI build bots that do not tolerate human intervention. Is it enough to run only the deterministic tests to validate xfs code? No it is not. My LTS environment is human monitored - I look at every failure and analyse the logs and look at historic data to decide if they are regressions or not. A bot simply cannot do that. The bot can go back and run the test N times on baseline vs patch. The question is, do we want kernel test bot to run -g auto -x soak on linux-next and report issues to us? I think the answer to this question should be yes. Do we want kernel test bot to run -g auto and report flaky test failures to us? I am pretty sure that the answer is no. So we need -x soak or whatever for this specific class of machine automated validation. > > Zorro has already proposed to properly tag the non deterministic tests > > with a specific group and I think there is really no other solution. > > > > The only question is whether we remove them from the 'auto' group > > (I think we should). > > As per above, this shows that many people simply don't understand > what many of these non-determinsitic tests are actually exercising, > and hence what they fail to test by excluding them from automated > testing. > > > There is probably a large overlap already between the 'stress' 'soak' and > > 'fuzzers' test groups and the non-deterministic tests. > > Moreover, if the test is not a stress/fuzzer test and it is not deterministic > > then the test is likely buggy. > > > > There is only one 'stress' test not in 'auto' group (generic/019), only two > > 'soak' tests not in the 'auto' group (generic/52{1,2}). > > There are only three tests in 'soak' group and they are also exactly > > the same three tests in the 'long_rw' group. > > > > So instead of thinking up a new 'flaky' 'random' 'stochastic' name > > we may just repurpose the 'soak' group for this matter and start > > moving known flaky tests from 'auto' to 'soak'. > > Please, no. The policy for the auto group is inclusive, not > exclusive. It is based on the concept that every test is valuable > and should be run if possible. Hence any test that generally > passes, does not run forever and does not endanger the system should > be a member of the auto group. That effectively only rules out > fuzzer and dangerous tests from being in the auto group, as long > running tests should be scaled by TIME_FACTOR/LOAD_FACTOR and hence > the default test behaviour results in only a short time run time. > > If someone wants to *reduce their test coverage* for whatever reason > (e.g. runtime, want to only run pass/fail tests, etc) then the > mechanism we already have in place for this is for that person to > use *exclusion groups*. i.e. we exclude subsets of tests from the > default set, we don't remove them from the default set. > > Such an environment would run: > > ./check -g auto -x soak > > So that the test environment doesn't run the "non-determinisitic" > tests in the 'soak' group. i.e. the requirements of this test > environment do not dictate the tests that every other test > environment runs by default. > OK. > > generic/52{1,2} can be removed from 'soak' group and remain > > in 'long_rw' group, unless filesystem developers would like to > > add those to the stochastic test run. > > > > filesystem developers that will run ./check -g auto -g soak > > will get the exact same test coverage as today's -g auto > > and the "commoners" that run ./check -g auto will enjoy blissful > > determitic test results, at least for the default config of regularly > > tested filesystems (a.k.a, the ones tested by kernet test bot).? > > An argument that says "everyone else has to change what they do so I > don't have to change" means that the person making the argument > thinks their requirements are more important than the requirements > of anyone else. Unless that person was not arguing for themselves... I was referring to passing-by developers that develop a patch that interacts with fs code who do not usually develop and test filesystems. Not for myself. > The test run policy mechanisms we already have avoid > this whole can of worms - we don't need to care about the specific > test requirements of any specific test enviroment because the > default is inclusive and it is trivial to exclude tests from that > default set if needed. > I had the humble notion that we should make running fstests to passing-by developers as easy as possible, because I have had the chance to get feedback from some developers on their first time attempt to run fstests and it wasn't pleasant, but nevermind. -g auto -x soak is fine. When you think about it, many fs developrs run ./check -g auto, so we should not interfere with that, but I bet very few run './check'? so we could make the default for './check' some group combination that is as deterministic as possible. If I am not mistaken, LTP main run script runltp.sh when run w/o parameters has a default set of tests which obviously get run by the kernel bots. Thanks, Amir. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-04 7:58 ` Amir Goldstein @ 2022-07-05 2:29 ` Theodore Ts'o 2022-07-05 3:11 ` Dave Chinner 1 sibling, 0 replies; 32+ messages in thread From: Theodore Ts'o @ 2022-07-05 2:29 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote: > I had the humble notion that we should make running fstests to > passing-by developers as easy as possible, because I have had the > chance to get feedback from some developers on their first time > attempt to run fstests and it wasn't pleasant, but nevermind. > -g auto -x soak is fine. I really don't think using some kind of group exclusion is the right way to go. First of all, the definition "determinism" keeps shifting around. The most recent way you've used it, it's for "passing-by developers" to be able to tell if their patches may have caused a regression. Is that right. Secondly, a group-based exclusion list is problematic because groups are fixed with respect to a kernel version (and therefore will very easily go out of date), and they are fixed with respect to a file system type, and which tests will either fail or be flaky will vary, widly, by file system type. For example to the "fixed in time" problem, I had a global exclude for the following tests: generic/471 generic/484 generic/554. That's because they were failing for me all the time, with various resources that made them test or global kernel bugs. But unless you regularly check to see whether a test which is on a particular exclusion list or in some magic "soak" group (and "soak is a ***massive*** misnomer --- the group as you've proposed it should really be named "might-fail-perhaps-on-some-fs-type-or-config"), the tests could remain on the list long after the test bug or the kernel bug has been addressed. So as of commit 7fd7c21547a1 ("test-appliance: add kernel version conditionals using cpp to exclude files") in xfstests-bld, generic/471 and generic/484 are only excluded when testing LTS kernels older than 5.10, and generic/554 is only excluded when testing LTS kernels older than 5.4. The point is (a) you can't easily do version-specific exclusions with xfstests group declarations, and (b) someone needs to periodically sweep through the tests to see if tests should be in the soak or "might-fail-perhaps-on-some-fs-type-or-config" group. As an example of why you're going to want to do the exclusions on a per-file system basis, consider the tests that would have to be added to the "might-fail-perhaps-on-some-fs-type-or-config". If the goal is to let "drive-by developer to know whether their has caused a regression", especially someone like Willy who might be modifying a large number of testers, then you would need to add at *least* 135 tests to the "soak" or "might-fail-perhaps-on-some-fs-type-or-config" group: (These are current test failures that I have observed using v4.19-rc4.) btrfs/default: 1176 tests, 7 failures, 244 skipped, 8937 seconds Failures: btrfs/012 btrfs/219 btrfs/235 generic/041 generic/297 generic/298 shared/298 exfat/default: 1222 tests, 23 failures, 552 skipped, 1561 seconds Failures: generic/013 generic/309 generic/310 generic/394 generic/409 generic/410 generic/411 generic/428 generic/430 generic/431 generic/432 generic/433 generic/438 generic/443 generic/465 generic/490 generic/519 generic/563 generic/565 generic/591 generic/633 generic/639 generic/676 ext2/default: 1188 tests, 4 failures, 472 skipped, 2803 seconds Failures: generic/347 generic/607 generic/614 generic/631 f2fs/default: 877 tests, 5 failures, 217 skipped, 4249 seconds Failures: generic/050 generic/064 generic/252 generic/506 generic/563 jfs/default: 1074 tests, 62 failures, 404 skipped, 3695 seconds Failures: generic/015 generic/034 generic/039 generic/040 generic/041 generic/056 generic/057 generic/065 generic/066 generic/073 generic/079 generic/083 generic/090 generic/101 generic/102 generic/104 generic/106 generic/107 generic/204 generic/226 generic/258 generic/260 generic/269 generic/288 generic/321 generic/322 generic/325 generic/335 generic/336 generic/341 generic/342 generic/343 generic/348 generic/376 generic/405 generic/416 generic/424 generic/427 generic/467 generic/475 generic/479 generic/480 generic/481 generic/489 generic/498 generic/502 generic/510 generic/520 generic/526 generic/527 generic/534 generic/535 generic/537 generic/547 generic/552 generic/557 generic/563 generic/607 generic/614 generic/629 generic/640 generic/690 nfs/loopback: 818 tests, 2 failures, 345 skipped, 9365 seconds Failures: generic/426 generic/551 reiserfs/default: 1076 tests, 25 failures, 413 skipped, 3368 seconds Failures: generic/102 generic/232 generic/235 generic/258 generic/321 generic/355 generic/381 generic/382 generic/383 generic/385 generic/386 generic/394 generic/418 generic/520 generic/533 generic/535 generic/563 generic/566 generic/594 generic/603 generic/614 generic/620 generic/634 generic/643 generic/691 vfat/default: 1197 tests, 42 failures, 528 skipped, 4616 seconds Failures: generic/003 generic/130 generic/192 generic/213 generic/221 generic/258 generic/309 generic/310 generic/313 generic/394 generic/409 generic/410 generic/411 generic/428 generic/430 generic/431 generic/432 generic/433 generic/438 generic/443 generic/465 generic/467 generic/477 generic/495 generic/519 generic/563 generic/565 generic/568 generic/569 generic/589 generic/632 generic/633 generic/637 generic/638 generic/639 generic/644 generic/645 generic/656 generic/676 generic/683 generic/688 generic/689 This is not *all* of the tests. There are a number of file system types that are causing the VM to crash. I haven't had time this weekend to figure out what tests need to be added to the exclude group for udf, ubifs, overlayfs, etc. So there might be even *more* tests that we would need to be added to the "might-fail-perhaps-on-some-fs-type-or-config" group. It *could* be fewer, if we want to exclude reiserfs and jfs, on the theory that they might be deprecated soon. But there are still some very commonly used file systems, such as vfat, exfat, etc., that have a *huge* number of failing tests that is going to make life unpleasant for the drive-by developer/tester. And there are other file systems which will cause a kernel crash or lockup on v4.19-rc4, which certainly will give trouble for the drive-by tester. Which is why I argue that using a group, whether it's called soak, or something else, to exclude all of the tests that might fail and thus confuse the passing-by fs testers is the best way to go. > The reason I suggested that *we* change our habits is because > we want to give passing-by fs testers an easier experience. Realistically, if we want to give passing-by fs testers an easier experience, we need to give these testers more turn-key experience. This was part of my original goals when I created kvm-xfstests and gce-xfstests. This is why I upload pre-created VM images for kvm-xfstests and gce-xfstests --- so people don't have to build xfstests and all their dependencies, and to figure out how to set up and configure it. For example, it means that I can tell ext4 developers to just run "kvm-xfstests smoke" as a bare minimum before sending me a patch for review. There is more that we clearly need to do if we want to make something which is completely turn-key for a drive-by tester, especially if they need to test more than just ext4 and xfs. I have some ideas, and this is something that I'm hoping to do more work in the next few months. If someone is interested in contributing some time and energy to this project, please give me a ring. Many hands make light work, and all that. Cheers, - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-04 7:58 ` Amir Goldstein 2022-07-05 2:29 ` Theodore Ts'o @ 2022-07-05 3:11 ` Dave Chinner 2022-07-06 10:11 ` Amir Goldstein 1 sibling, 1 reply; 32+ messages in thread From: Dave Chinner @ 2022-07-05 3:11 UTC (permalink / raw) To: Amir Goldstein Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote: > On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > > > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > > > > > > > On 5/18/22 20:07, Luis Chamberlain wrote: > > > > > I've been promoting the idea that running fstests once is nice, > > > > > but things get interesting if you try to run fstests multiple > > > > > times until a failure is found. It turns out at least kdevops has > > > > > found tests which fail with a failure rate of typically 1/2 to > > > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > > > failure. > > > > > > > > > > I have tried my best to annotate failure rates when I know what > > > > > they might be on the test expunge list, as an example: > > > > > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > > > to propose to standardize a way to represent this. How about > > > > > > > > > > generic/530 # F:1/15 > > > > > > > > > > Then we could extend the definition. F being current estimate, and this > > > > > can be just how long it took to find the first failure. A more valuable > > > > > figure would be failure rate avarage, so running the test multiple > > > > > times, say 10, to see what the failure rate is and then averaging the > > > > > failure out. So this could be a more accurate representation. For this > > > > > how about: > > > > > > > > > > generic/530 # FA:1/15 > > > > > > > > > > This would mean on average there failure rate has been found to be about > > > > > 1/15, and this was determined based on 10 runs. > > > > > > > > > > We should also go extend check for fstests/blktests to run a test > > > > > until a failure is found and report back the number of successes. > > > > > > > > > > Thoughts? > > > > > > > > > > Note: yes failure rates lower than 1/100 do exist but they are rare > > > > > creatures. I love them though as my experience shows so far that they > > > > > uncover hidden bones in the closet, and they they make take months and > > > > > a lot of eyeballs to resolve. > > > > > > > > I strongly disagree with annotating tests with failure rates. My opinion > > > > is that on a given test setup a test either should pass 100% of the time > > > > or fail 100% of the time. If a test passes in one run and fails in > > > > another run that either indicates a bug in the test or a bug in the > > > > software that is being tested. Examples of behaviors that can cause > > > > tests to behave unpredictably are use-after-free bugs and race > > > > conditions. How likely it is to trigger such behavior depends on a > > > > number of factors. This could even depend on external factors like which > > > > network packets are received from other systems. I do not expect that > > > > flaky tests have an exact failure rate. Hence my opinion that flaky > > > > tests are not useful and also that it is not useful to annotate flaky > > > > tests with a failure rate. If a test is flaky I think that the root > > > > cause of the flakiness must be determined and fixed. > > > > > > > > > > That is true for some use cases, but unfortunately, the flaky > > > fstests are way too valuable and too hard to replace or improve, > > > so practically, fs developers have to run them, but not everyone does. > > > > Everyone *should* be running them. They find *new bugs*, and it > > doesn't matter how old the kernel is. e.g. if you're backporting XFS > > log changes and you aren't running the "flakey" recoveryloop group > > tests, then you are *not testing failure handling log recovery > > sufficiently*. > > > > Where do you draw the line? recvoeryloop tests that shutdown and > > recover the filesystem will find bugs, they are guaranteed to be > > flakey, and they are *absolutely necessary* to be run because they > > are the only tests that exercise that critical filesysetm crash > > recovery functionality. > > > > What do we actually gain by excluding these "non-deterministic" > > tests from automated QA environments? > > > > Automated QA environment is a broad term. > We all have automated QA environments. > But there is a specific class of automated test env, such as CI build bots > that do not tolerate human intervention. Those environments need curated test lists because of the fact that failures gate progress. They also tend to run in resource limited environments as fstests is not the only set of tests that are run. Hence, generally speaking, CI is not an environment you'd be running a full "auto" group set of tests. Even the 'quick' group (which take an hour to run here) is often far too time and resource intensive for a CI system to use effectively. IOWs, we can't easily curate a set of tests that are appropriate for all CI environments - it's up to the people running the CI enviroment to determine what level of testing is appropriate for gating commits to their source tree, not the fstests maintainers or developers... > Is it enough to run only the deterministic tests to validate xfs code? > No it is not. > > My LTS environment is human monitored - I look at every failure and > analyse the logs and look at historic data to decide if they are regressions > or not. A bot simply cannot do that. > The bot can go back and run the test N times on baseline vs patch. > > The question is, do we want kernel test bot to run -g auto -x soak > on linux-next and report issues to us? > > I think the answer to this question should be yes. > > Do we want kernel test bot to run -g auto and report flaky test > failures to us? > > I am pretty sure that the answer is no. My answer to both is *yes, absolutely*. The zero-day kernel test bot runs all sorts of non-deterministic tests, including performance regression testing. We want these flakey/non-deterministic tests run in such environments, because they are often configurations we do not ahve access to and/or would never even consider. e.g. 128p server with a single HDD running IO scalability tests like AIM7... This is exactly where such automated testing provides developers with added value - it covers both hardware and software configs that indvidual developers cannot exercise themselves. Developers may or may not pay attention to those results depending on the test that "fails" and the hardware it "failed' on, but the point is that it got tested on something we'd never get coverage on otherwise. > > The test run policy mechanisms we already have avoid > > this whole can of worms - we don't need to care about the specific > > test requirements of any specific test enviroment because the > > default is inclusive and it is trivial to exclude tests from that > > default set if needed. > > > > I had the humble notion that we should make running fstests to > passing-by developers as easy as possible, because I have had the > chance to get feedback from some developers on their first time > attempt to run fstests and it wasn't pleasant, but nevermind. > -g auto -x soak is fine. I think that the way to do this is the way Ted has described - wrap fstests in an environment where all the required knowledge is already encapsulated and the "drive by testers" just need to crank the handle and it churns out results. As it is, I don't think making things easier for "drive-by" testing at the expense of making things arbitrarily different and/or harder for people who use it every day is a good trade-off. The "target market" for fstests is *filesystem developers* and people who spend their working life *testing filesystems*. The number of people who do *useful* "drive-by" testing of filesystems is pretty damn small, and IMO that niche is nowhere near as important as making things better for the people who use fstests every day.... > When you think about it, many fs developrs run ./check -g auto, > so we should not interfere with that, but I bet very few run './check'? > so we could make the default for './check' some group combination > that is as deterministic as possible. Bare ./check invocations are designed to run every test, regardless of what group they are in. Stop trying to redefine longstanding existing behaviour - if you want to define "deterministic" tests so that you can run just those tests, define a group for it, add all the tests to it, and then document it in the README as "if you have no experience with fstests, this is where you should start". Good luck keeping that up to date, though, as you're now back to the same problem that Ted describes, which is the "deterministic" group changes based on kernel, filesystem, config, etc. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-05 3:11 ` Dave Chinner @ 2022-07-06 10:11 ` Amir Goldstein 2022-07-06 14:29 ` Theodore Ts'o 0 siblings, 1 reply; 32+ messages in thread From: Amir Goldstein @ 2022-07-06 10:11 UTC (permalink / raw) To: Dave Chinner Cc: Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Tue, Jul 5, 2022 at 6:11 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote: > > On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > > > > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > > > > > > > > > On 5/18/22 20:07, Luis Chamberlain wrote: > > > > > > I've been promoting the idea that running fstests once is nice, > > > > > > but things get interesting if you try to run fstests multiple > > > > > > times until a failure is found. It turns out at least kdevops has > > > > > > found tests which fail with a failure rate of typically 1/2 to > > > > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > > > > failure. > > > > > > > > > > > > I have tried my best to annotate failure rates when I know what > > > > > > they might be on the test expunge list, as an example: > > > > > > > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > > > > to propose to standardize a way to represent this. How about > > > > > > > > > > > > generic/530 # F:1/15 > > > > > > > > > > > > Then we could extend the definition. F being current estimate, and this > > > > > > can be just how long it took to find the first failure. A more valuable > > > > > > figure would be failure rate avarage, so running the test multiple > > > > > > times, say 10, to see what the failure rate is and then averaging the > > > > > > failure out. So this could be a more accurate representation. For this > > > > > > how about: > > > > > > > > > > > > generic/530 # FA:1/15 > > > > > > > > > > > > This would mean on average there failure rate has been found to be about > > > > > > 1/15, and this was determined based on 10 runs. > > > > > > > > > > > > We should also go extend check for fstests/blktests to run a test > > > > > > until a failure is found and report back the number of successes. > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > Note: yes failure rates lower than 1/100 do exist but they are rare > > > > > > creatures. I love them though as my experience shows so far that they > > > > > > uncover hidden bones in the closet, and they they make take months and > > > > > > a lot of eyeballs to resolve. > > > > > > > > > > I strongly disagree with annotating tests with failure rates. My opinion > > > > > is that on a given test setup a test either should pass 100% of the time > > > > > or fail 100% of the time. If a test passes in one run and fails in > > > > > another run that either indicates a bug in the test or a bug in the > > > > > software that is being tested. Examples of behaviors that can cause > > > > > tests to behave unpredictably are use-after-free bugs and race > > > > > conditions. How likely it is to trigger such behavior depends on a > > > > > number of factors. This could even depend on external factors like which > > > > > network packets are received from other systems. I do not expect that > > > > > flaky tests have an exact failure rate. Hence my opinion that flaky > > > > > tests are not useful and also that it is not useful to annotate flaky > > > > > tests with a failure rate. If a test is flaky I think that the root > > > > > cause of the flakiness must be determined and fixed. > > > > > > > > > > > > > That is true for some use cases, but unfortunately, the flaky > > > > fstests are way too valuable and too hard to replace or improve, > > > > so practically, fs developers have to run them, but not everyone does. > > > > > > Everyone *should* be running them. They find *new bugs*, and it > > > doesn't matter how old the kernel is. e.g. if you're backporting XFS > > > log changes and you aren't running the "flakey" recoveryloop group > > > tests, then you are *not testing failure handling log recovery > > > sufficiently*. > > > > > > Where do you draw the line? recvoeryloop tests that shutdown and > > > recover the filesystem will find bugs, they are guaranteed to be > > > flakey, and they are *absolutely necessary* to be run because they > > > are the only tests that exercise that critical filesysetm crash > > > recovery functionality. > > > > > > What do we actually gain by excluding these "non-deterministic" > > > tests from automated QA environments? > > > > > > > Automated QA environment is a broad term. > > We all have automated QA environments. > > But there is a specific class of automated test env, such as CI build bots > > that do not tolerate human intervention. > > Those environments need curated test lists because of the fact > that failures gate progress. They also tend to run in resource > limited environments as fstests is not the only set of tests that > are run. Hence, generally speaking, CI is not an environment you'd > be running a full "auto" group set of tests. Even the 'quick' group > (which take an hour to run here) is often far too time and resource > intensive for a CI system to use effectively. > > IOWs, we can't easily curate a set of tests that are appropriate for > all CI environments - it's up to the people running the CI > enviroment to determine what level of testing is appropriate for > gating commits to their source tree, not the fstests maintainers or > developers... > OK. I think that is the way CIFS are doing CI - Running a whitelist of (probably quick) tests known to pass on cifs. > > Is it enough to run only the deterministic tests to validate xfs code? > > No it is not. > > > > My LTS environment is human monitored - I look at every failure and > > analyse the logs and look at historic data to decide if they are regressions > > or not. A bot simply cannot do that. > > The bot can go back and run the test N times on baseline vs patch. > > > > The question is, do we want kernel test bot to run -g auto -x soak > > on linux-next and report issues to us? > > > > I think the answer to this question should be yes. > > > > Do we want kernel test bot to run -g auto and report flaky test > > failures to us? > > > > I am pretty sure that the answer is no. > > My answer to both is *yes, absolutely*. > Ok. > The zero-day kernel test bot runs all sorts of non-deterministic > tests, including performance regression testing. We want these > flakey/non-deterministic tests run in such environments, because > they are often configurations we do not ahve access to and/or would > never even consider. e.g. 128p server with a single HDD running IO > scalability tests like AIM7... > > This is exactly where such automated testing provides developers > with added value - it covers both hardware and software configs that > indvidual developers cannot exercise themselves. Developers may or > may not pay attention to those results depending on the test that > "fails" and the hardware it "failed' on, but the point is that it > got tested on something we'd never get coverage on otherwise. > So I am wondering what is the status today, because I rarely see fstests failure reports from kernel test bot on the list, but there are some reports. Does anybody have a clue what hw/fs/config/group of fstests kernel test bot is running on linux-next? Did any fs maintainer communicate to kernel test bot maintainer about this? > > > The test run policy mechanisms we already have avoid > > > this whole can of worms - we don't need to care about the specific > > > test requirements of any specific test enviroment because the > > > default is inclusive and it is trivial to exclude tests from that > > > default set if needed. > > > > > > > I had the humble notion that we should make running fstests to > > passing-by developers as easy as possible, because I have had the > > chance to get feedback from some developers on their first time > > attempt to run fstests and it wasn't pleasant, but nevermind. > > -g auto -x soak is fine. > > I think that the way to do this is the way Ted has described - wrap > fstests in an environment where all the required knowledge is > already encapsulated and the "drive by testers" just need to crank > the handle and it churns out results. > > As it is, I don't think making things easier for "drive-by" testing > at the expense of making things arbitrarily different and/or harder > for people who use it every day is a good trade-off. The "target > market" for fstests is *filesystem developers* and people who spend > their working life *testing filesystems*. The number of people who > do *useful* "drive-by" testing of filesystems is pretty damn small, > and IMO that niche is nowhere near as important as making things > better for the people who use fstests every day.... > I agree that using fstests runners for drive-by testing makes more sense. > > When you think about it, many fs developrs run ./check -g auto, > > so we should not interfere with that, but I bet very few run './check'? > > so we could make the default for './check' some group combination > > that is as deterministic as possible. > > Bare ./check invocations are designed to run every test, regardless > of what group they are in. > > Stop trying to redefine longstanding existing behaviour - if you > want to define "deterministic" tests so that you can run just those > tests, define a group for it, add all the tests to it, and then > document it in the README as "if you have no experience with > fstests, this is where you should start". OK. Or better yet, use an fstests runner. > > Good luck keeping that up to date, though, as you're now back to the > same problem that Ted describes, which is the "deterministic" group > changes based on kernel, filesystem, config, etc. > It's true. I think there is some room for improvement of how tests are classified in fstests repo. I will elaborate on my reply to Ted. Thanks, Amir. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-06 10:11 ` Amir Goldstein @ 2022-07-06 14:29 ` Theodore Ts'o 2022-07-06 16:35 ` Amir Goldstein 0 siblings, 1 reply; 32+ messages in thread From: Theodore Ts'o @ 2022-07-06 14:29 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote: > > So I am wondering what is the status today, because I rarely > see fstests failure reports from kernel test bot on the list, but there > are some reports. > > Does anybody have a clue what hw/fs/config/group of fstests > kernel test bot is running on linux-next? Te zero-day test bot only reports test regressions. So they have some list of tests that have failed in the past, and they only report *new* test failures. This is not just true for fstests, but it's also true for things like check and compiler warnings warnings --- and I suspect it's for those sorts of reports that caused the zero-day bot to keep state, and to filter out test failures and/or check warnings and/or compiler warnings, so that only new test failures and/or new compiler warnigns are reported. If they didn't, they would be spamming kernel developers, and given how.... "kind and understanding" kernel developers are at getting spammed, especially when sometimes the complaints are bogus ones (either test bugs or compiler bugs), my guess is that they did the filtering out of sheer self-defense. It certainly wasn't something requested by a file system developer as far as I know. So this is how I think an automated system for "drive-by testers" should work. First, the tester would specify the baseline/origin tag, and the testing system would run the tests on the baseline once. Hopefully, the test runner already has exclude files so that kernel bugs that cause an immediate kernel crash or deadlock would be already be in the exclude list. But as I've discovered this weekend, for file systems that I haven't tried in a few yeas, like udf, or ubifs. etc. there may be missing tests that result in the test VM to stop responding and/or crash. I have a planned improvement where if you are using the gce-xfstests's lightweight test manager, since the LTM is constantly reading the serial console, a deadlock can be detected and the LTM can restart the VM. The VM can then disambiguate from a forced reboot caused by the LTM, or a forced shutdown caused by the use of a preemptible VM (a planned feature not yet fully implemented yet), and the test runner can skip the tests already run, and skip the test which caused the crash or deadlock, and this could be reported so that eventually, the test could be added to the exclude file to benefit thouse people who are using kvm-xfstests. (This is an example of a planned improvement in xfstests-bld which if someone is interested in helping to implement it, they should give me a ring.) Once the tests which are failing given a particular baseline are known, this state would then get saved, and then now the tests can be run on the drive-by developer's changes. We can now compare the known failures for the baseline, with the changed kernels, and if there are any new failures, there are two possibilities: (a) this was a new feailure caused by the drive-by developer's changes, (b) this was a pre-existing known flake. To disambiguate between these two cases, we now run the failed test N times (where N is probably something like 10-50 times; I normally use 25 times) on the changed kernel, and get the failure rate. If the failure rate is 100%, then this is almost certainly (a). If the failure rate is < 100% (and greater than 0%), then we need to rerun the failed test on the baseline kernel N times, and see if the failure rate is 0%, then we should do a bisection search to determine the guilty commit. If the failure rate is 0%, then this is either an extremely rare flake, in which case we might need to increase N --- or it's an example of a test failure which is sensitive to the order of tests which are failed, in which case we may need to reun all of the tests in order up to the failed test. This is right now what I do when processing patches for upstream. It's also rather similar to what we're doing for the XFS stable backports, because it's much more efficient than running the baseline tests 100 times (which can take a week of continuous testing per Luis's comments) --- we only tests dozens (or more) times where a potential flake has been found, as opposed to *all* tests. It's all done manually, but it would be great if we could automate this to make life easier for XFS stable backporters, and *also* for drive-by developers. And again, if anyone is interested in helping with this, especially if you're familiar with shell, python 3, and/or the Go language, please contact me off-line. Cheers, - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-06 14:29 ` Theodore Ts'o @ 2022-07-06 16:35 ` Amir Goldstein 0 siblings, 0 replies; 32+ messages in thread From: Amir Goldstein @ 2022-07-06 16:35 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Bart Van Assche, Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-block, Pankaj Raghav, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests, Zorro Lang, Matthew Wilcox On Wed, Jul 6, 2022 at 5:30 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote: > > > > So I am wondering what is the status today, because I rarely > > see fstests failure reports from kernel test bot on the list, but there > > are some reports. > > > > Does anybody have a clue what hw/fs/config/group of fstests > > kernel test bot is running on linux-next? > > Te zero-day test bot only reports test regressions. So they have some > list of tests that have failed in the past, and they only report *new* > test failures. This is not just true for fstests, but it's also true > for things like check and compiler warnings warnings --- and I suspect > it's for those sorts of reports that caused the zero-day bot to keep > state, and to filter out test failures and/or check warnings and/or > compiler warnings, so that only new test failures and/or new compiler > warnigns are reported. If they didn't, they would be spamming kernel > developers, and given how.... "kind and understanding" kernel > developers are at getting spammed, especially when sometimes the > complaints are bogus ones (either test bugs or compiler bugs), my > guess is that they did the filtering out of sheer self-defense. It > certainly wasn't something requested by a file system developer as far > as I know. > > > So this is how I think an automated system for "drive-by testers" > should work. First, the tester would specify the baseline/origin tag, > and the testing system would run the tests on the baseline once. > Hopefully, the test runner already has exclude files so that kernel > bugs that cause an immediate kernel crash or deadlock would be already > be in the exclude list. But as I've discovered this weekend, for file > systems that I haven't tried in a few yeas, like udf, or > ubifs. etc. there may be missing tests that result in the test VM to > stop responding and/or crash. > > I have a planned improvement where if you are using the gce-xfstests's > lightweight test manager, since the LTM is constantly reading the > serial console, a deadlock can be detected and the LTM can restart the > VM. The VM can then disambiguate from a forced reboot caused by the > LTM, or a forced shutdown caused by the use of a preemptible VM (a > planned feature not yet fully implemented yet), and the test runner > can skip the tests already run, and skip the test which caused the > crash or deadlock, and this could be reported so that eventually, the > test could be added to the exclude file to benefit thouse people who > are using kvm-xfstests. (This is an example of a planned improvement > in xfstests-bld which if someone is interested in helping to implement > it, they should give me a ring.) > > Once the tests which are failing given a particular baseline are > known, this state would then get saved, and then now the tests can be > run on the drive-by developer's changes. We can now compare the known > failures for the baseline, with the changed kernels, and if there are > any new failures, there are two possibilities: (a) this was a new > feailure caused by the drive-by developer's changes, (b) this was a > pre-existing known flake. > > To disambiguate between these two cases, we now run the failed test N > times (where N is probably something like 10-50 times; I normally use > 25 times) on the changed kernel, and get the failure rate. If the > failure rate is 100%, then this is almost certainly (a). If the > failure rate is < 100% (and greater than 0%), then we need to rerun > the failed test on the baseline kernel N times, and see if the failure > rate is 0%, then we should do a bisection search to determine the > guilty commit. > > If the failure rate is 0%, then this is either an extremely rare > flake, in which case we might need to increase N --- or it's an > example of a test failure which is sensitive to the order of tests > which are failed, in which case we may need to reun all of the tests > in order up to the failed test. > > This is right now what I do when processing patches for upstream. > It's also rather similar to what we're doing for the XFS stable > backports, because it's much more efficient than running the baseline > tests 100 times (which can take a week of continuous testing per > Luis's comments) --- we only tests dozens (or more) times where a > potential flake has been found, as opposed to *all* tests. It's all > done manually, but it would be great if we could automate this to make > life easier for XFS stable backporters, and *also* for drive-by > developers. > This process sounds like it could get us to mostly unattended regression testing, so it sounds good. I do wonder if there is nothing more that fstests devlopers can do to assist when annotating new (and existing) tests to aid in that effort. For example, there might be a case to tag a test as "this is a very reliable test that should have no failures at all - if there is a failure then something is surely wrong". I wonder if it would help to have a group like that and how many tests would that group include. > And again, if anyone is interested in helping with this, especially if > you're familiar with shell, python 3, and/or the Go language, please > contact me off-line. > Please keep me in the loop if you have a prototype I may be able to help test it. Thanks, Amir. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-02 21:48 ` Bart Van Assche 2022-07-03 5:56 ` Amir Goldstein @ 2022-07-03 13:32 ` Theodore Ts'o 2022-07-03 14:54 ` Bart Van Assche 2022-07-07 21:06 ` Luis Chamberlain 1 sibling, 2 replies; 32+ messages in thread From: Theodore Ts'o @ 2022-07-03 13:32 UTC (permalink / raw) To: Bart Van Assche Cc: Luis Chamberlain, linux-fsdevel, linux-block, amir73il, pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote: > > I strongly disagree with annotating tests with failure rates. My opinion is > that on a given test setup a test either should pass 100% of the time or > fail 100% of the time. My opinion is also that no child should ever go to bed hungry, and we should end world hunger. However, meanwhile, in the real world, while we can *strive* to eliminate all flaky tests, whether it is caused by buggy tests, or buggy kernel code, there's an old saying that the only time code is bug-free is when it is no longer being used. That being said, I completely agree that annotating failure rates in xfstesets-dev upstream probably doesn't make much sense. As we've stated before, it is highly dependent on the hardware configuration, and kernel version (remember, sometimes flaky tests are caused by bugs in other kernel subsystems --- including the loop device, which has not historically been bug-free(tm) either, and so bugs come and go across the entire kernel surface). I believe the best way to handle this is to have better test results analysis tools. We can certainly consider having some shared test results database, but I'm not convinced that flat text files shared via git is sufficiently scalable. The final thing I'll note it that we've lived with low probability flakes for a very long time, and it hasn't been the end of the world. Sometime in 2011 or 2012, when I first started at Google and when we first started rolling out ext4 to the all of our data centers, once or twice a month --- across the entire world-wide fleet --- there would be an unexplained file system corruption that had remarkably similar characteristics. It took us several months to run it down, and it turned out to be a lock getting released one C statement too soon. When I did some further archeological research, it turned out it had been in upstream for well over a *decade* --- in ext3 and ext4 --- and had not been noticed in at least 3 or 4 enterprise distro GA testing/qualification cycles. Or rather, it might have been noticed, but since it couldn't be replicated, I'm guessing the QA testers shrugged, assumed that it *must* have been due to some cosmic ray, or some such, and moved on. > If a test is flaky I think that the root cause of the flakiness must > be determined and fixed. In the ideal world, sure. Then again, in the ideal world, we wouldn't have thousands of people getting killed over border disputes and because some maniacal world leader thinks that it's A-OK to overrun the borders of adjacent countries. However, until we have infinite resources available to us, the reality is that we need to live with the fact that life is imperfect, despite all of our efforts to reduce these sort of flaky tests --- especially when we're talking about esoteric test configurations that most users won't be using. (Or when they are triggered by test code that is not used in production, but for which the error injection or shutdown simuilation code is itself not perfect.) Cheers, - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 13:32 ` Theodore Ts'o @ 2022-07-03 14:54 ` Bart Van Assche 2022-07-07 21:16 ` Luis Chamberlain 2022-07-07 21:06 ` Luis Chamberlain 1 sibling, 1 reply; 32+ messages in thread From: Bart Van Assche @ 2022-07-03 14:54 UTC (permalink / raw) To: Theodore Ts'o Cc: Luis Chamberlain, linux-fsdevel, linux-block, amir73il, pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen On 7/3/22 06:32, Theodore Ts'o wrote: > On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote: >> >> I strongly disagree with annotating tests with failure rates. My opinion is >> that on a given test setup a test either should pass 100% of the time or >> fail 100% of the time. > > My opinion is also that no child should ever go to bed hungry, and we > should end world hunger. In my view the above comment is unfair. The first year after I wrote the SRP tests in blktests I submitted multiple fixes for kernel bugs encountered by running these tests. Although it took a significant effort, after about one year the test itself and the kernel code it triggered finally resulted in reliable operation of the test. After that initial stabilization period these tests uncovered regressions in many kernel development cycles, even in the v5.19-rc cycle. Since I'm not very familiar with xfstests I do not know what makes the stress tests in this test suite fail. Would it be useful to modify the code that decides the test outcome to remove the flakiness, e.g. by only checking that the stress tests do not trigger any unwanted behavior, e.g. kernel warnings or filesystem inconsistencies? Thanks, Bart. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 14:54 ` Bart Van Assche @ 2022-07-07 21:16 ` Luis Chamberlain 0 siblings, 0 replies; 32+ messages in thread From: Luis Chamberlain @ 2022-07-07 21:16 UTC (permalink / raw) To: Bart Van Assche Cc: Theodore Ts'o, linux-fsdevel, linux-block, amir73il, pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen On Sun, Jul 03, 2022 at 07:54:11AM -0700, Bart Van Assche wrote: > On 7/3/22 06:32, Theodore Ts'o wrote: > > On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote: > > > > > > I strongly disagree with annotating tests with failure rates. My opinion is > > > that on a given test setup a test either should pass 100% of the time or > > > fail 100% of the time. > > > > My opinion is also that no child should ever go to bed hungry, and we > > should end world hunger. > > In my view the above comment is unfair. The first year after I wrote the > SRP tests in blktests I submitted multiple fixes for kernel bugs encountered > by running these tests. Although it took a significant effort, after about > one year the test itself and the kernel code it triggered finally resulted > in reliable operation of the test. After that initial stabilization period > these tests uncovered regressions in many kernel development cycles, even in > the v5.19-rc cycle. > > Since I'm not very familiar with xfstests I do not know what makes the > stress tests in this test suite fail. Would it be useful to modify the code > that decides the test outcome to remove the flakiness, e.g. by only checking > that the stress tests do not trigger any unwanted behavior, e.g. kernel > warnings or filesystem inconsistencies? Filesystems and the block layer are bundled on top of tons of things in the kernel, and those layers could introduce the undeterminism. To rule out determinism we must first rule out undeterminism in other areas of the kernel, and that will take a long time. Things like kunit tests will help here, along with adding more tests to other smaller layers. The list is long. At LSFMM I mentioned how blktests block/009 had an odd failure rate of about 1/669 a while ago. The issue was real, and it took a while to figure out what the real issue was. Jan Kara's patches solved these issues and they are not trivial to backport to ancient enterprise kernels ;) Another more recent one was the undeterministic RCU cpu stall warnings with a failure rate of about 1/80 on zbd/006 and that lead to some interesting revelations about how qemu's use of discard was shitty and just needed to be enhanced. Yes, you can probably make zbd/006 more atomic and split it into 10 tests, but I don't think we can escape the lack of determinism in certain areas of the kernel. We can *work to improve* it, but again, that will take time, and I am not quite sure many folks really want that too. Luis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges 2022-07-03 13:32 ` Theodore Ts'o 2022-07-03 14:54 ` Bart Van Assche @ 2022-07-07 21:06 ` Luis Chamberlain 1 sibling, 0 replies; 32+ messages in thread From: Luis Chamberlain @ 2022-07-07 21:06 UTC (permalink / raw) To: Theodore Ts'o Cc: Bart Van Assche, linux-fsdevel, linux-block, amir73il, pankydev8, josef, jmeneghi, Jan Kara, Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen On Sun, Jul 03, 2022 at 09:32:03AM -0400, Theodore Ts'o wrote: > On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote: > > > > I strongly disagree with annotating tests with failure rates. My opinion is > > that on a given test setup a test either should pass 100% of the time or > > fail 100% of the time. > > in the real world, while we can *strive* to > eliminate all flaky tests, whether it is caused by buggy tests, or > buggy kernel code, there's an old saying that the only time code is > bug-free is when it is no longer being used. Agreed but I will provide proof in reply to Bart shortly a bit more related to the block layer. I thought I made the case clear enough at LSFMM but I suppose not. > That being said, I completely agree that annotating failure rates in > xfstesets-dev upstream probably doesn't make much sense. As we've > stated before, it is highly dependent on the hardware configuration, > and kernel version (remember, sometimes flaky tests are caused by bugs > in other kernel subsystems --- including the loop device, which has > not historically been bug-free(tm) either, and so bugs come and go > across the entire kernel surface). That does not eliminate the possible value of having failure rates for the minimum virtualized storage arangement you can have with either loopback devices or LVM volumes. Nor, does it eliminate the possibility to say come up with generic system names. Just as 0-day has names for kernel configs, we can easily come up with names for hw profiles. > I believe the best way to handle this is to have better test results > analysis tools. We're going to evaluate an ELK stack for this, but there is a difference between historical data for random runs Vs what may be useful generically. > We can certainly consider having some shared test > results database, but I'm not convinced that flat text files shared > via git is sufficiently scalable. How data is stored is secondary, first is order of business is if sharing any of this information may be useful to others. I have results dating back to 4.17.3, each kernel supported and I have found it very valuable. I figured it may be.. but if there is no agreement on it, we can just keep that on kdevops as-is and move forward with our own nomenclature for hw profiles. Luis ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2022-07-07 21:36 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-19 3:07 [RFC: kdevops] Standardizing on failure rate nomenclature for expunges Luis Chamberlain 2022-05-19 6:36 ` Amir Goldstein 2022-05-19 7:58 ` Dave Chinner 2022-05-19 9:20 ` Amir Goldstein 2022-05-19 15:36 ` Josef Bacik 2022-05-19 16:18 ` Zorro Lang 2022-05-19 11:24 ` Zorro Lang 2022-05-19 14:18 ` Theodore Ts'o 2022-05-19 15:10 ` Zorro Lang 2022-05-19 14:58 ` Matthew Wilcox 2022-05-19 15:44 ` Zorro Lang 2022-05-19 16:06 ` Matthew Wilcox 2022-05-19 16:54 ` Zorro Lang 2022-07-01 23:36 ` Luis Chamberlain 2022-07-02 17:01 ` Theodore Ts'o 2022-07-07 21:36 ` Luis Chamberlain 2022-07-02 21:48 ` Bart Van Assche 2022-07-03 5:56 ` Amir Goldstein 2022-07-03 13:15 ` Theodore Ts'o 2022-07-03 14:22 ` Amir Goldstein 2022-07-03 16:30 ` Theodore Ts'o 2022-07-04 3:25 ` Dave Chinner 2022-07-04 7:58 ` Amir Goldstein 2022-07-05 2:29 ` Theodore Ts'o 2022-07-05 3:11 ` Dave Chinner 2022-07-06 10:11 ` Amir Goldstein 2022-07-06 14:29 ` Theodore Ts'o 2022-07-06 16:35 ` Amir Goldstein 2022-07-03 13:32 ` Theodore Ts'o 2022-07-03 14:54 ` Bart Van Assche 2022-07-07 21:16 ` Luis Chamberlain 2022-07-07 21:06 ` Luis Chamberlain
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.