* fuzzing bcachefs with dm-flakey @ 2023-05-29 20:59 Mikulas Patocka 2023-05-29 21:14 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Mikulas Patocka @ 2023-05-29 20:59 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, dm-devel, linux-fsdevel Hi I improved the dm-flakey device mapper target, so that it can do random corruption of read and write bios - I uploaded it here: https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write bios with this command: dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" I created a bcachefs volume on a single disk (metadata and data checksums were turned off) and mounted it on dm-flakey. I got: crash: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash1.txt deadlock: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash2.txt infinite loop: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash3.txt Here I uploaded an image that causes infnite loop when we run bcachefs fsck on it or when we attempt mount it: https://people.redhat.com/~mpatocka/testcases/bcachefs/inf-loop.gz I tried to run bcachefs on two block devices and fuzzing just one of them (checksums and replication were turned on - so bcachefs shold correct the corrupted data) - in this scenario, bcachefs doesn't return invalid data, but it sometimes returns errors and sometimes crashes. This script will trigger an oops on unmount: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash4.txt or nonsensical errors returned to userspace: rm: cannot remove '/mnt/test/test/cmd_migrate.c': Unknown error 2206 or I/O errors returned to userspace: diff: /mnt/test/test/rust-src/target/release/.fingerprint/bch_bindgen-f0bad16858ff0019/lib-bch_bindgen.json: Input/output error #!/bin/sh -ex umount /mnt/test || true dmsetup remove_all || true rmmod brd || true SRC=/usr/src/git/bcachefs-tools while true; do modprobe brd rd_size=1048576 bcachefs format --replicas=2 /dev/ram0 /dev/ram1 dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` linear /dev/ram0 0" mount -t bcachefs /dev/mapper/flakey:/dev/ram1 /mnt/test dmsetup load flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" dmsetup suspend flakey dmsetup resume flakey cp -a "$SRC" /mnt/test/test diff -r "$SRC" /mnt/test/test echo 3 >/proc/sys/vm/drop_caches diff -r "$SRC" /mnt/test/test echo 3 >/proc/sys/vm/drop_caches diff -r "$SRC" /mnt/test/test echo 3 >/proc/sys/vm/drop_caches rm -rf /mnt/test/test echo 3 >/proc/sys/vm/drop_caches cp -a "$SRC" /mnt/test/test echo 3 >/proc/sys/vm/drop_caches diff -r "$SRC" /mnt/test/test umount /mnt/test dmsetup remove flakey rmmod brd done The oops happens in set_btree_iter_dontneed and it is caused by the fact that iter->path is NULL. The code in try_alloc_bucket is buggy because it sets "struct btree_iter iter = { NULL };" and then jumps to the "err" label that tries to dereference values in "iter". Bcachefs gives not much usefull error messages, like "Fatal error: Unknown error 2184" or "Error in recovery: cannot allocate memory" or "mount(2) system call failed: Unknown error 2186." or "rm: cannot remove '/mnt/test/xfstests-dev/tools/fs-walk': Unknown error 2206". Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 20:59 fuzzing bcachefs with dm-flakey Mikulas Patocka @ 2023-05-29 21:14 ` Matthew Wilcox 2023-05-29 23:12 ` Dave Chinner 2023-05-30 12:23 ` Mikulas Patocka 2023-05-29 21:43 ` Kent Overstreet 2023-06-02 1:13 ` Darrick J. Wong 2 siblings, 2 replies; 13+ messages in thread From: Matthew Wilcox @ 2023-05-29 21:14 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Kent Overstreet, linux-bcachefs, dm-devel, linux-fsdevel On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > Hi > > I improved the dm-flakey device mapper target, so that it can do random > corruption of read and write bios - I uploaded it here: > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > bios with this command: > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" I'm not suggesting that any of the bugs you've found are invalid, but 10% seems really high. Is it reasonable to expect any filesystem to cope with that level of broken hardware? Can any of our existing ones cope with that level of flakiness? I mean, I've got some pretty shoddy USB cables, but ... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 21:14 ` Matthew Wilcox @ 2023-05-29 23:12 ` Dave Chinner 2023-05-29 23:51 ` Kent Overstreet 2023-05-30 12:23 ` Mikulas Patocka 1 sibling, 1 reply; 13+ messages in thread From: Dave Chinner @ 2023-05-29 23:12 UTC (permalink / raw) To: Matthew Wilcox Cc: Mikulas Patocka, Kent Overstreet, linux-bcachefs, dm-devel, linux-fsdevel On Mon, May 29, 2023 at 10:14:51PM +0100, Matthew Wilcox wrote: > On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > > Hi > > > > I improved the dm-flakey device mapper target, so that it can do random > > corruption of read and write bios - I uploaded it here: > > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > > bios with this command: > > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > > I'm not suggesting that any of the bugs you've found are invalid, but 10% > seems really high. Is it reasonable to expect any filesystem to cope > with that level of broken hardware? Can any of our existing ones cope > with that level of flakiness? I mean, I've got some pretty shoddy USB > cables, but ... It's realistic in that when you have lots of individual storage devices, load balanced over all of them, and then one fails completely we'll see an IO error rate like this. These are the sorts of setups I'd expect to be using erasure coding with bcachefs, so the IO failure rate should be able to head towards 20-30% before actual loss and/or corruption should start occurring. In this situation, if the failures were isolated to an individual device, then I'd want the filesystem to kick that device out of the backing pool. Hence all the failures go away and then rebuild of the redundancy the erasure coding provides can take place. i.e. an IO failure rate this high should be a very short lived incident for a filesystem that directly manages individual devices. But within a single, small device, it's not a particularly realistic scenario. If it's really corrupting this much active metadata, then the filesystem should be shutting down at the first uncorrectable/unrecoverable metadata error and every other IO error is then superfluous. Of course, bcachefs might be doing just that - cleanly shutting down an active filesystem is a very hard problem. XFS still has intricate and subtle issues with shutdown of active filesystems that can cause hangs and/or crashes, so I wouldn't expect bcachefs to be able to handle these scenarios completely cleanly at this stage of it's development.... Perhaps it is worthwhile running the same tests on btrfs so we can something to compare the bcachefs behaviour to. I suspect that btrfs will fair little better on the single device, no checksums corruption test.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 23:12 ` Dave Chinner @ 2023-05-29 23:51 ` Kent Overstreet 0 siblings, 0 replies; 13+ messages in thread From: Kent Overstreet @ 2023-05-29 23:51 UTC (permalink / raw) To: Dave Chinner Cc: Matthew Wilcox, Mikulas Patocka, linux-bcachefs, dm-devel, linux-fsdevel On Tue, May 30, 2023 at 09:12:44AM +1000, Dave Chinner wrote: > Perhaps it is worthwhile running the same tests on btrfs so we can > something to compare the bcachefs behaviour to. I suspect that btrfs > will fair little better on the single device, no checksums > corruption test.... It's also a test we _should_ be doing much, much better on: we've got validation code for every key type that we run on every metadata read, so there's no excuses for thees bugs and they will be fixed. Just a question of when and in what order... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 21:14 ` Matthew Wilcox 2023-05-29 23:12 ` Dave Chinner @ 2023-05-30 12:23 ` Mikulas Patocka 1 sibling, 0 replies; 13+ messages in thread From: Mikulas Patocka @ 2023-05-30 12:23 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Kent Overstreet, linux-bcachefs, dm-devel, linux-fsdevel On Mon, 29 May 2023, Matthew Wilcox wrote: > On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > > Hi > > > > I improved the dm-flakey device mapper target, so that it can do random > > corruption of read and write bios - I uploaded it here: > > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > > bios with this command: > > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > > I'm not suggesting that any of the bugs you've found are invalid, but 10% > seems really high. Is it reasonable to expect any filesystem to cope > with that level of broken hardware? Can any of our existing ones cope > with that level of flakiness? I mean, I've got some pretty shoddy USB > cables, but ... If you reduce the corruption probability, it will take more iterations to hit the bugs. So, for the "edit-compile-test" loop, the probability should be as high as possible, just to save the developer's time on testing. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 20:59 fuzzing bcachefs with dm-flakey Mikulas Patocka 2023-05-29 21:14 ` Matthew Wilcox @ 2023-05-29 21:43 ` Kent Overstreet 2023-05-30 21:00 ` Mikulas Patocka 2023-06-02 1:13 ` Darrick J. Wong 2 siblings, 1 reply; 13+ messages in thread From: Kent Overstreet @ 2023-05-29 21:43 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-bcachefs, dm-devel, linux-fsdevel On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > Hi > > I improved the dm-flakey device mapper target, so that it can do random > corruption of read and write bios - I uploaded it here: > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > bios with this command: > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" I've got some existing ktest tests for error injection: https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/single_device.ktest#n200 https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/replication.ktest#n491 I haven't looked at dm-flakey before, I take it you're silently corrupting data instead of just failing the IOs like these tests do? Let's add what you're doing to ktest, and see if we can merge it with the existing tests. > I created a bcachefs volume on a single disk (metadata and data checksums > were turned off) and mounted it on dm-flakey. I got: > > crash: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash1.txt > deadlock: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash2.txt > infinite loop: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash3.txt Fun! > Here I uploaded an image that causes infnite loop when we run bcachefs > fsck on it or when we attempt mount it: > https://people.redhat.com/~mpatocka/testcases/bcachefs/inf-loop.gz > > > I tried to run bcachefs on two block devices and fuzzing just one of them > (checksums and replication were turned on - so bcachefs shold correct the > corrupted data) - in this scenario, bcachefs doesn't return invalid data, > but it sometimes returns errors and sometimes crashes. > > This script will trigger an oops on unmount: > https://people.redhat.com/~mpatocka/testcases/bcachefs/crash4.txt > or nonsensical errors returned to userspace: > rm: cannot remove '/mnt/test/test/cmd_migrate.c': Unknown error 2206 > or I/O errors returned to userspace: > diff: /mnt/test/test/rust-src/target/release/.fingerprint/bch_bindgen-f0bad16858ff0019/lib-bch_bindgen.json: Input/output error > > #!/bin/sh -ex > umount /mnt/test || true > dmsetup remove_all || true > rmmod brd || true > SRC=/usr/src/git/bcachefs-tools > while true; do > modprobe brd rd_size=1048576 > bcachefs format --replicas=2 /dev/ram0 /dev/ram1 > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` linear /dev/ram0 0" > mount -t bcachefs /dev/mapper/flakey:/dev/ram1 /mnt/test > dmsetup load flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > dmsetup suspend flakey > dmsetup resume flakey > cp -a "$SRC" /mnt/test/test > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > rm -rf /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > cp -a "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > umount /mnt/test > dmsetup remove flakey > rmmod brd > done > > The oops happens in set_btree_iter_dontneed and it is caused by the fact > that iter->path is NULL. The code in try_alloc_bucket is buggy because it > sets "struct btree_iter iter = { NULL };" and then jumps to the "err" > label that tries to dereference values in "iter". Good catches on all of them. Darrick's been on me to get fuzz testing going, looks like it's definitely needed :) However, there's two things I want in place first before I put much effort into fuzz testing: - Code coverage analysis. ktest used to have integrated code coverage analysis, where you'd tell it a subdirectory of the kernel tree (doing code coverage analysis for the entire kernel is impossibly slow) and it would run tests and then give you the lcov output. However, several years ago something about kbuild changed, and the method ktest was using for passing in build flags for a specific subdir on the command line stopped working. I would like to track down someone who understands kbuild and get this working again. - Fault injection Years and years ago, when I was still at Google and this was just bcache, we had fault injection that worked like dynamic debug: you could call dynamic_fault("type of fault") anywhere in your code, and it returned a bool indicating whether that fault had been enabled - and faults were controllable at runtime via debugfs, we had tests that iterated over e.g. faults in the initialization path, or memory allocation failures, and flipped them on one by one and ran $test_workload. The memory allocation profiling stuff that Suren and I have been working on includes code tagging, which is for (among other things) a new and simplified implementation of dynamic fault injection, which I'm going to push forward again once the memory allocation profiling stuff gets merged. The reason I want this stuff is because fuzz testing tends to be a heavyweight, scattershot approach. I want to be able to look at the code coverage analysis first to e.g. work on a chunk of code at a time and make sure it's tested thoroughly, instead of jumping around in the code at random depending on what fuzz testing finds, and when we are fuzz testing I want to be able to add fault injection points and write unit tests so that we can have much more targeted, quicker to run tests going forward. Can I get you interested in either of those things? I'd really love to find someone to hand off or collaborate with on the fault injection stuff in particular. > Bcachefs gives not much usefull error messages, like "Fatal error: Unknown > error 2184" or "Error in recovery: cannot allocate memory" or "mount(2) > system call failed: Unknown error 2186." or "rm: cannot remove > '/mnt/test/xfstests-dev/tools/fs-walk': Unknown error 2206". Those are mostly missing bch2_err_str()/bch2_err_class() calls: - bch2_err_str(), to print error string for our private error code - bch2_err_class(), to convert private error code to standard error code before returning it to outside bcachefs code except error in recovery, cannot allocate memory - that's ancient code that still squashes to -ENOMEM ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 21:43 ` Kent Overstreet @ 2023-05-30 21:00 ` Mikulas Patocka 2023-05-30 23:29 ` Kent Overstreet 0 siblings, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2023-05-30 21:00 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, dm-devel, linux-fsdevel On Mon, 29 May 2023, Kent Overstreet wrote: > On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > > Hi > > > > I improved the dm-flakey device mapper target, so that it can do random > > corruption of read and write bios - I uploaded it here: > > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > > bios with this command: > > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > > I've got some existing ktest tests for error injection: > https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/single_device.ktest#n200 > https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/replication.ktest#n491 > > I haven't looked at dm-flakey before, I take it you're silently > corrupting data instead of just failing the IOs like these tests do? Yes, silently corrupting. When I tried to simulate I/O errors with dm-flakey, bcachefs worked correcly - there were no errors returned to userspace and no crashes. Perhaps, it should treat metadata checksum errors in the same way as disk failures? > Let's add what you're doing to ktest, and see if we can merge it with > the existing tests. > Good catches on all of them. Darrick's been on me to get fuzz testing > going, looks like it's definitely needed :) > > However, there's two things I want in place first before I put much > effort into fuzz testing: > > - Code coverage analysis. ktest used to have integrated code coverage > analysis, where you'd tell it a subdirectory of the kernel tree > (doing code coverage analysis for the entire kernel is impossibly > slow) and it would run tests and then give you the lcov output. > > However, several years ago something about kbuild changed, and the > method ktest was using for passing in build flags for a specific > subdir on the command line stopped working. I would like to track > down someone who understands kbuild and get this working again. > > - Fault injection > > Years and years ago, when I was still at Google and this was just > bcache, we had fault injection that worked like dynamic debug: you > could call dynamic_fault("type of fault") anywhere in your code, > and it returned a bool indicating whether that fault had been enabled > - and faults were controllable at runtime via debugfs, we had tests > that iterated over e.g. faults in the initialization path, or memory > allocation failures, and flipped them on one by one and ran > $test_workload. > > The memory allocation profiling stuff that Suren and I have been > working on includes code tagging, which is for (among other things) a > new and simplified implementation of dynamic fault injection, which > I'm going to push forward again once the memory allocation profiling > stuff gets merged. > > The reason I want this stuff is because fuzz testing tends to be a > heavyweight, scattershot approach. > > I want to be able to look at the code coverage analysis first to e.g. > work on a chunk of code at a time and make sure it's tested thoroughly, > instead of jumping around in the code at random depending on what fuzz > testing finds, and when we are fuzz testing I want to be able to add > fault injection points and write unit tests so that we can have much > more targeted, quicker to run tests going forward. > > Can I get you interested in either of those things? I'd really love to > find someone to hand off or collaborate with on the fault injection > stuff in particular. I'd like to know how do you want to do coverage analysis? By instrumenting each branch and creating a test case that tests that the branch goes both ways? I know that people who write spacecraft-grade software do such tests, but I can't quite imagine how would that work in a filesystem. "grep -w if fs/bcachefs/*.[ch] | wc -l" shows that there are 5828 conditions. That's one condition for every 15.5 lines. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-30 21:00 ` Mikulas Patocka @ 2023-05-30 23:29 ` Kent Overstreet 2023-06-09 20:57 ` Mikulas Patocka 0 siblings, 1 reply; 13+ messages in thread From: Kent Overstreet @ 2023-05-30 23:29 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-bcachefs, dm-devel, linux-fsdevel On Tue, May 30, 2023 at 05:00:39PM -0400, Mikulas Patocka wrote: > I'd like to know how do you want to do coverage analysis? By instrumenting > each branch and creating a test case that tests that the branch goes both > ways? Documentation/dev-tools/gcov.rst. The compiler instruments each branch and then the results are available in debugfs, then the lcov tool produces annotated source code as html output. > I know that people who write spacecraft-grade software do such tests, but > I can't quite imagine how would that work in a filesystem. > > "grep -w if fs/bcachefs/*.[ch] | wc -l" shows that there are 5828 > conditions. That's one condition for every 15.5 lines. Most of which are covered by existing tests - but by running the existing tests with code coverage analylis we can see which branches the tests aren't hitting, and then we add fault injection points for those. With fault injection we can improve test coverage a lot without needing to write any new tests (or simple ones, for e.g. init/mount errors) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-30 23:29 ` Kent Overstreet @ 2023-06-09 20:57 ` Mikulas Patocka 2023-06-09 22:17 ` Kent Overstreet 0 siblings, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2023-06-09 20:57 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, dm-devel, linux-fsdevel On Tue, 30 May 2023, Kent Overstreet wrote: > On Tue, May 30, 2023 at 05:00:39PM -0400, Mikulas Patocka wrote: > > I'd like to know how do you want to do coverage analysis? By instrumenting > > each branch and creating a test case that tests that the branch goes both > > ways? > > Documentation/dev-tools/gcov.rst. The compiler instruments each branch > and then the results are available in debugfs, then the lcov tool > produces annotated source code as html output. > > > I know that people who write spacecraft-grade software do such tests, but > > I can't quite imagine how would that work in a filesystem. > > > > "grep -w if fs/bcachefs/*.[ch] | wc -l" shows that there are 5828 > > conditions. That's one condition for every 15.5 lines. > > Most of which are covered by existing tests - but by running the > existing tests with code coverage analylis we can see which branches the > tests aren't hitting, and then we add fault injection points for those. > > With fault injection we can improve test coverage a lot without needing > to write any new tests (or simple ones, for e.g. init/mount errors) I compiled the kernel with gcov, I ran "xfstests-dev" on bcachefs and gcov shows that there is 56% coverage on "fs/bcachefs/*.o". So, we have 2564 "if" branches (of total 5828) that were not tested. What are you going to do about them? Will you create a filesystem image for each branch that triggers it? Or, will you add 2564 fault-injection points to the source code? It seems like extreme amount of work. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-06-09 20:57 ` Mikulas Patocka @ 2023-06-09 22:17 ` Kent Overstreet 0 siblings, 0 replies; 13+ messages in thread From: Kent Overstreet @ 2023-06-09 22:17 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-bcachefs, dm-devel, linux-fsdevel On Fri, Jun 09, 2023 at 10:57:27PM +0200, Mikulas Patocka wrote: > > > On Tue, 30 May 2023, Kent Overstreet wrote: > > > On Tue, May 30, 2023 at 05:00:39PM -0400, Mikulas Patocka wrote: > > > I'd like to know how do you want to do coverage analysis? By instrumenting > > > each branch and creating a test case that tests that the branch goes both > > > ways? > > > > Documentation/dev-tools/gcov.rst. The compiler instruments each branch > > and then the results are available in debugfs, then the lcov tool > > produces annotated source code as html output. > > > > > I know that people who write spacecraft-grade software do such tests, but > > > I can't quite imagine how would that work in a filesystem. > > > > > > "grep -w if fs/bcachefs/*.[ch] | wc -l" shows that there are 5828 > > > conditions. That's one condition for every 15.5 lines. > > > > Most of which are covered by existing tests - but by running the > > existing tests with code coverage analylis we can see which branches the > > tests aren't hitting, and then we add fault injection points for those. > > > > With fault injection we can improve test coverage a lot without needing > > to write any new tests (or simple ones, for e.g. init/mount errors) > > I compiled the kernel with gcov, I ran "xfstests-dev" on bcachefs and gcov > shows that there is 56% coverage on "fs/bcachefs/*.o". Nice :) I haven't personally looked at the gcov output in ages, you might motivate me to see if I can get the kbuild issue for ktest integration sorted out. Just running xfstests won't exercise a lot of the code though - our own tests are written as ktest tests, and those exercise e.g. multiple devices (regular raid mode, tiering, erasure coding), subvolumes/snapshots, all the compression/checksumming/encryption modes, etc. No doubt our test coverage will still need improving :) > So, we have 2564 "if" branches (of total 5828) that were not tested. What > are you going to do about them? Will you create a filesystem image for > each branch that triggers it? Or, will you add 2564 fault-injection points > to the source code? Fault injection points will be the first thing to look at, as well as any chunks of code that just have missing tests. We won't have to manually add individual fault injection points in every case: once code tagging and dynamic fault injection go in, that will give us distinct fault injection points for every memory allocation, and then it's a simple matter to enable a 1% failure rate for all memory allocations in the bcachefs module - we'll do this in bcachefs_antagonist in ktest/tests/bcachefs/bcachefs-test-libs, which runs after mounting. Similarly, we'll also want to add fault injection for transaction restart points. Fault injection is just the first, easiest thing I want people looking at, it won't be the best tool for the job in all situations. Darrick's also done cool stuff with injecting filesystem errors into the on disk image - he's got a tool that can select which individual field to corrupt - and I want to copy that idea. Our kill_btree_node test (in single_device.ktest) is some very initial work along those lines, we'll want to extend that. And we will definitely want to still be testing with dm-flakey because no doubt those techniques won't catch everything :) > It seems like extreme amount of work. It is a fair amount of work - but it's a more focused kind of work, with a benchmark to look at to know when we're done. In practice, nobody but perhaps automotive & aerospace attains full 100% branch coverage. People generally aim for 80%, and with good, easy to use fault injection I'm hoping we'll be able to hit 90%. IIRC when we were working on the predecessor to bcachefs and had fault injection available, we were hitting 85-88% code coverage. Granted the codebase was _much_ smaller back then, but it's still not a crazy unattainable goal. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-05-29 20:59 fuzzing bcachefs with dm-flakey Mikulas Patocka 2023-05-29 21:14 ` Matthew Wilcox 2023-05-29 21:43 ` Kent Overstreet @ 2023-06-02 1:13 ` Darrick J. Wong 2023-06-09 18:56 ` Mikulas Patocka 2 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2023-06-02 1:13 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Kent Overstreet, linux-bcachefs, dm-devel, linux-fsdevel On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > Hi > > I improved the dm-flakey device mapper target, so that it can do random > corruption of read and write bios - I uploaded it here: > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > bios with this command: > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > > > I created a bcachefs volume on a single disk (metadata and data checksums > were turned off) and mounted it on dm-flakey. I got: > > crash: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash1.txt > deadlock: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash2.txt > infinite loop: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash3.txt > > Here I uploaded an image that causes infnite loop when we run bcachefs > fsck on it or when we attempt mount it: > https://people.redhat.com/~mpatocka/testcases/bcachefs/inf-loop.gz > > > I tried to run bcachefs on two block devices and fuzzing just one of them > (checksums and replication were turned on - so bcachefs shold correct the > corrupted data) - in this scenario, bcachefs doesn't return invalid data, > but it sometimes returns errors and sometimes crashes. > > This script will trigger an oops on unmount: > https://people.redhat.com/~mpatocka/testcases/bcachefs/crash4.txt > or nonsensical errors returned to userspace: > rm: cannot remove '/mnt/test/test/cmd_migrate.c': Unknown error 2206 > or I/O errors returned to userspace: > diff: /mnt/test/test/rust-src/target/release/.fingerprint/bch_bindgen-f0bad16858ff0019/lib-bch_bindgen.json: Input/output error > > #!/bin/sh -ex > umount /mnt/test || true > dmsetup remove_all || true > rmmod brd || true > SRC=/usr/src/git/bcachefs-tools > while true; do > modprobe brd rd_size=1048576 > bcachefs format --replicas=2 /dev/ram0 /dev/ram1 > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` linear /dev/ram0 0" > mount -t bcachefs /dev/mapper/flakey:/dev/ram1 /mnt/test > dmsetup load flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" Hey, that's really neat! Any chance you'd be willing to get the dm-flakey changes merged into upstream so that someone can write a recoveryloop fstest to test all the filesystems systematically? :D --D > dmsetup suspend flakey > dmsetup resume flakey > cp -a "$SRC" /mnt/test/test > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > rm -rf /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > cp -a "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > umount /mnt/test > dmsetup remove flakey > rmmod brd > done > > The oops happens in set_btree_iter_dontneed and it is caused by the fact > that iter->path is NULL. The code in try_alloc_bucket is buggy because it > sets "struct btree_iter iter = { NULL };" and then jumps to the "err" > label that tries to dereference values in "iter". > > > Bcachefs gives not much usefull error messages, like "Fatal error: Unknown > error 2184" or "Error in recovery: cannot allocate memory" or "mount(2) > system call failed: Unknown error 2186." or "rm: cannot remove > '/mnt/test/xfstests-dev/tools/fs-walk': Unknown error 2206". > > Mikulas > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-06-02 1:13 ` Darrick J. Wong @ 2023-06-09 18:56 ` Mikulas Patocka 2023-06-09 19:38 ` Darrick J. Wong 0 siblings, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2023-06-09 18:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Kent Overstreet, linux-bcachefs, dm-devel, linux-fsdevel On Thu, 1 Jun 2023, Darrick J. Wong wrote: > On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > > > #!/bin/sh -ex > > umount /mnt/test || true > > dmsetup remove_all || true > > rmmod brd || true > > SRC=/usr/src/git/bcachefs-tools > > while true; do > > modprobe brd rd_size=1048576 > > bcachefs format --replicas=2 /dev/ram0 /dev/ram1 > > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` linear /dev/ram0 0" > > mount -t bcachefs /dev/mapper/flakey:/dev/ram1 /mnt/test > > dmsetup load flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > > Hey, that's really neat! > > Any chance you'd be willing to get the dm-flakey changes merged into > upstream so that someone can write a recoveryloop fstest to test all the > filesystems systematically? > > :D > > --D Yes, we will merge improved dm-flakey in the next merge window. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fuzzing bcachefs with dm-flakey 2023-06-09 18:56 ` Mikulas Patocka @ 2023-06-09 19:38 ` Darrick J. Wong 0 siblings, 0 replies; 13+ messages in thread From: Darrick J. Wong @ 2023-06-09 19:38 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Kent Overstreet, linux-bcachefs, dm-devel, linux-fsdevel On Fri, Jun 09, 2023 at 08:56:37PM +0200, Mikulas Patocka wrote: > > > On Thu, 1 Jun 2023, Darrick J. Wong wrote: > > > On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > > > > > #!/bin/sh -ex > > > umount /mnt/test || true > > > dmsetup remove_all || true > > > rmmod brd || true > > > SRC=/usr/src/git/bcachefs-tools > > > while true; do > > > modprobe brd rd_size=1048576 > > > bcachefs format --replicas=2 /dev/ram0 /dev/ram1 > > > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` linear /dev/ram0 0" > > > mount -t bcachefs /dev/mapper/flakey:/dev/ram1 /mnt/test > > > dmsetup load flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > > > > Hey, that's really neat! > > > > Any chance you'd be willing to get the dm-flakey changes merged into > > upstream so that someone can write a recoveryloop fstest to test all the > > filesystems systematically? > > > > :D > > > > --D > > Yes, we will merge improved dm-flakey in the next merge window. Thank you! --D > Mikulas > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-06-09 22:17 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-29 20:59 fuzzing bcachefs with dm-flakey Mikulas Patocka 2023-05-29 21:14 ` Matthew Wilcox 2023-05-29 23:12 ` Dave Chinner 2023-05-29 23:51 ` Kent Overstreet 2023-05-30 12:23 ` Mikulas Patocka 2023-05-29 21:43 ` Kent Overstreet 2023-05-30 21:00 ` Mikulas Patocka 2023-05-30 23:29 ` Kent Overstreet 2023-06-09 20:57 ` Mikulas Patocka 2023-06-09 22:17 ` Kent Overstreet 2023-06-02 1:13 ` Darrick J. Wong 2023-06-09 18:56 ` Mikulas Patocka 2023-06-09 19:38 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).