From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB94EC7EE2E for ; Mon, 29 May 2023 21:44:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229496AbjE2VoF (ORCPT ); Mon, 29 May 2023 17:44:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229461AbjE2VoC (ORCPT ); Mon, 29 May 2023 17:44:02 -0400 Received: from out-35.mta0.migadu.com (out-35.mta0.migadu.com [91.218.175.35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B4E9DD9 for ; Mon, 29 May 2023 14:43:59 -0700 (PDT) Date: Mon, 29 May 2023 17:43:53 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1685396637; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=g9LV2x82sc6A+Ucb7bsrgYQlKFVDBatiYh9r3a/voOU=; b=icbVg3aiJm3WrNrdXGgum3GzwV53yb1ocwOV8Y6vBmnjiy9FibXB9foRC6mHPgNv9jrgES Nwn5No/pNNOaNDhpEABuXs22j2L+NVU/Av5S0XKgeIqIUun0qEEsu4iClZ+W7QNKxK0Nzq AdWAOEkLGFCue6lfLzu2ecN8evc7IIY= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Mikulas Patocka Cc: linux-bcachefs@vger.kernel.org, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org Subject: Re: fuzzing bcachefs with dm-flakey Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org On Mon, May 29, 2023 at 04:59:40PM -0400, Mikulas Patocka wrote: > Hi > > I improved the dm-flakey device mapper target, so that it can do random > corruption of read and write bios - I uploaded it here: > https://people.redhat.com/~mpatocka/testcases/bcachefs/dm-flakey.c > > I set up dm-flakey, so that it corrupts 10% of read bios and 10% of write > bios with this command: > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" I've got some existing ktest tests for error injection: https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/single_device.ktest#n200 https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/replication.ktest#n491 I haven't looked at dm-flakey before, I take it you're silently corrupting data instead of just failing the IOs like these tests do? Let's add what you're doing to ktest, and see if we can merge it with the existing tests. > I created a bcachefs volume on a single disk (metadata and data checksums > were turned off) and mounted it on dm-flakey. I got: > > crash: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash1.txt > deadlock: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash2.txt > infinite loop: https://people.redhat.com/~mpatocka/testcases/bcachefs/crash3.txt Fun! > Here I uploaded an image that causes infnite loop when we run bcachefs > fsck on it or when we attempt mount it: > https://people.redhat.com/~mpatocka/testcases/bcachefs/inf-loop.gz > > > I tried to run bcachefs on two block devices and fuzzing just one of them > (checksums and replication were turned on - so bcachefs shold correct the > corrupted data) - in this scenario, bcachefs doesn't return invalid data, > but it sometimes returns errors and sometimes crashes. > > This script will trigger an oops on unmount: > https://people.redhat.com/~mpatocka/testcases/bcachefs/crash4.txt > or nonsensical errors returned to userspace: > rm: cannot remove '/mnt/test/test/cmd_migrate.c': Unknown error 2206 > or I/O errors returned to userspace: > diff: /mnt/test/test/rust-src/target/release/.fingerprint/bch_bindgen-f0bad16858ff0019/lib-bch_bindgen.json: Input/output error > > #!/bin/sh -ex > umount /mnt/test || true > dmsetup remove_all || true > rmmod brd || true > SRC=/usr/src/git/bcachefs-tools > while true; do > modprobe brd rd_size=1048576 > bcachefs format --replicas=2 /dev/ram0 /dev/ram1 > dmsetup create flakey --table "0 `blockdev --getsize /dev/ram0` linear /dev/ram0 0" > mount -t bcachefs /dev/mapper/flakey:/dev/ram1 /mnt/test > dmsetup load flakey --table "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 4 random_write_corrupt 100000000 random_read_corrupt 100000000" > dmsetup suspend flakey > dmsetup resume flakey > cp -a "$SRC" /mnt/test/test > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > rm -rf /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > cp -a "$SRC" /mnt/test/test > echo 3 >/proc/sys/vm/drop_caches > diff -r "$SRC" /mnt/test/test > umount /mnt/test > dmsetup remove flakey > rmmod brd > done > > The oops happens in set_btree_iter_dontneed and it is caused by the fact > that iter->path is NULL. The code in try_alloc_bucket is buggy because it > sets "struct btree_iter iter = { NULL };" and then jumps to the "err" > label that tries to dereference values in "iter". Good catches on all of them. Darrick's been on me to get fuzz testing going, looks like it's definitely needed :) However, there's two things I want in place first before I put much effort into fuzz testing: - Code coverage analysis. ktest used to have integrated code coverage analysis, where you'd tell it a subdirectory of the kernel tree (doing code coverage analysis for the entire kernel is impossibly slow) and it would run tests and then give you the lcov output. However, several years ago something about kbuild changed, and the method ktest was using for passing in build flags for a specific subdir on the command line stopped working. I would like to track down someone who understands kbuild and get this working again. - Fault injection Years and years ago, when I was still at Google and this was just bcache, we had fault injection that worked like dynamic debug: you could call dynamic_fault("type of fault") anywhere in your code, and it returned a bool indicating whether that fault had been enabled - and faults were controllable at runtime via debugfs, we had tests that iterated over e.g. faults in the initialization path, or memory allocation failures, and flipped them on one by one and ran $test_workload. The memory allocation profiling stuff that Suren and I have been working on includes code tagging, which is for (among other things) a new and simplified implementation of dynamic fault injection, which I'm going to push forward again once the memory allocation profiling stuff gets merged. The reason I want this stuff is because fuzz testing tends to be a heavyweight, scattershot approach. I want to be able to look at the code coverage analysis first to e.g. work on a chunk of code at a time and make sure it's tested thoroughly, instead of jumping around in the code at random depending on what fuzz testing finds, and when we are fuzz testing I want to be able to add fault injection points and write unit tests so that we can have much more targeted, quicker to run tests going forward. Can I get you interested in either of those things? I'd really love to find someone to hand off or collaborate with on the fault injection stuff in particular. > Bcachefs gives not much usefull error messages, like "Fatal error: Unknown > error 2184" or "Error in recovery: cannot allocate memory" or "mount(2) > system call failed: Unknown error 2186." or "rm: cannot remove > '/mnt/test/xfstests-dev/tools/fs-walk': Unknown error 2206". Those are mostly missing bch2_err_str()/bch2_err_class() calls: - bch2_err_str(), to print error string for our private error code - bch2_err_class(), to convert private error code to standard error code before returning it to outside bcachefs code except error in recovery, cannot allocate memory - that's ancient code that still squashes to -ENOMEM