On Nov 21, 2019, at 5:09 PM, Darrick J. Wong wrote: > > On Thu, Nov 21, 2019 at 01:30:36PM -0500, Theodore Ts'o wrote: >> This allows us to test various error handling code paths >> >> Signed-off-by: Theodore Ts'o >> --- >> fs/ext4/balloc.c | 4 +++- >> fs/ext4/ext4.h | 42 ++++++++++++++++++++++++++++++++++++++++++ >> fs/ext4/ialloc.c | 4 +++- >> fs/ext4/inode.c | 6 +++++- >> fs/ext4/namei.c | 11 ++++++++--- >> fs/ext4/sysfs.c | 23 +++++++++++++++++++++++ >> 6 files changed, 84 insertions(+), 6 deletions(-) >> >> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c >> index 102c38527a10..5f993a411251 100644 >> --- a/fs/ext4/balloc.c >> +++ b/fs/ext4/balloc.c >> @@ -371,7 +371,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb, >> if (buffer_verified(bh)) >> goto verified; >> if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group, >> - desc, bh))) { >> + desc, bh) || >> + ext4_simulate_fail(sb, EXT4_SIM_BBITMAP_CRC))) { >> ext4_unlock_group(sb, block_group); >> ext4_error(sb, "bg %u: bad block bitmap checksum", block_group); >> ext4_mark_group_bitmap_corrupted(sb, block_group, >> @@ -505,6 +506,7 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group, >> if (!desc) >> return -EFSCORRUPTED; >> wait_on_buffer(bh); >> + ext4_simulate_fail_bh(sb, bh, EXT4_SIM_BBITMAP_EIO); >> if (!buffer_uptodate(bh)) { >> ext4_set_errno(sb, EIO); >> ext4_error(sb, "Cannot read block bitmap - " >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 1c9ac0fc8715..e6798db4634c 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -1557,6 +1557,9 @@ struct ext4_sb_info { >> /* Barrier between changing inodes' journal flags and writepages ops. */ >> struct percpu_rw_semaphore s_journal_flag_rwsem; >> struct dax_device *s_daxdev; >> +#ifdef CONFIG_EXT4_DEBUG >> + unsigned long s_simulate_fail; >> +#endif >> }; >> >> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb) >> @@ -1575,6 +1578,45 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino) >> ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)); >> } >> >> +static inline int ext4_simulate_fail(struct super_block *sb, >> + unsigned long flag) > > Nit: bool? > >> +{ >> +#ifdef CONFIG_EXT4_DEBUG >> + unsigned long old, new; >> + struct ext4_sb_info *sbi = EXT4_SB(sb); >> + >> + do { >> + old = READ_ONCE(sbi->s_simulate_fail); >> + if (likely((old & flag) == 0)) >> + return 0; >> + new = old & ~flag; >> + } while (unlikely(cmpxchg(&sbi->s_simulate_fail, old, new) != old)); > > If I'm reading this correctly, this means that userspace sets a > s_simulate_fail bit via sysfs knob, and the next time the filesystem > calls ext4_simulate_fail with the same bit set in @flag we'll return > true to say "simulate the failure" and clear the bit in s_simulate_fail? > > IOWs, the simulated failures have to be re-armed every time? > > Seems reasonable, but consider the possibility that in the future it > might be useful if you could set up periodic failures (e.g. directory > lookups fail 10% of the time) so that you can see how something like > fsstress reacts to less-predictable failures? > > Of course that also increases the amount of fugly sysfs boilerplate so > that each knob can have its own sysfs file... that alone is half of a > reason not to do that. :( Just for comparison, Lustre has had a fault injection mechanism for ages that can do a bunch of things like this. Each fault location has a unique number (we separate them by subsystems in the code, but numbers are rather arbitrary), and then a sysfs parameter "fail_loc" that can be set to match the fault location to inject errors, and "fail_val" that allows userspace to adjust/tune the failure behavior (e.g. only affect target N, or sleep N seconds, ...). The low 16 bits set in fail_loc is the fault location number, and the high 16 bits of fail_loc are flags that can modify the behavior independent of which failure number is being used: - CFS_FAIL_ONCE: the fail_loc should only fail once (default forever) - CFS_FAIL_SKIP: skip the fail_loc the first "fail_val" times - CFS_FAIL_SOME: trigger the failure the first "fail_val" times - CFS_FAIL_RAND: trigger the failure at a rate of 1/fail_val There are also flags set by the kernel when the failure is hit, so it is possible to read fail_loc in a test script if the failure was already hit. Internally in the code, the most common use is just checking if we hit the currently-set fail_loc (which is unlikely() for minimal impact), like: if (CFS_FAIL_CHECK(OBD_FAIL_TGT_REPLAY_RECONNECT)) RETURN(1); /* don't send early reply */ if (CFS_FAIL_CHECK(OBD_FAIL_FLD_QUERY_REQ) && req->rq_no_delay) { /* the same error returned by ptlrpc_import_delay_req() */ rc = -EWOULDBLOCK; req->rq_status = rc; It is possible to inject a delay into a thread to allow something else to happen (maybe more useful for a distributed system rather than a local one): CFS_FAIL_TIMEOUT(OBD_FAIL_TGT_REPLAY_DELAY2, cfs_fail_val); It is also possible to set up a race between two threads in the same or different parts of the code on the same node: CFS_RACE(CFS_FAIL_CHLOG_USER_REG_UNREG_RACE); The first thread to hit this fail_loc will sleep, and the second thread that hits it will wake it up. There is a variation of this to make it explicit that only one thread to hit a location should sleep, and a second thread needs to hit a different location to wake it up: thread1: CFS_RACE_WAIT(OBD_FAIL_OBD_ZERO_NLINK_RACE); thread2: CFS_RACE_WAKEUP(OBD_FAIL_OBD_ZERO_NLINK_RACE); It is also possible to daisy-chain failure conditions: if (ns_is_client(ldlm_lock_to_ns(lock)) && CFS_FAIL_CHECK_RESET(OBD_FAIL_LDLM_INTR_CP_AST, OBD_FAIL_LDLM_CP_BL_RACE | OBD_FAIL_ONCE)) ldlm_set_fail_loc(lock); so here if "OBD_FAIL_LDLM_INTR_CP_AST" is hit, it will reset fail_loc to be "OBD_FAIL_LDLM_CP_BL_RACE" to fail once, flag a particular DLM lock and then the next two threads that access this lock will race to process it: if (ldlm_is_fail_loc(lock)) CFS_RACE(OBD_FAIL_LDLM_CP_BL_RACE); The CFS_FAIL functionality has been used for quite a few years and has proven sufficient and easily used to invoke failure conditions that would be otherwise impossible to reproduce (over a thousand fault injection sites in the Lustre code and corresponding tests to trigger them today). Cheers, Andreas