On Nov 21, 2019, at 5:09 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Thu, Nov 21, 2019 at 01:30:36PM -0500, Theodore Ts'o wrote:
>> This allows us to test various error handling code paths
>> 
>> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>> ---
>> fs/ext4/balloc.c |  4 +++-
>> fs/ext4/ext4.h   | 42 ++++++++++++++++++++++++++++++++++++++++++
>> fs/ext4/ialloc.c |  4 +++-
>> fs/ext4/inode.c  |  6 +++++-
>> fs/ext4/namei.c  | 11 ++++++++---
>> fs/ext4/sysfs.c  | 23 +++++++++++++++++++++++
>> 6 files changed, 84 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
>> index 102c38527a10..5f993a411251 100644
>> --- a/fs/ext4/balloc.c
>> +++ b/fs/ext4/balloc.c
>> @@ -371,7 +371,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
>> 	if (buffer_verified(bh))
>> 		goto verified;
>> 	if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
>> -			desc, bh))) {
>> +						    desc, bh) ||
>> +		     ext4_simulate_fail(sb, EXT4_SIM_BBITMAP_CRC))) {
>> 		ext4_unlock_group(sb, block_group);
>> 		ext4_error(sb, "bg %u: bad block bitmap checksum", block_group);
>> 		ext4_mark_group_bitmap_corrupted(sb, block_group,
>> @@ -505,6 +506,7 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group,
>> 	if (!desc)
>> 		return -EFSCORRUPTED;
>> 	wait_on_buffer(bh);
>> +	ext4_simulate_fail_bh(sb, bh, EXT4_SIM_BBITMAP_EIO);
>> 	if (!buffer_uptodate(bh)) {
>> 		ext4_set_errno(sb, EIO);
>> 		ext4_error(sb, "Cannot read block bitmap - "
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1c9ac0fc8715..e6798db4634c 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1557,6 +1557,9 @@ struct ext4_sb_info {
>> 	/* Barrier between changing inodes' journal flags and writepages ops. */
>> 	struct percpu_rw_semaphore s_journal_flag_rwsem;
>> 	struct dax_device *s_daxdev;
>> +#ifdef CONFIG_EXT4_DEBUG
>> +	unsigned long s_simulate_fail;
>> +#endif
>> };
>> 
>> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
>> @@ -1575,6 +1578,45 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
>> 		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
>> }
>> 
>> +static inline int ext4_simulate_fail(struct super_block *sb,
>> +				     unsigned long flag)
> 
> Nit: bool?
> 
>> +{
>> +#ifdef CONFIG_EXT4_DEBUG
>> +	unsigned long old, new;
>> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
>> +
>> +	do {
>> +		old = READ_ONCE(sbi->s_simulate_fail);
>> +		if (likely((old & flag) == 0))
>> +			return 0;
>> +		new = old & ~flag;
>> +	} while (unlikely(cmpxchg(&sbi->s_simulate_fail, old, new) != old));
> 
> If I'm reading this correctly, this means that userspace sets a
> s_simulate_fail bit via sysfs knob, and the next time the filesystem
> calls ext4_simulate_fail with the same bit set in @flag we'll return
> true to say "simulate the failure" and clear the bit in s_simulate_fail?
> 
> IOWs, the simulated failures have to be re-armed every time?
> 
> Seems reasonable, but consider the possibility that in the future it
> might be useful if you could set up periodic failures (e.g. directory
> lookups fail 10% of the time) so that you can see how something like
> fsstress reacts to less-predictable failures?
> 
> Of course that also increases the amount of fugly sysfs boilerplate so
> that each knob can have its own sysfs file... that alone is half of a
> reason not to do that. :(

Just for comparison, Lustre has had a fault injection mechanism for ages
that can do a bunch of things like this.  Each fault location has a unique
number (we separate them by subsystems in the code, but numbers are rather
arbitrary), and then a sysfs parameter "fail_loc" that can be set to match
the fault location to inject errors, and "fail_val" that allows userspace
to adjust/tune the failure behavior (e.g. only affect target N, or sleep N
seconds, ...).

The low 16 bits set in fail_loc is the fault location number, and the high
16 bits of fail_loc are flags that can modify the behavior independent of
which failure number is being used:
- CFS_FAIL_ONCE: the fail_loc should only fail once (default forever)
- CFS_FAIL_SKIP: skip the fail_loc the first "fail_val" times
- CFS_FAIL_SOME: trigger the failure the first "fail_val" times
- CFS_FAIL_RAND: trigger the failure at a rate of 1/fail_val

There are also flags set by the kernel when the failure is hit, so it is
possible to read fail_loc in a test script if the failure was already hit.

Internally in the code, the most common use is just checking if we hit the
currently-set fail_loc (which is unlikely() for minimal impact), like:

	if (CFS_FAIL_CHECK(OBD_FAIL_TGT_REPLAY_RECONNECT))
		 RETURN(1);	 /* don't send early reply */

        if (CFS_FAIL_CHECK(OBD_FAIL_FLD_QUERY_REQ) && req->rq_no_delay) {
                /* the same error returned by ptlrpc_import_delay_req() */
                rc = -EWOULDBLOCK;
                req->rq_status = rc;

It is possible to inject a delay into a thread to allow something else to
happen (maybe more useful for a distributed system rather than a local one):

	CFS_FAIL_TIMEOUT(OBD_FAIL_TGT_REPLAY_DELAY2, cfs_fail_val);

It is also possible to set up a race between two threads in the same or
different parts of the code on the same node:

	CFS_RACE(CFS_FAIL_CHLOG_USER_REG_UNREG_RACE);

The first thread to hit this fail_loc will sleep, and the second thread
that hits it will wake it up.  There is a variation of this to make it
explicit that only one thread to hit a location should sleep, and a
second thread needs to hit a different location to wake it up:

	thread1:
	CFS_RACE_WAIT(OBD_FAIL_OBD_ZERO_NLINK_RACE);

				thread2:
				CFS_RACE_WAKEUP(OBD_FAIL_OBD_ZERO_NLINK_RACE);

It is also possible to daisy-chain failure conditions:

	if (ns_is_client(ldlm_lock_to_ns(lock)) &&
	    CFS_FAIL_CHECK_RESET(OBD_FAIL_LDLM_INTR_CP_AST,
				 OBD_FAIL_LDLM_CP_BL_RACE | OBD_FAIL_ONCE))
		ldlm_set_fail_loc(lock);

so here if "OBD_FAIL_LDLM_INTR_CP_AST" is hit, it will reset fail_loc to be "OBD_FAIL_LDLM_CP_BL_RACE" to fail once, flag a particular DLM lock and then
the next two threads that access this lock will race to process it:

	if (ldlm_is_fail_loc(lock))
		CFS_RACE(OBD_FAIL_LDLM_CP_BL_RACE);

The CFS_FAIL functionality has been used for quite a few years and has
proven sufficient and easily used to invoke failure conditions that would
be otherwise impossible to reproduce (over a thousand fault injection
sites in the Lustre code and corresponding tests to trigger them today).

Cheers, Andreas