linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups
@ 2020-05-01  6:29 Ritesh Harjani
  2020-05-01  6:29 ` [RFC 01/20] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
                   ` (19 more replies)
  0 siblings, 20 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Hello All,

v2 -> v3:
v3 changes the code design to fix ENOSPC error. This patch uses the percpu
discard pa seq counter (which was as discussed with Jan & Aneesh).
Patch-2 commit msg describes both the problem and the new algorithm in
great detail. Rest of the patches are mostly either refactoring, code
cleanups or debug logs improvements.

Posting this for early review comments.
For now I have done some basic testing on this patch to test
for ENOSPC reported errors, using a smaller filesystem (~240MB, 64K bs)
a. Tested multi-thread file writes which only allocates group PAs.
b. Tested multi-thread file writes which only allocates inode PAs.
c. Tested multi-thread file writes doing combination of both of the above.

[May Not be Ready for Merge yet, until below are properly discussed]
==================================================================

1. There is a query asked in Patch-2 commit msg itself, regarding
   rcu_barrier() usage.

2. AFAICT, even if we reduce the PA size based on available FS size to avoid
   this ENOSPC error, this issue could still potentially happen since the race
   has mostly to do with 1st thread freeing up all the PAs while 2nd thread
   returning 0 as freed (since PA list was empty). Hence 2nd thread fails
   with ENOSPC even though there were free blocks freed by 1st thread.

3. I would like to know if there is any stress-ng test case or any other use
   case which measures multi-thread performance of write while FS is close to
   ENOSPC? If yes - instead of regressing this later, I would like to know
   such test case so that it can be tested at my end while we are still
   at it.

4. I do see that with 64K blocksize the performance of multi-thread < 1 MB
   file size writes becomes very slow. But without this patch, it anyways fails
   with ENOSPC error. On doing below the performance does improve close to 5x.
   echo 32 > /sys/fs/ext4/loop3/mb_group_prealloc
   I was thinking instead of 512 blocks as default value for
   'MB_DEFAULT_GROUP_PREALLOC', we could make it 64K as the default size and
   decide the no. of blocks based on the blocksize. But I haven't got to it
   yet. But thought to capture it in this email though. We can get to it
   after these patches.

5. Started fstests testing. Will let you know the results soon.

[RFCv2]: https://patchwork.ozlabs.org/project/linux-ext4/patch/533ac1f5b19c520b08f8c99aec5baf8729185714.1586954511.git.riteshh@linux.ibm.com/
         - Problems with v2 are captured in Patch-2 commit msg itself.


Ritesh Harjani (20):
  ext4: mballoc: Refactor ext4_mb_discard_preallocations()
  ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid
    ENOSPC err
  ext4: mballoc: Do print bb_free info even when it is 0
  ext4: mballoc: Refactor ext4_mb_show_ac()
  ext4: mballoc: Add more mb_debug() msgs
  ext4: mballoc: Correct the mb_debug() format specifier for pa_len var
  ext4: mballoc: Fix few other format specifier in mb_debug()
  ext4: mballoc: Simplify error handling in ext4_init_mballoc()
  ext4: mballoc: Make ext4_mb_use_preallocated() return type as bool
  ext4: mballoc: Remove EXT4_MB_HINT_GOAL_ONLY and it's related code
  ext4: mballoc: Refactor code inside DOUBLE_CHECK into separate
    function
  ext4: mballoc: Fix possible NULL ptr dereference from mb_cmp_bitmaps()
  ext4: mballoc: Don't BUG if kmalloc or read blk bitmap fail for
    DOUBLE_CHECK
  ext4: balloc: Use task_pid_nr() helper
  ext4: Use BIT() macro for BH_** state bits
  ext4: Improve ext_debug() msg in case of block allocation failure
  ext4: Replace EXT_DEBUG with __maybe_unused in
    ext4_ext_handle_unwritten_extents()
  ext4: mballoc: Make mb_debug() implementation to use pr_debug()
  ext4: Make ext_debug() implementation to use pr_debug()
  ext4: Add process name and pid in ext4_msg()

 fs/ext4/Kconfig             |   3 +-
 fs/ext4/balloc.c            |   5 +-
 fs/ext4/ext4.h              |  38 ++--
 fs/ext4/extents.c           | 150 ++++++++--------
 fs/ext4/inode.c             |  15 +-
 fs/ext4/mballoc.c           | 347 +++++++++++++++++++++++-------------
 fs/ext4/mballoc.h           |  16 +-
 fs/ext4/super.c             |   3 +-
 include/trace/events/ext4.h |   1 -
 9 files changed, 335 insertions(+), 243 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC 01/20] ext4: mballoc: Refactor ext4_mb_discard_preallocations()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err Ritesh Harjani
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 30d5d97548c4..a742e51e33b8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4486,6 +4486,17 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 	return freed;
 }
 
+static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
+			struct ext4_allocation_context *ac)
+{
+	int freed;
+
+	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+	if (freed)
+		return true;
+	return false;
+}
+
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
@@ -4494,7 +4505,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 				struct ext4_allocation_request *ar, int *errp)
 {
-	int freed;
 	struct ext4_allocation_context *ac = NULL;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
@@ -4593,8 +4603,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ar->len = ac->ac_b_ex.fe_len;
 		}
 	} else {
-		freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
-		if (freed)
+		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
 			goto repeat;
 		*errp = -ENOSPC;
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
  2020-05-01  6:29 ` [RFC 01/20] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01 18:31   ` Paul E. McKenney
  2020-05-01  6:29 ` [RFC 03/20] ext4: mballoc: Do print bb_free info even when it is 0 Ritesh Harjani
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we succeed in
   allocating blocks and hence while adding the remaining blocks in group's
   bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed.

TO CHECK: Though in here rcu_barrier only happens in ENOSPC path.
=================================================================
On rcu_barrier() - How expensive it can be?
Does that mean that every thread who is coming and waiting on
rcu_barrier() will actually check whether call_rcu() has completed by
checking that on every cpu? So will this be a O(n*m) operation?
(n = no. of threads, m = no. of cpus).
Or are there some sort of optimization in using rcu_barrier()?

---
Note: The other method [1] also used to check for grp->bb_free next time
if we couldn't discard anything. But it had below concerns due to which
we cannot use that approach.
1. But one suspicion with that was that if grp->bb_free is non-zero for some
reason (not sure), then we may result in that loop indefinitely but still
won't be able to satisfy any request.
2. To check for grp->bb_free all threads were to wait on grp's spin_lock
which might result in increased cpu usage.
3. In case of group's PA allocation (i.e. when file's size is < 1MB for
   64K blocksize), there is still a case where ENOSPC could be returned.
   This happens when the grp->bb_free is set to 0 but those blocks are
   actually not yet added to PA. Yes, this could actually happen since
   reducing grp->bb_free and adding those extra blocks in
   bb_prealloc_list are not done atomically. Hence the race.

[1]: https://patchwork.ozlabs.org/project/linux-ext4/patch/533ac1f5b19c520b08f8c99aec5baf8729185714.1586954511.git.riteshh@linux.ibm.com/

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 65 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a742e51e33b8..6bb08bb3c0ce 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -357,6 +357,35 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group);
 
+/*
+ * The algorithm using this percpu seq counter goes below:
+ * 1. We sample the percpu discard_pa_seq counter before trying for block
+ *    allocation in ext4_mb_new_blocks().
+ * 2. We increment this percpu discard_pa_seq counter when we succeed in
+ *    allocating blocks and hence while adding the remaining blocks in group's
+ *    bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
+ * 3. We also increment this percpu seq counter when we successfully identify
+ *    that the bb_prealloc_list is not empty and hence proceed for discarding
+ *    of those PAs inside ext4_mb_discard_group_preallocations().
+ *
+ * Now to make sure that the regular fast path of block allocation is not
+ * affected, as a small optimization we only sample the percpu seq counter
+ * on that cpu. Only when the block allocation fails and when freed blocks
+ * found were 0, that is when we sample percpu seq counter for all cpus using
+ * below function ext4_get_discard_pa_seq_sum(). This happens after making
+ * sure that all the PAs on grp->bb_prealloc_list got freed.
+ */
+DEFINE_PER_CPU(u64, discard_pa_seq);
+static inline u64 ext4_get_discard_pa_seq_sum(void)
+{
+	int __cpu;
+	u64 __seq = 0;
+
+	for_each_possible_cpu(__cpu)
+		__seq += per_cpu(discard_pa_seq, __cpu);
+	return __seq;
+}
+
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
 #if BITS_PER_LONG == 64
@@ -3730,6 +3759,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_inode = ac->ac_inode;
 
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
+	this_cpu_inc(discard_pa_seq);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
@@ -3791,6 +3821,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	pa->pa_inode = NULL;
 
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
+	this_cpu_inc(discard_pa_seq);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
@@ -3943,6 +3974,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	INIT_LIST_HEAD(&list);
 repeat:
 	ext4_lock_group(sb, group);
+	this_cpu_inc(discard_pa_seq);
 	list_for_each_entry_safe(pa, tmp,
 				&grp->bb_prealloc_list, pa_group_list) {
 		spin_lock(&pa->pa_lock);
@@ -4487,14 +4519,35 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 }
 
 static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
-			struct ext4_allocation_context *ac)
+			struct ext4_allocation_context *ac, u64 *seq)
 {
 	int freed;
+	u64 seq_retry = 0;
+	bool ret = false;
 
 	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
-	if (freed)
-		return true;
-	return false;
+	if (freed) {
+		ret = true;
+		goto out_dbg;
+	}
+	/*
+	 * Unless it is ensured that PAs are actually freed, we may hit
+	 * a ENOSPC error since the next time seq may match while the PA blocks
+	 * are still getting freed in ext4_mb_release_inode/group_pa().
+	 * So, rcu_barrier() here is to make sure that any call_rcu queued in
+	 * ext4_mb_discard_group_preallocations() is completed before we
+	 * proceed further to retry for block allocation.
+	 */
+	rcu_barrier();
+	seq_retry = ext4_get_discard_pa_seq_sum();
+	if (seq_retry != *seq) {
+		*seq = seq_retry;
+		ret = true;
+	}
+
+out_dbg:
+	mb_debug(1, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
+	return ret;
 }
 
 /*
@@ -4511,6 +4564,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	ext4_fsblk_t block = 0;
 	unsigned int inquota = 0;
 	unsigned int reserv_clstrs = 0;
+	u64 seq;
 
 	might_sleep();
 	sb = ar->inode->i_sb;
@@ -4572,6 +4626,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	}
 
 	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
+	seq = *this_cpu_ptr(&discard_pa_seq);
 	if (!ext4_mb_use_preallocated(ac)) {
 		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
 		ext4_mb_normalize_request(ac, ar);
@@ -4603,7 +4658,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ar->len = ac->ac_b_ex.fe_len;
 		}
 	} else {
-		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
+		if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
 			goto repeat;
 		*errp = -ENOSPC;
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 03/20] ext4: mballoc: Do print bb_free info even when it is 0
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
  2020-05-01  6:29 ` [RFC 01/20] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
  2020-05-01  6:29 ` [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 04/20] ext4: mballoc: Refactor ext4_mb_show_ac() Ritesh Harjani
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Improve the debugging msg by also printing even if bb_free is 0.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6bb08bb3c0ce..478b33fb1183 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4202,8 +4202,6 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 		}
 		ext4_unlock_group(sb, i);
 
-		if (grp->bb_free == 0)
-			continue;
 		printk(KERN_ERR "%u: %d/%d \n",
 		       i, grp->bb_free, grp->bb_fragments);
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 04/20] ext4: mballoc: Refactor ext4_mb_show_ac()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (2 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 03/20] ext4: mballoc: Do print bb_free info even when it is 0 Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 05/20] ext4: mballoc: Add more mb_debug() msgs Ritesh Harjani
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

This factors out ext4_mb_show_pa() function to show all the group's
preallocation info. This could be useful info to be added in later
patches.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 65 +++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 478b33fb1183..d7ae5a092796 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4152,38 +4152,16 @@ void ext4_discard_preallocations(struct inode *inode)
 }
 
 #ifdef CONFIG_EXT4_DEBUG
-static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
+static inline void ext4_mb_show_pa(struct super_block *sb)
 {
-	struct super_block *sb = ac->ac_sb;
-	ext4_group_t ngroups, i;
+	ext4_group_t i, ngroups;
 
 	if (!ext4_mballoc_debug ||
 	    (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED))
 		return;
 
-	ext4_msg(ac->ac_sb, KERN_ERR, "Can't allocate:"
-			" Allocation context details:");
-	ext4_msg(ac->ac_sb, KERN_ERR, "status %d flags %d",
-			ac->ac_status, ac->ac_flags);
-	ext4_msg(ac->ac_sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, "
-		 	"goal %lu/%lu/%lu@%lu, "
-			"best %lu/%lu/%lu@%lu cr %d",
-			(unsigned long)ac->ac_o_ex.fe_group,
-			(unsigned long)ac->ac_o_ex.fe_start,
-			(unsigned long)ac->ac_o_ex.fe_len,
-			(unsigned long)ac->ac_o_ex.fe_logical,
-			(unsigned long)ac->ac_g_ex.fe_group,
-			(unsigned long)ac->ac_g_ex.fe_start,
-			(unsigned long)ac->ac_g_ex.fe_len,
-			(unsigned long)ac->ac_g_ex.fe_logical,
-			(unsigned long)ac->ac_b_ex.fe_group,
-			(unsigned long)ac->ac_b_ex.fe_start,
-			(unsigned long)ac->ac_b_ex.fe_len,
-			(unsigned long)ac->ac_b_ex.fe_logical,
-			(int)ac->ac_criteria);
-	ext4_msg(ac->ac_sb, KERN_ERR, "%d found", ac->ac_found);
-	ext4_msg(ac->ac_sb, KERN_ERR, "groups: ");
 	ngroups = ext4_get_groups_count(sb);
+	ext4_msg(sb, KERN_ERR, "groups: ");
 	for (i = 0; i < ngroups; i++) {
 		struct ext4_group_info *grp = ext4_get_group_info(sb, i);
 		struct ext4_prealloc_space *pa;
@@ -4207,9 +4185,46 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 	}
 	printk(KERN_ERR "\n");
 }
+
+static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
+{
+	struct super_block *sb = ac->ac_sb;
+
+	if (!ext4_mballoc_debug ||
+	    (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED))
+		return;
+
+	ext4_msg(sb, KERN_ERR, "Can't allocate:"
+			" Allocation context details:");
+	ext4_msg(sb, KERN_ERR, "status %d flags %d",
+			ac->ac_status, ac->ac_flags);
+	ext4_msg(sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, "
+			"goal %lu/%lu/%lu@%lu, "
+			"best %lu/%lu/%lu@%lu cr %d",
+			(unsigned long)ac->ac_o_ex.fe_group,
+			(unsigned long)ac->ac_o_ex.fe_start,
+			(unsigned long)ac->ac_o_ex.fe_len,
+			(unsigned long)ac->ac_o_ex.fe_logical,
+			(unsigned long)ac->ac_g_ex.fe_group,
+			(unsigned long)ac->ac_g_ex.fe_start,
+			(unsigned long)ac->ac_g_ex.fe_len,
+			(unsigned long)ac->ac_g_ex.fe_logical,
+			(unsigned long)ac->ac_b_ex.fe_group,
+			(unsigned long)ac->ac_b_ex.fe_start,
+			(unsigned long)ac->ac_b_ex.fe_len,
+			(unsigned long)ac->ac_b_ex.fe_logical,
+			(int)ac->ac_criteria);
+	ext4_msg(sb, KERN_ERR, "%d found", ac->ac_found);
+	ext4_mb_show_pa(sb);
+}
 #else
+static inline void ext4_mb_show_pa(struct super_block *sb)
+{
+	return;
+}
 static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 {
+	ext4_mb_show_pa(ac->ac_sb);
 	return;
 }
 #endif
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 05/20] ext4: mballoc: Add more mb_debug() msgs
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (3 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 04/20] ext4: mballoc: Refactor ext4_mb_show_ac() Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 06/20] ext4: mballoc: Correct the mb_debug() format specifier for pa_len var Ritesh Harjani
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d7ae5a092796..2acabc3927e0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2137,7 +2137,7 @@ static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t ngroups, group, i;
-	int cr;
+	int cr = -1;
 	int err = 0, first_err = 0;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
@@ -2289,6 +2289,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 out:
 	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
 		err = first_err;
+
+	mb_debug(1, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
+		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
+		 ac->ac_flags, cr, err);
 	return err;
 }
 
@@ -3949,7 +3953,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	mb_debug(1, "discard preallocation for group %u\n", group);
 
 	if (list_empty(&grp->bb_prealloc_list))
-		return 0;
+		goto out_dbg;
 
 	bitmap_bh = ext4_read_block_bitmap(sb, group);
 	if (IS_ERR(bitmap_bh)) {
@@ -3957,7 +3961,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		ext4_error_err(sb, -err,
 			       "Error %d reading block bitmap for %u",
 			       err, group);
-		return 0;
+		goto out_dbg;
 	}
 
 	err = ext4_mb_load_buddy(sb, group, &e4b);
@@ -3965,7 +3969,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		ext4_warning(sb, "Error %d loading buddy information for %u",
 			     err, group);
 		put_bh(bitmap_bh);
-		return 0;
+		goto out_dbg;
 	}
 
 	if (needed == 0)
@@ -4011,6 +4015,8 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	/* found anything to free? */
 	if (list_empty(&list)) {
 		BUG_ON(free != 0);
+		mb_debug(1, "Someone else may have freed PA for this group %u\n",
+			 group);
 		goto out;
 	}
 
@@ -4035,6 +4041,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	ext4_unlock_group(sb, group);
 	ext4_mb_unload_buddy(&e4b);
 	put_bh(bitmap_bh);
+out_dbg:
+	mb_debug(1, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
+		 free, group, grp->bb_free);
 	return free;
 }
 
@@ -4602,6 +4611,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ar->len = ar->len >> 1;
 		}
 		if (!ar->len) {
+			ext4_mb_show_pa(sb);
 			*errp = -ENOSPC;
 			return 0;
 		}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 06/20] ext4: mballoc: Correct the mb_debug() format specifier for pa_len var
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (4 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 05/20] ext4: mballoc: Add more mb_debug() msgs Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 07/20] ext4: mballoc: Fix few other format specifier in mb_debug() Ritesh Harjani
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

pa->pa_len is an integer. Fix all of the format specifier used in
mb_debug() for pa_len to %d instead of %u.

As such no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2acabc3927e0..a4d9abf97193 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3749,7 +3749,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_INODE_PA;
 
-	mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa,
+	mb_debug(1, "new inode pa %p: %llu/%d for %u\n", pa,
 			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
 	trace_ext4_mb_new_inode_pa(ac, pa);
 
@@ -3810,7 +3810,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_GROUP_PA;
 
-	mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa,
+	mb_debug(1, "new group pa %p: %llu/%d for %u\n", pa,
 			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
 	trace_ext4_mb_new_group_pa(ac, pa);
 
@@ -3893,10 +3893,10 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
 	}
 	if (free != pa->pa_free) {
 		ext4_msg(e4b->bd_sb, KERN_CRIT,
-			 "pa %p: logic %lu, phys. %lu, len %lu",
+			 "pa %p: logic %lu, phys. %lu, len %d",
 			 pa, (unsigned long) pa->pa_lstart,
 			 (unsigned long) pa->pa_pstart,
-			 (unsigned long) pa->pa_len);
+			 pa->pa_len);
 		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
 					free, pa->pa_free);
 		/*
@@ -4184,7 +4184,7 @@ static inline void ext4_mb_show_pa(struct super_block *sb)
 			ext4_get_group_no_and_offset(sb, pa->pa_pstart,
 						     NULL, &start);
 			spin_unlock(&pa->pa_lock);
-			printk(KERN_ERR "PA:%u:%d:%u \n", i,
+			printk(KERN_ERR "PA:%u:%d:%d \n", i,
 			       start, pa->pa_len);
 		}
 		ext4_unlock_group(sb, i);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 07/20] ext4: mballoc: Fix few other format specifier in mb_debug()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (5 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 06/20] ext4: mballoc: Correct the mb_debug() format specifier for pa_len var Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 08/20] ext4: mballoc: Simplify error handling in ext4_init_mballoc() Ritesh Harjani
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Fix few other format specifiers in mb_debug() msgs.
As such no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a4d9abf97193..2999b41bb5f8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3309,8 +3309,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
 	}
 
-	mb_debug(1, "goal: %u(was %u) blocks at %u\n", (unsigned) size,
-		(unsigned) orig_size, (unsigned) start);
+	mb_debug(1, "goal: %lld(was %lld) blocks at %u\n", size, orig_size,
+		 start);
 }
 
 static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
@@ -3399,7 +3399,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
 	BUG_ON(pa->pa_free < len);
 	pa->pa_free -= len;
 
-	mb_debug(1, "use %llu/%u from inode pa %p\n", start, len, pa);
+	mb_debug(1, "use %llu/%d from inode pa %p\n", start, len, pa);
 }
 
 /*
@@ -3606,7 +3606,7 @@ void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 		ext4_set_bits(bitmap, start, len);
 		preallocated += len;
 	}
-	mb_debug(1, "preallocated %u for group %u\n", preallocated, group);
+	mb_debug(1, "preallocated %d for group %u\n", preallocated, group);
 }
 
 static void ext4_mb_pa_callback(struct rcu_head *head)
@@ -4205,7 +4205,7 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 
 	ext4_msg(sb, KERN_ERR, "Can't allocate:"
 			" Allocation context details:");
-	ext4_msg(sb, KERN_ERR, "status %d flags %d",
+	ext4_msg(sb, KERN_ERR, "status %u flags 0x%x",
 			ac->ac_status, ac->ac_flags);
 	ext4_msg(sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, "
 			"goal %lu/%lu/%lu@%lu, "
@@ -4223,7 +4223,7 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 			(unsigned long)ac->ac_b_ex.fe_len,
 			(unsigned long)ac->ac_b_ex.fe_logical,
 			(int)ac->ac_criteria);
-	ext4_msg(sb, KERN_ERR, "%d found", ac->ac_found);
+	ext4_msg(sb, KERN_ERR, "%u found", ac->ac_found);
 	ext4_mb_show_pa(sb);
 }
 #else
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 08/20] ext4: mballoc: Simplify error handling in ext4_init_mballoc()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (6 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 07/20] ext4: mballoc: Fix few other format specifier in mb_debug() Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 09/20] ext4: mballoc: Make ext4_mb_use_preallocated() return type as bool Ritesh Harjani
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

This patch simplifies error handling logic in ext4_init_mballoc(),
by adding all the cleanups at one place at the end of that function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2999b41bb5f8..10f295c61ccb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2942,23 +2942,26 @@ int __init ext4_init_mballoc(void)
 	ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space,
 					SLAB_RECLAIM_ACCOUNT);
 	if (ext4_pspace_cachep == NULL)
-		return -ENOMEM;
+		goto out;
 
 	ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context,
 				    SLAB_RECLAIM_ACCOUNT);
-	if (ext4_ac_cachep == NULL) {
-		kmem_cache_destroy(ext4_pspace_cachep);
-		return -ENOMEM;
-	}
+	if (ext4_ac_cachep == NULL)
+		goto out_pa_free;
 
 	ext4_free_data_cachep = KMEM_CACHE(ext4_free_data,
 					   SLAB_RECLAIM_ACCOUNT);
-	if (ext4_free_data_cachep == NULL) {
-		kmem_cache_destroy(ext4_pspace_cachep);
-		kmem_cache_destroy(ext4_ac_cachep);
-		return -ENOMEM;
-	}
+	if (ext4_free_data_cachep == NULL)
+		goto out_ac_free;
+
 	return 0;
+
+out_ac_free:
+	kmem_cache_destroy(ext4_ac_cachep);
+out_pa_free:
+	kmem_cache_destroy(ext4_pspace_cachep);
+out:
+	return -ENOMEM;
 }
 
 void ext4_exit_mballoc(void)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 09/20] ext4: mballoc: Make ext4_mb_use_preallocated() return type as bool
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (7 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 08/20] ext4: mballoc: Simplify error handling in ext4_init_mballoc() Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 10/20] ext4: mballoc: Remove EXT4_MB_HINT_GOAL_ONLY and it's related code Ritesh Harjani
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Change return type of function ext4_mb_use_preallocated() to bool to
better reflect what this function can return.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 10f295c61ccb..6e7232fd109e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3461,7 +3461,7 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
 /*
  * search goal blocks in preallocated space
  */
-static noinline_for_stack int
+static noinline_for_stack bool
 ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
@@ -3473,7 +3473,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 
 	/* only data can be preallocated */
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
-		return 0;
+		return false;
 
 	/* first, try per-file preallocation */
 	rcu_read_lock();
@@ -3500,7 +3500,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 			spin_unlock(&pa->pa_lock);
 			ac->ac_criteria = 10;
 			rcu_read_unlock();
-			return 1;
+			return true;
 		}
 		spin_unlock(&pa->pa_lock);
 	}
@@ -3508,12 +3508,12 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 
 	/* can we use group allocation? */
 	if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
-		return 0;
+		return false;
 
 	/* inode may have no locality group for some reason */
 	lg = ac->ac_lg;
 	if (lg == NULL)
-		return 0;
+		return false;
 	order  = fls(ac->ac_o_ex.fe_len) - 1;
 	if (order > PREALLOC_TB_SIZE - 1)
 		/* The max size of hash table is PREALLOC_TB_SIZE */
@@ -3542,9 +3542,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (cpa) {
 		ext4_mb_use_group_pa(ac, cpa);
 		ac->ac_criteria = 20;
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 /*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 10/20] ext4: mballoc: Remove EXT4_MB_HINT_GOAL_ONLY and it's related code
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (8 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 09/20] ext4: mballoc: Make ext4_mb_use_preallocated() return type as bool Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 11/20] ext4: mballoc: Refactor code inside DOUBLE_CHECK into separate function Ritesh Harjani
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

We don't set EXT4_MB_HINT_GOAL_ONLY flag at any place and from our last
during discussion, we don't see a need/use case of it anytime in near
future too.
So just kill the flag and all of it's references.
This also adjusts the remaining ac_flags macro values accordingly.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/ext4.h              | 12 +++++-------
 fs/ext4/mballoc.c           | 10 ----------
 include/trace/events/ext4.h |  1 -
 3 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 91eb4381cae5..db4fb62c1169 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -130,18 +130,16 @@ enum SHIFT_DIRECTION {
 #define EXT4_MB_HINT_NOPREALLOC		0x0040
 /* allocate for locality group */
 #define EXT4_MB_HINT_GROUP_ALLOC	0x0080
-/* allocate goal blocks or none */
-#define EXT4_MB_HINT_GOAL_ONLY		0x0100
 /* goal is meaningful */
-#define EXT4_MB_HINT_TRY_GOAL		0x0200
+#define EXT4_MB_HINT_TRY_GOAL		0x0100
 /* blocks already pre-reserved by delayed allocation */
-#define EXT4_MB_DELALLOC_RESERVED	0x0400
+#define EXT4_MB_DELALLOC_RESERVED	0x0200
 /* We are doing stream allocation */
-#define EXT4_MB_STREAM_ALLOC		0x0800
+#define EXT4_MB_STREAM_ALLOC		0x0400
 /* Use reserved root blocks if needed */
-#define EXT4_MB_USE_ROOT_BLOCKS		0x1000
+#define EXT4_MB_USE_ROOT_BLOCKS		0x0800
 /* Use blocks from reserved pool */
-#define EXT4_MB_USE_RESERVED		0x2000
+#define EXT4_MB_USE_RESERVED		0x1000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6e7232fd109e..4d6effe22652 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2157,9 +2157,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	if (err || ac->ac_status == AC_STATUS_FOUND)
 		goto out;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
-		goto out;
-
 	/*
 	 * ac->ac2_order is set only if the fe_len is a power of 2
 	 * if ac2_order is set we also set criteria to 0 so that we
@@ -3139,10 +3136,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
 
-	/* sometime caller may want exact blocks */
-	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
-		return;
-
 	/* caller may indicate that preallocation isn't
 	 * required (it's a tail, for example) */
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
@@ -4257,9 +4250,6 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
-		return;
-
 	size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
 		>> bsbits;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 19c87661eeec..a2a603172f57 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -30,7 +30,6 @@ struct partial_cluster;
 	{ EXT4_MB_HINT_DATA,		"HINT_DATA" },		\
 	{ EXT4_MB_HINT_NOPREALLOC,	"HINT_NOPREALLOC" },	\
 	{ EXT4_MB_HINT_GROUP_ALLOC,	"HINT_GRP_ALLOC" },	\
-	{ EXT4_MB_HINT_GOAL_ONLY,	"HINT_GOAL_ONLY" },	\
 	{ EXT4_MB_HINT_TRY_GOAL,	"HINT_TRY_GOAL" },	\
 	{ EXT4_MB_DELALLOC_RESERVED,	"DELALLOC_RESV" },	\
 	{ EXT4_MB_STREAM_ALLOC,		"STREAM_ALLOC" },	\
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 11/20] ext4: mballoc: Refactor code inside DOUBLE_CHECK into separate function
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (9 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 10/20] ext4: mballoc: Remove EXT4_MB_HINT_GOAL_ONLY and it's related code Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 12/20] ext4: mballoc: Fix possible NULL ptr dereference from mb_cmp_bitmaps() Ritesh Harjani
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 50 +++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4d6effe22652..5e59c18c89c0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -540,6 +540,26 @@ static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
 	}
 }
 
+static void mb_group_bb_bitmap_alloc(struct super_block *sb,
+			struct ext4_group_info *grp, ext4_group_t group)
+{
+	struct buffer_head *bh;
+
+	grp->bb_bitmap = kmalloc(sb->s_blocksize, GFP_NOFS);
+	BUG_ON(grp->bb_bitmap == NULL);
+
+	bh = ext4_read_block_bitmap(sb, group);
+	BUG_ON(IS_ERR_OR_NULL(bh));
+
+	memcpy(grp->bb_bitmap, bh->b_data, sb->s_blocksize);
+	put_bh(bh);
+}
+
+static void mb_group_bb_bitmap_free(struct ext4_group_info *grp)
+{
+	kfree(grp->bb_bitmap);
+}
+
 #else
 static inline void mb_free_blocks_double(struct inode *inode,
 				struct ext4_buddy *e4b, int first, int count)
@@ -555,6 +575,17 @@ static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
 {
 	return;
 }
+
+static inline void mb_group_bb_bitmap_alloc(struct super_block *sb,
+			struct ext4_group_info *grp, ext4_group_t group)
+{
+	return;
+}
+
+static inline void mb_group_bb_bitmap_free(struct ext4_group_info *grp)
+{
+	return;
+}
 #endif
 
 #ifdef AGGRESSIVE_CHECK
@@ -2482,20 +2513,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	meta_group_info[i]->bb_free_root = RB_ROOT;
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
 
-#ifdef DOUBLE_CHECK
-	{
-		struct buffer_head *bh;
-		meta_group_info[i]->bb_bitmap =
-			kmalloc(sb->s_blocksize, GFP_NOFS);
-		BUG_ON(meta_group_info[i]->bb_bitmap == NULL);
-		bh = ext4_read_block_bitmap(sb, group);
-		BUG_ON(IS_ERR_OR_NULL(bh));
-		memcpy(meta_group_info[i]->bb_bitmap, bh->b_data,
-			sb->s_blocksize);
-		put_bh(bh);
-	}
-#endif
-
+	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
 	return 0;
 
 exit_group_info:
@@ -2762,9 +2780,7 @@ int ext4_mb_release(struct super_block *sb)
 		for (i = 0; i < ngroups; i++) {
 			cond_resched();
 			grinfo = ext4_get_group_info(sb, i);
-#ifdef DOUBLE_CHECK
-			kfree(grinfo->bb_bitmap);
-#endif
+			mb_group_bb_bitmap_free(grinfo);
 			ext4_lock_group(sb, i);
 			ext4_mb_cleanup_pa(grinfo);
 			ext4_unlock_group(sb, i);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 12/20] ext4: mballoc: Fix possible NULL ptr dereference from mb_cmp_bitmaps()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (10 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 11/20] ext4: mballoc: Refactor code inside DOUBLE_CHECK into separate function Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 13/20] ext4: mballoc: Don't BUG if kmalloc or read blk bitmap fail for DOUBLE_CHECK Ritesh Harjani
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5e59c18c89c0..e32f3675f962 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -522,6 +522,8 @@ static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count)
 
 static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
 {
+	if (unlikely(e4b->bd_info->bb_bitmap == NULL))
+		return;
 	if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) {
 		unsigned char *b1, *b2;
 		int i;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 13/20] ext4: mballoc: Don't BUG if kmalloc or read blk bitmap fail for DOUBLE_CHECK
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (11 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 12/20] ext4: mballoc: Fix possible NULL ptr dereference from mb_cmp_bitmaps() Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 14/20] ext4: balloc: Use task_pid_nr() helper Ritesh Harjani
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap() fails.
We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).

---
Should we add a ext4_msg() if any of above fails?

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e32f3675f962..aa22ecf3f827 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -548,13 +548,20 @@ static void mb_group_bb_bitmap_alloc(struct super_block *sb,
 	struct buffer_head *bh;
 
 	grp->bb_bitmap = kmalloc(sb->s_blocksize, GFP_NOFS);
-	BUG_ON(grp->bb_bitmap == NULL);
+	if (!grp->bb_bitmap)
+		goto out;
 
 	bh = ext4_read_block_bitmap(sb, group);
-	BUG_ON(IS_ERR_OR_NULL(bh));
+	if (IS_ERR_OR_NULL(bh)) {
+		kfree(grp->bb_bitmap);
+		grp->bb_bitmap = NULL;
+		goto out;
+	}
 
 	memcpy(grp->bb_bitmap, bh->b_data, sb->s_blocksize);
 	put_bh(bh);
+out:
+	return;
 }
 
 static void mb_group_bb_bitmap_free(struct ext4_group_info *grp)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 14/20] ext4: balloc: Use task_pid_nr() helper
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (12 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 13/20] ext4: mballoc: Don't BUG if kmalloc or read blk bitmap fail for DOUBLE_CHECK Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 15/20] ext4: Use BIT() macro for BH_** state bits Ritesh Harjani
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Use task_pid_nr() function instead of current->pid.
There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/balloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index a32e5f7b5385..1ba46d87cdf1 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -903,10 +903,11 @@ ext4_fsblk_t ext4_inode_to_goal_block(struct inode *inode)
 		return bg_start;
 
 	if (bg_start + EXT4_BLOCKS_PER_GROUP(inode->i_sb) <= last_block)
-		colour = (current->pid % 16) *
+		colour = (task_pid_nr(current) % 16) *
 			(EXT4_BLOCKS_PER_GROUP(inode->i_sb) / 16);
 	else
-		colour = (current->pid % 16) * ((last_block - bg_start) / 16);
+		colour = (task_pid_nr(current) % 16) *
+			((last_block - bg_start) / 16);
 	return bg_start + colour;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 15/20] ext4: Use BIT() macro for BH_** state bits
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (13 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 14/20] ext4: balloc: Use task_pid_nr() helper Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 16/20] ext4: Improve ext_debug() msg in case of block allocation failure Ritesh Harjani
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Simply use BIT() macro for all BH_** state bits instead of open
coding it.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/ext4.h  | 8 ++++----
 fs/ext4/inode.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index db4fb62c1169..214102f84a10 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -169,10 +169,10 @@ struct ext4_allocation_request {
  * well as to store the information returned by ext4_map_blocks().  It
  * takes less room on the stack than a struct buffer_head.
  */
-#define EXT4_MAP_NEW		(1 << BH_New)
-#define EXT4_MAP_MAPPED		(1 << BH_Mapped)
-#define EXT4_MAP_UNWRITTEN	(1 << BH_Unwritten)
-#define EXT4_MAP_BOUNDARY	(1 << BH_Boundary)
+#define EXT4_MAP_NEW		BIT(BH_New)
+#define EXT4_MAP_MAPPED		BIT(BH_Mapped)
+#define EXT4_MAP_UNWRITTEN	BIT(BH_Unwritten)
+#define EXT4_MAP_BOUNDARY	BIT(BH_Boundary)
 #define EXT4_MAP_FLAGS		(EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
 				 EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY)
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a4aae6acdcb..e294abeb7f03 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2078,7 +2078,7 @@ static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
 	return err;
 }
 
-#define BH_FLAGS ((1 << BH_Unwritten) | (1 << BH_Delay))
+#define BH_FLAGS (BIT(BH_Unwritten) | BIT(BH_Delay))
 
 /*
  * mballoc gives us at most this number of blocks...
@@ -2358,7 +2358,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
-	if (map->m_flags & (1 << BH_Delay))
+	if (map->m_flags & BIT(BH_Delay))
 		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
 
 	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 16/20] ext4: Improve ext_debug() msg in case of block allocation failure
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (14 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 15/20] ext4: Use BIT() macro for BH_** state bits Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:29 ` [RFC 17/20] ext4: Replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents() Ritesh Harjani
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

ext4_map_blocks() has ext_debug msg early at the start of function.
We also get ext_debug msg if we could allocate a block from
ext4_ext_map_blocks(). But there is no ext_debug() msg in case of
block allocation failure. So add one along with error code.

Also add more info in ext_debug() msg like how many blocks were allocated
v/s how many were requested in ext4_ext_map_blocks().

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/extents.c | 4 ++--
 fs/ext4/inode.c   | 4 ++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f2b577b315a0..461600c07316 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4218,10 +4218,10 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	newblock = ext4_mb_new_blocks(handle, &ar, &err);
 	if (!newblock)
 		goto out2;
-	ext_debug("allocate new block: goal %llu, found %llu/%u\n",
-		  ar.goal, newblock, allocated);
 	allocated_clusters = ar.len;
 	ar.len = EXT4_C2B(sbi, ar.len) - offset;
+	ext_debug("allocate new block: goal %llu, found %llu/%u, requested %u\n",
+		  ar.goal, newblock, ar.len, allocated);
 	if (ar.len > allocated)
 		ar.len = allocated;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e294abeb7f03..5f120af22d48 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -726,6 +726,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 				return ret;
 		}
 	}
+
+	if (retval < 0)
+		ext_debug("failed for inode %lu with err %d\n",
+			  inode->i_ino, retval);
 	return retval;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 17/20] ext4: Replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (15 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 16/20] ext4: Improve ext_debug() msg in case of block allocation failure Ritesh Harjani
@ 2020-05-01  6:29 ` Ritesh Harjani
  2020-05-01  6:30 ` [RFC 18/20] ext4: mballoc: Make mb_debug() implementation to use pr_debug() Ritesh Harjani
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

Replace EXT_DEBUG with __maybe_unused from inside
ext4_ext_handle_unwritten_extents() function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/extents.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 461600c07316..dc980fbc49aa 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3794,9 +3794,7 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 			struct ext4_ext_path **ppath, int flags,
 			unsigned int allocated, ext4_fsblk_t newblock)
 {
-#ifdef EXT_DEBUG
-	struct ext4_ext_path *path = *ppath;
-#endif
+	struct ext4_ext_path __maybe_unused *path = *ppath;
 	int ret = 0;
 	int err = 0;
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 18/20] ext4: mballoc: Make mb_debug() implementation to use pr_debug()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (16 preceding siblings ...)
  2020-05-01  6:29 ` [RFC 17/20] ext4: Replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents() Ritesh Harjani
@ 2020-05-01  6:30 ` Ritesh Harjani
  2020-05-01  6:30 ` [RFC 19/20] ext4: Make ext_debug() " Ritesh Harjani
  2020-05-01  6:30 ` [RFC 20/20] ext4: Add process name and pid in ext4_msg() Ritesh Harjani
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:30 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).

Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/Kconfig   |   3 +-
 fs/ext4/mballoc.c | 106 +++++++++++++++++++++-------------------------
 fs/ext4/mballoc.h |  16 +++----
 3 files changed, 56 insertions(+), 69 deletions(-)

diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 2a592e38cdfe..02376ddb0cb5 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -99,8 +99,7 @@ config EXT4_DEBUG
 	  Enables run-time debugging support for the ext4 filesystem.
 
 	  If you select Y here, then you will be able to turn on debugging
-	  with a command such as:
-		echo 1 > /sys/module/ext4/parameters/mballoc_debug
+	  using dynamic debug control for mb_debug() msgs.
 
 config EXT4_KUNIT_TESTS
 	tristate "KUnit tests for ext4"
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index aa22ecf3f827..72d7bd6ac3ba 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -18,13 +18,6 @@
 #include <linux/backing-dev.h>
 #include <trace/events/ext4.h>
 
-#ifdef CONFIG_EXT4_DEBUG
-ushort ext4_mballoc_debug __read_mostly;
-
-module_param_named(mballoc_debug, ext4_mballoc_debug, ushort, 0644);
-MODULE_PARM_DESC(mballoc_debug, "Debugging level for ext4's mballoc");
-#endif
-
 /*
  * MUSTDO:
  *   - test ext4_ext_search_left() and ext4_ext_search_right()
@@ -889,14 +882,14 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 	char *bitmap;
 	struct ext4_group_info *grinfo;
 
-	mb_debug(1, "init page %lu\n", page->index);
-
 	inode = page->mapping->host;
 	sb = inode->i_sb;
 	ngroups = ext4_get_groups_count(sb);
 	blocksize = i_blocksize(inode);
 	blocks_per_page = PAGE_SIZE / blocksize;
 
+	mb_debug(sb, "init page %lu\n", page->index);
+
 	groups_per_page = blocks_per_page >> 1;
 	if (groups_per_page == 0)
 		groups_per_page = 1;
@@ -936,7 +929,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 			bh[i] = NULL;
 			goto out;
 		}
-		mb_debug(1, "read bitmap for group %u\n", group);
+		mb_debug(sb, "read bitmap for group %u\n", group);
 	}
 
 	/* wait for I/O completion */
@@ -981,7 +974,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 		if ((first_block + i) & 1) {
 			/* this is block of buddy */
 			BUG_ON(incore == NULL);
-			mb_debug(1, "put buddy for group %u in page %lu/%x\n",
+			mb_debug(sb, "put buddy for group %u in page %lu/%x\n",
 				group, page->index, i * blocksize);
 			trace_ext4_mb_buddy_bitmap_load(sb, group);
 			grinfo = ext4_get_group_info(sb, group);
@@ -1001,7 +994,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
 		} else {
 			/* this is block of bitmap */
 			BUG_ON(incore != NULL);
-			mb_debug(1, "put bitmap for group %u in page %lu/%x\n",
+			mb_debug(sb, "put bitmap for group %u in page %lu/%x\n",
 				group, page->index, i * blocksize);
 			trace_ext4_mb_bitmap_load(sb, group);
 
@@ -1107,7 +1100,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
 	int ret = 0;
 
 	might_sleep();
-	mb_debug(1, "init group %u\n", group);
+	mb_debug(sb, "init group %u\n", group);
 	this_grp = ext4_get_group_info(sb, group);
 	/*
 	 * This ensures that we don't reinit the buddy cache
@@ -1179,7 +1172,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
 	struct inode *inode = sbi->s_buddy_cache;
 
 	might_sleep();
-	mb_debug(1, "load group %u\n", group);
+	mb_debug(sb, "load group %u\n", group);
 
 	blocks_per_page = PAGE_SIZE / sb->s_blocksize;
 	grp = ext4_get_group_info(sb, group);
@@ -2327,7 +2320,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
 		err = first_err;
 
-	mb_debug(1, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
+	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
 		 ac->ac_flags, cr, err);
 	return err;
@@ -2759,7 +2752,7 @@ int ext4_mb_init(struct super_block *sb)
 }
 
 /* need to called with the ext4 group lock held */
-static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
+static int ext4_mb_cleanup_pa(struct ext4_group_info *grp)
 {
 	struct ext4_prealloc_space *pa;
 	struct list_head *cur, *tmp;
@@ -2771,9 +2764,7 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
 		count++;
 		kmem_cache_free(ext4_pspace_cachep, pa);
 	}
-	if (count)
-		mb_debug(1, "mballoc: %u PAs left\n", count);
-
+	return count;
 }
 
 int ext4_mb_release(struct super_block *sb)
@@ -2784,6 +2775,7 @@ int ext4_mb_release(struct super_block *sb)
 	struct ext4_group_info *grinfo, ***group_info;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits);
+	int count;
 
 	if (sbi->s_group_info) {
 		for (i = 0; i < ngroups; i++) {
@@ -2791,7 +2783,10 @@ int ext4_mb_release(struct super_block *sb)
 			grinfo = ext4_get_group_info(sb, i);
 			mb_group_bb_bitmap_free(grinfo);
 			ext4_lock_group(sb, i);
-			ext4_mb_cleanup_pa(grinfo);
+			count = ext4_mb_cleanup_pa(grinfo);
+			if (count)
+				mb_debug(sb, "mballoc: %d PAs left\n",
+					 count);
 			ext4_unlock_group(sb, i);
 			kmem_cache_free(cachep, grinfo);
 		}
@@ -2864,7 +2859,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 	struct ext4_group_info *db;
 	int err, count = 0, count2 = 0;
 
-	mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
+	mb_debug(sb, "gonna free %u blocks in group %u (0x%p):",
 		 entry->efd_count, entry->efd_group, entry);
 
 	err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
@@ -2904,7 +2899,8 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 	kmem_cache_free(ext4_free_data_cachep, entry);
 	ext4_mb_unload_buddy(&e4b);
 
-	mb_debug(1, "freed %u blocks in %u structures\n", count, count2);
+	mb_debug(sb, "freed %d blocks in %d structures\n", count,
+		 count2);
 }
 
 /*
@@ -3135,8 +3131,7 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
 
 	BUG_ON(lg == NULL);
 	ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc;
-	mb_debug(1, "#%u: goal %u blocks for locality group\n",
-		current->pid, ac->ac_g_ex.fe_len);
+	mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
 }
 
 /*
@@ -3330,8 +3325,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
 	}
 
-	mb_debug(1, "goal: %lld(was %lld) blocks at %u\n", size, orig_size,
-		 start);
+	mb_debug(ac->ac_sb, "goal: %lld(was %lld) blocks at %u\n", size,
+		 orig_size, start);
 }
 
 static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
@@ -3420,7 +3415,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
 	BUG_ON(pa->pa_free < len);
 	pa->pa_free -= len;
 
-	mb_debug(1, "use %llu/%d from inode pa %p\n", start, len, pa);
+	mb_debug(ac->ac_sb, "use %llu/%d from inode pa %p\n", start, len, pa);
 }
 
 /*
@@ -3444,7 +3439,8 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
 	 * in on-disk bitmap -- see ext4_mb_release_context()
 	 * Other CPUs are prevented from allocating from this pa by lg_mutex
 	 */
-	mb_debug(1, "use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
+	mb_debug(ac->ac_sb, "use %u/%u from group pa %p\n",
+		 pa->pa_lstart-len, len, pa);
 }
 
 /*
@@ -3627,7 +3623,7 @@ void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 		ext4_set_bits(bitmap, start, len);
 		preallocated += len;
 	}
-	mb_debug(1, "preallocated %d for group %u\n", preallocated, group);
+	mb_debug(sb, "preallocated %d for group %u\n", preallocated, group);
 }
 
 static void ext4_mb_pa_callback(struct rcu_head *head)
@@ -3770,8 +3766,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_INODE_PA;
 
-	mb_debug(1, "new inode pa %p: %llu/%d for %u\n", pa,
-			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+	mb_debug(sb, "new inode pa %p: %llu/%d for %u\n", pa, pa->pa_pstart,
+		 pa->pa_len, pa->pa_lstart);
 	trace_ext4_mb_new_inode_pa(ac, pa);
 
 	ext4_mb_use_inode_pa(ac, pa);
@@ -3831,8 +3827,8 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_GROUP_PA;
 
-	mb_debug(1, "new group pa %p: %llu/%d for %u\n", pa,
-			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+	mb_debug(sb, "new group pa %p: %llu/%d for %u\n", pa, pa->pa_pstart,
+		 pa->pa_len, pa->pa_lstart);
 	trace_ext4_mb_new_group_pa(ac, pa);
 
 	ext4_mb_use_group_pa(ac, pa);
@@ -3900,7 +3896,7 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
 		if (bit >= end)
 			break;
 		next = mb_find_next_bit(bitmap_bh->b_data, end, bit);
-		mb_debug(1, "    free preallocated %u/%u in group %u\n",
+		mb_debug(sb, "free preallocated %u/%u in group %u\n",
 			 (unsigned) ext4_group_first_block_no(sb, group) + bit,
 			 (unsigned) next - bit, (unsigned) group);
 		free += next - bit;
@@ -3971,8 +3967,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	int busy = 0;
 	int free = 0;
 
-	mb_debug(1, "discard preallocation for group %u\n", group);
-
+	mb_debug(sb, "discard preallocation for group %u\n", group);
 	if (list_empty(&grp->bb_prealloc_list))
 		goto out_dbg;
 
@@ -4036,7 +4031,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	/* found anything to free? */
 	if (list_empty(&list)) {
 		BUG_ON(free != 0);
-		mb_debug(1, "Someone else may have freed PA for this group %u\n",
+		mb_debug(sb, "Someone else may have freed PA for this group %u\n",
 			 group);
 		goto out;
 	}
@@ -4063,7 +4058,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	ext4_mb_unload_buddy(&e4b);
 	put_bh(bitmap_bh);
 out_dbg:
-	mb_debug(1, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
+	mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
 		 free, group, grp->bb_free);
 	return free;
 }
@@ -4093,7 +4088,8 @@ void ext4_discard_preallocations(struct inode *inode)
 		return;
 	}
 
-	mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino);
+	mb_debug(sb, "discard preallocation for inode %lu\n",
+		 inode->i_ino);
 	trace_ext4_discard_preallocations(inode);
 
 	INIT_LIST_HEAD(&list);
@@ -4186,12 +4182,11 @@ static inline void ext4_mb_show_pa(struct super_block *sb)
 {
 	ext4_group_t i, ngroups;
 
-	if (!ext4_mballoc_debug ||
-	    (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED))
+	if (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
 		return;
 
 	ngroups = ext4_get_groups_count(sb);
-	ext4_msg(sb, KERN_ERR, "groups: ");
+	mb_debug(sb, "groups: ");
 	for (i = 0; i < ngroups; i++) {
 		struct ext4_group_info *grp = ext4_get_group_info(sb, i);
 		struct ext4_prealloc_space *pa;
@@ -4205,30 +4200,27 @@ static inline void ext4_mb_show_pa(struct super_block *sb)
 			ext4_get_group_no_and_offset(sb, pa->pa_pstart,
 						     NULL, &start);
 			spin_unlock(&pa->pa_lock);
-			printk(KERN_ERR "PA:%u:%d:%d \n", i,
-			       start, pa->pa_len);
+			mb_debug(sb, "PA:%u:%d:%d\n", i, start,
+				 pa->pa_len);
 		}
 		ext4_unlock_group(sb, i);
-
-		printk(KERN_ERR "%u: %d/%d \n",
-		       i, grp->bb_free, grp->bb_fragments);
+		mb_debug(sb, "%u: %d/%d\n", i, grp->bb_free,
+			 grp->bb_fragments);
 	}
-	printk(KERN_ERR "\n");
 }
 
 static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
 
-	if (!ext4_mballoc_debug ||
-	    (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED))
+	if (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
 		return;
 
-	ext4_msg(sb, KERN_ERR, "Can't allocate:"
+	mb_debug(sb, "Can't allocate:"
 			" Allocation context details:");
-	ext4_msg(sb, KERN_ERR, "status %u flags 0x%x",
+	mb_debug(sb, "status %u flags 0x%x",
 			ac->ac_status, ac->ac_flags);
-	ext4_msg(sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, "
+	mb_debug(sb, "orig %lu/%lu/%lu@%lu, "
 			"goal %lu/%lu/%lu@%lu, "
 			"best %lu/%lu/%lu@%lu cr %d",
 			(unsigned long)ac->ac_o_ex.fe_group,
@@ -4244,7 +4236,7 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 			(unsigned long)ac->ac_b_ex.fe_len,
 			(unsigned long)ac->ac_b_ex.fe_logical,
 			(int)ac->ac_criteria);
-	ext4_msg(sb, KERN_ERR, "%u found", ac->ac_found);
+	mb_debug(sb, "%u found", ac->ac_found);
 	ext4_mb_show_pa(sb);
 }
 #else
@@ -4354,7 +4346,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
 	 * locality group. this is a policy, actually */
 	ext4_mb_group_or_file(ac);
 
-	mb_debug(1, "init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, "
+	mb_debug(sb, "init ac: %u blocks @ %u, goal %u, flags 0x%x, 2^%d, "
 			"left: %u/%u, right %u/%u to %swritable\n",
 			(unsigned) ar->len, (unsigned) ar->logical,
 			(unsigned) ar->goal, ac->ac_flags, ac->ac_2order,
@@ -4375,7 +4367,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
 	struct list_head discard_list;
 	struct ext4_prealloc_space *pa, *tmp;
 
-	mb_debug(1, "discard locality group preallocation\n");
+	mb_debug(sb, "discard locality group preallocation\n");
 
 	INIT_LIST_HEAD(&discard_list);
 
@@ -4586,7 +4578,7 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
 	}
 
 out_dbg:
-	mb_debug(1, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
+	mb_debug(sb, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
 	return ret;
 }
 
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 88c98f17e3d9..6b4d17c2935d 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -24,19 +24,15 @@
 #include "ext4.h"
 
 /*
+ * mb_debug() dynamic printk msgs could be used to debug mballoc code.
  */
 #ifdef CONFIG_EXT4_DEBUG
-extern ushort ext4_mballoc_debug;
-
-#define mb_debug(n, fmt, ...)	                                        \
-do {									\
-	if ((n) <= ext4_mballoc_debug) {				\
-		printk(KERN_DEBUG "(%s, %d): %s: " fmt,			\
-		       __FILE__, __LINE__, __func__, ##__VA_ARGS__);	\
-	}								\
-} while (0)
+#define mb_debug(sb, fmt, ...)						\
+	pr_debug("[%s/%d] EXT4-fs (%s): (%s, %d): %s: " fmt,		\
+		current->comm, task_pid_nr(current), sb->s_id,		\
+	       __FILE__, __LINE__, __func__, ##__VA_ARGS__)
 #else
-#define mb_debug(n, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
+#define mb_debug(sb, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
 #define EXT4_MB_HISTORY_ALLOC		1	/* allocation */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 19/20] ext4: Make ext_debug() implementation to use pr_debug()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (17 preceding siblings ...)
  2020-05-01  6:30 ` [RFC 18/20] ext4: mballoc: Make mb_debug() implementation to use pr_debug() Ritesh Harjani
@ 2020-05-01  6:30 ` Ritesh Harjani
  2020-05-01  6:30 ` [RFC 20/20] ext4: Add process name and pid in ext4_msg() Ritesh Harjani
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:30 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

ext_debug() msgs could be helpful, provided those could be enabled
without recompiling kernel and also if we could selectively enable
only required prints for case by case debugging.

So make ext_debug() implementation use pr_debug().
Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG.
So EXT_DEBUG macro now mostly remain for below 3 functions.
ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug()
which again could be dynamically enabled using pr_debug())

This also changes the ext_debug() to take inode as a parameter
to add inode no. in all of it's msgs.
Prints additional info like process name / pid, superblock id etc.
This also removes any explicit function names passed in ext_debug().
Since ext_debug() on it's own prints file, func and line no.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/Kconfig   |   2 +-
 fs/ext4/ext4.h    |  18 ++++--
 fs/ext4/extents.c | 144 ++++++++++++++++++++++------------------------
 fs/ext4/inode.c   |  11 ++--
 4 files changed, 87 insertions(+), 88 deletions(-)

diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 02376ddb0cb5..cf9e430514c4 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -99,7 +99,7 @@ config EXT4_DEBUG
 	  Enables run-time debugging support for the ext4 filesystem.
 
 	  If you select Y here, then you will be able to turn on debugging
-	  using dynamic debug control for mb_debug() msgs.
+	  using dynamic debug control for mb_debug() / ext_debug() msgs.
 
 config EXT4_KUNIT_TESTS
 	tristate "KUnit tests for ext4"
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 214102f84a10..1cc7a43fbc11 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -80,14 +80,22 @@
 #define ext4_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
+ /*
+  * Turn on EXT_DEBUG to enable ext4_ext_show_path/leaf/move in extents.c
+  */
+#define EXT_DEBUG__
+
 /*
- * Turn on EXT_DEBUG to get lots of info about extents operations.
+ * Dynamic printk for controlled extents debugging.
  */
-#define EXT_DEBUG__
-#ifdef EXT_DEBUG
-#define ext_debug(fmt, ...)	printk(fmt, ##__VA_ARGS__)
+#ifdef CONFIG_EXT4_DEBUG
+#define ext_debug(ino, fmt, ...)					\
+	pr_debug("[%s/%d] EXT4-fs (%s): ino %lu: (%s, %d): %s:" fmt,	\
+		 current->comm, task_pid_nr(current),			\
+		 ino->i_sb->s_id, ino->i_ino, __FILE__, __LINE__,	\
+		 __func__, ##__VA_ARGS__)
 #else
-#define ext_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
+#define ext_debug(ino, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
 /* data type for block offset of block group */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index dc980fbc49aa..155f9c6c1279 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -600,22 +600,22 @@ static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
 {
 	int k, l = path->p_depth;
 
-	ext_debug("path:");
+	ext_debug(inode, "path:");
 	for (k = 0; k <= l; k++, path++) {
 		if (path->p_idx) {
-			ext_debug("  %d->%llu",
+			ext_debug(inode, "  %d->%llu",
 				  le32_to_cpu(path->p_idx->ei_block),
 				  ext4_idx_pblock(path->p_idx));
 		} else if (path->p_ext) {
-			ext_debug("  %d:[%d]%d:%llu ",
+			ext_debug(inode, "  %d:[%d]%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
 				  ext4_ext_is_unwritten(path->p_ext),
 				  ext4_ext_get_actual_len(path->p_ext),
 				  ext4_ext_pblock(path->p_ext));
 		} else
-			ext_debug("  []");
+			ext_debug(inode, "  []");
 	}
-	ext_debug("\n");
+	ext_debug(inode, "\n");
 }
 
 static void ext4_ext_show_leaf(struct inode *inode, struct ext4_ext_path *path)
@@ -631,14 +631,14 @@ static void ext4_ext_show_leaf(struct inode *inode, struct ext4_ext_path *path)
 	eh = path[depth].p_hdr;
 	ex = EXT_FIRST_EXTENT(eh);
 
-	ext_debug("Displaying leaf extents for inode %lu\n", inode->i_ino);
+	ext_debug(inode, "Displaying leaf extents\n");
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
-		ext_debug("%d:[%d]%d:%llu ", le32_to_cpu(ex->ee_block),
+		ext_debug(inode, "%d:[%d]%d:%llu ", le32_to_cpu(ex->ee_block),
 			  ext4_ext_is_unwritten(ex),
 			  ext4_ext_get_actual_len(ex), ext4_ext_pblock(ex));
 	}
-	ext_debug("\n");
+	ext_debug(inode, "\n");
 }
 
 static void ext4_ext_show_move(struct inode *inode, struct ext4_ext_path *path,
@@ -651,10 +651,9 @@ static void ext4_ext_show_move(struct inode *inode, struct ext4_ext_path *path,
 		struct ext4_extent_idx *idx;
 		idx = path[level].p_idx;
 		while (idx <= EXT_MAX_INDEX(path[level].p_hdr)) {
-			ext_debug("%d: move %d:%llu in new index %llu\n", level,
-					le32_to_cpu(idx->ei_block),
-					ext4_idx_pblock(idx),
-					newblock);
+			ext_debug(inode, "%d: move %d:%llu in new index %llu\n",
+				  level, le32_to_cpu(idx->ei_block),
+				  ext4_idx_pblock(idx), newblock);
 			idx++;
 		}
 
@@ -663,7 +662,7 @@ static void ext4_ext_show_move(struct inode *inode, struct ext4_ext_path *path,
 
 	ex = path[depth].p_ext;
 	while (ex <= EXT_MAX_EXTENT(path[depth].p_hdr)) {
-		ext_debug("move %d:%llu:[%d]%d in new leaf %llu\n",
+		ext_debug(inode, "move %d:%llu:[%d]%d in new leaf %llu\n",
 				le32_to_cpu(ex->ee_block),
 				ext4_ext_pblock(ex),
 				ext4_ext_is_unwritten(ex),
@@ -707,7 +706,7 @@ ext4_ext_binsearch_idx(struct inode *inode,
 	struct ext4_extent_idx *r, *l, *m;
 
 
-	ext_debug("binsearch for %u(idx):  ", block);
+	ext_debug(inode, "binsearch for %u(idx):  ", block);
 
 	l = EXT_FIRST_INDEX(eh) + 1;
 	r = EXT_LAST_INDEX(eh);
@@ -717,13 +716,13 @@ ext4_ext_binsearch_idx(struct inode *inode,
 			r = m - 1;
 		else
 			l = m + 1;
-		ext_debug("%p(%u):%p(%u):%p(%u) ", l, le32_to_cpu(l->ei_block),
-				m, le32_to_cpu(m->ei_block),
-				r, le32_to_cpu(r->ei_block));
+		ext_debug(inode, "%p(%u):%p(%u):%p(%u) ", l,
+			  le32_to_cpu(l->ei_block), m, le32_to_cpu(m->ei_block),
+			  r, le32_to_cpu(r->ei_block));
 	}
 
 	path->p_idx = l - 1;
-	ext_debug("  -> %u->%lld ", le32_to_cpu(path->p_idx->ei_block),
+	ext_debug(inode, "  -> %u->%lld ", le32_to_cpu(path->p_idx->ei_block),
 		  ext4_idx_pblock(path->p_idx));
 
 #ifdef CHECK_BINSEARCH
@@ -774,7 +773,7 @@ ext4_ext_binsearch(struct inode *inode,
 		return;
 	}
 
-	ext_debug("binsearch for %u:  ", block);
+	ext_debug(inode, "binsearch for %u:  ", block);
 
 	l = EXT_FIRST_EXTENT(eh) + 1;
 	r = EXT_LAST_EXTENT(eh);
@@ -785,13 +784,13 @@ ext4_ext_binsearch(struct inode *inode,
 			r = m - 1;
 		else
 			l = m + 1;
-		ext_debug("%p(%u):%p(%u):%p(%u) ", l, le32_to_cpu(l->ee_block),
-				m, le32_to_cpu(m->ee_block),
-				r, le32_to_cpu(r->ee_block));
+		ext_debug(inode, "%p(%u):%p(%u):%p(%u) ", l,
+			  le32_to_cpu(l->ee_block), m, le32_to_cpu(m->ee_block),
+			  r, le32_to_cpu(r->ee_block));
 	}
 
 	path->p_ext = l - 1;
-	ext_debug("  -> %d:%llu:[%d]%d ",
+	ext_debug(inode, "  -> %d:%llu:[%d]%d ",
 			le32_to_cpu(path->p_ext->ee_block),
 			ext4_ext_pblock(path->p_ext),
 			ext4_ext_is_unwritten(path->p_ext),
@@ -871,7 +870,7 @@ ext4_find_extent(struct inode *inode, ext4_lblk_t block,
 		ext4_cache_extents(inode, eh);
 	/* walk through the tree */
 	while (i) {
-		ext_debug("depth %d: num %d, max %d\n",
+		ext_debug(inode, "depth %d: num %d, max %d\n",
 			  ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
 
 		ext4_ext_binsearch_idx(inode, path + ppos, block);
@@ -948,18 +947,20 @@ static int ext4_ext_insert_index(handle_t *handle, struct inode *inode,
 
 	if (logical > le32_to_cpu(curp->p_idx->ei_block)) {
 		/* insert after */
-		ext_debug("insert new index %d after: %llu\n", logical, ptr);
+		ext_debug(inode, "insert new index %d after: %llu\n",
+			  logical, ptr);
 		ix = curp->p_idx + 1;
 	} else {
 		/* insert before */
-		ext_debug("insert new index %d before: %llu\n", logical, ptr);
+		ext_debug(inode, "insert new index %d before: %llu\n",
+			  logical, ptr);
 		ix = curp->p_idx;
 	}
 
 	len = EXT_LAST_INDEX(curp->p_hdr) - ix + 1;
 	BUG_ON(len < 0);
 	if (len > 0) {
-		ext_debug("insert new index %d: "
+		ext_debug(inode, "insert new index %d: "
 				"move %d indices from 0x%p to 0x%p\n",
 				logical, len, ix, ix + 1);
 		memmove(ix + 1, ix, len * sizeof(struct ext4_extent_idx));
@@ -1022,12 +1023,12 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 	}
 	if (path[depth].p_ext != EXT_MAX_EXTENT(path[depth].p_hdr)) {
 		border = path[depth].p_ext[1].ee_block;
-		ext_debug("leaf will be split."
+		ext_debug(inode, "leaf will be split."
 				" next leaf starts at %d\n",
 				  le32_to_cpu(border));
 	} else {
 		border = newext->ee_block;
-		ext_debug("leaf will be added."
+		ext_debug(inode, "leaf will be added."
 				" next leaf starts at %d\n",
 				le32_to_cpu(border));
 	}
@@ -1049,7 +1050,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		return -ENOMEM;
 
 	/* allocate all needed blocks */
-	ext_debug("allocate %d blocks for indexes/leaf\n", depth - at);
+	ext_debug(inode, "allocate %d blocks for indexes/leaf\n", depth - at);
 	for (a = 0; a < depth - at; a++) {
 		newblock = ext4_ext_new_meta_block(handle, inode, path,
 						   newext, &err, flags);
@@ -1135,7 +1136,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		goto cleanup;
 	}
 	if (k)
-		ext_debug("create %d intermediate indices\n", k);
+		ext_debug(inode, "create %d intermediate indices\n", k);
 	/* insert new index into current index block */
 	/* current depth stored in i var */
 	i = depth - 1;
@@ -1162,7 +1163,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		fidx->ei_block = border;
 		ext4_idx_store_pblock(fidx, oldblock);
 
-		ext_debug("int.index at %d (block %llu): %u -> %llu\n",
+		ext_debug(inode, "int.index at %d (block %llu): %u -> %llu\n",
 				i, newblock, le32_to_cpu(border), oldblock);
 
 		/* move remainder of path[i] to the new index block */
@@ -1176,7 +1177,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		}
 		/* start copy indexes */
 		m = EXT_MAX_INDEX(path[i].p_hdr) - path[i].p_idx++;
-		ext_debug("cur 0x%p, last 0x%p\n", path[i].p_idx,
+		ext_debug(inode, "cur 0x%p, last 0x%p\n", path[i].p_idx,
 				EXT_MAX_INDEX(path[i].p_hdr));
 		ext4_ext_show_move(inode, path, newblock, i);
 		if (m) {
@@ -1313,7 +1314,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
 		EXT_FIRST_INDEX(neh)->ei_block =
 			EXT_FIRST_EXTENT(neh)->ee_block;
 	}
-	ext_debug("new root: num %d(%d), lblock %d, ptr %llu\n",
+	ext_debug(inode, "new root: num %d(%d), lblock %d, ptr %llu\n",
 		  le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
 		  le32_to_cpu(EXT_FIRST_INDEX(neh)->ei_block),
 		  ext4_idx_pblock(EXT_FIRST_INDEX(neh)));
@@ -1955,7 +1956,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 
 		/* Try to append newex to the ex */
 		if (ext4_can_extents_be_merged(inode, ex, newext)) {
-			ext_debug("append [%d]%d block to %u:[%d]%d"
+			ext_debug(inode, "append [%d]%d block to %u:[%d]%d"
 				  "(from %llu)\n",
 				  ext4_ext_is_unwritten(newext),
 				  ext4_ext_get_actual_len(newext),
@@ -1980,7 +1981,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 prepend:
 		/* Try to prepend newex to the ex */
 		if (ext4_can_extents_be_merged(inode, newext, ex)) {
-			ext_debug("prepend %u[%d]%d block to %u:[%d]%d"
+			ext_debug(inode, "prepend %u[%d]%d block to %u:[%d]%d"
 				  "(from %llu)\n",
 				  le32_to_cpu(newext->ee_block),
 				  ext4_ext_is_unwritten(newext),
@@ -2018,7 +2019,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 	if (le32_to_cpu(newext->ee_block) > le32_to_cpu(fex->ee_block))
 		next = ext4_ext_next_leaf_block(path);
 	if (next != EXT_MAX_BLOCKS) {
-		ext_debug("next leaf block - %u\n", next);
+		ext_debug(inode, "next leaf block - %u\n", next);
 		BUG_ON(npath != NULL);
 		npath = ext4_find_extent(inode, next, NULL, 0);
 		if (IS_ERR(npath))
@@ -2026,12 +2027,12 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 		BUG_ON(npath->p_depth != path->p_depth);
 		eh = npath[depth].p_hdr;
 		if (le16_to_cpu(eh->eh_entries) < le16_to_cpu(eh->eh_max)) {
-			ext_debug("next leaf isn't full(%d)\n",
+			ext_debug(inode, "next leaf isn't full(%d)\n",
 				  le16_to_cpu(eh->eh_entries));
 			path = npath;
 			goto has_space;
 		}
-		ext_debug("next leaf has no free space(%d,%d)\n",
+		ext_debug(inode, "next leaf has no free space(%d,%d)\n",
 			  le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
 	}
 
@@ -2057,7 +2058,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 
 	if (!nearex) {
 		/* there is no extent in this leaf, create first one */
-		ext_debug("first extent in the leaf: %u:%llu:[%d]%d\n",
+		ext_debug(inode, "first extent in the leaf: %u:%llu:[%d]%d\n",
 				le32_to_cpu(newext->ee_block),
 				ext4_ext_pblock(newext),
 				ext4_ext_is_unwritten(newext),
@@ -2067,7 +2068,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 		if (le32_to_cpu(newext->ee_block)
 			   > le32_to_cpu(nearex->ee_block)) {
 			/* Insert after */
-			ext_debug("insert %u:%llu:[%d]%d before: "
+			ext_debug(inode, "insert %u:%llu:[%d]%d before: "
 					"nearest %p\n",
 					le32_to_cpu(newext->ee_block),
 					ext4_ext_pblock(newext),
@@ -2078,7 +2079,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 		} else {
 			/* Insert before */
 			BUG_ON(newext->ee_block == nearex->ee_block);
-			ext_debug("insert %u:%llu:[%d]%d after: "
+			ext_debug(inode, "insert %u:%llu:[%d]%d after: "
 					"nearest %p\n",
 					le32_to_cpu(newext->ee_block),
 					ext4_ext_pblock(newext),
@@ -2088,7 +2089,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 		}
 		len = EXT_LAST_EXTENT(eh) - nearex + 1;
 		if (len > 0) {
-			ext_debug("insert %u:%llu:[%d]%d: "
+			ext_debug(inode, "insert %u:%llu:[%d]%d: "
 					"move %d extents from 0x%p to 0x%p\n",
 					le32_to_cpu(newext->ee_block),
 					ext4_ext_pblock(newext),
@@ -2232,7 +2233,7 @@ ext4_ext_put_gap_in_cache(struct inode *inode, ext4_lblk_t hole_start,
 			return;
 		hole_len = min(es.es_lblk - hole_start, hole_len);
 	}
-	ext_debug(" -> %u:%u\n", hole_start, hole_len);
+	ext_debug(inode, " -> %u:%u\n", hole_start, hole_len);
 	ext4_es_insert_extent(inode, hole_start, hole_len, ~0,
 			      EXTENT_STATUS_HOLE);
 }
@@ -2269,7 +2270,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
 	err = ext4_ext_dirty(handle, inode, path);
 	if (err)
 		return err;
-	ext_debug("index is empty, remove it, free block %llu\n", leaf);
+	ext_debug(inode, "index is empty, remove it, free block %llu\n", leaf);
 	trace_ext4_ext_rm_idx(inode, leaf);
 
 	ext4_free_blocks(handle, inode, NULL, leaf, 1,
@@ -2548,7 +2549,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 	ext4_fsblk_t pblk;
 
 	/* the header must be checked already in ext4_ext_remove_space() */
-	ext_debug("truncate since %u in leaf to %u\n", start, end);
+	ext_debug(inode, "truncate since %u in leaf to %u\n", start, end);
 	if (!path[depth].p_hdr)
 		path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
 	eh = path[depth].p_hdr;
@@ -2574,7 +2575,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 		else
 			unwritten = 0;
 
-		ext_debug("remove ext %u:[%d]%d\n", ex_ee_block,
+		ext_debug(inode, "remove ext %u:[%d]%d\n", ex_ee_block,
 			  unwritten, ex_ee_len);
 		path[depth].p_ext = ex;
 
@@ -2582,7 +2583,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 		b = ex_ee_block+ex_ee_len - 1 < end ?
 			ex_ee_block+ex_ee_len - 1 : end;
 
-		ext_debug("  border %u:%u\n", a, b);
+		ext_debug(inode, "  border %u:%u\n", a, b);
 
 		/* If this extent is beyond the end of the hole, skip it */
 		if (end < ex_ee_block) {
@@ -2691,7 +2692,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 		if (err)
 			goto out;
 
-		ext_debug("new extent: %u:%u:%llu\n", ex_ee_block, num,
+		ext_debug(inode, "new extent: %u:%u:%llu\n", ex_ee_block, num,
 				ext4_ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
@@ -2768,7 +2769,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	partial.lblk = 0;
 	partial.state = initial;
 
-	ext_debug("truncate since %u to %u\n", start, end);
+	ext_debug(inode, "truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
 	handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE,
@@ -2909,7 +2910,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 
 		/* this is index block */
 		if (!path[i].p_hdr) {
-			ext_debug("initialize header\n");
+			ext_debug(inode, "initialize header\n");
 			path[i].p_hdr = ext_block_hdr(path[i].p_bh);
 		}
 
@@ -2917,7 +2918,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			/* this level hasn't been touched yet */
 			path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
 			path[i].p_block = le16_to_cpu(path[i].p_hdr->eh_entries)+1;
-			ext_debug("init index ptr: hdr 0x%p, num %d\n",
+			ext_debug(inode, "init index ptr: hdr 0x%p, num %d\n",
 				  path[i].p_hdr,
 				  le16_to_cpu(path[i].p_hdr->eh_entries));
 		} else {
@@ -2925,13 +2926,13 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			path[i].p_idx--;
 		}
 
-		ext_debug("level %d - index, first 0x%p, cur 0x%p\n",
+		ext_debug(inode, "level %d - index, first 0x%p, cur 0x%p\n",
 				i, EXT_FIRST_INDEX(path[i].p_hdr),
 				path[i].p_idx);
 		if (ext4_ext_more_to_rm(path + i)) {
 			struct buffer_head *bh;
 			/* go to the next level */
-			ext_debug("move to level %d (block %llu)\n",
+			ext_debug(inode, "move to level %d (block %llu)\n",
 				  i + 1, ext4_idx_pblock(path[i].p_idx));
 			memset(path + i + 1, 0, sizeof(*path));
 			bh = read_extent_tree_block(inode,
@@ -2967,7 +2968,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			brelse(path[i].p_bh);
 			path[i].p_bh = NULL;
 			i--;
-			ext_debug("return to level %d\n", i);
+			ext_debug(inode, "return to level %d\n", i);
 		}
 	}
 
@@ -3135,8 +3136,7 @@ static int ext4_split_extent_at(handle_t *handle,
 	BUG_ON((split_flag & (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2)) ==
 	       (EXT4_EXT_DATA_VALID1 | EXT4_EXT_DATA_VALID2));
 
-	ext_debug("ext4_split_extents_at: inode %lu, logical"
-		"block %llu\n", inode->i_ino, (unsigned long long)split);
+	ext_debug(inode, "logical block %llu\n", (unsigned long long)split);
 
 	ext4_ext_show_leaf(inode, path);
 
@@ -3369,9 +3369,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 	int err = 0;
 	int split_flag = EXT4_EXT_DATA_VALID2;
 
-	ext_debug("ext4_ext_convert_to_initialized: inode %lu, logical"
-		"block %llu, max_blocks %u\n", inode->i_ino,
-		(unsigned long long)map->m_lblk, map_len);
+	ext_debug(inode, "logical block %llu, max_blocks %u\n",
+		  (unsigned long long)map->m_lblk, map_len);
 
 	sbi = EXT4_SB(inode->i_sb);
 	eof_block = (EXT4_I(inode)->i_disksize + inode->i_sb->s_blocksize - 1)
@@ -3623,8 +3622,7 @@ static int ext4_split_convert_extents(handle_t *handle,
 	unsigned int ee_len;
 	int split_flag = 0, depth;
 
-	ext_debug("%s: inode %lu, logical block %llu, max_blocks %u\n",
-		  __func__, inode->i_ino,
+	ext_debug(inode, "logical block %llu, max_blocks %u\n",
 		  (unsigned long long)map->m_lblk, map->m_len);
 
 	eof_block = (EXT4_I(inode)->i_disksize + inode->i_sb->s_blocksize - 1)
@@ -3670,8 +3668,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
 	ee_block = le32_to_cpu(ex->ee_block);
 	ee_len = ext4_ext_get_actual_len(ex);
 
-	ext_debug("ext4_convert_unwritten_extents_endio: inode %lu, logical"
-		"block %llu, max_blocks %u\n", inode->i_ino,
+	ext_debug(inode, "logical block %llu, max_blocks %u\n",
 		  (unsigned long long)ee_block, ee_len);
 
 	/* If extent is larger than requested it is a clear sign that we still
@@ -3741,8 +3738,7 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
 	ee_block = le32_to_cpu(ex->ee_block);
 	ee_len = ext4_ext_get_actual_len(ex);
 
-	ext_debug("%s: inode %lu, logical"
-		"block %llu, max_blocks %u\n", __func__, inode->i_ino,
+	ext_debug(inode, "logical block %llu, max_blocks %u\n",
 		  (unsigned long long)ee_block, ee_len);
 
 	if (ee_block != map->m_lblk || ee_len > map->m_len) {
@@ -3798,10 +3794,9 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 	int ret = 0;
 	int err = 0;
 
-	ext_debug("ext4_ext_handle_unwritten_extents: inode %lu, logical "
-		  "block %llu, max_blocks %u, flags %x, allocated %u\n",
-		  inode->i_ino, (unsigned long long)map->m_lblk, map->m_len,
-		  flags, allocated);
+	ext_debug(inode, "logical block %llu, max_blocks %u, flags 0x%x, allocated %u\n",
+		  (unsigned long long)map->m_lblk, map->m_len, flags,
+		  allocated);
 	ext4_ext_show_leaf(inode, path);
 
 	/*
@@ -4029,8 +4024,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	struct ext4_allocation_request ar;
 	ext4_lblk_t cluster_offset;
 
-	ext_debug("blocks %u/%u requested for inode %lu\n",
-		  map->m_lblk, map->m_len, inode->i_ino);
+	ext_debug(inode, "blocks %u/%u requested\n", map->m_lblk, map->m_len);
 	trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
 
 	/* find extent for this block */
@@ -4077,8 +4071,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			newblock = map->m_lblk - ee_block + ee_start;
 			/* number of remaining blocks in the extent */
 			allocated = ee_len - (map->m_lblk - ee_block);
-			ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
-				  ee_block, ee_len, newblock);
+			ext_debug(inode, "%u fit into %u:%d -> %llu\n",
+				  map->m_lblk, ee_block, ee_len, newblock);
 
 			/*
 			 * If the extent is initialized check whether the
@@ -4218,7 +4212,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		goto out2;
 	allocated_clusters = ar.len;
 	ar.len = EXT4_C2B(sbi, ar.len) - offset;
-	ext_debug("allocate new block: goal %llu, found %llu/%u, requested %u\n",
+	ext_debug(inode, "allocate new block: goal %llu, found %llu/%u, requested %u\n",
 		  ar.goal, newblock, ar.len, allocated);
 	if (ar.len > allocated)
 		ar.len = allocated;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5f120af22d48..a39fea85ca59 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -493,9 +493,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 #endif
 
 	map->m_flags = 0;
-	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
-		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
-		  (unsigned long) map->m_lblk);
+	ext_debug(inode, "flag 0x%x, max_blocks %u, logical block %lu\n",
+		  flags, map->m_len, (unsigned long) map->m_lblk);
 
 	/*
 	 * ext4_map_blocks returns an int, and m_len is an unsigned int
@@ -728,8 +727,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	}
 
 	if (retval < 0)
-		ext_debug("failed for inode %lu with err %d\n",
-			  inode->i_ino, retval);
+		ext_debug(inode, "failed with err %d\n", retval);
 	return retval;
 }
 
@@ -1685,8 +1683,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		invalid_block = ~0;
 
 	map->m_flags = 0;
-	ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
-		  "logical block %lu\n", inode->i_ino, map->m_len,
+	ext_debug(inode, "max_blocks %u, logical block %lu\n", map->m_len,
 		  (unsigned long) map->m_lblk);
 
 	/* Lookup extent status tree firstly */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 20/20] ext4: Add process name and pid in ext4_msg()
  2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
                   ` (18 preceding siblings ...)
  2020-05-01  6:30 ` [RFC 19/20] ext4: Make ext_debug() " Ritesh Harjani
@ 2020-05-01  6:30 ` Ritesh Harjani
  19 siblings, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-01  6:30 UTC (permalink / raw)
  To: linux-ext4
  Cc: Paul E . McKenney, linux-fsdevel, Jan Kara, tytso,
	Aneesh Kumar K . V, Ritesh Harjani

This adds process name and pid for ext4_msg().
I found this to be useful. For e.g. below print gives more
info about process name and pid.

[ 7671.131912]  [mount/12543] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: acl,user_xattr

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bf5fcb477f66..5067a47f4f46 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -756,7 +756,8 @@ void __ext4_msg(struct super_block *sb,
 	va_start(args, fmt);
 	vaf.fmt = fmt;
 	vaf.va = &args;
-	printk("%sEXT4-fs (%s): %pV\n", prefix, sb->s_id, &vaf);
+	printk("%s [%s/%d] EXT4-fs (%s): %pV\n", prefix, current->comm,
+		task_pid_nr(current), sb->s_id, &vaf);
 	va_end(args);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err
  2020-05-01  6:29 ` [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err Ritesh Harjani
@ 2020-05-01 18:31   ` Paul E. McKenney
  2020-05-04 22:34     ` Ritesh Harjani
  0 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2020-05-01 18:31 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V

On Fri, May 01, 2020 at 11:59:44AM +0530, Ritesh Harjani wrote:
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
> 
> Consider for a case where we have less number of groups
> (for e.g. just group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread
> to release all the PAs and should have retried for allocation.
> Since 1st thread was anyway going to discard the PAs.
> 
> The algorithm using this percpu seq counter goes below:
> 1. We sample the percpu discard_pa_seq counter before trying for block
>    allocation in ext4_mb_new_blocks().
> 2. We increment this percpu discard_pa_seq counter when we succeed in
>    allocating blocks and hence while adding the remaining blocks in group's
>    bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
> 3. We also increment this percpu seq counter when we successfully identify
>    that the bb_prealloc_list is not empty and hence proceed for discarding
>    of those PAs inside ext4_mb_discard_group_preallocations().
> 
> Now to make sure that the regular fast path of block allocation is not
> affected, as a small optimization we only sample the percpu seq counter
> on that cpu. Only when the block allocation fails and when freed blocks
> found were 0, that is when we sample percpu seq counter for all cpus using
> below function ext4_get_discard_pa_seq_sum(). This happens after making
> sure that all the PAs on grp->bb_prealloc_list got freed.
> 
> TO CHECK: Though in here rcu_barrier only happens in ENOSPC path.
> =================================================================
> On rcu_barrier() - How expensive it can be?
> Does that mean that every thread who is coming and waiting on
> rcu_barrier() will actually check whether call_rcu() has completed by
> checking that on every cpu? So will this be a O(n*m) operation?
> (n = no. of threads, m = no. of cpus).
> Or are there some sort of optimization in using rcu_barrier()?

This first part assumes that a single task is executing a series of
rcu_barrier() calls sequentially.

Yes, the rcu_barrier() function makes heavy use of batching optimizations
and asynchrony.  So from an overhead perspective, it queues an RCU
callback on each CPU that has at least one callback already queued.
It does this queuing in such a way to avoid the need for any additional
grace periods.  Of course, each callback actually queued will eventually
be invoked (function call through a pointer), and the callback will
atomically decrement a counter and wake up the rcu_barrier() task if
that counter reaches zero.  So the CPU overhead is O(M), where M is the
number of CPUs, with negligible overhead except for those CPUs that have
at least one RCU callback queued.

So the CPU overhead of rcu_barrier() is quite small, at least when
taken on a per-CPU basis.

However, the latency can be a bit longer than than that of an RCU
grace period.  In normal operation, this will normally range from a few
milliseconds to a few hundred milliseconds.

And the caller of rcu_barrier() can also do batching.  For example,
the following:

	make_call_rcu_stop(a);
	rcu_barrier();
	depend_on_callbacks_finished(a);
	make_call_rcu_stop(b);
	rcu_barrier();
	depend_on_callbacks_finished(b);
	...
	make_call_rcu_stop(z);
	rcu_barrier();
	depend_on_callbacks_finished(z);

Can be safely transformed into this:

	make_call_rcu_stop(a);
	make_call_rcu_stop(b);
	...
	make_call_rcu_stop(z);
	rcu_barrier();
	depend_on_callbacks_finished(a);
	depend_on_callbacks_finished(b);
	...
	depend_on_callbacks_finished(z);

That single rcu_barrier() call can cover all 26 instances.
Give or take possible software-engineering and maintainability
issues, of course.

But what if a large number of tasks are concurrently executing
rcu_barrier()?

These will also be batched.  The first task will run the rcu_barrier()
machinery.  Any rcu_barrier() calls that show up before the first one
completes will share a single second run of the rcu_barrier() machinery.
And any rcu_barrier() calls that show up before this second run of
the machinery completes will be handled by a single third run of this
machinery.  And so on.

Does that help, or am I missing the point of your question?

							Thanx, Paul

> ---
> Note: The other method [1] also used to check for grp->bb_free next time
> if we couldn't discard anything. But it had below concerns due to which
> we cannot use that approach.
> 1. But one suspicion with that was that if grp->bb_free is non-zero for some
> reason (not sure), then we may result in that loop indefinitely but still
> won't be able to satisfy any request.
> 2. To check for grp->bb_free all threads were to wait on grp's spin_lock
> which might result in increased cpu usage.
> 3. In case of group's PA allocation (i.e. when file's size is < 1MB for
>    64K blocksize), there is still a case where ENOSPC could be returned.
>    This happens when the grp->bb_free is set to 0 but those blocks are
>    actually not yet added to PA. Yes, this could actually happen since
>    reducing grp->bb_free and adding those extra blocks in
>    bb_prealloc_list are not done atomically. Hence the race.
> 
> [1]: https://patchwork.ozlabs.org/project/linux-ext4/patch/533ac1f5b19c520b08f8c99aec5baf8729185714.1586954511.git.riteshh@linux.ibm.com/
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>  fs/ext4/mballoc.c | 65 +++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 60 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index a742e51e33b8..6bb08bb3c0ce 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -357,6 +357,35 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
>  static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
>  						ext4_group_t group);
>  
> +/*
> + * The algorithm using this percpu seq counter goes below:
> + * 1. We sample the percpu discard_pa_seq counter before trying for block
> + *    allocation in ext4_mb_new_blocks().
> + * 2. We increment this percpu discard_pa_seq counter when we succeed in
> + *    allocating blocks and hence while adding the remaining blocks in group's
> + *    bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
> + * 3. We also increment this percpu seq counter when we successfully identify
> + *    that the bb_prealloc_list is not empty and hence proceed for discarding
> + *    of those PAs inside ext4_mb_discard_group_preallocations().
> + *
> + * Now to make sure that the regular fast path of block allocation is not
> + * affected, as a small optimization we only sample the percpu seq counter
> + * on that cpu. Only when the block allocation fails and when freed blocks
> + * found were 0, that is when we sample percpu seq counter for all cpus using
> + * below function ext4_get_discard_pa_seq_sum(). This happens after making
> + * sure that all the PAs on grp->bb_prealloc_list got freed.
> + */
> +DEFINE_PER_CPU(u64, discard_pa_seq);
> +static inline u64 ext4_get_discard_pa_seq_sum(void)
> +{
> +	int __cpu;
> +	u64 __seq = 0;
> +
> +	for_each_possible_cpu(__cpu)
> +		__seq += per_cpu(discard_pa_seq, __cpu);
> +	return __seq;
> +}
> +
>  static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
>  {
>  #if BITS_PER_LONG == 64
> @@ -3730,6 +3759,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	pa->pa_inode = ac->ac_inode;
>  
>  	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> +	this_cpu_inc(discard_pa_seq);
>  	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
>  	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>  
> @@ -3791,6 +3821,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
>  	pa->pa_inode = NULL;
>  
>  	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> +	this_cpu_inc(discard_pa_seq);
>  	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
>  	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>  
> @@ -3943,6 +3974,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  	INIT_LIST_HEAD(&list);
>  repeat:
>  	ext4_lock_group(sb, group);
> +	this_cpu_inc(discard_pa_seq);
>  	list_for_each_entry_safe(pa, tmp,
>  				&grp->bb_prealloc_list, pa_group_list) {
>  		spin_lock(&pa->pa_lock);
> @@ -4487,14 +4519,35 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>  }
>  
>  static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
> -			struct ext4_allocation_context *ac)
> +			struct ext4_allocation_context *ac, u64 *seq)
>  {
>  	int freed;
> +	u64 seq_retry = 0;
> +	bool ret = false;
>  
>  	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
> -	if (freed)
> -		return true;
> -	return false;
> +	if (freed) {
> +		ret = true;
> +		goto out_dbg;
> +	}
> +	/*
> +	 * Unless it is ensured that PAs are actually freed, we may hit
> +	 * a ENOSPC error since the next time seq may match while the PA blocks
> +	 * are still getting freed in ext4_mb_release_inode/group_pa().
> +	 * So, rcu_barrier() here is to make sure that any call_rcu queued in
> +	 * ext4_mb_discard_group_preallocations() is completed before we
> +	 * proceed further to retry for block allocation.
> +	 */
> +	rcu_barrier();
> +	seq_retry = ext4_get_discard_pa_seq_sum();
> +	if (seq_retry != *seq) {
> +		*seq = seq_retry;
> +		ret = true;
> +	}
> +
> +out_dbg:
> +	mb_debug(1, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
> +	return ret;
>  }
>  
>  /*
> @@ -4511,6 +4564,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  	ext4_fsblk_t block = 0;
>  	unsigned int inquota = 0;
>  	unsigned int reserv_clstrs = 0;
> +	u64 seq;
>  
>  	might_sleep();
>  	sb = ar->inode->i_sb;
> @@ -4572,6 +4626,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  	}
>  
>  	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
> +	seq = *this_cpu_ptr(&discard_pa_seq);
>  	if (!ext4_mb_use_preallocated(ac)) {
>  		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
>  		ext4_mb_normalize_request(ac, ar);
> @@ -4603,7 +4658,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  			ar->len = ac->ac_b_ex.fe_len;
>  		}
>  	} else {
> -		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
> +		if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
>  			goto repeat;
>  		*errp = -ENOSPC;
>  	}
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err
  2020-05-01 18:31   ` Paul E. McKenney
@ 2020-05-04 22:34     ` Ritesh Harjani
  2020-05-05  0:36       ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: Ritesh Harjani @ 2020-05-04 22:34 UTC (permalink / raw)
  To: paulmck; +Cc: linux-ext4, linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V

Hello Paul,

On 5/2/20 12:01 AM, Paul E. McKenney wrote:
> On Fri, May 01, 2020 at 11:59:44AM +0530, Ritesh Harjani wrote:
>> There could be a race in function ext4_mb_discard_group_preallocations()
>> where the 1st thread may iterate through group's bb_prealloc_list and
>> remove all the PAs and add to function's local list head.
>> Now if the 2nd thread comes in to discard the group preallocations,
>> it will see that the group->bb_prealloc_list is empty and will return 0.
>>
>> Consider for a case where we have less number of groups
>> (for e.g. just group 0),
>> this may even return an -ENOSPC error from ext4_mb_new_blocks()
>> (where we call for ext4_mb_discard_group_preallocations()).
>> But that is wrong, since 2nd thread should have waited for 1st thread
>> to release all the PAs and should have retried for allocation.
>> Since 1st thread was anyway going to discard the PAs.
>>
>> The algorithm using this percpu seq counter goes below:
>> 1. We sample the percpu discard_pa_seq counter before trying for block
>>     allocation in ext4_mb_new_blocks().
>> 2. We increment this percpu discard_pa_seq counter when we succeed in
>>     allocating blocks and hence while adding the remaining blocks in group's
>>     bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
>> 3. We also increment this percpu seq counter when we successfully identify
>>     that the bb_prealloc_list is not empty and hence proceed for discarding
>>     of those PAs inside ext4_mb_discard_group_preallocations().
>>
>> Now to make sure that the regular fast path of block allocation is not
>> affected, as a small optimization we only sample the percpu seq counter
>> on that cpu. Only when the block allocation fails and when freed blocks
>> found were 0, that is when we sample percpu seq counter for all cpus using
>> below function ext4_get_discard_pa_seq_sum(). This happens after making
>> sure that all the PAs on grp->bb_prealloc_list got freed.
>>
>> TO CHECK: Though in here rcu_barrier only happens in ENOSPC path.
>> =================================================================
>> On rcu_barrier() - How expensive it can be?
>> Does that mean that every thread who is coming and waiting on
>> rcu_barrier() will actually check whether call_rcu() has completed by
>> checking that on every cpu? So will this be a O(n*m) operation?
>> (n = no. of threads, m = no. of cpus).
>> Or are there some sort of optimization in using rcu_barrier()?
> 
> This first part assumes that a single task is executing a series of
> rcu_barrier() calls sequentially.
> 
> Yes, the rcu_barrier() function makes heavy use of batching optimizations
> and asynchrony.  So from an overhead perspective, it queues an RCU
> callback on each CPU that has at least one callback already queued.
> It does this queuing in such a way to avoid the need for any additional
> grace periods.  Of course, each callback actually queued will eventually
> be invoked (function call through a pointer), and the callback will
> atomically decrement a counter and wake up the rcu_barrier() task if
> that counter reaches zero.  So the CPU overhead is O(M), where M is the
> number of CPUs, with negligible overhead except for those CPUs that have
> at least one RCU callback queued.
> 
> So the CPU overhead of rcu_barrier() is quite small, at least when
> taken on a per-CPU basis.
> 
> However, the latency can be a bit longer than than that of an RCU
> grace period.  In normal operation, this will normally range from a few
> milliseconds to a few hundred milliseconds.
> 
> And the caller of rcu_barrier() can also do batching.  For example,
> the following:
> 
> 	make_call_rcu_stop(a);
> 	rcu_barrier();
> 	depend_on_callbacks_finished(a);
> 	make_call_rcu_stop(b);
> 	rcu_barrier();
> 	depend_on_callbacks_finished(b);
> 	...
> 	make_call_rcu_stop(z);
> 	rcu_barrier();
> 	depend_on_callbacks_finished(z);
> 
> Can be safely transformed into this:
> 
> 	make_call_rcu_stop(a);
> 	make_call_rcu_stop(b);
> 	...
> 	make_call_rcu_stop(z);
> 	rcu_barrier();
> 	depend_on_callbacks_finished(a);
> 	depend_on_callbacks_finished(b);
> 	...
> 	depend_on_callbacks_finished(z);
> 
> That single rcu_barrier() call can cover all 26 instances.
> Give or take possible software-engineering and maintainability
> issues, of course.
> 
> But what if a large number of tasks are concurrently executing
> rcu_barrier()?
> 
> These will also be batched.  The first task will run the rcu_barrier()
> machinery.  Any rcu_barrier() calls that show up before the first one
> completes will share a single second run of the rcu_barrier() machinery.
> And any rcu_barrier() calls that show up before this second run of
> the machinery completes will be handled by a single third run of this
> machinery.  And so on.

Thanks for explaining that. So I was doing more reading on rcu_barrier()
code and I think what it essentially does, is depicted from line 1 - 12
below.

So AFAIU,
when there are multiple threads calling rcu_barrier(), what will happen
mostly is that 1st thread which calls this func will get the mutex_lock
(line 1) and will make sure that all calls to call_rcu() already queued
on different cpus are completed before this rcu_barrier() returns.
And how exactly this happens is that rcu_barrier() runs it's own
rcu_barrier_func() on all of those cpus which have call_rcu() queued up.
rcu_barrier_func() while running on each of those cpus then queues up
a call_rcu() (on it's tail of call_rcu queue) with
rcu_barrier_callback() function. This call_rcu queued on tail will run,
after all of previous call_rcu() functions from those per cpu queues are
completed. This rcu_barrier_callback() will mostly run after a grace
period (like how in general call_rcu() works).

And meanwhile all other threads will sleep while trying to acquire
that mutex lock. So at this time there could be something useful which
these cpus should be able to do (while waiting on mutex).
Now, when these other threads will be able to acquire the mutex_lock()
to proceed inside rcu_barrier(), in case if there are no more call_rcu()
queued in (since it should have been taken care by previous thread),
(checked in line 5 below) then nothing else will be done by
rcu_barrier(), and it will simply exit.

Is the above understanding correct?


Instead if I use a spinlock. Then in the worst case what will happen is
that most of the threads will just spin on this lock while not being
able to do anything useful. Also in case if it's heavy multi-threaded
and if we hit this scenario, then this could also result into high cpu
usage and sluggish performance since all threads are waiting on
spinlock.

My main reason to understand rcu_barrier() functionality is to weigh
my options of which one is a better approach to be deployed here.
Hence all of these queries :)


1. mutex_lock(&rcu_state.barrier_mutex);
2. init_completion(&rcu_state.barrier_completion);
3. atomic_set(&rcu_state.barrier_cpu_count, 1);

4. for_each_possible_cpu(cpu) {
5. 	if_we_have_any_call_rcu_pending_on_this_cpu {
6. 		smp_call_function_single(cpu, rcu_barrier_func, NULL, 1)
7. 	}
8. }

9. if (atomic_dec_and_test(&rcu_state.barrier_cpu_count))
10. 	complete(&rcu_state.barrier_completion);

11. wait_for_completion(&rcu_state.barrier_completion);

12. mutex_unlock(&rcu_state.barrier_mutex);


Thanks
-ritesh



>> ---
>> Note: The other method [1] also used to check for grp->bb_free next time
>> if we couldn't discard anything. But it had below concerns due to which
>> we cannot use that approach.
>> 1. But one suspicion with that was that if grp->bb_free is non-zero for some
>> reason (not sure), then we may result in that loop indefinitely but still
>> won't be able to satisfy any request.
>> 2. To check for grp->bb_free all threads were to wait on grp's spin_lock
>> which might result in increased cpu usage.
>> 3. In case of group's PA allocation (i.e. when file's size is < 1MB for
>>     64K blocksize), there is still a case where ENOSPC could be returned.
>>     This happens when the grp->bb_free is set to 0 but those blocks are
>>     actually not yet added to PA. Yes, this could actually happen since
>>     reducing grp->bb_free and adding those extra blocks in
>>     bb_prealloc_list are not done atomically. Hence the race.
>>
>> [1]: https://patchwork.ozlabs.org/project/linux-ext4/patch/533ac1f5b19c520b08f8c99aec5baf8729185714.1586954511.git.riteshh@linux.ibm.com/
>>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> ---
>>   fs/ext4/mballoc.c | 65 +++++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 60 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index a742e51e33b8..6bb08bb3c0ce 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -357,6 +357,35 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
>>   static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
>>   						ext4_group_t group);
>>   
>> +/*
>> + * The algorithm using this percpu seq counter goes below:
>> + * 1. We sample the percpu discard_pa_seq counter before trying for block
>> + *    allocation in ext4_mb_new_blocks().
>> + * 2. We increment this percpu discard_pa_seq counter when we succeed in
>> + *    allocating blocks and hence while adding the remaining blocks in group's
>> + *    bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
>> + * 3. We also increment this percpu seq counter when we successfully identify
>> + *    that the bb_prealloc_list is not empty and hence proceed for discarding
>> + *    of those PAs inside ext4_mb_discard_group_preallocations().
>> + *
>> + * Now to make sure that the regular fast path of block allocation is not
>> + * affected, as a small optimization we only sample the percpu seq counter
>> + * on that cpu. Only when the block allocation fails and when freed blocks
>> + * found were 0, that is when we sample percpu seq counter for all cpus using
>> + * below function ext4_get_discard_pa_seq_sum(). This happens after making
>> + * sure that all the PAs on grp->bb_prealloc_list got freed.
>> + */
>> +DEFINE_PER_CPU(u64, discard_pa_seq);
>> +static inline u64 ext4_get_discard_pa_seq_sum(void)
>> +{
>> +	int __cpu;
>> +	u64 __seq = 0;
>> +
>> +	for_each_possible_cpu(__cpu)
>> +		__seq += per_cpu(discard_pa_seq, __cpu);
>> +	return __seq;
>> +}
>> +
>>   static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
>>   {
>>   #if BITS_PER_LONG == 64
>> @@ -3730,6 +3759,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   	pa->pa_inode = ac->ac_inode;
>>   
>>   	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
>> +	this_cpu_inc(discard_pa_seq);
>>   	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
>>   	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>>   
>> @@ -3791,6 +3821,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
>>   	pa->pa_inode = NULL;
>>   
>>   	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
>> +	this_cpu_inc(discard_pa_seq);
>>   	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
>>   	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>>   
>> @@ -3943,6 +3974,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   	INIT_LIST_HEAD(&list);
>>   repeat:
>>   	ext4_lock_group(sb, group);
>> +	this_cpu_inc(discard_pa_seq);
>>   	list_for_each_entry_safe(pa, tmp,
>>   				&grp->bb_prealloc_list, pa_group_list) {
>>   		spin_lock(&pa->pa_lock);
>> @@ -4487,14 +4519,35 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>>   }
>>   
>>   static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
>> -			struct ext4_allocation_context *ac)
>> +			struct ext4_allocation_context *ac, u64 *seq)
>>   {
>>   	int freed;
>> +	u64 seq_retry = 0;
>> +	bool ret = false;
>>   
>>   	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
>> -	if (freed)
>> -		return true;
>> -	return false;
>> +	if (freed) {
>> +		ret = true;
>> +		goto out_dbg;
>> +	}
>> +	/*
>> +	 * Unless it is ensured that PAs are actually freed, we may hit
>> +	 * a ENOSPC error since the next time seq may match while the PA blocks
>> +	 * are still getting freed in ext4_mb_release_inode/group_pa().
>> +	 * So, rcu_barrier() here is to make sure that any call_rcu queued in
>> +	 * ext4_mb_discard_group_preallocations() is completed before we
>> +	 * proceed further to retry for block allocation.
>> +	 */
>> +	rcu_barrier();
>> +	seq_retry = ext4_get_discard_pa_seq_sum();
>> +	if (seq_retry != *seq) {
>> +		*seq = seq_retry;
>> +		ret = true;
>> +	}
>> +
>> +out_dbg:
>> +	mb_debug(1, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
>> +	return ret;
>>   }
>>   
>>   /*
>> @@ -4511,6 +4564,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>>   	ext4_fsblk_t block = 0;
>>   	unsigned int inquota = 0;
>>   	unsigned int reserv_clstrs = 0;
>> +	u64 seq;
>>   
>>   	might_sleep();
>>   	sb = ar->inode->i_sb;
>> @@ -4572,6 +4626,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>>   	}
>>   
>>   	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
>> +	seq = *this_cpu_ptr(&discard_pa_seq);
>>   	if (!ext4_mb_use_preallocated(ac)) {
>>   		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
>>   		ext4_mb_normalize_request(ac, ar);
>> @@ -4603,7 +4658,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>>   			ar->len = ac->ac_b_ex.fe_len;
>>   		}
>>   	} else {
>> -		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
>> +		if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
>>   			goto repeat;
>>   		*errp = -ENOSPC;
>>   	}
>> -- 
>> 2.21.0
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err
  2020-05-04 22:34     ` Ritesh Harjani
@ 2020-05-05  0:36       ` Paul E. McKenney
  0 siblings, 0 replies; 24+ messages in thread
From: Paul E. McKenney @ 2020-05-05  0:36 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V

On Tue, May 05, 2020 at 04:04:42AM +0530, Ritesh Harjani wrote:
> Hello Paul,
> 
> On 5/2/20 12:01 AM, Paul E. McKenney wrote:
> > On Fri, May 01, 2020 at 11:59:44AM +0530, Ritesh Harjani wrote:
> > > There could be a race in function ext4_mb_discard_group_preallocations()
> > > where the 1st thread may iterate through group's bb_prealloc_list and
> > > remove all the PAs and add to function's local list head.
> > > Now if the 2nd thread comes in to discard the group preallocations,
> > > it will see that the group->bb_prealloc_list is empty and will return 0.
> > > 
> > > Consider for a case where we have less number of groups
> > > (for e.g. just group 0),
> > > this may even return an -ENOSPC error from ext4_mb_new_blocks()
> > > (where we call for ext4_mb_discard_group_preallocations()).
> > > But that is wrong, since 2nd thread should have waited for 1st thread
> > > to release all the PAs and should have retried for allocation.
> > > Since 1st thread was anyway going to discard the PAs.
> > > 
> > > The algorithm using this percpu seq counter goes below:
> > > 1. We sample the percpu discard_pa_seq counter before trying for block
> > >     allocation in ext4_mb_new_blocks().
> > > 2. We increment this percpu discard_pa_seq counter when we succeed in
> > >     allocating blocks and hence while adding the remaining blocks in group's
> > >     bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
> > > 3. We also increment this percpu seq counter when we successfully identify
> > >     that the bb_prealloc_list is not empty and hence proceed for discarding
> > >     of those PAs inside ext4_mb_discard_group_preallocations().
> > > 
> > > Now to make sure that the regular fast path of block allocation is not
> > > affected, as a small optimization we only sample the percpu seq counter
> > > on that cpu. Only when the block allocation fails and when freed blocks
> > > found were 0, that is when we sample percpu seq counter for all cpus using
> > > below function ext4_get_discard_pa_seq_sum(). This happens after making
> > > sure that all the PAs on grp->bb_prealloc_list got freed.
> > > 
> > > TO CHECK: Though in here rcu_barrier only happens in ENOSPC path.
> > > =================================================================
> > > On rcu_barrier() - How expensive it can be?
> > > Does that mean that every thread who is coming and waiting on
> > > rcu_barrier() will actually check whether call_rcu() has completed by
> > > checking that on every cpu? So will this be a O(n*m) operation?
> > > (n = no. of threads, m = no. of cpus).
> > > Or are there some sort of optimization in using rcu_barrier()?
> > 
> > This first part assumes that a single task is executing a series of
> > rcu_barrier() calls sequentially.
> > 
> > Yes, the rcu_barrier() function makes heavy use of batching optimizations
> > and asynchrony.  So from an overhead perspective, it queues an RCU
> > callback on each CPU that has at least one callback already queued.
> > It does this queuing in such a way to avoid the need for any additional
> > grace periods.  Of course, each callback actually queued will eventually
> > be invoked (function call through a pointer), and the callback will
> > atomically decrement a counter and wake up the rcu_barrier() task if
> > that counter reaches zero.  So the CPU overhead is O(M), where M is the
> > number of CPUs, with negligible overhead except for those CPUs that have
> > at least one RCU callback queued.
> > 
> > So the CPU overhead of rcu_barrier() is quite small, at least when
> > taken on a per-CPU basis.
> > 
> > However, the latency can be a bit longer than than that of an RCU
> > grace period.  In normal operation, this will normally range from a few
> > milliseconds to a few hundred milliseconds.
> > 
> > And the caller of rcu_barrier() can also do batching.  For example,
> > the following:
> > 
> > 	make_call_rcu_stop(a);
> > 	rcu_barrier();
> > 	depend_on_callbacks_finished(a);
> > 	make_call_rcu_stop(b);
> > 	rcu_barrier();
> > 	depend_on_callbacks_finished(b);
> > 	...
> > 	make_call_rcu_stop(z);
> > 	rcu_barrier();
> > 	depend_on_callbacks_finished(z);
> > 
> > Can be safely transformed into this:
> > 
> > 	make_call_rcu_stop(a);
> > 	make_call_rcu_stop(b);
> > 	...
> > 	make_call_rcu_stop(z);
> > 	rcu_barrier();
> > 	depend_on_callbacks_finished(a);
> > 	depend_on_callbacks_finished(b);
> > 	...
> > 	depend_on_callbacks_finished(z);
> > 
> > That single rcu_barrier() call can cover all 26 instances.
> > Give or take possible software-engineering and maintainability
> > issues, of course.
> > 
> > But what if a large number of tasks are concurrently executing
> > rcu_barrier()?
> > 
> > These will also be batched.  The first task will run the rcu_barrier()
> > machinery.  Any rcu_barrier() calls that show up before the first one
> > completes will share a single second run of the rcu_barrier() machinery.
> > And any rcu_barrier() calls that show up before this second run of
> > the machinery completes will be handled by a single third run of this
> > machinery.  And so on.
> 
> Thanks for explaining that. So I was doing more reading on rcu_barrier()
> code and I think what it essentially does, is depicted from line 1 - 12
> below.
> 
> So AFAIU,
> when there are multiple threads calling rcu_barrier(), what will happen
> mostly is that 1st thread which calls this func will get the mutex_lock
> (line 1) and will make sure that all calls to call_rcu() already queued
> on different cpus are completed before this rcu_barrier() returns.
> And how exactly this happens is that rcu_barrier() runs it's own
> rcu_barrier_func() on all of those cpus which have call_rcu() queued up.
> rcu_barrier_func() while running on each of those cpus then queues up
> a call_rcu() (on it's tail of call_rcu queue) with
> rcu_barrier_callback() function. This call_rcu queued on tail will run,
> after all of previous call_rcu() functions from those per cpu queues are
> completed. This rcu_barrier_callback() will mostly run after a grace
> period (like how in general call_rcu() works).
> 
> And meanwhile all other threads will sleep while trying to acquire
> that mutex lock. So at this time there could be something useful which
> these cpus should be able to do (while waiting on mutex).
> Now, when these other threads will be able to acquire the mutex_lock()
> to proceed inside rcu_barrier(), in case if there are no more call_rcu()
> queued in (since it should have been taken care by previous thread),
> (checked in line 5 below) then nothing else will be done by
> rcu_barrier(), and it will simply exit.

True enough.  However, on a busy system it is very likely that
there will be additional callbacks enqueued somewhere by that
time, and so the next task to acquire the mutex will have to wait.
However, the third task might well be able to skip the whole process
based on the ->barrier_sequence counter in the rcu_state structure.
(The ->barrier_sequence manipulations are not shown in your steps below,
but they lead to the "if (rcu_seq_done" early exit just after the mutex
is acquired.

> Is the above understanding correct?

Reasonably close.

> Instead if I use a spinlock. Then in the worst case what will happen is
> that most of the threads will just spin on this lock while not being
> able to do anything useful. Also in case if it's heavy multi-threaded
> and if we hit this scenario, then this could also result into high cpu
> usage and sluggish performance since all threads are waiting on
> spinlock.

It can also result in deadlock because something holding a pure spinlock
will need to do a wait_for_completion() while holding that spinlock.
And of course, lockdep will complain bitterly, as well it should.  ;-)

You also don't get to drop the lock before doing the wait_for_completion()
without adding some sort of synchronization preventing the next task
from using the rcu_data structure's ->barrier_head before the current
task is done with them.

> My main reason to understand rcu_barrier() functionality is to weigh
> my options of which one is a better approach to be deployed here.
> Hence all of these queries :)

Why not just try it and then measure the actual effect on performance?

Also, given actual measurements, there might be things I could do to
speed this up, for example, applying funnel locking to decrease memory
contention on ->barrier_mutex, should that turn out to be a real problem.

But it all depends on whether there is an actual performance problem,
and if so, what aspect of the overall algorithm is in trouble.

							Thanx, Paul

> 1. mutex_lock(&rcu_state.barrier_mutex);
> 2. init_completion(&rcu_state.barrier_completion);
> 3. atomic_set(&rcu_state.barrier_cpu_count, 1);
> 
> 4. for_each_possible_cpu(cpu) {
> 5. 	if_we_have_any_call_rcu_pending_on_this_cpu {
> 6. 		smp_call_function_single(cpu, rcu_barrier_func, NULL, 1)
> 7. 	}
> 8. }
> 
> 9. if (atomic_dec_and_test(&rcu_state.barrier_cpu_count))
> 10. 	complete(&rcu_state.barrier_completion);
> 
> 11. wait_for_completion(&rcu_state.barrier_completion);
> 
> 12. mutex_unlock(&rcu_state.barrier_mutex);
> 
> 
> Thanks
> -ritesh
> 
> 
> 
> > > ---
> > > Note: The other method [1] also used to check for grp->bb_free next time
> > > if we couldn't discard anything. But it had below concerns due to which
> > > we cannot use that approach.
> > > 1. But one suspicion with that was that if grp->bb_free is non-zero for some
> > > reason (not sure), then we may result in that loop indefinitely but still
> > > won't be able to satisfy any request.
> > > 2. To check for grp->bb_free all threads were to wait on grp's spin_lock
> > > which might result in increased cpu usage.
> > > 3. In case of group's PA allocation (i.e. when file's size is < 1MB for
> > >     64K blocksize), there is still a case where ENOSPC could be returned.
> > >     This happens when the grp->bb_free is set to 0 but those blocks are
> > >     actually not yet added to PA. Yes, this could actually happen since
> > >     reducing grp->bb_free and adding those extra blocks in
> > >     bb_prealloc_list are not done atomically. Hence the race.
> > > 
> > > [1]: https://patchwork.ozlabs.org/project/linux-ext4/patch/533ac1f5b19c520b08f8c99aec5baf8729185714.1586954511.git.riteshh@linux.ibm.com/
> > > 
> > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > > ---
> > >   fs/ext4/mballoc.c | 65 +++++++++++++++++++++++++++++++++++++++++++----
> > >   1 file changed, 60 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index a742e51e33b8..6bb08bb3c0ce 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -357,6 +357,35 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
> > >   static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> > >   						ext4_group_t group);
> > > +/*
> > > + * The algorithm using this percpu seq counter goes below:
> > > + * 1. We sample the percpu discard_pa_seq counter before trying for block
> > > + *    allocation in ext4_mb_new_blocks().
> > > + * 2. We increment this percpu discard_pa_seq counter when we succeed in
> > > + *    allocating blocks and hence while adding the remaining blocks in group's
> > > + *    bb_prealloc_list (ext4_mb_new_inode_pa/ext4_mb_new_group_pa).
> > > + * 3. We also increment this percpu seq counter when we successfully identify
> > > + *    that the bb_prealloc_list is not empty and hence proceed for discarding
> > > + *    of those PAs inside ext4_mb_discard_group_preallocations().
> > > + *
> > > + * Now to make sure that the regular fast path of block allocation is not
> > > + * affected, as a small optimization we only sample the percpu seq counter
> > > + * on that cpu. Only when the block allocation fails and when freed blocks
> > > + * found were 0, that is when we sample percpu seq counter for all cpus using
> > > + * below function ext4_get_discard_pa_seq_sum(). This happens after making
> > > + * sure that all the PAs on grp->bb_prealloc_list got freed.
> > > + */
> > > +DEFINE_PER_CPU(u64, discard_pa_seq);
> > > +static inline u64 ext4_get_discard_pa_seq_sum(void)
> > > +{
> > > +	int __cpu;
> > > +	u64 __seq = 0;
> > > +
> > > +	for_each_possible_cpu(__cpu)
> > > +		__seq += per_cpu(discard_pa_seq, __cpu);
> > > +	return __seq;
> > > +}
> > > +
> > >   static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
> > >   {
> > >   #if BITS_PER_LONG == 64
> > > @@ -3730,6 +3759,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> > >   	pa->pa_inode = ac->ac_inode;
> > >   	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> > > +	this_cpu_inc(discard_pa_seq);
> > >   	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
> > >   	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> > > @@ -3791,6 +3821,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
> > >   	pa->pa_inode = NULL;
> > >   	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> > > +	this_cpu_inc(discard_pa_seq);
> > >   	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
> > >   	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> > > @@ -3943,6 +3974,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> > >   	INIT_LIST_HEAD(&list);
> > >   repeat:
> > >   	ext4_lock_group(sb, group);
> > > +	this_cpu_inc(discard_pa_seq);
> > >   	list_for_each_entry_safe(pa, tmp,
> > >   				&grp->bb_prealloc_list, pa_group_list) {
> > >   		spin_lock(&pa->pa_lock);
> > > @@ -4487,14 +4519,35 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
> > >   }
> > >   static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
> > > -			struct ext4_allocation_context *ac)
> > > +			struct ext4_allocation_context *ac, u64 *seq)
> > >   {
> > >   	int freed;
> > > +	u64 seq_retry = 0;
> > > +	bool ret = false;
> > >   	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
> > > -	if (freed)
> > > -		return true;
> > > -	return false;
> > > +	if (freed) {
> > > +		ret = true;
> > > +		goto out_dbg;
> > > +	}
> > > +	/*
> > > +	 * Unless it is ensured that PAs are actually freed, we may hit
> > > +	 * a ENOSPC error since the next time seq may match while the PA blocks
> > > +	 * are still getting freed in ext4_mb_release_inode/group_pa().
> > > +	 * So, rcu_barrier() here is to make sure that any call_rcu queued in
> > > +	 * ext4_mb_discard_group_preallocations() is completed before we
> > > +	 * proceed further to retry for block allocation.
> > > +	 */
> > > +	rcu_barrier();
> > > +	seq_retry = ext4_get_discard_pa_seq_sum();
> > > +	if (seq_retry != *seq) {
> > > +		*seq = seq_retry;
> > > +		ret = true;
> > > +	}
> > > +
> > > +out_dbg:
> > > +	mb_debug(1, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
> > > +	return ret;
> > >   }
> > >   /*
> > > @@ -4511,6 +4564,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> > >   	ext4_fsblk_t block = 0;
> > >   	unsigned int inquota = 0;
> > >   	unsigned int reserv_clstrs = 0;
> > > +	u64 seq;
> > >   	might_sleep();
> > >   	sb = ar->inode->i_sb;
> > > @@ -4572,6 +4626,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> > >   	}
> > >   	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
> > > +	seq = *this_cpu_ptr(&discard_pa_seq);
> > >   	if (!ext4_mb_use_preallocated(ac)) {
> > >   		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
> > >   		ext4_mb_normalize_request(ac, ar);
> > > @@ -4603,7 +4658,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> > >   			ar->len = ac->ac_b_ex.fe_len;
> > >   		}
> > >   	} else {
> > > -		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
> > > +		if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
> > >   			goto repeat;
> > >   		*errp = -ENOSPC;
> > >   	}
> > > -- 
> > > 2.21.0
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-05-05  0:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01  6:29 [RFC 00/20] ext4: Fix ENOSPC error, improve mballoc dbg, other cleanups Ritesh Harjani
2020-05-01  6:29 ` [RFC 01/20] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
2020-05-01  6:29 ` [RFC 02/20] ext4: Introduce percpu seq counter for freeing blocks(PA) to avoid ENOSPC err Ritesh Harjani
2020-05-01 18:31   ` Paul E. McKenney
2020-05-04 22:34     ` Ritesh Harjani
2020-05-05  0:36       ` Paul E. McKenney
2020-05-01  6:29 ` [RFC 03/20] ext4: mballoc: Do print bb_free info even when it is 0 Ritesh Harjani
2020-05-01  6:29 ` [RFC 04/20] ext4: mballoc: Refactor ext4_mb_show_ac() Ritesh Harjani
2020-05-01  6:29 ` [RFC 05/20] ext4: mballoc: Add more mb_debug() msgs Ritesh Harjani
2020-05-01  6:29 ` [RFC 06/20] ext4: mballoc: Correct the mb_debug() format specifier for pa_len var Ritesh Harjani
2020-05-01  6:29 ` [RFC 07/20] ext4: mballoc: Fix few other format specifier in mb_debug() Ritesh Harjani
2020-05-01  6:29 ` [RFC 08/20] ext4: mballoc: Simplify error handling in ext4_init_mballoc() Ritesh Harjani
2020-05-01  6:29 ` [RFC 09/20] ext4: mballoc: Make ext4_mb_use_preallocated() return type as bool Ritesh Harjani
2020-05-01  6:29 ` [RFC 10/20] ext4: mballoc: Remove EXT4_MB_HINT_GOAL_ONLY and it's related code Ritesh Harjani
2020-05-01  6:29 ` [RFC 11/20] ext4: mballoc: Refactor code inside DOUBLE_CHECK into separate function Ritesh Harjani
2020-05-01  6:29 ` [RFC 12/20] ext4: mballoc: Fix possible NULL ptr dereference from mb_cmp_bitmaps() Ritesh Harjani
2020-05-01  6:29 ` [RFC 13/20] ext4: mballoc: Don't BUG if kmalloc or read blk bitmap fail for DOUBLE_CHECK Ritesh Harjani
2020-05-01  6:29 ` [RFC 14/20] ext4: balloc: Use task_pid_nr() helper Ritesh Harjani
2020-05-01  6:29 ` [RFC 15/20] ext4: Use BIT() macro for BH_** state bits Ritesh Harjani
2020-05-01  6:29 ` [RFC 16/20] ext4: Improve ext_debug() msg in case of block allocation failure Ritesh Harjani
2020-05-01  6:29 ` [RFC 17/20] ext4: Replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents() Ritesh Harjani
2020-05-01  6:30 ` [RFC 18/20] ext4: mballoc: Make mb_debug() implementation to use pr_debug() Ritesh Harjani
2020-05-01  6:30 ` [RFC 19/20] ext4: Make ext_debug() " Ritesh Harjani
2020-05-01  6:30 ` [RFC 20/20] ext4: Add process name and pid in ext4_msg() Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).