linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case
@ 2020-05-20  6:40 Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks Ritesh Harjani
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-05-20  6:40 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V, linux-kernel,
	Ritesh Harjani

Hello All,

Please note that these patches are based on top of mballoc cleanup series [2]
which is also pending review. :)

v4 -> v5:
1. Removed ext4_lock_group() from fastpath and added that in the retry attempt,
so that the performance of fastpath is not affected.

v3 -> v4:
1. Splitted code cleanups and debug improvements as a separate patch series.
2. Dropped rcu_barrier() approach since it did cause some latency
in my testing of ENOSPC handling.
3. This patch series takes a different approach to improve the multi-threaded
ENOSPC handling in ext4 mballoc code. Below mail gives more details.


Background
==========
Consider a case where your disk is close to full but still enough space
remains for your multi-threaded application to run. Now when this application
threads tries to write (e.g. sparse file followed by mmap write or
even fallocate multiple files) in parallel, then with current code of
ext4 multi-block allocator, the application may get an ENOSPC error in some
cases. Examining disk space at this time, we see there is sufficient space
remaining for your application to continue to run.

Additional info:
============================
1. Our internal test team was easily able to reproduce this ENOSPC error on 
   an upstream kernel with 2GB ext4 image, with 64K blocksize. They didn't try
   above 2GB and reprorted this issue directly to dev team. On examining the
   free space when the application gets ENOSPC, the free space left was more
   then 50% of filesystem size in some cases.

2. For debugging/development of these patches, I used below script [1] to
   trigger this issue quite frequently on a 64K blocksize setup with 240MB
   ext4 image.


Summary of patches and problem with current design
==================================================
There were 3 main problems which these patches tries to address and hence
improve the ENOSPC handling in ext4's multi-block allocator code.

1. Patch-2: Earlier we were considering the group is good or not (means
   checking if it has enough free blocks to serve your request) without taking
   the group's lock. This could result into a race where, if another thread is
   discarding the group's prealloc list, then the allocation thread will not
   consider those about to be free blocks and will fail will return that group
   is not fit for allocation thus eventually fails with ENOSPC error.

2. Patch-4: Discard PA algoritm only scans the PA list to free up the
   additional blocks which got added to PA. This is done by the same thread-A
   which at 1st couldn't allocate any blocks. But there is a window where,
   once the blocks were allocated (say by some other thread-B previously) we
   drop the group's lock and then checks to see if some of these blocks could
   be added to prealloc list of the group from where we allocated some blocks.
   After that we take the lock and add these additional blocks allocated by
   thread-B to the PA list. But say if thread-A tries to scan the PA list
   between this time interval then there is possibilty that it won't find any
   blocks added to the PA list and hence may return ENOSPC error.
   Hence this patch tries to add those additional blocks to the PA list just
   after the blocks are marked as used with the same group's spinlock held.
   
3. Patch-3: Introduces a per cpu discard_pa_seq counter which is increased
   whenever there is block allocation/freeing or when the discarding of any
   group's PA list has started. With this we could know when to stop the
   retrying logic and return ENOSPC error if there is actually no free space
   left.
   There is an optimization done in the block allocation fast path with this
   approach that, before starting the block allocation, we only sample the
   percpu seq count on that cpu. Only when the allocation fails and discard
   couldn't free up any of the blocks in all of the group's PA list, that is
   when we sample the percpu seq counter sum over all possible cpus to check
   if we need to retry.


Testing:
=========
Tested fstests with default bs of 4K and bs == PAGESIZE ("-g auto")
No new failures were reported with this patch series in this testing.

NOTE:
1. This patch series is based on top of mballoc code cleanup patch series
posted at [2].

References:
===========
[v3]: https://lkml.kernel.org/linux-ext4/cover.1588313626.git.riteshh@linux.ibm.com/
[1]: https://github.com/riteshharjani/LinuxStudy/blob/master/tools/test_mballoc.sh
[2]: https://lkml.kernel.org/linux-ext4/cover.1589086800.git.riteshh@linux.ibm.com/


Ritesh Harjani (5):
  ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks
  ext4: mballoc: Refactor ext4_mb_discard_preallocations()
  ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  ext4: mballoc: Refactor ext4_mb_good_group()
  ext4: mballoc: Use lock for checking free blocks while retrying

 fs/ext4/ext4.h              |   2 +
 fs/ext4/mballoc.c           | 247 ++++++++++++++++++++++++++----------
 include/trace/events/ext4.h |   3 +-
 3 files changed, 185 insertions(+), 67 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks
  2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
@ 2020-05-20  6:40 ` Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 2/5] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-05-20  6:40 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V, linux-kernel,
	Ritesh Harjani

ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A  				Process B

1. allocate blocks
					1. Fails block allocation from
					     ext4_mb_regular_allocator()
   ext4_lock_group()
	allocated blocks
	more than ac_o_ex.fe_len
   ext4_unlock_group()
					2. Scans the
					   grp->bb_prealloc_list (under
					   ext4_lock_group()) and
					   find nothing and thus return
					   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
   	add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 97 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 62 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 33a69424942c..decc5168d126 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -349,6 +349,7 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 					ext4_group_t group);
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group);
+static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
@@ -1701,6 +1702,14 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
 		spin_unlock(&sbi->s_md_lock);
 	}
+	/*
+	 * As we've just preallocated more space than
+	 * user requested originally, we store allocated
+	 * space in a special descriptor.
+	 */
+	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
+		ext4_mb_new_preallocation(ac);
+
 }
 
 /*
@@ -1949,7 +1958,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
 
 		ext4_mb_use_best_found(ac, e4b);
 
-		BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len);
+		BUG_ON(ac->ac_f_ex.fe_len != ac->ac_g_ex.fe_len);
 
 		if (EXT4_SB(sb)->s_mb_stats)
 			atomic_inc(&EXT4_SB(sb)->s_bal_2orders);
@@ -3675,7 +3684,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 /*
  * creates new preallocated space for given inode
  */
-static noinline_for_stack int
+static noinline_for_stack void
 ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
@@ -3688,10 +3697,9 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
+	BUG_ON(ac->ac_pa == NULL);
 
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
-	if (pa == NULL)
-		return -ENOMEM;
+	pa = ac->ac_pa;
 
 	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
 		int winl;
@@ -3735,7 +3743,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
 	pa->pa_len = ac->ac_b_ex.fe_len;
 	pa->pa_free = pa->pa_len;
-	atomic_set(&pa->pa_count, 1);
 	spin_lock_init(&pa->pa_lock);
 	INIT_LIST_HEAD(&pa->pa_inode_list);
 	INIT_LIST_HEAD(&pa->pa_group_list);
@@ -3755,21 +3762,17 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_obj_lock = &ei->i_prealloc_lock;
 	pa->pa_inode = ac->ac_inode;
 
-	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
-	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
 	spin_lock(pa->pa_obj_lock);
 	list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
 	spin_unlock(pa->pa_obj_lock);
-
-	return 0;
 }
 
 /*
  * creates new preallocated space for locality group inodes belongs to
  */
-static noinline_for_stack int
+static noinline_for_stack void
 ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
@@ -3781,11 +3784,9 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
+	BUG_ON(ac->ac_pa == NULL);
 
-	BUG_ON(ext4_pspace_cachep == NULL);
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
-	if (pa == NULL)
-		return -ENOMEM;
+	pa = ac->ac_pa;
 
 	/* preallocation can change ac_b_ex, thus we store actually
 	 * allocated blocks for history */
@@ -3795,7 +3796,6 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	pa->pa_lstart = pa->pa_pstart;
 	pa->pa_len = ac->ac_b_ex.fe_len;
 	pa->pa_free = pa->pa_len;
-	atomic_set(&pa->pa_count, 1);
 	spin_lock_init(&pa->pa_lock);
 	INIT_LIST_HEAD(&pa->pa_inode_list);
 	INIT_LIST_HEAD(&pa->pa_group_list);
@@ -3816,26 +3816,20 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	pa->pa_obj_lock = &lg->lg_prealloc_lock;
 	pa->pa_inode = NULL;
 
-	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
-	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
 	/*
 	 * We will later add the new pa to the right bucket
 	 * after updating the pa_free in ext4_mb_release_context
 	 */
-	return 0;
 }
 
-static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
+static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 {
-	int err;
-
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
-		err = ext4_mb_new_group_pa(ac);
+		ext4_mb_new_group_pa(ac);
 	else
-		err = ext4_mb_new_inode_pa(ac);
-	return err;
+		ext4_mb_new_inode_pa(ac);
 }
 
 /*
@@ -4150,6 +4144,29 @@ void ext4_discard_preallocations(struct inode *inode)
 	}
 }
 
+static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
+{
+	struct ext4_prealloc_space *pa;
+
+	BUG_ON(ext4_pspace_cachep == NULL);
+	pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS);
+	if (!pa)
+		return -ENOMEM;
+	atomic_set(&pa->pa_count, 1);
+	ac->ac_pa = pa;
+	return 0;
+}
+
+static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
+{
+	struct ext4_prealloc_space *pa = ac->ac_pa;
+
+	BUG_ON(!pa);
+	ac->ac_pa = NULL;
+	WARN_ON(!atomic_dec_and_test(&pa->pa_count));
+	kmem_cache_free(ext4_pspace_cachep, pa);
+}
+
 #ifdef CONFIG_EXT4_DEBUG
 static inline void ext4_mb_show_pa(struct super_block *sb)
 {
@@ -4606,23 +4623,28 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	if (!ext4_mb_use_preallocated(ac)) {
 		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
 		ext4_mb_normalize_request(ac, ar);
+
+		*errp = ext4_mb_pa_alloc(ac);
+		if (*errp)
+			goto errout;
 repeat:
 		/* allocate space in core */
 		*errp = ext4_mb_regular_allocator(ac);
-		if (*errp)
-			goto discard_and_exit;
-
-		/* as we've just preallocated more space than
-		 * user requested originally, we store allocated
-		 * space in a special descriptor */
-		if (ac->ac_status == AC_STATUS_FOUND &&
-		    ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
-			*errp = ext4_mb_new_preallocation(ac);
+		/*
+		 * pa allocated above is added to grp->bb_prealloc_list only
+		 * when we were able to allocate some block i.e. when
+		 * ac->ac_status == AC_STATUS_FOUND.
+		 * And error from above mean ac->ac_status != AC_STATUS_FOUND
+		 * So we have to free this pa here itself.
+		 */
 		if (*errp) {
-		discard_and_exit:
+			ext4_mb_pa_free(ac);
 			ext4_discard_allocated_blocks(ac);
 			goto errout;
 		}
+		if (ac->ac_status == AC_STATUS_FOUND &&
+			ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
+			ext4_mb_pa_free(ac);
 	}
 	if (likely(ac->ac_status == AC_STATUS_FOUND)) {
 		*errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
@@ -4637,6 +4659,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
 		if (freed)
 			goto repeat;
+		/*
+		 * If block allocation fails then the pa allocated above
+		 * needs to be freed here itself.
+		 */
+		ext4_mb_pa_free(ac);
 		*errp = -ENOSPC;
 	}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv5 2/5] ext4: mballoc: Refactor ext4_mb_discard_preallocations()
  2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks Ritesh Harjani
@ 2020-05-20  6:40 ` Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling Ritesh Harjani
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-05-20  6:40 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V, linux-kernel,
	Ritesh Harjani

Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index decc5168d126..b75408d72773 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4543,6 +4543,17 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 	return freed;
 }
 
+static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
+			struct ext4_allocation_context *ac)
+{
+	int freed;
+
+	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+	if (freed)
+		return true;
+	return false;
+}
+
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
@@ -4551,7 +4562,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 				struct ext4_allocation_request *ar, int *errp)
 {
-	int freed;
 	struct ext4_allocation_context *ac = NULL;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
@@ -4656,8 +4666,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ar->len = ac->ac_b_ex.fe_len;
 		}
 	} else {
-		freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
-		if (freed)
+		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
 			goto repeat;
 		/*
 		 * If block allocation fails then the pa allocated above
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 2/5] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
@ 2020-05-20  6:40 ` Ritesh Harjani
       [not found]   ` <CGME20200603064851eucas1p2e435089fbdf4de1d1fa3fb051c2f3d7b@eucas1p2.samsung.com>
  2020-05-20  6:40 ` [PATCHv5 4/5] ext4: mballoc: Refactor ext4_mb_good_group() Ritesh Harjani
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2020-05-20  6:40 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V, linux-kernel,
	Ritesh Harjani

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b75408d72773..754ff9f65199 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,35 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
+/*
+ * The algorithm using this percpu seq counter goes below:
+ * 1. We sample the percpu discard_pa_seq counter before trying for block
+ *    allocation in ext4_mb_new_blocks().
+ * 2. We increment this percpu discard_pa_seq counter when we either allocate
+ *    or free these blocks i.e. while marking those blocks as used/free in
+ *    mb_mark_used()/mb_free_blocks().
+ * 3. We also increment this percpu seq counter when we successfully identify
+ *    that the bb_prealloc_list is not empty and hence proceed for discarding
+ *    of those PAs inside ext4_mb_discard_group_preallocations().
+ *
+ * Now to make sure that the regular fast path of block allocation is not
+ * affected, as a small optimization we only sample the percpu seq counter
+ * on that cpu. Only when the block allocation fails and when freed blocks
+ * found were 0, that is when we sample percpu seq counter for all cpus using
+ * below function ext4_get_discard_pa_seq_sum(). This happens after making
+ * sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
+ */
+static DEFINE_PER_CPU(u64, discard_pa_seq);
+static inline u64 ext4_get_discard_pa_seq_sum(void)
+{
+	int __cpu;
+	u64 __seq = 0;
+
+	for_each_possible_cpu(__cpu)
+		__seq += per_cpu(discard_pa_seq, __cpu);
+	return __seq;
+}
+
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
 #if BITS_PER_LONG == 64
@@ -1462,6 +1491,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
 	mb_check_buddy(e4b);
 	mb_free_blocks_double(inode, e4b, first, count);
 
+	this_cpu_inc(discard_pa_seq);
 	e4b->bd_info->bb_free += count;
 	if (first < e4b->bd_info->bb_first_free)
 		e4b->bd_info->bb_first_free = first;
@@ -1603,6 +1633,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 	mb_check_buddy(e4b);
 	mb_mark_used_double(e4b, start, len);
 
+	this_cpu_inc(discard_pa_seq);
 	e4b->bd_info->bb_free -= len;
 	if (e4b->bd_info->bb_first_free == start)
 		e4b->bd_info->bb_first_free += len;
@@ -3962,6 +3993,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	INIT_LIST_HEAD(&list);
 repeat:
 	ext4_lock_group(sb, group);
+	this_cpu_inc(discard_pa_seq);
 	list_for_each_entry_safe(pa, tmp,
 				&grp->bb_prealloc_list, pa_group_list) {
 		spin_lock(&pa->pa_lock);
@@ -4544,14 +4576,26 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 }
 
 static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
-			struct ext4_allocation_context *ac)
+			struct ext4_allocation_context *ac, u64 *seq)
 {
 	int freed;
+	u64 seq_retry = 0;
+	bool ret = false;
 
 	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
-	if (freed)
-		return true;
-	return false;
+	if (freed) {
+		ret = true;
+		goto out_dbg;
+	}
+	seq_retry = ext4_get_discard_pa_seq_sum();
+	if (seq_retry != *seq) {
+		*seq = seq_retry;
+		ret = true;
+	}
+
+out_dbg:
+	mb_debug(sb, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
+	return ret;
 }
 
 /*
@@ -4568,6 +4612,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	ext4_fsblk_t block = 0;
 	unsigned int inquota = 0;
 	unsigned int reserv_clstrs = 0;
+	u64 seq;
 
 	might_sleep();
 	sb = ar->inode->i_sb;
@@ -4630,6 +4675,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	}
 
 	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
+	seq = *this_cpu_ptr(&discard_pa_seq);
 	if (!ext4_mb_use_preallocated(ac)) {
 		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
 		ext4_mb_normalize_request(ac, ar);
@@ -4666,7 +4712,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ar->len = ac->ac_b_ex.fe_len;
 		}
 	} else {
-		if (ext4_mb_discard_preallocations_should_retry(sb, ac))
+		if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
 			goto repeat;
 		/*
 		 * If block allocation fails then the pa allocated above
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv5 4/5] ext4: mballoc: Refactor ext4_mb_good_group()
  2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
                   ` (2 preceding siblings ...)
  2020-05-20  6:40 ` [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling Ritesh Harjani
@ 2020-05-20  6:40 ` Ritesh Harjani
  2020-05-20  6:40 ` [PATCHv5 5/5] ext4: mballoc: Use lock for checking free blocks while retrying Ritesh Harjani
  2020-05-29  2:40 ` [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Theodore Y. Ts'o
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-05-20  6:40 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V, linux-kernel,
	Ritesh Harjani

ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 78 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 754ff9f65199..c9297c878a90 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2106,15 +2106,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
 }
 
 /*
- * This is now called BEFORE we load the buddy bitmap.
+ * This is also called BEFORE we load the buddy bitmap.
  * Returns either 1 or 0 indicating that the group is either suitable
- * for the allocation or not. In addition it can also return negative
- * error code when something goes wrong.
+ * for the allocation or not.
  */
-static int ext4_mb_good_group(struct ext4_allocation_context *ac,
+static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
 				ext4_group_t group, int cr)
 {
-	unsigned free, fragments;
+	ext4_grpblk_t free, fragments;
 	int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 
@@ -2122,23 +2121,16 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 	free = grp->bb_free;
 	if (free == 0)
-		return 0;
+		return false;
 	if (cr <= 2 && free < ac->ac_g_ex.fe_len)
-		return 0;
+		return false;
 
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
-		return 0;
-
-	/* We only do this if the grp has never been initialized */
-	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
-		if (ret)
-			return ret;
-	}
+		return false;
 
 	fragments = grp->bb_fragments;
 	if (fragments == 0)
-		return 0;
+		return false;
 
 	switch (cr) {
 	case 0:
@@ -2148,31 +2140,63 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 		if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&
 		    (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
 		    ((group % flex_size) == 0))
-			return 0;
+			return false;
 
 		if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||
 		    (free / fragments) >= ac->ac_g_ex.fe_len)
-			return 1;
+			return true;
 
 		if (grp->bb_largest_free_order < ac->ac_2order)
-			return 0;
+			return false;
 
-		return 1;
+		return true;
 	case 1:
 		if ((free / fragments) >= ac->ac_g_ex.fe_len)
-			return 1;
+			return true;
 		break;
 	case 2:
 		if (free >= ac->ac_g_ex.fe_len)
-			return 1;
+			return true;
 		break;
 	case 3:
-		return 1;
+		return true;
 	default:
 		BUG();
 	}
 
-	return 0;
+	return false;
+}
+
+/*
+ * This could return negative error code if something goes wrong
+ * during ext4_mb_init_group(). This should not be called with
+ * ext4_lock_group() held.
+ */
+static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
+				     ext4_group_t group, int cr)
+{
+	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+	ext4_grpblk_t free;
+	int ret = 0;
+
+	free = grp->bb_free;
+	if (free == 0)
+		goto out;
+	if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+		goto out;
+	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+		goto out;
+
+	/* We only do this if the grp has never been initialized */
+	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
+		ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+		if (ret)
+			return ret;
+	}
+
+	ret = ext4_mb_good_group(ac, group, cr);
+out:
+	return ret;
 }
 
 static noinline_for_stack int
@@ -2260,7 +2284,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				group = 0;
 
 			/* This now checks without needing the buddy page */
-			ret = ext4_mb_good_group(ac, group, cr);
+			ret = ext4_mb_good_group_nolock(ac, group, cr);
 			if (ret <= 0) {
 				if (!first_err)
 					first_err = ret;
@@ -2278,11 +2302,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			 * block group
 			 */
 			ret = ext4_mb_good_group(ac, group, cr);
-			if (ret <= 0) {
+			if (ret == 0) {
 				ext4_unlock_group(sb, group);
 				ext4_mb_unload_buddy(&e4b);
-				if (!first_err)
-					first_err = ret;
 				continue;
 			}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv5 5/5] ext4: mballoc: Use lock for checking free blocks while retrying
  2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
                   ` (3 preceding siblings ...)
  2020-05-20  6:40 ` [PATCHv5 4/5] ext4: mballoc: Refactor ext4_mb_good_group() Ritesh Harjani
@ 2020-05-20  6:40 ` Ritesh Harjani
  2020-05-29  2:40 ` [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Theodore Y. Ts'o
  5 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-05-20  6:40 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, Jan Kara, tytso, Aneesh Kumar K . V, linux-kernel,
	Ritesh Harjani

Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/ext4.h              |  2 ++
 fs/ext4/mballoc.c           | 13 ++++++++++++-
 include/trace/events/ext4.h |  3 ++-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fb37fb3fe689..d185f3bcb9eb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -150,6 +150,8 @@ enum SHIFT_DIRECTION {
 #define EXT4_MB_USE_ROOT_BLOCKS		0x1000
 /* Use blocks from reserved pool */
 #define EXT4_MB_USE_RESERVED		0x2000
+/* Do strict check for free blocks while retrying block allocation */
+#define EXT4_MB_STRICT_CHECK		0x4000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c9297c878a90..a9083113a8c0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2176,9 +2176,13 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 				     ext4_group_t group, int cr)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+	struct super_block *sb = ac->ac_sb;
+	bool should_lock = ac->ac_flags & EXT4_MB_STRICT_CHECK;
 	ext4_grpblk_t free;
 	int ret = 0;
 
+	if (should_lock)
+		ext4_lock_group(sb, group);
 	free = grp->bb_free;
 	if (free == 0)
 		goto out;
@@ -2186,6 +2190,8 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 		goto out;
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
 		goto out;
+	if (should_lock)
+		ext4_unlock_group(sb, group);
 
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
@@ -2194,8 +2200,12 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 			return ret;
 	}
 
+	if (should_lock)
+		ext4_lock_group(sb, group);
 	ret = ext4_mb_good_group(ac, group, cr);
 out:
+	if (should_lock)
+		ext4_unlock_group(sb, group);
 	return ret;
 }
 
@@ -4610,7 +4620,8 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
 		goto out_dbg;
 	}
 	seq_retry = ext4_get_discard_pa_seq_sum();
-	if (seq_retry != *seq) {
+	if (!(ac->ac_flags & EXT4_MB_STRICT_CHECK) || seq_retry != *seq) {
+		ac->ac_flags |= EXT4_MB_STRICT_CHECK;
 		*seq = seq_retry;
 		ret = true;
 	}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 19c87661eeec..0df9efa80b16 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -35,7 +35,8 @@ struct partial_cluster;
 	{ EXT4_MB_DELALLOC_RESERVED,	"DELALLOC_RESV" },	\
 	{ EXT4_MB_STREAM_ALLOC,		"STREAM_ALLOC" },	\
 	{ EXT4_MB_USE_ROOT_BLOCKS,	"USE_ROOT_BLKS" },	\
-	{ EXT4_MB_USE_RESERVED,		"USE_RESV" })
+	{ EXT4_MB_USE_RESERVED,		"USE_RESV" },		\
+	{ EXT4_MB_STRICT_CHECK,		"STRICT_CHECK" })
 
 #define show_map_flags(flags) __print_flags(flags, "|",			\
 	{ EXT4_GET_BLOCKS_CREATE,		"CREATE" },		\
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case
  2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
                   ` (4 preceding siblings ...)
  2020-05-20  6:40 ` [PATCHv5 5/5] ext4: mballoc: Use lock for checking free blocks while retrying Ritesh Harjani
@ 2020-05-29  2:40 ` Theodore Y. Ts'o
  5 siblings, 0 replies; 11+ messages in thread
From: Theodore Y. Ts'o @ 2020-05-29  2:40 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, linux-fsdevel, Jan Kara, Aneesh Kumar K . V, linux-kernel


Thanks, I've applied this patch series.

					- Ted


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
       [not found]   ` <CGME20200603064851eucas1p2e435089fbdf4de1d1fa3fb051c2f3d7b@eucas1p2.samsung.com>
@ 2020-06-03  6:48     ` Marek Szyprowski
  2020-06-03 10:10       ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Marek Szyprowski @ 2020-06-03  6:48 UTC (permalink / raw)
  To: Ritesh Harjani, linux-ext4
  Cc: linux-fsdevel, Jan Kara, Theodore Ts'o, Aneesh Kumar K . V,
	linux-kernel

Hi Ritesh,

On 20.05.2020 08:40, Ritesh Harjani wrote:
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
>
> Consider for a case where we have less number of groups
> (for e.g. just group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread
> to release all the PAs and should have retried for allocation.
> Since 1st thread was anyway going to discard the PAs.
>
> The algorithm using this percpu seq counter goes below:
> 1. We sample the percpu discard_pa_seq counter before trying for block
>     allocation in ext4_mb_new_blocks().
> 2. We increment this percpu discard_pa_seq counter when we either allocate
>     or free these blocks i.e. while marking those blocks as used/free in
>     mb_mark_used()/mb_free_blocks().
> 3. We also increment this percpu seq counter when we successfully identify
>     that the bb_prealloc_list is not empty and hence proceed for discarding
>     of those PAs inside ext4_mb_discard_group_preallocations().
>
> Now to make sure that the regular fast path of block allocation is not
> affected, as a small optimization we only sample the percpu seq counter
> on that cpu. Only when the block allocation fails and when freed blocks
> found were 0, that is when we sample percpu seq counter for all cpus using
> below function ext4_get_discard_pa_seq_sum(). This happens after making
> sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
>
> It can be well argued that why don't just check for grp->bb_free to
> see if there are any free blocks to be allocated. So here are the two
> concerns which were discussed:-
>
> 1. If for some reason the blocks available in the group are not
>     appropriate for allocation logic (say for e.g.
>     EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
>     the retry logic may result into infinte looping since grp->bb_free is
>     non-zero.
>
> 2. Also before preallocation was clubbed with block allocation with the
>     same ext4_lock_group() held, there were lot of races where grp->bb_free
>     could not be reliably relied upon.
> Due to above, this patch considers discard_pa_seq logic to determine if
> we should retry for block allocation. Say if there are are n threads
> trying for block allocation and none of those could allocate or discard
> any of the blocks, then all of those n threads will fail the block
> allocation and return -ENOSPC error. (Since the seq counter for all of
> those will match as no block allocation/discard was done during that
> duration).
>
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>

This patch landed in yesterday's linux-next and causes following 
WARNING/BUG on various Samsung Exynos-based boards:

  BUG: using smp_processor_id() in preemptible [00000000] code: logsave/552
  caller is ext4_mb_new_blocks+0x404/0x1300
  CPU: 3 PID: 552 Comm: logsave Tainted: G        W 5.7.0-next-20200602 #4
  Hardware name: Samsung Exynos (Flattened Device Tree)
  [<c011184c>] (unwind_backtrace) from [<c010d250>] (show_stack+0x10/0x14)
  [<c010d250>] (show_stack) from [<c05185fc>] (dump_stack+0xbc/0xe8)
  [<c05185fc>] (dump_stack) from [<c0ab689c>] 
(check_preemption_disabled+0xec/0xf0)
  [<c0ab689c>] (check_preemption_disabled) from [<c03a7b38>] 
(ext4_mb_new_blocks+0x404/0x1300)
  [<c03a7b38>] (ext4_mb_new_blocks) from [<c03780f4>] 
(ext4_ext_map_blocks+0xc7c/0x10f4)
  [<c03780f4>] (ext4_ext_map_blocks) from [<c03902b4>] 
(ext4_map_blocks+0x118/0x5a0)
  [<c03902b4>] (ext4_map_blocks) from [<c0394524>] 
(mpage_map_and_submit_extent+0x134/0x9c0)
  [<c0394524>] (mpage_map_and_submit_extent) from [<c03958c8>] 
(ext4_writepages+0xb18/0xcb0)
  [<c03958c8>] (ext4_writepages) from [<c02588ec>] (do_writepages+0x20/0x94)
  [<c02588ec>] (do_writepages) from [<c024c688>] 
(__filemap_fdatawrite_range+0xac/0xcc)
  [<c024c688>] (__filemap_fdatawrite_range) from [<c024c700>] 
(filemap_flush+0x28/0x30)
  [<c024c700>] (filemap_flush) from [<c037eedc>] 
(ext4_release_file+0x70/0xac)
  [<c037eedc>] (ext4_release_file) from [<c02c3310>] (__fput+0xc4/0x234)
  [<c02c3310>] (__fput) from [<c014eb74>] (task_work_run+0x88/0xcc)
  [<c014eb74>] (task_work_run) from [<c010ca40>] 
(do_work_pending+0x52c/0x5cc)
  [<c010ca40>] (do_work_pending) from [<c0100094>] 
(slow_work_pending+0xc/0x20)
  Exception stack(0xec9c1fb0 to 0xec9c1ff8)
  1fa0:                                     00000000 0044969c 0000006c 
00000000
  1fc0: 00000001 0045a014 00000241 00000006 00000000 be91abb4 be91abb0 
0000000c
  1fe0: 00459fd4 be91ab90 00448ed4 b6e43444 60000050 00000003

Please let me know how I can help debugging this issue. The above log is 
from linux-next 20200602 compiled from exynos_defconfig running on ARM 
32bit Samsung Exynos4412-based Odroid U3 board, however I don't think 
this is Exynos specific issue. Probably I've observed it, because 
exynos_defconfig has most of the debugging options enabled.

 > ...

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  2020-06-03  6:48     ` Marek Szyprowski
@ 2020-06-03 10:10       ` Ritesh Harjani
  2020-06-09 10:20         ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2020-06-03 10:10 UTC (permalink / raw)
  To: Marek Szyprowski, Ritesh Harjani, linux-ext4
  Cc: linux-fsdevel, Jan Kara, Theodore Ts'o, Aneesh Kumar K . V,
	linux-kernel

Hi Marek,

On 6/3/20 12:18 PM, Marek Szyprowski wrote:
> Hi Ritesh,
> 
> On 20.05.2020 08:40, Ritesh Harjani wrote:
>> There could be a race in function ext4_mb_discard_group_preallocations()
>> where the 1st thread may iterate through group's bb_prealloc_list and
>> remove all the PAs and add to function's local list head.
>> Now if the 2nd thread comes in to discard the group preallocations,
>> it will see that the group->bb_prealloc_list is empty and will return 0.
>>
>> Consider for a case where we have less number of groups
>> (for e.g. just group 0),
>> this may even return an -ENOSPC error from ext4_mb_new_blocks()
>> (where we call for ext4_mb_discard_group_preallocations()).
>> But that is wrong, since 2nd thread should have waited for 1st thread
>> to release all the PAs and should have retried for allocation.
>> Since 1st thread was anyway going to discard the PAs.
>>
>> The algorithm using this percpu seq counter goes below:
>> 1. We sample the percpu discard_pa_seq counter before trying for block
>>      allocation in ext4_mb_new_blocks().
>> 2. We increment this percpu discard_pa_seq counter when we either allocate
>>      or free these blocks i.e. while marking those blocks as used/free in
>>      mb_mark_used()/mb_free_blocks().
>> 3. We also increment this percpu seq counter when we successfully identify
>>      that the bb_prealloc_list is not empty and hence proceed for discarding
>>      of those PAs inside ext4_mb_discard_group_preallocations().
>>
>> Now to make sure that the regular fast path of block allocation is not
>> affected, as a small optimization we only sample the percpu seq counter
>> on that cpu. Only when the block allocation fails and when freed blocks
>> found were 0, that is when we sample percpu seq counter for all cpus using
>> below function ext4_get_discard_pa_seq_sum(). This happens after making
>> sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
>>
>> It can be well argued that why don't just check for grp->bb_free to
>> see if there are any free blocks to be allocated. So here are the two
>> concerns which were discussed:-
>>
>> 1. If for some reason the blocks available in the group are not
>>      appropriate for allocation logic (say for e.g.
>>      EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
>>      the retry logic may result into infinte looping since grp->bb_free is
>>      non-zero.
>>
>> 2. Also before preallocation was clubbed with block allocation with the
>>      same ext4_lock_group() held, there were lot of races where grp->bb_free
>>      could not be reliably relied upon.
>> Due to above, this patch considers discard_pa_seq logic to determine if
>> we should retry for block allocation. Say if there are are n threads
>> trying for block allocation and none of those could allocate or discard
>> any of the blocks, then all of those n threads will fail the block
>> allocation and return -ENOSPC error. (Since the seq counter for all of
>> those will match as no block allocation/discard was done during that
>> duration).
>>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> 
> This patch landed in yesterday's linux-next and causes following
> WARNING/BUG on various Samsung Exynos-based boards:
> 
>    BUG: using smp_processor_id() in preemptible [00000000] code: logsave/552
>    caller is ext4_mb_new_blocks+0x404/0x1300

Yes, this is being discussed in the community.
I have submitted a patch which should help fix this warning msg.
Feel free to give this a try on your setup.

https://marc.info/?l=linux-ext4&m=159110574414645&w=2


-ritesh


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  2020-06-03 10:10       ` Ritesh Harjani
@ 2020-06-09 10:20         ` Borislav Petkov
  2020-06-09 10:37           ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2020-06-09 10:20 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Marek Szyprowski, Ritesh Harjani, linux-ext4, linux-fsdevel,
	Jan Kara, Theodore Ts'o, Aneesh Kumar K . V, linux-kernel

On Wed, Jun 03, 2020 at 03:40:16PM +0530, Ritesh Harjani wrote:
> Yes, this is being discussed in the community.
> I have submitted a patch which should help fix this warning msg.
> Feel free to give this a try on your setup.
> 
> https://marc.info/?l=linux-ext4&m=159110574414645&w=2

I just triggered the same thing here too. Looking at your fix and not
even pretending to know what's going on with that percpu sequence
counting, I can't help but wonder why do you wanna do:

	seq = *raw_cpu_ptr(&discard_pa_seq);

instead of simply doing:

	seq = this_cpu_read(discard_pa_seq);

?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  2020-06-09 10:20         ` Borislav Petkov
@ 2020-06-09 10:37           ` Ritesh Harjani
  0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-06-09 10:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Marek Szyprowski, Ritesh Harjani, linux-ext4, linux-fsdevel,
	Jan Kara, Theodore Ts'o, Aneesh Kumar K . V, linux-kernel



On 6/9/20 3:50 PM, Borislav Petkov wrote:
> On Wed, Jun 03, 2020 at 03:40:16PM +0530, Ritesh Harjani wrote:
>> Yes, this is being discussed in the community.
>> I have submitted a patch which should help fix this warning msg.
>> Feel free to give this a try on your setup.
>>
>> https://marc.info/?l=linux-ext4&m=159110574414645&w=2
> 
> I just triggered the same thing here too. Looking at your fix and not
> even pretending to know what's going on with that percpu sequence
> counting, I can't help but wonder why do you wanna do:
> 
> 	seq = *raw_cpu_ptr(&discard_pa_seq);
> 
> instead of simply doing:
> 
> 	seq = this_cpu_read(discard_pa_seq);
> 

That's correct. Thanks for pointing that out.
I guess in my development version of code I had seq as a u64 pointer
variable which later I had changed to u64 variable but I guess I
continued using pcpu ptr APIs for that.

Let me re-submit that patch with your Suggested-by tag and corresponding
changes.

-riteshh

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-09 10:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20  6:40 [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Ritesh Harjani
2020-05-20  6:40 ` [PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks Ritesh Harjani
2020-05-20  6:40 ` [PATCHv5 2/5] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Ritesh Harjani
2020-05-20  6:40 ` [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling Ritesh Harjani
     [not found]   ` <CGME20200603064851eucas1p2e435089fbdf4de1d1fa3fb051c2f3d7b@eucas1p2.samsung.com>
2020-06-03  6:48     ` Marek Szyprowski
2020-06-03 10:10       ` Ritesh Harjani
2020-06-09 10:20         ` Borislav Petkov
2020-06-09 10:37           ` Ritesh Harjani
2020-05-20  6:40 ` [PATCHv5 4/5] ext4: mballoc: Refactor ext4_mb_good_group() Ritesh Harjani
2020-05-20  6:40 ` [PATCHv5 5/5] ext4: mballoc: Use lock for checking free blocks while retrying Ritesh Harjani
2020-05-29  2:40 ` [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case Theodore Y. Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).