All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations
@ 2020-04-08 16:54 Ritesh Harjani
  2020-04-08 16:54 ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Ritesh Harjani
  2020-04-08 17:06 ` [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
  0 siblings, 2 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-08 16:54 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, linux-fsdevel, Aneesh Kumar K . V,
	sandeen, Ritesh Harjani

Hello All, 

There seems to be a below race with current
ext4_mb_group_discard_preallocations() function. This could be fairly
easily reproduced on a system with only 1 group. But internally this was even
reported with more than 1 group. We could reproduce this on both 64K and 4K
blocksize filesystem.

This is a RFC patch sent out for reviews, feedback and/or any comments.
Below mail provide all the necessary details.

Test setup
==========
1. It's a multithreaded test case where each thread is trying to create a
file using open() -> ftruncate().
2. Then we are doing mmap of this file for filesize bytes. 
3. Then start writing sequentially byte by byte for full filesize. 

Details for creating it on loop device:-
======================================
1. truncate -s 240M filefs   (easier on a smaller filesystem)
2. losetup /dev/loop0 filefs 
3. mkfs.ext4 -F -b 65536 /dev/loop0
4. mkdir mnt
5. mount -t ext4 /dev/loop0 mnt/
6. cd mnt/
7. Start running the test case mentioned above in above "Test setup".
(for our test we were keeping no. of threads to ~70 and filling the available
filesystem space (df -h) to around ~80%/70%. Based on that each filesize is
calculated).
8. cd .. (once the test finishes)
9. umount mnt
10. Go to step 3.

Test (test-ext4-mballoc.c) file and script which does the
unmount/mount and run the ./a.out is mentioned at [1], [2].


Analysis:-
==========

It seems below race could be occurring
	P1 							P2
ext4_mb_new_blocks() 						|
	|						ext4_mb_new_blocks()
	|
ext4_mb_group_discard_preallocations() 				|
		| 				ext4_mb_group_discard_preallocations()
	if (list_empty(&grp->bb_prealloc_list)
		return 0; 					|
		| 						|
	ext4_lock_group() 					|
	list_for_each_entry_safe() {  				|
		<..>  						|
		list_del(&pa->pa_group_list);  			|
		list_add(&pa->u.pa_tmp_list, &list) 		|
	} 							|
								|
	processing-local-list() 		if(list_empty(&grp->bb_prealloc_list)
	 	|					return 0
	<...>
	ext4_unlock_group()

What we see here is that, there are multiple threads which are trying to allocate. 
But since there is not enough space, they try to discard the group preallocations.
(will be easy to understand if we only consider group 0, though it could
be reproduced with multiple groups as well). 
Now while more than 1 thread tries to free up the group preallocations, there
could be 1 thread (P2) which sees that the bb_prealloc_list is already
empty and will assume that there is nothing to free from here. Hence return 0.
Now consider this happens with thread P2 for all other groups as well (where some other
thread came in early and freed up the group preallocations). At that point,
P2 sees that the total freed blocks returned by ext4_mb_discard_preallocations()
back to ext4_mb_new_blocks() is 0 and hence it does not retry the allocation,
instead it returns -ENOSPC error. 

This causes SIGBUS to the application who was doing an mmap write. Once
the application crashes we could still see that the filesystem available space
is more than ~70-80% (worst case scenario). So ideally P2 should have waited
for P1 to get over and should have checked how much P1 could free up.


Solution (based on the understanding of the mballoc code)
=========================================================

We think that it is best to check if there is anything to be freed
within ext4_group_lock(). i.e. to check if the bb_prealloc_list is empty.
This patch attempts to fix this race by checking if nothing could be collected
in the local list. This could mean that someone else might have freed
all of this group PAs for us. So simply return group->bb_free which
should also give us an upper bound on the total available space for
allocation in this group.

We need not worry about the fast path of whether the list is empty without
taking ext4_group_lock(), since we are anyway in the slow path where the
ext4_mb_regular_allocator() failed and hence we are now desperately trying
to discard all the group PAs to free up some space for allocation.


Please correct if any of below understanding is incorrect:-
==========================================================
1. grp->bb_free is the available number of free blocks in that group for
   allocation by anyone.
2. If grp->bb_free is non-zero and we call ext4_mb_regular_allocator(ac),
then it will always return ac->ac_status == AC_STATUS_FOUND
(and it could even allocate and return less than the requested no. of blocks).
3. There shouldn't be any infinte loop in ext4_mb_new_blocks() after we
return grp->bb_free in this patch.
(i.e. between ext4_mb_regular_allocator() and ext4_mb_discard_preallocations())
It could only happen if ext4_mb_regular_allocator cannot make any forward
progress even if grp->bb_free is non-zero. 
But IIUC, that won't happen. Please correct here.

Tests run:-
==========
For now I have only done unit testing with the shared test code [1] [2].
Wanted to post this RFC for review comments/discussion.

Resources:
==========
[1] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test-ext4-mballoc.c
[2] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test_mballoc.sh

Ritesh Harjani (1):
  ext4: Fix race in ext4_mb_discard_group_preallocations()

 fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations()
  2020-04-08 16:54 [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
@ 2020-04-08 16:54 ` Ritesh Harjani
  2020-04-09 13:37   ` Jan Kara
  2020-04-08 17:06 ` [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
  1 sibling, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-08 16:54 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, linux-fsdevel, Aneesh Kumar K . V,
	sandeen, Ritesh Harjani

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups (for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread to release
all the PAs and should have retried for allocation. Since 1st thread
was anyway going to discard the PAs.

This patch fixes this race by checking if nothing could be collected
in the local list. This could mean that someone else might have freed
all of this group PAs for us. So simply return group->bb_free which
should also give us an upper bound on the total available space for
allocation in this group.

We need not check the fast path of whether the list (bb_prealloc_list)
is empty, since we are anyway in the slow path where the
ext4_mb_regular_allocator() has failed and hence we are now desperately trying
to discard all the group PAs to free up some space for allocation.

Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1027e01c9011..a6c92567ec14 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3885,7 +3885,18 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
 }
 
 /*
- * releases all preallocations in given group
+ * This function discards all the preallocations for the given group.
+ * In this there could be a race with another process also trying to discard
+ * the PAs. So if we could not get any PA from grp->bb_prealloc_list into our
+ * local list, then we simply return grp->bb_free. This now will be an
+ * upper bound of the available space for that group. If we choose to return 0
+ * instead, then we may end up returning -ENOSPC error in the worst case
+ * scenario, which is certainly wrong.
+ *
+ * return value could be either of 2 below:-
+ *  - actual discarded PA blocks from grp->bb_prealloc_list.
+ *  - grp->bb_free value. (this will be 0 only when there are actually no free
+ *  block available to be allocated from this group anymore).
  *
  * first, we need to decide discard policy:
  * - when do we discard
@@ -3908,16 +3919,13 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 
 	mb_debug(1, "discard preallocation for group %u\n", group);
 
-	if (list_empty(&grp->bb_prealloc_list))
-		return 0;
-
 	bitmap_bh = ext4_read_block_bitmap(sb, group);
 	if (IS_ERR(bitmap_bh)) {
 		err = PTR_ERR(bitmap_bh);
 		ext4_set_errno(sb, -err);
 		ext4_error(sb, "Error %d reading block bitmap for %u",
 			   err, group);
-		return 0;
+		goto out_dbg;
 	}
 
 	err = ext4_mb_load_buddy(sb, group, &e4b);
@@ -3925,7 +3933,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		ext4_warning(sb, "Error %d loading buddy information for %u",
 			     err, group);
 		put_bh(bitmap_bh);
-		return 0;
+		goto out_dbg;
 	}
 
 	if (needed == 0)
@@ -3967,9 +3975,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		goto repeat;
 	}
 
-	/* found anything to free? */
+	/*
+	 * If this list is empty, then return the grp->bb_free. As someone
+	 * else may have freed the PAs and updated grp->bb_free.
+	 */
 	if (list_empty(&list)) {
 		BUG_ON(free != 0);
+		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
+			 group, grp->bb_free);
+		free = grp->bb_free;
 		goto out;
 	}
 
@@ -3994,6 +4008,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	ext4_unlock_group(sb, group);
 	ext4_mb_unload_buddy(&e4b);
 	put_bh(bitmap_bh);
+out_dbg:
+	mb_debug(1, "discarded (%d) blocks preallocated for group %u\n",
+		 free, group);
 	return free;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations
  2020-04-08 16:54 [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
  2020-04-08 16:54 ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Ritesh Harjani
@ 2020-04-08 17:06 ` Ritesh Harjani
  1 sibling, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-08 17:06 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, linux-fsdevel, Aneesh Kumar K . V, sandeen


Sorry about the typo.
ext4_mb_group_discard_preallocations() should be 
ext4_mb_discard_group_preallocations()

-ritesh

On 4/8/20 10:24 PM, Ritesh Harjani wrote:
> Hello All,
> 
> There seems to be a below race with current
> ext4_mb_group_discard_preallocations() function. This could be fairly
> easily reproduced on a system with only 1 group. But internally this was even
> reported with more than 1 group. We could reproduce this on both 64K and 4K
> blocksize filesystem.
> 
> This is a RFC patch sent out for reviews, feedback and/or any comments.
> Below mail provide all the necessary details.
> 
> Test setup
> ==========
> 1. It's a multithreaded test case where each thread is trying to create a
> file using open() -> ftruncate().
> 2. Then we are doing mmap of this file for filesize bytes.
> 3. Then start writing sequentially byte by byte for full filesize.
> 
> Details for creating it on loop device:-
> ======================================
> 1. truncate -s 240M filefs   (easier on a smaller filesystem)
> 2. losetup /dev/loop0 filefs
> 3. mkfs.ext4 -F -b 65536 /dev/loop0
> 4. mkdir mnt
> 5. mount -t ext4 /dev/loop0 mnt/
> 6. cd mnt/
> 7. Start running the test case mentioned above in above "Test setup".
> (for our test we were keeping no. of threads to ~70 and filling the available
> filesystem space (df -h) to around ~80%/70%. Based on that each filesize is
> calculated).
> 8. cd .. (once the test finishes)
> 9. umount mnt
> 10. Go to step 3.
> 
> Test (test-ext4-mballoc.c) file and script which does the
> unmount/mount and run the ./a.out is mentioned at [1], [2].
> 
> 
> Analysis:-
> ==========
> 
> It seems below race could be occurring
> 	P1 							P2
> ext4_mb_new_blocks() 						|
> 	|						ext4_mb_new_blocks()
> 	|
> ext4_mb_group_discard_preallocations() 				|
> 		| 				ext4_mb_group_discard_preallocations()
> 	if (list_empty(&grp->bb_prealloc_list)
> 		return 0; 					|
> 		| 						|
> 	ext4_lock_group() 					|
> 	list_for_each_entry_safe() {  				|
> 		<..>  						|
> 		list_del(&pa->pa_group_list);  			|
> 		list_add(&pa->u.pa_tmp_list, &list) 		|
> 	} 							|
> 								|
> 	processing-local-list() 		if(list_empty(&grp->bb_prealloc_list)
> 	 	|					return 0
> 	<...>
> 	ext4_unlock_group()
> 
> What we see here is that, there are multiple threads which are trying to allocate.
> But since there is not enough space, they try to discard the group preallocations.
> (will be easy to understand if we only consider group 0, though it could
> be reproduced with multiple groups as well).
> Now while more than 1 thread tries to free up the group preallocations, there
> could be 1 thread (P2) which sees that the bb_prealloc_list is already
> empty and will assume that there is nothing to free from here. Hence return 0.
> Now consider this happens with thread P2 for all other groups as well (where some other
> thread came in early and freed up the group preallocations). At that point,
> P2 sees that the total freed blocks returned by ext4_mb_discard_preallocations()
> back to ext4_mb_new_blocks() is 0 and hence it does not retry the allocation,
> instead it returns -ENOSPC error.
> 
> This causes SIGBUS to the application who was doing an mmap write. Once
> the application crashes we could still see that the filesystem available space
> is more than ~70-80% (worst case scenario). So ideally P2 should have waited
> for P1 to get over and should have checked how much P1 could free up.
> 
> 
> Solution (based on the understanding of the mballoc code)
> =========================================================
> 
> We think that it is best to check if there is anything to be freed
> within ext4_group_lock(). i.e. to check if the bb_prealloc_list is empty.
> This patch attempts to fix this race by checking if nothing could be collected
> in the local list. This could mean that someone else might have freed
> all of this group PAs for us. So simply return group->bb_free which
> should also give us an upper bound on the total available space for
> allocation in this group.
> 
> We need not worry about the fast path of whether the list is empty without
> taking ext4_group_lock(), since we are anyway in the slow path where the
> ext4_mb_regular_allocator() failed and hence we are now desperately trying
> to discard all the group PAs to free up some space for allocation.
> 
> 
> Please correct if any of below understanding is incorrect:-
> ==========================================================
> 1. grp->bb_free is the available number of free blocks in that group for
>     allocation by anyone.
> 2. If grp->bb_free is non-zero and we call ext4_mb_regular_allocator(ac),
> then it will always return ac->ac_status == AC_STATUS_FOUND
> (and it could even allocate and return less than the requested no. of blocks).
> 3. There shouldn't be any infinte loop in ext4_mb_new_blocks() after we
> return grp->bb_free in this patch.
> (i.e. between ext4_mb_regular_allocator() and ext4_mb_discard_preallocations())
> It could only happen if ext4_mb_regular_allocator cannot make any forward
> progress even if grp->bb_free is non-zero.
> But IIUC, that won't happen. Please correct here.
> 
> Tests run:-
> ==========
> For now I have only done unit testing with the shared test code [1] [2].
> Wanted to post this RFC for review comments/discussion.
> 
> Resources:
> ==========
> [1] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test-ext4-mballoc.c
> [2] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test_mballoc.sh
> 
> Ritesh Harjani (1):
>    ext4: Fix race in ext4_mb_discard_group_preallocations()
> 
>   fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
>   1 file changed, 24 insertions(+), 7 deletions(-)
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations()
  2020-04-08 16:54 ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Ritesh Harjani
@ 2020-04-09 13:37   ` Jan Kara
  2020-04-09 18:45     ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-04-09 13:37 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, tytso, jack, adilger.kernel, linux-fsdevel,
	Aneesh Kumar K . V, sandeen

Hello Ritesh!

On Wed 08-04-20 22:24:10, Ritesh Harjani wrote:
> @@ -3908,16 +3919,13 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  
>  	mb_debug(1, "discard preallocation for group %u\n", group);
>  
> -	if (list_empty(&grp->bb_prealloc_list))
> -		return 0;
> -

OK, so ext4_mb_discard_preallocations() is now going to lock every group
when we try to discard preallocations. That's likely going to increase lock
contention on the group locks if we are running out of free blocks when
there are multiple processes trying to allocate blocks. I guess we don't
care about the performace of this case too deeply but I'm not sure if the
cost won't be too big - probably we should check how much the CPU usage
with multiple allocating process trying to find free blocks grows...

>  	bitmap_bh = ext4_read_block_bitmap(sb, group);
>  	if (IS_ERR(bitmap_bh)) {
>  		err = PTR_ERR(bitmap_bh);
>  		ext4_set_errno(sb, -err);
>  		ext4_error(sb, "Error %d reading block bitmap for %u",
>  			   err, group);
> -		return 0;
> +		goto out_dbg;
>  	}
>  
>  	err = ext4_mb_load_buddy(sb, group, &e4b);
> @@ -3925,7 +3933,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  		ext4_warning(sb, "Error %d loading buddy information for %u",
>  			     err, group);
>  		put_bh(bitmap_bh);
> -		return 0;
> +		goto out_dbg;
>  	}
>  
>  	if (needed == 0)
> @@ -3967,9 +3975,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  		goto repeat;
>  	}
>  
> -	/* found anything to free? */
> +	/*
> +	 * If this list is empty, then return the grp->bb_free. As someone
> +	 * else may have freed the PAs and updated grp->bb_free.
> +	 */
>  	if (list_empty(&list)) {
>  		BUG_ON(free != 0);
> +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
> +			 group, grp->bb_free);
> +		free = grp->bb_free;
>  		goto out;
>  	}

OK, but this still doesn't reliably fix the problem, does it? Because
bb_free can be still zero and another process just has some extents to free
in its local 'list' (e.g. because it has decided it doesn't have enough
extents, some were busy and it decided to cond_resched()), so bb_free will
increase from 0 only once these extents are freed.

Honestly, I don't understand why ext4_mb_discard_group_preallocations()
bothers with the local 'list'. Why doesn't it simply free the preallocation
right away? And that would also basically fix your problem (well, it would
still theoretically exist because there's still freeing of one extent
potentially pending but I'm not sure if that will still be a practical
issue).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations()
  2020-04-09 13:37   ` Jan Kara
@ 2020-04-09 18:45     ` Ritesh Harjani
  2020-04-10 15:52       ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-09 18:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, linux-fsdevel,
	Aneesh Kumar K . V, sandeen

Hello Jan,

Thanks for looking into this.

On 4/9/20 7:07 PM, Jan Kara wrote:
> Hello Ritesh!
> 
> On Wed 08-04-20 22:24:10, Ritesh Harjani wrote:
>> @@ -3908,16 +3919,13 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   
>>   	mb_debug(1, "discard preallocation for group %u\n", group);
>>   
>> -	if (list_empty(&grp->bb_prealloc_list))
>> -		return 0;
>> -
> 
> OK, so ext4_mb_discard_preallocations() is now going to lock every group
> when we try to discard preallocations. That's likely going to increase lock
> contention on the group locks if we are running out of free blocks when
> there are multiple processes trying to allocate blocks. I guess we don't
> care about the performace of this case too deeply but I'm not sure if the
> cost won't be too big - probably we should check how much the CPU usage
> with multiple allocating process trying to find free blocks grows...

Sure let me check the cpu usage in my test case with this patch.
But either ways unless we take the lock we are not able to confirm
that what are no. of free blocks available in the filesystem, right?

This mostly will happen only when there are lot of threads and due to
all of their preallocations filesystem is running into low space and
hence
trying to discard all the preallocations. => so when FS is going low on 
space, isn't this cpu usage justifiable? (in an attempt to make sure we
don't fail with ENOSPC)?
Maybe not since this is only due to spinlock case, is it?

Or are you suggesting we should use some other method for discarding
all the group's PA. So that other threads could sleep while discard is 
happening. Something like a discard work item which should free up
all of the group's PA. But we need a way to determine if the needed
no of blocks were freed so that we wake up and retry the allocation.

(Darrick did mentioned something on this line related to work/workqueue,
but couldn't discuss much that time).


> 
>>   	bitmap_bh = ext4_read_block_bitmap(sb, group);
>>   	if (IS_ERR(bitmap_bh)) {
>>   		err = PTR_ERR(bitmap_bh);
>>   		ext4_set_errno(sb, -err);
>>   		ext4_error(sb, "Error %d reading block bitmap for %u",
>>   			   err, group);
>> -		return 0;
>> +		goto out_dbg;
>>   	}
>>   
>>   	err = ext4_mb_load_buddy(sb, group, &e4b);
>> @@ -3925,7 +3933,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   		ext4_warning(sb, "Error %d loading buddy information for %u",
>>   			     err, group);
>>   		put_bh(bitmap_bh);
>> -		return 0;
>> +		goto out_dbg;
>>   	}
>>   
>>   	if (needed == 0)
>> @@ -3967,9 +3975,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   		goto repeat;
>>   	}
>>   
>> -	/* found anything to free? */
>> +	/*
>> +	 * If this list is empty, then return the grp->bb_free. As someone
>> +	 * else may have freed the PAs and updated grp->bb_free.
>> +	 */
>>   	if (list_empty(&list)) {
>>   		BUG_ON(free != 0);
>> +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
>> +			 group, grp->bb_free);
>> +		free = grp->bb_free;
>>   		goto out;
>>   	}
> 
> OK, but this still doesn't reliably fix the problem, does it? Because > bb_free can be still zero and another process just has some extents 
to free
> in its local 'list' (e.g. because it has decided it doesn't have enough
> extents, some were busy and it decided to cond_resched()), so bb_free will
> increase from 0 only once these extents are freed.

This patch should reliably fix it, I think.
So even if say Process P1 didn't free all extents, since some of the
PAs were busy it decided to cond_resched(), that still means that the
list(bb_prealloc_list) is not empty and whoever will get the
ext4_lock_group() next will either
get the busy PAs or it will be blocked on this lock_group() until all of
the PAs were freed by processes.
So if you see we may never actually return 0, unless, there are no PAs 
and grp->bb_free is truely 0.

But your case does shows that grp->bb_free may not be the upper bound
of free blocks for this group. It could be just 1 PA's free blocks, 
while other PAs are still in some other process's local list (due to 
cond_reched())


> 
> Honestly, I don't understand why ext4_mb_discard_group_preallocations()
> bothers with the local 'list'. Why doesn't it simply free the preallocation

Let's see if someone else know about this. I am not really sure
why it was done this way.


> right away? And that would also basically fix your problem (well, it would
> still theoretically exist because there's still freeing of one extent
> potentially pending but I'm not sure if that will still be a practical
> issue).

I guess this still can be a problem. So let's say if the process P1
just checks that the list was not empty and then in parallel process P2
just deletes the last entry - then when process P1 iterates over the 
list, it will find it empty and return 0, which may return -ENOSPC failure.
unless we again take the group lock to check if the list is really free
and return grp->bb_free if it is.


-ritesh


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations()
  2020-04-09 18:45     ` Ritesh Harjani
@ 2020-04-10 15:52       ` Jan Kara
  2020-04-15 17:20         ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-04-10 15:52 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, linux-fsdevel,
	Aneesh Kumar K . V, sandeen


Hello Ritesh!

On Fri 10-04-20 00:15:44, Ritesh Harjani wrote:
> On 4/9/20 7:07 PM, Jan Kara wrote:
> > Hello Ritesh!
> > 
> > On Wed 08-04-20 22:24:10, Ritesh Harjani wrote:
> > > @@ -3908,16 +3919,13 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> > >   	mb_debug(1, "discard preallocation for group %u\n", group);
> > > -	if (list_empty(&grp->bb_prealloc_list))
> > > -		return 0;
> > > -
> > 
> > OK, so ext4_mb_discard_preallocations() is now going to lock every group
> > when we try to discard preallocations. That's likely going to increase lock
> > contention on the group locks if we are running out of free blocks when
> > there are multiple processes trying to allocate blocks. I guess we don't
> > care about the performace of this case too deeply but I'm not sure if the
> > cost won't be too big - probably we should check how much the CPU usage
> > with multiple allocating process trying to find free blocks grows...
> 
> Sure let me check the cpu usage in my test case with this patch.
> But either ways unless we take the lock we are not able to confirm
> that what are no. of free blocks available in the filesystem, right?
> 
> This mostly will happen only when there are lot of threads and due to
> all of their preallocations filesystem is running into low space and
> hence
> trying to discard all the preallocations. => so when FS is going low on
> space, isn't this cpu usage justifiable? (in an attempt to make sure we
> don't fail with ENOSPC)?
> Maybe not since this is only due to spinlock case, is it?

As I wrote, I'm not too much concerned about *some* increase in CPU usage.
But I'd like to get that quantified because if we can say softlockup the
machine in the extreme case (or burn 100% of several CPUs spinning on the
lock), then we need a better mechanism to handle the preallocation
discarding and waiting...

> Or are you suggesting we should use some other method for discarding
> all the group's PA. So that other threads could sleep while discard is
> happening. Something like a discard work item which should free up
> all of the group's PA. But we need a way to determine if the needed
> no of blocks were freed so that we wake up and retry the allocation.
> 
> (Darrick did mentioned something on this line related to work/workqueue,
> but couldn't discuss much that time).
> 
> 
> > 
> > >   	bitmap_bh = ext4_read_block_bitmap(sb, group);
> > >   	if (IS_ERR(bitmap_bh)) {
> > >   		err = PTR_ERR(bitmap_bh);
> > >   		ext4_set_errno(sb, -err);
> > >   		ext4_error(sb, "Error %d reading block bitmap for %u",
> > >   			   err, group);
> > > -		return 0;
> > > +		goto out_dbg;
> > >   	}
> > >   	err = ext4_mb_load_buddy(sb, group, &e4b);
> > > @@ -3925,7 +3933,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> > >   		ext4_warning(sb, "Error %d loading buddy information for %u",
> > >   			     err, group);
> > >   		put_bh(bitmap_bh);
> > > -		return 0;
> > > +		goto out_dbg;
> > >   	}
> > >   	if (needed == 0)
> > > @@ -3967,9 +3975,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> > >   		goto repeat;
> > >   	}
> > > -	/* found anything to free? */
> > > +	/*
> > > +	 * If this list is empty, then return the grp->bb_free. As someone
> > > +	 * else may have freed the PAs and updated grp->bb_free.
> > > +	 */
> > >   	if (list_empty(&list)) {
> > >   		BUG_ON(free != 0);
> > > +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
> > > +			 group, grp->bb_free);
> > > +		free = grp->bb_free;
> > >   		goto out;
> > >   	}
> > 
> > OK, but this still doesn't reliably fix the problem, does it? Because >
> > bb_free can be still zero and another process just has some extents
> to free
> > in its local 'list' (e.g. because it has decided it doesn't have enough
> > extents, some were busy and it decided to cond_resched()), so bb_free will
> > increase from 0 only once these extents are freed.
> 
> This patch should reliably fix it, I think.
> So even if say Process P1 didn't free all extents, since some of the
> PAs were busy it decided to cond_resched(), that still means that the
> list(bb_prealloc_list) is not empty and whoever will get the
> ext4_lock_group() next will either
> get the busy PAs or it will be blocked on this lock_group() until all of
> the PAs were freed by processes.
> So if you see we may never actually return 0, unless, there are no PAs and
> grp->bb_free is truely 0.

Right, I missed that. Thanks for correction.

> But your case does shows that grp->bb_free may not be the upper bound
> of free blocks for this group. It could be just 1 PA's free blocks, while
> other PAs are still in some other process's local list (due to
> cond_reched())

Yes, but note that the return value of ext4_mb_discard_preallocations() is
effectively treated as bool in the end - i.e., did I free some blocks? If
yes, retry, if not -> ENOSPC.

Looking more into the code, I'm also somewhat concerned, whether your
changes cannot lead to excessive looping in ext4_mb_new_blocks(). Because
allocation request can be restricting, which blocks are elligible for
allocation. And thus even if there are some free blocks in the group, it
may not be possible to satisfy the request. Currently, the looping is
limited by the fact that you have to discard some preallocation to loop
again. With your bb_free check, this protection is removed and you could
loop in principle indefinitely AFAICS.

> > Honestly, I don't understand why ext4_mb_discard_group_preallocations()
> > bothers with the local 'list'. Why doesn't it simply free the preallocation
> 
> Let's see if someone else know about this. I am not really sure
> why it was done this way.
> 
> 
> > right away? And that would also basically fix your problem (well, it would
> > still theoretically exist because there's still freeing of one extent
> > potentially pending but I'm not sure if that will still be a practical
> > issue).
> 
> I guess this still can be a problem. So let's say if the process P1
> just checks that the list was not empty and then in parallel process P2
> just deletes the last entry - then when process P1 iterates over the list,
> it will find it empty and return 0, which may return -ENOSPC failure.
> unless we again take the group lock to check if the list is really free
> and return grp->bb_free if it is.

I see. You're correct this could be an issue. Good spotting! So we really
need to check if some preallocation was discarded (either by us or someone
else) since we last tried allocation. If yes, loop again, if no, return
ENOSPC. Do you agree? This could be implemented quite efficiently by
"preallocation discard" sequence counter. We'd sample it before trying
preallocation, then again after returning from
ext4_mb_discard_preallocations() and if it differs, we'll loop. What do you
think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations()
  2020-04-10 15:52       ` Jan Kara
@ 2020-04-15 17:20         ` Ritesh Harjani
  2020-04-15 17:23           ` [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations Ritesh Harjani
  2020-04-16 10:27           ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Jan Kara
  0 siblings, 2 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-15 17:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, linux-ext4, tytso, adilger.kernel, linux-fsdevel,
	aneesh.kumar, sandeen

Hello Jan, 

Sorry if this mail is not properly formatted. Somehow my mail server lost few
of the emails and I didn't receive your email. I have to start exploring some
other reliable method for at least receiving all the emails.
But anyways, I copied this from patchwork so that we could continue our
discussion.

>Hello Ritesh!
>
>On Fri 10-04-20 00:15:44, Ritesh Harjani wrote:
>> On 4/9/20 7:07 PM, Jan Kara wrote:
>> > Hello Ritesh!
>> > 
>> > On Wed 08-04-20 22:24:10, Ritesh Harjani wrote:
>> > > @@ -3908,16 +3919,13 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>> > >   	mb_debug(1, "discard preallocation for group %u\n", group);
>> > > -	if (list_empty(&grp->bb_prealloc_list))
>> > > -		return 0;
>> > > -
>> > 
>> > OK, so ext4_mb_discard_preallocations() is now going to lock every group
>> > when we try to discard preallocations. That's likely going to increase lock
>> > contention on the group locks if we are running out of free blocks when
>> > there are multiple processes trying to allocate blocks. I guess we don't
>> > care about the performace of this case too deeply but I'm not sure if the
>> > cost won't be too big - probably we should check how much the CPU usage
>> > with multiple allocating process trying to find free blocks grows...
>> 
>> Sure let me check the cpu usage in my test case with this patch.
>> But either ways unless we take the lock we are not able to confirm
>> that what are no. of free blocks available in the filesystem, right?
>> 
>> This mostly will happen only when there are lot of threads and due to
>> all of their preallocations filesystem is running into low space and
>> hence
>> trying to discard all the preallocations. => so when FS is going low on
>> space, isn't this cpu usage justifiable? (in an attempt to make sure we
>> don't fail with ENOSPC)?
>> Maybe not since this is only due to spinlock case, is it?
>
>As I wrote, I'm not too much concerned about *some* increase in CPU usage.
>But I'd like to get that quantified because if we can say softlockup the
>machine in the extreme case (or burn 100% of several CPUs spinning on the
>lock), then we need a better mechanism to handle the preallocation
>discarding and waiting...

So I did give this a thought and re-looked at our code again. So to me one other
reason why we could be deleting the PAs from grp->bb_prealloc_list and adding
it to our local list is because:-
We should be able to make the grp->bb_prealloc_list empty as quickly as
possible, so that if the other process tries to discard the same group's PA
list, it should find the list empty and it could proceed discarding some
other group's PA. This way instead of spending cpu cycles on spin lock waiting
for discard to complete, we could as well try and discard some other group's
PA list.
Not sure if there is some other reason as well for doing it like this.

>
>
>> Or are you suggesting we should use some other method for discarding
>> all the group's PA. So that other threads could sleep while discard is
>> happening. Something like a discard work item which should free up
>> all of the group's PA. But we need a way to determine if the needed
>> no of blocks were freed so that we wake up and retry the allocation.
>> 
>> (Darrick did mentioned something on this line related to work/workqueue,
>> but couldn't discuss much that time).
>> 
>> 
>> > 
>> > >   	bitmap_bh = ext4_read_block_bitmap(sb, group);
>> > >   	if (IS_ERR(bitmap_bh)) {
>> > >   		err = PTR_ERR(bitmap_bh);
>> > >   		ext4_set_errno(sb, -err);
>> > >   		ext4_error(sb, "Error %d reading block bitmap for %u",
>> > >   			   err, group);
>> > > -		return 0;
>> > > +		goto out_dbg;
>> > >   	}
>> > >   	err = ext4_mb_load_buddy(sb, group, &e4b);
>> > > @@ -3925,7 +3933,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>> > >   		ext4_warning(sb, "Error %d loading buddy information for %u",
>> > >   			     err, group);
>> > >   		put_bh(bitmap_bh);
>> > > -		return 0;
>> > > +		goto out_dbg;
>> > >   	}
>> > >   	if (needed == 0)
>> > > @@ -3967,9 +3975,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>> > >   		goto repeat;
>> > >   	}
>> > > -	/* found anything to free? */
>> > > +	/*
>> > > +	 * If this list is empty, then return the grp->bb_free. As someone
>> > > +	 * else may have freed the PAs and updated grp->bb_free.
>> > > +	 */
>> > >   	if (list_empty(&list)) {
>> > >   		BUG_ON(free != 0);
>> > > +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
>> > > +			 group, grp->bb_free);
>> > > +		free = grp->bb_free;
>> > >   		goto out;
>> > >   	}
>> > 
>> > OK, but this still doesn't reliably fix the problem, does it? Because >
>> > bb_free can be still zero and another process just has some extents
>> to free
>> > in its local 'list' (e.g. because it has decided it doesn't have enough
>> > extents, some were busy and it decided to cond_resched()), so bb_free will
>> > increase from 0 only once these extents are freed.
>> 
>> This patch should reliably fix it, I think.
>> So even if say Process P1 didn't free all extents, since some of the
>> PAs were busy it decided to cond_resched(), that still means that the
>> list(bb_prealloc_list) is not empty and whoever will get the
>> ext4_lock_group() next will either
>> get the busy PAs or it will be blocked on this lock_group() until all of
>> the PAs were freed by processes.
>> So if you see we may never actually return 0, unless, there are no PAs and
>> grp->bb_free is truely 0.
>
>Right, I missed that. Thanks for correction.
>
>> But your case does shows that grp->bb_free may not be the upper bound
>> of free blocks for this group. It could be just 1 PA's free blocks, while
>> other PAs are still in some other process's local list (due to
>> cond_reched())
>
>Yes, but note that the return value of ext4_mb_discard_preallocations() is
>effectively treated as bool in the end - i.e., did I free some blocks? If
>yes, retry, if not -> ENOSPC.

Yes, that's true.

>Looking more into the code, I'm also somewhat concerned, whether your
>changes cannot lead to excessive looping in ext4_mb_new_blocks(). Because
>allocation request can be restricting, which blocks are elligible for
>allocation. And thus even if there are some free blocks in the group, it
>may not be possible to satisfy the request. Currently, the looping is

I am not very sure if above is true. I do see that we have a ac_criteria,
which if it's 3, then even if any group is found with free space less that the
original allocation request length, even then it could proceed with allocation.
I see that we iterate till the criteria value is <= 3 in 
ext4_mb_regular_allocator(). Also note in function ext4_mb_measure_extent().
In that we note any free block extent into ac->ac_b_ex, unless on doing
more scanning, we are able to find a better one.

But considering there are a lot of logic conditions put in here, maybe
I have missed your case here. Could you please help explain this case
where if there are some free blocks in the group i.e. grp->bb_free is non-zero,
even then it may fail to allocate anything?

>limited by the fact that you have to discard some preallocation to loop
>again. With your bb_free check, this protection is removed and you could
>loop in principle indefinitely AFAICS.

So even with a positive grp->bb_free value, if ext4_mb_regular_allocator()
cannot make forward progress of any kind, then yes this indefinitely looping
is possible. But I am unable to find a case where this is true.

>
>> > Honestly, I don't understand why ext4_mb_discard_group_preallocations()
>> > bothers with the local 'list'. Why doesn't it simply free the preallocation
>> 
>> Let's see if someone else know about this. I am not really sure
>> why it was done this way.

One reason could be trying to free other group's PA list in parallel
(as explained above). Not sure if there was any other reason apart from this
though.

>> 
>> > right away? And that would also basically fix your problem (well, it would
>> > still theoretically exist because there's still freeing of one extent
>> > potentially pending but I'm not sure if that will still be a practical
>> > issue).
>> 
>> I guess this still can be a problem. So let's say if the process P1
>> just checks that the list was not empty and then in parallel process P2
>> just deletes the last entry - then when process P1 iterates over the list,
>> it will find it empty and return 0, which may return -ENOSPC failure.
>> unless we again take the group lock to check if the list is really free
>> and return grp->bb_free if it is.
>
>I see. You're correct this could be an issue. Good spotting! So we really
>need to check if some preallocation was discarded (either by us or someone
>else) since we last tried allocation. If yes, loop again, if no, return
>ENOSPC. Do you agree? This could be implemented quite efficiently by
>"preallocation discard" sequence counter. We'd sample it before trying
>preallocation, then again after returning from
>ext4_mb_discard_preallocations() and if it differs, we'll loop. What do you
>think?

IIUC, write_seqcounter needs a locking since write paths cannot be nested.
So the caller needs to implement it's own locking for seqcounter. So what could
actually happen is even if update write_seqcount_begin/end() inside
ext4_lock/unlock_group(). It may still happen that some other thread with a
different group can proceed with it's own different group lock and call
write_seqcount_begin() which will make it nested (since we are keeping
seqcount at EXT4 super_block level), unless we decide to keep seqcount also
at group level. But that would also mean that we need to sample the
read_seqcount for all groups in beginning? 

Hence the slight trouble understanding a way around this issue.
Does this make sense or am I lost here?

Now while we are at this, I implemented another approach of taking a fastpath
and slowpath in ext4_mb_discard_group_preallocations(). 
But this still relies upon the grp->bb_free counter. This also means that
we need to get answer to our above concern, where we assume that if grp->bb_free
is non-zero then the forward progress is possible.

While reading more code, I see that updating grp->bb_free and
grp->bb_prealloc_list is not atomic. This happens in two different functions
and in between the lock is released. So relying completely on grp->bb_free
if the PA list is just an approximation in our algorithm.
Since it may happen that some processes in parallel may come and allocate
all grp->bb_free blocks but the PA list for those are not yet created.
If that happens and we return grp->bb_free which is 0, this could again
be a corner(/approx.) case of ENOSPC failure.
Bcoz there was still some space left which was about to be updated into the
group's PA list.
But this would also mean that we are anyway reaching our limit
of ENOSPC and so it should be ok if we get a ENOSPC failure then, right?

Meanwhile since I anyway have that patch with me, I will reply to this
email with that new patch. That patch should help fix your concern of
spinning unnecessarily on the group lock. And also it has other problem as
explained, that it could have made all discard proceed only sequentially.
But with new approach of fastpath/slowpath, it is still possible to at least
discard in parallel (if by the time we check list_emty(), some other process
made this group's bb_prealloc_list empty and added all the PAs in it's local
list).

Appreciate your feedback and comments.

-ritesh

-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations
  2020-04-15 17:20         ` Ritesh Harjani
@ 2020-04-15 17:23           ` Ritesh Harjani
  2020-04-20 14:38             ` Jan Kara
  2020-04-16 10:27           ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Jan Kara
  1 sibling, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-15 17:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, linux-ext4, tytso, adilger.kernel, linux-fsdevel,
	aneesh.kumar, sandeen

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups (for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread to release
all the PAs and should have retried for allocation. Since 1st thread
was anyway going to discard the PAs.

This patch fixes this race by introducing two paths (fastpath and
slowpath). We first try the fastpath via
ext4_mb_discard_preallocations(). So if any of the group's PA list is
empty then instead of waiting on the group_lock we continue to discard
other group's PA. This could help maintain the parallelism in trying to
discard multiple group's PA list. So if at the end some process is
not able to find any freed block, then we retry freeing all of the
groups PA list while holding the group_lock. And in case if the PA list
is empty, then we try return grp->bb_free which should tell us
whether there are any free blocks in the given group or not to make any
forward progress.

Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/mballoc.c | 76 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 63 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1027e01c9011..0728bfd3bc7e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3885,7 +3885,27 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
 }
 
 /*
- * releases all preallocations in given group
+ * This function discards all the preallocations for the given group.
+ * In this there could be a race with another process also trying to discard
+ * this group's PA. So, if we could not get any PA from grp->bb_prealloc_list
+ * into our local list (though when initially the list was non-empty),
+ * then we simply return grp->bb_free. Note that this need
+ * not be an upper bound value since it may happen that some of the PAs of
+ * a given group is in process-A local list while some other PAs of the same
+ * group could end up in process-B local list (this is due to cond_resched()
+ * busy logic below). But as long as there this process could free some PAs
+ * or if there are any grp->bb_free blocks freed by some other process, a
+ * forward progress could be made.
+ *
+ * return value could be either of either of below in fastpath:-
+ *  - 0 of the grp->bb_prealloc_list is empty.
+ *  - actual discarded PA blocks from grp->bb_prealloc_list.
+ *  - grp->bb_free.
+ *
+ * In case of slowpath, we try to free all the PAs of the given group in the
+ * similar manner. But if even after taking the group_lock, we find
+ * that the list is empty, then we return grp->bb_free blocks.
+ * This will be 0 only when there are acually no free blocks in this group.
  *
  * first, we need to decide discard policy:
  * - when do we discard
@@ -3894,8 +3914,8 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
  *   1) how many requested
  */
 static noinline_for_stack int
-ext4_mb_discard_group_preallocations(struct super_block *sb,
-					ext4_group_t group, int needed)
+ext4_mb_discard_group_preallocations(struct super_block *sb, ext4_group_t group,
+				     int needed, bool fastpath)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct buffer_head *bitmap_bh = NULL;
@@ -3907,9 +3927,8 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	int free = 0;
 
 	mb_debug(1, "discard preallocation for group %u\n", group);
-
-	if (list_empty(&grp->bb_prealloc_list))
-		return 0;
+	if (fastpath && list_empty(&grp->bb_prealloc_list))
+		goto out_dbg;
 
 	bitmap_bh = ext4_read_block_bitmap(sb, group);
 	if (IS_ERR(bitmap_bh)) {
@@ -3917,7 +3936,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		ext4_set_errno(sb, -err);
 		ext4_error(sb, "Error %d reading block bitmap for %u",
 			   err, group);
-		return 0;
+		goto out_dbg;
 	}
 
 	err = ext4_mb_load_buddy(sb, group, &e4b);
@@ -3925,7 +3944,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		ext4_warning(sb, "Error %d loading buddy information for %u",
 			     err, group);
 		put_bh(bitmap_bh);
-		return 0;
+		goto out_dbg;
 	}
 
 	if (needed == 0)
@@ -3967,9 +3986,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		goto repeat;
 	}
 
-	/* found anything to free? */
+	/*
+	 * If this list is empty, then return the grp->bb_free. As someone
+	 * else may have freed the PAs and updated grp->bb_free.
+	 */
 	if (list_empty(&list)) {
 		BUG_ON(free != 0);
+		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
+			 group, grp->bb_free);
+		free = grp->bb_free;
 		goto out;
 	}
 
@@ -3994,6 +4019,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 	ext4_unlock_group(sb, group);
 	ext4_mb_unload_buddy(&e4b);
 	put_bh(bitmap_bh);
+out_dbg:
+	mb_debug(1, "discarded (%d) blocks preallocated for group %u fastpath (%d)\n",
+		 free, group, fastpath);
 	return free;
 }
 
@@ -4464,17 +4492,39 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 	return 0;
 }
 
+/*
+ * ext4_mb_discard_preallocations: This function loop over each group's prealloc
+ * list and try to free it. It may so happen that more than 1 process try to
+ * call this function in parallel. That's why we initially take a fastpath
+ * approach in which we first check if the grp->bb_prealloc_list is empty,
+ * that could mean that, someone else may have removed all of it's PA and added
+ * into it's local list. So we quickly return from there and try to discard
+ * next group's PAs. This way we try to parallelize discarding of multiple group
+ * PAs. But in case if any of the process is unfortunate to not able to free
+ * any of group's PA, then we retry with slow path which will gurantee that
+ * either some PAs will be made free or we will get group->bb_free blocks
+ * (grp->bb_free if non-zero gurantees forward progress in ext4_mb_new_blocks())
+ */
 static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 {
 	ext4_group_t i, ngroups = ext4_get_groups_count(sb);
 	int ret;
 	int freed = 0;
+	bool fastpath = true;
+	int tmp_needed;
 
-	trace_ext4_mb_discard_preallocations(sb, needed);
-	for (i = 0; i < ngroups && needed > 0; i++) {
-		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
+repeat:
+	tmp_needed = needed;
+	trace_ext4_mb_discard_preallocations(sb, tmp_needed);
+	for (i = 0; i < ngroups && tmp_needed > 0; i++) {
+		ret = ext4_mb_discard_group_preallocations(sb, i, tmp_needed,
+							   fastpath);
 		freed += ret;
-		needed -= ret;
+		tmp_needed -= ret;
+	}
+	if (!freed && fastpath) {
+		fastpath = false;
+		goto repeat;
 	}
 
 	return freed;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations()
  2020-04-15 17:20         ` Ritesh Harjani
  2020-04-15 17:23           ` [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations Ritesh Harjani
@ 2020-04-16 10:27           ` Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Jan Kara @ 2020-04-16 10:27 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, linux-fsdevel,
	aneesh.kumar, sandeen

On Wed 15-04-20 22:50:19, Ritesh Harjani wrote:
> Sorry if this mail is not properly formatted. Somehow my mail server lost few
> of the emails and I didn't receive your email. I have to start exploring some
> other reliable method for at least receiving all the emails.
> But anyways, I copied this from patchwork so that we could continue our
> discussion.

Sure, no problem.

> >On Fri 10-04-20 00:15:44, Ritesh Harjani wrote:
> >> On 4/9/20 7:07 PM, Jan Kara wrote:
> >> > Hello Ritesh!
> >> > 
> >> > On Wed 08-04-20 22:24:10, Ritesh Harjani wrote:
> >> > > @@ -3908,16 +3919,13 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> >> > >   	mb_debug(1, "discard preallocation for group %u\n", group);
> >> > > -	if (list_empty(&grp->bb_prealloc_list))
> >> > > -		return 0;
> >> > > -
> >> > 
> >> > OK, so ext4_mb_discard_preallocations() is now going to lock every group
> >> > when we try to discard preallocations. That's likely going to increase lock
> >> > contention on the group locks if we are running out of free blocks when
> >> > there are multiple processes trying to allocate blocks. I guess we don't
> >> > care about the performace of this case too deeply but I'm not sure if the
> >> > cost won't be too big - probably we should check how much the CPU usage
> >> > with multiple allocating process trying to find free blocks grows...
> >> 
> >> Sure let me check the cpu usage in my test case with this patch.
> >> But either ways unless we take the lock we are not able to confirm
> >> that what are no. of free blocks available in the filesystem, right?
> >> 
> >> This mostly will happen only when there are lot of threads and due to
> >> all of their preallocations filesystem is running into low space and
> >> hence
> >> trying to discard all the preallocations. => so when FS is going low on
> >> space, isn't this cpu usage justifiable? (in an attempt to make sure we
> >> don't fail with ENOSPC)?
> >> Maybe not since this is only due to spinlock case, is it?
> >
> >As I wrote, I'm not too much concerned about *some* increase in CPU usage.
> >But I'd like to get that quantified because if we can say softlockup the
> >machine in the extreme case (or burn 100% of several CPUs spinning on the
> >lock), then we need a better mechanism to handle the preallocation
> >discarding and waiting...
> 
> So I did give this a thought and re-looked at our code again. So to me one other
> reason why we could be deleting the PAs from grp->bb_prealloc_list and adding
> it to our local list is because:-
> We should be able to make the grp->bb_prealloc_list empty as quickly as
> possible, so that if the other process tries to discard the same group's PA
> list, it should find the list empty and it could proceed discarding some
> other group's PA. This way instead of spending cpu cycles on spin lock waiting
> for discard to complete, we could as well try and discard some other group's
> PA list.
> Not sure if there is some other reason as well for doing it like this.

Hum, interesting idea but I somewhat doubt that was the reason... I rather
suspect that the reason was some lock ordering problem which got removed
later (e.g. maybe there originally was not pa->deleted entry to mark
preallocation as on its way to be deleted and then lock ordering would
basically dictate a local list of to-discard items).

> >> Or are you suggesting we should use some other method for discarding
> >> all the group's PA. So that other threads could sleep while discard is
> >> happening. Something like a discard work item which should free up
> >> all of the group's PA. But we need a way to determine if the needed
> >> no of blocks were freed so that we wake up and retry the allocation.
> >> 
> >> (Darrick did mentioned something on this line related to work/workqueue,
> >> but couldn't discuss much that time).
> >> 
> >> 
> >> > 
> >> > >   	bitmap_bh = ext4_read_block_bitmap(sb, group);
> >> > >   	if (IS_ERR(bitmap_bh)) {
> >> > >   		err = PTR_ERR(bitmap_bh);
> >> > >   		ext4_set_errno(sb, -err);
> >> > >   		ext4_error(sb, "Error %d reading block bitmap for %u",
> >> > >   			   err, group);
> >> > > -		return 0;
> >> > > +		goto out_dbg;
> >> > >   	}
> >> > >   	err = ext4_mb_load_buddy(sb, group, &e4b);
> >> > > @@ -3925,7 +3933,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> >> > >   		ext4_warning(sb, "Error %d loading buddy information for %u",
> >> > >   			     err, group);
> >> > >   		put_bh(bitmap_bh);
> >> > > -		return 0;
> >> > > +		goto out_dbg;
> >> > >   	}
> >> > >   	if (needed == 0)
> >> > > @@ -3967,9 +3975,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> >> > >   		goto repeat;
> >> > >   	}
> >> > > -	/* found anything to free? */
> >> > > +	/*
> >> > > +	 * If this list is empty, then return the grp->bb_free. As someone
> >> > > +	 * else may have freed the PAs and updated grp->bb_free.
> >> > > +	 */
> >> > >   	if (list_empty(&list)) {
> >> > >   		BUG_ON(free != 0);
> >> > > +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
> >> > > +			 group, grp->bb_free);
> >> > > +		free = grp->bb_free;
> >> > >   		goto out;
> >> > >   	}
> >> > 
> >> > OK, but this still doesn't reliably fix the problem, does it? Because >
> >> > bb_free can be still zero and another process just has some extents
> >> to free
> >> > in its local 'list' (e.g. because it has decided it doesn't have enough
> >> > extents, some were busy and it decided to cond_resched()), so bb_free will
> >> > increase from 0 only once these extents are freed.
> >> 
> >> This patch should reliably fix it, I think.
> >> So even if say Process P1 didn't free all extents, since some of the
> >> PAs were busy it decided to cond_resched(), that still means that the
> >> list(bb_prealloc_list) is not empty and whoever will get the
> >> ext4_lock_group() next will either
> >> get the busy PAs or it will be blocked on this lock_group() until all of
> >> the PAs were freed by processes.
> >> So if you see we may never actually return 0, unless, there are no PAs and
> >> grp->bb_free is truely 0.
> >
> >Right, I missed that. Thanks for correction.
> >
> >> But your case does shows that grp->bb_free may not be the upper bound
> >> of free blocks for this group. It could be just 1 PA's free blocks, while
> >> other PAs are still in some other process's local list (due to
> >> cond_reched())
> >
> >Yes, but note that the return value of ext4_mb_discard_preallocations() is
> >effectively treated as bool in the end - i.e., did I free some blocks? If
> >yes, retry, if not -> ENOSPC.
> 
> Yes, that's true.
> 
> >Looking more into the code, I'm also somewhat concerned, whether your
> >changes cannot lead to excessive looping in ext4_mb_new_blocks(). Because
> >allocation request can be restricting, which blocks are elligible for
> >allocation. And thus even if there are some free blocks in the group, it
> >may not be possible to satisfy the request. Currently, the looping is
> 
> I am not very sure if above is true. I do see that we have a ac_criteria,
> which if it's 3, then even if any group is found with free space less that the
> original allocation request length, even then it could proceed with allocation.
> I see that we iterate till the criteria value is <= 3 in 
> ext4_mb_regular_allocator(). Also note in function ext4_mb_measure_extent().
> In that we note any free block extent into ac->ac_b_ex, unless on doing
> more scanning, we are able to find a better one.
> 
> But considering there are a lot of logic conditions put in here, maybe
> I have missed your case here. Could you please help explain this case
> where if there are some free blocks in the group i.e. grp->bb_free is non-zero,
> even then it may fail to allocate anything?

What I had in mind was e.g. an allocation with EXT4_MB_HINT_GOAL_ONLY
flag. Then AFAICT we will only try to allocate starting at given block and
if that happens to be unfreeable, we have to bail out with ENOSPC even if
the group has other free blocks. But as you said, the mballoc logic is
rather convoluted so I may be missing something as well :).

> >limited by the fact that you have to discard some preallocation to loop
> >again. With your bb_free check, this protection is removed and you could
> >loop in principle indefinitely AFAICS.
> 
> So even with a positive grp->bb_free value, if ext4_mb_regular_allocator()
> cannot make forward progress of any kind, then yes this indefinitely looping
> is possible. But I am unable to find a case where this is true.
> 
> >
> >> > Honestly, I don't understand why ext4_mb_discard_group_preallocations()
> >> > bothers with the local 'list'. Why doesn't it simply free the preallocation
> >> 
> >> Let's see if someone else know about this. I am not really sure
> >> why it was done this way.
> 
> One reason could be trying to free other group's PA list in parallel
> (as explained above). Not sure if there was any other reason apart from this
> though.
> 
> >> 
> >> > right away? And that would also basically fix your problem (well, it would
> >> > still theoretically exist because there's still freeing of one extent
> >> > potentially pending but I'm not sure if that will still be a practical
> >> > issue).
> >> 
> >> I guess this still can be a problem. So let's say if the process P1
> >> just checks that the list was not empty and then in parallel process P2
> >> just deletes the last entry - then when process P1 iterates over the list,
> >> it will find it empty and return 0, which may return -ENOSPC failure.
> >> unless we again take the group lock to check if the list is really free
> >> and return grp->bb_free if it is.
> >
> >I see. You're correct this could be an issue. Good spotting! So we really
> >need to check if some preallocation was discarded (either by us or someone
> >else) since we last tried allocation. If yes, loop again, if no, return
> >ENOSPC. Do you agree? This could be implemented quite efficiently by
> >"preallocation discard" sequence counter. We'd sample it before trying
> >preallocation, then again after returning from
> >ext4_mb_discard_preallocations() and if it differs, we'll loop. What do you
> >think?
> 
> IIUC, write_seqcounter needs a locking since write paths cannot be nested.
> So the caller needs to implement it's own locking for seqcounter. So what could
> actually happen is even if update write_seqcount_begin/end() inside
> ext4_lock/unlock_group(). It may still happen that some other thread with a
> different group can proceed with it's own different group lock and call
> write_seqcount_begin() which will make it nested (since we are keeping
> seqcount at EXT4 super_block level), unless we decide to keep seqcount also
> at group level. But that would also mean that we need to sample the
> read_seqcount for all groups in beginning?
> 
> Hence the slight trouble understanding a way around this issue.
> Does this make sense or am I lost here?

Right, it's not as simple as I thought :). So how I'd do it is to have
atomic_t counter in the superblock, increment it for each discard of a
preallocated extent (likely in ext4_mb_release_group_pa() and
ext4_mb_release_inode_pa()). Sample it in ext4_mb_new_blocks() before the
first allocation and then again after ext4_mb_discard_preallocations() to
check whether anything was discarded.

Now it could happen that this atomic_t will actually become a bottleneck
due to cacheline bouncing if there's heavy discarding of preallocations
happening from multiple CPUs. In case we find a workload where that would
happen we'd need to make it a per-cpu counter and then play some tricks
with the sampling - like do the first sampling in ext4_mb_new_blocks() with
percpu_counter_read() (to avoid the overhead of summing the counter for
common allocations) and if there's no space, use percpu_counter_sum() to
get accurate value. This could result in falsely retrying once if there's
no space but that's not that bad...

> Now while we are at this, I implemented another approach of taking a fastpath
> and slowpath in ext4_mb_discard_group_preallocations(). 
> But this still relies upon the grp->bb_free counter. This also means that
> we need to get answer to our above concern, where we assume that if grp->bb_free
> is non-zero then the forward progress is possible.
> 
> While reading more code, I see that updating grp->bb_free and
> grp->bb_prealloc_list is not atomic. This happens in two different functions
> and in between the lock is released. So relying completely on grp->bb_free
> if the PA list is just an approximation in our algorithm.
> Since it may happen that some processes in parallel may come and allocate
> all grp->bb_free blocks but the PA list for those are not yet created.
> If that happens and we return grp->bb_free which is 0, this could again
> be a corner(/approx.) case of ENOSPC failure.
> Bcoz there was still some space left which was about to be updated into the
> group's PA list.
> But this would also mean that we are anyway reaching our limit
> of ENOSPC and so it should be ok if we get a ENOSPC failure then, right?
> 
> Meanwhile since I anyway have that patch with me, I will reply to this
> email with that new patch. That patch should help fix your concern of
> spinning unnecessarily on the group lock. And also it has other problem as
> explained, that it could have made all discard proceed only sequentially.
> But with new approach of fastpath/slowpath, it is still possible to at least
> discard in parallel (if by the time we check list_emty(), some other process
> made this group's bb_prealloc_list empty and added all the PAs in it's local
> list).

OK, I'll have a look.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations
  2020-04-15 17:23           ` [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations Ritesh Harjani
@ 2020-04-20 14:38             ` Jan Kara
  2020-04-21  4:24               ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-04-20 14:38 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, linux-fsdevel,
	aneesh.kumar, sandeen

On Wed 15-04-20 22:53:01, Ritesh Harjani wrote:
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
> 
> Consider for a case where we have less number of groups (for e.g. just group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread to release
> all the PAs and should have retried for allocation. Since 1st thread
> was anyway going to discard the PAs.
> 
> This patch fixes this race by introducing two paths (fastpath and
> slowpath). We first try the fastpath via
> ext4_mb_discard_preallocations(). So if any of the group's PA list is
> empty then instead of waiting on the group_lock we continue to discard
> other group's PA. This could help maintain the parallelism in trying to
> discard multiple group's PA list. So if at the end some process is
> not able to find any freed block, then we retry freeing all of the
> groups PA list while holding the group_lock. And in case if the PA list
> is empty, then we try return grp->bb_free which should tell us
> whether there are any free blocks in the given group or not to make any
> forward progress.
> 
> Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>

Ritesh, do you still want to push this patch as is or do you plan to change
it after a discussion on Thursday?

> @@ -3967,9 +3986,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>  		goto repeat;
>  	}
>  
> -	/* found anything to free? */
> +	/*
> +	 * If this list is empty, then return the grp->bb_free. As someone
> +	 * else may have freed the PAs and updated grp->bb_free.
> +	 */
>  	if (list_empty(&list)) {
>  		BUG_ON(free != 0);
> +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
> +			 group, grp->bb_free);
> +		free = grp->bb_free;
>  		goto out;
>  	}

I'm still somewhat concerned about the forward progress guarantee here...
If you're convinced the allocation from goal is the only possibility of
lockup and that logic can be removed, then please remove it and then write a
good comment why lockup is not possible due to this.

> @@ -4464,17 +4492,39 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>  	return 0;
>  }
>  
> +/*
> + * ext4_mb_discard_preallocations: This function loop over each group's prealloc
> + * list and try to free it. It may so happen that more than 1 process try to
> + * call this function in parallel. That's why we initially take a fastpath
> + * approach in which we first check if the grp->bb_prealloc_list is empty,
> + * that could mean that, someone else may have removed all of it's PA and added
> + * into it's local list. So we quickly return from there and try to discard
> + * next group's PAs. This way we try to parallelize discarding of multiple group
> + * PAs. But in case if any of the process is unfortunate to not able to free
> + * any of group's PA, then we retry with slow path which will gurantee that
> + * either some PAs will be made free or we will get group->bb_free blocks
> + * (grp->bb_free if non-zero gurantees forward progress in ext4_mb_new_blocks())
> + */
>  static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>  {
>  	ext4_group_t i, ngroups = ext4_get_groups_count(sb);
>  	int ret;
>  	int freed = 0;
> +	bool fastpath = true;
> +	int tmp_needed;
>  
> -	trace_ext4_mb_discard_preallocations(sb, needed);
> -	for (i = 0; i < ngroups && needed > 0; i++) {
> -		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
> +repeat:
> +	tmp_needed = needed;
> +	trace_ext4_mb_discard_preallocations(sb, tmp_needed);
> +	for (i = 0; i < ngroups && tmp_needed > 0; i++) {
> +		ret = ext4_mb_discard_group_preallocations(sb, i, tmp_needed,
> +							   fastpath);
>  		freed += ret;
> -		needed -= ret;
> +		tmp_needed -= ret;
> +	}

Why do you need 'tmp_needed'? When freed is 0, tmp_needed == needed, right?

> +	if (!freed && fastpath) {
> +		fastpath = false;
> +		goto repeat;
>  	}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations
  2020-04-20 14:38             ` Jan Kara
@ 2020-04-21  4:24               ` Ritesh Harjani
  0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-21  4:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, linux-fsdevel, aneesh.kumar, sandeen

Hello Jan,

Thanks for your email.

On 4/20/20 8:08 PM, Jan Kara wrote:
> On Wed 15-04-20 22:53:01, Ritesh Harjani wrote:
>> There could be a race in function ext4_mb_discard_group_preallocations()
>> where the 1st thread may iterate through group's bb_prealloc_list and
>> remove all the PAs and add to function's local list head.
>> Now if the 2nd thread comes in to discard the group preallocations,
>> it will see that the group->bb_prealloc_list is empty and will return 0.
>>
>> Consider for a case where we have less number of groups (for e.g. just group 0),
>> this may even return an -ENOSPC error from ext4_mb_new_blocks()
>> (where we call for ext4_mb_discard_group_preallocations()).
>> But that is wrong, since 2nd thread should have waited for 1st thread to release
>> all the PAs and should have retried for allocation. Since 1st thread
>> was anyway going to discard the PAs.
>>
>> This patch fixes this race by introducing two paths (fastpath and
>> slowpath). We first try the fastpath via
>> ext4_mb_discard_preallocations(). So if any of the group's PA list is
>> empty then instead of waiting on the group_lock we continue to discard
>> other group's PA. This could help maintain the parallelism in trying to
>> discard multiple group's PA list. So if at the end some process is
>> not able to find any freed block, then we retry freeing all of the
>> groups PA list while holding the group_lock. And in case if the PA list
>> is empty, then we try return grp->bb_free which should tell us
>> whether there are any free blocks in the given group or not to make any
>> forward progress.
>>
>> Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> 
> Ritesh, do you still want to push this patch as is or do you plan to change
> it after a discussion on Thursday?

I would need more time on this. I will get back on this, once I do some
study to see how your suggested approach works out.

-ritesh


> 
>> @@ -3967,9 +3986,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   		goto repeat;
>>   	}
>>   
>> -	/* found anything to free? */
>> +	/*
>> +	 * If this list is empty, then return the grp->bb_free. As someone
>> +	 * else may have freed the PAs and updated grp->bb_free.
>> +	 */
>>   	if (list_empty(&list)) {
>>   		BUG_ON(free != 0);
>> +		mb_debug(1, "Someone may have freed PA for this group %u, grp->bb_free %d\n",
>> +			 group, grp->bb_free);
>> +		free = grp->bb_free;
>>   		goto out;
>>   	}
> 
> I'm still somewhat concerned about the forward progress guarantee here...
> If you're convinced the allocation from goal is the only possibility of
> lockup and that logic can be removed, then please remove it and then write a
> good comment why lockup is not possible due to this.
> 
>> @@ -4464,17 +4492,39 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>>   	return 0;
>>   }
>>   
>> +/*
>> + * ext4_mb_discard_preallocations: This function loop over each group's prealloc
>> + * list and try to free it. It may so happen that more than 1 process try to
>> + * call this function in parallel. That's why we initially take a fastpath
>> + * approach in which we first check if the grp->bb_prealloc_list is empty,
>> + * that could mean that, someone else may have removed all of it's PA and added
>> + * into it's local list. So we quickly return from there and try to discard
>> + * next group's PAs. This way we try to parallelize discarding of multiple group
>> + * PAs. But in case if any of the process is unfortunate to not able to free
>> + * any of group's PA, then we retry with slow path which will gurantee that
>> + * either some PAs will be made free or we will get group->bb_free blocks
>> + * (grp->bb_free if non-zero gurantees forward progress in ext4_mb_new_blocks())
>> + */
>>   static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>>   {
>>   	ext4_group_t i, ngroups = ext4_get_groups_count(sb);
>>   	int ret;
>>   	int freed = 0;
>> +	bool fastpath = true;
>> +	int tmp_needed;
>>   
>> -	trace_ext4_mb_discard_preallocations(sb, needed);
>> -	for (i = 0; i < ngroups && needed > 0; i++) {
>> -		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
>> +repeat:
>> +	tmp_needed = needed;
>> +	trace_ext4_mb_discard_preallocations(sb, tmp_needed);
>> +	for (i = 0; i < ngroups && tmp_needed > 0; i++) {
>> +		ret = ext4_mb_discard_group_preallocations(sb, i, tmp_needed,
>> +							   fastpath);
>>   		freed += ret;
>> -		needed -= ret;
>> +		tmp_needed -= ret;
>> +	}
> 
> Why do you need 'tmp_needed'? When freed is 0, tmp_needed == needed, right?
> 
>> +	if (!freed && fastpath) {
>> +		fastpath = false;
>> +		goto repeat;
>>   	}
> 
> 								Honza
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-04-21  4:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-08 16:54 [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
2020-04-08 16:54 ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Ritesh Harjani
2020-04-09 13:37   ` Jan Kara
2020-04-09 18:45     ` Ritesh Harjani
2020-04-10 15:52       ` Jan Kara
2020-04-15 17:20         ` Ritesh Harjani
2020-04-15 17:23           ` [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations Ritesh Harjani
2020-04-20 14:38             ` Jan Kara
2020-04-21  4:24               ` Ritesh Harjani
2020-04-16 10:27           ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Jan Kara
2020-04-08 17:06 ` [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.