[RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations

* [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations
@ 2020-04-08 16:54 Ritesh Harjani
  2020-04-08 16:54 ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Ritesh Harjani
  2020-04-08 17:06 ` [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
  0 siblings, 2 replies; 11+ messages in thread
From: Ritesh Harjani @ 2020-04-08 16:54 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, linux-fsdevel, Aneesh Kumar K . V,
	sandeen, Ritesh Harjani

Hello All, 

There seems to be a below race with current
ext4_mb_group_discard_preallocations() function. This could be fairly
easily reproduced on a system with only 1 group. But internally this was even
reported with more than 1 group. We could reproduce this on both 64K and 4K
blocksize filesystem.

This is a RFC patch sent out for reviews, feedback and/or any comments.
Below mail provide all the necessary details.

Test setup
==========
1. It's a multithreaded test case where each thread is trying to create a
file using open() -> ftruncate().
2. Then we are doing mmap of this file for filesize bytes. 
3. Then start writing sequentially byte by byte for full filesize. 

Details for creating it on loop device:-
======================================
1. truncate -s 240M filefs   (easier on a smaller filesystem)
2. losetup /dev/loop0 filefs 
3. mkfs.ext4 -F -b 65536 /dev/loop0
4. mkdir mnt
5. mount -t ext4 /dev/loop0 mnt/
6. cd mnt/
7. Start running the test case mentioned above in above "Test setup".
(for our test we were keeping no. of threads to ~70 and filling the available
filesystem space (df -h) to around ~80%/70%. Based on that each filesize is
calculated).
8. cd .. (once the test finishes)
9. umount mnt
10. Go to step 3.

Test (test-ext4-mballoc.c) file and script which does the
unmount/mount and run the ./a.out is mentioned at [1], [2].

Analysis:-
==========

It seems below race could be occurring
	P1 							P2
ext4_mb_new_blocks() 						|
	|						ext4_mb_new_blocks()
	|
ext4_mb_group_discard_preallocations() 				|
		| 				ext4_mb_group_discard_preallocations()
	if (list_empty(&grp->bb_prealloc_list)
		return 0; 					|
		| 						|
	ext4_lock_group() 					|
	list_for_each_entry_safe() {  				|
		<..>  						|
		list_del(&pa->pa_group_list);  			|
		list_add(&pa->u.pa_tmp_list, &list) 		|
	} 							|
								|
	processing-local-list() 		if(list_empty(&grp->bb_prealloc_list)
	 	|					return 0
	<...>
	ext4_unlock_group()

What we see here is that, there are multiple threads which are trying to allocate. 
But since there is not enough space, they try to discard the group preallocations.
(will be easy to understand if we only consider group 0, though it could
be reproduced with multiple groups as well). 
Now while more than 1 thread tries to free up the group preallocations, there
could be 1 thread (P2) which sees that the bb_prealloc_list is already
empty and will assume that there is nothing to free from here. Hence return 0.
Now consider this happens with thread P2 for all other groups as well (where some other
thread came in early and freed up the group preallocations). At that point,
P2 sees that the total freed blocks returned by ext4_mb_discard_preallocations()
back to ext4_mb_new_blocks() is 0 and hence it does not retry the allocation,
instead it returns -ENOSPC error. 

This causes SIGBUS to the application who was doing an mmap write. Once
the application crashes we could still see that the filesystem available space
is more than ~70-80% (worst case scenario). So ideally P2 should have waited
for P1 to get over and should have checked how much P1 could free up.

Solution (based on the understanding of the mballoc code)
=========================================================

We think that it is best to check if there is anything to be freed
within ext4_group_lock(). i.e. to check if the bb_prealloc_list is empty.
This patch attempts to fix this race by checking if nothing could be collected
in the local list. This could mean that someone else might have freed
all of this group PAs for us. So simply return group->bb_free which
should also give us an upper bound on the total available space for
allocation in this group.

We need not worry about the fast path of whether the list is empty without
taking ext4_group_lock(), since we are anyway in the slow path where the
ext4_mb_regular_allocator() failed and hence we are now desperately trying
to discard all the group PAs to free up some space for allocation.

Please correct if any of below understanding is incorrect:-
==========================================================
1. grp->bb_free is the available number of free blocks in that group for
   allocation by anyone.
2. If grp->bb_free is non-zero and we call ext4_mb_regular_allocator(ac),
then it will always return ac->ac_status == AC_STATUS_FOUND
(and it could even allocate and return less than the requested no. of blocks).
3. There shouldn't be any infinte loop in ext4_mb_new_blocks() after we
return grp->bb_free in this patch.
(i.e. between ext4_mb_regular_allocator() and ext4_mb_discard_preallocations())
It could only happen if ext4_mb_regular_allocator cannot make any forward
progress even if grp->bb_free is non-zero. 
But IIUC, that won't happen. Please correct here.

Tests run:-
==========
For now I have only done unit testing with the shared test code [1] [2].
Wanted to post this RFC for review comments/discussion.

Resources:
==========
[1] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test-ext4-mballoc.c
[2] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test_mballoc.sh

Ritesh Harjani (1):
  ext4: Fix race in ext4_mb_discard_group_preallocations()

 fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 11+ messages in thread