All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani <riteshh@linux.ibm.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, jack@suse.cz, adilger.kernel@dilger.ca,
	linux-fsdevel@vger.kernel.org,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	sandeen@sandeen.net
Subject: Re: [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations
Date: Wed, 8 Apr 2020 22:36:44 +0530	[thread overview]
Message-ID: <20200408170647.31019A4040@d06av23.portsmouth.uk.ibm.com> (raw)
In-Reply-To: <cover.1586358980.git.riteshh@linux.ibm.com>


Sorry about the typo.
ext4_mb_group_discard_preallocations() should be 
ext4_mb_discard_group_preallocations()

-ritesh

On 4/8/20 10:24 PM, Ritesh Harjani wrote:
> Hello All,
> 
> There seems to be a below race with current
> ext4_mb_group_discard_preallocations() function. This could be fairly
> easily reproduced on a system with only 1 group. But internally this was even
> reported with more than 1 group. We could reproduce this on both 64K and 4K
> blocksize filesystem.
> 
> This is a RFC patch sent out for reviews, feedback and/or any comments.
> Below mail provide all the necessary details.
> 
> Test setup
> ==========
> 1. It's a multithreaded test case where each thread is trying to create a
> file using open() -> ftruncate().
> 2. Then we are doing mmap of this file for filesize bytes.
> 3. Then start writing sequentially byte by byte for full filesize.
> 
> Details for creating it on loop device:-
> ======================================
> 1. truncate -s 240M filefs   (easier on a smaller filesystem)
> 2. losetup /dev/loop0 filefs
> 3. mkfs.ext4 -F -b 65536 /dev/loop0
> 4. mkdir mnt
> 5. mount -t ext4 /dev/loop0 mnt/
> 6. cd mnt/
> 7. Start running the test case mentioned above in above "Test setup".
> (for our test we were keeping no. of threads to ~70 and filling the available
> filesystem space (df -h) to around ~80%/70%. Based on that each filesize is
> calculated).
> 8. cd .. (once the test finishes)
> 9. umount mnt
> 10. Go to step 3.
> 
> Test (test-ext4-mballoc.c) file and script which does the
> unmount/mount and run the ./a.out is mentioned at [1], [2].
> 
> 
> Analysis:-
> ==========
> 
> It seems below race could be occurring
> 	P1 							P2
> ext4_mb_new_blocks() 						|
> 	|						ext4_mb_new_blocks()
> 	|
> ext4_mb_group_discard_preallocations() 				|
> 		| 				ext4_mb_group_discard_preallocations()
> 	if (list_empty(&grp->bb_prealloc_list)
> 		return 0; 					|
> 		| 						|
> 	ext4_lock_group() 					|
> 	list_for_each_entry_safe() {  				|
> 		<..>  						|
> 		list_del(&pa->pa_group_list);  			|
> 		list_add(&pa->u.pa_tmp_list, &list) 		|
> 	} 							|
> 								|
> 	processing-local-list() 		if(list_empty(&grp->bb_prealloc_list)
> 	 	|					return 0
> 	<...>
> 	ext4_unlock_group()
> 
> What we see here is that, there are multiple threads which are trying to allocate.
> But since there is not enough space, they try to discard the group preallocations.
> (will be easy to understand if we only consider group 0, though it could
> be reproduced with multiple groups as well).
> Now while more than 1 thread tries to free up the group preallocations, there
> could be 1 thread (P2) which sees that the bb_prealloc_list is already
> empty and will assume that there is nothing to free from here. Hence return 0.
> Now consider this happens with thread P2 for all other groups as well (where some other
> thread came in early and freed up the group preallocations). At that point,
> P2 sees that the total freed blocks returned by ext4_mb_discard_preallocations()
> back to ext4_mb_new_blocks() is 0 and hence it does not retry the allocation,
> instead it returns -ENOSPC error.
> 
> This causes SIGBUS to the application who was doing an mmap write. Once
> the application crashes we could still see that the filesystem available space
> is more than ~70-80% (worst case scenario). So ideally P2 should have waited
> for P1 to get over and should have checked how much P1 could free up.
> 
> 
> Solution (based on the understanding of the mballoc code)
> =========================================================
> 
> We think that it is best to check if there is anything to be freed
> within ext4_group_lock(). i.e. to check if the bb_prealloc_list is empty.
> This patch attempts to fix this race by checking if nothing could be collected
> in the local list. This could mean that someone else might have freed
> all of this group PAs for us. So simply return group->bb_free which
> should also give us an upper bound on the total available space for
> allocation in this group.
> 
> We need not worry about the fast path of whether the list is empty without
> taking ext4_group_lock(), since we are anyway in the slow path where the
> ext4_mb_regular_allocator() failed and hence we are now desperately trying
> to discard all the group PAs to free up some space for allocation.
> 
> 
> Please correct if any of below understanding is incorrect:-
> ==========================================================
> 1. grp->bb_free is the available number of free blocks in that group for
>     allocation by anyone.
> 2. If grp->bb_free is non-zero and we call ext4_mb_regular_allocator(ac),
> then it will always return ac->ac_status == AC_STATUS_FOUND
> (and it could even allocate and return less than the requested no. of blocks).
> 3. There shouldn't be any infinte loop in ext4_mb_new_blocks() after we
> return grp->bb_free in this patch.
> (i.e. between ext4_mb_regular_allocator() and ext4_mb_discard_preallocations())
> It could only happen if ext4_mb_regular_allocator cannot make any forward
> progress even if grp->bb_free is non-zero.
> But IIUC, that won't happen. Please correct here.
> 
> Tests run:-
> ==========
> For now I have only done unit testing with the shared test code [1] [2].
> Wanted to post this RFC for review comments/discussion.
> 
> Resources:
> ==========
> [1] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test-ext4-mballoc.c
> [2] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test_mballoc.sh
> 
> Ritesh Harjani (1):
>    ext4: Fix race in ext4_mb_discard_group_preallocations()
> 
>   fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
>   1 file changed, 24 insertions(+), 7 deletions(-)
> 


      parent reply	other threads:[~2020-04-08 17:06 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-08 16:54 [RFC 0/1] ext4: Fix mballoc race in freeing up group preallocations Ritesh Harjani
2020-04-08 16:54 ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Ritesh Harjani
2020-04-09 13:37   ` Jan Kara
2020-04-09 18:45     ` Ritesh Harjani
2020-04-10 15:52       ` Jan Kara
2020-04-15 17:20         ` Ritesh Harjani
2020-04-15 17:23           ` [RFCv2 1/1] ext4: Fix race in ext4 mb discard group preallocations Ritesh Harjani
2020-04-20 14:38             ` Jan Kara
2020-04-21  4:24               ` Ritesh Harjani
2020-04-16 10:27           ` [RFC 1/1] ext4: Fix race in ext4_mb_discard_group_preallocations() Jan Kara
2020-04-08 17:06 ` Ritesh Harjani [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200408170647.31019A4040@d06av23.portsmouth.uk.ibm.com \
    --to=riteshh@linux.ibm.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.