All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
	Ritesh Harjani <riteshh@linux.ibm.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	rookxu <brookxu.cn@gmail.com>,
	Ritesh Harjani <ritesh.list@gmail.com>
Subject: Re: [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
Date: Fri, 27 Jan 2023 15:43:12 +0100	[thread overview]
Message-ID: <20230127144312.3m3hmcufcvxxp6f4@quack3> (raw)
In-Reply-To: <Y8jizbGg6l2WxJPF@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>

Hi Ojaswin!

I'm sorry for a bit delayed reply...

On Thu 19-01-23 11:57:25, Ojaswin Mujoo wrote:
> On Tue, Jan 17, 2023 at 12:03:35PM +0100, Jan Kara wrote:
> > On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> > > On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > > > Since this covers the special case we discussed above, we will always
> > > > > un-delete the PA when we encounter the special case and we can then
> > > > > adjust for overlap and traverse the PA rbtree without any issues.
> > > > > 
> > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > > 
> > > Hi Jan,
> > > Thanks for the review, sharing some of my thoughts below.
> > > 
> > > > 
> > > > So I find this putting back of already deleted inode PA very fragile. For
> > > > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > > > which can mark inode PA (so it can then be spotted by
> > > > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > > > ext4_mb_put_pa() still goes on and destroys the PA.
> > > 
> > > The 2 code paths that clash here are:
> > > 
> > > ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> > > ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> > > 
> > > Since these are the only code paths from which these 2 functions are
> > > called, for a given inode, access will always be serialized by the upper
> > > level ei->i_data_sem, which is always taken when writing data blocks
> > > using ext4_mb_new_block(). 
> > 
> > Indeed, inode->i_data_sem prevents the race I was afraid of.
> >  
> > > From my understanding of the code, I feel only
> > > ext4_mb_discard_group_preallocations() can race against other functions
> > > that are modifying the PA rbtree since it does not take any inode locks.
> > > 
> > > That being said, I do understand your concerns regarding the solution,
> > > however I'm willing to work with the community to ensure our
> > > implementation of this undelete feature is as robust as possible. Along
> > > with fixing the bug reported here [1], I believe that it is also a good
> > > optimization to have especially when the disk is near full and we are
> > > seeing a lot of group discards going on. 
> > > 
> > > Also, in case the deleted PA completely lies inside our new range, it is
> > > much better to just undelete and use it rather than deleting the
> > > existing PA and reallocating the range again. I think the advantage
> > > would be even bigger in ext4_mb_use_preallocated() function where we can
> > > just undelete and use the PA and skip the entire allocation, incase original
> > > range lies in a deleted PA.
> > 
> > Thanks for explantion. However I think you're optimizing the wrong thing.
> > We are running out of space (to run ext4_mb_discard_group_preallocations()
> > at all) and we allocate from an area covered by PA that we've just decided
> > to discard - if anything relies on performance of the filesystem in ENOSPC
> > conditions it has serious problems no matter what. Sure, we should deliver
> > the result (either ENOSPC or some block allocation) in a reasonable time
> > but the performance does not really matter much because all the scanning
> > and flushing is going to slow down everything a lot anyway. One additional
> > scan of the rbtree is really negligible in this case. So what we should
> > rather optimize for in this case is the code simplicity and maintainability
> > of this rare corner-case that will also likely get only a small amount of
> > testing. And in terms of code simplicity the delete & restart solution
> > seems to be much better (at least as far as I'm imagining it - maybe the
> > code will prove me wrong ;)).
> Hi Jan,
> 
> So I did try out the 'rb_erase from ext4_mb_adjust_overlap() and retry' method,
> with ane extra pa_removed flag, but the locking is getting pretty messy. I'm
> not sure if such a design is possible is the lock we currently have. 
> 
> Basically, the issue I'm facing is that we are having to drop the
> locks read locks and accquire the write locks in
> ext4_mb_adjust_overlap(), which looks something like this:
> 
> 				spin_unlock(&tmp_pa->pa_lock);
> 				read_unlock(&ei->i_prealloc_lock);
> 
> 				write_lock(&ei->i_prealloc_lock);
> 				spin_lock(&tmp_pa->pa_lock);
> 
> We have to preserve the order and drop both tree and PA locks to avoid
> deadlocks.  With this approach, the issue is that in between dropping and
> accquiring this lock, the group discard path can actually go ahead and free the
> PA memory after calling rb erase on it, which can result in use after free in
> the adjust overlap path.  This is because the PA is freed without any locks in
> discard path, as it assumes no other thread will have a reference to it. This
> assumption was true earlier since our allocation path never gave up the rbtree
> lock however it is not possible with this approach now.  Essentially, the
> concept of having two different areas where a PA can be deleted is bringing in
> additional challenges and complexity, which might make things worse from a
> maintainers/reviewers point of view.

Right, I didn't realize that. That is nasty.

> After brainstorming a bit, I think there might be a few alternatives here:
> 
> 1. Instead of deleting PA in the adjust overlap thread, make it sleep till group
> discard path goes ahead and deletes/frees it. At this point we can wake it up and retry
> allocation. 
> 
> * Pros: We can be sure that PA would have been removed at the time of retry so
> we don't waste extra retries. C
> * Cons: Extra complexity in code. 
> 
> 2. Just go for a retry in adjust overlap without doing anything. In ideal case,
> by the time we start retrying the PA might be already removed. Worse case: We
> keep looping again and again since discard path has not deleted it yet.
> 
> * Pros: Simplest approach, code remains straightforward.
> * Cons: We can end up uselessly retrying if the discard path doesn't delete the PA fast enough.

Well, I think cond_resched() + goto retry would be OK here. We could also
cycle the corresponding group lock which would wait for
ext4_mb_discard_group_preallocations() to finish but that is going to burn
the CPU even more than the cond_resched() + retry as we'll be just spinning
on the spinlock. Sleeping is IMHO not warranted as the whole
ext4_mb_discard_group_preallocations() is running under a spinlock anyway
so it should better be a very short sleep.

Or actually I have one more possible solution: What the adjusting function
is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
trims start & end to not overlap these PAs. So we could just lookup these
two PAs (ignoring the deleted state) and then just iterate from these with
rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
 
> 3. The approach of undeleting the PA (proposed in this patchset) that
> we've already discussed.
> 
> Now, to be honest, I still prefer the undelete PA approach as it makes more
> sense to me and I think the code is simple enough as there are not many paths
> that might race. Mostly just adjust_overlap and group discard or
> use_preallocated and group discard.

Yeah, I'm still not too keen on this but I'm willing to reconsider if 
above approach proves to be too expensive under ENOSPC conditions...

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2023-01-27 14:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-16  8:02 [PATCH v3 0/8] ext4: Convert inode preallocation list to an rbtree Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 1/8] ext4: Stop searching if PA doesn't satisfy non-extent file Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 2/8] ext4: Refactor code related to freeing PAs Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 3/8] ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated() Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 4/8] ext4: Move overlap assert logic into a separate function Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 5/8] ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request() Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 6/8] ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list Ojaswin Mujoo
2023-01-16 12:23   ` Jan Kara
2023-01-17 10:30     ` Ojaswin Mujoo
2023-01-17 11:03       ` Jan Kara
2023-01-19  6:27         ` Ojaswin Mujoo
2023-01-27 14:43           ` Jan Kara [this message]
2023-02-03  8:36             ` Ojaswin Mujoo
2023-02-08 11:25               ` Ojaswin Mujoo
2023-02-09 10:54                 ` Jan Kara
2023-02-09 17:54                   ` Ojaswin Mujoo
2023-02-10 14:37                     ` Jan Kara
2023-02-13 17:58                       ` Ojaswin Mujoo
2023-02-14  8:50                         ` Jan Kara
2023-02-16 17:07                       ` Andreas Dilger
2023-03-17 12:40                         ` Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 8/8] ext4: Remove the logic to trim inode PAs Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230127144312.3m3hmcufcvxxp6f4@quack3 \
    --to=jack@suse.cz \
    --cc=brookxu.cn@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=riteshh@linux.ibm.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.