Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
From: Zhang Yi <yi.zhang@huawei.com>
To: Jan Kara <jack@suse.cz>, Theodore Ts'o <tytso@mit.edu>
Cc: Christoph Hellwig <hch@infradead.org>,
	<linux-ext4@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<adilger.kernel@dilger.ca>, <yukuai3@huawei.com>
Subject: Re: [RFC PATCH v2 7/7] ext4: fix race between blkdev_releasepage() and ext4_put_super()
Date: Fri, 23 Apr 2021 19:39:09 +0800
Message-ID: <9c83866e-7517-2051-8894-bca2892df1b6@huawei.com> (raw)
In-Reply-To: <20210422090410.GA26221@quack2.suse.cz>

On 2021/4/22 17:04, Jan Kara wrote:
> On Wed 21-04-21 12:57:39, Theodore Ts'o wrote:
>> On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
>>>
>>> Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
>>> by ext4. So maybe it is not that important? I agree with Zhang and
>>> Christoph that getting the lifetime rules sorted out will be hairy and it
>>> is questionable, whether it is worth the additional pages we can reclaim.
>>> Ted, do you remember what was the original motivation for this?
>>
>> The comment in fs/ext4/super.c is I thought a pretty good explanation:
>>
>> /*
>>  * Try to release metadata pages (indirect blocks, directories) which are
>>  * mapped via the block device.  Since these pages could have journal heads
>>  * which would prevent try_to_free_buffers() from freeing them, we must use
>>  * jbd2 layer's try_to_free_buffers() function to release them.
>>  */
>>
>> When we modify a metadata block, we attach a journal_head (jh)
>> structure to the buffer_head, and bump the ref count to prevent the
>> buffer from being freed.  Before the transaction is committed, the
>> buffer is marked jbddirty, but the dirty bit is not set until the
>> transaction commit.
>>
>> At that back, writeback happens entirely at the discretion of the
>> buffer cache.  The jbd layer doesn't get notification when the I/O is
>> completed, nor when there is an I/O error.  (There was an attempt to
>> add a callback but that was NACK'ed because of a complaint that it was
>> jbd specific.)
>>
>> So we don't actually know when it's safe to detach the jh from the
>> buffer_head and can drop the refcount so that the buffer_head can be
>> freed.  When the space in the journal starts getting low, we'll look
>> at at the jh's attached to completed transactions, and see how many of
>> them have clean bh's, and at that point, we can release the buffer
>> heads.
>>
>> The other time when we'll attempt to detach jh's from clean buffers is
>> via bdev_try_to_free_buffers().  So if we drop the
>> bdev_try_to_free_page hook, then when we are under memory pressure,
>> there could be potentially a large percentage of the buffer cache
>> which can't be freed, and so the OOM-killer might trigger more often.
> 
> Yes, I understand that. What I was more asking about is: Does it really
> matter we leave those buffer heads and journal heads unreclaimed. I
> understand it could be triggering premature OOM in theory but is it a
> problem in practice? Was there some observed practical case for which this
> was added or was it just added due to the theoretical concern?
> 
>> Now, if we could get a callback on I/O completion on a per-bh basis,
>> then we could detach the jh when the buffer is clean --- and as a
>> bonus, we'd get a notification when there was an I/O error writing
>> back a metadata block, which would be even better.
>>
>> So how about an even swap?  If we can get a buffer I/O completion
>> callback, we can drop bdev_to_free_swap hook.....
> 
> I'm OK with that because mainly for IO error reporting it makes sense to
> me. For this memory reclaim problem I think we have also other reasonably
> sensible options. E.g. we could have a shrinker that would just walk the
> checkpoint list and reclaim journal heads for whatever is already written
> out... Or we could just release journal heads already after commit and
> during checkpoint we'd fetch the list of blocks that may need to be written
> out e.g. from journal descriptor blocks. This would be a larger overhaul
> but as a bonus we'd get rid of probably the last place in the kernel which
> can write out page contents through buffer heads without updating page
> state properly (and thus get rid of some workarounds in mm code as well).

Thanks for these suggestions, I get your first solution and sounds good, but
I do not understand your last sentence, how does ext4 not updating page state
properly? Could you explain it more clearly?

Thanks,
Yi.

  reply index

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 13:47 [RFC PATCH v2 0/7] ext4, jbd2: fix 3 issues about bdev_try_to_free_page() Zhang Yi
2021-04-14 13:47 ` [RFC PATCH v2 1/7] jbd2: remove the out label in __jbd2_journal_remove_checkpoint() Zhang Yi
2021-04-21 10:01   ` Jan Kara
2021-04-14 13:47 ` [RFC PATCH v2 2/7] jbd2: ensure abort the journal if detect IO error when writing original buffer back Zhang Yi
2021-04-21 13:20   ` Jan Kara
2021-04-14 13:47 ` [RFC PATCH v2 3/7] jbd2: don't abort the journal when freeing buffers Zhang Yi
2021-04-21 13:23   ` Jan Kara
2021-04-14 13:47 ` [RFC PATCH v2 4/7] jbd2: do not free buffers in jbd2_journal_try_to_free_buffers() Zhang Yi
2021-04-15 14:46   ` Christoph Hellwig
2021-04-14 13:47 ` [RFC PATCH v2 5/7] ext4: use RCU to protect accessing superblock in blkdev_releasepage() Zhang Yi
2021-04-15 14:48   ` Christoph Hellwig
2021-04-14 13:47 ` [RFC PATCH v2 6/7] fs: introduce a usage count into the superblock Zhang Yi
2021-04-15 14:40   ` Christoph Hellwig
2021-04-16  8:00     ` Zhang Yi
2021-04-14 13:47 ` [RFC PATCH v2 7/7] ext4: fix race between blkdev_releasepage() and ext4_put_super() Zhang Yi
2021-04-15 14:52   ` Christoph Hellwig
2021-04-16  8:00     ` Zhang Yi
2021-04-20 13:08       ` Christoph Hellwig
2021-04-21 13:46         ` Jan Kara
2021-04-21 16:57           ` Theodore Ts'o
2021-04-22  9:04             ` Jan Kara
2021-04-23 11:39               ` Zhang Yi [this message]
2021-04-23 16:06                 ` Jan Kara
2021-04-23 14:40               ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9c83866e-7517-2051-8894-bca2892df1b6@huawei.com \
    --to=yi.zhang@huawei.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git