linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Mauricio Faria de Oliveira <mfo@canonical.com>
Cc: linux-ext4@vger.kernel.org, "Theodore Y. Ts'o" <tytso@mit.edu>,
	dann frazier <dann.frazier@canonical.com>,
	Andreas Dilger <adilger@dilger.ca>, Jan Kara <jack@suse.com>
Subject: Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache
Date: Wed, 10 Jun 2020 15:21:39 +0200	[thread overview]
Message-ID: <20200610132139.GG12551@quack2.suse.cz> (raw)
In-Reply-To: <20200423233705.5878-1-mfo@canonical.com>

Hello!

Firstly, thanks for the patches and I'm sorry that it took me so long to
get to this.

On Thu 23-04-20 20:36:54, Mauricio Faria de Oliveira wrote:
> Ted Ts'o explains, in the linux-ext4 thread [1], that currently
> data=journal + journal_checksum + mmap is not handled correctly,
> and outlines the changes to fix it (+ items/breaks for clarity):
> 
>     """
>     The fix is going to have to involve
>     - fixing __ext4_journalled_writepage() to call set_page_writeback()
>       before it unlocks the page,
>     - adding a list of pages under data=journalled writeback
>       which is attached to the transaction handle,
>     - have the jbd2 commit hook call end_page_writeback()
>       on all of these pages,
>     - and then in the places where ext4 calls wait_for_stable_page()
>       or grab_cache_page_write_begin(), we need to add:
>     
>     	if (ext4_should_journal_data(inode))
>     		wait_on_page_writeback(page);
>     
>     It's all relatively straightforward except for the part
>     where we have to attach a list of pages to the currently
>     running transaction.  That will require adding some
>     plumbing into the jbd2 layer.
>     """

So I was pondering about this general design for some time. First let me
state here how I see the problem you are hitting in data=journal mode:

The journalling machinery expects the buffer contents cannot change from
the moment transaction commit starts (more exactly from the moment
transaction exists LOCKED state) until the moment transaction commit
finishes. Places like jbd2_journal_get_write_access() are there exactly to
ascertain this - they copy the "to be committed" contents into a bounce
buffer (jh->b_frozen_data) or wait for commit to finish (if it's too late
for copying and the data is already in flight). data=journal mode breaks
this assumption because although ext4_page_mkwrite() calls
do_journal_get_write_access() for each page buffer (and thus makes sure
there's no commit with these buffers running at that moment), the commit
can start the moment we call ext4_journal_stop() in ext4_page_mkwrite() and
then this commit runs while buffers in the committing transaction are
writeably mapped to userspace.

However I don't think Ted's solution actually fully solves the above
problem. Think for example about the following scenario:

fd = open('file');
addr = mmap(fd);
addr[0] = 'a';
	-> caused page fault, ext4_page_mkwrite() is called, page is
	   dirtied, all buffers are added to running transaction
pwrite(fd, buf, 1, 1);
	-> page dirtied again, all buffer are dirtied in the running
	   transaction

					jbd2 thread commits transaction now
-> the problem with committing buffers that are writeably mapped is still
there and your patches wouldn't solve it because
ext4_journalled_writepage() didn't get called at all.

Also, as you found out, leaving pages under writeback while we didn't even
start transaction commit that will end writeback on them is going to cause
odd stalls in various unexpected places.

So I'd suggest a somewhat different solution. Instead of trying to mark,
when page is under writeback, do it the other way around and make jbd2 kick
ext4 on transaction commit to writeprotect journalled pages. Then, because
of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages
cannot be writeably mapped into page tables until transaction commit
finishes (or we'll copy data to commit to b_frozen_data).

Now let me explain a bit more the "make jbd2 kick ext4 on transaction
commit to writeprotect journalled pages" part. I think we could mostly
reuse the data=ordered machinery for that. data=ordered machinery tracks
with each running transaction also a list of inodes for which we need to
flush data pages before doing commit of metadata into the journal. Now we
could use the same mechanism for data=journal mode - we'd track in the
inode list inodes with writeably mapped pages and on transaction commit we
need to writeprotect these pages using clear_page_dirty_for_io(). Probably
the best place to add inode to this list is ext4_journalled_writepage().
To do the commit handling we probably need to introduce callbacks that jbd2
will call instead of journal_submit_inode_data_buffers() in
journal_submit_data_buffers() and instead of
filemap_fdatawait_range_keep_errors() in
journal_finish_inode_data_buffers(). In data=ordered mode ext4 will just do
what jbd2 does in its callback, when inode's data should be journalled, the
callback will use write_cache_pages() to iterate and writeprotect all dirty
pages. The writepage callback just returns AOP_WRITEPAGE_ACTIVATE, some
care needs to be taken to redirty a page in the writepage callback if not
all its underlying buffers are part of the committing transaction or if
some buffer already has b_next_transaction set (since in that case the page
was already dirtied also against the following transaction). But it should
all be reasonably doable.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  parent reply	other threads:[~2020-06-10 13:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23 23:36 [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 01/11] ext4: data=journal: introduce struct/kmem_cache ext4_journalled_wb_page/_cachep Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 02/11] ext4: data=journal: handle page writeback in __ext4_journalled_writepage() Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 03/11] ext4: data=journal: call ext4_force_commit() in ext4_writepages() for msync() Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 04/11] ext4: data=journal: introduce helpers for journalled writeback deadlock Mauricio Faria de Oliveira
2020-04-23 23:36 ` [RFC PATCH 05/11] ext4: data=journal: prevent journalled writeback deadlock in __ext4_journalled_writepage() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 06/11] ext4: data=journal: prevent journalled writeback deadlock in ext4_write_begin() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 07/11] ext4: grab page before starting transaction handle in ext4_convert_inline_data_to_extent() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 08/11] ext4: data=journal: prevent journalled writeback deadlock " Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 09/11] ext4: grab page before starting transaction handle in ext4_try_to_write_inline_data() Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 10/11] ext4: deduplicate code with error legs " Mauricio Faria de Oliveira
2020-04-23 23:37 ` [RFC PATCH 11/11] ext4: data=journal: prevent journalled writeback deadlock " Mauricio Faria de Oliveira
2020-05-15 18:39 ` [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache Mauricio Faria de Oliveira
2020-05-17  7:40   ` Andreas Dilger
2020-06-10 13:21 ` Jan Kara [this message]
2020-06-10 15:15   ` Mauricio Faria de Oliveira

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200610132139.GG12551@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger@dilger.ca \
    --cc=dann.frazier@canonical.com \
    --cc=jack@suse.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=mfo@canonical.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).